
Research in social statistics
Overview of social statistics
Social statistics is concerned with the development of statistical methods that can be used across the social sciences. Statisticians play an essential role in all aspects of social inquiry, including: study design; measurement; data linkage; development of statistical models that account for the complex structure of social data; model selection and assessment.
Members of the LSE Social Statistics group have interest in statistical methods in each of these areas and regularly collaborate with social scientists whose questions motivate new lines of methodological research. We have experience in a range of social science disciplines, including demography, education, epidemiology, psychology and sociology, and psychology.
A brief introduction to our main areas of methodological research in social statistics
Bayesian inference
Sara Geneletti; Kostas Kalogeropoulos;
Bayesian inference may potentially be helpful in various research questions within the area of Social statistics. It provides a natural and coherent framework for hierarchical multilevel models as well as latent variable analysis, and it can be used to incorporate prior information or combine multiple data sources. Moreover, it offers a heavy computational machinery that can be used for complex highdimensional models and large datasets. We are interesting in developing methods suitable for application in economics, health and medicine, biology and other social sciences. Some of the current methodological challenges consist of constructing efficient and accessible computational schemes, prior specification, model choice and variable search and estimation of latent stochastic processes.
Bias modelling in medical statistics
Sara Geneletti;
Categorical data analysis
Wicher Bergsma; Jouni Kuha; Irini Moustaki; Fiona Steele
Data collected in the social sciences are often categorical in nature. Key questions about categorical data involve the nature of the relations among the variables, for example how political preference is shaped by social characteristics, such as education or income. Methods for assessing relations among variables include graphical models and latent variable models. A first distinguishing feature of categorical data analysis, compared to continuous data analysis, is that inference can often be done with light assumptions about the distribution of the data, e.g., a multinomial distribution may often be adequate. A second distinguishing feature of categorical data analysis is that the variables can have one of a variety of measurement levels, the most common being nominal, ordinal or interval level. However, not all data types adequately fit in this scheme, for example, outcomes may consist of preference orderings, or subsets of items. Much statistical research is devoted to taking measurement level into account in order to maximise the amount of information that can be extracted from the data.

Kuha, J. and Jackson, J. (2014). The item count method for sensitive survey questions: Modelling criminal behaviour. Journal of the Royal Statistical Society, Series C (Applied Statistics). ISSN 00359254 (In Press). Early view access.

Bergsma, W. P., Croon, M. A. and Hagenaars, J. A. (2013). Advancements in marginal modelling for categorical data. Sociological Methodology. 43 (1), pp. 141. ISSN 00811750

Kuha, J. and Goldthorpe, J. H. (2010). Path analysis for discrete variables: the role of education in social mobility. Journal of the Royal Statistical Society, Series A, 173 (2), pp. 351369. ISSN 09641998

Mavridis, D. and Moustaki, I. (2009). The forward search algorithm for detecting aberrant response patterns in factor analysis for binary data. Journal of Computational and Graphical Statistics. 18 (4), pp.10161034. ISSN 10618600

Bergsma, W. P., Croon. M. A. and Hagenaars, J. A. (2009). Marginal models for dependent, clustered and longitudinal categorical data. Springer NY. ISBN 9780387096094
Causal inference
Sara Geneletti;
Causal inference is the name given to the area of statistical methodology aimed at identifying and estimating causal effects. In social science and epidemiology we often want to know the effect of interventions or to explain underlying mechanisms; we want to know if a particular social programme works because if it does, it might result in new policies  new interventions. We want to know how lifestyle factors cause a disease so that we can understand the biological mechanisms involved and how to treat or prevent it. One of the main issues in this area is how to make causal inference from observational (i.e. nonrandomised) data. There are a number of methods, including graphical modelling, potential outcomes and statistical approaches to dealing with this problem.
Latent variable models and structural equation models
Wicher Bergsma; Jouni Kuha; Irini Moustaki; Fiona Steele
Research in the area of latent variable models focuses on the development of methodology for categorical data, mixed types of data, goodnessoffit measures, detection of outliers, new methods of estimation and the application of the methodology in social measurement and research. Latent variable models have extensive applications in educational testing as well as in psychometrics, sociometrics, economics and other social sciences. We are interested in measuring theoretical constructs such as intelligence, ability, attitude, belief and wealth that are not directly observed (latent). We collect information through surveys on variables that can be considered to be indicators of those unobserved constructs. Those indicators can be of any measurement type, such as categorical or metrical and they can also be measured over time.

Moustaki, I. and Knott, M. (2013). Latent variable models that account for atypical responses. Journal of the Royal Statistical Society, series C (applied statistics), online. ISSN 00359254 (In Press)

Katsikatsou, M., Moustaki, I., YangWallentin, F. and Jöreskog, K. (2012). Pairwise Likelihood Estimation for factor analysis models with ordinal data. Computational Statistics and Data Analysis. Computational Statistics and Data Analysis, 56 (12), pp. 4243–4258. ISSN 01679473

Vasdekis, V., Cagnone, S., and Moustaki, I. (2012). Composite likelihood estimation for latent variable models with longitudinal ordinal variables. Psychometrika, DOI:10.1007/s1133601292646

Bartholomew, D. J., Knott, M. and Moustaki, I. (2011). Latent variable models and factor analysis: a unified approach 3rd ed., John Wiley & Sons, London, UK. ISBN 9780470971925

Bartholomew, D. J., Steele, F., Galbraith, J. and Moustaki, I. (2008). Analysis of multivariate social science data. Chapman & Hall/CRC Statistics in the Social and Behavioral Scie. 2nd ed., CRC Press, London. ISBN 9781584889601
Longitudinal data analysis
Wicher Bergsma; Kostas Kalogeropoulos; Irini Moustaki; Fiona Steele
Longitudinal data come from a variety of sources such as panel surveys, birth cohort studies and, more recently, ‘realtime’ digital data collection methods such as ecological momentary assessment. Analyses of longitudinal data are most commonly concerned with the nature and predictors of change in a response variable over time, for example children’s height or weight or individual attitudes and behaviour, using repeated measurements of the response variable and covariates. Methods for studying change using repeated measures data include growth curve models (which may be framed as a multilevel model or structural equation model), autoregressive ‘dynamic’ models with random or fixed individual effects, and marginal models. Another type of longitudinal study is concerned with the timing of events such as births and death where the response is the (often partially observed) duration to event occurrence. Survival analysis, widely referred to as event history analysis in social research, is used to study duration data, with extensions to handle recurrent events and competing risks.

Steele, F., French, R. and Bartley, M. (2013). Adjusting for selection bias in longitudinal analyses using simultaneous equations modelling: the relationship between employment transitions and mental health. Epidemiology, 24 (5), pp. 703711. ISSN 10443983

Vasdekis, V., Cagnone, S. and Moustaki, I. (2012). A Composite likelihood inference in latent variable models for ordinal longitudinal responses. Psychometrika. Vol. 77 (3), pp. 425441. ISSN 00333123

Steele, F. (2008). Multilevel models for longitudinal data. Journal of the Royal Statistical Society: series A (statistics in society), 171 (1), pp. 519. ISSN 09641998
Marginal modelling
Wicher Bergsma
A different type of research question involving clustered data concerns population averaged quantities rather than the sampling units themselves. Such questions can be handled using marginal modelling techniques, which are being developed for categorical data. Again, the dependencies in the data due to the clustering have to be taken into account.

Bergsma, W. P., Croon, M. A. and Hagenaars, J. A. (2013). Advancements in marginal modelling for categorical data. Sociological Methodology. 43 (1), pp. 141. ISSN 00811750

Rudas, T., Bergsma, W. P., and Nemeth, R., (2010). Marginal loglinear parameterization of conditional independence models. Biometrika, 97 (4), pp. 10061012. ISSN 00063444

Bergsma, W. P., Croon. M. A. and Hagenaars, J. A. (2009). Marginal models for dependent, clustered and longitudinal categorical data. Springer NY. ISBN 9780387096094

Bergsma, W. P. and Rudas, T. (2002). Marginal models for categorical data. Annals of Statistics, 30 (1), pp. 140159. ISSN 00905364
Measurement error in statistical analysis
Jouni Kuha; Irini Moustaki; Chris Skinner;
The problem of measurement error in statistical analysis is both common and serious; common because very many of the quantities of interest in the social sciences are difficult to determine accurately; serious because even moderate amounts of measurement error can cause substantial biases in estimated models of interest. It is possible to reduce these biases by using appropriately modified estimation methods, provided that sufficient information about the measurement error is available in the form of either additional data or realistic assumptions. Different measurement error problems may require rather different solutions, depending on, for example, the type of model (linear, loglinear, logistic etc.), the erroneously measured variables (explanatory or response, continuous or discrete) and the method of estimation (e.g. moment or likelihood based, exact or approximate). The work carried out in this area has focused in particular in problems involving more than one type of inaccurate measurement, such as measurement error of continuous and discrete variables, error in both explanatory and response variables, and measurement error together with missing data.
Multilevel modelling
Kostas Kalogeropoulos; Fiona Steele;
Most population studied in the social sciences have a multilevel structure: students may be nested in schools, people in neighbourhoods, employees in firms or twins in twinpairs. Longitudinal data are an important example of a twolevel hierarchical structure where repeated measurements over time are nested within individuals. Multilevel structures may also be nonhierarchical. For example, students may be nested within a crossclassification of school and neighbourhood of residence, and mobility between schools leads to a multiplemembership structure. Such clustered designs often provide rich information on processes operating at different levels; for instance people's characteristics interacting with institutional characteristics. Importantly, the standard assumption of independent observations is likely to be violated due to dependence among observations within the same cluster. Multilevel models extend conventional regression analysis to handle such dependence and exploit the richness of the data.

Steele, F., Rasbash, J. and Jenkins, J. (2013). A multilevel simultaneous equations model for withincluster dynamic effects, with an application to reciprocal parent–child and sibling effects. Psychological methods, 18 (1), pp. 87100. ISSN 1082989X

Steele, F., Clarke, P. and Washbrooke, E. (2013). Modelling household decisions using longitudinal data from household panel surveys, with applications to residential mobility. Sociological methodology. ISSN 00811750 (In press)
Statistical model selection
Jouni Kuha;
The study of the assessment and choice of statistical models covers a wide range of topics from the very specific to the most general. There is a large and growing number of statistical model comparison methods and criteria, whose properties still need to be further described and compared. Many of these have been developed in response to perceived shortcomings of other criteria, such as the sensitivity of standard goodnessoffit tests to the sample size. At the same time, it is rarely clear what the most relevant way of comparing such criteria should be. Unambiguous results about best models and best criteria can only be obtained under precise and arguably unrealistic conditions, for example by assuming a true model of a specific parametric form. A more general discussion of the merits of different approaches requires stepping outside a strictly mathematical framework and considering such important and partly nonstatistical considerations as the role of subjectmatter input in model formulation and the purposes to which the models are to be used.
Survey methods
Chris Skinner; Jouni Kuha
Sampling has been central to the development of statistical methodology for surveys. Estimation under complex sampling, with auxiliary information and possibly nonresponse, is still a major research area. Methods to take account of complex sampling in survey data analysis are also of interest for many forms of analysis. There is growing interest in the combination of survey data with other data sources, for example from administrative records and this generates many questions regarding estimation and analysis, including issues of record linkage. Measurement error is critical to survey data quality and raises many questions about how to take account of it in estimation, as well as how to design measurement instruments. The control of statistical disclosure is another important area of research.

Kuha, J. and Jackson, J. (2014). The item count method for sensitive survey questions: Modelling criminal behaviour. Journal of the Royal Statistical Society, Series C (Applied Statistics). ISSN 00359254 (In Press). Early view access.

Micklewright, J., Schnepf, S.V. and Skinner, C. J. (2012). Nonresponse biases in surveys of school children: the case of the English ‘Programme for International Student Assessment’ (PISA) samples. Journal of the Royal Statistical Society, Series A, 175 (4), pp. 915938. ISSN 09641998

Chambers, R. L. and Skinner, C. J., eds. (2003) Analysis of survey data. John Wiley & Sons, Chichester, UK. ISBN 9780471899877
Testing independence
Wicher Bergsma;
Testing for independence between two nominal categorical variables can be done using the Pearson chisquared test; for two continuous variables, independence tests can be based on the Pearson correlation; for a nominal categorical and a continuous variable, the ttest can be used. Current research is being done on developing more general tests, such as between highdimensional random variables, or tests of independence among more than two random variables.

Bergsma, W. P. and Dassios, A. (2013). A test of independence based on a sign covariance related to Kendall's tau. Bernoulli, 20 (2), pp. 10061028. ISSN 13507265.

Sejdinovic, D., Gretton, A. and Bergsma, W. P. (2013). A kernel test for threevariable interaction. Advances in Neural Information Processing Systems 26 (eds. Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z. and Weinberger, K. Q.). NIPS Proceedings 2013.

