Research in Social Statistics

Social statistics is concerned with the development of statistical methods that can be used across the social sciences. Statisticians play an essential role in all aspects of social inquiry, including: study design; measurement; data linkage; development of statistical models that account for the complex structure of social data; model selection and assessment.

Members of the LSE Social Statistics group have interest in statistical methods in each of these areas and regularly collaborate with social scientists whose questions motivate new lines of methodological research. We have experience in a range of social science disciplines, including demography, education, epidemiology, psychology and sociology, and psychology.

Bayesian inference
Sara Geneletti; Kostas Kalogeropoulos;

Bayesian inference may potentially be helpful in various research questions within the area of Social statistics. It provides a natural and coherent framework for hierarchical multilevel models as well as latent variable analysis, and it can be used to incorporate prior information or combine multiple data sources. Moreover, it offers a heavy computational machinery that can be used for complex high-dimensional models and large datasets. We are interesting in developing methods suitable for application in economics, health and medicine, biology and other social sciences. Some of the current methodological challenges consist of constructing efficient and accessible computational schemes, prior specification, model choice and variable search and estimation of latent stochastic processes.

Dureau, J., Kalogeropoulos, K. and Baguelin, M. (2013). Capturing the time-varying drivers of an epidemic using stochastic dynamical systems. Biostatistics 14 (3), pp. 541-555. ISSN 1465-4644

Wicher Bergsma; Jouni Kuha; Irini Moustaki; Fiona Steele

Data collected in the social sciences are often categorical in nature. Key questions about categorical data involve the nature of the relations among the variables, for example how political preference is shaped by social characteristics, such as education or income. Methods for assessing relations among variables include graphical models and latent variable models. A first distinguishing feature of categorical data analysis, compared to continuous data analysis, is that inference can often be done with light assumptions about the distribution of the data, e.g., a multinomial distribution may often be adequate. A second distinguishing feature of categorical data analysis is that the variables can have one of a variety of measurement levels, the most common being nominal, ordinal or interval level. However, not all data types adequately fit in this scheme, for example, outcomes may consist of preference orderings, or subsets of items. Much statistical research is devoted to taking measurement level into account in order to maximise the amount of information that can be extracted from the data.

Kuha, J. and Jackson, J. (2014). The item count method for sensitive survey questions: Modelling criminal behaviour. Journal of the Royal Statistical Society, Series C (Applied Statistics). ISSN 0035-9254 (In Press). Early view access.
Bergsma, W. P., Croon, M. A. and Hagenaars, J. A. (2013). Advancements in marginal modelling for categorical data. Sociological Methodology. 43 (1), pp. 1-41. ISSN 0081-1750
Kuha, J. and Goldthorpe, J. H. (2010). Path analysis for discrete variables: the role of education in social mobility. Journal of the Royal Statistical Society, Series A, 173 (2), pp. 351-369. ISSN 0964-1998
Mavridis, D. and Moustaki, I. (2009). The forward search algorithm for detecting aberrant response patterns in factor analysis for binary data. Journal of Computational and Graphical Statistics. 18 (4), pp.1016-1034. ISSN 1061-8600
Bergsma, W. P., Croon. M. A. and Hagenaars, J. A. (2009). Marginal models for dependent, clustered and longitudinal categorical data. Springer NY. ISBN 9780387096094

Sara Geneletti;

Causal inference is the name given to the area of statistical methodology aimed at identifying and estimating causal effects. In social science and epidemiology we often want to know the effect of interventions or to explain underlying mechanisms; we want to know if a particular social programme works because if it does, it might result in new policies - new interventions. We want to know how lifestyle factors cause a disease so that we can understand the biological mechanisms involved and how to treat or prevent it. One of the main issues in this area is how to make causal inference from observational (i.e. non-randomised) data. There are a number of methods, including graphical modelling, potential outcomes and statistical approaches to dealing with this problem.

Geneletti, S., Best, N., and Mason, A., (2010). Adjusting for selection effects in epidemiological studies: why sensitivity analysis is the only "solution". Biostatistics, 10 (1), pp. 17-31. ISSN 1044-3983

Wicher Bergsma; Jouni Kuha; Irini Moustaki; Fiona Steele

Research in the area of latent variable models focuses on the development of methodology for categorical data, mixed types of data, goodness-of-fit measures, detection of outliers, new methods of estimation and the application of the methodology in social measurement and research. Latent variable models have extensive applications in educational testing as well as in psychometrics, sociometrics, economics and other social sciences. We are interested in measuring theoretical constructs such as intelligence, ability, attitude, belief and wealth that are not directly observed (latent). We collect information through surveys on variables that can be considered to be indicators of those unobserved constructs. Those indicators can be of any measurement type, such as categorical or metrical and they can also be measured over time.

Wall, M. M., Park, J.-Y. and Moustaki, I. (2015). IRT modeling in the presence of zero-inflation with application to psychiatric disorder severity. Applied Psychological Measurement. (In press) ISSN 0146-6216
Kuha, J. and Moustaki, I. (2015). Non-equivalence of measurement in latent variable modelling of multigroup data: a sensitivity analysis. Psychological Methods. (In press), pp. 1-47. ISSN 1082-989X
Moustaki, I. and Knott, M. (2013). Latent variable models that account for atypical responses. Journal of the Royal Statistical Society, series C (applied statistics), online. ISSN 0035-9254 (In Press)
Katsikatsou, M., Moustaki, I., Yang-Wallentin, F. and Jöreskog, K. (2012). Pairwise Likelihood Estimation for factor analysis models with ordinal data. Computational Statistics and Data Analysis. Computational Statistics and Data Analysis, 56 (12), pp. 4243–4258. ISSN 0167-9473
Vasdekis, V., Cagnone, S., and Moustaki, I. (2012). Composite likelihood estimation for latent variable models with longitudinal ordinal variables. Psychometrika, DOI:10.1007/s11336-012-9264-6
Bartholomew, D. J., Knott, M. and Moustaki, I. (2011). Latent variable models and factor analysis: a unified approach 3rd ed., John Wiley & Sons, London, UK. ISBN 9780470971925
Bartholomew, D. J., Steele, F., Galbraith, J. and Moustaki, I. (2008). Analysis of multivariate social science data.Chapman & Hall/CRC Statistics in the Social and Behavioral Scie. 2nd ed., CRC Press, London. ISBN 9781584889601

Wicher Bergsma; Kostas Kalogeropoulos; Irini Moustaki; Fiona Steele

Longitudinal data come from a variety of sources such as panel surveys, birth cohort studies and, more recently, ‘real-time’ digital data collection methods such as ecological momentary assessment. Analyses of longitudinal data are most commonly concerned with the nature and predictors of change in a response variable over time, for example children’s height or weight or individual attitudes and behaviour, using repeated measurements of the response variable and covariates. Methods for studying change using repeated measures data include growth curve models (which may be framed as a multilevel model or structural equation model), autoregressive ‘dynamic’ models with random or fixed individual effects, and marginal models. Another type of longitudinal study is concerned with the timing of events such as births and death where the response is the (often partially observed) duration to event occurrence. Survival analysis, widely referred to as event history analysis in social research, is used to study duration data, with extensions to handle recurrent events and competing risks.

Hafez, M. S., Moustaki, I. and Kuha, J. (2014). Analysis of multivariate longitudinal data subject to dropout. Structural Equation Modelling, 22 (2), pp. 193-201. ISSN 1070-5511
Steele, F., French, R. and Bartley, M. (2013). Adjusting for selection bias in longitudinal analyses using simultaneous equations modelling: the relationship between employment transitions and mental health. Epidemiology, 24 (5), pp. 703-711. ISSN 1044-3983
Vasdekis, V., Cagnone, S. and Moustaki, I. (2012). A Composite likelihood inference in latent variable models for ordinal longitudinal responses.Psychometrika.Vol. 77 (3), pp. 425-441. ISSN 0033-3123
Steele, F. (2008). Multilevel models for longitudinal data. Journal of the Royal Statistical Society: series A (statistics in society), 171 (1), pp. 5-19. ISSN 0964-1998

Wicher Bergsma

A different type of research question involving clustered data concerns population averaged quantities rather than the sampling units themselves. Such questions can be handled using marginal modelling techniques, which are being developed for categorical data. Again, the dependencies in the data due to the clustering have to be taken into account.

Bergsma, W. P., Croon, M. A. and Hagenaars, J. A. (2013). Advancements in marginal modelling for categorical data. Sociological Methodology. 43 (1), pp. 1-41. ISSN 0081-1750
Rudas, T., Bergsma, W. P., and Nemeth, R., (2010). Marginal log-linear parameterization of conditional independence models. Biometrika, 97 (4), pp. 1006-1012. ISSN 0006-3444
Bergsma, W. P., Croon. M. A. and Hagenaars, J. A. (2009). Marginal models for dependent, clustered and longitudinal categorical data. Springer NY. ISBN 9780387096094
Bergsma, W. P. and Rudas, T. (2002). Marginal models for categorical data. Annals of Statistics, 30 (1), pp. 140-159. ISSN 0090-5364

Jouni Kuha; Irini Moustaki; Chris Skinner;

The problem of measurement error in statistical analysis is both common and serious; common because very many of the quantities of interest in the social sciences are difficult to determine accurately; serious because even moderate amounts of measurement error can cause substantial biases in estimated models of interest. It is possible to reduce these biases by using appropriately modified estimation methods, provided that sufficient information about the measurement error is available in the form of either additional data or realistic assumptions. Different measurement error problems may require rather different solutions, depending on, for example, the type of model (linear, log-linear, logistic etc.), the erroneously measured variables (explanatory or response, continuous or discrete) and the method of estimation (e.g. moment or likelihood based, exact or approximate). The work carried out in this area has focused in particular in problems involving more than one type of inaccurate measurement, such as measurement error of continuous and discrete variables, error in both explanatory and response variables, and measurement error together with missing data.

Da Silva, D. N. and Skinner, C. J. (2014). The use of accuracy indicators to correct for survey measurement error. Journal of the Royal Statistical Society, Series C (Applied Statistics). ISSN 1467-9876 (In Press). Early view access.
Skrondal, A. and Kuha, J. (2012). Improved regression calibration. Psychometrika, 77 (4), pp. 649-669. ISSN 0033-3123

Kostas Kalogeropoulos; Fiona Steele;

Most population studied in the social sciences have a multilevel structure: students may be nested in schools, people in neighbourhoods, employees in firms or twins in twin-pairs. Longitudinal data are an important example of a two-level hierarchical structure where repeated measurements over time are nested within individuals. Multilevel structures may also be non-hierarchical. For example, students may be nested within a cross-classification of school and neighbourhood of residence, and mobility between schools leads to a multiple-membership structure. Such clustered designs often provide rich information on processes operating at different levels; for instance people's characteristics interacting with institutional characteristics. Importantly, the standard assumption of independent observations is likely to be violated due to dependence among observations within the same cluster. Multilevel models extend conventional regression analysis to handle such dependence and exploit the richness of the data.

Clarke, P., Crawford, C., Steele, F. and Vignoles, A. (2015). Revisiting fixed- and random-effects models: some considerations for policy-relevant education research”. Education Economics, 23 (3), pp. 259-277. ISSN 0964-5292
Steele, F., Rasbash, J. and Jenkins, J. (2013). A multilevel simultaneous equations model for within-cluster dynamic effects, with an application to reciprocal parent–child and sibling effects. Psychological methods, 18 (1), pp. 87-100. ISSN 1082-989X
Steele, F., Clarke, P. and Washbrook, E. (2013). Modelling household decisions using longitudinal data from household panel surveys, with applications to residential mobility. Sociological methodology. ISSN 0081-1750 (In press)

Jouni Kuha;

The study of the assessment and choice of statistical models covers a wide range of topics from the very specific to the most general. There is a large and growing number of statistical model comparison methods and criteria, whose properties still need to be further described and compared. Many of these have been developed in response to perceived shortcomings of other criteria, such as the sensitivity of standard goodness-of-fit tests to the sample size. At the same time, it is rarely clear what the most relevant way of comparing such criteria should be. Unambiguous results about best models and best criteria can only be obtained under precise and arguably unrealistic conditions, for example by assuming a true model of a specific parametric form. A more general discussion of the merits of different approaches requires stepping outside a strictly mathematical framework and considering such important and partly non-statistical considerations as the role of subject-matter input in model formulation and the purposes to which the models are to be used.

Kuha, J. and Firth, D. (2011). On the index of dissimilarity for lack of fit in loglinear and log-multiplicative models. Computational Statistics and Data Analysis, 55 (1), pp. 375-388. ISSN 0167-9473
Kuha, J. (2004). AIC and BIC: comparisons of assumptions and performance. Sociological Methods & Research, 33 (2), pp. 188-229. ISSN 0049-1241

Chris Skinner; Jouni Kuha

Sampling has been central to the development of statistical methodology for surveys. Estimation under complex sampling, with auxiliary information and possibly non-response, is still a major research area. Methods to take account of complex sampling in survey data analysis are also of interest for many forms of analysis. There is growing interest in the combination of survey data with other data sources, for example from administrative records and this generates many questions regarding estimation and analysis, including issues of record linkage. Measurement error is critical to survey data quality and raises many questions about how to take account of it in estimation, as well as how to design measurement instruments. The control of statistical disclosure is another important area of research.

Hanly, M., Clarke, P. and Steele, F. (2015). Using edit cost settings to improve sequence analysis summaries of call record data. Journal of the Royal Statistical Society, Series A. (In press)
Kuha, J. and Jackson, J. (2014). The item count method for sensitive survey questions: Modelling criminal behaviour. Journal of the Royal Statistical Society, Series C (Applied Statistics). ISSN 0035-9254 (In Press). Early view access.
Micklewright, J., Schnepf, S.V. and Skinner, C. J. (2012). Non-response biases in surveys of school children: the case of the English ‘Programme for International Student Assessment’ (PISA) samples. Journal of the Royal Statistical Society, Series A, 175 (4), pp. 915-938. ISSN 0964-1998
Chambers, R. L. and Skinner, C. J., eds. (2003) Analysis of survey data. John Wiley & Sons, Chichester, UK. ISBN 9780471899877

Wicher Bergsma;

Testing for independence between two nominal categorical variables can be done using the Pearson chi-squared test; for two continuous variables, independence tests can be based on the Pearson correlation; for a nominal categorical and a continuous variable, the t-test can be used. Current research is being done on developing more general tests, such as between high-dimensional random variables, or tests of independence among more than two random variables.

Bergsma, W. P. and Dassios, A. (2013). A test of independence based on a sign covariance related to Kendall's tau. Bernoulli, 20 (2), pp. 1006-1028. ISSN 1350-7265.
Sejdinovic, D., Gretton, A. and Bergsma, W. P. (2013). A kernel test for three-variable interaction. Advances in Neural Information Processing Systems 26 (eds. Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z. and Weinberger, K. Q.). NIPS Proceedings 2013.