Chalk board with mathematical equations

Research in Social Statistics

Social statistics is concerned with the development of statistical methods that can be used across the social sciences. Statisticians play an essential role in all aspects of social inquiry, including: study design; measurement; data linkage; development of statistical models that account for the complex structure of social data; model selection and assessment.

Members of the LSE Social Statistics group have interest in statistical methods in each of these areas and regularly collaborate with social scientists whose questions motivate new lines of methodological research. We have experience in a range of social science disciplines, including demography, education, epidemiology, psychology and sociology, and psychology.

A brief introduction to our main areas of methodological research in social statistics

Bayesian inference
Sara Geneletti; Kostas Kalogeropoulos;

Bayesian inference may potentially be helpful in various research questions within the area of Social statistics. It provides a natural and coherent framework for hierarchical multilevel models as well as latent variable analysis, and it can be used to incorporate prior information or combine multiple data sources. Moreover, it offers a heavy computational machinery that can be used for complex high-dimensional models and large datasets. We are interesting in developing methods suitable for application in economics, health and medicine, biology and other social sciences. Some of the current methodological challenges consist of constructing efficient and accessible computational schemes, prior specification, model choice and variable search and estimation of latent stochastic processes.

Categorical data analysis

Wicher Bergsma; Jouni Kuha; Irini Moustaki; Fiona Steele

Data collected in the social sciences are often categorical in nature. Key questions about categorical data involve the nature of the relations among the variables, for example how political preference is shaped by social characteristics, such as education or income. Methods for assessing relations among variables include graphical models and latent variable models. A first distinguishing feature of categorical data analysis, compared to continuous data analysis, is that inference can often be done with light assumptions about the distribution of the data, e.g., a multinomial distribution may often be adequate. A second distinguishing feature of categorical data analysis is that the variables can have one of a variety of measurement levels, the most common being nominal, ordinal or interval level. However, not all data types adequately fit in this scheme, for example, outcomes may consist of preference orderings, or subsets of items. Much statistical research is devoted to taking measurement level into account in order to maximise the amount of information that can be extracted from the data.

Causal inference

Sara Geneletti;

Causal inference is the name given to the area of statistical methodology aimed at identifying and estimating causal effects. In social science and epidemiology we often want to know the effect of interventions or to explain underlying mechanisms; we want to know if a particular social programme works because if it does, it might result in new policies - new interventions. We want to know how lifestyle factors cause a disease so that we can understand the biological mechanisms involved and how to treat or prevent it. One of the main issues in this area is how to make causal inference from observational (i.e. non-randomised) data. There are a number of methods, including graphical modelling, potential outcomes and statistical approaches to dealing with this problem.

Latent variable models and structural equation models

Wicher Bergsma; Jouni Kuha; Irini Moustaki; Fiona Steele

Research in the area of latent variable models focuses on the development of methodology for categorical data, mixed types of data, goodness-of-fit measures, detection of outliers, new methods of estimation and the application of the methodology in social measurement and research. Latent variable models have extensive applications in educational testing as well as in psychometrics, sociometrics, economics and other social sciences. We are interested in measuring theoretical constructs such as intelligence, ability, attitude, belief and wealth that are not directly observed (latent). We collect information through surveys on variables that can be considered to be indicators of those unobserved constructs. Those indicators can be of any measurement type, such as categorical or metrical and they can also be measured over time.

Longitudinal data analysis

Wicher Bergsma; Kostas Kalogeropoulos; Irini Moustaki; Fiona Steele

Longitudinal data come from a variety of sources such as panel surveys, birth cohort studies and, more recently, ‘real-time’ digital data collection methods such as ecological momentary assessment. Analyses of longitudinal data are most commonly concerned with the nature and predictors of change in a response variable over time, for example children’s height or weight or individual attitudes and behaviour, using repeated measurements of the response variable and covariates. Methods for studying change using repeated measures data include growth curve models (which may be framed as a multilevel model or structural equation model), autoregressive ‘dynamic’ models with random or fixed individual effects, and marginal models. Another type of longitudinal study is concerned with the timing of events such as births and death where the response is the (often partially observed) duration to event occurrence. Survival analysis, widely referred to as event history analysis in social research, is used to study duration data, with extensions to handle recurrent events and competing risks.

Marginal modelling

Wicher Bergsma

A different type of research question involving clustered data concerns population averaged quantities rather than the sampling units themselves. Such questions can be handled using marginal modelling techniques, which are being developed for categorical data. Again, the dependencies in the data due to the clustering have to be taken into account.

Measurement error in statistical analysis

Jouni Kuha; Irini Moustaki; Chris Skinner;

The problem of measurement error in statistical analysis is both common and serious; common because very many of the quantities of interest in the social sciences are difficult to determine accurately; serious because even moderate amounts of measurement error can cause substantial biases in estimated models of interest. It is possible to reduce these biases by using appropriately modified estimation methods, provided that sufficient information about the measurement error is available in the form of either additional data or realistic assumptions. Different measurement error problems may require rather different solutions, depending on, for example, the type of model (linear, log-linear, logistic etc.), the erroneously measured variables (explanatory or response, continuous or discrete) and the method of estimation (e.g. moment or likelihood based, exact or approximate). The work carried out in this area has focused in particular in problems involving more than one type of inaccurate measurement, such as measurement error of continuous and discrete variables, error in both explanatory and response variables, and measurement error together with missing data.

Multilevel modelling

Kostas Kalogeropoulos; Fiona Steele;

Most population studied in the social sciences have a multilevel structure: students may be nested in schools, people in neighbourhoods, employees in firms or twins in twin-pairs. Longitudinal data are an important example of a two-level hierarchical structure where repeated measurements over time are nested within individuals.  Multilevel structures may also be non-hierarchical. For example, students may be nested within a cross-classification of school and neighbourhood of residence, and mobility between schools leads to a multiple-membership structure. Such clustered designs often provide rich information on processes operating at different levels; for instance people's characteristics interacting with institutional characteristics. Importantly, the standard assumption of independent observations is likely to be violated due to dependence among observations within the same cluster. Multilevel models extend conventional regression analysis to handle such dependence and exploit the richness of the data.

Statistical model selection

Jouni Kuha;

The study of the assessment and choice of statistical models covers a wide range of topics from the very specific to the most general. There is a large and growing number of statistical model comparison methods and criteria, whose properties still need to be further described and compared. Many of these have been developed in response to perceived shortcomings of other criteria, such as the sensitivity of standard goodness-of-fit tests to the sample size. At the same time, it is rarely clear what the most relevant way of comparing such criteria should be. Unambiguous results about best models and best criteria can only be obtained under precise and arguably unrealistic conditions, for example by assuming a true model of a specific parametric form. A more general discussion of the merits of different approaches requires stepping outside a strictly mathematical framework and considering such important and partly non-statistical considerations as the role of subject-matter input in model formulation and the purposes to which the models are to be used.

Survey methods

Chris Skinner; Jouni Kuha

Sampling has been central to the development of statistical methodology for surveys. Estimation under complex sampling, with auxiliary information and possibly non-response, is still a major research area. Methods to take account of complex sampling in survey data analysis are also of interest for many forms of analysis. There is growing interest in the combination of survey data with other data sources, for example from administrative records and this generates many questions regarding estimation and analysis, including issues of record linkage.  Measurement error is critical to survey data quality and raises many questions about how to take account of it in estimation, as well as how to design measurement instruments. The control of statistical disclosure is another important area of research.

Testing independence

Wicher Bergsma;

Testing for independence between two nominal categorical variables can be done using the Pearson chi-squared test; for two continuous variables, independence tests can be based on the Pearson correlation; for a nominal categorical and a continuous variable, the t-test can be used. Current research is being done on developing more general tests, such as between high-dimensional random variables, or tests of independence among more than two random variables.