Skip to main content

On working as an LSE Research Assistant

Thursday 30 April 2026

MSc Data Science student Amara Otero Salgado shares her recent insights on working as a Research Assistant at LSE.

LSE Department of Statistics Student Amara Otero Salgado

In July of 2025, just post-graduation, I started in my current role as a Research Assistant for Professor Cheryl Schonhardt-Bailey in the LSE’s Department of Government. I work on Schonhardt-Bailey’s project which examines legal cases from the US ninth and tenth district courts concerning matters of public lands and the wilderness, exploring how the arguments used in these cases have changed over time.

I am currently studying MSc Data Science in the LSE’s Department of Statistics, having studied BSc Politics and Data Science here. My interest in politics supports my passion for data science, and this project lies at the intersection of these two disciplines, combined with some legal work. I have always had an interest in law; sitting the LNAT at age 17, and this project allows me to combine my interests.

Compiling the data set for this study has raised a number of key challenges. Legal case documents can represent difficult material for subjecting to statistical analysis, with pages-long documents of natural language. The time frame for our study is also relatively lengthy, spanning the years 1960-2024, which amounts to roughly 6000 cases in total. The sheer volume of cases mean that it is infeasible to input data manually. This is where I came in as a research assistant with a data science skill set.

At the outset of the project in Summer 2025, the research team consisted of only Prof Schonhardt-Bailey and myself. However, once the academic year got underway, our number expanded to include another postgraduate research assistant and two undergraduate research assistants, whom I manage under Prof Schonhardt-Bailey’s supervision. I assign and coordinate tasks within the team, taking into consideration the different skills, knowledge and experience among the researchers. The other postgraduate research assistant and I work on the coding and data-related tasks, whilst the undergraduate research assistants handle most of the robustness checking and legal research.

We have tested a number of different techniques for compiling our dataset, including the use of large language models (LLMs) via application programming interfaces (APIs, which allow for seamless exchange between LLMs and artificial intelligence (AI) applications) to extract variables. For the most part, we used the R software environment to identify patterns within the plain text of the documents and extract key variables based on these. Luckily, legal case documents tend to be uniformly structured, with the case title followed by the plaintiffs, defendants, synopsis and so on, which made this task more straightforward than it could otherwise be.

However, since our main focus in this project is the arguments used by the litigants, we were keen to do some form of ‘topic modelling’ (a machine learning technique used in natural language processing (NLP) to discover latent thematic structures, or ‘topics’ across large collections of documents). This could be done using traditional methods like ‘Latent Dirichlet Analysis’ or ‘Document-Term Matrices’, but recent literature has actually shown that LLMs can perform better at topic modelling than these traditional approaches. For example, research published by De-Marcos and Domínguez-Díaz[JB1] in 2025 found that LLMs such as GPT and Gemini consistently outperformed traditional methods in capturing relevant, specialised themes as they are better at filtering out generic terms. Moreover, the LLMs were found to be more aligned than the traditional methods tested, producing significantly lower Levenshtein distances (a string metric that measures the minimum number of single-character edits, insertions, deletions, or substitutions, required to transform one word, phrase, or sequence into another) between models than between methods.

De-Marcos and Domínguez-Díaz’s research was undertaken in the context of dark web forums, characterised by short, informal and context-specific posts. Whilst our setting is quite significantly different in many ways (much longer documents and extremely formally written), our documents are also very context specific, which gives us hope that we may find the same effect.

Engineering our prompts has been a process. At each iteration, I produce a test output, share it with the team to verify, and then refine the prompt based on the feedback. Initial tests only input 20 cases, since the cost of API calls adds up. We are now at the point where our small-batch tests have passed approval, and I have generated full robustness check outputs of 360 cases (the number required to be verified to be 95% confident that the LLM will be accurate for all 6000 cases). My colleagues are working through these, and we hope to be able to approve soon.

Getting to the point where we have established a final, analysis-ready dataset is very exciting. The journey has been long and extremely bumpy (including overcoming problems we encountered with identifying with certainty the exact cause of duplicated cases), but very rewarding. I feel that I have learned so much about the research process and what it is like to work with real data, as opposed to clean and perfect python package datasets I had access to as a student This is experience which will undoubtedly help me as I progress and continue a career in research.

From a personal perspective, perhaps the most exciting development with the project so far is that Prof Schonhardt-Bailey and I visited the Oxford Computational Political Science Group in March to deliver a talk about our work. I had the opportunity to speak about our project, address colleagues and field questions at the University of Oxford. My journey with this project to date is giving me such a valuable insight into the world of academia and research. Indeed, it has played a big part in my decision to pursue PhD study. I look forward to sharing more as the project progresses.

By Amara Otero Salgado