MY459      Half Unit
Computational Text Analysis and Large Language Models

This information is for the 2025/26 session.

Course Convenor

Ryan Hubert

Availability

This course is available on the MPA in Data Science for Public Policy, MSc in Applied Social Data Science, MSc in Data Science, MSc in Econometrics and Mathematical Economics, MSc in Human Geography and Urban Studies (Research), MSc in Political Science (Political Science and Political Economy), MSc in Social Research Methods, MSc in Statistics, MSc in Statistics (Financial Statistics), MSc in Statistics (Financial Statistics) (Research), MSc in Statistics (Research), MSc in Statistics (Social Statistics) and MSc in Statistics (Social Statistics) (Research). This course is freely available as an outside option to students on other programmes where regulations permit. It does not require permission.

The course is also available to research students as MY559. This course is not controlled access. If you register for a place and meet the prerequisites, if any, you are likely to be given a place.

Requisites

Additional requisites:

Applied Regression Analysis (MY452) or equivalent is required. Students should have foundational mathematical proficiency through basic linear algebra. All students are required to complete the pre-sessional Digital Skills Lab course on programming (information will be available on the course Moodle page). Already knowing how to code in at least one programming language will be very helpful, but after completing the pre-sessional course, MY459 is suitable for students without coding experience. However, students in this situation should be prepared to invest additional time learning to code during the term.

Course content

This course introduces computational approaches to analysing text, emphasising how these methods can be used to investigate social phenomena and offering an overview of the tools that students can apply in academic research, policy analysis, or industry roles in data science. Students learn to quantify and empirically model texts using techniques such as: dictionary methods for measuring sentiment, topic modelling for extracting document topics, scaling for identifying ideological rhetoric, supervised classification for categorising large numbers of texts, word embeddings for modelling the contextual meaning of words, and large language models for a range of possible applications.

A central focus is on evaluating the validity of these methods—what they measure, how results should be interpreted, and what kinds of inferences they can support. To help build deeper conceptual understanding of how these methods enable learning about the social world, the course develops a range of applied technical skills, including programming, data wrangling, and the use of generative AI. Students gain practical experience through hands-on exercises designed to prepare them to use these techniques in academic research and beyond. By the end of the course, students will have developed a strong command of the logic and theory underpinning a wide range of computational techniques for analysing texts, be able to critically assess their uses and limits, and be equipped to apply them in real-world datasets.

Teaching

20 hours of lectures and 10 hours of seminars in the Winter Term.

This course has a reading week in Week 6 of Winter Term.

Formative assessment

Students will complete exercises during seminars and will complete one take home assignment during WT.

Indicative reading

Benoit, Kenneth (2020). “Text as Data: An Overview.” In Curini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Sage Publications. pp. 461-497.

Grimmer, Justin, Margaret E. Roberts and Brandon M. Stewart (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.

Jurafsky, Daniel and James H. Martin (2024). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd edition. https://web.stanford.edu/~jurafsky/slp3/

quanteda: An R package for quantitative text analysis. http://kbenoit.github.io/quanteda/

Assessment

Exam (100%), duration: 120 Minutes in the Spring exam period


Key facts

Department: Methodology

Course Study Period: Winter Term

Unit value: Half unit

FHEQ Level: Level 7

CEFR Level: Null

Total students 2024/25: 21

Average class size 2024/25: 11

Controlled access 2024/25: No
Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Personal development skills

  • Problem solving
  • Application of numeracy skills