Not available in 2021/22
MY360      Half Unit
Text Mining

This information is for the 2021/22 session.

Teacher responsible

Prof Kenneth Benoit


This will be listed as an option for the new BSc in Politics and Data Science.

Course content

The course surveys methods and applications for text mining: systematically extracting quantitative information from text for social scientific purposes.

It starts with classical content analysis and dictionary-based methods, continues with quantitative methods for exploration and discovery from texts, and concludes with classification methods using simple machine learning.

The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research. The common focus across all methods is that they can be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features - such as coded content categories, word counts, word types, dictionary counts, or parts of speech - and converting these into a quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically surveys these methods in a logical progression, with a practical, hands-on approach where each technique will be applied using appropriate software to real texts.

Lectures, class exercises and homework will be based on the use of the R statistical software package but will assume no background knowledge of that language.


16 hours and 40 minutes of lectures and 13 hours and 30 minutes of classes in the LT.

A combination of classes and lectures totalling 33.5 hours across Lent Term (counting the 50 mins as an hour). This course has a Reading Week in Week 6 of LT.

Formative coursework

Students will work on weekly, structured problem sets in the staff-led class sessions. Five of these will be for formative assessment. Example solutions will be provided at the end of each week.

Indicative reading

Benoit, Kenneth. 2020. “Text as Data: An Overview.” In Curini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage. pp461-497.

Grimmer, Justin and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267–297.

Loughran, Tim and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66(1, February): 35–65.

Evans, Michael, Wayne McIntosh, Jimmy Lin and Cynthia Cates. 2007. “Recounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research.” Journal of Empirical Legal Studies 4(4, December):1007–1039.

quanteda: An R package for quantitative text analysis.


Take-home assessment (60%) and problem sets (40%) in the LT.

Five summative problem sets will be marked in five of the weeks. These will constitute 40% of the final overall mark. The take-home assessment (60%) will be submitted in the week following the end of the Lent Term.

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.

Important information in response to COVID-19

Please note that during 2021/22 academic year some variation to teaching and learning activities may be required to respond to changes in public health advice and/or to account for the differing needs of students in attendance on campus and those who might be studying online. For example, this may involve changes to the mode of teaching delivery and/or the format or weighting of assessments. Changes will only be made if required and students will be notified about any changes to teaching or assessment plans at the earliest opportunity.

Key facts

Department: Methodology

Total students 2020/21: Unavailable

Average class size 2020/21: Unavailable

Capped 2020/21: No

Value: Half Unit

Guidelines for interpreting course guide information

Personal development skills

  • Self-management
  • Team working
  • Problem solving
  • Application of information skills
  • Communication
  • Application of numeracy skills
  • Specialist skills