MY360      Half Unit
Text Mining

Prof Kenneth Benoit


Course content

The course surveys methods and applications for text mining: systematically extracting quantitative information from text for social scientific purposes.

It starts with classical content analysis and dictionary-based methods, continues with quantitative methods for exploration and discovery from texts, and concludes with classification methods using simple machine learning.

The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research. The common focus across all methods is that they can be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features - such as coded content categories, word counts, word types, dictionary counts, or parts of speech - and converting these into a quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically surveys these methods in a logical progression, with a practical, hands-on approach where each technique will be applied using appropriate software to real texts.

Lectures, class exercises and homework will be based on the use of the R statistical software package but will assume no background knowledge of that language.


16 hours and 40 minutes of lectures and 13 hours and 30 minutes of classes in the LT.

A combination of classes and lectures totalling 33.5 hours across Lent Term (counting the 50 mins as an hour). This course has a Reading Week in Week 6 of LT.

Formative coursework

Students will work on weekly, structured problem sets in the staff-led class sessions. Five of these will be for formative assessment. Example solutions will be provided at the end of each week.

Indicative reading

Benoit, Kenneth. 2020. “Text as Data: An Overview.” In Curini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage. pp461-497.

Grimmer, Justin and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267–297.

Loughran, Tim and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66(1, February): 35–65.

Evans, Michael, Wayne McIntosh, Jimmy Lin and Cynthia Cates. 2007. “Recounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research.” Journal of Empirical Legal Studies 4(4, December):1007–1039.

quanteda: An R package for quantitative text analysis.


Take-home assessment (60%) and problem sets (40%) in the LT.

Five summative problem sets will be marked in five of the weeks. These will constitute 40% of the final overall mark. The take-home assessment (60%) will be submitted in the week following the end of the Lent Term.

