MY559      Half Unit
Quantitative Text Analysis

This information is for the 2022/23 session.

Teacher responsible

Dr Blake Miller COL.7.14

Availability

This course is available on the MPhil/PhD in Economic Geography, MPhil/PhD in Environmental Economics, MPhil/PhD in International Relations, MPhil/PhD in International Relations, MPhil/PhD in Regional and Urban Planning Studies and MPhil/PhD in Social Research Methods. This course is available with permission as an outside option to students on other programmes where regulations permit.

This course is available as an outside option to students on other programmes where regulations permit. This course is not controlled access. If you register for a place and meet the prerequisites, if any, you are likely to be given a place.

Pre-requisites

The course will assume knowledge of linear and logistic regression models, to the level covered in MY452.

Course content

The course surveys methods for systematically extracting quantitative information from text for social scientific purposes, starting with classical content analysis and dictionary-based methods, to classification methods, and state-of-the-art scaling methods and topic models for estimating quantities from text using statistical techniques. The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research. The common focus across all methods is that they can all be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features - such as coded content categories, word counts, word types, dictionary counts, or parts of speech - and converting these into a quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically surveys these methods in a logical progression, with a practical, hands-on approach where each technique will be applied using appropriate software to real texts.

Lectures, class exercises and homework will be based on the use of the R statistical software package but will assume no background knowledge of that language.

Teaching

This course is delivered through a combination of classes and lectures totalling a minimum of 20 hours across Lent Term. 

This course has a reading week in Week 6 of LT.

Formative coursework

Students will be expected to produce 1 problem set in the LT.

Exercises from the computer classes can be submitted for marking.

Indicative reading

quanteda: An R package for quantitative text analysis. http://kbenoit.github.io/quanteda/

Grimmer, Justin and Brandon M. Stewart. 2013. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3):267–297.

Loughran, Tim and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66 (1, February): 35–65.

Evans, Michael, Wayne McIntosh, Jimmy Lin and Cynthia Cates. 2007. “Recounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research.” Journal of Empirical Legal Studies 4 (4, December):1007–1039.

Assessment

Project (40%, 5000 words) in the ST.
Problem sets (60%) in the LT.

Key facts

Department: Methodology

Total students 2021/22: 9

Average class size 2021/22: 4

Value: Half Unit

Guidelines for interpreting course guide information

Course selection videos

Some departments have produced short videos to introduce their courses. Please refer to the course selection videos index page for further information.