SEDS seminar

Past SEDS seminars

Past SEDS Seminars

Normalizing Digital Trace Data

Date: 19 October 2017
Speaker: Andreas Jungherr
Abstract: Over the last ten years, social scientists have found themselves confronting a massive increase in available data sources. In the debates on how to use these new data, the research potential of “digital trace data” has featured prominently. While various commentators expect digital trace data to create a “measurement revolution”, empirical work has fallen somewhat short of these grand expectations. In fact, empirical research based on digital trace data is largely limited by the prevalence of two central fallacies: First, the n=all fallacy; second, the mirror fallacy. As I will argue, these fallacies can be addressed by developing a measurement theory for the use of digital trace data. For this, researchers will have to test the consequences of variations in research designs, account for sample problems arising from digital trace data, and explicitly link signals identified in digital trace data to sophisticated conceptualizations of social phenomena. Below, I will outline the two fallacies in greater detail. Then, I will discuss their consequences with regard to three general areas in the work with digital trace data in the social sciences: digital ethnography, proxies, and hybrids. In these sections, I will present selected prominent studies predominantly from political communication research. I will close by a short assessment of the road ahead and how these fallacies might be constructively addressed by the systematic development of a measurement theory for the work with digital trace data in the social sciences.


Integrating Conflict Event Data

Thursday 4th May 2017
Karsten Donnay, University of Konstanz
The growing volume of sophisticated event-level data collection, with improving geographic and temporal coverage, offers prospects for conducting novel analyses. In instances where multiple related datasets are available, researchers tend to rely on one at a time, ignoring the potential value of the multiple datasets in providing more comprehensive, precise, and valid measurement of empirical phenomena. If multiple datasets are used, integration is typically limited to manual efforts for select cases. We develop the conceptual and methodological foundations for automated, transparent and reproducible integration and disambiguation of multiple event datasets. We formally present the methodology, validate it with synthetic test data, and demonstrate its application using conflict event data for Africa, drawing on four leading sources (UCDP-GED, ACLED, SCAD, GTD). We show that whether analyses rely on one or multiple datasets can affect substantive findings with regard to key explanatory variables, thus highlighting the critical importance of systematic data integration.


The STEM requirements of "non-STEM" jobs: evidence from UK online vacancy postings and implications for Skills & Knowledge Shortages

Date: Thursday 23rd March 2017
Speaker: Inna Grinis, PhD candidate in the Department of Economics, LSE
Abstract: Do employers in “non-STEM” occupations (e.g. Graphic Designers, Economists) seek to hire STEM (Science, Technology, Engineering, and Mathematics) graduates with a higher probability than non-STEM ones for knowledge and skills that they have acquired through their STEM education (e.g. “Microsoft C#”, “Systems Engineering”) and not simply for their problem solving and analytical abilities? This is an important question in the UK where less than half of STEM graduates work in STEM occupations and where this apparent leakage from the “STEM pipeline” is often considered as a wastage of resources. To address it, this paper goes beyond the discrete divide of occupations into STEM vs. non-STEM and measures STEM requirements at the level of jobs by examining the universe of UK online vacancy postings between 2012 and 2016. We design and evaluate machine learning algorithms that classify thousands of keywords collected from job adverts and millions of vacancies into STEM and non-STEM. 35% of all STEM jobs belong to non-STEM occupations and 15% of all postings in non-STEM occupations are STEM. Moreover, STEM jobs are associated with higher wages within both STEM and non-STEM occupations, even after controlling for detailed occupations, education, experience requirements, employers, etc. Although our results indicate that the STEM pipeline breakdown may be less problematic than typically thought, we also find that many of the STEM requirements of “non-STEM” jobs could be acquired with STEM training that is less advanced than a full time STEM education. Hence, a more efficient way of satisfying the STEM demand in non-STEM occupations could be to teach more STEM in non-STEM disciplines. We develop a simple abstract framework to show how this education policy could help reduce STEM shortages in both STEM and non-STEM occupations.

Read full paper  


The Case for Research Preregistration, with Applications in Elections Research

 Thursday 23rd February 2017
 Prof. Jamie Monogan, Department of Political Science, University of Georgia.
Preregistration refers to when an analyst commits to a research design before observing the outcome. How can preregistration be useful for political scientists? This presentation makes the
argument that, when appropriate, study registration increases honesty and transparency in research reporting in a way that benefits authors, reviewers, and readers. The essential element for preregistration to be useful is a clear public signal of the design before the data could possibly be observed, such as
before an experiment is conducted or before an election occurs. This presentation therefore offers illustrations of how to implement preregistration that focus on American elections. The three examples include: An analysis of the immigration issue in 2010 U.S. House of Representatives races, the effect of the 2011 debt ceiling controversy on the 2012 U.S. House elections, and a yet-to-be implemented design of how anxiety shaped individual voters' decision-making process in the 2016 U.S. presidential election.


Revealing the Anatomy of Vote Trading

Date: Thursday 9th February 2017
Speaker: Dr Omar Guerrero, Said Business School, University of Oxford.
Abstract: Cooperation in the form of vote trading, also known as logrolling, is central for law-making processes, shaping the development of democratic societies. Measuring vote trading is challenging because it happens behind closed doors. Hence, it is not directly observable. Empirical evidence of logrolling is scarce and limited to highly specific situations because existing methods are not easily applicable to broader contexts. We have developed a general and scalable methodology for revealing a network of vote traders, allowing us to measure logrolling on a large scale. Analysis on more than 9 million votes spanning 40 years in the U.S. Congress reveals a higher logrolling prevalence in the Senate and an overall decreasing trend over recent congresses, coincidental with high levels of political polarization. Our method is applicable in multiple contexts, shedding light on many aspects of logrolling and opening new doors in the study of hidden cooperation.


Fitting Hierarchical Models in Large-Scale Recommender Systems

Thursday 26 January 2017
Speaker: Professor Patrick Perry, Stern School of Business, New York University.
Abstract: Early in the development of recommender systems, hierarchical models were recognized as a tool capable of combining content-based filtering (recommending based on item-specific attributes) with collaborative filtering (recommending based on preferences of similar users). However, as recently as the late 2000s, many authors deemed the computational costs required to fit hierarchical models to be prohibitively high for commercial-scale settings. This talk addresses the challenge of fitting a hierarchical model at commercial scale by proposing a moment-based procedure for estimating the parameters of a hierarchical model. This procedure has its roots in a method originally introduced by Cochran in 1937. The method trades statistical efficiency for computational efficiency. It gives consistent parameter estimates, competitive prediction error performance, and substantial computational improvements. When applied to a large-scale recommender system application and compared to a standard maximum likelihood procedure, the method delivers competitive prediction performance while reducing computation time from hours to minutes. 


Detecting the Structure and Dynamics of Political Concepts from Text

Date: 8 December 2016
Speaker: Dr Paul Nulty, Research Associate, Cambridge Language Sciences at the University of Cambridge
Abstract: The availability of large archives of digitised political text offers new opportunities for analysing the emergence and formation of political concepts. This talk describes new methods for discovering the structure of abstract political concepts from large text corpora. Working in a theoretical framework that treats concepts as cultural entities that can be studied through patterns of lexical behaviour (De Bolla, 2013), Dr Nulty outlined several methods from computational linguistics that enable researchers to discover the architecture of political concepts from text. At the level of the sentence, grammatical relation parsing (dependency parsing) is used to extract predicates and propositions that compose complex concepts. Beyond the sentence-level, Paul described a weighted mutual-information measure calculated from long-range co-occurrences to discover looser conceptual associations that might not occur in a predicating relation with the central concept. Finally, he presented several examples from historical corpora of traces of the origin and structure of political concepts, and how these have changed over time.


Measuring and explaining political sophistication through textual complexity

Date: 24 November 2016
Speaker: Professor Ken Benoit, Department of Methodology at the LSE (with Kevin Munger and Arthur Spirling)
Abstract: The sophistication of political communication has been measured using ``readability'' scores developed from other contexts, but their application to political text suffers from a number of theoretical and practical issues. We develop a new benchmark of textual complexity which is better suited to the task of determining political sophistication. We use the crowd to perform tens of thousands of pairwise comparisons of snippets of State of the Union Addresses, scale these results into an underlying measure of reading ease, and ``learn'' which features of the texts are most associated with higher levels of sophistication, including linguistic markers, parts of speech, and a baseline of word frequency relative to 210 years of the Google book corpus ngram dataset. Our refitting of the readability model not only shows which features are appropriate to the political domain and how, but also provides a measure easily applied and rescaled to political texts in a way that facilitates comparison with reference to a meaningful baseline.