When your data science research hits an issue, what do you? You should present your research problems at the DSI Squared Unsolved Problems Research Seminar, to get helpful ideas or to find collaborators across disciplines for breaking through the obstacles.
This series of seminars forms part of the DSI Squared collaboration between the LSE Data Science Institute and ICL Data Science Institute, to foster innovations by bridging the social sciences and computer science and STEM subjects.
Innovative researchers from both Institutes are invited to showcase their ideas in front of an expert audience of colleagues from both Institutes. These attendees offer their ranging expertise and knowledge to crowd source solutions to these stumbling blocks! For example, core data science experts may wish for contributions from those with knowledge in social science and vice versa.
The series avoids the classic seminar format of long presentations and limited chance for audience contributions by limiting the presenter to just an informal twenty-minute outline of their research. Presenters will circulate a paper, draft paper, or problem outline in advance identifying data science challenges that their research faces.
Events are hosted on alternate months by the institutes. The initial event took place at LSE on 30 September 2022 with a research presentation delivered by Dr Rossella Arcucci, a colleague from ICL. At these lunchtime meetings, we provide food after the seminar.
So, are you based at LSE or ICL? Join us at our upcoming events to get involved with our growing DSI Squared research community, take the chance to collaborate with others, and find answers to your unsolved problems.
Scaling Text with the Class Affinity Model
Professor Ken Benoit
Location: Data Observatory, Imperial College London's South Kensington Campus.
Date: 25 November 2022
Time: 12:30 - 13:30
Sign up to attend this event here.
Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Da ́il confidence vote, a challenge brought by opposition party leaders against the then-governing Fianna Fa ́il party in response to corruption scandals. In this application, we clearly observe support or opposition from the known positions of party leaders, but have only information from speeches from which to estimate the relative degree of support from other legislators. To solve this scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentence-level block bootstrap. Applying our method to the Da ́il debate, we are able to scale the legislators between extreme pro-government and pro-opposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.
This method is implemented in the R package quanteda.textmodels.
This is a fairly classical statistical paper, using a bag-of-words approach to text, with feature selection based on influence statistics, and maximum likelihood to estimate the affinity statistic. A more contemporary method would be to incorporate a model using an artificial neural network, using a continuous bag-of-words input, an embedding layer, and a continuously scaled output layer as the estimand. This could be based on a transformer architecture.
- The goal is valid measurement, not prediction. Because the goal is measurement, some uncertainty accounting is desirable.
- There are no training labels for individual cases, but rather document sets identified with polar opposite classes, to whose affinity each unknown document is measured.
- Because the measurement is a latent trait with no directly verifiable value, validation does not work in the same way that a regression loss measure (e.g., RMSE) or a categorical loss measure (e.g., F1) can be measured.
- Robustness, reproducibility, and transparency are important goals in the construction of the “estimator”, since (social) science generally eschews black boxes
Professor Ken Benoit
Ken Benoit is Director of the Data Science Institute at LSE and Professor of Computational Social Science in the Department of Methodology.
Ken’s current research focuses on computational, quantitative methods for processing large amounts of textual data, mainly political texts and social media. Current interest span from the analysis of big data, including social media, and methods of text mining. He has published extensively on applications of measurement and the analysis of text as data in political science, including machine learning methods and text coding through crowd-sourcing, an approach that combines statistical scaling with the qualitative power of thousands of human coders working in tandem on small coding tasks.
Data Learning for more reliable AI models
Dr Rossella Arcucci
Read more about this event here.
This work fits into the context of AI for digital twins (DT). DTs are usually made of two components: a model and some data. When developing a digital twin, many fundamental questions exist, some connected with the data and its reliability and uncertainty, and some to do with dynamic model updating. To combine model and data, we use Data Assimilation (DA). DA is the approximation of the true state of some physical system by combining real-world observations with a dynamic model. DA models have increased in sophistication to better fit application requirements and circumvent implementation issues. Nevertheless, these approaches are incapable of fully overcoming some of their unrealistic assumptions, such as linearity of the systems. Machine Learning (ML) shows great capability in approximating nonlinear systems and extracting meaningful features from high-dimensional data. ML algorithms can assist or replace traditional forecasting methods. However, the data used during training any ML algorithm include numerical, approximation and round off errors, which are trained into the forecasting model. Integration of ML with DA increases the reliability of prediction by including information in real time and with a physical meaning. This talk introduces Data Learning, a field that integrates Data Assimilation and Machine Learning to overcome limitations in applying these fields to real-world data. We present several Data Learning methods and results for some real-world test cases, though the equations are general and can easily be applied elsewhere.
Dr Rossella Arcucci
Lecturer in Data Science and Machine Learning at Imperial College London where she leads the DataLearning Group and is the elected representative of the AI Network of Excellence. She is also elected member of World Meteorological Organization (WMO) where she contributes to the World Weather Research Programme.
Rossella has developed models which have been applied to many industries including finance (to estimate optimal parameters of economic models), social science (to merge Twitter and pooling data to better estimate the sentiment of people), engineering (to optimise the placement of sensors and reduce the costs), geoscience (to improve accuracy of forecasting) and climate change. With an academic background in mathematics, Rossella completed her PhD in Computational and Computer Science in February 2012 and became a Marie Sklodowska-Curie fellow with the European Commission Research Executive Agency in Brussels in February 2017.