When your data science research hits an issue, what do you? You should present your research problems at the DSI Squared Unsolved Problems Research Seminar, to get helpful ideas or to find collaborators across disciplines for breaking through the obstacles.
This series of seminars forms part of the DSI Squared collaboration between the LSE Data Science Institute and ICL Data Science Institute, to foster innovations by bridging the social sciences and computer science and STEM subjects.
Innovative researchers from both Institutes are invited to showcase their ideas in front of an expert audience of colleagues from both Institutes. These attendees offer their ranging expertise and knowledge to crowd source solutions to these stumbling blocks! For example, core data science experts may wish for contributions from those with knowledge in social science and vice versa.
The series avoids the classic seminar format of long presentations and limited chance for audience contributions by limiting the presenter to just an informal twenty-minute outline of their research. Presenters will circulate a paper, draft paper, or problem outline in advance identifying data science challenges that their research faces.
Events are hosted on alternate months by the institutes. The initial event took place at LSE on 30 September 2022 with a research presentation delivered by Dr Rossella Arcucci, a colleague from ICL. At these lunchtime meetings, a light lunch is provided after the seminar.
So, are you based at LSE or ICL? Join us at our upcoming events to get involved with our growing DSI Squared research community, take the chance to collaborate with others, and find answers to your unsolved problems.
Generative Modelling of Cardiac Anatomy
Dr Wenjia Bai
Location: LSE Data Science Institute, COL 1.06, First Floor Columbia House, Houghton St, London WC2A 2AE
Date: 18 January 2024
Please Note: this event is open to staff and PhD students at LSE and Imperial College only.
Registration is required. Sign up to attend this event here.
Two key questions in cardiac image analysis are to assess the anatomy and motion of the heart from images; and to understand how they are associated with non-imaging clinical factors such as gender, age and diseases. While the first question can often be addressed by image segmentation and motion tracking algorithms, our capability to model and answer the second question is still limited. Here, we propose a novel conditional generative model to describe the 4D spatio-temporal anatomy of the heart and its interaction with non-imaging clinical factors. The clinical factors are integrated as the conditions of the generative modelling, which allows us to investigate how these factors influence the cardiac anatomy. We evaluate the model performance in mainly two tasks, anatomical sequence completion and sequence generation. The model achieves high performance in anatomical sequence completion, comparable to or outperforming other state-of-the-art generative models. In terms of sequence generation, given clinical conditions, the model can generate realistic synthetic 4D sequential anatomies that share similar distributions with the real data.
The Unsolved Problems:
How can we develop a personalised normative model for human anatomy?
How do we show its clinical usefulness?
Social and Ethical Implications of Data Scarcity and Data Drift in Large Language Models (LLMs)
Dr Blake Miller
Location: Data Science Institute, Imperial College London, William Penney Laboratory, South Kensington Campus SW7 2AZ
Date: 5 October 2023
Time: 12:30 - 13:30
This event is open to staff and PhD students at LSE and Imperial College only. Registration is required. Sign up to attend this event here.
In this project, I investigate the effects of behavioral changes in data producers/providers due to the swift introduction and widespread adoption of powerful large language model (LLM) tools. I examine the impact of their use on the quality and quantity of data produced on platforms where these models are commonly trained (e.g., Wikipedia, StackOverflow, Quora, etc.). I discuss the potential challenges arising from data drift and domain mismatch resulting from this behavioral shift, specifically concerning safety, content moderation, and the factual accuracy of LLM outputs. This project aims to highlight the extent of behavior change among content creators and emphasizes the potential risks of LLMs becoming less reliable due to scarcity of non-synthetic data.
The Unsolved Problem:
Evaluating LLM output, measuring the impact of data drift and data scarcity on safety, fairness, bias, etc. of LLMs.
The need for “smarter” data curation methods
Dr Ovidiu Șerban
Location: FAW.2.04 (Fawcett House, LSE)
Date: 16 January 2023
Time: 12:30 - 13:30
The Deep Learning community is buzzing to find the “best” and “largest” model they can train without thinking more about the data and where it comes from. This phenomenon makes junior data scientists and students at all levels feel very uneasy with Data Curation, which is still considered an underrated topic. Throughout this talk, we will look at a few projects, their data problems and how we addressed the data curation issues to improve the Machine Learning models. In one of the projects, we will be forecasting COVID-19 cases and excess deaths using data proxies for human activity. In another project, we will look at fraudulent activity detection and the issue of generalising datasets for infrequent events. Last, we will look at data quality issues with human-annotated data and how to estimate the quality of textual annotations beyond inter-annotator agreements.
The Unsolved Problem:
The unsolved challenge of all these projects is improving data quality by spending little time manually curating and reviewing the data. Are there more intelligent data curation techniques available to accelerate this process?
- Romain Molinas, Cesar Quilodran Casas, Rossella Arcucci, Ovidiu Serban. A novel approach for predicting epidemiological forecasting parameters based on real-time signals and Data Assimilation. (in review) Available on request.
- Tuccella, J., Nadler, P., & Şerban, O. (2021). Protecting Retail Investors from Order Book Spoofing using a GRU-based Detection Model. arXiv. https://doi.org/10.48550/arXiv.2110.03687
- Vaghela, Uddhav and Rabinowicz, Simon and Bratsos, Paris and Martin, Guy and Fritzilas, Epameinondas and Markar, Sheraz and Purkayastha, Sanjay and Stringer, Karl and Singh, Harshdeep and Llewellyn, Charlie and Dutta, Debabrata and Clarke, Jonathan M and Howard, Matthew and Curators, PanSurg REDASA and Serban, Ovidiu and Kinross, James. Using a Secure, Continually Updating, Web Source Processing Pipeline to Support the Real-Time Data Synthesis and Analysis of Scientific Literature: Development and Validation Study. In Journal of Medical Internet Research (pp. e25714), 2021.
Dr Ovidiu Șerban
Dr Ovidiu Șerban (ʃerban) is a Research Fellow at the Data Science Institute (DSI), Imperial College London. He is currently the head of the Data Observatory Team at DSI and a member of the Imperial AI Network of Excellence and Imperial Mental Health Research. Ovidiu recently joined the Security, Privacy, Identity and Trust Engagement NetworkPlus (SPRITE+) as an Early Career researcher (ECR). In addition, he actively collaborates with Refinitiv, an LSEG business, on various projects around using Machine Learning and Data Curation for Environmental, Social, and Governance (ESG) issues. He is also a co-PI on the NIHR-funded project “R-Cancer” and REDASA, where he works on data quality estimation for annotated data on unstructured medical text documents.
Ovidiu holds a joint PhD from INSA de Rouen Normandy (France) and “Babeș-Bolyai” University (Romania) while working at the LITIS Laboratory in France. His current work includes real-time Data Curation, Machine Learning, Natural Language Processing, Large-Scale Visualisation Systems and Human-Computer interaction.
Scaling Text with the Class Affinity Model
Professor Ken Benoit
Location: Data Observatory, Imperial College London's South Kensington Campus.
Date: 25 November 2022
Time: 12:30 - 13:30
Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Da ́il confidence vote, a challenge brought by opposition party leaders against the then-governing Fianna Fa ́il party in response to corruption scandals. In this application, we clearly observe support or opposition from the known positions of party leaders, but have only information from speeches from which to estimate the relative degree of support from other legislators. To solve this scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentence-level block bootstrap. Applying our method to the Da ́il debate, we are able to scale the legislators between extreme pro-government and pro-opposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.
This method is implemented in the R package quanteda.textmodels.
This is a fairly classical statistical paper, using a bag-of-words approach to text, with feature selection based on influence statistics, and maximum likelihood to estimate the affinity statistic. A more contemporary method would be to incorporate a model using an artificial neural network, using a continuous bag-of-words input, an embedding layer, and a continuously scaled output layer as the estimand. This could be based on a transformer architecture.
- The goal is valid measurement, not prediction. Because the goal is measurement, some uncertainty accounting is desirable.
- There are no training labels for individual cases, but rather document sets identified with polar opposite classes, to whose affinity each unknown document is measured.
- Because the measurement is a latent trait with no directly verifiable value, validation does not work in the same way that a regression loss measure (e.g., RMSE) or a categorical loss measure (e.g., F1) can be measured.
- Robustness, reproducibility, and transparency are important goals in the construction of the “estimator”, since (social) science generally eschews black boxes
Professor Ken Benoit
Ken Benoit is Director of the Data Science Institute at LSE and Professor of Computational Social Science in the Department of Methodology.
Ken’s current research focuses on computational, quantitative methods for processing large amounts of textual data, mainly political texts and social media. Current interest span from the analysis of big data, including social media, and methods of text mining. He has published extensively on applications of measurement and the analysis of text as data in political science, including machine learning methods and text coding through crowd-sourcing, an approach that combines statistical scaling with the qualitative power of thousands of human coders working in tandem on small coding tasks.
Data Learning for more reliable AI models
Dr Rossella Arcucci
Read more about this event here.
This work fits into the context of AI for digital twins (DT). DTs are usually made of two components: a model and some data. When developing a digital twin, many fundamental questions exist, some connected with the data and its reliability and uncertainty, and some to do with dynamic model updating. To combine model and data, we use Data Assimilation (DA). DA is the approximation of the true state of some physical system by combining real-world observations with a dynamic model. DA models have increased in sophistication to better fit application requirements and circumvent implementation issues. Nevertheless, these approaches are incapable of fully overcoming some of their unrealistic assumptions, such as linearity of the systems. Machine Learning (ML) shows great capability in approximating nonlinear systems and extracting meaningful features from high-dimensional data. ML algorithms can assist or replace traditional forecasting methods. However, the data used during training any ML algorithm include numerical, approximation and round off errors, which are trained into the forecasting model. Integration of ML with DA increases the reliability of prediction by including information in real time and with a physical meaning. This talk introduces Data Learning, a field that integrates Data Assimilation and Machine Learning to overcome limitations in applying these fields to real-world data. We present several Data Learning methods and results for some real-world test cases, though the equations are general and can easily be applied elsewhere.
Dr Rossella Arcucci
Lecturer in Data Science and Machine Learning at Imperial College London where she leads the DataLearning Group and is the elected representative of the AI Network of Excellence. She is also elected member of World Meteorological Organization (WMO) where she contributes to the World Weather Research Programme.
Rossella has developed models which have been applied to many industries including finance (to estimate optimal parameters of economic models), social science (to merge Twitter and pooling data to better estimate the sentiment of people), engineering (to optimise the placement of sensors and reduce the costs), geoscience (to improve accuracy of forecasting) and climate change. With an academic background in mathematics, Rossella completed her PhD in Computational and Computer Science in February 2012 and became a Marie Sklodowska-Curie fellow with the European Commission Research Executive Agency in Brussels in February 2017.