DSI Squared

DSI Squared

DSI Squared is a collaborative initiative joining the Data Science Institutes from both Imperial College London and the London School of Economics and Political Science (LSE)

When it comes to data science research and its impact, LSE’s strength in social science naturally complements Imperial’s strengths in science, technology, and medicine

When it comes to data science research and its impact, the our expertise in the social sciences naturally complement Imperial’s strengths in science, technology, and medicine.

Our partnership aims to facilitate collaboration between the two institutions. Together, we host networking events, seminars, and a joint small grant scheme that encourages researchers from both institutions to collaborate on projects and grant applications through seed funding.


In an effort to facilitate collaboration, Imperial and LSE have partnered to launch a DSI Squared Research Grant Scheme, open to all researchers affiliated in some way with either institute. Applications for funding are normally considered up to a maximum of £5,000 and could cover research projects or attendance at conferences or workshops.

At least one applicant must be a DSI Affiliate, and at least one co-applicant must be from Imperial College London and one must be from the London School of Economics and Political Science. The lead applicant must be a Research Fellow or above.

Download our application form
Apply for a DSI Squared Research Grant


Our Unsolved Problems in Data Science Seminars are a crucial component of the ongoing collaboration between LSE and ICL Data Science Institutes. These seminars unite LSE's social science strength with Imperial's science, technology, and medicine expertise to address complex data science challenges.

Over the course of an hour, and over a light lunch, researchers present their work, fostering interdisciplinary collaborations and innovative problem-solving. The series, alternating between LSE and ICL, encourages active audience participation, bridging social sciences with computer science and STEM fields.

Whether affiliated with LSE or ICL, these seminars provide a platform for collaboration and solutions in data science.

Upcoming Events

Visit our Upcoming Events page to see the latest DSI Squared Events

Past Events


Generative Modelling of Cardiac Anatomy
18 January 2024
Dr Wenjia Bai  

Location: LSE Data Science Institute, COL 1.06, First Floor Columbia House, Houghton St, London WC2A 2AE 
Date: 18 January 2024 
Time: 12:30-13:30 

Two key questions in cardiac image analysis are to assess the anatomy and motion of the heart from images; and to understand how they are associated with non-imaging clinical factors such as gender, age and diseases. While the first question can often be addressed by image segmentation and motion tracking algorithms, our capability to model and answer the second question is still limited. Here, we propose a novel conditional generative model to describe the 4D spatio-temporal anatomy of the heart and its interaction with non-imaging clinical factors. The clinical factors are integrated as the conditions of the generative modelling, which allows us to investigate how these factors influence the cardiac anatomy. We evaluate the model performance in mainly two tasks, anatomical sequence completion and sequence generation. The model achieves high performance in anatomical sequence completion, comparable to or outperforming other state-of-the-art generative models. In terms of sequence generation, given clinical conditions, the model can generate realistic synthetic 4D sequential anatomies that share similar distributions with the real data. 

The Unsolved Problems:
How can we develop a personalised normative model for human anatomy?
How do we show its clinical usefulness? 

Social and Ethical Implications of Data Scarcity and Data Drift in Large Language Models (LLMs)
5 October 2023
Dr Blake Miller

Location: Data Science Institute, Imperial College London, William Penney Laboratory, South Kensington Campus SW7 2AZ 
Date: 5 October 2023 
Time: 12:30 - 13:30

In this project, I investigate the effects of behavioral changes in data producers/providers due to the swift introduction and widespread adoption of powerful large language model (LLM) tools. I examine the impact of their use on the quality and quantity of data produced on platforms where these models are commonly trained (e.g., Wikipedia, StackOverflow, Quora, etc.). I discuss the potential challenges arising from data drift and domain mismatch resulting from this behavioral shift, specifically concerning safety, content moderation, and the factual accuracy of LLM outputs. This project aims to highlight the extent of behavior change among content creators and emphasizes the potential risks of LLMs becoming less reliable due to scarcity of non-synthetic data. 

The Unsolved Problem: 
Evaluating LLM output, measuring the impact of data drift and data scarcity on safety, fairness, bias, etc. of LLMs. 


Generative AI and the Knowledge Economy Symposium
24 - 25 May 2023

This event provided an overview of generative AI and large language models (LLMs) and their implications for the knowledge economy and society writ broadly.  

Over two days at Imperial College London and The London School of Economics and Political Science (LSE), participants explored the technical basis, future directions, industry applications, and consequences of generative AI, with particular attention to the knowledge economy and intellectual workers.

The exploration of these topics included why the tools are powerful, whether they are intelligent or just dazzling prediction machines, and what those answers mean for a range of knowledge work.Since we are research-driven educational institutions spanning very technical subjects and specialisations in the social sciences, especial consideration was given to the consequences for academia and education as tools such as ChatGPT challenge traditional approaches to teaching, assessment, and research. 

Find more information about the event here.

DSI Squared Networking Event and Research Grants Scheme Launch
10 March 2023

This bilateral networking event connected researchers with an interest in data science from LSE and Imperial College as part of our ‘DSI Squared’ series. All LSE and Imperial data science researchers were welcome to participate in this opportunity to meet and share their current and planned work through short conversations.

When it comes to data science research and its impact, LSE’s strengths in the social sciences naturally complement Imperial’s strengths in science, technology, and medicine. In line with the aim to see these conversations grow into potential research collaborations, the DSI Squared partnership was pleased to announce that funding is now available to support these projects via the exciting new DSI Squared Research Grants Scheme.

The full details of the DSI Squared Small Grants Scheme were be outlined at this event, including application deadlines, judging criteria, and the value of potential awards.

The Scheme will consider applications for grants to fund data science research and research-related activities (including dissemination of findings and public outreach) from those based at LSE and Imperial College London.

The need for “smarter” data curation methods
16 January 2023
Dr Ovidiu Șerban

Location: FAW.2.04 (Fawcett House, LSE)
Date: 16 January 2023
Time: 12:30 - 13:30

The Deep Learning community is buzzing to find the “best” and “largest” model they can train without thinking more about the data and where it comes from. This phenomenon makes junior data scientists and students at all levels feel very uneasy with Data Curation, which is still considered an underrated topic. Throughout this talk, we will look at a few projects, their data problems and how we addressed the data curation issues to improve the Machine Learning models. In one of the projects, we will be forecasting COVID-19 cases and excess deaths using data proxies for human activity. In another project, we will look at fraudulent activity detection and the issue of generalising datasets for infrequent events. Last, we will look at data quality issues with human-annotated data and how to estimate the quality of textual annotations beyond inter-annotator agreements. 

The Unsolved Problem:
The unsolved challenge of all these projects is improving data quality by spending little time manually curating and reviewing the data. Are there more intelligent data curation techniques available to accelerate this process?  

Reading list:  

SpeakerO SerbanDr Ovidiu Șerban

Dr Ovidiu Șerban (ʃerban) is a Research Fellow at the Data Science Institute (DSI), Imperial College London. He is currently the head of the Data Observatory Team at DSI and a member of the Imperial AI Network of Excellence and Imperial Mental Health Research. Ovidiu recently joined the Security, Privacy, Identity and Trust Engagement NetworkPlus (SPRITE+) as an Early Career researcher (ECR). In addition, he actively collaborates with Refinitiv, an LSEG business, on various projects around using Machine Learning and Data Curation for Environmental, Social, and Governance (ESG) issues. He is also a co-PI on the NIHR-funded project “R-Cancer” and REDASA, where he works on data quality estimation for annotated data on unstructured medical text documents.

Ovidiu holds a joint PhD from INSA de Rouen Normandy (France) and “Babeș-Bolyai” University (Romania) while working at the LITIS Laboratory in France. His current work includes real-time Data Curation, Machine Learning, Natural Language Processing, Large-Scale Visualisation Systems and Human-Computer interaction.

Scaling Text with the Class Affinity Model
25 November 2022
Professor Ken Benoit

Location: Data Observatory, Imperial College London's South Kensington Campus.
Date: 25 November 2022
Time: 12:30 - 13:30

Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Da ́il confidence vote, a challenge brought by opposition party leaders against the then-governing Fianna Fa ́il party in response to corruption scandals. In this application, we clearly observe support or opposition from the known positions of party leaders, but have only information from speeches from which to estimate the relative degree of support from other legislators. To solve this scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentence-level block bootstrap. Applying our method to the Da ́il debate, we are able to scale the legislators between extreme pro-government and pro-opposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.  

Other information: 
This method is implemented in the R package quanteda.textmodels. 

Unsolved Problem(s): 
This is a fairly classical statistical paper, using a bag-of-words approach to text, with feature selection based on influence statistics, and maximum likelihood to estimate the affinity statistic. A more contemporary method would be to incorporate a model using an artificial neural network, using a continuous bag-of-words input, an embedding layer, and a continuously scaled output layer as the estimand. This could be based on a transformer architecture.  


  • The goal is valid measurement, not prediction. Because the goal is measurement, some uncertainty accounting is desirable. 
  • There are no training labels for individual cases, but rather document sets identified with polar opposite classes, to whose affinity each unknown document is measured.
  • Because the measurement is a latent trait with no directly verifiable value, validation does not work in the same way that a regression loss measure (e.g., RMSE) or a categorical loss measure (e.g., F1) can be measured. 
  • Robustness, reproducibility, and transparency are important goals in the construction of the “estimator”, since (social) science generally eschews black boxes 

 SpeakerKenneth-Benoit-Feb2017-Cropped-200x200Professor Ken Benoit

Ken Benoit is Director of the Data Science Institute at LSE and Professor of Computational Social Science in the Department of Methodology. 

Ken’s current research focuses on computational, quantitative methods for processing large amounts of textual data, mainly political texts and social media. Current interest span from the analysis of big data, including social media, and methods of text mining.  He has published extensively on applications of measurement and the analysis of text as data in political science, including machine learning methods and text coding through crowd-sourcing, an approach that combines statistical scaling with the qualitative power of thousands of human coders working in tandem on small coding tasks.

Data Learning for more reliable AI models
30 September 2022
Dr Rossella Arcucci

Read more about this event here.

This work fits into the context of AI for digital twins (DT). DTs are usually made of two components: a model and some data. When developing a digital twin, many fundamental questions exist, some connected with the data and its reliability and uncertainty, and some to do with dynamic model updating. To combine model and data, we use Data Assimilation (DA). DA is the approximation of the true state of some physical system by combining real-world observations with a dynamic model. DA models have increased in sophistication to better fit application requirements and circumvent implementation issues. Nevertheless, these approaches are incapable of fully overcoming some of their unrealistic assumptions, such as linearity of the systems. Machine Learning (ML) shows great capability in approximating nonlinear systems and extracting meaningful features from high-dimensional data. ML algorithms can assist or replace traditional forecasting methods. However, the data used during training any ML algorithm include numerical, approximation and round off errors, which are trained into the forecasting model. Integration of ML with DA increases the reliability of prediction by including information in real time and with a physical meaning. This talk introduces Data Learning, a field that integrates Data Assimilation and Machine Learning to overcome limitations in applying these fields to real-world data. We present several Data Learning methods and results for some real-world test cases, though the equations are general and can easily be applied elsewhere.

Rossella Arcucc
Dr Rossella Arcucci

Lecturer in Data Science and Machine Learning at Imperial College London where she leads the DataLearning Group and is the elected representative of the AI Network of Excellence. She is also elected member of World Meteorological Organization (WMO) where she contributes to the World Weather Research Programme.

Rossella has developed models which have been applied to many industries including finance (to estimate optimal parameters of economic models), social science (to merge Twitter and pooling data to better estimate the sentiment of people), engineering (to optimise the placement of sensors and reduce the costs), geoscience (to improve accuracy of forecasting) and climate change. With an academic background in mathematics, Rossella completed her PhD in Computational and Computer Science in February 2012 and became a Marie Sklodowska-Curie fellow with the European Commission Research Executive Agency in Brussels in February 2017. 


 DSI Squared 'Speed Dating' - Research Networking Event
14 June 2022

This bilateral LSE Data Science Institute / Imperial College London DSI research networking event was the first of our ‘DSI Squared’ series. 

What happened? 
This research networking event took the form of ‘speed-dating’ event where researchers from each DSI had a chance to meet each other with a time limit for each meeting, before shifting table in order to meet the next researcher. 

Participants considered their research interests in advance and sent details for a contact card summarising these. At the event, researchers were paired for 4-5 mins which allowed them to introduce themselves and their future research plans, before exchanging contact cards to follow up with later. 

Following the event, all researchers were invited to an informal drinks reception.

What were the aims? 
This event connected researchers from the two DSIs as part of a collaboration between Imperial DSI and LSE DSI, facilitating greater and more regular cooperation under the 'DSI-squared' collaboration. 

This event aimed to discover connections that could lead to shared funding applications under the (separately) proposed seed funding scheme and bridge the gap between data science applied to STEM and to SHAPE subjects. 

Read more about the event here.