Data Science

Data really powers everything that we do.

Research activities in the data science area are concerned with the development of machine learning and computational statistical methods, their theoretical foundations, and applications. Machine learning and computational statistics play an important role in a wide range of applications involving data, featuring variety, large dimension, volume or velocity.

Our research focuses on both theoretical and applied aspects in the broad area of machine learning and statistical computing.

We study machine learning algorithms for solving a variety of learning tasks, including supervised, semi-supervised, unsupervised, and reinforcement learning tasks. A special focus is devoted to fairness of machine learning, optimisation for machine learning, kernel methods, information theory, federated learning, and scalable models and tools for linking massive and distributed multimodal data. Our work on computational statistical methods include Bayesian inference, functional data analysis, large-scale statistical inference, and non-parametric estimation.

Chengchun's research lies in developing statistical learning methodologies in reinforcement learning and causal inference. With the fast development of new technology, modern datasets often consist of massive observations, high-dimensional covariates and are characterized by some degree of heterogeneity. In an era of big and complex data, he is interested in developing computationally efficient algorithms with statistical performance guarantees.

Ieva's research centres around Bayesian inference with a focus on inverse problems and uncertainty quantification in physics-based problems. This involves Gaussian processes for modelling complex spatial and temporal dependencies, and variational inference for scalable posterior estimation in high-dimensional Bayesian models. Applying these techniques to partial differential equation (PDE) based inverse problems and experimental design is of particular interest.

Joshua's research interests involve improving practices in data science and machine learning to reduce the impact of bias, particularly biases associated with social harms and scientific reproducibility. This includes developing methods and software for statistical inference after model selection, and using causality to analyse the fairness and interpretability of algorithms in machine learning and artificial intelligence.

Kostas' research interests revolve around Bayesian Machine Learning methods such as Factor analysis, Mixture models, Gaussian processes and Sequential methods. Focus is given on developing suitable efficient computational schemes in the presence of data that are potentially partial, noisy and from multiple sources. The developed computations schemes utilise Markov Chain Monte Carlo and sequential Monte Carlo techniques to facilitate tasks such as prediction, model choice and averaging, parameter estimation and portfolio optimisation.

Marcos' research areas include data linkage methods and tools for massive and/or multimodal datasets, design and validation of machine learning/artificial intelligence models for healthcare and socioeconomic data, and federated learning models. On the educational side, his research comprises the design of teaching materials and practices accounting for interdisciplinarity and data decolonisation.

Milan's research interests are in developing new methodologies in the areas of machine learning and statistical inference. This involves developing new algorithms and establishing their theoretical performance guarantees to enable design of efficient intelligent systems. He has worked in the areas of scalable optimisation methods for machine learning, multi-armed bandits, multi-agent systems, algorithms and uncertainty, and network system control and optimisation.

Tengyao is broadly interested in the area of high-dimensional statistics. His research aims to develop computationally efficient procedures for high-dimensional problems, while at the same time understanding the potential statistical limitations imposed by computational constraints.

Yining's research areas include change-point detection, time series, nonparametric statistics (in particular, shape constraint estimation) and computing.

Zoltan's research interest is on kernel methods, information theoretical estimators, scalable computation and their interaction, with focus on fundamental mathematical questions which are often inspired by and intertwined with applications. The key properties of kernels motivating his research are that (i) they are able to capture the similarity of a wide range of data types, (ii) the associated reproducing kernel Hilbert space (RKHS) is quite flexible and can be rich enough for instance to encode probability distributions without loss of information, (iii) still they are computationally tractable, (iv) the Hilbert structure of RKHSs facilitates their statistical analysis, (v) they extend naturally to the vector-valued case encoding the dependency among output coordinates.

Machine learning and computational statistical methods that we study have applications in a variety of domains. We work on such applications in collaboration with various academic and industrial partners, for example, in the areas of online platforms, healthcare systems, and finance. Particular applications include methods for predicting popularity of social media content, recommender systems in online platforms, standardising surgical assessments, dynamic portfolio allocation, interest rate forecasting, and forecasting of oil prices.

Chengchun's motivation behind his work stems from real world applications. In medicine, applying reinforcement learning algorithms could assist patients in improving their health status. In ride-sharing platforms, applying RL algorithms could increase its revenue and customer satisfaction. His research is motivated to several key issues in these applications.

Ieva's work includes interdisciplinary projects in climate science, particularly on experimental design for ice sheet modelling and analysis of climate simulator data. She works in collaboration with the British Antarctic Survey and the Institute of Computing for Climate Science at the University of Cambridge.

Kostas has mostly engaged with financial applications, working on time series consisting of bonds, options, asset prices and volatility indices. He has also worked on infectious diseases, exploring the use of stochastic epidemic models.

Marcos' recent applications include i) design of a 3-tier alert-early system outbreaks with pandemic potential (AESOP); ii) design of a Social Inequality Index (IDS-COVID-19) for measuring the effects of the COVID-19 pandemic over vulnerable communities; iii) design of artificial intelligence models for estimating underreporting of COVID-19 cases and deaths; iv) a framework to evaluate gait assessment algorithms applied to Parkinson’s disease; v) Early Childhood Friendly Municipal Index (IMAPI), a multi-level index based on the Nurturing Care Framework to classify Brazilian municipalities regarding their support to early childhood development, and vi) bespoke data linkage tools to build the 100 Million Cohort, a longitudinal, multimodal database aggregating healthcare and socioeconomic data from over 130 million individuals.

Milan's research has found applications in several different areas, including online platforms, recommender systems, computer network systems, and systems for machine learning.

Tengyao has worked on applications including medical statistics, financial data analysis and statistical learning-assisted material discovery.

Yining has worked on applications in several areas, including medical statistics, financial time series data, and insurance.

Zoltan has worked on safety-critical learning, style transfer, shape-constrained prediction, hypothesis testing, functional output regression, distribution regression, estimation of divergence and independence measures.

Ivan Ustyuzhaninov, Ieva Kazlauskaite, Carl Henrik Ek and Neill Campbell (2020). Monotonic Gaussian Process Flows Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research,108:3057-3067.

Jan Povala, Eva Kazlauskaite, Eky Febrianto, Fehmi Cirak and Mark Girolami (2022). Variational Bayesian approximation of inverse problems using sparse precision matrice Computer Methods in Applied Mechanics and Engineering, 393.

Arnaud Vadeboncoeur, Ömer Deniz Akyildiz, Ieva Kazlauskaite, Mark Girolami and Fehmi Cirak (2023). Fully probabilistic deep models for forward and inverse problems in parametric PDEs, Journal of Computational Physics, Volume 491.

T. Dubiel-Teleszynski, K. Kalogeropoulos, N. Karouzakis (2022) Sequential Learning and Economic Benefits of Affine Term Structure Models. To appear in Management Science.

A. Beskos, J. Dureau and K. Kalogeropoulos (2015) Bayesian inference for partially observed stochastic differential equations driven by fractional Brownian motion.http://eprints.lse.ac.uk/64806/ Biometrika, 102 (4). pp. 809-827.

Shi, Chengchun, Wang, Xiaoyu, Luo, Shikai, Zhu, Hongtu, Ye, Jieping, & Song, Rui (2023). Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework. Journal of the American Statistical Association, 118 (543), 2059-2071.

Shi, Chengchun, Uehara, Masatoshi, Huang, Jiawei, & Jiang, Nan (2022). A minimax learning approach to off-policy evaluation in confounded partially observable markov decision processes. International Conference on Machine Learning (ICML), pages 20057-20094.

Pierre-Cyril Aubin-Frankowski, Zoltan Szabo (2022). Handling Hard Affine SDP Shape Constraints in RKHSs. Journal of Machine Learning Research, 23 (297):1-54.

Patric Bonnier, Harald Oberhauser, Zoltan Szabo (2023). Kernelized Cumulants: Beyond Kernel Mean Embeddings. Advances in Neural Information Processing Systems. (NeurIPS), pages 11049-11074.

Kim, Jung-Hun, Vojnovic, Milan and Yun, Se-Young (2021) Rotting infinitely many-armed bandits. Proceedings of the 39th International Conference on Machine Learning. Journal of Machine Learning Research, pp. 11229-11254.

Lee, Dabeen and Vojnovic, Milan (2021) Scheduling jobs with stochastic holding costs. In: Ranzato, Marc'Aurelio, Beygelzimer, Alina, Dauphin, Yann, Liang, Percy S. and Wortman Vaughan, Jenn, (eds.) Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Advances in Neural Information Processing Systems. Neural Information Processing Systems Foundation, pp. 19375-19384.

Gao, Fengnan and Wang, Tengyao (2022) Two-sample testing of high-dimensional linear regression coefficients via complementary sketching. Annals of Statistics.

Follain, Bertille, Wang, Tengyao and Samworth, Richard J. (2022) High-dimensional changepoint estimation with heterogeneous missingness. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 84 (3). 1023 - 1055.

Calais-Ferreira, Lucas, Barreto, Marcos , Barreto, Marcos , Mendonça, Everton, Dite, Gillian S, Hickey, Martha, Ferreira, Paulo H, Scurrah, Katrina J and L Hopper, John (2022) Birthweight, gestational age and familial confounding in sex differences in infant mortality: a matched co-twin control study of Brazilian male-female twin pairs identified by population data linkage. International Journal of Epidemiology, 51 (5). 1502 - 1510.

Mauricio L Barreto, Maria Yury Ichihara, Julia M Pescarini, M Sanni Ali, Gabriela L Borges, Rosemeire L Fiaccone, Rita de Cássia Ribeiro-Silva, Carlos A Teles, Daniela Almeida, Samila Sena, Roberto P Carreiro, Liliana Cabral, Bethania A Almeida, George C G Barbosa, Robespierre Pita, Marcos E Barreto, Andre A F Mendes, Dandara O Ramos, Elizabeth B Brickley, Nivea Bispo, Daiane B Machado, Enny S Paixao, Laura C Rodrigues, Liam Smeeth. Cohort Profile: The 100 Million Brazilian Cohort. International Journal of Epidemiology, Volume 51, Issue 2, April 2022, Pages e27–e38.

Chen, Yining (2020) Jump or kink: note on super-efficiency in segmented linear regression break-point estimation. Biometrika. ISSN 0006-3444.

Feng, Oliver Y., Chen, Yining, Han, Qiyang, Carroll, Raymond J and Samworth, Richard J. (2022) Nonparametric, tuning-free estimation of S-shaped functions. Journal of the Royal Statistical Society. Series B: Statistical Methodology. ISSN 1369-7412.

Mona Azadkia - Assistant Professor

Research interests: Non-parametric statistics, causal inference, high-dimensional statistics.

Marcos Barreto - Assistant Professor (Education)

Research interests: Big data linkage & analytics, artificial intelligence applied to healthcare and socioeconomic data, federated learning models, data science teaching and assessment.

Yining Chen - Associate Professor

Research interests: Change-point, nonparametric, shape constraint, computing.

Kostas Kalogeropoulos - Associate Professor

Research interests: Bayesian inference, Gaussian processes, latent stochastic processes, sequential learning, stochastic epidemic modelling, volatility estimation, bond risk premia.

Ieva Kazlauskaitė - Assistant Professor

Research interests: Probabilistic machine learning, Bayesian inference, Gaussian processes, variational inference, inverse problems.

Joshua Loftus - Assistant Professor

Research interests: High-dimensional inference, algorithmic fairness, data science.

Chengchun Shi - Associate Professor

Research interests: Reinforcement learning, causal inference, statistical inference.

Zoltan Szabo - Professor

Research interests: Statistical machine learning, information theoretical estimators, kernel methods, scalable computation.

Milan Vojnovic - Professor

Research interests: Algorithms, decision making, machine learning, optimisation, statistical inference.

Tengyao Wang - Professor

Research interests: High-dimensional statistics; changepoint analysis; dimension reduction; statistical-computational trade-offs.

Sakina Hansen

Research interests: Fair machine learning, explainability, equitable data science, philosophy and ethics of machine learning.

Ziqing Ho

Research interests: Non-parametric regression, high-dimensional statistics, and machine learning.

Liyuan Hu

Research interests: Reinforcement learning and statistical inference.

Xinhui Liu

Research interests: Bayesian statistics, machine learning, and social statistics, as well as their applications in social sciences.

Tao Ma

Research interests: Reinforcement learning, decision science and strategies, causal inference, optimization, applications in economics and finance.

Pingfan Su

Research interests: Reinforcement learning, causal inference, generative AI and their applications in finance.

Trevor Wrobleski

Research interests: Operations research, high-dimensional variable selection, computational efficiency optimization, model averaging, and spatio-temporal modeling.

Xuzhi Yang

Research interests: Optimal transport theory and its applications, robust statistics, theoretical machine learning, diffusion model.

Kai Ye

Research interests: Offline reinforcement learning, confounded partially observable Markov decision processes (POMDPs), and high-dimensional statistics.