DS image

Data Science

Data really powers everything that we do.

Research activities in the data science area are concerned with the development of machine learning and computational statistical methods, their theoretical foundations, and applications. Machine learning and computational statistics play an important role in a wide range of applications involving data, featuring variety, large dimension, volume or velocity. 

Research areas

Our research focuses on both theoretical and applied aspects in the broad area of machine learning and statistical computing.  

We study machine learning algorithms for solving a variety of learning tasks, including supervised, semi-supervised, unsupervised, and reinforcement learning tasks. A special focus is devoted to fairness of machine learning, optimisation for machine learning, kernel methods, information theory, federated learning, and scalable models and tools for linking massive and distributed multimodal data. Our work on computational statistical methods include Bayesian inference, functional data analysis, large-scale statistical inference, and non-parametric estimation. 

Marcos' research areas include data linkage methods and tools for massive and/or multimodal datasets, design and validation of machine learning/artificial intelligence models for healthcare and socioeconomic data, and federated learning models. On the educational side, his research comprises the design of teaching materials and practices accounting for interdisciplinarity and data decolonisation.

Yining's research areas include change-point detection, time series, nonparametric statistics (in particular, shape constraint estimation) and computing.

Kostas' research interests revolve around Bayesian Machine Learning methods such as Factor analysis, Mixture models, Gaussian processes and Sequential methods. Focus is given on developing suitable efficient computational schemes in the presence of data that are potentially partial, noisy and from multiple sources. The developed computations schemes utilise Markov Chain Monte Carlo and sequential Monte Carlo techniques to facilitate tasks such as prediction, model choice and averaging, parameter estimation and portfolio optimisation.

Chengchun's research lies in developing statistical learning methodologies in reinforcement learning and causal inference. With the fast development of new technology, modern datasets often consist of massive observations, high-dimensional covariates and are characterized by some degree of heterogeneity. In an era of big and complex data, he is interested in developing computationally efficient algorithms with statistical performance guarantees. 

Zoltan's research interest is on kernel methods, information theoretical estimators, scalable computation and their interaction, with focus on fundamental mathematical questions which are often inspired by and intertwined with applications. The key properties of kernels motivating his research are that (i) they are able to capture the similarity of a wide range of data types, (ii) the associated reproducing kernel Hilbert space (RKHS) is quite flexible and can be rich enough for instance to encode probability distributions without loss of information, (iii) still they are computationally tractable, (iv) the Hilbert structure of RKHSs facilitates their statistical analysis, (v) they extend naturally to the vector-valued case encoding the dependency among output coordinates.

Milan's research interests are in developing new methodologies in the areas of machine learning and statistical inference. This involves developing new algorithms and establishing their theoretical performance guarantees to enable design of efficient intelligent systems. He has worked in the areas of scalable optimisation methods for machine learning, multi-armed bandits, multi-agent systems, algorithms and uncertainty, and network system control and optimisation.

Tengyao is broadly interested in the area of high-dimensional statistics. His research aims to develop computationally efficient procedures for high-dimensional problems, while at the same time understanding the potential statistical limitations imposed by computational constraints. 


Machine learning and computational statistical methods that we study have applications in a variety of domains. We work on such applications in collaboration with various academic and industrial partners, for example, in the areas of online platforms, healthcare systems, and finance. Particular applications include methods for predicting popularity of social media content, recommender systems in online platforms, standardising surgical assessments, dynamic portfolio allocation, interest rate forecasting, and forecasting of oil prices. 

Marcos' recent applications include i) design of a 3-tier alert-early system outbreaks with pandemic potential (AESOP); ii) design of a Social Inequality Index (IDS-COVID-19) for measuring the effects of the COVID-19 pandemic over vulnerable communities; iii) design of artificial intelligence models for estimating underreporting of COVID-19 cases and deaths; iv) a framework to evaluate gait assessment algorithms applied to Parkinson’s disease; v) Early Childhood Friendly Municipal Index (IMAPI), a multi-level index based on the Nurturing Care Framework to classify Brazilian municipalities regarding their support to early childhood development, and vi) bespoke data linkage tools to build the 100 Million Cohort, a longitudinal, multimodal database aggregating healthcare and socioeconomic data from over 130 million individuals.

Yining has worked on applications in several areas, including medical statistics, financial time series data, and insurance.

Kostas has mostly engaged with financial applications, working on time series consisting of bonds, options, asset prices and volatility indices. He has also worked on infectious diseases, exploring the use of stochastic epidemic models.

Chengchun's motivation behind his work stems from real world applications. In medicine, applying reinforcement learning algorithms could assist patients in improving their health status. In ride-sharing platforms, applying RL algorithms could increase its revenue and customer satisfaction. His research is motivated to several key issues in these applications.

Zoltan has worked on safety-critical learning, style transfer, shape-constrained prediction, hypothesis testing, functional output regression, distribution regression, estimation of divergence and independence measures.

Milan's research has found applications in several different areas, including online platforms, recommender systems, computer network systems, and systems for machine learning.

Tengyao has worked on applications including medical statistics, financial data analysis and statistical learning-assisted material discovery.  

Selected publications 

T. Dubiel-Teleszynski, K. Kalogeropoulos, N. Karouzakis (2022) Sequential Learning and Economic Benefits of Affine Term Structure ModelsTo appear in Management Science. 

A. Beskos, J. Dureau and K. Kalogeropoulos (2015) Bayesian inference for partially observed stochastic differential equations driven by fractional Brownian motion. Biometrika, 102 (4). pp. 809-827.

Shi, Chengchun, Wang, Xiaoyu, Luo, Shikai, Zhu, Hongtu, Ye, Jieping and Song, Rui (2022) Dynamic causal effects evaluation in A/B testing with a reinforcement learning framework. Journal of the American Statistical Association. 1 - 13. 

Shi, Chengchun, Uehara, Masatoshi, Uehara, Masatoshi, Huang, Jiawei and Jiang, Nan (2022) A minimax learning approach to off-policy evaluation in confounded Partially Observable Markov Decision Processes. In: Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. (In Press) 

Aubin-Frankowski, Pierre-Cyril and Szabo, Zoltan (2022) Handling hard affine SDP shape constraints in RKHSs.

Lambert, Alex, Bouche, Dimitri, Szabo, Zoltan and d'Alché-Buc, Florence (2022) Functional output regression with infimal convolution: exploring the Huber and ε-insensitive Losses. In: International Conference on Machine Learning, 2022-07-17 - 2022-07-23, Baltimore, MD, United States.

Kim, Jung-Hun, Vojnovic, Milan and Yun, Se-Young (2021) Rotting infinitely many-armed bandits. In: Proceedings of the 39th International Conference on Machine Learning. Journal of Machine Learning Research, pp. 11229-11254.

Lee, Dabeen and Vojnovic, Milan (2021) Scheduling jobs with stochastic holding costs. In: Ranzato, Marc'Aurelio, Beygelzimer, Alina, Dauphin, Yann, Liang, Percy S. and Wortman Vaughan, Jenn, (eds.) Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021. Advances in Neural Information Processing Systems. Neural Information Processing Systems Foundation, pp. 19375-19384.

Gao, Fengnan and Wang, Tengyao (2022) Two-sample testing of high-dimensional linear regression coefficients via complementary sketching. Annals of Statistics.

Follain, Bertille, Wang, Tengyao and Samworth, Richard J. (2022) High-dimensional changepoint estimation with heterogeneous missingness. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 84 (3). 1023 - 1055.

Calais-Ferreira, Lucas, Barreto, Marcos , Barreto, Marcos , Mendonça, Everton, Dite, Gillian S, Hickey, Martha, Ferreira, Paulo H, Scurrah, Katrina J and L Hopper, John (2022) Birthweight, gestational age and familial confounding in sex differences in infant mortality: a matched co-twin control study of Brazilian male-female twin pairs identified by population data linkage. International Journal of Epidemiology, 51 (5). 1502 - 1510. 

Mauricio L Barreto, Maria Yury Ichihara, Julia M Pescarini, M Sanni Ali, Gabriela L Borges, Rosemeire L Fiaccone, Rita de Cássia Ribeiro-Silva, Carlos A Teles, Daniela Almeida, Samila Sena, Roberto P Carreiro, Liliana Cabral, Bethania A Almeida, George C G Barbosa, Robespierre Pita, Marcos E Barreto, Andre A F Mendes, Dandara O Ramos, Elizabeth B Brickley, Nivea Bispo, Daiane B Machado, Enny S Paixao, Laura C Rodrigues, Liam Smeeth. Cohort Profile: The 100 Million Brazilian Cohort. International Journal of Epidemiology, Volume 51, Issue 2, April 2022, Pages e27–e38.

Chen, Yining (2020) Jump or kink: note on super-efficiency in segmented linear regression break-point estimation. Biometrika. ISSN 0006-3444. 

Feng, Oliver Y., Chen, Yining, Han, Qiyang, Carroll, Raymond J and Samworth, Richard J. (2022) Nonparametric, tuning-free estimation of S-shaped functions. Journal of the Royal Statistical Society. Series B: Statistical Methodology. ISSN 1369-7412. 

Academic and research staff


Mona Azadkia - Assistant Professor 

Research interests: Non-parametric statistics, causal inference, high-dimensional statistics.

M Barreto 2021

Marcos BarretoAssistant Professorial Lecturer

Research interests: Big data linkage & analytics, artificial intelligence applied to healthcare and socioeconomic data, federated learning models, data science teaching and assessment.


Yining ChenAssociate Professor

Research interests: Change-point, nonparametric, shape constraint, computing.

Kostas Kal new1

Kostas Kalogeropoulos - Associate Professor

Research interests: Bayesian inference, Gaussian processes, latent stochastic processes, sequential learning, stochastic epidemic modelling, volatility estimation, bond risk premia.

Joshua Loftus 2022

Joshua Loftus - Assistant Professor 

Research interests: High-dimensional Inference, Algorithmic Fairness, Data Science.

Dr Xinghao Qiao200x200

Xinghao Qiao - Associate Professor

Chengchun Shi

Chengchun Shi - Assistant Professor

Research interests: Reinforcement learning, causal inference, statistical inference.

Zoltán Szabó

Zoltan Szabo - Professor

Research interests: Statistical machine learning, information theoretical estimators, kernel methods, scalable computation.

Milan Vojnovic

Milan Vojnovic - Professor

Research interests: Algorithms, decision making, machine learning, optimisation, statistical inference.

Tengyao Wang 2021

Tengyao Wang - Associate Professor

Research interests: High-dimensional statistics; changepoint analysis; dimension reduction; statistical-computational trade-offs.

Research students

Sakina Hansen 2022

Sakina Hansen

Research interests: Fair machine learning, explainability, equitable data science, philosophy and ethics of machine learning

Liyuan Hu 2022

Liyuan Hu

Research interests: Reinforcement learning and statistical inference.

Tao Ma

Tao Ma

Research interests: Reinforcement learning, decision science and strategies, causal inference, optimization, applications in economics and finance.

Pingfan Su

Pingfan Su

Research interests: Reinforcement learning, causal inference, generative AI and their applications in finance.