Istock4

Statistics and Data Science Seminar Series

The Department of Statistics hosts the Statistics and Data Science Seminar Series (SDSS) throughout the year and usually taking place on Friday afternoons at 2pm. Topics include statistics, machine learning, computer science and their interface, both from a theoretical and applied point of view. We invite both internal and external speakers to present their latest cutting edge research. All staff and students are welcome to attend our seminars!

Autumn Term 2023 

Friday 13 October 2023, 2-3pm -  Bharath Sriperumbudur (Pennsylvania State University)

 Bharathimage

This event will take place in the Data Science Institute (COL.1.06).

Title: Gromov-Wasserstein Distances: Entropic Regularization, Duality, and Sample Complexity

Abstract: The Gromov-Wasserstein (GW) distance, rooted in optimal transport (OT) theory, quantifies dissimilarity between metric measure spaces and provides a framework for aligning heterogeneous datasets. While computational aspects of the GW problem have been widely studied, a duality theory and fundamental statistical questions concerning empirical convergence rates remained obscure. In this talk, I present our recent work that closes these gaps for the quadratic GW distance over Euclidean spaces of different dimensions $d_x$ and $d_y$. We treat both the standard and the entropically regularized GW distance, and derive dual forms that represent them in terms of the well-understood OT and entropic OT (EOT) problems, respectively. This enables employing proof techniques from statistical OT based on regularity analysis of dual potentials and empirical process theory, using which we establish the first GW empirical convergence rates. The derived two-sample rates are $n^{-2/\max\{\min\{d_x,d_y\},4\}}$ (up to a log factor when $\min\{d_x,d_y\}=4$) for standard GW and $n^{-1/2}$ for EGW, which matches the corresponding rates for standard and entropic OT. The parametric rate for EGW is evidently optimal, while for standard GW we provide matching lower bounds, which establish sharpness of the derived rates. We also study stability of EGW in the entropic regularization parameter and prove approximation and continuity results for the cost and optimal couplings. Our results serve as a first step towards a comprehensive statistical theory as well as computational advancements for GW distances, based on the discovered dual formulations.Joint work with Zhengxin Zhang (Cornell University, Applied Math), Ziv Goldfeld (Cornell University, ECE), and Youssef Mroueh (IBM Research, NYC)

Biography: Bharath Sriperumbudur is currently an Associate Professor of Statistics at Pennsylvania State University. He held a postdoctoral stint at Gatsby Computational Neuroscience Unit, University College London, and was a Research Fellow in the Statistical Laboratory, University of Cambridge. He received his Ph. D. in Electrical Engineering from University of California, San Diego. He is the recipient of the prestigious NSF CAREER Award. He is currently serving as an Action Editor for the Journal of Machine Learning Research and has served as an area chair for many machine learning conferences such as NeurIPS, COLT, ALT, ICML and AISTATS. His current research interests are in statistical learning theory, non-parametric statistics, RKHS theory and methods, topological data analysis, optimal transport and gradient flows.

Take a look at Bharath's slides (PDF).

Friday 20 October 2023, 2-3pm - Florian Kalinke (Karlsruhe Institute of Technology)

FlorianKalinke

This event will take place in B.07, Sir Arthur Lewis Building (SAL.B.07).

Title: Nyström M-Hilbert-Schmidt Independence Criterion

Abstract: Measuring the joint independence of random variables is a central problem in data science, with Hilbert-Schmidt independence criterion (HSIC) being one of the most popular measures in the area. While closed-form estimators for HSIC exist, they scale quadratically with the number of samples, which proves to be prohibitive for large-scale applications. Accordingly, speed-ups for the estimation of HSIC exist, but they are limited to two random variables. In this talk, I will present a Nyström-based accelerated estimation of HSIC that allows more than two random variables, accompanied by consistency and mini-max optimality guarantees. I will demonstrate the applicability of the approach for the dependency testing of media annotations and causal discovery.

Biography: Florian Kalinke is a third-year Ph.D. student in computer science at the Karlsruhe Institute of Technology (KIT) at the "Institute for Program Structures and Data Organization," advised by Klemens Böhm. His research focuses on processing streaming data, online change point detection, estimation of independence measures, and kernel techniques, with a particular focus on deriving efficient algorithms and the analysis of their performance and runtime trade-offs, and applying these techniques in energy management.

Take a look at Florian's slides (PDF).

Friday 27 October 2023, 2-3pm - Claire Monteleoni (INRIA Paris & University of Colorado Boulder)

ClaireMonteleoni

This event will take place in the Data Science Institute (COL.1.06).

Title: Machine Learning Research for Climate Change and Environmental Sustainability

Abstract: Despite the scientific consensus on climate change, drastic uncertainties remain. Crucial questions about regional climate trends, changes in extreme events, such as heat waves and mega-storms, and understanding how climate varied in the distant past, must be answered in order to improve predictions, assess impacts and vulnerability, and inform mitigation and sustainable adaptation strategies. Machine learning can help answer such questions and shed light on climate change. I will give an overview of our climate informatics research, focusing on challenges in learning from spatiotemporal data, along with semi- and unsupervised deep learning approaches to studying rare and extreme events, and precipitation and temperature downscaling

Biography: Claire Monteleoni is a Choose France Chair in AI and a Research Director at INRIA Paris, a Professor in the Department of Computer Science at the University of Colorado Boulder, and the founding Editor in Chief of Environmental Data Science, a Cambridge University Press journal, launched in December 2020. She joined INRIA in 2023 and has previously held positions at University of Paris-Saclay, CNRS, George Washington University, and Columbia University. She completed her PhD and Masters in Computer Science at MIT and was a postdoc at UC San Diego. She holds a Bachelor’s in Earth and Planetary Sciences from Harvard. Her research on machine learning for the study of climate change helped launch the interdisciplinary field of Climate Informatics. She co-founded the International Conference on Climate Informatics, which turned 12 years old in 2023, and has attracted climate scientists and data scientists from over 20 countries and 30 U.S. states. She gave an invited tutorial: Climate Change: Challenges for Machine Learning, at NeurIPS 2014. She currently serves on the NSF Advisory Committee for Environmental Research and Education.

Take a look at Claire's slides (PDF).

Friday 10 November 2023, 2-3pm - Emanuele Giorgi (Lancaster University)

emanuelegiorgi

This event will take place in the Data Science Institute (COL.1.06).

Title: Model-based geostatistical inference with low prevalence data: a case study on lymphatic filariasis

Abstract: Model-based geostatistics is a branch of spatial statistics that is used to draw inferences on a sptially continuous surface using data collected at a discrete set of locations.In this presentation, I will first give an overview of the concepts underpinning MBG and its application to public health problems where the goal is to predict disease risk within an geographical area of interests.I will then introduce the problem of analysing low-prevalence disease data were, due to the large number of zero reported cases, the estimation of MBG models can be quite challenging. This problem often arises in the context of diseases approaching elimination and where MBG has been used to assess the elimination status of areal units. Through a case study on lymphatic filariasis, a mosquito-borne parasitic disease that causes chronic and disabling swelling of the lymphatic system, we will present a spatio-temporal MBG model that uses historical data to draw inferences on the prevalence surface in a post-elimination setting. We will also illustrate how this approach can be used to design a surveillance system to quantify the likelihood of the resurgence of the disease in a post-elimination phase.

Biography: Emanuele Giorgi is a Senior Lecturer in Biostatistics at Lancaster University. He leads the Centre for Health for Health Informatics Computing and Statistics (CHICAS) which is a designated WHO Collaborating Centre on Geostatistical Methods for Neglected Tropical Disease Research.His primary research interests are in the development of spatial and spatio-temporal statistical methodologies and their implementation into open-source statistical software. His application domain is global health, with a specific focus on tropical diseases in low and middle-income countries.

Take a look at Emanuele's slides (PDF).

Monday 20 November 2023, 2-3pm - Zhigang Yao (National University of Singapore

Yao

This event will take place in the Data Science Institute (COL.1.06).

Title: Manifold Fitting -An Invitation to Statistics

Abstract: This manifold fitting problem can go back to H. Whitney’s work in the early 1930s (Whitney (1992)), and finally has been answered in recent years by C. Fefferman’s works (Fefferman, 2006, 2005). The solution to the Whitney extension problem leads to new insights for data interpolation and inspires the formulation of the Geometric Whitney Problems (Fefferman et al. (2020, 2021a)): Assume that we are given a set $Y \subset \mathbb{R}^D$. When can we construct a smooth $d$-dimensional submanifold $\widehat{M} \subset \mathbb{R}^D$ to approximate $Y$, and how well can $\widehat{M}$ estimate $Y$ in terms of distance and smoothness? To address these problems, various mathematical approaches have been proposed (see Fefferman et al. (2016, 2018, 2021b)). However, many of these methods rely on restrictive assumptions, making extending them to efficient and workable algorithms challenging. As the manifold hypothesis (non-Euclidean structure exploration) continues to be a foundational element in statistics, the manifold fitting Problem, merits further exploration and discussion within the modern statistical community. The talk will be partially based on recent works of Yao and Xia (2019) and Yao, Su, Li and Yau (2022) and Yao, Su, Yau (2023)

Biography: Zhigang Yao is a tenured associate professor in the Department of Statistics and Data Science at the National University of Singapore. He has been a visiting faculty at the Center for Mathematical Sciences and Applications at Harvard University since 2022. He holds a visiting professorship at YMSC at Tsinghua University. His primary research interests lie in statistical inference for complex data. In recent years, his focus has shifted towards Non-Euclidean Statistics and low-dimensional manifold learning.Yao is committed to promoting the new field of interaction between geometry and statistics. In recent years, Yao and his collaborators have proposed methods and theories that redefine traditional PCA on Riemannian manifolds including principal flows/sub-manifolds and principal boundaries, as well as new manifold learning theories. These methods aim to address deficiencies in traditional statistical methods and theories by taking into account the geometry of the data.

Friday 24 November 2023, 2-3pm - Daniela Witten (University of Washington)

DanielaResizedWitten_2021This event will take place in the Data Science Institute (COL.1.06).

Title: Data thinning and its applications

Abstract: We propose data thinning, a new approach for splitting an observation from a known distributional family with unknown parameter(s) into two or more independent parts that sum to yield the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This proposal is very general, and can be applied to a broad class of distributions within the natural exponential family, including the Gaussian, Poisson, negative binomial, Gamma, and binomial distributions, among others. Furthermore, we generalize data thinning to enable splitting an observation into two or more parts that can be combined to yield the original observation using an operation other than addition; this enables the application of data thinning far beyond the natural exponential family. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the "usual" approach of cross-validation via sample splitting, especially in unsupervised settings in which the latter is not applicable. We will present an application of data thinning to single-cell RNA-sequencing data, in a setting where sample splitting is not applicable. This is joint work with Anna Neufeld (Fred Hutch), Ameer Dharamshi (University of Washington), Lucy Gao (University of British Columbia), and Jacob Bien (University of Southern California)

Biography: TBC 

Friday 1 December 2023, 2-3pm - Giulia Martini (United Nations World Food Programme)

Giulia Martini 200x200

This event will take place in the Data Science Institute (COL.1.06).

Title: Food security monitoring, from real time data to machine learning

Abstract: Hunger Monitoring harnesses mobile technology, artificial intelligence and data analytics to establish remote, near real-time food security monitoring systems across countries to enable WFP, governments, partners and the broad humanitarian community to monitor the food security situation daily, capture problems in real-time in the event of a crisis and provide the necessary information for early action and mitigation.
By bringing food security analysts with expertise in data science, engineering and predictive modelling together, we collect daily data on hunger and its drivers and observe how situations may shift from one day to the next to inform effective response. To ensure that near real-time data is available as a global public good, we developed the HungerMapLIVE, WFP’s global hunger monitoring system so that anyone, anywhere can track hunger as it is now in over 90 countries. Survey data is used to fit machine learning models based on the XGBoost algorithm to make daily estimates of the food insecurity levels with first administrative area resolution.

Biography: Giulia is a data scientist who currently works at the United Nations World Food Program (WFP) in the Hunger Monitoring Unit where she leverages her expertise to develop and implement machine learning models. These models are used to estimate food security indicators in regions where primary data is unavailable and to provide longer-term forecasting. Before joining WFP in 2020, she worked as a researcher at the Netherlands Organisation for Applied Scientific Research (TNO) and received a Master's degree in civil engineering from Delft University of Technology. Giulia's primary objective is to make technology accessible and user-friendly to support and improve operations in the humanitarian sector.

Friday 8 December 2023, 2-3pm - Maria-Pia Victoria Feser (University of Geneva)

 MariaPia

This event will take place in B.07, Sir Arthur Lewis Building (SAL.B.07).

Title: Fast M-Estimation of Generalized Linear Latent Variable Models in High Dimensions

Abstract: Dimension reduction for high dimensional data is an important and challenging task, relevant to both machine learning and statistical applications. Generalized Linear Latent Variable Models (GLLVMs) provide a probabilistic alternative to matrix factorization when the data are of mixed types, whether discrete, continuous, or a mixture of both. They achieve the reduction of dimensionality by mapping the correlated multivariate data to so-called latent variables, defined in a lower-dimensional space. The benefit of GLLVMs is twofold: the latent variables can be estimated and used as features to be embedded in another model, and the model parameters themselves are interpretable and provide meaningful indications on the very structure of the data. Moreover, GLLVM can naturally be extended to dynamic processes such as those used to model longitudinal data. However, with a likelihood based approach, GLLVM’s estimation represents a tremendous challenge for even moderately large dimensions, essentially due to the multiple integrals involved in the likelihood function. Numerous methods based on approximations of this latter have been proposed: Laplace approximation, adaptive quadrature, or, recently, extended variational approximation. For GLLVMs, however, these methods do not scale well to high dimensions, and they may also introduce a large bias in the estimates. In this paper, we consider an alternative route, which consists in proposing an alternative estimator, based on drastically simplified estimating equations, complemented with a numerically efficient bias reduction methods in order to recover a consistent estimator for the GLLVM parameters. The resulting estimator is an M-estimator, which has a negligible efficiency loss compared to the (exact) MLE. For larger data sets, the proposed M-estimator, whose computational burden is linear in npq, remains applicable when the state-of-the-art method fails to converge. To compute the M-estimator, we propose to use a stochastic approximation algorithm.

Biography: Graduated from the University of Geneva (Ph. D. in econometrics and statistics) in 1993, Maria-Pia Victoria-Feser has held several positions in different institutions or Departments. She was appointed as lecturer in statistics at the London School of Economics (1993-1996), as assistant and associate professor in statistics (part time) at the Faculty of psychology and educational sciences at the University of Geneva (1997-2005), financed by a Swiss National Science Found grant, full professor in statistics at the University of Geneva since 2001. She now holds, since 2023, a full professor position at the department of statistics of the University of Bologna. She has also acted for the foundation and as founding dean (2013-2017) of the Geneva School of Economics and Management (GSEM) of the University of Geneva, and as founding director of the Research Center for Statistics of the University of Geneva (created in 2011).

Maria-Pia Victoria-Feser’s research interests are in fundamental statistics (robust statistics, model selection and simulation based inference in high dimensions for complex models) with applications in economics (welfare economics, extremes), psychology and social sciences (generalized linear latent variable models, media analytics), and engineering (time series for geo-localization). She has published in leading journals of statistics as well as top journals in related fields. Throughout her career, she has supervised several PhD students, three of which currently hold professorial positions and one a senior researcher position, in leading academic institutions.

 

 

SpringTerm 2024

Tuesday 30 April 2024, 2-3pm - Tengyuan Liang (University of Chicago)

tengyuanliangTitle: TBC

Abstract: TBC

Bio: TBC