[Statlist] Joint webinar of the IMS New Researchers Group, Young Data Science Researcher Seminar Zürich, and the YoungStatS Project: Extrapolation to unseen domains: from theory to applications, 22.04.24

Thu Apr 11 12:52:25 CEST 2024

Dear Colleagues,

We are glad to announce the 3rd joint webinar of the IMS New Researchers Group, Young Data Science Researcher Seminar Zürich, and the YoungStatS Project.

Topic: Extrapolation to unseen domains: from theory to applications  
Time: Monday, April 22nd, 8:00 PT / 11:00 ET / 17:00 CET.
Online via Zoom: https://washington.zoom.us/j/92385046970

We encourage you to RSVP so that you can get the recording emailed to you afterwards: https://docs.google.com/forms/d/e/1FAIpQLSdJ_BYx3B4mkeduSutoS5fsLjjm9bGGxzUWwOdcZeHoOUWWaA/viewform?vc=0&c=0&w=1&flr=0

The event will consist of three 25 minute talks by Mohammad Lotfollahi (Cambridge University), Max Simchowitz (MIT), and Zhijing Jin (MPI and ETH), followed by a ~15 minute discussion by Nicolai Meinshausen (ETH). The topics and expertise of the speakers are diverse ranging from extrapolation in robotics to cellular biology and to large language models.

Please see the details information of talks below. Hope to see many of you at the webinar!

Max Simchowitz

Title: Statistical Learning under Heterogeneous Distribution Shift
Abstract: What makes a trained predictor, e.g. neural network, more or less susceptible to performance degradation under distribution shift? In this talk, we will investigate a less well-studied factor: that of the statistical complexity of the individual features themselves. We will show that, for a very general class of predictors with a certain additive structure, empirical risk minimization is less sensitive to distribution shifts in "simple features" than "complex" ones, where simplicity/complexity are measured in terms of natural statistical quantities. We demonstrate that this arises because standard ERM learns the dependence on the "simpler" feature more quickly, whilst avoiding the risk of overfitting to more "complex" features. We will conclude by drawing connections to the orthogonal machine learning literature, and validating our theory on various experimental domains (even those in which the additivity assumption fails to hold).

Mohammad Lotfollahi

Title: Generative Machine Learning to Model Cellular Perturbations
Abstract: The field of cellular biology has long sought to understand the intricate mechanisms that govern cellular responses to various perturbations, be they chemical, physical, or biological. Traditional experimental approaches, while invaluable, often face limitations in scalability and throughput, especially when exploring the vast combinatorial space of potential cellular states. Enter generative machine learning that has shown exceptional promise in modeling complex biological systems. This talk will highlight recent successes, address the challenges and limitations of current models, and discuss the future direction of this exciting interdisciplinary field. Through examples of practical applications, we will illustrate the transformative potential of generative ML in advancing our understanding of cellular perturbations and in shaping the future of biomedical research.

Zhijing Jin

Title: A Paradigm Shift in Addressing Distribution Shifts: Insights from Large Language Models
Abstract: Traditionally, the challenge of distribution shifts—where the training data distribution differs from the test data distribution—has been a central concern in statistical learning and model generalization. Traditional methods have primarily focused on techniques such as domain adaptation, and transfer learning. However, the rise of large language models (LLMs) such as ChatGPT has ushered in a novel empirical success, triggering a significant "shift" in problem formulation and approach for traditional distribution shift problems. In this talk, I will start with two formulations for LLMs: (1) the engineering heuristics aimed at transforming "out-of-distribution" (OOD) problems into "in-distribution" scenarios, which is further accompanied by (2) the hypothesized "emergence of intelligence" through massive scaling of data and model parameters, which challenges our traditional views on distribution shifts. I will sequentially examine these aspects, first by presenting behavioral tests of these models' generalization capabilities across unseen data, and then by conducting intrinsic checks to uncover the mechanisms LLMs learned. This talk seeks to provoke thoughts on several questions: Do the strategies of "making OOD problem IID" and facilitating the "emergence of intelligence" by scaling, truly stand up to scientific scrutiny? Furthermore, what do these developments imply for the field of statistical learning and the broader evolution of AI?

Discussant: Nicolai Meinshausen, ETH Zürich

Organizers: Sam Allen, Michael Law, Xinwei Shen
Webinar links:
https://math.ethz.ch/sfs/news-and-events/young-data-science.html
https://youngstats.github.io/post/2024/04/05/extrapolation-generalization-to-novel-domains-in-data-science/