

I’m a professor in the Stanford AI Lab (SAIL), the center for research on foundation models (CRFM), and the Machine Learning Group (bio). Our lab works on the foundations of the next generation of AI systems.
- On the AI side, I am fascinated by how we can learn from increasingly weak forms of supervision, the basis of new architectures, the role of data, and by the mathematical foundations of such techniques.
- On the systems side, I am broadly interested in how machine learning is changing how we build software and hardware. I’m particularly excited when we can blend AI and systems, e.g,. Snorkel, Overton (YouTube), or Together.
Our work is inspired by the observation that data is central to these systems, and so data management principles (re-imagined) play a starring role in our work. This sounds like Silicon Valley nonsense, but oddly enough, these ideas get used due to amazing students and collaborations with Google ads, YouTube, Apple, and more.
While we’re very proud of our research ideas and their impact, the lab’s real goal is to help students become professors, entrepreneurs, and researchers. To that end, over a dozen members of our group have started their own professorships. With students and collaborators, I’ve been fortunate enough to cofound a number of companies and a venture firm. For transparency, I try to list companies I advise or invest in here and our research sponsors here. My students run the ML Sys Podcast.
The latest from Chris
This paper introduces the Structured State Space sequence model (s4), which uses a new parameterization for the state-space model to improve long-range dependency handling both mathematically and empirically.
This paper proposes cross-modal data programming (XMDP) for machine learning (ML) in medicine.
This paper provides a series of results studying how performance scales with changes in source coverage, source accuracy, and the Lipschitzness of label distributions in the embedding space, and compare this rate to standard weak supervision.
Knowledge graph (KG) embeddings learn lowdimensional representations of entities and relations to predict missing facts. KGs often exhibit hierarchical and logical patterns which must be preserved in the embedding space. For hierarchical data, hyperbolic embedding methods have shown promise for high-fidelity and parsimonious representations. However, existing hyperbolic embedding methods do not account for the rich logical patterns in KGs. In…
A popular way to estimate the causal effect of a variable x on y from observational data is to use an instrumental variable (IV): a third variable z that affects y only through x. The more strongly z is associated with x, the more reliable the estimate is, but such strong IVs are difficult to find. Instead, practitioners combine more…
Enzymatic and chemical reactions are key for understanding biological processes in cells. Curated databases of chemical reactions exist but these databases struggle to keep up with the exponential growth of the biomedical literature. Conventional text mining pipelines provide tools to automatically extract entities and relationships from the scientific literature, and partially replace expert curation, but such machine learning frameworks often…
This paper explores the applicability of weak supervision, or relying on higher level, noisier forms of supervision to label training data, specifically using data programming.
Proposing a framework for integrating and modeling such weak supervision sources by viewing them as labeling different related sub-tasks of a problem, which we refer to as the multi-task weak supervision setting
Outlining a vision for a Software 2.0 lifecycle centered around the idea that labeling training data can be the primary interface to Software 2.0 systems.

