The genomic data scientist investigating the underpinnings of disease

By Michaela Herrmann

Incoming FAS faculty member Xiang Zhou is developing novel computational and statistical methods to help other scientists understand the genetic drivers of disease.

Xiang Zhou

Every year, Yale’s Faculty of Arts and Sciences welcomes exceptional scholars across the sciences, humanities, and social sciences. This series profiles six of the faculty joining the FAS in the 2025–26 academic year, highlighting their academic achievements, research ambitions, and the teaching they hope to do at Yale. Learn more about the incoming faculty joining the FAS. 

Xiang Zhou has always wanted to understand the genetic underpinnings of disease—but not in the way you might expect. 

Zhou, who joined the FAS this fall as Professor of Statistics and Data Science, does have a background in biology: a BS in biology, MS in statistics, and a PhD in neurobiology. “I even did molecular biology experiments myself,” he notes.

But instead of becoming a biologist, Zhou decided to apply his expertise in statistics to developing novel computational and statistical methods that help other scientists answer fundamental questions about the genomic drivers of human disease. “I've always been curious about biological processes and wanted to understand disease mechanisms and disease etiologies to help eventually better treatment for different people,” he says. “That's why I have always been interested in solving those important genetic and genomic questions to eventually get down to the disease.”

In his research, Zhou asks why some individuals are more or less likely to develop certain diseases than other people. From one individual to the next, our human DNA is about 99.9% identical—but the less than one tenth percent of our DNA that is different results in countless phenotypic variations, Zhou explains. “Some people are tall; some people are short. Some people have brown eyes; some have blue eyes. And some people may be more prone to certain diseases, while other people are potentially healthier and perhaps live longer.”

To identify these differences, scientists must “zoom into” the genome at the level of gene sequences. There, the one tenth percent difference between individuals shows up as single nucleotide polymorphisms (SNPs, or “snips”). Across the entire genome, there are millions of SNPs.

“We each have three billion DNA sequences, and those consist of A, T, C, and G,” explains Zhou, referring to the four different nucleotides—adenine, thymine, cytosine, and guanine, which are often referred to as “letters”—that make up our DNA. 

Among those sequences, there are certain regions that contain variations in different individuals. “For example, maybe 60% of people have the A in a particular region, while 40% people have a T. You can then try to identify what those sites are associated with certain diseases,” Zhou says of his work investigating the human genome.

For researchers interested in diseases such as type 2 diabetes or cardiovascular disease, the goal is to collect thousands, or even millions, of samples. With huge datasets like these, Zhou’s sophisticated methods for analyzing genomic data are becoming increasingly essential.

Breaking ground with new methods 

Zhou specializes in developing novel statistical and computational methods—including machine learning approaches that utilize deep learning and artificial intelligence—for state-of-the-art genetic and genomic techniques, which can in turn help uncover how variations in our genes influence biological functions and contribute to the development of disease.

These methods can be used for analyzing large-scale and high-dimensional genetic and genomic datatypes. These include genome-wide sequencing studies, single-cell sequencing studies, and multiomics sequencing studies, a rapidly emerging field Zhou is particularly excited to explore. 

With spatial multiomics, he explains, “you can measure thousands or even millions of locations on the tissue. And on each location, you can measure the entire transcriptome of ten, twenty, 30,000 genes. 

“By doing that, now you can start to see that, well, this SNP is affecting this gene. And this gene seems to be affecting a particular area of this tissue or a particular cell type in this location, and it's this location that's associated with the traits of the complex diseases.”

The approach produces “extremely rich data,” Zhou says, requiring ever-more powerful statistical tools that enable scientists to analyze these data points.  

“Because those technologies are very new, very complex, and extremely noisy, it requires us to develop statistical models and methods to deal with them, especially as this data becomes larger and larger. So that's why we need also deep learning models and large language models to handle all that.”

Scientists at Yale are some of the few in the country working at the cutting-edge of spatial multiomics, and Zhou joins several Yale data scientists harnessing the power of machine learning. The technology-forward work being done by colleagues, plus the collaborative environment fostered across Yale’s science and medicine departments, both attracted Zhou to campus. “I have already reached out to people here to talk with them, and I’m very excited to get involved in all those collaborative projects.”

Pushing science forward—collectively and collaboratively 

Zhou’s lab is not focused on specific diseases or one single approach to data analysis; rather, his team aims to develop methods that will support and improve the research being done by molecular biologists, biophysicists, data scientists, and beyond. 

He’s committed to developing applied methods and wants to be sure they can be readily accessed by other scientists. “When we develop methods, we always package it well, and we always make the software publicly available, and freely available, so that other biologists can make use of them,” he emphasizes.

Zhou published a paper last year about one such method: multi-ancestry sum of the single effects model (MESuSiE), a probabilistic multi-ancestry fine-mapping method. Zhou and his co-author Boran Gao, assistant professor of statistics and biological sciences at Purdue University (and a former PhD student in Zhou's lab), wanted to address the relatively little genomic data that have been gathered about people who are not of European descent. This makes it difficult to extrapolate analyses about the SNPs causing disease in many Asian populations, African Americans, and other minority populations—and they thought a machine learning model could help fill in some of the gaps. 

“The machine learning method allows you to borrow information from the European population but also take advantage of those relatively small-scale studies from other populations,” Zhou explains. “So we combined them to do an integrative analysis, which really helps you pinpoint the truly causal SNPs associated with diseases while also benefiting the other minority populations. We are effectively borrowing information on this very large European ancestry to help benefit the other minority population.”

In another, ongoing study, Zhou is helping biologists determine which SNPs may cause fibromuscular dysplasia (FMD), a cardiovascular disease found predominantly in young women. He's also developed methods to help determine the genetic drivers of diabetes and cardiovascular disease, making use of huge, high-quality datasets like the UK Bio Bank—a collection of data pertaining to patients in the United Kingdom’s National Health Service (NHS). 

Regardless of what disease or method Zhou studies next—or what obstacles he may encounter—he’s sure to learn something that he will be eager to share. “In each project, I feel that we always learn something,” he says. “We aim for a particular targeted application and we try to develop a method. Sometimes it works towards this particular application, and sometimes it doesn't. When it doesn’t, we look at other collaborations and to see whether it's adapted to other applications.”

No matter what each collaboration may hold, Zhou reflects, “we see something unexpected almost every single day.”