Research
Our lab focuses on two synergistic areas: Genome AI and machine learning for biological discovery, and large-scale analysis of next-generation sequencing (NGS) data.
In recent years, our research has increasingly centered on Genome AI — the development of deep learning models that learn the functional grammar of the genome directly from DNA sequence. We build sequence-to-function models and genomic language models that integrate long-range regulatory context to predict gene expression, regulatory element activity, and the functional impact of genetic variation. By combining advances in transformers, representation learning, and large-scale biological datasets, we aim to construct models capable of decoding the regulatory logic of the human genome and enabling variant-to-function and variant-to-phenotype prediction at unprecedented resolution.
Beyond genome modeling, we have contributed broadly to machine learning applications in biomedical data science. Our work includes developing deep learning algorithms for breast cancer detection and risk prediction from mammography, sequence-to-phenotype prediction models using convolutional and transformer architectures, and deep neural network approaches for automated genome annotation.
A core strength of the lab is our expertise in large-scale genomic data analysis. Since 2009, our group has analyzed tens of thousands of NGS datasets totaling more than 300 terabytes of data, in collaboration with researchers across the United States. Our work has been published in leading journals including Nature, Science, Nature Medicine, and Nature Genetics.
We have also developed widely used computational tools for genomic data analysis, including ngs.plot for NGS data mining and visualization and diffReps for differential analysis of ChIP-seq data.
Our long-term vision is to bridge genomics and artificial intelligence, building foundation models that transform how we interpret genomes, understand disease mechanisms, and enable precision medicine.
Machine Learning
We have a long-term interest in machine learning research and its applications to biomedicine. Some highlights include:
-
- Large-scale genome AI including sequence-to-function and genomic language models:
– Dubey, V. & Shen, L. Personalized gene expression prediction in the era of deep learning: a review. Brief Bioinform 27, bbag022 (2026).
– Xu, K. & Shen, L. Predicting 3D Chromatin Interactions Using Transformer-Enhanced Deep Learning Models. 2025.04.10.647995 Preprint at https://doi.org/10.1101/2025.04.10.647995 (2025).
– Shen, L. AlphaGenome Enhances Personal Gene Expression Prediction but Retains Key Limitations. 2025.08.05.668750 Preprint at https://doi.org/10.1101/2025.08.05.668750 (2025).
– A. Ramakrishnan, G. Wangensteen, S. Kim, E. J. Nestler, and L. Shen, “DeepRegFinder: deep learning-based regulatory elements finder,” Bioinformatics Advances, vol. 4, no. 1, p. vbae007, Jan. 2024, doi: 10.1093/bioadv/vbae007.
– L. Shen, “Automatic genome segmentation with HMM-ANN hybrid models,” presented at the Workshops on Machine Learning in Computational Biology at the NIPS, 2015. bioRxiv: http://www.biorxiv.org/content/early/2016/10/29/034579
- Large-scale genome AI including sequence-to-function and genomic language models:
- Cancer detection and risk prediction using deep learning and medical imaging:
– K. Van Vorst and L. Shen, “Siamese Networks with Soft Labels for Unsupervised Lesion Detection and Patch Pretraining on Screening Mammograms.” arXiv, Jan. 10, 2024. Available: http://arxiv.org/abs/2401.05570.
– V. A. Arasu et al., “Comparison of Mammography AI Algorithms with a Clinical Risk Model for 5-year Breast Cancer Risk Prediction: An Observational Study,” Radiology, vol. 307, no. 5, p. e222733, Jun. 2023, doi: 10.1148/radiol.222733.– J. D. Miller, V. A. Arasu, A. X. Pu, L. R. Margolies, W. Sieh, and L. Shen, “Self-Supervised Deep Learning to Enhance Breast Cancer Detection on Screening Mammography,” Mar. 2022. Available: https://arxiv.org/abs/2203.08812v1
– T. Schaffter et al., “Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms,” JAMA Netw Open, vol. 3, no. 3, pp. e200265–e200265, Mar. 2020, doi: 10.1001/jamanetworkopen.2020.0265.
– L. Shen, L. R. Margolies, J. H. Rothstein, E. Fluder, R. McBride, and W. Sieh, “Deep Learning to Improve Breast Cancer Detection on Screening Mammography,” Scientific Reports, vol. 9, no. 1, pp. 1–12, Aug. 2019, doi: 10.1038/s41598-019-48995-4.
Teaching
In 2023-2024, Dr. Shen had been the course director for BDS 3002 – Machine Learning for Biomedical Data Science. Here is the course companion website.
Since 2025, Dr. Shen continues to be a lecturer on the BDS 3002 course. He teaches artificial neural networks, support vector machines and kernel methods, and deep learning.
Software
Our lab is actively involved in developing bioinformatic software for next-generation sequencing (NGS) and gene analysis. Our lab’s has its own Github site. Dr. Shen also has his own Github profile, mainly machine learning related projects.
- ngs.plot is a very popular tool for NGS data visualization and exploration. It features a versatile pipeline that allows anyone to visualize sequence alignment pileups at a collection of genomic regions and summarize the enriched patterns. It is carefully engineered to index and read large NGS data efficiently without causing memory explosion. Since its publication in 2015, the paper has been cited for >700 times. Access the paper here.
- diffReps is a tool for ChIP-seq differential analysis considering biological replicates. There are many programs for comparing RNA-seq samples but much less for comparing ChIP-seq samples. The main reason is that ChIP-seq is more difficult to deal with because there is no predefined regions. diffReps uses a sliding window approach to perform millions of negative binomial tests on the genomic sequence and then merge and summarize the results into a text report. Since its publication, it has remained one of the most popular tools for ChIP-seq differential analysis.
- GeneOverlap is a small utility package for comparing multiple gene lists and determine the significance of their overlaps. It solves the statistical problem with Fisher’s exact tests. Although relatively simpler than ngs.plot and diffReps, it is among the top 20% downloaded packages on Bioconductor. Why is that? It’s simply because people have so many gene lists to compare against one another!
- Our lab has also developed NGS-Data-Charmer, a high-performance preprocessing pipeline for NGS samples. It can run seamlessly on an HPC cluster or a local workstation with built-in fault tolerance.
Publications
The most updated list of our publications can be found on Google Scholar (Citations: 24000+; H-index 61).
Collaborators
Mount Sinai
- Eric Nestler
- Scott Russo
- Yasmin Hurd
- Schahram Akbarian
- Ian Maze
- Venetia Zachariou
- Hongyan Jenny Zou
- Roland Friedel
- Ana Pereira
- Weiva Sieh
- Laurie Margolies
University of Minnesota/ASU
- Jonathan Gewirtz
- Andrew Harris
Bioinformaticians/Data Scientists
- Aarthi Ramakrishnan, Bioinformatician
- Molly Estill, Bioinformatician
- Eddie Loh, Bioinformatician (part-time)
- Adam Catto, Data Scientist
Alumni
Name, past and present
- Ying Jin, postdoc => computational scientist @ CSHL
- Xiaochuan Liu, postdoc => Associate Director @ AstraZeneca
- Ningyi Shao, postdoc => Assistant Professor @ University of Macau
- Immanuel Purushothaman, bioinformatician => AI/ML @ Toast
- Yong Hwee Eddie Loh, bioinformatician => faculty @ UCLA
- John Miller, master student => Product @ YipitData
- Kevin Van Vorst, master student => Data Engineer @ Zwanger-Pesiri Radiology
- Viksar Dubey, research intern => undergrad @ UC Berkeley
- Kexin Xu, Master student => machine learning engineer @ MeiTuan