Shen Lab of Machine Learning for Biomedical Data Science | Neuroscience Labs

Research

Shen Li Our lab has two major focuses: machine learning applications and large-scale next-generation sequencing (NGS) data analysis.

In the past few years, we have contributed to numerous areas of machine learning applications to biomedical data science. Our works include developing sequence-to-phenotype prediction models through convolutional neural networks and transformers, deep learning algorithms for breast cancer detection and risk prediction using mammograms, automated genome annotation using deep neural networks, and others.

We have extensive experience analyzing large-scale NGS data. Since 2009, our group has analyzed 30,000s of NGS samples in the total of more than 300TB in collaboration with researchers across the nation. We have published many papers in top-tier journals, such as Nature, Science, Nature Medicine and Nature Genetics. We also participated in developing software for NGS data analysis, including some of the most popular tools, such as ngs.plot (NGS data mining and visualization) and diffReps (differential analysis for ChIP-seq data).

Machine Learning

We have a long-term interest in machine learning research and its applications to biomedicine. Some highlights include:

Self-supervised learning for medical image classification:
– K. Van Vorst and L. Shen, “Siamese Networks with Soft Labels for Unsupervised Lesion Detection and Patch Pretraining on Screening Mammograms.” arXiv, Jan. 10, 2024. Available: http://arxiv.org/abs/2401.05570
– J. D. Miller, V. A. Arasu, A. X. Pu, L. R. Margolies, W. Sieh, and L. Shen, “Self-Supervised Deep Learning to Enhance Breast Cancer Detection on Screening Mammography,” Mar. 2022. Available: https://arxiv.org/abs/2203.08812v1
Breast cancer detection and risk prediction using deep learning and mammograms:

– V. A. Arasu et al., “Comparison of Mammography AI Algorithms with a Clinical Risk Model for 5-year Breast Cancer Risk Prediction: An Observational Study,” Radiology, vol. 307, no. 5, p. e222733, Jun. 2023, doi: 10.1148/radiol.222733.

– T. Schaffter et al., “Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms,” JAMA Netw Open, vol. 3, no. 3, pp. e200265–e200265, Mar. 2020, doi: 10.1001/jamanetworkopen.2020.0265.
– L. Shen, L. R. Margolies, J. H. Rothstein, E. Fluder, R. McBride, and W. Sieh, “Deep Learning to Improve Breast Cancer Detection on Screening Mammography,” Scientific Reports, vol. 9, no. 1, pp. 1–12, Aug. 2019, doi: 10.1038/s41598-019-48995-4.
Automated genome annotation using machine learning:

– A. Ramakrishnan, G. Wangensteen, S. Kim, E. J. Nestler, and L. Shen, “DeepRegFinder: deep learning-based regulatory elements finder,” Bioinformatics Advances, vol. 4, no. 1, p. vbae007, Jan. 2024, doi: 10.1093/bioadv/vbae007.

– L. Shen, “Automatic genome segmentation with HMM-ANN hybrid models,” presented at the Workshops on Machine Learning in Computational Biology at the NIPS, 2015. bioRxiv: http://www.biorxiv.org/content/early/2016/10/29/034579

Teaching

In 2023-2024, Dr. Shen had been the course director for BDS 3002 – Machine Learning for Biomedical Data Science. Here is the course companion website.

Since 2025, Dr. Shen continues to be a lecturer on the BDS 3002 course. He teaches artificial neural networks, support vector machines and kernel methods, and deep learning.

Software

Our lab is actively involved in developing bioinformatic software for next-generation sequencing (NGS) and gene analysis. Our lab’s has its own Github site. Dr. Shen also has his own Github profile, mainly machine learning related projects.

ngs.plot is a very popular tool for NGS data visualization and exploration. It features a versatile pipeline that allows anyone to visualize sequence alignment pileups at a collection of genomic regions and summarize the enriched patterns. It is carefully engineered to index and read large NGS data efficiently without causing memory explosion. Since its publication in 2015, the paper has been cited for >700 times. Access the paper here.
diffReps is a tool for ChIP-seq differential analysis considering biological replicates. There are many programs for comparing RNA-seq samples but much less for comparing ChIP-seq samples. The main reason is that ChIP-seq is more difficult to deal with because there is no predefined regions. diffReps uses a sliding window approach to perform millions of negative binomial tests on the genomic sequence and then merge and summarize the results into a text report. Since its publication, it has remained one of the most popular tools for ChIP-seq differential analysis.
GeneOverlap is a small utility package for comparing multiple gene lists and determine the significance of their overlaps. It solves the statistical problem with Fisher’s exact tests. Although relatively simpler than ngs.plot and diffReps, it is among the top 20% downloaded packages on Bioconductor. Why is that? It’s simply because people have so many gene lists to compare against one another!
Our lab has also developed NGS-Data-Charmer, a high-performance preprocessing pipeline for NGS samples. It can run seamlessly on an HPC cluster or a local workstation with built-in fault tolerance.

Publications

The most updated list of our publications can be found on Google Scholar. The followings are a few selected.

Machine Learning

K. Van Vorst and L. Shen, “Siamese Networks with Soft Labels for Unsupervised Lesion Detection and Patch Pretraining on Screening Mammograms.” arXiv, Jan. 10, 2024. Accessed: Jan. 12, 2024. [Online]. Available: http://arxiv.org/abs/2401.05570
A. Ramakrishnan, G. Wangensteen, S. Kim, E. J. Nestler, and L. Shen, “DeepRegFinder: deep learning-based regulatory elements finder,” Bioinformatics Advances, vol. 4, no. 1, p. vbae007, Jan. 2024, doi: 10.1093/bioadv/vbae007.
S. Ortega-Martorell et al., “Breast cancer patient characterisation and visualisation using deep learning and fisher information networks,” Sci Rep, vol. 12, no. 1, Art. no. 1, Aug. 2022, doi: 10.1038/s41598-022-17894-6.

J. D. Miller, V. A. Arasu, A. X. Pu, L. R. Margolies, W. Sieh, and L. Shen, “Self-Supervised Deep Learning to Enhance Breast Cancer Detection on Screening Mammography,” Mar. 2022, Accessed: Mar. 17, 2022. [Online]. Available: https://arxiv.org/abs/2203.08812v1
V. A. Arasu et al., “Comparison of Mammography Artificial Intelligence Algorithms for 5-year Breast Cancer Risk Prediction.” medRxiv, p. 2022.01.05.22268746, Jan. 07, 2022. doi: 10.1101/2022.01.05.22268746.
T. Schaffter et al., “Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms,” JAMA Netw Open, vol. 3, no. 3, pp. e200265–e200265, Mar. 2020, doi: 10.1001/jamanetworkopen.2020.0265.
L. Shen, Laurie R. Margolies, J. H. Rothstein, E. Fluder, R. McBride, and W. Sieh, “Deep Learning to Improve Breast Cancer Detection on Screening Mammography,” Scientific Reports, vol. 9, no. 1, pp. 1–12, Aug. 2019, doi: 10.1038/s41598-019-48995-4.
L. Shen, “Automatic genome segmentation with HMM-ANN hybrid models,” bioRxiv, p. 034579, Dec. 2015, doi: 10.1101/034579.

Bioinformatics

S. Westfall et al., “Optimization of probiotic therapeutics using machine learning in an artificial human gastrointestinal tract,” Scientific Reports, vol. 11, no. 1, Art. no. 1, Jan. 2021, doi: 10.1038/s41598-020-79947-y.
A. E. Lepack et al., “Dopaminylation of histone H3 in ventral tegmental area regulates cocaine seeking,” Science, vol. 368, no. 6487, pp. 197–201, Apr. 2020, doi: 10.1126/science.aaw8806.
D. M. Walker et al., “Cocaine self-administration alters transcriptome-wide responses in the brain’s reward circuitry,” Biological Psychiatry, Apr. 2018, doi: 10.1016/j.biopsych.2018.04.009.
C. J. Peña et al., “Early life stress confers lifelong stress susceptibility in mice via ventral tegmental area OTX2,” Science, vol. 356, no. 6343, pp. 1185–1188, Jun. 2017, doi: 10.1126/science.aan4491.
B. Labonté et al., “Sex-specific transcriptional signatures in human depression,” Nature Medicine, vol. 23, no. 9, p. 1102, Sep. 2017, doi: 10.1038/nm.4386.
R. C. Bagot et al., “Ketamine and Imipramine Reverse Transcriptional Signatures of Susceptibility and Induce Resilience-Specific Gene Expression Profiles,” Biological Psychiatry, vol. 81, no. 4, pp. 285–295, Feb. 2017, doi: 10.1016/j.biopsych.2016.06.012.
R. C. Bagot et al., “Circuit-wide Transcriptional Profiling Reveals Brain Region-Specific Gene Networks Regulating Depression Susceptibility,” Neuron, vol. 90, no. 5, pp. 969–83, Jun. 2016, doi: 10.1016/j.neuron.2016.04.015.
Roadmap Epigenomics Consortium et al., “Integrative analysis of 111 reference human epigenomes,” Nature, vol. 518, no. 7539, pp. 317–330, Feb. 2015, doi: 10.1038/nature14248.
J. Feng et al., “Role of Tet1 and 5-hydroxymethylcytosine in cocaine action,” Nat Neurosci, vol. 18, no. 4, pp. 536–544, print 2015, doi: 10.1038/nn.3976.
J. Feng et al., “Chronic cocaine-regulated epigenomic changes in mouse nucleus accumbens,” Genome Biology, vol. 15, no. 4, p. R65, 2014.
C. Dias et al., “[bgr]-catenin mediates stress resilience through Dicer1/microRNA regulation,” Nature, vol. 516, no. 7529, pp. 51–55, print 2014, doi: 10.1038/nature13976.

Software

L. Shen, N. Shao, X. Liu, and E. Nestler, “ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases,” BMC Genomics, vol. 15, no. 1, p. 284, 2014.
T. Wang et al., “STAR: an integrated solution to management and visualization of sequencing data,” Bioinformatics, vol. 29, no. 24, pp. 3204–3210, Dec. 2013, doi: 10.1093/bioinformatics/btt558.
L. Shen, N.-Y. Shao, X. Liu, I. Maze, J. Feng, and E. J. Nestler, “diffReps: Detecting Differential Chromatin Modification Sites from ChIP-seq Data with Biological Replicates,” PLoS ONE, vol. 8, no. 6, p. e65598, 2013, doi: 10.1371/journal.pone.0065598.
L. Shen, “GeneOverlap: Test and visualize gene overlaps.” R Bioconductor package, 2016. Available: http://www.bioconductor.org/packages/release/bioc/html/GeneOverlap.html

Collaborators

Mount Sinai

Eric Nestler
Scott Russo
Yasmin Hurd
Schahram Akbarian
Ian Maze
Venetia Zachariou
Hongyan Jenny Zou
Roland Friedel
Pinxian Xu
Ana Pereira
Weiva Sieh
Laurie Margolies

University of Minnesota/ASU

Jonathan Gewirtz
Andrew Harris

Bioinformaticians/Data Scientists

Aarthi Ramakrishnan, Bioinformatician
Molly Estill, Bioinformatician
Eddie Loh, Bioinformatician (part-time)
Adam Catto, Data Scientist

Research

Contact Us

Machine Learning

Teaching

Software

Publications

Machine Learning

Bioinformatics

Software

Collaborators

Mount Sinai

University of Minnesota/ASU

Bioinformaticians/Data Scientists