Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

MLxAI Codeathon Team Projects

ToxPipe: Semi-autonomous AI integration of diverse toxicological data streams
GitHub icon
ToxPipe is an application that makes use of large language models (LLMs), Langchain, and various tools and data sources to answer toxicological queries about chemicals. ToxPipe currently pulls information from PubMed, PubChem, Semantic Scholar, RDKit, and is inspired by and adapted from ChemCrow.

Team Members:

  • Trey Saddler, (Team Leader) DTT, NIEHS, NIH
  • Parker Combs (Technical Lead) DTT, NIEHS, NIH
  • Virginie Grosboillot, University of Ljubljana
  • Grzegorz Boratyn, NCBI, NLM, NIH
  • Yixing Han, NHGRI, NIH
  • David Li, NIA, NIH
  • Olawale Ogundeji, University of Leeds
  • Mike Conway, (Ad-hoc member) DTT, NIEHS, NIH
  • Scott Auerbach, (Ad-hoc member) DTT, NIEHS, NIH

Harnessing the Microbiome: From Microbial Genes in the Gut to Intestinal Function and Drug Absorption
GitHub icon
The scientific goal of the project is to build AI/ML model to predict the impact of a bacterial species on human intestinal function in inflammatory bowel diseases (IBD), specifying, drug absorption & metabolism.

Team Members:
  • Abhinav Bhushan (Team Leader) Illinois Institute of Technology
  • Christopher Tang (Technical Lead) 
  • Abhinav Sur, NICHD, NIH
  • Gayathri Jahan Mohan
  • Gobikrishnan Subramaniam, Queen's University
  • Jooho Lee, D4CG, University of Chicago
  • Karan Jogi, Discovery Partners Institute
  • Soham Shirolkar, University of South Florida
  • Viktoriia Liu

RAGVar MLXAI Codeathon Team
GitHub icon
The goal of RAGVar is to build a system for harmonizing your data with the data that already exists within a data repository such as NCBI or NHLBI BDC. Harmonization is a major challenge for all research repositories as data sources do not have either the clear direction and/or the resources to align data before ingest. This results in retrospective data harmonization that must be done by the data users or through manual harmonization efforts by the repository teams. The RAGVar system evaluates the potential for retrieval augmented generation and AI reasoning to provide an evaluation mechanism for determining how new data, provided in a data dictionary, aligns with the data existing within a repository.  RAGVar focuses on the initial step of aligning user provided variables, in the form of a data dictionary with descriptive variable labels, with the existing variables in the corpus of data which a user may want to align with. After identifying prospective similar variables, RAGVar ranks which variable is most likely to be a match for each of the new variables and attempts to provide information to the user about how to harmonize their variables with the existing data sets.

Team Members:
  • David Beaumont (Team Leader) Center for Data Modernization Solutions, SSES, RTI International
  • Corey Cox (Technical Lead) University of Colorado, Anschutz
  • Nathaniel Braswell, Center for Data Modernization Solutions, SSES, RTI International
  • Stephen Hwang, Center for Data Modernization Solutions, SSES, RTI International
  • Oswaldo Alonso Lozoya, Center for Data Modernization Solutions, SSES, RTI International
SPARCLE Curation
GitHub icon
In order to assist a small team of experts manually annotating subfamily protein architectures in the SPARCLE database, we trained a decision tree and an NLP model on the roughly 42k expert-curated architectures in order to automate assignment of labels to the remaining roughly 200k uncurated protein architectures.

Team Members:
  • Marc Gwadz (Team Leader) NCBI, NLM, NIH
  • Mingzhang Yang (Technical Lead) NCBI, NLM, NIH
  • Christopher Meyer, University of Chicago (Writer)
  • Franziska Ahrend, ORISE Fellow at NIDDK, NIH
  • Yixiang Deng, MIT and Harvard
  • Shaojun Xie, NCI, NIH

A Gentle Introduction to ML/AI as Applied to Antibody Engineering
GitHub icon
The project focused on developing resources and documentation for teaching data science and machine learning / artificial intelligence (ML/AI) concepts related to antibody engineering. Immune profiling (immunoprofiling) datasets were used as a source of antibody sequences for both data science and ML. The team develope Jupyter notebooks to undertake comparative analyses of iReceptor datasets, and then incorporate the AbLang2 antibody-specific language model to characterize data from CoV-AbDab. A dictionary and glossary of terms defining essential computer and biology terms related to the computations processed within the Jupyter notebook were also developed.

Team Members:
  • Todd Smith (Team Leader) Digital World Biology, LLC
  • Herminio Vazquez (Technical Lead) Copado Inc.
  • Stephen Panossian (Writer)
  • Zainab Adenaike, NCBI, NLM, NIH
  • Jake Lance, University of Toronto
  • Mohsen Sharifi Renani, Spotify AB

Random Forest for Genomic Association Detection
GitHub icon
Use Random Forest (RF) to detect high-order interaction among genomic, omic features associated with the phenotype.

Team Members:
  • Weiping Chen (Team Leader) NIDDK, NIH
  • Guanjie Chen (Technical Lead) NHGRI, NIH
  • Qing Li (Writer) NHGRI, NIH
  • Chimenya Ntweya, Queen Elizabeth Central Hospital

ClinCluster: A Package for Aggregating Disease Terms in ClinVar
GitHub icon
Naming for human genetic diseases is complex. Diseases may be named for the phenotype; other information may be alluded to including the relevant gene, mode of inheritance, or the mechanism of disease. Diseases may be described at a high level with a generic name, or at a lower level with a more specific name; however, in the context of a variant in a specific gene, these differences may not be considered important. ClinVar data would be easier to ingest in bulk and to read in web displays if there were a meaningful way to aggregate diseases that effectively mean the same thing in the context of a gene. Problem Statement: Diseases in ClinVar are very granular and result in many variant-disease records. Can we use an ML/AI approach to aggregate disease terms in ClinVar to reduce the number of variant-disease records?

Team Members:
  • Melissa Landrum (Co-Team Leader) NCBI, NLM, NIH
  • Guangfeng Song (Co-Team Leader) NCBI, NLM, NIH
  • Lauren Edgar, NHGRI, NIH
  • Benjamin Kesler, Vanderbilt University
  • Nicholas Minor, University of Wisconsin
  • Michael Muchow
  • Rebecca Orris, NCBI, NLM, NIH
  • Wengang Zhang, NCI, NIH

Exploring LLM Applications to SRA Metadata Normalization
GitHub icon
The NCBI Sequence Read Archive (SRA) contains valuable research data, but its metadata is often unstandardized, leading to misspellings and variations in how data is described. This project aims to use machine learning and AI to standardize these metadata fields.

Team Members:
  • Jonathan Gunti (Co-Team Leader) NCBI, NLM, NIH
  • Ryan Conner (Co-Team Leader) NCBI, NLM, NIH
  • Andrey Kochergin (Technical Leader) NCBI, NLM, NIH
  • Corinne Matti (Writer) NCBI, NLM, NIH
  • Priyanka Ghosh, NCBI, NLM, NIH
  • Moyo Williams, NCBI, NLM, NIH
  • Vadim Zalunin, NCBI, NLM, NIH

Visualizations of Nucleotide Sequences in 3-Dimensional Space using Ziro Studio and Entrez
GitHub icon
Building human genome visualizations for the next generation of CRISPR enthusiasts, genome sequencers, and bioinformatics engineers.
We live in a day and age where genetic editing is becoming more prevalent - the pivotal 2012 CRISPR paper by Dr. Doudna and Dr. Charpentier has led to the creation of several applications to the initial CAS-9 System, especially around the cures to common somatic cell autoimmune diseases. Leveraging our knowledge of CRISPR, alongside the NLM NCBI libraries, we look to visualize the Hemoglobin Beta B genes associated with Sickle Cell Anemia as a stepping stone to better visualizing how CRISPR CAS-9 systems can affect the mutated regions. In doing so, we can better communicate the efficacy of treatments such as CASGEVY to a younger audience of students within the K-12 system, and inspire students to learn more about genome sequencing and editing.

Team Members:
  • John Hall (Team Leader) PASTEM.ORG
  • Raja Jasti CEO, Ziro Studio
  • Hari Parthasarathy, UC Berkeley
  • Christopher Fluta, Drexel University
  • Dimple Amitha Garuadapuri, UC Berkeley
  • Daniel Chen, UC Berkeley

Machine Learning with Genomic Information for Predicting Cancer Drug Response
GitHub icon
Genomics of Drug Sensitivity in Cancer (GDSC) is a valuable resource for pharmacogenomic research. It has characterized 1000 human cancer cell lines and screened them with hundreds of drug compounds. Currently, for a given drug, GDSC provides ANOVA test results for each of the 700 genomic features. We aim to identify a panel of features with a high predictive value for drug response of each individual drug using machine learning. We will first use over 700 genetic features as input and IC50 (Half-maximal inhibitory concentration) as the output in our model to predict drug response. We will test whether our panel of features includes the genomic features selected by the ANOVA test.

Team Members:
  • Bingfang Ruth Xu (Team Leader) Frederick National Laboratory for Cancer Research (FNLCR), Leidos Biomedical Research, Inc., CMDL
  • Daniel Sierra-Sosa (Technical Lead) Hood College
  • Julie Bocetti (Writer) NICHD, NIH
  • Todd Young, Brooklyn College
  • Brendan Reilly, Affiliation Postdoctoral Fellow: MIT and Harvard
  • Helga Saizonou, Tropical Infections Diseases Research Centre (TIDRC), Univeristy of Abomey-Calavi (UAC)

Last Reviewed: March 28, 2024