MLxAI Codeathon Team Projects

ToxPipe: Semi-autonomous AI integration of diverse toxicological data streams

ToxPipe is an application that makes use of large language models (LLMs), Langchain, and various tools and data sources to answer toxicological queries about chemicals. ToxPipe currently pulls information from PubMed, PubChem, Semantic Scholar, RDKit, and is inspired by and adapted from ChemCrow.

Team Members:

Trey Saddler, (Team Leader) DTT, NIEHS, NIH
Parker Combs (Technical Lead) DTT, NIEHS, NIH
Virginie Grosboillot, University of Ljubljana
Grzegorz Boratyn, NCBI, NLM, NIH
Yixing Han, NHGRI, NIH
David Li, NIA, NIH
Olawale Ogundeji, University of Leeds
Mike Conway, (Ad-hoc member) DTT, NIEHS, NIH
Scott Auerbach, (Ad-hoc member) DTT, NIEHS, NIH

Harnessing the Microbiome: From Microbial Genes in the Gut to Intestinal Function and Drug Absorption

The scientific goal of the project is to build AI/ML model to predict the impact of a bacterial species on human intestinal function in inflammatory bowel diseases (IBD), specifying, drug absorption & metabolism.

Team Members:

Abhinav Bhushan (Team Leader) Illinois Institute of Technology
Christopher Tang (Technical Lead)
Abhinav Sur, NICHD, NIH
Gayathri Jahan Mohan
Gobikrishnan Subramaniam, Queen's University
Jooho Lee, D4CG, University of Chicago
Karan Jogi, Discovery Partners Institute
Soham Shirolkar, University of South Florida
Viktoriia Liu

RAGVar MLXAI Codeathon Team

The goal of RAGVar is to build a system for harmonizing your data with the data that already exists within a data repository such as NCBI or NHLBI BDC. Harmonization is a major challenge for all research repositories as data sources do not have either the clear direction and/or the resources to align data before ingest. This results in retrospective data harmonization that must be done by the data users or through manual harmonization efforts by the repository teams. The RAGVar system evaluates the potential for retrieval augmented generation and AI reasoning to provide an evaluation mechanism for determining how new data, provided in a data dictionary, aligns with the data existing within a repository. RAGVar focuses on the initial step of aligning user provided variables, in the form of a data dictionary with descriptive variable labels, with the existing variables in the corpus of data which a user may want to align with. After identifying prospective similar variables, RAGVar ranks which variable is most likely to be a match for each of the new variables and attempts to provide information to the user about how to harmonize their variables with the existing data sets.

Team Members:

David Beaumont (Team Leader) Center for Data Modernization Solutions, SSES, RTI International
Corey Cox (Technical Lead) University of Colorado, Anschutz
Nathaniel Braswell, Center for Data Modernization Solutions, SSES, RTI International
Stephen Hwang, Center for Data Modernization Solutions, SSES, RTI International
Oswaldo Alonso Lozoya, Center for Data Modernization Solutions, SSES, RTI International

SPARCLE Curation

In order to assist a small team of experts manually annotating subfamily protein architectures in the SPARCLE database, we trained a decision tree and an NLP model on the roughly 42k expert-curated architectures in order to automate assignment of labels to the remaining roughly 200k uncurated protein architectures.

Team Members:

Marc Gwadz (Team Leader) NCBI, NLM, NIH
Mingzhang Yang (Technical Lead) NCBI, NLM, NIH
Christopher Meyer, University of Chicago (Writer)
Franziska Ahrend, ORISE Fellow at NIDDK, NIH
Yixiang Deng, MIT and Harvard
Shaojun Xie, NCI, NIH

A Gentle Introduction to ML/AI as Applied to Antibody Engineering

The project focused on developing resources and documentation for teaching data science and machine learning / artificial intelligence (ML/AI) concepts related to antibody engineering. Immune profiling (immunoprofiling) datasets were used as a source of antibody sequences for both data science and ML. The team develope Jupyter notebooks to undertake comparative analyses of iReceptor datasets, and then incorporate the AbLang2 antibody-specific language model to characterize data from CoV-AbDab. A dictionary and glossary of terms defining essential computer and biology terms related to the computations processed within the Jupyter notebook were also developed.

Team Members:

Todd Smith (Team Leader) Digital World Biology, LLC
Herminio Vazquez (Technical Lead) Copado Inc.
Stephen Panossian (Writer)
Zainab Adenaike, NCBI, NLM, NIH
Jake Lance, University of Toronto
Mohsen Sharifi Renani, Spotify AB

Random Forest for Genomic Association Detection

Use Random Forest (RF) to detect high-order interaction among genomic, omic features associated with the phenotype.

Team Members:

Weiping Chen (Team Leader) NIDDK, NIH
Guanjie Chen (Technical Lead) NHGRI, NIH
Qing Li (Writer) NHGRI, NIH
Chimenya Ntweya, Queen Elizabeth Central Hospital

ClinCluster: A Package for Aggregating Disease Terms in ClinVar

Naming for human genetic diseases is complex. Diseases may be named for the phenotype; other information may be alluded to including the relevant gene, mode of inheritance, or the mechanism of disease. Diseases may be described at a high level with a generic name, or at a lower level with a more specific name; however, in the context of a variant in a specific gene, these differences may not be considered important. ClinVar data would be easier to ingest in bulk and to read in web displays if there were a meaningful way to aggregate diseases that effectively mean the same thing in the context of a gene. Problem Statement: Diseases in ClinVar are very granular and result in many variant-disease records. Can we use an ML/AI approach to aggregate disease terms in ClinVar to reduce the number of variant-disease records?

Team Members:

Melissa Landrum (Co-Team Leader) NCBI, NLM, NIH
Guangfeng Song (Co-Team Leader) NCBI, NLM, NIH
Lauren Edgar, NHGRI, NIH
Benjamin Kesler, Vanderbilt University
Nicholas Minor, University of Wisconsin
Michael Muchow
Rebecca Orris, NCBI, NLM, NIH
Wengang Zhang, NCI, NIH

Exploring LLM Applications to SRA Metadata Normalization

The NCBI Sequence Read Archive (SRA) contains valuable research data, but its metadata is often unstandardized, leading to misspellings and variations in how data is described. This project aims to use machine learning and AI to standardize these metadata fields.

Team Members:

Jonathan Gunti (Co-Team Leader) NCBI, NLM, NIH
Ryan Conner (Co-Team Leader) NCBI, NLM, NIH
Andrey Kochergin (Technical Leader) NCBI, NLM, NIH
Corinne Matti (Writer) NCBI, NLM, NIH
Priyanka Ghosh, NCBI, NLM, NIH
Moyo Williams, NCBI, NLM, NIH
Vadim Zalunin, NCBI, NLM, NIH

Visualizations of Nucleotide Sequences in 3-Dimensional Space using Ziro Studio and Entrez

Building human genome visualizations for the next generation of CRISPR enthusiasts, genome sequencers, and bioinformatics engineers.
We live in a day and age where genetic editing is becoming more prevalent - the pivotal 2012 CRISPR paper by Dr. Doudna and Dr. Charpentier has led to the creation of several applications to the initial CAS-9 System, especially around the cures to common somatic cell autoimmune diseases. Leveraging our knowledge of CRISPR, alongside the NLM NCBI libraries, we look to visualize the Hemoglobin Beta B genes associated with Sickle Cell Anemia as a stepping stone to better visualizing how CRISPR CAS-9 systems can affect the mutated regions. In doing so, we can better communicate the efficacy of treatments such as CASGEVY to a younger audience of students within the K-12 system, and inspire students to learn more about genome sequencing and editing.

Team Members:

John Hall (Team Leader) PASTEM.ORG
Raja Jasti CEO, Ziro Studio
Hari Parthasarathy, UC Berkeley
Christopher Fluta, Drexel University
Dimple Amitha Garuadapuri, UC Berkeley
Daniel Chen, UC Berkeley

Machine Learning with Genomic Information for Predicting Cancer Drug Response

Genomics of Drug Sensitivity in Cancer (GDSC) is a valuable resource for pharmacogenomic research. It has characterized 1000 human cancer cell lines and screened them with hundreds of drug compounds. Currently, for a given drug, GDSC provides ANOVA test results for each of the 700 genomic features. We aim to identify a panel of features with a high predictive value for drug response of each individual drug using machine learning. We will first use over 700 genetic features as input and IC50 (Half-maximal inhibitory concentration) as the output in our model to predict drug response. We will test whether our panel of features includes the genomic features selected by the ANOVA test.

Team Members:

Bingfang Ruth Xu (Team Leader) Frederick National Laboratory for Cancer Research (FNLCR), Leidos Biomedical Research, Inc., CMDL
Daniel Sierra-Sosa (Technical Lead) Hood College
Julie Bocetti (Writer) NICHD, NIH
Todd Young, Brooklyn College
Brendan Reilly, Affiliation Postdoctoral Fellow: MIT and Harvard
Helga Saizonou, Tropical Infections Diseases Research Centre (TIDRC), Univeristy of Abomey-Calavi (UAC)

Last Reviewed: March 28, 2024