Home NewsX Genomics + LLMs: A Case Study on adding variant annotations to LLMs through RAG and Fine-tuning

Genomics + LLMs: A Case Study on adding variant annotations to LLMs through RAG and Fine-tuning

by info.odysseyx@gmail.com
0 comment 8 views


This blog was co-written by: Lu Shuangjia,Microsoft Research Researcher Intern & Ph.D. Yale University genetics student

abstract:

This blog shows how to add genomics domain knowledge to large-scale language models (LLMs) such as GPT-4 and GPT-4o through search augmented generation (RAG) and fine-tuning using Azure Open AI and the Azure AI Search platform. . The specific genomics knowledge we have added is genetic annotation data, which is critical for interpreting genetic reports and diagnosing diseases. Users can now query gpt-4o for specific variants and receive accurate variant annotations and interpretations, supported by LLM’s advanced inference and summarization capabilities.

introduction:

Your LLM knowledge may not be sufficient in the domain you have queried, or you may want to use your personal data to answer your questions. This is where RAG can help. RAG is an approach that improves the accuracy and relevance of model answer generation by searching your own data and retrieving relevant information.1.

We refer to it as supervised fine-tuning (SFT). SFT is a process of continuous training of a pre-trained base model, adapting the model to a specific domain, tone or format. By fine-tuning the model on a new data set, the model weights are adjusted to produce the desired response.2.

  • Genetic variant annotation data:

Variant annotation datasets provide detailed information about genetic variations and their biological and clinical significance. These data sets are important for interpreting and prioritizing potential genetic variants for disease. Variant annotation includes several aspects, including variant description and identification, affected genes, affected proteins, molecular outcomes, population frequency, clinical conditions, etc.

methods:

We integrated 189 million variant annotations from multiple datasets into the gpt-4o and gpt-4 models via RAG or fine-tuning. To evaluate the performance of the model, we evaluated the output accuracy.

Erdal_Cosgun_0-1727285219099.png

Build a RAG for GPT-4o using Azure AI Search and Azure OpenAI.

Erdal_Cosgun_1-1727285219104.png

R1) Download data and convert to csv format: I downloaded the variant annotation data (in vcf).3 format) from the database. To ensure compatibility with the next step search indexing process, we converted the vcf file to csv format. This conversion was achieved using: bcftools4 and oh command. Below is an example showing the process, including three variations of the vcf format, the commands used to convert the format, and the resulting output csv file.

Erdal_Cosgun_2-1727285219111.png

R2) Split the csv file into smaller files to accommodate the limitations of the Azure AI Search S1 pricing tier. Each file contains less than 4 million characters.

R3) Transfer data to Azure Blob storage to ensure accessibility to the Azure AI search service.

R4) Create an Azure AI Search service and define the data source in Azure AI Search as Azure Blob Storage. Azure Blob Indexer – Azure AI Search | Microsoft Run.

Erdal_Cosgun_3-1727285219117.png

R5) Define a search index schema based on the data structure: Index creation – Azure AI Search. Creates 5 searchable fields: “chr:pos, rsid, gene, Condition, and content”.

Erdal_Cosgun_4-1727285219126.png

R6) Load data from a data resource into a search index through a Blob indexer load Index – Azure AI Search.

Erdal_Cosgun_5-1727285219133.png

R7) Add search index to GPT-4o in Azure OpenAI via ‘Add your own data’ feature: Consume data from Azure OpenAI service – Azure OpenAI | Microsoft.

Fine-tuning GPT-4 in Azure OpenAI

Erdal_Cosgun_6-1727285219135.png

F1) Convert annotated vcf files to jsonl prompt according to GPT-4 requirements. Below is an example of three input transformations, along with the command to convert the file format and the resulting output from jsonl.

Erdal_Cosgun_7-1727285219143.png

F2) Upload 3,000 training and 1,000 validation variants to Azure OpenAI.

F3) Fine-tune a GPT-4 model in Azure OpenAI.

Model performance evaluation

We evaluated the performance of the model using output accuracy. For the fine-tuned model, to measure the model’s ability to remember the information it was trained on and accurately reproduce the desired output, we randomly selected 100 variants from the training set and calculated the exact match between the output and the actual set. For the RAG model, 100 variants were randomly selected from the input dataset and exact matches were calculated.

result:

Initial tests of the basic GPT-4o model revealed limited knowledge of genomics, as it returned 0 out of 100 correct genetic information when queried for variants.

We have made significant improvements by leveraging the power of RAG. With RAG, we successfully integrated 189 million variant annotations into the GPT-4o model and achieved 100% accuracy across all annotation fields in a test set of 100 variants. Now, when users query for variants, they not only receive accurate annotation information, but also benefit from the interpretation supported by the GPT-4o feature. The following example shows how model performance for the same user query improves when external data is integrated through RAG. Before RAG, models were often generic or misleading. After implementing RAG, the model now provides accurate information and can provide further interpretation.

Before RAG After RAG

Erdal_Cosgun_8-1727285219159.png

Fine-tuning GPT-4 on variant annotation data also improved the performance of some annotation fields, but the accuracy of more fields is still suboptimal. Initially, we fine-tuned GPT-4 to predict 13 annotation fields (e.g. ID, gene, disease name, etc.) using user-provided chromosome location information (e.g. chr16:14555693). Despite testing multiple input formats and iteration strategies, the average output accuracy for each field was still around 0.2. To better understand how to improve fine-tuned model performance, we adapted our approach and fine-tuning for a single field. Specifically, we fine-tuned the model only in the genetic domain and achieved an accuracy of 95%. However, when scaled up to predict multiple fields simultaneously, accuracy dropped significantly. This allowed us to conclude that the more information we add and the less frequently it occurs, the more difficult it is for the model to learn through fine-tuning.

argument:

During the exploration, we found that RAG outperforms supervised fine-tuning in adding factual information to LLM in terms of data volume, accuracy, and cost-effectiveness. In the future, we aim to explore embedding strategies and GraphRAG to expand application scenarios and integrate more genomics and clinical data with complex relationships. To further improve the performance and utility of the model in genomics, we are considering enhancing the underlying genomics knowledge of LLM through unsupervised learning or dictionary learning. We will also explore the use of small language models such as phi-3 to safely leverage private data in models.

As the first model for adding genomics information to LLM, our model paves the way for developing more comprehensive and useful genomics AI to support clinical diagnostics and research projects, demonstrating the potential of LLM in specialized areas.

disclaimer:

Although we strive to ensure the accuracy and reliability of the information generated from this large-scale language model, we cannot guarantee its completeness or accuracy. To obtain accurate and up-to-date advice, users should independently verify all information and consult with experts in the relevant field. We used fully public datasets from Microsoft Genomics Data Lake. [5] No personally identifiable data was used.

resources:

  1. Learn more about RAG: Augmented generative search in Azure AI Studio – Azure AI Studio | microsoft run
  2. Fine-tune in Azure OpenAI: Customize models using the Azure OpenAI service – Azure OpenAI | microsoft run
  3. Introduction to the vcf file format: VCF – Variant Call Format – GATK (broadinstitute.org)
  4. Extract vcf information using bcftools: bcftools(1) (samtools.github.io)
  5. Microsoft Genomics Data Lake: Genomics data lake – Azure Open Datasets | microsoft run





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX