Home NewsX Use Microsoft Phi-3.5 to complete text chunking

Use Microsoft Phi-3.5 to complete text chunking

by info.odysseyx@gmail.com
0 comment 10 views


The introduction of Generative AI has accelerated the AI ​​transformation of more companies. Data is key to help LLMs better understand their business or various scenarios. RAG is currently the most common method. Limited to LLM/SLM tokens, so text data needs to be chunked. How to keep the text context relevant is a technical problem. Microsoft Pie-3.5 It supports 128k tokens, image analysis capabilities, and has powerful long-term context understanding capabilities. In this blog, we will use the Microsoft Phi-3.5 suite as the core of text chunking to improve the efficiency and consistency of chunking content.

SLMRAG.png

Starting with the text content

We need to understand text structure. Daily news, contracts, newspapers, etc. all consist of three parts: text, images, and tables. These three parts help readers understand the content. How to extract this content? From a technical perspective, you can combine Python with a variety of AI technologies. For text and images you can use pypdf, and for tables you can use Azure Document Intelligence to get more accurate content.

Role of Microsoft Phi-3.5

We divided the document into three parts: text, pictures, and tables. We start using Microsoft Phi-3.5 to understand these three parts.

  • text – Text content should be divided into knowledge content to make it more helpful for information retrieval. Microsoft Phi-3.5-mini-instruct As a reader of text content, summarize and divide information points. The prompt is:

You are an expert in content chunking. Please help me chunk user's input text according to the following requirements

    1. Truncate the text content into chunks of no more than 300 tokens. 

    2. Each chunk part should maintain contextual coherence. The truncated content should be retained in its entirety without any additions or modifications.

    3. Each chunked part is output JSON format  { \"chunking\": \"...\" }

    4. The final output is a JSON array [{ \"chunking\" : \"...\" },{ \"chunking\" :\"...\"},{ \"chunking\" : \"...\"} ....]

  • image – Images are presented in conjunction with text content. we can use Microsoft Phi-3.5-Vision Understand the content of each image in the text. The prompt is:

You are my analysis assistant, help me analyze charts, flowchart, etc. according to the following conditions

    1. If it is a chart, please analyze according to the data in the chart and tell me the different details

    2. If it is a flowchart, please analyze all the situations in detail according to the flow and  describe all process in details, do NOT simplify. Use bullet lists with identation to describe the process

    3. The output is json {"chunking":"......"}
                                    
    4. If it is not a chart or flowchart(more than single node),it does not need to be analyzed, the output is json {"chunking":"NIL"}                           
        

  • table – The table is important. through Microsoft Phi-3.5-mini-instructYou can check development trends by analyzing the data in each table within the content. The prompt is:

You are my markdown table assistant, who can understand all the contents of the table and give analysis.                        
        

After completing these three parts of processing, you can integrate the data, combine it with the output JSON, and save it to a vector database via Embedding transformation. Vectorized storage of text content is complete.

way out

Complete chunks of text using prompted flows combined with GitHub models and Azure AI Search. As shown in the picture:

SLMF flow.png

memo

  • You need to check whether the output is JSON, but there are sometimes errors in Phi-3.5-mini-instruct, so check with check_json.

  • You may need to adjust different prompts for each document. Otherwise, there is no way to get a more accurate understanding of the content of the document.

  • This solution combines Azure AI Search to complete your vector store and can switch between different vector databases.

  • GitHub models allow you to make more efficient calls during development and verification phases, but for production environments, we recommend moving to Azure or using locally deployed models.

You can check related results through Chat-flow.

SLMChat.png

Download sample code – Click here

summation

How to complete text chunking more efficiently is an academic problem, and SLM may be one of the ways to solve it. However, we need to do more to address the diversity of texts, and these attempts still have much room for improvement.

resources

  1. Learn more about Microsoft Phi-3.5 https://huggingface.co/microsoft/Phi-3.5-mini-instruct

  2. Read the Microsoft Phi-3 documentation https://aka.ms/phi-3cookbook

  3. GitHub Model https://gh.io/models

  4. Learn more about Microsoft prompt flow https://microsoft.github.io/promptflow/

  5. Learn more about Microsoft AI Search https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX