Use Microsoft Phi-3.5 to complete text chunking by info.odysseyx@gmail.com November 4, 2024 written by info.odysseyx@gmail.com November 4, 2024 0 comment 10 views 10 The introduction of Generative AI has accelerated the AI transformation of more companies. Data is key to help LLMs better understand their business or various scenarios. RAG is currently the most common method. Limited to LLM/SLM tokens, so text data needs to be chunked. How to keep the text context relevant is a technical problem. Microsoft Pie-3.5 It supports 128k tokens, image analysis capabilities, and has powerful long-term context understanding capabilities. In this blog, we will use the Microsoft Phi-3.5 suite as the core of text chunking to improve the efficiency and consistency of chunking content. Starting with the text content We need to understand text structure. Daily news, contracts, newspapers, etc. all consist of three parts: text, images, and tables. These three parts help readers understand the content. How to extract this content? From a technical perspective, you can combine Python with a variety of AI technologies. For text and images you can use pypdf, and for tables you can use Azure Document Intelligence to get more accurate content. Role of Microsoft Phi-3.5 We divided the document into three parts: text, pictures, and tables. We start using Microsoft Phi-3.5 to understand these three parts. text – Text content should be divided into knowledge content to make it more helpful for information retrieval. Microsoft Phi-3.5-mini-instruct As a reader of text content, summarize and divide information points. The prompt is: You are an expert in content chunking. Please help me chunk user's input text according to the following requirements 1. Truncate the text content into chunks of no more than 300 tokens. 2. Each chunk part should maintain contextual coherence. The truncated content should be retained in its entirety without any additions or modifications. 3. Each chunked part is output JSON format { \"chunking\": \"...\" } 4. The final output is a JSON array [{ \"chunking\" : \"...\" },{ \"chunking\" :\"...\"},{ \"chunking\" : \"...\"} ....] image – Images are presented in conjunction with text content. we can use Microsoft Phi-3.5-Vision Understand the content of each image in the text. The prompt is: You are my analysis assistant, help me analyze charts, flowchart, etc. according to the following conditions 1. If it is a chart, please analyze according to the data in the chart and tell me the different details 2. If it is a flowchart, please analyze all the situations in detail according to the flow and describe all process in details, do NOT simplify. Use bullet lists with identation to describe the process 3. The output is json {"chunking":"......"} 4. If it is not a chart or flowchart(more than single node),it does not need to be analyzed, the output is json {"chunking":"NIL"} table – The table is important. through Microsoft Phi-3.5-mini-instructYou can check development trends by analyzing the data in each table within the content. The prompt is: You are my markdown table assistant, who can understand all the contents of the table and give analysis. After completing these three parts of processing, you can integrate the data, combine it with the output JSON, and save it to a vector database via Embedding transformation. Vectorized storage of text content is complete. way out Complete chunks of text using prompted flows combined with GitHub models and Azure AI Search. As shown in the picture: memo You need to check whether the output is JSON, but there are sometimes errors in Phi-3.5-mini-instruct, so check with check_json. You may need to adjust different prompts for each document. Otherwise, there is no way to get a more accurate understanding of the content of the document. This solution combines Azure AI Search to complete your vector store and can switch between different vector databases. GitHub models allow you to make more efficient calls during development and verification phases, but for production environments, we recommend moving to Azure or using locally deployed models. You can check related results through Chat-flow. Download sample code – Click here summation How to complete text chunking more efficiently is an academic problem, and SLM may be one of the ways to solve it. However, we need to do more to address the diversity of texts, and these attempts still have much room for improvement. resources Learn more about Microsoft Phi-3.5 https://huggingface.co/microsoft/Phi-3.5-mini-instruct Read the Microsoft Phi-3 documentation https://aka.ms/phi-3cookbook GitHub Model https://gh.io/models Learn more about Microsoft prompt flow https://microsoft.github.io/promptflow/ Learn more about Microsoft AI Search https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post MS Ignite 2024: Our session picks next post Enhance E-Commerce Product Search with Vector Similarity in Azure Cosmos DB You may also like 7 Disturbing Tech Trends of 2024 December 19, 2024 AI on phones fails to impress Apple, Samsung users: Survey December 18, 2024 Standout technology products of 2024 December 16, 2024 Is Intel Equivalent to Tech Industry 2024 NY Giant? December 12, 2024 Google’s Willow chip marks breakthrough in quantum computing December 11, 2024 Job seekers are targeted in mobile phishing campaigns December 10, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.