Maximizing Data Extraction Precision with Dual LLMs Integration and Human-in-the-Loop by info.odysseyx@gmail.com September 4, 2024 written by info.odysseyx@gmail.com September 4, 2024 0 comment 3 views 3 While improving data extraction accuracy is important, verifying the accuracy of the extracted data is equally important. Document Intelligence’s layout model, combined with Markdown formatting and semantic chunking, plays a key role in dividing documents into clear sections and subsections. This approach preserves relationships between different sections and different structured formats (e.g., tables, paragraphs, sections, and figures), improving navigation, comprehension, and information retrieval. This structure helps LLM understand the data more contextually and accurately during extraction. For more information on this concept, see: Higher accuracy and human involvement However, our customers still struggle to achieve near-100% accuracy. They are also looking for ways to integrate human verification into their processes, particularly in a Human-in-the-Loop (HITL) approach, to ensure that critical data points such as financial figures, legal terms or medical data are accurately captured, especially in the early stages before phasing out human intervention where necessary. In this paper, we propose a dual approach that utilizes two large-scale language models (LLMs): data extraction and data validation. This is similar to the concept of “two heads are better than one.” Data extraction involves converting documents to markdown format and extracting data using an LLM (e.g., GPT-4o). JSON format based on a predefined schema And then pass it back to the system, which then calls validation on the system. Same schema Extract data from Document Intelligent and validate it against the data extracted in the first data extraction process. Identified inconsistent data is sent to the front-end UI for human verification. For our demonstrations, we utilize the latest technology. Document field extraction modelLeverage generative AI to accurately extract specific fields from documents, regardless of visual templates. This custom model combines advanced document intelligence specialized algorithms with large-scale language models (LLMs) and precise custom extraction schemas. It also provides confidence scores for each field and provides training functions to further improve accuracy. Here’s a summary explaining how this process works: The entire data extraction and validation process involves human involvement in the loop. Extract data by defining a schema in JSON format. A system that calls data extraction to convert PDF or image files into markdown format and sends the markdown with a predefined schema to the prompt message. Once the output JSON format is complete, it is sent back to the system. The system initiates the data validation process by calling data validation. The document can be submitted for analysis using: REST API or Client Library. Custom Generating AI Models (Public Preview) While it is effective at extracting simple fields without labeled samples, providing labeled examples can greatly improve accuracy, especially for complex fields like tables. The validation process compares the extracted values against the schema. Any values that do not match the flagged values are sent to the user interface (UI) for human validation. Users validate the mismatched data and select the correct value based on the displayed PDF or image file with highlighted mismatches. They can also enter a new value if both displayed values are incorrect. This approach, which focuses on reviewing only the mismatched data rather than the entire field, leverages LLM and Document Intelligence to improve accuracy while minimizing the need for extensive human intervention. JSON Schema: "docTypes": { "custom-docutment-intel-model": { "fieldSchema": { "apn_number": { "type": "number" }, "borrower_name": { "type": "string" }, "lender_name": { "type": "string" }, "trustee": { "type": "string" }, "amount": { "type": "number" }, "date": { "type": "date" } }, "buildMode": "generative" } Data Extraction (Azure OpenAI) def get_response_from_aoai_with_schema(document_content: str, schema: str): """Get a JSON response from the GPT-4o model with schema""" system_message = f""" ### you are AI assistant that helps extract information from given context. - context will be given by the user. - you will extract the relevant information using this json schema: ```json {schema} ``` - if you are unable to extract the information, return JSON with the keys and empty strings or 0 as values. - if schema type is date, provide the date as a string in the format "YYYY-MM-DD". """ messages = [ {"role": "system", "content": system_message}, {"role": "user", "content": document_content} ] try: response = client.chat.completions.create( model=azure_openai_model, # The deployment name you chose when you deploy GPT model messages=messages, response_format={ "type": "json_object" }, ) response_message = response.choices[0].message return response_message.content except Exception as e: print(f"Error: {e}") return None Data Validation in Document Intelligence: def get_response_from_ai_doc_intel(target_file): # get file from documents folder in the main directory with open(target_file, "rb") as f: url = f"{docintel_endpoint}documentintelligence/documentModels/{docintel_custom_model_name}:analyze" headers = { "Ocp-Apim-Subscription-Key": docintel_key, "Content-Type": "application/octet-stream" } params = { "api-version": "2024-07-31-preview", "outputContentFormat": "markdown" } sumbit_analysis = requests.post(url, params=params , headers=headers, data=f) if sumbit_analysis.status_code != 202: print(f"Error: {sumbit_analysis.json()}") return None # get the operation location operation_location = sumbit_analysis.headers["Operation-Location"] print(operation_location) # do while loop til the analysis is done while True: response = requests.get(operation_location, headers={"Ocp-Apim-Subscription-Key": docintel_key}) if response.status_code != 200: print(f"Error: {response.json()}") return None analysis_results = response.json() if analysis_results["status"] == "running": # wait for 5 seconds print("Analysis is still running...") time.sleep(5) continue if analysis_results["status"] != "succeeded": print(f"Error: {analysis_results}") return None return analysis_results["analyzeResult"] calculation: { "apn_number": 38593847301, "borrower_name": "Siyabonga Sithole", "lender_name": "Addullo Kholov", "trustee": "Fabrikam, Inc", "amount": 30000, "date": "2024-12-25" } The front-end interface below demonstrates how a user can achieve nearly 100% data extraction accuracy by combining three services: Large-Scale Language Modeling (LLM), Document Intelligence, and Human Intervention to resolve inconsistencies, based on test results. Under “Fields” in the left column, there is a list of extracted fields. Selecting a field’s radio button displays the comparison results under “Field Information”. Green highlighting indicates an exact match, while red highlighting indicates a mismatch with a low confidence score (e.g. 0.159) as shown in the figure below. The user needs to focus on these red highlighted fields and either accept the correct value or overwrite them with new values in the editable text boxes if both options are incorrect. Front app for human verification For detailed implementation Clear imageCome visit us GitHub Repository Choosing an LLM Choosing the right LLM typically requires a language model that can effectively understand and process text, including structure, Markdown-specific syntax, and validation capabilities. AI Studio as a platform. Provides various language models (LLM) You can choose. Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post SSMS 21 feedback next post Announcing support for Welsh and Catalan language in Copilot for Microsoft 365 You may also like Work Smarter: Copilot Productivity Tips for Planning Events September 9, 2024 Build the next generation of AI applications with Microsoft Azure at DevIntersection, Las vegas September 9, 2024 Secure Time Seeding on DCs: A Note from the Field September 9, 2024 New on Azure Marketplace: August 25-31, 2024 September 9, 2024 See what’s possible with Copilot in Excel (part 4) September 9, 2024 Ensuring Platform Resiliency: The Next Step in AI Deployment September 9, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.