Home NewsX Maximizing Data Extraction Precision with Dual LLMs Integration and Human-in-the-Loop

Maximizing Data Extraction Precision with Dual LLMs Integration and Human-in-the-Loop

by info.odysseyx@gmail.com
0 comment 3 views


While improving data extraction accuracy is important, verifying the accuracy of the extracted data is equally important. Document Intelligence’s layout model, combined with Markdown formatting and semantic chunking, plays a key role in dividing documents into clear sections and subsections. This approach preserves relationships between different sections and different structured formats (e.g., tables, paragraphs, sections, and figures), improving navigation, comprehension, and information retrieval. This structure helps LLM understand the data more contextually and accurately during extraction. For more information on this concept, see:

Higher accuracy and human involvement

However, our customers still struggle to achieve near-100% accuracy. They are also looking for ways to integrate human verification into their processes, particularly in a Human-in-the-Loop (HITL) approach, to ensure that critical data points such as financial figures, legal terms or medical data are accurately captured, especially in the early stages before phasing out human intervention where necessary.

In this paper, we propose a dual approach that utilizes two large-scale language models (LLMs): data extraction and data validation. This is similar to the concept of “two heads are better than one.” Data extraction involves converting documents to markdown format and extracting data using an LLM (e.g., GPT-4o). JSON format based on a predefined schema And then pass it back to the system, which then calls validation on the system. Same schema Extract data from Document Intelligent and validate it against the data extracted in the first data extraction process. Identified inconsistent data is sent to the front-end UI for human verification.

For our demonstrations, we utilize the latest technology. Document field extraction modelLeverage generative AI to accurately extract specific fields from documents, regardless of visual templates. This custom model combines advanced document intelligence specialized algorithms with large-scale language models (LLMs) and precise custom extraction schemas. It also provides confidence scores for each field and provides training functions to further improve accuracy.

Here’s a summary explaining how this process works:

The entire data extraction and validation process involves human involvement in the loop.The entire data extraction and validation process involves human involvement in the loop.

  1. Extract data by defining a schema in JSON format.
  2. A system that calls data extraction to convert PDF or image files into markdown format and sends the markdown with a predefined schema to the prompt message. Once the output JSON format is complete, it is sent back to the system.
  3. The system initiates the data validation process by calling data validation. The document can be submitted for analysis using: REST API or Client Library. Custom Generating AI Models (Public Preview) While it is effective at extracting simple fields without labeled samples, providing labeled examples can greatly improve accuracy, especially for complex fields like tables.
  4. The validation process compares the extracted values ​​against the schema. Any values ​​that do not match the flagged values ​​are sent to the user interface (UI) for human validation.
  5. Users validate the mismatched data and select the correct value based on the displayed PDF or image file with highlighted mismatches. They can also enter a new value if both displayed values ​​are incorrect. This approach, which focuses on reviewing only the mismatched data rather than the entire field, leverages LLM and Document Intelligence to improve accuracy while minimizing the need for extensive human intervention.

JSON Schema:

    "docTypes": {
        "custom-docutment-intel-model": {
            "fieldSchema": {
                "apn_number": {
                    "type": "number"
                },
                "borrower_name": {
                    "type": "string"
                },
                "lender_name": {
                    "type": "string"
                },
                "trustee": {
                    "type": "string"
                },
                "amount": {
                    "type": "number"
                },
                "date": {
                    "type": "date"
                }
            },
            "buildMode": "generative"
        }

Data Extraction (Azure OpenAI)

def get_response_from_aoai_with_schema(document_content: str, schema: str):
    """Get a JSON response from the GPT-4o model with schema"""

    system_message = f"""
    ### you are AI assistant that helps extract information from given context.
    - context will be given by the user.
    - you will extract the relevant information using this json schema:
        ```json
        {schema}
        ```
    - if you are unable to extract the information, return JSON with the keys and empty strings or 0 as values.
    - if schema type is date, provide the date as a string in the format "YYYY-MM-DD".
    """

    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": document_content}
    ]

    try:
        response = client.chat.completions.create(
            model=azure_openai_model, # The deployment name you chose when you deploy GPT model
            messages=messages,
            response_format={ "type": "json_object" },
        )
        response_message = response.choices[0].message
        return response_message.content
    except Exception as e:
        print(f"Error: {e}")
        return None

Data Validation in Document Intelligence:

def get_response_from_ai_doc_intel(target_file):
    # get file from documents folder in the main directory
    with open(target_file, "rb") as f:
        url = f"{docintel_endpoint}documentintelligence/documentModels/{docintel_custom_model_name}:analyze"
        headers = {
            "Ocp-Apim-Subscription-Key": docintel_key,
            "Content-Type": "application/octet-stream"
        }
        params  = {
            "api-version": "2024-07-31-preview",
            "outputContentFormat": "markdown"
        }
        sumbit_analysis = requests.post(url, params=params , headers=headers, data=f)

        if sumbit_analysis.status_code != 202:
            print(f"Error: {sumbit_analysis.json()}")
            return None

        # get the operation location
        operation_location = sumbit_analysis.headers["Operation-Location"]
        print(operation_location)

        # do while loop til the analysis is done
        while True:
            response = requests.get(operation_location, headers={"Ocp-Apim-Subscription-Key": docintel_key})

            if response.status_code != 200:
                print(f"Error: {response.json()}")
                return None
            
            analysis_results = response.json()

            if analysis_results["status"] == "running":
                # wait for 5 seconds
                print("Analysis is still running...")
                time.sleep(5)
                continue
            
            if analysis_results["status"] != "succeeded":
                print(f"Error: {analysis_results}")
                return None
            
            return analysis_results["analyzeResult"]

calculation:

{
 "apn_number": 38593847301,
 "borrower_name": "Siyabonga Sithole",
 "lender_name": "Addullo Kholov",
 "trustee": "Fabrikam, Inc",
 "amount": 30000,
 "date": "2024-12-25"
}

The front-end interface below demonstrates how a user can achieve nearly 100% data extraction accuracy by combining three services: Large-Scale Language Modeling (LLM), Document Intelligence, and Human Intervention to resolve inconsistencies, based on test results. Under “Fields” in the left column, there is a list of extracted fields. Selecting a field’s radio button displays the comparison results under “Field Information”. Green highlighting indicates an exact match, while red highlighting indicates a mismatch with a low confidence score (e.g. 0.159) as shown in the figure below. The user needs to focus on these red highlighted fields and either accept the correct value or overwrite them with new values ​​in the editable text boxes if both options are incorrect.

Front app for human verificationFront app for human verification

For detailed implementation Clear imageCome visit us GitHub Repository

Choosing an LLM

Choosing the right LLM typically requires a language model that can effectively understand and process text, including structure, Markdown-specific syntax, and validation capabilities. AI Studio as a platform. Provides various language models (LLM) You can choose.





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX