Streamline Document Indexing in Azure Cosmos DB Using Logic Apps Automation by info.odysseyx@gmail.com November 4, 2024 written by info.odysseyx@gmail.com November 4, 2024 0 comment 12 views 12 introduction Effectively managing large documents is essential to maintaining modern applications, especially fast and reliable queries. With Azure Logic Apps, you can now automate document indexing to Azure Cosmos DB in addition to the existing indexing capabilities of AI Search, giving you the flexibility to use either service as your vector store. Logic Apps provides a rich set of connectors to seamlessly integrate with a variety of document sources, including Azure Blob Storage, SharePoint, and OneDrive, enabling automated workflows for document collection from multiple locations. Whether you work with structured data files such as PDFs, Word documents, or CSVs, Logic Apps supports efficient parsing of a variety of document types. For larger documents, Logic Apps can also implement chunking to optimize processing and indexing by dividing the file into manageable parts. This allows you to seamlessly process complex or large data sets without excessive use of system resources. In terms of integration with Azure Cosmos DB, the Logic Apps Cosmos DB connector supports multiple authentication methods, including managed identity, shared key authentication, and Azure Active Directory OAuth, providing flexibility depending on your security requirements. Logic Apps can also meet a variety of networking needs, such as integrating with private endpoints or using VNet integration to secure communications between services. In this post, we’ll explore a scenario where Logic Apps automates the ingestion and indexing of documents, such as PDFs, into Azure Cosmos DB. This approach not only reduces operational overhead, but also keeps the data highly accessible and queryable. Why does Cosmos DB use Logic Apps for document indexing? Automated Workflow: Automating document indexing eliminates manual work and ensures that documents are indexed as soon as they are uploaded. scalability: As your document volume grows, the global distribution of Azure Cosmos DB ensures that your data remains scalable and available. Seamless integration: Logic App allows you to easily integrate with other Azure services, such as Blob storage and AI models, to enhance document indexing with intelligence and automation. Scenario Overview This scenario automates the collection, parsing, and indexing of document content in Azure Blob storage. Azure Cosmos DB. When a blob (such as a PDF or text document) is uploaded, a logic app workflow is triggered to process the document and store that data in a Cosmos DB container for easy searching and querying. Prerequisites To set up a scenario on your computer, you need to set up the following: Azure CosmosDB resource to index data An Azure Storage account to upload content to be indexed. Azure CosmosDB settings After creating the resource: Go to your Azure CosmosDB resource Select “Features” in the “Settings” menu. Enable the “Vector search in Azure Cosmos DB for NoSQL” feature. The steps can also be found in detail here: this blog post In Azure CosmosDB. Now that we have set up our CosmosDB resource as the index store, let’s create a new database and a container for the vector store database. To create a new container, follow these instructions: Go to “Data Explorer”. Create a “New Container” with the following fields set: Database ID: This is your database ID. In our case it is ‘docs’. Container ID: The container where the document will be stored, defined as ‘Category’. Partition Key: We defined it as ‘/category’ for data distribution as there may be different indexed document categories that can be queried. Container vector policy: This is where you set vector properties for ‘vector embedding 1’. Path: The location to retrieve and display the vector embedding. In our case it would be ‘/vector’. Data type: float32 Distance function: Used to determine the distance between nearest neighbors. In our case we set it to ‘cosine’. Dimension: 1056 Index Type: diskANN, a low-cost, scalable, high-latency option for finding Approximate Nearest Neighbors (ANN). More information about container setup can be found here. GitHub Tutorials. Document structure for indexing in Azure Cosmos DB This Logic Apps workflow indexes document embeddings into Azure Cosmos DB. Below is a breakdown of the key fields we map and index. content: This field contains the body of the document or the actual processed text content. For example, this could be text data extracted from documents such as contracts, invoices, or other file types. document name: The name or title of the document being indexed. This field helps you identify documents based on their file names, making it easier to search and retrieve documents by their original names. vector: This represents the document’s embedding vector, which is a numerical representation of the content. These vectors are used to perform similarity searches on documents, enabling AI-based insights or matches based on content similarity. Document ID: A unique identifier generated for each document. This way, each document has a unique ID, which is important for querying and updating specific items in the Cosmos DB container. category: This is where you specify the document type or category. In our case, we are using “Document” as the value for this field. This helps categorize and group documents and can be useful when querying for specific types of documents within a database. ID: Another unique identifier that is often automatically generated or derived by concatenating values. This ID can be used to ensure that there are no duplicates and that each document is properly referenced. When you configure the payload to pass Azure CosmosDB in your logic app workflow: Key steps in the workflow I am GitHub Sample For workflow projects. Here is a visual representation of the workflow: Blob upload detection: The logic app starts by detecting when a new blob (document) is added or updated in Azure Blob Storage. Read blob content: The workflow reads the content of the uploaded blob and prepares it for further processing. Document parsing: Logic Apps parse documents to extract relevant content, including text and metadata. This may include PDF extraction or text chunking for larger documents. Chunk text (if needed): For large documents, split content into manageable chunks for smooth processing and indexing. Generate Embeddings Using AI: The logic app uses Azure AI to generate embeddings from document content. These inclusions enable improved data processing, classification, and structural mapping within Cosmos DB. Mapping to Schema: Extracted data and embeddings are mapped to a predefined schema, ensuring consistency in how documents are indexed within Cosmos DB. The properties we index are: Bulk updates in Cosmos DB: The final processed document is stored and indexed in Cosmos DB. The “Create or update many items in bulk” operation accepts a database and container ID along with the data to be indexed, where multiple items have been processed into the database in a previous operation. conclusion by utilizing Azure Logic App To automate document indexing Azure Cosmos DBhelps you streamline your data workflow, reduce manual intervention, and organize your data for optimal performance. This powerful integration simplifies processes, making it easier for teams to manage large volumes of documents and scale as needed. What’s next? Logic Apps currently supports efficient document indexing, but vector search Features for AI-based search are not yet available in Azure CosmosDB. This is a highly anticipated feature that will strengthen Cosmos DB as a powerful vector store. Stay tuned for this update! Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Developer Productivity next post 5 Key Takeaways from the Marketplace Summit UK: Maximizing Marketplace Success You may also like 7 Disturbing Tech Trends of 2024 December 19, 2024 AI on phones fails to impress Apple, Samsung users: Survey December 18, 2024 Standout technology products of 2024 December 16, 2024 Is Intel Equivalent to Tech Industry 2024 NY Giant? December 12, 2024 Google’s Willow chip marks breakthrough in quantum computing December 11, 2024 Job seekers are targeted in mobile phishing campaigns December 10, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.