Home NewsX Energy-Efficient Data Querying with NLP and Client-Side Compute for ISVs

Energy-Efficient Data Querying with NLP and Client-Side Compute for ISVs

by info.odysseyx@gmail.com
0 comment 4 views


introduction

Interacting with data using natural language can greatly enhance our ability to manipulate and understand information, making data more accessible and usable for everyone. Given recent advances in large-scale language models (LLMs), this seems like an obvious solution. However, while we have made progress in interacting with unstructured data using NLP and AI, interacting with structured data remains a challenge. Using LLMs to convert natural language into domain-specific languages ​​like SQL is a common and valid use case that demonstrates the power of these models. In this blog, we will identify the limitations of current solutions and introduce a new, energy-efficient approach to improve efficiency and flexibility.

My team focuses on ISVs and the impact of each design decision on them. For example, if an ISV needs to allow “chat with data” in their solution, they also need to address the challenges of hosting, monetizing, and securing that functionality. We have two core strategies.

  • Leverage deterministic tools to run domain-specific languages ​​on appropriate systems.
  • Distributes the computing load across client devices.

This strategy is ideal for ISVs who want to improve performance and scalability while also reducing server load to provide customers with seamless, sustainable data access.

Challenge: Interacting efficiently with structured data

Structured data is typically stored in databases, structured files, and spreadsheets, and is the backbone of business intelligence and analytics. However, querying and extracting insights from this data often requires knowledge of specific query languages, such as SQL, which can be a barrier for many users. ISVs also face the challenge of anticipating the many ways customers want to interact with their data. As customer demand for natural language interfaces to simplify and intuitively access data grows, ISVs are under pressure to develop solutions that bridge the gap between the structured data that users need to interact with.

Using LLM to translate natural language queries into a domain-specific language like SQL is a powerful capability, but it alone does not solve the problem. The next step is to efficiently execute these queries on the appropriate systems. Implementing such a solution requires some basic safeguards to ensure that the generated SQL can be executed safely. In addition, there is the additional challenge of managing the computational load. Hosting such functionality on ISV servers can be resource-intensive and expensive.

Therefore, an effective solution must not only translate natural language into actionable queries, but also optimize how those queries are processed. This includes leveraging deterministic tools to execute domain-specific languages ​​and offload computational tasks to client devices. This allows ISVs to provide customers with more efficient, scalable, and cost-effective data interaction solutions.

Common Use Cases

ISVs collect data from a variety of sources, some open source and most from their customers (or tenants). These tenants can come from a variety of industries, such as retail, healthcare, and finance, and each requires a customized data solution. ISVs implement the Medallion pattern for data collection. This pattern is a design pattern that organizes data into tiers (Bronze, Silver, and Gold) to ensure data quality and accessibility. In this pattern, raw data is collected into the Bronze tier, cleaned and enriched into the Silver tier, and then aggregated into the Gold tier for analysis. The Gold table containing the aggregated data is typically less than 20 MB per tenant.

The data ingestion pipeline runs periodically to populate the gold tables hosted in Azure SQL Database. Data isolation is managed using row-level security or multiple schemas tailored to the ISV’s needs. The next step for the ISV is to provide tenants with access to the data via a web application, leveraging their own dashboards and reporting capabilities. These ISVs are often small companies that do not have the resources to implement a full Business Continuity and Disaster Recovery (BCDR) approach or purchase commercial tools like Power BI, so they rely on their own or free packages.

Despite having a robust infrastructure, ISVs face several challenges.

  • Compound query language: Users often struggle with the complexity of SQL or other query languages ​​required to extract insights from their data, creating barriers to effective data utilization.
  • Performance and Scalability: Complex queries significantly increase server load, especially when multiple tenants access data simultaneously. This can lead to performance bottlenecks and scalability issues.
  • Cost and Resource Management: Hosting the computing resources needed to process data queries on ISV servers is resource-intensive and expensive. This includes maintaining high-performance databases and application servers.
  • User Experience: Customers increasingly demand the ability to interact with their data using natural language and expect seamless, intuitive user experiences.
For more information about the medallion pattern, see: This link.

2024-07-03-16-52-24.png

The architecture diagram above shows the current setup.

  • Data Source: Public sources and tenant data are collected into the system.
  • save: A data lake (or lake house) processes data from multiple sources, performs cleansing, and periodically stores the data in gold tables.
  • Orchestrator: ELT/ETL orchestration is done using Azure Fabric/Synapse or Azure Data Factory pipelines.
  • clothing material: The web application is hosted on Azure App Service, and data is queried using Azure SQL Database.
  • Visualize: Data is reported using Power BI or other reporting tools (including self-built dashboards).

An improved approach: energy-efficient data interaction

To address the challenges mentioned above, ISVs can adopt the following strategies:

  • Leverage deterministic tools for query execution:
    • translation: LLM transforms natural language queries into SQL.
    • execution: Create a sandbox environment for each customer’s data. This sandbox is hosted on low-cost storage, such as a storage container per customer, and contains a snapshot of the data that the customer can interact with.
    • Data Management: The same data ingestion pipeline that updates the gold table in Azure SQL has been adapted to update customer-specific data sets stored in their storage containers. The idea is to use SQLite to store customer-specific data, ensuring it is lightweight and portable.
    • profit:
      • Efficiency and security: Leverage the power of SQL databases while minimizing risk to ensure queries run efficiently and safely. By isolating each customer’s data in a sandbox, sophisticated safeguards against incorrect queries and reporting database overload are greatly reduced.
      • Cost and energy savings: There is no need to manage or host a dedicated reporting database. Customer-specific data is hosted in Azure storage containers, allowing ISVs to avoid the costs and energy consumption of maintaining a high-performance database infrastructure.
      • Scalability and stability: ISVs do not need to plan for the worst-case scenario where all customers execute queries simultaneously, which could impact the health of the central reporting database. Each customer’s query is isolated from the data, ensuring system stability and performance.
  • Offload computing to client devices:
    • Data transfer: Client-side applications make available a current snapshot of the data for the client to work with. For example, they can check the timestamp of the data or use other methods to check if the local data is up to date and download the latest version if necessary. This snapshot is encapsulated in a portable format such as JSON, SQLite, or Parquet.
    • Local processing: Client-side applications process data locally using translated SQL queries.
    • profit:
      • Performance: Reduce server load, improve scalability, and respond to queries faster by leveraging the client’s computing resources.
      • Cost and energy savings: Significant cost savings by reducing the need for high-performance server infrastructure. Hosting static websites and leveraging the processing power of client devices also reduces overall energy consumption.
      • pliability: Ensures that customers always work with up-to-date data without the need for constant server communication.

2024-07-07-14-08-58.png

Revised architecture

  • Data Source: Public sources and tenant data are collected into the system.
  • save: A data lake (or lake house) processes data from multiple sources, performs cleansing, and stores the data in customer-specific containers, which enhances security and isolation.
  • Orchestrator: ELT/ETL orchestration is done using Azure Fabric/Synapse or Azure Data Factory pipelines.

The above components are hosted on the ISV infrastructure.

Client-side web applications fetch data from customer-specific containers and process the data locally. Visit us. Azure OpenAI .NET Starter Kit For further reading and understanding – focus on the following: 07_ChatWithJson and 08_Chatting with Data Laptop.

Why use this approach?

  • efficiency: Data queries are executed locally, reducing the load on the server and improving performance.
  • security: Data is safely isolated within a client-side sandbox, so customers can only query the information provided to them.
  • Cost and energy savings: Hosting a static website is much cheaper and more energy efficient than hosting a web application with a database. This approach further reduces infrastructure costs and energy consumption by leveraging the processing power of client devices.
  • Scalability: By isolating each customer’s data into a sandbox, ISVs do not need to worry about the impact of concurrent queries on the central database, ensuring the stability and scalability of the system.
  • pliability: Ensures that customers always have access to up-to-date data without the need for constant communication with the server.

Potential Drawbacks and Pitfalls

  • Client-side performance variability: This approach relies on the computing power of the client device.
  • Data Synchronization: It can be difficult to ensure that local data snapshots on client devices are up to date. Delayed synchronization can result in users working with outdated data.

conclusion

By adopting this strategy, ISVs can deliver more efficient, scalable, and cost-effective solutions for natural language queries on structured data. Leveraging deterministic tools to execute domain-specific languages ​​within isolated sandboxes ensures robust and secure query execution. Offloading compute to client devices not only reduces server load, but also improves performance and scalability, providing a smooth and intuitive user experience.





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX