BigQuery integrates with Doc AI to help build document analytics and generative AI use cases

11 months ago 82
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

As digital transformation accelerates, organizations are generating vast amounts of text and other document data, all of which holds immense potential for insights and powering novel generative AI use cases. To help harness this data, we’re excited to announce an integration between BigQuery and Document AI, letting you easily extract insights from document data and build new large language model (LLM) applications.

BigQuery customers can now create Document AI Custom Extractors, powered by Google's cutting-edge foundation models, which they can customize based on their own documents and metadata. These customized models can then be invoked from BigQuery to extract structured data from documents in a secure, governed manner, using the simplicity and power of SQL.

Prior to this integration, some customers tried to construct independent Document AI pipelines, which involved manually curating extraction logic and schema. The lack of native integration capabilities left them to develop bespoke infrastructure to synchronize and maintain data consistency. This turned each document analytics project into a substantial undertaking that required significant investment. Now, with this integration, customers can easily create remote models in BigQuery for their custom extractors in Document AI, and use them to perform document analytics and generative AI at scale, unlocking a new era of data-driven insights and innovation.

A unified, governed data to AI experience

You can build a custom extractor in the Document AI Workbench with three steps:

  1. Define the data you need to extract from your documents. This is called document schema, stored with each version of the custom extractor, accessible from BigQuery.
  2. Optionally, provide extra documents with annotations as samples of the extraction.
  3. Train the model for the custom extractor, based on the foundation models provided in Document AI.

In addition to custom extractors that require manual training, Document AI also provides ready-to-use extractors for expenses, receipts, invoices, tax forms, government ids, and a multitude of other scenarios, in the processor gallery. You may use them directly without performing the above steps.

Then, once you have the custom extractor ready, you can move to BigQuery Studio to analzye the documents using SQL in the following four steps:

  1. Register a BigQuery remote model for the extractor using SQL. The model can understand the document schema (created above), invoke the custom extractor, and parse the results.
  2. Create object tables using SQL for the documents stored in Cloud Storage. You can govern the unstructured data in the tables by setting row-level access policies, which limits users’ access to certain documents and thus restricts the AI power for privacy and security.
  3. Use the function ML.PROCESS_DOCUMENT on the object table to extract relevant fields by making inference calls to the API endpoint. You can also filter out the documents for the extractions with a “WHERE” clause outside of the function. The function returns a structured table, with each column being an extracted field.
  4. Join the extracted data with other BigQuery tables to combine structured and unstructured data, producing business values.

The following example illustrates the user experience:

code_block<ListValue: [StructValue([('code', '# Show a screenshot of curating Doc AI custom extractor in Workbench'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ef785295760>)])]>
1
code_block<ListValue: [StructValue([('code', "# Create an object table in BigQuery that maps to the document files stored in Cloud Storage.\r\nCREATE OR REPLACE EXTERNAL TABLE `my_dataset.receipt_table`\r\nWITH CONNECTION `my_project.us.example_connection`\r\nOPTIONS (\r\n object_metadata = 'SIMPLE',\r\n uris = ['gs://my_bucket/path/*'],\r\n metadata_cache_mode= 'AUTOMATIC',\r\n max_staleness= INTERVAL 1 HOUR\r\n);\r\n\r\n# Create a remote model to register your Doc AI processor in BigQuery.\r\nCREATE OR REPLACE MODEL `my_dataset.invoice_parser`\r\nREMOTE WITH CONNECTION `my_project.us.example_connection`\r\nOPTIONS (\r\n remote_service_type = 'CLOUD_AI_DOCUMENT_V1', \r\n document_processor='projects/…/locations/us/processors/…/processorVersions/pretrained-invoice-v1.3-2022-07-15'\r\n);\r\n\r\n# Invoke the registered model over the object table to parse PDF expense receipts\r\nSELECT uri, total_amount, invoice_date\r\nFROM ML.PROCESS_DOCUMENT(\r\n MODEL `my_dataset.invoice_parser`,\r\n TABLE `my_dataset.receipt_table`)\r\nWHERE content_type = 'application/pdf';"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ef785295790>)])]>

Table of results

2

Text analytics, summarization and other document analysis use cases

Once you have extracted text from your documents, you can then perform document analytics in a few ways:

  • Use BigQuery ML to perform text-analytics: BigQuery ML supports training and deploying text models in a variety of ways. For example, you can use BigQuery ML to identify customer sentiment in support calls, or to classify product feedback into different categories. If you are a Python user, you can also use BigQuery DataFrames for pandas, and scikit-learn-like APIs for text analysis on your data.
  • Use PaLM 2 LLM to summurize the documents: BigQuery has a ML.GENERATE_TEXT function that calls the PaLM 2 model to generate texts, which can be used to summarize the documents. For instance, you can use a Document AI to extract customer feedback and summarize the feedback using PaLM 2, all with BigQuery SQL.
  • Join document metadata with other structured data stored in BigQuery tables: This allows you to combine structured and unstructured data for more powerful use cases. For example, you could identify high customer lifetime value (CLTV) customers with feedback captured from online reviews, or shortlist the most requested product features from customer feedback.
code_block<ListValue: [StructValue([('code', "// Example of document summarization using PaLM 2\r\nSELECT\r\n ml_generate_text_result['predictions'][0]['content'] AS generated_text,\r\n ml_generate_text_result['predictions'][0]['safetyAttributes']\r\n AS safety_attributes,\r\n * EXCEPT (ml_generate_text_result)\r\nFROM\r\n ML.GENERATE_TEXT(\r\n MODEL `my_dataset.llm_model`,\r\n (\r\n SELECT\r\n CONCAT(\r\n 'Summarize the following text: ',customer_feedback) AS prompt,\r\n *\r\n FROM ML.PROCESS_DOCUMENT(\r\n MODEL `my_dataset.customer_feedback_extractor`,\r\n TABLE `my_dataset.customer_feecback_documents`)\r\n ),\r\n STRUCT(\r\n 0.2 AS temperature,\r\n 1024 AS max_output_tokens));"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3ef78a926940>)])]>

Implement search and generative AI use cases

Once you’ve extracted structured text from your documents, you can build indexes optimized for needle-in-the-haystack queries, made possible by BigQuery's search and indexing capabilities, unlocking powerful search functionality.

This integration also helps unlock new generative LLM applications like executing text-file processing for privacy filtering, content safety checks, and token chunking using SQL and custom Document AI models. The extracted text, combined with other metadata, simplifies the curation of the training corpus required to fine-tune large language models. Moreover, you’re building LLM use cases on governed, enterprise data that’s been grounded through BigQuery's embedding generation and vector index management capabilities. By synchronizing this index with Vertex AI, you can implement retrieval-augmented generation use cases, for a more governed and streamlined AI experience.

Next steps

The above capabilities are now available in preview. To get started, reach out to your Google sales representative, or check out the following tutorials:

Read Entire Article