Integrating BigQuery data into your LangChain application

11 months ago 53

News Banner

Data is the heart of any AI solution. You've got lots of valuable BigQuery data, but how can you integrate it into an LLM application? Large language models excel at using unstructured data. Including structured data can enrich and ground your model's responses, and capture new relationships in your data. Let's explore how to use the LangChain BigQuery Data Loader to do just that.

LangChain is an open source framework that enables you to connect LLMs to data sources and control the flow of information between them. By using LangChain, you can simplify LLM development with a modular architecture and pre-built connectors. Some of the most popular use cases include chatbots and virtual assistants, code generation tools, and custom web applications that use large language models. When you see a LLM workflow consisting of multiple steps and integration points, you should consider LangChain for the task.

Whether you're new to LangChain, or just haven't used a data loader, this post will walk through each step. In our example, our model will analyze multiple e-commerce tables to select the most appropriate customers for a marketing campaign. You'll learn how to use a data loader to support generative AI use cases from summarization to code generation.

Getting started with LangChain

To find out more about BigQuery and LangChain at the same time, all we need to do is ask the Vertex AI text foundation model. Six lines of code from start-to-finish, not too bad!

code_block<ListValue: [StructValue([('code', '# Install LangChain and the Vertex AI SDK\r\n!pip install --quiet google-cloud-aiplatform langchain\r\n\r\n\r\n# Initialize Vertex AI SDK\r\nimport vertexai\r\nvertexai.init(project="<your-project-id>", location="us-central1")\r\n\r\n\r\n# Query the model \r\nfrom langchain.llms import VertexAI\r\nllm = VertexAI(model_name="text-bison@001", temperature=0)\r\nllm("What\'s BigQuery?")'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5fd0272610>)])]>

And what does it tell us?

BigQuery is a fully managed, petabyte-scale analytics data warehouse that enables businesses to analyze all their data very quickly. It is a cloud-based service that offers fast performance, scalability, and flexibility. BigQuery is easy to use and can be integrated with other Google Cloud Platform services.

Pro tip: now you can useBigQuery Studioto run notebooks and try out SQL in this blog post directly within BigQuery.

Using the data loader

Now that we've climbed the "Hello world!" mountain, let's learn how to use the document loader. We'll use data from a fictional eCommerce clothing site calledTheLook, available as a BigQuery public dataset.

Let's say we're starting from scratch with a bunch of tables we don't know well. And our marketing team is about to start a campaign in Japan, oh my! Can we ask the LLM to identify our target customers?

The first step to understanding our data is loading it. Let's query the schema from this dataset to extract thedata definition language(DDL). DDL is used to create and modify tables, and can tell us about each column and its type.

As a prerequisite, let's make sure we have the BigQuery client library installed:

code_block<ListValue: [StructValue([('code', '# Install BigQuery library\r\n!pip install google-cloud-bigquery'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5fd0272370>)])]>

Let's now define the query and load the data:

code_block<ListValue: [StructValue([('code', '# Define our query\r\nquery = f"""\r\nSELECT table_name, ddl\r\nFROM `bigquery-public-data.thelook_ecommerce.INFORMATION_SCHEMA.TABLES`\r\nWHERE table_type = \'BASE TABLE\'\r\nORDER BY table_name;\r\n"""\r\n\r\n\r\n# Load the data\r\nloader = BigQueryLoader(query, project="<your-project-id>", metadata_columns="table_name", page_content_columns="ddl")\r\ndata = loader.load()'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5fd0272250>)])]>

Our query is extracting the table name and DDL for each of the tables. We then create a data loader, specifying that the table name is a metadata column and the DDL is the content.

The result is an array of documents, that looks like this:

code_block<ListValue: [StructValue([('code', '[Document(page_content=\'ddl:\r\n CREATE TABLE `bigquery-public-data.thelook_ecommerce.distribution_centers`\r\n (id INT64, name STRING, latitude FLOAT64, longitude FLOAT64)\r\n OPTIONS(description=\r\n "The Look fictitious e-commerce dataset: distribution_centers table");\',\r\n metadata={\'table_name\': \'distribution_centers\'}),...]'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5fd0272a30>)])]>

Writing our first chain

Now that we've loaded the documents, let's put them to work! We need a query that gives us the answers we need - understanding for our marketing campaign. We'll use the code generation model for this task.

We will create a basic chain that "stuffs" together all of the table metadata into one prompt. For larger datasets with many more tables, a more sophisticated chaining approach will be needed. That's because there's a limited length to each prompt, i.e. a context window.

For example, you could compress highlights from each individual table's content into smaller documents, and then summarize those using a map-reduce method. Or, you could iterate over each table, refining your query as you go.

Here's how to do it. We'll use the LangChain Expression Language (LCEL) to define the chain with 3 steps:

We'll combine the page_content from each document (remember, that's the DDL of each table) into a string called content.
Create a prompt to find our most valuable customers, passing in content, the combined set of table metadata .
Pass the prompt to the LLM.

code_block<ListValue: [StructValue([('code', '# Use code generation model\r\nllm = VertexAI(model_name="code-bison@latest", max_output_tokens=2048)\r\n\r\n\r\n# Define the chain\r\nfrom langchain.prompts import PromptTemplate\r\nfrom langchain.schema import format_document\r\nchain = (\r\n {\r\n "content": lambda docs: "\\n\\n".join(\r\n format_document(doc, PromptTemplate.from_template("{page_content}")) for doc in docs\r\n )\r\n }\r\n | PromptTemplate.from_template("Suggest a query that will help me identify my most valuable customers, with an emphasis on recent sales:\\n\\n{content}")\r\n | llm\r\n)\r\n\r\n\r\n# Invoke the chain with the documents, and remove code backticks\r\nresult = chain.invoke(data).strip(\'```\')\r\nprint(result)'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5fcb3db1c0>)])]>

Let's take a look at the query:

code_block<ListValue: [StructValue([('code', "SELECT\r\n users.id AS user_id,\r\n users.first_name AS first_name,\r\n users.last_name AS last_name,\r\n users.email AS email,\r\n SUM(order_items.sale_price) AS total_spend,\r\n users.country AS country\r\nFROM `bigquery-public-data.thelook_ecommerce.users` AS users\r\nJOIN `bigquery-public-data.thelook_ecommerce.orders` AS orders\r\nON users.id = orders.user_id\r\nJOIN `bigquery-public-data.thelook_ecommerce.order_items` AS order_items\r\nON orders.order_id = order_items.order_id\r\nWHERE users.country = 'Japan'\r\nGROUP BY users.id, users.first_name, users.last_name, users.email, users.country\r\nORDER BY total_spend DESC\r\nLIMIT 10;"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5fcb3dbbe0>)])]>

Great, we've got the query! It's recommended to try it out first with a dry run..Now let's get the answer to our question of fetching the users from Japan:

code_block<ListValue: [StructValue([('code', 'import google.cloud.bigquery as bq\r\n\r\n\r\nclient = bq.Client(project="<your-project-id>")\r\nclient.query(result).result().to_dataframe()'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e5fcb3db400>)])]>

index	user_id	first_name	last_name	email	total_spend	country
0	86707	Robin	Oneill	[email protected]	1533.96	Japan
1	68275	Michael	Fisher	[email protected]	1422.46	Japan
2	1988	Alexander	Gill	[email protected]	1061.95	Japan

Read Entire Article

Integrating BigQuery data into your LangChain application

Looking for an Interim or Fractional CTO to support your business?

Getting started with LangChain

Using the data loader

Writing our first chain