Achieve higher performance and lower query cost for BigQuery integer or timestamp lookups

7 months ago 28
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

The first incarnation of search indexes in BigQuery focused on fast and efficient lookups on STRING data elements, either in standalone STRING scalar columns, or within an ARRAY, STRUCT, or JSON column. Our previous blog posts showcased the orders-of-magnitude performance gains achievable when utilizing indexes with the SEARCH function and other functions and operators.

Today, we are announcing the public preview of numeric search indexes, which enables optimized lookups on INT64 and TIMESTAMP data types. With this change, the EQUAL(=) and IN operations on these data types can utilize search indexes to reduce byte scans for improved performance. So now your lookups for account IDs or transactions IDs or log timestamps can get faster and cheaper.

In this blog, we demonstrate the gains on real data, showcasing index creation and queries on a 100TB log table called log_table that contains Google Cloud Logging data for an internal Google test project.

The base table details are as follows:

image_1

The table has the following columns of interest:

  • jsonPayload: type JSON

    • This jsonPayload has a leaf field named threadId of type JSON number.

  • sourceLocation: type RECORD (or STRUCT) with two sub-fields of interest:

    • file: type STRING, containing the name of the file producing the log entry

    • line: type INT64, containing the line number in the file where the log entry was produced. 

Using search indexes

By default, a search index is created for the STRING data only. If you want to index INT64 or TIMESTAMP, you need to provide them in the index option called data_types. In the following example, all data of type STRING and INT64 in the log_table table will be indexed.

code_block <ListValue: [StructValue([('code', "CREATE SEARCH INDEX index_with_string_int64\r\nON log_table (ALL COLUMNS)\r\nOPTIONS (data_types = ['STRING', 'INT64'])"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eeaf1ee2d00>)])]>

JSON field search

In this first example, we want to search for log entries that have the thread ID 12104 in the JSON payload.

code_block <ListValue: [StructValue([('code', 'SELECT timestamp, jsonPayload\r\nFROM log_table\r\nWHERE INT64(jsonPayload.threadId) = 12104;'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eeaf1ee2cd0>)])]>

We compare between having a search index and having no index. Given that log entries with this thread ID are very rare, the results show dramatic improvements on all three metrics:

 

Metrics 

Without Index

With Index

Improvement

Execution Time (ms)

48,790

4,664

10x

Processed Bytes

2,174,758,158,336

774,897,664

2,806x

Slot Usage (ms)

25,735,222

7,300

3,525x

STRUCT nested field search

In the second example, we count how many log entries are produced from a certain line of code (line 813 in the file borg/borgletlib/borgletlib.cc).

code_block <ListValue: [StructValue([('code', "SELECT count(*)\r\nFROM log_table\r\nWHERE sourceLocation.file = 'borg/borgletlib/borgletlib.cc'\r\n AND sourceLocation.line = 813"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3eeaf1ee2fd0>)])]>

Note that sourceLocation.file is a STRING field. A search index on only STRING data type can already help improve the query performance as shown below. However, with indexing on also the INT64 data type, the performance can be further improved.

 

Metrics 

No Index

With IndexSTRING-only

With IndexSTRING & INT64

Execution Time (ms)

57,169

11,571

4.9x

7,982

7.1x

Processed Bytes

1,703,843,725,312

976,230,547,456 

1.7x

682,560,061,440

2.4x

Slot Usage (ms)

38,947,660

25,595,348 

1.5x

8,256,218

4.7x

Do I need indexing if I use partitioning/clustering?

While partitioning and clustering can optimize filtering and lookups, they have certain limitations. For instance, partitioning can only be done on a single column, and clustering allows up to four columns per table. However, clustering is most effective when filtering on the first clustering column, as subsequent columns often provide minimal pruning power. Furthermore, both partitioning and clustering are limited to top-level columns.

Search indexes on INT64/TIMESTAMP complement these BigQuery features by enabling lookup optimizations on any number of columns. In addition,  as demonstrated above, they cover struct nested fields, array elements, and JSON leaf fields.

This feature is currently in preview. For more information, refer to Optimize with numeric predicates.

Read Entire Article