Introducing new BigQuery features to simplify time-series data analysis

9 months ago 39
News Banner

Looking for an Interim or Fractional CTO to support your business?

Read more

We are excited to invite customers to join a preview of new SQL features that simplify time series analysis in BigQuery. These new features simplify writing queries that perform two of the most common time series operations: windowing and gap filling. We are also introducing the RANGE data type and supporting functions. The RANGE type represents a continuous window of time and is useful for recording time-based state of a value. Combined, these features make it easier to write time series queries over your data in BigQuery.

Time-series analytics and data warehouses

Time-series data is an important, and growing class of data for many enterprises. Operational data like debug logs, metrics, and event logs are inherently time-oriented and full of valuable information. Additionally, connected devices common in manufacturing, scientific research, and other domains continually produce important metrics that need to be analyzed. All of this data follows the same pattern: the data is reported as values at a specific time, and deriving meaning from that data requires understanding how it changes over time.

Users with time-series analytics problems have traditionally relied on domain-specific solutions that result in harder to access data silos. Additionally, they often have their own query languages, which require additional expertise to use. Organizations looking to analyze their time-series data alongside the rest of their data must invest in pipelines that first clean and normalize the data in one store before exporting it to their data warehouse. This slows down time-to-insights and potentially hides data from the users who could build value from the data. Adding more time-series analytics features to BigQuery SQL democratizes access to this valuable data. Organizations can now leverage BigQuery's scalable serverless architecture to store and analyze their time-series data alongside the rest of their business data without having to build and maintain expensive pipelines first.

Introducing time-series windowing and gap filling functions

The first step in most time-series queries is to map individual data points to output windows of a specific time duration and alignment. For example, you want to know the average of a sampled value every ten minutes over the past twelve hours. The underlying data is often sampled at arbitrary times so the user's query must include logic to map the input times to the output time windows. BigQuery's new time bucketing functions provide a full-featured method for performing this mapping.

Let's look at an example using a time series data set showing the air quality index (AQI) and temperature recorded over part of the day. Each row in the table represents a single sample of time-series data. In this case, we have a single time series for the zip code 60606, but the table can easily hold data for multiple time series, all represented by their zip code.

code_block <ListValue: [StructValue([('code', "CREATE OR REPLACE TABLE mydataset.environmental_data_hourly AS\r\nSELECT * FROM UNNEST(\r\n ARRAY<STRUCT<zip_code INT64, time TIMESTAMP, aqi INT64, temperature INT64>>[\r\n STRUCT(60606, TIMESTAMP '2020-09-08 00:30:51', 22, 66),\r\n STRUCT(60606, TIMESTAMP '2020-09-08 01:32:10', 23, 63),\r\n STRUCT(60606, TIMESTAMP '2020-09-08 02:30:35', 22, 60),\r\n STRUCT(60606, TIMESTAMP '2020-09-08 03:29:39', 21, 58),\r\n STRUCT(60606, TIMESTAMP '2020-09-08 06:31:14', 22, 56),\r\n STRUCT(60606, TIMESTAMP '2020-09-08 07:31:06', 28, 55)\r\n]);"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e40b75a01f0>)])]>

Before doing additional processing we'd like to assign each row to a specific time bucket that is two-hours wide. We'll also aggregate within those windows to compute the average AQI and maximum temperature.

code_block <ListValue: [StructValue([('code', 'SELECT\r\n TIMESTAMP_BUCKET(time, INTERVAL 2 HOUR) AS time,\r\n zip_code,\r\n CAST(AVG(aqi) AS INT64) AS aqi,\r\n MAX(temperature) AS max_temperature\r\nFROM mydataset.environmental_data_hourly\r\nGROUP BY zip_code, time\r\nORDER BY zip_code, time;\r\n\r\n+---------------------+----------+-----+-----------------+\r\n| time | zip_code | aqi | max_temperature |\r\n+---------------------+----------+-----+-----------------+\r\n| 2020-09-08 00:00:00 | 60606 | 23 | 66 |\r\n| 2020-09-08 02:00:00 | 60606 | 22 | 60 |\r\n| 2020-09-08 06:00:00 | 60606 | 25 | 56 |\r\n+---------------------+----------+-----+-----------------+'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e40ce065550>)])]>

Note that we can use any valid INTERVAL to define the width of the output window. We can also define an arbitrary origin for aligning the windows if we want to start the windows at something other than 00:00:00.

Gap filling is another common step in time-series analysis that addresses gaps in time of the underlying data. A server producing metrics may reset, stopping the flow of data in the progress. A lack of user traffic may mean gaps in event data. Or, data may be inherently sparse. Users want to control how these gaps are filled in before joining with other data or preparing it for graphing. The newly released GAP_FILL table valued function does just that.

In our previous example, we have a gap between 02:00 and 06:00 in the output data because there is no raw data between 04:00:00 and 05:59:59. We would like to fill in that gap before joining this data with other time-series data. Below, we demonstrate two modes of backfill supported by the GAP_FILL TVF: "last observation carried forward" (locf) and linear interpolation. The TVF also supports inserting nulls.

code_block <ListValue: [StructValue([('code', "WITH aggregated_2_hr AS (\r\n SELECT\r\n TIMESTAMP_BUCKET(time, INTERVAL 2 HOUR) AS time,\r\n zip_code,\r\n CAST(AVG(aqi) AS INT64) AS aqi,\r\n MAX(temperature) AS max_temperature\r\n FROM mydataset.environmental_data_hourly\r\n GROUP BY zip_code, time\r\n ORDER BY zip_code, time)\r\n\r\nSELECT *\r\nFROM GAP_FILL(\r\n TABLE aggregated_2_hr,\r\n ts_column => 'time',\r\n bucket_width => INTERVAL 2 HOUR,\r\n partitioning_columns => ['zip_code'],\r\n value_columns => [\r\n ('aqi', 'locf'),\r\n ('max_temperature', 'linear')\r\n ]\r\n)\r\nORDER BY zip_code, time;\r\n\r\n\r\n+---------------------+----------+-----+-----------------+\r\n| time | zip_code | aqi | max_temperature |\r\n+---------------------+----------+-----+-----------------+\r\n| 2020-09-08 00:00:00 | 60606 | 23 | 66 |\r\n| 2020-09-08 02:00:00 | 60606 | 22 | 60 |\r\n| 2020-09-08 04:00:00 | 60606 | 22 | 58 |\r\n| 2020-09-08 06:00:00 | 60606 | 25 | 56 |\r\n+---------------------+----------+-----+-----------------+"), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e40e2768c70>)])]>

GAP_FILL can also be used to produce aligned output data without having to bucket the input data first.

Combined, these new bucketing and gap-filling operations provide key building blocks of time-series analytics. Users can now leverage BigQuery's serverless architecture, high-throughput streaming ingestion, massive parallelism, and built-in AI to derive valuable insights from their time-series data alongside the rest of their business data, all without relying on a separate data silo.

Importantly, these new functions align with the standard SQL syntax and semantics analytics that experts are familiar with. There is no need to learn a new language or a new computation model. Simply add these functions into your existing workflows and start delivering insights.

Introducing the RANGE data type

The RANGE type and functions make it easier to work with contiguous windows of data. For example, RANGE<DATE> "[2024-01-01, 2024-02-01)" represents all DATE values starting from 2024-01-01 up to and excluding 2024-02-01. RANGE can also be unbounded on either end, representing the beginning and end of time respectively. RANGE is useful for representing a state that is true, or valid, over a contiguous period of time. Quota assigned to a user, the exchange rate of a currency over different times, the value of a system setting, or the version number of an algorithm are all great examples.

We are also introducing features that allow you to combine or expand RANGEs at query time. RANGE_SESSIONIZE is a TVF that allows you to combine overlapping or adjacent RANGEs into a single row. The table below shows how users can use this feature to create a smaller table with the same information:

code_block <ListValue: [StructValue([('code', 'CREATE OR REPLACE TABLE mydataset.sensor_metrics AS\r\nSELECT * FROM UNNEST(\r\n ARRAY<STRUCT<sensor_id INT64, duration RANGE<DATETIME>, flow INT64, spins INT64>>[\r\n (1, RANGE<DATETIME> "[2020-01-01 12:00:01, 2020-01-01 12:05:23)", 10, 1),\r\n (1, RANGE<DATETIME> "[2020-01-01 12:05:12, 2020-01-01 12:10:46)", 10, 20), \r\n (1, RANGE<DATETIME> "[2020-01-01 12:10:27, 2020-01-01 12:15:56)", 11, 4),\r\n (2, RANGE<DATETIME> "[2020-01-01 12:05:08, 2020-01-01 12:10:30)", 21, 2),\r\n (2, RANGE<DATETIME> "[2020-01-01 12:10:22, 2020-01-01 12:15:42)", 21, 10)\r\n]);'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e40df6a4850>)])]>

RANGE_SESSIONIZE finds all overlapping ranges and outputs the session range, which is a union of all the ranges that overlap within the (sensor_id, flow) partition. We finally group by the session ranges to output a table that combines the overlapping ranges:

code_block <ListValue: [StructValue([('code', 'SELECT sensor_id, session_range, flow\r\nFROM RANGE_SESSIONIZE(\r\n (SELECT sensor_id, duration, flow FROM mydataset.sensor_metrics),\r\n "duration",\r\n ["sensor_id", "flow"],\r\n "OVERLAPS")\r\nORDER BY sensor_id, session_range;\r\n\r\n+-----------+--------------------------------------------+------+\r\n| sensor_id | session_range | flow | \r\n+-----------+--------------------------------------------+------+\r\n| 1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10 |\r\n| 1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10 |\r\n| 1 | [2020-01-01 12:10:27, 2020-01-01 12:15:56) | 11 |\r\n| 2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21 |\r\n| 2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21 |\r\n+-----------+--------------------------------------------+------+\r\n\r\nSELECT sensor_id, session_range, flow, SUM(spins)\r\nFROM RANGE_SESSIONIZE(\r\n TABLE mydataset.sensor_metrics, \r\n "duration",\r\n ["sensor_id", "flow"],\r\n "OVERLAPS")\r\nGROUP BY sensor_id, session_range, flow\r\nORDER BY sensor_id, session_range\r\n\r\n+-----------+--------------------------------------------+------+-------+\r\n| sensor_id | session_range | flow | spins |\r\n+-----------+--------------------------------------------+------+-------+\r\n| 1 | [2020-01-01 12:00:01, 2020-01-01 12:10:46) | 10 | 21 |\r\n| 1 | [2020-01-01 12:10:27, 2020-01-01 12:15:56) | 11 | 4 |\r\n| 2 | [2020-01-01 12:05:08, 2020-01-01 12:15:42) | 21 | 12 |\r\n+-----------+--------------------------------------------+------+-------+'), ('language', ''), ('caption', <wagtail.rich_text.RichText object at 0x3e40df729d60>)])]>

You can do much more with RANGE. For example, you can use RANGE to JOIN a timestamp table with a RANGE table or emulate the behavior of an AS OF JOIN. See more examples in the RANGE and time series documentation.

Getting started

To use the new time series analysis features, all you need is a regular table with some time series data in it. You can even try experimenting with one of the existing public datasets. Check out the documentation to learn more and get started. Give these new features a try and let us know if you have feedback.

Read Entire Article