Revolutionizing SQL with pipe syntax, now available in BigQuery and Cloud Logging

2 months ago 28

News Banner

Since its invention 50 years ago, SQL has become the universal language used across the database industry. Its familiar syntax and well-established community have truly democratized how people access data everywhere. But let's face it, SQL isn't perfect. SQL's syntax has several issues that make it harder to read and write:

Rigid structure: A query must follow a particular order (SELECT … FROM … WHERE … GROUP BY…), and doing anything else requires subqueries or other complex patterns.
Awkward inside-out data flow: A query starts in the middle and then logic builds outwards, starting with FROM clauses embedded in subqueries or common table expressions (CTE).
Verbose, repetitive syntax: Tired of listing the same columns over and over in SELECT, GROUP BY, and ORDER BY, and in every subquery? We hear you.

These issues can make SQL more difficult for new users. Even for expert users, reading or writing SQL takes more work than should be necessary. A more convenient syntax would help everyone.

Over the years, several alternative languages and APIs have been proposed, and some have had some success in narrow applications. Many of these, including Apache Beam and Python DataFrames, use piped data flow, which makes it easy to compose queries arbitrarily. Many users find this syntax easier to understand and more usable than SQL.

But SQL is ubiquitous, and unlikely to be replaced any time soon. Learning a new language and migrating workloads to new tools is simply too much for most users.

This got us wondering. Would it be possible to add something to SQL to give it the same power and flexibility, while keeping everything that's familiar and works well in SQL? The answer is yes!

Introducing SQL pipe syntax: the best of both worlds

At Google Cloud, we’re on a mission to make data analysis easier and more intuitive. That's why we're thrilled to introduce pipe syntax, a groundbreaking innovation that brings the elegance of piped data flow to SQL in BigQuery and Cloud Logging. You can read our recent research paper on the topic here.

What is pipe syntax?
In a nutshell, pipe syntax is an extension to standard SQL syntax that makes SQL simpler, more concise, and more flexible. It supports the same underlying operators as standard SQL, with the same semantics and mostly the same syntax, but allows applying operators in any order, any number of times.

How it works:

Queries can start with FROM.
Operators are written sequentially, using the |> pipe symbol.

Each operator consumes its input table and produces an output table.

Most pipe operators use the same syntax as standard SQL:

SELECT, WHERE, JOIN, ORDER BY, LIMIT, etc.

Standard and pipe syntax can be mixed arbitrarily, even within the same query.

Example: Suppose you want to know the average number of taxi trips per year in Chicago by payment_type using a BigQuery public dataset. Here's how you might write that in standard and pipe syntax.

Using the standard syntax, you typically need to write a subquery:

Here’s that same query using pipe syntax — no subquery needed!

Real-world impact at HSBC

The global banking giant HSBC has already embraced pipe syntax, following impressive results experimenting with a preview version in BigQuery. They saw significant improvements in productivity and code readability, especially when dealing with large JSON datasets.

"We're thrilled with the addition of pipe syntax to BigQuery! Our team has immediately put it to use to rewrite business reports and make future reports less time consuming, particularly those involving large JSON datasets. Previously, unpacking this data required extensive filtering and extraction, leading to cumbersome and repetitive SQL query writing engagement. The pipe syntax helps to significantly streamline this process by making up to 80% of the query string reusable, so we can focus on the specific use cases. This is going to be a major productivity boost for our engineers in the future, making it far easier to write and debug complex queries. We're excited to expand our use of pipe syntax to even more use cases and data types in the future." - Zoltán Pósfai, GCP Cloud Engineering Lead, HSBC

Advantages of adding pipe syntax inside SQL

We think that the addition of pipe syntax in SQL brings a number of benefits to SQL developers. Here are a few:

Easy to learn
Learning and adopting new languages is challenging, particularly in organizations with many users, and it's best if all users use the same languages and tools. Pipe syntax is not a new language — it's a new feature within the existing SQL language. Users who already know SQL find pipe syntax very easy to learn since it has all the same operators, with mostly the same syntax.

For users new to SQL, it’s easier to learn pipe syntax first. They still need to learn the operators and some semantics (e.g., inner and outer joins), but they can use those operators to express their intended queries directly, bypassing some of the complexities and workarounds required when expressing queries in standard SQL.

Easy to adopt incrementally, without migrations
We all know that migrating to a new language or system can be expensive, time-consuming and error-prone. Since pipe syntax is a feature inside GoogleSQL, you don't need to migrate anything before you can start using it. The new syntax can be used selectively, wherever it's helpful, and all existing queries continue to work. Any new SQL is fully interoperable with existing SQL code. For example, queries with pipe syntax can call standard views that were written in standard syntax, and vice versa. Using pipe syntax in new SQL code doesn't make any existing SQL obsolete or unusable.

No performance or cost hit
Pipe syntax works in existing platforms like BigQuery without any extra layers (such as translation proxies) that can add latency, cost, or reliability risks, and that can also make debugging or tuning difficult.

There's also no additional cost. Queries with pipe syntax still have SQL's declarative semantics, meaning the SQL query optimizer will still rearrange the query to run more efficiently. In other words, queries written in either standard or pipe syntax typically have exactly the same performance.

What can you use pipe syntax for?

Whether you're exploring data, creating dashboards, building data pipelines or analyzing logs, pipe syntax empowers you to write clearer, more efficient, and more maintainable SQL queries. And because pipe syntax supports most standard SQL operators, you can use it anywhere you write queries. Here are a few applications to get you started:

Ad-hoc analysis and query debugging
When performing data exploration, you typically start by looking at rows of a table (starting with a FROM clause) to see what exists, and then add filters, aggregations, joins, ordering, etc. Pipe syntax makes this kind of exploration very easy because you can start with a FROM clause and build from there. At every step, you can see the results so far, and then add another pipe operator, and then run the query again to see the new results.

Pipe syntax also helps when debugging queries. Queries in pipe syntax have the nice property that any query prefix up to a pipe symbol is also a valid query, so it's possible to highlight that query prefix and run it, showing the intermediate result up to that point.

Data engineering lifecycle
As the volume of data grows, data processing and transformation become more challenging and time-consuming. In data-heavy environments, building, adapting and maintaining a data pipeline usually involves a large engineering effort. With its linear query structure and more intuitive syntax, pipe syntax streamlines data engineering. Say goodbye to those deeply nested queries and CTEs that always seem to show up when using standard SQL. This new addition to GoogleSQL reimagines how to parse, extract and transform data, making it easier to create and maintain data pipelines. Here's an example use case showing how to accelerate log data workloads with pipe syntax both in BigQuery and Cloud Logging.

Applying LLMs and natural language with SQL
Research shows that SQL can be challenging for large language models (LLMs) to understand or generate, for the same reasons SQL can be complex for humans to read and write. In contrast, pipe syntax breaks queries down into independent steps, which correspond closely to the desired logical data flow. It's easier for the LLM to generate pipe syntax to express a desired data flow, and the generated queries can be much simpler and more human-readable, also making human validation of generated queries much easier.

Pipe syntax also makes it possible to provide better code assistants and auto-completion, since it's much easier to understand what's going on and what's possible, and to make suggestions for local edits to one pipe operator rather than global edits to a whole query. Smarter AI-generated code suggestions and mixing more natural language-based operators into a query are great opportunities for further enhancing user productivity.

Unlock the power of pipe syntax today

For 50 years, SQL has been the universal language of data because it works so well. SQL gets many things right, and there's tremendous power in expressing queries as declarative combinations of relational operators.

But that doesn't mean SQL couldn't be better. Pipe syntax brings SQL into the future, solving SQL's biggest usability challenges, and enabling new ways of working with and extending SQL. This isn't about replacing SQL or inventing a new language. SQL with pipe syntax is still SQL, but it's a better SQL — one that's more user-friendly, flexible and expressive.

We believe pipe syntax is the right direction for the future of SQL and hope to see it adopted across the industry. We can’t wait to see how you plan to use it today! You can sign up to use pipe syntax in BigQuery here.

Resources

Sign-up form to try out pipe syntax in BigQuery
Pipe syntax documentation and detailed reference guide
Demo video
Blog post on using pipe syntax for log data analysis
Research publication (VLDB 2024) with more background and language design details
Open-source GoogleSQL implementation, ZetaSQL

Posted in

Read Entire Article