BLOG

Querying all data, anywhere

January 18, 2022
   
Strategy
Posted by Miguel Branco

Many years ago I was part of the team building the tools to help physicists access data for the Large Hadron Collider at CERN. I recall spending many hours sitting down with physicists, helping them navigate over a set of custom tools we had built so that they could track down which “physics events” they needed, download them locally and start their work. Helping them access data was, in essence, my role.

But as soon as we got through the various hurdles to get them the data they needed – which was hard enough, as there were petabytes of physics events, vast amounts of scripts, and home-grown tools – I then saw them struggling to process that data. Simple tasks like filtering events with some given set of physics properties was hard work. I recall contrasting what I saw – hand-rolled C++ code – with the world of relational databases, SQL and their powerful query optimizers. There were plenty of good reasons not to use any of these systems: data had a complex structure with plenty of hierarchies; scale was an issue; data ownership and ingestion required by traditional databases were all a no-go. But a lot was lost in the process.

What we were missing was a query engine that could query “all the data, anywhere”. By all the data I meant not just relational tables – the staple of SQL –  but data with complex deep hierarchical structures for example. Whilst most data can be modelled relationally as tables, there’s a level of complexity where it’s simply not helpful to do so, as it just makes it harder to understand. And by “anywhere” I meant data from any source, without having to download it or ingest it (i.e. replicate it) into another system.

Our solution at CERN was to build home-grown tools and plenty of processes. Since those days, NoSQL, NewSQL, virtual data warehouses, data lakes and cloud computing have made tremendous progress. But for the average user, querying “all the data, anywhere” is still an elusive goal. Much of what I saw back then, I got to relive years later when querying data lakes for instance. This problem – of querying “all the data, anywhere” – is the problem we have set ourselves to solve at RAW Labs.  Let me tell you how.


Querying “all data”

SQL is the de facto standard for querying data. The reasons are manifold but at this point SQL is the most well-known language with broad support and there is fundamental reason that makes SQL a nearly ideal query language: it is a declarative language. By this we mean that as a user of SQL you express the logic of a computation and not how it is done. A SQL user declares that “I want to join datasets A and B discarding all the data with those properties”. A user does not specify how that join should be computed: which algorithm should be used, or how the computation should be distributed in your machine or cluster, or even when the data files should be open or closed, or how much should be read into memory at any point.

Another property of the SQL world is the abstraction level it provides over the data. As a user of a database system executing SQL commands you are not concerned how the data is stored – which format it is on -, nor how best to access it. This is all handled for you. Data can be cached in-memory, replicated on disk, or indexed in a variety of formats and all of that is transparent to you. The format can even evolve over time, transparently, bringing you additional performance benefits.

The declarative nature of SQL and the abstraction provided by database engines gives ample room for optimisation. These are some of the most important reasons why SQL won the “query language” wars. That said, like at CERN, there are cases where it isn’t the right tool for the job. SQL today still comes closest to “querying all the data” but in fact requires relational tables as its input data source; and no matter how much we tried to make them fit, the fact is that some data is just too clunky to fit into tables: deep hierarchies, matrices, complex document stores all have their reasons of existing for storing data.

There’s also SQL’s recent evolution. SQL is becoming too complex in ways that are, frankly, unhelpful to the users. For example, in recent years the SQL standard has had many extensions added. One example is the support for JSON, which is now part of the SQL:2016 standard. Under the new standard, to every operation on a JSON field must use JSON-specific operations, e.g.  json_arrayagg or json_array_elements along with countless other specific functions. This means that in practice, as a user you must now know your data is stored in JSON, because JSON is now a data type.  This complicates the language immensely by adding many new keywords specific to JSON (and different ones for, say, XML). It is also a deviation of the original design principles of SQL because you now must write code that depends on the format of your data. If your data were to move from JSON to XML or Protobufs – even with the exact same schema – you would need completely different query code.

There are good reasons for how JSON support is implemented in SQL, but in my opinion these have more to do with backward compatibility and legacy support than with providing users with the best possible query language.

At RAW Labs – and before that, in a database research group at EPFL, Switzerland, where RAW was born – we’ve been redesigning SQL for future.

Consider the following example.  Say you have the following JSON data with logs of transactions. Each transaction contains one or more entries:

[
  {
    "id": "xct1",
    "entries": [
      {
        "id": "ee4192ffb6d2caa490a1",
        "amount": 1200,
        "created_at": 1426412340,
      },
      {
        "id": "0e4a34e4ee7281d369df",
        "amount": 1500,
        "created_at": 1426412353,
      }
    ]
  },
  {
    "id": "xct2",
    "entries": [
      {
        "id": "de5192fab612d8a4a999",
        "amount": 2000,
        "created_at": 1426412779,
      }
    ]
  }
]

Now suppose we want to obtain the total amount, which requires scanning all elements in the JSON data and for each of them scan all entries.

Here’s how you’d perform that aggregation in PostgreSQL:

WITH entries AS(
  SELECT json_array_elements(y->’details’) as e FROM data
)
SELECT sum((d->>’amount’)::int) FROM entries;

Note the use of JSON-specific elements for accessing “array elements”, another syntax for accessing object fields, as well as the need of casting types; not to mention the use of CTEs (WITH statement).

Contrast the example above with the following RAW SQL code to do the same query over the same data:

SELECT SUM(e.amount) FROM d IN data, e IN d.entries

Now suppose we have a structurally equivalent XML file, e.g.:

<root>
  <xct>
    <id>xct1</id>
    <entries>
      <entry>
        <id>ee4192ffb6d2caa490a1</id>
        <amount>1200</amount>
      </entry>
      <entry>
        <id>0e4a34e4ee7281d369df</id>
        <amount>1500</amount>
      </entry>
    </entries>
  </xct>
  <xct>
    <id>xct2</id>
    <entries>
      <entry>
        <id>de5192fab612d8a4a999</id>
        <amount>2000</amount>
      </entry>
    </entries>
  </xct>
</root>

Now here’s the code, again in RAW SQL, to query that XML file:

SELECT SUM(e.amount) FROM d IN data.xct, e IN e.entries.entry

(I tried to create a similar example in PostgreSQL, using the XML-specific methods, but gave up after some frustrated attempts.)

In RAW SQL, the XML case is nearly identical to the JSON case, with the exception that XML has an extra named field wrapping each entry. It is essentially the same code because we are querying essentially the same data, just in a different format. The aggregation also uses the exact same SQL constructs, instead of custom, format-specific keywords. Less keywords or concepts to learn, and more query expressibility.

Achieving this simplicity to the users was difficult internally and required us to revamp some of the core foundations of SQL, namely, relational algebra. In fact, RAW uses a variant based on category theory, which gives us the all the declarative power and optimization of SQL but goes far beyond it.

This support for JSON is just a simple example of how SQL can be extended to provide a truly Structured Query Language for data. There are countless other examples where SQL struggles, but RAW SQL provides the query expressibility needed. These include surprising common tasks such as querying Web Services directly from SQL; or querying complex document stores holding semi-structured data. In each of them, the state-of-the-art before RAW continues to be building custom scripts and tools, mix a variety of ad-hoc query languages and client APIs, and then revert to implementing joins and filters largely by hand just as we had to do all these years ago at CERN.


Querying data “anywhere”

Even if we can query “all the data”, we first have to get to it.  Over the years there have been many attempts at querying data from “anywhere”. Data integration, virtual data warehouses, and data lakes have all been recent attempts at solving this problem with the goal of providing a “single source of truth”.

Data integration and virtual data warehouses are based on the idea that data should be kept at source. On top of that, a layer – ideally virtual – is built that provides a unified view over the underlying data. Queries are sent to this layer which then splits and otherwise forwards them to the underlying systems. Data lakes turn this idea around and instead provide a single place where all types of data can be stored. Data lakes provide in essence a large, distributed file system or object-store with a SQL-like query layer on top.

While these are very distinct implementations, they are actually two ways of solving the problem of bringing disparate sources of data together. Data lakes, however, can lead to many data duplicates, which bring along new concerns caused by data duplication such as additional cost, security issues and quality concerns. Virtual data integration, on the other hand, has performance and scalability issues. Most importantly, maintaining consistency of the unified view over backend systems that can be changing over time is also very hard to manage.

Data virtualisation is a powerful concept but requires a good answer to scalability and consistency issues. Data lakes can be distributed engineering marvels and provide immense flexibility and scalability. At RAW Labs, we took inspiration from both “camps” so to speak. We use data lake technology in our query engine to provide large-scale caches and achieve the scalability and robustness requirements needed. However, data is always referenced from the source, as in data virtualisation.

Here’s an example of joining PostgreSQL data with S3 in RAW SQL:

SELECT *
FROM READ("pgsql://server01/schema/table1") t1,
     READ("s3://raw-labs-tutorial/*.json") t2
WHERE t1.id = t2.query_id

(Example of joining a PostgreSQL table with JSON data stored on S3.)

There are actually multiple concepts at play here. First, note that data is referenced by URL only. There is no need to specify schemas, as these can be inferred automatically for many common cases. Of course if you know the schema – or want to enforce it – you can always do it:

SELECT *
FROM READ[COLLECTION(RECORD(id: INT, ...))]("pgsql://server01/schema/table1") t1,
     READ[COLLECTION(RECORD(query_id: INT, ...))]("s3://raw-labs-tutorial/*.json") t2
WHERE t1.id = t2.query_id

(Same as above but specifying the data types. It certainly beats having to “CREATE EXTERNAL TABLE” by hand!)

Then, there is data freshness. You can also control how fresh data needs to be, which then allows RAW to serve it from cache (faster) instead of going out to the original source to retrieve it (slower). This is done when reading the data.

SELECT *
FROM READ("pgsql://server01/schema/table1", freshness := INTERVAL "1 hour")
WHERE creation_date > DATE "2022-01-01"

(Example specifying a query over a PostgreSQL table but specifying how fresh results should be. If the cached data in RAW is fresh enough, data is served from cache, which is faster.)

Finally, there’s our approach to consistency. In the virtual case, over time, users build queries with views layered on top of other views, which finally query source systems. Evolving code and their dependencies is difficult because changing any underlying view or source system can affect its dependencies, in ways that are oftentimes hard to predict in complex scenarios.

Here we take inspiration from software development where large groups of developers work together to produce a single software artefact. And just like in software development we use tools like Git and approaches like Continuous Integration to manage RAW SQL code.

How does this work? RAW does not read “view definitions” from its own metadata tables like SQL engines traditionally do. Instead, it reads them from Git and every Git commit transitions all RAW SQL to the new version, atomically. All your RAW SQL code evolves at once; it can be tested in a separate branch against test databases/systems, and you can use whatever CI or CD methodology that best fits your needs. As a user, you push the code to a Git repo, and RAW retrieves it and updates its state, atomically. Besides being a good solution to consistency and enabling a “DataOps” approach to testing and deployment, it also enables new use cases, such as sharing RAW SQL code in GitHub repositories for example. We can finally build, test, deploy and share data products with a truly data-as-code approach but that’s a topic for another post.


A future for SQL

We believe SQL design principles are solid and its future bright, but changes are welcome. Our goal with RAW SQL is to give users a truly structured query language, with complete abstraction over data location, format and structure.

RAW SQL is “not yet another small variant” of SQL, but it’s the SQL that we should have. Of course, we took advantage to modernise it with the RAW platform. This includes top-notch engineering including novel code generation techniques (a topic for another post!), data discovery mechanism, Git integration and others. All of this to make RAW the best tool to discover and publish data, and the ideal tool to power new use cases for building data products as part of your data mesh architecture.

Miguel Branco, Founder and CEO, RAW Labs.


Want to learn more?


Learn More

VIDEOS

Get Started with RAW: Hello World!

Instructional video to show how to get started with RAW, using a simple Hello World example. Accompanying instructions can also be found on our Developer Site

Blog

Tutorial 1. First flight with RAW

First Tutorial with RAW where we take reader through a first query, first API and passing parameters to that API

Blog

Introducing the RAW Data Product Platform

Data Products as a Service: A collaborative DataOps platform as a service for data APIs. Create and Share data faster.

News

RAW Labs' Co-founder Prof. Ailamaki Receives Argo Innovation Award

RAW Labs’ Co-founder and Chief Scientific Officer Professor Anastasia Ailamaki received an Innovation award for excellence from the Greek President Katerina Sakellaropoulou. The Argo Awards, which honour distinguished figures from...

Success

Thank you for your interest. Expect to hear from us soon.

Error

Email address must contain the @ symbol