BLOG

Tutorial 3. Managing complex data with RAW

January 17, 2022
   
Tutorial
Posted by Georges Lagardère

One of the flagship features of RAW is the ability to query complex data. By complex data we mean datasets that have complex structures, with JSON or XML data being the most common examples.

Traditional databases struggle to cope with complex data, and often require it to be “flattened” into tables and columns. This is cumbersome to do and often means some semantic information implicit in the structure is lost. More recent support – for JSON types for instance – helps somewhat, but it still falls short of providing enough query capabilities for complex data.

As we will shall see, RAW supports complex data with ease and with few SQL extensions.

Querying JSON documents

Let’s start by a moderately simple JSON document.

The file at https://raw-tutorial.s3.amazonaws.com/trips.json contains the following data:

[
  {"reason": "Holidays",
   "origin": "GVA",
   "destination": "KEF",
   "dates": {"departure": "2016/02/27", "arrival": "2016/03/06"}},
  {"reason": "Work",
   "origin": "GVA",
   "destination": "SFO",
   "dates": {"departure": "2016/04/10", "arrival": "2016/04/17"}},
  {"reason": "Holidays",
   "origin": "GVA",
   "destination": "LCY",
   "dates": {"departure": "2016/05/22", "arrival": "2016/05/29"}}
]

The dates field contains a JSON object with two fields: departure and arrival.

These can be queried in RAW as in:

trips := READ("https://raw-tutorial.s3.amazonaws.com/trips.json"); 

show_trips() := {
    SELECT origin, dates.departure AS departure, dates.arrival AS arrival
    FROM trips
}

Which produces something like this on output:

collection(
    record(
        origin: string, 
        departure: string, 
        arrival: string
    )
)
[
    {
    "origin": "GVA",
    "departure": "2016/02/27",
    "arrival": "2016/03/06"
    },
    {
    "origin": "GVA",
    "departure": "2016/04/10",
    "arrival": "2016/04/17"
    },
    {
    "origin": "GVA",
    "departure": "2016/05/22",
    "arrival": "2016/05/29"
    }
]

The . in dates.departure and dates.arrival is used to navigate inside the dates field.


Now let’s query the file https://raw-tutorial.s3.amazonaws.com/sales.json, which contains the following data:

[
    {"country": "CH",
     "products": [
         {"category": "Keyboard", "cost": 50},
         {"category": "Keyboard", "cost": 70},
         {"category": "Monitor", "cost": 450}]},
    {"country": "US",
     "products": [
        {"category": "Keyboard", "cost": 20},
        {"category": "Monitor", "cost": 200}]}
]

This is a list of two rows, in which the products field is a nested list of products. Each product has a category and a cost field.

Let’s start with a simple RAW function:

sales := READ("https://raw-tutorial.s3.amazonaws.com/sales.json");

show_sales() := {
    SELECT s.country, s.products
    FROM sales AS s
}

This will produce pretty much something similar to the source:

collection(
    record( 
        country: string,
        products: collection(
            record(
                category: string,
                cost: int
            )
        )
    )
)
[
    {
    "country": "CH",
    "products": [
        {
        "category": "Keyboard",
        "cost": 50
        },
        {
        "category": "Keyboard",
        "cost": 70
        },
        {
        "category": "Monitor",
        "cost": 450
        }
    ]
    },
    {
    "country": "US",
    "products": [
        {
        "category": "Keyboard",
        "cost": 20
        },
        {
        "category": "Monitor",
        "cost": 200
        }
    ]
    }
]

As you can see, the data can be queried directly. We can now shape it into a flat table-like shape, by “unnesting” the nested list of products:

sales := READ("https://raw-tutorial.s3.amazonaws.com/sales.json");

show_sales_flat() := {
    SELECT s.country, product.category, product.cost
    FROM sales AS s, s.products AS product
}

This will produce the following “flatten” output:

collection(
    record(
        country: string,
        category: string,
        cost: int
    )
)
[
    {
    "country": "CH",
    "category": "Keyboard",
    "cost": 50
    },
    {
    "country": "CH",
    "category": "Keyboard",
    "cost": 70
    },
    {
    "country": "CH",
    "category": "Monitor",
    "cost": 450
    },
    {
    "country": "US",
    "category": "Keyboard",
...

We see sale.products appearing in the FROM clause.

How does this work?

Let’s take a step back and look at the FROM syntax in SQL.

In standard SQL, we can say SELECT row.column FROM table AS row. This means bind row to each row of table in turn; we can then refer to row.column in the projection.

When doing sales AS s, s.products AS product, if we focus on s.products, we can think of it as if it were its own table: it is the table of products in each sale. In fact, if we refer back to the original data, we see that the first row of the file had products set to the “nested table” as follows:

[
         {"category": "Keyboard", "cost": 50},
         {"category": "Keyboard", "cost": 70},
         {"category": "Monitor", "cost": 450}
]

… and the second row of the file had products set to the “nested table”:

[
        {"category": "Keyboard", "cost": 20},
        {"category": "Monitor", "cost": 200}
]

We can call these “nested tables” (or more precisely, “nested collections”) in RAW.

Therefore, FROM sales AS s assigned each row of the file to the name s. Then, for each sale, s.products AS product assigns each row of the products nested table to the name product.

This operation is called “unnesting” in RAW and is commonly-used to process nested data.


Querying XML documents

Let’s start by the following XML document available at https://raw-tutorial.s3.amazonaws.com/article.xml:

<?xml version="1.0" encoding="UTF-8"?>
<article>
  <title>This is an article.</title>
  <authors>
    <name title="Mr">John</name>
    <name title="Dr">Jane</name>
  </authors>
  <contents>
    ...
  </contents>
</article>

The first thing to notice is that this XML is not a collection: there is a top-level record article, which contains nested elements titleauthors (a collection), and contents.

Therefore, if we would try to query this with SELECT, it will fail because it is a record and not a collection.

// THIS WILL FAIL !
// with an error like : "non-collections must be aliased explicitly with AS [line, col]"

show_articles() := {
    SELECT * FROM READ("https://raw-tutorial.s3.amazonaws.com/article.xml")
}

The error indicates the file is not a collection. We can still use SELECT if we use the AS keyword. But let’s just run the READ without a SELECT.

show_articles() := {
    READ("https://raw-tutorial.s3.amazonaws.com/article.xml")
}
record(
    title: string,
    authors: record(
        name: collection(
            record(
                `@title`: string,
                `#text`: string)
        )
    ), 
    contents: string
)
{
    "title": "This is an article.",
    "authors": {
    "name": [
        {
        "@title": "Mr",
        "#text": "John"
        },
        {
        "@title": "Dr",
        "#text": "Jane"
        }
    ]
    },
    "contents": "\n    ...\n  "
}

It evaluates to a record.

This query starts to show the “true nature” of RAW SQL.

RAW SQL is a programming language that is “disguised” to look as SQL. In fact, SELECT is just a keyword used in RAW to query “collections of data”. The example above, however, is a query over a non-collection, so we do not use SELECT.

Other queries which do not produce collections are possible in RAW SQL.

The simplest example is perhaps:

one_plus_one() := {
    1+1 // This will simply return "2"
}

The output of this query is a number: not a collection, not a table. Just the number 2.

Going back to the original XML, we can query the title or the contents by doing:

articles := READ("https://raw-tutorial.s3.amazonaws.com/article.xml") ;

show_title() := {
    articles.title
}

show_content() := {
    articles.content
}

Which will respectively return the string title “this is an article” and the string content “…”.

We can do the same for authors, and this time we get a record with a field name.

articles := READ("https://raw-tutorial.s3.amazonaws.com/article.xml") ;

show_authors() := {
    articles.authors
}
record(
    name: collection(
        record(
            `@title`: string,
            `#text`: string
        )
    )
)
{
    "name": [
        {
        "@title": "Mr",
        "#text": "John"
        },
        {
        "@title": "Dr",
        "#text": "Jane"
        }
    ]
}

The field name has two inner fields: @title and #text. These two are special field names generated by RAW. The @ is used a prefix for the XML attribute title, while #text is a special field that refers to the text of an XML element.

<name title="Mr">John</name>

Now, let’s turn these author names into a table with a single query. Note that the back-ticks are used to escape special characters in field names in RAW SQL.

articles := READ("https://raw-tutorial.s3.amazonaws.com/article.xml") ;

show_authors_names() := {
    SELECT name.`@title` AS title, name.`#text` AS name
    FROM articles.authors.name AS name
}
collection(
    record(
        title: string,
        name: string
    )
)
[
    {
    "title": "Mr",
    "name": "John"
    },
    {
    "title": "Dr",
    "name": "Jane"
    }
]

The source code for this tutorial can be found here on GitHub, and more Tutorial links can be found below.

Georges Lagardere Sales Director RAW Labs

Georges Lagardère, VP Customer Experience, RAW Labs.


Want to learn more?

Learn More

Blog

Tutorial 1. First flight with RAW

First Tutorial with RAW where we take reader through a first query, first API and passing parameters to that API

Blog

Tutorial 2. Producing complex data with RAW

Second RAW Tutorial to show how to produce complex data, extensions to Group By, Top level collections and records

Blog

Hello world, meet RAW

here we show how to use RAW Data Product Platform to create a simple Hello World output