BLOG

Tutorial 2. Producing complex data with RAW

January 14, 2022
   
Tutorial
Posted by Georges Lagardère

Collections in the output field of a SELECT

Let’s start with the following query:

sales := READ("https://raw-tutorial.s3.amazonaws.com/sales.json");

all_sales() := {
    SELECT * from sales
}
collection(
    record(
        country: string,
        products: collection(
            record(
                category: string,
                cost: int
            )
        )
    )
)
[
    {
    "country": "CH",
    "products": [
        {
	"category": "Keyboard",
	"cost": 50
	},
	{
	"category": "Keyboard",
	"cost": 70
	},
	{
	"category": "Monitor",
	"cost": 450
	}
    ]
    },
    {
    "country": "US",
    "products": [
...

products is a collection of records, so it is like a “nested table”. Let’s analyze the following query:

sales := READ("https://raw-tutorial.s3.amazonaws.com/sales.json");

products_cost_per_country() := {
    SELECT s.country, (SELECT p.cost FROM s.products AS p) AS products_cost
    FROM sales AS s
}
collection(
    record(
        country: string,
        products_cost: collection(int)
    )
)
[
    {
    "country": "CH",
    "products_cost": [
        50,
	70,
	450
    ]
    },
    {
    "country": "US",
    "products_cost": [
	20,
	200
    ]
    }
]

This query returns two rows: one for CH, another for US.

The first column is the country, and the second column is a list of the cost of the products in that country.

How does this work?

The inner SELECT contains (SELECT p.cost FROM s.products AS p). We can think of this as a normal query, over a table called s.products. The output of that query is a table with the cost of each product.

When we compose it in a single query, then the result is a table where the second column contains another table.

To further demonstrate that SELECTs are just operations over collections of data, let’s add a WHERE filter to the inner SELECT:

sales := READ("https://raw-tutorial.s3.amazonaws.com/sales.json");

products_cost_over_60_per_country() := {
    SELECT s.country, (SELECT p.cost FROM s.products AS p WHERE p.cost > 60) AS products_cost_over_60
    FROM sales AS s
}
collection(
    record(
        country: string,
        products_cost_over_60: collection(int)
    )
)
[
    {
    "country": "CH",
    "products_cost_over_60": [
        70,
	450
    ]
    },
    {
    "country": "US",
    "products_cost_over_60": [
        200
    ]
    }
]

This query filtered the products in the inner SELECT for those that cost > 60.

We can even do aggregations:

sales := READ("https://raw-tutorial.s3.amazonaws.com/sales.json");

number_products_cost_over_60_per_country() := {
    SELECT s.country, (SELECT COUNT(*) FROM s.products AS p WHERE p.cost > 60) AS number_products_cost_over_60
    FROM sales AS s
}
collection(
    record(
        country: string,
        number_products_cost_over_60: long
    )
)
[
    {
    "country": "CH",
    "number_products_cost_over_60": 2
    },
    {
    "country": "US",
    "number_products_cost_over_60": 1
    }
]

This query counts the number of products in each country that cost over 60.

Extensions to GROUP BY

We start by a traditional SQL aggregation:

airports := READ("https://raw-tutorial.s3.amazonaws.com/airports.csv");

airports_per_country() := {
    SELECT Country, COUNT(*) AS Number_Airports
    FROM airports
    GROUP BY Country
    ORDER BY Number_Airports DESC
}
collection(
    record(
        Country: string,
        Number_Airports: long
    )
)
[
	{
	"Country": "United States",
	"Number_Airports": 1697
	},
	{
	"Country": "Canada",
	"Number_Airports": 435
	},
	{
	"Country": "Germany",
	"Number_Airports": 321
	},
	{
	"Country": "Australia",
	"Number_Airports": 263
	},
	{
	"Country": "Russia",
	"Number_Airports": 249
	},
	{
	"Country": "France",
	"Number_Airports": 233
	},
	{
	"Country": "China",
	"Number_Airports": 219
	},
...

This query lists the number of airports per country.

In RAW SQL, however, GROUP BY produces “groups” that can be queried.

When the GROUP BY keyword is used, the * is bound to the group.

To query the entire “group” for a given country – i.e. the airports in each country – we can do:

airports := READ("https://raw-tutorial.s3.amazonaws.com/airports.csv");

airports_per_country_detail() := {
    SELECT Country, *
    FROM airports
    GROUP BY Country
}
collection(
    record(
        Country: string, 
        _2: collection( 
            record( 
                AirportID: int,
                Name: string,
                City: string,
                Country: string,
                IATA_FAA: string,
                ICAO: string,
                Latitude: double,
                Longitude: double,
                Altitude: int,
                Timezone: double,
                DST: string,
                TZ: string
            )
        )
    )
)
[
    {
    "Country": "Austria",
    "_2": [
        {
	"AirportID": 1607,
	"Name": "Wiener Neustadt East",
	"City": "Wiener Neustadt Ost",
	"Country": "Austria",
	"IATA_FAA": "",
	"ICAO": "LOAN",
	"Latitude": 47.843333,
	"Longitude": 16.260139,
	"Altitude": 896,
	"Timezone": 1,
	"DST": "E",
	"TZ": "Europe/Vienna"
	},
	{
	"AirportID": 1608,
	"Name": "Wels",
	"City": "Wels",
	"Country": "Austria",
	"IATA_FAA": "",
	"ICAO": "LOLW",
	"Latitude": 48.183304,
	"Longitude": 14.040861,
	"Altitude": 1043,
	"Timezone": 1,
	"DST": "E",
	"TZ": "Europe/Vienna"
	},
	{
	"AirportID": 1609,
	"Name": "Graz",
	"City": "Graz",
	"Country": "Austria",
	"IATA_FAA": "GRZ",
	"ICAO": "LOWG",
	"Latitude": 46.991067,
	"Longitude": 15.439628,
	"Altitude": 1115,
	"Timezone": 1,
	"DST": "E",
	"TZ": "Europe/Vienna"
	},
...

The * is a “nested table” containing all rows in the group defined by the GROUP BY clause.

In this example the * is all the airports in a given Country, since the query does GROUP BY Country.

Since * is a table, we can query it as normally:

airports := READ("https://raw-tutorial.s3.amazonaws.com/airports.csv");

airports_per_country_name_city() := {
    SELECT Country, (SELECT Name, City FROM *)
    FROM airports
    GROUP BY Country
}
collection(
    record(
        Country: string,
        _2: collection(
            record(
                Name: string,
                City: string
            )
        )
    )
)
[
    {
    "Country": "Austria",
    "_2": [
        {
	"Name": "Wiener Neustadt East",
	"City": "Wiener Neustadt Ost"
	},
	{
	"Name": "Wels",
	"City": "Wels"
	},
	{
	"Name": "Graz",
	"City": "Graz"
	},
	{
	"Name": "Innsbruck",
...

We can even group further:

airports := READ("https://raw-tutorial.s3.amazonaws.com/airports.csv");

airports_per_country_per_city() := {
    SELECT Country, (SELECT City, COUNT(*) FROM * GROUP BY City)
    FROM airports
    GROUP BY Country
}
collection(
    record(
        Country: string, 
        _2: collection(
            record(
                City: string, 
                _2: long
            )
        )
    )
)
[
    {
    "Country": "Austria",
    "_2": [
        {
	"City": "Hohenems",
	"_2": 1
	},
        {
	"City": "Punitz-Guessing",
	"_2": 1
	},
	{
	"City": "Bad Gastein",
	"_2": 1
	},
	{
...

This groups the airports by Country, and then by City.

The COUNT(*) in the inner SELECT refers to the groups created by GROUP BY City.

Top-level Collections

Let’s look in more detail at the output of the following queries:

sales := READ("https://raw-tutorial.s3.amazonaws.com/sales.json");

sales_countries_collection() := {
    SELECT country AS name FROM sales
}
collection(
    record(
        name: string
    )
)
[
    {
    "name": "CH"
    },
    {
    "name": "US"
    }
]

This query returns a list of countries. Each row in the output has the column name. We obtain a collection of two records.

Now the following query :

sales := READ("https://raw-tutorial.s3.amazonaws.com/sales.json");

sales_countries_strings() := {
    SELECT country FROM sales
}
collection(string)
[
    "CH",
    "US"
]

… appears similar but note that AS alias is not included.

The output is different: each row is a string. There is not record, just a collection of strings.

Top-level Records

The syntax (field1: "One", field2: 1) is used to create a record with two fields: field1, a string with value "One", and field2, a integer with value 1.

Collections and Records can be nested in RAW, so the following is a valid query:

airports := READ("https://raw-tutorial.s3.amazonaws.com/airports.csv");

number_of_airports_countries() := {
    (
        Number_Of_Airports: (SELECT COUNT(*) FROM airports),
        Countries: (SELECT DISTINCT Country FROM airports)
    )
}
record(
    Number_Of_Airports: long,
    Countries: collection(string)
)
{
    "Number_Of_Airports": 8107,
    "Countries": [
        "Austria",
        "Azerbaijan",
        "Belgium",
        "Bermuda",
        "Bhutan",
        "British Indian Ocean Territory",
        "Burundi",
        "Cameroon",
        "Cayman Islands",
        "Chile",
        "Congo (Kinshasa)",
        "Cote d'Ivoire",
        "Dominica",
        "Dominican Republic",
        "Faroe Islands",
        "Finland",
        "French Guiana",
        "Grenada",
        "Guernsey",
...

The output of this query is a record with two fields: Number_of_Airports, a long and Countries, a collection of strings.

The source code for this tutorial can be found here on GitHub, and more Tutorial links can be found below.

Georges Lagardere Sales Director RAW Labs

Georges Lagardère, VP Customer Experience, RAW Labs.

Learn More

Blog

Tutorial 1. First flight with RAW

First Tutorial with RAW where we take reader through a first query, first API and passing parameters to that API

Blog

Tutorial 3. Managing complex data with RAW

Third Tutorial to show how sot read, manage and produce complex data in RAW

Blog

Hello world, meet RAW

here we show how to use RAW Data Product Platform to create a simple Hello World output

Success

Thank you for your interest. Expect to hear from us soon.

Error

Email address must contain the @ symbol