BLOG

Reference Data Standard Library

January 12, 2022
   
Solution
Posted by Jeremy Posner

Introduction

Standard Libraries are a commonly known construct in the programming world. In the data world, similarly; there are always a set of commonly used, base classes of data that are needed across an organisation. It is important too that everyone is using the same set of ‘reference data’ values (aka ‘static’ data), otherwise data quality and interoperability problems ensue.

This standard library concept can expand to include any type of highly reusable data, where central reference data is needed in many places and formats, from databases, to web-apps, to Excel spreadsheets. It makes perfect sense to supply this data over simple REST APIs.

Inside Financial Services’ organisations it is common to use standard data for Geography (Countries, Regions), Language and Currency – so this example shows how RAW can create base reference data for Languages to serve as building blocks for other types of data product. The code on GitHub has Geography and Currency data too.

Getting Started: RAW = SQL + YAML + GIT

There’s a little initial setup, which is described here. You need a RAW account and a GitHub account, and then synchronise your GitHub repo with RAW – we provide a utility for that. If you know SQL, can write a YAML file and commit into GIT, then you’re good to go.

A Classic 3 Layer Approach

It is possible to query the raw data, transform it, and output directly as an API all in a single statement. This works for simple data, however when things get more complex you will want a more structured approach to delivering standardised data. A good way of doing this is a 3 Layer approach, which is common in data warehousing, data virtualisation and API management:

Layer 1: the Raw layer – often called the Source layer, System layer, Staging layer, this is where the data matches the structure of the source. Aim here is for a simple 1:1 representation with how the data is available.

Layer 2: the Model layer – often called the Integration layer, Canonical layer, where the data becomes integrated, standardised with common business words, standard datatypes, formats and conforms to quality rules.

Layer 3: the Consumption layer – where data is served up as API endpoints and are based on user requirements The Principles of good API design need to apply here. Some nice articles exist, I like this one by Microsoft.

The good thing about RAW is that there is no performance difference between using a 3 layer structured approach, and doing all the wrangling and conversion in a single statement. This is because RAW uses whole stage compilation and execution, and our optimiser will reduce the code to the same expresson internally.

You may have an API-first approach (define the interface, then find the data that fits it), a Model-first approach (create the agreed business logical model, then map to both API and underlying data), or a Data-first approach ( you have the data clearly defined in a database, and just want to expose it as an API). Either way, RAW will help do this extremely quickly and without much fuss – your chosen approach may change which layer you start from, who works on each layer and how you might structure the files in GitHub.

Standard Data Library example: Languages

Data about Languages is standardised well under ISO 369, so it makes sense to use this. However as with much of the ISO data, it isn’t always that easy to consume, is often in spreadsheets, or you need to pay for a costly ISO subscriptions. There are also many well maintained sources of language data on the web.

In this example we take a dataset found as Open Data, here. In RAW we would simply query this directly:

languages_raw := 
  SELECT Id, Part2B, Part2T, Part1, Scope, Language_Type, Ref_Name, Comment 
  FROM 
  READ("https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab");

This creates a variable corresponding to the raw data (Layer 1), i.e. as the source serves it up. For reference data sources like this it’s important that the source doesn’t change structure, if a new field occurs that’s fine because our query will ignore it.

The next step is to convert the ‘Raw’ data into a ‘Model’ Layer (2), a business representation of the data which is commonly agreed. It may be just name changes, or may include format changes too. In this case I have just taken 3 fields that we want to serve up in our API, and filtered some rows too:

languages :=
    SELECT 
      Part1 as iso639_1_lang_code,
      Language_Type as language_type,
      Ref_Name as language_name 
    FROM languages_raw
    WHERE Part1 is not null

We now have our standardised data in the variable ‘languages’. It is noted here that this is a simple example with just one source, but in this layer would be the place to combine multiple sources to create the business representation needed to conform to the agreed model.

Lastly we take the language standardised data and serve it up with an API in our Consumption Layer (3). We will create two APIs:

Firstly there’s a function to return all_languages:

all_languages() := {
    select 
      iso639_1_lang_code as lang_code
      language_name as lang_name
    from
      languages
};

And then a second function to return languages for a country as an input parameter. Aside: this function uses another function to get the country for a given ISO country code – showing RAWs capability as a ‘composable language’, i.e. a function can return a type that is queryable by a SQL construct – This is very powerful

languages_by_country(iso: string) := {
  select 
      languages as lang_code 
  from 
      country_by_iso(iso)
};

Lastly we need to configure the two endpoints so that the functions are called, resources are allocated, caching is specified (in this case 86400 seconds is a day, i.e. we can cache this very static data, but we can also turn caching off ), metadata is described in our catalog (which can be exported to OpenAPI), and security settings (in this case it is public).

Here is the simple YAML file, languages.yml :

raw: 0.9
endpoint: GET
language: rql
codeFile: code.rql
format: json
computeClass: normal
enabled: true
declaration: all_languages
metadata:
  title: Languages
  description: Return all languages 
  tags:
    - languages
security:
  public: true
refreshSeconds: 86400

This file resides in the GitHub directory structure which determines the API path structure, including API version ( ‘1’ in this case ), so in our folder structure we have a file called. “api/1/public/reference-data/languages.yml”

Here is a second YAML file, called country.yml which we place into a sub-directory under languages, so that the endpoint is structured accordingly, i.e.“api/1/public/reference-data/languages/country”

raw: 0.9
endpoint: GET
language: rql
codeFile: ../code.rql
format: json
computeClass: normal
enabled: true
declaration: languages_by_country
metadata:
  title: List of language codes for country
  description: Returns a list of language codes. Parameter is country code using ISO forms
  tags:
    - country
    - languages
security:
  public: true
refreshSeconds: 86400

Endpoints

https://api.raw-labs.com/demos/1/public/reference-data/languages

which returns all languages in a JSON structure (try it!), and:

https://api.raw-labs.com/demos/1/public/reference-data/languages/country?iso=ES

which returns simply languages used in a country:

[{
	"lang_code": ["es-ES", "ca", "gl", "eu", "oc"]
}] 

To get these APIs going, we simply commit the two YAML files:country.yml andlanguages.yml , plus the code file: code.rql into our GitHub repository. That’s it.

Check out our other Reference Data endpoints here too over at: https://github.com/raw-labs/demos.raw-labs.com/tree/master/1/public/reference-data

Wrap Up

Here we show how RAW can create a Standard Library Reference Data APIs, using just a SQL + YAML + GIT approach. Our platform takes care of the infrastructure and provides execution capabilities so you can just focus on the data challenge. I hope you can see how incredibly powerful this platform is combined with ease of use which can result in huge productivity gains for your data teams, and improved data delivery to your consumers.

As usual all the code is on GitHub.

Jeremy Posner, VP Product & Solutions, RAW Labs.

Learn More

WHITE PAPERS

RAW Data Product Platform

Read more about our RAW Data Product Platform, including challenges it is designed to solve, and components of the solution

VIDEOS

Olympics Demo Overview

Georges Lagardère shows how easy it is to expose data as an API using a simple Olympics dataset, and then how to modify the code behind the API and iterate quickly.

Blog

Hello world, meet RAW

here we show how to use RAW Data Product Platform to create a simple Hello World output

Success

Thank you for your interest. Expect to hear from us soon.

Error

Email address must contain the @ symbol