BLOG

Integrating Snowflake with API-based Data: an Address Cleansing Example

September 30, 2022
   
Solution
Posted by Jeremy Posner

Intro

As a company you have decided to move your data into Snowflake for analytics. You’re on the journey to moving all the in-scope data there, but the following scenarios occur:

  1. You need data that is not in Snowflake yet – it’s on the plan, but for next year.
  2. You need data as a one-off; loading into Snowflake and then using it is work you want to avoid
  3. You want to combine your Snowflake data with data that belongs to someone else

The same of course applies to any other Cloud database, including RedShift, BigQuery, etc. Whichever you choose, their aim is to grow a captive audience who keep paying for more compute and storage, thus building up their Data Gravity (see my earlier blog on this topic).

We’re going to show how using RAW can help you get the job done without:

  • Paying for ELT and copying of data you don’t need
  • Paying for compute / scans you don’t need
  • Paying with excessive time and effort

AND:

  • You can create new, innovative data products.

Why RAW?

Here’s our value proposition:

  • At RAW we query all data in place. No moving, No downloading.
  • So it’s faster to get to results. Faster to iterate. Much less work.
  • And we expose data as APIs. Data products your users can access without needing to know where the data is.

Let’s get on with it….


Customer Data in Snowflake

Our example starts with a typical table in your Snowflake database. This one has some customer addresses. We can use RAW to query in VS Code using this command:

Please note as of November 2022 we have upgraded our language – see our documentation and quick tour for details. These examples below are using previous version of RQL, and will be refactored shortly.

// read snowflake table ADDRESS, from CRM schema in DB01 database

SELECT * FROM READ_SNOWFLAKE("DB01", "CRM", "ADDRESS")

This simple query returns the following data:

results set from query

Address Cleansing – The Snowflake Way

Address cleansing is a great example of a service based on data you don’t have. It’s someone else’s data product. You would not want every address on Earth in your database and the vendor would be unlikely to give it to you.

In Snowflake, their Marketplace lists many vendors of address data. The Snowflake way is generally to get you to sign up to the data inside Snowflake, join to it, execute your business logic (in this case address cleansing) and maintain everything inside their ecosystem – and in turn they profit handsomely from every query in the mushrooming database.

Marketplace vendors include Dun and Bradstreet for business addresses, and several other local country vendors for addresses in France, UK, Netherlands, US. Each vendor just provides the addresses – you have to do all the work, so imagine trying to cleanse a worldwide set of addresses using dozens of different sources and formats. Sounds like a project in itself 🙂

There is also another vendor (Demyst) who has a slightly different model – they cite: “User shares data with Demyst. Demyst enriches data. Demyst shares data back to user through Snowflake data share.”, a slightly different take, if a little clunky, but again it’s all inside the Snowflake ecosystem, and everything is a table. Plus, of course, Demyst needs to support the Snowflake-centric model so you’re locked into this mechanism.

Address Cleansing as a Service

Melissa are long-time address experts for 35 years, now with a global dataset and a nice set of APIs. Here’s their GlobalAddress API with good documentation to show how to get started, with examples. Plus you can get started for free. Note that we are not affiliated with Melissa in any way.

Let’s take a look at how we can use this in RAW.

The GlobalAddress API uses two structures, one input structure to pass your address info into the API, and one output structure they return from the API. Very simple. They are both well documented, so in RAW we can create two typealias , one for each. We have called them inRec and outRec below:

// Melissa GlobalAddress API definitions:

// input record format
typealias inRec := record(
  TransmissionReference: string nullable,
  CustomerID: string,
  Options: string nullable,
  `Records`: collection(record(
      `RecordID`: string nullable,
      `Organization`: string nullable,
      `AddressLine1`: string,
      `AddressLine2`: string nullable,
      `AddressLine3`: string nullable,
      `AddressLine4`: string nullable,
      `AddressLine5`: string nullable,
      `AddressLine6`: string nullable,
      `AddressLine7`: string nullable,
      `AddressLine8`: string nullable,
      `DoubleDependentLocality`: string nullable,
      `DependentLocality`: string nullable,
      `Locality`: string nullable,
      `SubAdministrativeArea`: string nullable,
      `AdministrativeArea`: string nullable,
      `PostalCode`: string nullable,
      `SubNationalArea`: string nullable,
      `Country`: string)));

// output record format
typealias outRec := record(
  `Version`: string,
  `TransmissionReference`: string,
  `TransmissionResults`: string,
  `TotalRecords`: string,
  `Records`: collection(record(
      `RecordID`: string,
      `Results`: string,
      `FormattedAddress`: string,
      `Organization`: string,
      `AddressLine1`: string,
      `AddressLine2`: string,
      `AddressLine3`: string,
      `AddressLine4`: string,
      `AddressLine5`: string,
      `AddressLine6`: string,
      `AddressLine7`: string,
      `AddressLine8`: string,
      `SubPremises`: string,
      `DoubleDependentLocality`: string,
      `DependentLocality`: string,
      `Locality`: string,
      `SubAdministrativeArea`: string,
      `AdministrativeArea`: string,
      `PostalCode`: string,
      `PostalCodeType`: string,
      `AddressType`: string,
      `AddressKey`: string,
      `SubNationalArea`: string,
      `CountryName`: string,
      `CountryISO3166_1_Alpha2`: string,
      `CountryISO3166_1_Alpha3`: string,
      `CountryISO3166_1_Numeric`: string,
      `CountrySubdivisionCode`: string,
      `Thoroughfare`: string,
      `ThoroughfarePreDirection`: string,
      `ThoroughfareLeadingType`: string,
      `ThoroughfareName`: string,
      `ThoroughfareTrailingType`: string,
      `ThoroughfarePostDirection`: string,
      `DependentThoroughfare`: string,
      `DependentThoroughfarePreDirection`: string,
      `DependentThoroughfareLeadingType`: string,
      `DependentThoroughfareName`: string,
      `DependentThoroughfareTrailingType`: string,
      `DependentThoroughfarePostDirection`: string,
      `Building`: string,
      `PremisesType`: string,
      `PremisesNumber`: string,
      `SubPremisesType`: string,
      `SubPremisesNumber`: string,
      `PostBox`: string,
      `Latitude`: string,
      `Longitude`: string,
      `DeliveryIndicator`: string,
      `MelissaAddressKey`: string,
      `MelissaAddressKeyBase`: string,
      `PostOfficeLocation`: string,
      `SubPremiseLevel`: string,
      `SubPremiseLevelType`: string,
      `SubPremiseLevelNumber`: string,
      `SubBuilding`: string,
      `SubBuildingType`: string,
      `SubBuildingNumber`: string,
      `UTC`: string,
      `DST`: string,
      `DeliveryPointSuffix`: string,
      `CensusKey`: string,
      `Extras`: record())));

Using these structure definitions, we can construct a function in RAW that simply calls the API endpoint with a POST method, posting in the addresses as inRec and outputting the results as outRec :

// function to call melissa API

GlobalAddress(addresses: inRec): outRec := {
  READ_JSON[outRec]("http://address.melissadata.net/V3/WEB/GlobalAddress/doGlobalAddress", 
  http_method := "post", 
  http_headers := 
  [
    ("x-raw-output-format", "json"),
    ("Content-Type", "application/json")
  ], 
  http_body_string := print_json(addresses))
};

Let’s clean some addresses …

We’re done with our set up in RAW. We’re now ready to use the address cleansing service with our Snowflake dataset. Here’s the query we will run below. This will pass all the records of the table into the Melissa API, which supports bulk address cleansing, up to 100-at-a-time, so it can be batched efficiently.

// reformat structure. Construct the record in the Melissa inRec form
r := (
  TransmissionReference: "myTest01", 
  CustomerID: "myLicenseKey", 
  Records: 
    SELECT 
       CustomerID as RecordID, 
      `Street address` as AddressLine1, 
      City, 
      Region as Locality, 
      `Postal code` as PostalCode, 
      Country 
    FROM 
      READ_SNOWFLAKE("SNOW01","CRM","ADDRESS")
);

GlobalAddress(r)

And the results from Melissa are in tabular form below. Note that RAW can display results in tabular, tree or raw output.

results in RAW from the address cleanse of snowflake data

The Melissa API not only gives the best match, with standardised and formatted addresses, but it also gives crucial match Results information in the form of codes – see their guide for more information. In this example: AV23,AV24 or AV25 are ‘good’ matches, and absence of these result codes indicate issues.

It would be trivial to filter on the results, and then update the Snowflake tables if required.

Creating a new API Data Product

Lastly, when we get the results in the format we want, it’s then easy to turn this into a new API – a new Data Product which can be shared securely with your users: this could be an address cleansing service for singular customer records, or in batches. You can decide.

To do this, simply wrap the code above in a function, create a YAML file and commit them both to GitHub. We have plenty of examples in our blogs, and a short video of how to create an API in 4 minutes.

Wrap Up

Snowflake data can be easily accessed and integrated with non-Snowflake data, including API-based services using a few lines of code in RAW. This saves time, saves money and can facilitate new, innovative data products. You also avoid cloud-database lock-in, excessive data gravity, and the ‘as-a-service’ model can replace one vendor with another at any time. See our follow on blog on Snowflake caching too.

Try us or contact us today to find out how you could get faster results.

Jeremy Posner, VP Product and Solutions, RAW Labs.


 

Get Started for Free

More Useful Links:

Like this? 

Sign up for our mailing list to receive the latest news and blogs

Learn More

Blog

A Better Snowflake Cache: Smart and Simple with Savings

Snowflake caching works only some of the time. Often you need something that can work on top of it, alongside it, which is smart, fast and simple, as well as potentially saving money.

Blog

Querying data in JSON files across multiple S3 buckets and AWS accounts

Read how RAW helps a customer query, analyse and share S3 data in JSON files across multiple AWS accounts and S3 buckets.

Blog

12 Benefits of Data Sharing via APIs

Data Sharing via APIs is a modern approach to unlocking valuable resources inside your organisation. Businesses sharing data are more likely to outperform those who do not. Read more about why using APIs is a better approach.

Blog

Introducing our Google Cloud API Connector

Our new Google Connector allows many services supported by Google's APIs to be used securely in RAW. Check out an example using Google Search Console, used for SEO and website performance

Success

Thank you for your interest. Expect to hear from us soon.

Error

Email address must contain the @ symbol