BLOG

A Better Snowflake Cache: Smart and Simple with Savings

October 3, 2022
   
Solution / Strategy
Posted by Jeremy Posner

Intro

Following on from the previous blog showing how we can take Snowflake data and combine it with API-based data in RAW, here’s a follow-on to show the power and simplicity of caching data over the top of Snowflake, or Snowflake + API’s, or indeed any other combination of data sources in File, Database or API form.

Not only is our solution powerful, flexible and incredibly easy to set up, but it could also reduce your cloud bills from unnecessary scans/compute too. Let’s see how it works…


Snowflake Caching

Snowflake’s caching mechanism is good in many circumstances, however it doesn’t deal with the following:

Snowflake caching may only partially help the end user when some of the data isn’t in Snowflake, as in my previous blog or the example below with both real-time information and Snowflake data for history. This depends on how fast the other source(s) are in relation to the Snowflake-cached data; the user will be waiting on something.


Caching in RAW

RAW allows simple and powerful caching rules to be created; you can insert into the code that performs the query execution, or the configuration that executes the API endpoints. The cache is maintained in the cloud, secured and adapts to the workload (data structures and queries) so there are differently optimised physical data structures. There is also code caching including fragments of the code tree to speed things up further.

RAW’s cache can either work with Snowflake’s cache or replace it. It’s up to you, your workload, and the other sources you are connecting to. Let’s say your query is against Snowflake and an API together, then RAW can cache the combined query results, which is useful for a use case like the previous blog example or you may want to provide a cache over “near” real-time data in an API + Historical data from Snowflake.

Our caching works at different levels, so let’s dig into each:

1. Caching specified in queries

Supply a cache parameter to a READ command, i.e. in the actual query. Here’s an example below – you can also read more from our docs on cache hints:

Please note as of November 2022 we have upgraded our language – see our documentation and quick tour for details. These examples below are using previous version of RQL, and will be refactored shortly.

// read snowflake table ADDRESS, from CRM schema in DB01 database
// simple cache parameter

SELECT * FROM READ_SNOWFLAKE("DB01", "CRM", "ADDRESS", cache := INTERVAL "1 DAY")

This simple query above caches data for a day after it is first executed. The first query execution outside of that day will go back to the database. Putting the hint in the query code is granular, per data source, which can be useful when you know one of your sources is static, e.g. Country in the query below:

// cache snowflake in RAW (cache)
// read from oracle (no cache)

SELECT a.Country, SUM(o.Order) 
FROM 
  READ_SNOWFLAKE("DB01", "CRM", "ADDRESS", cache := INTERVAL "1 DAY") a,
  READ_ORACLE("ORA02", "ORDER", "CUSTOMER_ORDER", cache := INTERVAL "0 seconds") o
WHERE 
  o.CustomerID = a.CustomerID
GROUP BY 
  Country

In the above example, we have queried Snowflake (a Data Mart/Warehouse) and Oracle (a OLTP). We have also forced RAW to never cache Oracle, using cache := interval "0 seconds".

2. Caching specified in API Configuration

Configuration-based caching options are catered for in the YAML structure to complement or add to the code hint options created above.

RAW supports multiple mechanisms to build and maintain caches. To achieve a similar caching mechanism to that in the code, above, we can enter a parameter in the configuration to cache the whole endpoint, using expireSeconds. This will simply keep the results from the API cached for that duration, so if the same API endpoint is called again, with the same parameters, it will use the cached version of the results. Here’s an example:

raw: 0.9
endpoint: GET 
metadata:
  title: My cached endpoint
  description: Example showing 10 minute cache for an API endpoint
code: rql 
codeFile: codefile.rql
declaration: snowflakeAndAPIQuery
format: json 
computeClass: normal 
#
# Endpoint cache settings
#
expireSeconds: 600

This will work well if there’s high read-concurrency for the same data, and everyone except the first user will get the benefit. However the granularity of different caching strategies per data source is not supported in this mechanism, unlike our first example.

Sometimes you want even the first user to get the speed benefit, and not wait for the cache to kick in. For this purpose RAW has a Precomputing of Endpoints using refreshCron which allows cron-like timed execution of queries so that users always get a highly-performant version of their data request, albeit slightly delayed – this is dependent on use case as to how well this fits.

The example below shows caching every 30 seconds, and buildOnPublish: true ensures that it’s ready to go cached as soon as the endpoint is deployed:

raw: 0.9
endpoint: GET 
metadata:
  title: My cached endpoint
  description: Example showing pre-computed endpoint with 30 second cache
code: rql 
codeFile: codefile.rql
declaration: snowflakeAndAPIQuery
format: json 
computeClass: normal 
#
# Endpoint cache settings
#
buildOnPublish: true
refreshCron: "* */30 * * * ?" # Run the query in the background every 30 minutes.

In addition to these two strategies, RAW supports configurable cache sizes; You may want to cache a sizeable amount of data, to ensure all the users with different needs are catered for. The optional cacheSizeMB allows you to reserve a maximum cache size for this endpoint.

cacheSizeMB: 200

Larger cache sizes are available for our Business and Enterprise Pricing tiers. Contact us for details.

A Real-World Example: Orders Today vs. Yesterday

A typical example where you might want to cache data would be comparing ‘live’ performance today, vs yesterday or another historical day of your choosing.

In this case the data from today is in available from an Order Management API, called OrdersToday, and the data from yesterday in our Snowflake Data Warehouse, called DWH.ORDER table. The query to retrieve and join them on ProductID to show Orders, would look like this:

ComparePerformance(customerID: int, compareTo: interval := "1 day") := {
  today:=
    SELECT 
      t.ProductID as Prod, 
      sum(t.OrderSize) as OrderQty, 
      count(*) as NumOrders
    FROM 
      OrdersToday(CustomerID) t   // API from today's live data
    GROUP BY
       Prod;
 
  prev:=
    SELECT 
      h.ProdID as Prod, 
      sum(h.Quantity) as OrderQty, 
      count(*) as NumOrders
    FROM
      READ_SNOWFLAKE("DB02", "DWH", "ORDER") h  // history from warehouse
    WHERE
      h.CustID = CustomerID AND
      h.OrderDate = CURRENT_DATE() - compareTo 
    GROUP BY 
      Prod;

   // join the datasets
   SELECT
     t.Prod,
     t.OrderQty,
     t.NumOrders,
     p.OrderQty as prevQty,
     p.NumOrders as prevOrders
   FROM
     today t, prev p
   WHERE
     t.Prod = p.Prod
}

To configure this endpoint with a YAML file and a 5 minute updating cache looks like this:

raw: 0.9
endpoint: GET 
metadata:
  title: Compare Performance 
  description: Returns Orders Today vs. a Previous Day
code: rql 
codeFile: Snowflake-API-compare.rql
declaration: ComparePerformance
format: json 
computeClass: normal 
#
# Endpoint cache settings
#
cacheSizeMB=100
buildOnPublish: true
refreshCron: "0 0/5 * * * ?" # Run the query in the background every 5 mins

Finally, to deploy this to RAW just requires committing these to your GitHub. You’re then good to go, assuming you have an account with us.

Wrap Up

Fast, cached data from a Live API and a Snowflake database. Made Simple. You can be up and running in minutes. Try this yourself or contact us today to find out how you could delight users with faster results, gain more control and potentially reduce those unnecessary Snowflake fees.

Jeremy Posner, VP Product and Solutions, RAW Labs.


 

Get Started for Free

More Useful Links:

Like this? 

Sign up for our mailing list to receive the latest news and blogs

Learn More

Blog

Powerful new SEO & SEM Data Products with RAW

Create powerful SEO/SEM data products with RAW. This example compares historical search volumes for two keywords / brands over a time period for trend analysis.

Blog

Querying data in JSON files across multiple S3 buckets and AWS accounts

Read how RAW helps a customer query, analyse and share S3 data in JSON files across multiple AWS accounts and S3 buckets.

Blog

Integrating Snowflake with API-based Data: an Address Cleansing Example

Read how we combine Snowflake data with API from Melissa Address Cleansing - with less code, less effort and less cost - and create new data products too.

Blog

12 Benefits of Data Sharing via APIs

Data Sharing via APIs is a modern approach to unlocking valuable resources inside your organisation. Businesses sharing data are more likely to outperform those who do not. Read more about why using APIs is a better approach.

Success

Thank you for your interest. Expect to hear from us soon.

Error

Email address must contain the @ symbol