BLOG

Querying data in JSON files across multiple S3 buckets and AWS accounts

September 15, 2022
   
Solution
Posted by Georges Lagardère

Data in S3 Buckets under Multiple Accounts?

One of our customers showed up with an interesting use case. They had multiple AWS accounts with multiple S3 buckets and they wanted to query JSON files stored across them.

They had two potential solutions:

  1. Replicate all the data into a single bucket, effectively creating multiple copies – sync’ing and hence extra costs would be incurred, and account and role management would be complex to maintain.
  2. Set up Amazon Athena, again grant cross-account permissions (as roles) so that one central account would query the files in all others. This would be complex to set up and Athena’s pricing model (up to $7 per Tb scanned) could work out expensive.

Neither option looked satisfying to them. They wanted:

  • No copying of data
  • No complex security
  • No paying for each scan

Enter RAW – Simpler and Smarter

They decided to look at other options and tried with RAW Labs. Here’s the setup we helped them come up with.

In RAW, S3 data sources are easy to setup with credentials. You just need to repeat the operation for as many buckets as needed and create a credential for each.

Every bucket gets a unique name (for instance something like machine-1, machine-2… machine-n in their case). Accessing JSON files in the buckets with the RAW query language is done via this syntax:

READ_JSON(“s3://machine-1/data-2022-03-05.json”)

And you can merge files with a UNION operation:

READ_JSON(“s3://machine-1/data-2022-03-05.json”)
UNION ALL
READ_JSON(“s3://machine-2/data-2022-03-05.json”)

In reality, it wasn’t that simple, because the file names weren’t known upfront, only their content structure and naming convention were. When running the query, they had first to list the bucket contents to pick the files they wanted or could query.

This was done with the LS operation. Basically when running

LS(“s3://machine-1/*”)

you get a collection of strings with the names of the files found in the bucket like:

[
  “s3://machine-1/data-2022-03-01.json”,
  “s3://machine-1/data-2022-03-02.json”,
  “s3://machine-1/data-2022-03-03.json”,
  “s3://machine-1/data-2022-03-04.json”,
  “s3://machine-1/data-2022-03-05.json”
]

This allowed the client to make all sorts of selections on files based on date/names. The queries can bypass the files not of interest, since most are date-range queries too.

RAW creates secure APIs to access the data

Although they were not initially looking at building REST API endpoints to deliver their data, they found this genuinely useful and another way to deliver data to users, securely via Auth0 using Scopes and API Keys. They were easily able to plug the endpoints into their BI tool and provide the data for their analysts, using a simple call like this:

https://customer.api.raw-labs.com/machine-data/endpoint?fromDate=2022-03-01&toDate=2022-03-31


Conclusion

All in all, with a minimal setup and a few lines of RAW query language they were able to rapidly tackle their problem, and give them not only a more efficient, cost effective and elegant solution, but one which helps to deliver the data in additional ways which are genuinely useful to their business.

Georges Lagardere Sales Director RAW Labs

Georges Lagardère, VP Customer Experience, RAW Labs.


Want to learn more?

Learn More

Blog

Data Sharing: User Management and Invitations

We built our data delivery platform to have a simple and secure permissioning system, so that you can very easily control who has access to your API endpoints.

Blog

Querying all data, anywhere

Founder and CEO Miguel Branco explains the rationale behind the RAW platform, the problems it solves and how and why a better SQL is the way forward.

Blog

12 Benefits of Data Sharing via APIs

Data Sharing via APIs is a modern approach to unlocking valuable resources inside your organisation. Businesses sharing data are more likely to outperform those who do not. Read more about why using APIs is a better approach.

Blog

Retrieving JSON data from a REST API in Excel with Power Query

Use Excel with Power Query to retrieve data from REST APIs created through RAW in JSON, CSV, or even XML