Querying data in JSON files across multiple S3 buckets and AWS accounts
Data in S3 Buckets under Multiple Accounts?
One of our customers showed up with an interesting use case. They had multiple AWS accounts with multiple S3 buckets and they wanted to query JSON files stored across them.
They had two potential solutions:
- Replicate all the data into a single bucket, effectively creating multiple copies – sync’ing and hence extra costs would be incurred, and account and role management would be complex to maintain.
- Set up Amazon Athena, again grant cross-account permissions (as roles) so that one central account would query the files in all others. This would be complex to set up and Athena’s pricing model (up to $7 per Tb scanned) could work out expensive.
Neither option looked satisfying to them. They wanted:
- No copying of data
- No complex security
- No paying for each scan
Enter RAW – Simpler and Smarter
They decided to look at other options and tried with RAW Labs. Here’s the setup we helped them come up with.
In RAW, S3 data sources are easy to setup with credentials. You just need to repeat the operation for as many buckets as needed and create a credential for each.
Every bucket gets a unique name (for instance something like machine-1, machine-2… machine-n in their case). Accessing JSON files in the buckets with the RAW query language is done via this syntax:
And you can merge files with a UNION operation:
READ_JSON(“s3://machine-1/data-2022-03-05.json”) UNION ALL READ_JSON(“s3://machine-2/data-2022-03-05.json”)
In reality, it wasn’t that simple, because the file names weren’t known upfront, only their content structure and naming convention were. When running the query, they had first to list the bucket contents to pick the files they wanted or could query.
This was done with the LS operation. Basically when running
you get a collection of strings with the names of the files found in the bucket like:
[ “s3://machine-1/data-2022-03-01.json”, “s3://machine-1/data-2022-03-02.json”, “s3://machine-1/data-2022-03-03.json”, “s3://machine-1/data-2022-03-04.json”, “s3://machine-1/data-2022-03-05.json” ]
This allowed the client to make all sorts of selections on files based on date/names. The queries can bypass the files not of interest, since most are date-range queries too.
RAW creates secure APIs to access the data
Although they were not initially looking at building REST API endpoints to deliver their data, they found this genuinely useful and another way to deliver data to users, securely via Auth0 using Scopes and API Keys. They were easily able to plug the endpoints into their BI tool and provide the data for their analysts, using a simple call like this:
All in all, with a minimal setup and a few lines of RAW query language they were able to rapidly tackle their problem, and give them not only a more efficient, cost effective and elegant solution, but one which helps to deliver the data in additional ways which are genuinely useful to their business.
Georges Lagardère, VP Customer Experience, RAW Labs.
Want to learn more?
- Take a tour of our Demo Catalogue
- Have Questions / Comments? Ask away – join our Community discussion!
- Read our Tutorials and Getting Started Docs
- Why not follow us on LinkedIn, Twitter, YouTube
- Like code? head on over to GitHub
- Developer? Join us! we are looking for bright minds – at all levels of seniority, in databases, distributed systems, UI/UX.
Data Sharing: User Management and Invitations
We built our data delivery platform to have a simple and secure permissioning system, so that you can very easily control who has access to your API endpoints.
Querying all data, anywhere
Founder and CEO Miguel Branco explains the rationale behind the RAW platform, the problems it solves and how and why a better SQL is the way forward.
12 Benefits of Data Sharing via APIs
Data Sharing via APIs is a modern approach to unlocking valuable resources inside your organisation. Businesses sharing data are more likely to outperform those who do not. Read more about why using APIs is a better approach.
Retrieving JSON data from a REST API in Excel with Power Query
Use Excel with Power Query to retrieve data from REST APIs created through RAW in JSON, CSV, or even XML