Why we created RAW?
RAW Labs was founded by Miguel Branco and Professor Anastasia Ailamaki, in 2015
The origins of the work into the RAW platform go back beyond this to 2010 and earlier, where Professor Ailamaki lead the Data-Intensive Applications and Systems lab at EPFL, where she and others had been researching earlier in the fields of scientific data management, high-performance query processing, and software/hardware co-design.
The motivation, as is often the case with new inventions, was frustration. Frustration with scientific applications that cannot rely on database engines and build their homemade solutions at great cost. Frustration by the emergence of new, incredibly useful paradigms for data management – e.g. machine learning – and seeing how inadequate current technologies were in coping with those. Frustrated by countless hours spent writing scripts to load data to the database, or figuring out how to tune the query engine. Frustrated by Object-Relational layers. Frustrated by having database engines continue to expect that “all data belongs here”, when data grows so much faster than one database engine can ingest it. And frustrated because the idea of data warehouses as a single source of truth had failed, but not many seemed to do much about it.
“The solution grew gradually in our heads, and with the time to experiment in academia, it became obvious that we were onto something significant. The solution was in a combination of ideas taken from multiple domains of computer science, including compilers, functional language, database research, as well as math”
How? Let’s disentangle the issues:
- It takes too long to load data. Solution: don’t load data. Instead, design the engine to query at source.
- You have to write scripts and other glue-code, i.e. to load or transform data: Solution: don’t write scripts. Build language features instead that cover these functions.
- It’s hard to tune the database engine. Plus, requirements change all the time, so even if tuned correctly, tomorrow’s queries are different than today’s. Solution: don’t tune the database. Let it tune itself based on usage.
- Modern applications have data formats that are rich and complex; not just tables and not easily modeled as tables. Solution: support rich data formats as a core part of the language and internals.
- Modern data transformations are more complex than SELECTs and JOINs. Solution: support operations other than classical database algebraic operators; but make sure to find the correct math abstractions so that the query remains “optimizable” and the query language declarative.
“Conceptually, the solution is not difficult. What is difficult is to build the correct design and theoretical framework for the solution. It’s hard to build a new system that still looks-and-feels like SQL. But that’s what we accomplished with RAW NoDB, as we now call it, with a great deal of integration between miscellaneous concepts and ideas”.
Miguel was working at CERN on distributed data management where he saw many young physicists, struggling to write code to do data analysis. They were writing low-level code to process petabytes of data.
“We couldn’t use databases over there: we had vast amounts of data; we had to own the data format. as a relational database company may not be there in 50 years but our data has to still be read; the type of data was far too complex for being put into a database – they cope with tables well, but badly with very large and complex data like hierarchies, arrays, etc”
Miguel got to learn about Anastasia’s work on scientific data management and it intrigued him. He would have like the physicists at CERN to be able to have a real database, not write that analysis code over and over again by hand; instead to give them a powerful high-level declarative query language that could work over complex data. So he joined Anastasia’s group where she was looking at many of the same problems from other (scientific) domains.
“This was the Genesis of RAW Labs. As we defined the basic goals better: i.e. support complex data like we saw in scientific domains, build a database that does not own the data, have a SQL-like query language, and create a just-in-time system that optimizes based on what the user is doing, not what the DBA thinks they will do.”
Multiple prototypes and published research papers ensued, and during this period various people in industry said what they were doing for the scientific domain would actually apply to the business domain – So in 2015 RAW Labs SA was born, as a stealth research project until 2019, with Miguel leading the technology platform build-out aided by a small but highly specialized team to concentrate on solving the query of large scale, distributed, heterogenous and complex data at source.
“As we were building the core engine out, and after showing this technology to people in 2020 and early 2021, it became obvious that the world had moved on since 2015; people in business wanted APIs as the main method of interacting with data, and on the technology side everyone is doing CI/CD to deliver code faster, so we put both together on top of our core engine offering. This is the problem space that needs solving – being able to create, publish and manage data products more simply and iterate faster, using familiar tools“