NYC Taxi Cab Analytics and APIs

January 25, 2022
Solution / Tutorial
Posted by Jeremy Posner


There are many blogs and posts on the web on Big Data using the NYC Taxi Cab dataset – it’s a good crowd pleaser because of its volume, subject matter and the data is in a simple to consume format (csv).

It’s also open data, available to view here, and can be analysed along with other sources of open data – for instance related datasets from NYC on the excellent and websites.

So this is naturally a good place to show some of the capabilities of our new RAW Data Product Platform.

For recap, the dataset includes every Taxi, Uber, Limo and other For-Hire Vehicle (FHV) journey in New York City, available as monthly files where each month is about 20 million rides in total. The files are in different column layouts, have usual data quality issues (empty fields, bad dates, etc.) and semantic inconsistencies – so, typical of the type of data that you don’t control.

Why use RAW ?

Say you want to analyse this data – what are your choices? It’s too big for Excel and Google Sheets, and, even if you had a bigger spreadsheet tool, you would need to download, then load the data into the right columns, deal with the bad stuff, and then construct the queries. A lot of data and more importantly, a lot of work for, say one year of analysis (that’s 48 files)

Downloading all this data just to run a query is time-consuming. Sure you can also load the data into a database with some ETL/ELT (DBT for instance). You could also write some python scripts with pandas, dump into a data lake and use Presto, load into Spark, BigQuery, you could also beef python up with dask, pyspark or similar. Sure these are all valid options and indeed there’s many more ways too. In fact every tool can cope with this data – they are all good tools.

I’m here to show you how easy and quick it is in RAW – so you can compare for yourself. Comments welcome please, tweeting @raw_labs is welcome too. So here’s the value proposition:

  • At RAW we query all data in place. No moving, No downloading.
  • So it’s faster to get to results. Faster to iterate. Much less work.
  • And we expose data as APIs. Data products your users can access with their tools.

Our NYC Taxi Cab Blog

This will be a multi part blog, which allows us to build up to some interesting concepts and powerful functionality with our new RAW Data Product Platform:

PART 1 – basic concepts and simple queries on one set of monthly NYC Taxi Cab data files, to create an API to expose data as endpoints

PART 2 – Add parameterising the file names, create a function with year/month input to expose multiple months, and query a year of data, plus I’ll show some integration – joining to MTA (trains, buses) data to create a broader picture of trips

PART 3 – expand further to other datasets, joining directly to APIs, in this case to perform weather analysis.

The code for these blogs will be found in GitHub here

PART 1: Basic concepts, simple queries, one month and an API

Let’s get straight to it. With RAW, you can query data on the web like this:

SELECT count(*) 

This is not too remarkable of course, but it gets a little more interesting with something like this:

SELECT pickup_dt,  
  SUM(CASE WHEN hvfhs_license_num = "HV0003" THEN 1 ELSE 0 END) as uber,
  SUM(CASE WHEN hvfhs_license_num = "HV0004" THEN 1 ELSE 0 END) as via,
  SUM(CASE WHEN hvfhs_license_num = "HV0005" THEN 1 ELSE 0 END) as lyft
  READ("", cache:=interval "10 days")
GROUP BY to_date(substring(pickup_datetime,1,10),"yyyy-MM-dd") as pickup_dt

What’s interesting here? Well, there’s inferencing on the file, to pick out the fields, formats, encoding, delimiters, etc. Plus there’s a caching parameter so we don’t have to physically read the data several times (it’s a big file, and doesn’t change, right?).

All this is still a little unremarkable, until you then wrap this above SQL statement into a simple function call, and add a YAML file to make it into an API.

Take the metadata in the associated Data Dictionary – it says that for the field hvfhs_license_num the values mean: HV0003 is for Uber, HV0004 is for Via and HV0005 is for Lyft. We can code this into a simple function in RAW, re-use the function inside another function.

fhv_type_lookup(hvfhs_license_num:string) := {
  WHEN hvfhs_license_num = "HV0003" THEN "uber" 
  WHEN hvfhs_license_num = "HV0004" THEN "via" 
  WHEN hvfhs_license_num = "HV0005" THEN "lyft" 
  ELSE "unknown"

fhv_counts() := {
  SELECT  pickup_dt,  
          fhv_type_lookup(hvfhs_license_num) as type,
          count(*) as rides
     READ("", cache:=interval "10 days") 
    to_date(substring(pickup_datetime,1,10),"yyyy-MM-dd") as pickup_dt,
  ORDER BY pickupdate

The query above returns (as JSON, It can also return tabular structure too) :

		"pickupdate": "2021-07-01",
		"type": "uber",
		"rides": 355837
		"pickupdate": "2021-07-01",
		"type": "lyft",
		"rides": 141116

Let’s dig a little further. RAW SQL is composable functional language, so that the results of queries are collections of records that can be passed into functions too. We can also use the output of a function as a dataset, and input to another:

my_enum() := {

Calling my_enum here will return the enumerated collection of results, starting at 0:

		"_1": 0,
		"_2": {
			"pickupdate": "2021-07-01",
			"type": "uber",
			"rides": 355837
		"_1": 1,
		"_2": {
			"pickupdate": "2021-07-01",
			"type": "lyft",
			"rides": 141116

Let’s create an API

Creating an API in RAW is this simple:

  1. Create a Query, like we did above. Wrap it in a function call.
  2. Create a YAML file: Choose the Query function, specify some metadata and execution characteristics, and security if you need
  3. Commit the YAML file and the Query code into your GitHub project, which is set up to auto-sync with RAW.
  4. That’s it. No step 4. The API is ready 🙂

Here’s a YAML file to expose the query above as an API:

raw: 0.9
endpoint: GET
  title: Simple Example API of trip counts 
  description: Counts of FHV Trips in July 2021
    - nyc
    - transit
    - uber
    - fhv
code: rql
codeFile: fhv-counts.rql
declaration: fhv_counts
format: json
  public: true
computeClass: normal
enabled: true

Commit this and the associated query (RQL file) into GitHub, and now the API is ready – Give it a click 🙂

You can also see this and other endpoints in our demo catalog, where you can see how this is created and then create your own RAW account.

Putting it all together, and dealing with dirty data too.

In order to see all the trip count data for a particular month it’s pretty simple to create an integrated view that joins the 4 monthly files – the Green Cab data, Yellow Cab data, For-Hire-Vehicles (FHVs, e.g. Limos.) and High-Volume FHVs (e.g. Uber, Lyft, Via).

This can either be a UNION ALL, or a JOIN depending on how you want the data represented. We will show a JOIN because we want the data in columns showing each total by day.

Dealing with dirty raw data in files is often a problem, and this taxi data is no exception. Some fields are empty, dates are incorrectly formatted, etc. We will show you the way to deal with this, we can force the READ_CSV of the dataset to cast every field to a string, then filter out the bad actors with a string pattern, either in a SQL statement, or even a regular expression.

We can use a typealias below to read everything as a string:

typealias tf := record(`hvfhs_license_num`: string,`dispatching_base_num`: string,`pickup_datetime`: string,`dropoff_datetime`: string,`PULocationID`: string,`DOLocationID`: string,`SR_Flag`: string nullable);

fhv_hv := SELECT pickup_dt, 
  SUM(CASE WHEN hvfhs_license_num = "HV0003" THEN 1 ELSE 0 END) as uber,
  SUM(CASE WHEN hvfhs_license_num = "HV0004" THEN 1 ELSE 0 END) as via,
  SUM(CASE WHEN hvfhs_license_num = "HV0005" THEN 1 ELSE 0 END) as lyft
  READ_CSV[tf]("", cache:=interval "10 days")
WHERE pickup_datetime like "2021-07%" // filter bad dates out
GROUP BY to_date(substring(pickup_datetime,1,10),"yyyy-MM-dd") as pickup_dt;

This is the final piece needed in order to query 4 files safely, and and join them all together to get daily counts of all types of car in a single month:

SELECT y.pickup_dt, y.yellow,, h.uber, h.lyft, h.via, o.fhv_other
 taxi_yellow y, taxi_green g, fhv_other o, fhv_hv h
  y.pickup_dt = g.pickup_dt and
  y.pickup_dt = o.pickup_dt and
  y.pickup_dt = h.pickup_dt
ORDER BY y.pickup_dt

The full code is here on GitHub: Query and YAML file, which configures the query to return a CSV file. It could also return JSON quite easily. Here’s the API endpoint for querying the whole month of April 2020 – give it a Click 🙂

Wrap Up … and Next Post Teaser

We showed how to use RAW to query data directly in place, and clean, join and present the data as JSON or CSV via an API. There’s no need to move the data first – creating curated data products as APIs is as simple as SQL + YAML + GIT approach.

In the next post we will show how we build on this, to do more interesting things, including how we parameterise, join this dataset to other data, including APIs, Files, and showing a much larger query across multiple months and years. Stay tuned – In the meantime why not check out the links below…

Jeremy Posner, VP Product & Solutions, RAW Labs.

Want to learn more?

Learn More


NYC Taxi Cab Analytics and APIs, Part 2

Second in Blog Series, we now parameterise our function to dynamically query different files, join to MTA train and bus data for total rides per day, and run over several months.


API Integration: Where is the ISS ?

Georges shows how to integrate two APIs and produce a new API that can easily be published on our platform, using an example about the ISS spaceship and Reverse Geocoding


Tutorial 1. First flight with RAW

First Tutorial with RAW where we take reader through a first query, first API and passing parameters to that API