NYC Taxi Cab Analytics and APIs, Part 2

January 26, 2022
Solution / Tutorial
Posted by Jeremy Posner


In the first part of this blog we queried one month of NYC Taxi Cab rides from NYC TLC. We used a single month to show some basics of RAW, including SQL syntax, functions, composability, inference and typealias, plus the YAML file needed to configure the query as API endpoints.

In this post we will build upon those concepts and show how we can:

  1. Create a better function (and API) that dynamically forms the read targets based on the month & year passed in.
  2. Join this dataset to another dataset (MTA ridership), again open data and using familiar SQL constructs
  3. Reuse this function to query many months of data, e.g. a whole year of rides.

Again, the key here is that no data was moved. Queries are all at source, transformation is in-line, and APIs are instantly deployed as we run DataOps and provision infrastructure running in the Cloud.

So this is: Data Delivery. Made Simple. I hope you agree. Comments are most welcome, tweets @raw_labs too. As usual the code for these blogs will be found in GitHub here

Let’s go…..

PART 2: Dynamic targets, joining to MTA data, and querying one year of data

How many rides in total are taken across public transport and cabs in New York every year? What was the busiest day? And are there patterns? Well, now you can find out.

Let’s take the query we developed here in the last post, i.e. that joined a single month of data from those 4 files: Yellow & Green cabs, FHVs (Limos) and FHV-HV’s (Uber, Lyft, Via). It returns a nice CSV file with ride numbers by day for each type.

First thing we want to do is to extend this function to take in a parameter to specify the year and month, and then use that to dynamically construct the queries and thus return results from any year and month (caveat: this assumes that the format of the files does not change across the query period – we know that early files have a different format, so there’s a limit how far we can go back).

Let’s modify the function, call it simply trips() as it will take in the period yearmonth, i.e. 2021-06, as a parameter. We can then extract the year and month portions using substrings, as we will need these later:

trips(yearmonth: string) := {
y_str := substring(yearmonth,1,4);
m_str := substring(yearmonth,6,2);


Next up we need to simply replace the URL of the READ() functions to construct the correct name for that month’s files, we will also change the filter on the WHERE clause, to remove bad dates:

taxi_yellow := 
  pickup_dt, count(*) as yellow
  READ_CSV[t]("" + 
        y_str + "-" + m_str + ".csv", cache:=interval "10 days")
  tpep_pickup_datetime like y_str + "-" + m_str + "%" // filter bad dates out
  to_date(substring(tpep_pickup_datetime,1,10),"yyyy-MM-dd") as pickup_dt;

We do the same on all 4 file queries. Very simple. Let’s now add some more data….

Joining to another dataset – MTA ridership

New York’s MTA produces a ridership estimates file, on a daily basis. It contains rider numbers for trains, buses, etc. and compares to pre-pandemic days too. More info here. The dataset starts in March 2020 at the onset of the pandemic, has some gaps, and as usual there’s unclean data in the file (some N/A and TBD values in numeric columns).

Let’s create a simple query using that dataset, and assign the results to a variable mta. Dates are formatted in the American format, and the backticks are needed due to spaces in the field names. We will also choose to cache this data for a day, since it’s updated daily. Our READ_CSV() function has an optional has_header parameter for situations where the inference can’t deduce a header exists. Sometimes this happens.

Lastly, for simple cleaning, e.g. here where we need to identify other values that represent NULL, we can run the following query using the nulls parameter option to identify N/A and TBD as a representation of NULL:

mta := 
  SELECT to_date(Date,"M/d/yyyy") as date, 
    `Subways: Total Estimated Ridership` as mta_subway,
    `Buses: Total Estimated Ridership` as mta_bus,
    `LIRR: Total Estimated Ridership` as mta_lirr,
    `Metro-North: Total Estimated Ridership` as mta_metronorth,
    `Access-A-Ride: Total Scheduled Trips` as mta_accessaride
    READ_CSV("", cache:= interval "1 day", 
               has_header:=true, nulls:=["N/A", "TBD"]) 
  WHERE year(to_date(Date,"M/d/yyyy")) = CAST(y_str as int) AND 
        month(to_date(Date,"M/d/yyyy")) = CAST(m_str as int)

Now we have cleaned up MTA data, we can join this dataset to the NYC Taxi data, using the date columns, and a perfectly standard SQL clause, except these “tables” are actually results from functions:

SELECT, m.mta_subway, m.mta_bus, m.mta_lirr, m.mta_metronorth, 
  m.mta_accessaride, y.yellow,, h.uber, h.lyft, h.via, o.fhv_other
  mta m, taxi_yellow y, taxi_green g, fhv_other o, fhv_hv h
WHERE = y.pickup_dt and = g.pickup_dt and = o.pickup_dt and = h.pickup_dt

With that, we are done. The query is ready, we will save it as trips.rql and commit to GitHub. The associated YAML file is called trips.yml, as this will be the endpoint name which points to this query:

raw: 0.9
endpoint: GET
  title: NYC cab and transit trips, by day
  description: Number of trips from Taxis (yellow and green), For-hire vehicles (Limo, Uber, Lyft, Via) and all MTA services by day, in a single month, parameter is in format yyyy-MM
    - nyc
    - transit
    - uber
    - taxi
    - mta
code: rql
codeFile: trips.rql
declaration: trips
format: csv
  public: true
computeClass: normal
cacheSizeMB: 512
computeLimitSeconds: 30
refreshSeconds: 86400
enabled: true

Notice how we have set some execution characteristics, including a cacheSize (512Mb) a computeLimit (30 secs) and refreshSeconds (1 day) – and, because this is a demo and the data doesn’t change more than daily, there’s no point re-executing on static sources unnecessarily. It also means you get the data back faster from our cache.

After GitHub commits, RAW will sync automatically and deploy the API. So, here it is, our new trips() endpoint, with both the NYC Cab data and the MTA data, aggregated trips by day – Click it 🙂

Querying multiple months by reusing our function

Lastly, we can call our endpoint/function with many months in a single statement, and RAW will parallel-execute these queries across the cluster. Here are two ways to achieve this, firstly in plain SQL:


… or secondly by using a SPLIT() function to call our trips() function this way, using our composable language features:

// takes multiple yearmonths, "2021-03,2021-04,2021-05..."

trips_multi(yearmonths:string):= {
    SPLIT(yearmonths,",") as yr_mo,
    row IN trips(yr_mo)

Wrap Up … and Next Post Teaser

We showed how to take a query function and parameterise it to make it reusable for an API. We then showed some integration by joining to another external dataset, clean that up, and then execute a bigger query across 300 million rows of data.

With a simple SQL + YAML + GIT approach. And we never landed any data first.

BTW, the answers are:

1,189,911,598 Total Rides.

Busiest Day: 15th July

In the next post we will show how we build on this, add some fun weather data from an API serving JSON….. Stay tuned. In the meantime, why not check out the links below…

Jeremy Posner, VP Product & Solutions, RAW Labs.

Want to learn more?

Learn More


NYC Taxi Cab Analytics and APIs

First in Blog Series showing New York Taxi Cab Data, how we query, clean and join this data to produce simple APIs


Querying all data, anywhere

Founder and CEO Miguel Branco explains the rationale behind the RAW platform, the problems it solves and how and why a better SQL is the way forward.


API Integration: Where is the ISS ?

Georges shows how to integrate two APIs and produce a new API that can easily be published on our platform, using an example about the ISS spaceship and Reverse Geocoding