BLOG

NYC Taxi Cab Analytics and APIs, Part 3

February 8, 2022
   
Solution / Tutorial
Posted by Jeremy Posner

Intro

In the first part of this blog we queried one month of NYC Taxi Cab rides from NYC TLC. We used a single month to show some basics of RAW, including SQL syntax, functions, composability, inference and typealias, plus how to configure the query as API endpoints using YAML syntax.

In the second post, we then expanded the function / API to show how to join to another dataset, from MTA, to get all rides over a year across Cabs, Limos, Trains, Buses – which is useful if you want to perform analysis across both datasets.

In this post we will show how we can now join those tabular datasets to granular data from a weather API returning a JSON structure, and run queries to see correlations between journeys and weather.

NYC Cab data and Weather data – why RAW, and what for ?

It’s noted that there are many articles about NYC cab data and weather data analysis (in particular here and here), and all of them great – the aim here is not to perform better analysis, as these authors did a great job – but simply to show how easy it is to implement the ‘data engineering’ portion in RAW, vs other methods. We will not put fancy heatmaps and charts here – there are lots of tools for doing this, so if you’re reading this blog and want to do that with our data, please get in touch.

Many data scientists and analysts will appreciate that the ‘data engineering’ part of this type of analysis work is very often a major part of the work, it could be anywhere from 20-80% of the total effort. Data Engineers also spend too much time on infrastructure, i.e. setting up and managing resources.

If you look into the actual details of how the same/similar data was extracted, downloaded, pre-processed and how long it took IN TOTAL (hint: many hours, see: redshift, microsoft azure synapse, clickhouse), hopefully you will see RAW as a simpler, faster alternative with powerful features – meaning less “glue-code”, swapping between multiple tools, bulk-loaders, database code, python code, etc.

To summarise:

With RAW we can analyse data from many sources without first downloading, transforming and loading data. RAW makes all the data directly queryable at source. Time to first results is faster. Iterations are also faster.

Comments are most welcome, tweets @raw_labs too. As usual the code for these blogs will be found in GitHub here

So, let’s get to it …..

PART 3: Weather API – retrieve, join, analyse all those Cab, Limo, Train and Bus rides

Does the NYC weather affect journeys ? If so, which types of journey, and what types of weather?

Choosing a weather API

We will use a weather API from https://weather.visualcrossing.com/ – there are so many weather APIs out there, but we wanted one which offered hourly, historical queries across a whole month in a single API call – purely because it’s more efficient for our use case, joining monthly NYC Cab data.

We like this weather API because it offers a Visual Query Builder also, plus there’s a nice PAYG option for occasional use.

We chose hourly granularity however some APIs (like this one) go down more granularly than hourly too, so you could find out exactly what the weather was like at the time the journey was started, assuming, say, the weather in Central Park is not too different than, say in Queens. This weather API will actually aggregate across stations nearest to a location, but we didn’t do this.

For the demo purposes we will just use ‘manhattan’ as a location and let this weather API do its thing. Below are the weather stations in/around NYC, closest to ‘manhattan’, which is in red.

Weather stations near Manhattan (red dot) from Visual Crossing

The API call we will use is the /timeline endpoint, you can read about it here.

The API enables mulitple day spans, with hourly weather details, and customised results (see: ‘elements’ below) so you can just retrieve what you want, in the format you want (we want JSON for our demo) and units (we will use metric Celsius). It supports localtimes and also returns UTC.

The full API URL to get what we want looks like this :

https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/
timeline/manhattan/2022-02-01/2022-02-0?unitGroup=metric&
elements=datetime%2CdatetimeEpoch%2Cname%2Ctempmax%2Ctempmin%2Ctemp%2C
feelslikemax%2Cfeelslikemin%2Cfeelslike%2Cprecip%2Csnow%2Csnowdepth %2Cwindgust%2Cwindspeed%2Chours&key=YOURAPIKEY&contentType=json

Creating a RAW function for this API

Now, In RAW we create a function in our reusable component Library, that allows us to call this API above. I have omitted the typealias needed because RAW can’t infer the structure returned from a dynamically generated URL, but you can see the full file here on GitHub. OurLibrary has a number of other “API shims”, i.e. RAW SQL layers on top of existing vendor APIs.

timeline( 
    key: string, 
    location: string, 
    date1: date, 
    date2: date, 
    unitGroup: string null := null
    ) := {

    read_json[timeline_response] 
        ( "https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timeline/" + 
        location + "/" + cast(date1 as string) + "/" + cast(date2 as string), 
        http_args:=[
            ("key", key),
            ("elements","datetime,datetimeEpoch,name,tempmax,tempmin,temp,feelslikemax,
             feelslikemin,feelslike,precip,snow,snowdepth,windgust,windspeed,hours"),
            ("unitGroup", unitGroup),
            ("contentType", "json") // also exist: csv and excel
        ]
    )
};

To use this timeline function now we can simply IMPORT the definition using a GitHub-style URL, and then call it like this:

FROM `github://raw-labs/lib/1/public/visualcrossing.com/weather.rql` 
  IMPORT timeline;

timeline( secret("weather.visualcrossing.com"),
          "manhattan",
          date "2022-02-01", 
          date "2022-02-04",
          "metric")

In RAW we store secrets such as API keys, so you don’t have to expose them in your code. You can see this in the secret function above. We can also configure queries to have a cache duration, which is particularly good for saving money on repetitive API calls where the data does not change (i.e. historical weather).

Calling this code above in RAW will return the structure below:

{
  "queryCost": 96,
  "latitude": 40.7145,
  "longitude": -74.006,
  "resolvedAddress": "Manhattan, NY, United States",
  "address": "manhattan",
  "timezone": "America/New_York",
  "tzoffset": -5,
  "days": [
    {
      "datetime": "2022-02-01",
      "datetimeEpoch": 1643691600,
      "tempmax": 1.8,
      "tempmin": -5.4,
      "temp": -1.4,
      "feelslikemax": 1.4,
      "feelslikemin": -10,
      "feelslike": -4.6,
      "precip": 0,
      "snow": 0,
      "snowdepth": 18.03,
      "windgust": 33.5,
      "windspeed": 16.8,
      "hours": [
        {
          "datetime": "00:00:00",
          "datetimeEpoch": 1643691600,
          "temp": -3.9,
          "feelslike": -3.9,
          "precip": 0,
          "snow": 0,
          "snowdepth": 18.03,
          "windgust": null,
          "windspeed": 3.3
        },
        {
          "datetime": "01:00:00",
           ...

A word on flattening data

In RAW you can query, join, sort, merge, aggregate and generally work with nested data like flat, tabular data. RAW has a unique query engine that is designed for this, and a RAW SQL language with features to harness this power.

Regular SQL doesn’t do all of what RAW SQL can do with say, JSON or XML data, and even the latest extensions to ANSI SQL and implementations of relational databases are limited in this respect.

Sometimes you might want to flatten data in order to present to users, use in spreadsheets, etc. People like flat data, so, let’s have a look at how this is done in RAW….

To flatten this data above, in this case to take fields from days and fields from hours and place them in a row, we can do this in RAW SQL easily. Below shows a query on the data above – you can see that alias d for each day is itself treated like another table, joining back to the top level results set, and thus the hourly fields can be joined back to the daily ones:

res := timeline( secret("weather.visualcrossing.com"),
                 "manhattan",
                 date "2022-02-01", 
                 date "2022-02-04",
                 "metric");

SELECT 
  to_date(d.datetime,"yyyy-MM-dd") + to_time( h.datetime, "H:m:s") as datehour,
  h.precip, 
  h.temp,
  h.feelslike
FROM 
  res.days as d, 
  d.hours as h

Executing the above query will yield results flattened, and then we configure our API endpoint to delivery it in CSV. (Note that this WeatherCrossing API does CSV results natively, but many don’t)

So, now we have our weather results set, in a form we can easily consume in a spreadsheet or similar end user tool. If you had Excel and PowerQuery you could import this right now.

The queries and two endpoints can be found on GitHub: one returns the daily weather data, the other returns the hourly weather data, they both make use of our RAW cache too, so that repetitive queries both come back faster, and don’t use resources unnecessarily.

Here are the endpoints below – click them, they will download CSV dataset samples (full data set is licensed by visualcrossing.com)

Daily History: 
https://api.raw-labs.com/examples/1/public/weather/history-daily?location=manhattan&yearmonth=2021-01
Hourly History: 
https://api.raw-labs.com/examples/1/public/weather/history-hourly?location=manhattan&yearmonth=2022-01

Joining Weather to Our Trip data

So we’ve now done all the hard work, we can relax because the next piece is very easy. The first query below will join one month of weather data to one month of trip data, using the dates to join.

We will also reuse the code from our GitHub repository by IMPORT statements:

nyc_transit_summary_daily_weather(yearmonth: string) := {

FROM `github://raw-labs/demos/1/public/weather/weather.rql` 
  IMPORT weather_history_daily;

FROM `github://raw-labs/demos/1/public/nyc-transit/trips.rql` 
  IMPORT trips;

SELECT 
  t.date, 
  t.yellow + t.green + t.uber + t.lyft + t.via + t.fhv_other as cabs,
  t.mta_bus + t.mta_accessaride as busses,
  t.mta_subway + t.mta_lirr + t.mta_metronorth as trains,
  w.temp,
  w.tempmax,
  w.tempmin,
  w.precip, 
  w.feelslike,
  w.feelslikemax, 
  w.feelslikemin,
  w.snow,
  w.snowdepth
FROM
  trips(yearmonth) t,
  weather_history_daily("manhattan", yearmonth) w
 WHERE 
  t.date = w.date
ORDER BY t.date

};

It’s noted that we could also query the endpoints we just created. The option is ours, and there are pros/cons of each. On GitHub you can find the query code for above, and the YML endpoint configuration for exposing this as an API – and you can see a sample* if you click below (* datasets contain data licensed by visualcrossing.com):

https://api.raw-labs.com/examples/1/public/nyc-transit/trips-summary-daily-weather?yearmonth=2021-01

The second query is an hourly query. We could of course join every trip to the weather for that trip, but we only have hourly weather data, so let’s aggregate all the cab trips by hour and then join to the hourly weather data – following this we can apply some statistical functions to see if there are correlations:

// cab trips by hour for 2021-01 
hourly_cab_trips:=
 SELECT 
   pickup_hour,
    count(*) as num_trips,
    avg(unix_timestamp(to_timestamp(tpep_dropoff_datetime,"yyyy-MM-dd H:m:s")) - 
      unix_timestamp(to_timestamp(tpep_pickup_datetime,"yyyy-MM-dd H:m:s"))) as avg_duration,
    round(avg(try_cast(trip_distance as double)),2) as avg_dist, 
    round(avg(try_cast(fare_amount as double)),2) as avg_cab_fare,
    round(avg(try_cast(total_amount as double)),2) as avg_amount_paid
  FROM 
    READ_CSV[t]("https://nyc-tlc.s3.amazonaws.com/trip+data/yellow_tripdata_2021-01.csv"
    UNION ALL
    READ_CSV[t]("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2021-01.csv"
  WHERE 
    tpep_pickup_datetime like "2021-01%" // filter bad dates 
  GROUP BY 
    date_trunc("hour", to_timestamp(tpep_pickup_datetime,"yyyy-MM-dd H:m:s")) as pickup_hour
  ORDER BY pickup_hour;

  // now join the cab and weather data
  SELECT
    c.pickup_hour, c.avg_duration, c.avg_dist,  
    c.avg_cab_fare, c.avg_amount_paid, c.num_trips, w.precip, w.temp, 
    w.feelslike, w.snow, w.snowdepth, w.windspeed
  FROM
    hourly_cab_trips c,
    weather_history_hourly("manhattan", "2021-01") w
  WHERE
    c.pickup_hour = w.datehour  

This will give us some results with hourly cab and weather data, you can see sample* if you click below (* datasets are licensed by visualcrossing.com):

https:/api.raw-labs.com/examples/1/public/nyc-transit/taxi-trips-hourly-weather?yearmonth=2020-08

Finally, add some Stats

So now we have our data integrated and joined. We can apply some statistical functions in RAW. You can code them up natively in RAW, or use our Python Integration to do it. Here’s two simple functions, the first is a Pearson correlation, written in RAW SQL, the second is a Spearmans Rank correlation called from SciPy:

pearson_correlation (data: collection(record(x: double, y: double))) := {

  t:= SELECT 
    count(*)         as num,
    sum(x)           as sigX,
    sum(y)           as sigY,
    sum(POWER(x, 2)) as sigX2,
    sum(POWER(y, 2)) as sigY2,
    sum(x*y)         as sigXY
  FROM
    data;
  
  ((t.num * t.sigXY) - (t.sigX * t.sigY)) / 
  POWER( ((t.num * t.sigX2) - POWER(t.sigX,2)) * 
         ((t.num * t.sigY2) - POWER(t.sigY,2)) 
   , 0.5)

};

// uses Python SciPy
spearmanrank_correlation (data: collection(record( x: double, y: double))) := {

  spearmanr := \python(x: collection(double), y: collection(double)): collection(double) -> $$$
      from scipy.stats import spearmanr
      rho, p = spearmanr(x,y)
      return rho, p
  $$$;

  spearmanr ((select x from data), (select y from data))
};

These functions both take a collection of records of doubles, i.e. : two columns of numbers, x and y. This makes it super simple to apply these to any two columns from our dataset of cab and weather data above.

With this we can run a simple RAW SQL query looking at the correlation between avg_duration and feelslike, for instance. As you can see it imports functions from GitHub for reusability, and is running over 12 months of cab and weather data. We will also choose a 1-hour time window (in this case 15:00) to remove any time-of-day effects.

FROM `github://raw-labs/demos/1/public/nyc-transit/weather_blog.rql`
  IMPORT hourly_cab_trips_weather;

FROM `github://raw-labs/lib/1/public/raw-labs.com/stats.rql`
  IMPORT spearmanrank_correlation, pearson_correlation;

// query 12 months of cab rides
// cast as double and ensure non-nullable for the spearman rank
data:=
  SELECT ISNULL(cast(avg_duration as double),0.0) as x, 
         ISNULL(cast(feelslike as double),0.0) as y
  FROM
        hourly_cab_trips_weather("2020-08")
        UNION ALL
        hourly_cab_trips_weather("2020-09")
        UNION ALL 
        hourly_cab_trips_weather("2020-10")
        UNION ALL 
        hourly_cab_trips_weather("2020-11")          
        UNION ALL 
        hourly_cab_trips_weather("2020-12")
        UNION ALL 
        hourly_cab_trips_weather("2021-01")
        UNION ALL 
        hourly_cab_trips_weather("2021-02")          
        UNION ALL 
        hourly_cab_trips_weather("2021-03")
        UNION ALL 
        hourly_cab_trips_weather("2021-04") 
        UNION ALL 
        hourly_cab_trips_weather("2021-05")          
        UNION ALL 
        hourly_cab_trips_weather("2021-06")
        UNION ALL 
        hourly_cab_trips_weather("2021-07") 
  where
        hour(pickup_hour) = 15;

spearmanrank_correlation(data)

So then, to answer the question at the top of this blog:

Does the NYC weather affect journeys ? If so, which types of journey, and what types of weather?

Well, my result from the above was:

// rho
// p
[
	0.3884156262470408,
	1.3647531696825649e-14
]

Not being statisticians, we found some interesting results in the limited testing. Seems to be some correlation between feelslike and avg_duration? Remembering that correlation is not causation, we shall leave further analysis to the experts reading, and those smart people who performed the analysis in the blogs earlier. If you are, and would like to get in touch, use RAW, improve on this blog code, feel free !

Wrap Up

This is the end of the NYC Cab blog; we covered a lot of ground. RAW is powerful and simple, and queries data at source. In RAW you can query files, databases, and APIs, join them together to create reusable, shareable data products and expose these as API very easily. There’s no infrastructure to manage, or even clusters to start/stop, and we offer compute resources that are matched to workloads.

If you want to be more productive with data, can write SQL, simple functions, use GitHub and have an enquiring mind or a tricky business problem, then it might be for you. Thanks for reading – comments welcome, likes, shares, etc.

Jeremy Posner, VP Product & Solutions, RAW Labs.


Want to learn more?

Learn More

VIDEOS

Get Started with RAW: Hello World!

Instructional video to show how to get started with RAW, using a simple Hello World example. Accompanying instructions can also be found on our Developer Site

Blog

Tutorial 1. First flight with RAW

First Tutorial with RAW where we take reader through a first query, first API and passing parameters to that API

Blog

API Integration: Where is the ISS ?

Georges shows how to integrate two APIs and produce a new API that can easily be published on our platform, using an example about the ISS spaceship and Reverse Geocoding

Success

Thank you for your interest. Expect to hear from us soon.

Error

Email address must contain the @ symbol