NYC Taxi Cab Analytics and APIs, Part 3

Intro
In the first part of this blog we queried one month of NYC Taxi Cab rides from NYC TLC. We used a single month to show some basics of RAW, including SQL syntax, functions, composability, inference and typealias, plus how to configure the query as API endpoints using YAML syntax.
In the second post, we then expanded the function / API to show how to join to another dataset, from MTA, to get all rides over a year across Cabs, Limos, Trains, Buses – which is useful if you want to perform analysis across both datasets.
In this post we will show how we can now join those tabular datasets to granular data from a weather API returning a JSON structure, and run queries to see correlations between journeys and weather.
NYC Cab data and Weather data – why RAW, and what for ?
It’s noted that there are many articles about NYC cab data and weather data analysis (in particular here and here), and all of them great – the aim here is not to perform better analysis, as these authors did a great job – but simply to show how easy it is to implement the ‘data engineering’ portion in RAW, vs other methods. We will not put fancy heatmaps and charts here – there are lots of tools for doing this, so if you’re reading this blog and want to do that with our data, please get in touch.
Many data scientists and analysts will appreciate that the ‘data engineering’ part of this type of analysis work is very often a major part of the work, it could be anywhere from 20-80% of the total effort. Data Engineers also spend too much time on infrastructure, i.e. setting up and managing resources.
If you look into the actual details of how the same/similar data was extracted, downloaded, pre-processed and how long it took IN TOTAL (hint: many hours, see: redshift, microsoft azure synapse, clickhouse), hopefully you will see RAW as a simpler, faster alternative with powerful features – meaning less “glue-code”, swapping between multiple tools, bulk-loaders, database code, python code, etc.
To summarise:
With RAW we can analyse data from many sources without first downloading, transforming and loading data. RAW makes all the data directly queryable at source. Time to first results is faster. Iterations are also faster.
Comments are most welcome, tweets @raw_labs too. As usual the code for these blogs will be found in GitHub here
So, let’s get to it …..

PART 3: Weather API – retrieve, join, analyse all those Cab, Limo, Train and Bus rides
Does the NYC weather affect journeys ? If so, which types of journey, and what types of weather?
Choosing a weather API
We will use a weather API from https://weather.visualcrossing.com/ – there are so many weather APIs out there, but we wanted one which offered hourly, historical queries across a whole month in a single API call – purely because it’s more efficient for our use case, joining monthly NYC Cab data.
We like this weather API because it offers a Visual Query Builder also, plus there’s a nice PAYG option for occasional use.
We chose hourly granularity however some APIs (like this one) go down more granularly than hourly too, so you could find out exactly what the weather was like at the time the journey was started, assuming, say, the weather in Central Park is not too different than, say in Queens. This weather API will actually aggregate across stations nearest to a location, but we didn’t do this.
For the demo purposes we will just use ‘manhattan’ as a location and let this weather API do its thing. Below are the weather stations in/around NYC, closest to ‘manhattan’, which is in red.

The API call we will use is the /timeline
endpoint, you can read about it here.
The API enables mulitple day spans, with hourly weather details, and customised results (see: ‘elements’ below) so you can just retrieve what you want, in the format you want (we want JSON for our demo) and units (we will use metric Celsius). It supports localtimes and also returns UTC.
The full API URL to get what we want looks like this :
https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/ timeline/manhattan/2022-02-01/2022-02-0?unitGroup=metric& elements=datetime%2CdatetimeEpoch%2Cname%2Ctempmax%2Ctempmin%2Ctemp%2C feelslikemax%2Cfeelslikemin%2Cfeelslike%2Cprecip%2Csnow%2Csnowdepth %2Cwindgust%2Cwindspeed%2Chours&key=YOURAPIKEY&contentType=json
Creating a RAW function for this API
Now, In RAW we create a function in our reusable component Library, that allows us to call this API above. I have omitted the typealias needed because RAW can’t infer the structure returned from a dynamically generated URL, but you can see the full file here on GitHub. OurLibrary has a number of other “API shims”, i.e. RAW SQL layers on top of existing vendor APIs.
timeline(
key: string,
location: string,
date1: date,
date2: date,
unitGroup: string null := null
) := {
read_json[timeline_response]
( "https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timeline/" +
location + "/" + cast(date1 as string) + "/" + cast(date2 as string),
http_args:=[
("key", key),
("elements","datetime,datetimeEpoch,name,tempmax,tempmin,temp,feelslikemax,
feelslikemin,feelslike,precip,snow,snowdepth,windgust,windspeed,hours"),
("unitGroup", unitGroup),
("contentType", "json") // also exist: csv and excel
]
)
};
To use this timeline
function now we can simply IMPORT
the definition using a GitHub-style URL, and then call it like this:
FROM `github://raw-labs/lib/1/public/visualcrossing.com/weather.rql`
IMPORT timeline;
timeline( secret("weather.visualcrossing.com"),
"manhattan",
date "2022-02-01",
date "2022-02-04",
"metric")
In RAW we store secrets such as API keys, so you don’t have to expose them in your code. You can see this in the secret
function above. We can also configure queries to have a cache duration, which is particularly good for saving money on repetitive API calls where the data does not change (i.e. historical weather).
Calling this code above in RAW will return the structure below:
{
"queryCost": 96,
"latitude": 40.7145,
"longitude": -74.006,
"resolvedAddress": "Manhattan, NY, United States",
"address": "manhattan",
"timezone": "America/New_York",
"tzoffset": -5,
"days": [
{
"datetime": "2022-02-01",
"datetimeEpoch": 1643691600,
"tempmax": 1.8,
"tempmin": -5.4,
"temp": -1.4,
"feelslikemax": 1.4,
"feelslikemin": -10,
"feelslike": -4.6,
"precip": 0,
"snow": 0,
"snowdepth": 18.03,
"windgust": 33.5,
"windspeed": 16.8,
"hours": [
{
"datetime": "00:00:00",
"datetimeEpoch": 1643691600,
"temp": -3.9,
"feelslike": -3.9,
"precip": 0,
"snow": 0,
"snowdepth": 18.03,
"windgust": null,
"windspeed": 3.3
},
{
"datetime": "01:00:00",
...
A word on flattening data
In RAW you can query, join, sort, merge, aggregate and generally work with nested data like flat, tabular data. RAW has a unique query engine that is designed for this, and a RAW SQL language with features to harness this power.
Regular SQL doesn’t do all of what RAW SQL can do with say, JSON or XML data, and even the latest extensions to ANSI SQL and implementations of relational databases are limited in this respect.
Sometimes you might want to flatten data in order to present to users, use in spreadsheets, etc. People like flat data, so, let’s have a look at how this is done in RAW….
To flatten this data above, in this case to take fields from days
and fields from hours
and place them in a row, we can do this in RAW SQL easily. Below shows a query on the data above – you can see that alias d
for each day is itself treated like another table, joining back to the top level results set, and thus the hourly fields can be joined back to the daily ones:
res := timeline( secret("weather.visualcrossing.com"),
"manhattan",
date "2022-02-01",
date "2022-02-04",
"metric");
SELECT
to_date(d.datetime,"yyyy-MM-dd") + to_time( h.datetime, "H:m:s") as datehour,
h.precip,
h.temp,
h.feelslike
FROM
res.days as d,
d.hours as h
Executing the above query will yield results flattened, and then we configure our API endpoint to delivery it in CSV. (Note that this WeatherCrossing API does CSV results natively, but many don’t)
So, now we have our weather results set, in a form we can easily consume in a spreadsheet or similar end user tool. If you had Excel and PowerQuery you could import this right now.
The queries and two endpoints can be found on GitHub: one returns the daily weather data, the other returns the hourly weather data, they both make use of our RAW cache too, so that repetitive queries both come back faster, and don’t use resources unnecessarily.
Here are the endpoints below – click them, they will download CSV dataset samples (full data set is licensed by visualcrossing.com)
Daily History:
https://api.raw-labs.com/examples/1/public/weather/history-daily?location=manhattan&yearmonth=2021-01
Hourly History:
https://api.raw-labs.com/examples/1/public/weather/history-hourly?location=manhattan&yearmonth=2022-01
Joining Weather to Our Trip data
So we’ve now done all the hard work, we can relax because the next piece is very easy. The first query below will join one month of weather data to one month of trip data, using the dates to join.
We will also reuse the code from our GitHub repository by IMPORT
statements:
nyc_transit_summary_daily_weather(yearmonth: string) := {
FROM `github://raw-labs/demos/1/public/weather/weather.rql`
IMPORT weather_history_daily;
FROM `github://raw-labs/demos/1/public/nyc-transit/trips.rql`
IMPORT trips;
SELECT
t.date,
t.yellow + t.green + t.uber + t.lyft + t.via + t.fhv_other as cabs,
t.mta_bus + t.mta_accessaride as busses,
t.mta_subway + t.mta_lirr + t.mta_metronorth as trains,
w.temp,
w.tempmax,
w.tempmin,
w.precip,
w.feelslike,
w.feelslikemax,
w.feelslikemin,
w.snow,
w.snowdepth
FROM
trips(yearmonth) t,
weather_history_daily("manhattan", yearmonth) w
WHERE
t.date = w.date
ORDER BY t.date
};
It’s noted that we could also query the endpoints we just created. The option is ours, and there are pros/cons of each. On GitHub you can find the query code for above, and the YML endpoint configuration for exposing this as an API – and you can see a sample* if you click below (* datasets contain data licensed by visualcrossing.com):
https://api.raw-labs.com/examples/1/public/nyc-transit/trips-summary-daily-weather?yearmonth=2021-01
The second query is an hourly query. We could of course join every trip to the weather for that trip, but we only have hourly weather data, so let’s aggregate all the cab trips by hour and then join to the hourly weather data – following this we can apply some statistical functions to see if there are correlations:
// cab trips by hour for 2021-01
hourly_cab_trips:=
SELECT
pickup_hour,
count(*) as num_trips,
avg(unix_timestamp(to_timestamp(tpep_dropoff_datetime,"yyyy-MM-dd H:m:s")) -
unix_timestamp(to_timestamp(tpep_pickup_datetime,"yyyy-MM-dd H:m:s"))) as avg_duration,
round(avg(try_cast(trip_distance as double)),2) as avg_dist,
round(avg(try_cast(fare_amount as double)),2) as avg_cab_fare,
round(avg(try_cast(total_amount as double)),2) as avg_amount_paid
FROM
READ_CSV[t]("https://nyc-tlc.s3.amazonaws.com/trip+data/yellow_tripdata_2021-01.csv"
UNION ALL
READ_CSV[t]("https://nyc-tlc.s3.amazonaws.com/trip+data/green_tripdata_2021-01.csv"
WHERE
tpep_pickup_datetime like "2021-01%" // filter bad dates
GROUP BY
date_trunc("hour", to_timestamp(tpep_pickup_datetime,"yyyy-MM-dd H:m:s")) as pickup_hour
ORDER BY pickup_hour;
// now join the cab and weather data
SELECT
c.pickup_hour, c.avg_duration, c.avg_dist,
c.avg_cab_fare, c.avg_amount_paid, c.num_trips, w.precip, w.temp,
w.feelslike, w.snow, w.snowdepth, w.windspeed
FROM
hourly_cab_trips c,
weather_history_hourly("manhattan", "2021-01") w
WHERE
c.pickup_hour = w.datehour
This will give us some results with hourly cab and weather data, you can see sample* if you click below (* datasets are licensed by visualcrossing.com):
https:/api.raw-labs.com/examples/1/public/nyc-transit/taxi-trips-hourly-weather?yearmonth=2020-08
Finally, add some Stats
So now we have our data integrated and joined. We can apply some statistical functions in RAW. You can code them up natively in RAW, or use our Python Integration to do it. Here’s two simple functions, the first is a Pearson correlation, written in RAW SQL, the second is a Spearmans Rank correlation called from SciPy:
pearson_correlation (data: collection(record(x: double, y: double))) := {
t:= SELECT
count(*) as num,
sum(x) as sigX,
sum(y) as sigY,
sum(POWER(x, 2)) as sigX2,
sum(POWER(y, 2)) as sigY2,
sum(x*y) as sigXY
FROM
data;
((t.num * t.sigXY) - (t.sigX * t.sigY)) /
POWER( ((t.num * t.sigX2) - POWER(t.sigX,2)) *
((t.num * t.sigY2) - POWER(t.sigY,2))
, 0.5)
};
// uses Python SciPy
spearmanrank_correlation (data: collection(record( x: double, y: double))) := {
spearmanr := \python(x: collection(double), y: collection(double)): collection(double) -> $$$
from scipy.stats import spearmanr
rho, p = spearmanr(x,y)
return rho, p
$$$;
spearmanr ((select x from data), (select y from data))
};
These functions both take a collection of records of doubles, i.e. : two columns of numbers, x and y. This makes it super simple to apply these to any two columns from our dataset of cab and weather data above.
With this we can run a simple RAW SQL query looking at the correlation between avg_duration
and feelslike
, for instance. As you can see it imports functions from GitHub for reusability, and is running over 12 months of cab and weather data. We will also choose a 1-hour time window (in this case 15:00) to remove any time-of-day effects.
FROM `github://raw-labs/demos/1/public/nyc-transit/weather_blog.rql`
IMPORT hourly_cab_trips_weather;
FROM `github://raw-labs/lib/1/public/raw-labs.com/stats.rql`
IMPORT spearmanrank_correlation, pearson_correlation;
// query 12 months of cab rides
// cast as double and ensure non-nullable for the spearman rank
data:=
SELECT ISNULL(cast(avg_duration as double),0.0) as x,
ISNULL(cast(feelslike as double),0.0) as y
FROM
hourly_cab_trips_weather("2020-08")
UNION ALL
hourly_cab_trips_weather("2020-09")
UNION ALL
hourly_cab_trips_weather("2020-10")
UNION ALL
hourly_cab_trips_weather("2020-11")
UNION ALL
hourly_cab_trips_weather("2020-12")
UNION ALL
hourly_cab_trips_weather("2021-01")
UNION ALL
hourly_cab_trips_weather("2021-02")
UNION ALL
hourly_cab_trips_weather("2021-03")
UNION ALL
hourly_cab_trips_weather("2021-04")
UNION ALL
hourly_cab_trips_weather("2021-05")
UNION ALL
hourly_cab_trips_weather("2021-06")
UNION ALL
hourly_cab_trips_weather("2021-07")
where
hour(pickup_hour) = 15;
spearmanrank_correlation(data)
So then, to answer the question at the top of this blog:
Does the NYC weather affect journeys ? If so, which types of journey, and what types of weather?
Well, my result from the above was:
// rho
// p
[
0.3884156262470408,
1.3647531696825649e-14
]
Not being statisticians, we found some interesting results in the limited testing. Seems to be some correlation between feelslike
and avg_duration
? Remembering that correlation is not causation, we shall leave further analysis to the experts reading, and those smart people who performed the analysis in the blogs earlier. If you are, and would like to get in touch, use RAW, improve on this blog code, feel free !
Wrap Up
This is the end of the NYC Cab blog; we covered a lot of ground. RAW is powerful and simple, and queries data at source. In RAW you can query files, databases, and APIs, join them together to create reusable, shareable data products and expose these as API very easily. There’s no infrastructure to manage, or even clusters to start/stop, and we offer compute resources that are matched to workloads.
If you want to be more productive with data, can write SQL, simple functions, use GitHub and have an enquiring mind or a tricky business problem, then it might be for you. Thanks for reading – comments welcome, likes, shares, etc.

Jeremy Posner, VP Product & Solutions, RAW Labs.
Want to learn more?
- Give RAW a try: Get Started for free!
- Why not follow us on LinkedIn, or Twitter, or join the conversation over at Reddit
- Read our Tutorials and Getting Started docs
- Like code? head on over to GitHub and look at our demo APIs
- Developer? Join us! we are looking for bright minds – at all levels of seniority, in databases, distributed systems, UI/UX.
Learn More

Get Started with RAW: Hello World!
Instructional video to show how to get started with RAW, using a simple Hello World example. Accompanying instructions can also be found on our Developer Site

Tutorial 1. First flight with RAW
First Tutorial with RAW where we take reader through a first query, first API and passing parameters to that API

API Integration: Where is the ISS ?
Georges shows how to integrate two APIs and produce a new API that can easily be published on our platform, using an example about the ISS spaceship and Reverse Geocoding