BLOG

Analysing the News with RAW

February 16, 2022
   
Solution
Posted by Jeremy Posner

Intro

In this blog we will show using RAW with some great Web APIs to produce powerful analysis results on web pages with just a few lines of code. With RAW you don’t need complex and expensive ELT or ETL pipelines, create or manage your own clusters – and with the data being analysed in place it’s faster to develop and iterate.

RAW is a toolkit to enable data query, integration, reformatting and presentation through APIs. It’s the fastest way to take all your distributed and heterogenous data and serve it up as trusted data products. If you can use SQL, YAML and GIT, then you can use RAW.

Let’s go…

XML – still used for RSS and Sitemaps

There’s 50 billion web pages apparently, so plenty of choice for data sources. These could be your own web pages from your web site, or a customer, competitor, partner website(s).

There’s news feeds too. For this example we’re going to use RSS, which is a very well known standard from a long time ago, it’s ubiquitous, well defined and easy to use.

RSS is an XML standard – despite the reports of the death of XML, there’s still plenty of XML in the news and publishing spaces. RAW reads XML natively with a READ_XML() command.

// read the RSS
READ_XML("https://www.huffpost.com/section/front-page/feed?x=1")

To see what the statement returns in your browser click here.

This URL above doesn’t have a stylesheet associated, so you see the raw XML, however some do, such as this RSS feed from CNN, who have a list of news feeds on different subjects.

Incidentally you might also come across XML in sitemaps too, most websites are described with a sitemap.xml file, here is the RAW Labs one, so this example works for both.

In RAW, we can use the READ_XML command to perform queries directly and natively on complex data structures, for example this query will count the news items within the channel. For a more complex query read further down at the end.

// count the sub-nodes called 'item'
SELECT 
 count(*) 
FROM 
 READ_XML("http://rss.cnn.com/rss/edition_us.rss").channel.item

Web Page Processing

Our example will use a RSS feed from CNN. Each RSS file has a set of web pages, links, titles and other metadata which reside inside the <item> node, within <channel>. However RSS doesn’t have the actual content, and there’s more metadata inside each page, so we can use RSS as a nice index, but we will have to traverse down to process more data.

In this example we will:

  1. Query and order the XML, and extract metadata from each page using an API
  2. Pass the results to a text analysis API to return structured, semantic data (entity extraction)
  3. Aggregate up results for presentation

This is a fairly standard pattern, and here we will use both the OpenGraph.io API for extracting page metadata, along with Google’s Language Entity Analysis API for the text extraction, but there are plenty of choices out there depending on what you want to do.

1. Query pages and extract page metadata

For this example, we want to navigate down into each webpage, and pull out some ‘opengraph’ metadata, using an API from opengraph.io. We like this one because it’s fast, simple and easy to use.

As per good design practices, we will show a modular approach by creating reusable library functions, which can be thought of as building blocks for data products.

We have wrapped the opengraph.io API in a RAW API wrapper, you can see it on GitHub. Essentially it’s just a shim that enables us to call their API from RAW, passing in the same parameters and returning the same output. We can then simply import this using an IMPORT command below, and then run our query using the site function:

The API key, used for the opengraph API, is stored in a RAW secret, which you can add using our RAW App, along with other administration capabilities, when you have an account.

// import library function
FROM `github://raw-labs/lib/1/public/opengraph.io/opengraph.rql`
  IMPORT site;

// query story list, extract description using opengraph API
SELECT 
  rss_item.link,
  rss_item.title,
  site(
    url:=rss_item.link, 
    key:=secret("opengraph.io")
  ).hybridGraph.description as description 
FROM 
  read_xml("http://rss.cnn.com/rss/edition_us.rss").channel.item as rss_item

This returns a Collection of Records in RAW speak, the Collection is symbolised by a [ and ] and the record by { and }. In RAW a Collection of Records is effectively a table.

You can see from below that we have a title, some more description (from the web page opengraph metadata itself) and the link as the URL.

[
  {
    "link": "https://www.cnn.com/2022/02/16/us/houston-racist-antisemitic-flyers/index.html",
    "title": "2 more Houston-area neighborhoods littered with racist and anti-Semitic flyers ",
    "description": "Two neighborhoods in northeast Harris County were littered with hundreds of racist and anti-Semitic flyers over the weekend, resident Stacie Brady told CNN."
  },
  {
    "link": ...,
    "title": ...,
    "description": ...
  },
  ...

Incidentally, we have a number of other API wrappers in our Library, including for Twitter, some Google APIs, Weather data, etc. We will use a Google one next.

2. Perform the Textual Analysis

Google have some of the best APIs – they are easy to use, fast, easy to configure and are PAYG, plus great documentation and demo pages. For this example we are going to use the Analyze Entities endpoint from the Natural Language API

Again, in RAW, we have created a simple Library function that acts as an API wrapper. You can see it here on GitHub. Google can use different authentication mechanisms, we have chosen to use a simple API Key here, but RAW also supports Auth0 too.

Let’s take the output of the previous query and then pass it through our Google API this way, as a subquery:

FROM `github://raw-labs/lib/1/public/google.com/language.rql`
    IMPORT analyze_entities;

FROM `github://raw-labs/lib/1/public/opengraph.io/opengraph.rql`
    IMPORT site;

// call google entity analysis
SELECT 
  link,
  title,
  description,
  analyze_entities( 
    key:= secret("language.google.com") , 
    content:=mkstring( [title, description] , " - "), 
    type:="PLAIN_TEXT"
  ) as analysis
FROM 
(
  // opengraph metadata results - Google API doesn't like \" replace with '
  SELECT 
    rss_item.link,
    rss_item.title,
    replace(
      site(
        url:=rss_item.link,
        key:=secret("opengraph.io")
      ).hybridGraph.description, "\"","'"
    ) as description 
  FROM 
    read_xml("http://rss.cnn.com/rss/edition_us.rss").channel.item as rss_item
)

Functions not subqueries

… or we can write these as functions for better reuse below; we have encapsulated the RSS read and the typealias needed to return the RSS structure in our library, here in rss.rql.

FROM `github://raw-labs/lib/1/public/google.com/language.rql`
    IMPORT analyze_entities;

FROM `github://raw-labs/lib/1/public/opengraph.io/opengraph.rql`
    IMPORT site;

FROM `github://raw-labs/lib/1/public/raw-labs.com/rss.rql`
    IMPORT read_rss;

// query and order sub-set of RSS pages
// grab metadata from opengraph
get_rss_link_metadata(url: string) := {
  SELECT 
    rss_item.link,
    rss_item.title,
    rss_item.pubDate,
    replace(
      site(
        url:=rss_item.link,
        key:=secret("opengraph.io")
      ).hybridGraph.description, "\"","'"
    ) as description 
  FROM 
    read_rss(url:=url, cache:= interval "1 hour").channel.item as rss_item
  WHERE
    to_timestamp ( substring(rss_item.pubDate,1,25) , 
                   "eee, dd MMM yyyy HH:mm:ss") > 
                   DATE_TRUNC("day", current_timestamp())
  ORDER BY 3 DESC
};

// call google entity analysis API
// from function above 
SELECT 
  link,
  title,
  description,
  analyze_entities( 
    key:= secret("language.google.com") , 
    content:=mkstring( [title, description] , " - "), 
    type:="PLAIN_TEXT"
  ) as analysis
FROM 
  get_rss_link_metadata("http://rss.cnn.com/rss/edition_us.rss")

So this second example is a lot better to reuse. There’s also some nice caching we can apply on the read_rss function call, in case we are re-running this for development, and we don’t want to keep calling the same URL many times in a short space of time, and wait for the results.

Either way, the return structure is here, Google’s API has returned an analysis structure, where entities is a collection of records that describe the content:

[
  {
    "link": "https://www.cnn.com/2022/02/15/politics/mazars-trump-letter-letitia-james/index.html",
    "title": "This is why Trump's accounting firm severed ties after questioning reliability of financial statements",
    "description": "Mazars USA, the accounting firm that former President Donald Trump and his businesses have used for years, cut ties with him in stunning fashion, declaring it could no longer vouch for the financial statements it has complied over the past decade.",
    "analysis": {
      "entities": [
        {
          "name": "Mazars USA",
          "type": "ORGANIZATION",
          "metadata": {
            "wikipedia_url": "https://en.wikipedia.org/wiki/Mazars",
            "mid": "/m/0273cwy"
          },
          "salience": 0.37116933,
          "mentions": [
            {
              "text": {
                "content": "Mazars USA",
                "beginOffset": 105
              },
              "type": "PROPER"
            },
            {
              "text": {
                "content": "accounting firm",
                "beginOffset": 121
              },
              "type": "COMMON"
            }
          ]
        },
        {
          "name": "Donald Trump",
          "type": "PERSON",
          "metadata": {
            "wikipedia_url": "https://en.wikipedia.org/wiki/Donald_Trump",
            "mid": "/m/0cqt90"
          },
          "salience": 0.27993295,
          "mentions": [
            {
              "text": {
                "content": "Trump",
                "beginOffset": 12
              },
              "type": "PROPER"
            },
            {
              "text": {
                "content": "Donald Trump",
                "beginOffset": 159
              },
       ...

3. Present aggregated results

The results so far have been output from two APIs, and have added information to our input RSS feed. We now have a nested data structure.

Depending on what question we are asking the final query could return different structures – we will show here a single query that will return aggregated Entity and Type information across all the pages in the RSS feed – in a descending order of ‘hits’ so we can see what’s ‘most reported‘.

We will also remove uninteresting Entities, so we can concentrate on People, Organization, Location and other real-world items.

RAW SQL is very strong at being able to query complex, nested structures like this. Regular SQL would find it very difficult, try doing this in a regular RDBMS, even with JSON query extensions!

// previous query results assigned to variable "data"
data := 
  SELECT 
    link,
    title,
    description,
    analyze_entities( 
      key:= secret("language.google.com") , 
      content:=mkstring( [title, description] , " - "), 
      type:="PLAIN_TEXT"
    ) as analysis
  FROM 
    get_rss_link_metadata("http://rss.cnn.com/rss/edition_us.rss");


 // Sum up the results across all stories in the RSS feed.
 SELECT 
    name as entity_name, 
    `type` as entity_type,
    metadata.wikipedia_url as wikipedia_url,
    SUM(e.salience) as total_salience, 
    COUNT(*) as story_count,
    (SELECT DISTINCT d.link FROM PARTITION) as stories,
    (SELECT COUNT(m) FROM PARTITION, e.mentions m) as mention_count // mentions are nested
  FROM 
    data d, 
    d.analysis.entities e
  WHERE 
    e.`type` in ["PERSON", "LOCATION", "ORGANIZATION",
                "EVENT","WORK_OF_ART","CONSUMER_GOOD"]
  GROUP BY 
    e.name, 
    e.`type`,
    e.metadata
  ORDER BY 
    COUNT(*) DESC

In RAW we can perform queries, group, sort, join and perform algebraic operations on non-tabular data, natively, because RAW does not use a relational data engine.

The RAW SQL syntax has some minor differences to standard SQL, but there’s a lot more power too. You can read more at the RAW SQL specification

The results are below. (Guess who is most in the news, still!).

The data below has been grouped by entity and type, summing up the mentions, the story count, and providing an embedded collection of stories. Effectively we have rearranged the data above into a new form below:

[
  {
    "entity_name": "Donald Trump",
    "entity_type": "PERSON",
    "wikipedia_url": "https://en.wikipedia.org/wiki/Donald_Trump",
    "total_salience": 0.32082891,
    "story_count": 2,
    "stories": [
      "https://www.cnn.com/2022/02/15/politics/mazars-trump-letter-letitia-james/index.html",
      "https://www.cnn.com/2022/02/15/politics/january-6-subpoena-fake-elector-plot/index.html"
    ],
    "mention_count": 5
  },
  { 
    ...
  }

Finally, let’s make an API

So we’ve done this great analysis, and we want to share it with colleagues. That’s easy to do…

  1. Turn our RAW SQL into a function, see our code here
  2. Create a YAML file that exposes our function as an endpoint
  3. Commit to GitHub

That’s it. Let RAW sync with your GitHub and deploy the API endpoint.

Then go and run this endpoint and use your favourite RSS feeds as parameters (just URL-encode the URL parameter first). Here’s some examples – please bear in mind these endpoints may take a few seconds to run

CNN US Edition

RSS:

http://rss.cnn.com/rss/edition_us.rss 

RAW Endpoint:

https://api.raw-labs.com/examples/1/public/RSS-analysis/rss-analysis?url=http%3A%2F%2Frss.cnn.com%2Frss%2Fedition.rss

Huffington Post

RSS:

https://www.huffpost.com/section/front-page/feed?x=1 

RAW Endpoint:

https://api.raw-labs.com/examples/1/public/RSS-analysis/rss-analysis?url=https%3A%2F%2Fwww.huffpost.com%2Fsection%2Ffront-page%2Ffeed%3Fx%3D1

Techcrunch

RSS:

https://techcrunch.com/feed/?guccounter=1

RAW Endpoint:

https://api.raw-labs.com/examples/1/public/RSS-analysis/rss-analysis?url=https%3A%2F%2Ftechcrunch.com%2Ffeed%2F%3Fguccounter%3D1

Wrap Up

With RAW you can query and join all types of data, including XML files, APIs that return JSON, plus tables and log files. You can also create reusable functions for modular, data building blocks. Finally, you can expose your new data products as APIs simply using a YAML file and our GitHub integration. To see how to get started, check the links below.

I hope you liked this post, and found it insightful. Check out more posts soon !

Jeremy Posner, VP Product & Solutions, RAW Labs.


Want to learn more?

Learn More

Blog

NYC Taxi Cab Analytics and APIs

First in Blog Series showing New York Taxi Cab Data, how we query, clean and join this data to produce simple APIs

Blog

API Integration: Where is the ISS ?

Georges shows how to integrate two APIs and produce a new API that can easily be published on our platform, using an example about the ISS spaceship and Reverse Geocoding

Blog

Resource-Intensive Data Products: "Calculation-as-a-Service"

Some Data Products are simple, but there's plenty that are more complex, involving calculations, models and other longer-running and resource intensive workloads.