Node.js – API Web Scraping HTML Table Values With Cheerio

Overview

This article will go through the steps needed to create a Node.js API to return scraped data in JSON format using cheerio. Cheerio is one of the easiest scrapers I have used. Even easier than Beautiful Soup for Python. It is great for many sites and there are ways around the more complex sites that are generated by a CMS. The learning curve is pretty shallow and you should be able to scrape the data from a typical site in 1 to 8 hours depending on the complexity.

Prerequisites

If you don’t already have Node.js installed then you can get the download and install information from their website at https://nodejs.org/.

Create a folder for your project and go into that directory and run the following commands.

npm init
npm install express

Nodemon

Nodemon restarts the node.js application when changes are made.

npm install nodemon

Update the “scripts” section of the “package.jsonfile with the following:

  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1",
    "start": "nodemon  app.js"
  },

Note: It is important to escape the quotes inside the “test”: value. \”Error: no test specified\”.

To start the application use npm start instead of node app.js

Cheerio – Web Scraper

Also install pretty to format the incoming data.

npm install cheerio pretty

Axios

Axios is a promise-based HTTP Client.

npm i axios

Environment Variable

Gives the application access to .env files. This is where we can put the keys for now.

npm i dotenv

CORS

Cross Origin Resource Sharing is needed if you want to use the API with an external application.

npm i cors

Setting up CORS

Recommended if you plan to use an external web application to call this API.

The app.use() will need the “express.urlencoded()” to read the incoming packets and the CORS for API calls made from a JavaScript web application. The code below will accept requests from an external Web Application with the domain “localhost:3000“.

app.use(
    express.urlencoded(),
    cors({
        origin: 'http://localhost:3000'
    })
);

Node APIWeb Scraper

Create a file labeled “./app.js” in the root folder of the API.

All of the imports needed to run the API. This includes the JSON Web Token, BCryptJS, SQLite3, and Cross Origins Resource Sharing. The port is set to 3004.

const express = require('express');
const app = express();
const http = require('http');
const server = http.createServer(app);
const cheerio = require('cheerio');
const axios = require("axios");

Example HTML Table to Scrape

  <table id="sortable" class="sortable-table">
    <thead>
        <tr>
            <th class="header">State</th>
            <th class="regular header" style="display: table-cell;">Regular</th>
            <th class="mid_grade header">Mid-Grade </th>
            <th class="premium header">Premium</th>
            <th class="diesel header">Diesel</th>
        </tr>
    </thead>
    <tbody>
      <tr>
          <td>
            <a href="https://domain.com/ak">
              Alaska                              
            </a>
          </td>
          <td class="regular" style="display: table-cell;">$3.720</td>
          <td class="mid_grade">$3.879</td>
          <td class="premium">$4.060</td>
          <td class="diesel">$3.528</td>
      </tr>
      <tr>
        <td>
          <a href="https://domain.com/AL">
            Alabama
          </a>
        </td>
        <td class="regular" style="display: table-cell;">$3.197</td>
        <td class="mid_grade">$3.510</td>
        <td class="premium">$3.830</td>
        <td class="diesel">$3.456</td>
      </tr>
                          
  </tbody>
</table>

Find the Selector

Using Chrome in this example go an inspect the following html tag you want to scrape.

Single HTML tag value

Example: for a specific column in a table

Right click and go to Copy —> Copy Selector on specific column

Get a Single Value

You should get something like the following:

#sortable > tbody > tr:nth-child(1) > td.mid_grade

The API action needs to be asynchronous in order to work properly. The API action below will simple

app.get('/', async (req, res) => {
  var results = '';
  try {
    const url = "https://www.DOMAIN.com";

    const { data } = await axios.get(url);
    
    const $ = cheerio.load(data);

    $("#sortable > tbody > tr:nth-child(1) > td.mid_grade").each((index, element) => {
      console.log($(element).text());
    });
} catch (err) {
    console.error(err);
  }

  
   res.json({
    "message":"success",
    
  })
});

Get Table Data

For the entire table, select the <tbody> tag

You should get something like

#sortable > tbody

Now loop through it as below.

app.get('/', async (req, res) => {
  var results = '';
  try {
    const url = "https://gasprices.aaa.com/state-gas-price-averages/";

    const { data } = await axios.get(url);
    
    const $ = cheerio.load(data);

    $("#sortable > tbody").each((index, element) => {
      console.log($(element).text());
    });
} catch (err) {
    console.error(err);
  }

  
   res.json({
    "message":"success",
    
  })
});

Output

  Alaska

$3.720
$3.879
$4.060
$3.528

 Alabama

$3.197
$3.510
$3.830
$3.456

Get a specific column from a table.

console.log($(element).find("td")[0]);

Get all data from a single column

You should get something like

#sortable > tbody > tr:nth-child(2) > td.regular

Modify it to the following:

$("#sortable > tbody > tr > td.regular").each((index, element) => {

Now loop through it as below. Also used trim() to get additional spaces out.

app.get('/', async (req, res) => {
  var results = [];
  try {
    const url = "https://www.DOMAIN.com";

    const { data } = await axios.get(url);
    
    const $ = cheerio.load(data);
    
    $("#sortable > tbody > tr > td.regular").each((index, element) => {
      const val = $(element).text();

      results.push({column: "price", value: val.trim()});
    });
} catch (err) {
    console.error(err);

  }
  
  
   res.json({
    "message":"success",
    "data": results
  })
});

Scrape all of the columns into JSON format

app.get('/', async (req, res) => {
  var results = [];
  try {
    const url = "https://www.DOMAIN.com";

    const { data } = await axios.get(url);
    
    const $ = cheerio.load(data);

    $("#sortable > tbody > tr").each((index, element) => {
      var state = $(element).find("td.a").text();      
      var regular = $(element).find("td.regular").text();      
      var mid_grade = $(element).find("td.mid_grade").text();      
      var premium = $(element).find("td.premium").text();      
      var diesel = $(element).find("td.diesel").text();            
      results.push({state: state.trim(),
                    regular: regular.trim(),
                    mid_grade: mid_grade.trim(),
                    premium: premium.trim(),
                    diesel: diesel.trim()
                  });
    });

  } catch (err) {
    console.error(err);
  }
    
   res.json({
    "message":"success",
    "data": results
  })
});

Output

{
    "message": "success",
    "data": [
        {
            "state": "",
            "regular": "$3.720",
            "mid_grade": "$3.879",
            "premium": "$4.060",
            "diesel": "$3.528"
        },
        {
            "state": "",
            "regular": "$3.197",
            "mid_grade": "$3.510",
            "premium": "$3.830",
            "diesel": "$3.456"
        },