Node.js – API Web Scraping HTML Table Values With Cheerio

Table of Contents
Overview
This article will go through the steps needed to create a Node.js API to return scraped data in JSON format using cheerio. Cheerio is one of the easiest scrapers I have used. Even easier than Beautiful Soup for Python. It is great for many sites and there are ways around the more complex sites that are generated by a CMS. The learning curve is pretty shallow and you should be able to scrape the data from a typical site in 1 to 8 hours depending on the complexity.
Prerequisites
If you don’t already have Node.js installed then you can get the download and install information from their website at https://nodejs.org/.
Create a folder for your project and go into that directory and run the following commands.
npm init
npm install express
Nodemon
Nodemon restarts the node.js application when changes are made.
npm install nodemon
Update the “scripts” section of the “package.json” file with the following:
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1",
"start": "nodemon app.js"
},
Note: It is important to escape the quotes inside the “test”: value. \”Error: no test specified\”.
To start the application use npm start instead of node app.js
Cheerio – Web Scraper
Also install pretty to format the incoming data.
npm install cheerio pretty
Axios
Axios is a promise-based HTTP Client.
npm i axios
Environment Variable
Gives the application access to .env files. This is where we can put the keys for now.
npm i dotenv
CORS
Cross Origin Resource Sharing is needed if you want to use the API with an external application.
npm i cors
Setting up CORS
Recommended if you plan to use an external web application to call this API.
The app.use() will need the “express.urlencoded()” to read the incoming packets and the CORS for API calls made from a JavaScript web application. The code below will accept requests from an external Web Application with the domain “localhost:3000“.
app.use(
express.urlencoded(),
cors({
origin: 'http://localhost:3000'
})
);
Node API – Web Scraper
Create a file labeled “./app.js” in the root folder of the API.

All of the imports needed to run the API. This includes the JSON Web Token, BCryptJS, SQLite3, and Cross Origins Resource Sharing. The port is set to 3004.
const express = require('express');
const app = express();
const http = require('http');
const server = http.createServer(app);
const cheerio = require('cheerio');
const axios = require("axios");
Example HTML Table to Scrape
<table id="sortable" class="sortable-table">
<thead>
<tr>
<th class="header">State</th>
<th class="regular header" style="display: table-cell;">Regular</th>
<th class="mid_grade header">Mid-Grade </th>
<th class="premium header">Premium</th>
<th class="diesel header">Diesel</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<a href="https://domain.com/ak">
Alaska
</a>
</td>
<td class="regular" style="display: table-cell;">$3.720</td>
<td class="mid_grade">$3.879</td>
<td class="premium">$4.060</td>
<td class="diesel">$3.528</td>
</tr>
<tr>
<td>
<a href="https://domain.com/AL">
Alabama
</a>
</td>
<td class="regular" style="display: table-cell;">$3.197</td>
<td class="mid_grade">$3.510</td>
<td class="premium">$3.830</td>
<td class="diesel">$3.456</td>
</tr>
</tbody>
</table>
Find the Selector
Using Chrome in this example go an inspect the following html tag you want to scrape.
Single HTML tag value
Example: for a specific column in a table
Right click and go to Copy —> Copy Selector on specific column

Get a Single Value
You should get something like the following:
#sortable > tbody > tr:nth-child(1) > td.mid_grade
The API action needs to be asynchronous in order to work properly. The API action below will simple
app.get('/', async (req, res) => {
var results = '';
try {
const url = "https://www.DOMAIN.com";
const { data } = await axios.get(url);
const $ = cheerio.load(data);
$("#sortable > tbody > tr:nth-child(1) > td.mid_grade").each((index, element) => {
console.log($(element).text());
});
} catch (err) {
console.error(err);
}
res.json({
"message":"success",
})
});
Get Table Data
For the entire table, select the <tbody> tag

You should get something like
#sortable > tbody
Now loop through it as below.
app.get('/', async (req, res) => {
var results = '';
try {
const url = "https://gasprices.aaa.com/state-gas-price-averages/";
const { data } = await axios.get(url);
const $ = cheerio.load(data);
$("#sortable > tbody").each((index, element) => {
console.log($(element).text());
});
} catch (err) {
console.error(err);
}
res.json({
"message":"success",
})
});
Output
Alaska
$3.720
$3.879
$4.060
$3.528
Alabama
$3.197
$3.510
$3.830
$3.456
Get a specific column from a table.
console.log($(element).find("td")[0]);
Get all data from a single column
You should get something like
#sortable > tbody > tr:nth-child(2) > td.regular
Modify it to the following:
$("#sortable > tbody > tr > td.regular").each((index, element) => {
Now loop through it as below. Also used trim() to get additional spaces out.
app.get('/', async (req, res) => {
var results = [];
try {
const url = "https://www.DOMAIN.com";
const { data } = await axios.get(url);
const $ = cheerio.load(data);
$("#sortable > tbody > tr > td.regular").each((index, element) => {
const val = $(element).text();
results.push({column: "price", value: val.trim()});
});
} catch (err) {
console.error(err);
}
res.json({
"message":"success",
"data": results
})
});
Scrape all of the columns into JSON format
app.get('/', async (req, res) => {
var results = [];
try {
const url = "https://www.DOMAIN.com";
const { data } = await axios.get(url);
const $ = cheerio.load(data);
$("#sortable > tbody > tr").each((index, element) => {
var state = $(element).find("td.a").text();
var regular = $(element).find("td.regular").text();
var mid_grade = $(element).find("td.mid_grade").text();
var premium = $(element).find("td.premium").text();
var diesel = $(element).find("td.diesel").text();
results.push({state: state.trim(),
regular: regular.trim(),
mid_grade: mid_grade.trim(),
premium: premium.trim(),
diesel: diesel.trim()
});
});
} catch (err) {
console.error(err);
}
res.json({
"message":"success",
"data": results
})
});
Output
{
"message": "success",
"data": [
{
"state": "",
"regular": "$3.720",
"mid_grade": "$3.879",
"premium": "$4.060",
"diesel": "$3.528"
},
{
"state": "",
"regular": "$3.197",
"mid_grade": "$3.510",
"premium": "$3.830",
"diesel": "$3.456"
},
You must be logged in to post a comment.