Python Web Scraping with BeautifulSoup

Photo by Sigmund on Unsplash

Overview

These are just example of the most common scenarios that I have run into when scrapping data. You can always use these together to build a single python application to crawl through get all the URLs and get data from all pages (If there is pagination).

Prerequisites

Must have python installed. https://www.python.org/downloads/

Check for pip version

If you don’t have pip installed then go to https://pip.pypa.io/en/stable/installation/.

Windows

C:> py --version
Python 3.N.N
C:> py -m pip --version
pip X.Y.Z from ... (python 3.N.N)

Mac or Linux

$ python --version
Python 3.N.N
$ python -m pip --version
pip X.Y.Z from ... (python 3.N.N)

Install Beautiful Soup with pip. You may need to specify the version of pip. For instance pip2 for 2.7 and pip3 for 3.7 etc/

pip install beautifulsoup4

LXML -Supplements Beautiful Soup for easy handling of XML and HTML files.

pip install lxml

You will also need urllib to open the URLs.

pip install urllib3

Imports

Import the following for all examples:

from bs4 import BeautifulSoup
import lxml
from urllib.request import Request, urlopen

Scraping Basics

The following are a list of how-to snippets for different situations

Get a list of URLs

Get URLs from the anchor tag <a href=””>

This example will get all the URLs that are inside of a list

Example HTML

<ul class = "row list-unstyled" > 
    <li class = "col-3" > 
        <a href="https://www.domain.com/page.html"/>
              Page one title
        </a>
    </li>
    <li class = "col-3" > 
        <a href="https://www.domain.com/page2.html"/>
              Page two title
        </a>
    </li>
</ul>

The following code will scrape the URLs out of the anchor tags.


from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import lxml

def get_urls():
    
    url = "https://www.DOMAIN.com/"
    req = Request(url)
    page = urlopen(req)
    soup = BeautifulSoup(page, features="lxml")
    table = soup.find('ul', class_='row list-unstyled list')
    rows = table.find_all('li')
    
    for row in rows:
        url = row.find("a")
        siteURLVal = url['href']
        print('url ', url['href'])


get_urls()

Find every instance of a tag

.find_all()

Example HTML

<ul class = "row list-unstyled" > 
    <li class = "col-3" > 
    <a class = "fancy-link" href = "https://www.domain.com/content/image.jpg" > 
        <img class = "img-fluid" 
             data-src = "https://www.domain.com/content/image.jpg"                       src="https://www.domain.com/image.jpg"/>
        </a>
    </li>
    <li class = "col-3" > 
    <a class = "fancy-link" href = "https://www.domain.com/content/image.jpg" > 
        <img class = "img-fluid"
            data-src = "https://www.domain.com/content/image.jpg" src="https://www.domain.com/image.jpg"/>
        </a>
    </li>
</ul>

Two Examples of how to find the list tag <ul> so if can be looped through as a list/array.

.find_all("ul":, class_="row list-unstyled")

OR

.find_all("ul":, {"class" :"row list-unstyled"})

Iterate Through Multiple Sections with Identical Class or ID Names

Example HTML

<section class="main-section">
  <div class="main-content">
     <!-- Main Content -->
  </div>

  <div class="side-content">
      <h5>Height</h5>
      <p>5ft</p>
      <h5>Weight</h5>
      <p>180lbs</p>
  </div>
  <div class="side-content">
      <h5>Hair Color</h5>
      <p>Brown</p>
      <h5>Eye Color</h5>
      <p>Blue</p>
  </div>
</section>

The following is a code snippet to get the values from the HTML above

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import lxml

def get_side_content():
    
    url = "https://www.DOMAIN.com/"
    req = Request(url)
    page = urlopen(req)
    soup = BeautifulSoup(page, features="lxml")
    mainContent = soup.find('section', class_='main-section')
        
    sideData = mainContent.find_all('div', class_='side-content')
    for sideSection in sideData:        
        values = sideSection.find_all('p')
        for val in values:        
            print(val)       


get_side_content()

Extract value from Custom Tag – Example data-src

Get URLs from the anchor tag

<a  href="https://www.domain.com/images/image.jpg" data-src="https://www.domain.com/images/image.jpg">

This example will get all the URLs from the redit.com home page.

imagelistArr = listitems.find_all('a')                       
for imageblock in imagelistArr :                          
    img = imageblock.find('img')                         
    if img != None:
        print('list item img ', img['data-src'])

Extract Values From Text next to HTML Tags

Example HTML

<div class="short-description">
	<p>
    <em>Citrus lime ‘Bearss Seedless’ Std<br></em> <br> 
    <strong>• Mature Height: &nbsp; </strong>&nbsp;10′<br> 
    <strong>• Mature Width:&nbsp;</strong>&nbsp;8′<br> 
    <strong>• Light Requirements: &nbsp;</strong>&nbsp;Morning sun<br> 
    <strong>• Water Requirements: &nbsp;</strong>&nbsp;Regular water<br> 
    <strong>• Fertilizers: &nbsp;</strong>&nbsp;Dr Q’s Citrus Food, 14-7-7, Dr Q’s Organic Citrus Food 8-4-4
    </p>
</div>

The code below will scrape the value after the <strong> tags with the next_sibling property.

    title = ''
    mature_height = ''
    mature_width = ''
    light_req = ''
    water_req = ''


    req = Request(url)
    page = urlopen(req)
    soup = BeautifulSoup(page, features="lxml")

    
    table = soup.find('div', class_='summary entry-summary')
    rows = table.find_all('p')
    

    for row in rows:
        print('Row', row)
        title = row.find('em')

        data = row.find_all('strong')
        for item in data:
            print('item', item.text)
            if 'Mature Height:' in item.text:
                mature_height = item.next_sibling

            if 'Mature Width:' in item.text:
                mature_width = item.next_sibling

            if 'Light Requirements:' in item.text:
                light_req = item.next_sibling

            if 'Water Requirements:' in item.text:
                water_req = item.next_sibling

Extract Select Value from a Dropdown Field

Example HTML

<div>
    <label for="specs-year-select">Year</label>
    <div class="specs-style-select-container">
        <select id="specs-year-select" name="year">
                    <option value="2021">2021</option>
                    <option value="2020">2020</option>
                    <option value="2019">2019</option>
                    <option value="2018" selected="">2018</option>
        </select>
    </div>
</div>

This is not ideal, but I was not able to find another way to get the selected value. An improvement would be to check if a string contains the value of “selected“, but I had no luck with that.

Another option you can try is the following – Taken from a post from Ilham Riski on https://www.py4u.net/discuss/200746:

option = soup.find("selected",{"name":"try"}).findAll("option")
option_ = soup.find("table", {"style": "font-size:14px"}).findAll("option")

In the end, this works and if I find a better working solution, I will update the code snippet.

    selectedValue = ""

    datatable = soup.find("div", class_ = "review-body")

    dropdowns = datatable.find('form', class_ = "specs-filter-container")
    selectFields = dropdowns.find_all('select')
    for item in selectFields:        
        section = item.select('option[value]')        
        pattern = re.compile(r'selected=""')        
        for item in section:    
            tag = str(item)        
            if(tag[0:17] == "<option selected="):
                if optionCount == 0:
                    selectedValue = int(item.text)
                elif optionCount == 1:
                    selectedValue = item.text
                optionCount = optionCount + 1

Find All Elements With Regex Library

Example HTML

<div>
    <h4>Header One</h4>
     <p>Paragraph One</p>
     <p>paragraph two</p>
     <p>paragraph three</p>
</div>

Using Regex Library to find all of the paragraphs <p/> tags that contain “Paragraph” or “paragraph“, using a regex for both upper and lower case P.

import re

for tag in soup.find_all('p',re.recompile="(P/p)aragraph"):
    print(tag.name)

Results:

Paragraph One
paragraph two
paragraph three

Saving Images to a folder

First you need to install pyodbc

pip install pyodbc

Optional – For creating unique names for the images

pip install uuid

Saving images

Example HTML

<ul class = "row list-unstyled" > 
    <li class = "col-3" > 
    <a class = "fancy-link" href = "https://www.domain.com/content/image.jpg" > 
        <img class = "img-fluid" 
             data-src = "https://www.domain.com/content/image.jpg"                       src="https://www.domain.com/image.jpg"/>
        </a>
    </li>
    <li class = "col-3" > 
    <a class = "fancy-link" href = "https://www.domain.com/content/image.jpg" > 
        <img class = "img-fluid"
            data-src = "https://www.domain.com/content/image.jpg" src="https://www.domain.com/image.jpg"/>
        </a>
    </li>
</ul>

In this example we will be saving the images to a subfolder in the root path of the Python App

import requests
from urllib.request import urlopen as uReq
import lxml
from bs4 import BeautifulSoup
import pyodbc
import os
import uuid

def scrape_images(url, RecordID):
    current_directory = os.getcwd()
    subfolderpath = "images\\" + str(RecordID) + "\\"
    path = os.path.join(current_directory, subfolderpath)
    isdir = os.path.isdir(path) 
    
    # Create the subfolder if it does not exist
    if not isdir:        
      os.mkdir(path)
    
    req = Request(url)
    page = urlopen(req)
    soup = BeautifulSoup(page, features="lxml")
    imageContent = soup.find('ul', class_='row list-unstyled')
        
    for ul_tag in imageContent:

        if ul_tag != None:

            for item in ul_tag:

                list_tag = item .find_all("li", class_="col-3")

                if list_tag != None:        

                    for list_item in list_tag :                        

                       image = list_item.find('img')                 
                            
                            
                       image_data = requests.get(image['href']).content

                       filename = str(uuid.uuid4()) + '.jpg'

                       complete_file_path = os.path.join(path, filename)

                       with open(complete_file_path, 'wb') as handler:
                            handler.write(image_data)



        
scrape_images('https://www.DOMAIN.com/', 2021)