Python Web Scraping with BeautifulSoup
Overview
These are just example of the most common scenarios that I have run into when scrapping data. You can always use these together to build a single python application to crawl through get all the URLs and get data from all pages (If there is pagination).
Prerequisites
Must have python installed. https://www.python.org/downloads/
Table of Contents
Check for pip version
If you don’t have pip installed then go to https://pip.pypa.io/en/stable/installation/.
Windows
C:> py --version
Python 3.N.N
C:> py -m pip --version
pip X.Y.Z from ... (python 3.N.N)
Mac or Linux
$ python --version
Python 3.N.N
$ python -m pip --version
pip X.Y.Z from ... (python 3.N.N)
Install Beautiful Soup with pip. You may need to specify the version of pip. For instance pip2 for 2.7 and pip3 for 3.7 etc/
pip install beautifulsoup4
LXML -Supplements Beautiful Soup for easy handling of XML and HTML files.
pip install lxml
You will also need urllib to open the URLs.
pip install urllib3
Imports
Import the following for all examples:
from bs4 import BeautifulSoup
import lxml
from urllib.request import Request, urlopen
Scraping Basics
The following are a list of how-to snippets for different situations
Get a list of URLs
Get URLs from the anchor tag <a href=””>
This example will get all the URLs that are inside of a list
Example HTML
<ul class = "row list-unstyled" >
<li class = "col-3" >
<a href="https://www.domain.com/page.html"/>
Page one title
</a>
</li>
<li class = "col-3" >
<a href="https://www.domain.com/page2.html"/>
Page two title
</a>
</li>
</ul>
The following code will scrape the URLs out of the anchor tags.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import lxml
def get_urls():
url = "https://www.DOMAIN.com/"
req = Request(url)
page = urlopen(req)
soup = BeautifulSoup(page, features="lxml")
table = soup.find('ul', class_='row list-unstyled list')
rows = table.find_all('li')
for row in rows:
url = row.find("a")
siteURLVal = url['href']
print('url ', url['href'])
get_urls()
Find every instance of a tag
.find_all()
Example HTML
<ul class = "row list-unstyled" >
<li class = "col-3" >
<a class = "fancy-link" href = "https://www.domain.com/content/image.jpg" >
<img class = "img-fluid"
data-src = "https://www.domain.com/content/image.jpg" src="https://www.domain.com/image.jpg"/>
</a>
</li>
<li class = "col-3" >
<a class = "fancy-link" href = "https://www.domain.com/content/image.jpg" >
<img class = "img-fluid"
data-src = "https://www.domain.com/content/image.jpg" src="https://www.domain.com/image.jpg"/>
</a>
</li>
</ul>
Two Examples of how to find the list tag <ul> so if can be looped through as a list/array.
.find_all("ul":, class_="row list-unstyled")
OR
.find_all("ul":, {"class" :"row list-unstyled"})
Iterate Through Multiple Sections with Identical Class or ID Names
Example HTML
<section class="main-section">
<div class="main-content">
<!-- Main Content -->
</div>
<div class="side-content">
<h5>Height</h5>
<p>5ft</p>
<h5>Weight</h5>
<p>180lbs</p>
</div>
<div class="side-content">
<h5>Hair Color</h5>
<p>Brown</p>
<h5>Eye Color</h5>
<p>Blue</p>
</div>
</section>
The following is a code snippet to get the values from the HTML above
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import lxml
def get_side_content():
url = "https://www.DOMAIN.com/"
req = Request(url)
page = urlopen(req)
soup = BeautifulSoup(page, features="lxml")
mainContent = soup.find('section', class_='main-section')
sideData = mainContent.find_all('div', class_='side-content')
for sideSection in sideData:
values = sideSection.find_all('p')
for val in values:
print(val)
get_side_content()
Extract value from Custom Tag – Example data-src
Get URLs from the anchor tag
<a href="https://www.domain.com/images/image.jpg" data-src="https://www.domain.com/images/image.jpg">
This example will get all the URLs from the redit.com home page.
imagelistArr = listitems.find_all('a')
for imageblock in imagelistArr :
img = imageblock.find('img')
if img != None:
print('list item img ', img['data-src'])
Extract Values From Text next to HTML Tags
Example HTML
<div class="short-description">
<p>
<em>Citrus lime ‘Bearss Seedless’ Std<br></em> <br>
<strong>• Mature Height: </strong> 10′<br>
<strong>• Mature Width: </strong> 8′<br>
<strong>• Light Requirements: </strong> Morning sun<br>
<strong>• Water Requirements: </strong> Regular water<br>
<strong>• Fertilizers: </strong> Dr Q’s Citrus Food, 14-7-7, Dr Q’s Organic Citrus Food 8-4-4
</p>
</div>
The code below will scrape the value after the <strong> tags with the next_sibling property.
title = ''
mature_height = ''
mature_width = ''
light_req = ''
water_req = ''
req = Request(url)
page = urlopen(req)
soup = BeautifulSoup(page, features="lxml")
table = soup.find('div', class_='summary entry-summary')
rows = table.find_all('p')
for row in rows:
print('Row', row)
title = row.find('em')
data = row.find_all('strong')
for item in data:
print('item', item.text)
if 'Mature Height:' in item.text:
mature_height = item.next_sibling
if 'Mature Width:' in item.text:
mature_width = item.next_sibling
if 'Light Requirements:' in item.text:
light_req = item.next_sibling
if 'Water Requirements:' in item.text:
water_req = item.next_sibling
Extract Select Value from a Dropdown Field
Example HTML
<div>
<label for="specs-year-select">Year</label>
<div class="specs-style-select-container">
<select id="specs-year-select" name="year">
<option value="2021">2021</option>
<option value="2020">2020</option>
<option value="2019">2019</option>
<option value="2018" selected="">2018</option>
</select>
</div>
</div>
This is not ideal, but I was not able to find another way to get the selected value. An improvement would be to check if a string contains the value of “selected“, but I had no luck with that.
Another option you can try is the following – Taken from a post from Ilham Riski on https://www.py4u.net/discuss/200746:
option = soup.find("selected",{"name":"try"}).findAll("option")
option_ = soup.find("table", {"style": "font-size:14px"}).findAll("option")
In the end, this works and if I find a better working solution, I will update the code snippet.
selectedValue = ""
datatable = soup.find("div", class_ = "review-body")
dropdowns = datatable.find('form', class_ = "specs-filter-container")
selectFields = dropdowns.find_all('select')
for item in selectFields:
section = item.select('option[value]')
pattern = re.compile(r'selected=""')
for item in section:
tag = str(item)
if(tag[0:17] == "<option selected="):
if optionCount == 0:
selectedValue = int(item.text)
elif optionCount == 1:
selectedValue = item.text
optionCount = optionCount + 1
Find All Elements With Regex Library
Example HTML
<div>
<h4>Header One</h4>
<p>Paragraph One</p>
<p>paragraph two</p>
<p>paragraph three</p>
</div>
Using Regex Library to find all of the paragraphs <p/> tags that contain “Paragraph” or “paragraph“, using a regex for both upper and lower case P.
import re
for tag in soup.find_all('p',re.recompile="(P/p)aragraph"):
print(tag.name)
Results:
Paragraph One
paragraph two
paragraph three
Saving Images to a folder
First you need to install pyodbc
pip install pyodbc
Optional – For creating unique names for the images
pip install uuid
Saving images
Example HTML
<ul class = "row list-unstyled" >
<li class = "col-3" >
<a class = "fancy-link" href = "https://www.domain.com/content/image.jpg" >
<img class = "img-fluid"
data-src = "https://www.domain.com/content/image.jpg" src="https://www.domain.com/image.jpg"/>
</a>
</li>
<li class = "col-3" >
<a class = "fancy-link" href = "https://www.domain.com/content/image.jpg" >
<img class = "img-fluid"
data-src = "https://www.domain.com/content/image.jpg" src="https://www.domain.com/image.jpg"/>
</a>
</li>
</ul>
In this example we will be saving the images to a subfolder in the root path of the Python App
import requests
from urllib.request import urlopen as uReq
import lxml
from bs4 import BeautifulSoup
import pyodbc
import os
import uuid
def scrape_images(url, RecordID):
current_directory = os.getcwd()
subfolderpath = "images\\" + str(RecordID) + "\\"
path = os.path.join(current_directory, subfolderpath)
isdir = os.path.isdir(path)
# Create the subfolder if it does not exist
if not isdir:
os.mkdir(path)
req = Request(url)
page = urlopen(req)
soup = BeautifulSoup(page, features="lxml")
imageContent = soup.find('ul', class_='row list-unstyled')
for ul_tag in imageContent:
if ul_tag != None:
for item in ul_tag:
list_tag = item .find_all("li", class_="col-3")
if list_tag != None:
for list_item in list_tag :
image = list_item.find('img')
image_data = requests.get(image['href']).content
filename = str(uuid.uuid4()) + '.jpg'
complete_file_path = os.path.join(path, filename)
with open(complete_file_path, 'wb') as handler:
handler.write(image_data)
scrape_images('https://www.DOMAIN.com/', 2021)
You must be logged in to post a comment.