Web scraping reviews to CSV with Selenium
- What this is for: Scraping web pages to collect review data and storing the data into a CSV
- Requirements: Python Anaconda distribution, Basic knowledge of HTML structure and Chrome Inspector tool
- Concepts covered: Selenium, Error exception handling
- Download the entire Python file
In an earlier blog post, I wrote a brief tutorial on web scraping with BeautifulSoup. This is a great tool but has some limitations, particularly if you need to scrape a page with content loaded via AJAX.
Enter Selenium. This is a Python library that is capable of scraping AJAX generated content. Before we continue, it is important to note that Selenium is technically a testing tool, not a scraper.
That said, Selenium is simple to use and can get the job done. In this tutorial, we’ll set up a code similar to what you would need to scrape review data from a website and store it in a CSV file.
Install Selenium library
First, we’ll install the Selenium library in Anaconda.
Click on your Start menu and search for Anaconda Prompt. Open a new Anaconda Prompt window.
Change the directory to where you have Anaconda installed. For example
cd C:\users\username\Anaconda3\
Next, type
conda install -c conda-forge selenium
It will take a moment to load and ask for consent to install. Once installed, open Anaconda Navigator and go to the Environment tab. Search packages to make sure it installed.
We’ll also need to install Chromedriver for the code to work. This essentially lets the code take control of a Chrome browser window.
Chromedriver is available for download here. Extract the ZIP file and save the .EXE somewhere on your computer.
Getting started in Python
First we’ll import our libraries and establish our CSV and Pandas dataframe.
#Importing packages from selenium import webdriver import pandas as pd import csv #Create csv outfile = open("star_ratings.csv","w",newline='') writer = csv.writer(outfile) #define dataframe df = pd.DataFrame(columns=['business','category','rating','totalrating','comments'])
Next we’ll define the URLs we want to scrape as an array. We’ll also define the location of our web driver EXE file.
urls = ['https://www.reviewsite.com/1', 'https://www.reviewsite.com/2', 'https://www.reviewsite.com/3', 'https://www.reviewsite.com/4' ]
driver = webdriver.Chrome(executable_path=r'C:/Users/username/Documents/Python Scripts/chromedriver/chromedriver.exe')
Because we’re scraping multiple pages, we’ll create a for loop to repeat our data gathering steps for each site.
for url in urls: driver.get(url)
Selenium has the ability to grab elements by their ID, class, tag, or other properties. To find the ID, class, tag or other property you want to scrape, right click within Chrome browser and select Inspect (or you can press F12 to open the Inspector window).
In this case we’ll start with collecting the H1 data. This is simple with the find_element_by_tag_name method.
#Scrape data business_element = driver.find_element_by_tag_name('h1') business = business_element.text print(business)
Cleaning strings
Next, we’ll collect the type of business. For this example, the site I was scraping needed this data cleaned a little bit because of how the data was stored. You may run into a similar situation, so let’s do some basic text cleaning.
When I looked at the section markup with Chrome Inspector, it looks something like this:
<div id="categories"> Categories:<br> Type1<br> Type2<br> </div>
In order to send clean data to the CSV, we’ll need to remove the “Categories:” text and replace line breaks with a pipe character to store data like this: “Type1|Type2”. This is how we can accomplish that:
category_element = driver.find_element_by_id('category') full_category = category_element.text category_clean = full_category.replace('Categories:\n', '') category = category_clean.replace('\n','|') print(category)
Scraping other elements
For the other elements, we’ll use Selenium’s other methods to capture by class.
rating_element = driver.find_element_by_class_name('starrating') rating = rating_element.text print(rating) totalrating_element = driver.find_element_by_class_name('totalratings-count') totalrating = totalrating_element.text print(totalrating) comments_element = driver.find_element_by_class_name('totalcomments-count') comments = comments_element.text print(comments)
Now, let’s piece all the data together and add it to our dataframe. Using the variables we created, we’ll populate a new row to the dataframe.
#Add new rows to dataframe df2 = pd.DataFrame([[business,category,rating,totalrating,comments]],columns=['business','category','rating','totalrating','comments']) df = df.append(df2,ignore_index=True)
Handling errors
One error you may encounter is if data is missing. For example, if a business doesn’t have any reviews or comments, the site may not render this div that contains this info into to the page.
If you attempt to scrape a div that doesn’t exist, you’ll get an error. But Python lets you handle errors with the try block.
So let’s assume our business may not have a star rating. In the try: block we’ll write the code for what to do if the “starrating” class exists. In the except: block, we’ll write code for what to do if the try: block returns an error.
try: rating_element = driver.find_element_by_class_name('starrating') rating = rating_element.text print(rating) except: rating = "-" print(rating)
A word of caution: If you are planning to do statistical analysis of the data, be careful how you replace error data in the “except” block. For example, if your code cannot find the number of stars, entering this data as “0” will skew any data because there is a difference between having a 0-star rating and not having a star rating. So for this example, data that returns an error will produce a “-” in the dataframe and CSV file instead of a 0.