Scraping Trustpilot Reviews with Python
Being able to scrape review data is essential whenever you have more than a few pages of customer reviews.
For this SWD qualitative data challenge, I decided to analyse Ryanair’s Trustpilot reviews.
I used Python’s Beautiful Soup package to extract the reviews into a Pandas dataframe. This could then be processed to find any themes in the data using Python’s Natural Language Toolkit (NLTK).
You can also use this code to scrape reviews for any other company for your own analysis.
Fork on Github
⚠️ Important:
it's possible that Trustpilot’s HTML may have been updated since I wrote this!
If this is the case, you should update the class names in your code to reflect the current HTML.
Update the URL
Update the response
variable URL with your company name:
response = requests.get(f"https://www.trustpilot.com/review/**COMPANY_NAME_HERE**?page={i}")
Page numbers
Update which Trustpilot pages are scraped using these variables:
from_page = 1
to_page = 50
Final dataframe
If you hit any Trustpilot request limits you can wait a while before running the next batch of pages.
Once you’ve scraped all the pages you need, you can run the last two lines to create a final dataframe df_reviews
from the lists.
Python script:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import datetime as dt
# Initialize lists
review_titles = []
review_dates_original = []
review_dates = []
review_ratings = []
review_texts = []
page_number = []
# Set Trustpilot page numbers to scrape here
from_page = 1
to_page = 50
for i in range(from_page, to_page + 1):
response = requests.get(f"https://www.trustpilot.com/review/YOUR-DOMAIN-HERE.COM?page={i}")
web_page = response.text
soup = BeautifulSoup(web_page, "html.parser")
for review in soup.find_all(class_ = "paper_paper__1PY90 paper_square__lJX8a card_card__lQWDv card_noPadding__D8PcU styles_cardWrapper__LcCPA styles_show__HUXRb styles_reviewCard__9HxJJ"):
# Review titles
review_title = review.find(class_ = "typography_typography__QgicV typography_h4__E971J typography_color-black__5LYEn typography_weight-regular__TWEnf typography_fontstyle-normal__kHyN3 styles_reviewTitle__04VGJ")
review_titles.append(review_title.getText())
# Review dates
review_date_original = review.select_one(selector="time")
review_dates_original.append(review_date_original.getText())
# Convert review date texts into Python datetime objects
review_date = review.select_one(selector="time").getText().replace("Updated ", "")
if "hours ago" in review_date.lower() or "hour ago" in review_date.lower():
review_date = dt.datetime.now().date()
elif "a day ago" in review_date.lower():
review_date = dt.datetime.now().date() - dt.timedelta(days=1)
elif "days ago" in review_date.lower():
review_date = dt.datetime.now().date() - dt.timedelta(days=int(review_date[0]))
else:
review_date = dt.datetime.strptime(review_date, "%b %d, %Y").date()
review_dates.append(review_date)
# Review ratings
review_rating = review.find(class_ = "star-rating_starRating__4rrcf star-rating_medium__iN6Ty").findChild()
review_ratings.append(review_rating["alt"])
# When there is no review text, append "" instead of skipping so that data remains in sequence with other review data e.g. review_title
review_text = review.find(class_ = "typography_typography__QgicV typography_body__9UBeQ typography_color-black__5LYEn typography_weight-regular__TWEnf typography_fontstyle-normal__kHyN3")
if review_text == None:
review_texts.append("")
else:
review_texts.append(review_text.getText())
# Trustpilot page number
page_number.append(i)
# Create final dataframe from lists
df_reviews = pd.DataFrame(list(zip(review_titles, review_dates_original, review_dates, review_ratings, review_texts, page_number)),
columns =['review_title', 'review_date_original', 'review_date', 'review_rating', 'review_text', 'page_number'])