Scraping Trustpilot Reviews with Python

Being able to scrape review data is essential whenever you have more than a few pages of customer reviews.

For this SWD qualitative data challenge, I decided to analyse Ryanair’s Trustpilot reviews.

I used Python’s Beautiful Soup package to extract the reviews into a Pandas dataframe. This could then be processed to find any themes in the data using Python’s Natural Language Toolkit (NLTK).

You can also use this code to scrape reviews for any other company for your own analysis.

Fork on Github

Github logo

⚠️ Important:

it's possible that Trustpilot’s HTML may have been updated since I wrote this!


If this is the case, you should update the class names in your code to reflect the current HTML.

Update the URL

Update the response variable URL with your company name:

response = requests.get(f"https://www.trustpilot.com/review/**COMPANY_NAME_HERE**?page={i}")

Page numbers

Update which Trustpilot pages are scraped using these variables:

from_page = 1
to_page = 50

Final dataframe

If you hit any Trustpilot request limits you can wait a while before running the next batch of pages.

Once you’ve scraped all the pages you need, you can run the last two lines to create a final dataframe df_reviews from the lists.

Python script:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import datetime as dt

# Initialize lists
review_titles = []
review_dates_original = []
review_dates = []
review_ratings = []
review_texts = []
page_number = []

# Set Trustpilot page numbers to scrape here
from_page = 1
to_page = 50

for i in range(from_page, to_page + 1):
    response = requests.get(f"https://www.trustpilot.com/review/YOUR-DOMAIN-HERE.COM?page={i}")
    web_page = response.text
    soup = BeautifulSoup(web_page, "html.parser")

    for review in soup.find_all(class_ = "paper_paper__1PY90 paper_square__lJX8a card_card__lQWDv card_noPadding__D8PcU styles_cardWrapper__LcCPA styles_show__HUXRb styles_reviewCard__9HxJJ"):
        # Review titles
        review_title = review.find(class_ = "typography_typography__QgicV typography_h4__E971J typography_color-black__5LYEn typography_weight-regular__TWEnf typography_fontstyle-normal__kHyN3 styles_reviewTitle__04VGJ")
        review_titles.append(review_title.getText())

        # Review dates
        review_date_original = review.select_one(selector="time")
        review_dates_original.append(review_date_original.getText())

        # Convert review date texts into Python datetime objects
        review_date = review.select_one(selector="time").getText().replace("Updated ", "")
        if "hours ago" in review_date.lower() or "hour ago" in review_date.lower():
            review_date = dt.datetime.now().date()
        elif "a day ago" in review_date.lower():
            review_date = dt.datetime.now().date() - dt.timedelta(days=1)
        elif "days ago" in review_date.lower():
            review_date = dt.datetime.now().date() - dt.timedelta(days=int(review_date[0]))
        else:
            review_date = dt.datetime.strptime(review_date, "%b %d, %Y").date()
        review_dates.append(review_date)

        # Review ratings
        review_rating = review.find(class_ = "star-rating_starRating__4rrcf star-rating_medium__iN6Ty").findChild()
        review_ratings.append(review_rating["alt"])
        
        # When there is no review text, append "" instead of skipping so that data remains in sequence with other review data e.g. review_title
        review_text = review.find(class_ = "typography_typography__QgicV typography_body__9UBeQ typography_color-black__5LYEn typography_weight-regular__TWEnf typography_fontstyle-normal__kHyN3")
        if review_text == None:
            review_texts.append("")
        else:
            review_texts.append(review_text.getText())
        
        # Trustpilot page number
        page_number.append(i)

# Create final dataframe from lists
df_reviews = pd.DataFrame(list(zip(review_titles, review_dates_original, review_dates, review_ratings, review_texts, page_number)),
                columns =['review_title', 'review_date_original', 'review_date', 'review_rating', 'review_text', 'page_number'])