How I scrape lots of sites with one python script

The power of configurable execution of code.

Mykhailo Kushnir
Level Up Coding

--

Have you ever wanted to scrape a website but didn’t want to pay for a scraping tool like Octoparse? Or maybe you only needed to scrape a few pages from the website and didn’t want to go through the hassle of setting up a scraping script. In this blog post, I will show you how I created a tool capable of scraping 90% of websites for free using only python and a bit of docker.

UPD: How I scrape lots of sites with one Python script. Part 2 with Docker is available! I encourage you to read this post, though, to give you a general understanding of the scripting solution.

You can find code from my posts about Web Scraping here at an easy viewing format.

Types of data that can be scraped

Most of the scraping bots are created to scrape tabular data or lists. In terms of markup, both tables and lists are essentially the same. In a container, they hold rows with cells filled with values. Hence the algorithm of the script:

Flowchart of application

The process of scraping a website

To extend the potential scraping target list, I’ve decided to use an old-fashioned combination of python with Selenium. While I do enjoy working with Scrapy and was highly influenced by its configurable design when creating my own parsing script, it has certain limits in parsing sites with pagination, so I had to opt for the already mentioned solution.

For the sake of stability, I’ve also decided to use a dockerized version of chromedriver. It saves me some pain during updates of local Chrome and is always there, ready for me, unlike a version you’re installing on your OS, which can be messed up with system updates or installation of new software.

Assuming you have docker service already running on your machine, starting a new container with chromedriver would be as easy as running two commands:

$ docker pull selenium/standalone-chrome$ docker run -d -p 4444:4444 -p 7900:7900 — shm-size=”2g” selenium/standalone-chrome
My python script for scraping websites

The core of this post — code sharing paragraph. First, I’ll introduce you to helpers methods:

These two allow me to switch between a dockerized version of Selenium and the local one when I need to debug something during development.

There’s also a straightforward method to extract text out of HTML elements that I’m using. In the nearest future, I have plans to add helpers to extract links and images automatically. If there is interest in the subject, I can share an updated version of the script.

The essence of this selenium-based spider is in the gist below. Please, read through the comments, and if there would any questions about how it works — let me know in the comments.

How to use the script to scrape websites

In this part, I’ll demonstrate how this script can be used. First, you need to create a YAML configuration file and then run your spider. For example, let us scrape good-old quotes.toscrape.com. An example of config for it would look like this:

First of all, notice that $p$ is a placeholder for the future page number. This is because most of the sites serve page content with a noticeable change in URLs. Your task would be to identify how it is changed from page to page and configure it for your spider with this mask.

Be aware that in data_selectors and data_column_titles, order matters. For example, the text of quotes would be parsed from selector “.text” (duh).

After you have your config prepared, you can execute it with:

python -m spider -c “./configs/quotes.yaml” -o “./outputs/quotes/$(date +%Y-%m-%d).csv”

Bash line above takes config from “./configs/quotes.yaml” file and stores result in a CSV file to “./outputs/quotes/current_date.csv”

Tips on how to improve your scraping process

  • Use proxies

Selenium allows you to pass proxy IP addresses as simple as adding a parameter to its constructor. There is a perfect answer at StackOverflow, so I’ll not try to invent the wheel.

  • Be gentle with sites you’re parsing

Check out robots.txt and comply. Run your request with a specific timeout to smooth the load. Use scheduling to run scripts during evenings or when you think that site would have low incoming traffic.

Voice of the crowd

One of the best things about agile scraping bots is that you don’t have to write a new bot for every site you want to parse. You just need one good script that can be tweaked for each site or domain. Think back on all your scraping projects from this year so far — what would you like me to add to my script?

--

--