Tutorial

How To Crawl A Web Page with Scrapy and Python 3

Updated on December 7, 2022
English
How To Crawl A Web Page with Scrapy and Python 3

Introduction

Web scraping, often called web crawling or web spidering, is the act of programmatically going over a collection of web pages and extracting data, and is a powerful tool for working with data on the web.

With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, retrieve data from a site without an official API, or just satisfy your own personal curiosity.

In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use Quotes to Scrape, a database of quotations hosted on a site designed for testing out web spiders. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages containing quotes and displays them on your screen.

The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.

Prerequisites

To complete this tutorial, you’ll need a local development environment for Python 3. You can follow How To Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.

Step 1 — Creating a Basic Scraper

Scraping is a two step process:

  1. Systematically finding and downloading web pages.
  2. Extract information from the downloaded pages.

Both of those steps can be implemented in a number of ways in many languages.

You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you’ll need to handle concurrency so you can crawl more than one page at a time. You’ll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you’ll sometimes have to deal with sites that require specific settings and access patterns.

You’ll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we’re going to use Python and Scrapy to build our scraper.

Scrapy is one of the most popular and powerful Python scraping libraries; it takes a “batteries included” approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don’t have to reinvent the wheel each time.

Scrapy, like most Python packages, is on PyPI (also known as pip). PyPI, the Python Package Index, is a community-owned repository of all published Python software.

If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have pip installed on your machine, so you can install Scrapy with the following command:

  1. pip install scrapy

If you run into any issues with the installation, or you want to install Scrapy without using pip, check out the official installation docs.

With Scrapy installed, create a new folder for our project. You can do this in the terminal by running:

  1. mkdir quote-scraper

Now, navigate into the new directory you just created:

  1. cd quote-scraper

Then create a new Python file for our scraper called scraper.py. We’ll place all of our code in this file for this tutorial. You can create this file using the editing software of your choice.

Start out the project by making a very basic scraper that uses Scrapy as its foundation. To do that, you’ll need to create a Python class that subclasses scrapy.Spider, a basic spider class provided by Scrapy. This class will have two required attributes:

  • name — just a name for the spider.
  • start_urls — a list of URLs that you start to crawl from. We’ll start with one URL.

Open the scrapy.py file in your text editor and add this code to create the basic spider:

scraper.py
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

Let’s break this down line by line:

First, we import scrapy so that we can use the classes that the package provides.

Next, we take the Spider class provided by Scrapy and make a subclass out of it called BrickSetSpider. Think of a subclass as a more specialized form of its parent class. The Spider class has methods and behaviors that define how to follow URLs and extract data from the pages it finds, but it doesn’t know where to look or what data to look for. By subclassing it, we can give it that information.

Finally, we name the class quote-spider and give our scraper a single URL to start from: https://quotes.toscrape.com. If you open that URL in your browser, it will take you to a search results page, showing the first of many pages of famous quotations.

Now, test out the scraper. Typically, Python files are run with a command like python path/to/file.py. However, Scrapy comes with its own command line interface to streamline the process of starting a scraper. Start your scraper with the following command:

  1. scrapy runspider scraper.py

The command will output something like this:

Output
2022-12-02 10:30:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2022-12-02 10:30:08 [scrapy.extensions.telnet] INFO: Telnet Password: b4d94e3a8d22ede1 2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', ... 'scrapy.extensions.logstats.LogStats'] 2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', ... 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', ... 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled item pipelines: [] 2022-12-02 10:30:08 [scrapy.core.engine] INFO: Spider opened 2022-12-02 10:30:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2022-12-02 10:30:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2022-12-02 10:49:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com> (referer: None) 2022-12-02 10:30:08 [scrapy.core.engine] INFO: Closing spider (finished) 2022-12-02 10:30:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 226, ... 'start_time': datetime.datetime(2022, 12, 2, 18, 30, 8, 492403)} 2022-12-02 10:30:08 [scrapy.core.engine] INFO: Spider closed (finished)

That’s a lot of output, so let’s break it down.

  • The scraper initialized and loaded additional components and extensions it needed to handle reading data from URLs.
  • It used the URL we provided in the start_urls list and grabbed the HTML, just like your web browser would do.
  • It passed that HTML to the parse method, which doesn’t do anything by default. Since we never wrote our own parse method, the spider just finishes without doing any work.

Now let’s pull some data from the page.

Step 2 — Extracting Data from a Page

We’ve created a very basic program that pulls down a page, but it doesn’t do any scraping or spidering yet. Let’s give it some data to extract.

If you look at the page we want to scrape, you’ll see it has the following structure:

  • There’s a header that’s present on every page.
  • There’s a link to log in.
  • Then there are the quotes themselves, displayed in what looks like a table or ordered list. Each quote has a similar format.

When writing a scraper, you will need to look at the source of the HTML file and familiarize yourself with the structure. So here it is, with the tags that aren’t relevant to our goal removed for readability:

quotes.toscrape.com
<body> ... <div class="quote" itemscope itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“I have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.”</span> <span>by <small class="author" itemprop="author">Thomas A. Edison</small> <a href="/author/Thomas-A-Edison">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" / > <a class="tag" href="/tag/edison/page/1/">edison</a> <a class="tag" href="/tag/failure/page/1/">failure</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a> </div> </div> ... </body>

Scraping this page is a two step process:

  1. First, grab each quote by looking for the parts of the page that have the data we want.
  2. Then, for each quote, grab the data we want from it by pulling the data out of the HTML tags.

scrapy grabs data based on selectors that you provide. Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. scrapy supports either CSS selectors or XPath selectors.

We’ll use CSS selectors for now since CSS is a perfect fit for finding all the sets on the page. If you look at the HTML, you’ll see that each quote is specified with the class quote. Since we’re looking for a class, we’d use .quote for our CSS selector. The . part of the selector searches the class attribute on elements. All we have to do is create a new method in our class named parse and pass that selector into the response object, like this:

scraper.py
class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        
        for quote in response.css(QUOTE_SELECTOR):
            pass

This code grabs all the sets on the page and loops over them to extract the data. Now let’s extract the data from those quotes so we can display it.

Another look at the source of the page we’re parsing tells us that the text of each quote is stored within a span with the text class and the author of the quote in a <small> tag with the author class:

quotes.toscrape.com
... <span class="text" itemprop="text">“I have not failed. I&#39;ve just found 10,000 ways that won&#39;t work.”</span> <span>by <small class="author" itemprop="author">Thomas A. Edison</small> ...

The quote object we’re looping over has its own css method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:

scraper.py
class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        
        for quote in response.css(QUOTE_SELECTOR):
            yield {
                'text': quote.css(TEXT_SELECTOR).extract_first(),
                'author': quote.css(AUTHOR_SELECTOR).extract_first(),
            }

Note: The trailing comma after extract_first() isn’t a typo. In Python, a trailing comma in dict objects is valid syntax, and a good way to leave room for more adding more items, which we will here later.

You’ll notice two things going on in this code:

  • We append ::text to our selectors for the quote and author. That’s a CSS pseudo-selector that fetches the text inside of the tag rather than the tag itself.
  • We call extract_first() on the object returned by quote.css(TEXT_SELECTOR) because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.

Save the file and run the scraper again:

  1. scrapy runspider scraper.py

This time the output will contain the quotes and their authors:

Output
... 2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'} 2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'} 2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'} ...

Let’s keep expanding on this by adding new selectors for links to pages about the author and tags for the quote. By investigating the HTML for each quote, we find:

  • The link to the author’s about page is stored in a link immediately following their name.
  • The tags are stored as a collection of a tags, each classed tag, stored within a div element with the tags class.

So, let’s modify the scraper to get this new information:

scraper.py
class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        ABOUT_SELECTOR = '.author + a::attr("href")'
        TAGS_SELECTOR = '.tags > .tag::text'

        for quote in response.css(QUOTE_SELECTOR):
            yield {
                'text': quote.css(TEXT_SELECTOR).extract_first(),
                'author': quote.css(AUTHOR_SELECTOR).extract_first(),
                'about': 'https://quotes.toscrape.com' + 
                        quote.css(ABOUT_SELECTOR).extract_first(),
                'tags': quote.css(TAGS_SELECTOR).extract(),
            }

Save your changes and run the scraper again:

  1. scrapy runspider scraper.py

Now the output will contain the new data:

Output
2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'about': 'https://quotes.toscrape.com/author/Albert-Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']} 2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'about': 'https://quotes.toscrape.com/author/Jane-Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']} 2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com> {'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'about': 'https://quotes.toscrape.com/author/Marilyn-Monroe', 'tags': ['be-yourself', 'inspirational']}

Now let’s turn this scraper into a spider that follows links.

Step 3 — Crawling Multiple Pages

You’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too.

You’ll notice that the top and bottom of each page has a little right carat (>) that links to the next page of results. Here’s the HTML for that:

quotes.toscrape.com
... <nav> <ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a> </li> </ul> </nav> ...

In the source, you will find an li tag with the class of next, and inside that tag, there’s an a tag with a link to the next page. All we have to do is tell the scraper to follow that link if it exists.

Modify your code as follows:

scraper.py
class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        ABOUT_SELECTOR = '.author + a::attr("href")'
        TAGS_SELECTOR = '.tags > .tag::text'
        NEXT_SELECTOR = '.next a::attr("href")'

        for quote in response.css(QUOTE_SELECTOR):
            yield {
                'text': quote.css(TEXT_SELECTOR).extract_first(),
                'author': quote.css(AUTHOR_SELECTOR).extract_first(),
                'about': 'https://quotes.toscrape.com' + 
                        quote.css(ABOUT_SELECTOR).extract_first(),
                'tags': quote.css(TAGS_SELECTOR).extract(),
            }

        next_page = response.css(NEXT_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

First, we define a selector for the “next page” link, extract the first match, and check if it exists. The scrapy.Request is a new request object that Scrapy knows means it should fetch and parse next.

This means that once we go to the next page, we’ll look for a link to the next page there, and on that page we’ll look for a link to the next page, and so on, until we don’t find a link for the next page. This is the key piece of web scraping: finding and following links. In this example, it’s very linear; one page has a link to the next page until we’ve hit the last page, But you could follow links to tags, or other search results, or any other URL you’d like.

Now, if you save your code and run the spider again you’ll see that it doesn’t just stop once it iterates through the first page of sets. It keeps on going through all 100 quotes on all 10 pages. In the grand scheme of things it’s not a huge chunk of data, but now you know the process by which you automatically find new pages to scrape.

Here’s our completed code for this tutorial:

scraper.py
import scrapy


class QuoteSpider(scrapy.Spider):
    name = 'quote-spdier'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        QUOTE_SELECTOR = '.quote'
        TEXT_SELECTOR = '.text::text'
        AUTHOR_SELECTOR = '.author::text'
        ABOUT_SELECTOR = '.author + a::attr("href")'
        TAGS_SELECTOR = '.tags > .tag::text'
        NEXT_SELECTOR = '.next a::attr("href")'

        for quote in response.css(QUOTE_SELECTOR):
            yield {
                'text': quote.css(TEXT_SELECTOR).extract_first(),
                'author': quote.css(AUTHOR_SELECTOR).extract_first(),
                'about': 'https://quotes.toscrape.com' + 
                        quote.css(ABOUT_SELECTOR).extract_first(),
                'tags': quote.css(TAGS_SELECTOR).extract(),
            }

        next_page = response.css(NEXT_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
            )

Conclusion

In this tutorial you built a fully-functional spider that extracts data from web pages in less than thirty lines of code. That’s a great start, but there’s a lot of fun things you can do with this spider. That should be enough to get you thinking and experimenting. If you need more information on Scrapy, check out Scrapy’s official docs. For more information on working with data from the web, see our tutorial on “How To Scrape Web Pages with Beautiful Soup and Python 3”.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the authors
Default avatar
Justin Duke

author



Default avatar

Tech Writer at DigitalOcean


Still looking for an answer?

Ask a questionSearch for more help

Was this helpful?
 
10 Comments


This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

for brickset in response.css(SET_SELECTOR):

what is brickset, in this case, referring to? when using this code on another project what would brickset be changed to? is it part of the URL? is it referring to part of the name of the spider?

Awesome informative lesson for a beginner like myself.

One thing that I would love to see in here in addition to the above is how to parse or scrape a ‘frame set’ in a webpage. Also to make it available when hitting a frame and otherwise perform as above!

One example could be here: I would like to fetch the company name and the company ID in the ‘href’ that are holding the ‘AGM’ travelling from the original site root.

Wow) That’s what scraping is inside) Never thought it’s such a complex system…Actually, I’ve only tried one scraping service which I personally found to be more than satisfying, it was http://www.tellprices.com/. This service enables you to get information about your competitors’ prices to get more sales by adjusting your prices

Hi Brian, excellent post! What’s modifications are necessary to save photos on a specific folder? In the case that each article has an image por example.

Thanks so much!

Hi, great guide! I am wondering… I just want to crawl my website so it create a cache files, can I use only the first part of the code and delete the second part that download the site?

Also is there anyway to detect mobile so it create mobile cache files?

Thanks!

Does Scrapy has any feature to show scraped data on webpage ? Mostly scrapy result are in csv, json file format.

I had some trouble installing Scrapy on a Windows 10 machine. I was getting error: Unable to find vcvarsall.bat

There were several suggestions on Stack Overflow that didn’t work. For me, the solution was to install Visual C++ 2015 Build Tools.

had to remove the ‘a’ from this line in order to extract the name :

NAME_SELECTOR = ‘h1 a ::text’

keeping this ‘a’ will extract the item number

For this question : Right now we’re only parsing results from 2016, as you might have guessed from the 2016 part of http://brickset.com/sets/year-2016 — how would you crawl results from other years?

Is it possible to search any urls in “YEAR” filter, and stock them in the start_urls list ?

i extracted all the url from webpage. is it possible to parse url (for depth crawling)??

for example on google page i got Gmail link. is it possible to parse gmail url

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!

Sign up

Join the Tech Talk
Success! Thank you! Please check your email for further details.

Please complete your information!

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Resources for startups and SMBs

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Get started for free

Sign up and get $200 in credit for your first 60 days with DigitalOcean.*

*This promotional offer applies to new accounts only.