Web scraping, often called web crawling or web spidering, is the act of programmatically going over a collection of web pages and extracting data, and is a powerful tool for working with data on the web.
With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, retrieve data from a site without an official API, or just satisfy your own personal curiosity.
In this tutorial, you’ll learn about the fundamentals of the scraping and spidering process as you explore a playful data set. We’ll use Quotes to Scrape, a database of quotations hosted on a site designed for testing out web spiders. By the end of this tutorial, you’ll have a fully functional Python web scraper that walks through a series of pages containing quotes and displays them on your screen.
The scraper will be easily expandable so you can tinker around with it and use it as a foundation for your own projects scraping data from the web.
To complete this tutorial, you’ll need a local development environment for Python 3. You can follow How To Install and Set Up a Local Programming Environment for Python 3 to configure everything you need.
Scraping is a two step process:
Both of those steps can be implemented in a number of ways in many languages.
You can build a scraper from scratch using modules or libraries provided by your programming language, but then you have to deal with some potential headaches as your scraper grows more complex. For example, you’ll need to handle concurrency so you can crawl more than one page at a time. You’ll probably want to figure out how to transform your scraped data into different formats like CSV, XML, or JSON. And you’ll sometimes have to deal with sites that require specific settings and access patterns.
You’ll have better luck if you build your scraper on top of an existing library that handles those issues for you. For this tutorial, we’re going to use Python and Scrapy to build our scraper.
Scrapy is one of the most popular and powerful Python scraping libraries; it takes a “batteries included” approach to scraping, meaning that it handles a lot of the common functionality that all scrapers need so developers don’t have to reinvent the wheel each time.
Scrapy, like most Python packages, is on PyPI (also known as pip
). PyPI, the Python Package Index, is a community-owned repository of all published Python software.
If you have a Python installation like the one outlined in the prerequisite for this tutorial, you already have pip
installed on your machine, so you can install Scrapy with the following command:
- pip install scrapy
If you run into any issues with the installation, or you want to install Scrapy without using pip
, check out the official installation docs.
With Scrapy installed, create a new folder for our project. You can do this in the terminal by running:
- mkdir quote-scraper
Now, navigate into the new directory you just created:
- cd quote-scraper
Then create a new Python file for our scraper called scraper.py
. We’ll place all of our code in this file for this tutorial. You can create this file using the editing software of your choice.
Start out the project by making a very basic scraper that uses Scrapy as its foundation. To do that, you’ll need to create a Python class that subclasses scrapy.Spider
, a basic spider class provided by Scrapy. This class will have two required attributes:
name
— just a name for the spider.start_urls
— a list of URLs that you start to crawl from. We’ll start with one URL.Open the scrapy.py
file in your text editor and add this code to create the basic spider:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
Let’s break this down line by line:
First, we import scrapy
so that we can use the classes that the package provides.
Next, we take the Spider
class provided by Scrapy and make a subclass out of it called BrickSetSpider
. Think of a subclass as a more specialized form of its parent class. The Spider
class has methods and behaviors that define how to follow URLs and extract data from the pages it finds, but it doesn’t know where to look or what data to look for. By subclassing it, we can give it that information.
Finally, we name the class quote-spider
and give our scraper a single URL to start from: https://quotes.toscrape.com. If you open that URL in your browser, it will take you to a search results page, showing the first of many pages of famous quotations.
Now, test out the scraper. Typically, Python files are run with a command like python path/to/file.py
. However, Scrapy comes with its own command line interface to streamline the process of starting a scraper. Start your scraper with the following command:
- scrapy runspider scraper.py
The command will output something like this:
Output2022-12-02 10:30:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-02 10:30:08 [scrapy.extensions.telnet] INFO: Telnet Password: b4d94e3a8d22ede1
2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
...
'scrapy.extensions.logstats.LogStats']
2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
...
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
...
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-02 10:30:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-02 10:30:08 [scrapy.core.engine] INFO: Spider opened
2022-12-02 10:30:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-02 10:30:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-02 10:49:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com> (referer: None)
2022-12-02 10:30:08 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-02 10:30:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 226,
...
'start_time': datetime.datetime(2022, 12, 2, 18, 30, 8, 492403)}
2022-12-02 10:30:08 [scrapy.core.engine] INFO: Spider closed (finished)
That’s a lot of output, so let’s break it down.
start_urls
list and grabbed the HTML, just like your web browser would do.parse
method, which doesn’t do anything by default. Since we never wrote our own parse
method, the spider just finishes without doing any work.Now let’s pull some data from the page.
We’ve created a very basic program that pulls down a page, but it doesn’t do any scraping or spidering yet. Let’s give it some data to extract.
If you look at the page we want to scrape, you’ll see it has the following structure:
When writing a scraper, you will need to look at the source of the HTML file and familiarize yourself with the structure. So here it is, with the tags that aren’t relevant to our goal removed for readability:
quotes.toscrape.com<body>
...
<div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
<a href="/author/Thomas-A-Edison">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased" / >
<a class="tag" href="/tag/edison/page/1/">edison</a>
<a class="tag" href="/tag/failure/page/1/">failure</a>
<a class="tag" href="/tag/inspirational/page/1/">inspirational</a>
<a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>
</div>
</div>
...
</body>
Scraping this page is a two step process:
scrapy
grabs data based on selectors that you provide. Selectors are patterns we can use to find one or more elements on a page so we can then work with the data within the element. scrapy
supports either CSS selectors or XPath selectors.
We’ll use CSS selectors for now since CSS is a perfect fit for finding all the sets on the page. If you look at the HTML, you’ll see that each quote is specified with the class quote
. Since we’re looking for a class, we’d use .quote
for our CSS selector. The .
part of the selector searches the class
attribute on elements. All we have to do is create a new method in our class named parse
and pass that selector into the response
object, like this:
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
for quote in response.css(QUOTE_SELECTOR):
pass
This code grabs all the sets on the page and loops over them to extract the data. Now let’s extract the data from those quotes so we can display it.
Another look at the source of the page we’re parsing tells us that the text of each quote is stored within a span
with the text
class and the author of the quote in a <small>
tag with the author
class:
quotes.toscrape.com ...
<span class="text" itemprop="text">“I have not failed. I've just found 10,000 ways that won't work.”</span>
<span>by <small class="author" itemprop="author">Thomas A. Edison</small>
...
The quote
object we’re looping over has its own css
method, so we can pass in a selector to locate child elements. Modify your code as follows to locate the name of the set and display it:
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
for quote in response.css(QUOTE_SELECTOR):
yield {
'text': quote.css(TEXT_SELECTOR).extract_first(),
'author': quote.css(AUTHOR_SELECTOR).extract_first(),
}
Note: The trailing comma after extract_first()
isn’t a typo. In Python, a trailing comma in dict
objects is valid syntax, and a good way to leave room for more adding more items, which we will here later.
You’ll notice two things going on in this code:
::text
to our selectors for the quote and author. That’s a CSS pseudo-selector that fetches the text inside of the tag rather than the tag itself.extract_first()
on the object returned by quote.css(TEXT_SELECTOR)
because we just want the first element that matches the selector. This gives us a string, rather than a list of elements.Save the file and run the scraper again:
- scrapy runspider scraper.py
This time the output will contain the quotes and their authors:
Output...
2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}
2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
2022-12-02 11:00:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}
...
Let’s keep expanding on this by adding new selectors for links to pages about the author and tags for the quote. By investigating the HTML for each quote, we find:
a
tags, each classed tag
, stored within a div
element with the tags
class.So, let’s modify the scraper to get this new information:
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
ABOUT_SELECTOR = '.author + a::attr("href")'
TAGS_SELECTOR = '.tags > .tag::text'
for quote in response.css(QUOTE_SELECTOR):
yield {
'text': quote.css(TEXT_SELECTOR).extract_first(),
'author': quote.css(AUTHOR_SELECTOR).extract_first(),
'about': 'https://quotes.toscrape.com' +
quote.css(ABOUT_SELECTOR).extract_first(),
'tags': quote.css(TAGS_SELECTOR).extract(),
}
Save your changes and run the scraper again:
- scrapy runspider scraper.py
Now the output will contain the new data:
Output2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'about': 'https://quotes.toscrape.com/author/Albert-Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'about': 'https://quotes.toscrape.com/author/Jane-Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2022-12-02 11:14:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'about': 'https://quotes.toscrape.com/author/Marilyn-Monroe', 'tags': ['be-yourself', 'inspirational']}
Now let’s turn this scraper into a spider that follows links.
You’ve successfully extracted data from that initial page, but we’re not progressing past it to see the rest of the results. The whole point of a spider is to detect and traverse links to other pages and grab data from those pages too.
You’ll notice that the top and bottom of each page has a little right carat (>
) that links to the next page of results. Here’s the HTML for that:
quotes.toscrape.com...
<nav>
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
</nav>
...
In the source, you will find an li
tag with the class of next
, and inside that tag, there’s an a
tag with a link to the next page. All we have to do is tell the scraper to follow that link if it exists.
Modify your code as follows:
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
ABOUT_SELECTOR = '.author + a::attr("href")'
TAGS_SELECTOR = '.tags > .tag::text'
NEXT_SELECTOR = '.next a::attr("href")'
for quote in response.css(QUOTE_SELECTOR):
yield {
'text': quote.css(TEXT_SELECTOR).extract_first(),
'author': quote.css(AUTHOR_SELECTOR).extract_first(),
'about': 'https://quotes.toscrape.com' +
quote.css(ABOUT_SELECTOR).extract_first(),
'tags': quote.css(TAGS_SELECTOR).extract(),
}
next_page = response.css(NEXT_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
First, we define a selector for the “next page” link, extract the first match, and check if it exists. The scrapy.Request
is a new request object that Scrapy knows means it should fetch and parse next.
This means that once we go to the next page, we’ll look for a link to the next page there, and on that page we’ll look for a link to the next page, and so on, until we don’t find a link for the next page. This is the key piece of web scraping: finding and following links. In this example, it’s very linear; one page has a link to the next page until we’ve hit the last page, But you could follow links to tags, or other search results, or any other URL you’d like.
Now, if you save your code and run the spider again you’ll see that it doesn’t just stop once it iterates through the first page of sets. It keeps on going through all 100 quotes on all 10 pages. In the grand scheme of things it’s not a huge chunk of data, but now you know the process by which you automatically find new pages to scrape.
Here’s our completed code for this tutorial:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote-spdier'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
QUOTE_SELECTOR = '.quote'
TEXT_SELECTOR = '.text::text'
AUTHOR_SELECTOR = '.author::text'
ABOUT_SELECTOR = '.author + a::attr("href")'
TAGS_SELECTOR = '.tags > .tag::text'
NEXT_SELECTOR = '.next a::attr("href")'
for quote in response.css(QUOTE_SELECTOR):
yield {
'text': quote.css(TEXT_SELECTOR).extract_first(),
'author': quote.css(AUTHOR_SELECTOR).extract_first(),
'about': 'https://quotes.toscrape.com' +
quote.css(ABOUT_SELECTOR).extract_first(),
'tags': quote.css(TAGS_SELECTOR).extract(),
}
next_page = response.css(NEXT_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
)
In this tutorial you built a fully-functional spider that extracts data from web pages in less than thirty lines of code. That’s a great start, but there’s a lot of fun things you can do with this spider. That should be enough to get you thinking and experimenting. If you need more information on Scrapy, check out Scrapy’s official docs. For more information on working with data from the web, see our tutorial on “How To Scrape Web Pages with Beautiful Soup and Python 3”.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
what is brickset, in this case, referring to? when using this code on another project what would brickset be changed to? is it part of the URL? is it referring to part of the name of the spider?
Awesome informative lesson for a beginner like myself.
One thing that I would love to see in here in addition to the above is how to parse or scrape a ‘frame set’ in a webpage. Also to make it available when hitting a frame and otherwise perform as above!
One example could be here: I would like to fetch the company name and the company ID in the ‘href’ that are holding the ‘AGM’ travelling from the original site root.
Wow) That’s what scraping is inside) Never thought it’s such a complex system…Actually, I’ve only tried one scraping service which I personally found to be more than satisfying, it was http://www.tellprices.com/. This service enables you to get information about your competitors’ prices to get more sales by adjusting your prices
Hi Brian, excellent post! What’s modifications are necessary to save photos on a specific folder? In the case that each article has an image por example.
Thanks so much!
Hi, great guide! I am wondering… I just want to crawl my website so it create a cache files, can I use only the first part of the code and delete the second part that download the site?
Also is there anyway to detect mobile so it create mobile cache files?
Thanks!
Does Scrapy has any feature to show scraped data on webpage ? Mostly scrapy result are in csv, json file format.
I had some trouble installing Scrapy on a Windows 10 machine. I was getting
error: Unable to find vcvarsall.bat
There were several suggestions on Stack Overflow that didn’t work. For me, the solution was to install Visual C++ 2015 Build Tools.
had to remove the ‘a’ from this line in order to extract the name :
NAME_SELECTOR = ‘h1 a ::text’
keeping this ‘a’ will extract the item number
For this question : Right now we’re only parsing results from 2016, as you might have guessed from the 2016 part of http://brickset.com/sets/year-2016 — how would you crawl results from other years?
Is it possible to search any urls in “YEAR” filter, and stock them in the
start_urls
list ?i extracted all the url from webpage. is it possible to parse url (for depth crawling)??
for example on google page i got Gmail link. is it possible to parse gmail url