Web scraping is the process of extracting data from websites. It can be used if the desired data are not readily available via e.g. a download link or an API.
Web scraping is the process of extracting data from websites. It can be used if the desired data are not readily available via e.g. a download link or an API.
Consider the website https://www.trustpilot.com/ which is a platform for customer reviews.
Web scraping is the process of extracting data from websites. It can be used if the desired data are not readily available via e.g. a download link or an API.
Consider the website https://www.trustpilot.com/ which is a platform for customer reviews.
Each review consists of different parts such as
Web scraping is the process of extracting data from websites. It can be used if the desired data are not readily available via e.g. a download link or an API.
Consider the website https://www.trustpilot.com/ which is a platform for customer reviews.
Each review consists of different parts such as
Let's say we are interested whether the ratings of a shop change over time. For this we would need to gather the date and the star rating which are, however, not downloadable.
We will learn how we can access these data nonetheless to perform the analysis.
Web scraping is generally legal. However, depending on the jurisdiction it can be considered illegal in some cases.
Web scraping is generally legal. However, depending on the jurisdiction it can be considered illegal in some cases.
You should be careful if...
Web scraping is generally legal. However, depending on the jurisdiction it can be considered illegal in some cases.
You should be careful if...
Important: avoid troubles by limiting the number of requests to a reasonable amount (e.g. 1 request every 10 seconds). Otherwise it could be considered as a denial of service attack.
To extract data from a website it is necessary to understand the basics of how a website is built.
To extract data from a website it is necessary to understand the basics of how a website is built.
<!DOCTYPE html> <html> <head> <title>Page Title</title> </head> <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> <div> <h1>This is another Heading</h1> <p>This is another paragraph.</p> </div> </body> </html>
To extract data from a website it is necessary to understand the basics of how a website is built.
<!DOCTYPE html> <html> <head> <title>Page Title</title> </head> <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> <div> <h1>This is another Heading</h1> <p>This is another paragraph.</p> </div> </body> </html>
<html>...</html> tags are the container for all other HTML elements.<head>...</head> tags contain meta data which are not directly visible on the web page.<body>...</body> contains everything we can see such as text, links, images, tables, lists, etc. This is the most relevant part for web scraping. <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> <div> <h1>This is another Heading</h1> <p>This is another paragraph.</p> </div> </body>
<body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> <div> <h1>This is another Heading</h1> <p>This is another paragraph.</p> </div> </body>
In the body part tags are used to give the displayed information a structure. In our example we use:
<h1>...</h1> to define a heading<p>...</p> to define text<div>...</div> to define different sections.Look at the w3schools tag list for other tags you might encounter.
<body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> <div> <h1>This is another Heading</h1> <p>This is another paragraph.</p> </div> </body>
In the body part tags are used to give the displayed information a structure. In our example we use:
<h1>...</h1> to define a heading<p>...</p> to define text<div>...</div> to define different sections.Look at the w3schools tag list for other tags you might encounter.
We will start by scraping the simple HTML page from before. Create a new HTML document and copy the code from the last slide into it.
read_html()The package rvest makes use of the structure of an HTML document to extract the relevant information.
The first step is to load the website into R using xml2::read_html() (rvest depends on xml2 whereby xml2 automatically loads when rvest gets loaded).
read_html()The package rvest makes use of the structure of an HTML document to extract the relevant information.
The first step is to load the website into R using xml2::read_html() (rvest depends on xml2 whereby xml2 automatically loads when rvest gets loaded).
library(rvest)URL <- here::here("SoSe_2022/webscraping/examples/simple_html_page.html") # path of my html file(page <- read_html(URL))
## {html_document}## <html>## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...## [2] <body>\n \n <h1>This is a Heading</h1>\n <p>This is a paragraph.</p>\n ...Our page is now stored as an xml document which has a hierarchical data structure. For our page it looks like this.

Our page is now stored as an xml document which has a hierarchical data structure. For our page it looks like this.

html_node()We can navigate through the xml object using rvest::html_node().
p nodepage %>% rvest::html_nodes("p")
## {xml_nodeset (2)}## [1] <p>This is a paragraph.</p>## [2] <p>This is another paragraph.</p>html_node()We can navigate through the xml object using rvest::html_node().
Get all p node
page %>% rvest::html_nodes("p")
## {xml_nodeset (2)}## [1] <p>This is a paragraph.</p>## [2] <p>This is another paragraph.</p>Get only p nodes which are children of div nodes
page %>% rvest::html_nodes("div") %>% rvest::html_nodes("p")
## {xml_nodeset (1)}## [1] <p>This is another paragraph.</p>html_text()If we got the nodes which contain the data we want to scrape, we can use rvest::html_text() to
extract the data (i.e. the text between the tags) as a normal character vector.
html_text()If we got the nodes which contain the data we want to scrape, we can use rvest::html_text() to
extract the data (i.e. the text between the tags) as a normal character vector.
page %>% html_nodes("div") %>% html_nodes("p") %>% html_text()
## [1] "This is another paragraph."Websites are not only built with HTML. CSS is the language which is used to style a website. Let's add a bit more to our simple page:
css file (which doesn't exist now) in the head section of your html file <link rel="stylesheet" href="stylesheet.css">
Websites are not only built with HTML. CSS is the language which is used to style a website. Let's add a bit more to our simple page:
css file (which doesn't exist now) in the head section of your html file <link rel="stylesheet" href="stylesheet.css">
add another div with a heading and a paragraph
create a new file stylesheet.css in the same folder as your HTML file
copy the following code into that file and see what happens
Websites are not only built with HTML. CSS is the language which is used to style a website. Let's add a bit more to our simple page:
css file (which doesn't exist now) in the head section of your html file <link rel="stylesheet" href="stylesheet.css">
add another div with a heading and a paragraph
create a new file stylesheet.css in the same folder as your HTML file
copy the following code into that file and see what happens
div h1 { color: green; text-align: center;};
What if we want different div sections to look differently?
For this classes and ids can be specified. For web scraping we usually only need to know about classes, but ids work quite similar.
What if we want different div sections to look differently?
For this classes and ids can be specified. For web scraping we usually only need to know about classes, but ids work quite similar.
div section to <div class = "blue"> ... </div>
<div class = "red"> ... </div>
Having added a class attribute to our div sections, we can now use this class in the CSS file as follows
Having added a class attribute to our div sections, we can now use this class in the CSS file as follows
.blue h1 { color: blue; text-align: center;}.red h1 { color: green; text-align: center;}
{} is called a CSS selector. .blue h1 reads as: select all h1 headings inside an element with class blue.. before the class name. Having added a class attribute to our div sections, we can now use this class in the CSS file as follows
.blue h1 { color: blue; text-align: center;}.red h1 { color: green; text-align: center;}
{} is called a CSS selector. .blue h1 reads as: select all h1 headings inside an element with class blue.. before the class name.
For more on selectors look at w3chools CSS Selector Reference.
Having added a class attribute to our div sections, we can now use this class in the CSS file as follows
.blue h1 { color: blue; text-align: center;}.red h1 { color: green; text-align: center;}
{} is called a CSS selector. .blue h1 reads as: select all h1 headings inside an element with class blue.. before the class name.
For more on selectors look at w3chools CSS Selector Reference.
The web developer uses selectors to style similar content in the same way (e.g. on trustpilot.com each user review looks the same). We can use those to scrape the content we desire more specifically.
Let's say we want to select all paragraphs (p) which are a descend of an element with class .red. We can achieve this with
Let's say we want to select all paragraphs (p) which are a descend of an element with class .red. We can achieve this with
url <- here::here("SoSe_2022/webscraping/examples/simple_html_page_with_css.html")page <- url %>% read_html()page %>% html_nodes(".red") %>% html_nodes("p") %>% html_text()
## [1] "But this isn't red."Let's say we want to select all paragraphs (p) which are a descend of an element with class .red. We can achieve this with
url <- here::here("SoSe_2022/webscraping/examples/simple_html_page_with_css.html")page <- url %>% read_html()page %>% html_nodes(".red") %>% html_nodes("p") %>% html_text()
## [1] "But this isn't red."or in short
page %>% html_nodes(".red p") %>% html_text()
## [1] "But this isn't red."Sometimes we are not interested in the text between element tags but the relevant information is hidden in the tag attributes.
Attributes are everything that is defined in the opening tag, e.g in
Sometimes we are not interested in the text between element tags but the relevant information is hidden in the tag attributes.
Attributes are everything that is defined in the opening tag, e.g in
<div class = "blue"> ... </div>
Sometimes we are not interested in the text between element tags but the relevant information is hidden in the tag attributes.
Attributes are everything that is defined in the opening tag, e.g in
<div class = "blue"> ... </div>
class = "blue" is an attribute. With html_attrs() and html_attr() we can extract these information.
Sometimes we are not interested in the text between element tags but the relevant information is hidden in the tag attributes.
Attributes are everything that is defined in the opening tag, e.g in
<div class = "blue"> ... </div>
class = "blue" is an attribute. With html_attrs() and html_attr() we can extract these information.
# Get all attributespage %>% html_nodes(".red") %>% html_attrs()
## [[1]]## class ## "red"# Get a specific attributepage %>% html_nodes(".red") %>% html_attr("class")
## [1] "red"Since data are often stored in tables, rvest provides the function html_table() which parses a HTML table into a data frame. Tables in HTML look like this:
Since data are often stored in tables, rvest provides the function html_table() which parses a HTML table into a data frame. Tables in HTML look like this:
<table style="width:100%"> <tr> <th>Firstname</th> <th>Lastname</th> <th>Age</th> </tr> <tr> <td>Jill</td> <td>Smith</td> <td>50</td> </tr> <tr> <td>Eve</td> <td>Jackson</td> <td>94</td> </tr></table>
Since data are often stored in tables, rvest provides the function html_table() which parses a HTML table into a data frame. Tables in HTML look like this:
<table style="width:100%"> <tr> <th>Firstname</th> <th>Lastname</th> <th>Age</th> </tr> <tr> <td>Jill</td> <td>Smith</td> <td>50</td> </tr> <tr> <td>Eve</td> <td>Jackson</td> <td>94</td> </tr></table>
html_table()How do we find the CSS selectors
Looking into the source code (depending on your web browser you can right click on the webpage and click e.g. source code, inspect, ...)
Digging through the source code is often not necessary. If you use Chrome install the Selector Gadget.
We want to write a function that for each review extracts
We want to write a function that for each review extracts
The final function to extract data from one URL might look like this:
get_reviews <- function(url){ page <- read_html(url) tibble( name = get_name(page), rating = get_rating(page), date = get_date(page), title = get_title(page), text = get_text(page) )}
trustpilot.comNot only do we want to scrape all reviews from one URL but all reviews for a company which are distributed over many URLs. Can you write a function which can do this?
Web scraping is the process of extracting data from websites. It can be used if the desired data are not readily available via e.g. a download link or an API.
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |