Hi! 👋 Please sign up to get notified when I release the source code for this book on GitHub. Till then, I would greatly appreciate if you can submit all the grammatical, prose and code related issues here.

I post about once every couple of months. No spam, I promise

# 2 Scraping Steam Using lxml¶

Hello everyone! In this chapter, I’ll teach the basics of web scraping using lxml and Python. I also recorded this chapter in a screencast so if it’s preferred to watch me do this step by step in a video please go ahead and watch it on YouTube.

The final product of this chapter will be a script which will provide you access to data on Steam in an easy to use format for your own applications. It will provide you data in a JSON format similar to Fig. 2.1.

First of all, why should you even bother learning how to web scrape? If your job doesn’t require you to learn it, then let me give you some motivation. What if you want to create a website that curates the cheapest products from Amazon, Walmart, and a couple of other online stores? A lot of these online stores don’t provide you with an easy way to access their information using an API. In the absence of an API, your only choice is to create a web scraper which can extract information from these websites automatically and provide you with that information in an easy to use way. You can see an example of a typical API response in JSON from Reddit in Fig. 2.2.

There are a lot of Python libraries out there that can help you with web scraping. There is lxml, BeautifulSoup, and a full-fledged framework called Scrapy. Most of the tutorials discuss BeautifulSoup and Scrapy, so I decided to go with what powers both libraries: the lxml library. I will teach the basics of XPaths and how you can use them to extract data from an HTML document. I will go through a couple of different examples so that readers can quickly get up-to-speed with lxml and XPaths.

If you are a gamer, chances are you already know about this website. We will be trying to extract data from Steam. More specifically, we will be extracting information from the “popular new releases” section.

## 2.1 Exploring Steam¶

First, open up the “popular new releases” page on Steam and scroll down until you see the Popular New Releases tab. At this point, I usually open up Chrome developer tools and see which HTML tags contain the required data. I extensively use the element inspector tool (The button in the top left of the developer tools). It allows the ability to see the HTML markup behind a specific element on the page with just one click.

As a high-level overview, everything on a web page is encapsulated in an HTML tag, and tags are usually nested. We need to figure out which tags we need to extract the data from and then we will be good to go. In our case, if we take a look at Fig. 2.4, we can see that every separate list item is encapsulated in an anchor (a) tag.

The anchor tags themselves are encapsulated in the div with an id of tab_newreleases_content. I am mentioning the id because there are two tabs on this page. The second tab is the standard “New Releases” tab, and we don’t want to extract information from that tab. Hence, we will first extract the “Popular New Releases” tab, and then we will extract the required information from within this tag.

## 2.2 Start writing a Python script¶

This is a perfect time to create a new Python file and start writing down our script. I am going to create a scrape.py file. Now let’s go ahead and import the required libraries. The first one is the requests library and the second one is the lxml.html library.

import requests
import lxml.html


If you don’t have requests or lxml installed, make sure you have a virtualenv ready:

$python -m venv env$ source env/bin/activate


Then you can easily install them using pip:

$pip install requests$ pip install lxml


The requests library is going to help us open the web page (URL) in Python. We could have used lxml to open the HTML page as well but it doesn’t work well with all web pages so to be on the safe side I am going to use requests. Now let’s open up the web page using requests and pass that response to lxml.html.fromstring method.

html = requests.get('https://store.steampowered.com/explore/new/')
doc = lxml.html.fromstring(html.content)


This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document.

Note

I will not explicitly ask you to create a virtual environment in each chapter. Make sure you create one for each project before writing any Python code.

## 2.3 Fire up the Python Interpreter¶

Now save this file as scrape.py and open up a terminal. Copy the code from the scrape.py file and paste it in a Python interpreter session.

We are doing this so that we can quickly test our XPaths without continuously editing, saving, and executing our scrape.py file. Let’s try writing an XPath for extracting the div which contains the ‘Popular New Releases’ tab. I will explain the code as we go along:

new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')[0]


This statement will return a list of all the divs in the HTML page which have an id of tab_newreleases_content. Now because we know that only one div on the page has this id we can take out the first element from the list ([0]) and that would be our required div. Let’s break down the xpath and try to understand it:

• // these double forward slashes tell lxml that we want to search for all tags in the HTML document which match our requirements/filters. Another option was to use / (a single forward slash). The single forward slash returns only the immediate child tags/nodes which match our requirements/filters

• div tells lxml that we are searching for div tags in the HTML page

• [@id="tab_newreleases_content"] tells lxml that we are only interested in those divs which have an id of tab_newreleases_content

Cool! We have got the required div. Now let’s go back to chrome and check which tag contains the titles of the releases.

## 2.4 Extract the titles & prices¶

The title is contained in a div with a class of tab_item_name. Now that we have the “Popular New Releases” tab extracted we can run further XPath queries on that tab. Write down the following code in the same Python console which we previously ran our code in:

titles = new_releases.xpath('.//div[@class="tab_item_name"]/text()')


This gives us the titles of all of the games in the “Popular New Releases” tab. You can see the expected output in Fig. 2.7.

Let’s break down this XPath a little bit because it is a bit different from the last one.

• . tells lxml that we are only interested in the tags which are the children of the new_releases tag

• [@class="tab_item_name"] is pretty similar to how we were filtering divs based on id. The only difference is that here we are filtering based on the class name

• /text() tells lxml that we want the text contained within the tag we just extracted. In this case, it returns the title contained in the div with the tab_item_name class name

Now we need to extract the prices for the games. We can easily do that by running the following code:

prices = new_releases.xpath('.//div[@class="discount_final_price"]/text()')


I don’t think I need to explain this code as it is pretty similar to the title extraction code. The only change we made is the change in the class name.

## 2.5 Extracting tags¶

Now we need to extract the tags associated with the titles. You can see the markup in Fig. 2.9.

Write down the following code in the Python terminal to extract the tags:

tags_divs = new_releases.xpath('.//div[@class="tab_item_top_tags"]')
tags = []
for div in tags_divs:
tags.append(div.text_content())


What we are doing here is extracting the divs containing the tags for the games. Then we loop over the list of extracted tags and then extract the text from those tags using the text_content() method. text_content() returns the text contained within an HTML tag without the HTML markup.

Note

We could have also made use of a list comprehension to make that code shorter. I wrote it down in this way so that even those who don’t know about list comprehensions can understand the code. Either way, this is the alternate code:

tags = [tag.text_content() for tag in
new_releases.xpath('.//div[@class="tab_item_top_tags"]')]


Let’s separate the tags in a list as well so that each tag is a separate element:

tags = [tag.split(', ') for tag in tags]


## 2.6 Extracting the platforms¶

Now the only thing remaining is to extract the platforms associated with each title.

The major difference here is that the platforms are not contained as texts within a specific tag. They are listed as the class name. Some titles only have one platform associated with them like this:

<span class="platform_img win"></span>


While some titles have 5 platforms associated with them like this:

 1 2 3 4 5 6  

As we can see, these spans contain the platform type as the class name. The only common thing between these spans is that all of them contain the platform_img class. First of all, we will extract the divs with the tab_item_details class, then we will extract the spans containing the platform_img class and finally we will extract the second class name from those spans. Here is the code:

 1 2 3 4 5 6 7 8 9 platforms_div = new_releases.xpath('.//div[@class="tab_item_details"]') total_platforms = [] for game in platforms_div: temp = game.xpath('.//span[contains(@class, "platform_img")]') platforms = [t.get('class').split(' ')[-1] for t in temp] if 'hmd_separator' in platforms: platforms.remove('hmd_separator') total_platforms.append(platforms) 

In line 1 we start with extracting the tab_item_details div.

The XPath in line 5 is a bit different. Here we have [contains(@class, "platform_img")] instead of simply having [@class="platform_img"]. The reason is that [@class="platform_img"] returns those spans which only have the platform_img class associated with them. If the spans have an additional class, they won’t be returned. Whereas [contains(@class, "platform_img")] filters all the spans which have the platform_img class. It doesn’t matter whether it is the only class or if there are more classes associated with that tag.

In line 6 we are making use of a list comprehension to reduce the code size. The .get() method allows us to extract an attribute of a tag. Here we are using it to extract the class attribute of a span. We get a string back from the .get() method. In the case of the first game, the string being returned is platform_img win so we split that string based on the comma and the whitespace, and then we store the last part (which is the actual platform name) of the split string in the list.

In lines 7-8 we are removing the hmd_separator from the list if it exists. This is because hmd_separator is not a platform. It is just a vertical separator bar used to separate actual platforms from VR/AR hardware.

## 2.7 Putting everything together¶

Now we just need this to return a JSON response so that we can easily turn this into a web-based API or use it in some other project. Here is the code for that:

 1 2 3 4 5 6 7 8 output = [] for info in zip(titles,prices, tags, total_platforms): resp = {} resp['title'] = info[0] resp['price'] = info[1] resp['tags'] = info[2] resp['platforms'] = info[3] output.append(resp) 

This code is self-explanatory. We are using the zip function to iterate over all of those lists in parallel. Then we create a dictionary for each game and assign the title, price, tags, and platforms as a separate key in that dictionary. Lastly, we append that dictionary to the output list.

The final code for this project is listed below:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 import requests import lxml.html html = requests.get('https://store.steampowered.com/explore/new/') doc = lxml.html.fromstring(html.content) new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')[0] titles = new_releases.xpath('.//div[@class="tab_item_name"]/text()') prices = new_releases.xpath('.//div[@class="discount_final_price"]/text()') tags = [] for tag in new_releases.xpath('.//div[@class="tab_item_top_tags"]'): tags.append(tag.text_content()) tags = [tag.split(', ') for tag in tags] platforms_div = new_releases.xpath('.//div[@class="tab_item_details"]') total_platforms = [] for game in platforms_div: temp = game.xpath('.//span[contains(@class, "platform_img")]') platforms = [t.get('class').split(' ')[-1] for t in temp] if 'hmd_separator' in platforms: platforms.remove('hmd_separator') total_platforms.append(platforms) output = [] for info in zip(titles,prices, tags, total_platforms): resp = {} resp['title'] = info[0] resp['price'] = info[1] resp['tags'] = info[2] resp['platforms'] = info[3] output.append(resp) print(output) 

Note

This is just a demonstration of how web scraping works. Make sure you do not infringe on the copyright of any service by doing web scraping. I can’t be held responsible for any irresponsible action you take.

## 2.8 Troubleshoot¶

The main issues I can think of are:

• Steam not returning a proper response

• lxml` not parsing the data correctly

For the first issue, the causes might be that you are making a lot of requests in a short amount of time. This causes Steam to think that you are a bot and it refuses to return a proper response. You can overcome this issue by opening Steam in your browser and solving any captcha it might show you. If that doesn’t work you can use a proxy to access Steam.

For the second issue, you can solve it by opening up Steam in the browser and checking the HTML code closely and making sure you are trying to extract data from the correct HTML tags.

## 2.9 Next Steps¶

You can now use this to create all sorts of services. How about a service that monitors Steam for promotions on specific games and sends you an email once it sees an amazing promotion? Just like always, options are endless. Make sure you send me an email about whatever service you end up making!

I hope you learned something new in this chapter. In a different chapter, we will learn how to take something like this and turn it into a web-based API using the Flask framework. See you in the next chapter!