Sign up for updates!

Hi! 👋 Please sign up to get notified when I release the source code for this book on GitHub. Till then, I would greatly appreciate if you can submit all the grammatical, prose and code related issues here.

I post about once every couple of months. No spam, I promise

2 Scraping Steam Using lxml

Hello everyone! In this chapter, I’ll teach the basics of web scraping using lxml and Python. I also recorded this chapter in a screencast so if it’s preferred to watch me do this step by step in a video please go ahead and watch it on YouTube.

The final product of this chapter will be a script which will provide you access to data on Steam in an easy to use format for your own applications. It will provide you data in a JSON format similar to Fig. 2.1.

https://imgur.com/8di4r2Q.png

Fig. 2.1 JSON output

First of all, why should you even bother learning how to web scrape? If your job doesn’t require you to learn it, then let me give you some motivation. What if you want to create a website that curates the cheapest products from Amazon, Walmart, and a couple of other online stores? A lot of these online stores don’t provide you with an easy way to access their information using an API. In the absence of an API, your only choice is to create a web scraper which can extract information from these websites automatically and provide you with that information in an easy to use way. You can see an example of a typical API response in JSON from Reddit in Fig. 2.2.

Fig. 2.2 JSON response from Reddit

There are a lot of Python libraries out there that can help you with web scraping. There is lxml, BeautifulSoup, and a full-fledged framework called Scrapy. Most of the tutorials discuss BeautifulSoup and Scrapy, so I decided to go with what powers both libraries: the lxml library. I will teach the basics of XPaths and how you can use them to extract data from an HTML document. I will go through a couple of different examples so that readers can quickly get up-to-speed with lxml and XPaths.

If you are a gamer, chances are you already know about this website. We will be trying to extract data from Steam. More specifically, we will be extracting information from the “popular new releases” section.

Fig. 2.3 Steam website

2.1 Exploring Steam

First, open up the “popular new releases” page on Steam and scroll down until you see the Popular New Releases tab. At this point, I usually open up Chrome developer tools and see which HTML tags contain the required data. I extensively use the element inspector tool (The button in the top left of the developer tools). It allows the ability to see the HTML markup behind a specific element on the page with just one click.

As a high-level overview, everything on a web page is encapsulated in an HTML tag, and tags are usually nested. We need to figure out which tags we need to extract the data from and then we will be good to go. In our case, if we take a look at Fig. 2.4, we can see that every separate list item is encapsulated in an anchor (a) tag.

Fig. 2.4 HTML markup

The anchor tags themselves are encapsulated in the div with an id of tab_newreleases_content. I am mentioning the id because there are two tabs on this page. The second tab is the standard “New Releases” tab, and we don’t want to extract information from that tab. Hence, we will first extract the “Popular New Releases” tab, and then we will extract the required information from within this tag.

2.2 Start writing a Python script

This is a perfect time to create a new Python file and start writing down our script. I am going to create a scrape.py file. Now let’s go ahead and import the required libraries. The first one is the requests library and the second one is the lxml.html library.

import requests
import lxml.html

If you don’t have requests or lxml installed, make sure you have a virtualenv ready:

$ python -m venv env
$ source env/bin/activate

Then you can easily install them using pip:

$ pip install requests
$ pip install lxml

The requests library is going to help us open the web page (URL) in Python. We could have used lxml to open the HTML page as well but it doesn’t work well with all web pages so to be on the safe side I am going to use requests. Now let’s open up the web page using requests and pass that response to lxml.html.fromstring method.

html = requests.get('https://store.steampowered.com/explore/new/')
doc = lxml.html.fromstring(html.content)

This provides us with an object of HtmlElement type. This object has the xpath method which we can use to query the HTML document. This provides us with a structured way to extract information from an HTML document.

Note

I will not explicitly ask you to create a virtual environment in each chapter. Make sure you create one for each project before writing any Python code.

2.3 Fire up the Python Interpreter

Now save this file as scrape.py and open up a terminal. Copy the code from the scrape.py file and paste it in a Python interpreter session.

Fig. 2.5 Testing code in Python interpreter

We are doing this so that we can quickly test our XPaths without continuously editing, saving, and executing our scrape.py file. Let’s try writing an XPath for extracting the div which contains the ‘Popular New Releases’ tab. I will explain the code as we go along:

new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')[0]

This statement will return a list of all the divs in the HTML page which have an id of tab_newreleases_content. Now because we know that only one div on the page has this id we can take out the first element from the list ([0]) and that would be our required div. Let’s break down the xpath and try to understand it:

  • // these double forward slashes tell lxml that we want to search for all tags in the HTML document which match our requirements/filters. Another option was to use / (a single forward slash). The single forward slash returns only the immediate child tags/nodes which match our requirements/filters

  • div tells lxml that we are searching for div tags in the HTML page

  • [@id="tab_newreleases_content"] tells lxml that we are only interested in those divs which have an id of tab_newreleases_content

Cool! We have got the required div. Now let’s go back to chrome and check which tag contains the titles of the releases.

2.4 Extract the titles & prices

Fig. 2.6 Titles & prices in div tags

The title is contained in a div with a class of tab_item_name. Now that we have the “Popular New Releases” tab extracted we can run further XPath queries on that tab. Write down the following code in the same Python console which we previously ran our code in:

titles = new_releases.xpath('.//div[@class="tab_item_name"]/text()')

This gives us the titles of all of the games in the “Popular New Releases” tab. You can see the expected output in Fig. 2.7.

Fig. 2.7 Titles extracted as a list

Let’s break down this XPath a little bit because it is a bit different from the last one.

  • . tells lxml that we are only interested in the tags which are the children of the new_releases tag

  • [@class="tab_item_name"] is pretty similar to how we were filtering divs based on id. The only difference is that here we are filtering based on the class name

  • /text() tells lxml that we want the text contained within the tag we just extracted. In this case, it returns the title contained in the div with the tab_item_name class name

Now we need to extract the prices for the games. We can easily do that by running the following code:

prices = new_releases.xpath('.//div[@class="discount_final_price"]/text()')

I don’t think I need to explain this code as it is pretty similar to the title extraction code. The only change we made is the change in the class name.

Fig. 2.8 Prices extracted as a list

2.5 Extracting tags

Now we need to extract the tags associated with the titles. You can see the markup in Fig. 2.9.

Fig. 2.9 HTML markup for game tags

Write down the following code in the Python terminal to extract the tags:

tags_divs = new_releases.xpath('.//div[@class="tab_item_top_tags"]')
tags = []
for div in tags_divs:
    tags.append(div.text_content())

What we are doing here is extracting the divs containing the tags for the games. Then we loop over the list of extracted tags and then extract the text from those tags using the text_content() method. text_content()` returns the text contained within an HTML tag without the HTML markup.

Note

We could have also made use of a list comprehension to make that code shorter. I wrote it down in this way so that even those who don’t know about list comprehensions can understand the code. Either way, this is the alternate code:

tags = [tag.text_content() for tag in
   new_releases.xpath('.//div[@class="tab_item_top_tags"]')]

Fig. 2.10 Tags extracted as a list

Let’s separate the tags in a list as well so that each tag is a separate element:

tags = [tag.split(', ') for tag in tags]

2.6 Extracting the platforms

Now the only thing remaining is to extract the platforms associated with each title.

Fig. 2.11 HTML markup for platforms information

The major difference here is that the platforms are not contained as texts within a specific tag. They are listed as the class name. Some titles only have one platform associated with them like this:

<span class="platform_img win"></span>

While some titles have 5 platforms associated with them like this:

1
2
3
4
5
6
<span class="platform_img win"></span>
<span class="platform_img mac"></span>
<span class="platform_img linux"></span>
<span class="platform_img hmd_separator"></span>
<span title="HTC Vive" class="platform_img htcvive"></span>
<span title="Oculus Rift" class="platform_img oculusrift"></span>

As we can see, these spans contain the platform type as the class name. The only common thing between these spans is that all of them contain the platform_img class. First of all, we will extract the divs with the tab_item_details class, then we will extract the spans containing the platform_img class and finally we will extract the second class name from those spans. Here is the code:

1
2
3
4
5
6
7
8
9
platforms_div = new_releases.xpath('.//div[@class="tab_item_details"]')
total_platforms = []

for game in platforms_div:
    temp = game.xpath('.//span[contains(@class, "platform_img")]')
    platforms = [t.get('class').split(' ')[-1] for t in temp]
    if 'hmd_separator' in platforms:
        platforms.remove('hmd_separator')
    total_platforms.append(platforms)

In line 1 we start with extracting the tab_item_details div.

The XPath in line 5 is a bit different. Here we have [contains(@class, "platform_img")] instead of simply having [@class="platform_img"]. The reason is that [@class="platform_img"] returns those spans which only have the platform_img class associated with them. If the spans have an additional class, they won’t be returned. Whereas [contains(@class, "platform_img")] filters all the spans which have the platform_img class. It doesn’t matter whether it is the only class or if there are more classes associated with that tag.

In line 6 we are making use of a list comprehension to reduce the code size. The .get() method allows us to extract an attribute of a tag. Here we are using it to extract the class attribute of a span. We get a string back from the .get() method. In the case of the first game, the string being returned is platform_img win so we split that string based on the comma and the whitespace, and then we store the last part (which is the actual platform name) of the split string in the list.

In lines 7-8 we are removing the hmd_separator from the list if it exists. This is because hmd_separator is not a platform. It is just a vertical separator bar used to separate actual platforms from VR/AR hardware.

2.7 Putting everything together

Now we just need this to return a JSON response so that we can easily turn this into a web-based API or use it in some other project. Here is the code for that:

1
2
3
4
5
6
7
8
output = []
for info in zip(titles,prices, tags, total_platforms):
    resp = {}
    resp['title'] = info[0]
    resp['price'] = info[1]
    resp['tags'] = info[2]
    resp['platforms'] = info[3]
    output.append(resp)

This code is self-explanatory. We are using the zip function to iterate over all of those lists in parallel. Then we create a dictionary for each game and assign the title, price, tags, and platforms as a separate key in that dictionary. Lastly, we append that dictionary to the output list.

The final code for this project is listed below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import requests
import lxml.html

html = requests.get('https://store.steampowered.com/explore/new/')
doc = lxml.html.fromstring(html.content)
new_releases = doc.xpath('//div[@id="tab_newreleases_content"]')[0]

titles = new_releases.xpath('.//div[@class="tab_item_name"]/text()')
prices = new_releases.xpath('.//div[@class="discount_final_price"]/text()')

tags = []
for tag in new_releases.xpath('.//div[@class="tab_item_top_tags"]'):
	tags.append(tag.text_content())

tags = [tag.split(', ') for tag in tags] 
platforms_div = new_releases.xpath('.//div[@class="tab_item_details"]')
total_platforms = []

for game in platforms_div:
	temp = game.xpath('.//span[contains(@class, "platform_img")]') 
	platforms = [t.get('class').split(' ')[-1] for t in temp]
	if 'hmd_separator' in platforms:
		platforms.remove('hmd_separator')
	total_platforms.append(platforms)

output = []
for info in zip(titles,prices, tags, total_platforms): 
	resp = {}
	resp['title'] = info[0]
	resp['price'] = info[1]
	resp['tags'] = info[2]
	resp['platforms'] = info[3]
	output.append(resp)

print(output)

Note

This is just a demonstration of how web scraping works. Make sure you do not infringe on the copyright of any service by doing web scraping. I can’t be held responsible for any irresponsible action you take.

2.8 Troubleshoot

The main issues I can think of are:

  • Steam not returning a proper response

  • lxml not parsing the data correctly

For the first issue, the causes might be that you are making a lot of requests in a short amount of time. This causes Steam to think that you are a bot and it refuses to return a proper response. You can overcome this issue by opening Steam in your browser and solving any captcha it might show you. If that doesn’t work you can use a proxy to access Steam.

For the second issue, you can solve it by opening up Steam in the browser and checking the HTML code closely and making sure you are trying to extract data from the correct HTML tags.

2.9 Next Steps

You can now use this to create all sorts of services. How about a service that monitors Steam for promotions on specific games and sends you an email once it sees an amazing promotion? Just like always, options are endless. Make sure you send me an email about whatever service you end up making!

If you want to learn more, you can also check out this other Python web scraping tutorial published by my friends over at ScrapingBee.

I hope you learned something new in this chapter. In a different chapter, we will learn how to take something like this and turn it into a web-based API using the Flask framework. See you in the next chapter!