Programming Journey Entry 2: Starting an Extraction of Steam Data Python Script

I’ve started this series of posts about general programming in whatever language I happen to feel like. I decided to make it a series of posts separate from from my game development dev log series of posts. This one is focusing on some ideas I have for a couple of different utilities in Python specifically.

If you’d like to see my mad rambling notes as I learned, it is in a subdirectory of my Python GitHub repository I’ve tentatively called “steam scraper. Eventually I will spin it off into its own repository.

All of these Programming Journey posts can be found in the associated category.

In the prior entry of this series, I brainstormed on a few utilities I could come up with in Python. In this entry I wanted expand of my idea of importing my steam library data and manipulating it.

Some may call this a redundant and unnecessary project. If nothing else it is a good data set to practice on.

I should perhaps attempt further summary of this Steam library data I keep referring to. You buy games on Steam, it goes into your library as part of your “steam profile” and one page lists all the games (and software etc) you have in one big table (this is accessible from a browser as opposed to the Steam client software). Well the size of the table varies based on how many games you own. This page can be either private or public. Here is a library of page of a Steam user I picked at random:

The browser developer tools would imply this is a big HTML document with a table, but really the browser runs some JavaScript that stiches together that table based on available library data (the get method below doesn’t run through the JavaScript like a browser, just grabs text it finds).

I’ve decided to just write down what I learn as I go and not go back and re-write this post as if I got it all once. Maybe not good for a guide but more of a stream-of consciousness format can be valuable as well.

What I know so far

As I alluded to in the prior post, after many hours of not figuring out why my web scraping wasn’t working the way the guides made it work, I realized: the HTML source of my target steam page wasn’t just static HTML but rather a JavaScript-generated table based on the data.

The videos and documentation I was following said to find the rough area of the page I wanted, then via the development tools right click the nearest parent tag like a div or span, go to “copy” and from the sub-menu select either “CSS path” or “copy selector” to get a something like:

#SHORTCUT_FOCUSABLE_DIV > div:nth-child(4) > div > div > div > div._3ozFtOe6WpJEMUtxDOIvtU > div.MSTY2ZpsdupobywLEfx9u > div > div.QscnL9OySMkHhGudEvEya > div > div._3TG57N4WQtubLLo8SbAXVF > h1

That is something that the popular Python library BeautifulSoup can understand. But I couldn’t get it to select the right portion of the page.

Okay I’ll skip the end: the data isn’t in that hierarchal structure of the HTML document, it’s in what is actually a very programmer-friendly format: JSON in the document itself.

I’m kind of expressing surprise here because Valve didn’t have to do it that way. They could have hidden that JSON data in a separate .JS file on the server and/or obfuscated the JavaScript code. Or just required an API key to access the data. It seems very young-people-learning-friendly of them to just include the data right there in the document and in a perfect format. That’s not something Dell would do for instance. Or I’m reading too much into it.

When I refer to a "Steam API" I'm referring to the ability to "talk directly" to Steam and get the "raw data" for lack of better term. This is better than web scraping because Steam can re-design their entire site layout and re-write HTML from scratch and a script or program will still work. There's usually a limited number of "API keys" handed out and something of sign up process to avoid abuse.

So the first obvious step then would be to extract just this JSON data from the steam library page.

For this part I’ll need the libraries :

  • Beautiful Soup (bs4) – a common library for parsing HTML pages
  • requests – common for downloading from the web and getting status return codes (the famous 404 for not found and 200 for “success” or “exists”)
  • Pyperclip – A module provided by the author of Automate the Boring Stuff to copy contents of variables to the system clipboard.

So the basic structure of the program goes like this:

  • Use the requests library get method to the steam library page (I used what I’m considering a constant for the URL since it won’t change)
  • Pass the text of this capture HTML on to a BeautifulSoup object, turning on the “html.parser” mode in the process
  • This is where that CSS selector mentioned previously comes in: using the the BeautifulSoup object select the CSS selector containing the script tag. This way instead of a string holding the entire HTML source it’s holding the one script tag.
  • Using string manipulation, figure out where the JSON data starts and ends. Assign this string to a variable
  • Write the contents of this variable out to text file

I thought of doing it this way rather than re-grabbing the HTML every time I want to change one thing and re-run my Python script. With the text file written I can just load the contents of the text file and manipulate it from there.

So really the program should check if the text file exists or not: if it’s not there prompt the user to make one and if it is there ask the user if it should be loaded (and an option about over-writing too I suppose). I mean I could offer to let the user choose a name for the text file but since I’m the only user and still debugging I’ll just keep going with a static chosen text file name.

More details

The first step is just copying the script tag that contains the variable holding all the data. Yes, it’s one variable with one very long, non-broken up line of text.

Turns out it’s the very last script tag on the page, the 18th script tag (17th counting from 0). I could probably use some kind of array length minus one or other trickery, but here is what I have so far. I was trying to decide what method to for “selecting” the script tag I wanted. I mean if I wanted to figure out something that could be future proof. I could select the 18th script tag with this: loadToBeaut.find_all(“script”)[17]. That works. Or I could use the CSS selector path, with something like‘#responsive_page_template_content > script:nth-child(4)’)[0]. Technically either of those things could change at any time.

import requests, pyperclip
from bs4 import BeautifulSoup

STEAM_LIBRARY_URL = " user id)/games/?tab=all&sort=name" # attempt at constant for this URL
URL_TRUE = 200 # status code from "requests" that it "found the URL" thus it's as good as 'true' in this context (will need this later)

result = requests.get(STEAM_LIBRARY_URL) # result.text will show whole source of page
loadToBeaut = BeautifulSoup(result.text, "html.parser") # comprehend HTML
# below assigns a variable to the content and also converts to a string type
scriptContent = loadToBeaut.find_all("script")[17] # it is 18th script tag or 17 counting from 0

pyperclip.copy(str(scriptContent)) # copy entire content of of the script tag to clip board - convert to data type string 

Now that I have the whole tag, I need to a way to cut off everything before and after the content I want.

I think cutting off the open and closing script tags would be relatively easy using something out of Beautiful Soup but I still need to cut off everything that comes after the content and before the closing script tag.

The content does end with ]; so I could use that. Actually the contents starts with [{ so I could literally just parse the string to capture everything in between [{ and ]; and cut off everything else. It doesn’t seem very elegant but sometimes the easiest and fastest is actually the best.

Actually, doing a ctrl+f of the page source, there’s only one instances of the the character combo [{. So I could do this even easier than I am already.

Proof-reading pass note: I should perhaps mention that even though the string I was searching through - as verified by pasting said string into notepad - had that ]; combination I couldn't get the find method to match it. Even tried escaping both characters. Not sure what I did wrong there but eventually I decided to try a different method.

Many hours later

After a lot of referencing and trying to decide if I should use regular expressions I came up with at least part of a solution just through string functions. I’m hoping to not have to resort to regular expressions if I don’t have to.

I replaced the above version of the script with the slightly altered version: first I use this to find the content of just the script tag instead of above:

scriptContent = str('#responsive_page_template_content > script:nth-child(4)')[0])

It’s something of a distinction without a difference. This does in fact capture that <script> and only that script tag as I wanted. It also converts it entirely to a string type. I think the [0] might redundant actually but I’ll find that out later.

Since it’s now a string type I have a whole host of string functions and methods to draw from.

Since the “preamble” as I call it – from the start of the string to the first/only instances of [{ – was so short across two lines I just added but column numbers to 49. I could have used another function to cutout the new lines/spaces then counter but it’s short enough this is fine. Then I used a slice on the string and saved that into a variable. With that stored, I could use a replace method to take that preamble out:

# save preamble to a variable
preambleText = scriptContent[:49]
# replace preamble with a empty string
noMorePreamble = scriptContent.replace(preambleText, '')

I verified the output using pyperclip and pasting into notepad (it’s enough content it scrolls the console way too far off screen).

As I have just started Python (and am trying to get into programming in general) I am still in the process of learning how to manipulation strings and string data. Once thing about string manipulation is knowing the difference between the position inside a string and the content of the string.

Luckily if I practice long enough it starts make more and more sense. I think this is due to just how friendly the Python language is.

You can ask Python to find the position of a specific string, which is an integer and you can ask Python to show what is at that particular position. This is used with what is called “slice notation”.

Anyway, in an effort to hopefully make my code more future proof and inexplicably avoid use of regular expressions since it seemed excessive I think I have found something of a solution.

It seems like some languages you can actually start from the end of a string and search backwards. I couldn’t figure out how to do that in python but I did figure out how to find the position of a unique string.

So first I wanted to find from the start of the script tag, or the literal <script> tag, to the character immediately before the start the JSON data. Then I needed to find the position from the character immediately after the JSON data to the last character in the script tag, leaving only the JSON data, isolate from everything else.

Of course if I could just use the requests library to ask (or literally request) for JSON data none of this would be necessary. Or that’s what I briefly inferred from this YouTube video. I tried this however and the Python interpreter was not having it.

The following looks through the script tag with the find method and returns the position of that combination of an equal sign and a space, “= “, then adds two to that (two places past the position) and saves it in a variable.

# get from start of string to right before bracket '['
PreambleEndPos = (scriptContent.find("= ") + 2)
# use "slice notation" with print to verify it's the 
# prefix that I want

The print is just a debugging thing to verify I have the preamble and only the preamble captured.

Next is suffix or post-amble? Even though it doesn’t match I’ll call it a suffix.

It’s easy to find the very last character of the suffix since that is the last character of the whole thing. Which can be captured with len() function, like len(scriptContent).

Then I just have to find the start of the suffix which has the unique line of another variable, rgChangingGames. Well it’s actually eight characters short of that unique variable name technically, so I used startSuffixPos = scriptContent.find(“rgChangingGames”) – 8.

Now I just have to put it together.

# start pos is just 0, so this is end pos:
PreambleEndPos = (scriptContent.find("= ") + 2)
# similarly, the suffix end pos is just the length of the whole thing
# so start of suffix:
startSuffixPos = scriptContent.find("rgChangingGames") - 8
# With those two variables and four positions established
# but really all I need is "end of preamble" position and
# the "start of suffix" position. I don't technically "need" 
# information about the before and after of the data I want
justGamedata = scriptContent[PreambleEndPos:startSuffixPos]

So all I really need to know is “where does the string start” and “where does the string end”. What comes before and after the string doesn’t actually matter. I just need the position of a unique string to start at and the same for the ending.

It seems like it took me a while to accomplish this very minimum sort progress. And this isn’t even the hardest part. And it may have been unnecessary. Try not to think about that.

Next I decided to see if I could learn how to write out to a text file. I turns out this wasn’t hard at all. I found a reference from w3c schools.

# the "x" means "create a new file"
writeAfile = open("justGamedata.txt", "x")

That’s it. That’s all I needed to write the data out to a text file.

Brain storm of future ideas (and dream features)

I should probably layout my basic ideas for the program before continuing any further.

I think I’m going to need to write a function just for things like testing if the URL exists, prompting the user to grab a copy of the page (or if it’s just me using it this URL will never change). Basic setup stuff.

Then a function to see if there is already a file with the data that exists is the same directory as the Python script and respond accordingly (download new, over-write existing, import existing – probably no need for an append option).

The next step would be the actual extracting of the data, hopefully in JSON form since that would make working and manipulating the data so much easier.

For instance I could create my own categories and tags and filter by each. I could try importing tags from the store page of a particular game and assigning and/or assigning my own. I’m not sure yet if attempting to extract information from the store pages of games would be a separate program entirely or merely a function of this one. One of those tags would also be my rating of a particular game.

Then I could further sort and filter by different parameters. Maybe have extra attributes that note games I also own on different platforms.

As for how it will look I haven’t thought of that yet. I could explore options for using a UI system like tkinter or Qt, but part of me just wants to learn UI making with curses.

Curses being a library for writing the screen in text mode. If you've ever seen one of those keyboard-only text UIs on a terminal - I mainly see them at car mechanics and auto part stores for whatever reason - that was like created using some form of curses. Curses is popular on Linux and like systems but there was probably never a reason for adoption on Windows. WSL not withstanding. 

I found a YouTube video playlist from a relatively great host that I think I’ll be able to follow and understand (which I haven’t started yet).

I’m going to have to brainstorm on exactly how I would want the the user interface to look for this. Unless I’m already overly ambitious. Maybe I should just worry about the retrieval of the data and converting the JSON data first, whatever form such a script would take and worry about the UI four or five steps down the road.

Random idea: take text – copy to clipboard – from either the post-purchase page or the “recent purchases/purchases accounts screen and parse that to retrieve meta data and add to the database. This would just to be to maximize laziness, no other reason.

Reference links:

I've made a "clearing house" repo for random python programming projects on GitHub:

One thought on “Programming Journey Entry 2: Starting an Extraction of Steam Data Python Script

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s