So a buddy of mine asked me to help him write a shell script which can scrape content from a website and put it into a mongodb database. I didn’t really feel like writing a shell script to do that since I figured it would be a huge pain in the a**. So I decided I would try it with Python. After some research I stumbled upon beautiful Soup. This actually turned out to be pretty easy and in a few moments I had a script which could scrape the MegaMillions website, grab the date, winning numbers, and mega number from every drawing and put that info into a mongodb database.
Grab The Website
So the first thing that needs to be done is a simple urlopen on the website in question:
In our case we are going to pull down the first page of the MegaMillions winning number history page and set it as the variable soup. If you where to do a simple print(soup.prettify()) you would see a pretty output of the URL posted above.
Parsing HTML Table Content With Beautiful Soup
I had to actually read the HTML code to determine that the fourth ‘table’ on the website was the one that contained the winning lottery numbers that I wanted to parse out. A simple print soup('table').prettify() would output the full table content I was looking for. This is a good start, as you can see I simply needed to supply which table in the HTML I wanted. I found other examples on the website that would allow you to search for css tags in the table, etc.. but in my case that wasn’t an option.
Iterating Trough Table Rows
Now that I know which table I wanted, I simply needed to iterate through the table rows so I created a simple loop:
for row in soup('table').findAll('tr'):
tds = row('td')
This would allow me to print each set of td tags contained in a tr tag individually:
Print Strings From Specific Cells In HTML Table Row
Here is where Beautiful Soup really shines. I wanted to take only specific cells from the row and append them to a dictionary, but I only wanted the actual content string, not the HTML tags. First lets isolate one of the cells in one of the rows:
But, again I only waned the winning numbers and in this case the winning numbers and the mega ball where both combined in the same cell. The good thing here is that the winning numbers and the mega ball are both separated by different tags (which makes our life easy). To simply get the winning numbers, we do this:
I won’t bore you with all the details on how to input the data into mongodb, or explain all the small python details (there are websites way better then mine which can explain all that). I just wanted to give people a brief overview of the Beautiful Soup python module and how they might better use it in their day to day coding. Here is the code all tied together: