Website Scraping With Python and Beautiful Soup

So a buddy of mine asked me to help him write a shell script which can scrape content from a website and put it into a mongodb database. I didn’t really feel like writing a shell script to do that since I figured it would be a huge pain in the a**. So I decided I would try it with Python. After some research I stumbled upon beautiful Soup. This actually turned out to be pretty easy and in a few moments I had a script which could scrape the MegaMillions website, grab the date, winning numbers, and mega number from every drawing and put that info into a mongodb database.

Grab The Website

So the first thing that needs to be done is a simple urlopen on the website in question:

soup = BeautifulSoup(urllib2.urlopen('http://www.usamega.com/mega-millions-history.asp?p=1').read())

In our case we are going to pull down the first page of the MegaMillions winning number history page and set it as the variable soup. If you where to do a simple print(soup.prettify()) you would see a pretty output of the URL posted above.

Parsing HTML Table Content With Beautiful Soup

I had to actually read the HTML code to determine that the fourth ‘table’ on the website was the one that contained the winning lottery numbers that I wanted to parse out. A simple print soup('table')[4].prettify() would output the full table content I was looking for. This is a good start, as you can see I simply needed to supply which table in the HTML I wanted. I found other examples on the website that would allow you to search for css tags in the table, etc.. but in my case that wasn’t an option.

Iterating Trough Table Rows

Now that I know which table I wanted, I simply needed to iterate through the table rows so I created a simple loop:

for row in soup('table')[4].findAll('tr'):
    tds = row('td')
    print tds

This would allow me to print each set of td tags contained in a tr tag individually:

[<td></td>, <td align="right" nowrap="nowrap"><a href="mega-millions-drawing.asp?d=9/18/2012">Tuesday, September 18, 2012</a></td>, <td></td>, <td align="center" nowrap="nowrap"><b>05 &middot; 09 &middot; 22 &middot; 36 &middot; 49</b>&nbsp; &nbsp; <font color="#000064">+ <strong>36</strong></font></td>, <td></td>, <td align="center" nowrap="nowrap"><b>3</b></td>, <td></td>, <td align="right" nowrap="nowrap"><a href="/mega-millions-jackpot.asp?d=9/18/2012">$15 Million</a></td>, <td></td>]

Print Strings From Specific Cells In HTML Table Row

Here is where Beautiful Soup really shines. I wanted to take only specific cells from the row and append them to a dictionary, but I only wanted the actual content string, not the HTML tags. First lets isolate one of the cells in one of the rows:

print soup('table')[4].findAll('tr')[1].findAll('td')[1]

This gives us the following output:

<td align="right" nowrap="nowrap"><a href="mega-millions-drawing.asp?d=11/30/2012">Friday, November 30, 2012</a></td>

But like I said I wanted only the string information, but simply adding .string was not sufficient, I needed to tell Beautiful Soap that I wanted the string after the link tag like so:

print soup('table')[4].findAll('tr')[1].findAll('td')[1].a.string

Which gives me the following output:

Friday, November 30, 2012

Perfect, next I wanted the actual winning numbers which in this case is the 4th set of cells. So, again, simply printing index [3] would return the 4th cell:

print soup('table')[4].findAll('tr')[1].findAll('td')[3]

Which looks like this:

<td align="center" nowrap="nowrap"><b>11 &middot; 22 &middot; 24 &middot; 28 &middot; 31</b>&nbsp; &nbsp; <font color="#000064">+ <strong>46</strong></font></td>

But, again I only waned the winning numbers and in this case the winning numbers and the mega ball where both combined in the same cell. The good thing here is that the winning numbers and the mega ball are both separated by different tags (which makes our life easy). To simply get the winning numbers, we do this:

print soup('table')[4].findAll('tr')[1].findAll('td')[3].b.string

Which gives us our power ball numbers seperated by the · tag (we can parse that out later):

11 &middot; 22 &middot; 24 &middot; 28 &middot; 31

To get the mega ball number all we needed to do was pull out the string after the ‘strong’ tag vs the ‘bold’ tag:

print soup('table')[4].findAll('tr')[1].findAll('td')[3].strong.string

And this gives us our mega millions number:

46

Final Script With Mongodb Integration

I won’t bore you with all the details on how to input the data into mongodb, or explain all the small python details (there are websites way better then mine which can explain all that). I just wanted to give people a brief overview of the Beautiful Soup python module and how they might better use it in their day to day coding. Here is the code all tied together:

#!/usr/bin/python
# vim: set expandtab:
import urllib2
from BeautifulSoup import BeautifulSoup
from pymongo import Connection
host = 'localhost'
database = 'lotto'
collection = 'mega_millions'
def mongo_connection():
con = Connection(host)
col = con[database][collection]
return col
def main():
col = mongo_connection()
page_num = 1
total_pages = 63
while True:
if page_num > total_pages: break
page_num = str(page_num)
soup = BeautifulSoup(urllib2.urlopen('http://www.usamega.com/mega-millions-history.asp?p='+page_num).read())
for row in soup('table')[4].findAll('tr'):
win_dict = {}
tds = row('td')
if tds[1].a is not None:
win_dict['date'] = tds[1].a.string
if tds[3].b is not None:
num_list = []
#Told you we would get back to it
number_list = tds[3].b.string.split('&middot;')
for num in number_list:
num_list.append(int(num))
win_dict['numbers'] = num_list
mega_number = tds[3].strong.string
win_dict['mega_number'] = int(mega_number)
col.insert(win_dict)
page_num = int(page_num)
page_num += 1
if __name__ == "__main__":
main()
view raw gistfile1.py hosted with ❤ by GitHub

7 Comments

Leave a Reply to Jim Cancel reply

Your email address will not be published. Required fields are marked *