samedi 18 juin 2016

Beautiful Soup scrape for "Worldwide"


I'm trying to scrape some Box Office Mojo pages for Worldwide box office gross figures using Beautiful Soup.My code below will grab the Domestic figures just fine, won't work when I sub in "Worldwide" for "Domestic Total Gross." Maybe because "Worldwide" show's up on the page more than once or something.

Any help on fixing it? I'll past the source code for the two portions as well. Thanks!

Source code below

<center><table border="0" border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcdc" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$172,825,435</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b><a href="/studio/chart/?studio=mgm.htm">MGM</a></b></td><td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&release=theatrical&date=1988-12-16&p=.htm">December 16, 1988</a></nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Drama</b></td><td valign="top">Runtime: <b>2 hrs. 13 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>R</b></td><td valign="top">Production Budget: <b>$25 million</b></td></tr></table>  </td>

...skip...

<tr>
<td width="40%">=&nbsp;<b>Worldwide:</b></td>
<td width="35%" align="right">&nbsp;<b>$354,825,435</b></td>
<td width="25%">&nbsp;</td>
</tr>

Python code below

BOG_titles = ['=RainMan.htm']
def get_movie_value(soup, field_name):
obj = soup.find(text = re.compile(field_name))
if not obj:
    return "Nothing"
next_sibling = obj.findNextSibling()
if next_sibling:
    return next_sibling.text
else:
    return "Still Nothing"

BOG_data = []
for x in BOG_titles:
y = 'http://www.boxofficemojo.com/movies/?id' + x
page = urllib2.urlopen(y)
soup = BeautifulSoup(page)
m = get_movie_value(soup, "Worldwide")
title_string = soup.find('title').text
title = title_string.split('(')[0].strip()
BOG_data.append([title,m])

Aucun commentaire:

Enregistrer un commentaire