在程序中中导入 Beautiful Soup库:
from BeautifulSoup import BeautifulSoup # For processing HTML from BeautifulSoup import BeautifulStoneSoup # For processing XML import BeautifulSoup # To get everything
下面的代码是Beautiful Soup基本功能的示范。你可以复制粘贴到你的python文件中,自己运行看看。
from BeautifulSoup import BeautifulSoup import re doc = ['<html><head><title>Page title</title></head>', '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.', '<p id="secondpara" align="blah">This is paragraph <b>two</b>.', '</html>'] soup = BeautifulSoup(''.join(doc)) print soup.prettify() # <html> # <head> # <title> # Page title # </title> # </head> # <body> # <p id="firstpara" align="center"> # This is paragraph # <b> # one # </b> # . # </p> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html>
navigate soup的一些方法:
soup.contents[0].name # u'html' soup.contents[0].contents[0].name # u'head' head = soup.contents[0].contents[0] head.parent.name # u'html' head.next # <title>Page title</title> head.nextSibling.name # u'body' head.nextSibling.contents[0] # <p id="firstpara" align="center">This is paragraph <b>one</b>.</p> head.nextSibling.contents[0].nextSibling # <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
下面是一些方法搜索soup,获得特定标签或有着特定属性的标签:
titleTag = soup.html.head.title titleTag # <title>Page title</title> titleTag.string # u'Page title' len(soup('p')) # 2 soup.findAll('p', align="center") # [<p id="firstpara" align="center">This is paragraph <b>one</b>. </p>] soup.find('p', align="center") # <p id="firstpara" align="center">This is paragraph <b>one</b>. </p> soup('p', align="center")[0]['id'] # u'firstpara' soup.find('p', align=re.compile('^b.*'))['id'] # u'secondpara' soup.find('p').b.string # u'one' soup('p')[1].b.string # u'two'
修改soup也很简单:
titleTag['id'] = 'theTitle' titleTag.contents[0].replaceWith("New title") soup.html.head # <head><title id="theTitle">New title</title></head> soup.p.extract() soup.prettify() # <html> # <head> # <title id="theTitle"> # New title # </title> # </head> # <body> # <p id="secondpara" align="blah"> # This is paragraph # <b> # two # </b> # . # </p> # </body> # </html> soup.p.replaceWith(soup.b) # <html> # <head> # <title id="theTitle"> # New title # </title> # </head> # <body> # <b> # two # </b> # </body> # </html> soup.body.insert(0, "This page used to have ") soup.body.insert(2, " <p> tags!") soup.body # <body>This page used to have <b>two</b> <p> tags!</body>
一个实际例子,用于抓取 ICC Commercial Crime Services weekly piracy report页面, 使用Beautiful Soup剖析并获得发生的盗版事件:
import urllib2 from BeautifulSoup import BeautifulSoup page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.html") soup = BeautifulSoup(page) for incident in soup('td', width="90%"): where, linebreak, what = incident.contents[:3] print where.strip() print what.strip() print