python Beautiful Soup python网络爬虫精解之Beautiful Soup的使用教程
小狐狸梦想去童话镇 人气:0一、Beautiful Soup的介绍
Beautiful Soup是一个强大的解析工具,它借助网页结构和属性等特性来解析网页。
它提供一些函数来处理导航、搜索、修改分析树等功能,Beautiful Soup不需要考虑文档的编码格式。Beautiful Soup在解析时实际上需要依赖解析器,常用的解析器是lxml。
二、Beautiful Soup的使用
test03.html测试实例:
<!DOCTYPE html> <html> <head> <meta content="text/html;charset=utf-8" http-equiv="content-type" /> <meta content="IE=Edge" http-equiv="X-UA-Compatible" /> <meta content="always" name="referrer" /> <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css" /> <title>百度一下,你就知道 </title> </head> <body link="#0000cc"> <div id="wrapper"> <div id="head"> <div class="head_wrapper"> <div id="u1"> <a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a> <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a> <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a> <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a> <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a> <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a> </div> </div> </div> </div> </body> </html>
1、节点选择器
我们之前了解到,一个网页是由若干个元素节点组成的,通过提取某个节点的具体内容,就可以获取到界面呈现的一些数据。使用节点选择器能够简化我们获取数据的过程,在不使用正则表达式的前提下,精准的获取数据。
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.head) print(soup.head.title) print(soup.a)
【运行结果】
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
<title>百度一下,你就知道 </title>
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
分析:
第一条打印数据为获取网页的head节点;
第二条打印内容是获取head节点中title节点,获取该节点使用了一个嵌套选择,因为title节点是嵌套在head节点里面的;
第三条打印内容是获取a节点,在源码中我们看到有许多条a节点,而只匹配到第一个a节点就结束了。当有多个节点时,这种选择方式指只会选择第一个匹配的节点,其他后面节点会忽略。
2、提取信息
一般我们需要的数据位于节点名、属性值、文本值中,以下代码展示了如何获取这三个地方的数据:
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.body.name) print(soup.body.a.attrs['class']) print(soup.body.a.attrs['href']) print(soup.body.a.string)
【运行结果】
body
['mnav']
http://news.baidu.com
新闻
分析:
第一条获取body节点名;
第二条获取a节点class属性值;
第三条获取a节点href属性值;
第四条获取a节点的文本值;
3、关联选择
(1)子节点和子孙节点
子节点可以调用contents属性和children属性,子孙节点可以调用descendants属性,他们返回结果都是生成器类型,通过for循环输出匹配到的信息。
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') # print(soup.body.contents) for i,content in enumerate(soup.body.contents): print(i,content)
【运行结果】
0
1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
</div>
</div>
</div>
</div>
2
(2)父节点和祖先节点
获取某个节点的父节点可以调用parent属性,例如获取实例中title节点的父节点:
file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.title.parent)
【运行结果】
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
同理,如果是想要获取节点的祖先节点,则可调用parents属性。
(3)兄弟节点
调用next_sibling获取节点的下一个兄弟元素;
调用previous_sibling获取节点的上一个兄弟元素;
调用next_siblings取节点的下一个兄弟节点;
调用previous_siblings获取节点的上一个兄弟节点;
4、方法选择器
find_all()
查找所有符合条件的元素,其使用方法如下:
find_all(name,attrs,recursive,text,**kwargs)
(1)name
根据节点名来查询元素,例如查询实例中a标签元素:
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.find_all(name = "a")) for a in soup.find_all(name = "a"): print(a)
【运行结果】
[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
(2)attrs
在查询时我们还可以传入标签的属性,attrs参数的数据类型是字典。
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.find_all(name = "a",attrs = {"class":"bri"}))
【运行结果】
[<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
可以看到,在加上class=“bri”属性时,查询结果就只剩一条a标签元素。
(3)text
text参数可以用来匹配节点的文本,传入的可以是字符串,也可以是正则表达式对象。
import re from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.find_all(name = "a",text = re.compile('新闻')))
【运行结果】
[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>]
只包含文本内容为“新闻”的a标签。
find()
find()的使用与前者相似,唯一不同的是,find进匹配搜索到的第一个元素,然后返回单个元素,find_all()则是匹配所有符合条件的元素,返回一个列表。
5、CSS选择器
使用CSS选择器时,调用select()方法,传入相应的CSS选择器;
例如使用CSS选择器获取实例中的a标签
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') print(soup.select('a')) for a in soup.select('a'): print(a)
【运行结果】
[<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>, <a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>, <a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>, <a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>, <a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>]
<a class="mnav" href="http://news.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新闻 </a>
<a class="mnav" href="https://www.hao123.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" href="http://map.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地图 </a>
<a class="mnav" href="http://v.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">视频 </a>
<a class="mnav" href="http://tieba.baidu.com" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">贴吧 </a>
<a class="bri" href="//www.baidu.com/more/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多产品 </a>
获取属性
获取上述a标签中的href属性
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') for a in soup.select('a'): print(a['href'])
【运行结果】
http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/
获取文本
获取上述a标签的文本内容,使用get_text()方法,或者是string获取文本内容
from bs4 import BeautifulSoup file = open("./test03.html",'rb') html = file.read() soup = BeautifulSoup(html,'lxml') for a in soup.select('a'): print(a.get_text()) print(a.string)
【运行结果】
新闻
新闻
hao123
hao123
地图
地图
视频
视频
贴吧
贴吧
更多产品
更多产品
加载全部内容