北大青鸟：你需要的python爬虫笔记

有时候我们的爬虫不一定只是爬取文本数据，也会爬取一些图片，下面就来看怎么将爬取的图片存到本地磁盘。

我们先来选好目标，知乎话题：女生怎么健身锻造好身材？ (单纯因为图多，不要多想哦（# _ # ) ）

看下页面的源代码，找到话题下图片链接的格式，如图：

可以看到，图片在img标签中，且class=origin_image zh-lightbox-thumb，而且链接是由.jpg结尾，我们便可以用Beautiful Soup结合正则表达式的方式来提取所有链接，如下:

links = soup.find_all('img', \zh-lightbox-thumb\

提取出所有链接后，使用request.urlretrieve来将所有链接保存到本地

Copy a network object denoted by a URL to a local file. If the URL points to a local file, the object will not be copied unless filename is supplied. Return a tuple (filename,

headers)where filename is the local file name under which the object can be found, and headers is whatever the info()method of the object returned by urlopen()returned (for a remote object). Exceptions are the same as for urlopen(). 具体实现代码如下： # -*- coding:utf-8 -*- import time

from urllib import request from bs4 import BeautifulSoup import re

url = r'https://www.zhihu.com/question/22918070'

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}

page = request.Request(url, headers=headers)

page_info = request.urlopen(page).read().decode('utf-8') soup = BeautifulSoup(page_info, 'html.parser')

# Beautiful Soup和正则表达式结合，提取出所有图片的链接（img标签中，class=**，以.jpg结尾的链接）

links = soup.find_all('img', \zh-lightbox-thumb\# 设置保存的路径，否则会保存到程序当前路径 local_path = r'E:Pic' for link in links: print(link.attrs['src'])

# 保存链接并命名，time防止命名冲突

request.urlretrieve(link.attrs['src'], local_path+r'%s.jpg' % time.time())

北大青鸟：你需要的python爬虫笔记

下载：北大青鸟：你需要的python爬虫笔记.doc

最近浏览

最新搜索

站内搜索