从网页源码中过滤 img 标签,取src的图片网址,下载到本地,通过xpath将src的值改为对应的本地路径。
有时src的图片url不可用,可以使用src2或者data_src,因为有些网站为了应付爬虫,会通过js动态控制src。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
import requests import time import os from lxml import etree # 构造请求头 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36' } def getPage(url): print("download page ---") response = requests.get(url, headers=headers) html = response.content.decode("utf-8") # 解码,用于xpath解析 xxpath = etree.HTML(html) imgs = xxpath.xpath("//img") # 取图片标签 for img in imgs: src = img.xpath("./@data-src") # 取图片的可用url if len(src)>0: src = src[0] l_src = getImg(src) # 下载图片 img.attrib['src'] = l_src # 更新为本地图片路径 img.attrib['data-src'] = l_src divs = xxpath.xpath('//*[@id="js_content"]') for div in divs: div.attrib['style'] = "visibility: visible;" # 设置网页可见 html = etree.tostring(xxpath, encoding="utf-8").decode("utf-8") # 重新编译网页源码 return html def getImg(url): print("download img:", url) response = requests.get(url, headers=headers) img = response.content rn = time.time() fn = '/Users/moonmen/Desktop/oop/{}.jpg'.format(rn) os.makedirs(os.path.dirname(fn), exist_ok=True) # 自动创建目录 with open(fn ,'wb' ) as f: f.write(img) print("local img:", fn) return fn html = getPage('https://mp.weixin.qq.com/s/X15LNTgXMWV-I224NJ_U1A') print("save page -----") with open(r'/Users/moonmen/Desktop/oop/page.html', mode='w') as f: f.write(html) print("finish------") |
-end
声明
本文由崔维友 威格灵 cuiweiyou vigiles cuiweiyou 原创,转载请注明出处:http://www.gaohaiyan.com/2794.html
承接App定制、企业web站点、办公系统软件 设计开发,外包项目,毕设