[Python] 通过scrapy抓取网站的sitemap信息 →→→→→进入此内容的聊天室

来自 , 2020-11-20, 写在 Python, 查看 118 次.
URL http://www.code666.cn/view/7866c91c
  1. # This script allows you to quickly crawl pages based off of the sitemap(s) for any given domain.  The XmlXPathSelector was not doing it for me nor was the XmlItemExporter this code is by far the fastest and easiest way I have found to crawl a site based off of the URLs listed in the sitemap.
  2.  
  3. import re
  4.  
  5. from scrapy.spider import BaseSpider
  6. from scrapy import log
  7. from scrapy.utils.response import body_or_str
  8. from scrapy.http import Request
  9. from scrapy.selector import HtmlXPathSelector
  10.  
  11. class SitemapSpider(BaseSpider):
  12.         name = "SitemapSpider"
  13.         start_urls = ["http://www.domain.com/sitemap.xml"]
  14.  
  15.         def parse(self, response):
  16.                 nodename = 'loc'
  17.                 text = body_or_str(response)
  18.                 r = re.compile(r"(<%s[\s>])(.*?)(</%s>)" % (nodename, nodename), re.DOTALL)
  19.                 for match in r.finditer(text):
  20.                         url = match.group(2)
  21.                         yield Request(url, callback=self.parse_page)
  22.  
  23.         def parse_page(self, response):
  24.                 hxs = HtmlXPathSelector(response)
  25.  
  26.                 #Mock Item
  27.                 blah = Item()
  28.  
  29.                 #Do all your page parsing and selecting the elemtents you want
  30.                 blash.divText = hxs.select('//div/text()').extract()[0]
  31.                 yield blah
  32. #//python/8398

回复 "通过scrapy抓取网站的sitemap信息"

这儿你可以回复上面这条便签

captcha