[Python] scrapy 登录后再进行采集的代码 →→→→→进入此内容的聊天室

来自 , 2020-07-30, 写在 Python, 查看 153 次.
URL http://www.code666.cn/view/3bf75f71
  1. from scrapy.contrib.spiders.init import InitSpider
  2. from scrapy.http import Request, FormRequest
  3. from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
  4. from scrapy.contrib.spiders import Rule
  5.  
  6. class MySpider(InitSpider):
  7.     name = 'myspider'
  8.     allowed_domains = ['domain.com']
  9.     login_page = 'http://www.domain.com/login'
  10.     start_urls = ['http://www.domain.com/useful_page/',
  11.                   'http://www.domain.com/another_useful_page/']
  12.  
  13.     rules = (
  14.         Rule(SgmlLinkExtractor(allow=r'-\w+.html$'),
  15.              callback='parse_item', follow=True),
  16.     )
  17.  
  18.     def init_request(self):
  19.         """This function is called before crawling starts."""
  20.         return Request(url=self.login_page, callback=self.login)
  21.  
  22.     def login(self, response):
  23.         """Generate a login request."""
  24.         return FormRequest.from_response(response,
  25.                     formdata={'name': 'herman', 'password': 'password'},
  26.                     callback=self.check_login_response)
  27.  
  28.     def check_login_response(self, response):
  29.         """Check the response returned by a login request to see if we are
  30.        successfully logged in.
  31.        """
  32.         if "Hi Herman" in response.body:
  33.             self.log("Successfully logged in. Let's start crawling!")
  34.             # Now the crawling can begin..
  35.             self.initialized()
  36.         else:
  37.             self.log("Bad times :(")
  38.             # Something went wrong, we couldn't log in, so nothing happens.
  39.  
  40.     def parse_item(self, response):
  41.  
  42.         # Scrape data from page
  43. #//python/8544

回复 "scrapy 登录后再进行采集的代码"

这儿你可以回复上面这条便签

captcha