[Python] python采集百度百科代码演示 →→→→→进入此内容的聊天室

来自 , 2020-12-28, 写在 Python, 查看 119 次.
URL http://www.code666.cn/view/7d92c088
  1. #!/usr/bin/python
  2. # -*- coding: utf-8 -*-
  3. #encoding=utf-8  
  4. #Filename:get_baike.py
  5.  
  6. import urllib2,re
  7. import sys
  8.  
  9.  
  10.  
  11. def getHtml(url,time=10):
  12.     response = urllib2.urlopen(url,timeout=time)
  13.     html = response.read()
  14.     response.close()
  15.     return html
  16.  
  17. def clearBlank(html):
  18.     if len(html) == 0 : return ''
  19.     html = re.sub('\r|\n|\t','',html)
  20.     while html.find("  ")!=-1 or html.find(' ')!=-1 :
  21.         html = html.replace(' ',' ').replace('  ',' ')
  22.     return html
  23.  
  24.  
  25. if __name__ == '__main__':
  26.         html = getHtml('http://baike.baidu.com/view/994462.htm',10)
  27.         html = html.decode('gb2312','replace').encode('utf-8') #转码
  28.  
  29.         title_reg = r'<h1 class="title" id="[\d]+">(.*?)</h1>'
  30.         content_reg = r'<div class="card-summary-content">(.*?)</p>'
  31.  
  32.         title = re.compile(title_reg).findall(html)
  33.         content = re.compile(content_reg).findall(html)
  34.  
  35.         title[0] = re.sub(r'<[^>]*?>', '', title[0])
  36.         content[0] = re.sub(r'<[^>]*?>', '', content[0])
  37.  
  38.         print title[0]
  39.         print '#######################'
  40.         print content[0]
  41. #//python/5589

回复 "python采集百度百科代码演示"

这儿你可以回复上面这条便签

captcha