- A+
所属分类:python笔记
手中有一个20W的URL列表的TXT文件,想把这20W的页面的title/description/keywords都提取出来,shell只能单线程,而且通过curl方式采集不稳定;火车头采集器倒是可以多线程,但是处理这种比较大的URL列表,URL导入时间已经久得让人受不了了,于是想到了用轻巧而强大的Python来搞一个多纯种的采集工具,在python大神@老姜的帮助下,终于把问题解决了,贴出代码备忘:
该代码默认是4线程的,如果需要更多的线程,只需要把test(l,4)中的4改为更大的数值就行了。
代码中需要用到BeautifulSoup这个库,关于怎么在windows系统下安装这个库,可以看看这个教程:【亲测好用!】Windows系统下安装Beautiful Soup4
# -*- coding: utf-8 -*- from bs4 import BeautifulSoup import requests import threading import Queue import time with open('url.txt') as f: l = f.readlines() def btdk(url): try: html = requests.get(url, timeout = 10).text except: html = '<html><title>%s</title><meta name="keywords" content="" /><meta name="description" content="" /></html>'%url soup = BeautifulSoup(html.lower()) t = soup.title.text.encode('utf8','ignore') try: k = soup.find(attrs={"name":"keywords"})['content'].encode('utf8','ignore') except: k = "" try: d = soup.find(attrs={"name":"description"})['content'].encode('utf8','ignore') except: d = "" return t,d,k class MyThread(threading.Thread): def __init__(self, queue, url): threading.Thread.__init__(self) self.queue = queue self.url = url def run(self): while True: url = self.queue.get() t,k,d = btdk(url) with open('tdk.txt', 'a+') as s: line = url+'#'+t+'#'+'\n' s.writelines(line) self.queue.task_done() def test(l, ts=4): ll = [i.rstrip() for i in l] for j in range(ts): t = MyThread(queue,ll) t.setDaemon(True) t.start() for url in ll: queue.put(url) queue.join() if __name__ == '__main__': queue = Queue.Queue() start = time.time() test(l,4) end = time.time() print '共耗时:%s秒' % (end - start)
2017-08-26 上午8:27
https://github.com/xingshu1990/python_demo/blob/master/TDK
改成了3,编码问题目前解决了。