Python多线程采集网站title/description/keywords

  • A+
所属分类:python笔记
本文信息本文由方法SEO顾问发表于2015-06-0815:50:03,共 1417 字,转载请注明:Python多线程采集网站title/description/keywords_【方法SEO顾问】

手中有一个20W的URL列表的TXT文件,想把这20W的页面的title/description/keywords都提取出来,shell只能单线程,而且通过curl方式采集不稳定;火车头采集器倒是可以多线程,但是处理这种比较大的URL列表,URL导入时间已经久得让人受不了了,于是想到了用轻巧而强大的Python来搞一个多纯种的采集工具,在python大神@老姜的帮助下,终于把问题解决了,贴出代码备忘:

该代码默认是4线程的,如果需要更多的线程,只需要把test(l,4)中的4改为更大的数值就行了。

代码中需要用到BeautifulSoup这个库,关于怎么在windows系统下安装这个库,可以看看这个教程:【亲测好用!】Windows系统下安装Beautiful Soup4

# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import requests
import threading
import Queue
import time

with open('url.txt') as f:
    l = f.readlines()


def btdk(url):
    try:
        html = requests.get(url, timeout = 10).text
    except:
        html = '<html><title>%s</title><meta name="keywords" content="" /><meta name="description" content="" /></html>'%url
    soup = BeautifulSoup(html.lower())
    t = soup.title.text.encode('utf8','ignore')
    try:
        k = soup.find(attrs={"name":"keywords"})['content'].encode('utf8','ignore')
    except:
        k = ""
    try:
        d = soup.find(attrs={"name":"description"})['content'].encode('utf8','ignore')
    except:
        d = ""

    return t,d,k


class MyThread(threading.Thread):

    def __init__(self, queue, url):
        threading.Thread.__init__(self)
        self.queue = queue
        self.url = url

    def run(self):
        while True:
            url = self.queue.get()
            t,k,d = btdk(url)
            with open('tdk.txt', 'a+') as s:
                line = url+'#'+t+'#'+'\n'
                s.writelines(line)
            self.queue.task_done()


def test(l, ts=4):
    ll = [i.rstrip() for i in l]
    for j in range(ts):
        t = MyThread(queue,ll)
        t.setDaemon(True)
        t.start()
    for url in ll:
        queue.put(url)
    queue.join()
if __name__ == '__main__':
    queue = Queue.Queue()
    start = time.time()
    test(l,4)
    end = time.time()
    print '共耗时:%s秒' % (end - start)

  • 版权声明:除非注明,本博客均为北京SEO方法的原创文章,转载或引用请以超链接形式标明本文地址,否则会在SEO圈内公开此种不尊重版权的行为,谢谢合作!本文地址:https://seofangfa.com/python-note/python-title-description-keywords.html
  • 转载请注明:Python多线程采集网站title/description/keywords_ 【方法SEO顾问】

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen:

目前评论:1   其中:访客  0   博主  0

    • avatar 行书

      https://github.com/xingshu1990/python_demo/blob/master/TDK
      改成了3,编码问题目前解决了。