爬虫提升效率_多进程实战_进程池版-【官方】百战程序员_IT在线教育培训机构


xxxxxxxxxx
from multiprocessing import Process

def func(name):
    print('hello', name)


if __name__ == "__main__":
    p = Process(target=func,args=('sxt',))
    p.start()
    p.join()  # 等待进程执行完毕

Manager类，实现数据共享

在使用并发设计的时候最好尽可能的避免共享数据，尤其是在使用多进程的时候。如果你真有需要要共享数据，可以使用由Manager()返回的manager提供list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Barrier, Queue, Value and Array类型的支持


xxxxxxxxxx
from multiprocessing import Process,Manager,Lock


def print_num(info_queue,l,lo):
    with lo:
        for n in l:
            info_queue.put(n)

def updata_num(info_queue,lo):
    with lo:
        while not info_queue.empty():
            print(info_queue.get())


if __name__ == '__main__':
        manager = Manager()
        into_html = manager.Queue()
        lock = Lock()
        a = [1, 2, 3, 4, 5]
        b = [11, 12, 13, 14, 15]

        p1 = Process(target=print_num,args=(into_html,a,lock))
        p1.start()
        p2 = Process(target=print_num,args=(into_html,b,lock))
        p2.start()
        p3 = Process(target=updata_num,args=(into_html,lock))
        p3.start()
        p1.join()
        p2.join()
        p3.join()


xxxxxxxxxx
from multiprocessing import Process
from multiprocessing import Manager
import time
from fake_useragent import UserAgent
import requests
from time import sleep


def spider(url_queue):
    while not url_queue.empty():
        try:
            url = url_queue.get(timeout = 1)
            # headers = {'User-Agent':UserAgent().chrome}
            print(url)
            # resp = requests.get(url,headers = headers)
            # 处理响应结果
            # for d in resp.json().get('data'):
            #     print(f'tid:{d.get("tid")} topic:{d.get("topicName")} content:{d.get("content")}')
            sleep(1)
            # if resp.status_code == 200:
            #     print(f'成功获取第{i}页数据')
        except Exception as e:
            print(e)


if __name__ == '__main__':
    url_queue = Manager().Queue()
    for i in range(1,11):
        url = f'https://www.hupu.com/home/v1/news?pageNo={i}&pageSize=50'
        url_queue.put(url)

    all_process = []
    for i in range(3):
        p1 = Process(target=spider,args=(url_queue,))
        p1.start()
        all_process.append(p1)
    [p.join() for p in all_process]

进程池的使用

进程池内部维护一个进程序列，当使用时，则去进程池中获取一个进程，如果进程池序列中没有可供使用的进进程，那么程序就会等待，直到进程池中有可用进程为止。
进程池中有两个方法：
- apply同步执行-串行
- apply_async异步执行-并行


xxxxxxxxxx
from multiprocessing import Pool,Manager
def print_num(info_queue,l):
    for n in l:
        info_queue.put(n)

def updata_num(info_queue):
    while not info_queue.empty():
        print(info_queue.get())

if __name__ == '__main__':
    html_queue =Manager().Queue()
    a=[11,12,13,14,15]
    b=[1,2,3,4,5]
    pool = Pool(3)
    pool.apply_async(func=print_num,args=(html_queue,a))
    pool.apply_async(func=print_num,args=(html_queue,b))
    pool.apply_async(func=updata_num,args=(html_queue,))
    pool.close() #这里join一定是在close之后，且必须要加join，否则主进程不等待创建的子进程执行完毕
    pool.join() # 进程池中进程执行完毕后再关闭，如果注释，那么程序直接关闭


xxxxxxxxxx
from multiprocessing import Pool,Manager
from time import sleep

def spider(url_queue):
    while not url_queue.empty():
        try:
            url = url_queue.get(timeout = 1)
            print(url)
            sleep(1)
        except Exception as e:
            print(e)

if __name__ == '__main__':
    url_queue = Manager().Queue()
    for i in range(1,11):
        url = f'https://www.hupu.com/home/v1/news?pageNo={i}&pageSize=50'
        url_queue.put(url)
    pool = Pool(3)
    pool.apply_async(func=spider,args=(url_queue,))
    pool.apply_async(func=spider,args=(url_queue,))
    pool.apply_async(func=spider,args=(url_queue,))
    pool.close()
    pool.join()

实时效果反馈

1. 关于提升爬虫效率，说法错误的？

A 多进程可以提高爬虫效率

B 多进程爬虫可以通过queue来进行通信

C 多进程爬虫无需安装3方模块

D 多线程爬虫可以使用multiprocessing开发

答案

1=>B

爬虫提升效率_多进程实战_方法版爬虫提升效率_协程实现

北京市昌平区回龙观镇南店村综合商业楼2楼226室