Crawler un site avec scrapy

Nous allons découvrir ensemble comment crawler un site web avec scrapy. Crawler / Scraper veut dire extraire des informations d’un site de manière automatisée.

Prérequis

On commence par installer Python3 et Scrapy. Facile à réaliser à partir des lignes de commande suivantes:

$sudo apt update
$sudo apt install python3
$sudo apt installpython3-pip
$pip3 install scrapy

Création du robot

Une fois que scrapy est installé, il est disponible sous forme de ligne commande. On crée un projet puis les spiders qui va constituer le crawler. Pour notre exemple nous allons créer deux spiders. Le premier sert à extraire les liens vers les fiches entreprises. Le second pour extraire les informations relatives à chaque entreprise présentes sur la fiche.

Initialisation

$scrapy startproject qualibat
$cd qualibat
$scrapy genspider listing www.qualibat.com
$scrapy genspider entreprise www.qualibat.com

Listing

import scrapy

class ListingSpider(scrapy.Spider):
    name = "listing"
    allowed_domains = ["www.qualibat.com"]
    start_urls = [
        "https://www.qualibat.com/rechercher/?wqq_type%5B%5D=Entreprise",
        "https://www.qualibat.com/rechercher/?wqq_type%5B0%5D=Entreprise&wqq_page=2",
        "https://www.qualibat.com/rechercher/?wqq_type%5B0%5D=Entreprise&wqq_page=3"
    ]

    def parse(self, response):
        for url in response.xpath('//a[@class="wqq-search-result__link"]/@href').getall():
            yield {'url':"https://www.qualibat.com"+url}
        pass

Entreprise

import scrapy
import csv


class EntrepriseSpider(scrapy.Spider):
    name = "entreprise"
    allowed_domains = ["www.qualibat.com"]
    #start_urls = []

    def start_requests(self):
        with open('urls.csv', 'r', newline='', encoding='utf-8') as csvfile:
            reader = csv.reader(csvfile)
            for url in reader:
                self.log(url[0])
                yield scrapy.Request(url=url[0], callback=self.parse)

    def parse(self, response):
        title = response.xpath('//div[@class="wqq-company__title"]/text()').get()
        siret = response.xpath('//span[@class="wqq-company__siret"]/text()').get()
        employees = response.xpath('//span[@class="wqq-company__employees"]/text()').get()
        titles = []
        values = []
        for key in response.xpath('//span[@class="wqq-company__block-line-title"]/text()').getall():
            titles.append(key.strip())
        for value in response.xpath('//span[@class="wqq-company__block-line-value"]/text()').getall():
            values.append(value.strip())
        yield {
            'title': title.strip(),
            'siret': siret.strip(),
            'employees': employees.strip(),
            'keys': titles,
            'values': values
        }

Execution

$ scrapy crawl -o urls:csv listing
$ scrapy crawl -o data:csv entreprise