'웹 크롤러' 태그의 글 목록

웹 크롤러

[Crawler] Scrapy 개요, 설치, 공식 예제 (window 환경) 2018.07.22

[Crawler] Scrapy 개요, 설치, 공식 예제 (window 환경)

2018. 7. 22. 19:30

scrapy

1. Scrapy 개요

Scrapy는 크롤링/스크레이핑을 위한 파이썬 프레임워크로서 풍부한 기능들이 존재합니다.

웹 페이지에서 링크 추출하기
robots.txt를 기반으로 허가된 페이지와 금지된 페이지 구분하기
XML 사이트맵 추출과 링크 추출하기
도메인과 IP 주소마다 크롤링 시간 간격 조정하기
여러 개의 크롤링 대상을 병렬 처리하기
중복된 URL 크롤링하지 않기
오류가 발생했을 때 특정 횟수만큼 재시도하기
크롤러를 데몬으로 만들기와 잡 관리하기

2. Scrapy 설치 (window 환경)

Scrapy는 1.1 버전부터 파이썬 3을 지원하고 있으며, 여러 파이썬 패키지들을 기반으로 만들어졌습니다.

lxml : libxml2와 libxslt를 사용한 C 확장 라이브러리로서 효율적인 XML과 HTML 파서 역할을 수행
twisted : 이벤트 구동(Event Drive) 네트워크 프로그래밍 엔진을 기반으로 만들어졌기 떄문에 웹사이트 다운로드 처리를 비동기적으로 실행하며 다운로드 중에도 스크레이핑 처리 등을 할 수 있습니다.

공식 문서에 따르면 Scrapy를 설치 시 pip말고, Anaconda 또는 miniconda를 설치하여 conda-forge 채널의 패키지를 활용하는 것이 많은 설치 이슈를 피할 수 있다고 추천하고 있습니다. (Scrapy install guide)

conda install -c conda-forge scrapy

약 2~3분 정도의 시간이 소요되며 설치가 완료된 후에 scrapy --version 해당 명령어를 실행시키면 아래와 같은 결과를 보실 수 있습니다.

C:\Users>scrapy --version
Scrapy 1.5.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

3. 공식 예제

1.5 버전의 공식 예제는 http://quotes.toscrape.com 사이트의 링크를 순회하며 text와 authon를 스크레이핑하는 코드입니다.

quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
    # spider의 이름(변경가능)
    name = "quotes"
    # 크롤링을 시작할 URL 리스트
    start_urls = [
        'http://quotes.toscrape.com/tag/humor/',
    ]

    def parse(self, response):
        '''
        링크를 순회하며 div.quote부분의 text와 author를 추출
        '''
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        '''
        최상위 페이지의 모든 링크를 추출
        '''    
        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

scrapy의 runspider 명령어의 파라미터로 [실행 파일 경로, 출력 형태]를 지정하여 실행하면 로그와 함께 크롤링이 완료된 것을 볼 수 있습니다.

scrapy runspider quotes_spider.py -o quotes.json

실행 결과

C:\workspace\python\scrapy>scrapy runspider quotes_spider.py -o quotes.json
2018-07-22 12:23:17 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-07-22 12:23:17 [scrapy.utils.log] INFO: Versions: lxml 3.6.4.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 16.2.0 (OpenSSL 1.0.2o  27 Mar 2018), cryptography 1.5, Platform Windows-10-10.0.17134-SP0
2018-07-22 12:23:17 [scrapy.crawler] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'SPIDER_LOADER_WARN_ONLY': True, 'FEED_URI': 'quotes.json'}
2018-07-22 12:23:17 [scrapy.middleware] INFO: Enabled extensions:
...
2018-07-22 12:23:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/humor/> (referer: None)
2018-07-22 12:23:20 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/tag/humor/>
{'author': 'Jane Austen', 'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”'}
...
'finish_time': datetime.datetime(2018, 7, 22, 3, 23, 20, 646076),
 'start_time': datetime.datetime(2018, 7, 22, 3, 23, 18, 724216)}
2018-07-22 12:23:20 [scrapy.core.engine] INFO: Spider closed (finished)

출력 결과 확인

C:\workspace\python\scrapy>type quotes.json
[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
...
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
][
{"author": "Jane Austen", "text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"},
...
{"author": "Jane Austen", "text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d"}
]

Reference

Scrapy 1.5 documentation
카토 코다, 『파이썬을 이용한 웹 크롤링과 스크레이핑』, 윤인성, 위키북스(2018-03-22), p267~270.

저작자표시

PREV 1 NEXT

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Cheo-ri

웹 크롤러

[Crawler] Scrapy 개요, 설치, 공식 예제 (window 환경)

1. Scrapy 개요

2. Scrapy 설치 (window 환경)

3. 공식 예제

quotes_spider.py

Reference

+ Recent posts

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역