• 카테고리

    질문 & 답변
  • 세부 분야

    데이터 엔지니어링

  • 해결 여부

    미해결

질문있습니다!

22.06.29 10:45 작성 조회수 320

0

강력/최신 크롤링 기술: Scrapy spider 크롤링 기법 강의에서 4분쯤에 지마켓 크롤링한것 처럼

무신사 홈페이지를 크롤링 해봤는데 안되서 질문드립니다. 

import scrapy

class MusinsaRankSpider(scrapy.Spider):

    name = 'musinsa_rank'

    allowed_domains = ['www.musinsa.com/ranking/best']

    start_urls = ['https://www.musinsa.com/ranking/best/']

    def parse(self, response):

        ranks = response.css("div.li_inner > div.article_info > p.list_info > a::text").getall()

        for rank in ranks:

            print(rank)

이렇게 코드를 작성하고  터미널 환경에서 crawl을 했는데

C:\Users\JOONIOR\scrapyproject\musinsa\musinsa>scrapy crawl musinsa_rank

2022-06-29 10:41:30 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: musinsa)

2022-06-29 10:41:30 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0

2022-06-29 10:41:30 [scrapy.crawler] INFO: Overridden settings:

{'BOT_NAME': 'musinsa',

 'FEED_EXPORT_ENCODING': 'utf-8',

 'NEWSPIDER_MODULE': 'musinsa.spiders',

 'ROBOTSTXT_OBEY': True,

 'SPIDER_MODULES': ['musinsa.spiders']}

2022-06-29 10:41:30 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor

2022-06-29 10:41:30 [scrapy.extensions.telnet] INFO: Telnet Password: e6cbf7235634cbaa

2022-06-29 10:41:30 [scrapy.middleware] INFO: Enabled extensions:

['scrapy.extensions.corestats.CoreStats',

 'scrapy.extensions.telnet.TelnetConsole',

 'scrapy.extensions.logstats.LogStats']

2022-06-29 10:41:30 [scrapy.middleware] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',

 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2022-06-29 10:41:30 [scrapy.middleware] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

 'scrapy.spidermiddlewares.depth.DepthMiddleware']

2022-06-29 10:41:30 [scrapy.middleware] INFO: Enabled item pipelines:

[]

2022-06-29 10:41:30 [scrapy.core.engine] INFO: Spider opened

2022-06-29 10:41:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2022-06-29 10:41:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023

2022-06-29 10:41:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musinsa.com/robots.txt> (referer: None)

2022-06-29 10:41:31 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.musinsa.com/ranking/best/>

2022-06-29 10:41:31 [scrapy.core.engine] INFO: Closing spider (finished)

2022-06-29 10:41:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'downloader/exception_count': 1,

 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,

 'downloader/request_bytes': 229,

 'downloader/request_count': 1,

 'downloader/request_method_count/GET': 1,

 'downloader/response_bytes': 503,

 'downloader/response_count': 1,

 'downloader/response_status_count/200': 1,

 'elapsed_time_seconds': 0.329829,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2022, 6, 29, 1, 41, 31, 379516),

 'httpcompression/response_bytes': 152,

 'httpcompression/response_count': 1,

 'log_count/DEBUG': 3,

 'log_count/INFO': 10,

 'response_received_count': 1,

 'robotstxt/forbidden': 1,

 'robotstxt/request_count': 1,

 'robotstxt/response_count': 1,

 'robotstxt/response_status_count/200': 1,

 'scheduler/dequeued': 1,

 'scheduler/dequeued/memory': 1,

 'scheduler/enqueued': 1,

 'scheduler/enqueued/memory': 1,

 'start_time': datetime.datetime(2022, 6, 29, 1, 41, 31, 49687)}

2022-06-29 10:41:31 [scrapy.core.engine] INFO: Spider closed (finished)

터미널에 이렇게만 나오고 제가 원하던게 안나오더라구요

혹시 css selector가 잘못되었나 싶어 주피터 노트북에서 beatifulsoap로 해봤는데 잘 나왔습니다.

혹시 이유를 알 수 있을까요

크롤링 사이트는 이거입니다.https://www.musinsa.com/ranking/best

 

 

답변 1

답변을 작성해보세요.

1

안녕하세요. 답변도우미 입니다 :)

scrapy 로는 안되는데, beautifulsoup 로는 정상동작한다면, scrapy 는 내부적으로 스레드로 동작하기 때문에, 사이트측에서 여러 요청이 동시에 오다보니, 이런 경우는 크롤링 기술로 인지해서 강제로 막는 경우도 있을 수 있거든요. 이 부분도 한번 의심해볼만한 포인트일 것 같습니다. 그렇다면 beautifulsoup 에서는 정상진행하니, beautifulsoup 으로 진행하는 것이 좋을 것 같습니다.

감사합니다.