-
카테고리
-
세부 분야
데이터 엔지니어링
-
해결 여부
미해결
질문있습니다!
22.06.29 10:45 작성 조회수 320
0
강력/최신 크롤링 기술: Scrapy spider 크롤링 기법 강의에서 4분쯤에 지마켓 크롤링한것 처럼
무신사 홈페이지를 크롤링 해봤는데 안되서 질문드립니다.
import scrapy
class MusinsaRankSpider(scrapy.Spider):
name = 'musinsa_rank'
allowed_domains = ['www.musinsa.com/ranking/best']
start_urls = ['https://www.musinsa.com/ranking/best/']
def parse(self, response):
ranks = response.css("div.li_inner > div.article_info > p.list_info > a::text").getall()
for rank in ranks:
print(rank)
이렇게 코드를 작성하고 터미널 환경에서 crawl을 했는데
C:\Users\JOONIOR\scrapyproject\musinsa\musinsa>scrapy crawl musinsa_rank
2022-06-29 10:41:30 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: musinsa)
2022-06-29 10:41:30 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-06-29 10:41:30 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'musinsa',
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'musinsa.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['musinsa.spiders']}
2022-06-29 10:41:30 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-06-29 10:41:30 [scrapy.extensions.telnet] INFO: Telnet Password: e6cbf7235634cbaa
2022-06-29 10:41:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2022-06-29 10:41:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-06-29 10:41:30 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-06-29 10:41:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-06-29 10:41:30 [scrapy.core.engine] INFO: Spider opened
2022-06-29 10:41:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-06-29 10:41:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-06-29 10:41:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.musinsa.com/robots.txt> (referer: None)
2022-06-29 10:41:31 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.musinsa.com/ranking/best/>
2022-06-29 10:41:31 [scrapy.core.engine] INFO: Closing spider (finished)
2022-06-29 10:41:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 229,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 503,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.329829,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 6, 29, 1, 41, 31, 379516),
'httpcompression/response_bytes': 152,
'httpcompression/response_count': 1,
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'response_received_count': 1,
'robotstxt/forbidden': 1,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 6, 29, 1, 41, 31, 49687)}
2022-06-29 10:41:31 [scrapy.core.engine] INFO: Spider closed (finished)
터미널에 이렇게만 나오고 제가 원하던게 안나오더라구요
혹시 css selector가 잘못되었나 싶어 주피터 노트북에서 beatifulsoap로 해봤는데 잘 나왔습니다.
혹시 이유를 알 수 있을까요
크롤링 사이트는 이거입니다.https://www.musinsa.com/ranking/best
답변을 작성해보세요.
1
잔재미코딩 DaveLee
지식공유자2022.06.30
안녕하세요. 답변도우미 입니다 :)
scrapy 로는 안되는데, beautifulsoup 로는 정상동작한다면, scrapy 는 내부적으로 스레드로 동작하기 때문에, 사이트측에서 여러 요청이 동시에 오다보니, 이런 경우는 크롤링 기술로 인지해서 강제로 막는 경우도 있을 수 있거든요. 이 부분도 한번 의심해볼만한 포인트일 것 같습니다. 그렇다면 beautifulsoup 에서는 정상진행하니, beautifulsoup 으로 진행하는 것이 좋을 것 같습니다.
감사합니다.
답변 1