강의

멘토링

커뮤니티

인프런 커뮤니티 질문&답변

JAB님의 프로필 이미지
JAB

작성한 질문수

현존 최강 크롤링 기술: Scrapy와 Selenium 정복

강력/최신 크롤링 기술: Scrapy 크롤러 만들기

오류 질문

작성

·

379

0

   강의 내용을 따라가다가, gmarket을 scrapy하면 response.text가 안나오고 아래와 같은 에러가 나옵니다.

잘은 모르겠으나, 아마도 가운데 부분쯤에

UnicodeEncodeError: 'cp949' codec can't encode character '\xa0' in position 18081: illegal multibyte sequence

이 부분 때문인듯한데, 어떤식으로 해결하면 되나요?

(윈도우cmd 사용합니다) 

2020-02-28 21:17:02 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: ecommerce)
2020-02-28 21:17:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Windows-10-10.0.18362-SP0
2020-02-28 21:17:02 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_MODULES': ['ecommerce.spiders'], 'BOT_NAME': 'ecommerce', 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'ecommerce.spiders'}
2020-02-28 21:17:03 [scrapy.extensions.telnet] INFO: Telnet Password: 27523c89cdea5418
2020-02-28 21:17:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2020-02-28 21:17:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-02-28 21:17:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-02-28 21:17:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-28 21:17:04 [scrapy.core.engine] INFO: Spider opened
2020-02-28 21:17:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-02-28 21:17:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-28 21:17:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.gmarket.co.kr/robots.txt> (referer: None)
2020-02-28 21:17:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.gmarket.co.kr/> (referer: None)
2020-02-28 21:17:04 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.gmarket.co.kr/> (referer: None)
Traceback (most recent call last):
  File "c:\users\jh\appdata\local\programs\python\python35\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\JH\ecommerce\ecommerce\spiders\gmarket.py", line 11, in parse
    print(response.text)
UnicodeEncodeError: 'cp949' codec can't encode character '\xa0' in position 18081: illegal multibyte sequence
2020-02-28 21:17:04 [scrapy.core.engine] INFO: Closing spider (finished)
2020-02-28 21:17:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 442,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 97847,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.535697,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 2, 28, 12, 17, 4, 962553),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/UnicodeEncodeError': 1,
 'start_time': datetime.datetime(2020, 2, 28, 12, 17, 4, 426856)}
2020-02-28 21:17:04 [scrapy.core.engine] INFO: Spider closed (finished)

sublime의 코드는 아래와 같이 강의와 동일하게 했습니다.



# -*- coding: utf-8 -*-
import scrapy


class GmarketSpider(scrapy.Spider):
    name = 'gmarket'
    allowed_domains = ['www.gmarket.co.kr']
    start_urls = ['http://www.gmarket.co.kr/']

    def parse(self, response):
        print(response.text)

답변 1

0

안녕하세요. 이 부분은 response로 넘어오는 데이터 중, 일부 특수문자나 unicode가 아닌 문자가 들어와서 일어날 수 있습니다. 파이썬3에서는 모든 문자열 데이터를 utf-8(유니코드)로 변환해서 사용하는데, print 하면서보니, 특정 코드가 유니코드로 변환이 안된다는 의미예요. 그런데, 어떤 데이터가 들어왔는지를 디버깅한다는 의미로 print를 해본 것이니 전체 로직에는 문제가 없습니다.

감사합니다.

JAB님의 프로필 이미지
JAB

작성한 질문수

질문하기