tfidf 관련

안녕하세요, tfidf관련하여 오류가 발생하여 질문드립니다.

csv파일로 뉴스를 토픽모델링하는 과정에서,

csv 파일에 뉴스를 추가하면서 계속 토픽모델링을 진행하는 과정에서

어느 순간에 아래처럼 오류가 발생합니다.

토픽모델링이 가능했었는데, 왜 자료를 추가하면 진행하면 어느 순간 해당 오류가 생기는지 알 수가 없어서 질문드립니다ㅠㅠ

캡처.PNG

좋은 강의해주셔서 감사합니다!

UnicodeDecodeError                        Traceback (most recent call last)
Cell In[127], line 2
      1 vectorizer = TfidfVectorizer(tokenizer=tokenizer, max_df=0.90, min_df=100, max_features=20000)
----> 2 tfidf = vectorizer.fit_transform(topnews['text']).toarray()

File c:\Users\My COM\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\feature_extraction\text.py:2131, in TfidfVectorizer.fit_transform(self, raw_documents, y)
   2124 self._check_params()
   2125 self._tfidf = TfidfTransformer(
   2126     norm=self.norm,
   2127     use_idf=self.use_idf,
   2128     smooth_idf=self.smooth_idf,
   2129     sublinear_tf=self.sublinear_tf,
   2130 )
-> 2131 X = super().fit_transform(raw_documents)
   2132 self._tfidf.fit(X)
   2133 # X is already a transformed view of raw_documents so
   2134 # we set copy to False

File c:\Users\My COM\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\feature_extraction\text.py:1387, in CountVectorizer.fit_transform(self, raw_documents, y)
   1379             warnings.warn(
   1380                 "Upper case characters found in"
   1381                 " vocabulary while 'lowercase'"
   1382                 " is True. These entries will not"
   1383                 " be matched with any documents"
...
---> 93 result = [(token.getMorph(), token.getPos()) for token in result]
     95 if join:
     96     result = ['{}/{}'.format(morph, pos) for morph, pos in result]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: invalid continuation byte

안녕하십니까, 인사이저 입니다.

말씀 주신 내용에 대해,

'기존 저희가 제공드린 데이터를 사용했을 때는 문제없이 작동하였으나,

신규 수집하신 뉴스 데이터를 추가했을 때는 위와 같은 에러가 발생한다' 라고 이해하였습니다.

에러 내용은 데이터 인코딩 포맷의 차이로 발생한 오류로,

데이터 상에 'utf-8' 포맷이 아닌 텍스트 데이터가 있기에 발생하였습니다.

신규 데이터를 입력할 때 utf-8로 저장하였는 지 확인해 보시기 바라며,

혹은 pandas를 통해 데이터를 읽어 들일 때 encoding="utf-8"로 읽어들이는 시도를 해보시는 것 또한 추천 드립니다.

또한 아래 동일 이슈 관련 블로그 포스트를 함께 공유드리니 참고하시기 바랍니다.

https://gmnam.tistory.com/291?category=899950

그외에 이슈가 있을 시,

언제든 질문주시기 바랍니다.

감사합니다.

인프런 커뮤니티 질문&답변