rss 데이터 실습중에 안되는게 있어서여

Question

안녕하세요 좋은 사람 선생님

3-4-2.py에서 rss 데이터를 다뤘었는데 기상청 자료가 말고 zdnet뉴스 rss가 있길래

텍스트 파일로 저장하려고 하는데 뭔가 잘안되는게 있어서여

rss데이터를 프린트로 찍을때(기사 제목과 내용)에 대해 내용이 정상 출력되는데

파일에 쓸때 파일 제목이 제대로 안들어가고 둘째는 파일 내용이 깨져여

이럴때 어떻게 해야되는지 알려주실수 있나여?

해결해야할 문제 (제일 하단의 반복문에서 문제가 발생)

titleForFile을 파일 이름으로 하려면 어떻게 하나요?

content를 내용으로 .html로 저장할때 문자열 데이터가 깨져서 나와요

또 기사 본문 같은 경우 html 태그가 섞여있는데 이럴때는 string값만 가져올수 있나여?

아니면 .txt가 아니라 html로 저장하면 웹브라우져로 기사를 볼수 있을까여?

import os

import sys

import io

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from bs4 import BeautifulSoup

import codecs

sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding = ’utf-8’)

sys.stderr = io.TextIOWrapper(sys.stderr.detach(), encoding = ’utf-8’)

chrome_options = Options()

chrome_options.add_argument(”–headless”)

chrome_options.add_argument(’–log-level=3’)

driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=r’C:/section3/webdriver/chrome/chromedriver’)

driver.implicitly_wait(2)

driver.get(’http://www.zdnet.co.kr/Include2/NewsSection0020.xml’)

print(’크롬 webdriver를 이용해서 url요청을 zdnet에 날려 rss 데이터가 로딩되었습니다’)

xml_soup = BeautifulSoup(driver.page_source,’html.parser’)

print(xml_soup.prettify())

title_list = xml_soup.select(’item > title’)

폴더 만들기

savePath = ’c:/fast_11/html/’

try:

if not(os.path.isdir(savePath)):

os.makedirs(os.path.join(savePath))

print(’폴더를 생성하였습니다.’)

except OSError as e:

if e.errno != errno.EEXIST:

print(”Failed to create directory!!!!!”)

raise

print(’폴더에 저장합니다.’)

for i,title in enumerate(title_list):

titleForFile = title.string

content = title.parent.description.string

content2 = titleForFile+ ’ : ’+content

print(i , ’titleForFile : ’, titleForFile,’ : ’,content)

if content != None:

with open(’c:/fast_11/html/’+str(i)+’.txt’ ,”wt”) as f:

f.write(content + ’n’)

f.write(r’content2’)

print(’text파일에 쓰기 완료’)

Answer

안녕하세요. terecal 님
좋은 예제 실습 하고 계시네요. 쪽지도 확인했습니다.
우선 파일 이름 및 & 내용이 깨지는 건 문자 캐릭터셋 인코딩 문제입니다.
즉, 송신쪽과 수신쪽 캐릭터 타입이 달라서 수신측에서 깨지는 현상인데요.
지금 올려주신 현재 소스는 저는 제대로 작동을 합니다.
http://raccoonyy.github.io/working-with-unicode-streams-in-python-korean/
위에 링크를 참조하시고 encode (utf-8)로 다시 해보심이 좋을 듯 합니다.
또 파일을 쓰기 직전 open(’c:/fast_11/html/’+str(i)+’.txt’ ,”wt”) as f:
이 소스 코드 이전에 printf 문을 사용해서 콘솔에 깨진 채로 출력이 되는지 아니면
파일을 쓴 이후에 깨지는 현상인지도 단위테스트를 한 번 해보세요.
참고로 파이썬3에서는
input = open("input.txt", "rt", encoding="utf-16")
output = open("output.txt", "wt", encoding="utf-8")
위에 처럼 encoding 옵션을 한 번 넣어보세요.
도움이 되셨으면 좋겠습니다.
감사합니다.

terecal

rss 데이터 실습중에 안되는게 있어서여

해결해야할 문제 (제일 하단의 반복문에서 문제가 발생)

titleForFile을 파일 이름으로 하려면 어떻게 하나요?

content를 내용으로 .html로 저장할때 문자열 데이터가 깨져서 나와요

또 기사 본문 같은 경우 html 태그가 섞여있는데 이럴때는 string값만 가져올수 있나여?

아니면 .txt가 아니라 html로 저장하면 웹브라우져로 기사를 볼수 있을까여?

print(xml_soup.prettify())

폴더 만들기

content2 = titleForFile+ ’ : ’+content

f.write(r’content2’)

이 글과 비슷한 Q&A

에어비엔비 가격 문제 중 오브젝트 컬럼 삭제 관련 문의입니다.

7회 기출문제 원핫인코딩 관련 질문입니다.

MYSQL 맥북 오류

동적 페이지 이동 크롤링 방법 문의