안녕하세요 그저께 까지 만 해도 잘되던게...

Question

갑자기 일반 뉴스 만 크롤링을 하지 못하네요. 제가 테스트 해본결과 일반 뉴스만 못가지고 오는데 코드를 확인해보니 달라진게 없는거 같은데. 코드 한번 봐주시기 발바니다. import requests from bs4 import BeautifulSoup import time import pyautogui #사용자 입력 keyword = pyautogui.prompt("검색어를 입력해주세요") lsatpage = int(pyautogui.prompt("몇페이지 까지 크롤링 할까요?")) page_num = 1 for i in range(1, lsatpage * 10, 10): print(f".==========================={page_num}페이지를 가지고 오는 중입니다.===========================") response = requests.get(f"https://search.naver.com/search.naver?where=news&sm=tab_jum&query={keyword}") html = response.text soup = BeautifulSoup(html,'html.parser') articles = soup.select("div.news_info") # 뉴스기사 디브 10개 추출 for article in articles: links = article.select("a.info") # 링크만 가지고 오기 a 태그 안에 info추출 if len(links) >= 2: #링크를 len 함수로 세어주고 2개보다 많으면 url = links[1].attrs['href'] # 링크중 2번째 것을 선택 # print(url) response = requests.get(url, headers={'User-agent':'mozila/5.0'}) html = response.text # print(html) soup = BeautifulSoup(html, 'html.parser') # print(soup) #만약 연예뉴스라면 if "entertain" in response.url: title = soup.select_one(".end_tit") #네이버 뉴스 첫뻔재링크 contant = soup.select_one("#articeBody") elif "sports" in response.url: #리스폰스 변수에 있는 유알엘중 스포츠스가 있으면 title = soup.select_one("h4.title") contant = soup.select_one("#newsEndContents") divs = contant.select("div") #본문에 불필요한 dvi p 삭제 for div in divs: div.decompose() #decompose 는 없애주는 함수 ps = contant.select("p") for p in ps: p.decompose() else: title = soup.select_one("#articleTitle") #신문사 뉴스 두번째링크 contant = soup.select_one("#articleBodyContents") print("==========링크========
", url) print("==========제목========
", title.text.strip()) #strip 함수는 공백을 없애는 거임 print("==========본문========
", contant.text.strip()) time.sleep(0.3) page_num = page_num + 1

스타트코딩 · Answer

네 확인해보니, 수강생분이 말씀해 주신대로 일반 기사만 동작하지 않네요. 사이트 분석을 해보니, 네이버 일반 기사 HTML 구조가 수정되었습니다. 강의에서 알려준 방법대로, 제목과 본문의 CSS 선택자만 바꿔 주면 문제없이 동작할 겁니다 ^^ import requests from bs4 import BeautifulSoup import pyautogui # 사용자 입력 keyword = pyautogui . prompt ( "검색어를 입력하세요" ) lastpage = int ( pyautogui . prompt ( "몇 페이지까지 크롤링 할까요?" )) page_num = 1 for i in range ( 1 , lastpage * 10 , 10 ): print ( f " { page_num } 페이지 크롤링 중입니다 ================ " ) response = requests . get ( f "https://search.naver.com/search.naver?where=news&sm=tab_jum&query= { keyword } &start= { i } " ) html = response . text soup = BeautifulSoup ( html , 'html.parser' ) articles = soup . select ( "div.info_group" ) # 뉴스 기사 div 10개 추출 for article in articles : links = article . select ( "a.info" ) if len ( links ) >= 2 : # 링크가 2개 이상이면 url = links [ 1 ]. attrs [ 'href' ] # 두번째 링크의 href 추출 # 다시 request를 날려 준다 response = requests . get ( url , headers ={ 'User-Agent' : 'Mozila/5.0' }) html = response . text soup = BeautifulSoup ( html , 'html.parser' ) # 연예뉴스 또는 스포츠뉴스는 사이트의 생김새가 다르다 # 즉, 오류가 날 수 있다. if "entertain" in response . url : title = soup . select_one ( ".end_tit" ) content = soup . select_one ( "#articeBody" ) elif "sports" in response . url : title = soup . select_one ( "h4.title" ) content = soup . select_one ( "#newsEndContents" ) # 본문 내용안에 불필요한 div 삭제 divs = content . select ( "div" ) for div in divs : div . decompose () else : title = soup . select_one ( ".media_end_head_headline" ) content = soup . select_one ( "#newsct_article" ) print ( "=======링크======= 
 " , url ) print ( "=======제목======= 
 " , title . text . strip ()) print ( "=======본문======= 
 " , content . text . strip ()) page_num = page_num + 1