[2024 개정판] 이것이 진짜 크롤링이다 - 실전편대시보드

(4.9)72개의 수강평 ∙ 846명의 수강생

스타트코딩

Python 웹 크롤링

132,000원

월 26,400원

5개월 할부 시

지식공유자: 스타트코딩

총 26개 수업 (3시간 29분)

수강기한:

수료증: 발급

난이도: --

지식공유자 답변이 제공되는 강의입니다

폴더에 추가731

다른 수강생들이 자주 물어보는 질문이 궁금하신가요?

미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
크롤링 에러 관련 문의
안녕하세요.아래 코드를 사용해서 '식품 로봇'이라는 검색어로 크롤링을 시도했는데요. URL에 지정한 기간에 존재하는 모든 기사를 수집하고자 하는데, 총 몇 페이지나 있는지 알 수가 없어서.. 페이지수를 2,000으로 넣어서 실행 해보았습니다.그런데, 크롤링이 잘 진행되다가 에러가 발생해서요. 혹시 이건 어떻게 수정할 수 있을지요?에러 문구:=======링크======= https://n.news.naver.com/mnews/article/025/0003239249?sid=101 Traceback (most recent call last): File "/Users/유저이름/startcoding/Chapter04/11.마지막페이지확인하기.py", line 64, in <module> print("=======제목======= \n", title.text.strip()) ^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'text'실행 코드:import requests from bs4 import BeautifulSoup import time import pyautogui from openpyxl import Workbook from openpyxl.styles import Alignment # 사용자입력 keyword = pyautogui.prompt("검색어를 입력하세요") lastpage = int(pyautogui.prompt("몇 페이지까지 크롤링 할까요?")) # 엑셀 생성하기 wb = Workbook() # 엑셀 시트 생성하기 ws = wb.create_sheet(keyword) # 열 너비 조절 ws.column_dimensions['A'].width = 60 ws.column_dimensions['B'].width = 60 ws.column_dimensions['C'].width = 120 # 행 번호 row = 1 # 페이지 번호 page_num = 1 for i in range(1, lastpage * 10, 10): print(f"{page_num}페이지 크롤링 중 입니다.==========================") response = requests.get(f"https://search.naver.com/search.naver?sm=tab_hty.top&where=news&query={keyword}&start={i}") html = response.text # html은 response의 text 안에 위치함 soup = BeautifulSoup(html, 'html.parser') articles = soup.select("div.info_group") #뉴스 기사 div 10개 추출 # 기사가 10개니까 for문을 써서 하나하나 추출 필요 for article in articles: links = article.select("a.info") # a 태그, info class인 아이들을 가져옴. = 리스트 if len(links) >= 2: # 링크가 2개 이상이면 url = links[1].attrs['href'] # 두번째 링크의 href를 추출 # 다시 request 날려주기 response = requests.get(url, headers={'User-agent': 'Mozila/5.0'}) html = response.text soup_sub = BeautifulSoup(html, 'html.parser') print(url) # 연예 뉴스 체크 if "entertain" in response.url: title = soup_sub.select_one(".end_tit") content = soup_sub.select_one("#articeBody") elif "sports" in response.url: title = soup_sub.select_one("h4.title") content = soup_sub.select_one("#newsEndContents") # 본문 내용 안에 불필요한 div, p 삭제 divs = content.select("div") for div in divs: div.decompose() paragraphs = content.select("p") for p in paragraphs: p.decompose() else: title = soup_sub.select_one(".media_end_head_headline") content = soup_sub.select_one("#newsct_article") print("=======링크======= \n", url) print("=======제목======= \n", title.text.strip()) print("=======본문======= \n", content.text.strip()) ws[f'A{row}'] = url # A열에는 URL 기입 ws[f'B{row}'] = title.text.strip() ws[f'C{row}'] = content.text.strip() # 자동 줄바꿈 ws[f'C{row}'].alignment = Alignment(wrap_text=True) row = row + 1 time.sleep(0.3) # 마지막 페이지 여부 확인하기 isLastPage = soup.select_one("a.btn_next").attrs['aria-disabled'] if isLastPage == 'true': print("마지막 페이지 입니다.") break page_num = page_num + 1 wb.save(f'{keyword}_result.xlsx')
cherrykim90 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
209
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
크롤링 기사 기간 설정
안녕하세요뉴스크롤링에서 크롤링 하고자 하는 뉴스의 기간을 정해주려면response = requests.get("https://search.naver.com/search.naver?where=news&sm=tab_jum&query=검색어") 위 코드의 " " 안에 뉴스기간을 옵션으로 설정하여 검색한 페이지의 URL을 긁어서 넣어주면 되는걸지요? 감사합니다.
cherrykim90 · 8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
632
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
코드가 작동이 되었다가 다시 안되는데요 ㅠㅠ
분명히 작동을 잘 했었는데,제가 어디서 잘못을 한것인지 아래와 같은 에러가 반복해서 발생합니다.startcoding/Chapter04/11.마지막페이지확인하기.py", line 62, in <module>print("=======링크======= \n", url)^^^NameError: name 'url' is not defined 강의를 뒤로가서 다시 작성해봐도... 이제는 02.본문내용스크롤부터 에러가 발생하고, Chapter04/02.뉴스본문내용크롤링하기.py", line 17, in <module>print(content.text)^^^^^^^^^^^^AttributeError: 'NoneType' object has no attribute 'text'"10.크롤링결과엑셀저장하기"에서도 돌아가다가 2페이지부터 이런 에러가 발생합니다. startcoding/Chapter04/10.크롤링결과엑셀저장하기.py", line 63, in <module> print("=======제목======= \n", title.text.strip()) ^^^^^^^^^^AttributeError: 'NoneType' object has no attribute 'text'제가 도대체 어디를 잘못하고 있는 걸까요 ㅠㅠ import requestsfrom bs4 import BeautifulSoupimport time # Time module 불러오기import pyautoguifrom openpyxl import Workbookfrom openpyxl.styles import Alignment# 사용자입력푸드keyword = pyautogui.prompt("검색어를 입력하세요")lastpage = int(pyautogui.prompt("몇 페이지까지 크롤링 할까요?"))# 엑셀 생성하기wb = Workbook()# 엑셀 시트 생성하기ws = wb.create_sheet(keyword)# 열 너비 조절ws.column_dimensions['A'].width = 60ws.column_dimensions['B'].width = 60ws.column_dimensions['C'].width = 120# 행번호row = 1# 페이지번호page_num = 1for i in range(1, lastpage * 10, 10):print(f"{page_num}페이지 크롤링 중입니다.===============")response = requests.get(f"https://search.naver.com/search.naver?where=news&ie=utf8&sm=nws_hty&query={keyword}&start={i}")html = response.textsoup = BeautifulSoup(html, 'html.parser')articles = soup.select("div.info_group") # 뉴스 기사 div 10개 추출(ctrl+F, div.info_group 검색후 10개로 확인)for article in articles:links = article.select("a.info") # 리스트: a 태그인데, class가 info인 것들 가지고 오기if len(links) >= 2: # 링크가 2개 이상이면url = links[1].attrs['href'] # 두번째 링크의 href를 추출response = requests.get(url, headers={'User-agent':'Mozila/5.0'})html = response.textsoup = BeautifulSoup(html, 'html.parser')# 연예 뉴스 체크if "entertain" in response.url:title = soup.select_one(".end_tit")content = soup.select_one("#articeBody")elif "sports" in response.url:title = soup.select_one("h4.title")content = soup.select_one("#newsEndContents")# 본문 내용 안에 불필요한 div 삭제 (기사 본문 이후 내용들)divs = content.select("div")for div in divs:div.decompose()paragraphs = content.select("p")for p in paragraphs:p.decompose()else:title = soup.select_one(".media_end_head_headline")content = soup.select_one("#newsct_article")print("=======링크======= \n", url)print("=======제목======= \n", title.text.strip())print("=======본문======= \n", content.text.strip())ws[f'A{row}'] = urlws[f'B{row}'] = title.text.strip()ws[f'C{row}'] = content.text.strip()# 자동 줄바꿈ws[f'C{row}'].alignment = Alignment(wrap_text=True)row = row + 1time.sleep(0.3) # 프로그램을 0.3초 정도 휴식 주기 (서버 부담 줄여주기, 프로그램 안정성 up)page_num = page_num + 1wb.save(f'{keyword}_result.xlsx')
8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
145
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
임포트가 잘 안되요~~~~
- 학습 관련 질문을 남겨주세요. 상세히 작성하면 더 좋아요! - 먼저 유사한 질문이 있었는지 검색해보세요. - 서로 예의를 지키며 존중하는 문화를 만들어가요. - 잠깐! 인프런 서비스 운영 관련 문의는 1:1 문의하기를 이용해주세요.
오유라 · 8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
1
조회수
158
답변
3
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
안녕하세요. Response 안쓰고 진행중입니다..
아래와 같이 코드를 작성했습니다. Response를 안쓰고 진행했는데 뉴스기사는 출력이 되지만 연예기사가 출력이 안됩니다 ㅠㅠ from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup # 크롬 드라이버 자동 업데이트 from webdriver_manager.chrome import ChromeDriverManager import time import pyautogui import pyperclip import csv # 브라우저 꺼짐 방지 chrome_options = Options() chrome_options.add_experimental_option("detach", True) # 크롬창 안뜨게 함 chrome_options.add_argument('--headless') # headless 모드 활성화 chrome_options.add_argument('--disable-gpu') # GPU 가속 비활성화 # Mozilla 웹 브라우저에서 온 것처럼 인식 / 자동화된 요청을 감지하고 차단하는 것을 우회 chrome_options.add_argument("--user-agent=Mozilla/5.0") # 불필요 메세지 없애기 chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) # 드라이버 업데이트 service = Service(executable_path=ChromeDriverManager().install()) # 옵션 적용 browser = webdriver.Chrome(service=service, options=chrome_options) news = pyautogui.prompt('뉴스기사 입력 >>> ') print(f'{news} 검색') # 웹페이지 해당 주소 이동 path = f'https://search.naver.com/search.naver?where=news&sm=tab_jum&query={news}' # url 대화 browser.get(path) # 네이버에서 html 줌 html = browser.page_source soup = BeautifulSoup(html, 'html.parser') articles = soup.select("div.info_group") # 뉴스 기사 div 10개 추출 for article in articles: links = article.select("a.info") if len(links) >= 2: # 링크가 2개 이상이면 url = links[1].attrs['href'] # 두번째 링크의 href 추출 # 다시 한번 받아옴 browser.get(url) html = browser.page_source soup = BeautifulSoup(html, 'html.parser') # 연예뉴스라면 -> ? div 모양이 다름 if 'entertain' in url: title = soup.select_one(".end_tit") content = soup.select_one('#articeBody') else: title = soup.select_one("#title_area") content = soup.select_one('#dic_area') # 해당 링크 본문의 아이디값 가져옴 print("=============링크==========\n", url) print("=============제목==========\n", title.text.strip()) print("=============내용==========\n", content.text.strip()) time.sleep(0.7) print('\nDvlp.H.Y.C.Sol\n') 출력은 이렇게 나옵니다.=============링크========== https://n.news.naver.com/mnews/article/382/0001075938?sid=106Traceback (most recent call last): File "c:\Users\cksth\OneDrive\바탕 화면\Career\크롤링\심화\02.연예뉴스.py", line 71, in <module> print("=============제목==========\n", title.text.strip())AttributeError: 'NoneType' object has no attribute 'text
찬솔 · 8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
169
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
파이썬 코드 실행/pip 오류 등
안녕하세요, 저번에 친절하게 답변해주셔서 감사합니다. 강의 수강 중에 크롤링 코드를 작성 후, 정상 작동을 확인한 다음에 다른 PC에서 파이썬/Visual Studio Code를 설치하여 파일을 실행했는데,정상적으로 크롤링이 작동하지 않는 오류가 발생했습니다 ㅜㅜ 처음에는 라이브러리 설치를 전부 진행했었는데,아래와 같은 오류가 발생했었습니다 [현재는 해결된 현상] import 모듈(?) 오류import requests 를 작성하면 requests 부분이 초록색이 되어야 하는데, 흰 글씨가 되는 현상 pip install --upgrade 오류해당 명령어를 사용하면 upgrade가 진행되지 않고,ERROR : You must give at least one requirement to install (see "pip help install") 이라는 문구만 출력됩니다.(혹시 몰라서 원래 잘 되던 기존 PC에 입력해보니까 다른 명령어로 쓰라면서 notice가 출력됐었습니다. 기존 PC에서는 아무런 설명도 없이 오류만 떠요ㅜㅜ) 일단 기본적으로 코드를 실행하면 크롤링 후에 엑셀 파일이 생성되어야 하는데 결과적으로는 안 됩니다..혹시 도움을 받을 수 있을까요? 현재까지 시도해 본 것들 1) Python , Visual Studio Code 삭제 및 재설치 , 윈도우 버전 확인 등 2) Python 환경 변수 설정 (기존 PC에는 따로 환경 변수 설정을 하지 않아도 잘 작동하는 점 확인) 3) Visual Studio Code 재실행, 컴퓨터 재부팅 4) cmd 에서 Python 정상 설치 확인 5) pip 삭제 후 재설치 진행 (upgrade는 못한 상태 6) 기존 PC와 현 PC의 코드 크로스 체크 (특이사항 없음 확인)
프로그램초 · 8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
883
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
부동산 크롤링 강좌 이벤트
선생님 안녕하세요~~ 몇 일 전에 이벤트 참여하려고 블로그에 글쓰고 선생님에게 메일을 보냈습니다.이거 완강하고 부동산도 꼭 듣고싶네요 ㅎㅎ 너무 유용한 강의 감사합니다. 메일 확인 부탁드려요 !!
김제림 · 8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
210
답변
1
해결됨
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
selenium 사용시 속도 개선 방법 질의.
''' 네이버지도에 표시되는 순서대로 순위, 가게명, 별점, 방문자리뷰수, 블로그리뷰수를 엑셀에 저장(1페이지만) 주의사항 광고는 제외, 별점이 있는 가게만 크롤링한다. 방문자 리뷰가 없다면 0으로 수집, 블로그 리뷰가 없다면 0으로 수집한다. iframe을 만났을때 대처방법. 무한 스크롤 처리 방법 고민. ''' from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import pyautogui import time from bs4 import BeautifulSoup import openpyxl #크롬 드라이버 자동 업데이트 from webdriver_manager.chrome import ChromeDriverManager header = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36', } keyword = pyautogui.prompt("검색어를 입력하세요>>>") wb = openpyxl.Workbook() ws = wb.create_sheet(keyword) ws.append(["순위","가게명","별점","방문자 리뷰수","블로그 리뷰수"]) chrome_options = Options() #브라우저 꺼짐 방지 chrome_options.add_experimental_option("detach", True) #불필요한 에러 메시지 없애기 chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) # 셀레니움 로그 무시 serivce = Service(executable_path=ChromeDriverManager().install()) browser = webdriver.Chrome(service=serivce, options=chrome_options) #웹페이지 해당 주소 이동 browser.implicitly_wait(5) browser.maximize_window() browser.get("https://map.naver.com/p?c=15.00,0,0,0,dh") search = browser.find_element(By.CSS_SELECTOR,".input_search") search.click() time.sleep(1) search.send_keys(f"{keyword}") time.sleep(1) search.send_keys(Keys.ENTER) time.sleep(1) #iframe 안으로 들어가기 browser.switch_to.frame("searchIframe") #iframe 밖으로 나오기 #browser.switch_to.default_content() browser.find_element(By.CSS_SELECTOR,"#_pcmap_list_scroll_container").click() lists = browser.find_elements(By.CSS_SELECTOR,"li.UEzoS") before_scroll = len(lists) while True : before_scroll = len(lists) for i in range(0,10) : browser.find_element(By.CSS_SELECTOR,"body").send_keys(Keys.PAGE_DOWN) time.sleep(0.5) lists = browser.find_elements(By.CSS_SELECTOR,"li.UEzoS") after_scroll = len(lists) if before_scroll == after_scroll : break # lists = browser.find_elements(By.CSS_SELECTOR,"li.UEzoS.rTjJo") print(f"총 {after_scroll}개의 가게가 있습니다.") num = 0 for list in lists : if len(list.find_elements(By.CSS_SELECTOR,"li>a.gU6bV")) == 0 : store = list.find_element(By.CSS_SELECTOR,"span.TYaxT") browser.execute_script("arguments[0].click();", store) time.sleep(1) browser.switch_to.default_content() browser.switch_to.frame("entryIframe") # html_entry = browser.page_source # soup_entry = BeautifulSoup(html_entry, 'html.parser') if len(browser.find_elements(By.CSS_SELECTOR,"#app-root > div > div > div > div.place_section.OP4V8 > div.zD5Nm.f7aZ0 > div.dAsGb > span.PXMot.LXIwF")) > 0 : num += 1 title = browser.find_element(By.CSS_SELECTOR,".Fc1rA").text stars = float(browser.find_element(By.CSS_SELECTOR,".LXIwF > em").text) visitor_review = browser.find_element(By.CSS_SELECTOR,"div.dAsGb > span:nth-child(2)").text blog_review = browser.find_element(By.CSS_SELECTOR,"div.dAsGb > span:nth-child(3)").text visitor_review = int(visitor_review.replace("방문자리뷰 ","").replace(",","")) blog_review = int(blog_review.replace("블로그리뷰 ","").replace(",","")) print(num,title,stars,visitor_review,blog_review) ws.append([num,title,stars,visitor_review,blog_review]) browser.switch_to.default_content() browser.switch_to.frame("searchIframe") wb.save(f"/Chapter05/{keyword}.xlsx") 안녕하세요 강사님.현재 네이버 지도에서 강남역 맛집을 검색하면 강의 노트와 같이 방문자 리뷰, 블로그 리뷰를 크롤링하기 위해서는 해당 가게를 클릭하고 그 iframe으로 들어가서 데이터를 받아와야합니다. 그래서 위와 같이 코드를 작성하였는데요.문제는 가게이름 클릭 후 페이지 로딩을 고려한 지연시간을 1초 밖에 안줬는데. 가게이름, 별점, 리뷰수를 읽어오는데 굉장히 오래걸립니다.(5~6초 이상 소요). 이 때문에 전체 리스트에 대한 정보를 받아오는 시간이 너무 많이 소요되는데요.혹시 이를 개선할수있는 방안이 있을까요?
김세종 · 8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
1
조회수
1.13k
답변
3
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
코드제공
혹시 코드는 어디서 다운 받을 수 있을까요?수업노트에도 없는거 같아서 문의드립니다.
소정 · 8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
192
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
쿠팡 검색 후 제품명 가져오기 오류
''' 아래 키워드별로 순위, 브랜드명, 제품명, 가격, 상세페이지링크를 엑셀에 저장하기(1~100위 까지) [게이밍마우스, 기계식 키보드, 27인치 모니터] ※ 주의사항 - 광고 상품은 제외한다 (AD라고 표기됨) - 브랜드명이 없거나 이상하면 빈칸 ''' import requests from bs4 import BeautifulSoup import time import pyautogui # 헤더에 User-Agent, Accept-Language 를 추가하지 않으면 멈춥니다 header = { 'Host': 'www.coupang.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'ko-KR,ko;q=0.8,en-US;q=0.5,en;q=0.3', } url = f'https://www.coupang.com/np/search?component=&q=%EA%B2%8C%EC%9D%B4%EB%B0%8D+%EB%A7%88%EC%9A%B0%EC%8A%A4&channel=user' response = requests.get(url, headers=header) html = response.text soup = BeautifulSoup(html, 'html.parser') items = soup.select("div.descriptions-inner") for item in items : name = soup.select_one("div.name").text print(name) 쿠팡 첫페이지에서 '게이밍 마우스' 검색하여 각 상품별 이름을 가져오는 코드를 작성했는데요.(강의에서처럼 상품별 url에 들어가서 가져오는 것이 아닌, 검색 완료 페이지에서 바로 가져오기)각 상품별로 div의 descriptions-inner tag를 가지고있고 div의 name tag를 가지고있어 해당 방법으로 가져오기를 해봤습니다.위와 같이 실행을 하면 상품별 이름을 가져오기는 하는데 해당 페이지내 랜덤한 상품명이 하나로 쭉 나오는데... 어디서 잘못된걸까요?(결과는 아래 처럼 페이지중 하나의 상품명이 쭉나옵니다.)로지텍코리아 (정품) 로지텍 G502 X PLUS 무선 게이밍 마우스, 블랙로지텍코리아 (정품) 로지텍 G502 X PLUS 무선 게이밍 마우스, 블랙로지텍코리아 (정품) 로지텍 G502 X PLUS 무선 게이밍 마우스, 블랙로지텍코리아 (정품) 로지텍 G502 X PLUS 무선 게이밍 마우스, 블랙로지텍코리아 (정품) 로지텍 G502 X PLUS 무선 게이밍 마우스, 블랙로지텍코리아 (정품) 로지텍 G502 X PLUS 무선 게이밍 마우스, 블랙
김세종 · 8달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
1
조회수
317
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
네이버 게임 라운지 게시글 댓글 크롤링
크롤링 하다가 막혔는데 어떻게 해결할 수 있을까요?? URL : https://game.naver.com/lounge/Viking_Rise/board/detail/2583518사용 코드import requests from bs4 import BeautifulSoup import openpyxl url = "https://game.naver.com/lounge/Viking_Rise/board/detail/2583518" response = requests.get(url) html = response.text soup = BeautifulSoup(html, "html.parser") comment_items = soup.select(".comment_item_text__1foPs") workbook = openpyxl.Workbook() sheet = workbook.active sheet.title = "Crawled Comments" sheet.cell(row=1, column=1, value="Comment Text") sheet.cell(row=1, column=2, value="Attributes") for index, comment_content in enumerate(comment_items, start=2): comment_text = comment_content.get_text(strip=True) comment_attributes = str(comment_content.attrs) sheet.cell(row=index, column=1, value=comment_text) sheet.cell(row=index, column=2, value=comment_attributes) workbook.save("crawled_comments.xlsx") print("Comment attributes have been crawled and saved to crawled_comments.xlsx")
프로그램초 · 9달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
315
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
부동산 매물 강좌 관련 문의
안녕하세요! 강사님 강의를 끝까지 다 수강하였습니다. 너무 도움이 많이 되었습니다. 감사합니다.부동산 매물 강의에 대한 공지를 보고 메일로 문의 드렸는데 아직 피드백이 오지 않아 이렇게 질문글로 문의를 드리게 되었습니다!메일 문의 한 번만 확인 부탁드립니다. 감사합니다!
hhs0995 · 9달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
201
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
크롤링 시 다양한 태그들에 대한 대처
안녕하세요. 네이버 지도 크롤링 강의를 듣다 궁금한 점이 생겨 질문드립니다! 네이버 지도에서 [강남역 맛집, 홍대 술집, 이태원 카페] 와 같이 검색하여 나타나는 CSS선택자는 li.UEzoS.rTjJo 와 같이 선택하면 잘 선택이 되지만,[청담 미용실] 과 같이 선택하면 다른 CSS 선택자를 선택해야 합니다.이럴 경우 어떻게 대처를 하면 좋을지에 대해 질문을 남겨봅니다. 항상 강의 잘 듣고 있습니다. 감사합니다!
hhs0995 · 9달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
239
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
네이버 뉴스본문 가지고 오기도 되고 연예뉴스도 되는데 스포츠 뉴만 안되네요?
- 학습 관련 질문을 남겨주세요. 상세히 작성하면 더 좋아요! - 먼저 유사한 질문이 있었는지 검색해보세요. - 서로 예의를 지키며 존중하는 문화를 만들어가요. - 잠깐! 인프런 서비스 운영 관련 문의는 1:1 문의하기를 이용해주세요.AttributeError: 'NoneType' object has no attribute 'text'이 오류가 계속 나오네요연예뉴스 복사한후 바로 코드를 다시 썼는데도 안되네요 ㅠㅠimport requests from bs4 import BeautifulSoup import time response = requests.get("https://search.naver.com/search.naver?sm=tab_sug.top&where=news&query=%EC%86%90%ED%9D%A5%EB%AF%BC&oquery=%EB%B8%94%EB%9E%99%ED%95%91%ED%81%AC&tqi=iK4yElprvmZss69Ig8Nssssss1w-042517&acq=thsgmd&acr=1&qdt=0") html = response.text soup = BeautifulSoup(html,'html.parser') articles = soup.select("div.info_group") for article in articles: links = article.select("a.info") if len(links) >= 2: url = links[1].attrs["href"] response = requests.get(url,headers={'User-agent':'Mozila/5.0'}) html = response.text soup = BeautifulSoup(html,'html.parser') # 만약 연예 뉴스라면 if "entertain" in response.url: title = soup.select_one(".end_tit") content = soup.select_one ("#articeBody") elif "storts" in response.url: title = soup.select_one("h4.title") content = soup.select_one ("#newsEndContents") # 본문 내용안애 불필요한 div삭제 divs = content.select("div") for div in divs: div.decompose() paragraphs = content.select("p") for p in paragraphs: p.decompose() else: title = soup.select_one("#artcleTitle") content = soup.select_one("#areicleBodyContents") print("============링크=========\n", url) print("============제목=========\n", title.text.strip()) print("============본문=========\n", content.text.strip()) time.sleep(0.3)
남경민 · 9달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
362
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
안녕하세요 선생님 여기에 이런 질문을 해도될지 모르겠지만 부탁드립니다ㅠㅠ
셀레니움 크롬 드라이버를 배우고 웹하드 크롤링을 하려고 하는데 예기치 못한 상황이 생겼습니다... https://smartfile.co.kr/ 스마트파일이라는 사이트를 크롤링하려고 하는데 일단 크롬드라이버로 사이트에 접속을하고원하는 카테고리의 정보를 (페이지 번호, 개수 등등)을 넣어서 beautiful soup로 받아오려고 했지만 특정 카테고리에 가서 url을 그대로 복사하고 브라우저에서 복사한 url을 그대로 입력후 접속을 하니까 smart file이라는 글자만 나오고 자료가 나오지 않는 상황이 생깁니다..그래서 뭔가 막아놨구나 생각해서 chrome driver로 카테고리를 클릭해서 들어가려고 코드를 짰는데 카테고리를 클릭하니까 구글로 이동이됩니다. 뭔가 되게 많이 막아놓은것같은데 자바스크립트 명령어로 클릭을 해봐도 동일한 결과가 나옵니다 어떻게 하면 좋을까요 부탁드립니다... 이것은 driver를 생성하는 모듈입니다""" 크롬 드라이버 생성 및 설정 모듈 """ from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options # 크롬 드라이버 자동 업데이트 from webdriver_manager.chrome import ChromeDriverManager def create_driver(): # 크롬 브라우저 꺼짐 방지 chrome_options = Options() chrome_options.add_experimental_option('detach', True) # 불필요한 에러 메시지 없애기 chrome_options.add_experimental_option('excludeSwitches', ['enable-logging']) # 크롬 드라이버 최신 버전 설치 service = Service(executable_path=ChromeDriverManager().install()) # 드라이버 객체 생성 driver = webdriver.Chrome(service=service, options=chrome_options) driver.service = service driver.implicitly_wait(10) driver.maximize_window() return driver 이것은 크롬드라이버를 생성해서 크롤링을하려고 크롬드라이버로 접속해서 해당 url에 접속하는 모듈입니다import time from selenium import webdriver from selenium.webdriver.common.by import By def execute_crawling(driver: webdriver.Chrome, url: str): for i in range(2): driver.get(url) time.sleep(2) if i == 0: menu_book = driver.find_element(By.CSS_SELECTOR, '#wrap > div.wrap-nav-wrap > div > ul.depth1 > li.menutop_DOC.m9') driver.execute_script("arguments[0].click();", menu_book) time.sleep(1)
KimJuYoung · 10달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
259
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
셀레니움 정상 작동 되다가 오늘부로 갑자기 오류가 발생
안녕하세요. 강의를 잘 수강하고 있습니다.다름이 아니라 셀레니움으로 작성했던 코드들이 정상적으로 모두 잘 작동되다가 오늘부로 갑자기 오류가 발생하여 문의드립니다!오류를 해결하기 위해 버전 업그레이드도 모두 하였고, 재부팅도 해보았지만 셀레니움으로 작성했던 모든 코드들에서 맨 아래와 같은 오류가 발생했습니다 ㅠㅠ 코드는 아래와 같습니다.# -*- coding: utf-8 -*- # 외우는거 아님. 그냥 필요할 때 복붙 from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.by import By import time # 크롬 드라이버 자동 업데이트 from webdriver_manager.chrome import ChromeDriverManager #브라우저 꺼짐 방지 chrome_options = Options() chrome_options.add_experimental_option("detach", True) # 불필요한 에러 메시지 없애기 chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) service = Service(executable_path=ChromeDriverManager().install()) browser = webdriver.Chrome(service=service, options=chrome_options) # 웹페이지 해당 주소 이동 browser.get("https://www.naver.com")오류 메시지는 다음과 같습니다.Traceback (most recent call last): File "c:\pratice_crolling\실습4_셀레니움 기본 설정\[기초복붙용]셀레니움 기본 설정.py", line 21, in <module> File "C:\Users\hyeonseok\AppData\Local\Programs\Python\Python311\Lib\site-packages\webdriver_manager\chrome.py", line 39, in install driver_path = self._get_driver_path(self.driver) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\hyeonseok\AppData\Local\Programs\Python\Python311\Lib\site-packages\webdriver_manager\core\manager.py", line 30, in getdriver_path file = self._download_manager.download_file(driver.get_driver_download_url()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\hyeonseok\AppData\Local\Programs\Python\Python311\Lib\site-packages\webdriver_manager\drivers\chrome.py", line 40, in get_driver_download_url driver_version_to_download = self.get_driver_version_to_download() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\hyeonseok\AppData\Local\Programs\Python\Python311\Lib\site-packages\webdriver_manager\core\driver.py", line 51, in get_driver_version_to_download self._driver_to_download_version = self._version if self._version not in (None, "latest") else self.get_latest_release_version() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\hyeonseok\AppData\Local\Programs\Python\Python311\Lib\site-packages\webdriver_manager\drivers\chrome.py", line 62, in get_latest_release_version resp = self._http_client.get(url=latest_release_url) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\hyeonseok\AppData\Local\Programs\Python\Python311\Lib\site-packages\webdriver_manager\core\http.py", line 37, in get self.validate_response(resp) File "C:\Users\hyeonseok\AppData\Local\Programs\Python\Python311\Lib\site-packages\webdriver_manager\core\http.py", line 16, in validate_response raise ValueError(f"There is no such driver by url {resp.url}")ValueError: There is no such driver by url https://chromedriver.storage.googleapis.com/LATEST_RELEASE_115.0.5790
hhs0995 · 10달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
2
조회수
15.3k
답변
6
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
쿠팡 사례처럼 다른 사이트에서 User-Agent로 크롤링이 불가능한 경우
안녕하세요. 혹시 이번 쿠팡 사례처럼 다른 사이트에서 User-Agent로 크롤링이 불가능한 경우 header 선언을 저렇게 해주셨는데, 다른 사이트에서도 만약 User-Agent 로 크롤링이 불가능한 경우 header을 어떻게 선언해야 하는건지 알 수 있을까요? 뭔가 규칙이 있는건지, 그냥 구글링해서 가져와야 하는건지요 ㅠㅠ?
hhs0995 · 10달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
479
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
'블랙핑크' 검색 시에만 오류가 뜨는 현상
안녕하세요. 강사님 아래 코드에서 '블랙핑크' 를 검색할 때 Traceback (most recent call last): File "c:\pratice_crolling\심화1_\03_스포츠 뉴스 크롤링.py", line 52, in <module> print(article_title.text.strip()) ^^^^^^^^^^^^^^^^^^AttributeError: 'NoneType' object has no attribute 'text'다음과 같은 오류가 뜹니다 ㅠㅠ CSS 선택자, 오타도 모두 맞게 확인이 되는데 왜 저 검색어만 오류가 뜰까요ㅠㅠ?# -*- coding: euc-kr -*- # 네이버에서 손흥민, 오승환과 같은 스포츠 관련 검색어 크롤링하기 import requests from bs4 import BeautifulSoup import pyautogui import time search = pyautogui.prompt("어떤 것을 검색하시겠어요?") response = requests.get(f"https://search.naver.com/search.naver?sm=tab_hty.top&where=news&query={search}&oquery=%EC%98%B7%EC%9C%BC%ED%99%98&tqi=i74G%2FdprvTossZPeMhCssssssko-058644") html = response.text soup = BeautifulSoup(html, "html.parser") articles = soup.select(".info_group") for article in articles: # '네이버뉴스' 가 있는 기사만 추출한다. (<a> 하이퍼링크가 2개 이상인 경우에 해당) links = article.select("a.info") if len(links) >=2 : url = links[1].attrs['href'] response = requests.get(url, headers={'User-agent':'Mozila/5.0'}) html = response.text soup = BeautifulSoup(html, "html.parser") # 스포츠 기사인 경우 if "sports" in url: article_title = soup.select_one("h4.title") article_body = soup.select_one("#newsEndContents") # 본문 내에 불필요한 내용 제거 p태그와 div태그의 내용은 출력할 필요가 없다. 없애주자. p_tags = article_body.select("p") # 본문에서 p 태그인 것들을 추출 for p_tag in p_tags: p_tag.decompose() div_tags = article_body.select("div") # 본문에서 div 태그인 것들을 추출 for div_tag in div_tags: div_tag.decompose() # 연예 기사인 경우 elif "entertain" in url: article_title = soup.select_one(".end_tit") article_body = soup.select_one("#articeBody") # 일반 뉴스 기사인 경우 else: article_title = soup.select_one("#title_area") article_body = soup.select_one("#dic_area") # 출력문 print("==================================================== 주소 ===========================================================") print(url.strip()) print("==================================================== 제목 ===========================================================") print(article_title.text.strip()) print("==================================================== 본문 ===========================================================") print(article_body.text.strip()) #strip 함수는 앞 뒤의 공백을 제거한다. time.sleep(0.3)
hhs0995 · 10달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
173
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
네이버 지도 크롤링 질문있습니다
네이버 지도 크롤링 간 별점 text를 어떻게 추출해야할 지 모르겠습니다. 어떤 태그를 이용해야 아래 4.37이 추출이 될까요??아래 사진은 제 코드 사진입니다
김찬호 · 10달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
530
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
첫페이지 크롤링하기 오류
안녕하세요!! 바꿔서 올려주신 강의자료 복사해서 사용해도 작동하지 않습니다!!맥북 사용중이라 혹시나해서 유저 에이전트 값을Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 으로 변경해서 진행했는데도 값이 나오지 않아요!! 확인부탁드립니다.! import requests from bs4 import BeautifulSoup main_url = "https://www.coupang.com/np/search?component=&q=usb%ED%97%88%EB%B8%8C&channel=user" # 헤더에 User-Agent, Accept-Language 를 추가하지 않으면 멈춥니다 header = { 'Host': 'www.coupang.com', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'ko-KR,ko;q=0.8,en-US;q=0.5,en;q=0.3', } response = requests.get(main_url, headers=header) html = response.text soup = BeautifulSoup(html, 'html.parser') links = soup.select("a.search-product-link") # select의 결과는 리스트 자료형 print(links)
심호준 · 10달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
270
답변
1

[2024 개정판] 이것이 진짜 크롤링이다 - 실전편대시보드

크롤링 에러 관련 문의

크롤링 기사 기간 설정

코드가 작동이 되었다가 다시 안되는데요 ㅠㅠ

임포트가 잘 안되요~~~~

안녕하세요. Response 안쓰고 진행중입니다..

파이썬 코드 실행/pip 오류 등

부동산 크롤링 강좌 이벤트

selenium 사용시 속도 개선 방법 질의.

코드제공

쿠팡 검색 후 제품명 가져오기 오류

네이버 게임 라운지 게시글 댓글 크롤링

부동산 매물 강좌 관련 문의

크롤링 시 다양한 태그들에 대한 대처

네이버 뉴스본문 가지고 오기도 되고 연예뉴스도 되는데 스포츠 뉴만 안되네요?

안녕하세요 선생님 여기에 이런 질문을 해도될지 모르겠지만 부탁드립니다ㅠㅠ

셀레니움 정상 작동 되다가 오늘부로 갑자기 오류가 발생

쿠팡 사례처럼 다른 사이트에서 User-Agent로 크롤링이 불가능한 경우

'블랙핑크' 검색 시에만 오류가 뜨는 현상

네이버 지도 크롤링 질문있습니다

첫페이지 크롤링하기 오류