[2024 개정판] 이것이 진짜 크롤링이다 - 실전편대시보드

(4.9)72개의 수강평 ∙ 846명의 수강생

스타트코딩

Python 웹 크롤링

132,000원

월 26,400원

5개월 할부 시

지식공유자: 스타트코딩

총 26개 수업 (3시간 29분)

수강기한:

수료증: 발급

난이도: --

지식공유자 답변이 제공되는 강의입니다

폴더에 추가731

다른 수강생들이 자주 물어보는 질문이 궁금하신가요?

해결됨
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
NoSuchElementException 이라고 뜹니다
안녕하세요 선생님 ~ 질문있습니다.아래의 코드를 실행하니 NoSuchElementException이라고 뜹니다. 제가 생각하기엔 큰이미지를 다운로드 하는 과정에서 첫번째 사진만 저장되고 그 이후에 저런 메세지가 나옵니다.CSS 선택자가 잘못된 것 같은데 해결을 못하고 있습니다 ㅠㅠ from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import time import os import urllib.request import pyautogui # 크롬 드라이버 자동 업데이트 from webdriver_manager.chrome import ChromeDriverManager keyword = pyautogui.prompt("검색어를 입력하세요.") if not os.path.exists(f'CRAWLING심화\ch4.구글이미지크롤링\{keyword}') == True: os.mkdir(f'CRAWLING심화\ch4.구글이미지크롤링\{keyword}') # 브라우저 꺼짐 방지 코드 chrome_options = Options() chrome_options.add_experimental_option("detach", True) # 불필요한 에러 메세지 차단 chrome_options.add_experimental_option('excludeSwitches', ["enable-logging"]) # 최신 버전의 ChromeDriver 경로를 자동으로 다운로드하거나 검색 service = Service(executable_path=ChromeDriverManager().install()) driver = webdriver.Chrome(service=service, options=chrome_options) # url 검색 url = f"https://www.google.com/search?q={keyword}&sca_esv=581612012&tbm=isch&sxsrf=AM9HkKnRu6DCGGz23e29xT4BSB7Hq95zgA:1699754235522&source=lnms&sa=X&ved=2ahUKEwiboaf7rb2CAxWJfd4KHWkWA9MQ_AUoAXoECAQQAw&biw=1552&bih=737&dpr=1.65" # 웹 페이지가 로딩될때까지 10초 기다림 driver.implicitly_wait(10) # 화면 최대화 driver.maximize_window() driver.get(url) # 스크롤 전 높이 before_h = driver.execute_script("return window.scrollY") # 무한 스크롤 while True: # 맨 아래로 스크롤 내린다. driver.find_element(By.CSS_SELECTOR, "body").send_keys(Keys.END) # 스크롤 사이 페이지 로딩 시간 time.sleep(1) # 스크롤 후 높이 after_h = driver.execute_script("return window.scrollY") if after_h == before_h: break before_h = after_h # 썸네일 이미지 태크 추출 imgs = driver.find_elements(By.CSS_SELECTOR, ".rg_i.Q4LuWd") for i, img in enumerate (imgs, 1): # 이미지를 클릭해서 큰 사이즈 찾기 # 클릭하다보면 element click intercepted 에러가 등장 # javascript로 클릭을 직접 하도록 만들어주면 된다 driver.execute_script("arguments[0].click();", img) img.click() time.sleep(1) # 큰 이미지 주소 추출 target = driver.find_element(By.CSS_SELECTOR, 'img.sFlh5c.pT0Scc.iPVvYb') img_src = target.get_attribute('src') opener = urllib.request.build_opener() opener.addheaders = [('User-Agent', 'Mozila/5.0')] urllib.request.install_opener(opener) # 이미지 다운로드 urllib.request.urlretrieve(img_src, f'CRAWLING심화\ch4.구글이미지크롤링\{keyword}\{keyword}{i}.jpg') print(f'img {i}개 : {target}')
김민석 · 5달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
1
조회수
153
답변
1
해결됨
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
쿠팡 첫페이지 관련 href 오류가 뜹니다.
강의노트에 올려주신 코드를 그대로 복붙해서 실행시켰는데,C:\CRAWLLING> cmd /C "C:\Users\libra_erv8ij1\AppData\Local\Programs\Python\Python312\python.exe c:\Users\libra_erv8ij1\.vscode\extensions\ms-python.python-2023.20.0\pythonFiles\lib\python\debugpy\adapter/../..\debugpy\launcher 1693 -- "c:\CRAWLLING\CRAWLING 심화\ch3. 쿠팡크롤링\01.첫번째페이지크롤링.py" "Traceback (most recent call last): File "c:\CRAWLLING\CRAWLING 심화\ch3. 쿠팡크롤링\01.첫번째페이지크롤링.py", line 20, in <module> sub_url = "https://www.coupang.com" + link.attrs['href'] ~~~~~~~~~~^^^^^^^^KeyError: 'href'href 관련 오류가 나옵니다. 왜그러는걸까요?
김민석 · 5달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
1
조회수
337
답변
1
해결됨
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
html에서 가져오지 못하는 부분이 있어 질문 드립니다.
import requests from bs4 import BeautifulSoup headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36' } url = 'https://new.land.naver.com/complexes/8928?ms=37.2110479,127.0941727,16&a=APT:ABYG:JGC:PRE&b=B1&e=RETAIL&ad=true&articleNo=2350848148' article_response = requests.get(url, headers=headers) html = article_response.text soup = BeautifulSoup(html, 'html.parser') items = soup.select(".info_title") #Select의 결과는 리스트자료형 print(items)안녕하세요. 강의 다 들은지는 꽤 됬는데 beautifulsoup을 오랜만에 사용하려니 막혀서 질의드립니다. ㅠurl의 네이버 부동산에 들어가서 해당 매물의 중개소 정보를 가져오려고 하는데요.분명 크롬 개발자도구에서 info_title 클래스의 div 태그가 있음에도 아무것도 파싱되지 않습니다. ㅠ다른 클래스로 지정시 파싱은 잘 되서 코드에는 문제가 없는거 같은데 유독 저것만 왜 안나오는 걸까요 ㅠ
김세종 · 5달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
1
조회수
108
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
NoSuchElementException Stacktrace 에러입니다.
진행 중 에러가 나왔습니다.. 해당 부분에서 어떻게 잘못되었는지 파악을 못해서 글 남깁니다.. File "c:\Users\cksth\OneDrive\바탕 화면\Career\크롤링\심화3\09.click.py", line 71, in <module> raise exception_class(message, screen, stacktrace)selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"img.sFlh5c.pT0Scc.iPVvYb"} (Session info: chrome=119.0.6045.124); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exceptionStacktrace: from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import urllib.request import time import pyautogui import os # 브라우저 꺼짐 방지 chrome_options = Options() chrome_options.add_experimental_option("detach", True) # 크롬창 안뜨게 함 # chrome_options.add_argument('--headless') # headless 모드 활성화 # chrome_options.add_argument('--disable-gpu') # GPU 가속 비활성화 # 불필요 메세지 없애기 chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) # 크롬 드라이버 자동 업데이트 browser = webdriver.Chrome(options=chrome_options) keyword = pyautogui.prompt('검색어를 입력하세요.') cnt = 0 # 폴더 만들기 (이미 존재하면 += 1) while True: cnt += 1 folder_path = f'크롤링/심화3/{keyword}{cnt}모음' if not os.path.exists(folder_path): os.mkdir(folder_path) break path = f'https://www.google.com/search?q={keyword}&sca_esv=581612012&tbm=isch&sxsrf=AM9HkKnRu6DCGGz23e29xT4BSB7Hq95zgA:1699754235522&source=lnms&sa=X&ved=2ahUKEwiboaf7rb2CAxWJfd4KHWkWA9MQ_AUoAXoECAQQAw&biw=1552&bih=737&dpr=1.65' # 구글 browser.implicitly_wait(3) browser.maximize_window() browser.get(path) before_h = browser.execute_script("return window.scrollY") # 무한스크롤 while True: # 맨 아래로 스크롤을 내림 browser.find_element(By.CSS_SELECTOR,"body").send_keys(Keys.END) time.sleep(5) # 스크롤 후 높이 after_h = browser.execute_script("return window.scrollY") # 스크롤 높이가 맨 아래와 같다면 무한루프 탈출 if after_h == before_h: print('OKOK') break # 스크롤 높이 업데이트 before_h = after_h # 썸네일 이미지 태그 추출 imgs = browser.find_elements(By.CSS_SELECTOR, '.rg_i.Q4LuWd') for i, img in enumerate(imgs, 1): # 이미지 클릭 후 큰 사이즈 찾음 # 클릭하면 element click intercepted -> JS 로 직접 클릭 유도 browser.execute_script('arguments[0].click();', img) time.sleep(1) # 큰 이미지 주소 추출 target = browser.find_element(By.CSS_SELECTOR, 'img.sFlh5c.pT0Scc.iPVvYb') img_src = target.get_attribute('src') # 에러 해결중 # if i == 1: # target = browser.find_elements(By.CSS_SELECTOR, 'img.sFlh5c.pT0Scc.iPVvYb')[0] # else: # target = browser.find_elements(By.CSS_SELECTOR, 'img.sFlh5c.pT0Scc.iPVvYb')[1] # img_src = target.get_attribute('src') # urllib.error.HTTPError: HTTP Error 403: Forbidden 해결방안 3줄 opener = urllib.request.build_opener() opener.addheaders = [('User-Agent', 'Mozila/5.0')] urllib.request.install_opener(opener) # 이미지 저장 try: urllib.request.urlretrieve(img_src, f'크롤링/심화3/{keyword}{cnt}모음/{keyword}{i}.png') except: pass print(f'img {i}개: {target}') print('\nDvlp.H.Y.C.Sol\nJason')
찬솔 · 6달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
352
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
구글 큰이미지 크롤링 IndexError입니다.
안녕하세요. 이전 에러를 해결하고 다시 코드를 잡기 시작했습니다...에러 내용은 IndexError: list index out of range 입니다.다른분이 올려주신 글을 읽어보기도 했는데 제 코드에서는 문제점이 무엇인지 잘 모르겠습니다.두번째 인덱스가 없기 때문인거같은데큰 그림만 가져오고싶습니다 ㅠ from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import urllib.request import time import pyautogui import os # 브라우저 꺼짐 방지 chrome_options = Options() chrome_options.add_experimental_option("detach", True) # 크롬창 안뜨게 함 chrome_options.add_argument('--headless') # headless 모드 활성화 chrome_options.add_argument('--disable-gpu') # GPU 가속 비활성화 # 불필요 메세지 없애기 chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) # 크롬 드라이버 자동 업데이트 browser = webdriver.Chrome(options=chrome_options) keyword = pyautogui.prompt('검색어를 입력하세요.') cnt = 0 # 폴더 만들기 (이미 존재하면 += 1) while True: cnt += 1 folder_path = f'크롤링/심화3/{keyword}{cnt}모음' if not os.path.exists(folder_path): os.mkdir(folder_path) break path = f'https://www.google.com/search?q={keyword}&sca_esv=581612012&tbm=isch&sxsrf=AM9HkKnRu6DCGGz23e29xT4BSB7Hq95zgA:1699754235522&source=lnms&sa=X&ved=2ahUKEwiboaf7rb2CAxWJfd4KHWkWA9MQ_AUoAXoECAQQAw&biw=1552&bih=737&dpr=1.65' # 구글 browser.implicitly_wait(3) browser.maximize_window() browser.get(path) before_h = browser.execute_script("return window.scrollY") # 무한스크롤 while True: time.sleep(5) # 맨 아래로 스크롤을 내림 browser.find_element(By.CSS_SELECTOR,"body").send_keys(Keys.END) # 스크롤 후 높이 after_h = browser.execute_script("return window.scrollY") # 스크롤 높이가 맨 아래와 같다면 무한루프 탈출 if after_h == before_h: print('OKOK') break # 스크롤 높이 업데이트 before_h = after_h # 썸네일 이미지 태그 추출 imgs = browser.find_elements(By.CSS_SELECTOR, '.rg_i.Q4LuWd') for i, img in enumerate(imgs, 1): # 이미지 클릭 후 큰 사이즈 찾음 img.click() time.sleep(1) # 큰 이미지 주소 추출 if i == 1: target = browser.find_elements(By.CSS_SELECTOR, 'img.sFlh5c.pT0Scc.iPVvYb')[0] else: target = browser.find_elements(By.CSS_SELECTOR, 'img.sFlh5c.pT0Scc.iPVvYb')[1] # IndexError: list index out of range img_src = target.get_attribute('src') # urllib.error.HTTPError: HTTP Error 403: Forbidden 해결방안 3줄 opener = urllib.request.build_opener() opener.addheaders = [('User-Agent', 'Mozila/5.0')] urllib.request.install_opener(opener) # 이미지 저장 try: urllib.request.urlretrieve(img_src, f'크롤링/심화3/{keyword}{cnt}모음/{keyword}{i}.png') except: pass print(f'img {i}개: {target}') print('\nDvlp.H.Y.C.Sol\nJason')
찬솔 · 6달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
273
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
셀레니움 크롬드라이버 질문입니다.
안녕하세요 강사님.크롬드라이버를 따로 업데이트 하는 로직으로 구현해서 크롤링중입니다.문제는 url을 호출해서 크롬이 열리면 오래전 구글 화면이 나옵니다. 결과도 물론 옛날꺼구요,, 아래 그림처럼 오른쪽 화면이 나옵니다.드라이버를 업데이트 하는 방식이 아닌 다른 방식을 사용 했을 때 작동이 안되고 오류가 나기 때문에 이 방식으로 사용하고있었습니다.일단 실행해서 작동하는 부분만 코드 첨부하겠습니다 ㅠㅠ 해결방안을 알려주세용..드라이버를 다운받아서 하는 방식은 에러가 나기 때문에,, 크롬드라이버 업데이트 하는 방식으로 진행해야할것같습니다..그리고 크롬드라이버를 업데이트 하는 방식은find_elements를 사용해야하는데imgs = browser.find_elements(By.CSS_SELECTOR, '.rg_i.Q4LuWd')해당 부분에서 .click을 지원하지 않는데 어떻게 해야할까요아래 전체 코드 첨부합니다! from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import urllib.request # 크롬 드라이버 자동 업데이트 from webdriver_manager.chrome import ChromeDriverManager import time import pyautogui import os # 브라우저 꺼짐 방지 chrome_options = Options() chrome_options.add_experimental_option("detach", True) # 크롬창 안뜨게 함 # chrome_options.add_argument('--headless') # headless 모드 활성화 # chrome_options.add_argument('--disable-gpu') # GPU 가속 비활성화 # Mozilla 웹 브라우저에서 온 것처럼 인식 / 자동화된 요청을 감지하고 차단하는 것을 우회 chrome_options.add_argument("--user-agent=Mozilla/5.0") # 불필요 메세지 없애기 chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) # 드라이버 업데이트 service = Service(executable_path=ChromeDriverManager().install()) # 옵션 적용 browser = webdriver.Chrome(service=service, options=chrome_options) keyword = pyautogui.prompt('검색어를 입력하세요.') cnt = 0 # 폴더 만들기 (이미 존재하면 += 1) while True: cnt += 1 folder_path = f'크롤링/심화3/{keyword}{cnt}모음' if not os.path.exists(folder_path): os.mkdir(folder_path) break path = f'https://www.google.com/search?q={keyword}&sca_esv=580120143&hl=ko&tbm=isch&sxsrf=AM9HkKmDd46NefxcclWk71YsVWobVHQsIw:1699362285857&source=lnms&sa=X&ved=2ahUKEwicopLr-bGCAxXV-mEKHbygCZgQ_AUoAXoECAMQAw&biw=1455&bih=705&dpr=1.1' # 구글 browser.implicitly_wait(5) browser.maximize_window() browser.get(path) before_h = browser.execute_script("return window.scrollY") # 무한스크롤 while True: time.sleep(2) # 맨 아래로 스크롤을 내림 browser.find_element(By.CSS_SELECTOR,"body").send_keys(Keys.END) # 스크롤 후 높이 after_h = browser.execute_script("return window.scrollY") # 스크롤 높이가 맨 아래와 같다면 무한루프 탈출 if after_h == before_h: break # 스크롤 높이 업데이트 before_h = after_h imgs = browser.find_elements(By.CSS_SELECTOR, '.rg_i.Q4LuWd')
찬솔 · 6달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
520
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
links 클래스 a.search-product-link가 36개가 아닙니다..
안녕하세요 선생님 선생님의 강의를 따라하고 있었는데, 강의 초반부분에 상품 링크를 받아오는 부분에서 links = soup.select("a.search-product-link")라는 부분이 있습니다. 선생님 강의에서는 첫 페이지 36개를 잘 선택하는 것을 볼 수 있는데요, 쿠팡 사이트에서 변화가 있었는듯 합니다. 쿠팡에서 1~36등까지의 순위 아이템 외에 베스트 셀러 상품들을 한 페이지 내에 추가했는데, 이 상품들도 클래스가 a.search-product-link로 똑같아서 얘네를 따로 어떻게 걸러내야할지 모르겠습니다.위 사진이 1~36등 상품이고, 아래 사진이 베스트셀러라는 항목에 포함된 상품입니다.. 보시다시피 클래스가 똑같아서 어떻게 구별할 수 있을지 모르겠습니다.
안효기 · 6달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
169
답변
1
해결됨
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
request와 selenium 차이 질문
안녕하세요 선생님 기본편 수강 이후 실전편 강의 학습하고 있는 학생입니다!! 기본편 후반부에 selenium을 학습했던 터라 지금 '뉴스 본문 링크 가져오는 방법' 강의에서 선생님께서 직접 먼저 만들어보라고 하신 프로그램을 저도 모르게 selenium을 이용해서 만들려고 했습니다. 그리고 강의를 쭉 이어보니 선생님께서는 requests 라이브러리를 사용하신 것을 확인했습니다. 전에 동적 페이지인지, 정적 페이지인지에 따라 selenium과 requests를 구분해서 사용하신다고 하셨던 것으로 기억하는데, 이번 강의 내용인 뉴스 본문 링크도 정적 페이지라 requests를 사용하셔서 프로그램을 만드신 건지 궁금합니다.
안효기 · 6달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
165
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
아 맥에서 아이디 암호가 이상하게 입력되어요
pyautogui.hotkey('command', 'v') time.sleep(2)이거 제대로 했는데 써주신거와 같이아이디는 v가 입력이 되고 pw는 제대로 들어갑니다id = driver.find_element(By.CSS_SELECTOR, "#id") id.click() pyperclip.copy("****") pyautogui.hotkey('command', 'v') time.sleep(2) pw = driver.find_element(By.CSS_SELECTOR, "#pw") pw.click() pyperclip.copy("****") pyautogui.hotkey('command', 'v') time.sleep(2)뭐가 문제일가요?
sharprea · 6달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
176
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
클릭해서 세부내용이 변하고 그걸 가져오고 싶은데
구글이미지 가져오는걸 이용해서 프로그래밍해보는데아래 소스에서 menu_item 이 for문에서 한번씩 클릭될때 상세 페이지에서 다시 내용을 가져와야하는데너무 빨라 못가져오는거 같아서 time.sleep문을 넣으면 에러가 발생합니다. 이유가 뭘까요?for i, menu_item in enumerate(items,1): menu_item_text = menu_item.text.replace("\n","일") print("--------------------") print(f" {i}:{menu_item_text}") print("----------------------->") menu_item.click() choice_items = driver.find_elements(By.CSS_SELECTOR,"[onfocus]") for j , choice_item in enumerate(choice_items,1): if choice_item.text : #변수가 공백이 아닐때 print(f" {j}: {choice_item.text.replace('예약하기','')}")
J Park · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
160
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
m1 맥에서 크롬 웹드라이버는 어디서 받아야 하나요? 선생님 ㅠㅠ
셀레니움 기본 강의에서는 자동으로 업데이트 해서 하는 방법을 알려주셨는데 유튜브 크롤링 프로그램 만들때는 경로 지정해서 하셔서요제가 헷갈리고 있는건가요? ㅠㅠ하는방법 알려주세요
min · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
243
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
무한스크롤 관련 내용입니다.
안녕하세요! 다른분꺼 내용 보고 해답을 찾을 뻔 했다가 무한스크롤이 50번째에서 멈추는 현상을 발견했습니다 ㅠㅠ 저는 크롬 브라우저를 띄우지 않고 진행하니까 500개까지 크롤링이 되었는데, 크롬 브라우저를 띄우고 진행하고싶은데 어떻게 해결해야할까요,,? 로딩시간 2초정도 할당했는데 진행이 안되서 질문 남깁니다!from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import urllib.request # 크롬 드라이버 자동 업데이트 from webdriver_manager.chrome import ChromeDriverManager import time import pyautogui import os # 브라우저 꺼짐 방지 chrome_options = Options() chrome_options.add_experimental_option("detach", True) # 크롬창 안뜨게 함 chrome_options.add_argument('--headless') # headless 모드 활성화 chrome_options.add_argument('--disable-gpu') # GPU 가속 비활성화 # Mozilla 웹 브라우저에서 온 것처럼 인식 / 자동화된 요청을 감지하고 차단하는 것을 우회 chrome_options.add_argument("--user-agent=Mozilla/5.0") # 불필요 메세지 없애기 chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) # 드라이버 업데이트 service = Service(executable_path=ChromeDriverManager().install()) # 옵션 적용 browser = webdriver.Chrome(service=service, options=chrome_options) keyword = pyautogui.prompt('검색어를 입력하세요.') # 폴더 만들기 if not os.path.exists(f'크롤링/심화2/{keyword}모음'): os.mkdir(f'크롤링/심화2/{keyword}모음') # path = f'https://www.google.co.kr/search?tbm=isch&hl=ko&source=hp&biw=&bih=&q={keyword}' # 구글 path = f'https://search.naver.com/search.naver?where=image&sm=tab_jum&query={keyword}' # 네이버 browser.implicitly_wait(5) browser.maximize_window() browser.get(path) before_h = browser.execute_script("return window.scrollY") # 무한스크롤 while True: time.sleep(2) # 맨 아래로 스크롤을 내림 browser.find_element(By.CSS_SELECTOR,"body").send_keys(Keys.END) # 스크롤 후 높이 after_h = browser.execute_script("return window.scrollY") # 스크롤 높이가 맨 아래와 같다면 무한루프 탈출 if after_h == before_h: break # 스크롤 높이 업데이트 before_h = after_h # 이미지 태그 추출 imgs = browser.find_elements(By.CSS_SELECTOR, '._image._listImage') for i, img in enumerate(imgs, 1): # 각 이미지 태그의 주소 추출 link = img.get_attribute('src') # 이미지 저장 urllib.request.urlretrieve(link, f'크롤링/심화2/{keyword}모음/{i}.png') print(f'img {i}개: {link}') print('\nDvlp.H.Y.C.Sol\n')
찬솔 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
331
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
주의요함 data;
잘되다가 여기서 멈춰서 error가 뜨네요
김동호 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
118
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
강의보기전에 짠코드 질문있습니당
import requests,openpyxl from bs4 import BeautifulSoup from openpyxl.styles import Font wb = openpyxl.Workbook() wb.create_sheet('증권') del wb['Sheet'] ws = wb['증권'] ws.append(['순번','종목','PER','ROE','DATE']) path = 'C:/Users/user/DESKTOP/' count = 10 x =1 requests.get('https://finance.naver.com/sise/field_submit.naver?menu=market_sum&returnUrl=http%3A%2F%2Ffinance.naver.com%2Fsise%2Fsise_market_sum.naver%3F%26page%3D1&fieldIds=per&fieldIds=roe&fieldIds=high_val&fieldIds=low_val&fieldIds=pbr&fieldIds=reserve_ratio') for t in range(1,count + 1,1) : respone = requests.get(f'https://finance.naver.com/sise/sise_market_sum.naver?sosok=0&page={t}') html = respone.text soup = BeautifulSoup(html,'html.parser') for i in range(1,100,1): first_link = f"#contentarea > div.box_type_l > table.type_2 > tbody > tr:nth-child({i+1})" links = soup.select(first_link) for link in links : try: name = link.select_one(f"{first_link} > td:nth-child(2)").text per = link.select_one(f"{first_link} > td:nth-child(9)").text roe = link.select_one(f"{first_link} > td:nth-child(11)").text dae = link.select_one(f"{first_link} > td:nth-child(12)").text ws.append([x,name,per,roe,dae]) x += 1 except: pass wb.save(f'{path}증권.xlsx')강의 듣기 전 과제 내주셨을때 짠 코드인대용아직도 살짝 이해가 안되는대 처음 짤 때 reqsuetst.get 부분에 요청 보내고 난 후 url은 변경없이 동일해서 새로운 변수에 기존 url 저장했었거든용 아래처럼respone = requests.get(f'https://finance.naver.com/sise/sise_market_sum.naver?sosok=0&page={t}')기존 url사용했을때 체크박스 저장했던 내용이 계속 보이던대 뭐때문인지 알수있을까용.. 그리구 td부분 저장 도중에 바부분 처리 할줄몰랐어 저런씩으로 try와 except로 처리 했었는대 이코드 뿐만 아니라 추 후에 이런 비슷한거 나왔을떄 저렇게 짜면 문제가 될가능성이 있을까용 ?
이동창 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
154
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
gui로 검색어 변경 시 다른 내용이 나와요!
안녕하세요! 크롤링 진행중입니다!제가 keyword 변수에 gui 를 사용해서 검색어를 input하는 방식으로 이용중입니다. 그냥 경로를 넣었을 때는 검색어가 잘 나오는데, gui를 사용하면 다른 내용이 나옵니다. 저의 결과로는 책을 크롤링해오고 있습니다.. 무엇이 문제일까요 ㅠㅠ import requests from bs4 import BeautifulSoup import time import pyautogui keyword = pyautogui.prompt('검색어를 입력하세요.') path = 'https://www.coupang.com/np/search?q={keyword}&channel=recent' # 헤더에 User-Agent, Accept-Language 를 추가하지 않으면 멈춤 header = { 'Host': 'www.coupang.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:76.0) Gecko/20100101 Firefox/76.0', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Language': 'ko-KR,ko;q=0.8,en-US;q=0.5,en;q=0.3', } response_1 = requests.get(path, headers=header) html = response_1.text soup_1 = BeautifulSoup(html, 'html.parser') links = soup_1.select('a.search-product-link') for link in links: # 광고상품 제거 if len(link.select('span.ad-badge-text')) > 0: print('광고 상품 입니다.') else: sub_path = 'https://www.coupang.com/' + link.attrs['href'] # print(sub_path) response_2 = requests.get(sub_path, headers=header) html = response_2.text soup_2 = BeautifulSoup(html, 'html.parser') # 회사 - 있을 수도 있고, 없을 수도 있음. # 중고상품은 태그가 달라짐 try: brand_name = soup_2.select_one('a.prod-brand-name').text except: brand_name = "" # 제품명 product_name = soup_2.select_one('h2.prod-buy-header__title').text # 가격 try: product_price = soup_2.select_one('span.total-price > strong').text except: product_price = 0 print(brand_name, product_name, product_price) print('\nDvlp.H.Y.C.Sol\n')
찬솔 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
142
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
뉴스기사크롤링중 이런 오류가 나올 경우엔 어떻게 해야 하나요?
잘 따오다가 갑자기 오류가 나더니 그 뒤로 계속 이 오류창이 뜹니다ㅠㅠ isLastPage = soup.select_one("a.btn_next").attrs['aria-disabled'] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^AttributeError: 'NoneType' object has no attribute 'attrs' 코드는 이렇게 작성했어요9월뉴스 결과가 필요한데400페이지까지 밖에 못 긁어온다고 그래서 대충 여러번에 나눠서 작업중입니다ㅠㅠ#네이버기사 크롤링 엑셀저장 import requests from bs4 import BeautifulSoup import time import pyautogui from openpyxl import Workbook #사용자입력 keyword = pyautogui.prompt("검색어를 입력하세요") lastpage = int(pyautogui.prompt("몇 페이지까지 크롤링 할까요?")) #엑셀생성하기 wb = Workbook() #엑셀 시트 생성하기 ws = wb.create_sheet(keyword) #열 너비 조절 ws.column_dimensions['A'].width = 60 ws.column_dimensions['B'].width = 60 ws.column_dimensions['C'].width = 30 #행 번호 row = 1 #페이지 번호 pageNum = 1 for i in range(1, lastpage*10, 10) : print(f"{pageNum}페이지 크롤링중입니다 =================") response = requests.get(f"https://search.naver.com/search.naver?where=news&query=%EC%95%94&sm=tab_opt&sort=1&photo=0&field=0&pd=3&ds=2023.09.01&de=2023.09.07&news&query={keyword}&start={i}") html = response.text soup = BeautifulSoup(html, 'html.parser') articles = soup.select("div.info_group") #뉴스기사 div 10개 추출 for article in articles: links = article.select("a.info") #리스트 if len(links) >= 2: #링크가 2개 이상이면 url = links[1].attrs['href'] #두번째 링크의 href를 추출 response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'}) html = response.text soup_sub = BeautifulSoup(html, 'html.parser') title = None date = None #만약 연예 뉴스라면 if "entertain" in response.url: title = soup_sub.select_one(".end_tit") date = soup_sub.select_one("div.article_info > span > em") #만약 스포츠 뉴스라면 elif "sports" in response.url: title = soup_sub.select_one("h4.title") else: title = soup_sub.select_one(".media_end_head_headline") date = soup_sub.select_one("span.media_end_head_info_datestamp_time._ARTICLE_DATE_TIME") print("=======링크======= \n", url) print("=======제목======= \n", title.text.strip() if title else "제목을 찾을 수 없습니다.") print("=======날짜======= \n", date.text if date else "날짜를 찾을 수 없습니다.") # 'else' 블록에서 'date' 변수 정의는 여기서 끝나도록 수정 ws['A1'] = 'URL' ws['B1'] = '기사제목' ws['C1'] = '업로드날짜' ws[f'A{row}'] = url ws[f'B{row}'] = title.text.strip() if title else "제목을 찾을 수 없습니다." if date: ws[f'C{row}'] = date.text.strip() else: ws[f'C{row}'] = "날짜를 찾을 수 없습니다." row=row+1 #마지막 페이지 여부 확인하기 isLastPage = soup.select_one("a.btn_next").attrs['aria-disabled'] if isLastPage == 'true': print("마지막 페이지 입니다.") break pageNum = pageNum+1 wb.save(r'/Users/eunkyungsong/Desktop/코딩/10월 셀레니움 크롤링/실전/9월뉴스기사크롤링/' + f'{keyword}_result.9월.본문x.9.07.xlsx')
sek95041143 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
367
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
뉴스기사크롤링시 날짜 범위 지정하는 방법 알려주세요
안녕하세요 선생님파이썬 자체를 처음 깔아봤는데 선생님 덕분에 여기까지 왔습니다...감사드려요ㅠㅠㅠㅠ 뉴스기사들을 엑셀화 하는 거 까지는 따라갔는데기간의 범위를 정하고 싶어요 머리가 안 돌아가네요ex) 3분기 기사만 가져오고 싶다 / 8월달 기사만 가져오고 싶다어느 부분을 어떻게 바꿔주면 되는지 코드 부탁드립니다+네이버 기사 가져올때 400페이지까지 밖에 못 가져오던데 혹시 다른 방법은 없을까요?#네이버기사 크롤링 엑셀저장 import requests from bs4 import BeautifulSoup import time import pyautogui from openpyxl import Workbook #사용자입력 keyword = pyautogui.prompt("검색어를 입력하세요") lastpage = int(pyautogui.prompt("몇 페이지까지 크롤링 할까요?")) #엑셀생성하기 wb = Workbook() #엑셀 시트 생성하기 ws = wb.create_sheet(keyword) #열 너비 조절 ws.column_dimensions['A'].width = 60 ws.column_dimensions['B'].width = 60 ws.column_dimensions['C'].width = 120 ws.column_dimensions['D'].width = 60 #행 번호 row = 1 #페이지 번호 pageNum = 1 for i in range(1, lastpage*10, 10) : print(f"{pageNum}페이지 크롤링중입니다 =================") response = requests.get(f"https://search.naver.com/search.naver?sm=tab_hty.top&where=news&query={keyword}&start={i}") html = response.text soup = BeautifulSoup(html, 'html.parser') articles = soup.select("div.info_group") #뉴스기사 div 10개 추출 for article in articles: links = article.select("a.info") #리스트 if len(links) >= 2: #링크가 2개 이상이면 url = links[1].attrs['href'] #두번째 링크의 href를 추출 response = requests.get(url, headers={'User-agent': 'Mozila/5.0'}) html = response.text soup_sub = BeautifulSoup(html, 'html.parser') #만약 연예 뉴스라면 if "entertain" in response.url: title = soup_sub.select_one(".end_tit") content = soup_sub.select_one("#articeBody") date = soup_sub.select_one("div.article_info > span > em") #만약 스포츠 뉴스라면 elif "sports" in response.url: title = soup_sub.select_one("h4.title") content = soup_sub.select_one("#newsEndContents") #본문 내용안에 불필요한 div, p제거 divs = content.select("div") for div in divs: div.decompose() paragraphs = content.select("p") for p in paragraphs: p.decompose() else: title = soup_sub.select_one(".media_end_head_headline") content = soup_sub.select_one("#newsct_article") date = soup_sub.select_one("span.media_end_head_info_datestamp_time._ARTICLE_DATE_TIME") print("=======링크======= \n", url) print("=======제목======= \n", title.text.strip()) print("=======본문======= \n", content.text.strip()) print("=======날짜======= \n", date) ws['A1'] = 'URL' ws['B1'] = '기사제목' ws['C1'] = '기사본문' ws['D1'] = '업로드날짜' ws[f'A{row}'] = url ws[f'B{row}'] = title.text.strip() ws[f'C{row}'] = content.text.strip() ws[f'D{row}'] = date.text.strip() row=row+1 pageNum = pageNum+1 wb.save(f'{keyword}_result.date.xlsx')
sek95041143 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
862
답변
2
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
불필요한 div, p 코드 삽입 후 에러
안녕하세요. 샘불필요한 div, p 코드 사입 후 에러 발생 건 입니다. import requests from bs4 import BeautifulSoup import time req_header_dict = { # 요청헤더 : 브라우저 정보 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36' } response = requests.get("https://search.naver.com/search.naver?where=news&sm=tab_jum&query=%EC%86%90%ED%9D%A5%EB%AF%BC", headers= req_header_dict) html = response.text soup = BeautifulSoup(html, "html.parser") articles = soup.select("div.info_group") # 뉴스기사 div 10개 가져오기 for article in articles: links = article.select("a.info") # 결과는 리스트 if len(links) >= 2: url = links[1].attrs["href"] response = requests.get(url, headers= req_header_dict) html = response.text soup = BeautifulSoup(html, "html.parser") # 만약 뉴스라면 if "entertain" in response.url: title = soup.select_one(".end_tit") content = soup.select_one("#articeBody") # 스포츠 뉴스라면 elif "sports" in response.url: title = soup.select_one("h4.title") content =soup.select_one("#newsEndContents") # 본문 내용안에 불필요한 dvi 삭제 divs = content.select("div") for div in divs: div.decompose() paragraphs = content.select("p") for p in paragraphs: p.decompose() else: title = soup.select_one(".tit.title_area") content = soup.select_one("#newsct_article") print("##########링크##########",url) print("##########제목##########",title.text.strip()) print("##########본문##########",content.text.strip()) time.sleep(0.3)
yhahn02 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
121
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
구글 이미지 주소 추출 - 오류(고양이)
*. 질문 : 큰 이미지 주소추출에서 문제가 발생하는 듯 합니다. 해결점을 못 찾겠습니다. "고양이" from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys import os import urllib.request import pyautogui # keyword = pyautogui.prompt("검색어를 입력하세요") if not os.path.exists("고양이"): os.mkdir("고양이") # 크롬 드라이버 자동 업데이트 from webdriver_manager.chrome import ChromeDriverManager import time import pyautogui import pyperclip # 브라우저 꺼짐 방지 chrome_options = Options() chrome_options.add_experimental_option("detach", True) # 불필요한 에러 메시지 없애기 chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"]) service = Service(executable_path=ChromeDriverManager().install()) browser = webdriver.Chrome(service=service, options=chrome_options) # 웹페이지 해당 주소 이동 browser.implicitly_wait(10) # 웹페이지 로딩 될때가지 5초는 기다림 browser.maximize_window() #browser = webdriver.Chrome() browser.get("https://www.google.co.kr/search?q=%EA%B3%A0%EC%96%91%EC%9D%B4&tbm=isch&ved=2ahUKEwioo8HqscOBAxUM_WEKHdO9CDwQ2-cCegQIABAA&oq=%EA%B3%A0%EC%96%91%EC%9D%B4&gs_lcp=CgNpbWcQAzIECCMQJzIICAAQgAQQsQMyCAgAEIAEELEDMggIABCABBCxAzIICAAQgAQQsQMyCAgAEIAEELEDMggIABCABBCxAzIFCAAQgAQyCAgAEIAEELEDMgUIABCABDoLCAAQgAQQsQMQgwFQ9hJYiRlg7hpoAXAAeACAAY8BiAGMB5IBAzEuN5gBAKABAaoBC2d3cy13aXotaW1nwAEB&sclient=img&ei=eT4QZeiCOoz6hwPT-6LgAw&bih=933&biw=1680") before_h = browser.execute_script("return window.scrollY") # 무한 스크롤 while True: browser.find_element(By.CSS_SELECTOR, "body").send_keys(Keys.END) time.sleep(1) after_h = browser.execute_script("return window.scrollY") if after_h == before_h: break before_h = after_h # 썸네일 이미지 태크 추출 imgs = browser.find_elements(By.CSS_SELECTOR,".rg_i.Q4LuWd") for i, img in enumerate(imgs,1): # 각 이미지를 클릭해서 큰 사이즈를 찾음 img.click() time.sleep(2) # 큰 이미지 추출 target = browser.find_element("img.r48jcc.pT0Scc.iPVvYb") img_src = target.get_attribute("src") # 이미지 다운로드 # 크롤링 하다보면 http error 403: forbidden 에러가 납니다. opener = urllib.request.build_opener() opener.addheaders = [("User-Agent","Mozila/5.0")] urllib.request.install_opener(opener) urllib.request.urlretrieve(img_src,f"고양이{i}.jpg") # 이미지 저장
yhahn02 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
159
답변
1
미해결
[2024 개정판] 이것이 진짜 크롤링이다 - 실전편
selenium 에서 웹드라이버를 불러오지 못하는 오류납니다~
Microsoft Windows [Version 10.0.19045.3448](c) Microsoft Corporation. All rights reserved.C:\Users\user\data>C:/Users/user/AppData/Local/Programs/Python/Python311/python.exe c:/Users/user/data/sel.pyTraceback (most recent call last): File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\common\driver_finder.py", line 38, in get_path path = SeleniumManager().driver_location(options) if path is None else path ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\common\selenium_manager.py", line 76, in driver_location browser = options.capabilities["browserName"] ^^^^^^^^^^^^^^^^^^^^AttributeError: 'str' object has no attribute 'capabilities'During handling of the above exception, another exception occurred:Traceback (most recent call last): File "c:\Users\user\data\sel.py", line 33, in <module> File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\chrome\webdriver.py", line 45, in init super().__init__( File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\chromium\webdriver.py", line 51, in init self.service.path = DriverFinder.get_path(self.service, options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\common\driver_finder.py", line 40, in get_path msg = f"Unable to obtain driver for {options.capabilities['browserName']} using Selenium Manager." ^^^^^^^^^^^^^^^^^^^^
오유라 · 7달 전 · [2024 개정판] 이것이 진짜 크롤링이다 - 실전편
투표점수
0
조회수
4.13k
답변
1

[2024 개정판] 이것이 진짜 크롤링이다 - 실전편대시보드

NoSuchElementException 이라고 뜹니다

쿠팡 첫페이지 관련 href 오류가 뜹니다.

html에서 가져오지 못하는 부분이 있어 질문 드립니다.

NoSuchElementException Stacktrace 에러입니다.

구글 큰이미지 크롤링 IndexError입니다.

셀레니움 크롬드라이버 질문입니다.

links 클래스 a.search-product-link가 36개가 아닙니다..

request와 selenium 차이 질문

아 맥에서 아이디 암호가 이상하게 입력되어요

클릭해서 세부내용이 변하고 그걸 가져오고 싶은데

m1 맥에서 크롬 웹드라이버는 어디서 받아야 하나요? 선생님 ㅠㅠ

무한스크롤 관련 내용입니다.

주의요함 data;

강의보기전에 짠코드 질문있습니당

gui로 검색어 변경 시 다른 내용이 나와요!

뉴스기사크롤링중 이런 오류가 나올 경우엔 어떻게 해야 하나요?

뉴스기사크롤링시 날짜 범위 지정하는 방법 알려주세요

불필요한 div, p 코드 삽입 후 에러

구글 이미지 주소 추출 - 오류(고양이)

selenium 에서 웹드라이버를 불러오지 못하는 오류납니다~