인프런 영문 브랜드 로고
인프런 영문 브랜드 로고
Data Science

/

Data Analysis

Free Python Course (Usage Part 3) - Web Scraping (5 hours)

From HTML basics to expert scraping techniques, I'll teach you everything. This one video is all you need.

(5.0) 147 reviews

5,019 students

Web Crawling
Web Scraping
Selenium
Python
Thumbnail

Correcting lecture errors

Hello, I am also coding. ^^

We would like to inform you that there have been changes to the webpage since the lecture was filmed.

Please study by referring to the contents below.

1. "Tistory" receives HTML normally without changing the UserAgent.

(Related lecture: User Agent)

2. When you try to log in to "Naver", an automatic input prevention character input page will appear. Please refer to the link that introduces a method to bypass this using JavaScript.

https://jaeseokim.github.io/Python/python-selenium-using-web-crawling-Naver-login-after-subscription-feed-crawling/

(Related lecture: Selenium Advanced (Naver Login))

3. After checking the contents of the "Coupang" lecture, it seems that some items are retrieved a little differently from when accessed on the web. About 80% of the screens are normal, and 20% retrieve values that do not exist on the page. (They may be contents that appear on the next page.) Also, 80% of the items seem to be in a slightly messed up order, unlike the web page. There seems to be a difference in the values returned by Coupang when retrieved using only requests, so it seems necessary to compare the results through selenium. I sincerely apologize for any errors in the contents because I did not think to fully inspect the results during class.

(Related lecture: BeautifulSoup4 Utilization 2 (Coupang))

4. When I try to retrieve Naver News from the "Project" lecture, I get a 500 Server Error. In this case, you can add your PC's user-agent to the headers in requests.

(example)

def create_soup(url):

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}

res = requests.get(url, headers=headers)

res.raise_for_status()

soup = BeautifulSoup(res.text, "lxml")

return soup

(Related lecture: Headline / IT News (Naver News))

thank you

Comment