실전 데이터 사이언스 Part2. 데이터 전처리
김화종
현업 실전에서 데이터 탐색 (EDA), 데이터 클리닝, 스케일링, 이상치 처리, 로그변환, 카테고리 인코딩 등이 왜 필요한지 그리고 어떻게 다루어야 하는지를 배웁니다. 또한 테이블 데이터 합치기, (비정형) 시계열 데이터 처리 방법을 배웁니다.
초급
Python
The digital transformation (DT) and introduction of artificial intelligence (AI) in companies begin with the construction of machine learning models. However, the scope of machine learning technology is very broad, and in order to select the optimal method, it is necessary to clearly understand the basic concepts. In this lecture, we will introduce the core contents necessary to clearly understand the basic concepts of machine learning, focusing on five examples.
Understand the basics of what machine learning is and how it works.
Understand how to implement machine learning models in Python and various performance metrics to evaluate the performance of the model.
Understand the difference between traditional statistical analysis and machine learning, and learn key statistical techniques such as probability distributions, independence tests, and chi-square tests through examples.
Only the essential points are included!
Understanding Machine Learning Fundamentals for Model Building
Machine learning refers to software that performs tasks such as predicting numbers (regression), classifying categories, and making optimal recommendations. It refers to software that gradually improves its performance by observing and learning from data.
Machine learning is currently the most common method for implementing artificial intelligence. The core function of machine learning is to create a machine learning "model" that performs intelligent actions .
It refers to software that obtains the optimal output (y) from input data (X), and the optimal output means predicting the correct answer (label, target) well.
Model types include linear models, logistic regression, support vector machines (SVMs), decision trees, random forests, k-NN, Bayesian models, and deep learning models (MLPs, CNNs, and RNNs). While this lecture does not cover the specifics of these algorithms, it will teach you the basic and common methods for implementing machine learning models using linear models. The characteristics of each model will be covered in other lectures.
To implement the optimal model, you must prepare the training data required to train the model and the validation data required to verify the operation of the trained model.
The process of creating appropriate training and validation data from raw data is data preprocessing , and data preprocessing greatly affects the performance of machine learning models.
The purpose of using machine learning models is divided into four categories:
Learn an overview of machine learning and explore key concepts for understanding machine learning through five examples.
First, you will learn how to implement, train, and validate regression models, as well as model performance evaluation metrics such as R-squared, MAE, and RMSE.
Next, we'll learn how to implement a classification model, as well as the concepts of decision boundaries, confusion matrices, accuracy, precision, recall, and the f-1 score. Evaluating classification performance requires a clear understanding of the confusion matrix, which we'll explain in detail through examples.
To comprehensively evaluate the performance of a classification model, the predicted ranking must be evaluated. To this end, we will explain how to use ROC-AUC and precision-recall curves.
In real-world applications, classification models often have minimum precision or recall requirements, requiring selection of an optimal classification threshold that satisfies these requirements. This article details how to find the optimal threshold using the Precision-Recall curve.
Machine learning While learning most Curious thing middle One With statistical analysis The difference Understanding Statistical analysis is divided into descriptive statistics, estimation , and hypothesis testing.
Statistics emphasizes explaining theoretical foundations, dealing with hypotheses, probabilities, confidence intervals, and margins of error. In contrast, machine learning focuses on creating software models that excel at prediction and classification, rather than providing theoretical foundations.
If the data to be analyzed is small, it is necessary to rely on statistical analysis for explanation, estimation, hypothesis testing, etc. However, if the data is sufficiently large, it is more useful to create a machine learning model that can be used in practice.
This lecture introduces the fundamentals of statistical analysis, including the characteristics of the normal distribution. For reference, the normal distribution is the probability distribution function of accumulated samples that converges and no longer changes (see figure below).
Who is this course right for?
For those who are learning the working principles of machine learning for the first time
If you need to apply machine learning to your work but find it difficult to invest a lot of time, this will be helpful for those who want to learn the core of machine learning in a short period of time.
Need to know before starting?
Basic knowledge of Python is required.
919
Learners
77
Reviews
11
Answers
4.8
Rating
3
Courses
"고장난 라디오 고칠 수 있어?"
제가 전자공학과에 입학한 후 친구로부터 받은 질문입니다. 뭐, 대답은 했습니다. "전자공학과에서는 라디오 만드는 원리를 배우는 것이지 고장난 전자제품 고치는 것은 우리 일이 아니고..."
이론으로 무장한 전문가보다 문제 해결사가 필요한 경우가 더 많습니다. 저는 실전 문제 해결이 더 중요하다고 생각합니다.
최근에는 머신러닝으로 금융, 에너지, 전자, 중장비, 물류, 신약개발, 식품 등 산업 영역의 문제를 해결하는 일을 하고 있는데, 정말 배울 것도 많고 할 일도 무궁무진한 영역인 것 같습니다. 본업은 교수지만 (강원대 컴퓨터공학과), 현장의 문제해결에 관심이 많아 여러 겸직을 하고 있습니다. AI신약개발지원센터장, KAIST 겸임교수, 그리고 데이터사이언스랩 대표를 맡고 있습니다.
AI 시대에 가장 필요한 인재는 실전 문제를 해결할 수 있는 데이터 사이언티스트라고 믿으며 여러분 모두 인기 있는 데이터 사이언티스트가 되기를 바랍니다.
All
20 lectures ∙ (4hr 45min)
All
31 reviews
4.7
31 reviews
$51.70
Check out other courses by the instructor!
Explore other courses in the same field!