BEST

Data Engineering

Realtime Datalake Using Kafka & Spark

Name: Realtime Datalake Using Kafka & Spark
Price: 132000 KRW
Rating: 4.9 (19 reviews)

Beginner's Kafka & Spark Real-time Pipeline Course. All-in-one: Master concepts to architecture.

(4.9) 19 reviews

233 learners

hyunjinkim

실시간

데이터처리

데이터파이프라인

Kafka

Apache Spark

pyspark

data-lake

kakao-tech

Reviews from Early Learners

What you will gain after the course

Implementing CI/CD with Github, Actions, and AWS Code Deploy
Kafka Broker, Confluent Producer & Consumer
Monitoring Kafka Dashboard with Prometheus & Grafana
Spark & Hive Metastore for Catalog Management
Implementing a Practical Project Using Spark Streaming
Kafka & Spark, Zookeeper & Yarn Availability Test

Real-time Data Pipeline, Why Should You Learn It?

Building real-time data pipelines to support rapid analysis and decision-making is not an option but a necessity.

Real-time Personalized Marketing & Recommendations
Real-time Trend Analysis
Real-time Security Threat Detection and Response

Especially in today's world where AI has become fundamental, there are countless examples of AI-powered real-time recommendations, detection, translation, and more, and real-time data pipelines are increasingly required to implement such architectures.

After taking the course

You will understand the basic principles of Kafka Broker services, understand availability guarantees, and be able to handle Broker services based on this knowledge.
You will understand the basic principles of Kafka Producer/Consumer and advanced options, and be able to write robust applications by understanding the trade-off between performance and consistency in large-scale environments.
You will be able to understand the conditions under which Spark can demonstrate its performance and write Applications based on techniques that can optimize performance.
You can understand the diversity of pipelines through the integration of various services such as AWS S3, Glue, Athena and Spark services.

Infrastructure Setup Automation Using Ansible

Isn't there a lot that needs to be done to create all of this?

That's right. There are very many.

There are many libraries to install and so many things to configure. If even one thing doesn't match up properly, you get errors 🤬

However, you only need to focus on the important content of implementing real-time pipelines.

Infrastructure configuration and various setups are automated through pre-prepared Ansible Scripts.

You can preview the Ansible Script at the github address below.

https://github.com/hjkim-sun/datalake-ansible-playbook-season1

You will clone the above GitHub repository content to easily proceed with the setup process.

🚨Please refer to this before starting the practice!

✔ Kafka Client(Producer/Consumer) is written in Python.

There are several types of Python Kafka libraries, but we use the Confluent Kafka Library, which has the best performance among them. Confluent Kafka is a tool that guarantees high performance comparable to Java, and we will learn how to write Producer/Consumer through Python.

✔ Spark language is also written in Python.

When writing Spark Applications, Scala is the language with the best performance. However, learning Scala separately just for Spark is definitely a burden. It has the disadvantages of not being as popular as Python and having relatively fewer deep learning/AI-related libraries. Therefore, in practice, Python is often used to develop Spark programs. Especially when considering deep learning/AI integration, writing in Python can be an excellent alternative.

🚨AWS Expected Lab Costs

The hands-on practice will be conducted on AWS Cloud, and separate practice costs will be incurred.

✔ When used for about 40 hours over a month, approximately 40,000 KRW in AWS costs are incurred. (Based on exchange rate of 1,430)

Most of the practice costs are incurred from EC2 (computing service), so you must stop the server instances after practice & assignments. However, for other costs (volumes (EBS) and EIP connected to server instances), costs will still be incurred even if you stop the server instances. Therefore, the faster you complete the course, the lower your AWS practice costs will be.

✔ Even when all servers are stopped, approximately 30,000 won per month is incurred due to server volume costs.

Therefore, even if you use the same 40 hours, if you use it over two months instead of one month, an additional 30,000 won will be added, resulting in a total AWS fee of about 70,000 won. Therefore, we recommend completing the course as quickly as possible.

✨Communication

Due to the nature of lectures that involve working with multiple tools, communication through the Q&A board alone may be difficult when you have questions or encounter unexpected errors.

(From experience, it appears to take about 3-4 days from when a question is registered until I provide an answer and it gets reconfirmed)

To reduce these communication inconveniences and provide high-quality service to our students until the very end, we plan to operate a Discord channel.

https://discord.gg/eTcYzMBxZm

It's fine whether it's about the lectures or not. It's also fine to share trivial stories.

This is a place for smooth communication, so please feel free to join us

Essential Prerequisites

Python Basics

Basic data structures and fundamental syntax like if/for/while statements. Also, the ability to write functions at a competent level
Basic Linux Commands

Most infrastructure work is done through the Ansible automation tool. However, you need to know basic Linux commands to take this course. (vi editor, basic commands like cd/mv/rm, etc.)
SQL

Having basic SQL knowledge (SELECT, WHERE, JOIN, GROUP BY, ORDER BY, etc.) will make it much easier to follow along.
(There won't be any difficult SQL)

Recommended Prerequisites

Docker Containers
Set up monitoring tools using containers. It helps to understand the principles of containers.
git
We will use git for CI/CD to proceed with direct code deployment. We will explain all the usage step by step, but it would be even better if you already know it.
Understanding Python Classes
Most Python programs that proceed with hands-on practice are structured through Class structures. Therefore, having an understanding of Classes and object-oriented programming makes it easier to practice
(It's okay if you don't know. I'll explain everything)

Recommended for
these people

Who is this course right for?

Those who want to learn Kafka & Spark
Those who want to learn real-time pipeline implementation
Those who need to develop various knowledge and skills as a data engineer

Need to know before starting?

Basic concepts of Python
Basic SQL knowledge (Filter, GroupBy, OrderBy level)
Able to use basic Linux commands

Hello
This is

1,228

Learners

Reviews

217

Answers

4.9

Rating

Courses

안녕하세요.

데이터 & AI 분야에서 일하고 있는 15년차 현직자입니다.

정보관리기술사를 취득한 이후 지금까지 얻은 지식을 많은 사람들에게 공유하고자 컨텐츠 제작하고 있습니다.

반갑습니다. :)

Contact: hjkim_sun@naver.com

Curriculum

All

113 lectures ∙ (28hr 23min)

Course Materials:

Lecture resources

Section 1. Data Lake Introduction

4 lectures ∙ (43min)

Section 2. Data Lake Architecture Design

3 lectures ∙ (24min)

5. Data Lake Architecture Design (Lambda Architecture)
12:15
6. Data Lake Architecture Design (Kappa Architecture)
03:50
7. Designing Data Lake Architecture
08:46

Section 3. Development Environment Setup

3 lectures ∙ (26min)

8. Installing Git
05:55
9. Python & PyCharm Installation
11:58
10. GitHub Repo Creation & Connection
08:32

Section 4. AWS Environment Creation

9 lectures ∙ (1hr 57min)

Section 5. Kafka setup

7 lectures ∙ (1hr 55min)

Section 6. Creating Kafka Producer

9 lectures ∙ (2hr 17min)

Section 7. Kafka UI and Monitoring

3 lectures ∙ (54min)

Section 8. Kafka Consumer Creation

7 lectures ∙ (1hr 54min)

Section 9. Spark setup and basics

7 lectures ∙ (1hr 49min)

Section 10. Spark Cluster Setup

5 lectures ∙ (1hr 21min)

Section 11. Understanding Spark

10 lectures ∙ (3hr 3min)

Section 12. Spark SQL

9 lectures ∙ (2hr 5min)

Section 13. Spark Streaming

8 lectures ∙ (2hr 11min)

Section 14. Streaming program setup

6 lectures ∙ (1hr 24min)

Section 15. Dashboard Configuration

1 lectures ∙ (17min)

Section 16. Spark Performance Optimization and Troubleshooting

8 lectures ∙ (2hr 20min)

Section 17. Spark Steaming Master

6 lectures ∙ (1hr 21min)

Section 18. Availability test

7 lectures ∙ (1hr 37min)

Section 19. In conclusion

1 lectures ∙ (15min)

Published:

Last updated:

Reviews

All

19 reviews

4.9

19 reviews

램쥐뱅
Reviews 19
∙
Average Rating 4.7
5
100% enrolled
기대 이상으로 잘 정리된 커리큘럼과 내용에 많은 것을 배우고 갑니다. 꼼꼼하게 강의를 만들어 주셨다는 느낌을 많이 받았습니다 후속 강의가 기다리고 있겠습니다 감사합니다
- 김현진
  Instructor
  램쥐뱅님 수강평 감사합니다. 저도 커리큘럼을 만들면서 어떻게 하면 기초부터 탄탄히 알려드리면서 Kafka 부터 spark 까지 연결할 수 있을지 고민이 많았어요. 덕분에 강의 기획부터 완성까지 1년이 걸렸지만 이렇게 알아봐주셔서 뿌듯합니다 ^^ 감사합니다. 지금 준비하는 후속 강의는 season2는 아니지만 알찬 내용으로 구성해서 후회하지 않게 진행해보겠습니다 😀
역시자네야
Reviews 1
∙
Average Rating 5.0
Edited
5
10% enrolled
믿고 보는 현진 선생님. 강추합니다. airflow 강의로부터 알게 되었는데 다른 강의들과는 다른 차별점이 많습니다. 개념부터 아키 설계까지, 사용 이유와 원리를 설명해주시는게 좋았어요. 실습도 편-안 그자체입니다. 답변까지 항상 친절하게 달아주십니다 아직 수강 초기이지만 완강해보겠습니다~ 날씨가 더운데, 건강 잘 챙기세요.
- 김현진
  Instructor
  안녕하세요 역시자네야님 airflow에 이어서 찾아주셔서 너무 감사드려요! 저는 스스로 뭔가를 배울 때 피상적으로 사용법만 익히고 원리를 파악하지 않으면 금방 까먹고 이해가 가질 않더라구요. 다른 분들도 그럴거라 생각해서 내가 이해한 원리를 전달하기 위해 강의 시간을 많이 할애하는 편입니다. 그래서 ppt도 만들어야 하고 과제도 만들고 있어서 좀 힘들지만 덕분에 알아봐주셔서 감사드려요 :) 다음 강의도 잘 준비해볼께요 💪
:찬영
Reviews 1
∙
Average Rating 5.0
Edited
5
100% enrolled
굉장히 완성도 높은 강의였습니다.. 감동 그 자체 보통 강의듣다 보면 똑같이 따라 했는데 왜 안돼? 이런 상황 적지 않을텐데, 그런 거 없이 스무스하게 완강했습니다. 전 강의 선택함에 있어 제일 먼저 커리큘럼을 보고, 가격과 강의시간을 비교해봅니다. 그동안 가격에 비해 너무 수박 겉핥기 식의 강의들이 많았는데, 현진님의 kafka&spark 강의를 듣는다면,, 추후 프로젝트에서도 충분히 완성도 높은 결과물 뽑을 수 있다고 장담합니다! 많이 배웠습니다 감사합니다! (시즌 2는 언제쯤 나올라나요..?)
- 김현진
  Instructor
  안녕하세요 찬영님! 감동스러운 수강평 감사드립니다. 완강하셔서 잘 아시겠지만 내용이 아무래도 단순 기능을 알려주기보다 인프라 구성부터 nginx 구성, docker 셋팅, 가용성테스트까지 다양한 내용을 포함하다보니 쉽게 진행이 되지 않을까 염려를 많이 했습니다. 그래서 ansible-playbook으로 최대한 표준화하고 강의 촬영을 마친 이후에도 직접 강의를 따라 해보면서 혹시 잘 되지 않는 부분이 있지 않은지 직접 확인도 했습니다. 그리고 혹시나 안되시는 분들을 위해 원활한 의사소통 대비 디스코드 방까지 준비했습니다. 결국 강의를 올리기까지 꽤 많은 시간이 걸렸지만 최대한 완성도 있는 강의를 만들고자 노력했습니다. 찬영님이 알아봐주신 것 같아 그간의 고생이 씻겨내려가는 것 같습니다 ^-^ 제가 더 감사드립니다.. 그리고 시즌 2는 시작하기 전에 생성형 AI 관련하여 강의를 먼저 하나 준비하고 있어서 조금 더 늦을 수도 있을 것 같습니다. 그래도 힘내서 준비해볼께요 !
황준하
Reviews 1
∙
Average Rating 5.0
5
60% enrolled
형빈
Reviews 1
∙
Average Rating 5.0
5
30% enrolled

$102.30

hyunjinkim's other courses

Check out other courses by the instructor!

Airflow 마스터 클래스

김현진

데이터 파이프라인을 효율적으로 만들고 관리하기 위한 Orchestration 도구인 Airflow에 대해 배우는 강의입니다. 초보자도 차근차근 배울 수 있는 Airflow 마스터 클래스, 환영합니다!

초급

airflow, 데이터 엔지니어링, Python

Airflow 마스터 클래스

김현진

Similar courses

Explore other courses in the same field!

[데브원영] 아파치 카프카 for beginners

데브원영 DVWY

아파치 카프카란 무엇일까? 아파치 카프카는 어떻게 동작할까? 아파치 카프카의 개념은 무엇이 있을까? 궁금하시다면 이 강의를 선택하세요😎

초급

Kafka, 데이터 엔지니어링

[데브원영] 아파치 카프카 for beginners

데브원영 DVWY

대용랑 채팅 TPS에 대한 stateful 서비스 구축하기

July

치지직, 아프리카TV 등 stateful 서비스에 대해서 어떻게 서버를 구축하고, 무중단 배포가 진행이 되는지 모든것을 알려드립니다.

초급

Go, Kafka, Node.js

대용랑 채팅 TPS에 대한 stateful 서비스 구축하기

July

ElasticSearch Essential

강진우

ElasticSearch 클러스터를 운영하기 위해 꼭 알아야 할 내부 동작에 대한 이해, 모니터링하는 방법, 사례를 기반으로 한 트러블 슈팅 방법을 알려주는 강의입니다. 이 강의를 통해 ElasticSearch 클러스터를 더 안정적으로 운영할 수 있습니다.

초급

Elasticsearch, 카카오공채-개발

ElasticSearch Essential

강진우