강의

멘토링

로드맵

Data Science

/

Data Engineering

Realtime Datalake Using Kafka & Spark

Beginner's Kafka & Spark Real-time Pipeline Course. All-in-one: Master concepts to architecture.

(4.9) 18 reviews

193 learners

  • hyunjinkim
실시간
데이터처리
데이터파이프라인
Kafka
Apache Spark
pyspark
data-lake
kakao-tech

Reviews from Early Learners

What you will learn!

  • Implementing CI/CD with Github, Actions, and AWS Code Deploy

  • Kafka Broker, Confluent Producer & Consumer

  • Monitoring Kafka Dashboard with Prometheus & Grafana

  • Spark & Hive Metastore for Catalog Management

  • Implementing a Practical Project Using Spark Streaming

  • Kafka & Spark, Zookeeper & Yarn Availability Test

Real-time Data Pipeline, Why Should You Learn It?


Building real-time data pipelines to support rapid analysis and decision-making is not an option but a necessity.

  • Real-time Personalized Marketing & Recommendations

  • Real-time Trend Analysis

  • Real-time Security Threat Detection and Response



Especially in today's world where AI has become fundamental, there are countless examples of AI-powered real-time recommendations, detection, translation, and more, and real-time data pipelines are increasingly required to implement such architectures.


So I prepared it.

One of the most popular streaming processing combinations: real-time pipeline based on Kafka + Spark

Kafka & Spark are prepared from the basics step by step, going beyond pipeline implementation to design methods from an architectural perspective.

Features of This Course

📌 Single configuration on local machines is a No! The trend is cloud. AWS Cloud is utilized.

📌 CI/CD is basic, right? We'll set up CI/CD using GitHub Actions and AWS CodeDeploy.

📌 Starting from the basics step by step, but helping you internalize the lecture content through hands-on practice and assignments.

📌 Server cluster configurationto real-time pipeline setup and availability testing- All in One configuration

👍Recommended for people like this

I want to learn real-time data pipelines.
Those who are interested in data pipelines but have not experienced real-time processing

I want to learn about DataLake.
For those who want to learn how DataLake is implemented on the Cloud

I want to grow as an architect.
For those curious about implementing robust architecture capable of high-volume processing from infrastructure design to code level

After taking the course

  • You will understand the basic principles of Kafka Broker services, understand availability guarantees, and be able to handle Broker services based on this knowledge.

  • You will understand the basic principles of Kafka Producer/Consumer and advanced options, and be able to write robust applications by understanding the trade-off between performance and consistency in large-scale environments.

  • You will be able to understand the conditions under which Spark can demonstrate its performance and write Applications based on techniques that can optimize performance.

  • You can understand the diversity of pipelines through the integration of various services such as AWS S3, Glue, Athena and Spark services.

So, what content does it cover?


  1. Now the trend is cloud. We'll configure a cluster just like in real-world scenarios using EC2 servers.

  2. Kafka & Spark will be learned slowly from the basics.

  3. Learn the basic concepts of Datalake on AWS through AWS's S3, Glue, and Athena services.



The pipeline can be divided from collection to utilization.

You need to clearly understand which tools to use at each step, how to use them, and how they connect to each other.

Therefore, it doesn't stop at simply learning Kafka and Spark.

Finally, let's build an actual pipeline

In the process, you'll learn methods such as CI/CD, availability testing, problem-solving, and performance improvement.

Curriculum

  1. Data Lake Concept


  • Lambda

  • Kappa Architecture

  • Pipeline Design

  1. Kafka Basics


  • Broker

  • Kafka Producer

  • Kafka Consumer

  1. Monitoring

  • UI For Apache Kafka

  • Prometheus

  • Grafana

  1. Apache Spark Basics

  • Spark Cluster

  • Spark SQL

  • Spark Streaming

  1. Performance Improvement Tips

  • Performance Improvement Checklist

  • Trouble Shooting

  • Spark Monitoring

  1. Availability Testing

  • Zookeeper Cluster

  • Kafka Broker

  • Spark Cluster


You will learn this content.

Pipeline Design

Combinations of tools available for configuring real-time data pipelines.

And we understand and implement the data flow of Kafka & Spark selected for hands-on practice.

CI/CD: Github Actions + Code Deploy

CI/CD is the basics of the basics.

Local git → Github Repository integration followed by automatic deployment using Actions + Code Deploy combination.

Kafka Web UI

Learn how to easily manage Kafka through UI For Apache Kafka.

Prometheus + Grafana

The trend in monitoring pipelines.

Learn how to monitor Kafka through the Prometheus + Grafana combination and further explore monitoring methods for Spark Streaming LAG.

Kafka Source + Spark Streaming

We implement an actual pipeline using the Kafka + Spark Streaming combination and visualize a Dashboard based on this.

AWS Athena

AWS Athena service is a serverless query service. We'll use this service to directly check the processing results of Spark Streaming.

Python Dashboard

Using the implemented real-time data pipeline, we visualize it with a Dashboard and understand the pipeline flow.

Availability Testing

We implement a robust architecture through the most realistic architectural implementation possible and conduct availability testing. We understand and verify the availability of Kafka, Spark, and Yarn Cluster.

Infrastructure Setup Automation Using Ansible

Isn't there a lot that needs to be done to create all of this?

That's right. There are very many.

There are many libraries to install and so many things to configure. If even one thing doesn't match up properly, you get errors 🤬

However, you only need to focus on the important content of implementing real-time pipelines.

Infrastructure configuration and various setups are automated through pre-prepared Ansible Scripts.

You can preview the Ansible Script at the github address below.

https://github.com/hjkim-sun/datalake-ansible-playbook-season1


You will clone the above GitHub repository content to easily proceed with the setup process.

🚨Please refer to this before starting the practice!

✔ Kafka Client(Producer/Consumer) is written in Python.


There are several types of Python Kafka libraries, but we use the Confluent Kafka Library, which has the best performance among them. Confluent Kafka is a tool that guarantees high performance comparable to Java, and we will learn how to write Producer/Consumer through Python.


✔ Spark language is also written in Python.


When writing Spark Applications, Scala is the language with the best performance. However, learning Scala separately just for Spark is definitely a burden. It has the disadvantages of not being as popular as Python and having relatively fewer deep learning/AI-related libraries. Therefore, in practice, Python is often used to develop Spark programs. Especially when considering deep learning/AI integration, writing in Python can be an excellent alternative.

🚨AWS Expected Lab Costs

The hands-on practice will be conducted on AWS Cloud, and separate practice costs will be incurred.


✔ When used for about 40 hours over a month, approximately 40,000 KRW in AWS costs are incurred. (Based on exchange rate of 1,430)

Most of the practice costs are incurred from EC2 (computing service), so you must stop the server instances after practice & assignments. However, for other costs (volumes (EBS) and EIP connected to server instances), costs will still be incurred even if you stop the server instances. Therefore, the faster you complete the course, the lower your AWS practice costs will be.


✔ Even when all servers are stopped, approximately 30,000 won per month is incurred due to server volume costs.

Therefore, even if you use the same 40 hours, if you use it over two months instead of one month, an additional 30,000 won will be added, resulting in a total AWS fee of about 70,000 won. Therefore, we recommend completing the course as quickly as possible.

The content below will not be covered.


  1. Java-based Producer/Consumer Development

  2. Kafka Connect (Season2 planned)

  3. Schema Registry (Season2 planned)

  4. Kafka Streams

  5. KSQL



  1. Machine Learning and Deep Learning

  2. Open Table Format (ex. iceberg) (Season2 planned)

  3. Scala-based Application (written only in pyspark)

Communication

Due to the nature of lectures that involve working with multiple tools, communication through the Q&A board alone may be difficult when you have questions or encounter unexpected errors.

(From experience, it appears to take about 3-4 days from when a question is registered until I provide an answer and it gets reconfirmed)


To reduce these communication inconveniences and provide high-quality service to our students until the very end, we plan to operate a Discord channel.

https://discord.gg/eTcYzMBxZm


It's fine whether it's about the lectures or not. It's also fine to share trivial stories.

This is a place for smooth communication, so please feel free to join us

Pre-enrollment Reference Information

Practice Environment

  • [OS] Most of the hands-on practice will be conducted on AWS. Therefore, you can take the course regardless of whether you use Windows/MacOS.

  • [Performance]Does not require high CPU/Memory specifications. A typical laptop/desktop is sufficient for taking the course.

  • [Other] You can take the course anywhere with an internet connection. Additionally, you'll need a credit card that can be used to pay for AWS Cloud costs.

Learning Materials

  • It is provided in Lecture 1-2.


Essential Prerequisites

  1. Python Basics


    Basic data structures and fundamental syntax like if/for/while statements. Also, the ability to write functions at a competent level

  2. Basic Linux Commands


    Most infrastructure work is done through the Ansible automation tool. However, you need to know basic Linux commands to take this course. (vi editor, basic commands like cd/mv/rm, etc.)

  3. SQL


    Having basic SQL knowledge (SELECT, WHERE, JOIN, GROUP BY, ORDER BY, etc.) will make it much easier to follow along.
    (There won't be any difficult SQL)

Recommended Prerequisites

  1. Docker Containers
    Set up monitoring tools using containers. It helps to understand the principles of containers.

  2. git
    We will use git for CI/CD to proceed with direct code deployment. We will explain all the usage step by step, but it would be even better if you already know it.

  3. Understanding Python Classes
    Most Python programs that proceed with hands-on practice are structured through Class structures. Therefore, having an understanding of Classes and object-oriented programming makes it easier to practice
    (It's okay if you don't know. I'll explain everything)

Recommended for
these people

Who is this course right for?

  • Those who want to learn Kafka & Spark

  • Those who want to learn real-time pipeline implementation

  • Those who need to develop various knowledge and skills as a data engineer

Need to know before starting?

  • Basic concepts of Python

  • Basic SQL knowledge (Filter, GroupBy, OrderBy level)

  • Able to use basic Linux commands

Hello
This is

1,135

Learners

72

Reviews

207

Answers

4.9

Rating

2

Courses

안녕하세요.

데이터 & AI 분야에서 일하고 있는 15년차 현직자입니다.

정보관리기술사를 취득한 이후 지금까지 얻은 지식을 많은 사람들에게 공유하고자 컨텐츠 제작하고 있습니다.

반갑습니다. :)

Contact: hjkim_sun@naver.com

Curriculum

All

113 lectures ∙ (28hr 23min)

Course Materials:

Lecture resources
Published: 
Last updated: 

Reviews

All

18 reviews

4.9

18 reviews

  • 램쥐뱅님의 프로필 이미지
    램쥐뱅

    Reviews 18

    Average Rating 4.8

    5

    100% enrolled

    기대 이상으로 잘 정리된 커리큘럼과 내용에 많은 것을 배우고 갑니다. 꼼꼼하게 강의를 만들어 주셨다는 느낌을 많이 받았습니다 후속 강의가 기다리고 있겠습니다 감사합니다

    • 김현진
      Instructor

      램쥐뱅님 수강평 감사합니다. 저도 커리큘럼을 만들면서 어떻게 하면 기초부터 탄탄히 알려드리면서 Kafka 부터 spark 까지 연결할 수 있을지 고민이 많았어요. 덕분에 강의 기획부터 완성까지 1년이 걸렸지만 이렇게 알아봐주셔서 뿌듯합니다 ^^ 감사합니다. 지금 준비하는 후속 강의는 season2는 아니지만 알찬 내용으로 구성해서 후회하지 않게 진행해보겠습니다 😀

  • 역시자네야님의 프로필 이미지
    역시자네야

    Reviews 1

    Average Rating 5.0

    Edited

    5

    10% enrolled

    믿고 보는 현진 선생님. 강추합니다. airflow 강의로부터 알게 되었는데 다른 강의들과는 다른 차별점이 많습니다. 개념부터 아키 설계까지, 사용 이유와 원리를 설명해주시는게 좋았어요. 실습도 편-안 그자체입니다. 답변까지 항상 친절하게 달아주십니다 아직 수강 초기이지만 완강해보겠습니다~ 날씨가 더운데, 건강 잘 챙기세요.

    • 김현진
      Instructor

      안녕하세요 역시자네야님 airflow에 이어서 찾아주셔서 너무 감사드려요! 저는 스스로 뭔가를 배울 때 피상적으로 사용법만 익히고 원리를 파악하지 않으면 금방 까먹고 이해가 가질 않더라구요. 다른 분들도 그럴거라 생각해서 내가 이해한 원리를 전달하기 위해 강의 시간을 많이 할애하는 편입니다. 그래서 ppt도 만들어야 하고 과제도 만들고 있어서 좀 힘들지만 덕분에 알아봐주셔서 감사드려요 :) 다음 강의도 잘 준비해볼께요 💪

  • :찬영님의 프로필 이미지
    :찬영

    Reviews 1

    Average Rating 5.0

    Edited

    5

    100% enrolled

    굉장히 완성도 높은 강의였습니다.. 감동 그 자체 보통 강의듣다 보면 똑같이 따라 했는데 왜 안돼? 이런 상황 적지 않을텐데, 그런 거 없이 스무스하게 완강했습니다. 전 강의 선택함에 있어 제일 먼저 커리큘럼을 보고, 가격과 강의시간을 비교해봅니다. 그동안 가격에 비해 너무 수박 겉핥기 식의 강의들이 많았는데, 현진님의 kafka&spark 강의를 듣는다면,, 추후 프로젝트에서도 충분히 완성도 높은 결과물 뽑을 수 있다고 장담합니다! 많이 배웠습니다 감사합니다! (시즌 2는 언제쯤 나올라나요..?)

    • 김현진
      Instructor

      안녕하세요 찬영님! 감동스러운 수강평 감사드립니다. 완강하셔서 잘 아시겠지만 내용이 아무래도 단순 기능을 알려주기보다 인프라 구성부터 nginx 구성, docker 셋팅, 가용성테스트까지 다양한 내용을 포함하다보니 쉽게 진행이 되지 않을까 염려를 많이 했습니다. 그래서 ansible-playbook으로 최대한 표준화하고 강의 촬영을 마친 이후에도 직접 강의를 따라 해보면서 혹시 잘 되지 않는 부분이 있지 않은지 직접 확인도 했습니다. 그리고 혹시나 안되시는 분들을 위해 원활한 의사소통 대비 디스코드 방까지 준비했습니다. 결국 강의를 올리기까지 꽤 많은 시간이 걸렸지만 최대한 완성도 있는 강의를 만들고자 노력했습니다. 찬영님이 알아봐주신 것 같아 그간의 고생이 씻겨내려가는 것 같습니다 ^-^ 제가 더 감사드립니다.. 그리고 시즌 2는 시작하기 전에 생성형 AI 관련하여 강의를 먼저 하나 준비하고 있어서 조금 더 늦을 수도 있을 것 같습니다. 그래도 힘내서 준비해볼께요 !

  • 형빈님의 프로필 이미지
    형빈

    Reviews 1

    Average Rating 5.0

    5

    30% enrolled

    • seungil.park님의 프로필 이미지
      seungil.park

      Reviews 4

      Average Rating 5.0

      5

      60% enrolled

      $102.30

      hyunjinkim's other courses

      Check out other courses by the instructor!

      Similar courses

      Explore other courses in the same field!