강의

멘토링

커뮤니티

BEST
Data Science

/

Data Engineering

Realtime Datalake Using Kafka & Spark

Beginner's Kafka & Spark Real-time Pipeline Course. All-in-one: Master concepts to architecture.

(4.9) 19 reviews

259 learners

Level Basic

Course period Unlimited

  • hyunjinkim
실시간
실시간
데이터처리
데이터처리
데이터파이프라인
데이터파이프라인
Kafka
Kafka
Apache Spark
Apache Spark
pyspark
pyspark
data-lake
data-lake
실시간
실시간
데이터처리
데이터처리
데이터파이프라인
데이터파이프라인
Kafka
Kafka
Apache Spark
Apache Spark
pyspark
pyspark
data-lake
data-lake

Reviews from Early Learners

Reviews from Early Learners

4.9

5.0

:찬영

100% enrolled

It was a remarkably high-quality lecture.. Simply moving. Usually when taking lectures, you often face situations like 'Why isn't it working when I followed exactly?', and there would be quite a few such cases, but I completed this smoothly without any of that. When choosing a lecture, I first look at the curriculum and compare the price and duration. Until now, there have been many lectures that were too superficial compared to the price, But I guarantee that if you take Hyunjin's kafka&spark lecture, you can produce sufficiently high-quality results in future projects as well! I learned a lot. Thank you! (When do you think Season 2 will be released..?)

5.0

램쥐뱅

100% enrolled

I'm learning so much from the curriculum and content that are organized far better than I expected. I really felt that you put a lot of care into creating this course. I'll be waiting for the follow-up courses. Thank you.

5.0

역시자네야

10% enrolled

I trust and watch Teacher Hyeonjin's classes. Highly recommend. I discovered this through the Airflow course, and it has many differentiating factors from other courses. From concepts to architecture design, I liked how you explain the reasons for use and principles. The hands-on practice is comfort itself. You always provide kind answers as well. I'm still in the early stages of taking the course, but I'll complete it all~ The weather is hot, so please take good care of your health.

What you will gain after the course

  • Implementing CI/CD with Github, Actions, and AWS Code Deploy

  • Kafka Broker, Confluent Producer & Consumer

  • Monitoring Kafka Dashboard with Prometheus & Grafana

  • Spark & Hive Metastore for Catalog Management

  • Implementing a Practical Project Using Spark Streaming

  • Kafka & Spark, Zookeeper & Yarn Availability Test

Real-time Data Pipeline, Why Should You Learn It?


Building real-time data pipelines to support rapid analysis and decision-making is not an option but a necessity.

  • Real-time Personalized Marketing & Recommendations

  • Real-time Trend Analysis

  • Real-time Security Threat Detection and Response



Especially in today's world where AI has become fundamental, there are countless examples of AI-powered real-time recommendations, detection, translation, and more, and real-time data pipelines are increasingly required to implement such architectures.


So I prepared it.

One of the most popular streaming processing combinations: real-time pipeline based on Kafka + Spark

Kafka & Spark are prepared from the basics step by step, going beyond pipeline implementation to design methods from an architectural perspective.

Features of This Course

📌 Single configuration on local machines is a No! The trend is cloud. AWS Cloud is utilized.

📌 CI/CD is basic, right? We'll set up CI/CD using GitHub Actions and AWS CodeDeploy.

📌 Starting from the basics step by step, but helping you internalize the lecture content through hands-on practice and assignments.

📌 Server cluster configurationto real-time pipeline setup and availability testing- All in One configuration

👍Recommended for people like this

I want to learn real-time data pipelines.
Those who are interested in data pipelines but have not experienced real-time processing

I want to learn about DataLake.
For those who want to learn how DataLake is implemented on the Cloud

I want to grow as an architect.
For those curious about implementing robust architecture capable of high-volume processing from infrastructure design to code level

After taking the course

  • You will understand the basic principles of Kafka Broker services, understand availability guarantees, and be able to handle Broker services based on this knowledge.

  • You will understand the basic principles of Kafka Producer/Consumer and advanced options, and be able to write robust applications by understanding the trade-off between performance and consistency in large-scale environments.

  • You will be able to understand the conditions under which Spark can demonstrate its performance and write Applications based on techniques that can optimize performance.

  • You can understand the diversity of pipelines through the integration of various services such as AWS S3, Glue, Athena and Spark services.

So, what content does it cover?


  1. Now the trend is cloud. We'll configure a cluster just like in real-world scenarios using EC2 servers.

  2. Kafka & Spark will be learned slowly from the basics.

  3. Learn the basic concepts of Datalake on AWS through AWS's S3, Glue, and Athena services.



The pipeline can be divided from collection to utilization.

You need to clearly understand which tools to use at each step, how to use them, and how they connect to each other.

Therefore, it doesn't stop at simply learning Kafka and Spark.

Finally, let's build an actual pipeline

In the process, you'll learn methods such as CI/CD, availability testing, problem-solving, and performance improvement.

Curriculum

  1. Data Lake Concept


  • Lambda

  • Kappa Architecture

  • Pipeline Design

  1. Kafka Basics


  • Broker

  • Kafka Producer

  • Kafka Consumer

  1. Monitoring

  • UI For Apache Kafka

  • Prometheus

  • Grafana

  1. Apache Spark Basics

  • Spark Cluster

  • Spark SQL

  • Spark Streaming

  1. Performance Improvement Tips

  • Performance Improvement Checklist

  • Trouble Shooting

  • Spark Monitoring

  1. Availability Testing

  • Zookeeper Cluster

  • Kafka Broker

  • Spark Cluster


You will learn this content.

Pipeline Design

Combinations of tools available for configuring real-time data pipelines.

And we understand and implement the data flow of Kafka & Spark selected for hands-on practice.

CI/CD: Github Actions + Code Deploy

CI/CD is the basics of the basics.

Local git → Github Repository integration followed by automatic deployment using Actions + Code Deploy combination.

Kafka Web UI

Learn how to easily manage Kafka through UI For Apache Kafka.

Prometheus + Grafana

The trend in monitoring pipelines.

Learn how to monitor Kafka through the Prometheus + Grafana combination and further explore monitoring methods for Spark Streaming LAG.

Kafka Source + Spark Streaming

We implement an actual pipeline using the Kafka + Spark Streaming combination and visualize a Dashboard based on this.

AWS Athena

AWS Athena service is a serverless query service. We'll use this service to directly check the processing results of Spark Streaming.

Python Dashboard

Using the implemented real-time data pipeline, we visualize it with a Dashboard and understand the pipeline flow.

Availability Testing

We implement a robust architecture through the most realistic architectural implementation possible and conduct availability testing. We understand and verify the availability of Kafka, Spark, and Yarn Cluster.

Infrastructure Setup Automation Using Ansible

Isn't there a lot that needs to be done to create all of this?

That's right. There are very many.

There are many libraries to install and so many things to configure. If even one thing doesn't match up properly, you get errors 🤬

However, you only need to focus on the important content of implementing real-time pipelines.

Infrastructure configuration and various setups are automated through pre-prepared Ansible Scripts.

You can preview the Ansible Script at the github address below.

https://github.com/hjkim-sun/datalake-ansible-playbook-season1


You will clone the above GitHub repository content to easily proceed with the setup process.

🚨Please refer to this before starting the practice!

✔ Kafka Client(Producer/Consumer) is written in Python.


There are several types of Python Kafka libraries, but we use the Confluent Kafka Library, which has the best performance among them. Confluent Kafka is a tool that guarantees high performance comparable to Java, and we will learn how to write Producer/Consumer through Python.


✔ Spark language is also written in Python.


When writing Spark Applications, Scala is the language with the best performance. However, learning Scala separately just for Spark is definitely a burden. It has the disadvantages of not being as popular as Python and having relatively fewer deep learning/AI-related libraries. Therefore, in practice, Python is often used to develop Spark programs. Especially when considering deep learning/AI integration, writing in Python can be an excellent alternative.

🚨AWS Expected Lab Costs

The hands-on practice will be conducted on AWS Cloud, and separate practice costs will be incurred.


✔ When used for about 40 hours over a month, approximately 40,000 KRW in AWS costs are incurred. (Based on exchange rate of 1,430)

Most of the practice costs are incurred from EC2 (computing service), so you must stop the server instances after practice & assignments. However, for other costs (volumes (EBS) and EIP connected to server instances), costs will still be incurred even if you stop the server instances. Therefore, the faster you complete the course, the lower your AWS practice costs will be.


✔ Even when all servers are stopped, approximately 30,000 won per month is incurred due to server volume costs.

Therefore, even if you use the same 40 hours, if you use it over two months instead of one month, an additional 30,000 won will be added, resulting in a total AWS fee of about 70,000 won. Therefore, we recommend completing the course as quickly as possible.

The content below will not be covered.


  1. Java-based Producer/Consumer Development

  2. Kafka Connect (Season2 planned)

  3. Schema Registry (Season2 planned)

  4. Kafka Streams

  5. KSQL



  1. Machine Learning and Deep Learning

  2. Open Table Format (ex. iceberg) (Season2 planned)

  3. Scala-based Application (written only in pyspark)

Communication

Due to the nature of lectures that involve working with multiple tools, communication through the Q&A board alone may be difficult when you have questions or encounter unexpected errors.

(From experience, it appears to take about 3-4 days from when a question is registered until I provide an answer and it gets reconfirmed)


To reduce these communication inconveniences and provide high-quality service to our students until the very end, we plan to operate a Discord channel.

https://discord.gg/eTcYzMBxZm


It's fine whether it's about the lectures or not. It's also fine to share trivial stories.

This is a place for smooth communication, so please feel free to join us

Pre-enrollment Reference Information

Practice Environment

  • [OS] Most of the hands-on practice will be conducted on AWS. Therefore, you can take the course regardless of whether you use Windows/MacOS.

  • [Performance]Does not require high CPU/Memory specifications. A typical laptop/desktop is sufficient for taking the course.

  • [Other] You can take the course anywhere with an internet connection. Additionally, you'll need a credit card that can be used to pay for AWS Cloud costs.

Learning Materials

  • It is provided in Lecture 1-2.


Essential Prerequisites

  1. Python Basics


    Basic data structures and fundamental syntax like if/for/while statements. Also, the ability to write functions at a competent level

  2. Basic Linux Commands


    Most infrastructure work is done through the Ansible automation tool. However, you need to know basic Linux commands to take this course. (vi editor, basic commands like cd/mv/rm, etc.)

  3. SQL


    Having basic SQL knowledge (SELECT, WHERE, JOIN, GROUP BY, ORDER BY, etc.) will make it much easier to follow along.
    (There won't be any difficult SQL)

Recommended Prerequisites

  1. Docker Containers
    Set up monitoring tools using containers. It helps to understand the principles of containers.

  2. git
    We will use git for CI/CD to proceed with direct code deployment. We will explain all the usage step by step, but it would be even better if you already know it.

  3. Understanding Python Classes
    Most Python programs that proceed with hands-on practice are structured through Class structures. Therefore, having an understanding of Classes and object-oriented programming makes it easier to practice
    (It's okay if you don't know. I'll explain everything)

Recommended for
these people

Who is this course right for?

  • Those who want to learn Kafka & Spark

  • Those who want to learn real-time pipeline implementation

  • Those who need to develop various knowledge and skills as a data engineer

Need to know before starting?

  • Basic concepts of Python

  • Basic SQL knowledge (Filter, GroupBy, OrderBy level)

  • Able to use basic Linux commands

Hello
This is

1,312

Learners

85

Reviews

224

Answers

4.9

Rating

2

Courses

안녕하세요.

데이터 & AI 분야에서 일하고 있는 15년차 현직자입니다.

정보관리기술사를 취득한 이후 지금까지 얻은 지식을 많은 사람들에게 공유하고자 컨텐츠 제작하고 있습니다.

반갑습니다. :)

Contact: hjkim_sun@naver.com

Curriculum

All

113 lectures ∙ (28hr 23min)

Course Materials:

Lecture resources
Published: 
Last updated: 

Reviews

All

19 reviews

4.9

19 reviews

  • jusungpark님의 프로필 이미지
    jusungpark

    Reviews 21

    Average Rating 4.8

    5

    100% enrolled

    I'm learning so much from the curriculum and content that are organized far better than I expected. I really felt that you put a lot of care into creating this course. I'll be waiting for the follow-up courses. Thank you.

    • hyunjinkim
      Instructor

      Dear Ramjjibaeng, Thank you for your course review. While creating the curriculum, I also thought a lot about how to teach from the basics solidly while connecting from Kafka to Spark. Thanks to you, it took a year from planning to completing the course, but I'm proud that you recognized it like this ^^ Thank you. The follow-up course I'm preparing now isn't season 2, but I'll proceed with substantial content so you won't regret it 😀

  • ㅈ님의 프로필 이미지

    Reviews 1

    Average Rating 5.0

    Edited

    5

    10% enrolled

    I trust and watch Teacher Hyeonjin's classes. Highly recommend. I discovered this through the Airflow course, and it has many differentiating factors from other courses. From concepts to architecture design, I liked how you explain the reasons for use and principles. The hands-on practice is comfort itself. You always provide kind answers as well. I'm still in the early stages of taking the course, but I'll complete it all~ The weather is hot, so please take good care of your health.

    • hyunjinkim
      Instructor

      Hello, yeoksijaneya! Thank you so much for continuing to find me after airflow! When I learn something on my own, if I only learn the superficial usage without understanding the principles, I quickly forget and don't really understand it. I think others are the same way, so I tend to dedicate a lot of lecture time to conveying the principles I've understood. That's why I have to make PPTs and create assignments, which is a bit tough, but I'm grateful that you recognize this effort :) I'll prepare the next lecture well too 💪

  • pcy78054921님의 프로필 이미지
    pcy78054921

    Reviews 1

    Average Rating 5.0

    Edited

    5

    100% enrolled

    It was a remarkably high-quality lecture.. Simply moving. Usually when taking lectures, you often face situations like 'Why isn't it working when I followed exactly?', and there would be quite a few such cases, but I completed this smoothly without any of that. When choosing a lecture, I first look at the curriculum and compare the price and duration. Until now, there have been many lectures that were too superficial compared to the price, But I guarantee that if you take Hyunjin's kafka&spark lecture, you can produce sufficiently high-quality results in future projects as well! I learned a lot. Thank you! (When do you think Season 2 will be released..?)

    • hyunjinkim
      Instructor

      Hello, Chan-young! Thank you for your touching review. As you know well after completing the course, I was quite concerned that it might not be easy to follow along, as the content includes a variety of topics from infrastructure setup and nginx configuration to docker setup and availability testing, rather than simply explaining functions. That's why I standardized as much as possible using ansible-playbook, and even after finishing the lecture recording, I followed the lecture myself to check if there were any parts that weren't working correctly. And in case someone had trouble, I even prepared a Discord room for smooth communication. In the end, it took quite a long time to release the lecture, but I tried my best to create a lecture with high completeness. Knowing that Chan-young recognized my efforts makes all the hard work feel like it's washed away ^-^ I am even more grateful.. Also, before starting Season 2, I'm preparing a lecture related to generative AI first, so it might be a bit later. Still, I'll do my best to prepare it!

  • junhahwang9642님의 프로필 이미지
    junhahwang9642

    Reviews 1

    Average Rating 5.0

    5

    60% enrolled

    • hbin0529님의 프로필 이미지
      hbin0529

      Reviews 1

      Average Rating 5.0

      5

      30% enrolled

      $102.30

      hyunjinkim's other courses

      Check out other courses by the instructor!