Understanding LLM Architecture and GPU Utilization Strategies for AI Beginners

Understand Transformer-based LLM architectures and GPU utilization strategies, and gain hands-on experience with the actual serving process using vLLM. This course covers the entire practical workflow, from building AI system pipelines to monitoring and multi-GPU utilization, and is designed for intuitive understanding through diagrams and practice without complex formulas.

(5.0) 7 reviews

171 learners

Level Basic

Course period Unlimited

GPU
GPU
attention-model
attention-model
AI
AI
transformer
transformer
LLM
LLM
GPU
GPU
attention-model
attention-model
AI
AI
transformer
transformer
LLM
LLM

Reviews from Early Learners

5.0

5.0

WonJune Lee

43% enrolled

I am not in the deep learning industry, but I work in the field of computer vision (rule-based). Since my company requires LLM and vision-related deep learning technologies, I have been studying these topics. I have only completed about 40% of the course, but I felt compelled to leave a review now. I have taken many deep learning courses, including those by famous and highly-rated instructors, but I haven't found any course as clean and clear as this one. The best part is the quality of the lecture materials. The instructor recorded every single matrix calculation in Excel, which is incredibly helpful when reviewing. The Python code is also well-commented in many places. The quality of the lectures themselves is excellent; the instructor reminds you of parts you might have forgotten, ensuring you don't miss anything. While most other courses show calculations once or twice and then move on, this course goes through the calculations together until the end, which provides great clarity. It seems like the Q&A is monitored frequently, as I received immediate answers to my questions. The lectures seem to have been filmed this year, so it's great that they include a lot of the latest trends. It feels like this course hasn't gained much word-of-mouth yet, but I highly recommend it to anyone who needs to study these topics.

5.0

Jang Jaehoon

7% enrolled

Thank you for the great lecture!

5.0

nova7tr

11% enrolled

It's my first time taking the course, and it's very helpful. ^-^

What you will gain after the course

  • Understanding the encoder-decoder structure and core operating principles of the Transformer model

  • Understanding the evolution of the latest attention mechanisms, including MHA, MQA, GQA, and MLA

  • Hands-on practice on how to utilize the vLLM engine, the de facto standard for current AI serving

  • Monitoring key performance metrics such as TTFT and TPOT in a vLLM serving environment

  • Design and implementation of multi-GPU architecture utilizing Tensor/Pipeline/Data Parallelism

  • Understanding the Core Concepts of Agent AI and the Principles of Tool Calling

  • Experience in building AI system pipelines and monitoring performance from a real-world industry perspective

  • Understanding the latest LLM trends such as MLA, MTP, and n-grams based on the latest research papers

In the era of AI Agents,
practical skills for understanding AI systems are becoming increasingly important.

From Transformer-based LLM architectures
to GPU utilization, vLLM serving, and multi-GPU strategies

LLM Architecture Practical Class

In the era of autonomous AI Agents,
you can utilize various Agent tools and Public APIs such as OpenAI, Claude, and Codex.

However, in a real service environment, you must also consider
data security, network costs, token costs, and GPU resource management.

Therefore, what is important is
an understanding of the Hybrid AI architecture, which combines Public APIs and self-hosted GPU-based LLMs
according to the situation.
sao cho phù hợp với từng tình huống.


In that case, is using only Public APIs always the best option?

Not necessarily.

These days, many LLMs comparable to public APIs (ChatGPT, Claude, Sonnet, etc.) are being developed both domestically and internationally.



3 models selected as a result of the 1st evaluation of domestic Sovereign AI


However, knowing and using LLMs well is not easy.
Purchasing expensive GPUs and using LLMs with an understanding of them versus using them without understanding
leads to
a significant difference.

Therefore, it is now time to learn the architecture for serving LLMs directly.


🌟 From LLM Architecture to Serving


In the era of agents, we have moved from the age of training to the age of inference. While using public APIs effectively is necessary, many companies prefer building local serving environments for various reasons such as security, governance, and cost. Learn everything from understanding LLM architecture for building local serving environments to architectural configuration and the latest trends in LLM development.


Lecture Core Structure

Core 1. Understanding Hugging Face Models


You need to know how to use the numerous LLMs released on Hugging Face.
However, the config.json file, which provides the specifications of an LLM model, is no different from a secret code to beginners. This is because you need to understand the Transformer model to be able to read it.

But don't worry. After taking this course, you will become an expert who can look at and understand the key specifications.

Learn how to decode the config.json file through this lecture.

(This is the content for Chapter 3-5. Make sure to take away all the remaining key parameters.)


Core 2. Mastering Attention

Attention is the beginning and end of the Transformer model, which serves as the foundation for current LLM models.

The attention-model emerged in 2017, but
it has remained the most powerful algorithm for nearly a decade.
While many efforts are being made to move beyond the Transformer structure,
no architecture has yet emerged that completely replaces the Transformer's attention.

⚠️ You must never have just a vague understanding of attention.


Gain a perfect understanding of the principles of attention and learn about its evolutionary flow.

(This is the content for Chapter 5-4. The evolution of Attention is synonymous with the evolution of LLMs.)


Core 3. Mastering Multi-GPU Architecture

Multi-GPU configuration is essential for running large-scale LLMs and achieving fast inference.
However, did you know that there are several different ways to configure multi-GPU setups?


We will pass on essential GPU utilization strategies, a necessary gateway to becoming a core AI engineer.




😄 Recommended for these people

AI Beginners

Those who wanted to study Transformer and Attention structures,
but found them difficult due to complex formulas and concepts

AI Beginner

Those who have used ChatGPT or generative AI services,
but want to understand the principles of how LLMs actually work

AI Engineer

AI engineers who need to understand LLM architecture and GPU environments,
and possess the capability to build and operate actual AI systems

💡 What you will learn in this lecture

Step 1. Foundation

  • Understanding the Transformer Model

  • Tokenizer & Embedding

  • Encoder vs Decoder

  • View model source code

Step 2. Attention

  • Mastering the Decoder Model

  • Mastering Attention

  • Masked Attention

  • KV Cache

Step 3. Serving

  • vLLM Serving

  • Paged Attention

  • OpenAI Compatible

  • SSE Protocol

Step 4. Tool Call

  • Understanding Tool Calls

  • Tool Response Architecture

  • Chat Template

  • Tool call parser

Step 5. Optimization

  • Performance Testing

  • vLLM Monitoring

  • Multi-GPU & Parallelism

  • vLLM Additional Features

Step 6. Advanced

  • Multi Token Prediction

  • mHC

  • Engram

  • Efforts to overcome limitations

💡 Key Lecture Points

Point 1

Core Principles of Attention Learned Without Formulas


Learn various attention techniques intuitively through Excel without complex formulas (MHA → MQA → GQA, Sliding Window Attention)

Point 2

Implementation of a 3-Tier AI Architecture


Understand the basic structure of a 3-tier architecture connecting OpenWebUI, FastAPI, and vLLM, and learn the fundamental flow of tool integration.

Point 3

Measuring Concurrent Users and Tips for vLLM Operation

Using jMeter, perform a load test from FastAPI to vLLM to check metrics such as TTFT and TPOT according to the number of concurrent users.

Point 4

Monitoring vLLM Services

Learn the basic principles of vLLM service operation by building a Prometheus & Grafana dashboard pipeline.

Point 5

Single GPU / Multi-GPU Testing

Through hands-on practice with the three basic multi-GPU methods (Pipeline Parallel, Tensor Parallel, and Data Parallel), you will see firsthand why multi-GPU setups are necessary.

Point 6

Mastering LLM Development Trends

We introduce the latest techniques from DeepSeek, such as MTP, Shared MoE, MLA, and Engram, as well as current LLM development trends focused on inference efficiency.

✅ Tools used in the lecture




✅ Server Practice Environment Guide

The vLLM system construction will be carried out using Runpod. In addition, hands-on sessions using Google Colab's T4 GPU will be conducted in parallel. Since the T4 GPU provides 15GB of GPU memory, exercises that are feasible on Colab will be conducted there.

Runpod

We will configure a practice environment based on the OpenWebUI → FastAPI → Runpod flow. We will conduct various exercises by deploying vLLM on GPU servers in the Runpod cloud.

Hands-on practice will incur a cost of approximately $10 to $20.


Google Colab

Google Colab, which is like the standard environment for AI practice, is used for simple exercises that do not require a Runpod environment. We will use the standard free tier, not Pro, and utilize the T4 GPU.

✅ Local Practice Environment Guide

The vLLM service will be hosted on Runpod, but
OpenWebUI and FastAPI will also be running on your local computer.
Therefore, please check if your environment meets the requirements below!



Runpod and Colab are used as the primary practice environments, but
You will be practicing by running OpenWebUI and FastAPI within your local environment..

⚠️ This course will be updated as vLLM is updated.

vLLM's update speed is very fast. However, the major version is still in the 0.x range.
Nevertheless, many companies are using vLLM as their inference engine as a de facto standard.
vLLM supports not only the Transformer models that currently form the backbone of LLMs but also alternative architectures like Mamba , and it is updated every time new features are added to models, such as Multi Token Prediction, to support them.
This course will also be updated as new vLLM features or new model types are released.

Don't miss out on the latest LLM trends.


Recommended for
these people

Who is this course right for?

  • Practitioners who use ChatGPT and generative AI but want to understand how LLMs actually work

  • Beginners who aim to become AI engineers and want to systematically learn LLM serving and system architecture.

  • Developers who want to understand Transformer and Attention structures from a practical perspective without complex formulas.

  • Backend and infrastructure engineers who want to understand GPU optimization and the actual workflow of building AI systems in multi-GPU environments

  • PMs and planners who want to understand LLM architecture and GPU utilization strategies during the AI service planning and development process.

Need to know before starting?

  • Understanding of basic Python syntax (variables, functions, conditional statements, etc.)

  • Basic usage of git

Hello
This is hyunjinkim

Inflearn Verified

1,584

Learners

104

Reviews

240

Answers

4.9

Rating

3

Courses

Hello.

I am a 17-year veteran currently working in the Data & AI field at a large corporation.

Since obtaining my Professional Engineer Information Management certification, I have been creating content to share the knowledge I've gained with as many people as possible.

Nice to meet you. :)

 

Contact: hjkim_sun@naver.com

More

Curriculum

All

54 lectures ∙ (14hr 27min)

Course Materials:

Lecture resources
Published: 
Last updated: 

Reviews

All

7 reviews

5.0

7 reviews

  • kjunekjune0812님의 프로필 이미지
    kjunekjune0812

    Reviews 3

    Average Rating 5.0

    Edited

    5

    43% enrolled

    I am not in the deep learning industry, but I work in the field of computer vision (rule-based). Since my company requires LLM and vision-related deep learning technologies, I have been studying these topics. I have only completed about 40% of the course, but I felt compelled to leave a review now. I have taken many deep learning courses, including those by famous and highly-rated instructors, but I haven't found any course as clean and clear as this one. The best part is the quality of the lecture materials. The instructor recorded every single matrix calculation in Excel, which is incredibly helpful when reviewing. The Python code is also well-commented in many places. The quality of the lectures themselves is excellent; the instructor reminds you of parts you might have forgotten, ensuring you don't miss anything. While most other courses show calculations once or twice and then move on, this course goes through the calculations together until the end, which provides great clarity. It seems like the Q&A is monitored frequently, as I received immediate answers to my questions. The lectures seem to have been filmed this year, so it's great that they include a lot of the latest trends. It feels like this course hasn't gained much word-of-mouth yet, but I highly recommend it to anyone who needs to study these topics.

    • hyunjinkim
      Instructor

      Hello Wonjune Lee, Thank you for your thoughtful review! I put a lot of thought into improving the quality of the lecture materials so that students could receive meaningful resources and review them effectively even later on. I also spent a lot of time thinking about how to effectively convey operations like Attention. The conclusion I reached was that it shouldn't be explained through formulas alone, nor through simple metaphors, nor just through torch code. Believing that it can only be understood by following the flow visually, I tried my best to explain it using Excel, and I'm glad to hear that it was conveyed well :) I hope you gain great insights from the remaining parts of the course. Keep it up!

  • jjhgwx님의 프로필 이미지
    jjhgwx

    Reviews 911

    Average Rating 4.9

    5

    7% enrolled

    Thank you for the great lecture!

    • hyunjinkim
      Instructor

      Hello Jang jaehoon, Thank you for the course review 👍 I see you've completed 7%. I hope you enjoy the rest of the lessons and find them very helpful. You can do it!

  • nova7tr1173님의 프로필 이미지
    nova7tr1173

    Reviews 9

    Average Rating 4.8

    5

    11% enrolled

    It's my first time taking the course, and it's very helpful. ^-^

    • hyunjinkim
      Instructor

      Hello nova7tr, Thank you for your review. I am glad to hear that it was helpful. I hope you enjoy the rest of the course :)

  • chongin12님의 프로필 이미지
    chongin12

    Reviews 2

    Average Rating 5.0

    5

    61% enrolled

    • myshaitan8493님의 프로필 이미지
      myshaitan8493

      Reviews 2

      Average Rating 5.0

      5

      61% enrolled

      Similar courses

      Explore other courses in the same field!