inflearn logo

Understanding LLM Architecture and GPU Utilization Strategies for AI Beginners

Understand Transformer-based LLM architecture and GPU utilization strategies, and experience direct serving through vLLM. This course covers the entire process of building an AI system pipeline, monitoring, and multi-GPU utilization, allowing you to learn intuitively through illustrations and hands-on practice without complex formulas or coding processes.

11 learners are taking this course

Level Basic

Course period Unlimited

GPU
GPU
attention-model
attention-model
AI
AI
transformer
transformer
LLM
LLM
GPU
GPU
attention-model
attention-model
AI
AI
transformer
transformer
LLM
LLM

What you will gain after the course

  • What is a Transformer model? Understanding the Transformer model's encoder and decoder

  • The foundation of Transformer models: A complete understanding of the evolution of attention mechanisms, including MHA, MQA, GQA, and MLA.

  • Mastering the utilization of the vLLM engine, the current de facto standard

  • vLLM Serving and Monitoring TTFT and TPOT Performance Metrics

  • Design and implementation of multi-GPU architecture utilizing Tensor/Pipeline/Data Parallelism

  • Understanding the Principles of Tool Calling: The Core of Agent AI

  • Transferring industry know-how, building AI system pipelines and performance monitoring

  • Latest trends understood through DeepSeek papers (MLA, MTP, N-gram, etc.)

What is needed now that we have become one of the top three AI powerhouses

For understanding LLM and practical application

LLM Master Class

As we enter the era of autonomous agents, we are using many agent tools such as Open Canvas, Claude Code, and Codex, but the threats of
data leakage and the issue of uncontrolled
token costs remain unresolved.


The answer is a Hybrid AI architecture.



But are you asking if public APIs are always better?
That is not the case.

Nowadays, many LLMs comparable to public APIs (ChatGPT, Claude, Sonnet, etc.) are being developed both domestically and internationally.



The 3 models selected based on the results of the 1st evaluation of domestic Sovereign AI


However, knowing and using LLMs well is not easy.
There is a significant difference between understanding and using an LLM
versus using it without understanding,
especially after purchasing expensive GPUs.


Therefore, it is now time to learn the architecture for serving LLMs directly.


🌟 From LLM Architecture to Serving


In the era of agents, we have moved from the age of training to the age of inference. While using public APIs effectively is necessary, many companies prefer building local serving environments for various reasons such as security, governance, and cost. Learn everything from understanding LLM architecture for building local serving environments to architectural configuration and the latest trends in LLM development.


Lecture Core Composition

Core 1. Understanding Hugging Face Models


You must know how to use the numerous LLMs released on Hugging Face.
However, the config.json file, which provides the specifications of an LLM model, is no different from a secret code to beginners. This is because you need to understand the Transformer model to be able to read it.

But don't worry. After taking this course, you will become an expert who can look at and understand the key specifications.

Learn how to decode the config.json file through this lecture.

(This is the content for Chapter 3-5. Make sure to take away all the remaining key parameters.)


Core 2. Mastering Attention

Attention is the beginning and end of the Transformer model, which serves as the foundation for current LLM models.

The attention-model emerged in 2017, but
it has remained the most powerful algorithm for nearly a decade.
While many efforts are being made to move beyond the Transformer structure,
no architecture has yet emerged that completely replaces the Transformer's attention.

⚠️ You must never have just a vague understanding of attention.


Gain a complete understanding of the principles of attention and learn about its evolutionary trends.

(This is the content of Chapter 5-4. The evolution of Attention is synonymous with the evolution of LLMs)


Core 3. Mastering Multi-GPU Architecture

Multi-GPU configuration is essential for running large-scale LLMs and achieving fast inference.
However, did you know that there are several different ways to configure multi-GPU setups?


We will pass on GPU utilization strategies, an essential gateway to becoming a core AI engineer.




😄 Recommended for these people

AI Beginners

Those who gave up on the formulas while researching "Attention" to study Transformers.

AI Beginners

Those who have only used ChatGPT or public APIs, but want to learn the principles of how LLM models operate.

AI Engineer

AI engineers who need the capability to understand the characteristics of LLM model architectures and to run and manage them in GPU environments

💡 What you will learn in this course

Step 1. Foundation

  • Understanding Transformer Models

  • Tokenizer & Embedding

  • Encoder vs Decoder

  • View model source code

Step 2. Attention

  • Mastering the Decoder Model

  • Mastering Attention

  • Masked Attention

  • KV Cache

Step 3. Serving

  • vLLM Serving

  • Paged Attention

  • OpenAI Compatible

  • SSE Protocol

Step 4. Tool Call

  • Understanding Tool Calls

  • Tool Response Architecture

  • Chat Template

  • Tool call parser

Step 5. Optimization

  • Performance Testing

  • vLLM Monitoring

  • Multi-GPU & Parallelism

  • Additional vLLM Features

Step 6. Advanced

  • Multi Token Prediction

  • mHC

  • Engram

  • Efforts to overcome limitations

💡 Key Lecture Points

Point 1

Core principles of attention learned without formulas


Learn various attention techniques intuitively through Excel without complex formulas (MHA → MQA → GQA, Sliding Window Attention)

Point 2

Implementation of 3-Tier AI Architecture


Understand the basic structure of a 3-tier architecture connecting OpenWebUI, FastAPI, and vLLM, and learn the fundamental flow of tool integration.

Point 3

Measuring Concurrent Users and Tips for vLLM Operation

Using jMeter, we will conduct load testing from FastAPI to vLLM to check metrics such as TTFT and TPOT based on the number of concurrent users.

Point 4

Monitoring vLLM Services

Build a Prometheus & Grafana dashboard pipeline to master the basic principles of vLLM service operations.

Point 5

Single GPU / Multi-GPU Testing

Through hands-on practice with the three basic multi-GPU methods (Pipeline Parallel, Tensor Parallel, and Data Parallel), you will see firsthand why multi-GPU setups are necessary.

Point 6

Mastering LLM Development Trends

We introduce the latest LLM development trends aimed at inference efficiency, including DeepSeek's MTP, Shared MoE, MLA, and Engram techniques.

✅ Tools used in the lecture




✅ Server Practice Environment Guide

The vLLM system construction will be carried out using Runpod. In addition, hands-on sessions utilizing the T4 GPU in Google Colab will be conducted in parallel. Since the T4 GPU provides 15GB of GPU memory, any exercises that can be performed in Colab will be done there.

Runpod

We will configure a practice environment based on the OpenWebUI → FastAPI → Runpod flow. We will conduct various exercises by deploying vLLM on GPU servers in the Runpod cloud.

A practice fee of approximately $10 to $20 will be incurred for the hands-on sessions.


Google Colab

Google Colab, which is like the standard environment for AI practice, is used for simple exercises that do not require a Runpod environment. We will use the standard free tier, not Pro, and utilize the T4 GPU.

✅ Local Practice Environment Guide

The vLLM service will be hosted on Runpod, but
OpenwebUI and FastAPI will also be running on your local computer.
Therefore, please check if the following environment requirements are met!



Runpod and Colab are used as the primary practice environments, but
You will be practicing by running OpenWebUI and FastAPI within your local environment.

⚠️ This course will be updated as vLLM is updated.

vLLM's update speed is very fast. However, the major version is still in the 0.x range.
Nevertheless, many companies are using vLLM as their inference engine as a de facto standard.
vLLM supports not only the Transformer models that form the backbone of current LLMs but also alternative architectures like Mamba , and it is updated every time new features like Multi Token Prediction are added to models to support them.
This course will also be updated as new vLLM features or new model types are released.

Don't miss out on the latest LLM trends.


Recommended for
these people

Who is this course right for?

  • A beginner aiming to become an AI engineer who wants to systematically learn LLM serving technologies.

  • Developers who want to understand the principles of Transformers and Attention from a practical perspective without complex formulas.

  • Backend/Infrastructure engineers looking to build AI systems in GPU-optimized and multi-GPU environments

Need to know before starting?

  • Understanding of basic Python syntax (variables, functions, conditional statements, etc.)

  • Basic usage of git

Hello
This is hyunjinkim

1,391

Learners

93

Reviews

233

Answers

4.9

Rating

3

Courses

Hello.

I am a 17-year veteran currently working in the Data & AI field at a large corporation.

Since obtaining my Professional Engineer Information Management certification, I have been creating content to share the knowledge I've gained with as many people as possible.

Nice to meet you. :)

 

Contact: hjkim_sun@naver.com

More

Curriculum

All

54 lectures ∙ (13hr 33min)

Course Materials:

Lecture resources
Published: 
Last updated: 

Reviews

Not enough reviews.
Please write a valuable review that helps everyone!

Similar courses

Explore other courses in the same field!

Limited time deal

$42,900.00

70%

$110.00