
Windows System Programming
megayuchi
We'll teach you essential Windows System programming skills for developing games and applications for Windows.
Basic
windows-programming, C++, microsoft-visual-c++
GPGPU is no longer an unfamiliar technology. It has long been utilized in various fields such as scientific computation, simulation, and graphics processing, and today, it has established itself as the core foundation that determines the performance of AI technology. In this context, GPU programming skills serve as a powerful tool that expands a developer's capabilities to the next level. Moving beyond CPU-centric development to directly handling large-scale parallel computing means acquiring a new way of problem-solving and broader development possibilities. This course systematically covers CUDA programming—the de facto standard of GPGPU—from the basics to practical application. The curriculum focuses on content that can be immediately applied in practice, including understanding GPU architecture, parallel programming models, memory optimization, and kernel writing. The goal is to reach a level where you can design and implement GPU-based programs on your own after completing the course.
13 learners are taking this course
Level Intermediate
Course period Unlimited
CUDA Parallel Programming Capability - You will understand GPU thread structures, memory hierarchies, and kernel execution models, and be able to write CUDA kernels yourself.
Computational acceleration code tens to hundreds of times faster than CPU - You can directly verify the performance difference by writing programs that accelerate actual operations such as vector operations and matrix multiplication with a GPU.
GPU programming is no longer exclusive to specialized fields. Today, GPUs play a core role in almost every domain, including AI, simulation, image processing, and scientific computing, and the ability to utilize them has become a powerful weapon that significantly expands a developer's competitive edge. This course is designed for developers who have experience with C/C++ but have hesitated to start because GPU programming felt unfamiliar. We cover everything from basic CUDA concepts to understanding GPU architecture, parallel programming models, memory optimization, kernel writing, stream utilization, and image processing with a practical focus. After completing this course, you will be able to design and implement GPU-based programs on your own.
We will examine the evolution of GPUs from dedicated graphics devices to GPGPU (General-Purpose computing on Graphics Processing Units) and provide a comprehensive overview of the core hardware and software concepts necessary for understanding CUDA programming. This section lightly covers the foundations for future hands-on practice, including GPU architecture, parallel processing methods, and the CUDA execution model.
The biggest challenge when starting CUDA development is the initial environment setup. In this chapter, we will provide a step-by-step guide to the entire environment needed for development, from installing the CUDA Toolkit to configuring the compiler and IDE. We will explain how to build a practical development environment where you can run and debug the examples in the following chapters.
This explains the basic flow of how a CUDA program operates. It covers the process of initializing and terminating the CUDA environment, and provides a step-by-step explanation of the overall execution structure, which includes copying from host memory to device memory, kernel execution, and copying from device memory back to host memory. Additionally, it summarizes essential concepts that serve as the foundation for subsequent hands-on exercises, such as CUDA kernel invocation methods and the usage of core CUDA APIs.
This section covers the concept of global memory coalescing, a key element of GPU performance optimization. It explains how hardware merges (coalesces) requests when threads access global memory and compares the differences between optimal and worst-case access patterns through real-world scenarios. Additionally, it outlines data layout strategies and thread configuration methods to maximize memory access performance, explaining essential optimization techniques for writing efficient CUDA kernels.
This covers how threads within a block can collaborate to achieve higher performance. It explains how to efficiently share data at the block level using Shared Memory, followed by techniques for collaboration between threads within a warp using warp level intrinsic. It discusses strategies for writing more optimized CUDA kernels by combining these two collaborative methods and implements the process of finding a minimum value using warp-level reduction and block-level reduction as practical examples.
Learn the core concepts of utilizing Shared Memory through the process of transposing a matrix in CUDA. We will examine inefficient global memory access patterns that frequently occur during transpose operations and the resulting performance degradation, then explain how to optimize memory access using Shared Memory. Additionally, we cover techniques to resolve bank conflicts that can occur in Shared Memory, providing practical strategies for effectively using Shared Memory through matrix transposition examples.
Following the Matrix Transpose example, this section covers how to utilize Shared Memory even more effectively through a Matrix Multiplication case study. It explains the basic structure for processing large-scale matrix multiplication in CUDA and introduces the technique of dividing large matrices into smaller tile-based sub-matrices for computation. Additionally, it compares the memory access patterns of matrix multiplication—which are similar yet different from matrix transposition—and discusses strategies to maximize performance and reduce memory access bottlenecks using Shared Memory.
This covers how to boost GPU performance from the perspective of warp scheduling, rather than memory access optimization. It explains what occupancy is and why it is important how many warps the GPU can execute simultaneously. We will look at strategies for adjusting thread configuration, register usage, and shared memory usage to increase occupancy, and compare cases where high occupancy leads to good performance with cases where performance actually drops despite high occupancy.
We cover the concept of cuda stream, which enables asynchronous execution in CUDA. After first understanding how the default stream—implicitly used in all CUDA programs—operates, we explain how to improve overall performance by overlapping computation and memory copies using multiple streams. Additionally, we use Nsight Systems to analyze the actual performance benefits provided by stream-based asynchronous execution.
We introduce the concept of kernel filters, which are widely used in image processing, and learn the basics of GPU-based image processing by directly implementing Gaussian and Laplacian filters with CUDA.
We explain how to improve memory access efficiency using Texture Memory and Shared Memory, and provide a comparative analysis of the actual hardware performance differences that occur when the same algorithm is implemented with various memory structures.
We will implement an image pixel distribution analysis tool, the histogram, using CUDA, and cover data accumulation methods in a parallel environment along with the resulting performance issues. We will examine the basic structure of calculating histograms in CUDA and explain the operating principles and performance degradation issues of atomic operations, which are essential in this process. Subsequently, we will cover optimization techniques for writing more efficient histogram calculation kernels by reducing atomic operation bottlenecks using Shared Memory and warp intrinsics.
This covers how to combine the Direct3D 12 rendering pipeline with CUDA to utilize GPU graphics and GPGPU operations simultaneously. It explains how to map the Render Target and Depth Buffer of a simple D3D12 game framework as CUDA resources and synchronize the D3D12 timeline with the CUDA timeline.
The example code implements functionality that takes textures mapped as CUDA resources as input, applies various image processing techniques such as Gaussian Blur, edge detection, normal map rendering, and depth value visualization, and outputs them to the final screen.
Operating System and Version (OS): Windows 10/11
Tools used: Visual Studio 2026
cuda Toolkit 13.2
nvidia GPU
PDF provided
Source code provided via attachments
Required
C/C++
Basic Windows Programming
Recommended (The following courses may be helpful.)
Windows System Programming (https://inf.run/VciKC)
Windows debugging tips (https://inf.run/KH5J6)
A graphics card of the GTX 1600 series or higher is required.
Examples can be run on GTX 1000 series graphics cards as well, but project settings must be slightly modified. The modification method is covered in the 'Installation and Development Environment Setup' chapter.
You can also use the latest CUDA Toolkit, version 13.3 or higher. Again, you will need to slightly modify the project settings. The modification method is covered in 'Installation and Development Environment Setup'.
It does not cover AI technology. While matrix multiplication or applying kernel filters are related to AI technology, it does not directly deal with AI technology.
Who is this course right for?
A programmer who is intimidated by GPU programming due to a lack of graphics experience but wants to utilize parallel computing.
Developers who want to directly accelerate AI, simulation, and scientific computing
Need to know before starting?
C/C++
Basic Windows Programming using Visual Studio
Inflearn Verified
Career Verified
3,313
Learners
95
Reviews
22
Answers
5.0
Rating
11
Courses
C++,x86/x64 ASM, DirectX9/11/12, Metal, OpenGL, CUDA, win32, winsock/bsd socket
D3D12프로그래밍 기초편 - https://inf.run/7gJhS
D3D12프로그래밍 기초플러스 - https://inf.run/itHDW
DirectX Raytracing 프로그래밍 - https://inf.run/cQqx7
Windows System 프로그래밍 - https://inf.run/AwfCv
Windows Debugging Tips - https://inf.run/zL7E4
Blog : https://megayuchi.com
Youtube : https://youtube.com/megayuchi
LinkedIn : https://www.linkedin.com/in/megayuchi/
All
13 lectures ∙ (16hr 23min)
Course Materials:
9. CUDA Programming - Occupancy
01:59:00
Check out other courses by the instructor!
Explore other courses in the same field!