CUDA Programming (4) - C/C++/GPU Parallel Computing - Matrix Multiplication

✅ (4) Multiplying Matrices (2D Arrays) in Parallel, out of the complete series from (1) to (6) ✅ Explaining NVIDIA GPU + CUDA programming step-by-step from the basics. ✅ Processing arrays, matrices, image processing, statistical processing, sorting, etc., extremely fast using parallel computing with C++/C languages.

(5.0) 5 reviews

181 learners

Level Intermediate

Course period 36 months

CUDA
CUDA
GPU
GPU
Parallel Processing
Parallel Processing
C++
C++
C
C
CUDA
CUDA
GPU
GPU
Parallel Processing
Parallel Processing
C++
C++
C
C

Reviews from Early Learners

5.0

5.0

하지

100% enrolled

The optimization part is especially helpful in many ways.

5.0

몽크in도시

10% enrolled

It was unique that matrix multiplication can be implemented in various ways, and I didn't know that changing the loop structure in CPU implementation would make it faster. It feels like I learned practical coding, not the level that usually appears in language books.

5.0

hooha1207

55% enrolled

It was good that you explained how to do mathematical operations with CUDA in earnest. And thanks to the table of contents organized by Drip Coffee + Han Mogeum, I was able to find the content I wanted very quickly.

What you will gain after the course

  • Full Series - Massively Parallel Computing with CUDA using GPUs

  • This lecture is - Part (4) - Multiplying Matrices (2D Arrays) Simultaneously in Parallel

  • Bundle discount coupon ✅ provided for the "CUDA Programming" roadmap ✳️

Speed is the lifeblood of a program!
Make it fast with massively parallel processing techniques 🚀

I heard massive parallel computing is important 🧐

GPU/graphics card-based massively parallel computing is being very actively used in fields such as AI, deep learning, big data processing, and image/video/audio processing. Currently, the most widely applied technology in GPU parallel computing is NVIDIA's CUDA architecture.

While technologies like massive parallel computing and CUDA are considered important within the field of parallel computing, it is often difficult to even start learning because it's hard to find courses that teach this field systematically. Through this course, you can learn CUDA programming step-by-step. CUDA and parallel computing require a theoretical background and can be challenging. However, if you follow along from the basics with this course's abundant examples and explanations of background knowledge, you can certainly do it! This course is planned as a series, ensuring sufficient lecture time is provided.

In this course, we aim to explain how C++/C programmers can combine CUDA libraries and C++/C functions to accelerate problems in various fields using large-scale parallel processing techniques. Through this method, you can accelerate existing C++/C programs or develop new algorithms and programs entirely with parallel computing to achieve breakthrough speeds.

📢 Please check before taking the course!

  • Please secure a hardware environment where NVIDIA CUDA works in advance for the practice. A PC/laptop equipped with an NVIDIA GeForce graphics card is essential.
  • While NVIDIA GeForce graphics cards can be used in some cloud environments, cloud settings change frequently and often involve costs. If you are using a cloud environment, you must personally ensure you know how to access and use the graphics card.
  • You can find detailed information about the lecture practice environment in the <00. Preparation Before the Lecture> video within the curriculum.

Course Features ✨

#1.
Abundant
examples and explanations

CUDA and massively parallel computing require abundant examples and explanations. This lecture series provides a total of over 24 hours of actual instruction time.

#2.
Practice is essential!

Since this is a computer programming course, we emphasize extensive hands-on practice and provide actual working source code so that you can follow along step-by-step.

#3.
Focus on the
important parts!

During the lecture, redundant explanations for previously covered source code are minimized as much as possible, allowing you to focus your learning on the modified parts or key points that need emphasis.


Recommended for these people 🙋‍♀️

University students who want to add a portfolio of new technologies before getting a job

Programmers who want to drastically improve existing programs

Major researchers who want to know how various applications have been accelerated

Those who want to learn the theory and practice of parallel processing for AI, deep learning, and matrix computation

A preview of course reviews 🏃

*The review below is for an external lecture conducted by the instructor on the same topic.

"I knew nothing about parallel algorithms or parallel computing, but
after taking the course, I gained confidence in parallel computing."

"There were many algorithms that I couldn't solve with existing C++ programs,
but through this lecture, I was able to improve them to enable real-time processing!"

"When I mentioned having experience in parallel computing during an interview after taking this course, the interviewers were very surprised.
They said it's not easy to find CUDA or parallel computing courses at the undergraduate level."


CUDA Programming Mastery Roadmap 🛩️

  • The CUDA programming course was designed as a 7-part series with over 24 hours of total content to enhance focus on each topic.
  • Each lecture consists of 6 or more sections, with each section covering an independent topic. (The current lecture, Part 0, consists of 2 sections and provides only the Introduction.)
  • The slides used in the lecture are provided as PDF files, and the program source code used in the sections where hands-on examples are explained is also provided.

Part 0 (1-hour free lecture)

  • Introduction to MPC and CUDA - This is an introduction section providing an overall overview of MPC and CUDA.

Part 1 (3 hours 40 minutes)

  • CUDA kernel concepts - Learn the concept of the CUDA kernel, the starting point of CUDA programming, and see how parallel computing works in action.

Part 2 (4 hours 15 minutes)

  • vector addition - Presents operations between vectors (1D arrays) through various examples and demonstrates the actual implementation of the AXPY routine using CUDA.

Part 3 (4 hours 5 minutes)

  • memory hierarchy - Learn the memory structure, which is the core of CUDA programming. Implement matrix addition, adjacent difference, etc., as examples.

Part 4 (3 hours 45 minutes)Current Lecture

  • matrix transpose & multiply - Presents operations between matrices in the form of 2D arrays through various examples and implements the GEMM routine using CUDA.

Part 5 (3 hours 55 minutes)

  • atomic operation & reduction - Along with an understanding of CUDA control flow, learn everything from problem definitions to solutions for atomic operations and reduction. Also, implement the GEMV routine using CUDA.

Part 6 (3 hours 45 minutes)

  • search & sort - Learn examples of effectively implementing search-all problems, even-odd sort, bitonic sort, and counting merge sort using the CUDA architecture.

CUDA Programming and
Massive Parallel Computing Mastery Complete!


Q&A 💬

Q. What are the reviews for the paid courses like?

Since the paid lectures are being opened sequentially from (1) to (6), the reviews are scattered and currently set to private. The paid lectures have received the following reviews so far.

  • It was very helpful because you explained in detail the process of maximizing performance by applying various techniques to a single example.
  • It was much easier to understand because the memory structures and logic were explained through visualization.
  • While studying AI in a vague way, it's great to be able to add in-depth content about devices.
  • The software installation was well-explained and the source code was provided, making it easy to practice.

Q. Is this a lecture that non-majors can also take?

  • C++ programming experience is required to some extent. At the very least, you should have experience with C programming. Although all examples are written as simply as possible, they are all provided in C++/C code, and the functions provided by malloc, memcpy, etc., are not explained separately.
  • However, if you have an understanding of computer architecture (registers, cache memory, etc.), operating systems (time-sharing, etc.), and compilers (code generation, code optimization), you will be able to understand the course content more deeply.
  • This course was originally designed as an advanced study for senior computer science majors at four-year universities.

Q. Is there anything I need to prepare before taking the course? Are there any reference materials regarding the course (required environment, other precautions, etc.)?

  • You must secure a hardware environment where NVIDIA CUDA works for the practice sessions in advance. A PC/laptop equipped with an NVIDIA GeForce graphics card is absolutely necessary.
  • While NVIDIA GeForce graphics cards can be used in some cloud environments, cloud settings change frequently and often involve costs; therefore, if you are using a cloud environment, you must resolve the method of using the graphics card on your own.

Q. To what level does the course content cover?

  • Starting from Part 0 and moving up from Part 1 to Part 6, the course requires deeper theory and a greater level of understanding.
  • We strongly recommend that you watch the courses in order from Part 0 to Part 6.
  • The counting merge sort covered at the end of Part 6 is a problem difficult enough that even professional researchers may find it hard to follow immediately. However, off-line students who followed along step-by-step were more often able to understand it without much trouble, building on the learning from the previous sections.

Q. Is there a reason for setting a course enrollment period?

  • The reason for setting a course enrollment period is that, due to the nature of the computer science field, there is a high possibility that the content of this lecture will already be outdated after that much time has passed.
  • By then, I will see you again in a new course. 😄

Q. Are there subtitles in the videos?

  • Yes. Currently, all videos include subtitles.
  • However, some videos added in the future may not have subtitles.

Information regarding fonts used in lecture materials ✔️

  • Only free fonts from Google / Adobe were used in the videos and PDF files.
  • The Korean font used is "Noto Sans KR", and the English fonts used are Source Sans Pro and Source Serif Pro,
  • All of them can be downloaded for free from the following links. After downloading and extracting the files, you can install them on your PC/laptop by right-clicking the mouse.
  • At https://fonts.google.com/noto/specimen/Noto+Sans+KR, download as a ZIP file by clicking "download family" and then install.
  • At https://fonts.google.com/specimen/Source+Sans+Pro, download as a ZIP file via "download family" and install.
  • At https://fonts.google.com/specimen/Source+Serif+Pro, download as a ZIP file by clicking "download family" and then install., tải xuống dưới dạng tệp ZIP bằng cách chọn "download family" rồi cài đặt.

Recommended for
these people

Who is this course right for?

  • Those who want to accelerate array/matrix/image processing, statistical processing, sorting, etc., using C++ based parallel computing/parallel processing.

  • Those who want to accelerate their own programs using parallel computing/CUDA.

  • Those who wish to study NVIDIA CUDA programming/CUDA computing from the basics.

  • Those who wish to study both the theory and practice of GPU parallel processing/parallel computing in a balanced manner.

Need to know before starting?

  • C++ or C programming experience

  • It is even better if you have knowledge of computer architecture, registers, caches, time sharing, etc.

Hello
This is onemoresipofcoffee

9,742

Learners

312

Reviews

65

Answers

4.9

Rating

30

Courses

One more cup of drip coffee for the road

Curriculum

All

40 lectures ∙ (3hr 40min)

Course Materials:

Lecture resources
Published: 
Last updated: 

Reviews

All

5 reviews

5.0

5 reviews

  • kissureng4871님의 프로필 이미지
    kissureng4871

    Reviews 5

    Average Rating 5.0

    5

    100% enrolled

    The optimization part is especially helpful in many ways.

    • Hello. 🌞 Thank you for your good review. 🍀 I hope you always have a happy time.

  • wayfarecru0581님의 프로필 이미지
    wayfarecru0581

    Reviews 25

    Average Rating 5.0

    5

    10% enrolled

    It was unique that matrix multiplication can be implemented in various ways, and I didn't know that changing the loop structure in CPU implementation would make it faster. It feels like I learned practical coding, not the level that usually appears in language books.

    • Hello. 🌞 Thank you for your good review. 🍀 I hope you always have a happy time.

  • hooha1207님의 프로필 이미지
    hooha1207

    Reviews 8

    Average Rating 5.0

    5

    55% enrolled

    It was good that you explained how to do mathematical operations with CUDA in earnest. And thanks to the table of contents organized by Drip Coffee + Han Mogeum, I was able to find the content I wanted very quickly.

    • min4849님의 프로필 이미지
      min4849

      Reviews 5

      Average Rating 5.0

      5

      30% enrolled

      • Hello. 🌞 Thank you for your good review. 🍀 I hope you always have a happy time.

    • dlghdwn0084660님의 프로필 이미지
      dlghdwn0084660

      Reviews 4

      Average Rating 5.0

      5

      30% enrolled

      onemoresipofcoffee's other courses

      Check out other courses by the instructor!

      Similar courses

      Explore other courses in the same field!