강의

멘토링

커뮤니티

AI Technology

/

AI Agent Development

[VLM101] Building a Multimodal Chatbot with Fine-tuning (feat.MCP / RunPod)

This is an introductory course for understanding the concept and application methods of Vision-Language Models (VLM), and practicing running the LLaVA model in an Ollama-based environment while integrating it with MCP (Model Context Protocol). This course covers the principles of multimodal models, quantization, service development, and integrated demo development, providing a balanced mix of theory and hands-on practice.

(4.6) 13 reviews

89 learners

Level Basic

Course period Unlimited

  • dreamingbumblebee
실습 중심
실습 중심
mcp
mcp
Vision Transformer
Vision Transformer
transformer
transformer
Llama
Llama
Model Context Protocol
Model Context Protocol
실습 중심
실습 중심
mcp
mcp
Vision Transformer
Vision Transformer
transformer
transformer
Llama
Llama
Model Context Protocol
Model Context Protocol

Reviews from Early Learners

What you will gain after the course

  • Understanding what MCP is

  • Hands-on VLM Tuning and Building a PoC Demo

오뜨 띠배너 (1)

Learn the Latest Multimodal Technology, VLM
through Fine-tuning & Chatbot ImplementationFine-tuning & Xây dựng Chatbot

We use AI services like ChatGPT, Gemini, and Claude every day, but have you ever wondered how they 'understand' images? The core technology is the Vision-Language Model (VLM).

In this course, you'll fine-tune the latest VLM models like LLaVA and Qwen2.5v, run them locally with Ollama, and build your own multimodal chatbot using MCP (Model Context Protocol). We'll also cover practical skills you can apply directly to real-world work, such as CLIP Vision Encoder, Quantization, and MCP Server setup. Beyond simple API calls, you'll experience the complete workflow from understanding how VLMs work to integrating with MCP.

📌 The evolution of multimodal AI at a glance

From CLIP to LLaVA OneVision, we organize the evolution and technical context of VLM.

📌 Build Your Own VLM Chatbot

Fine-tuning, optimization, and local execution with Ollama - build the model yourself

📌 Perfect Balance of Theory and Practice

Train and test models using actual GPUs in the RunPod environment

📌 Anyone with deep learning experience is welcome

We explain the basic concepts step by step so that even beginners can follow along

5 key points you'll experience
in this course

Hands-on Multimodal AI Experience - Build It Yourself, Not Just API Calls
Go beyond simply using models - this is a practice-focused curriculum where you directly tune, connect, and complete them.

Experience the evolution of VLM technology step by step
Systematically experience the development process of multimodal models from CLIP → LLaVA → LLaVA 1.5 → OneVision.

Incorporating the Latest Multimodal Technology
Covers the most recent multimodal AI trends including LLaVA OneVision, MCP, and more.

GPU hands-on practice designed to complete with $10
Based on RunPod environment, all hands-on exercises can be completed at an affordable cost.

Build your own portfolio through this course
Upon completion, you'll have your own multimodal chatbot as a tangible result.

We recommend this course for

🚀 I want to level up with AI development.
Developers / students who have only used the ChatGPT API and now want to work with AI models directly

👁 I'm interested in multimodal AI.
How does AI that processes text and images simultaneously work? For those curious about the principles of VLM

I'm curious about building a local AI environment.
Those who find cloud API costs burdensome and want to run AI models locally

💡 A course for students who need this

😤 "I'm frustrated with just using APIs"

  • Those who built a service with ChatGPT API but feel frustrated due to cost burden and many limitations

  • Those curious about the inner workings of black-box AI models and want to get hands-on experience

💸 "AI service operating costs are too expensive"

  • Startup developers considering building their own model due to the burden of OpenAI Vision API costs

  • Those planning a service that requires processing large volumes of images

🚀 "I want to become a multimodal AI expert"

  • Those who want to advance their career as an AI developer but have only worked with text-based LLMs

  • Job seekers who want to add differentiated projects to their portfolio

🤔 "I'm not sure exactly what VLM is"

  • Those who want to keep up with AI trends but don't fully understand what multimodal or VLM is

  • Those curious about the principles of AI that processes images and text simultaneously

After taking this course

  • You can fully understand the operating principles of CLIP and LLaVA series. Multimodal AI will no longer be a black box.. AI đa phương thức sẽ không còn là hộp đen nữa.

  • You can fine-tune and deploy VLM in a practical environment using Ollama and RunPod.

  • With Quantization techniques, you can make huge models lightweight so they can run even on personal PCs.

  • You can build a workflow that integrates multiple AI tools using MCP (Model Context Protocol)..

  • You'll be able to create your own multimodal chatbot from start to finish. từ đầu đến cuối.

💡 Concrete Changes You'll Gain After Taking the Course

🎯 Immediately Applicable Practical Skills

After completing the course, you'll be able to independently work on the following real-world projects:

  • Your Own VLM Service: Image analysis chatbot specialized for specific domains (medical, education, shopping, etc.)

  • Local AI Workflow: An automated system where multiple AI tools collaborate using MCP

  • Cost-Effective AI Services: Services that reduce API dependency and operate with proprietary models

📈 Portfolio for Career Development

  • GitHub Repository: A well-organized repository with complete practice code and trained models

  • Technical blog material: Can write technical posts documenting the VLM fine-tuning process and results

  • Interview Experience: A differentiated interview story with "hands-on experience fine-tuning VLM"

🧠 Deep Understanding and Application Skills

Beyond simple usage:

  • Fully understand the internal workings of VLM to quickly learn new models as well

  • Apply model optimization techniques such as Quantization, GGUF conversion to other projects as well

  • The ability to design AI workflows utilizing the MCP ecosystem

Here's what you'll learn.

🧠 VLM Core Principles: From CLIP to LLaVA OneVision
How does multimodal AI 'understand' images? Learn step-by-step the evolution of VLM, from the principles of CLIP Vision Encoder to the latest LLaVA OneVision.

🔧 Hands-on Fine-tuning: Building Your Own VLM
Fine-tune the LLaVA model directly in a RunPod GPU environment. Learn efficient training methods using Jupyter Notebook and HuggingFace Accelerate.

Model Optimization: Quantization & GGUF Conversion
Learn practical techniques to convert large VLMs to GGUF format and apply quantization so they can run on personal PCs.

🔗 MCP Integration: Collaboration of AI Tools
Learn how to connect multiple AI models and tools into a single workflow using the Model Context Protocol.

Who Created This Course

  • 2016 ~ Present: NLP & LLM Development Practitioner (Worked at large corporations N and S)

Notes Before Taking the Course

Practice Environment

  • The course is explained based on MacOS. If you're using a Windows machine, you should be able to follow along as long as Docker is installed.

  • The course uses Cursor. I believe you can follow along without any issues using the VSCode version as well.

  • Cloud Environment

    • RunPod: GPU instance rental service, using H100 or A100

    • Estimated Cost: $10 for the entire hands-on practice

    • Advantages: Can practice immediately without complex environment setup

    • Important Notice

      • RunPod account creation and payment card registration required

Learning Materials

  • You can check the attached PDF and source code

Prerequisites and Important Notes

  • LLM-related knowledge (refer to previous LLM 101 lecture)

  • Basic Python syntax (classes, functions, module usage)

  • Basic concepts of deep learning/machine learning (neural networks, training, inference, etc.)

  • Experience with model training in GPU environments is helpful (but not required)

  • Familiarity with using terminal/command line will be helpful

Recommended for
these people

Who is this course right for?

  • For those new to Multimodal and VLM

  • People who want to create an MCP-based demo

Need to know before starting?

  • LLM Fundamentals

Hello
This is

310

Learners

40

Reviews

4

Answers

4.4

Rating

2

Courses

📱contact: dreamingbumblebee@gmail.com

Curriculum

All

23 lectures ∙ (2hr 52min)

Course Materials:

Lecture resources
Published: 
Last updated: 

Reviews

All

13 reviews

4.6

13 reviews

  • jukyellow7445님의 프로필 이미지
    jukyellow7445

    Reviews 1

    Average Rating 5.0

    5

    61% enrolled

    • jgryu4241님의 프로필 이미지
      jgryu4241

      Reviews 11

      Average Rating 4.0

      4

      30% enrolled

      • sangsunkim11958님의 프로필 이미지
        sangsunkim11958

        Reviews 1

        Average Rating 5.0

        5

        61% enrolled

        • kimsc님의 프로필 이미지
          kimsc

          Reviews 25

          Average Rating 4.8

          Edited

          5

          52% enrolled

          Thank you for the great lecture.

          • luke90님의 프로필 이미지
            luke90

            Reviews 2

            Average Rating 5.0

            5

            61% enrolled

            It seems good for roughly examining concepts and creating simple demos. It's not bad for quickly grasping concepts in the early stages.

            $59.40

            dreamingbumblebee's other courses

            Check out other courses by the instructor!

            Similar courses

            Explore other courses in the same field!