Inflearn brand logo image
Inflearn brand logo image
Inflearn brand logo image
AI Development

/

AI Agent Development

[VLM101] Creating a Multimodal Chatbot with Fine-tuning (feat.MCP)

This is an introductory course to understand the concepts and application methods of Vision-Language Models (VLM), and to practically run the LLaVA model in an Ollama-based environment, practicing the process of integrating it with MCP (Model Context Protocol). This course covers the principles of multimodal models, Quantization, service, and integrated demo development, providing a balanced mix of theory and practice.

(4.6) 9 reviews

34 learners

  • dreamingbumblebee
실습 중심
mcp
Vision Transformer
transformer
Llama
Model Context Protocol

Reviews from Early Learners

What you will learn!

  • Understand what MCP is

  • Making VLM tuning and PoC demos by hand

Haute couture banner (1)

Learning through fine tuning and chatbot implementation
Latest multimodal technology, VLM

We use AI services like ChatGPT, Gemini, and Claude every day, but have you ever wondered how they ‘understand’ images? The core technology is the Vision-Language Model (VLM).

In this lecture, you will learn how to fine-tune the latest VLM models, LLaVA and Qwen2.5v, run them locally with Olama, and create your own multimodal chatbot using MCP (Model Context Protocol). You will also learn practical and immediately applicable technologies, such as CLIP Vision Encoder, Quantization, and MCP Server construction, and you will be able to experience the entire workflow, from VLM's operating principles to MCP integration, beyond simple API calls.

📌 A glance at the evolution of multimodal AI

From CLIP to LLaVA OneVision, we summarize the development and technical context of VLM.

📌 Create your own VLM chatbot

Fine-tuning and lightweighting, and even Ollama local execution - let's build the model ourselves

📌 Perfect balance between theory and practice

Train and test your models using real GPUs in a RunPod environment.

📌 Anyone with deep learning experience is OK

We explain basic concepts step by step so that even beginners can follow along.

What you can experience in class
5 points

Create multimodal AI experiences yourself, not through API calls
This is a hands-on, hands-on configuration that goes beyond simply using the model, to tuning, connecting, and completing it yourself.

Experience the evolution of VLM technology step by step
Experience the systematic development of a multimodal model from CLIP → LLaVA → LLaVA 1.5 → OneVision.

Reflects the latest multimodal technology
It contains the latest multimodal AI trends such as LLaVA OneVision and MCP.

GPU Lab Design You Can Complete for $10
Full hands-on training is available at an affordable cost, based on the RunPod environment.

Complete your own portfolio through lectures
Upon completion of the course, you will have a multimodal chatbot of your own creation.

I recommend this to these people

🚀 I want to level up with AI development.
I am a developer/student who has only used ChatGPT API and now wants to handle AI model directly.

👁 I'm interested in multimodal AI.
How does AI work that processes text and images simultaneously? For those who are curious about the principles of VLM

I'm curious about building a local AI environment.
For those who want to run AI models locally because cloud API costs are burdensome

💡 Lectures required for these students

😤 "It's frustrating to only use API"

  • If you have created a service using ChatGPT API, but are frustrated because it is expensive and has many restrictions,

  • For those who are curious about the inside of an AI model like a black box and want to touch it directly

💸 "AI service operating costs are too expensive"

  • Startup developers who are considering building their own models due to the cost of calling the OpenAI Vision API

  • Anyone planning a service that requires large-scale image processing

🚀 "I want to become a multimodal AI expert"

  • Anyone who wants to advance their career as an AI developer but has only taken a text-based LLM

  • Job seekers who want to add a differentiated project to their portfolio

🤔 "I don't know exactly what VLM is"

  • People who want to follow AI trends but don't exactly understand what multimodal is or what VLM is

  • For those who are curious about the principles of AI that processes images and text simultaneously

After class

  • CLIP, you can fully understand the operating principles of the LLaVA series . Multimodal AI is no longer a black box.

  • You can fine-tune and deploy VLMs in a production environment using Ollama and RunPod .

  • Using quantization techniques , we can make huge models lighter and run them on personal computers.

  • You can build a workflow that integrates multiple AI tools using MCP (Model Context Protocol) .

  • You will be able to build your own multimodal chatbot from start to finish.

💡 Specific changes you can achieve after taking the course

🎯 Immediately actionable practical skills

After completing the course, you will be able to work on the following hands-on projects on your own:

  • My own VLM service : Image analysis chatbot specialized for specific domains (medical, education, shopping, etc.)

  • Local AI Workflow : An automated system where multiple AI tools collaborate using MCP

  • Cost-effective AI service : Service that reduces API dependency and operates with its own model

📈 Portfolio for career advancement

  • GitHub repository : A complete repository containing the entire practice code and trained models.

  • Technical Blog Material : Technical postings summarizing the VLM fine-tuning process and results can be written.

  • Interview Experience : A Differentiated Interview Story with “Experience Fine-tuning VLM Directly”

🧠 Deep understanding and application

Beyond simple usage:

  • Fully understand the internal workings of VLM, enabling rapid learning of new models

  • Apply model optimization techniques such as Quantization and GGUF transformation to other projects

  • Ability to design AI workflows using the MCP ecosystem

Learn about these things.

🧠 VLM Core Principles: From CLIP to LLaVA OneVision
How does multimodal AI 'understand' images? Learn the evolution of VLM step by step, from the principles of CLIP Vision Encoder to the latest LLaVA OneVision.

🔧 Real-world Fine Tuning: Create Your Own VLM
Fine-tune the LLaVA model directly in a RunPod GPU environment. Learn efficient training methods using Jupyter Notebook and HuggingFace Accelerate.

Model Lightening: Quantization & GGUF Conversion
Learn practical techniques for converting massive VLMs to GGUF format and applying Quantization so that they can run on personal computers.

🔗 MCP Integration: Collaboration of AI Tools
Learn how to connect multiple AI models and tools into a single workflow using the Model Context Protocol.

Who created this course

  • 2016 ~ Present: NLP & LLM Development Practitioner (Working at large companies N ~ S)

Things to note before taking the class

Practice environment

  • The lecture will be based on MacOS. If you have a Windows machine and Docker is installed, you can mostly follow along.

  • In this lecture, we will use cursor. I think you can follow the vscode version without any problems.

  • Cloud environment

    • RunPod : GPU instance rental service, using H100 or A100

    • Estimated Cost : $10 for the entire practice

    • Pros : You can start practicing right away without any complicated environment settings.

    • Note

      • You need to create a RunPod account and register a payment card.

Learning Materials

  • Please check the attached PDF and source code

Player Knowledge and Notes

  • LLM related knowledge (refer to previous LLM 101 lecture)

  • Basic Python syntax (using classes, functions, modules)

  • Deep learning/machine learning basic concepts (neural networks, training, inference, etc.)

  • Experience training models in a GPU environment is preferred (but not required)

  • Familiarity with terminal/command usage would be helpful

Recommended for
these people

Who is this course right for?

  • Multimodal, VLM for beginners

  • Person who wants to build an MCP-based demo

Need to know before starting?

  • LLM Basics

Hello
This is

227

Learners

29

Reviews

4

Answers

4.5

Rating

2

Courses

📱contact: dreamingbumblebee@gmail.com

Curriculum

All

23 lectures ∙ (2hr 52min)

Course Materials:

Lecture resources
Published: 
Last updated: 

Reviews

All

9 reviews

4.6

9 reviews

  • luke90님의 프로필 이미지
    luke90

    Reviews 2

    Average Rating 5.0

    5

    61% enrolled

    大まかに概念を把握して簡単なデモを作ってみるのに良さそうです。序盤に素早く概念を掴む用途としては悪くないですね。

    • haenarashin님의 프로필 이미지
      haenarashin

      Reviews 9

      Average Rating 4.4

      3

      61% enrolled

      101クラスというよりは、専攻したり扱ったことがある人がざっと流し読みする程度のもののようです。

      • yyj님의 프로필 이미지
        yyj

        Reviews 3

        Average Rating 5.0

        5

        30% enrolled

        • nar998614님의 프로필 이미지
          nar998614

          Reviews 9

          Average Rating 4.7

          5

          100% enrolled

          核心的な内容は短時間でよく説明されているようです。

          • joshuayoon7058186님의 프로필 이미지
            joshuayoon7058186

            Reviews 2

            Average Rating 5.0

            5

            100% enrolled

            講義のおかげでMCP構造とデモ制作方法を素早く習得することができました。前半では複雑な内容を段階的によく説明してくれて、後半は実習中心の構成なので実務にすぐ活用するのに良かったです。

            $51.70

            dreamingbumblebee's other courses

            Check out other courses by the instructor!

            Similar courses

            Explore other courses in the same field!