안녕하세요.

강의에서 나오는 BitsAndBytesConfig를 통한 양자화 예시에서 CUDA가 필수로 요구되어 macbook에서 진행할 수 있는 방안을 찾다가 mlx_lm을 통해서 MPS(mac의 gpu)를 활용하여 양자화가 가능하단 사실을 알게되어 공유드리고자 합니다. 아마 맥북 사용자도 꽤 많을 것으로 예상됩니다.

langchain에서 MLXPipeline을 제공하지만 이를 사용했을 때 현재 invoke시 호출되는 _call 메서드 내부에서 generate_step이 mlx_lm에서 제공은 하지만 MLXPipeline에서 제대로 불러올 수 없어 RunnableLambda를 통해서 에러를 우회해봤습니다.(Q? 강사님은 혹시 해결 방법을 알고 계실까요? 공홈 코드가 다 안 되네요 ㅜㅜ 오류 코드는 맨 아래 첨부하겠습니다.)

먼저 양자화 진행하는 방식입니다. cli를 통한 command 또는 python-api를 활용하는 방법 두 가지가 있습니다. 자세한건 링크 첨부하겠습니다. 양자화 개념에 대해서 자세히 나오니 참고하면 좋으실 듯 합니다.

링크: [quantization with xlm_lm](https://developer.apple.com/kr/videos/play/wwdc2025/298/?time=187)

mlx_lm을 통한 양자화

먼저 양자화한 model을 local path에 저장합니다. 그 전에 apple의 gpu를 확인할 수 있는 코드 부분은 처음 부분에서 확인하실 수 있습니다.

사실 python코드를 싸는 것보다 command로 양자화 진행하는게 편한 것 같습니다. 전 projection layer와 embedding layer는 다른 layer보다 높은 bit로 해주는게 좋다고 하여 python으로 진행했습니다. dequantize on-the-fly(게산 추론)시 더 좋다고 합니다.

# mac에서 mps를 사용한 예제
import torch
from mlx_lm.convert import convert

print("MPS available on this device =>", torch.backends.mps.is_available())

# projection layer & embedding layer는 6bit, 양자화 가능한 layer는 4bit, 양자화 불가능은 False return
def mixed_quantization(layer_path, layer):
  if "lm_head" in layer_path or 'embed_tokens' in layer_path:
    return {"bits": 6, "group_size": 64}
  elif hasattr(layer, "to_quantized"):
    return {"bits": 4, "group_size": 64}
  else:
    return False

# quantization 진행
convert(
  hf_path="microsoft/Phi-3-mini-4k-instruct",
  mlx_path="./models/microsoft-Phi-3-mini-4k-instruct-mixed-4-6-bit",
  dtype="float16",
  quantize=True,
  q_bits=4,
  q_group_size=64,
  quant_predicate=mixed_quantization
)

$) mlx_lm.convert --hf-path "mistralai/Mistral-7B-Instruct-v0.3" \ 
               --mlx-path "./mistral-7b-v0.3-4bit" \ 
               --dtype float16 \ 
               --quantize --q-bits 4 --q-group-size 64 
               --upload-repo "my-name/mistral-7b-v0.3-4bit"

Langchain과 연계하기

from functools import wraps
from langchain_core.runnables import RunnableLambda
from langchain_core.output_parsers import StrOutputParser
from mlx_lm import generate, load

quantized_model_path = "./models/microsoft-Phi-3-mini-4k-instruct-mixed-4-6-bit"
model, tokenizer = load(quantized_model_path)

def runnable_wrapper(func):
  """RunnableLambda wrapper function"""
  @wraps(func)
  def wrapper(*args, **kwargs):
    return RunnableLambda(func)
  
  return wrapper

@runnable_wrapper
def create_chat_prompt(question):
  messages = [
    {
      "role": "system",
      "content": """
        You are an expert in information retrieval and summarization. 
        Your job is to read the provided text and produce a precise, faithful, and concise summary. 
        Prioritize the author’s main claim, key evidence, and conclusions. 
        Use plain English and avoid filler. 
        Do not invent facts that aren’t present in the input.
      """
    },
    {
      "role": "user",
      "content": f"""
        Question: {question}
      """
    }
  ]

  prompt_without_tokenized = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
  prompt_with_tokenized = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

  print("생성된 prompt👇\n", prompt_without_tokenized)

  return prompt_with_tokenized

@runnable_wrapper
def run_llm_with_mlx(prompt):
  return generate(model=model, tokenizer=tokenizer, prompt=prompt)

@runnable_wrapper
def output_parser(answer):
  return answer.replace("<|end|>", "")
  
chain = create_chat_prompt() | run_llm_with_mlx() | output_parser()

오류 코드

llm.invoke에서 에러가 발생하며 TypeError: generate_step() got an unexpected keyword argument 'formatter'라는 에러가 발생합니다.

from langchain_community.llms.mlx_pipeline import MLXPipeline
from mlx_lm import load

model_path = "./models/microsoft-Phi-3-mini-4k-instruct-mixed-4-6-bit"
model, tokenizer = load(model_path)

llm = MLXPipeline(
  model=model,
  tokenizer=tokenizer
)

llm.invoke("what's my name?")

인프런 커뮤니티 질문&답변

mac으로 hugging face 양자화 공유합니다.(질문도 있습니다)

mlx_lm을 통한 양자화

Langchain과 연계하기

오류 코드