질문 & 답변 - 인프런 | 커뮤니티

묻고 답해요

160만명의 커뮤니티!! 함께 토론해봐요.

인프런 TOP Writers

미해결
RAG를 활용한 LLM Application 개발 (feat. LangChain)

mac으로 hugging face 양자화 공유합니다.(질문도 있습니다)

안녕하세요.강의에서 나오는 BitsAndBytesConfig를 통한 양자화 예시에서 CUDA가 필수로 요구되어 macbook에서 진행할 수 있는 방안을 찾다가 mlx_lm을 통해서 MPS(mac의 gpu)를 활용하여 양자화가 가능하단 사실을 알게되어 공유드리고자 합니다. 아마 맥북 사용자도 꽤 많을 것으로 예상됩니다. langchain에서 MLXPipeline을 제공하지만 이를 사용했을 때 현재 invoke시 호출되는 _call 메서드 내부에서 generate_step이 mlx_lm에서 제공은 하지만 MLXPipeline에서 제대로 불러올 수 없어 RunnableLambda를 통해서 에러를 우회해봤습니다.(Q? 강사님은 혹시 해결 방법을 알고 계실까요? 공홈 코드가 다 안 되네요 ㅜㅜ 오류 코드는 맨 아래 첨부하겠습니다.) 먼저 양자화 진행하는 방식입니다. cli를 통한 command 또는 python-api를 활용하는 방법 두 가지가 있습니다. 자세한건 링크 첨부하겠습니다. 양자화 개념에 대해서 자세히 나오니 참고하면 좋으실 듯 합니다.링크: [quantization with xlm_lm](https://developer.apple.com/kr/videos/play/wwdc2025/298/?time=187) mlx_lm을 통한 양자화먼저 양자화한 model을 local path에 저장합니다. 그 전에 apple의 gpu를 확인할 수 있는 코드 부분은 처음 부분에서 확인하실 수 있습니다.사실 python코드를 싸는 것보다 command로 양자화 진행하는게 편한 것 같습니다. 전 projection layer와 embedding layer는 다른 layer보다 높은 bit로 해주는게 좋다고 하여 python으로 진행했습니다. dequantize on-the-fly(게산 추론)시 더 좋다고 합니다. # mac에서 mps를 사용한 예제 import torch from mlx_lm.convert import convert print("MPS available on this device =>", torch.backends.mps.is_available()) # projection layer & embedding layer는 6bit, 양자화 가능한 layer는 4bit, 양자화 불가능은 False return def mixed_quantization(layer_path, layer): if "lm_head" in layer_path or 'embed_tokens' in layer_path: return {"bits": 6, "group_size": 64} elif hasattr(layer, "to_quantized"): return {"bits": 4, "group_size": 64} else: return False # quantization 진행 convert( hf_path="microsoft/Phi-3-mini-4k-instruct", mlx_path="./models/microsoft-Phi-3-mini-4k-instruct-mixed-4-6-bit", dtype="float16", quantize=True, q_bits=4, q_group_size=64, quant_predicate=mixed_quantization )$) mlx_lm.convert --hf-path "mistralai/Mistral-7B-Instruct-v0.3" \ --mlx-path "./mistral-7b-v0.3-4bit" \ --dtype float16 \ --quantize --q-bits 4 --q-group-size 64 --upload-repo "my-name/mistral-7b-v0.3-4bit" Langchain과 연계하기 from functools import wraps from langchain_core.runnables import RunnableLambda from langchain_core.output_parsers import StrOutputParser from mlx_lm import generate, load quantized_model_path = "./models/microsoft-Phi-3-mini-4k-instruct-mixed-4-6-bit" model, tokenizer = load(quantized_model_path) def runnable_wrapper(func): """RunnableLambda wrapper function""" @wraps(func) def wrapper(*args, **kwargs): return RunnableLambda(func) return wrapper @runnable_wrapper def create_chat_prompt(question): messages = [ { "role": "system", "content": """ You are an expert in information retrieval and summarization. Your job is to read the provided text and produce a precise, faithful, and concise summary. Prioritize the author’s main claim, key evidence, and conclusions. Use plain English and avoid filler. Do not invent facts that aren’t present in the input. """ }, { "role": "user", "content": f""" Question: {question} """ } ] prompt_without_tokenized = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) prompt_with_tokenized = tokenizer.apply_chat_template(messages, add_generation_prompt=True) print("생성된 prompt👇\n", prompt_without_tokenized) return prompt_with_tokenized @runnable_wrapper def run_llm_with_mlx(prompt): return generate(model=model, tokenizer=tokenizer, prompt=prompt) @runnable_wrapper def output_parser(answer): return answer.replace("<|end|>", "") chain = create_chat_prompt() | run_llm_with_mlx() | output_parser()오류 코드llm.invoke에서 에러가 발생하며 TypeError: generate_step() got an unexpected keyword argument 'formatter'라는 에러가 발생합니다.from langchain_community.llms.mlx_pipeline import MLXPipeline from mlx_lm import load model_path = "./models/microsoft-Phi-3-mini-4k-instruct-mixed-4-6-bit" model, tokenizer = load(model_path) llm = MLXPipeline( model=model, tokenizer=tokenizer ) llm.invoke("what's my name?")

rlwjd31 · 1일 전 · RAG를 활용한 LLM Application 개발 (feat. LangChain)

투표점수

0

조회수

18

답변

1

인기 태그

주간 인기글