Fine-Tuning Large Language Models with MLX
Table of Contents
- Introduction to Fine-Tuning
- Prerequisites & Setup
- Understanding the Components
- Creating Training Data
- Downloading Models from Hugging Face
- The Fine-Tuning Process
- Converting to GGUF Format
- Deploying with Ollama
- Understanding Key Concepts
- Troubleshooting Common Issues
- Cost Comparison & Hosting Options
- Advanced Topics
1. Introduction to Fine-Tuning
Fine-tuning allows you to customize pre-trained language models to have specific behaviors, personalities, or capabilities. This guide will walk you through the complete process of fine-tuning a language model on Apple Silicon using MLX, Apple's machine learning framework.
What You'll Learn
• How to set up a Python environment for ML on Mac
• Understanding training data formats and requirements
• Using MLX to fine-tune models with LoRA (Low-Rank Adaptation)
• Converting models to formats compatible with Ollama
• Deploying your custom model locally
• Understanding key concepts like loss, epochs, and parameters
Note: This guide focuses on Apple Silicon (M1/M2/M3/M4) Macs with 16GB+ RAM. While the principles apply broadly, specific commands and performance will vary on other hardware.
2. Prerequisites & Setup
System Requirements
• Hardware: Apple Silicon Mac (M1/M2/M3/M4) with at least 16GB RAM
• Operating System: macOS 12.0 or later
• Python: Python 3.9 or later
• Storage: At least 20GB free space for models and tools
Installing Python Dependencies
First, verify you have Python 3 installed:
python3 --versionCreate a virtual environment to keep dependencies isolated:
# Create virtual environment
python3 -m venv finetune_env
# Activate it
source finetune_env/bin/activateInstall required packages:
# Install MLX and MLX-LM
pip install mlx mlx-lm
# Install Hugging Face CLI
pip install huggingface-hub
# Install additional tools for conversion
pip install sentencepiece ggufSetting Up Hugging Face Access
Many models require you to accept their license agreements:
-
Create a Hugging Face account at https://huggingface.co (opens in a new tab)
-
For models like Llama, visit the model page and click 'Agree and access repository'
-
Generate an access token at https://huggingface.co/settings/tokens (opens in a new tab)
-
Login via CLI:
huggingface-cli loginPaste your token when prompted (it won't show as you type - this is normal).
3. Understanding the Components
MLX Framework
MLX is Apple's machine learning framework optimized for Apple Silicon. It provides:
• Efficient use of unified memory architecture
• Optimized operations for M-series chips
• NumPy-like API for familiarity
• Built-in support for common ML operations
LoRA (Low-Rank Adaptation)
LoRA is a technique that makes fine-tuning efficient by:
• Training only small 'adapter' layers instead of the entire model
• Using approximately 1-2% of total parameters
• Requiring much less memory and compute
• Allowing quick training (minutes instead of hours/days)
• Enabling easy switching between different adaptations
Analogy: Instead of retraining an entire actor (the base model), LoRA gives them a character sheet for a specific role. The actor's skills remain, but they now know how to play this particular character.
Model Formats
| Format | Description | Used By |
|---|---|---|
| Hugging Face (safetensors) | Standard PyTorch format, full precision weights | Training, MLX |
| MLX Format | Apple-optimized format for M-series chips | MLX training and inference |
| GGUF | Quantized format for efficient inference | Ollama, llama.cpp |
4. Creating Training Data
Data Format Requirements
Training data must be in JSONL format (JSON Lines), where each line is a complete conversation example:
{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi! How can I help you?"}]}Key Principles
• Quality over quantity: 100 good examples beat 1000 poor ones
• Consistency: Maintain consistent style/format across examples
• Diversity: Cover different scenarios within your use case
• Balanced length: Mix short and longer conversations
• Minimum: At least 100 examples, ideally 300-1000
Example Use Cases
Customer Support Bot: Train on example support conversations with common issues and solutions
Code Review Assistant: Examples of code snippets with constructive feedback
Technical Documentation Writer: Pairs of code/concepts and their explanations
Email Tone Shifter: Casual emails paired with professional versions
Domain Expert: Q&A pairs about specific technical domain
Creating Training Data
You can create training data in several ways:
Method 1: Manual Creation Write examples yourself based on desired behavior. Time-consuming but highest quality.
Method 2: AI-Assisted Generation Use Claude or GPT-4 to generate examples based on your specifications. Review and edit all generated content.
Method 3: Extraction from Existing Content Convert existing documentation, conversations, or content into Q&A format.
Data Splitting
Always split your data into training and validation sets:
import json
# Read all data
with open('all_data.jsonl', 'r') as f:
data = f.readlines()
# Split 90/10
split_idx = int(len(data) * 0.9)
train_data = data[:split_idx]
valid_data = data[split_idx:]
# Write splits
with open('train.jsonl', 'w') as f:
f.writelines(train_data)
with open('valid.jsonl', 'w') as f:
f.writelines(valid_data)Why split data? The validation set tests if your model can generalize to new examples, not just memorize the training data.
5. Downloading Models from Hugging Face
Choosing a Base Model
Select a model based on your needs:
| Model | Size | Speed | Quality | License |
|---|---|---|---|---|
| Llama 3.2 1B | 1B params | Very Fast | Basic | Commercial* |
| Llama 3.2 3B | 3B params | Fast | Good | Commercial* |
| Qwen 2.5 1.5B | 1.5B params | Very Fast | Good | Apache 2.0 |
| Mistral 7B | 7B params | Moderate | Excellent | Apache 2.0 |
* Llama: Commercial use allowed if less than 700M monthly active users
Download Process
Use the Hugging Face CLI to download models:
# For Llama 3.2 3B (requires license acceptance)
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct \
--local-dir ./llama-3.2-3b
# For Qwen 2.5 1.5B (no license required)
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct \
--local-dir ./qwen2.5-1.5bDownloads typically take 5-15 minutes depending on your internet speed. Models are 3-7GB.
Tip: The model name comes from the Hugging Face URL. For example, https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct (opens in a new tab) becomes Qwen/Qwen2.5-1.5B-Instruct
6. The Fine-Tuning Process
Project Structure
Organize your project directory:
your-project/
├── llama-3.2-3b/ # Downloaded base model
├── train.jsonl # Training data
├── valid.jsonl # Validation data
└── adapters/ # Will be created during trainingTraining Command
Run the fine-tuning with MLX:
mlx_lm.lora \
--model ./llama-3.2-3b \
--train \
--data . \
--batch-size 2 \
--iters 500 \
--learning-rate 1e-5 \
--steps-per-report 10 \
--steps-per-eval 100 \
--adapter-path ./adapters \
--grad-checkpoint \
--max-seq-length 512Parameter Explanations
| Parameter | Purpose |
|---|---|
| --model | Path to downloaded base model |
| --train | Flag to enable training mode |
| --data | Directory containing train.jsonl and valid.jsonl |
| --batch-size | Number of examples processed together (2 is safe for 16GB RAM) |
| --iters | Number of training steps (500 = ~1.5 epochs for 347 examples) |
| --learning-rate | How fast the model learns (1e-5 is standard for fine-tuning) |
| --steps-per-report | How often to print training loss (every 10 steps) |
| --steps-per-eval | How often to run validation (every 100 steps) |
| --adapter-path | Where to save the LoRA adapter weights |
| --grad-checkpoint | Reduces memory usage (important for 16GB RAM) |
| --max-seq-length | Maximum conversation length in tokens |
What Happens During Training
You'll see output like this:
Loading pretrained model
Loading datasets
Training
Trainable parameters: 0.342% (5.276M/1543.714M)
Starting training..., iters: 500
Calculating loss...: 100%|████| 17/17
Iter 1: Val loss 6.125, Val took 3.433s
Iter 10: Train loss 4.020, It/sec 1.510
Iter 100: Val loss 2.845
...
Iter 500: Train loss 0.253, Val loss 0.845
Saved adapter weights to adapters/Training Time Expectations
| Model Size | 16GB M1 Pro | 32GB M2 Max | Notes |
|---|---|---|---|
| 1.5B params | 2-4 minutes | 2-3 minutes | Very fast training |
| 3B params | 3-5 minutes | 2-4 minutes | Recommended for learning |
| 7B params | 15-30 minutes | 10-20 minutes | Requires optimization |
7. Converting to GGUF Format
Why GGUF?
GGUF (GPT-Generated Unified Format) is optimized for inference:
• Quantized weights reduce file size and memory usage
• Fast loading and inference
• Compatible with Ollama and llama.cpp
• Can run on CPU efficiently
Step 1: Merge Adapter with Base Model
First, combine the LoRA adapter with the base model:
mlx_lm.fuse \
--model ./llama-3.2-3b \
--adapter-path ./adapters \
--save-path ./model-mergedThis creates a complete model with your customizations baked in.
Step 2: Install Conversion Tools
Clone llama.cpp for the conversion scripts:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cppStep 3: Convert to GGUF
Run the conversion script:
python convert_hf_to_gguf.py \
../model-merged \
--outfile ../model-custom.gguf \
--outtype f16The conversion takes 2-5 minutes. The resulting .gguf file will be similar in size to the original model.
Common Issue: If you get 'ModuleNotFoundError: No module named sentencepiece', run: pip install sentencepiece
8. Deploying with Ollama
What is Ollama?
Ollama is a tool for running language models locally. It provides:
• Simple model management
• Fast inference on Mac
• REST API for integration
• Easy switching between models
Installing Ollama
Download from https://ollama.ai (opens in a new tab) or install via Homebrew:
brew install ollamaCreating a Modelfile
The Modelfile tells Ollama how to use your model:
FROM ./model-custom.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
TEMPLATE """<|begin_of_text|>{{ if .System
}}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt
}}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""
SYSTEM You are a helpful assistant.Save this as 'Modelfile' in your project directory.
Importing to Ollama
Create the model in Ollama:
ollama create my-custom-model -f ModelfileRunning Your Model
Start chatting with your model:
ollama run my-custom-modelOr use the REST API:
curl http://localhost:11434/api/generate -d '{
"model": "my-custom-model",
"prompt": "Hello!"
}'9. Understanding Key Concepts
Loss
What it is: A measure of how wrong the model's predictions are.
• Higher loss = Model is doing poorly
• Lower loss = Model is doing better
• Goal: Watch loss decrease during training
Train Loss measures error on training examples (the data it's learning from). Validation Loss measures error on held-out examples (tests generalization).
Example: If val loss starts at 6.125 and drops to 0.845, your model learned successfully!
Parameters
Parameters are the numbers that make up the model (its 'brain'). A 3B model has 3 billion parameters.
• More parameters = More capability but slower and more memory
• Fewer parameters = Faster but less capable
• LoRA only trains 0.3-2% of parameters
Epochs
One epoch = one complete pass through all training data.
• 500 iterations with 347 examples ≈ 1.4 epochs
• More epochs = more learning but risk of overfitting
• 1-3 epochs is typical for fine-tuning
Batch Size
Number of examples processed together. Larger batches:
• Use more memory
• Can train faster
• May improve training stability
• For 16GB RAM, use batch size 1-2
Learning Rate
How big the training steps are:
• Too high: Model won't converge, loss jumps around
• Too low: Training is very slow
• Standard for fine-tuning: 1e-5 or 2e-5
Overfitting vs Underfitting
Overfitting: Model memorizes training data but can't generalize
• Train loss very low, val loss high
• Solution: More diverse training data, fewer epochs
Underfitting: Model hasn't learned enough
• Both train and val loss are high
• Solution: Train longer, more parameters, better data
10. Troubleshooting Common Issues
Out of Memory Errors
Symptoms: Training crashes with memory errors
Solutions:
• Reduce batch size to 1
• Use --grad-checkpoint flag
• Reduce --max-seq-length (try 256 or 128)
• Use a smaller base model (1.5B instead of 3B)
• Close other applications
Loss Not Decreasing
Symptoms: Loss stays high or increases
Solutions:
• Check data format (must be valid JSONL)
• Verify data quality (consistent, diverse examples)
• Try higher learning rate (2e-5 instead of 1e-5)
• Train for more iterations
Model Generates Gibberish
Symptoms: Output is incoherent or random
Solutions:
• Check Modelfile template matches model type
• Reduce temperature in Modelfile (try 0.5)
• Verify training actually completed
• Check if base model works before fine-tuning
Conversion Errors
Missing sentencepiece: pip install sentencepiece
Model type not supported: Some models need specific converters
GGUF format errors: Try updating llama.cpp (git pull)
Training Too Slow
Expected speeds on M1 Pro 16GB:
• 1.5B model: ~3 iterations/second
• 3B model: ~1.5 iterations/second
• 7B model: ~0.5 iterations/second
If slower, check Activity Monitor for other processes using CPU/memory.
11. Cost Comparison & Hosting Options
Local Hosting (Your Mac)
Cost: $0/month (electricity only)
Pros: Free, private, full control
Cons: Mac must stay on, limited by upload speed
Best for: Development, testing, personal use
Cloud Hosting
| Provider | Cost | Best For |
|---|---|---|
| Hugging Face Spaces Free | $0/month (CPU only) | Demos and portfolios |
| Modal/Replicate (Serverless) | $0.0002 per request | Low/sporadic traffic |
| Vast.ai (GPU rental) | $100-200/month (24/7) | Consistent traffic |
| RunPod (Dedicated) | $245/month (RTX 3060) | Reliable production |
| AWS/GCP/Azure | $200+/month | Enterprise scale |
Recommended Progression
Phase 1 (MVP): Your Mac + Cloudflare Tunnel ($0)
Phase 2 (Beta): Modal serverless (~$5-20/month)
Phase 3 (Launch): Vast.ai or RunPod (~$100-250/month)
Phase 4 (Scale): Dedicated infrastructure
12. Advanced Topics
When to Fine-Tune vs Use RAG
Use Fine-Tuning for:
• Teaching specific formats or styles
• Personality/tone customization
• Task specialization (classification, extraction)
• Changing communication patterns
Use RAG (Retrieval-Augmented Generation) for:
• Large knowledge bases (books, documentation)
• Frequently updated information
• When accuracy on facts is critical
• Content that's too large to fit in training
Best Approach: Often use both! RAG for knowledge, fine-tuning for style.
Commercial Use & Licensing
Llama 3.x: Commercial use allowed if less than 700M monthly active users
Mistral: Apache 2.0, fully commercial
Qwen: Generally commercial-friendly, check specific model
Phi: MIT license, fully commercial
Always check the model card on Hugging Face for license details.
Tool Calling & Agents
Most small fine-tuned models cannot call tools/functions reliably unless:
• Base model was trained for function calling
• Training data includes tool call examples
• Model is 7B+ parameters
However, frameworks like LangChain can orchestrate tool use around your model, with the model providing personality rather than tool logic.
Quantization Levels
GGUF supports different quantization levels:
| Type | Size | Quality | Use Case |
|---|---|---|---|
| F16 | 100% | Best | Development |
| Q8_0 | 50% | Excellent | Production |
| Q4_0 | 25% | Good | Resource-constrained |
| Q2_K | 12.5% | Basic | Extreme compression |
Parameter-Efficient Fine-Tuning (PEFT)
LoRA is one PEFT method. Others include:
• DoRA: Improved variant of LoRA
• QLoRA: LoRA with quantized base model
• Adapter layers: Add small modules between layers
• Prefix tuning: Learn prompt-like parameters
MLX supports LoRA and DoRA via the --fine-tune-type flag.
Appendix: Quick Reference Commands
Complete Workflow
# 1. Setup
python3 -m venv finetune_env
source finetune_env/bin/activate
pip install mlx mlx-lm huggingface-hub sentencepiece gguf
# 2. Login to Hugging Face
huggingface-cli login
# 3. Download model
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir ./model
# 4. Prepare data (train.jsonl and valid.jsonl)
# 5. Fine-tune
mlx_lm.lora --model ./model --train --data . --batch-size 2 --iters 500 \
--learning-rate 1e-5 --adapter-path ./adapters --grad-checkpoint
# 6. Merge adapter
mlx_lm.fuse --model ./model --adapter-path ./adapters \
--save-path ./model-merged
# 7. Convert to GGUF
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
python convert_hf_to_gguf.py ../model-merged \
--outfile ../custom.gguf --outtype f16
# 8. Import to Ollama
cd ..
ollama create my-model -f Modelfile
ollama run my-modelUseful Resources
• MLX Documentation: https://ml-explore.github.io/mlx/ (opens in a new tab)
• Hugging Face: https://huggingface.co (opens in a new tab)
• Ollama: https://ollama.ai (opens in a new tab)
• llama.cpp: https://github.com/ggerganov/llama.cpp (opens in a new tab)
• LangChain: https://python.langchain.com (opens in a new tab)