Fine-Tuning Large Language Models with MLX

Dom,Mon Nov 03 2025 • machine learning mlx apple silicon fine-tuning

1. Introduction to Fine-Tuning

Fine-tuning allows you to customize pre-trained language models to have specific behaviors, personalities, or capabilities. This guide will walk you through the complete process of fine-tuning a language model on Apple Silicon using MLX, Apple's machine learning framework.

What You'll Learn

• How to set up a Python environment for ML on Mac

• Understanding training data formats and requirements

• Using MLX to fine-tune models with LoRA (Low-Rank Adaptation)

• Converting models to formats compatible with Ollama

• Deploying your custom model locally

• Understanding key concepts like loss, epochs, and parameters

Note: This guide focuses on Apple Silicon (M1/M2/M3/M4) Macs with 16GB+ RAM. While the principles apply broadly, specific commands and performance will vary on other hardware.

2. Prerequisites & Setup

System Requirements

• Hardware: Apple Silicon Mac (M1/M2/M3/M4) with at least 16GB RAM

• Operating System: macOS 12.0 or later

• Python: Python 3.9 or later

• Storage: At least 20GB free space for models and tools

Installing Python Dependencies

First, verify you have Python 3 installed:

python3 --version

Create a virtual environment to keep dependencies isolated:

# Create virtual environment
python3 -m venv finetune_env

# Activate it
source finetune_env/bin/activate

Install required packages:

# Install MLX and MLX-LM
pip install mlx mlx-lm

# Install Hugging Face CLI
pip install huggingface-hub

# Install additional tools for conversion
pip install sentencepiece gguf

Setting Up Hugging Face Access

Many models require you to accept their license agreements:

Create a Hugging Face account at https://huggingface.co (opens in a new tab)
For models like Llama, visit the model page and click 'Agree and access repository'
Generate an access token at https://huggingface.co/settings/tokens (opens in a new tab)
Login via CLI:

huggingface-cli login

Paste your token when prompted (it won't show as you type - this is normal).

3. Understanding the Components

MLX Framework

MLX is Apple's machine learning framework optimized for Apple Silicon. It provides:

• Efficient use of unified memory architecture

• Optimized operations for M-series chips

• NumPy-like API for familiarity

• Built-in support for common ML operations

LoRA (Low-Rank Adaptation)

LoRA is a technique that makes fine-tuning efficient by:

• Training only small 'adapter' layers instead of the entire model

• Using approximately 1-2% of total parameters

• Requiring much less memory and compute

• Allowing quick training (minutes instead of hours/days)

• Enabling easy switching between different adaptations

Analogy: Instead of retraining an entire actor (the base model), LoRA gives them a character sheet for a specific role. The actor's skills remain, but they now know how to play this particular character.

Model Formats

Format	Description	Used By
Hugging Face (safetensors)	Standard PyTorch format, full precision weights	Training, MLX
MLX Format	Apple-optimized format for M-series chips	MLX training and inference
GGUF	Quantized format for efficient inference	Ollama, llama.cpp

4. Creating Training Data

Data Format Requirements

Training data must be in JSONL format (JSON Lines), where each line is a complete conversation example:

{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi! How can I help you?"}]}

Key Principles

• Quality over quantity: 100 good examples beat 1000 poor ones

• Consistency: Maintain consistent style/format across examples

• Diversity: Cover different scenarios within your use case

• Balanced length: Mix short and longer conversations

• Minimum: At least 100 examples, ideally 300-1000

Example Use Cases

Customer Support Bot: Train on example support conversations with common issues and solutions

Code Review Assistant: Examples of code snippets with constructive feedback

Technical Documentation Writer: Pairs of code/concepts and their explanations

Email Tone Shifter: Casual emails paired with professional versions

Domain Expert: Q&A pairs about specific technical domain

Creating Training Data

You can create training data in several ways:

Method 1: Manual Creation Write examples yourself based on desired behavior. Time-consuming but highest quality.

Method 2: AI-Assisted Generation Use Claude or GPT-4 to generate examples based on your specifications. Review and edit all generated content.

Method 3: Extraction from Existing Content Convert existing documentation, conversations, or content into Q&A format.

Data Splitting

Always split your data into training and validation sets:

import json
 
# Read all data
with open('all_data.jsonl', 'r') as f:
    data = f.readlines()
 
# Split 90/10
split_idx = int(len(data) * 0.9)
train_data = data[:split_idx]
valid_data = data[split_idx:]
 
# Write splits
with open('train.jsonl', 'w') as f:
    f.writelines(train_data)
 
with open('valid.jsonl', 'w') as f:
    f.writelines(valid_data)

Why split data? The validation set tests if your model can generalize to new examples, not just memorize the training data.

5. Downloading Models from Hugging Face

Choosing a Base Model

Select a model based on your needs:

Model	Size	Speed	Quality	License
Llama 3.2 1B	1B params	Very Fast	Basic	Commercial*
Llama 3.2 3B	3B params	Fast	Good	Commercial*
Qwen 2.5 1.5B	1.5B params	Very Fast	Good	Apache 2.0
Mistral 7B	7B params	Moderate	Excellent	Apache 2.0

* Llama: Commercial use allowed if less than 700M monthly active users

Download Process

Use the Hugging Face CLI to download models:

# For Llama 3.2 3B (requires license acceptance)
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct \
  --local-dir ./llama-3.2-3b

# For Qwen 2.5 1.5B (no license required)
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct \
  --local-dir ./qwen2.5-1.5b

Downloads typically take 5-15 minutes depending on your internet speed. Models are 3-7GB.

Tip: The model name comes from the Hugging Face URL. For example, https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct (opens in a new tab) becomes Qwen/Qwen2.5-1.5B-Instruct

6. The Fine-Tuning Process

Project Structure

Organize your project directory:

your-project/
├── llama-3.2-3b/      # Downloaded base model
├── train.jsonl        # Training data
├── valid.jsonl        # Validation data
└── adapters/          # Will be created during training

Training Command

Run the fine-tuning with MLX:

mlx_lm.lora \
  --model ./llama-3.2-3b \
  --train \
  --data . \
  --batch-size 2 \
  --iters 500 \
  --learning-rate 1e-5 \
  --steps-per-report 10 \
  --steps-per-eval 100 \
  --adapter-path ./adapters \
  --grad-checkpoint \
  --max-seq-length 512

Parameter Explanations

Parameter	Purpose
--model	Path to downloaded base model
--train	Flag to enable training mode
--data	Directory containing train.jsonl and valid.jsonl
--batch-size	Number of examples processed together (2 is safe for 16GB RAM)
--iters	Number of training steps (500 = ~1.5 epochs for 347 examples)
--learning-rate	How fast the model learns (1e-5 is standard for fine-tuning)
--steps-per-report	How often to print training loss (every 10 steps)
--steps-per-eval	How often to run validation (every 100 steps)
--adapter-path	Where to save the LoRA adapter weights
--grad-checkpoint	Reduces memory usage (important for 16GB RAM)
--max-seq-length	Maximum conversation length in tokens

What Happens During Training

You'll see output like this:

Loading pretrained model
Loading datasets
Training
Trainable parameters: 0.342% (5.276M/1543.714M)
Starting training..., iters: 500
Calculating loss...: 100%|████| 17/17
Iter 1: Val loss 6.125, Val took 3.433s
Iter 10: Train loss 4.020, It/sec 1.510
Iter 100: Val loss 2.845
...
Iter 500: Train loss 0.253, Val loss 0.845
Saved adapter weights to adapters/

Training Time Expectations

Model Size	16GB M1 Pro	32GB M2 Max	Notes
1.5B params	2-4 minutes	2-3 minutes	Very fast training
3B params	3-5 minutes	2-4 minutes	Recommended for learning
7B params	15-30 minutes	10-20 minutes	Requires optimization

7. Converting to GGUF Format

Why GGUF?

GGUF (GPT-Generated Unified Format) is optimized for inference:

• Quantized weights reduce file size and memory usage

• Fast loading and inference

• Compatible with Ollama and llama.cpp

• Can run on CPU efficiently

Step 1: Merge Adapter with Base Model

First, combine the LoRA adapter with the base model:

mlx_lm.fuse \
  --model ./llama-3.2-3b \
  --adapter-path ./adapters \
  --save-path ./model-merged

This creates a complete model with your customizations baked in.

Step 2: Install Conversion Tools

Clone llama.cpp for the conversion scripts:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Step 3: Convert to GGUF

Run the conversion script:

python convert_hf_to_gguf.py \
  ../model-merged \
  --outfile ../model-custom.gguf \
  --outtype f16

The conversion takes 2-5 minutes. The resulting .gguf file will be similar in size to the original model.

Common Issue: If you get 'ModuleNotFoundError: No module named sentencepiece', run: pip install sentencepiece

8. Deploying with Ollama

What is Ollama?

Ollama is a tool for running language models locally. It provides:

• Simple model management

• Fast inference on Mac

• REST API for integration

• Easy switching between models

Installing Ollama

Download from https://ollama.ai (opens in a new tab) or install via Homebrew:

brew install ollama

Creating a Modelfile

The Modelfile tells Ollama how to use your model:

FROM ./model-custom.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9

TEMPLATE """<|begin_of_text|>{{ if .System
}}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt
}}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

"""

SYSTEM You are a helpful assistant.

Save this as 'Modelfile' in your project directory.

Importing to Ollama

Create the model in Ollama:

ollama create my-custom-model -f Modelfile

Running Your Model

Start chatting with your model:

ollama run my-custom-model

Or use the REST API:

curl http://localhost:11434/api/generate -d '{
  "model": "my-custom-model",
  "prompt": "Hello!"
}'

9. Understanding Key Concepts

Loss

What it is: A measure of how wrong the model's predictions are.

• Higher loss = Model is doing poorly

• Lower loss = Model is doing better

• Goal: Watch loss decrease during training

Train Loss measures error on training examples (the data it's learning from). Validation Loss measures error on held-out examples (tests generalization).

Example: If val loss starts at 6.125 and drops to 0.845, your model learned successfully!

Parameters

Parameters are the numbers that make up the model (its 'brain'). A 3B model has 3 billion parameters.

• More parameters = More capability but slower and more memory

• Fewer parameters = Faster but less capable

• LoRA only trains 0.3-2% of parameters

Epochs

One epoch = one complete pass through all training data.

• 500 iterations with 347 examples ≈ 1.4 epochs

• More epochs = more learning but risk of overfitting

• 1-3 epochs is typical for fine-tuning

Batch Size

Number of examples processed together. Larger batches:

• Use more memory

• Can train faster

• May improve training stability

• For 16GB RAM, use batch size 1-2

Learning Rate

How big the training steps are:

• Too high: Model won't converge, loss jumps around

• Too low: Training is very slow

• Standard for fine-tuning: 1e-5 or 2e-5

Overfitting vs Underfitting

Overfitting: Model memorizes training data but can't generalize

• Train loss very low, val loss high

• Solution: More diverse training data, fewer epochs

Underfitting: Model hasn't learned enough

• Both train and val loss are high

• Solution: Train longer, more parameters, better data

10. Troubleshooting Common Issues

Out of Memory Errors

Symptoms: Training crashes with memory errors

Solutions:

• Reduce batch size to 1

• Use --grad-checkpoint flag

• Reduce --max-seq-length (try 256 or 128)

• Use a smaller base model (1.5B instead of 3B)

• Close other applications

Loss Not Decreasing

Symptoms: Loss stays high or increases

Solutions:

• Check data format (must be valid JSONL)

• Verify data quality (consistent, diverse examples)

• Try higher learning rate (2e-5 instead of 1e-5)

• Train for more iterations

Model Generates Gibberish

Symptoms: Output is incoherent or random

Solutions:

• Check Modelfile template matches model type

• Reduce temperature in Modelfile (try 0.5)

• Verify training actually completed

• Check if base model works before fine-tuning

Conversion Errors

Missing sentencepiece: pip install sentencepiece

Model type not supported: Some models need specific converters

GGUF format errors: Try updating llama.cpp (git pull)

Training Too Slow

Expected speeds on M1 Pro 16GB:

• 1.5B model: ~3 iterations/second

• 3B model: ~1.5 iterations/second

• 7B model: ~0.5 iterations/second

If slower, check Activity Monitor for other processes using CPU/memory.

11. Cost Comparison & Hosting Options

Local Hosting (Your Mac)

Cost: $0/month (electricity only)

Pros: Free, private, full control

Cons: Mac must stay on, limited by upload speed

Best for: Development, testing, personal use

Cloud Hosting

Provider	Cost	Best For
Hugging Face Spaces Free	$0/month (CPU only)	Demos and portfolios
Modal/Replicate (Serverless)	$0.0002 per request	Low/sporadic traffic
Vast.ai (GPU rental)	$100-200/month (24/7)	Consistent traffic
RunPod (Dedicated)	$245/month (RTX 3060)	Reliable production
AWS/GCP/Azure	$200+/month	Enterprise scale

Recommended Progression

Phase 1 (MVP): Your Mac + Cloudflare Tunnel ($0)

Phase 2 (Beta): Modal serverless (~$5-20/month)

Phase 3 (Launch): Vast.ai or RunPod (~$100-250/month)

Phase 4 (Scale): Dedicated infrastructure

12. Advanced Topics

When to Fine-Tune vs Use RAG

Use Fine-Tuning for:

• Teaching specific formats or styles

• Personality/tone customization

• Task specialization (classification, extraction)

• Changing communication patterns

Use RAG (Retrieval-Augmented Generation) for:

• Large knowledge bases (books, documentation)

• Frequently updated information

• When accuracy on facts is critical

• Content that's too large to fit in training

Best Approach: Often use both! RAG for knowledge, fine-tuning for style.

Commercial Use & Licensing

Llama 3.x: Commercial use allowed if less than 700M monthly active users

Mistral: Apache 2.0, fully commercial

Qwen: Generally commercial-friendly, check specific model

Phi: MIT license, fully commercial

Always check the model card on Hugging Face for license details.

Tool Calling & Agents

Most small fine-tuned models cannot call tools/functions reliably unless:

• Base model was trained for function calling

• Training data includes tool call examples

• Model is 7B+ parameters

However, frameworks like LangChain can orchestrate tool use around your model, with the model providing personality rather than tool logic.

Quantization Levels

GGUF supports different quantization levels:

Type	Size	Quality	Use Case
F16	100%	Best	Development
Q8_0	50%	Excellent	Production
Q4_0	25%	Good	Resource-constrained
Q2_K	12.5%	Basic	Extreme compression

Parameter-Efficient Fine-Tuning (PEFT)

LoRA is one PEFT method. Others include:

• DoRA: Improved variant of LoRA

• QLoRA: LoRA with quantized base model

• Adapter layers: Add small modules between layers

• Prefix tuning: Learn prompt-like parameters

MLX supports LoRA and DoRA via the --fine-tune-type flag.

Appendix: Quick Reference Commands

Complete Workflow

# 1. Setup
python3 -m venv finetune_env
source finetune_env/bin/activate
pip install mlx mlx-lm huggingface-hub sentencepiece gguf

# 2. Login to Hugging Face
huggingface-cli login

# 3. Download model
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir ./model

# 4. Prepare data (train.jsonl and valid.jsonl)

# 5. Fine-tune
mlx_lm.lora --model ./model --train --data . --batch-size 2 --iters 500 \
  --learning-rate 1e-5 --adapter-path ./adapters --grad-checkpoint

# 6. Merge adapter
mlx_lm.fuse --model ./model --adapter-path ./adapters \
  --save-path ./model-merged

# 7. Convert to GGUF
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
python convert_hf_to_gguf.py ../model-merged \
  --outfile ../custom.gguf --outtype f16

# 8. Import to Ollama
cd ..
ollama create my-model -f Modelfile
ollama run my-model

Useful Resources

• MLX Documentation: https://ml-explore.github.io/mlx/ (opens in a new tab)

• Hugging Face: https://huggingface.co (opens in a new tab)

• Ollama: https://ollama.ai (opens in a new tab)

• llama.cpp: https://github.com/ggerganov/llama.cpp (opens in a new tab)

• LangChain: https://python.langchain.com (opens in a new tab)

Fine-Tuning Large Language Models with MLX

Table of Contents

1. Introduction to Fine-Tuning

What You'll Learn

2. Prerequisites & Setup

System Requirements

Installing Python Dependencies

Setting Up Hugging Face Access

3. Understanding the Components

MLX Framework

LoRA (Low-Rank Adaptation)

Model Formats

4. Creating Training Data

Data Format Requirements

Key Principles

Example Use Cases

Creating Training Data

Data Splitting

5. Downloading Models from Hugging Face

Choosing a Base Model

Download Process

6. The Fine-Tuning Process

Project Structure

Training Command

Parameter Explanations

What Happens During Training

Training Time Expectations

7. Converting to GGUF Format

Why GGUF?

Step 1: Merge Adapter with Base Model

Step 2: Install Conversion Tools

Step 3: Convert to GGUF

8. Deploying with Ollama

What is Ollama?

Installing Ollama

Creating a Modelfile

Importing to Ollama

Running Your Model

9. Understanding Key Concepts

Loss

Parameters

Epochs

Batch Size

Learning Rate

Overfitting vs Underfitting

10. Troubleshooting Common Issues

Out of Memory Errors

Loss Not Decreasing

Model Generates Gibberish

Conversion Errors

Training Too Slow

11. Cost Comparison & Hosting Options

Local Hosting (Your Mac)

Cloud Hosting

Recommended Progression

12. Advanced Topics

When to Fine-Tune vs Use RAG

Commercial Use & Licensing

Tool Calling & Agents

Quantization Levels

Parameter-Efficient Fine-Tuning (PEFT)

Appendix: Quick Reference Commands

Complete Workflow

Useful Resources