Fine-Tuning Large Language Models with MLX

Table of Contents

  1. Introduction to Fine-Tuning
  2. Prerequisites & Setup
  3. Understanding the Components
  4. Creating Training Data
  5. Downloading Models from Hugging Face
  6. The Fine-Tuning Process
  7. Converting to GGUF Format
  8. Deploying with Ollama
  9. Understanding Key Concepts
  10. Troubleshooting Common Issues
  11. Cost Comparison & Hosting Options
  12. Advanced Topics

1. Introduction to Fine-Tuning

Fine-tuning allows you to customize pre-trained language models to have specific behaviors, personalities, or capabilities. This guide will walk you through the complete process of fine-tuning a language model on Apple Silicon using MLX, Apple's machine learning framework.

What You'll Learn

• How to set up a Python environment for ML on Mac

• Understanding training data formats and requirements

• Using MLX to fine-tune models with LoRA (Low-Rank Adaptation)

• Converting models to formats compatible with Ollama

• Deploying your custom model locally

• Understanding key concepts like loss, epochs, and parameters

Note: This guide focuses on Apple Silicon (M1/M2/M3/M4) Macs with 16GB+ RAM. While the principles apply broadly, specific commands and performance will vary on other hardware.

2. Prerequisites & Setup

System Requirements

Hardware: Apple Silicon Mac (M1/M2/M3/M4) with at least 16GB RAM

Operating System: macOS 12.0 or later

Python: Python 3.9 or later

Storage: At least 20GB free space for models and tools

Installing Python Dependencies

First, verify you have Python 3 installed:

python3 --version

Create a virtual environment to keep dependencies isolated:

# Create virtual environment
python3 -m venv finetune_env

# Activate it
source finetune_env/bin/activate

Install required packages:

# Install MLX and MLX-LM
pip install mlx mlx-lm

# Install Hugging Face CLI
pip install huggingface-hub

# Install additional tools for conversion
pip install sentencepiece gguf

Setting Up Hugging Face Access

Many models require you to accept their license agreements:

  1. Create a Hugging Face account at https://huggingface.co (opens in a new tab)

  2. For models like Llama, visit the model page and click 'Agree and access repository'

  3. Generate an access token at https://huggingface.co/settings/tokens (opens in a new tab)

  4. Login via CLI:

huggingface-cli login

Paste your token when prompted (it won't show as you type - this is normal).

3. Understanding the Components

MLX Framework

MLX is Apple's machine learning framework optimized for Apple Silicon. It provides:

• Efficient use of unified memory architecture

• Optimized operations for M-series chips

• NumPy-like API for familiarity

• Built-in support for common ML operations

LoRA (Low-Rank Adaptation)

LoRA is a technique that makes fine-tuning efficient by:

• Training only small 'adapter' layers instead of the entire model

• Using approximately 1-2% of total parameters

• Requiring much less memory and compute

• Allowing quick training (minutes instead of hours/days)

• Enabling easy switching between different adaptations

Analogy: Instead of retraining an entire actor (the base model), LoRA gives them a character sheet for a specific role. The actor's skills remain, but they now know how to play this particular character.

Model Formats

FormatDescriptionUsed By
Hugging Face (safetensors)Standard PyTorch format, full precision weightsTraining, MLX
MLX FormatApple-optimized format for M-series chipsMLX training and inference
GGUFQuantized format for efficient inferenceOllama, llama.cpp

4. Creating Training Data

Data Format Requirements

Training data must be in JSONL format (JSON Lines), where each line is a complete conversation example:

{"messages": [{"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi! How can I help you?"}]}

Key Principles

Quality over quantity: 100 good examples beat 1000 poor ones

Consistency: Maintain consistent style/format across examples

Diversity: Cover different scenarios within your use case

Balanced length: Mix short and longer conversations

Minimum: At least 100 examples, ideally 300-1000

Example Use Cases

Customer Support Bot: Train on example support conversations with common issues and solutions

Code Review Assistant: Examples of code snippets with constructive feedback

Technical Documentation Writer: Pairs of code/concepts and their explanations

Email Tone Shifter: Casual emails paired with professional versions

Domain Expert: Q&A pairs about specific technical domain

Creating Training Data

You can create training data in several ways:

Method 1: Manual Creation Write examples yourself based on desired behavior. Time-consuming but highest quality.

Method 2: AI-Assisted Generation Use Claude or GPT-4 to generate examples based on your specifications. Review and edit all generated content.

Method 3: Extraction from Existing Content Convert existing documentation, conversations, or content into Q&A format.

Data Splitting

Always split your data into training and validation sets:

import json
 
# Read all data
with open('all_data.jsonl', 'r') as f:
    data = f.readlines()
 
# Split 90/10
split_idx = int(len(data) * 0.9)
train_data = data[:split_idx]
valid_data = data[split_idx:]
 
# Write splits
with open('train.jsonl', 'w') as f:
    f.writelines(train_data)
 
with open('valid.jsonl', 'w') as f:
    f.writelines(valid_data)

Why split data? The validation set tests if your model can generalize to new examples, not just memorize the training data.

5. Downloading Models from Hugging Face

Choosing a Base Model

Select a model based on your needs:

ModelSizeSpeedQualityLicense
Llama 3.2 1B1B paramsVery FastBasicCommercial*
Llama 3.2 3B3B paramsFastGoodCommercial*
Qwen 2.5 1.5B1.5B paramsVery FastGoodApache 2.0
Mistral 7B7B paramsModerateExcellentApache 2.0

* Llama: Commercial use allowed if less than 700M monthly active users

Download Process

Use the Hugging Face CLI to download models:

# For Llama 3.2 3B (requires license acceptance)
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct \
  --local-dir ./llama-3.2-3b

# For Qwen 2.5 1.5B (no license required)
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct \
  --local-dir ./qwen2.5-1.5b

Downloads typically take 5-15 minutes depending on your internet speed. Models are 3-7GB.

Tip: The model name comes from the Hugging Face URL. For example, https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct (opens in a new tab) becomes Qwen/Qwen2.5-1.5B-Instruct

6. The Fine-Tuning Process

Project Structure

Organize your project directory:

your-project/
├── llama-3.2-3b/      # Downloaded base model
├── train.jsonl        # Training data
├── valid.jsonl        # Validation data
└── adapters/          # Will be created during training

Training Command

Run the fine-tuning with MLX:

mlx_lm.lora \
  --model ./llama-3.2-3b \
  --train \
  --data . \
  --batch-size 2 \
  --iters 500 \
  --learning-rate 1e-5 \
  --steps-per-report 10 \
  --steps-per-eval 100 \
  --adapter-path ./adapters \
  --grad-checkpoint \
  --max-seq-length 512

Parameter Explanations

ParameterPurpose
--modelPath to downloaded base model
--trainFlag to enable training mode
--dataDirectory containing train.jsonl and valid.jsonl
--batch-sizeNumber of examples processed together (2 is safe for 16GB RAM)
--itersNumber of training steps (500 = ~1.5 epochs for 347 examples)
--learning-rateHow fast the model learns (1e-5 is standard for fine-tuning)
--steps-per-reportHow often to print training loss (every 10 steps)
--steps-per-evalHow often to run validation (every 100 steps)
--adapter-pathWhere to save the LoRA adapter weights
--grad-checkpointReduces memory usage (important for 16GB RAM)
--max-seq-lengthMaximum conversation length in tokens

What Happens During Training

You'll see output like this:

Loading pretrained model
Loading datasets
Training
Trainable parameters: 0.342% (5.276M/1543.714M)
Starting training..., iters: 500
Calculating loss...: 100%|████| 17/17
Iter 1: Val loss 6.125, Val took 3.433s
Iter 10: Train loss 4.020, It/sec 1.510
Iter 100: Val loss 2.845
...
Iter 500: Train loss 0.253, Val loss 0.845
Saved adapter weights to adapters/

Training Time Expectations

Model Size16GB M1 Pro32GB M2 MaxNotes
1.5B params2-4 minutes2-3 minutesVery fast training
3B params3-5 minutes2-4 minutesRecommended for learning
7B params15-30 minutes10-20 minutesRequires optimization

7. Converting to GGUF Format

Why GGUF?

GGUF (GPT-Generated Unified Format) is optimized for inference:

• Quantized weights reduce file size and memory usage

• Fast loading and inference

• Compatible with Ollama and llama.cpp

• Can run on CPU efficiently

Step 1: Merge Adapter with Base Model

First, combine the LoRA adapter with the base model:

mlx_lm.fuse \
  --model ./llama-3.2-3b \
  --adapter-path ./adapters \
  --save-path ./model-merged

This creates a complete model with your customizations baked in.

Step 2: Install Conversion Tools

Clone llama.cpp for the conversion scripts:

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Step 3: Convert to GGUF

Run the conversion script:

python convert_hf_to_gguf.py \
  ../model-merged \
  --outfile ../model-custom.gguf \
  --outtype f16

The conversion takes 2-5 minutes. The resulting .gguf file will be similar in size to the original model.

Common Issue: If you get 'ModuleNotFoundError: No module named sentencepiece', run: pip install sentencepiece

8. Deploying with Ollama

What is Ollama?

Ollama is a tool for running language models locally. It provides:

• Simple model management

• Fast inference on Mac

• REST API for integration

• Easy switching between models

Installing Ollama

Download from https://ollama.ai (opens in a new tab) or install via Homebrew:

brew install ollama

Creating a Modelfile

The Modelfile tells Ollama how to use your model:

FROM ./model-custom.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9

TEMPLATE """<|begin_of_text|>{{ if .System
}}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt
}}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

"""

SYSTEM You are a helpful assistant.

Save this as 'Modelfile' in your project directory.

Importing to Ollama

Create the model in Ollama:

ollama create my-custom-model -f Modelfile

Running Your Model

Start chatting with your model:

ollama run my-custom-model

Or use the REST API:

curl http://localhost:11434/api/generate -d '{
  "model": "my-custom-model",
  "prompt": "Hello!"
}'

9. Understanding Key Concepts

Loss

What it is: A measure of how wrong the model's predictions are.

Higher loss = Model is doing poorly

Lower loss = Model is doing better

Goal: Watch loss decrease during training

Train Loss measures error on training examples (the data it's learning from). Validation Loss measures error on held-out examples (tests generalization).

Example: If val loss starts at 6.125 and drops to 0.845, your model learned successfully!

Parameters

Parameters are the numbers that make up the model (its 'brain'). A 3B model has 3 billion parameters.

More parameters = More capability but slower and more memory

Fewer parameters = Faster but less capable

LoRA only trains 0.3-2% of parameters

Epochs

One epoch = one complete pass through all training data.

• 500 iterations with 347 examples ≈ 1.4 epochs

• More epochs = more learning but risk of overfitting

• 1-3 epochs is typical for fine-tuning

Batch Size

Number of examples processed together. Larger batches:

• Use more memory

• Can train faster

• May improve training stability

• For 16GB RAM, use batch size 1-2

Learning Rate

How big the training steps are:

Too high: Model won't converge, loss jumps around

Too low: Training is very slow

Standard for fine-tuning: 1e-5 or 2e-5

Overfitting vs Underfitting

Overfitting: Model memorizes training data but can't generalize

• Train loss very low, val loss high

• Solution: More diverse training data, fewer epochs

Underfitting: Model hasn't learned enough

• Both train and val loss are high

• Solution: Train longer, more parameters, better data

10. Troubleshooting Common Issues

Out of Memory Errors

Symptoms: Training crashes with memory errors

Solutions:

• Reduce batch size to 1

• Use --grad-checkpoint flag

• Reduce --max-seq-length (try 256 or 128)

• Use a smaller base model (1.5B instead of 3B)

• Close other applications

Loss Not Decreasing

Symptoms: Loss stays high or increases

Solutions:

• Check data format (must be valid JSONL)

• Verify data quality (consistent, diverse examples)

• Try higher learning rate (2e-5 instead of 1e-5)

• Train for more iterations

Model Generates Gibberish

Symptoms: Output is incoherent or random

Solutions:

• Check Modelfile template matches model type

• Reduce temperature in Modelfile (try 0.5)

• Verify training actually completed

• Check if base model works before fine-tuning

Conversion Errors

Missing sentencepiece: pip install sentencepiece

Model type not supported: Some models need specific converters

GGUF format errors: Try updating llama.cpp (git pull)

Training Too Slow

Expected speeds on M1 Pro 16GB:

• 1.5B model: ~3 iterations/second

• 3B model: ~1.5 iterations/second

• 7B model: ~0.5 iterations/second

If slower, check Activity Monitor for other processes using CPU/memory.

11. Cost Comparison & Hosting Options

Local Hosting (Your Mac)

Cost: $0/month (electricity only)

Pros: Free, private, full control

Cons: Mac must stay on, limited by upload speed

Best for: Development, testing, personal use

Cloud Hosting

ProviderCostBest For
Hugging Face Spaces Free$0/month (CPU only)Demos and portfolios
Modal/Replicate (Serverless)$0.0002 per requestLow/sporadic traffic
Vast.ai (GPU rental)$100-200/month (24/7)Consistent traffic
RunPod (Dedicated)$245/month (RTX 3060)Reliable production
AWS/GCP/Azure$200+/monthEnterprise scale

Recommended Progression

Phase 1 (MVP): Your Mac + Cloudflare Tunnel ($0)

Phase 2 (Beta): Modal serverless (~$5-20/month)

Phase 3 (Launch): Vast.ai or RunPod (~$100-250/month)

Phase 4 (Scale): Dedicated infrastructure

12. Advanced Topics

When to Fine-Tune vs Use RAG

Use Fine-Tuning for:

• Teaching specific formats or styles

• Personality/tone customization

• Task specialization (classification, extraction)

• Changing communication patterns

Use RAG (Retrieval-Augmented Generation) for:

• Large knowledge bases (books, documentation)

• Frequently updated information

• When accuracy on facts is critical

• Content that's too large to fit in training

Best Approach: Often use both! RAG for knowledge, fine-tuning for style.

Commercial Use & Licensing

Llama 3.x: Commercial use allowed if less than 700M monthly active users

Mistral: Apache 2.0, fully commercial

Qwen: Generally commercial-friendly, check specific model

Phi: MIT license, fully commercial

Always check the model card on Hugging Face for license details.

Tool Calling & Agents

Most small fine-tuned models cannot call tools/functions reliably unless:

• Base model was trained for function calling

• Training data includes tool call examples

• Model is 7B+ parameters

However, frameworks like LangChain can orchestrate tool use around your model, with the model providing personality rather than tool logic.

Quantization Levels

GGUF supports different quantization levels:

TypeSizeQualityUse Case
F16100%BestDevelopment
Q8_050%ExcellentProduction
Q4_025%GoodResource-constrained
Q2_K12.5%BasicExtreme compression

Parameter-Efficient Fine-Tuning (PEFT)

LoRA is one PEFT method. Others include:

DoRA: Improved variant of LoRA

QLoRA: LoRA with quantized base model

Adapter layers: Add small modules between layers

Prefix tuning: Learn prompt-like parameters

MLX supports LoRA and DoRA via the --fine-tune-type flag.

Appendix: Quick Reference Commands

Complete Workflow

# 1. Setup
python3 -m venv finetune_env
source finetune_env/bin/activate
pip install mlx mlx-lm huggingface-hub sentencepiece gguf

# 2. Login to Hugging Face
huggingface-cli login

# 3. Download model
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct --local-dir ./model

# 4. Prepare data (train.jsonl and valid.jsonl)

# 5. Fine-tune
mlx_lm.lora --model ./model --train --data . --batch-size 2 --iters 500 \
  --learning-rate 1e-5 --adapter-path ./adapters --grad-checkpoint

# 6. Merge adapter
mlx_lm.fuse --model ./model --adapter-path ./adapters \
  --save-path ./model-merged

# 7. Convert to GGUF
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
python convert_hf_to_gguf.py ../model-merged \
  --outfile ../custom.gguf --outtype f16

# 8. Import to Ollama
cd ..
ollama create my-model -f Modelfile
ollama run my-model

Useful Resources

MLX Documentation: https://ml-explore.github.io/mlx/ (opens in a new tab)

Hugging Face: https://huggingface.co (opens in a new tab)

Ollama: https://ollama.ai (opens in a new tab)

llama.cpp: https://github.com/ggerganov/llama.cpp (opens in a new tab)

LangChain: https://python.langchain.com (opens in a new tab)

2025 © Daumantas Pyragas