Fine-tune a small model with QLoRA
Fine-tune a 7B instruction-tuned model (Llama 3.1 8B or Mistral 7B) for a specific task using QLoRA. You will format a training dataset in chat template format, run supervised fine-tuning with TRL's SFTTrainer, push the LoRA adapter to HuggingFace Hub, and benchmark the fine-tuned model against the base model on a 20-question eval set.
Why this matters
Fine-tuning is how you move from a model that can do anything to a model that is reliably excellent at one thing. QLoRA makes this trainable on a single consumer GPU by quantising the base model to 4-bit and training only small adapter matrices. The workflow; dataset format, training loop, eval before and after; is the same whether you are tuning for tone, domain knowledge, or a strict output format.
Before you start
- Python with transformers, peft, trl, bitsandbytes, and datasets installed
- A GPU with at least 10GB VRAM (RTX 3080/4070 or better), or a Colab A100 instance
- HuggingFace account with a write-access token
- A narrow task in mind: classification, extraction, or following a specific output format all work well
Step-by-step guide
- 1
Pick a task and prepare your dataset
Choose a task with clear right/wrong answers; sentiment classification or entity extraction works well for a first fine-tune. Collect or generate 200-500 examples. Format each as a chat turn: system message defining the task, user message with the input, assistant message with the correct output. Save as a Hugging Face Dataset.
- 2
Load the base model in 4-bit
Use BitsAndBytesConfig with load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, and bnb_4bit_quant_type='nf4'. Load the model with from_pretrained and the quantization config. Print the memory footprint before and after; you should see roughly a 4x reduction.
- 3
Configure LoRA with PEFT
Create a LoraConfig with r=16, lora_alpha=32, target_modules pointing to the query and value projection layers (q_proj, v_proj for Llama), lora_dropout=0.05, task_type=CAUSAL_LM. Apply with get_peft_model. Print trainable parameters; they should be under 1% of total parameters.
- 4
Run the fine-tune
Configure SFTTrainer with your dataset, model, tokenizer, and training args: 3 epochs, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, bf16=True. Run trainer.train(). Monitor loss; it should decrease over the first epoch. A flat or increasing loss usually means the data format is wrong.
- 5
Evaluate before and after
Before training, run your 20-question eval against the base model and record scores. After training, load the merged model (base + LoRA adapter) and run the same eval. Calculate the delta. A 20-40% improvement on a narrow task is typical. Under 10% usually means your dataset is too small or too noisy.
- 6
Push and document
Push the LoRA adapter to HuggingFace Hub with model.push_to_hub. Write a model card noting: base model, task, dataset size, training config, and eval score. The card is what future-you needs when you come back to this adapter in 3 months and cannot remember what it does.