Parameter-Efficient Fine-Tuning (PEFT) with LoRA [Hands-On Guide]
Fine-tune LLMs on a single GPU using PEFT and LoRA to save memory, ship MB-sized adapters, evaluate outputs confidently, privately.
Fine-tuning a large language model can really transform how you use AI for specific tasks. And here's what I've discovered: it's actually way more cost-effective than you might think, especially when you focus on fine-tuning just a smaller, task-specific piece instead of wrestling with the whole massive model. But let me tell you what nobody really talks about. Full fine-tuning of these large models? It needs some serious computational muscle. We're talking about handling all those weights, optimizer states, gradients, and activations. The memory requirements pile up so fast that most regular hardware just can't keep up.
This is where Parameter-Efficient Fine-Tuning (PEFT) comes in, and honestly, it's been a game-changer for me. Instead of updating everything, PEFT only touches a tiny portion of the model's parameters. The memory savings are huge, but you still get performance that's basically just as good. I've been able to fine-tune large models on just a single GPU, which would have been impossible with traditional methods.

Let me walk you through how PEFT actually works and show you a real example from my own experiments. Once you see this in action, you'll understand why this technique has made fine-tuning so much more practical.
Why PEFT?
Okay, before we dive into the actual code, I want to explain why PEFT matters so much. First, we need to look at what makes full fine-tuning such a pain, then I'll show you how PEFT fixes these problems.
Limitations of Full Fine-Tuning
High Memory and Storage Requirements: When you do full fine-tuning, you're updating every single parameter. I'm talking hundreds of gigabytes just for the model weights. But that's not even the worst part. You also need memory for optimizer states, gradients, and activations. And every time you create a version for a different task? That's more storage gone.
Risk of Catastrophic Forgetting: This one really frustrated me when I first encountered it. You fine-tune your model for a new task, and suddenly it forgets what it learned before. It's especially bad when you're trying to handle multiple tasks at once. The model's flexibility just disappears.
High Deployment Costs: Every task needs its own complete model after fine-tuning. The storage costs alone are enough to make you think twice, not to mention the deployment headaches. Maintaining multiple large models for different tasks? Good luck with that without serious resources.
How PEFT Overcomes These Challenges
Reduced Memory Footprint: This is where PEFT really shines. By updating just a small fraction of parameters, memory usage drops dramatically. The weights you end up with are tiny, often just a few megabytes. I've run PEFT on a single GPU that would have choked on full fine-tuning.
Efficient Multitasking: Here's something cool: PEFT creates task-specific weights that you can swap in and out during inference. One model, multiple tasks. No need to duplicate the entire thing for each use case.
Lower Risk of Catastrophic Forgetting: Since most of the original parameters stay frozen, the model keeps its general knowledge intact. It's much better at remembering what it learned across different tasks.
On-Premise and Private Deployments: For projects where data privacy is critical, PEFT's lightweight nature is perfect. You can keep everything on local servers without needing external APIs. Your data stays yours.
When to Choose PEFT Over Full Fine-Tuning
Prompting Limitations: Sometimes prompting just doesn't cut it. When you've tried everything and still can't get the performance you need, PEFT gives you a more targeted way to adapt the model.
Model Size and Resource Constraints: Let's be real. Most of us don't have access to massive GPU clusters. Full fine-tuning on large models is often impossible with consumer hardware. PEFT makes it doable.
Data Privacy: When you need to keep your data in-house, PEFT lets you customize models without sending anything to external servers. This is huge for privacy-sensitive applications.
Actually, before jumping into PEFT, you might want to try some advanced prompt engineering strategies for LLM APIs. I've found that sometimes you can get surprisingly good results just by being clever with your prompts. It's worth trying before you commit to fine-tuning.
Bottom line: PEFT gives you flexibility and cost-effectiveness, especially when you're dealing with on-premise deployment, privacy concerns, or just don't have access to a supercomputer.
PEFT Methods Overview
So PEFT isn't just one technique. It's actually a collection of different methods for fine-tuning large language models by tweaking only a small subset of parameters. Each approach has its sweet spot depending on what you're trying to do and what resources you have. Let me break down the main ones.
By the way, PEFT methods work really well with retrieval augmented generation workflows. When you're adapting models for domain-specific retrieval tasks, efficient fine-tuning makes a huge difference. If you're interested in building these kinds of pipelines, check out this guide on retrieval-augmented generation (RAG) workflows.
Selective Methods
These methods are pretty straightforward. You pick specific parts of the model to adapt, like certain layers or parameter types, and leave everything else alone. I've found this works great when you know exactly which layers matter for your task. Though honestly, if your task needs more comprehensive changes, this approach might fall short.
How it works:
You target specific components, maybe just the attention layers or the feed-forward networks.
Only selected parameter types get adjusted, which gives you a nice balance between task performance and efficiency.
Reparameterization Methods
This is where things get interesting. Methods like LoRA (Low-Rank Adaptation) are perfect when you need to keep memory and computational costs down. The idea is brilliant: use small, low-rank matrices for fine-tuning while the main model structure stays untouched.
How it works:
The main model weights stay frozen. You add these small rank-decomposition matrices on top.
During training, you only update these tiny matrices. The memory and compute savings are massive.
At inference time, you combine the low-rank matrices with the main weights. Latency and memory usage stay low.
Additive Methods
These methods take a different approach. Instead of modifying existing parameters, they add new, task-specific components. Adapters and Soft Prompts are the main players here. They're great when you need to add flexibility without messing with the model's core.
How it works:
Adapters: You insert additional layers that get trained specifically for your task.
Soft Prompts: You adjust or add specific prompt tokens for the task while keeping the main model frozen.
Now, here's the thing. When people talk about PEFT, they're usually talking about LoRA. It's become the go-to method because it strikes such a good balance between memory efficiency and performance. So for the rest of this post, I'm going to focus on LoRA to show you how this all works in practice.
PEFT/LoRA Fine-Tuning Walkthrough
Alright, let's get our hands dirty and see how PEFT actually works using LoRA. I'm going to use the same use case and dataset from my previous post on Full Fine-Tuning. If you want all the details about the dataset and what we're trying to accomplish, take a look at Fine-Tuning Large Language Models: A Step-by-Step Cookbook.
Oh, and if you're looking to integrate LLMs into your data science workflow, especially for interactive coding and debugging, I've got a tutorial on using LLM pair programming in Jupyter AI. It's a nice complement to PEFT because it shows how these fine-tuned models can actually speed up real projects.
Install the Required Libraries
First things first, you need to install the PEFT library. This gives you all those advanced Parameter-Efficient Fine-Tuning methods we've been talking about.
!pip install peftSet Up the Model for Fine-Tuning
Here's where the magic happens. We're going to add a new adapter layer specifically for fine-tuning, but the underlying LLM stays frozen. This means only the adapter gets trained. The rest of the model? Completely untouched.
Import necessary libraries
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
# Define the base model to be used
model_name = 'google/flan-t5-base'
# Load the pre-trained model and tokenizer, setting the data type for efficient computation
original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Configure LoRA parameters
lora_config = LoraConfig(
r=32, # 'r' is the rank, defining the adapter's dimensionality for LoRA
lora_alpha=32, # Scaling factor to balance LoRA adjustments with the original model's outputs
target_modules=["q", "v"], # Specifies which modules to adapt; here, 'q' and 'v' refer to attention mechanism
lora_dropout=0.05, # Dropout rate applied to LoRA layers to improve generalization
bias="none", # Specifies bias configuration; 'none' indicates no bias adjustments in LoRA
task_type=TaskType.SEQ_2_SEQ_LM # Task type, set to sequence-to-sequence language modeling for fine-tuning
)
# Add LoRA adapters to the original model
peft_model = get_peft_model(original_model, lora_config)Inspect Trainable Parameters
I always like to check how many parameters we're actually training. It's a simple helper function, but it really drives home how efficient PEFT is. Look at this: only 1.41% of the parameters are trainable! That's why you can run this on a single GPU. Try doing that with full fine-tuning.
def print_trainable_parameters(model):
# Initialize counters for trainable and total parameters
trainable_params = sum(param.numel() for param in model.parameters() if param.requires_grad)
total_params = sum(param.numel() for param in model.parameters())
# Calculate the percentage of trainable parameters
trainable_percentage = 100 * trainable_params / total_params
# Print parameter summary
print(f"Trainable parameters: {trainable_params}")
print(f"Total parameters: {total_params}")
print(f"Percentage of trainable parameters: {trainable_percentage:.2f}%")
# Call the function to display the parameter summary
print_trainable_parameters(peft_model)Load and Preprocess the Data
We're using the same dataset I prepared when exploring full fine-tuning. Again, if you want the complete story on this dataset and use case, head over to Fine-Tuning Large Language Models: A Step-by-Step Cookbook.
from datasets import load_dataset
# Load your dataset from the JSONL file
dataset = load_dataset("json", data_files="city_qna.jsonl")
# Check the dataset structure
print(dataset["train"][0]){'input': 'Can you describe the city of Paris to me?', 'output': 'Paris is a city in France with a population of 2.1 million, known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.'}# Tokenize the dataset
def preprocess_data(examples):
# Extract inputs and outputs as lists from the dictionary
inputs = examples["input"]
outputs = examples["output"]
# Tokenize inputs and outputs with padding and truncation
model_inputs = tokenizer(inputs, max_length=128, padding="max_length", truncation=True)
labels = tokenizer(outputs, max_length=128, padding="max_length", truncation=True).input_ids
# Replace padding token IDs with -100 to ignore them in the loss function
labels = [[-100 if token == tokenizer.pad_token_id else token for token in label] for label in labels]
model_inputs["labels"] = labels
return model_inputs
# Use the map function to apply the preprocessing to the whole dataset
tokenized_dataset = dataset["train"].map(preprocess_data, batched=True)Set Up Training Configuration
Before we start the actual fine-tuning, we need to set up our training arguments and create a Trainer instance. This handles all the training logistics for us.
# Set up output directory with a unique name based on the current timestamp
output_dir = f'./peft-flan-t5-city-tuning-{str(int(time.time()))}'
# Define training arguments for PEFT
peft_training_args = TrainingArguments(
output_dir=output_dir, # Directory to save model checkpoints and logs
auto_find_batch_size=True, # Automatically adjust batch size based on available memory
learning_rate=5e-4, # Learning rate, set to a moderate value to balance stability and speed
num_train_epochs=3, # Number of full passes through the training dataset
logging_steps=10, # Log training metrics every 10 steps
save_steps=100, # Save a model checkpoint every 100 steps
eval_strategy="no", # Evaluation strategy during training; set to "no" if not evaluating
save_total_limit=2, # Keep only the last 2 checkpoints to save storage space
per_device_train_batch_size=8, # Batch size per device during training; adjusted for memory constraints
per_device_eval_batch_size=8, # Batch size per device during evaluation
weight_decay=0.01, # Weight decay for regularization, prevents overfitting
max_steps=500 # Maximum number of training steps, limits training for quicker experimentation
)
# Initialize the Trainer with the specified model, training arguments, and dataset
peft_trainer = Trainer(
model=peft_model, # The PEFT/LoRA model to be fine-tuned
args=peft_training_args, # Training arguments defined above
train_dataset=tokenized_dataset, # Training dataset, assumed to be preprocessed
)Fine-Tune the Model
This is it. Time to actually train the model. The Trainer instance we set up will handle everything according to our configuration and dataset.
# Start the training process using the defined Trainer instance
peft_trainer.train()
# Define a local path to save the fine-tuned PEFT/LoRA model and tokenizer
peft_model_path = "./peft-flan-t5-city-tuning-checkpoint-local"
# Save the fine-tuned model to the specified path for later use
peft_trainer.model.save_pretrained(peft_model_path)
# Save the tokenizer associated with the model to the same path, ensuring compatibility during inference
tokenizer.save_pretrained(peft_model_path)Check the Size of the Fine-Tuned Model
Want to see something impressive? Let's check how much space our fine-tuned model takes up. Since PEFT only modifies a tiny subset of parameters, the storage requirements are ridiculously small compared to full fine-tuning. This code calculates the total size of the saved model directory. You'll be amazed at how compact it is.
def get_directory_size(path):
# Calculate the total size of all files in the specified directory
total_size = 0
for dirpath, dirnames, filenames in os.walk(path):
for f in filenames:
fp = os.path.join(dirpath, f)
total_size += os.path.getsize(fp)
return total_size
# Define the path to the saved PEFT model
peft_model_path = "./peft-flan-t5-city-tuning-checkpoint-local"
# Get the size in bytes and convert to megabytes (MB)
model_size_mb = get_directory_size(peft_model_path) / (1024 * 1024)
print(f"Fine-tuned PEFT model size: {model_size_mb:.2f} MB")Fine-tuned PEFT model size: 15.98 MBEvaluate the Model Qualitatively (Human Evaluation)
Now for the moment of truth. Let's see how well our fine-tuned model actually performs. To prepare for evaluation, we'll add an adapter to the original FLAN-T5 model and set is_trainable=False. This configures it for inference only, so we can focus on checking the quality of its responses without any additional training happening.
# Import PeftModel to load the model with adapters, and PeftConfig for adapter configuration settings
from peft import PeftModel, PeftConfig
# Load the base FLAN-T5 model with the specified data type for efficient computation
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
# Load the tokenizer for the FLAN-T5 model, required for text processing during inference
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
# Load the fine-tuned PEFT model by adding the LoRA adapter to the base model
# Set is_trainable=False to ensure the model is in inference mode (no further training)
peft_model = PeftModel.from_pretrained(
peft_model_base, # Base model to apply the PEFT adapter to
'./peft-flan-t5-city-tuning-checkpoint-local/', # Path to the saved fine-tuned adapter
torch_dtype=torch.bfloat16, # Data type for efficient inference
is_trainable=False # Disable training mode for inference-only evaluation
)# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Few-shot learning
input_text = "Describe the city of Vancouver"
# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt")
# Generate response
outputs = peft_model.generate(input_ids=inputs.input_ids, max_length=50)
# Decode and print the ouput
print(tokenizer.decode(outputs[0], skip_special_tokens=True))Vancouver is a city in Canada with a population of 1.8 million, known for landmarks such as the Golden Horseshoe Bridge, the Vancouver Museum, and the Canadian Museum.And there it is! The model now gives us responses in exactly the format we wanted. Sure, some information might still be a bit off, and you'll see the occasional hallucination. But honestly? The output is so much closer to what we were aiming for. This is real progress.
Conclusion
PEFT and LoRA have really opened up new possibilities for working with large models. The best part? You can create multiple small, fine-tuned versions for different tasks and just swap them in and out as you need them. Think about it. One main model, multiple lightweight adapters. No more managing a bunch of massive models for each task. It's such a cleaner approach, and it saves you time, storage, and computing power.
These PEFT techniques work great with cutting-edge architectures too, including reasoning-focused models like OpenAI o1. When you're dealing with complex problem solving and analysis, understanding how these models think helps you pick the right fine-tuning strategy.
What PEFT really does is make large models accessible. You don't need a server farm to get great performance on specialized tasks anymore. Just create these lightweight adapters for each use case, switch between them whenever you need to. Your whole AI setup becomes more adaptable and way more resource-friendly. Honestly, once you try this approach, you'll wonder why anyone still does full fine-tuning for most tasks.