
Fine-tuning Gemma 4 E2B with Unsloth on a free T4 GPU takes under 20 minutes and requires no paid compute. This guide covers the exact pipeline our team used: synthetic dataset creation, LoRA training, and Hugging Face deployment in a single Colab notebook.
Most teams approach small language model deployment from the wrong end. They pick a model based on benchmark leaderboards, then ask whether it can be fine-tuned for their use case. The more productive sequence runs the other direction: define the task, generate a representative dataset, run a fine-tuning experiment on available hardware, and let the results determine model selection.
Picture this: a domain-specific AI assistant, trained on your company's internal documents, running on free cloud hardware, deployed and queryable in under an hour. That's what we set out to test. We wanted to know how little data, how little compute, and how little infrastructure it actually takes to go from raw policy documents to a fine-tuned model that answers domain-specific questions accurately.
Google's Gemma 4 E2B and the Unsloth library together make that experimental cycle fast enough to matter. Here, we have documented the exact process our team used from synthetic dataset creation to Hugging Face deployment so other engineering teams can replicate it on their own data.
Gemma 4 is Google's fourth-generation open model family. The E2B variant (2 billion parameters) occupies the lower end of that family but is notable for several reasons that matter in production planning.
LoRA fine-tuning requires as little as 8 GB VRAM. Inference can run on 2 to 6 GB depending on quantization. That means the full fine-tuning cycle is accessible on a free Colab T4 instance with no reserved compute and no infrastructure cost during experimentation.
The architecture natively handles text, images, video, and audio. Audio input is exclusive to the E2B and E4B variants within the Gemma 4 family. For teams evaluating multimodal document or voice use cases, this eliminates the need to stand up separate models per modality during prototyping.
The context window runs to 128K tokens for E2B and E4B. Larger 26B and 31B models extend that to 256K. For most enterprise document and conversation use cases, 128K is sufficient.
Unsloth adds to this picture structurally rather than incrementally. The library achieves roughly 1.5x faster training and 60% less VRAM consumption compared to standard Flash Attention 2 setups, with no accuracy loss. On a T4, that difference compresses a 100-step training run from approximately 30 minutes to 15 to 20 minutes.
Unsloth Data Recipe is a node-based workflow tool built on NVIDIA NeMo Data Designer. It takes unstructured source documents and transforms them into structured training data through configurable pipeline stages. For teams that don't have labeled conversational datasets available, it provides a reproducible path from raw policy documents, product manuals, or knowledge bases to fine-tuning-ready JSONL.
We built a conversational dataset for an HR policy assistant. The pipeline has four components: a seed block ingesting unstructured PDFs and text files, a model provider block pointing to the OpenAI API endpoint, a model config block set to GPT-4o at temperature 0.7, and an LLM structured output block generating multi-turn conversation JSON.
The recipe configuration looks like this:
json
{
"recipe": {
"model_providers": [
{
"name": "provider_1",
"endpoint": "https://api.openai.com/v1",
"provider_type": "openai"
}
],
"model_configs": [
{
"alias": "model_1",
"model": "gpt-4o",
"provider": "provider_1",
"inference_parameters": {
"temperature": 0.7
}
}
],
"seed_config": {
"source": {
"seed_type": "unstructured",
"chunk_size": 1200,
"chunk_overlap": 200
}
}
}
}
The chunk size of 1200 tokens with 200-token overlap ensures that each chunk carries sufficient context for the generation model to produce grounded, multi-turn conversations.
The LLM block prompt is the quality gate for the entire pipeline. Ours instructed the model to generate only from facts present in the chunk, produce 8 to 20 turns depending on content richness, flag any gaps explicitly rather than inventing answers, and output strict JSON matching our schema.
You are an expert conversational dataset creator.
Using ONLY the information in the following policy chunk:
{{ chunk_text }}
Generate a realistic multi-turn conversation between an employee ("human")
and an HR assistant ("gpt").
Rules:
- Use only facts from the chunk
- Do not invent information
- Include natural follow-up and scenario-based questions
- If the chunk does not contain an answer, the HR assistant should say
the policy does not specify it
- Create 8–20 turns depending on the content available
- Output only valid JSON in the format below
The structured output format:
json
{
"conversations": [
{
"from": "human",
"value": "What is the leave encashment policy?"
},
{
"from": "gpt",
"value": "According to our policy, employees can encash..."
}
]
}
Open Unsloth Studio, navigate to Data Recipes, and start from an empty canvas. Add blocks in sequence: Seed (document upload with chunk size 1200, overlap 200), Model Provider (OpenAI endpoint), Model Config (GPT-4o, temperature 0.7), and LLM Structured (conversation generation with the JSON schema above).
Use the Validate button to catch configuration errors before running. Preview generates five sample rows for manual review. Once quality looks acceptable, run the full pipeline and export as JSONL. We generated 100 conversations for this training run.
A free Google Colab account with T4 GPU access is sufficient for this run. You'll also need a Hugging Face account with an API token (write permissions) for model upload.
bash
!pip install unsloth
This single command pulls in the Unsloth core library along with transformers, peft, and trl. No additional dependency management required.
python
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from transformers import TextStreamer
python
max_seq_length = 2048 # Context length
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "google/gemma-4-E2B",
max_seq_length = max_seq_length,
load_in_4bit = False, # Use 16-bit for better accuracy
load_in_16bit = True, # bf16/16-bit LoRA
full_finetuning = False, # Use LoRA for efficiency
)
load_in_16bit=True delivers better accuracy than 4-bit quantization while remaining within T4 memory limits. Setting full_finetuning=False tells the library to apply LoRA adapters instead of updating all parameters.
LoRA (Low-Rank Adaptation) fine-tunes approximately 1 to 2% of model parameters by inserting trainable rank-decomposition matrices at targeted layers. The full model weights remain frozen.
python
model = FastLanguageModel.get_peft_model(
model,
r = 16, # LoRA rank
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention layers
"gate_proj", "up_proj", "down_proj", # MLP layers
],
lora_alpha = 16, # Scaling factor (recommended: alpha = r)
lora_dropout = 0, # No dropout for small datasets
bias = "none",
use_gradient_checkpointing = "unsloth", # Memory optimization
random_state = 3407,
max_seq_length = max_seq_length,
)
The rank of 16 is a reasonable default for small domain-specific datasets. Higher ranks increase trainable parameters and model capacity, which matters when the target domain has high linguistic diversity. For the HR policy use case, 16 was sufficient. Setting lora_alpha equal to r is the standard starting configuration. Unsloth's gradient checkpointing implementation is the key memory optimization, it allows training to fit within T4 VRAM limits that would otherwise be marginal.
python
# Load your dataset
url = "100_conversations.jsonl"
dataset = load_dataset("json", data_files=url, split="train")
# Convert to training format
def convert(example):
text = ""
for turn in example["conversations"]:
if not turn["value"].strip():
continue
role = "Human" if turn["from"] == "human" else "Assistant"
text += f"### {role}:\n{turn['value']}\n\n"
return {"text": text.strip()}
dataset = dataset.map(convert)
Alternatively, Gemma 4's native chat template produces better results for conversational tasks:
python
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template = "gemma-4", # Use "gemma-4-thinking" for reasoning tasks
)
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [
tokenizer.apply_chat_template(
convo,
tokenize=False,
add_generation_prompt=False
).removeprefix('<bos>')
for convo in convos
]
return {"text": texts}
dataset = dataset.map(formatting_prompts_func, batched=True)
python
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
tokenizer = tokenizer,
args = SFTConfig(
dataset_text_field = "text",
max_seq_length = max_seq_length,
# Batch configuration
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4, # Effective batch size = 4
# Learning rate schedule
warmup_steps = 10,
learning_rate = 2e-4,
lr_scheduler_type = "linear",
# Training duration
max_steps = 100,
# Optimization
optim = "adamw_8bit",
weight_decay = 0.01,
max_grad_norm = 1.0,
# Logging
logging_steps = 1,
output_dir = "outputs_gemma4",
seed = 3407,
report_to = "none",
),
)
The gradient_accumulation_steps=4 setting simulates a batch size of 4 while processing one sample at a time, critical for running on T4 VRAM. The 8-bit AdamW optimizer reduces memory consumption during backpropagation with negligible accuracy impact.
Note: If you see loss values of 13 to 15 at the start of training, this is expected behavior for Gemma 4 E2B and E4B. It reflects the multimodal architecture and is not a configuration error.
python
trainer.train()
Expected training time on T4: 15 to 20 minutes for 100 steps. On L4: 8 to 10 minutes. On A100: approximately 5 minutes.
A note specific to Gemma 4's E2B and E4B variants: initial loss values of 13 to 15 are normal due to the multimodal architecture. For text-only fine-tuning with standard models, expect starting loss of 2.0 to 4.0 and final loss of 0.3 to 1.0. For Gemma 26B and 31B, the range is typically 1 to 3. Do not interpret the higher starting loss on E2B as a configuration error.
python
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer
# Apply chat template
tokenizer = get_chat_template(tokenizer, chat_template = "gemma-4")
PROMPT = "Can you explain the Encashment Leave Policy in Ideas2IT?"
messages = [{
"role": "user",
"content": [{"type": "text", "text": PROMPT}]
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt = True,
return_tensors = "pt",
tokenize = True,
return_dict = True,
).to("cuda")
_ = model.generate(
**inputs,
max_new_tokens = 512,
temperature = 1.0,
top_p = 0.95,
top_k = 64,
streamer = TextStreamer(tokenizer, skip_prompt=True),
)
python
BASE_MODEL = "google/gemma-4-E2B"
FN_PATH = "/content/outputs_gemma4/checkpoint-100"
# Load base model
base_model, base_tokenizer = FastModel.from_pretrained(
model_name = BASE_MODEL,
max_seq_length = 2048,
load_in_4bit = False,
)
# Load fine-tuned adapter
from peft import PeftModel
finetuned_model = PeftModel.from_pretrained(base_model, FN_PATH)
The fine-tuned model's responses on covered topics are notably more grounded and specific. Hallucinations on policy details present in the training data drop substantially. The base model generates plausible-sounding but generic HR policy language; the fine-tuned model retrieves and applies specific policy language from the training set.
Here’s how the answer generation differs between the models. For eg.
Prompt: "Can you explain the Encashment Leave Policy at Ideas2IT?"
Base model output: A generic 3-4 sentence paragraph describing what leave encashment typically means at most companies, with no specific figures or policy terms.
Fine-tuned model output: A specific response citing the actual policy terms, eligibility conditions, and encashment limits from the training data.
Unsloth supports three export formats, each suited to different deployment contexts.
python
# Save adapter locally
model.save_pretrained("gemma_4_finetuned")
tokenizer.save_pretrained("gemma_4_finetuned")
# Push to Hugging Face
model.push_to_hub("your-username/gemma_4_finetuned", token=HF_TOKEN)
tokenizer.push_to_hub("your-username/gemma_4_finetuned", token=HF_TOKEN)
Use LoRA adapters when storage is limited, when experimenting with multiple fine-tunes on a shared base model, or when the deployment environment supports PEFT loading natively.
python
model.save_pretrained_merged("gemma-4-finetuned-merged", tokenizer)
model.push_to_hub_merged(
"your-username/gemma_4_finetuned_merged",
tokenizer,
token=HF_TOKEN,
)
Merged models eliminate the adapter loading step. Use this for production inference where simplicity and deployment compatibility matter more than storage efficiency.
python
model.save_pretrained_gguf(
"gemma_4_finetuned",
tokenizer,
quantization_method = "Q8_0", # High quality, minimal loss
)
model.push_to_hub_gguf(
"your-username/gemma_4_finetuned",
tokenizer,
quantization_method = "Q8_0",
token=HF_TOKEN,
)
GGUF format enables deployment via llama.cpp, Ollama, and similar runtimes on CPU or edge hardware. The quantization method controls the size-quality trade-off: