Fine-Tuning Gemma 4 E2B with Unsloth: Complete Setup Guide

Maheshwari Vigneswar

TL;DR

  • Google's Gemma 4 E2B is a 2-billion-parameter multimodal model that runs fine-tuning on a T4 GPU with as little as 8 GB VRAM with free Colab tier included.
  • Unsloth delivers roughly 1.5x faster training with 60% less VRAM than standard Flash Attention 2 setups, at no accuracy cost.
  • A 100-conversation synthetic dataset fine-tuned with LoRA adapters (rank 16) in under 20 minutes on a T4 is enough to produce strong domain-specific reasoning on covered topics.
  • The full pipeline, dataset creation via Unsloth Data Recipe, LoRA training, and Hugging Face deployment in LoRA, merged, or GGUF format fits inside a single Colab notebook.
  • Ideas2IT's engineering teams use this stack to evaluate SLM fit for client-specific AI use cases before committing to production architecture.

Table of Content

Fine-tuning Gemma 4 E2B with Unsloth on a free T4 GPU takes under 20 minutes and requires no paid compute. This guide covers the exact pipeline our team used: synthetic dataset creation, LoRA training, and Hugging Face deployment in a single Colab notebook.

Most teams approach small language model deployment from the wrong end. They pick a model based on benchmark leaderboards, then ask whether it can be fine-tuned for their use case. The more productive sequence runs the other direction: define the task, generate a representative dataset, run a fine-tuning experiment on available hardware, and let the results determine model selection.

Picture this: a domain-specific AI assistant, trained on your company's internal documents, running on free cloud hardware, deployed and queryable in under an hour. That's what we set out to test. We wanted to know how little data, how little compute, and how little infrastructure it actually takes to go from raw policy documents to a fine-tuned model that answers domain-specific questions accurately.

Google's Gemma 4 E2B and the Unsloth library together make that experimental cycle fast enough to matter. Here, we have documented the exact process our team used from synthetic dataset creation to Hugging Face deployment so other engineering teams can replicate it on their own data.

Why Gemma 4 E2B Is Worth Testing First

Gemma 4 is Google's fourth-generation open model family. The E2B variant (2 billion parameters) occupies the lower end of that family but is notable for several reasons that matter in production planning.

LoRA fine-tuning requires as little as 8 GB VRAM. Inference can run on 2 to 6 GB depending on quantization. That means the full fine-tuning cycle is accessible on a free Colab T4 instance with no reserved compute and no infrastructure cost during experimentation.

The architecture natively handles text, images, video, and audio. Audio input is exclusive to the E2B and E4B variants within the Gemma 4 family. For teams evaluating multimodal document or voice use cases, this eliminates the need to stand up separate models per modality during prototyping.

The context window runs to 128K tokens for E2B and E4B. Larger 26B and 31B models extend that to 256K. For most enterprise document and conversation use cases, 128K is sufficient.

Unsloth adds to this picture structurally rather than incrementally. The library achieves roughly 1.5x faster training and 60% less VRAM consumption compared to standard Flash Attention 2 setups, with no accuracy loss. On a T4, that difference compresses a 100-step training run from approximately 30 minutes to 15 to 20 minutes.

Part 1: Building a Synthetic Dataset with Unsloth Data Recipe

What Data Recipe Does

Unsloth Data Recipe is a node-based workflow tool built on NVIDIA NeMo Data Designer. It takes unstructured source documents and transforms them into structured training data through configurable pipeline stages. For teams that don't have labeled conversational datasets available, it provides a reproducible path from raw policy documents, product manuals, or knowledge bases to fine-tuning-ready JSONL.

Our Pipeline Architecture

We built a conversational dataset for an HR policy assistant. The pipeline has four components: a seed block ingesting unstructured PDFs and text files, a model provider block pointing to the OpenAI API endpoint, a model config block set to GPT-4o at temperature 0.7, and an LLM structured output block generating multi-turn conversation JSON.

The recipe configuration looks like this:

json

{

  "recipe": {

    "model_providers": [

      {

        "name": "provider_1",

        "endpoint": "https://api.openai.com/v1",

        "provider_type": "openai"

      }

    ],

    "model_configs": [

      {

        "alias": "model_1",

        "model": "gpt-4o",

        "provider": "provider_1",

        "inference_parameters": {

          "temperature": 0.7

        }

      }

    ],

    "seed_config": {

      "source": {

        "seed_type": "unstructured",

        "chunk_size": 1200,

        "chunk_overlap": 200

      }

    }

  }

}

The chunk size of 1200 tokens with 200-token overlap ensures that each chunk carries sufficient context for the generation model to produce grounded, multi-turn conversations.

The Prompt Strategy

The LLM block prompt is the quality gate for the entire pipeline. Ours instructed the model to generate only from facts present in the chunk, produce 8 to 20 turns depending on content richness, flag any gaps explicitly rather than inventing answers, and output strict JSON matching our schema.

You are an expert conversational dataset creator.

Using ONLY the information in the following policy chunk:

{{ chunk_text }}

Generate a realistic multi-turn conversation between an employee ("human")

and an HR assistant ("gpt").

Rules:

- Use only facts from the chunk

- Do not invent information

- Include natural follow-up and scenario-based questions

- If the chunk does not contain an answer, the HR assistant should say

  the policy does not specify it

- Create 8–20 turns depending on the content available

- Output only valid JSON in the format below

The structured output format:

json

{

  "conversations": [

    {

      "from": "human",

      "value": "What is the leave encashment policy?"

    },

    {

      "from": "gpt",

      "value": "According to our policy, employees can encash..."

    }

  ]

}

Running the Pipeline

Open Unsloth Studio, navigate to Data Recipes, and start from an empty canvas. Add blocks in sequence: Seed (document upload with chunk size 1200, overlap 200), Model Provider (OpenAI endpoint), Model Config (GPT-4o, temperature 0.7), and LLM Structured (conversation generation with the JSON schema above).

Use the Validate button to catch configuration errors before running. Preview generates five sample rows for manual review. Once quality looks acceptable, run the full pipeline and export as JSONL. We generated 100 conversations for this training run.

Part 2: Setting Up the Training Environment

Prerequisites

A free Google Colab account with T4 GPU access is sufficient for this run. You'll also need a Hugging Face account with an API token (write permissions) for model upload.

Installing Unsloth

bash

!pip install unsloth

This single command pulls in the Unsloth core library along with transformers, peft, and trl. No additional dependency management required.

Part 3: Fine-tuning Gemma 4 E2B

Import Dependencies

python

from unsloth import FastLanguageModel

from unsloth.chat_templates import get_chat_template

import torch

from datasets import load_dataset

from trl import SFTTrainer, SFTConfig

from transformers import TextStreamer

Load the Model

python

max_seq_length = 2048  # Context length

model, tokenizer = FastLanguageModel.from_pretrained(

    model_name = "google/gemma-4-E2B",

    max_seq_length = max_seq_length,

    load_in_4bit = False,    # Use 16-bit for better accuracy

    load_in_16bit = True,    # bf16/16-bit LoRA

    full_finetuning = False, # Use LoRA for efficiency

)

load_in_16bit=True delivers better accuracy than 4-bit quantization while remaining within T4 memory limits. Setting full_finetuning=False tells the library to apply LoRA adapters instead of updating all parameters.

Configure LoRA Adapters

LoRA (Low-Rank Adaptation) fine-tunes approximately 1 to 2% of model parameters by inserting trainable rank-decomposition matrices at targeted layers. The full model weights remain frozen.

python

model = FastLanguageModel.get_peft_model(

    model,

    r = 16,  # LoRA rank

    target_modules = [

        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers

        "gate_proj", "up_proj", "down_proj",       # MLP layers

    ],

    lora_alpha = 16,  # Scaling factor (recommended: alpha = r)

    lora_dropout = 0, # No dropout for small datasets

    bias = "none",

    use_gradient_checkpointing = "unsloth",  # Memory optimization

    random_state = 3407,

    max_seq_length = max_seq_length,

)

The rank of 16 is a reasonable default for small domain-specific datasets. Higher ranks increase trainable parameters and model capacity, which matters when the target domain has high linguistic diversity. For the HR policy use case, 16 was sufficient. Setting lora_alpha equal to r is the standard starting configuration. Unsloth's gradient checkpointing implementation is the key memory optimization, it allows training to fit within T4 VRAM limits that would otherwise be marginal.

Load and Prepare the Dataset

python

# Load your dataset

url = "100_conversations.jsonl"

dataset = load_dataset("json", data_files=url, split="train")

# Convert to training format

def convert(example):

    text = ""

    for turn in example["conversations"]:

        if not turn["value"].strip():

            continue

        role = "Human" if turn["from"] == "human" else "Assistant"

        text += f"### {role}:\n{turn['value']}\n\n"

    return {"text": text.strip()}

dataset = dataset.map(convert)

Alternatively, Gemma 4's native chat template produces better results for conversational tasks:

python

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(

    tokenizer,

    chat_template = "gemma-4",  # Use "gemma-4-thinking" for reasoning tasks

)

def formatting_prompts_func(examples):

    convos = examples["conversations"]

    texts = [

        tokenizer.apply_chat_template(

            convo,

            tokenize=False,

            add_generation_prompt=False

        ).removeprefix('<bos>')

        for convo in convos

    ]

    return {"text": texts}

dataset = dataset.map(formatting_prompts_func, batched=True)

Configure Training

python

trainer = SFTTrainer(

    model = model,

    train_dataset = dataset,

    tokenizer = tokenizer,

    args = SFTConfig(

        dataset_text_field = "text",

        max_seq_length = max_seq_length,

        

        # Batch configuration

        per_device_train_batch_size = 1,

        gradient_accumulation_steps = 4,  # Effective batch size = 4

        

        # Learning rate schedule

        warmup_steps = 10,

        learning_rate = 2e-4,

        lr_scheduler_type = "linear",

        

        # Training duration

        max_steps = 100,

        

        # Optimization

        optim = "adamw_8bit",

        weight_decay = 0.01,

        max_grad_norm = 1.0,

        

        # Logging

        logging_steps = 1,

        output_dir = "outputs_gemma4",

        

        seed = 3407,

        report_to = "none",

    ),

)

The gradient_accumulation_steps=4 setting simulates a batch size of 4 while processing one sample at a time, critical for running on T4 VRAM. The 8-bit AdamW optimizer reduces memory consumption during backpropagation with negligible accuracy impact.

Note: If you see loss values of 13 to 15 at the start of training, this is expected behavior for Gemma 4 E2B and E4B. It reflects the multimodal architecture and is not a configuration error.

Train

python

trainer.train()

Expected training time on T4: 15 to 20 minutes for 100 steps. On L4: 8 to 10 minutes. On A100: approximately 5 minutes.

Understanding Training Loss

A note specific to Gemma 4's E2B and E4B variants: initial loss values of 13 to 15 are normal due to the multimodal architecture. For text-only fine-tuning with standard models, expect starting loss of 2.0 to 4.0 and final loss of 0.3 to 1.0. For Gemma 26B and 31B, the range is typically 1 to 3. Do not interpret the higher starting loss on E2B as a configuration error.

Part 4: Testing the Fine-tuned Model

Quick Inference Test

python

from unsloth.chat_templates import get_chat_template

from transformers import TextStreamer

# Apply chat template

tokenizer = get_chat_template(tokenizer, chat_template = "gemma-4")

PROMPT = "Can you explain the Encashment Leave Policy in Ideas2IT?"

messages = [{

    "role": "user",

    "content": [{"type": "text", "text": PROMPT}]

}]

inputs = tokenizer.apply_chat_template(

    messages,

    add_generation_prompt = True,

    return_tensors = "pt",

    tokenize = True,

    return_dict = True,

).to("cuda")

_ = model.generate(

    **inputs,

    max_new_tokens = 512,

    temperature = 1.0,

    top_p = 0.95,

    top_k = 64,

    streamer = TextStreamer(tokenizer, skip_prompt=True),

)

Base Model vs. Fine-tuned Model Comparison

python

BASE_MODEL = "google/gemma-4-E2B"

FN_PATH = "/content/outputs_gemma4/checkpoint-100"

# Load base model

base_model, base_tokenizer = FastModel.from_pretrained(

    model_name = BASE_MODEL,

    max_seq_length = 2048,

    load_in_4bit = False,

)

# Load fine-tuned adapter

from peft import PeftModel

finetuned_model = PeftModel.from_pretrained(base_model, FN_PATH)

The fine-tuned model's responses on covered topics are notably more grounded and specific. Hallucinations on policy details present in the training data drop substantially. The base model generates plausible-sounding but generic HR policy language; the fine-tuned model retrieves and applies specific policy language from the training set.

Here’s how the answer generation differs between the models. For eg.

Prompt: "Can you explain the Encashment Leave Policy at Ideas2IT?"

Base model output: A generic 3-4 sentence paragraph describing what leave encashment typically means at most companies, with no specific figures or policy terms.

Fine-tuned model output: A specific response citing the actual policy terms, eligibility conditions, and encashment limits from the training data.

Part 5: Export and Deployment

Unsloth supports three export formats, each suited to different deployment contexts.

Option A: LoRA Adapter (Lightweight, ~100–200 MB)

python

# Save adapter locally

model.save_pretrained("gemma_4_finetuned")

tokenizer.save_pretrained("gemma_4_finetuned")

# Push to Hugging Face

model.push_to_hub("your-username/gemma_4_finetuned", token=HF_TOKEN)

tokenizer.push_to_hub("your-username/gemma_4_finetuned", token=HF_TOKEN)

Use LoRA adapters when storage is limited, when experimenting with multiple fine-tunes on a shared base model, or when the deployment environment supports PEFT loading natively.

Option B: Merged Full Model (~4–6 GB)

python

model.save_pretrained_merged("gemma-4-finetuned-merged", tokenizer)

model.push_to_hub_merged(

    "your-username/gemma_4_finetuned_merged",

    tokenizer,

    token=HF_TOKEN,

)

Merged models eliminate the adapter loading step. Use this for production inference where simplicity and deployment compatibility matter more than storage efficiency.

Option C: GGUF for Edge and CPU Deployment (~2–4 GB)

python

model.save_pretrained_gguf(

    "gemma_4_finetuned",

    tokenizer,

    quantization_method = "Q8_0",  # High quality, minimal loss

)

model.push_to_hub_gguf(

    "your-username/gemma_4_finetuned",

    tokenizer,

    quantization_method = "Q8_0",

    token=HF_TOKEN,

)

GGUF format enables deployment via llama.cpp, Ollama, and similar runtimes on CPU or edge hardware. The quantization method controls the size-quality trade-off:

Deployment Options

Unsloth Studio (local):

bash

curl -fsSL https://unsloth.ai/install.sh | sh

unsloth studio -H 0.0.0.0 -p 8888

vLLM server:

python

from vllm import LLM, SamplingParams

llm = LLM(model="your-username/gemma_4_finetuned")

sampling_params = SamplingParams(temperature=1.0, top_p=0.95)

outputs = llm.generate(prompts, sampling_params)

Ollama (via GGUF):

bash

# Create Modelfile

cat > Modelfile << EOF

FROM ./gemma_4_finetuned-Q8_0.gguf

TEMPLATE """{{ .Prompt }}"""

PARAMETER temperature 1.0

PARAMETER top_p 0.95

PARAMETER top_k 64

EOF

ollama create gemma4-finetuned -f Modelfile

ollama run gemma4-finetuned

Practical Guidelines

Dataset Quality

The 100-conversation training set used here is sufficient for domain adaptation when the source documents are fact-dense and the task is narrow. For tasks with higher linguistic diversity, multiple document types, variable query phrasing, edge-case handling 500 or more conversations will produce more consistent results. Unsloth Data Recipe's augmentation features generate varied examples from the same source material, which is the faster path compared to writing additional source documents.

Training Optimization

Start with context length 2048 and increase only if your task involves long multi-turn conversations or extended document passages. learning_rate=2e-4 is a reliable default for LoRA runs on small datasets; reduce to 2e-5 for longer training runs to avoid late-stage instability. Monitor loss trajectory, a plateau before the target loss range suggests the training set may lack diversity rather than that the model has reached capacity.

Memory Management

If the T4 runs out of memory: reduce max_seq_length from 2048 to 1024, switch from load_in_16bit to load_in_4bit, confirm use_gradient_checkpointing="unsloth" is set, and close any other active Colab sessions sharing the GPU instance.

Evaluating whether SLMs fit a specific use case at your company?

Our engineering team runs structured SLM fit assessments selecting model candidates, generating a representative dataset from your existing content, running a fine-tuning pilot, and producing an evaluation report with production deployment options.

We take your existing documentation, generate a representative 100-500 sample dataset, run a LoRA fine-tuning pilot on Gemma 4 or equivalent, and deliver an evaluation report showing whether a small model meets your production accuracy threshold before you commit to architecture. Typical turnaround: 2 weeks.

Schedule a working session to scope

Why Ideas2IT Works on This Stack

Ideas2IT engineers began working on transformer-based model optimization in 2017, before LLM fine-tuning became accessible enough to be a standard project type. The practical outcome of that history is that our teams can evaluate SLM candidates against a specific use case quickly, rather than spending the first weeks of an engagement on environment setup and configuration.

The Gemma 4 plus Unsloth stack is one of several fine-tuning configurations our engineering teams use in early-stage AI project evaluations. When a client has a defined domain task and existing documentation, running a 100-to-500-sample fine-tuning experiment is often the fastest way to establish whether a small model can meet production accuracy requirements before committing to a larger architecture.

Our Anticlock AI platform brings the same principle to the broader SDLC. It standardizes AI-driven development across engineering teams enforcing consistent tooling, security guardrails, and deployment standards so that the results from a fine-tuning experiment translate into a production deployment with an auditable process behind it, not just a notebook someone ran once.

For teams evaluating AI use cases that require custom model behavior on domain-specific data, the working session described above is the appropriate starting point.

Talk to our engineering team

References

1. Google DeepMind. Gemma 4 Technical Report. Google. https://ai.google.dev/gemma/docs/gemma4

2. Unsloth AI. Unsloth Documentation: Gemma 4 Training Guide. https://unsloth.ai/docs/models/gemma-4/train

3. Unsloth AI. Unsloth Data Recipe Guide. https://unsloth.ai/docs/new/studio/data-recipe

4. Hu, E. J., et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685. https://arxiv.org/abs/2106.09685

5. Unsloth AI. Unsloth GitHub Repository. https://github.com/unslothai/unsloth

6. Hugging Face. PEFT: Parameter-Efficient Fine-Tuning. https://huggingface.co/docs/peft

Frequently Asked Questions

Didn't find what you were looking for?

What hardware do I need to fine-tune Gemma 4 E2B with Unsloth?

A T4 GPU with 15 GB VRAM, available on Google Colab's free tier, is sufficient for this training run. Unsloth's memory optimizations make the difference, without them, 16-bit LoRA fine-tuning on a T4 with the standard Flash Attention 2 setup would be marginal. For faster iteration, an L4 or A100 reduces the 15-to-20-minute run time by roughly half.

How many training samples does Gemma 4 E2B need for domain adaptation?

100 multi-turn conversations is enough for narrow, fact-bounded tasks where the source documentation is consistent and query types are predictable. Tasks with higher linguistic diversity or coverage requirements benefit from 500 or more examples. 

What is the difference between saving a LoRA adapter and a merged model?

A LoRA adapter stores only the low-rank weight updates from fine-tuning, typically 100 to 200 MB. Loading it requires the base model plus an adapter loader. A merged model combines the adapter weights back into the full model parameters, producing a standalone 4 to 6 GB file that loads without any special handling. 

What does the Gemma 4 E2B's high initial loss mean during training?

Initial loss values of 13 to 15 during E2B and E4B training are normal and reflect the multimodal architecture. The model's token distribution at initialization accounts for image, video, and audio modalities, which produces a higher cross-entropy loss on text-only fine-tuning data at the start of training.

Can this fine-tuning approach work for use cases other than HR policy assistants?

Yes. The pipeline generalizes to any domain where a fact-grounded conversational assistant provides value like customer support, technical documentation, compliance Q&A, product onboarding, and clinical protocol assistants are all structurally similar. The constraint is that the source documents must be fact-dense enough to generate high-quality synthetic conversations.

What is Unsloth Data Recipe and is it required for fine-tuning?

Unsloth Data Recipe is a visual workflow tool for generating synthetic training datasets from unstructured source documents. Any properly formatted JSONL file can be used as the training dataset. Data Recipe is useful when you do not have an existing labeled dataset and need to generate one from internal documentation.

Method Approximate Size Quality Typical Use Case
Q8_0 ~4 GB Highest Production, minimal quality loss
Q4_K_M ~2 GB Good Balanced size and quality
Q4_0 ~2 GB Acceptable Smaller or constrained devices
F16 ~8 GB Full No quantization, largest footprint