TL;DR
- Generative AI is bottlenecked by latency and cost, especially when serving models like Stable Diffusion at scale.
- DeepSpeed-MII changes the game: it delivers low-latency, low-cost inference for thousands of DL models using optimizations like tensor slicing, fusion, and ZeroQuant.
- However, the default deployments (Azure ML, gRPC) can feel restrictive for enterprises wanting more flexible serving solutions.
- In this blog, we show how to serve a DeepSpeed-MII optimized Stable Diffusion model via TorchServe, bypassing native limits while retaining performance gains.
- This approach helps teams experiment locally, integrate into diverse pipelines, and cut inference costs without being locked into a single cloud vendor.
With incredibly powerful text generation and image generation models such as GTP2, BERT, Bloom 176B, or Stable Diffusion now available to anyone with access to a handful or even a single GPU, their application is still restricted by two critical factors: inference latency and cost.
For product teams, this creates a bottleneck. Running large models at scale is expensive and often too slow for real-time applications. Enterprises need a way to deliver the same model quality without prohibitive infrastructure costs or long response times.
This is where DeepSpeed-MII comes in.
What is DeepSpeed-MII?
DeepSpeed-MII is a new open-source Python library from DeepSpeed, aimed at making low-latency, low-cost inference of powerful models not only feasible but also easily accessible.
- MII offers access to highly optimized implementations of thousands of DL models.
- MII-supported models achieve significantly lower latency and cost compared to their original implementation.
- To enable low latency/cost inference, MII leverages an extensive set of optimizations from DeepSpeed-Inference such as deep fusion for transformers, automated tensor-slicing for multi-GPU inference, on-the-fly quantization with ZeroQuant, and others.
How it works:
DeepSpeed-MII leverages advanced optimizations from DeepSpeed-Inference, including:
- Deep Fusion for Transformers – reduces kernel launch overhead.
- Tensor Slicing – enables multi-GPU inference with minimal coding.
- ZeroQuant – applies on-the-fly quantization to cut memory + cost.
Together, these optimizations make state-of-the-art models deployable in production without breaking compute budgets.
Native Deployment Options
DeepSpeed MII comes pre-packed with 2 deployment options
- Local Deployment with gRPC server
- Azure ML Endpoints
The local deployments offered by DeepSpeed MII has an overhead of running an extra gRPC for model inferencing and Azure ML dependency in Azure Cloud.
Being bound to Azure Machine Learning or gRPC kind of deployment, it makes serving the optimized model in other model servers difficult.
Non-Native Deployment Example:
This blog covers how to serve a DeepSpeed MII Optimized Stable Diffusion Model via Torchserve by bye-passing the default deployment options offered by DeepSpeed MII.
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
Refer: https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work
Before using the model, you need to accept the model license in order to download and use the weights.
For access tokens refer: https://huggingface.co/docs/hub/security-tokens
Below is a sample Python implementation of the stable diffusion model optimized with DeepSpeed MII without the native deployments.
The script requires pillow, deepspeed-mii packages, huggingface-hub
Sample Python script
Inference query
python dsmii.py --model_name CompVis/stable-diffusion-v1-4 --prompt “A
“A photo of a golden retriever puppy wearing a shirt. Background office”

Torchserve Implementation:
Compress Model:
Zip the folder where the model is saved. In this case
cd model
zip -r * ../model.zip
Generate MAR File:
torch-model-archiver --model-name stable-diffusion --version 1.0 --handler custom_handler.py --extra-files model.zip -r requirements.txt
Start Torchserve
Config.properties
torchserve --start --ts-config config.properties
Run Inference:
python query.py --url "http://localhost:8080/predictions/stable-diffusion" --prompt "a photo of an astronaut riding a horse on mars"
The image generated will be written to a file output-20221027213010.jpg
Why This Matters for Practitioners
- Flexibility
Serve MII‑optimized models on TorchServe, Triton, KServe, or your in‑house stack. Run the same package on‑prem, in any cloud, or at the edge. - Cost control
ZeroQuant and tensor slicing cut memory use and GPU hours. Deep fusion improves kernel efficiency. You get lower per‑request cost without rewriting models. - Enterprise fit
Slot into existing MLOps: Kubernetes, CI/CD, MLflow, Prometheus/Grafana, Datadog, Vault, and IAM. Keep your current observability, secrets, and rollout patterns. - Scalability
Start on a single GPU, scale to multi‑GPU and multi‑node. Use autoscaling, canary releases, and A/B routing at the endpoint level. - Operational signals
Track P50/P95 latency, throughput, VRAM utilization, and error rates. Tune batch size, scheduler, and attention settings to meet SLOs.
Production checklist
- Package: model artifacts + handler + env locked by hash
- Security: HF token handling, license acceptance, network policy
- Observability: logs, metrics, traces wired to your stack
- Rollouts: blue/green or canary with quick rollback
- Cost guardrails: per‑endpoint quotas and autoscale limits
Conclusion
The above blog describes how DeepSpeed MII can be used without the gRPC or Azure ML deployments. This solves the problem of having the gRPC server running in your local deployment or Azure ML dependency in the cloud deployment. Thus enabling the use of DeepSpeed MII optimized models to be used in other serving solutions.
1. What is DeepSpeed-MII?
DeepSpeed-MII is an open-source library that makes inference of large models faster and cheaper using optimizations like ZeroQuant and tensor slicing.
2. Why integrate DeepSpeed-MII with TorchServe?
TorchServe enables flexible, production-ready deployment. Combining it with DeepSpeed-MII brings low-latency inference without being locked to gRPC or Azure ML.
3. Which models are supported by DeepSpeed-MII?
MII supports a wide range of NLP and vision models including GPT, BERT, Bloom, and Stable Diffusion.
4. How does DeepSpeed-MII reduce inference costs?
By using quantization, kernel fusion, and multi-GPU optimizations, DeepSpeed-MII minimizes GPU usage and cuts cloud bills.
5. Can DeepSpeed-MII scale beyond a single GPU?
Yes. The same optimizations scale from a single GPU to multi-GPU clusters, making it enterprise-ready.