.avif)
With incredibly powerful text generation and image generation models such as GTP2, BERT, Bloom 176B, or Stable Diffusion now available to anyone with access to a handful or even a single GPU, their application is still restricted by two critical factors: inference latency and cost.
For product teams, this creates a bottleneck. Running large models at scale is expensive and often too slow for real-time applications. Enterprises need a way to deliver the same model quality without prohibitive infrastructure costs or long response times.
This is where DeepSpeed-MII comes in.
DeepSpeed-MII is a new open-source Python library from DeepSpeed, aimed at making low-latency, low-cost inference of powerful models not only feasible but also easily accessible.
How it works:
DeepSpeed-MII leverages advanced optimizations from DeepSpeed-Inference, including:
Together, these optimizations make state-of-the-art models deployable in production without breaking compute budgets.
DeepSpeed MII comes pre-packed with 2 deployment options
The local deployments offered by DeepSpeed MII has an overhead of running an extra gRPC for model inferencing and Azure ML dependency in Azure Cloud.
Being bound to Azure Machine Learning or gRPC kind of deployment, it makes serving the optimized model in other model servers difficult.
This blog covers how to serve a DeepSpeed MII Optimized Stable Diffusion Model via Torchserve by bye-passing the default deployment options offered by DeepSpeed MII.
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
Refer: https://huggingface.co/blog/stable_diffusion#how-does-stable-diffusion-work
Before using the model, you need to accept the model license in order to download and use the weights.
For access tokens refer: https://huggingface.co/docs/hub/security-tokens
Below is a sample Python implementation of the stable diffusion model optimized with DeepSpeed MII without the native deployments.
The script requires pillow, deepspeed-mii packages, huggingface-hub
python dsmii.py --model_name CompVis/stable-diffusion-v1-4 --prompt “A
“A photo of a golden retriever puppy wearing a shirt. Background office”

Zip the folder where the model is saved. In this case
cd model
zip -r * ../model.zip
torch-model-archiver --model-name stable-diffusion --version 1.0 --handler custom_handler.py --extra-files model.zip -r requirements.txt
Config.properties
torchserve --start --ts-config config.properties
python query.py --url "http://localhost:8080/predictions/stable-diffusion" --prompt "a photo of an astronaut riding a horse on mars"
The image generated will be written to a file output-20221027213010.jpg
Production checklist
The above blog describes how DeepSpeed MII can be used without the gRPC or Azure ML deployments. This solves the problem of having the gRPC server running in your local deployment or Azure ML dependency in the cloud deployment. Thus enabling the use of DeepSpeed MII optimized models to be used in other serving solutions.
Didn't find what you were looking for?
1. What is DeepSpeed-MII?
DeepSpeed-MII is an open-source library that makes inference of large models faster and cheaper using optimizations like ZeroQuant and tensor slicing.
2. Why integrate DeepSpeed-MII with TorchServe?
TorchServe enables flexible, production-ready deployment. Combining it with DeepSpeed-MII brings low-latency inference without being locked to gRPC or Azure ML.
3. Which models are supported by DeepSpeed-MII?
MII supports a wide range of NLP and vision models including GPT, BERT, Bloom, and Stable Diffusion.
4. How does DeepSpeed-MII reduce inference costs?
By using quantization, kernel fusion, and multi-GPU optimizations, DeepSpeed-MII minimizes GPU usage and cuts cloud bills.
5. Can DeepSpeed-MII scale beyond a single GPU?
Yes. The same optimizations scale from a single GPU to multi-GPU clusters, making it enterprise-ready.

