DeepSpeed MII Made Easy - Ideas2IT

DeepSpeed MII Made Easy

Share This

With incredibly powerful text generation and image generation models such as GTP2, BERT, Bloom 176B, or Stable Diffusion now available to anyone with access to a handful or even a single GPU, their application is still restricted by two critical factors: inference latency and cost.

What is DeepSpeed-MII?

DeepSpeed-MII is a new open-source Python library from DeepSpeed, aimed at making low-latency, low-cost inference of powerful models not only feasible but also easily accessible.

  • MII offers access to highly optimized implementations of thousands of DL models.
  • MII-supported models achieve significantly lower latency and cost compared to their original implementation.
  • To enable low latency/cost inference, MII leverages an extensive set of optimizations from DeepSpeed-Inference such as deep fusion for transformers, automated tensor-slicing for multi-GPU inference, on-the-fly quantization with ZeroQuant, and others.

Performance Metrics

Native Deployment Options

DeepSpeed MII comes pre-packed with 2 deployment options

  1. Local Deployment with gRPC server
  2. Azure ML Endpoints

The local deployments offered by DeepSpeed MII has an overhead of running an extra gRPC for model inferencing and Azure ML dependency in Azure Cloud.

Being bound to Azure ML or gRPC kind of deployment, it makes serving the optimized model in other model servers difficult.

Non-Native Deployment Example:

This blog covers how to serve a DeepSpeed MII Optimized Stable Diffusion Model via Torchserve by bye-passing the default deployment options offered by DeepSpeed MII.

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.


Before using the model, you need to accept the model license in order to download and use the weights.

For access tokens refer:

Below is a sample Python implementation of the stable diffusion model optimized with DeepSpeed MII without the native deployments.

The script requires pillow, deepspeed-mii packages, huggingface-hub

Sample python script:

Inference query

python --model_name CompVis/stable-diffusion-v1-4 --prompt “A

“A photo of a golden retriever puppy wearing a shirt. Background office”

Torchserve Implementation:

Download Model:

Compress Model:

Zip the folder where the model is saved. In this case

cd model
zip -r * ../

For advanced torchserve handler refer:

Generate MAR File:

torch-model-archiver --model-name stable-diffusion --version 1.0 --handler --extra-files -r requirements.txt

Start Torchserve

torchserve --start --ts-config

Run Inference:

python --url "http://localhost:8080/predictions/stable-diffusion" --prompt "a photo of an astronaut riding a horse on mars"


The image generated will be written to a file output-20221027213010.jpg


The above blog describes how DeepSpeed MII can be used without the gRPC or Azure ML deployments. This solves the problem of having the gRPC server running in your local deployment or Azure ML dependency in cloud deployment. Thus enabling the use of DeepSpeed MII optimized models to be used in other serving solutions.