With incredibly powerful text generation and image generation models such as GTP2, BERT, Bloom 176B, or Stable Diffusion now available to anyone with access to a handful or even a single GPU, their application is still restricted by two critical factors: inference latency and cost.
What is DeepSpeed-MII?
DeepSpeed-MII is a new open-source Python library from DeepSpeed, aimed at making low-latency, low-cost inference of powerful models not only feasible but also easily accessible.
- MII offers access to highly optimized implementations of thousands of DL models.
- MII-supported models achieve significantly lower latency and cost compared to their original implementation.
- To enable low latency/cost inference, MII leverages an extensive set of optimizations from DeepSpeed-Inference such as deep fusion for transformers, automated tensor-slicing for multi-GPU inference, on-the-fly quantization with ZeroQuant, and others.
Native Deployment Options
DeepSpeed MII comes pre-packed with 2 deployment options
- Local Deployment with gRPC server
- Azure ML Endpoints
The local deployments offered by DeepSpeed MII has an overhead of running an extra gRPC for model inferencing and Azure ML dependency in Azure Cloud.
Being bound to Azure Machine Learning or gRPC kind of deployment, it makes serving the optimized model in other model servers difficult.
Non-Native Deployment Example:
This blog covers how to serve a DeepSpeed MII Optimized Stable Diffusion Model via Torchserve by bye-passing the default deployment options offered by DeepSpeed MII.
Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
Before using the model, you need to accept the model license in order to download and use the weights.
For access tokens refer: https://huggingface.co/docs/hub/security-tokens
Below is a sample Python implementation of the stable diffusion model optimized with DeepSpeed MII without the native deployments.
The script requires pillow, deepspeed-mii packages, huggingface-hub
Sample Python script
python dsmii.py --model_name CompVis/stable-diffusion-v1-4 --prompt “A
“A photo of a golden retriever puppy wearing a shirt. Background office”
Zip the folder where the model is saved. In this case
zip -r * ../model.zip
For advanced torchserve handler refer: https://github.com/pytorch/serve/blob/master/examples/deepspeed_mii/DeepSpeed_mii_handler.py
Generate MAR File:
torch-model-archiver --model-name stable-diffusion --version 1.0 --handler custom_handler.py --extra-files model.zip -r requirements.txt
torchserve --start --ts-config config.properties
python query.py --url "http://localhost:8080/predictions/stable-diffusion" --prompt "a photo of an astronaut riding a horse on mars"
The image generated will be written to a file output-20221027213010.jpg
The above blog describes how DeepSpeed MII can be used without the gRPC or Azure ML deployments. This solves the problem of having the gRPC server running in your local deployment or Azure ML dependency in the cloud deployment. Thus enabling the use of DeepSpeed MII optimized models to be used in other serving solutions.
Are you looking to build a great product or service? Do you foresee technical challenges? If you answered yes to the above questions, then you must talk to us. We are a world-class custom .NET development company. We take up projects that are in our area of expertise. We know what we are good at and more importantly what we are not. We carefully choose projects where we strongly believe that we can add value. And not just in engineering but also in terms of how well we understand the domain. Book a free consultation with us today. Let’s work together.