Get Your Copy of The CXO's Playbook for Gen AI: Practical Insights From Industry Leaders.  Download Now >
Back to Blogs

Running LLMs on CPU: A Comprehensive Guide

Imagine having the capability to deploy chatbots directly on your CPU. Surprisingly, not all chatbots necessitate real-time inference with expensive H100 chips! With the strategic selection of an appropriate model and infrastructure, running LLM use cases like chatbots on your CPU is feasible.

A pivotal determinant in this process is the choice of the specific variant within the spectrum of Large Language Models (LLMs). The model's parameter count significantly influences the inference expenses. For instance, scenarios involving tasks like rectifying patient data inaccuracies, altering formats, or extracting essential keywords from discharge summaries can all be accomplished offline, eliminating the requirement for expensive hardware resources, such as GPUs.

Let's dive further into this concept with an illustration involving the extraction of keywords from a discharge summary.

Sample Discharge summary

Below is a provided discharge summary of a patient. The preceding summary comprises a total of 3020 characters, encompassing approximately 727 tokens. In this blog, we will undertake an extensive analysis, comparing the performance of the Llama-2 model on a GPU with that of the llamacpp model on a CPU.

Prompt

Let's prepare a simple prompt that we can feed into LLMs to extract the answers.
You are a helpful and polite assistant. From the given context, extract the answers based on the user question. If the answer is not present in the context, just say "Information not available"
Context:
{context}
Question:
{question}
Answer:

Information extraction using Llama-2

Setup

Llama-2, developed by Meta, has emerged as an open-source Large Language Model (LLM). This model is commercially available and asserts its superiority over GPT 3.5 when it comes to tackling specific tasks. Ensure the torch and transformer packages are installed in the system.

Model Download

Now, let's put our assessment to the test by utilizing the Llama-2 7b chat version of the model, available through the HuggingFace platform. You can access the model via this URL: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.

Prior to executing this test, it is crucial to emphasize that you must first seek access by utilizing the Meta form provided at the following link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Infrastructure

Llama-2 7B model can be loaded using a single GPU (either T4 or A10).

Inference Script

The subsequent inference script is designed to load the Llama-2 7B model on a single NVIDIA T4 GPU and carry out the inference process. The inference script can be accessed from here.

Results

Questions Response Time Taken (Secs)
What are the medications prescribed to the patient? The medications prescribed to the patient are: * Lorcet 10/650 mg (dispensed 24 tablets, one tablet to be taken by mouth q.6h as needed for pain) * Ancef IV (continued via PICC line)Information not available for the other medications mentioned in the context. 6.72
Social history of the patient? The patient denies smoking or drinking. 1.53
Any history of allergies? Yes, the patient has an allergy to Accutane. 1.74
Patient’s admission and discharge date? The patient was admitted on June 14, 2007, and discharged on June 19, 2007. 2.95
Name of the patient? Information not available. The name of the patient is not provided in the given context. 1.96
Describe patient’s illness The patient presented to Dr. X's office on June 14, 2007, complaining of a painful left foot. They had been treated conservatively in office for approximately 5 days, but symptoms progressed, leading to the decision for incision and drainage. The patient underwent I&D surgery and was admitted to the hospital. They were placed on vancomycin 1 g q.12h after surgery and later changed to Ancef 2 g IV every 8 hours. The patient progressively improved with IV antibiotics and local wound care and was discharged from the hospital on June 19, 2007, in excellent condition. 10.77
What are the lab tests taken by the patient? Lab tests taken by the patient are not mentioned in the given context. Therefore, the answer is "Information not available." 2.48
List the diseases and its corresponding ICD10 codes 1 Abscess with cellulitis - ICD10 code: L97.02. Incision and drainage - ICD10 code: 997.013. Left foot - ICD10 code: L50.0Information not available for the other questions. 5.69
Extract the insurance claim details Information not available. The discharge summary does not mention anything about insurance claim details. 2.01
Summarize the discharge summary The patient was admitted to the hospital with an abscess and cellulitis in their left foot. They underwent incision and drainage in the operating room and were placed on IV antibiotics. After a five-day hospital stay, the patient was discharged with instructions to keep their foot elevated, avoid excessive ambulation, and keep their dressing dry and intact. The patient was given written and oral instructions about wound care before discharge and was continued on IV antibiotics at home. The patient's vital signs were stable, and they were afebrile prior to discharge. The discharge summary includes the patient's history, physical examination findings, laboratory results, and diagnostic studies 10.9

Information extraction using llamacpp

The primary objective of llamacpp is to enable the execution of Large Language Models utilizing 4-bit integer quantization on MacBook or standard windows. Llamacpp employs a distinct model extension referred to as ggml. This extension facilitates the operation of widely recognized open-source models like llama, alpaca, vicuna, and falcon, all of which can be run effectively on CPU-based machines.

Checkout llamacpp github for more details - https://github.com/ggerganov/llama.cpp

Setup

Setting up llamacpp is fairly simple and straightforward. Clone the llamacpp repository using the terminal and run the make command
git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp
make -j
Subsequently, an executable named "main" will be produced within the directory, serving as the tool to execute inferences.For those who might not be acquainted with the CPP version and its configuration, an alternative exists in the form of Python bindings for llamacpp. You can access further information on this by visiting the repository abetlen/llama-cpp-python - https://github.com/abetlen/llama-cpp-python .

Furthermore, the installation of the llama-cpp python package is made convenient with a simple pip command:
pip install llama-cpp-python

Model Download

As mentioned previously, llamacpp operates with a distinct file format known as ggml. To proceed, let's acquire the ggml versions of both llama-2 7b and 13b models from the huggingface platform.

Various quantization methods are accessible for downloading. For our purposes, we will opt for the Q4 quantized version.

Reference - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

Inference Script

Certainly, let's craft a straightforward and reusable script named "inference.py" to facilitate the testing of our provided questions. This script will enable streamlined inference using llama-cpp-python. The inference script can be accessed from here.

Result

Questions Response Time Taken (Secs)
What are the medications prescribed to the patient? Lorcet 10/650 mg (dispense 24 tablets) 9.82
Social history of the patient The patient denies smoking or drinking. 5.63
Any history of allergies? Yes, the patient has a history of allergy to Accutane. 7.34
Patient’s admission and discharge date? The patient was admitted on June 14, 2007 and discharged on June 19, 2007. 7.85
Name of the patient? Information not available. 4.06
Describe patient’s illness Based on the discharge summary, the patient presented to the clinic with a painful left foot and was diagnosed with an abscess with cellulitis. The patient underwent incision and drainage of the abscess and was admitted for IV antibiotics and wound care. The patient was discharged from the hospital on 06/19/07 in excellent condition after a successful surgery and proper wound care. 31.07
What are the lab tests taken by the patient? Lab tests were never abnormal. 6.18
List the diseases and its corresponding ICD10 ICD10 - codes Abscess code is L25.1 8.59
Extract the insurance claim details Not available 6.210
Summarize the discharge summary The patient underwent incision and drainage of an abscess on their left foot. They were prescribed Lorcet 10/650 mg for pain, Ancef IV via PICC line, and home health administration of IV antibiotics. The patient was instructed to keep their foot elevated, wear a surgical shoe for ambulation, avoid excessive ambulation, and contact Dr. X for any follow-up care. The patient's vital signs were stable before discharge, and they were afebrile 46.1

Inference Latency Comparison

Inference Latency Comparison between llama-2 and llamacpp

The depiction clearly demonstrates that llamacpp requires more time when generating longer texts. However, for the remaining questions, the time taken by llamacpp and llama-2 exhibits a notably closer alignment.

Cost Analysis

Running llama-2 on a GPU (AWS - g4dn.xlarge) entails an hourly expense of $0.526, which accumulates to an estimated annual cost of $4607.

On the other hand, hosting llamacpp on a CPU instance (AWS - t3.xlarge) incurs a charge of $0.1664 per hour, translating to an approximate annual expenditure of $1457.

This translates to a reduction of over one-third in the overall cost. It's important to note that this analysis is an approximate assessment based on AWS infrastructure rates.

Conclusion

In conclusion, our exploration into the realm of Large Language Models (LLMs) has illuminated several critical considerations. Llamacpp can be used to build strong applications, like Freedomgpt, which can run effectively on your cpu. While the utilization of llamacpp offers significant cost reductions in terms of infrastructure, it's important to acknowledge the trade-offs associated with its implementation. The adoption of 4-bit quantization, while reducing model size, concurrently introduces longer inference latency and diminished accuracy. This underscores the need for a well-informed decision-making process when opting for llamacpp.

References:

Ideas2IT Team

Connect with Us

We'd love to brainstorm your priority tech initiatives and contribute to the best outcomes.