Ideas2IT rewards key players with 1/3rd of the Company in New Initiative.  Read More >
Back to Blogs

Running LLMs on your personal computer

Imagine having the capability to deploy chatbots directly on your personal computer. Surprisingly, not all chatbots necessitate real-time inference with expensive H100 chips! With the strategic selection of an appropriate model and infrastructure, running LLM use cases like chatbots using your laptop is feasible.A pivotal determinant in this process is the choice of the specific variant within the spectrum of Large Language Models (LLMs). The model's parameter count significantly influences the inference expenses. For instance, scenarios involving tasks like rectifying patient data inaccuracies, altering formats, or extracting essential keywords from discharge summaries can all be accomplished offline, eliminating the requirement for expensive hardware resources, such as GPUs.Let's dive further into this concept with an illustration involving the extraction of keywords from a discharge summary.

Sample Discharge summary

Below is a provided discharge summary of a patient. The preceding summary comprises a total of 3020 characters, encompassing approximately 727 tokens. In this blog, we will undertake an extensive analysis, comparing the performance of the Llama-2 model on a GPU with that of the llamacpp model on a CPU.

Prompt

Let's prepare a simple prompt which we can feed into LLMs for extracting the answers.
You are a helpful and polite assistant. From the given context, extract the answers based on the user question. If the answer is not present in the context, just say "Information not available"
Context:
{context}
Question:
{question}
Answer:

Information extraction using Llama-2

Setup

Llama-2, developed by Meta, has emerged as an open-source Large Language Model (LLM). This model is commercially available and asserts its superiority over GPT 3.5 when it comes to tackling specific tasks. Ensure torch and transformers packages are installed in the system.

Model Download

Now, let's put our assessment to the test by utilizing the Llama-2 7b chat version of the model, available through the huggingface platform. You can access the model via this URL: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.Prior to executing this test, it is crucial to emphasize that you must first seek access by utilizing the Meta form provided at the following link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

Infrastructure

Llama-2 7B model can be loaded using a single GPU (either T4 or A10).

Inference Script

The subsequent inference script is designed to load the Llama-2 7B model on a single NVIDIA T4 GPU and carry out the inference process. The inference script can be accessed from here.

Results

Lets tabulate the inference latency and text generated by llama-2SnoQuestionLlama-7b-chat responseTime taken in secs1What are the medications prescribed to the patientThe medications prescribed to the patient are: * Lorcet 10/650 mg (dispensed 24 tablets, one tablet to be taken by mouth q.6h as needed for pain) * Ancef IV (continued via PICC line)Information not available for the other medications mentioned in the context.6.72Social history of the patientThe patient denies smoking or drinking.1.53Any history of allergies ?Yes, the patient has an allergy to Accutane.1.74Patient’s admission and discharge date ?The patient was admitted on June 14, 2007, and discharged on June 19, 2007.2.95Name of the patient ?Information not available. The name of the patient is not provided in the given context.1.96Describe patient’s illnessThe patient presented to Dr. X's office on June 14, 2007, complaining of a painful left foot. They had been treated conservatively in office for approximately 5 days, but symptoms progressed, leading to the decision for incision and drainage. The patient underwent I&D surgery and was admitted to the hospital. They were placed on vancomycin 1 g q.12h after surgery and later changed to Ancef 2 g IV every 8 hours. The patient progressively improved with IV antibiotics and local wound care and was discharged from the hospital on June 19, 2007, in excellent10.77What are the lab tests taken by the patientLab tests taken by the patient are not mentioned in the given context. Therefore, the answer is "Information not available."2.48List the diseases and its corresponding ICD10 codes1. Abscess with cellulitis - ICD10 code: L97.02. Incision and drainage - ICD10 code: 997.013. Left foot - ICD10 code: L50.0Information not available for the other questions.5.69Extract the insurance claim detailsInformation not available. The discharge summary does not mention anything about insurance claim details.2.010Summarize the discharge summaryThe patient was admitted to the hospital with an abscess and cellulitis in their left foot. They underwent incision and drainage in the operating room and were placed on IV antibiotics. After a five-day hospital stay, the patient was discharged with instructions to keep their foot elevated, avoid excessive ambulation, and keep their dressing dry and intact. The patient was given written and oral instructions about wound care before discharge and was continued on IV antibiotics at home. The patient's vital signs were stable, and they were afebrile prior to discharge. The discharge summary includes the patient's history, physical examination findings, laboratory results, and10.9

Information extraction using llamacpp

The primary objective of llamacpp is to enable the execution of Large Language Models utilizing 4-bit integer quantization on MacBook or standard computers. Llamacpp employs a distinct model extension referred to as ggml. This extension facilitates the operation of widely recognized open-source models like llama, alpaca, vicuna, and falcon, all of which can be run effectively on CPU-based machines.Checkout llamacpp github for more details - https://github.com/ggerganov/llama.cpp

Setup

Setting up llamacpp is fairly simple and straightforward. Clone the llamacpp repository using the terminal and run the make command
git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp
make -j
Subsequently, an executable named "main" will be produced within the directory, serving as the tool to execute inferences.For those who might not be acquainted with the CPP version and its configuration, an alternative exists in the form of Python bindings for llamacpp. You can access further information on this by visiting the repository abetlen/llama-cpp-python - https://github.com/abetlen/llama-cpp-python .Furthermore, the installation of the llama-cpp python package is made convenient with a simple pip command:
pip install llama-cpp-python

Model Download

As mentioned previously, llamacpp operates with a distinct file format known as ggml. To proceed, let's acquire the ggml versions of both llama-2 7b and 13b models from the huggingface platform.Various quantization methods are accessible for downloading. For our purpose, we will opt for the Q4 quantized version. Reference - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

Inference Script

Certainly, let's craft a straightforward and reusable script named "inference.py" to facilitate the testing of our provided questions. This script will enable streamlined inference using llama-cpp-python. The inference script can be accessed from here

Result

Lets tabulate the resultsSnoQuestionllama-7b-ggml -responseTime taken in secs1What are the medications prescribed to the patientLorcet 10/650 mg (dispense 24 tablets)9.82Social history of the patientThe patient denies smoking or drinking.5.63Any history of allergies ?Yes, the patient has a history of allergy to Accutane.7.34Patient’s admission and discharge date ?The patient was admitted on June 14, 2007 and discharged on June 19, 2007.7.85Name of the patient ?Information not available.4.06Describe patient’s illnessBased on the discharge summary, the patient presented to the clinic with a painful left foot and was diagnosed with an abscess with cellulitis. The patient underwent incision and drainage of the abscess and was admitted for IV antibiotics and wound care. The patient was discharged from the hospital on 06/19/07 in excellent condition after a successful surgery and proper wound care.31.07What are the lab tests taken by the patientLab tests were never abnormal.6.18List the diseases and its corresponding ICD10 codesAbscess - ICD10 code is L25.18.59Extract the insurance claim detailsNot available6.210Summarize the discharge summaryThe patient underwent incision and drainage of an abscess on their left foot. They were prescribed Lorcet 10/650 mg for pain, Ancef IV via PICC line, and home health administration of IV antibiotics. The patient was instructed to keep their foot elevated, wear a surgical shoe for ambulation, avoid excessive ambulation, and contact Dr. X for any follow-up care. The patient's vital signs were stable before discharge, and they were afebrile46.1

Inference Latency Comparison

The depicted clearly demonstrates that llamacpp requires more time when generating lengthier texts. However, for the remaining questions, the time taken by llamacpp and llama-2 exhibits a notably closer alignment.

Cost Analysis

Running llama-2 on a GPU (AWS - g4dn.xlarge) entails an hourly expense of $0.526, which accumulates to an estimated annual cost of $4607.On the other hand, hosting llamacpp on a CPU instance (AWS - t3.xlarge) incurs a charge of $0.1664 per hour, translating to an approximate annual expenditure of $1457.This translates to a reduction of over one-third in the overall cost. It's important to note that this analysis is an approximate assessment based on AWS infrastructure rates.

Conclusion

In conclusion, our exploration into the realm of Large Language Models (LLMs) has illuminated several critical considerations. Llamacpp can be used to build strong applications, like Freedomgpt, which can operate effectively on your own computer.While the utilization of llamacpp offers significant cost reductions in terms of infrastructure, it's important to acknowledge the trade-offs associated with its implementation. The adoption of 4-bit quantization, while reducing model size, concurrently introduces longer inference latency and diminished accuracy. This underscores the need for a well-informed decision-making process when opting for llamacpp.

References:

Ideas2IT Team

Connect with Us

We'd love to brainstorm your priority tech initiatives and contribute to the best outcomes.