Imagine having the capability to deploy chatbots directly on your personal computer. Surprisingly, not all chatbots necessitate real-time inference with expensive H100 chips! With the strategic selection of an appropriate model and infrastructure, running LLM use cases like chatbots using your laptop is feasible.
A pivotal determinant in this process is the choice of the specific variant within the spectrum of Large Language Models (LLMs). The model’s parameter count significantly influences the inference expenses. For instance, scenarios involving tasks like rectifying patient data inaccuracies, altering formats, or extracting essential keywords from discharge summaries can all be accomplished offline, eliminating the requirement for expensive hardware resources, such as GPUs.
Let’s dive further into this concept with an illustration involving the extraction of keywords from a discharge summary.
Sample Discharge summary
Sample discharge summary can be downloaded from here
Below is a provided discharge summary of a patient. The preceding summary comprises a total of 3020 characters, encompassing approximately 727 tokens.
In this blog, we will undertake an extensive analysis, comparing the performance of the Llama-2 model on a GPU with that of the llamacpp model on a CPU.
Prompt
Let’s prepare a simple prompt which we can feed into LLMs for extracting the answers.
You are a helpful and polite assistant. From the given context, extract the answers based on the user question. If the answer is not present in the context, just say "Information not available"
Context:
{context}
Question:
{question}
Answer:
Information extraction using Llama-2
Setup
Llama-2, developed by Meta, has emerged as an open-source Large Language Model (LLM). This model is commercially available and asserts its superiority over GPT 3.5 when it comes to tackling specific tasks. Ensure torch and transformers packages are installed in the system.
Model Download
Now, let’s put our assessment to the test by utilizing the Llama-2 7b chat version of the model, available through the huggingface platform. You can access the model via this URL: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf.
Prior to executing this test, it is crucial to emphasize that you must first seek access by utilizing the Meta form provided at the following link: https://ai.meta.com/resources/models-and-libraries/llama-downloads/
Infrastructure
Llama-2 7B model can be loaded using a single GPU (either T4 or A10).
Inference Script
The subsequent inference script is designed to load the Llama-2 7B model on a single NVIDIA T4 GPU and carry out the inference process. The inference script can be accessed from here.
Results
Lets tabulate the inference latency and text generated by llama-2
Sno | Question | Llama-7b-chat response | Time taken in secs |
1 | What are the medications prescribed to the patient | The medications prescribed to the patient are: * Lorcet 10/650 mg (dispensed 24 tablets, one tablet to be taken by mouth q.6h as needed for pain) * Ancef IV (continued via PICC line) Information not available for the other medications mentioned in the context. |
6.7 |
2 | Social history of the patient | The patient denies smoking or drinking. | 1.5 |
3 | Any history of allergies ? | Yes, the patient has an allergy to Accutane. | 1.7 |
4 | Patient’s admission and discharge date ? | The patient was admitted on June 14, 2007, and discharged on June 19, 2007. | 2.9 |
5 | Name of the patient ? | Information not available. The name of the patient is not provided in the given context. | 1.9 |
6 | Describe patient’s illness | The patient presented to Dr. X’s office on June 14, 2007, complaining of a painful left foot. They had been treated conservatively in office for approximately 5 days, but symptoms progressed, leading to the decision for incision and drainage. The patient underwent I&D surgery and was admitted to the hospital. They were placed on vancomycin 1 g q.12h after surgery and later changed to Ancef 2 g IV every 8 hours. The patient progressively improved with IV antibiotics and local wound care and was discharged from the hospital on June 19, 2007, in excellent | 10.7 |
7 | What are the lab tests taken by the patient | Lab tests taken by the patient are not mentioned in the given context. Therefore, the answer is “Information not available.” | 2.4 |
8 | List the diseases and its corresponding ICD10 codes | 1. Abscess with cellulitis – ICD10 code: L97.0 2. Incision and drainage – ICD10 code: 997.01 3. Left foot – ICD10 code: L50.0 Information not available for the other questions. |
5.6 |
9 | Extract the insurance claim details | Information not available. The discharge summary does not mention anything about insurance claim details. | 2.0 |
10 | Summarize the discharge summary | The patient was admitted to the hospital with an abscess and cellulitis in their left foot. They underwent incision and drainage in the operating room and were placed on IV antibiotics. After a five-day hospital stay, the patient was discharged with instructions to keep their foot elevated, avoid excessive ambulation, and keep their dressing dry and intact. The patient was given written and oral instructions about wound care before discharge and was continued on IV antibiotics at home. The patient’s vital signs were stable, and they were afebrile prior to discharge. The discharge summary includes the patient’s history, physical examination findings, laboratory results, and | 10.9 |
Information extraction using llamacpp
The primary objective of llamacpp is to enable the execution of Large Language Models utilizing 4-bit integer quantization on MacBook or standard computers. Llamacpp employs a distinct model extension referred to as ggml. This extension facilitates the operation of widely recognized open-source models like llama, alpaca, vicuna, and falcon, all of which can be run effectively on CPU-based machines.
Checkout llamacpp github for more details – https://github.com/ggerganov/llama.cpp
Setup
Setting up llamacpp is fairly simple and straightforward. Clone the llamacpp repository using the terminal and run the make command
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make -j
Subsequently, an executable named “main” will be produced within the directory, serving as the tool to execute inferences.
For those who might not be acquainted with the CPP version and its configuration, an alternative exists in the form of Python bindings for llamacpp. You can access further information on this by visiting the repository abetlen/llama-cpp-python – https://github.com/abetlen/llama-cpp-python .
Furthermore, the installation of the llama-cpp python package is made convenient with a simple pip command:
pip install llama-cpp-python
Model Download
As mentioned previously, llamacpp operates with a distinct file format known as ggml. To proceed, let’s acquire the ggml versions of both llama-2 7b and 13b models from the huggingface platform.
Various quantization methods are accessible for downloading. For our purpose, we will opt for the Q4 quantized version. Reference – https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
Inference Script
Certainly, let’s craft a straightforward and reusable script named “inference.py” to facilitate the testing of our provided questions. This script will enable streamlined inference using llama-cpp-python. The inference script can be accessed from here
Result
Lets tabulate the results
Sno | Question | llama-7b-ggml -response | Time taken in secs |
1 | What are the medications prescribed to the patient | Lorcet 10/650 mg (dispense 24 tablets) | 9.8 |
2 | Social history of the patient | The patient denies smoking or drinking. | 5.6 |
3 | Any history of allergies ? | Yes, the patient has a history of allergy to Accutane. | 7.3 |
4 | Patient’s admission and discharge date ? | The patient was admitted on June 14, 2007 and discharged on June 19, 2007. | 7.8 |
5 | Name of the patient ? | Information not available. | 4.0 |
6 | Describe patient’s illness | Based on the discharge summary, the patient presented to the clinic with a painful left foot and was diagnosed with an abscess with cellulitis. The patient underwent incision and drainage of the abscess and was admitted for IV antibiotics and wound care. The patient was discharged from the hospital on 06/19/07 in excellent condition after a successful surgery and proper wound care. | 31.0 |
7 | What are the lab tests taken by the patient | Lab tests were never abnormal. | 6.1 |
8 | List the diseases and its corresponding ICD10 codes | Abscess – ICD10 code is L25.1 | 8.5 |
9 | Extract the insurance claim details | Not available | 6.2 |
10 | Summarize the discharge summary | The patient underwent incision and drainage of an abscess on their left foot. They were prescribed Lorcet 10/650 mg for pain, Ancef IV via PICC line, and home health administration of IV antibiotics. The patient was instructed to keep their foot elevated, wear a surgical shoe for ambulation, avoid excessive ambulation, and contact Dr. X for any follow-up care. The patient’s vital signs were stable before discharge, and they were afebrile | 46.1 |
Inference Latency Comparison
The depicted clearly demonstrates that llamacpp requires more time when generating lengthier texts. However, for the remaining questions, the time taken by llamacpp and llama-2 exhibits a notably closer alignment.
Cost Analysis
Running llama-2 on a GPU (AWS – g4dn.xlarge) entails an hourly expense of $0.526, which accumulates to an estimated annual cost of $4607.
On the other hand, hosting llamacpp on a CPU instance (AWS – t3.xlarge) incurs a charge of $0.1664 per hour, translating to an approximate annual expenditure of $1457.
This translates to a reduction of over one-third in the overall cost. It’s important to note that this analysis is an approximate assessment based on AWS infrastructure rates.
Conclusion
In conclusion, our exploration into the realm of Large Language Models (LLMs) has illuminated several critical considerations. Llamacpp can be used to build strong applications, like Freedomgpt, which can operate effectively on your own computer.
While the utilization of llamacpp offers significant cost reductions in terms of infrastructure, it’s important to acknowledge the trade-offs associated with its implementation. The adoption of 4-bit quantization, while reducing model size, concurrently introduces longer inference latency and diminished accuracy. This underscores the need for a well-informed decision-making process when opting for llamacpp.
References:
- Llama 2 – https://ai.meta.com/llama/
- Llama-CPP – https://github.com/ggerganov/llama.cpp
- Llama-CPP python – https://github.com/abetlen/llama-cpp-python
- Freedom GPT – https://github.com/ohmplatform/FreedomGPT