Large language models (LLMs) have been at the center of the generative AI revolution, especially since the emergence of ChatGPT. The potential use cases for LLMs are vast and include automating processes across many industries, such as the software supply chain, marketing, and information security. However, the full potential of LLMs has yet to be unlocked. One significant reason for this is the cost implication. The cost of incorporating LLMs into your application can vary from a few cents for on-demand use cases and increase to $20,000 for hosting a single instance of an LLM on your cloud. There are also significant costs associated with fine-tuning, training, and vector search, and, of course, scale is a crucial factor. In this blog post, I will explore the reasons that can make LLMs expensive and break down the costs into some major components.
Why are LLMs so expensive?
There are several aspects associated with LLM costs, not all of which will be relevant to your business, depending on how you utilize LLMs. To better understand the costs that are relevant to you, let's break them down peice by piece:
The Cost of Creating LLMs
While most of our readers won't train their own LLM from scratch, there have been reports that the cost of training LLMs such as BloombergGPT reach millions of USDs; however, these primarily refer to GPU costs. Training LLMs nowadays involves investing in research, acquiring and cleaning expensive data, and paying for many hours of human feedback with a technique called RLHF (Reinforcement Learning from Human Feedback). And while most companies that integrate LLMs into their Gen-AI applications would use an LLM that another company or organization trained (like ChatGPT or Llama2) they will indirectly pay for the costs associated with creating LLMs.
Paying for using LLMs
Assuming that you just want to use an LLM, you will encounter two main pricing models:
Pay by Token: Companies pay based on the amount of data processed by the model service. The costs are determined by the number of tokens (derived by words or symbols) processed, both for inputs and outputs. In the figure below we refer to how OpenAI calculates tokens for example.
Hosting Own Model: Companies host LLMs on their infrastructure, paying for the computing resources, especially GPUs, required to run these models, they may potentially also pay a license fee for the LLM itself.
While the pay-by-token model offers simplicity and scalability, hosting your model provides control over data privacy and operational flexibility. However, it demands significant infrastructure investment and maintenance. Let's review the two options:
Hosting an LLM on your cloud
When it comes to hosting your own model, the main cost, as mentioned, would be hardware. Consider, for example, hosting an open-source Falcon 180B on AWS. The default instance recommended by AWS is ml.p4de.24xlarge, with a listed price of almost 33 USD per hour (on-demand). This means that such a deployment would cost at least 23,000 USD per month, assuming it doesn't scale up or down and that no discounts are applied.
Scaling up and down of this AWS service may require attention, configuration changes, and optimization processes; however, the costs will still be very high for such deployments.
Paying per tokens
An alternative to hosting an LLM and paying for the hardware is to use SaaS models and pay per token. Tokens are the units vendors use to price calls to their APIs. Different vendors, like OpenAI and Anthropic, have different tokenization methods, and they charge varying prices per token based on whether it's an input token, output token, or related to the model size. In the following example, we demonstrate how OpenAI calculates a token count for a given text. It's evident that using special characters results in higher costs, while words in English consume fewer tokens. If you are using other languages, such as Hebrew, be aware that the costs may be even higher.
Now that we established how you can pay for an LLM, let's discuss those costs that are often overlooked.
Hidden costs of LLM applications
GPT-For-Work has developed an OpenAI pricing calculator for GPT products. We utilized it to estimate the cost of an AI application that processes 5 requests per minute, and were immediately faced with the question: how many tokens will be sent in such a case? The answer is complex, as this number is influenced by several hidden and unknown factors:
The size of the user input and the generated output can vary significantly.
There are hidden costs associated with application prompts.
Utilizing agent libraries typically incurs additional API calls in the background to LLMs, in order to implement frameworks like ReAct or to summarize data for buffers.
These hidden costs are often the primary cause of bill shock when transitioning from the prototyping phase to production. Therefore, generating visibility into these costs is crucial.
The emergence of vector databases
Most of the previous discussion has focused on hosting LLMs and the exchange of data with an LLM. However, LLMs have proven themselves useful not only for on-demand generation use cases but also for creating a new format for data storage known as embeddings. These embeddings are vectors (arrays of numbers) that can compress various media types such as text, images, audio, and video. Once data is compressed into vectors, it can be stored and indexed for advanced search purposes. Weaviate, a leading vector database solution, has demonstrated the efficacy of storing and retrieving data in a vectorized form. What makes it special is the ability to search for media objects that are conceptually similar, such as emails that share the same tone or mention the same topics, even if the exact words are not used
These new databases are significantly more expensive, as the creation and updating of embeddings are done through invoking Large Language Models (LLMs). Additionally, searching the database requires more advanced and costly techniques.
How to control the cost of LLMs?
Improving the underlying hardware is a key strategy for controlling the cost of LLMs. By investing in faster or more advanced GPUs, you can significantly increase the speed of inference, making your LLM applications run more efficiently. This improvement in performance can offset higher costs by reducing the time it takes to generate results, thereby balancing the trade-offs between speed, cost, and accuracy.
Choosing the Size of the LLM
The size of the LLM plays a crucial role in its performance and cost. As illustrated in the accompanying graph, larger LLMs typically offer greater accuracy but at the expense of higher costs due to increased resource requirements. For example, upgrading from GPT-3.5 to GPT-4 can provide more accurate results but will also incur higher expenses. This decision requires careful consideration of the balance between accuracy and cost.
The chart above illustrates the impact of various prompt engineering techniques on the accuracy of LLM applications across different LLMs, ordered by size. As you can see, while there is variance in performance and even some overlap between different LLMs that can be achieved through effective prompt engineering, the size of the LLM plays a critical role in determining accuracy. This, of course, implies that higher costs (since larger LLMs are more expensive) correlate with greater accuracy. How much more expensive? Let's examine OpenAI's pricing for a clearer understanding.
For reference, the table above presents the pricing for various GPT models offered by OpenAI as of February 2024. It illustrates the significant differences in potential monthly expenses. Later in this post, we will discuss strategies to ensure you select the most suitable LLM for your use case and budget.
Quantizing LLMs is a technique that reduces the precision of existing models, leading to improved performance with a trade-off of slightly reduced "intelligence." This method can be particularly effective in managing costs while still maintaining a level of performance that is acceptable for many applications. By quantizing your LLMs, you can achieve a more cost-effective balance between performance and accuracy.
The table above illustrates the effect of LLM quantization. Quantizing the models reduces their size by approximately 60%, resulting in a decrease in the cost of hosting the LLM and an improvement in latency. But at what accuracy compromise? Well, for the selected benchmark, Llama2-13B has shown better results than Llama2-7B, although it is almost 50% of the size. Read further in Miguel Neves' post.
Fine-tuning LLMs for specific tasks can offer significant improvements in performance. If your LLM is required for a particular function, making a one-time investment in fine-tuning can enhance its effectiveness for that task. This approach allows for a more efficient use of resources by tailoring the model's capabilities to your specific needs, potentially reducing overall costs in the long run.
Constructing Better Prompts
System prompts are templates that are used in LLM applications to giive instrustions to the models, in addition to injecting specific data like user prompts. Crafting better system prompts can greatly improve the accuracy of LLMs and reduce instances of hallucination. Techniques such as "chain of thought" prompting can minimize errors by guiding the model through a more logical process of generating responses. However, this method may increase the amount of data sent to the model, thereby raising costs and potentially affecting performance. Optimizing prompt design is a crucial aspect of managing trade-offs between cost, accuracy, and efficiency.
To provide perspective, we examine the "leaked" prompt of GitHub Copilot, the application that helps developers autocomplete code. Although not confirmed, the reported prompt contains 487 tokens, incurring significant out-of-the-box costs before introducing any business context.
Using an analytic approach
While there is intuition and theory about how to use these techniques, it's often difficult to predict in advance which method will be more effective when optimizing LLM (Large Language Model) systems. Therefore, a practical solution is to adopt an analytical approach that allows you to track different scenarios and test them against your data. Tools like LLMstudio, an open-source platform from TensorOps, facilitate exactly this. It enables you to test different prompts against various LLMs from any vendor and log the history of the requests and responses for later evaluation. Additionally, it tracks key metrics related to cost and latency, enabling you to make data-driven decisions regarding your LLM deployment optimization
A word from your trusted advisors
As a consulting company, we have encountered several instances where LLM deployments failed because the unit economics were not viable. LLMs can be thought of as SpaceX's rockets that can return safely to Earth. Although it's groundbreaking technology, you wouldn't use it to go to the grocery store, right? We assist companies in assessing the efficiency of their LLM deployment through a structured process that spans three weeks. If your company is interested in optimizing your LLM deployment, you can learn more about it here, or leave us a message on our website