Understanding the cost of Large Language Models (LLMs)

Gad Benram
Feb 10, 2024
10 min read

Updated: Mar 5

Large Language Models (LLMs) have been at the forefront of the generative AI revolution, especially since the emergence of ChatGPT. However, their full potential has yet to be unlocked, and one significant barrier is cost. The expense of incorporating LLMs into your applications can range from a few cents for on-demand use cases to upwards of $20,000 per month for hosting a single instance of an LLM in your cloud environment. Additionally, there are substantial costs associated with fine-tuning, training, vector search, and scaling.

In this blog post, I will explore the factors that contribute to the expense of LLM applications and break down the costs into major components. If you're interested in other costs associated with AI initiatives, including project running costs, please please do it here.

LLMstudio cost and performance dashboards

Breaking Down the Cost of AI and Large Language Models

When analyzing the cost of LLMs, it's useful to consider two main perspectives: the factors that can make LLMs more expensive and the individual cost components involved. Let's first overview the factors.

Factors That Make LLMs and AI Models More Expensive

Model Complexity (Thought Effort): This refers to how sophisticated you want the model to be. Increasing the model's intelligence often involves using more complex architectures or larger models, such as scaling from 7 billion to 300 billion parameters or employing more experts simultaneously in a Mixture of Experts (MoE) model. Higher thought effort increases computational demands and costs. Obviously, smarter models are more expensive.
Input Size: The number of tokens you send to and receive from the model affects processing time and computational resources. Larger inputs and outputs require more power to process, thereby increasing costs. The more input and output you have, the more you pay.
Media Type: The type of media the model processes—be it text, audio, or video—impacts the cost. Processing audio and video typically demands more resources than text due to larger data sizes and complexity.
Latency Requirements: How quickly you need the response influences cost. Lower latency requires more computational resources or optimized infrastructure, which can be more expensive to maintain.

Cost of LLMs and AI models - Breakdown and Components

How Can You Pay for Running LLMs?

Typically, you'll find yourself paying for LLMs in production in one of two main ways:

Hosting Your Own Infrastructure: You can build and manage your own infrastructure to host LLMs, either on-premises or in the cloud. For instance, you might download a model like Llama 3 and run it on your servers. This approach offers control and customization but involves significant upfront investment and ongoing maintenance costs.
Model as a Service: Alternatively, you can use LLMs provided by AI vendors as a service. These vendors typically adopt a pay-per-token pricing model, where you are charged for each token sent to and received from the model. This approach can be more cost-effective for those who prefer not to handle the complexities of infrastructure management or are operating at a lower scale.

Before delving into cost estimations for some popular models, let's briefly discuss training your own LLM.

I'll start my discussion about the cost of LLMs by addressing cost estimations for some popular models, but beforehand, a word about training your own LLM.

The Cost of Creating LLMs

While most users won't train their own LLM from scratch, there have been reports that the cost of training LLMs such as BloombergGPT reaches millions of dollars, primarily due to GPU costs. Training LLMs today involves investing in research, acquiring and cleaning vast amounts of data, and paying for many hours of human feedback through techniques like Reinforcement Learning from Human Feedback (RLHF). While most companies that integrate LLMs into their generative AI applications use models trained by other organizations (like OpenAI's GPT-4 or Meta's Llama 3), they indirectly pay for the costs associated with creating these LLMs. Now that we've clarified this, let's look at a few cost examples of hosting your LLMs in the cloud.

Hosting an LLM in Your Cloud

When it comes to hosting your own model, the main cost, as mentioned, is hardware. Consider, for example, hosting an open-source model like Llama3 on AWS. The default instance recommended by AWS is ml.p4d.24xlarge, with a listed price of almost $38 per hour (on-demand). This means that such a deployment would cost at least $27,360 per month (assuming 24/7 operation), assuming it doesn't scale up or down and that no discounts are applied.

Llama3 default hardware on AWS SageMaker

Scaling up and down of this AWS service may require attention, configuration changes, and optimization processes; however, the costs will still be very high for such deployments.

Paying Per Token

An alternative to hosting an LLM and paying for the hardware is to use Software as a Service (SaaS) models and pay per token. Tokens are the units vendors use to price calls to their APIs. Different vendors, like OpenAI and Anthropic, have different tokenization methods, and they charge varying prices per token based on whether it's an input token, output token, or related to the model size.

For example, OpenAI charges $0.03 per 1,000 input tokens and $0.06 per 1,000 output tokens for GPT-4, while GPT-3.5 Turbo costs $0.0015 per 1,000 input tokens and $0.002 per 1,000 output tokens. It's evident that using special characters or non-English languages can result in higher costs due to tokenization. If you are using other languages, such as Hebrew or Chinese, be aware that the costs may be even higher.

What Factors Can Influence LLM Call Costs?

Pay-by-token pricing for LLMs varies based on several factors, including model capabilities, usage patterns, and language differences. Here's how these factors influence costs, with precise examples and tables for popular models:

1. Model Selection and Capabilities

Choosing a model significantly impacts cost due to differences in performance and features.

Example: GPT-4 vs. GPT-3.5 Turbo

GPT-4: Advanced multimodal model with enhanced capabilities, 8K context length (or 32K in the extended version), and a knowledge cutoff of 2023. Ideal for complex tasks.
GPT-3.5 Turbo: Cost-efficient, faster, and suited for general applications with less complex requirements.

Pricing Comparison:

GPT-4: $0.03 per 1,000 input tokens, $0.06 per 1,000 output tokens.
GPT-3.5 Turbo: $0.0015 per 1,000 input tokens, $0.002 per 1,000 output tokens.

Impact on Cost:

Processing 1 million input and output tokens:

GPT-4:
- Input: (1,000,000 tokens / 1,000) * $0.03 = $30
- Output: (1,000,000 tokens / 1,000) * $0.06 = $60
- Total Cost: $90
GPT-3.5 Turbo:
- Input: (1,000,000 tokens / 1,000) * $0.0015 = $1.50
- Output: (1,000,000 tokens / 1,000) * $0.002 = $2.00
- Total Cost: $3.50

Conclusion: GPT-4 offers superior capabilities at a higher cost, while GPT-3.5 Turbo provides affordability for less demanding tasks.

2. Batch API Discounts

In cases when it's not urgent to get immediate responses, using the Batch API can reduce costs by 50% for tasks that can wait up to 24 hours.

Pricing with Batch API:

Example:

Batch processing 2 million input and output tokens with GPT-4o Mini:
- Input: 2 x $0.075 = $0.150
- Output: 2 x $0.300 = $0.600
- Total Cost: $0.750

Conclusion: Batch API is cost-effective for non-urgent, large-scale tasks.

3. Thought Process

Models like o1-preview can applu a thought processes for complex reasoning, indicative of "Agent as a Service."

Features of o1-preview:

Enhanced reasoning for complex tasks.
Internal reasoning tokens included in output token count.

Pricing:

Example:

Processing 500,000 input and output tokens with o1-mini:
- Input: 0.5 x $3.00 = $1.50
- Output: 0.5 x $12.00 = $6.00
- Total Cost: $7.50

Conclusion: While more expensive, o1 models solve complex problems efficiently, reducing the need for multiple simpler model calls.

4. Media Type Processing

Processing different media types (text, images, audio) affects pricing due to varying computational demands.

Image Generation with DALL·E Models:

Example:

Generating 50 images at 1024×1024 resolution:
- DALL·E 3: 50 x $0.040 = $2.00
- DALL·E 2: 50 x $0.020 = $1.00

Conclusion: Higher-quality images cost more; select based on quality needs and budget.

While we've covered the differences in API pricing, this still doesn't address the true costs of LLMs in production. Let's explore the true cost of AI in depth.

Hidden costs of LLM applications

GPT-For-Work has developed a cross-platform pricing calculator for AI and LLM products. I utilized it to estimate the cost of an AI application that processes 1 million requests and was immediately faced with the question: How many tokens will be sent in such a case? The answer is complex, as this number is influenced by several hidden and unknown factors:

Variable Input and Output Sizes: The size of the user input and the generated output can vary significantly, affecting the number of tokens used.
Hidden Prompt Costs: There are hidden costs associated with application prompts. System prompts and instructions can add a significant number of tokens to each request.
Background API Calls: Utilizing agent libraries typically incurs additional API calls in the background to LLMs, in order to implement frameworks like ReAct or to summarize data for buffers.

These hidden costs are often the primary cause of bill shock when transitioning from the prototyping phase to production. Therefore, generating visibility into these costs is crucial.

TThe Emergence of Vector Databases

Most of the previous discussion has focused on hosting LLMs and the exchange of data with an LLM. However, LLMs have proven useful not only for on-demand generation use cases but also for creating a new format for data storage known as embeddings. These embeddings are vectors (arrays of numbers) that can represent various media types such as text, images, audio, and video. Once data is compressed into vectors, it can be stored and indexed for advanced search purposes.

Schematic flow of data in a vector database setting. Source: weaviate.io

Weaviate a leading vector database solution, has demonstrated the efficacy of storing and retrieving data in a vectorized form. What makes it special is the ability to search for media objects that are conceptually similar, such as emails that share the same tone or mention the same topics, even if the exact words are not used.

These new databases are significantly more expensive, as the creation and updating of embeddings are done through invoking LLMs. Additionally, searching the database requires more advanced and costly techniques.

How to control the cost of LLMs?

Having explored the components and factors that influence the costs of deploying large LLMs and AI in production—particularly the underlying infrastructure—we can now turn our attention to cost reduction strategies. One effective approach is to optimize hardware performance. By choosing faster or more advanced GPUs, you can significantly increase inference speed, but this often comes at a higher cost. It's clear that the main balance in LLMs is between performance and cost; it involves trade-offs between speed, cost, and accuracy. While some cost reductions may come from vendors offering improvements in hardware and algorithms, what techniques can you employ to squeeze more performance out of your existing infrastructure?

Choosing the Size of the LLM

The size of the LLM plays a crucial role in its performance and cost. Larger LLMs typically offer greater accuracy but at the expense of higher costs due to increased resource requirements. For example, upgrading from GPT-3.5 Turbo to GPT-4 can provide more accurate results but will also incur higher expenses. This decision requires careful consideration of the balance between accuracy and cost.

@article{bsharat2023principled, title={Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4}, author={Sondos Mahmoud Bsharat, Aidar Myrzakhan, Zhiqiang Shen}, journal={arXiv preprint arXiv:2312.16171}, year={2023}, }

The following chart illustrates the impact of various prompt engineering techniques on the accuracy of LLM applications across different LLMs, ordered by size. While there is variance in performance and even some overlap between different LLMs that can be achieved through effective prompt engineering, the size of the LLM plays a critical role in determining accuracy. This implies that higher costs (since larger LLMs are more expensive) correlate with greater accuracy.

This table presents the pricing for various GPT models offered by OpenAI as of August 2023. It illustrates the significant differences in potential expenses. Later in this post, we will discuss strategies to ensure you select the most suitable LLM for your use case and budget.

Quantizing LLMs

Quantizing LLMs is a technique that reduces the precision of the model weights, leading to improved performance in terms of speed and resource utilization, with a trade-off of slightly reduced accuracy. This method can be particularly effective in managing costs while still maintaining acceptable performance for many applications. By quantizing your LLMs, you can achieve a more cost-effective balance between performance and accuracy.

Quantizing the models reduces their size significantly, resulting in decreased costs of hosting the LLM and improved latency. But at what accuracy compromise? For the selected benchmark, Llama2-13B quantized has shown better results than Llama2-7B, despite being almost 50% smaller in size. Read further in Miguel Neves' post.

Fine-Tuning LLMs

Fine-tuning LLMs for specific tasks can offer significant improvements in performance. If your LLM is required for a particular function, making a one-time investment in fine-tuning can enhance its effectiveness for that task. This approach allows for a more efficient use of resources by tailoring the model's capabilities to your specific needs, potentially reducing overall costs in the long run.

Constructing Better Prompts

System prompts are templates used in LLM applications to give instructions to the models, in addition to injecting specific data like user prompts. Crafting better system prompts can greatly improve the accuracy of LLMs and reduce instances of hallucination. Techniques such as "chain-of-thought" prompting can minimize errors by guiding the model through a more logical process of generating responses. However, this method may increase the amount of data sent to the model, thereby raising costs and potentially affecting performance. Optimizing prompt design is a crucial aspect of managing trade-offs between cost, accuracy, and efficiency.

Copilot leaked prompt. Source: https://github.com/jujumilk3/leaked-system-prompts/blob/main/github-copilot-chat_20230513.md

To provide perspective, consider the reported "leaked" prompt of GitHub Copilot, the application that helps developers autocomplete code. Although not confirmed, the reported prompt contains 487 tokens, incurring significant out-of-the-box costs before introducing any user-specific context.

Using an analytic approach

While there is intuition and theory about how to use these techniques, it's often difficult to predict in advance which method will be more effective when optimizing LLM systems. Therefore, a practical solution is to adopt an analytical approach that allows you to track different scenarios and test them against your data.

Tools like LLMstudio, an open-source platform from TensorOps, facilitate exactly this. It enables you to test different prompts against various LLMs from any vendor and log the history of the requests and responses for later evaluation. Additionally, it tracks key metrics related to cost and latency, enabling you to make data-driven decisions regarding your LLM deployment optimization.

A word from your trusted advisors

As a consulting company, we have encountered several instances where LLM deployments failed because the unit economics were not viable. LLMs can be thought of as SpaceX's rockets that can return safely to Earth. Although it's groundbreaking technology, you wouldn't use it to go to the grocery store, right?

We assist companies in assessing the efficiency of their LLM deployment through a structured process that spans three weeks. If your company is interested in optimizing your LLM deployment, leave us a message on our website.

Understanding the cost of Large Language Models (LLMs)

Breaking Down the Cost of AI and Large Language Models

Factors That Make LLMs and AI Models More Expensive

How Can You Pay for Running LLMs?

The Cost of Creating LLMs

Hosting an LLM in Your Cloud

Paying Per Token

What Factors Can Influence LLM Call Costs?

1. Model Selection and Capabilities

2. Batch API Discounts

3. Thought Process

4. Media Type Processing

Hidden costs of LLM applications

TThe Emergence of Vector Databases

How to control the cost of LLMs?

Choosing the Size of the LLM

Quantizing LLMs

Fine-Tuning LLMs

Constructing Better Prompts

Using an analytic approach

A word from your trusted advisors

Related Posts

Sign up to get updates when we release another amazing article