LLM-FinOps: The Key to Cost-Effective Gen AI Applications

Gad Benram
Feb 4, 2024
6 min read

Updated: Feb 5, 2024

LLMstudio performance dashboard - visibility over your LLM deployment - www.llmstudio.ai

In the ever-evolving landscape of technology, businesses are constantly seeking innovative ways to streamline operations, reduce costs, and enhance efficiency. The rise of Large Language Models (LLMs) like GPT-3 and its successors has ushered in a new era of artificial intelligence, offering remarkable capabilities in natural language processing, content creation, and more. However, as with any groundbreaking technology, the deployment and maintenance of LLMs in production environments come with their unique set of challenges and expenses. This is where LLMFinOps enters the picture, serving as the financial conscience in the realm of LLM deployment.

Understanding the Economics of LLMs

LLM Services like OpenAI's ChatGPT, Anthropic, and AWS's Bedrock are revolutionizing industries with their advanced AI-driven solutions. Built on vast datasets and complex algorithms, these LLMs can comprehend, generate, and translate human language. This capability allows businesses to automate intricate tasks and offer sophisticated user interactions.

The evolution of very large language models. Source: arXiv:2310.05694

The Impact of Model Size

The capabilities and costs of LLMs are significantly influenced by their size, typically measured by the parameter count. Models with a higher number of parameters can execute complex tasks with increased accuracy but demand substantial computational power. This scenario presents a pivotal balance for businesses: weighing the benefits of enhanced performance against the higher costs involved.

In order to demonstrate the importance of model size, let's observe research results published by Sondos Mahmoud Bsharat. The team examined the effectiveness of applying different prompt engineering techniques to enhance the accuracy of LLM calls. The results indicated that not only does going "bigger" with LLMs improve accuracy, but it also doubles down on the improvements because larger models are more responsive to prompt engineering techniques. While there are several techniques to enhance the system's metrics, opting for larger models is currently a winning strategy from an accuracy perspective, though it comes with its performance costs.

@article{bsharat2023principled, title={Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4}, author={Sondos Mahmoud Bsharat, Aidar Myrzakhan, Zhiqiang Shen}, journal={arXiv preprint arXiv:2312.16171}, year={2023}, }

Having highlighted the substantial trade-off between model size, costs, and performance, attention now turns to a critical aspect of LLM deployment in production: identifying the sweet spot between cost and performance. To do this, let's categorize these costs into their more apparent and less obvious components

LLM pricing models

The pricing models of LLMs can be categorized into two main types:

Pay by Token: Companies pay based on the amount of data processed by the model. The costs are determined by the number of tokens (derived by words or symbols) processed, both for inputs and outputs. In the figure below we refer to how OpenAI calculates tokens for example.
Hosting Own Model: Companies host LLMs on their infrastructure, paying for the computing resources, especially GPUs, required to run these models.

While the pay-by-token model offers simplicity and scalability, hosting your model provides control over data privacy and operational flexibility. However, it demands significant infrastructure investment and maintenance. Let's review the two options.

When it comes to hosting your own model, the main cost, as mentioned, would be hardware. Consider, for example, hosting an open-source Falcon 180B on AWS. The default instance recommended by AWS is ml.p4de.24xlarge, with a listed price of almost 33 USD per hour (on-demand). This means that such a deployment would cost at least 23,000 USD per month, assuming it doesn't scale up or down and that no discounts are applied.

Recommended hardware for deploying a single Falcon model

Scaling up and down of this AWS service may require attention, configuration changes, and optimization processes; however, the costs will still be very high for such deployments.

An alternative is to use SaaS models and pay per token. Tokens are the units vendors use to price calls to their APIs. Different vendors, like OpenAI and Anthropic, have different tokenization methods, and they charge varying prices per token based on whether it's an input token, output token, or related to the model size. In the following example, we demonstrate how OpenAI calculates a token count for a given text. It's evident that using special characters results in higher costs, while words in English consume fewer tokens. If you are using other languages, such as Hebrew, be aware that the costs may be even higher.

OpenAI tokenizer: demonstrates how characters are charged

Now that we established how you can pay for an LLM, let's discuss those costs that are often overlooked.

Hidden costs of LLM applications

GPT-For-Work has developed an OpenAI pricing calculator for GPT products. We utilized it to estimate the cost of an AI application that processes 5 requests per minute, and were immediately faced with the question: how many tokens will be sent in such a case? The answer is complex, as this number is influenced by several hidden and unknown factors:

The size of the user input and the generated output can vary significantly.
There are hidden costs associated with application prompts.
Utilizing agent libraries typically incurs additional API calls in the background to LLMs, in order to implement frameworks like ReAct or to summarize data for buffers.

These hidden costs are often the primary cause of bill shock when transitioning from the prototyping phase to production. Therefore, generating visibility into these costs is crucial.

LLMFinOps: Bridging the Gap Between Finance and AI

LLM-FinOps is not just about cost-cutting; it's a multidimensional approach that involves:

Cost Monitoring and Analysis: Continuously tracking and analyzing the expenses associated with LLM operations to identify cost-saving opportunities and budgeting effectively.
Performance Optimization: Ensuring that the LLMs deliver the desired performance without unnecessary expenditure on computing resources.
Scalability and Flexibility: Adapting the LLM infrastructure to meet changing demands, scaling up or down as needed without incurring prohibitive costs.
Collaboration and Governance: Fostering a collaborative environment where finance teams, IT, and data scientists work together to align financial goals with AI initiatives.
Continuous Improvement: Regularly reviewing and updating LLM strategies to capitalize on emerging technologies, pricing models, and best practices in LLM management.

The Emerging Stack of LLMFinOps

Provisioned Throughput Units (PTUs)

The provisioned throughput in Azure OpenAI, defined by Provisioned Throughput Units (PTUs), offers a way to manage and optimize Large Language Model deployments, ensuring predictable performance and cost efficiency. This approach is somewhat similar to other cloud reservations such as RIs on AWS. It allows for reserved processing capacity, aligning well with the principles of LLM-FinOps for efficient resource management.

Predictability: PTUs ensure stable maximum latency and throughput, crucial for uniform workloads.
Reserved Capacity: Resources are allocated regardless of usage, ensuring availability but possibly leading to underutilization.
Cost-Efficiency: Especially beneficial for high-throughput workloads, potentially offering significant cost savings compared to token-based models.

PTUs utilization as a function of the calls made to Azure OpenAI. Source: Azure Learn

To truly benefit from the potential of PTUs, an Azure user needs as detailed an analysis of the calls as possible. This is essential to identify the baseline of "stable" workloads and to perform back-testing based on historical data. This brings us to our next topic: how to monitor LLM applications effectively.

LLM Monitoring Tools

Optimizing any system depends on gathering and analyzing data about its performance. For LLM (Large Language Model) applications, this involves identifying all interactions with LLMs, including associated API or hosting calls. Unlike traditional backend applications, it's also crucial to monitor inputs and outputs and to collect feedback to assess the quality of the results. Adjusting the size of the LLM or the prompt configuration can affect cost, latency, and the accuracy of the system.

LLMstudio architecture - serves as a gateway to all LLM calls made by users and services in your perimeter

To streamline this process, we recommend using an LLM gateway service like LLMstudio. A well-designed LLM gateway is vendor-neutral and can direct requests to various backends (either managed LLMs or self-hosted) while integrating with your logging and monitoring platforms such as Datadog, Cloud Logging, or Grafana.

Beyond these essential functions, LLM gateways aim to offer additional features like security and load balancing, enhancing the stability and robustness of your LLM deployment. Implementing these practices will lay a solid foundation for optimization.

LLM Cache

Generating data with LLMs and AI often yields amazing and unique results at significantly lower costs than manual creation. In the image below, we asked ChatGPT to generate a surrealistic image that never existed before.

However, the output is not always as unique as desired. Sometimes, in chatbot applications for support centers, for example, the questions and answers repeat themselves. In these cases, storing and retrieving generated outputs can greatly reduce costs. The current techniques for caching primarily involve storing results in a database and performing vector searches. While established frameworks are not yet available, we recommend exploring GPTCache, which implements a caching layer fully compatible with LangChain and LlamaIndex.

Model size optimization

While the previous methods focus on optimizing the deployment of the LLM system, it's important to note that there are obviously ways to achieve optimization within the model itself. Here are a few techniques for achieving that:

Foundation model size selection: Choosing the right size of the foundation model based on the specific use case, and can further down the line lead to very different business and technical results as you will progress to the next techniques.
Quantization: Compressing the model to reduce its size without significantly compromising performance.
Fine-tuning: Customizing the model for specific tasks to improve efficiency and reduce unnecessary prompting.
Using Mixture of Experts: Results have shown that orchestrating several LLMs yields more efficient results than using one big LLM.

We addressed many of these topics in our webinar, which you're welcome to tune into at your convenience!

The future of LLM-FinOps

As the adoption of LLMs increases, we anticipate a significant expansion in the infrastructure for LLM monitoring and optimization tools. Techniques such as mixture of experts and LLM quantization, along with architectural decisions like smart LLM routing and caching, can be the differentiators between a profitable and a loss-making LLM deployment.

TensorOps is dedicated to helping customers reduce costs. If any of this resonates with you, our engineers would be delighted to have a chat and offer their assistance, just leave a message here!