- The development of neural networks and improvements in their architecture along with access to massive amounts of data led to the emergence of large language models (LLMs).
- At scale, LLMs have demonstrated extraordinary capabilities related to text processing and similar tasks.
- With an LLM integrated into your product, you can gain competitive advantages and optimize business expenses.
- Prompting and fine-tuning are two key ways to use an LLM in your product.
- When choosing an LLM for your product, you should consider multiple factors.
These have been intense years for AI.
Since the launch of GPT-3 in 2020, the field of artificial intelligence (AI) has witnessed remarkable progress, with significant advancements in the development and application of large language models (LLMs).
Within a relatively short time, AI tools have started doing magic: now, we can generate code, images, and text in a matter of seconds with the help of tools based on Transformer models — a neural network architecture that has been behind the most revolutionary AI advancements of the past six years.
These advancements are comparable to some of the most prominent events in the history of technology. For instance, decades ago, accounting used to consume huge resources. To create a simple financial report, dozens of employees used endless sheets of paper and a good deal of calculators.
Picture of "human computers" in Washington, D.C., circa 1920. Courtesy of the Library of Congress, Washington, D.C., USA.
Later, mainframes changed the game for big organizations. Personal computers and Microsoft Excel made accounting, calculations, data processing, and a myriad of other previously grueling tasks routine, completing them in seconds.
These days, we can observe something similar happening.
Organizations generate terabytes of text data and invest millions of dollars in attempts to efficiently process it. Large language models (LLMs) are able to change that: these technologies can cut data processing time and cost and provide you with assistance and insights you didn’t even know you needed.
ChatGPT, Bard, Claude, and other products based on LLMs are currently transforming the digital space and every industry. They enable machines to comprehend and generate human-like content with astonishing accuracy and contextual awareness.
In this article, you will find out how to become a pioneer in embracing LLM technologies for business, improve your existing product with an LLM, or develop an LLM-based application from scratch.
Attention made it possible: the short story of large language models
Large language models are based on the Transformer architecture that was introduced by the Google Brain team in 2017 in a paper titled “Attention Is All You Need.”
The Transfromer architecture uses an attention mechanism as a primary building block.
The attention mechanism allows a language model to focus on different words in a sequence when generating the next word or token. The self-attention mechanism in the Transformer model is a further enhancement of this concept. It allows each element in the sequence to attend to all other elements in a computationally efficient manner, effectively capturing the context of the entire sequence for each element.
Moreover, the authors introduced the concept of a “multi-head” attention, which allows the model to focus on different types of information. In essence, each “head” in the attention mechanism can potentially learn to focus on different aspects of the input data, thus providing a more nuanced understanding of the data.
Generative Pre-Trained Transformers (GPT) by OpenAI took the lead among early Transformer-based models. OpenAI’s idea was that the road to LLM advancements is through scaling. Since the original release of GPT, OpenAI has kept on scaling each new model, making it more powerful and intelligent than the previous version. Using a large amount of training data, new models not only learn to understand syntax but also can grasp semantics and abstract real-world concepts in the training texts.
The 175-billion-parameter GPT-3 model can answer questions and write code, while GPT-4 — the latest version having an unknown but much larger number of parameters — can pass a number of academic tests.
Today, there are multiple language models of different sizes trained on terabytes of texts and other data that are capable of almost human-like reasoning and that can successfully complete tasks we couldn’t even have imagined an LLM tackling in the past.
The trick is to decide where and how to use them.
Get ready for takeoff: focusing on tasks to solve with an LLM
Let’s start with where to use an LLM.
As of now, there is no one-size-fits-all answer to this question. Technically, you can implement an LLM in any product. There are two general approaches:
- Wait until market leaders and potential competitors come up with use cases and follow their strategies in the future.
- Start with basic use cases, research, test your ideas, and develop powerful LLM-based solutions for your business right now.
In this section, you can find basic tasks modern LLMs can complete to date. Still, growth in this field is so rapid and intense that the number of potential use cases can increase any day. We suggest you start by assessing the textual data your organization processes and think of how you can handle data processing tasks using a language model:
Human-like text generation is one of the most impressive capabilities of large language models. Based on a short prompt or description, models like GPT and Claude can generate coherent and contextually appropriate texts, reducing the time and effort required for content creation.
LLMs can classify text into predefined categories or labels.
For example, e-commerce companies and marketplaces can use LLMs for sentiment analysis to automatically classify customer reviews as positive, negative, or neutral. This may help in monitoring feedback and understanding customer sentiment at scale.
IBM Watson was introduced many years ago, and its key features included answering questions. Tools like Siri have been answering our questions for over twelve years.
So what’s unique about LLMs in this regard?
Compare the results you get from Watson and Siri to responses you can now read on ChatGPT. Language models are stronger and smarter at answering questions compared to any earlier ML-based tools. And responses from LLMs are mostly correct.
Tools like ChatGPT can act as human-like companions and keep track of the context in a long chat, understand sentiments, and regulate tone of voice, vocabulary, and linguistic style based on your feedback. LLMs can participate in conversations, and in some cases become the most interesting participant in the room.
This capability of LLMs comes in handy when powering your website with an intelligent chatbot. Trained on your company’s data, an LLM bot can provide relevant and mostly correct answers to customers’ questions.
LLMs may be prone to generating biased and offensive content, like Snapchat’s chatbot recently did. But we can assume that with constant evolution of language models, this issue will be mitigated.
Originally, the Google team started their research on the Transformer architecture to solve translation tasks. And specialized LLMs are particularly good at translation.
Tools like Alexa Teacher can achieve high translation accuracy. The 20B-parameter model allows for transferring knowledge from one language or task to another with minimal human input.
LLMs are capable of summarizing lengthy documents, distilling key information into concise summaries.
This may offer great support for experts in various niches. For example, legal specialists may use document summarization features to analyze and summarize huge legal documents, extract relevant information, and optimize their workflows.
Choosing your LLM
Once you’ve focused on a particular task you wish to handle with an LLM, it’s time to take a closer look at the Transformer architecture, as its specifics may impact your choice of language model.
In “Pre-Trained Language Models and Their Applications,” Haifeng Wang and others provide a complex overview of how specifics of a model’s architecture determine its possible applications.
The Transformer architecture includes two primary components: an encoder and a decoder.
- The encoder processes the input sequence (for example, a sentence) and creates a contextualized representation of each token (word) in the sequence.
- The decoder receives this representation of tokens and generates the output sequence, typically one token at a time. The model is auto-regressive, meaning that each new token is generated based on the previous tokens.
The encoder and decoder consist of several layers, each including two sublayers: a multi-head self attention mechanism that connects different positions of a sequence to generate a representation of the sequence, and a feed-forward neural network in which information moves only in one direction, from the input through hidden nodes to the output.
The specifics of a model’s architecture may impact its capabilities.
- Encoder-only models at scale employ a bidirectional transformer encoder to learn contextual representations; they demonstrate impressive performance on Natural Language Understanding (NLU) tasks (BERT).
- Decoder-only models aim at generating human-like texts (BLOOM, Claude, GPT-4).
- Encoder–decoder models can handle language understanding, translation, and generation tasks (BART).
Encoder–decoder-based LLMs like BART can perform translation tasks at a high level. At the same time, huge models like GPT-4 and Claude that have emerged in 2023 are better translators than the most talented polyglot. The reason is that state-of-the-art models are trained on a tremendous volume of data; even though they are based on a decoder-only architecture, they can go above and beyond in any task we’ve previously considered.
Below, you can see the evolution of LLMs based on different types of architectures.
Matching a use case to an appropriate model
The LLM tree is growing much faster than the maple across the street, and it blooms profusely throughout all seasons. Given the abundance of LLM options, it may seem challenging to pick a model that will serve as a reliable assistant.
Janna Lipenkova, a B2B entrepreneur with a background in AI and NLP, conducted insightful research into the strengths of various language models and collected the results in a handy table:
You and your team may have already determined key tasks to solve with a trained LLM. At this step, match the task requirements to the capabilities of the most suitable LLMs:
- Text summarization — BART
- Machine translation — BLOOM, GPT models
- Conversational AI — LaMDA, GPT models
- Question answering — LaMDA, PALM, GPT models
- Sentiment analysis — BERT
But the task you wish to solve isn’t the only factor to consider. Let’s have a look at eight more things to keep in mind when selecting a language model.
8 major factors that may impact your choice of a language model
In this section, we draw your attention to the performance and size of large language models, their possible ethics- and bias-related drawbacks, the memory (context window) of LLMs, the cost to use a model, privacy considerations, licensing, and integration and development specifics.
Evaluating the performance and accuracy of LLMs is vital. Consider factors such as language fluency, coherence, contextual understanding, and the ability to generate relevant and accurate responses.
GPT-4, Claude, and GPT-3.5 lead the parade. We’re using these LLMs for a couple of projects, and they show promising results in targeted tasks. You may choose them for your use case too, then consider other options after you validate your idea and start gaining traction.
Some of the best models are GPT-4, by far, I would say, followed by Claude GPT 3.5 .
— Andrej Karpathy, State of GPT talk for Microsoft Build
Language models can vary in size. As you compare the size-to-quality ratio, you may notice that the size directly impacts the complexity of tasks a model can handle, as well as the quality of the output. As the number of parameters a model uses for training grows, the model becomes able to make more accurate predictions.
At the same time, smaller models may demonstrate shorter inference time — the time between receiving new data and predicting the next token. Inference time may be from a couple of milliseconds to several seconds, but the ultimate goal is to keep it as fast as possible, especially if your product requires low latency.
Still, there are numerous approaches to optimizing inference time in large models and ensuring both high performance and low inference time.
Ethics and bias
LLMs are prone to reflecting biases present in the training data, potentially leading to inaccurate or undesired outputs. It’s important to consider the ethical implications and bias mitigation strategies offered by a particular LLM. Before you make your choice, make sure an LLM aligns with your organization’s ethical guidelines and find out if additional steps are required to address potential errors.
How much data do you need your LLM to remember before generating an answer?
Your use case may define the context window length you need.
The context window refers to the quantity of previous text that a model considers before generating additional text.
Models with a smaller context window tend to forget the content of the conversation and provide increasingly irrelevant responses to new questions.
State-of-the-art language models are particularly good at conversations. They can remember what you talk about; they can access a huge number of tokens before generating a response, and they have large context windows.
- GPT-4’s context window reaches up to 32,000 tokens, or approximately 25,000 words.
- Claude’s context window is 100,000 tokens, or 75,000 words.
You may need to employ a powerful LLM like Claude to reach your business goals, or you might focus on a model with a smaller context window. In this case, you’d need to invest more in engineering, break your data into pieces, and use embeddings to handle your tasks.
To get closer to the right choice, define your budget and evaluate the costs associated with utilizing an LLM, including licensing fees, computational resources, and maintenance expenses. LLMs can be open-source or commercially licensed:
- Stable LM
Open-source solutions may seem more affordable at first sight: you don’t need to pay a fee to access them. However, you will need to invest in hosting and all infrastructure setup and support. When you use a commercial LLM, the service provider covers all infrastructure-related costs.
Most popular cloud service providers claim to keep your data safe and sound. However, most businesses will be wary to share sensitive data with third-party services.
This may be an obstacle on your way to choosing a commercial model. If you wish to maximize data privacy, consider choosing a model with weights available and hosting it on your dedicated servers. Doing so may require more effort and expertise, but this approach may contribute to more controllable data protection.
Licensing for your LLM is another important factor to consider.
For example, LLaMA — a huge and powerful language model created by Facebook — is now licensed for educational and research purposes only. You may find it in open-source repositories, as LLaMA weights were leaked in spring 2023, but using it for commercial products is forbidden.
Another new language model called Falcon is available under the Apache license and may be the great choice for commercial products.
Before choosing an LLM, make sure that its use is legal for your purposes.
Integration and development specifics
Explore whether there are well-documented APIs, libraries, and guides that can facilitate the integration process. In this case, again, OpenAI beats all competitors: its guidelines are clear and detailed.
How to implement LLM: four basic ways to make an LLM work for your business
Trained on huge datasets, an LLM may be an amusement.
But in your business, you probably aren’t looking for ways to have AI-powered fun. You need an intelligent tool to work with your corporate data, handle your tasks, and chat about brand-specific topics.
In this section, we consider four main ways to adjust an LLM for your business case. We draw parallels between training a new employee and training an LLM to help you understand each approach and focus on the one (or ones) that resonate with your business, tasks, and needs.
One way to train new employees is to make them learn everything your team has been through during the last few years and provide examples of how other employees have handled particular tasks. In several months, a hypothetical trainee will learn how to perform those tasks the same way.
There is a similar process called fine-tuning in language model training.
Fine-tuning involves taking a pre-trained model and training it on a specific task or dataset to improve its performance or adapt it to a particular domain, business, or task.
Fine-tuning is a classic transfer learning technique in machine learning. Initial training of the LLM usually takes several months and requires a huge amount of compute and training data. A model trained like this acquires a lot of common sense and can identify business-specific patterns after being shown comparatively few examples. Also, fine-tuning is much quicker and cheaper compared to training from scratch.
The fine-tuning process allows for updating the model’s parameters through supervised learning techniques to adapt it to the task at hand.
The paramount component of fine-tuning is data collection.
First of all, consider data availability across your organization. Do you have enough data to start the training process? And how much is enough?
"The quality of training data has a direct impact on model performance — and also on the required size of the model."
— Jana Lipenkova, Choosing the Right Language Model for Your NLP Use Case
The answer may seem pretty generic, but the more data you have, the better the results you may achieve. OpenAI suggests that performance tends to increase linearly with every doubling of the number of examples.
Say you’d like to train a smart LLM-based assistant to respond to your emails.
- If you have a database of 10 emails, the quality of responses may be poor. It would be easy for a recipient to recognize the machine-generated texts.
- If you have 10,000 emails, your assistant may approach human-like performance.
To implement fine-tuning, start by forming a training data set. Ensure the data is well-structured and relevant to the use cases you defined during the discovery phase. Additionally, keep in mind that a chosen LLM may have its own requirements as to the size and format of training data. Before integrating a respective API, we recommend you get acquainted with its rules and requirements.
OpenAI models win in terms of the simplicity of fine-tuning. You can upload examples of training data through an API, wait several minutes, and start using your fine-tuned model right away. To fine-tune an open-source model, you’ll need to do more legwork and tinkering yourself.
As skilled newcomers join your team, another way to adapt them to your tasks is to share basic instructions on how to perform them.
You can use the same approach to explain a target task to an LLM. Just provide it with an instruction called a prompt and the LLM will attempt to follow it to complete the task.
The prompting method started to bring results with the rise of GPT-3, and since then it has become a viable rival to fine-tuning.
Behind prompting, there are two processes:
- Prompt design is the process of providing a model with instructions and context to perform a task.
You can provide a model with instructions on how to handle a task, or you can employ role prompting, when you ask a system to act, build reasoning, or simply talk the way somebody else would.
- Prompt engineering is a more advanced iterative process in which engineers develop and optimize prompts to improve the model’s performance, measuring it along the way.
You can simply ask a model to perform a certain task (zero-shot prompting), or you can provide it with one example (one-shot prompting) or several examples (few-shot prompting) of how to perform the task to improve the quality of the output. Prompting can also be used to specify unusual output formats such as JSON or XML.
The weakness of the prompting technique is that the context window of an LLM can contain only a defined number of tokens. Thus, you can’t pre-feed a model with thousands of examples — it simply can’t look that far to suggest the next token in the sequence.
Although the performance increase of few-shot prompting is considerable in comparison to zero-shot, it has flaws.
Few-shot prompting works well for many tasks but is still not a perfect technique, especially when dealing with complex reasoning tasks.
To overcome challenges and limitations of few-shot prompting, you can employ the chain of thought method to reach better training results.
This approach has been showing superior impact on the performance of language models.
Prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
The essence of CoT is that your model will get better at finding the right answer when in the first output it provides a textual explanation of how to find the right answer.
This approach may help you train a model that will be involved in solving mathematical, common sense, and symbolic reasoning tasks.
- Reinforcement learning from human feedback (RLHF)
Let’s go back to the hiring process. If you work closely with newcomers, constantly review their progress, ask them to provide multiple solutions to a single task, and then rate each solution, you’re using elements of reinforcement learning based on human feedback. A newcomer uses your ratings to learn and provide better results with each new attempt.
A similar approach allows you to fine-tune a pre-trained LLM using human feedback about the quality of the output.
Reinforcement learning from human feedback works this way:
- An operator prepares a set of tasks or questions.
- A language model generates several responses to a given question.
- A human operator rates these responses. These ratings are used to train a separate model called a reward model.
- The reward model is used to fine-tune the original language model using reinforcement learning.
This is an iterative approach that combines the expertise of a human operator and capabilities of a chosen model, achieving high performance.
OpenAI used RLHF to turn the base GPT model (that was able to guess the next word based on internet texts) into a helpful assistant that can follow instructions, answer questions, and chat with users on any topic.
RLHF can yield impressive results, although the technology is far too complex, costly, and unreliable to consider it among your first options.
"RLHF is very much research territory and is even much harder to get to work. I would probably not advise that someone just tries to roll their own RLHF implementation. These things are pretty unstable, very difficult to train, not something that is beginner friendly right now."
— Andrej Karpathy, State of GPT talk for Microsoft Build
On the other hand, in some cases, the RLHF technique may be more attractive for a human operator. Compared to generating examples from scratch, it’s easier and faster to rate responses.
- Pre-training from scratch
Pre-training from scratch may be associated with something more complex than training your new colleague. We can describe this method as training a Jedi or a Tibetan monk. From infancy, an imaginary child is trained not for dealing with particular tasks but for mastering exceptional skills — skills normal kids don’t possess.
An LLM trained from scratch on your corporate data may be smarter and more suitable to do the defined job in comparison to mainstream solutions.
But creating a pre-trained LLM is complex, time-consuming, and prohibitively expensive.
Researchers estimate that training GPT-3 cost OpenAI around $5 million; the price of training GPT-4 exceeded $100 million. It takes terabytes of data, billions of parameters, and huge funds to train an LLM.
Model pre-training on terabytes of data would require millions of dollars.
Still, some enterprises invest time, money, and expertise to take risks associated with pre-training from scratch. Take Bloomberg: the company’s specialists trained BloombergGPT from scratch using both traditional general-purpose training data as well as domain-specific financial data. As a result, they created “a model that vastly outperforms existing models on in-domain financial tasks while being on par or better on general NLP benchmarks.”
The pre-training process consists of basic steps:
- Data collection
The first step is to prepare the content you want to use for integrating an LLM with your product. Consider the scope, depth, and format of information you want an LLM to access.
In the case of BloombergGPT, the team created FinPile — a dataset consisting of a financial dataset (51% of training) and a public dataset (49% of training.) Overall, the model was trained on 569 billion tokens.
In this and many other articles, you will encounter the concept of natural language understanding, or NLU — the ability of machines to understand and interpret human language. Yet a machine can’t literally understand English; it operates with integers.
Thus, at this step, a tokenizer transforms texts into groups of integers.
A Transformer tries to predict the next token in the sequence and uses its predictions to update weights in the neural network.
Training time makes a difference, too. It impacts how the model will capture complex patterns and dependencies in the training data. With more training iterations, the model can potentially learn more nuanced representations and improve its ability to generate a coherent and contextually relevant output, which, again, takes us back to the level of performance you’d like your LLM product to demonstrate.
You’re not limited to using only one approach to LLM training. For example, you can combine prompting and fine-tuning to make your model smarter, faster, and more understanding and use tools and insights from the following section to achieve even better results.
LLM as part of your technology stack
A whole ecosystem of tools and frameworks is emerging around the relatively new concept of LLMs. Tools related to LLMs allow us to consider an LLM as a new primitive we can use to build more complex solutions.
In the one of new articles, Andreessen Horowitz offers an approach to understanding and implementing an LLM as a component of a new software technology stack.
The suggested framework describes the data flow through the pipeline and embeddings into the vector database. Orchestration allows an LLM to interact with external APIs and plugins. Responses are cached, logged, and validated.
Embeddings and vector databases
When you rely solely on prompting to adjust your model, you may face context window–related challenges: the bigger the window, the more computational power is needed and the higher the price of your effort. Embeddings allow you to tap into relevant information while using resources efficiently, and vector databases can help you make the most out of them.
An embedding is a term used to explain the meaning and relationships between words in a way that can be understood and processed by machine learning models.
During the LLM training phase, the model learns to assign unique numerical vectors to each word in its vocabulary. These vectors capture semantic and contextual information about the words, enabling the model to understand similarities, differences, and relationships among them.
To implement embeddings, you can consider the OpenAI API or Hugging Face.
At the data pre-processing stage, a vector database plays a significant role. It allows us to store, analyze, and retrieve embeddings. Among a variety of vector databases, you can choose open-source solutions like Vespa and Qdrant, local libraries like Chroma, and services such as Pinecone.
Our team implemented GPT LLM to enable intelligent search on the Clockwise Software blog. To handle this task, we used embeddings.
We break our blog posts into chunks, send our texts to the OpenAI API endpoint, and receive embeddings we can then extract, save, and process.
Each question we receive on our blog is embedded and compared to stored and indexed embeddings, and the model chooses pieces of text that are most related to the question. Then, it prompts relevant chunks and generates an answer to the question.
Check it out yourself!
Orchestration frameworks like Langchanin make the whole set of chosen technologies work together. These frameworks enable integration of different tools, interaction with external APIs, data collection, synchronization, data retrieval, and memory maintenance.
Langchain is the most popular early orchestration tool, although your development team may opt for custom Python code to implement software orchestration. Orchestration tools also may be used to create agents — LLM-based tools that are able to make decisions about which actions to take, take those actions, measure their impact, and repeat the process iteratively until achieving the final result.
Take AutoGPT. It aspires to assign tasks to itself and complete them. When it gets a multi-component mission that would normally require human input and feedback, it can generate prompts, respond to those prompts, evaluate its responses, and access required external data to perform a task.
LLM caching systems
App caching can help to cut an application’s response time and web app development cost. The most common tool for this is Redis — an open-source data structure server that’s able to cache responses and retrieve similar ones thousands of times faster compared to LLM-based products without caching systems.
The better you understand the tasks you are about to solve with an LLM, the easier it gets for you to choose the appropriate model and technology stack, adjust it according to your needs, and find the most effective and cost-optimized way to LLM implementation in your product.
Optimizing your LLM-based product
Optimization may help you reduce resources you invest in your product or use them in a more efficient way.
- Use a specialized model. To validate your idea, you may use the most popular, best-known, or most powerful model. At the optimization stage, focus on choosing a model that’s polished particularly for your use case.
- Move to your own servers. You can choose a paid out-of-the-box solution or opt for an open-source LLM. If you choose a model available via a commercial API, consider an open-source alternative and think about migrating to your own servers. Although this step requires going an extra mile in engineering, it may help you optimize maintenance costs in the long run and provide a higher level of data security to your users in comparison to a commercial LLM API.
- Reform your team. With an intelligent LLM tool, you may reduce or completely get rid of a certain number of routine tasks. This means you can also alter your software development team structure and size. LLMs may help you improve your team’s specialization and bring you closer to expected business results.
LLMs are unsupervised multi-task learners. These technologies can serve you well in performing multiple tasks. The more data you collect, use, and process in your organization, the more helpful LLMs can be.
They are not just one of the seasonal web development trends; they are here to stay.
While some people still know nothing about these breakthroughs and haven’t tried ChatGPT, you can and should start doing magic with LLMs right now. Just look around. Bing is showing impressive results with its AI-powered search, Notion AI is taking task management and other routine tasks to a whole new level, and dozens of new solutions to old problems are appearing daily.
With an LLM product, you can relive the experience of the first Excel users. Just like those first Excel users embraced the perks of a new product that changed the whole accounting sphere, you can revolutionize your business and the services you provide with an intelligent LLM-based solution.
Now is the time to get the upper hand, increase your market share, and optimize expenses.
Focus on a particular task, choose a suitable solution for large language models implementation, validate your idea, adjust it to your needs, and make your business thrive in an AI-driven era.