The emerging LLM tech stack provides a structured approach to integrating large language models into applications, emphasizing modularity and scalability. By leveraging in-context learning and the layered architecture, developers can efficiently build and deploy AI-powered solutions tailored to specific use cases.
We at Codesmith cultivate technologists who are at the intersection of society and tech, able to meet this moment and thrive in an ever-changing world. Ready to become a modern software engineer?
Pre-trained Large Language Models (LLMs), such as OpenAI’s GPT-4 and Meta’s Llama 2, have become increasingly prevalent in application development leveraging generative AI. As a software developer, how can you efficiently integrate LLM-powered capabilities into your application? An emerging tech stack, often referred to as the LLM application stack, is forming to facilitate interaction with these models via in-context learning.
In this article, we’ll define in-context learning and explore each layer of the emerging tech stack, serving as a reference architecture for AI startups and tech companies:
Ready-to-use LLMs like GPT-4 and Llama 2 are foundation models pre-trained on vast amounts of publicly available data sources, including Common Crawl, Wikipedia, and Project Gutenberg. GPT-4's model size is estimated at approximately 1.76 trillion parameters, while Llama 2's model size is 70 billion parameters. Due to the breadth of their pre-training data, these LLMs are suitable for generalized use cases. However, adjustments are often necessary to tailor them to specific application scenarios.
Two general approaches for customizing pre-trained LLMs to unique use cases are:
Fine-tuning involves additional training of a pre-trained LLM by providing it with a smaller, domain-specific, and proprietary dataset. This process will alter the parameters of the LLM, and thus modify the “model's knowledge bank” to be more specialized.
As of December 2023:
Advantages:
Disadvantages:
In-context learning maintains the integrity of the pre-trained model, guiding its output through structured prompts and relevant retrieved data. This approach provides the model with pertinent information at the appropriate time.
To condition the LLM for specific tasks and desired output formats, few-shot prompting can be employed. This technique involves supplying the LLM with examples of expected input-output pairs as part of the input context. The LLM's context, comprising tokenized data, functions as the model's "attention span." These example pairs act as a targeted, mini training dataset.
A compiled prompt typically combines elements such as:
Given that pre-trained LLMs were trained on publicly available data with a cutoff date, they lack awareness of recent events and private data. To supplement the LLM's knowledge, the Retrieval Augmented Generation (RAG) technique can be utilized. This involves retrieving additional required information from various sources—such as vector or SQL databases, internal or external APIs, and document repositories—and including it as part of the input context.
As of December 2023:
Advantages:
Disadvantages:
The emerging LLM tech stack comprises three main layers and one supplementary layer:
We will illustrate these layers using a simple application example: a customer service chatbot knowledgeable about a company's products, policies, and FAQs.
The data layer encompasses the full preprocessing and storage of private and supplementary information. The data processing involves three main steps: extracting, embedding, and storing.
A table of available offerings for the Data Layer (as of December 2023, not exhaustive).
Source: Inspired by Emerging Architectures for LLM Applications written by Matt Bornstein and Rajko Radovanovic & The New Language Model Stack
written by Michelle Fradin and Lauren Reeder.
Relevant data may originate from multiple sources in various formats. Connectors are established to ingest data from these sources for extraction. For a customer service chatbot, data sources might include:
Optional steps include cleaning the extracted data by removing unnecessary or confidential parts and transforming it into a standardized format, such as JSON, for efficient downstream processing.
Depending on data complexity and scale:
An embedding is a numerical representation capturing semantic meaning, expressed as a vector. Embeddings enable quick classification and search of unstructured data by comparing their vector representations.
To create embeddings:
https://api.openai.com/v1/embeddings
.Once data is embedded, the output along with the original content is stored in a vector database or in a traditional database enhanced with a vector search extension.
A vector database is built specifically for indexing and querying vectorized data. It supports CRUD operations and is optimized for high-performance similarity search, real-time updates, and data security.
Adding vector search to an existing SQL or NoSQL database is a quicker transition and allows teams to leverage existing infrastructure and expertise. However, this is best suited for simple use cases. Traditional databases prioritize data consistency and atomicity, whereas vector databases are designed for search speed and availability, accepting eventual consistency as a tradeoff. Integrating vector functionality into relational systems may result in performance compromises. A dedicated vector database also requires separate maintenance and monitoring from the primary database system.
The model layer contains the off-the-shelf LLM to be used for your application development, such as GPT-4 or Llama 2. Select the LLM suitable for your specific purposes as well as requirements on costs, performance, and complexity. For a customer service chatbot, we may use GPT-4, which is optimized for conversations, offers robust multilingual support, and has advanced reasoning capabilities.
Proprietary model APIs play a crucial role in the inference process, including submitting prompts to a pre-trained language model along with other systems like logging and caching.
The access method depends on the specific LLM, whether it is proprietary or open-source, and how the model is hosted. Typically, there will be an API endpoint for LLM inference, or prompt execution, which receives the input data and produces the output. At the time of writing, the API endpoint for GPT-4 is “https://api.openai.com/”
A table of available offerings for the Model Layer (as of December 2023, not exhaustive).
Source: Inspired by Emerging Architectures for LLM Applications written by Matt Bornstein and Rajko Radovanovic & The New Language Model Stack
written by Michelle Fradin and Lauren Reeder.
The orchestration layer is the core framework responsible for coordinating all other layers in the LLM application stack, as well as any external systems. It provides libraries, templates, and tools to handle key operations such as prompt construction and execution. Functionally, it resembles the controller in the Model-View-Controller (MVC) architecture.
With the in-context learning approach, the orchestration framework:
For a basic customer service chatbot, if a user asks about refund policies:
This illustrates how orchestration frameworks enable dynamic, context-aware LLM applications.
One example framework is LangChain (libraries available in JavaScript and Python), containing interfaces and integrations to common components as well as the ability to combine multiple steps into “chains.” Aside from programming frameworks, there are GUI frameworks available for orchestration. For instance, Flowise, built on top of LangChain, has a graphical layout for visually chaining together major components.
A table of available offerings for the Orchestration Layer (as of December 2023, not exhaustive).
Source: Inspired by Emerging Architectures for LLM Applications written by Matt Bornstein and Rajko Radovanovic & The New Language Model Stack
written by Michelle Fradin and Lauren Reeder.
As large language model applications reach production scale, an LLMOps layer can be introduced to manage reliability, efficiency, and security. This layer addresses:
In a chatbot context, frequent questions like refund policies can be served from cache. Queries can be validated before prompt execution, and outputs reviewed to ensure they meet performance and accuracy standards.
A table of available offerings for the Operational Layer (as of December 2023, not exhaustive).
Source: Inspired by Emerging Architectures for LLM Applications written by Matt Bornstein and Rajko Radovanovic & The New Language Model Stack
written by Michelle Fradin and Lauren Reeder.
This article outlined how in-context learning provides an easier path to building LLM applications compared to model fine-tuning. The emerging LLM tech stack now includes:
Each layer offers a modular entry point into LLM app development and helps teams quickly build scalable and production-ready solutions.
Explore CS Prep further in our beginner-friendly program.
Get more free resources and access to coding events every 2 weeks.
Connect with one of our graduates/recruiters.
Our graduates/recruiters work at:
Short Bio goes here, lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Connect with one of our graduates/recruiters to learn about their journeys.
Our graduates/recruiters work at: