Introducing the Emerging LLM Tech Stack

Subscribe for more insights

Thank you for your interest in the syllabus.

We'll be in touch soon with more information!

In the meantime, if you have any questions, don’t hesitate to reach out to our team.

Download Email Us

Oops! Something went wrong while submitting the form.

Insights
>
AI/ML
>
Introducing the Emerging LLM Tech Stack

TLDR: Emerging LLM Tech Stack

1. Fine-Tuning vs. In-Context Learning

Fine-Tuning: Involves retraining a pre-trained LLM on a smaller, domain-specific dataset to adapt it to particular tasks. This method requires substantial computational resources and expertise.
In-Context Learning: Leverages the LLM's existing capabilities by providing context through prompts, examples, or additional data at inference time, eliminating the need for retraining. This approach is more accessible and flexible for many applications.

2. Data Layer

Purpose: Preprocesses and stores embeddings of private or proprietary data to facilitate efficient retrieval during inference.
Components:
- Vector Databases: Specialized databases optimized for storing and querying high-dimensional vector representations of data. They enable similarity searches to retrieve relevant information based on input queries.
- Integration with Traditional Databases: Some systems enhance existing SQL or NoSQL databases with vector search capabilities, offering a balance between new functionality and familiar infrastructure.

3. Model Layer

Function: Hosts the LLMs that generate responses based on input prompts and context.
Options:
- Proprietary Models: Such as OpenAI's GPT-4 and Anthropic's Claude, which offer robust performance but may come with usage restrictions and costs.
- Open-Source Models: Including Meta's Llama 2 and various models available through Hugging Face, providing more flexibility and control over deployment.

4. Orchestration Layer

Role: Acts as the central framework coordinating interactions between the data, model, and operational layers, as well as external components.
Responsibilities:
- Prompt Construction: Builds prompts using templates and examples to guide the LLM's responses.
- Context Management: Retrieves relevant data from vector databases and integrates it into prompts.
- API Interactions: Handles communication with LLM APIs and other external services.
Tools:
- LangChain: An open-source framework providing libraries and interfaces for building LLM applications.
- Flowise: A GUI-based tool built on LangChain, allowing visual construction of LLM workflows.

5. Operational Layer (LLMOps)

Purpose: Ensures the reliability, efficiency, and security of LLM-powered applications in production environments.
Key Functions:
- Monitoring: Tracks LLM outputs to assess performance and guide improvements.
- Caching: Stores frequent LLM responses to reduce latency and API usage costs.
- Validation: Implements safeguards against prompt injection attacks and ensures output quality.
Notable Tools:
- Commercial: Autoblocks, Helicone, HoneyHive, LangSmith, Weights & Biases.
- Open-Source: GPTCache, Redis, Guardrails AI, Rebuff.

Conclusion

The emerging LLM tech stack provides a structured approach to integrating large language models into applications, emphasizing modularity and scalability. By leveraging in-context learning and the layered architecture, developers can efficiently build and deploy AI-powered solutions tailored to specific use cases.

‍

We at Codesmith cultivate technologists who are at the intersection of society and tech, able to meet this moment and thrive in an ever-changing world. Ready to become a modern software engineer?

Introduction

Pre-trained Large Language Models (LLMs), such as OpenAI’s GPT-4 and Meta’s Llama 2, have become increasingly prevalent in application development leveraging generative AI. As a software developer, how can you efficiently integrate LLM-powered capabilities into your application? An emerging tech stack, often referred to as the LLM application stack, is forming to facilitate interaction with these models via in-context learning.

In this article, we’ll define in-context learning and explore each layer of the emerging tech stack, serving as a reference architecture for AI startups and tech companies:

Tailoring Pre-Trained LLMs

Ready-to-use LLMs like GPT-4 and Llama 2 are foundation models pre-trained on vast amounts of publicly available data sources, including Common Crawl, Wikipedia, and Project Gutenberg. GPT-4's model size is estimated at approximately 1.76 trillion parameters, while Llama 2's model size is 70 billion parameters. Due to the breadth of their pre-training data, these LLMs are suitable for generalized use cases. However, adjustments are often necessary to tailor them to specific application scenarios.

Two general approaches for customizing pre-trained LLMs to unique use cases are:

Fine-Tuning: Involves additional training of a pre-trained LLM using a smaller, domain-specific, proprietary dataset. This process alters the model's parameters, making it more specialized.
In-Context Learning: Does not modify the underlying pre-trained model. Instead, it guides the LLM's output via structured prompting and relevant retrieved data, providing the model with the right information at the right time.

Fine-Tuning

Fine-tuning involves additional training of a pre-trained LLM by providing it with a smaller, domain-specific, and proprietary dataset. This process will alter the parameters of the LLM, and thus modify the “model's knowledge bank” to be more specialized.

As of December 2023:

Fine-tuning for GPT-3.5 Turbo is available via OpenAI's official API.
Fine-tuning for GPT-4 is offered through an experimental access program, with eligible users able to request access via the fine-tuning UI.
Fine-tuning for Llama 2 can be performed using platforms like Google Colab and open-source libraries from Hugging Face.

Pros and Cons of Fine-Tuning

Advantages:

Typically yields higher-quality outputs than prompting.
Allows for more training examples compared to prompting.
Results in lower costs and reduced latency post fine-tuning due to shorter prompts.

Disadvantages:

Requires machine learning expertise and is resource-intensive.
Risk of "Catastrophic Forgetting," where the model loses previously learned skills.
Potential for overfitting, hindering generalization to new, unseen inputs.

In-Context Learning

In-context learning maintains the integrity of the pre-trained model, guiding its output through structured prompts and relevant retrieved data. This approach provides the model with pertinent information at the appropriate time.

To condition the LLM for specific tasks and desired output formats, few-shot prompting can be employed. This technique involves supplying the LLM with examples of expected input-output pairs as part of the input context. The LLM's context, comprising tokenized data, functions as the model's "attention span." These example pairs act as a targeted, mini training dataset.

A compiled prompt typically combines elements such as:

Hard-coded prompt templates.
Few-shot examples.
Information from external APIs.
Relevant documents retrieved from a vector database.

Given that pre-trained LLMs were trained on publicly available data with a cutoff date, they lack awareness of recent events and private data. To supplement the LLM's knowledge, the Retrieval Augmented Generation (RAG) technique can be utilized. This involves retrieving additional required information from various sources—such as vector or SQL databases, internal or external APIs, and document repositories—and including it as part of the input context.

As of December 2023:

GPT-4 Turbo offers a maximum context length of 128,000 tokens.
Llama 2 supports a maximum context length of 4,096 tokens.

Pros and Cons of In-Context Learning

Advantages:

Does not require machine learning expertise and is less resource-intensive than fine-tuning.
No risk of altering the underlying pre-trained model.
Allows for separate management of specialized and proprietary data.

Disadvantages:

Typically produces lower-quality outputs compared to fine-tuning.
Limited by the LLM's maximum context length.
Higher costs and increased latency due to longer prompts.

LLM Tech Stack Layers

The emerging LLM tech stack comprises three main layers and one supplementary layer:

Data Layer: Preprocessing and storing embeddings of private data.
Orchestration Layer: Coordinating various components, retrieving relevant information, and constructing prompts.‍
Operational Layer (Supplementary): Tooling for monitoring, caching, and validation to enhance functionality and efficiency.
Model Layer: The LLM accessed for prompt execution.

We will illustrate these layers using a simple application example: a customer service chatbot knowledgeable about a company's products, policies, and FAQs.

Data Layer

The data layer encompasses the full preprocessing and storage of private and supplementary information. The data processing involves three main steps: extracting, embedding, and storing.

DATA LAYER
Extracting	Embedding	Storing
Data Pipelines: Airbyte (open-source) Airflow (open-source) Databricks (commercial) Unstructured (commercial)	Embedding Models: Cohere (commercial) OpenAI Ada (commercial) Sentence Transformers library from Hugging Face (open-source)	Vector Databases: Chroma (open-source) Pinecone (commercial) Qdrant (open-source) Weaviate (commercial)
Document Loaders: LangChain (open-source) LlamaIndex (open-source)	Chunking Libraries: LangChain (open-source) Semantic Kernel (open-source)	Databases w/ Vector Search: Elasticsearch (commercial) PostgreSQL pgvector (open-source) Redis (open-source)

A table of available offerings for the Data Layer (as of December 2023, not exhaustive).

Source: Inspired by Emerging Architectures for LLM Applications written by Matt Bornstein and Rajko Radovanovic & The New Language Model Stack
written by Michelle Fradin and Lauren Reeder.

Extracting

Relevant data may originate from multiple sources in various formats. Connectors are established to ingest data from these sources for extraction. For a customer service chatbot, data sources might include:

Internal Word documents and PDFs.
PowerPoint presentations.
Scraped HTML from the company website.
Client information from a CRM system accessed via an external API.
Product catalogs stored in an SQL database.
Team processes documented in a collaborative wiki.
Previous user support emails.

Optional steps include cleaning the extracted data by removing unnecessary or confidential parts and transforming it into a standardized format, such as JSON, for efficient downstream processing.

Depending on data complexity and scale:

Document Loaders: Suitable for a small number of data sources with infrequent changes and common formats (e.g., text, CSV, HTML, JSON, XML).
Data Pipelines: Appropriate for aggregating diverse and massive data sources, including real-time streams requiring intensive processing.

Embedding with Embedding Models

An embedding is a numerical representation capturing semantic meaning, expressed as a vector. Embeddings enable quick classification and search of unstructured data by comparing their vector representations.

llm vector data — A diagram showing vector data (*Pinecone The Rise of Vector Data*)

To create embeddings:

Utilize embedding models like OpenAI's Ada V2, which accepts input text and returns embedding outputs.
Ada V2 can process multiple inputs in a single request by passing an array of strings or token arrays.
As of December 2023, Ada V2 is accessible via the API endpoint at https://api.openai.com/v1/embeddings.

Optional step:

Chunking: Breaking up large input text into smaller fragments to accommodate embedding model size limits. Libraries support various chunking methods, such as fixed-size or sentence-splitting. For instance, Ada V2 has a maximum input length of 8,192

Storing in Vector Databases

Once data is embedded, the output along with the original content is stored in a vector database or in a traditional database enhanced with a vector search extension.

A vector database is built specifically for indexing and querying vectorized data. It supports CRUD operations and is optimized for high-performance similarity search, real-time updates, and data security.

Adding vector search to an existing SQL or NoSQL database is a quicker transition and allows teams to leverage existing infrastructure and expertise. However, this is best suited for simple use cases. Traditional databases prioritize data consistency and atomicity, whereas vector databases are designed for search speed and availability, accepting eventual consistency as a tradeoff. Integrating vector functionality into relational systems may result in performance compromises. A dedicated vector database also requires separate maintenance and monitoring from the primary database system.

storage data — A diagram of various types of data storage (*Pinecone The Rise of Vector Data*)

Model Layer

The model layer contains the off-the-shelf LLM to be used for your application development, such as GPT-4 or Llama 2. Select the LLM suitable for your specific purposes as well as requirements on costs, performance, and complexity. For a customer service chatbot, we may use GPT-4, which is optimized for conversations, offers robust multilingual support, and has advanced reasoning capabilities.

Proprietary model APIs play a crucial role in the inference process, including submitting prompts to a pre-trained language model along with other systems like logging and caching.

The access method depends on the specific LLM, whether it is proprietary or open-source, and how the model is hosted. Typically, there will be an API endpoint for LLM inference, or prompt execution, which receives the input data and produces the output. At the time of writing, the API endpoint for GPT-4 is “https://api.openai.com/”

MODEL LAYER
Proprietary LLMs: OpenAI’s GPT-4 Anthropic’s Claude (closed-beta)
Open-Source LLMs: Meta’s Llama 2 (open-source) Hugging Face (many models available)

A table of available offerings for the Model Layer (as of December 2023, not exhaustive).

‍

Orchestration Layer in Emerging Architectures for LLM Applications

The orchestration layer is the core framework responsible for coordinating all other layers in the LLM application stack, as well as any external systems. It provides libraries, templates, and tools to handle key operations such as prompt construction and execution. Functionally, it resembles the controller in the Model-View-Controller (MVC) architecture.

With the in-context learning approach, the orchestration framework:

Accepts the user query
Constructs the prompt based on a template and few-shot examples
Retrieves relevant data through vector similarity search
Optionally calls external APIs
Sends the full data pipeline to the LLM API endpoint
Processes the returned LLM output

Example: Customer Service Chatbot

For a basic customer service chatbot, if a user asks about refund policies:

A prompt template is already configured with instructions and sample inputs/outputs.
The framework queries a vector database for content related to “refund policy.”
It integrates the result into the prompt and sends it to the selected LLM, such as GPT-4.
GPT-4 returns the output “We allow refunds on new and unused items within 30 days of purchase. You should receive your refund back to the original form of payment within 3-5 business days.”
The framework responds to the user with the LLM output.

llm Flowise chain — A simple LLM chain in Flowise visual tool.Source: ***FlowiseAI***

This illustrates how orchestration frameworks enable dynamic, context-aware LLM applications.

One example framework is LangChain (libraries available in JavaScript and Python), containing interfaces and integrations to common components as well as the ability to combine multiple steps into “chains.” Aside from programming frameworks, there are GUI frameworks available for orchestration. For instance, Flowise, built on top of LangChain, has a graphical layout for visually chaining together major components.

‍

ORCHESTRATION LAYER
Programming Frameworks: LangChain (open-source) Anarchy (open-source)
GUI Frameworks: Flowise (open-source) Stack AI (commercial)

A table of available offerings for the Orchestration Layer (as of December 2023, not exhaustive).

‍

Operational Layer

As large language model applications reach production scale, an LLMOps layer can be introduced to manage reliability, efficiency, and security. This layer addresses:

Monitoring: Track and analyze inputs/outputs to improve prompt design and model selection
Caching: Reduce API calls and response latency by storing previous results using semantic caching
Validation: Detect and prevent prompt injection attacks or other harmful input, and apply rules-based corrections to outputs

In a chatbot context, frequent questions like refund policies can be served from cache. Queries can be validated before prompt execution, and outputs reviewed to ensure they meet performance and accuracy standards.

OPERATIONAL LAYER
Monitoring	Caching	Validation
Autoblocks (commercial) Helicone (commercial) HoneyHive (commercial) LangSmith (closed-beta) Weights & Biases (commercial)	GPTCache (open-source) Redis (open-source)	Guardrails AI (open-source) Rebuff (open-source)

A table of available offerings for the Operational Layer (as of December 2023, not exhaustive).

‍

The LLM Application Stack: Next Steps

This article outlined how in-context learning provides an easier path to building LLM applications compared to model fine-tuning. The emerging LLM tech stack now includes: