0% found this document useful (0 votes)
351 views58 pages

Building LLM Applications - Open-Source RAG (Part 7) - by Vipra Singh - Medium

This document is part 7 of a series on building Large Language Model (LLM) applications, focusing on Retrieval Augmented Generation (RAG). It outlines the tools and processes necessary for creating LLM applications locally, including data preparation, embedding models, vector databases, and orchestration tools. The article also discusses various LLM providers and applications, emphasizing the importance of quality tuning and evaluation in production environments.

Uploaded by

Fernanda G.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
351 views58 pages

Building LLM Applications - Open-Source RAG (Part 7) - by Vipra Singh - Medium

This document is part 7 of a series on building Large Language Model (LLM) applications, focusing on Retrieval Augmented Generation (RAG). It outlines the tools and processes necessary for creating LLM applications locally, including data preparation, embedding models, vector databases, and orchestration tools. The article also discusses various LLM providers and applications, emphasizing the importance of quality tuning and evaluation in production environments.

Uploaded by

Fernanda G.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Open in app

Search

Member-only story

Building LLM Applications: Open-Source RAG


(Part 7)
Vipra Singh · Follow
27 min read · Mar 17, 2024

Listen Share More

Learn Large Language Models (LLM) through the lens of a Retrieval Augmented
Generation (RAG) Application.

Posts in this Series


1. Introduction

2. Data Preparation

3. Sentence Transformers

4. Vector Database

5. Search & Retrieval

6. LLM

7. Open-Source RAG ( This Post )

8. Evaluation

9. Serving LLMs

10. Advanced RAG

[Link] 1/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Table of Contents
· 1. Introduction
∘ 1.1. LLMs
∘ 1.2. LLM Providers
∘ 1.3. Vector Databases
∘ 1.4. Embedding Models
∘ 1.5. Orchestration Tools
∘ 1.6. Quality Tuning Tools
∘ 1.7. Data Tools
∘ 1.8. Infrastructure
· 2. Build an LLM application from scratch
∘ 2.1. Prepare the data
∘ 2.2. Create the embeddings + retriever
∘ 2.3. Load quantized model
∘ 2.4. Setup the LLM chain
∘ 2.5. Compare the results
· 3. LLM Server
· 4. Chatbot Applications
· 5. Application 1: Chat with multiple PDFs
· 6. Application 2: Chatbot with Open WebUI
· 7. Application 3: Deploy Chatbot using Docker
· Conclusion
· Credits

1. Introduction

[Link] 2/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Source

Our previous blog posts extensively explored Large Language Models (LLMs),
covering their evolution and wide-ranging applications. Now, let’s take a closer look
at the core of this journey: Building LLM Applications locally.

In this blog post, we’ll create a basic LLM Application using LangChain.

Afterward, we’ll proceed to develop three additional open-source LLM Applications


locally.

[Link] 3/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Credits : Yujian Tang

Let’s start with looking into the tools in the LLM App Stack.

We will see more LLM apps implemented, and we’ll start to see more of these take
on production vibes. These include, but are not limited to — observability, data
versioning, and enterprise features on the basic pieces.

As of the March 2024 update, this article contains 67 companies in 8 categories:

LLMs

LLM Providers

Vector Databases

Embedding Models

Orchestration

Quality Tuning

Infrastructure

Data Tools

1.1. LLMs

[Link] 4/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Large language models are all the rage in AI. They have enabled us to work with AI
through natural language, a goal that researchers and practitioners everywhere have
been striving for for decades. The 2014 rise of generative adversarial networks,
combined with the 2018 emergence of transformers, and the increased compute
capabilities over the years have all led to this moment. This technology.

It’s not accurate or fair to say that LLMs will change the world. They already have.

OpenAI (GPT)

Meta (Llama)

Google (Gemini)

Mistral

Deci AI

DeciLM-7B is the latest in a family of LLMs by Deci AI. With its 7.04 billion
parameters and an architecture optimized for high throughput, it achieves top
performance on the Open LLM Leaderboard. It ensures diversity and efficiency in
training through a blend of open source and proprietary datasets and innovative
techniques like Grouped-Query Attention. Deci 7B supports 8192 tokens and is
under an Apache 2.0 license. — Harpreet Sahota

Symbl AI

We [the founders] come from a telecom background where they saw a need for
latency sensitive, low-memory language models. Symbl AI features a unique AI
model that focuses on understanding speech from end to end. It includes the ability
to do speech to text as well as analyze and understand what was said. — Surbhi
Rathore

Claude by Anthropic

AI Bloks

We built AI Bloks to solve the problem of automating LLM workflows in the


enterprise in private cloud. Our product ecosystem has one of the most
comprehensive open-source development frameworks and models for enterprise-
focused LLM workflows. We have an integrated RAG framework with over 40 small

[Link] 5/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

language models that are fine-tuned and CPU-friendly, and designed to stack
together for a comprehensive solution. — Namee Oberst

Arcee AI

Abacus AI

Nous Research

Solar by Upstage

LLMs are expensive. Even more so for developing countries. There needs to be a
solution to this. That’s why we made Solar. Solar is small enough to fit on a chip and
accessible enough that anyone can access it. — Sung Kim

1.2. LLM Providers


Amazon (Bedrock)

OctoAI

“When we started OctoAI, we knew models would only get larger, making GPU
resources scarce. This led us to focus our systems expertise on serving AI workloads
efficiently at scale. Today OctoAI serves the latest text-gen and media-gen
foundation models, via OpenAI-compatible APIs, so developers can get the best out
of open source innovation in a cost-effective package.” — Thierry Moreau

Fireworks AI

Martian

1.3. Vector Databases


Vector databases are a critical piece of the LLM stack. They provide the ability to
work with unstructured data. Otherwise impossible to work with, unstructured data
can be run through machine learning models to produce a vector embedding.
Vector databases use these vector embeddings to find similar vectors.

Milvus (This is my project! Give it a GitHub star!)

Milvus is an open source vector database aimed at making it possible to work with
billions of vectors. Aimed at enterprise scale, Milvus also includes many enterprise
features like multi-tenancy, role based access control, and data replications. —
Yujian Tang
[Link] 6/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Weaviate

Chroma

Qdrant

Astra DB

ApertureData

“We built Aperture Data with the intention of simplifying interactions with
multimodal data types for DS/ML teams. Our biggest value proposition is that we
can merge vector data, intelligence graphs, and multimodal data for querying
through one API.” — Vishakha Gupta

Pinecone

LanceDB

LanceDB runs in your app with no servers to manage. Zero vendor lock-in. LanceDB
is a developer-friendly, open source database for AI. It is based on DuckDB and the
Lance data format. — Jasmine Wang

ElasticSearch

Zilliz

Zilliz Cloud intends to solve the unstructured data problem. Built on the highly
scalable, reliable, and popular open source vector database Milvus, Zilliz Cloud
offers devs the ability to customize their vector search, scale seamlessly to billions
of vectors, and do it all without having to manage a complex infrastructure. —
Charles Xie

1.4. Embedding Models


Embedding models are the models that create vector embeddings. These are a
critical piece of the LLM stack that are often confused with LLMs. I blame OpenAI’s
naming conventions + the intense fervor around the need to learn this new
technology. LLMs can be used as embedding models, but embedding models have
existed long before the rise of LLMs.

Hugging Face

[Link] 7/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Voyage AI

“Voyage AI offers general-purpose, domain-specific, and fine-tuned embedding


models with the best retrieval quality and efficiency.” — Tengyu Ma

MixedBread

MixedBread looks to change the way that AI and people interact with Data. It’s
backed by a strong research and software engineering team. — Aamir Shakir

Jina AI
1.5. Orchestration Tools
A whole new set of orchestration tools rose around LLMs. The primary reason?
Orchestration of LLM apps includes prompting, an entirely new category. These
tools are made by people on the cutting edge of both “prompt engineering” and
machine learning.

LlamaIndex

“We built the first version of LlamaIndex at the cusp of the ChatGPT boom to solve
one of the most pressing problems with LLM tooling — how to harness this
reasoning capability and apply it on top of a user’s data. Today we’re a mature data
framework in Python and TypeScript that provides comprehensive
tooling/integrations (150+ data loaders, 50+ templates, 50+ vector stores) to build out
any LLM application over your data, from RAG to agents.” — Jerry Liu

LangChain

HayStack

Semantic Kernel

AutoGen

Flyte

“Flyte is redefining the landscape of machine learning and data engineering


workflows by leveraging containerization and Kubernetes to orchestrate complex,
scalable, and reliable workflows. With a focus on reproducibility and efficiency,
Flyte provides a unified platform for running computational tasks that allow ML

[Link] 8/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

engineers and data scientists to streamline their work across teams and resources
easily. You can access the power of Flyte, fully managed in your Cloud with
[Link]” — Ketan Umare

Flowise AI

Flowise is an ochestration tool built on top of Langchain and LlamaIndex.


Developing LLM apps require a whole new set of dev tooling, thats why we created
Flowise: to allow developers to build, evaluate and observe their LLM apps in one
single platform. We are the first that opened up the new frontier of low-code LLM
apps builder and the first open source platform that integrates different LLM
frameworks, allowing devs to customize for their use cases. — Henry Heng

Boundary ML

We created BAML, a new config language, that is optimized for expressing AI


function. BAML offers built in type-safety, a native VSCode playground, arbitrary
model support, observability, and support for both Python and Typescript. On top
of that, it’s Open Source! — Vaibhav Gupta
1.6. Quality Tuning Tools
Build your app first, then worry about quality. But really, quality is important. The
reason these tools exist is because: a) a lot of LLM based results are subjective but
need a way to be measured, b) if you’re using something in production, you need to
make sure it’s good, and c) seeing how different parameters effect your application
means you can see how to improve it.

Arize AI

“My co-founder, Aparna Dhinakaran, came from Uber’s ML team and I came from
TubeMogul, where we both realized the hardest problems we faced were
troubleshooting real world AI and making sense of AI performance. Arize has a
unique combination of people who have been working for decades on AI system
performance evaluation, highly usable observability tools, and large data systems.
We have a foundation in open source, and support a community version of our
software called Phoenix.” — Jason Lopatecki, CEO and Co-Founder of Arize AI.

WhyLabs

[Link] 9/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

“WhyLabs helps AI practitioners build reliable and trustworthy AI systems. As our


customers expanded from predictive to GenAI use cases, security and control
became their biggest barriers to launching to production. To solve that, we open
sourced LangKit, a tool that enables teams to prevent abuse, misinformation, and
data leakage in LLM applications. For enterprise teams, the WhyLabs Platform
layers on top of LangKit to provide a collaborative control and root cause analysis
center — Alessya Vijnic

Deepchecks

“We started Deepchecks as a response to the overwhelming cost of building,


aligning, and observing AI. Our special sauce is inproviding our users with an
automated scoring mechanism. This allows us to combine multiple considerations
such as quality and compliance to score an LLM response.” — Philip Tannor

Aporia

TruEra

TruEra’s AI Quality solutions evaluate, debug, and monitor machine learning


models and large language model applications for higher quality, trustworthiness,
and faster deployment. For large language model applications, TruEra uses
feedback functions to evaluate the performance of LLM applications without labels.
Combined with deep traceability, this brings unparalleled observability into any AI
app. TruEra works across the model lifecycle, is independent of model development
platforms, and embeds easily into your existing AI stack. — Josh Reini

Honey Hive

Guardrails AI

Exploding Gradients (RAGAS)

BrainTrust Data

BrainTrust Data is a robust solution to evaluating LLM apps that is software


engineering centric. It allows for quick iteration. Different set of evaluations from
other MLOps tools focused on usage and lets users define functions. — Albert Zhang

Patronus AI

[Link] 10/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Giskard

Quotient

Quotient provides an end to end platform to quantitatively test changes in your LLM
application. After many conversations at GitHub, we decided that quantitatively
testing LLM apps was a big problem. Our special sauce is that we provide domain
specific evaluation for business use cases. — Julia Neagu

Galileo
1.7. Data Tools
In 2012, data was your best friend in any AI/ML application. In 2024, the story is a
little different, but not by much. The quality of your data is still critical. These tools
help you ensure that your data is labeled correctly, that you’re using the right
datasets, and move your data around easily.

Voxel51

“Models are only as good as the data they’re trained on, so what’s in your datasets?
We built Voxel51 to organize your unstructured data in a centralized, searchable,
and visualizable location that uniquely allows you to build automations that
improve the quality of the training data that you feed to your models.” — Brian
Moore

DVC

XetHub

We built XetHub after building Apple’s ML data platform and watching ML teams
struggle because their tools & processes didn’t scale and weren’t aligned with
software teams. XetHub has scaled Git to 100TB (per repo) and offers a GitHub-like
experience with tailor-made features for ML (custom views, model diffs, automatic
summarization, block-based deduplication, streaming mounts, dashboarding, and
more). — Rajat Arya

Kafka

Airbyte

ByteWax

[Link] 11/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

“Bytewax first fills a gap in the Python ecosystem for a Python-native stream
processor that is production-ready. Second it aims at the developer experience
problem with existing stream processing tools with an easy-to-use and intuitive API
and a straight-forward deployment story: `pip install bytewax && python -m
[Link] my_dataflow.py`” — Zander Matheson

[Link]

At Unstructured we’re building data engineering tools to make it effortless to


transform unstructured data from raw to ML-ready. Today developers and data
scientist spend more than 80% of their time on data preparation; our mission is to
give this time back to them to focus on model training and application
development. — Brian Raymond

Spark

Pulsar

Floom

Flink

Proton by Timeplus

Apache NiFi

ActiveLoop

HumanLoop

SuperLinked

Skyflow (Privacy)

Skyflow is a data privacy vault service inspired by the zero trust approach used by
companies like Apple and Netflix. It isolates and protects sensitive customer data,
transforming it into non-sensitive references to reduce security risks. Skyflow’s APIs
ensure privacy-safe model training by excluding sensitive data, preventing
inference from prompts or user-provided files for operations like RAG. — Sean
Falconer

VectorFlow

[Link] 12/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Daios

Pathway

Mage AI

Flexor

1.8. Infrastructure
A March 2024 addition and shakeup to this stack, infrastructure tools are critical to
building LLM Apps. These tools allow you to build your app first and abstract the
production work to later on. They allow you to serve, train, and evaluate LLMs and
LLM based applications.

BentoML

I started BentoML because I saw how tough it was to run and serve AI models
efficiently. With traditional cloud infra, handling heavy GPU workloads and dealing
with large models can be a real headache. In short, we make it super easy for AI
developers to get their AI inference service up and running. We’re all about open
source here, supported by an awesome community that’s always contributing. —
Chaoyu Yang

Databricks

LastMile AI

TitanML

Lots of enterprises want to self host language models but don’t have the
infrastructure to do it well. TitanML provides that infrastructure to let developers
build applications. The special sauce is that it focuses on optimization for enterprise
workloads like batch inference, multimodal, and embedding models. — Meryem
Arik

Determined AI (acquired by HPE)

Pachyderm (acquired by HPE)

ConfidentialMind

[Link] 13/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

ConfidentialMind is building a deployable API stack for enterprises. What makes it


unique? The ability to deploy everything on prem and plug and play with open
source tools. — Markku Räsänen

Snowflake

Upstash

There needs to be a way to keep track of state for stateless tools, and the developers
need to be served in this space. — Enes Akar

Unbody

We make AI accessible for non AI developers and make private data pipeline for AI
functionalities. — Amir Houieh

NIM by NVIDIA

Parea AI

Parea is making it easier to build and evaluate LLM applications by bringing an


agnostic framework and model for LLMs. — Joel Alexander

Below is a very appropriate architecture of a RAG Application :

[Link] 14/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Source

2. Build an LLM application from scratch


We will quickly build a RAG (Retrieval Augmented Generation) for a project’s GitHub
issues using HuggingFaceH4/zephyr-7b-beta model, and LangChain.

Here’s a quick illustration:

[Link] 15/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

The external data is converted into embedding vectors with a separate


embeddings model, and the vectors are kept in a database. Embeddings models
are typically small, so updating the embedding vectors on a regular basis is
faster, cheaper, and easier than fine-tuning a model.

At the same time, the fact that fine-tuning is not required gives you the freedom
to swap your LLM for a more powerful one when it becomes available, or switch
to a smaller distilled version, should you need faster inference.

Let’s illustrate building a RAG using an open-source LLM, embeddings model, and
LangChain.

[Link] 16/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

First, let’s install the required dependencies:

!pip install -q torch transformers accelerate bitsandbytes transformers sentenc

# If running in Google Colab, you may need to run this cell to make sure you're
import locale
[Link] = lambda: "UTF-8"

!pip install -q langchain

2.1. Prepare the data


In this example, we’ll load all of the issues (both open and closed) from PEFT
library’s repo.

First, you need to acquire a GitHub personal access token to access the GitHub API.

from getpass import getpass


ACCESS_TOKEN = getpass("YOUR_GITHUB_PERSONAL_TOKEN")

Next, we’ll load all of the issues in the huggingface/peft repo:

By default, pull requests are considered issues as well, here we chose to exclude
them from data with by setting include_prs=False

Setting state = "all" means we will load both open and closed issues.

from langchain.document_loaders import GitHubIssuesLoader

loader = GitHubIssuesLoader(repo="huggingface/peft", access_token=ACCESS_TOKEN,

[Link] 17/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

include_prs=False, state="all")
docs = [Link]()

The content of individual GitHub issues may be longer than what an embedding
model can take as input. If we want to embed all of the available content, we need to
chunk the documents into appropriately sized pieces.

The most common and straightforward approach to chunking is to define a fixed


size of chunks and whether there should be any overlap between them. Keeping
some overlap between chunks allows us to preserve some semantic context between
the chunks. The recommended splitter for generic text is the
RecursiveCharacterTextSplitter, and that’s what we’ll use here.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.split_documents(docs)

2.2. Create the embeddings + retriever


Now that the docs are all of the appropriate size, we can create a database with their
embeddings.

To create document chunk embeddings we’ll use the HuggingFaceEmbeddings and the
BAAI/bge-base-en-v1.5 embeddings model. There are many other embedding
models available on the Hub, and you can keep an eye on the best-performing ones
by checking the Massive Text Embedding Benchmark (MTEB) Leaderboard.

To create the vector database, we’ll use FAISS , a library developed by Facebook AI.
This library offers efficient similarity search and clustering of dense vectors, which
is what we need here. FAISS is currently one of the most used libraries for NN
search in massive datasets.

We’ll access both the embedding model and FAISS via LangChain API.

[Link] 18/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

from [Link] import FAISS


from [Link] import HuggingFaceEmbeddings

db = FAISS.from_documents(chunked_docs, HuggingFaceEmbeddings
(model_name="BAAI/bge-base-en-v1.5"))

We need a way to return(retrieve) the documents given an unstructured query. For


that, we’ll use the as_retriever method using the db as a backbone:

search_type="similarity" means we want to perform similarity search between


the query and documents

search_kwargs={'k': 4} instructs the retriever to return top 4 results.

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

The vector database and retriever are now set up, next we need to set up the next
piece of the chain — the model.

2.3. Load quantized model


For this example, we chose HuggingFaceH4/zephyr-7b-beta , a small but powerful
model.

With many models being released every week, you may want to substitute this
model to the latest and greatest. The best way to keep track of open source LLMs is
to check the Open-source LLM leaderboard.

To make inference faster, we will load the quantized version of the model:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfi

model_name = "HuggingFaceH4/zephyr-7b-beta"

bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4

[Link] 19/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bn


tokenizer = AutoTokenizer.from_pretrained(model_name)

2.4. Setup the LLM chain


Finally, we have all the pieces we need to set up the LLM chain.

First, create a text_generation pipeline using the loaded model and its tokenizer.

Next, create a prompt template — this should follow the format of the model, so if
you substitute the model checkpoint, make sure to use the appropriate formatting.

from [Link] import HuggingFacePipeline


from [Link] import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
model=model,
tokenizer=tokenizer,
task="text-generation",
temperature=0.2,
do_sample=True,
repetition_penalty=1.1,
return_full_text=True,
max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:
{context}
</s>
<|user|>
{question}
</s>
<|assistant|>
"""

prompt = PromptTemplate(
input_variables=["context", "question"],
template=prompt_template,
)

[Link] 20/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

llm_chain = prompt | llm | StrOutputParser()

Note: You can also use tokenizer.apply_chat_template to convert a list of messages (as
dicts: {'role': 'user', 'content': '(...)'} ) into a string with the appropriate chat
format.

Finally, we need to combine the llm_chain with the retriever to create a RAG chain.
We pass the original question through to the final generation step, as well as the
retrieved context docs:

from langchain_core.runnables import RunnablePassthrough

retriever = db.as_retriever()

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | llm_cha

2.5. Compare the results


Let’s see the difference RAG makes in generating answers to the library-specific
questions.

question = "How do you combine multiple adapters?"

First, let’s see what kind of answer we can get with just the model itself, no context
added:

llm_chain.invoke({"context": "", "question": question})

Output:

[Link] 21/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

To combine multiple adapters, you need to ensure that they are compatible with

1. Identify the types of connectors you need: Before combining adapters, determ

2. Check compatibility: Make sure that the adapters you choose are compatible w

3. Connect the adapters: Once you have identified the compatible adapters, conn

4. Test the connection: After connecting all the adapters, test the connection

5. Secure the connections: To prevent accidental disconnections, use cable ties

Remember, combining multiple adapters can sometimes result in signal loss or in

rag_chain.invoke(question)

Output:

Based on the provided context, here are some potential ways to combine multipl

1. Load each adapter separately and concatenate their outputs:

```python
from peft import Peft

# Load the base model and adapter 1


base_model = AutoModelForSequenceClassification.from_pretrained("your_base_
adapter1 = Peft("adapter1").requires_grad_(False)
adapter1(base_model).load_state_dict([Link]("path/to/[Link]"))

# Load adapter 2
adapter2 = Peft("adapter2").requires_grad_(False)
adapter2(base_model).load_state_dict([Link]("path/to/[Link]"))

# Concatenate the outputs of both adapters


def forward(self, input_ids, attention_mask):
x = self.base_model(input_ids, attention_mask)[0]
x = [Link]([x, adapter1(x), adapter2(x)], dim=-1)
return x

# Create a new model class that includes the concatenated outputs


class MyModel(BaseModel):
def __init__(self):
[Link] 22/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

super().__init__()
[Link] = forward

# Instantiate the new model class and use it for inference


my_model = MyModel()
```

2. Freeze multiple adapters and selectively activate them during inference:

```python
from peft import Peft

# Load the base model and all ad

As we can see, the added context, really helps the exact same model, provide a
much more relevant and informed answer to the library-specific question.

Notably, combining multiple adapters for inference has been added to the library,
and one can find this information in the documentation, so for the next iteration of
this RAG it may be worth including documentation embeddings.

So, now we have an understanding of how to build an LLM RAG Application from
scratch.

Google Colab ( Credits ) :


[Link]
ks/en/rag_zephyr_langchain.ipynb

Next, we will use our understanding to build 3 more LLM Applications.

For the below applications, we will be using Ollama as our LLM Server. Let’s start
with understanding more about LLM Server below.

3. LLM Server
The most critical component of this app is the LLM server. With Ollama, we have a
robust LLM Server that can be set up locally.

What is Ollama?

[Link] 23/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Ollama can be installed from [Link] site.

Ollama isn’t a single language model but a framework that lets us run multiple
open-source LLMs locally on our machine. Think of it like a platform for playing
different language models like Llama 2, Mistral, etc., instead of a specific player
itself.

Additionally, we can use the Langchain SDK, which is a tool for working with Ollama
more conveniently.

Using Ollama on the command line is very simple. The following are commands,
that we can try to run Ollama on our computer.

ollama pull — This command pulls a model from the Ollama model hub.

ollama rm — This command is used to remove the already downloaded model


from the local computer.

ollama cp — This command is used to make a copy of the model.

ollama list — This command is used to see the list of downloaded models.

ollama run — This command is used to run a model, If the model is not already
downloaded, it will pull the model and serve it.

ollama serve — This command is used to start the server, to serve the
downloaded models.

We can download these models to our local machine, and then interact with those
models through a command line prompt. Alternatively, when we run the model,
Ollama also runs an inference server hosted at port 11434 (by default) that we can
interact with through APIs and other libraries like Langchain.

As of this post, Ollama has 74 models, which also include categories like embedding
models.

[Link] 24/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Source : Ollama

4. Chatbot Applications
The 3 essential Chatbot applications that we will be building next are :

1. Chat with multiple PDFs using LangChain, ChromaDB & Streamlit

2. Chatbot with Open WebUI

3. Deploy Chatbot using Docker.

By the end of these 3 applications, we will build an intuition of how industrial


applications are built and deployed at scale.

[Link] 25/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

RAG Application High-level Flow

5. Application 1: Chat with multiple PDFs


We will build an application that is something similar to ChatPDF but simpler.
Where users can upload multiple PDF documents and ask questions through a
straightforward UI.

Our tech stack is super easy with Langchain, Ollama, and Streamlit.

Architecture

LLM Server: The most critical component of this app is the LLM server. Thanks
to Ollama, we have a robust LLM Server that can be set up locally, even on a
laptop. While [Link] is an option, I find Ollama, written in Go, easier to set
up and run.
[Link] 26/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

RAG: Undoubtedly, the two leading libraries in the LLM domain are Langchain
and LLamIndex. For this project, I’ll be using Langchain due to my familiarity
with it from my professional experience. An essential component for any RAG
framework is vector storage. We’ll be using Chroma here, as it integrates well
with Langchain.

Chat UI: The user interface is also an important component. Although there are
many technologies available, I prefer using Streamlit, a Python library, for peace
of mind.

Okay, let’s start setting it up.

The chatbot can access information from various PDFs. Here’s a breakdown:

Data Source: Multiple PDFs

Storage: ChromaDB’s vector store (efficiently stores and retrieves information)

Processing: LangChain API prepares the data for a large language model (LLM)

LLM Integration: Likely involves Retrieval-Augmented Generation (RAG) for


enhanced responses

User Interface: Streamlit creates a user-friendly chat interface

GitHub Repo Structure:


[Link]

[Link] 27/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Folder Structure

1. Clone the repository to the local machine.

git clone [Link]


cd 7_Ollama/local-pdf-chatbot

2. Create a Virtual Environment:

python3 -m venv myenv


source myenv/bin/activate

3. Install the requirements:

[Link] 28/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

pip install -r [Link]

4. Install Ollama and pull LLM model specified in [Link] [ We have already
covered setting up Ollama in the above section ]

5. Run the LLama2 model using Ollama

ollama pull llama2


ollama run llama2

6. Run the [Link] file using the Streamlit CLI. Execute the following command:

streamlit run [Link]

Image by Author

6. Application 2: Chatbot with Open WebUI


In this application, we will be building a Chatbot with Open WebUI instead of
Streamlit / Chainlit / Gradio as an UI.

[Link] 29/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Below are the steps :

1. Install Ollama & Deploy a LLM

We can install Ollama directly on our local machine or can also deploy the Ollama
docker container locally. The choice is ours, either of them will work for the
langchain Ollama interface, Ollama official python interface, and open-webui
interface.

Below are the instructions for installing Ollama directly in our local systems :

Setting up and running Ollama is straightforward. First, visit [Link] and


download the app appropriate for our operating system.

Next, open the terminal and execute the following command to pull the latest
models. While there are many other LLM models available, I choose Mistral-7B for
its compact size and competitive quality.

ollama pull llama2

ollama run llama2

[Link] 30/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

The set-up procedure is the same for all other models. We need to pull and run.

Image Source

2. Install open-webui (ollama-webui)

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted WebUI


designed to operate entirely offline. It supports various LLM runners, including
Ollama and OpenAI-compatible APIs.

Official GitHub Repo: [Link]

Run the below docker command to deploy open-webui docker container on the local
machine.

[Link] 31/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

docker run -d -p 3000:8080 --add-host=[Link]:host-gateway -v olla

Image by Author

3. Open Browser

Open the browser and call localhost with port 3000

[Link]

[Link] 32/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Image by Author

To get started, we need to register for the first time. Simply click on the "Sign up"
button to create our account.

Image by Author

Once registered, we will be routed to the home page of open-webui.

[Link] 33/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Image by Author

Depending on which LLM we deployed on our local machine, those options will be
reflected in the drop-down to select.

Image by Author

Once selected, chat.

[Link] 34/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Image by Author

This prevents us from creating streamlit or gradio UI interfaces to experiment with


various open-source LLMs, for presentations, etc.

We can chat with any PDF file now :

[Link] 35/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Image by Author

Attaching a Demo gif file showcasing how we can use open-webui to chat with
images.

[Link] 36/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Source: open-webui

Let’s look into how we can use our customized models with Ollama. Below are the
steps for the same.

How to use a customized model?

Import from GGUF

Ollama supports importing GGUF models in the Modelfile:

1. Create a file named Modelfile , with a FROM instruction with the local filepath to
the model we want to import.

FROM ./vicuna-33b.Q4_0.gguf

2. Create the model in Ollama

ollama create example -f Modelfile

3. Run the model

ollama run example

Import from PyTorch or Safetensors

See the guide on importing models for more information.

Customize a prompt

Models from the Ollama library can be customized with a prompt. For example, to
customize the llama2 model:

[Link] 37/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

ollama pull llama2

Create a Modelfile :

FROM llama2

# set the temperature to 1 [higher is more creative, lower is more coherent]


PARAMETER temperature 1
# set the system message
SYSTEM """
You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
"""

Next, create and run the model:

ollama create mario -f ./Modelfile


ollama run mario
>>> hi
Hello! It's your friend Mario.

For more examples, see the examples directory. For more information on working
with a Modelfile, see the Modelfile documentation.

7. Application 3: Deploy Chatbot using Docker


Let’s build the chatbot application using Langshan, to access our model from the
Python application, we will be building a simple Steamlit chatbot application. We
will be deploying this Python application in a container and will be using Ollama in
a different container. We will build the infrastructure using docker-compose.

The following picture shows the architecture of how the containers interact, and
what ports they will be accessing.

[Link] 38/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Source

We build 2 containers,

Ollama container uses the host volume to store and load the models
( /root/.ollama is mapped to the local ./data/ollama ). Ollama container listens
on 11434 (external port, which is internally mapped to 11434)

Streamlit chatbot application will listen on 8501 (external port, which is


internally mapped to 8501).

GitHub Repository :

Folder Structure

[Link] 39/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Clone the repository to the local machine.

git clone [Link]


cd 7_Ollama/docker-pdf-chatbot

Create a Virtual Environment:

python3 -m venv ./ollama-langchain-venv


source ./ollama-langchain-venv/bin/activate

Install the requirements:

pip install -r [Link]

Ollama is a framework that allows us to run the Ollama server as a docker image.
This is very useful for building microservices applications that use Ollama models.
We can easily deploy our applications in the docker ecosystem, such as OpenShift,
Kubernetes, and others. To run Ollama in docker, we have to use the docker run
command, as shown below. Before this, we should have docker installed in our
system.

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollam

Below is the output :

[Link] 40/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Image by Author

We should then be able to interact with this container using docker exec , as shown
below, and run the prompts.

docker exec -it ollama ollama run phi

Image by Author

In the above command, we are running phi within the docker.

curl [Link] -d '{


"model": "phi:latest",
"prompt":"Who are you?",
"stream":false
}'

Below is the result :

[Link] 41/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Image by Author

Note that docker containers an ephemeral, and whatever models, we pull, will
disappear when we restart the container. We will solve this issue in the next blog,
where we will build a distributed Streamlit application from ground up. We will be
mapping the volume of the container with the host.

Ollama is a powerful tool that enables new ways of creating and running LLM
applications on the cloud. It simplifies the development process and offers flexible
deployment options. It also allows for easy management and scaling of the
applications.

Now, let’s get started with the Streamlit application.

[Link] 42/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

We are using Ollama and calling the model through Ollama Langchain library
(which is part of langchain_community )

Let’s define the dependencies in [Link].

[Link] 43/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Let’s now define a Dockerfile to build the docker image of the Streamlit application.

We are using the Python docker image, as the base image, and creating a working
directory called /app . We are then copying our application files there, and running
the pip installs to install all the dependencies. We are then exposing the port 8501
and starting the streamlit application.

We can build the docker image using docker build command, as shown below.

docker build . -t viprasingh/ollama-langchain:0.1

Image by Author

[Link] 44/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

We should be able to check if the Docker image is built, using docker images

command, as shown below.

Image by Author

Let’s now build a docker-compose configuration file, to define the network of the
Streamlit application and the Ollama container, so that they can interact with each
other. We will also be defining the various port configurations, as shown in the
picture above. For Ollama, we will also be mapping the volume, so that whatever
models are pulled, are persisted.

[Link]

We can bring up the applications by running the docker-compose up command,


once we execute docker-compose up , we see that both the containers start running,
as shown in the screenshot below.

[Link] 45/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Image by Author

We should be able to see the containers running by executing docker-compose ps

command as shown below.

Image by Author

We now check if Ollama is running by calling [Link] as shown in the


screenshot below.

Let’s now download the required model, by logging into the docker container using
the docker exec command as shown below.

docker exec -it docker-pdf-chatbot-ollama-container-1 ollama run phi

[Link] 46/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Since we are using the model phi, we are pulling that model and testing it by
running it. We can see the screenshot below, where the phi model is downloaded
and will start running (since we are using -it flag we should be able to interact and
test with sample prompts)

We can see the downloaded model files and manifests in our local folder
./data/ollama (which is internally mapped to /root/.ollama for the container,
which is where Ollama looks for the downloaded models to serve)

[Link] 47/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Image by Author

Let's now run access our streamlit application by opening [Link] on


the browser. The following screenshot shows the interface

[Link] 48/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Let's try to run a prompt “ generate a story about dog called bozo ”. We should be able
to see the console logs reflecting the API calls, that are coming from our Streamlit
application, as shown below

We can see below screenshot, the response, I got for the prompt I sent

[Link] 49/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

We can bring down the deployment by calling docker-compose down

The following screenshot shows the output

There we go. It was super fun, working on this blog getting Ollama to work with
Langchain, and deploying them on Docker using Docker-Compose

Conclusion
The blog explores building Large Language Model (LLM) applications locally,
focusing on Retrieval-Augmented Generation (RAG) chains. It covers components
like the LLM Server powered by Ollama, LangChain framework, Chroma for
[Link] 50/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

embeddings, and Streamlit for web apps. It details creating Chatbot applications
using Ollama, LangChain, ChromaDB, and Streamlit, with GitHub repo structures
and Docker deployment. Overall, it offers a practical guide to developing LLM
applications efficiently.

Credits
In this blog post, we have compiled information from various sources, including
research papers, technical blogs, official documentations, YouTube videos, and
more. Each source has been appropriately credited beneath the corresponding
images, with source links provided.

Below is a consolidated list of references:

1. [Link]
7168449062336225280-3n_p/

2. [Link]
eac28b9dc1e7

3. [Link]

4. [Link]

5. [Link]
ollama-deploy-on-docker-5dfcfd140363

Thank you for reading!


If this guide has enhanced your understanding of Python and Machine Learning:

Please show your support with a clap 👏 or several claps!

Your claps help me create more valuable content for our vibrant Python or ML
community.

Feel free to share this guide with fellow Python or AI / ML enthusiasts.

Your feedback is invaluable — it inspires and guides my future posts.

Connect with me!


Vipra

Large Language Models Streamlit Ollama Langchain Chromadb

[Link] 51/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Follow

Written by Vipra Singh


2.5K Followers

More from Vipra Singh

Vipra Singh

Building LLM Applications: Advanced RAG (Part 10)


Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation (
RAG ) Application.

Apr 28 755 6

[Link] 52/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Vipra Singh

Building LLM Applications: Serving LLMs (Part 9)


Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation (
RAG ) Application.

Apr 17 824 5

Vipra Singh

Building LLM Applications: Evaluation (Part 8)


Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation (
RAG ) Application.

[Link] 53/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Apr 7 479 1

Vipra Singh

LLM Architectures Explained: NLP Fundamentals (Part 1)


Deep Dive into the architecture & building of real-world applications leveraging NLP Models
starting from RNN to the Transformers.

Aug 15 1.5K 10

See all from Vipra Singh

Recommended from Medium

[Link] 54/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Vipra Singh

Building LLM Applications: Serving LLMs (Part 9)


Learn Large Language Models ( LLM ) through the lens of a Retrieval Augmented Generation (
RAG ) Application.

Apr 17 824 5

Paras Madan in GoPenAI

Building a Multi PDF RAG Chatbot: Langchain, Streamlit with code


Talking to big PDF’s is cool. You can chat with your notes, books and documents etc. This blog
post will help you build a Multi RAG…

[Link] 55/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Jun 6 705 3

Lists

Natural Language Processing


1717 stories · 1291 saves

AI Regulation
6 stories · 571 saves

ChatGPT prompts
48 stories · 2021 saves

Predictive Modeling w/ Python


20 stories · 1553 saves

Florian June in AI Advances

Advanced RAG 11: Query Classification and Refinement


Priciples, Code Explanation and Insights about Adaptive-RAG and RQ-RAG

May 11 484 2

[Link] 56/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Plaban Nayak in The AI Forum

RAG on Complex PDF using LlamaParse, Langchain and Groq


Retrieval-Augmented Generation (RAG) is a new approach that leverages Large Language
Models (LLMs) to automate knowledge search, synthesis…

Apr 7 858 12

Harshit Tyagi

Start Building These Projects to Become an LLM Engineer


First steps to become an LLM Engineer

[Link] 57/58
25/09/2024, 16:53 Building LLM Applications: Open-Source RAG (Part 7) | by Vipra Singh | Medium

Sep 15 363 3

Ryan Siegler in KX Systems

RAG + LlamaParse: Advanced PDF Parsing for Retrieval


The core focus of Retrieval Augmented Generation (RAG) is connecting your data of interest to
a Large Language Model (LLM). This process…

May 3 177 2

See more recommendations

[Link] 58/58

You might also like