Create your own PDF chatbot with OpenAI, Langchain and Streamlit

Lakmina Pramodya Gamage
6 min readSep 22, 2023

--

Digital art : © www.pdfgear.com

Chatbots has become a trend of nowadays due to the rise of LLM (large language models) products like ChatGPT, Bing AI, Bard AI etc. In this article, we explore the ability to use services like them in our own projects.

PDFs (portable document format) has become a key player in digital documentation world. They are being used by students, publishers, researchers and many other parties for thousands of usages since 1995.

Sometimes reading a PDF from beginning to end can be a challenging task if the document is too lengthy.

With our PDF chatbot we are leveraging the power of LLMs to easily grasp any information included in a PDF without reading it from the scratch in a conversational style.

For this project we are using Python as our development preference. We also use Langchain library, Streamlit Library in order to create our app alongsides with ChatGPT API.

Let’s dive into project.

Overview:

Abstract Workflow of the Service

In this project we hope to read the content of a PDF and make users able to ask questions about it and get answers within the context of the document. For this project we employ various technologies in Generative AI domain. After reading this article you will be able to create and host your own PDF chatbot in minutes!

Environment Setup:

Open-up your favourite text editor and create a new python project. Make sure Python3 is installed on your system. For Langchain library to work, we need to have Python version 3.8.1 or later (does not work on 3.8.0). You can check your python version by running below command on your CMD or Terminal.

python3 --version

If you are having a lower version of Python than 3.8.1, you have to install a newer version.

We use Python virtualenv library to create a virtual environment for our project. If virtualenv does not exist on your system, you may want to install it using pip in CMD or Terminal.

pip3 install virtualenv

After installing let’s create a virtual environment for our project and activate it for our use.

virtualenv pdfbot
source pdfbot/bin/activate

Now virtual environment is ready. Now we need to install python packages needed for our project.

pip install langchain==0.0.154 streamlit==1.18.1 altair==4.0.0 PyPDF2==3.0.1 

Code:

After a successful finish, we can get started with coding. At the root directory create a file named “app.py”. Here is our starter script for whole program.

Let’s get started with importing the required libraries.

import streamlit as st
from PyPDF2 import PdfReader
from dotenv import load_dotenv
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.callbacks import get_openai_callback
import pickle
import os
from langchain.chat_models import ChatOpenAI

After that we can get started with Streamlit for build UI components in seconds. Streamlit is an Open Source python application framework for production ready python data apps. Official documentation of Streamlit can be found at here.

st.header('ChatPDF v0.1')
st.sidebar.header(":blue[Welcome to ChatPDF!]")
pdf = st.file_uploader('Upload a PDF file with text in English. PDFs that only contain images will not be recognized.', type=['pdf'])
query = st.text_input('Ask question about the PDF you entered!', max_chars=300)

This code segments create a simple web UI with a header, a foldable sidebar and a file uploader utility. We use this file uploader for get user’s PDF files. We also employ streamlit’s text input component to get user’s questions about the pdf.

try:
pdf_doc = PdfReader(pdf)
for page in pdf_doc.pages:
txt += page.extract_text()

except Exception as e:
st.error(str(e))

With above code segment, we are using PyPDF2 to read the content of PDF document page by page.

text_split = RecursiveCharacterTextSplitter(
chunk_size=1000, # number of characters per chunk
chunk_overlap=200, # used to keep the context of a chunk intact with previous and next chunks
length_function=len
)
chunks = text_split.split_text(text=txt)

Then we use Recursive character text splitter from Langchain to split the content of PDF until the text chunks are small enough to be processed. We use chunk size of 1000 characters as we can easily fit inside the max token size of OpenAI LLM services. Chunk overlap is used to keep the context of a passage together between chunks of text.

embeddings = OpenAIEmbeddings()
vectorStore = FAISS.from_texts(chunks,embedding=embeddings)

We embed text to create a vector representation of it. With that we can run semantic search algorithms where we look up for a similar piece of text inside a vector space. The vector store takes care of storing and performing vector search inside the created vector space. For that we use FAISS(Facebook AI Similarity Search).

docs = vectorStore.similarity_search(query=query)
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

Here we perform a similarity search on the user’s input query taken from Streamlit UI. Here we use ChatOpenAI as a wrapper for OpenAI LLMs that use chat endpoint. We are using “gpt-3.5-turbo” as our LLM for this project as it is low in cost comparetively. You can see alternative list of models that can be used here with their pricing in this link.

chain = load_qa_chain(llm=llm, chain_type="stuff")

We create a stuff chain using Langchain here. With that we can give our response system limits such as asking it to response only if it knows the answer and try not to make up an answer.

response = chain.run(input_documents=docs, question=query)
st.write(response)

We can execute the chain and get the response for the query and document user has uploaded to the service. Then we present that to the user through the Web UI.

That’s how our Service works in simple terms. We can add additional functionalities to reduce the cost and improve the system functionalities. One method to do is to store the Vector Store on a local file, so that we don’t have to request OpenAI to build vector stores every time user uploads the same document.

For that we can save the created vector store file as a pickle locally.

with open(f"STORE_NAME.pkl", "wb") as f:
pickle.dump(vs, f)

and load it from local storage whenever user uploads the same document again.

with open(f"{store_name}.pkl", "rb") as f:
vs = pickle.load(f)

OpenAI Integration:

Note that, we should provide an OpenAI API key for the system to integrate openAI services with our system. For that we can create an environment variable file (.env file) in root directory and give it the key value pair “OPENAI_API_KEY” in TOML(Tom’s Obvious Minimal language) format.

We can obtain an API key at platform.openai.com

OPENAI_API_KEY="copy-and-paste-key-here"

Testing App:

To run our project we can use Streamlit app service. After finalising the code, we can use Streamlit CLI commands to run the application. Below command starts a web server and runs your application locally when executed from your project’s root folder.

streamlit run app.py

Go to the localhost link to try and test your app.

Deployment:

When it’s up to deployment, Streamlit comes in handy. You can simply deploy your app to Streamlit Community Cloud in matter of a few clicks.

First you have to create a git repository for your project and publish it to your Github account.

Then, head over to share.streamlit.io . Create a new app and authenticate with your Github account giving nessasary permissions to Streamlit.

In matter of seconds your brand-new PDF Chatbot will be ready to explore the seven seas!

Conclusion:

Like that we can easily create a PDF document information extractor app with conversational AI. There are many features and advancements need to be done in this project if you are eager to go beyond.

Resources:

OpenAI API references : https://platform.openai.com/docs/api-reference/introduction

Streamlit documentation : https://docs.streamlit.io/

Langchain documentation : https://python.langchain.com/docs/get_started/introduction

PyPDF2 Documentation : https://pypdf2.readthedocs.io/en/3.0.0/

Remarks:

A live demo of the app is available at the url below.

https://snapread.streamlit.app/

Link to project on my Github is available below. Feel free to fork me on github with the more advancements you can imagine.

Got any questions? Contact me on LinkedIn at the following URL.

https://www.linkedin.com/in/lakmina-gamage/

--

--

Lakmina Pramodya Gamage
Lakmina Pramodya Gamage

No responses yet