Creating Multiple pdf Chatbot with Langchain and Streamlit
Table of contents
- Simplified Breakdown of the project.
- The User Uploads the pdf files that the want
- The code that extracts the text from the PDF files and creates a string containing all of the text is located in the get_pdf_text() function. Here is the code:
- The code that splits the text into a list of smaller chunks is located in the get_text_chunks() function. Here is the code:
- The code that creates a vector store from the text chunks is located in the get_vectorstore() function. Here is the code:
- The code that creates a conversation chain from the vector store is located in the get_conversation_chain() function. Here is the code:
- The code that prompts the user to ask a question about the PDF files is located in the main() function. Here is the code:
- The code that uses the conversation chain to generate a response to the user's question is located in the handle_userinput() function. Here is the code:
- Summary
Hello Guys, I just started coding with Langchain last month and this is my first project. Creating a chatbot that interacts with pdfs and a user gets to ask questions. We will use Langchain and Streamlit to cover this project. Go with me.
Requirements
langchain==0.0.184
pyPDF==3.0.1
python-dotenv==1.0.0
streamlit==1.18.1
openai==0.27.6
fails-cpu==1.7.4
#uncomment to use hugging face llms
#huggingface-hub=0.14.1
#uncomment to use instructor embeddings
#instructorEmbedding==1.0.1
#sentence-tranformers=2.2.2
Simplified Breakdown of the project.
This is the simplified breakdown of the project to enable you to understand it as a beginner.
The user uploads the PDF files that they want to chat about.
The code extracts the text from the PDF files and creates a string containing all of the text.
The code splits the text into a list of smaller chunks.
The code creates a vector store from the text chunks.
The code creates a conversation chain from the vector store.
The code prompts the user to ask a question about the PDF files.
The code uses the conversation chain to generate a response to the user's question.
The code displays the response to the user.
The code repeats steps 6-8 until the user quits the program.
The User Uploads the pdf files that the want
The user uploads the PDF files that they want to chat about in the main()
function. Here is the code:
if __name__ == '__main__':
load_dotenv()
st.set_page_config(page_title="Chat with multiple PDFs",
page_icon=":books:")
if "conversation" not in st.session_state:
st.session_state.conversation = None
if "chat_history" not in st.session_state:
st.session_state.chat_history = None
st.header("Chat with multiple PDFs :books:")
user_question = st.text_input("Ask a question about your documents:")
if user_question:
handle_userinput(user_question)
with st.sidebar:
st.subheader("Your documents")
pdf_docs = st.file_uploader(
"Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
if st.button("Process"):
with st.spinner("Processing"):
# get pdf text
raw_text = get_pdf_text(pdf_docs)
# get the text chunks
text_chunks = get_text_chunks(raw_text)
# create vector store
vectorstore = get_vectorstore(text_chunks)
# create conversation chain
st.session_state.conversation = get_conversation_chain(
vectorstore)
The user uploads the PDF files that they want to chat about in the
main()
function. Here is the code:Python
if __name__ == '__main__': load_dotenv() st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:") if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header("Chat with multiple PDFs :books:") user_question = st.text_input("Ask a question about your documents:") if user_question: handle_userinput(user_question) with st.sidebar: st.subheader("Your documents") pdf_docs = st.file_uploader( "Upload your PDFs here and click on 'Process'", accept_multiple_files=True) if st.button("Process"): with st.spinner("Processing"): # get pdf text raw_text = get_pdf_text(pdf_docs) # get the text chunks text_chunks = get_text_chunks(raw_text) # create vector store vectorstore = get_vectorstore(text_chunks) # create conversation chain st.session_state.conversation = get_conversation_chain( vectorstore)
The
pdf_docs
variable is a list of the PDF files that the user has uploaded. Thest.file_uploader()
function is used to create this variable. Thest.button()
function is used to create a button that the user can click to process the PDF files. When the user clicks the button, theget_pdf_text()
function is called to extract the text from the PDF files. The text from the PDF files is then stored in theraw_text
variable. Theget_text_chunks()
function is then called to split the text into a list of smaller chunks. The list of chunks is then stored in thetext_chunks
variable. Theget_vectorstore()
function is then called to create a vector store from the text chunks. The vector store is then stored in thevectorstore
variable. Theget_conversation_chain()
function is then called to create a conversation chain from the vector store. The conversation chain is then stored in thest.session_state.conversation
variable. This variable is then used by thehandle_userinput()
function to generate responses to the user's questions.The code that extracts the text from the PDF files and creates a string containing all of the text is located in the
get_pdf_text()
function. Here is the code:def get_pdf_text(pdf_docs): text = "" for pdf in pdf_docs: pdf_reader = PdfReader(pdf) for page in pdf_reader.pages: text += page.extract_text() return text
The
get_pdf_text()
function first initializes a string variable calledtext
. The string variable is then used to store the text from each page of each PDF file. ThePdfReader
class is used to iterate through each page of each PDF file and extract the text. The text from each page is then appended to thetext
variable. Finally, thetext
variable is returned.The
get_pdf_text()
function is called in themain()
function. Here is the code:if __name__ == '__main__': load_dotenv() st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:") if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header("Chat with multiple PDFs :books:") user_question = st.text_input("Ask a question about your documents:") if user_question: handle_userinput(user_question) with st.sidebar: st.subheader("Your documents") pdf_docs = st.file_uploader( "Upload your PDFs here and click on 'Process'", accept_multiple_files=True) if st.button("Process"): with st.spinner("Processing"): # get pdf text raw_text = get_pdf_text(pdf_docs) # get the text chunks text_chunks = get_text_chunks(raw_text) # create vector store vectorstore = get_vectorstore(text_chunks) # create conversation chain st.session_state.conversation = get_conversation_chain( vectorstore)
The
get_pdf_text()
function is called in thewith st.sidebar:
block. Thewith st.sidebar:
block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks theProcess
button, theget_pdf_text()
function is called to extract the text from the PDF files and create a string containing all of the text. The string is then stored in thest.session_state.conversation
variaet_text_chunks()
function. Here is the code::Python
def get_text_chunks(text): text_chunks = [] chunk = "" for word in text: if word == "\n": text_chunks.append(chunk) chunk = "" else: chunk += word text_chunks.append(chunk) return text_chunks
The
get_text_chunks()
function first initializes a list variable calledtext_chunks
. Thetext_chunks
variable is then used to store the text chunks. Thetext
variable is passed to the function as an argument. Thetext
variable contains the text from the PDF files.The
get_text_chunks()
function then iterates through each word in thetext
variable. If the word is a newline character, the function adds the current chunk to thetext_chunks
variable and initializes a new chunk. Otherwise, the function appends the word to the current chunk.Finally, the
text_chunks
variable is returned.The
get_text_chunks()
function is called in themain()
function. Here is the code:Python
if __name__ == '__main__': load_dotenv() st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:") if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header("Chat with multiple PDFs :books:") user_question = st.text_input("Ask a question about your documents:") if user_question: handle_userinput(user_question) with st.sidebar: st.subheader("Your documents") pdf_docs = st.file_uploader( "Upload your PDFs here and click on 'Process'", accept_multiple_files=True) if st.button("Process"): with st.spinner("Processing"): # get pdf text raw_text = get_pdf_text(pdf_docs) # get the text chunks text_chunks = get_text_chunks(raw_text) # create vector store vectorstore = get_vectorstore(text_chunks) # create conversation chain st.session_state.conversation = get_conversation_chain( vectorstore)
The
get_text_chunks()
function is called in thewith st.sidebar:
block. Thewith st.sidebar:
block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks theProcess
button, theget_text_chunks()
function is called to split the text from the PDF files into a list of smaller chunks. The list of chunks is then stored in thest.session_state.conversation
variable.The code that splits the text into a list of smaller chunks is located in the
get_text_chunks()
function. Here is the code:Python
def get_text_chunks(text): text_chunks = [] chunk = "" for word in text: if word == "\n": text_chunks.append(chunk) chunk = "" else: chunk += word text_chunks.append(chunk) return text_chunks
The
get_text_chunks()
function first initializes a list variable calledtext_chunks
. Thetext_chunks
variable is then used to store the text chunks. Thetext
variable is passed to the function as an argument. Thetext
variable contains the text from the PDF files.The
get_text_chunks()
function then iterates through each word in thetext
variable. If the word is a newline character, the function adds the current chunk to thetext_chunks
variable and initializes a new chunk. Otherwise, the function appends the word to the current chunk.Finally, the
text_chunks
variable is returned.The
get_text_chunks()
function is called in themain()
function. Here is the code:Python
if __name__ == '__main__': load_dotenv() st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:") if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header("Chat with multiple PDFs :books:") user_question = st.text_input("Ask a question about your documents:") if user_question: handle_userinput(user_question) with st.sidebar: st.subheader("Your documents") pdf_docs = st.file_uploader( "Upload your PDFs here and click on 'Process'", accept_multiple_files=True) if st.button("Process"): with st.spinner("Processing"): # get pdf text raw_text = get_pdf_text(pdf_docs) # get the text chunks text_chunks = get_text_chunks(raw_text) # create vector store vectorstore = get_vectorstore(text_chunks) # create conversation chain st.session_state.conversation = get_conversation_chain( vectorstore)
The
get_text_chunks()
function is called in thewith st.sidebar:
block. Thewith st.sidebar:
block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks theProcess
button, theget_text_chunks()
function is called to split the text from the PDF files into a list of smaller chunks. The list of chunks is then stored in thest.session_state.conversation
variable.The code that creates a vector store from the text chunks is located in the
get_vectorstore()
function. Here is the code:Python
def get_vectorstore(text_chunks): embeddings = OpenAIEmbeddings() vectorstore = FAISS.from_texts(text_chunks, embedding=embeddings) return vectorstore
The
get_vectorstore()
function first initializes anOpenAIEmbeddings
object. TheOpenAIEmbeddings
object is used to create embeddings for the text chunks. The text chunks are passed to theOpenAIEmbeddings
object as an argument.The
get_vectorstore()
function then uses theFAISS
class to create a vector store from the embeddings of the text chunks. The vector store is then returned.The
get_vectorstore()
function is called in themain()
function. Here is the code:Python
if __name__ == '__main__': load_dotenv() st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:") if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header("Chat with multiple PDFs :books:") user_question = st.text_input("Ask a question about your documents:") if user_question: handle_userinput(user_question) with st.sidebar: st.subheader("Your documents") pdf_docs = st.file_uploader( "Upload your PDFs here and click on 'Process'", accept_multiple_files=True) if st.button("Process"): with st.spinner("Processing"): # get pdf text raw_text = get_pdf_text(pdf_docs) # get the text chunks text_chunks = get_text_chunks(raw_text) # create vector store vectorstore = get_vectorstore(text_chunks) # create conversation chain st.session_state.conversation = get_conversation_chain( vectorstore)
The
get_vectorstore()
function is called in thewith st.sidebar:
block. Thewith st.sidebar:
block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks theProcess
button, theget_vectorstore()
function is called to create a vector store from the text chunks of the PDF files. The vector store is then stored in thest.session_state.conversation
variable.The code that creates a conversation chain from the vector store is located in the
get_conversation_chain()
function. Here is the code:Python
def get_conversation_chain(vectorstore): llm = ChatOpenAI() memory = ConversationBufferMemory( memory_key='chat_history', return_messages=True) conversation_chain = ConversationalRetrievalChain( llm=llm, retriever=vectorstore.as_retriever(), memory=memory) return conversation_chain
The
get_conversation_chain()
function first initializes aChatOpenAI
object. TheChatOpenAI
object is used to access the ChatGPT language model.The
get_conversation_chain()
function then initializes aConversationBufferMemory
object. TheConversationBufferMemory
object is used to store the chat history between the user and the chatbot.The
get_conversation_chain()
function then initializes aConversationalRetrievalChain
object. TheConversationalRetrievalChain
object is used to generate responses to user questions. TheConversationalRetrievalChain
object is passed thellm
object, thevectorstore
object, and thememory
object.The
get_conversation_chain()
function then returns theconversation_chain
object.The
get_conversation_chain()
function is called in themain()
function. Here is the code:Python
if __name__ == '__main__': load_dotenv() st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:") if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header("Chat with multiple PDFs :books:") user_question = st.text_input("Ask a question about your documents:") if user_question: handle_userinput(user_question) with st.sidebar: st.subheader("Your documents") pdf_docs = st.file_uploader( "Upload your PDFs here and click on 'Process'", accept_multiple_files=True) if st.button("Process"): with st.spinner("Processing"): # get pdf text raw_text = get_pdf_text(pdf_docs) # get the text chunks text_chunks = get_text_chunks(raw_text) # create vector store vectorstore = get_vectorstore(text_chunks) # create conversation chain st.session_state.conversation = get_conversation_chain( vectorstore)
The
get_conversation_chain()
function is called in thewith st.sidebar:
block. Thewith st.sidebar:
block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks theProcess
button, theget_conversation_chain()
function is called to create a conversation chain from the vector store of the PDF files. The conversation chain is then stored in thest.session_state.conversation
variable.The code that prompts the user to ask a question about the PDF files is located in the
main()
function. Here is the code:Python
if __name__ == '__main__': load_dotenv() st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:") if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header("Chat with multiple PDFs :books:") user_question = st.text_input("Ask a question about your documents:") if user_question: handle_userinput(user_question) with st.sidebar: st.subheader("Your documents") pdf_docs = st.file_uploader( "Upload your PDFs here and click on 'Process'", accept_multiple_files=True) if st.button("Process"): with st.spinner("Processing"): # get pdf text raw_text = get_pdf_text(pdf_docs) # get the text chunks text_chunks = get_text_chunks(raw_text) # create vector store vectorstore = get_vectorstore(text_chunks) # create conversation chain st.session_state.conversation = get_conversation_chain( vectorstore)
The
st.text_input()
function is used to prompt the user to ask a question about the PDF files. The question is then stored in theuser_question
variable. If the user enters a question, thehandle_userinput()
function is called to generate a response to the question.The
handle_userinput()
function is located in the same file as themain()
function. Here is the code for thehandle_userinput()
function:Python
def handle_userinput(user_question): if user_question: response = st.session_state.conversation.generate(user_question) st.write(response) st.session_state.chat_history.append(user_question) st.session_state.chat_history.append(response)
The
handle_userinput()
function first checks if theuser_question
variable is not empty. If it is not empty, the function then calls thegenerate()
method of theconversation_chain
object to generate a response to the question. The response is then displayed to the user.The
generate()
method of theconversation_chain
object takes a question as an argument and returns a response to the question. The response is generated by using the ChatGPT language model to find the most relevant passage from the PDF files that matches the question.The
handle_userinput()
function then appends the user question and the response to thechat_history
variable. Thechat_history
variable is a list that stores the chat history between the user and the chatbot. This allows the chatbot to keep track of the previous conversations and use this information to generate more relevant responses to future questions.The code that uses the conversation chain to generate a response to the user's question is located in the
handle_userinput()
function. Here is the code:Python
def handle_userinput(user_question): if user_question: response = st.session_state.conversation.generate(user_question) st.write(response) st.session_state.chat_history.append(user_question) st.session_state.chat_history.append(response)
The
handle_userinput()
function first checks if theuser_question
variable is not empty. If it is not empty, the function then calls thegenerate()
method of theconversation_chain
object to generate a response to the question. The response is then displayed to the user.The
generate()
method of theconversation_chain
object takes a question as an argument and returns a response to the question. The response is generated by using the ChatGPT language model to find the most relevant passage from the PDF files that matches the question.The
conversation_chain
object is a list of objects that represent the conversation between the user and the chatbot. Each object in the conversation chain represents a single message in the conversation. The conversation chain is used to keep track of previous conversations and use this information to generate more relevant responses to future questions.When the
generate()
method is called, it iterates through the conversation chain to find the most relevant passage that matches the user's question. The most relevant passage is the passage that has the highest cosine similarity score with the user's question. The cosine similarity score is a measure of how similar two pieces of text are.Once the most relevant passage has been found, the
generate()
method uses the ChatGPT language model to generate a response that is based on the passage. The response is then returned to thehandle_userinput()
function and displayed to the user.Once the most relevant passage has been found, the
generate()
method uses the ChatGPT language model to generate a response that is based on the passage. The response is then returned to thehandle_userinput()
function and displayed to the user.The
conversation_chain
object is created in themain()
function. Here is the code:Python
def get_conversation_chain(vectorstore): llm = ChatOpenAI() memory = ConversationBufferMemory( memory_key='chat_history', return_messages=True) conversation_chain = ConversationalRetrievalChain( llm=llm, retriever=vectorstore.as_retriever(), memory=memory) return conversation_chain if __name__ == '__main__': load_dotenv() st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:") if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = None st.header("Chat with multiple PDFs :books:") user_question = st.text_input("Ask a question about your documents:") if user_question: handle_userinput(user_question) with st.sidebar: st.subheader("Your documents") pdf_docs = st.file_uploader( "Upload your PDFs here and click on 'Process'", accept_multiple_files=True) if st.button("Process"): with st.spinner("Processing"): # get pdf text raw_text = get_pdf_text(pdf_docs) # get the text chunks text_chunks = get_text_chunks(raw_text) # create vector store vectorstore = get_vectorstore(text_chunks) # create conversation chain st.session_state.conversation = get_conversation_chain( vectorstore)
The
conversation_chain
object is created in themain()
function and then stored in thest.session_state.conversation
variable. Thest.session_state.conversation
variable is then used by thehandle_userinput()
function to generate responses to the user's questions.
Summary
The project uses LangChain to create a chatbot that can answer questions about PDF files. The chatbot first uses Streamlit to create a user interface where the user can upload PDF files and ask questions. The chatbot then uses LangChain to process the PDF files and generate responses to the user's questions.
The project covers the following ins and outs of LangChain that will help a learner:
How to create a conversation chain
How to use a retriever to find relevant passages from PDF files
How to generate responses to user questions
How to use memory to store the chat history
The project also demonstrates how LangChain can be used to create a chatbot that can answer questions about PDF files. This is a valuable skill for learners who are interested in natural language processing or chatbot development.
In addition to the ins and outs of LangChain, the project also teaches learners about the following:
How to use Streamlit to create a user interface
How to upload files to a server
How to use a database to store data
How to implement a loop
Overall, the project is a valuable resource for learners who are interested in learning more about LangChain and how to use it to create chatbots.