Creating Multiple pdf Chatbot with Langchain and Streamlit

Hello Guys, I just started coding with Langchain last month and this is my first project. Creating a chatbot that interacts with pdfs and a user gets to ask questions. We will use Langchain and Streamlit to cover this project. Go with me.

Requirements

  1. langchain==0.0.184

  2. pyPDF==3.0.1

  3. python-dotenv==1.0.0

  4. streamlit==1.18.1

  5. openai==0.27.6

  6. fails-cpu==1.7.4

  7. #uncomment to use hugging face llms

  8. #huggingface-hub=0.14.1

  9. #uncomment to use instructor embeddings

  10. #instructorEmbedding==1.0.1

  11. #sentence-tranformers=2.2.2

Simplified Breakdown of the project.

This is the simplified breakdown of the project to enable you to understand it as a beginner.

  1. The user uploads the PDF files that they want to chat about.

  2. The code extracts the text from the PDF files and creates a string containing all of the text.

  3. The code splits the text into a list of smaller chunks.

  4. The code creates a vector store from the text chunks.

  5. The code creates a conversation chain from the vector store.

  6. The code prompts the user to ask a question about the PDF files.

  7. The code uses the conversation chain to generate a response to the user's question.

  8. The code displays the response to the user.

  9. The code repeats steps 6-8 until the user quits the program.

The User Uploads the pdf files that the want

The user uploads the PDF files that they want to chat about in the main() function. Here is the code:

if __name__ == '__main__':
    load_dotenv()
    st.set_page_config(page_title="Chat with multiple PDFs",
                       page_icon=":books:")

    if "conversation" not in st.session_state:
        st.session_state.conversation = None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history = None

    st.header("Chat with multiple PDFs :books:")
    user_question = st.text_input("Ask a question about your documents:")
    if user_question:
        handle_userinput(user_question)

    with st.sidebar:
        st.subheader("Your documents")
        pdf_docs = st.file_uploader(
            "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
        if st.button("Process"):
            with st.spinner("Processing"):
                # get pdf text
                raw_text = get_pdf_text(pdf_docs)

                # get the text chunks
                text_chunks = get_text_chunks(raw_text)

                # create vector store
                vectorstore = get_vectorstore(text_chunks)

                # create conversation chain
                st.session_state.conversation = get_conversation_chain(
                    vectorstore)
  1. The user uploads the PDF files that they want to chat about in the main() function. Here is the code:

    Python

     if __name__ == '__main__':
         load_dotenv()
         st.set_page_config(page_title="Chat with multiple PDFs",
                            page_icon=":books:")
    
         if "conversation" not in st.session_state:
             st.session_state.conversation = None
         if "chat_history" not in st.session_state:
             st.session_state.chat_history = None
    
         st.header("Chat with multiple PDFs :books:")
         user_question = st.text_input("Ask a question about your documents:")
         if user_question:
             handle_userinput(user_question)
    
         with st.sidebar:
             st.subheader("Your documents")
             pdf_docs = st.file_uploader(
                 "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
             if st.button("Process"):
                 with st.spinner("Processing"):
                     # get pdf text
                     raw_text = get_pdf_text(pdf_docs)
    
                     # get the text chunks
                     text_chunks = get_text_chunks(raw_text)
    
                     # create vector store
                     vectorstore = get_vectorstore(text_chunks)
    
                     # create conversation chain
                     st.session_state.conversation = get_conversation_chain(
                         vectorstore)
    

    The pdf_docs variable is a list of the PDF files that the user has uploaded. The st.file_uploader() function is used to create this variable. The st.button() function is used to create a button that the user can click to process the PDF files. When the user clicks the button, the get_pdf_text() function is called to extract the text from the PDF files. The text from the PDF files is then stored in the raw_text variable. The get_text_chunks() function is then called to split the text into a list of smaller chunks. The list of chunks is then stored in the text_chunks variable. The get_vectorstore() function is then called to create a vector store from the text chunks. The vector store is then stored in the vectorstore variable. The get_conversation_chain() function is then called to create a conversation chain from the vector store. The conversation chain is then stored in the st.session_state.conversation variable. This variable is then used by the handle_userinput() function to generate responses to the user's questions.

  2. The code that extracts the text from the PDF files and creates a string containing all of the text is located in the get_pdf_text() function. Here is the code:

  3.   def get_pdf_text(pdf_docs):
          text = ""
          for pdf in pdf_docs:
              pdf_reader = PdfReader(pdf)
              for page in pdf_reader.pages:
                  text += page.extract_text()
          return text
    

    The get_pdf_text() function first initializes a string variable called text. The string variable is then used to store the text from each page of each PDF file. The PdfReader class is used to iterate through each page of each PDF file and extract the text. The text from each page is then appended to the text variable. Finally, the text variable is returned.

    The get_pdf_text() function is called in the main() function. Here is the code:

  4.   if __name__ == '__main__':
          load_dotenv()
          st.set_page_config(page_title="Chat with multiple PDFs",
                             page_icon=":books:")
    
          if "conversation" not in st.session_state:
              st.session_state.conversation = None
          if "chat_history" not in st.session_state:
              st.session_state.chat_history = None
    
          st.header("Chat with multiple PDFs :books:")
          user_question = st.text_input("Ask a question about your documents:")
          if user_question:
              handle_userinput(user_question)
    
          with st.sidebar:
              st.subheader("Your documents")
              pdf_docs = st.file_uploader(
                  "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
              if st.button("Process"):
                  with st.spinner("Processing"):
                      # get pdf text
                      raw_text = get_pdf_text(pdf_docs)
    
                      # get the text chunks
                      text_chunks = get_text_chunks(raw_text)
    
                      # create vector store
                      vectorstore = get_vectorstore(text_chunks)
    
                      # create conversation chain
                      st.session_state.conversation = get_conversation_chain(
                          vectorstore)
    

    The get_pdf_text() function is called in the with st.sidebar: block. The with st.sidebar: block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks the Process button, the get_pdf_text() function is called to extract the text from the PDF files and create a string containing all of the text. The string is then stored in the st.session_state.conversation variaet_text_chunks() function. Here is the code:

    :

    Python

     def get_text_chunks(text):
         text_chunks = []
         chunk = ""
         for word in text:
             if word == "\n":
                 text_chunks.append(chunk)
                 chunk = ""
             else:
                 chunk += word
         text_chunks.append(chunk)
         return text_chunks
    

    The get_text_chunks() function first initializes a list variable called text_chunks. The text_chunks variable is then used to store the text chunks. The text variable is passed to the function as an argument. The text variable contains the text from the PDF files.

    The get_text_chunks() function then iterates through each word in the text variable. If the word is a newline character, the function adds the current chunk to the text_chunks variable and initializes a new chunk. Otherwise, the function appends the word to the current chunk.

    Finally, the text_chunks variable is returned.

    The get_text_chunks() function is called in the main() function. Here is the code:

    Python

     if __name__ == '__main__':
         load_dotenv()
         st.set_page_config(page_title="Chat with multiple PDFs",
                            page_icon=":books:")
    
         if "conversation" not in st.session_state:
             st.session_state.conversation = None
         if "chat_history" not in st.session_state:
             st.session_state.chat_history = None
    
         st.header("Chat with multiple PDFs :books:")
         user_question = st.text_input("Ask a question about your documents:")
         if user_question:
             handle_userinput(user_question)
    
         with st.sidebar:
             st.subheader("Your documents")
             pdf_docs = st.file_uploader(
                 "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
             if st.button("Process"):
                 with st.spinner("Processing"):
                     # get pdf text
                     raw_text = get_pdf_text(pdf_docs)
    
                     # get the text chunks
                     text_chunks = get_text_chunks(raw_text)
    
                     # create vector store
                     vectorstore = get_vectorstore(text_chunks)
    
                     # create conversation chain
                     st.session_state.conversation = get_conversation_chain(
                         vectorstore)
    

    The get_text_chunks() function is called in the with st.sidebar: block. The with st.sidebar: block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks the Process button, the get_text_chunks() function is called to split the text from the PDF files into a list of smaller chunks. The list of chunks is then stored in the st.session_state.conversation variable.

  5. The code that splits the text into a list of smaller chunks is located in the get_text_chunks() function. Here is the code:

    Python

     def get_text_chunks(text):
         text_chunks = []
         chunk = ""
         for word in text:
             if word == "\n":
                 text_chunks.append(chunk)
                 chunk = ""
             else:
                 chunk += word
         text_chunks.append(chunk)
         return text_chunks
    

    The get_text_chunks() function first initializes a list variable called text_chunks. The text_chunks variable is then used to store the text chunks. The text variable is passed to the function as an argument. The text variable contains the text from the PDF files.

    The get_text_chunks() function then iterates through each word in the text variable. If the word is a newline character, the function adds the current chunk to the text_chunks variable and initializes a new chunk. Otherwise, the function appends the word to the current chunk.

    Finally, the text_chunks variable is returned.

    The get_text_chunks() function is called in the main() function. Here is the code:

    Python

     if __name__ == '__main__':
         load_dotenv()
         st.set_page_config(page_title="Chat with multiple PDFs",
                            page_icon=":books:")
    
         if "conversation" not in st.session_state:
             st.session_state.conversation = None
         if "chat_history" not in st.session_state:
             st.session_state.chat_history = None
    
         st.header("Chat with multiple PDFs :books:")
         user_question = st.text_input("Ask a question about your documents:")
         if user_question:
             handle_userinput(user_question)
    
         with st.sidebar:
             st.subheader("Your documents")
             pdf_docs = st.file_uploader(
                 "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
             if st.button("Process"):
                 with st.spinner("Processing"):
                     # get pdf text
                     raw_text = get_pdf_text(pdf_docs)
    
                     # get the text chunks
                     text_chunks = get_text_chunks(raw_text)
    
                     # create vector store
                     vectorstore = get_vectorstore(text_chunks)
    
                     # create conversation chain
                     st.session_state.conversation = get_conversation_chain(
                         vectorstore)
    

    The get_text_chunks() function is called in the with st.sidebar: block. The with st.sidebar: block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks the Process button, the get_text_chunks() function is called to split the text from the PDF files into a list of smaller chunks. The list of chunks is then stored in the st.session_state.conversation variable.

  6. The code that creates a vector store from the text chunks is located in the get_vectorstore() function. Here is the code:

    Python

     def get_vectorstore(text_chunks):
         embeddings = OpenAIEmbeddings()
         vectorstore = FAISS.from_texts(text_chunks, embedding=embeddings)
         return vectorstore
    

    The get_vectorstore() function first initializes an OpenAIEmbeddings object. The OpenAIEmbeddings object is used to create embeddings for the text chunks. The text chunks are passed to the OpenAIEmbeddings object as an argument.

    The get_vectorstore() function then uses the FAISS class to create a vector store from the embeddings of the text chunks. The vector store is then returned.

    The get_vectorstore() function is called in the main() function. Here is the code:

    Python

     if __name__ == '__main__':
         load_dotenv()
         st.set_page_config(page_title="Chat with multiple PDFs",
                            page_icon=":books:")
    
         if "conversation" not in st.session_state:
             st.session_state.conversation = None
         if "chat_history" not in st.session_state:
             st.session_state.chat_history = None
    
         st.header("Chat with multiple PDFs :books:")
         user_question = st.text_input("Ask a question about your documents:")
         if user_question:
             handle_userinput(user_question)
    
         with st.sidebar:
             st.subheader("Your documents")
             pdf_docs = st.file_uploader(
                 "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
             if st.button("Process"):
                 with st.spinner("Processing"):
                     # get pdf text
                     raw_text = get_pdf_text(pdf_docs)
    
                     # get the text chunks
                     text_chunks = get_text_chunks(raw_text)
    
                     # create vector store
                     vectorstore = get_vectorstore(text_chunks)
    
                     # create conversation chain
                     st.session_state.conversation = get_conversation_chain(
                         vectorstore)
    

    The get_vectorstore() function is called in the with st.sidebar: block. The with st.sidebar: block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks the Process button, the get_vectorstore() function is called to create a vector store from the text chunks of the PDF files. The vector store is then stored in the st.session_state.conversation variable.

    The code that creates a conversation chain from the vector store is located in the get_conversation_chain() function. Here is the code:

    Python

     def get_conversation_chain(vectorstore):
         llm = ChatOpenAI()
         memory = ConversationBufferMemory(
             memory_key='chat_history', return_messages=True)
         conversation_chain = ConversationalRetrievalChain(
             llm=llm, retriever=vectorstore.as_retriever(), memory=memory)
         return conversation_chain
    

    The get_conversation_chain() function first initializes a ChatOpenAI object. The ChatOpenAI object is used to access the ChatGPT language model.

    The get_conversation_chain() function then initializes a ConversationBufferMemory object. The ConversationBufferMemory object is used to store the chat history between the user and the chatbot.

    The get_conversation_chain() function then initializes a ConversationalRetrievalChain object. The ConversationalRetrievalChain object is used to generate responses to user questions. The ConversationalRetrievalChain object is passed the llm object, the vectorstore object, and the memory object.

    The get_conversation_chain() function then returns the conversation_chain object.

    The get_conversation_chain() function is called in the main() function. Here is the code:

    Python

     if __name__ == '__main__':
         load_dotenv()
         st.set_page_config(page_title="Chat with multiple PDFs",
                            page_icon=":books:")
    
         if "conversation" not in st.session_state:
             st.session_state.conversation = None
         if "chat_history" not in st.session_state:
             st.session_state.chat_history = None
    
         st.header("Chat with multiple PDFs :books:")
         user_question = st.text_input("Ask a question about your documents:")
         if user_question:
             handle_userinput(user_question)
    
         with st.sidebar:
             st.subheader("Your documents")
             pdf_docs = st.file_uploader(
                 "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
             if st.button("Process"):
                 with st.spinner("Processing"):
                     # get pdf text
                     raw_text = get_pdf_text(pdf_docs)
    
                     # get the text chunks
                     text_chunks = get_text_chunks(raw_text)
    
                     # create vector store
                     vectorstore = get_vectorstore(text_chunks)
    
                     # create conversation chain
                     st.session_state.conversation = get_conversation_chain(
                         vectorstore)
    

    The get_conversation_chain() function is called in the with st.sidebar: block. The with st.sidebar: block is used to create a sidebar that the user can use to upload the PDF files. When the user clicks the Process button, the get_conversation_chain() function is called to create a conversation chain from the vector store of the PDF files. The conversation chain is then stored in the st.session_state.conversation variable.

  7. The code that prompts the user to ask a question about the PDF files is located in the main() function. Here is the code:

    Python

     if __name__ == '__main__':
         load_dotenv()
         st.set_page_config(page_title="Chat with multiple PDFs",
                            page_icon=":books:")
    
         if "conversation" not in st.session_state:
             st.session_state.conversation = None
         if "chat_history" not in st.session_state:
             st.session_state.chat_history = None
    
         st.header("Chat with multiple PDFs :books:")
         user_question = st.text_input("Ask a question about your documents:")
         if user_question:
             handle_userinput(user_question)
    
         with st.sidebar:
             st.subheader("Your documents")
             pdf_docs = st.file_uploader(
                 "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
             if st.button("Process"):
                 with st.spinner("Processing"):
                     # get pdf text
                     raw_text = get_pdf_text(pdf_docs)
    
                     # get the text chunks
                     text_chunks = get_text_chunks(raw_text)
    
                     # create vector store
                     vectorstore = get_vectorstore(text_chunks)
    
                     # create conversation chain
                     st.session_state.conversation = get_conversation_chain(
                         vectorstore)
    

    The st.text_input() function is used to prompt the user to ask a question about the PDF files. The question is then stored in the user_question variable. If the user enters a question, the handle_userinput() function is called to generate a response to the question.

    The handle_userinput() function is located in the same file as the main() function. Here is the code for the handle_userinput() function:

    Python

  8.  def handle_userinput(user_question):
         if user_question:
             response = st.session_state.conversation.generate(user_question)
             st.write(response)
             st.session_state.chat_history.append(user_question)
             st.session_state.chat_history.append(response)
    

    The handle_userinput() function first checks if the user_question variable is not empty. If it is not empty, the function then calls the generate() method of the conversation_chain object to generate a response to the question. The response is then displayed to the user.

    The generate() method of the conversation_chain object takes a question as an argument and returns a response to the question. The response is generated by using the ChatGPT language model to find the most relevant passage from the PDF files that matches the question.

    The handle_userinput() function then appends the user question and the response to the chat_history variable. The chat_history variable is a list that stores the chat history between the user and the chatbot. This allows the chatbot to keep track of the previous conversations and use this information to generate more relevant responses to future questions.

  9. The code that uses the conversation chain to generate a response to the user's question is located in the handle_userinput() function. Here is the code:

    Python

     def handle_userinput(user_question):
         if user_question:
             response = st.session_state.conversation.generate(user_question)
             st.write(response)
             st.session_state.chat_history.append(user_question)
             st.session_state.chat_history.append(response)
    

    The handle_userinput() function first checks if the user_question variable is not empty. If it is not empty, the function then calls the generate() method of the conversation_chain object to generate a response to the question. The response is then displayed to the user.

    The generate() method of the conversation_chain object takes a question as an argument and returns a response to the question. The response is generated by using the ChatGPT language model to find the most relevant passage from the PDF files that matches the question.

    The conversation_chain object is a list of objects that represent the conversation between the user and the chatbot. Each object in the conversation chain represents a single message in the conversation. The conversation chain is used to keep track of previous conversations and use this information to generate more relevant responses to future questions.

    When the generate() method is called, it iterates through the conversation chain to find the most relevant passage that matches the user's question. The most relevant passage is the passage that has the highest cosine similarity score with the user's question. The cosine similarity score is a measure of how similar two pieces of text are.

    Once the most relevant passage has been found, the generate() method uses the ChatGPT language model to generate a response that is based on the passage. The response is then returned to the handle_userinput() function and displayed to the user.

    Once the most relevant passage has been found, the generate() method uses the ChatGPT language model to generate a response that is based on the passage. The response is then returned to the handle_userinput() function and displayed to the user.

    The conversation_chain object is created in the main() function. Here is the code:

    Python

     def get_conversation_chain(vectorstore):
         llm = ChatOpenAI()
         memory = ConversationBufferMemory(
             memory_key='chat_history', return_messages=True)
         conversation_chain = ConversationalRetrievalChain(
             llm=llm, retriever=vectorstore.as_retriever(), memory=memory)
         return conversation_chain
    
     if __name__ == '__main__':
         load_dotenv()
         st.set_page_config(page_title="Chat with multiple PDFs",
                            page_icon=":books:")
    
         if "conversation" not in st.session_state:
             st.session_state.conversation = None
         if "chat_history" not in st.session_state:
             st.session_state.chat_history = None
    
         st.header("Chat with multiple PDFs :books:")
         user_question = st.text_input("Ask a question about your documents:")
         if user_question:
             handle_userinput(user_question)
    
         with st.sidebar:
             st.subheader("Your documents")
             pdf_docs = st.file_uploader(
                 "Upload your PDFs here and click on 'Process'", accept_multiple_files=True)
             if st.button("Process"):
                 with st.spinner("Processing"):
                     # get pdf text
                     raw_text = get_pdf_text(pdf_docs)
    
                     # get the text chunks
                     text_chunks = get_text_chunks(raw_text)
    
                     # create vector store
                     vectorstore = get_vectorstore(text_chunks)
    
                     # create conversation chain
                     st.session_state.conversation = get_conversation_chain(
                         vectorstore)
    

    The conversation_chain object is created in the main() function and then stored in the st.session_state.conversation variable. The st.session_state.conversation variable is then used by the handle_userinput() function to generate responses to the user's questions.

Summary

The project uses LangChain to create a chatbot that can answer questions about PDF files. The chatbot first uses Streamlit to create a user interface where the user can upload PDF files and ask questions. The chatbot then uses LangChain to process the PDF files and generate responses to the user's questions.

The project covers the following ins and outs of LangChain that will help a learner:

  • How to create a conversation chain

  • How to use a retriever to find relevant passages from PDF files

  • How to generate responses to user questions

  • How to use memory to store the chat history

The project also demonstrates how LangChain can be used to create a chatbot that can answer questions about PDF files. This is a valuable skill for learners who are interested in natural language processing or chatbot development.

In addition to the ins and outs of LangChain, the project also teaches learners about the following:

  • How to use Streamlit to create a user interface

  • How to upload files to a server

  • How to use a database to store data

  • How to implement a loop

Overall, the project is a valuable resource for learners who are interested in learning more about LangChain and how to use it to create chatbots.