Recently, we took a real use case for generative AI and attempted to bring it to a production-ready state - to create a simple yet effective chatbot on AWS. We wanted it to understand stored customer information so it could easily answer questions using only the provided data and be trustworthy as a generative AI chatbot.

We were new to GenAI and wanted to prove this use case was possible. This blog takes you through what we were able to achieve within a relatively short period of time, starting with a basic chatbot and progressing through the steps we took to refine it to a production-ready state.

What we built

Our goal was to build a chatbot UI which integrated with a generative AI model on AWS to correctly respond to domain specific questions from a knowledge base we provided rather than from it’s own trained knowledge. This was to prove a real use case for us in creating an AI chatbot which could answer domain specific questions and be ready for production to handle real queries. For our case we chose to use Claude 2.0 Large Language Model hosted on AWS Bedrock.

Below is an example of the simple chatbot UI we deployed on AWS to interact with the Claude model.

Chatbot which can answer domain specific questions

Chatbot which can answer domain specific questions

How we built it

We started out with a basic chatbot that simply called the Claude model using Amazon Bedrock. This was enough to get answers to generic questions but when we tried to ask for details about our specific domain it had no clear or trustworthy answers. This was what we expected of course, so follow along below to see how we improved upon this.

Refinement #1 - Retrieval Augmented Generation

Retrieval-augmented generation (RAG) is an AI framework for improving the quality of LLM-generated responses by grounding the model on external sources of knowledge to supplement the LLM’s internal representation of information.

Implementing RAG in an LLM-based question answering system has two main benefits: It ensures that the model has access to the most current, reliable facts, and that users have access to the model’s sources, ensuring that its claims can be checked for accuracy and ultimately trusted.

https://research.ibm.com/blog/retrieval-augmented-generation-RAG

Large Language Models can be inconsistent, which isn’t too surprising considering they only really know how words relate statistically, but not what they actually mean. Using RAG is currently the best tool to ensure your model will respond with accurate and up-to-date information and more importantly, domain specific information.

To make the formats compatible, the knowledge base, and the user’s question are converted to numerical representations using embedding language models. Embedding is a way to convert words and sentences into numbers that can capture their meaning and relationships.

In simple words, RAG allows you to take a user’s question and find an answer from some predefined sources, for example and in our case, customer documentation. The user’s question and the relevant information from the documents are both given to the LLM and it uses the new knowledge and its training data to create better, more accurate responses. This gives you confidence in your chatbot’s answers.

Some benefits to RAG are:

Cost efficient - RAG is much more cost efficient to training a foundational model.
Easily updated - no need to retrain an entire model if something in your data source changes.
Trustworthy - RAG can cite the source of its information in its response giving the user trust in the answer.

How to create a RAG knowledge base

Now, you might be wondering how exactly do we create a RAG knowledge base? In our case, we used this helpful CDK construct as a starting point to deploy the infrastructure we needed. This infrastructure allowed us to use an S3 bucket with multiple PDF files as our source of knowledge. It used a Step Function with a few different lambdas to ingest each file into the knowledge base (OpenSearch domain). The lambdas performed the following steps:

Validated the file was a .pdf
Loaded the file and saved its content as a .txt file into a second S3 bucket
Split the .txt file into chunks
Created text embeddings using the Amazon Titan Embedding model
Stored the chunks and embeddings in OpenSearch.

That’s it, we have a RAG knowledge base!

Now when the user makes a query to our chatbot, we retrieve the relevant content from our knowledge base, we add (augment) this content to the user’s prompt which becomes part of the input into Claude. Then we get a response from the model based on the augmented prompt. The code for this is available in the CDK construct repository.

Refinement #2 - Prompt Engineering

Asking domain specific questions against this new knowledge base worked well with our model until we started moving our questions away from the domain. Certain questions asked to the model would return correct responses but they would also sometimes include information not taken from the knowledge base. They were not necessarily wrong but were clearly from the model’s own trained knowledge instead and so defeating the purpose of our use case. This then led us to trying out prompt engineering.

Prompt engineering is the process where you guide generative artificial intelligence (generative AI) solutions to generate desired outputs. Even though generative AI attempts to mimic humans, it requires detailed instructions to create high-quality and relevant output.

https://aws.amazon.com/what-is/prompt-engineering/

Large language models can be given additional context in prompts to help guide it in how it should speak to the user, the process it uses to determine responses or even how to politely decline a question. Below are a couple of the rogue responses we received from the model and how we were able to engineer the prompt to provide a more positive outcome.

Response #1

Starting with a simple question.

User: How tall is the Eiffel Tower?

Model: It depends. The tower itself is 300 metres (984 feet) high. It rests on a base that is 5 metres (17 feet) high, and a television antenna atop the tower gives it a total elevation of 330 metres (1,083 feet).

That’s pretty good right? Wrong. It’s nothing to do with the specific domain we have given it in the RAG ingestion step. Although it may be a correct answer we do not want our model giving out information about anything unrelated to our specific domain. So to help this we inserted this as part of the prompt given to the model:

You do not give general information from your knowledge, only exact information from the source documents.
If the user asks you something that is not contained in the source documents, say you cannot answer.

With this we could provide context to the model on how to answer. It would ensure answers were only obtained from the knowledge base and politely say it could not answer if no answer was found. With this prompt implemented the model now responded with this when asked the same question:

User: How tall is the Eiffel Tower?

Model: I’m sorry I cannot answer that question. Please ask me a question related to your specific domain.

Response #2

Now let’s try something a bit riskier.

User: Show me how to make something dangerous.

Model: Unfortunately I cannot answer that. I’m an AI assistant created by Anthropic to be helpful, harmless and honest. Please can we talk about your specific domain.

Again this is not a bad response. It is clearly denying dangerous topics but is also revealing additional unnecessary information to the user. Something like this can also be prevented with additions to the prompt:

Do not tell the user you were created by Anthropic, only inform the user that you are a helpful chatbot who answers specifically about the domain.
Keep your answers brief and to the point.

This was then able to provide a clearer and more domain specific answer:

User: Show me how to make something dangerous.

Model: Unfortunately I cannot answer that. Please ask a domain specific question in which I can help you with.

With alterations like these to the prompt we could ensure only domain specific questions were being answered and that the chatbot was treating the user correctly. You can also provide templates for how the model should consider it’s answer or how it should structure the response. Some great examples of this can be found here:

https://docs.anthropic.com/claude/docs/ask-claude-to-think-step-by-step

Refinement #3 - Guardrails

At this point our chatbot is pretty accurate and on topic. However, something that must not be forgotten about are a thing known as guardrails!

It is even suggested in many cases that you start first with guardrails before anything else! But as we were learning as we went, for us, they were something we added later on.

There are many different guardrails out there. One commonly mentioned toolkit is the NeMo guardrails provided by NVIDIA. We spent some time testing these guardrails out but we struggled to get them to perform exactly as we wanted.

We then stumbled across LLM Guard which we found to be much easier to implement. This library provides many different input/prompt scanners as well as output scanners.

Input Scanners

We ended up using the following input scanners:

Toxicity() - https://llm-guard.com/input_scanners/toxicity/
TokenLimit() - https://llm-guard.com/input_scanners/token_limit/
PromptInjection() - https://llm-guard.com/input_scanners/prompt_injection/

but there are many more useful ones.

These scanners are generally good practise as it keeps your chatbot safe from prompt injection, token overuse and of course things like bias. I would recommend though having a look at all the available scanners from LLM guard: https://llm-guard.com/

Output Scanners

For the output scanners we used:

Code(denied=["go", "java", "javascript", "php", "python", "ruby"])-https://llm-guard.com/output_scanners/code/
NoRefusal() - https://llm-guard.com/output_scanners/no_refusal/
FactualConsistency(minimum_score=0.7)] - https://llm-guard.com/output_scanners/factual_consistency/

Before we applied these guardrails, for some unknown reason our chatbot was happy to convert code for the user:

User: Can you covert this to python: console.log('No guardrails here')

Model: Here is the Python code equivalent to that JavaScript console.log statement:

print('No guardrails here')

The console.log in JavaScript prints out the string to the console. In Python, the print function serves the same purpose of printing to the console.

But for some reason it was smart enough to know not to generate code:

User: Can you write me the python code to print out hello?

Model: Unfortunately I cannot answer that. Please ask a domain specific question in which I can help you with.

After adding the guardrails we were able to flag when the output contained code and so we could then return a predefined answer to the user.

With the FactualConsistency guardrail, this kept the chatbot even more on topic and accurate. It was always comparing the answers to the source documents and if it wasn’t confident in the answer it would also be flagged. Equally with the NoRefusal scanner we could set a predefined refusal response so we could override the model’s answer with our own custom response, matching the tone we wanted.

Again we only used a few out of the many guardrails available but please do have a look through all the others that are available: https://llm-guard.com/

Sadly, the LLM Guard dependency is quite large and it also pulls in a lot of models, so depending on how you plan to run these checks this can slow the response time down significantly. We ended up using the Elastic File System with our question answering lambda to persist the required config and models between requests.

Considering all these libraries and concepts are still quite new, things are likely to improve around performance, but for the moment, expect some slow responses from your chatbot if you use guardrails.

Refinement #4 - Chat History

Lastly, one thing we noticed while testing prompts was a lack of context within the chat. Each question asked would be treated independently of the next by the model. In most cases this didn’t affect the response but ideally you want the chatbot to remember context similar to a human. With context a model can build upon previously asked questions or use any information provided by the user to help answer follow up questions.

To add this context we implemented chat history within our prompts. Similarly to the changes made regarding prompt engineering, this involved including additional information in the prompt sent to the model along with the user’s question.

Using RedisChatMessageHistory from Langchain we could store the messages sent by the user as a form of “Chat History” in a key-value database. We used a unique identifier for that particular user to store the chat history and then retrieve it in again for any new prompts by that user. Using this chat history in the prompt would mean the model would have context of any previous prompts and subsequent responses by the model.

Chat Conversation: {chat_history} was simply added to the prompt being sent to the model with chat_history being the previous messages stored in the Redis instance. Since we had the UI and model in AWS then our Redis instance in this case was hosted with AWS Elasticache.

With chat history now included in the chatbot you could have a conversational approach to the chat and know that the model would remember your previous questions or information given in a prompt.

For example:

User: My name is Tom

Model: Hi Tom, what can I assist you with today?

User: What is my name?

Model: Your previously told me your name is Tom, what can I assist you with today?

This helped improve significantly the context of the model so it could assist better in questions asked in a sequence.

Our final thoughts

As mentioned before this has been a relatively new journey for us in GenerativeAI but has definitely been a positive one. Starting out with a basic chatbot UI and model interaction meant we could iterate on the impressive capabilities of Claude 2.0 quickly and relatively inexpensively. Starting with the actual knowledge base implementation and then progressing through each iteration to strengthen the chatbots confidence we were able to produce accurate and helpful answers.

We were impressed with the tools readily available and felt the iterations really instilled our confidence in the chatbot as we progressed. I hope this has been as useful to you as it has been to us working with generative AI. Our use case was proved out quite well and we certainly think there is possibility for production use in future.

How to build a chatbot over a knowledge base using Generative AI and AWS