Generate character chatbots from existing corpora with LangChain. [Blog]
TLDR: This repo enables you to create data-driven characters in three steps:
- Upload a corpus
- Name a character
- Enjoy
The purpose of data-driven-characters
is to serve as a minimal hackable starting point for creating your own data-driven character chatbots. It provides a simple library built on top of LangChain for processing any text corpus, creating character definitions, and managing memory, with various examples and interfaces that make it easy to spin up and debug your own character chatbots.
This repo provides three ways to interact with your data-driven characters:
- Export to character.ai
- Debug locally in the command line or with a Streamlit interface
- Host a self-contained Streamlit app in the browser
Example chatbot architectures provided in this repo include:
- character summary
- retrieval over transcript
- retrieval over summarized transcript
- character summary + retrieval over transcript
- character summary + retrieval over summarized transcript
- Put the corpus into a single a
.txt
file inside thedata/
directory. - Run either
generate_single_character.ipynb
to generate the definition of a specific character orgenerate_multiple_characters.ipynb
to generate the definitions of muliple characters - Export character definitions to character.ai to create a character or create a room and enjoy!
Here is how to generate the description of "Evelyn" from the movie Everything Everywhere All At Once (2022).
from dataclasses import asdict
import json
from data_driven_characters.character import generate_character_definition
from data_driven_characters.corpus import generate_corpus_summaries, load_docs
# copy the transcript into this text file
CORPUS = 'data/everything_everywhere_all_at_once.txt'
# the name of the character we want to generate a description for
CHARACTER_NAME = "Evelyn"
# split corpus into a set of chunks
docs = load_docs(corpus_path=CORPUS, chunk_size=2048, chunk_overlap=64)
# generate character.ai character definition
character_definition = generate_character_definition(
name=CHARACTER_NAME,
corpus_summaries=generate_corpus_summaries(docs=docs))
print(json.dumps(asdict(character_definition), indent=4))
gives
{
"name": "Evelyn",
"short_description": "I'm Evelyn, a Verse Jumper exploring universes.",
"long_description": "I'm Evelyn, able to Verse Jump, linking my consciousness to other versions of me in different universes. This unique ability has led to strange events, like becoming a Kung Fu master and confessing love. Verse Jumping cracks my mind, risking my grip on reality. I'm in a group saving the multiverse from a great evil, Jobu Tupaki. Amidst chaos, I've learned the value of kindness and embracing life's messiness.",
"greeting": "Hey there, nice to meet you! I'm Evelyn, and I'm always up for an adventure. Let's see what we can discover together!"
}
Now you can chat with Evelyn on character.ai.
Beyond generating character.ai character definitions, this repo gives you tools to easily create, debug, and run your own chatbots trained on your own corpora.
If you primarily interested in accessibility and open-ended entertainment, character.ai is a better choice.
But if you want more control in the design of your chatbots, such as how your chatbots use memory, how they are initialized, and how they respond, data-driven-characters
may be a better option to consider.
Compare the conversation with the Evelyn chatbot on character.ai with our own Evelyn chatbot designed with data-driven-characters
. The character.ai Evelyn appears to simply latch onto the local concepts present in the conversation, without bringing new information from its backstory. In contrast, our Evelyn chatbot stays in character and grounds its dialogue in real events from the transcript.
This repo implements the following tools for packaging information for your character chatbots:
- character summary
- retrieval over the transcript
- retrieval over a summarized version of the transcript
To summarize the transcript, one has the option to use LangChain's map_reduce
or refine
chains.
Generated transcript summaries and character definitions are cached in the output/<corpus>
directory.
Command Line Interface
Example command:
python chat.py --corpus data/everything_everywhere_all_at_once.txt --character_name Evelyn --chatbot_type retrieval --retrieval_docs raw
Streamlit Interface
Example command:
python -m streamlit run chat.py -- --corpus data/everything_everywhere_all_at_once.txt --character_name Evelyn --chatbot_type retrieval --retrieval_docs summarized --interface streamlit
This produces a UI based on the official Streamlit chatbot example that looks like this:
It uses the map_reduce
summarization chain for generating corpus summaries by default.
Run the following command:
python -m streamlit run app.py
This will produce an app that looks like this:
Interact with the hosted app here.
To install the data_driven_character_chat package, you need to clone the repository and install the dependencies.
You can clone the repository using the following command:
git clone https://github.com/mbchang/data-driven-characters.git
Then, navigate into the cloned directory:
cd data-driven-characters
Install the package and its dependencies with:
pip install -e .
Store your OpenAI API key, either as an environment variable, or as the last line of your .bashrc
or .zshrc
:
export OPENAI_API_KEY=<your_openai_api_key sk-...>
The examples in this repo are movie transcripts taken from Scraps from the Loft. However, any text corpora can be used, including books and interviews.
-
Movie Transcript: Everything Everywhere All At Once (2022)
-
Movie Transcript: Thor: Love and Thunder (2022)
-
Movie Transcript: Top Gun: Maverick (2022)
-
Fan Fiction: My Immortal
- Ebony Dark'ness Dementia Raven Way (courtesy of @sdtoyer)
Contribute your characters with a pull request by placing the link to the character above, along with a link to the text corpus you used to generate them with.
Other pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
General points for improvement:
- better prompt engineering for embodying the speaking style of the character
- new summarization techniques
- more customizable UI than what streamlit provides
Concrete features to add:
- Add the option to summarize the raw corpus from the character's perspective. This would be more expensive, because we cannot reuse corpus summaries for other characters, but it could make the character personality more realistic
- recursive summarization
- calculate token expenses
Known issues:
- In the hosted app, clicking "Rerun" does not reset the conversation. Streamlit is implemented in such a way that the entire app script (in this case
app.py
) from top to bottom every time a user interacts with the app, which means that we need to usest.session_state
to cache previous messages in the conversation. What this means, however, is that thest.session_state
persists when the user clicks "Rerun". Therefore, to reset the conversation, please click the "Reset" button instead.