Let’s write an AI Keeper for Call of Cthulhu! — Part I: Design & Demo

Ming
10 min readSep 10, 2024

Call of Cthulhu (CoC) is a tabletop role-playing game based on the works of H.P. Lovecraft. It involves 3~8 players, but coordinating everyone’s schedule can feel like trying to align the stars. What if you’re all alone and suddenly start to crave for a game? That’s where an AI Keeper comes in. It can run a game for you, anytime, anywhere.

My scribble of a Cthulhu-y creature.

In this series, we’ll build a chatbot that acts as the game master (“Keeper”) in a single-player CoC game. It will narrate the story, play the NPCs, and roll the dices. This means:

  • We’ll use large language models (LLMs). As a narrative-driven game, CoC involves lots of natural-language interactions. Often, the game master (“Keeper” in the CoC world) has to improvise the story based on the players’ actions, which demands latent knowledge and common sense. This is where LLMs shine.
  • The chatbot will be agentic. As the Keeper, our AI needs to consult the rulebook, roll dices, and examine player statuses. Each of those capabilities can be delegated to a “tool” that the AI can use. The ability to use tools, in this context, is called “having agency”.
  • We’ll use retrieval-augmented generation (RAG). While hosting the game, the AI Keeper often need to look up rules, consult playbooks, and even searching the internet for facts. We can’t possibly train an LLM on all that knowledge (or guarantee its accuracy in recalling anything). Instead, we’ll have the AI agent retrieve data on-demand. See Why RAG is Big.

Are you looking for real-world examples of building AI applications? This project is a great starting point. Let’s dive in!

Background

A typical game of CoC involves a Keeper (the game master) and Investigators (the players). The Keeper narrates the story, and the Investigators interact with the world. The Keeper also plays the roles of the non-player characters (NPCs).

How does a Keeper know how the story should unfold? That’s where CoC modules come in. A module is a scenario that the Keeper uses to run a game. It contains the story, the NPCs, and the challenges that the Investigators face.

Yes, challenges. CoC is a horror game, and the Investigators will face many challenges. These challenges can be anything from a locked door to a monster from another dimension. To determine the outcome of these challenges, the Keeper and the Investigators roll dice. If fortune bestows upon them, they succeed and progress the story. If not, they face the consequences, and the story takes a darker turn.

Here’s a flowchart of a typical CoC game:

Now, let’s examine each component in detail and see how we want to implement them.

Design

Character building is a process that involves quite some math, which isn’t LLM’s strong suit. We can circumvent this by delegating the task to an existing tool, which the internet has plenty of. The following web-based character builders allow downloading character sheets as JSON files, which can be easily parsed by our AI Keeper:

What if the player doesn’t want to digress to a browser at all? Let’s do it in pure Python! The package Cochar by Adam Walkiewicz does just that. We can simply register it as a tool for our AI Keeper to use. Plus, since this package brings its own class representing character sheets, it saved us the trouble of defining the data schema for imported JSON files.

Another math-heavy part is dice rolling. Researchers have shown that LLMs tend to be biased when generating random numbers. Let’s delegate this task to a tool written in traditional programming code. For dice outcomes, the CoC rulebook has an exact mapping for numerical values to degrees of success (success, fail, fumble, etc.). Things like this should also be handled by traditional code and packaged into a tool.

Storytelling is the fun part. Our AI Keeper should be faithful to the module’s story and follow the rules. This means we should give the chatbot a tool for looking up details from the module or the rulebook. We can prepare both documents beforehand in some parser-friendly format.

Photo by Maurice Nguyen on Unsplash

In case the Keeper needs to improvise, a similar tool should be available for searching the internet. There are search engines built specifically for LLMs to use via API calls, such as Tavily. If you prefer some wider-adopted search engine, that’s easy, too: Popular RAG frameworks like LlamaIndex have integrations for big names like Google and Bing.

A final data source to consider may be its own notebook. This alleviates the problem if your chatbot tend to miss details previously appeared in the conversation, perhaps due to a short context window in your LLM of choice. We can invite the LLM to track with it the status of characters, branching storylines, and improvised details. Unlike the tools we’ve designed so dar, this capability should be broken down into two parts: one for taking notes and another for reading them.

Apart from function-calling tools and data-retrieving tools, let’s spice it up with some LLM-powered tools. For example, a player may feel disoriented and ask the Keeper, “What can I do in this dark cave?” A generic chatbot may say something like, “You can explore the cave or go back.” But a good Keeper should suggest some skills appropriate for the situation and forecast the possible outcomes. We can have a tool that generates such recommendations.

With the components translated into tools, we can now design the chatbot’s workflow:

Ughh, I can’t say MermaidJS always generates the most aesthetic layouts.

Unlike the flow of the game itself, the flow of the bot constitutes an infinite loop. (Notice the absence of an “end” terminal from the diagram.) This is because the chatbot should be able to handle multiple questions from the player in a single session. The player can close the window whenever they want.

Goal / Demo

The end goal for our chatbot is to hold a conversation like the following:

What just happened? In the dialogue above, I’ve asked our AI Keeper (which I named “Cocai”, for “CoC AI”) to generate a character for me. Failing to satisfy the input requirements of the character creation tool, Cocai corrected itself in the next attempt and successfully got a 23-year-old Mr. Don Joe from the tool:

Cocai was then able to suggest skills appropriate based on the daring situation my Investigator was facing. Subsequently, I asked Cocai to roll a skill check for me, Spot Hidden. Notice that Spot Hidden wasn’t mentioned in Don Joe’s short biography. Challenged by my inquisition about where that skill value of 51 came from, Cocai was able to consult the game module as well as the internet for a better grounded default. Humbly admitting its mistake, Cocai worked nicely with the player to get the game going.

Implementation

You can find the code here. I’d also like to encourage you to implement the AI agent yourself.

Starting off with a skeleton

Let’s start by creating a skeleton for our AI Keeper. For the chatbot part, we’ll use the following libraries:

  • LlamaIndex is a framework for building LLM applications. We’ll be using its agentic AI capabilities.
  • Chainlit provides a web UI to LlamaIndex that looks like ChatGPT.
  • Arize Phoenix is an observability platform. We use it to inspect the AI’s chains of thought (CoTs), which are generally hidden away from the user-facing UI that Chainlit provides.
  • Pydantic is a data validation library. We’ll use it to describe to the LLM the inputs each tool expects.

This quadrtet of libraries is employed so often in modern AI projects that I’m willing to name it “the LCAP stack”. (Anyone remembers the MEAN stack for webdevs and the TICK stack for time series data?) In fact, I have a template repository that sets up the LCAP stack for you. You can use it as a starting point for your own AI projects.

One deviation from my template repo that I’m making in CoCai is the package manager. I’m using uv instead of Poetry. This departure is largely inspired by Stuart Ellis’s blog post, Modern Good Practices for Python Development. To summarize, uv is faster, closer to Python standards, and more space-efficient. I recommend you to give it a try.

Here’s the bare-minimum code to get the chatbot up and running:

# import the necessary libraries here
px.launch_app()

@cl.on_chat_start
async def factory():
Settings.callback_manager = CallbackManager([
# Phoenix can display in real time the traces automatically collected from your LlamaIndex application.
# One-liner activation is possible, but I prefer to do it manually, so that I can put all callback handlers in one place.
OpenInferenceTraceCallbackHandler(),
cl.LlamaIndexCallbackHandler(),
])
Settings.llm = #...
cl.user_session.set("agent", # initialize it here...
OpenAIAgent.from_tools(
system_prompt="You are a keeper of a Call of Cthulhu game...",
tools=[...],
))

@cl.on_message
async def main(message: cl.Message):
agent: AgentRunner = cl.user_session.get("agent") # ... and use it here.
response = await cl.make_async(agent.chat)(message.content)
response_message = cl.Message(content="")
response_message.content = response.response
await response_message.send()

A note about asynchronous programming. The line await cl.make_async(agent.chat)(message.content) may look messy, but it's actually a recommendation from the Chainlit doc:

The make_async function takes a synchronous function (for instance a LangChain agent) and returns an asynchronous function that will run the original function in a separate thread. This is useful to run long running synchronous tasks without blocking the event loop.

I once thought we could just use agent.achat(...), which is the async flavor of the agent.chat(...) method native to LlamaIndex. However, it would cause <ContextVar name='chainlit' at 0x...> errors. It seems that it matters in which thread the async function is declared. (Please tell me if I got it wrong.)

Picking a LLM and a function-calling paradigm

Ever since I started building AI agents last year, I’ve always been using the ReAct paradigm. It simulates function-calling capabilities with a purely semantic approach, allowing me to try out ideas with locally-served LLMs, which rarely support calling functions natively.

This feature may be better illustrated by comparison. Taking the LlamaIndex framework as an instance, where interactions between an AI agent and its underlying LLM are carried out by AgentWorkers:

  • A ReActAgentWorker describes all the tools in the system prompt in English, eavesdrops to the LLM's "inner dialogue" about what tool it wants to use, executes it, and sends back to the LLM for user-facing responses. (See my previous post, Why RAG is big, where I explained ReAct in more details.)
  • An OpenAIAgentWorker sends the tooling information according to the OpenAI API's specifications in JSON, sees which tool the remote server says the LLM wants to execute, executes it, and sends the result back to OpenAI for user-facing responses.

ReAct has its own issues, though. The natural-language approach can be error-prone. The three major problems I witnessed (and fixed for LlamaIndex) are malformed JSON strings, hallucinated tools, and failure to adhere to “inner voice” formats. Although we can ask the LLM to correct its own mistakes, it’s better to prevent them from happening in the first place, just like native function-calling LLMs would.

The situation has improved. This year, many more open-source LLMs have started supporting native function calls. The most prominent of them may be Llama 3.1, released on July 23, 2024. Two days later, Ollama published an example of how to have Llama 3.1 use tools when served via Ollama. This looked promising, so I decided to try it out in this project.

We can use the OpenAIAgentWorker to make use of Llama 3.1's tooling capabilities. The only caveat is that OpenAIAgentWorker expects an OpenAI LLM. If we continued to use the LLM class llama_index.llms.ollama.Ollama, OpenAIAgentWorker would complain "llm must be a OpenAI instance". Luckily, Ollama offers an OpenAI-compatible API, so we can simply use the LLM class llama_index.llms.openai_like.OpenAILike as a workaround. Here's a minimal reproducible example (gist):

# import the necessary packages here

def roll_a_dice(
n: int = Field(description="number of faces of the dice to roll", gt=0, le=100),
) -> int:
"""Roll an n-faced dice and return the result."""
return random.randint(1, n)

if __name__ == "__main__":
Settings.llm = OpenAILike(
model="llama3.1",
api_base="http://localhost:11434/v1",
api_key="ollama",
is_function_calling_model=True,
is_chat_model=True,
)
agent = OpenAIAgent.from_tools(tools=[FunctionTool.from_defaults(roll_a_dice)])
print(agent.chat("Roll a 7-faced dice just for fun. What's the outcome?"))

In my implementation, I’ve added support for both the OpenAI-like Ollama API and the genuine OpenAI API. If you would like to see how ReAct performs, revert this commit and run the chatbot with the Ollama API.

What’s next?

Phew! That’s a long article already. In the next installment of the series, I will implement each of the agentic tools we have designed. They will cover all four ways of declaring a tool in LlamaIndex:

  • taking a pre-packaged tool from LlamaHub,
  • wrapping a Python library from PyPI into a tool,
  • writing your own Python function for the LLM, and finally
  • using LLM’s text completion as a tool itself.

The ideas translate easily to other frameworks like LangChain, as I’ve illustrated in another post earlier this year. Stay tuned!

No responses yet

Write a response