Build a Knowledge Base with Large language models

Build a Knowledge Base with Large language models

This is a brief read about one of my AI side projects. The idea is to leverage generative models to accumulate knowledge in a way that makes it usable. It works by querying the LLM (large language model) about a starting topic. The LLM will respond with a brief description of the concept, a list of related concepts, and a list of subconcepts.

concepts = [ "Knowledge" ] # or any other of your choice
while true:
    new_concepts = []
    for concept in concepts:
        concept_description, subconcepts = queryLLM(concept)
        save_concept(concept_description, subconcepts)
    new_concepts.append(subconcepts)
    concepts = new_concepts

That way, two things arise over time:

1. A hierarchy: With the starting concept at the top, and subconcepts branching out, it will be clear, how each one is related to the others, which ones are useful, and which ones are details. The depth of the knowledge base is directly proportional to how long the program runs, which makes it more flexible for different use-case-profiles.

2. A conceptual network: Related concepts are linked to each other. This makes it easier to place concepts into context and improves discoverability of concepts in general. (The “connective tissue” are the concepts themselves.)

It is important to note that GPTVault stores these concepts in markdown files to integrate seamlessly with Obsidian. You can open the folder in Obsidian (as vault) and explore the network in the graph view.

Use cases

Concept Exploration

This concept is a powerful one for those who want to get a grasp of a field or concept, or want to check if they are missing something after exploration of a topic.

Personal Knowledge Management

GPTVault can be a powerful tool for individuals seeking to expand their understanding of various topics. It can serve as a starting point for a new field of personal research by providing a solid structure. This can be especially useful for Obsidian users like me, who want to write the bare minimum and don’t like to engage in linking, or creating “connective tissue” in their second brain.

Inspiration for writing, content creation, and blogging

Writers and bloggers can leverage GPTVault to generate a vast array of interconnected topics for content creation. The program can serve as a source of inspiration and provide a systematic approach to brainstorming and organizing article ideas.

Exploration of an LLMs grasp of things

In what ways the concepts are arranged is partly dependent on how the LLM behaves. Conclusions about an LLMs worldview and bias of its training data can be drawn from repeating patterns in the resulting structure.

As educational resource

GPTVault can be used by educators and students alike to explore and expand upon specific subjects. By setting a depth appropriate to the level of knowledge required, the program can help in-depth research and study.

Demonstration

Visualizing an example with the starting concept “Knowledge” yields this graph, visualized with Obsidian.

![Demo using Knowledge as starting concept](dev-to-uploads.s3.amazonaws.com/uploads/art..)

Running the program for a bit longer, it generated more than 1000 files. You can see, this is quite exhaustive.

![Demo 2, 1000 Concepts](dev-to-uploads.s3.amazonaws.com/uploads/art..)

A conceptual thread might look like this:

Machine Learning -> Artificial Intelligence -> Robotics -> Humanoid Robots -> Social Interaction -> Empathy in AI -> Ethical AI Development

GPTVault allows users to build a usable vault from scratch by simply providing the program with topic names. This means you don’t need to manually curate the entire knowledge base, saving time and effort. However, you need to have basic programming knowledge. (You need to run the program in your terminal)

Customization Options

The program can be customized to fit individual needs. Users can modify the prompt, the intermediary JSON file structure, the markdown file template (note: useful for obsidian users) and even change the behavior of the program itself, as the structure is quite simple.

Integration in an existing vault

By selecting the option ‘fromlayer’ when running the program’s file (more on that in the next section), you can use existing files as the last layer (the one which has not yet been used to create new concepts). You might need to modify the function by which the last layer of a vault is searched (contact me if you need help with this).

API Costs

Building a knowledge base with GPTVault is not as expensive as you might imagine: The program that created the graph you saw in the demo-section, with over 1000 notes cost just under 1€. (The OpenAI API was used for GPT-3.5-Turbo in July 2023)

Visualization and Analysis

GPTVault’s results can be visualized using tools like Obsidian Graph view, which provides insights into conceptual relationships and how they are arranged. Additional Obsidian plugins, such as Graph Analysis, can offer further information and analysis.

How to use GPTVault

It uses the OpenAI API by default, but this can be changed easily in the respective function.

1. Clone the repository

2. Set Up the API Key: Set the environment variable `OPENAI_API_KEY` to your OpenAI API key, allowing the program to interact with the OpenAI API.

3. Start the Program: Run `generator.py`. If you are using it for the first time, use `new` as the usage-type. The program will prompt you to provide a set of concepts to start with. If you want to expand on already created files, choose the other mode `fromlayer`.

4. Building the Knowledge Base: GPTVault will start building the knowledge base based on the initial concepts you provided. Json-files will be saved in the concepts folder, while Markdown files will be saved in the mdfiles folder.

On depth selection

The depth parameter in GPTVault determines how far the program will expand the knowledge base from the initial concepts. For broader topics like “machine learning,” a higher depth, around 8, is recommended to explore a wide range of related subtopics. However, for more specific concepts, a depth of 3 to 5 may be more appropriate, to prevent the LLM from wandering off.

Customizing GPTVault

GPTVault is customizable, and users can tailor it to their specific needs:

1. Customizing the Prompt: The prompt used to request concepts from the LLM can be modified. As long as the LLM responds with the required JSON string, the program will work.

2. Modifying JSON File Structure: The intermediary JSON file structure, by which the LLM answers its queries, can be customized. Users will need to replace the appropriate keys in the program.

3. Changing API Request Parameters: The `makeResponse` function in `generator.py` can be modified to adjust the parameters used in API requests to the LLM. You could, for example, use a different model, or platform.

4. Disabling Markdown Conversion: If desired, you can turn off the auto-conversion to Markdown files, keeping only the JSON-files for their knowledge base. This might come in handy if you want to use them in your own program, and don’t care about the markdown files.

Limitations

GPTVault, like any tool, has its limitations:

1. Synonyms are not accounted for: The program does not account for synonyms, which might result in some concepts being missed or not adequately represented in the knowledge base. However, you can add a synonyms array to the json template, add specifications about this to the prompt, and change the behavior of the conceptExists-Function.

2. Depth-Limit-Stopping: Depending on what depth you choose, the concepts stored, and the overall cost vary widely. Thats because of the exponential nature, so keep that in mind.

3. Drifting Off: Right now, there is nothing that prevents GPT-3.5 from drifting off, if the topic is exhausted. A solution would paste the starting concept into each prompt, and tell GPT-3.5 to not to deviate from that, e.g. by terminating process on this thread via a status reponse like “0”.

Conclusion

GPTVault is a powerful tool for building a knowledge base iteratively using GPT-3.5. Its ability to expand the hierarchy of concepts based on user-provided starting concepts makes it efficient and cost-effective. With customization options available, users can adapt the program to suit their specific requirements. Despite the synonym-limitation, GPTVault offers a valuable way to organize knowledge and gain insights into concept hierarchies, whether used for personal knowledge management or as a foundation for research or the creative process.

You can view the code on its public Github repository, feel free to use the code to extend onto it. I kept the code simple and easy to customize.