Wrapping MongoDB with an LLM

Imagine being able to ask your database questions in plain english. What were my sales for last quarter? Show me customers that made purchases in the last month. Personally, I'd much rather use natural language than remember SQL syntax and figure out the tables I need to query and join to. This has enormous potential to save numerous professions hours of work per week. Business analysts, actuaries, data scientists, etc. can spend more time problem solving rather than sifting through data.

Abstract: This project combines the power of LLMs with MongoDB (a NoSQL database) to gain insights on data with natural language. Here's how it works:

  1. A user types in a question in plain english
  2. The LLM then interprets the question and initially queries the database
  3. The LLM observes the output and iteratively tweaks the query to get the information it needs to answer the user's question
  4. Once the LLM is has the data it needs, it uses the data to answer the user's question and stops the iteration cycle.
File structure

.
+-- docker-compose.yaml 
+-- init-mongo.js
+-- seed 
    +-- data.json 
+-- data-analyzer-agent
    +-- agent.ipynb
                
docker-compose.yaml

version: '3.8'
services:
  mongodb:
    image: mongo:latest
    restart: always
    ports:
      - 27017:27017
    volumes:
      - ./data:/data/db
    environment:
      MONGO_INITDB_ROOT_USERNAME: root
      MONGO_INITDB_ROOT_PASSWORD: example

  mongo-seed:
    image: mongo:latest
    depends_on:
      - mongodb
    volumes:
      - ./data:/data/db

The mongodb instance needs to run the following script to load the dummy data.

init-mongo.js

db = db.getSiblingDB('workoutdb');

// Drop the collection if it exists to ensure fresh import
db.workouts.drop();

// Read and parse the JSON data
const data = cat('/docker-entrypoint-initdb.d/dummy_workoutdata.json');
const workoutsData = JSON.parse(data).workouts;

// Insert the data into the 'workouts' collection
db.workouts.insertMany(workoutsData);
    

I mocked up some workout data as part of another personal project, which works for this tutorial. Save the file to a new directory called seed.

seed/data.json

[
      {
        "date": "2024-11-17",
        "name": "Leg Day",
        "exercises": [
          {
            "name": "Squats",
            "exercise_type": "Strength",
            "sets": [
              { "reps": 10, "value": 60, "units": "kg" },
              { "reps": 8, "value": 70, "units": "kg" },
              { "reps": 6, "value": 80, "units": "kg" }
            ]
          },
          {
            "name": "Leg Press",
            "exercise_type": "Strength",
            "sets": [
              { "reps": 12, "value": 100, "units": "kg" },

Docker allows us to create a MongoDB instance that is isolated and is recreated each time the Docker container is spun up. There are two steps:

  1. Create the MongoDB instance
  2. Load all our dummy data into the database
Make sure to have Docker desktop installed on your machine. If you've never heard of Docker you can find out more here. Once you have Docker desktop installed, run the following command to spin up the MongoDB instance.

Terminal Command
docker-compose up
data-analyzer-agent/test.ipynb

from pymongo import MongoClient
from groq import Groq
import pprint
client = Groq(
    api_key="<YOUR API KEY>",
)

class MyMongoClient():
    def __init__(self):
        # Connection string (replace with your actual credentials)
        CONNECTION_STRING = "mongodb://root:example@localhost:27017/"

        # Connect to MongoDB
        client = MongoClient(CONNECTION_STRING)

        # Access the database
        mydb = client["mydb"]

        # Access the collection

Run the above code block. You can choose to either save it as a python file (.py) or as a jupyter notebook (.ipynb), which is recommended for this tutorial. You should observe an output similar to this:

Output from data-analyzer-agent/test.ipynb

=============== Response ===============
||CALL_FUNCTION||query||db.find({'exercises.exercise_type':'Strength', 'exercises.exercise_name':'Bench Press'})||
=============== Results from MongoDB ===============
'[]'
=============== Response ===============
It seems that there are no documents in the database that match the query for bench presses. This could be because there are no strength exercises with the name 'Bench Press' in the database. Would you like to try a different query?
=============== Response ===============
Let me try a more general query to see if there are any strength exercises in the database. 

||CALL_FUNCTION||query||db.find({'exercises.exercise_type':'Strength'})||
=============== Results from MongoDB ===============
("[{'_id': ObjectId('673fbf45017a8b638253e187'), 'date': '2024-11-18', 'name': "
 "'Back and Biceps', 'exercises': [{'name': 'Deadlifts', 'exercise_type': "
 "'Strength', 'sets': [{'reps': 5, 'value': 100, 'units': 'kg'}, {'reps': 5, "
 "'value': 110, 'units': 'kg'}, {'reps': 3, 'value': 120, 'units': 'kg'}]}, "
 "{'name': 'Barbell Rows', 'exercise_type': 'Strength', 'sets': [{'reps': 8, "
 "'value': 60, 'units': 'kg'}, {'reps': 8, 'value': 70, 'units': 'kg'}, "
 "{'reps': 6, 'value': 80, 'units': 'kg'}]}, {'name': 'Bicep Curls', "
 "'exercise_type': 'Strength', 'sets': [{'reps': 12, 'value': 20, 'units': "

You will need to use a self hosted or cloud large language model to run the code block above. I used Groq for this tutorial because it's free to use and extremely fast compared to other services. Groq serves LLM's using a new type of chip, which speeds up the response time by 2-3x. You will need to go to their website and sign up to access the dev console. From there, you can get a free API key to use for this tutorial.

This implementation was designed to show you how LLM's can interact with systems to gain broader knowledge and generate insights. Using the latest and most powerful models will ensure more accurate responses.

Challenges: Getting the LLM's prompt right was really difficult since small changes can cause drastically different results. Called prompt engineering, it is more of an art than a science. Large language models are language models, and are very good at figuring out what to say next. Their limited reasoning capabilities do not ensure that user instructions will always be followed.