Imagine being able to ask your database questions in plain english. What were my sales for last quarter? Show me customers that made purchases in the last month. Personally, I'd much rather use natural language than remember SQL syntax and figure out the tables I need to query and join to. This has enormous potential to save numerous professions hours of work per week. Business analysts, actuaries, data scientists, etc. can spend more time problem solving rather than sifting through data.
Abstract: This project combines the power of LLMs with MongoDB (a NoSQL database) to gain insights on data with natural language. Here's how it works:
.
+-- docker-compose.yaml
+-- init-mongo.js
+-- seed
+-- data.json
+-- data-analyzer-agent
+-- agent.ipynb
version: '3.8'
services:
mongodb:
image: mongo:latest
restart: always
ports:
- 27017:27017
volumes:
- ./data:/data/db
environment:
MONGO_INITDB_ROOT_USERNAME: root
MONGO_INITDB_ROOT_PASSWORD: example
mongo-seed:
image: mongo:latest
depends_on:
- mongodb
volumes:
- ./data:/data/db
The mongodb instance needs to run the following script to load the dummy data.
db = db.getSiblingDB('workoutdb');
// Drop the collection if it exists to ensure fresh import
db.workouts.drop();
// Read and parse the JSON data
const data = cat('/docker-entrypoint-initdb.d/dummy_workoutdata.json');
const workoutsData = JSON.parse(data).workouts;
// Insert the data into the 'workouts' collection
db.workouts.insertMany(workoutsData);
I mocked up some workout data as part of another personal project, which works for this tutorial. Save the file to a new directory called seed.
[
{
"date": "2024-11-17",
"name": "Leg Day",
"exercises": [
{
"name": "Squats",
"exercise_type": "Strength",
"sets": [
{ "reps": 10, "value": 60, "units": "kg" },
{ "reps": 8, "value": 70, "units": "kg" },
{ "reps": 6, "value": 80, "units": "kg" }
]
},
{
"name": "Leg Press",
"exercise_type": "Strength",
"sets": [
{ "reps": 12, "value": 100, "units": "kg" },
Docker allows us to create a MongoDB instance that is isolated and is recreated each time the Docker container is spun up. There are two steps:
docker-compose up
from pymongo import MongoClient
from groq import Groq
import pprint
client = Groq(
api_key="<YOUR API KEY>",
)
class MyMongoClient():
def __init__(self):
# Connection string (replace with your actual credentials)
CONNECTION_STRING = "mongodb://root:example@localhost:27017/"
# Connect to MongoDB
client = MongoClient(CONNECTION_STRING)
# Access the database
mydb = client["mydb"]
# Access the collection
Run the above code block. You can choose to either save it as a python file (.py) or as a jupyter notebook (.ipynb), which is recommended for this tutorial. You should observe an output similar to this:
=============== Response ===============
||CALL_FUNCTION||query||db.find({'exercises.exercise_type':'Strength', 'exercises.exercise_name':'Bench Press'})||
=============== Results from MongoDB ===============
'[]'
=============== Response ===============
It seems that there are no documents in the database that match the query for bench presses. This could be because there are no strength exercises with the name 'Bench Press' in the database. Would you like to try a different query?
=============== Response ===============
Let me try a more general query to see if there are any strength exercises in the database.
||CALL_FUNCTION||query||db.find({'exercises.exercise_type':'Strength'})||
=============== Results from MongoDB ===============
("[{'_id': ObjectId('673fbf45017a8b638253e187'), 'date': '2024-11-18', 'name': "
"'Back and Biceps', 'exercises': [{'name': 'Deadlifts', 'exercise_type': "
"'Strength', 'sets': [{'reps': 5, 'value': 100, 'units': 'kg'}, {'reps': 5, "
"'value': 110, 'units': 'kg'}, {'reps': 3, 'value': 120, 'units': 'kg'}]}, "
"{'name': 'Barbell Rows', 'exercise_type': 'Strength', 'sets': [{'reps': 8, "
"'value': 60, 'units': 'kg'}, {'reps': 8, 'value': 70, 'units': 'kg'}, "
"{'reps': 6, 'value': 80, 'units': 'kg'}]}, {'name': 'Bicep Curls', "
"'exercise_type': 'Strength', 'sets': [{'reps': 12, 'value': 20, 'units': "
You will need to use a self hosted or cloud large language model to run the code block above. I used Groq for this tutorial because it's free to use and extremely fast compared to other services. Groq serves LLM's using a new type of chip, which speeds up the response time by 2-3x. You will need to go to their website and sign up to access the dev console. From there, you can get a free API key to use for this tutorial.
This implementation was designed to show you how LLM's can interact with systems to gain broader knowledge and generate insights. Using the latest and most powerful models will ensure more accurate responses.
Challenges: Getting the LLM's prompt right was really difficult since small changes can cause drastically different results. Called prompt engineering, it is more of an art than a science. Large language models are language models, and are very good at figuring out what to say next. Their limited reasoning capabilities do not ensure that user instructions will always be followed.