Updating documentation

Jairik · Jairik · commit 9b3371e8c870 · 2025-12-08T07:44:29.000-05:00
Still need to finish up contributors, but other than that all documentation should be finalized and ready to go.
diff --git a/README.md b/README.md
@@ -1,61 +1,95 @@
 # Parallel Database Query Processing System
 
-## Overview
-
 This project implements a parallel query-processing engine designed to run SQL-like queries over large, structured datasets using high-performance computing techniques. The goal is to build a lightweight, command-line database system that supports fast data ingestion, indexing, and query execution using a combination of:
 
-- B+ tree based storage
+- B+ tree based indexing
 - Serial, OpenMP, and MPI execution modes
 - Parallel query evaluation and parallel data scanning
 
 The system provides a full pipeline—from data generation to query parsing to parallel execution—making it useful for system administrators who need a fast, embeddable tool for scanning large logs or structured records.
 
-<!-- Should modify this later
-## Expected Components
+---
+
+<!-- How to compile and run your programs (including how to generate the data) (makefile and python file) -->
+## Compilation & Execution
 
-* **`QPESeq.c`** — Serial query processing engine.
-* **`QPEOMP.c`** — Parallel version using **OpenMP**.
-* **`QPEMPI.c`** — Parallel version using **MPI**.
-* **`Proj2.pdf`** — Documentation and runtime analysis.
-* **`db.txt`** — Sample generated dataset.
-* **`sample-queries.txt`** — Sample SQL-like queries.
+Within this project, we have helpers for downloading dependencies, generating synthetic data, and executing the programs.
 
--->
+### Downloading Dependencies
+To ensure that all requirements are satisfied, run the convienance `requirements.sh` file:
 
-## Current File Structure
+```bash
+bash requirements.sh
+```
 
-* **data-generation/** - Schema and scripts for generating log data
-* **engine/** - B+ tree implementation and query functionality (serial/parallel)
-* **include/** - Header files
-* **tokenizer/** - Command tokenizing functionality for main program
-* **docs/** - Various MD documentation on design choices and architectural motivation, as well as reports
-* **QPESeq.c** - Main serial implementation, using the serial engine
-* **QPEOMP.c** - Main parallel implementation, using the OpenMP engine
-* **QPEMPI.c** - Main parallel implementation, using the OpenMPI engine
+### Generating synthetic data
 
-## Compilation & Execution
+Our data generation helper (`generate_commands.py`) will look at a bank of known commands to randomly generate a given amount of known data. This function takes in two parameters: a requiremented parameter of tuples to generate (we'll say 50,000) and an optimal parameter of a filename to save to.
 
 ```bash
-# Serial execution
-gcc QPESeq.c -o QPESeq
-./QPESeq db.txt sql.txt
+python generate_commands.py 50000
+```
 
-w/ makefile: make run
+### Executing the QPE Files
 
-# OpenMP version
-gcc -fopenmp QPEOMP.c -o QPEOMP
-./QPEOMP db.txt sql.txt
+To execute our full tests, we can utilize predefined configs in the `makefile`.
 
-w/ makefile: make run-omp
+Firstly, to **compile** all relevant .c files:
 
-# MPI version
-mpicc QPEMPI.c -o QPEMPI
-mpirun -np <num_processes> ./QPEMPI db.txt sql.txt
+```bash
+# Compile and link all relevant files
+make
+```
+
+Once all files are compiled, we can use other makefile helpers to execute each version of our QPE testing functions: 
+
+**Serial**:
 
-w/ makefile: make run-mpi
+```bash
+# Serial version
+make run-omp
 ```
 
-<!--
+**OpenMP Parallel Version**: 
+```bash
+# OpenMP Version
+make run-omp
+```
+
+**OpenMPI Parallel Version**:
+```bash
+# OpenMPI Version
+make run-mpi
+```
+
+Once testing is complete, the `make clean` command can be run to clean all artifacts and object files.
+
+---
+
+## File Structure
+
+project-root/
+├── build/                      # Compiled binaries and test executables
+├── data-generation/            # Scripts for generating synthetic datasets
+├── docs/                       # Project documentation, diagrams, design notes
+├── engine/                     # Core database engine implementation
+│   ├── mpi/                    # MPI-specific build + execution logic
+│   ├── omp/                    # OpenMP-specific build + execution logic
+│   ├── serial/                 # Serial build + execution logic
+│   └── bplus.c                 # B+ Tree data structure implementation
+├── include/                    # Shared headers across modules
+├── tests/                      # Unit test cases + verification utilities
+├── tokenizer/                  # SQL parsing + tokenization logic
+├── connectEngine.c             # Bridge between parser and execution engines
+├── makefile                    # Build rules and compiler instructions
+├── QPEMPI.c                    # Main entry for MPI execution engine
+├── QPEOMP.c                    # Main entry for OpenMP execution engine
+├── QPESeq.c                    # Main entry for serial execution engine
+├── requirements.sh             # Environment + dependency setup script
+└── sample-queries.txt          # Example queries for debugging + validation
+
+---
+
 ## Report & Analysis
 
 See **Proj2.pdf** for:
@@ -64,11 +98,15 @@ See **Proj2.pdf** for:
 * Scalability with increased problem size
 * Optimal thread and process count for performance
 
-## Contributors
+## Ensuring Correctness Without Sacrificing Performance
 
-* *Name A*: Data generation & serial QPE
-* *Name B*: OpenMP implementation
-* *Name C*: MPI implementation & runtime analysis
--->
----
+We verified accuracy through targeted testing and edge-case checks, while profiling and optimizing critical paths to keep execution fast. This continual cycle of testing and refinement ensured the system remained both correct and efficient.
+
+<!-- TODO Update these with finished deliverables -->
+## Contributors
 
+* *JJ McCauley*: Serial engines, makefiles/testing, docs, & QPE testing files
+* *Ian Davis*:
+* *Anthony Czerwinski*: Sample queries & Select parallelizations
+* *Sam Dickerson*: Parser & Insert parallelizations
+* *Logan Kelsch*: Data generation & 
diff --git a/db.csv b/db.csv
diff --git a/docs/bplus.md b/docs/bplus.md
@@ -47,10 +47,10 @@ The engine is structured into three core components:
 1. **bplus.c**
    Implements the B+ tree storage structure itself.
 
-2. **buildEngine-serial.c**
+2. **buildEngine-*.c** (serial, omp, mpi)
    Builds tables and creates B+ tree indexes over chosen attributes.
 
-3. **queryEngine-serial.c**
+3. **executeEngine-*.c** (serial, omp, mpi)
    Executes commands (SELECT, WHERE) and uses the B+ tree for fast lookups.
 
 ### Index Lifecycle
diff --git a/docs/engine.md b/docs/engine.md
@@ -1,6 +1,6 @@
 # Engine Documentation
 
-This document is a comprehensive technical reference for the Serial engine used by the Parallel-Query-Processing-System repository. It describes the B+ tree index implementation, build utilities that load CSV data and construct indexes, and the execute engine that implements SQL-like operations (SELECT, INSERT, DELETE). The goal is to make the codebase easy to understand for contributors who need to maintain, extend, or benchmark the engine.
+This document is a comprehensive technical reference for the engines used by the Parallel-Query-Processing-System repository. It describes the B+ tree index implementation, build utilities that load CSV data and construct indexes, and the execute engines that implement SQL-like operations (SELECT, INSERT, DELETE). While the Serial engine is described in detail, the OpenMP and MPI implementations follow a similar structure.
 
 Table of Contents
 - Section 1 — B+ Tree (structure, public API, internals)
@@ -62,10 +62,11 @@ Design notes and caveats
 
 ## Section 2 — Build Engine
 
-Files: `engine/serial/buildEngine-serial.c`, `include/buildEngine-serial.h`, `engine/recordSchema.c`, `include/recordSchema.h`
+Files: `engine/*/buildEngine-*.c`, `include/buildEngine-*.h`, `engine/recordSchema.c`, `include/recordSchema.h`
 
 Purpose
 - Load CSV data into an in-memory `record **` representation and provide helpers to build B+ tree indexes from those records.
+- Note: Each implementation (Serial, OpenMP, MPI) has its own build engine file (e.g., `buildEngine-serial.c`, `buildEngine-omp.c`, `buildEngine-mpi.c`).
 
 Key functions and behavior
 - `record **getAllRecordsFromFile(const char *filepath, int *num_records)`
@@ -92,10 +93,11 @@ Design notes
 
 ## Section 3 — Execute Engine
 
-Files: `engine/serial/executeEngine-serial.c`, `include/executeEngine-serial.h`
+Files: `engine/*/executeEngine-*.c`, `include/executeEngine-*.h`
 
 Purpose
 - Implements application-level query execution (SELECT / INSERT / DELETE) and query predicate evaluation. Connects in-memory records, persistent CSV storage, and B+ tree indexes.
+- Note: Each implementation (Serial, OpenMP, MPI) has its own execute engine file.
 
 Core types
 - `struct engineS` — engine state with fields for in-memory records, index roots, index metadata, and the CSV datafile path.
diff --git a/docs/fileStructure.md b/docs/fileStructure.md
@@ -14,7 +14,16 @@ Documentation and reports. Used mostly for developer experience and to help expl
 
 ## `/engine`
 
-This serves as the **main powerhouse** of the program. This holds the B+ tree implementation (*bplus-x.c*), utility functions for *building* the different trees that will be used for indexing (*buildEngine-x.c*), and various functions that can be used for executing specific commands (*executeEngine-x.c*). The execute file will hold the specific commands for things such as SELECT, WHERE, INSERT, DELETE, etc. This should be used as a means to abstract the lower-level functionality to be used in the root-level files (*QPEx.c*). For more of an explanation of how this will work, see the **bplus.md** md file in the docs folder.
+This serves as the **main powerhouse** of the program. It contains:
+- `bplus.c`: The core B+ tree implementation.
+- `recordSchema.c`: Schema definitions and helpers.
+- `serial/`, `omp/`, `mpi/`: Subdirectories containing specific implementations for Serial, OpenMP, and MPI execution engines.
+
+Each subdirectory contains:
+- `buildEngine-*.c`: Utility functions for *building* the indexes.
+- `executeEngine-*.c`: Functions for executing specific commands (SELECT, INSERT, DELETE).
+
+This structure abstracts lower-level functionality for use in the root-level files (`QPE*.c`). For more explanation, see `bplus.md` and `engine.md`.
 
 ## `/include`
 
@@ -28,6 +37,6 @@ Any basic tests ran during development to verify the functionality of any utilit
 
 Given a string SQL command, will parse it to determine the actual functionality desired by the user.
 
-## `QPE.c`
+## `QPE*.c`
 
-The *QPEMPI.c*, *QPEOMP.c*, and *QPESeq.c* uses the wrapper functions in the `/engine` directory to perform high-level queries. For now, these files should read in each command in the `sample-queries.txt` file, use the parser to determine which specific functionality is desired, then run it through the engine to get the specific results. This will also perform high-level benchmarking.
+The `QPEMPI.c`, `QPEOMP.c`, and `QPESeq.c` files use the wrapper functions in the `/engine` directory to perform high-level queries. These files read commands from `sample-queries.txt`, use the tokenizer to parse them, and then run them through the appropriate engine (Serial, OpenMP, or MPI) to get results. They also handle high-level benchmarking.
diff --git a/docs/schema.md b/docs/schema.md
@@ -23,7 +23,11 @@
 
 There should be minimal default row indexes, as each one will require a seperate B+ tree to be stored in memory. Below are the currently chosen *default* indexes:
 
-- TODO
+- `command_id` (UINT64)
+- `user_id` (INT)
+- `risk_level` (INT)
+- `exit_code` (INT)
+- `sudo_used` (BOOL)
 
 ## Generation