💡 Concept

TL;DR This is a reusable platform (data + wrappers) that allows running tens of thousands of experiments for query optimization using "hints" in just a few minutes on a laptop.

⚡ Key Features:

Super fast prototyping of hint-based approaches cause of using look-up insted of real execution.
Implemented torch-like Dataset and DataLoader objects for efficient data handling.
Prototype (query_explorer.py) for experimenting with different query exploration strategies.

this repository is part of project HERO: Hint-Based Efficient and Reliable Query Optimizer

💡 Concept

Save all necessary data while developing a hint-based optimization approach, enabling a single look-up later instead of real execution. This dramatically reduces computation time during algorithm development. We conducted ~2 week of computes and collected results for ~60K plans.

📦 Setup

python -m pip install --upgrade pip
python3 -v venv venv
source venv/bin/activate
pip install -e .
sudo apt-get install -y p7zip-full
7za x src/hbo_bench/data/raw/raw_data.7z -osrc/hbo_bench/data/raw
python3 process_raw_data.py
pytest || [ $? -eq 5 ]

🗂️ Data Structure & Execution Workflow

The raw_data.7z archive contains results obtained by running the following pseudocode (all queries were executed sequentially on a free server, with the cache warmed up beforehand):

for dop in [1, 16, 64]:
    for benchmark in ["JOB", "sample_queries", "tpch_10gb"]:
        for query in benchmark_queries:
            for hintset in HINTSETS:
                ...
                # 1. Plan the query and save the result and planning time
                #    using the `EXPLAIN` command (fields `hs_to_explain_plan`, 
                #    `hs_to_planning_time`).
                # 2. Execute the query and save the actual cardinalities and
                #    execution time using the `EXPLAIN ANALYZE (format json)`
                #    command. If the plan had already been executed, the results
                #    were reused. To avoid long wait times when there is
                #    significant degradation due to hint sets, a timeout was set
                #    to twice the execution time without any hints. This nuance
                #    requires special handling later because the execution time
                #    becomes effectively unknown.

🚀 How To

The key object that allows emulation of DBMS operation is Oracle (terminology is taken from mathematics, not connected with @Oracle). Simple examples of using its functionality are presented in example.ipynb

Example: How to access the stored data directly:

import json
dop = 1
benchmark_name = "JOB"
with open(f"src/hbo_bench/data/raw/dop{dop}/{benchmark_name}.json", "r") as f:
    data = json.load(f)
    query_data = data["1b.sql"]
    explain_plan = query_data["hs_to_explain_plan"]["42"]
    explain_analyze_plan = query_data["explain_plan_to_explain_analyze_plan"][json.dumps(explain_plan)]
    planning_time = query_data["hs_to_planning_time"]["42"]
    # sometimes it won't work due to T/O, there's guarantee only for default hintset (0)
    execution_time = explain_analyze_plan["Total Runtime"]

ⓘ FAQ

What is hint-based optimization?

Hint-based Query Optimization (HBO) is a technique used to optimize query execution time by improving workload execution without modifying a single line of the DBMS kernel code. This is achieved by selecting planner hyperparameters (hints) that influence the construction of the query execution plan. Although this approach can greatly speed up query execution, it faces several fundamental challenges. First, there is no universal hint. Second, the search space is exponential, and the cost of exploring a "point" within it depends on its execution time. As a result, we need to construct and train an intelligent hint-advisor to address these challenges.

Why is this platform useful?

The main problem in hint-based optimization is that we don’t know in advance which hint set is best for a query. To find the optimal hint set, we typically need to try all possible combinations. The cost of each trial is the time it takes to execute the query with those hints. This platform helps free us from real execution, allowing us to prototype solutions much faster. In practice, this platform enabled us to develop a well-balanced query explorer for selecting the best hint combination, outperforming state-of-the-art results in both speed and overall performance. For details, see the repository for our paper HERO: New Learned Hint-based Efficient and Reliable Query Optimizer.

How was the data collected?

To experiment quickly with tens of thousands of different hint exploration strategies, we implemented the following approach: for every query from these benchmarks and all possible hint combinations, we saved execution plans and their latencies obtained from OpenGauss DB. This allowed us to replace real query execution with a simple table lookup. To ensure the consistency of the collected data, the server was used exclusively during idle periods, with statistics updates disabled, and the database was pre-warmed before each query execution.

Which data benchmarks were used?

For experimental evaluation, we used two IMDb-based benchmarks: the JOB benchmark consisting of 113 queries and its skewed version, SQ (sample_queries from the repository of Marcus, Ryan, et al. "Bao: Making learned query optimization practical."), with 40 queries. Additionally, we used the TPCH benchmark with 22 queries and a scale factor of 10.

Which hints were used?

The following list of hints was used, controlled by the corresponding global user configuration parameters (GUCs):

HINTS: "List[Hint]" = [
    "Nested Loop",
    "Merge",
    "Hash",
    "Bitmap",
    "Index Only Scan",
    "Index Scan",
    "Seq Scan",
]

GUCS: "List[GUC]" = [
    "nestloop",
    "mergejoin",
    "hashjoin",
    "bitmapscan",
    "indexonlyscan",
    "indexscan",
    "seqscan",
]

To enumerate all combinations of such hints, we simply use bit masks corresponding to the order above (the high bit is responsible for "Nested Loop", and the low bit for "Seq Scan").

What is the configuration of used server?

All data were obtained on OpenGauss DB on the server with the following settings:

Parameter	Value
`max_process_memory`	200GB
`cstore_buffers`	100GB
`work_mem`	80GB
`effective_cache_size`	32GB
`standby_shared_buffers_fraction`	0.1
`shared_buffers`	160GB

Parameter	Value
Architecture	aarch64
CPU op-mode(s)	64-bit
Byte Order	Little Endian
CPU(s)	128
On-line CPU(s) list	0-127
Thread(s) per core	1
Core(s) per socket	64
Socket(s)	2
NUMA node(s)	4
Vendor ID	HiSilicon
Model	0
Model name	Kunpeng-920
Stepping	0x1
CPU MHz	2600.000
CPU max MHz	2600.0000

📚 References

There are two main papers on the hint-based query optimization approach and useful links:

and 🔥 our paper:

Zinchenko S. and Iazov S. "HERO: Hint-Based Efficient and Reliable Query Optimizer.", 2024

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
images		images
src/hbo_bench		src/hbo_bench
tests		tests
.coverage		.coverage
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.pylintrc		.pylintrc
README.md		README.md
process_raw_data.py		process_raw_data.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💡 Concept

📦 Setup

🗂️ Data Structure & Execution Workflow

🚀 How To

ⓘ FAQ

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

zinchse/hbo_bench

Folders and files

Latest commit

History

Repository files navigation

💡 Concept

📦 Setup

🗂️ Data Structure & Execution Workflow

🚀 How To

ⓘ FAQ

📚 References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages