REaR-Retrieve-Expand-and-Refine-for-Effective-Multitable-Retrieval

Quick Start

Create/activate a virtual environment, then install dependencies listed in setup.sh (FAISS, llama-index, PyTorch, etc.).
Keep large assets (e.g., combined_database_with_desc.json, FAISS binaries) outside version control but accessible via the paths you pass to these scripts.

Project Structure

REaR/
├── .gitignore
├── LICENSE
├── README.md
├── setup.sh
├── imports.py
├── utils.py
├── preprocessing/
│   ├── create_col_store.py
│   ├── create_vector_score. py
│   ├── generate_questions_list.py
│   └── generate_table_descriptions.py
├── retrieve.py
├── expand.py
├── refine. py
├── generate.py
├── evaluate_retrieval.py
└── full_inference.py

Build Artifacts

python create_col_store.py --table-repository combined_database_with_desc.json --embedding-model BAAI/bge-base-en-v1.5 --index-out vs_col.bin --metadata-out doc_metadata.json [--normalize]
Generates the column-level FAISS index + metadata used for joinability.
python create_vector_score.py --table-repository combined_database_with_desc.json --output-dir storage_bge --embedding-model BAAI/bge-base-en-v1.5
Persists a LlamaIndex/FAISS store for base table retrieval.

Pipeline Stages

python retrieve.py --questions-file merged_questions.json --vector-store-path storage_bge --output-file base.json
Returns top-k tables per question from the retrieval store.
python expand.py --retrieval-file base.json --faiss-index vs_col.bin --faiss-metadata doc_metadata.json --output-file expansion.json
Finds additional joinable tables via the column index.
python refine.py --expansion-file expansion.json --table-repository combined_database_with_desc.json --output-file pruned.json
Prunes tables with cross-encoder scoring (--top-n, --alpha, --beta adjust behavior).
python generate.py --pruned-file pruned.json --table-repository combined_database_with_desc.json --provider gemini --output-file sql.json [--env-file .env]
Produces SQL with Gemini, OpenAI, or DeepInfra; .env can preload API keys.
python evaluate_retrieval.py --predictions-file base.json --ground-truth-file merged_questions.json --output-file eval.json
Computes recall/precision/F1 for retrieved tables versus ground truth.

End-to-End Runner

python full_inference.py --env-file .env --questions-file merged_questions.json --vector-store-path storage_bge --faiss-index vs_col.bin --faiss-metadata doc_metadata.json --table-repository combined_database_with_desc.json --output-file e2e_results.json [--intermediate-dir runs/]
Executes all four stages sequentially and optionally saves intermediate JSON artefacts.

API Credentials

Supported providers: gemini, openai, deepinfra.
Supply keys via CLI (--api-key) or .env file entries such as GOOGLE_API_KEY=..., OPENAI_API_KEY=..., DEEPINFRA_API_KEY=....

LICENSE

This project is licensed under the Creative Commons License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REaR-Retrieve-Expand-and-Refine-for-Effective-Multitable-Retrieval

Quick Start

Project Structure

Build Artifacts

Pipeline Stages

End-to-End Runner

API Credentials

LICENSE

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
preprocessing		preprocessing
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate_retrieval.py		evaluate_retrieval.py
expand.py		expand.py
full_inference.py		full_inference.py
generate.py		generate.py
imports.py		imports.py
refine.py		refine.py
retrieve.py		retrieve.py
setup.sh		setup.sh
utils.py		utils.py

License

CoRAL-ASU/REaR

Folders and files

Latest commit

History

Repository files navigation

REaR-Retrieve-Expand-and-Refine-for-Effective-Multitable-Retrieval

Quick Start

Project Structure

Build Artifacts

Pipeline Stages

End-to-End Runner

API Credentials

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages