Skip to content
/ REaR Public

REAR is a fast, LLM-free framework for multi-table retrieval that separates semantic relevance from structural joinability. By retrieving relevant tables, expanding with joinable ones, and refining noisy candidates, it consistently improves multi-table QA and Text-to-SQL performance—matching LLM-based methods at much lower cost and latency.

License

Notifications You must be signed in to change notification settings

CoRAL-ASU/REaR

Repository files navigation

REaR-Retrieve-Expand-and-Refine-for-Effective-Multitable-Retrieval

Quick Start

  • Create/activate a virtual environment, then install dependencies listed in setup.sh (FAISS, llama-index, PyTorch, etc.).
  • Keep large assets (e.g., combined_database_with_desc.json, FAISS binaries) outside version control but accessible via the paths you pass to these scripts.

Project Structure

REaR/
├── .gitignore
├── LICENSE
├── README.md
├── setup.sh
├── imports.py
├── utils.py
├── preprocessing/
│   ├── create_col_store.py
│   ├── create_vector_score. py
│   ├── generate_questions_list.py
│   └── generate_table_descriptions.py
├── retrieve.py
├── expand.py
├── refine. py
├── generate.py
├── evaluate_retrieval.py
└── full_inference.py

Build Artifacts

  • python create_col_store.py --table-repository combined_database_with_desc.json --embedding-model BAAI/bge-base-en-v1.5 --index-out vs_col.bin --metadata-out doc_metadata.json [--normalize]
    Generates the column-level FAISS index + metadata used for joinability.
  • python create_vector_score.py --table-repository combined_database_with_desc.json --output-dir storage_bge --embedding-model BAAI/bge-base-en-v1.5
    Persists a LlamaIndex/FAISS store for base table retrieval.

Pipeline Stages

  • python retrieve.py --questions-file merged_questions.json --vector-store-path storage_bge --output-file base.json
    Returns top-k tables per question from the retrieval store.
  • python expand.py --retrieval-file base.json --faiss-index vs_col.bin --faiss-metadata doc_metadata.json --output-file expansion.json
    Finds additional joinable tables via the column index.
  • python refine.py --expansion-file expansion.json --table-repository combined_database_with_desc.json --output-file pruned.json
    Prunes tables with cross-encoder scoring (--top-n, --alpha, --beta adjust behavior).
  • python generate.py --pruned-file pruned.json --table-repository combined_database_with_desc.json --provider gemini --output-file sql.json [--env-file .env]
    Produces SQL with Gemini, OpenAI, or DeepInfra; .env can preload API keys.
  • python evaluate_retrieval.py --predictions-file base.json --ground-truth-file merged_questions.json --output-file eval.json
    Computes recall/precision/F1 for retrieved tables versus ground truth.

End-to-End Runner

  • python full_inference.py --env-file .env --questions-file merged_questions.json --vector-store-path storage_bge --faiss-index vs_col.bin --faiss-metadata doc_metadata.json --table-repository combined_database_with_desc.json --output-file e2e_results.json [--intermediate-dir runs/]
    Executes all four stages sequentially and optionally saves intermediate JSON artefacts.

API Credentials

  • Supported providers: gemini, openai, deepinfra.
    Supply keys via CLI (--api-key) or .env file entries such as GOOGLE_API_KEY=..., OPENAI_API_KEY=..., DEEPINFRA_API_KEY=....

LICENSE

This project is licensed under the Creative Commons License - see the LICENSE file for details.

About

REAR is a fast, LLM-free framework for multi-table retrieval that separates semantic relevance from structural joinability. By retrieving relevant tables, expanding with joinable ones, and refining noisy candidates, it consistently improves multi-table QA and Text-to-SQL performance—matching LLM-based methods at much lower cost and latency.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published