- Create/activate a virtual environment, then install dependencies listed in
setup.sh(FAISS, llama-index, PyTorch, etc.). - Keep large assets (e.g.,
combined_database_with_desc.json, FAISS binaries) outside version control but accessible via the paths you pass to these scripts.
REaR/
├── .gitignore
├── LICENSE
├── README.md
├── setup.sh
├── imports.py
├── utils.py
├── preprocessing/
│ ├── create_col_store.py
│ ├── create_vector_score. py
│ ├── generate_questions_list.py
│ └── generate_table_descriptions.py
├── retrieve.py
├── expand.py
├── refine. py
├── generate.py
├── evaluate_retrieval.py
└── full_inference.py
python create_col_store.py --table-repository combined_database_with_desc.json --embedding-model BAAI/bge-base-en-v1.5 --index-out vs_col.bin --metadata-out doc_metadata.json [--normalize]
Generates the column-level FAISS index + metadata used for joinability.python create_vector_score.py --table-repository combined_database_with_desc.json --output-dir storage_bge --embedding-model BAAI/bge-base-en-v1.5
Persists a LlamaIndex/FAISS store for base table retrieval.
python retrieve.py --questions-file merged_questions.json --vector-store-path storage_bge --output-file base.json
Returns top-k tables per question from the retrieval store.python expand.py --retrieval-file base.json --faiss-index vs_col.bin --faiss-metadata doc_metadata.json --output-file expansion.json
Finds additional joinable tables via the column index.python refine.py --expansion-file expansion.json --table-repository combined_database_with_desc.json --output-file pruned.json
Prunes tables with cross-encoder scoring (--top-n,--alpha,--betaadjust behavior).python generate.py --pruned-file pruned.json --table-repository combined_database_with_desc.json --provider gemini --output-file sql.json [--env-file .env]
Produces SQL with Gemini, OpenAI, or DeepInfra;.envcan preload API keys.python evaluate_retrieval.py --predictions-file base.json --ground-truth-file merged_questions.json --output-file eval.json
Computes recall/precision/F1 for retrieved tables versus ground truth.
python full_inference.py --env-file .env --questions-file merged_questions.json --vector-store-path storage_bge --faiss-index vs_col.bin --faiss-metadata doc_metadata.json --table-repository combined_database_with_desc.json --output-file e2e_results.json [--intermediate-dir runs/]
Executes all four stages sequentially and optionally saves intermediate JSON artefacts.
- Supported providers:
gemini,openai,deepinfra.
Supply keys via CLI (--api-key) or.envfile entries such asGOOGLE_API_KEY=...,OPENAI_API_KEY=...,DEEPINFRA_API_KEY=....
This project is licensed under the Creative Commons License - see the LICENSE file for details.