An automated, reproducible pipeline to build Sefaria exports from a MongoDB dump using the official Sefaria-Project exporter, and publish the resulting archives as GitHub Releases.
This repository is a collection of small, composable Bash and Python scripts that:
- Prepare a build environment (tools, Python, MongoDB Database Tools)
- Download a small sample MongoDB dump for quick end-to-end runs
- Clone the upstream
Sefaria-Projectrepository and install its dependencies - Restore the database, run the exporters, verify results
- Package, post-process, and split the archives
- Optionally create a GitHub Release and upload the generated assets
- Top-level scripts
01_...to21_...implement each step in the pipeline, designed to be run sequentially. - Supporting Python utilities:
configure_local_settings.pyensure_history_collection.pyrun_exports.pycheck_export_module.py
- GitHub Actions workflow:
.github/workflows/release.ymlfor CI-driven builds and releases.
You can run the pipeline on Linux or macOS. The GitHub Actions workflow shows a fully automated reference run. For a local run, install or ensure access to:
- Bash and coreutils
- Python 3.9 (to mirror CI) with
pip - Git, curl, unzip, jq
- MongoDB Database Tools (for
mongorestore) - A running MongoDB instance on
localhost:27017- Quick start with Docker:
docker run --rm -p 27017:27017 --name mongo mongo:7
- Quick start with Docker:
The scripts will attempt to install/prepare some tools automatically, but having the above ready smooths the process.
The scripts are designed to be executed in order. A minimal local end-to-end run using the small sample dump looks like this:
- Compute a timestamp used for naming artifacts
bash 01_compute_timestamp.sh
- Install base tools (curl, jq, unzip, etc.)
bash 02_install_base_tools.sh
- Install MongoDB Database Tools (mongorestore)
bash 03_install_mongo_tools.sh
- Download a small MongoDB dump suitable for quick tests
bash 04_download_small_dump.sh
- Clone the upstream Sefaria codebase
bash 05_clone_sefaria_project.sh
- Install build dependencies and Python requirements
bash 06_install_build_deps.sh
bash 07_pip_install_requirements.sh
- Fallback build for Google RE2 (only if needed by your environment)
bash 08_fallback_built_google_re2.sh
- Prepare local project settings and export directories
bash 09_create_exports_dir.sh
bash 10_create_local_settings.sh
- Ensure MongoDB is up, then restore the sample dump
bash 11_wait_for_mongodb.sh
bash 12_restore_db_from_dump.sh
- Sanity-check exporter module, run exports, verify outputs
bash 13_check_export_module.sh
bash 14_run_exports.sh
bash 15_verify_exports.sh
- (Optional) Drop the database to free space
bash 16_drop_db.sh
- Build and post-process archives
bash 17_build_combined_archive.sh
# Optional content processing helpers:
bash 17a_remove_english_in_exports.sh
bash 17b_flatten_hebrew_in_exports.sh
bash 18_split_archive.sh
- (Optional) Create a GitHub Release and upload assets
bash 19_ensure_gh_cli.sh
bash 20_create_or_update_release.sh
bash 21_upload_release_assets.sh
Notes
- The scripts are idempotent where practical; if something fails, re-running from the last successful step is typically fine.
- By default, scripts assume
localhost:27017for MongoDB. Adjust environment variables as needed if your setup differs.
Some scripts accept environment variables to tweak behavior. Common ones include:
PYTHON_VERSION– Pin a Python version (the CI uses 3.9)MONGODB_URI– Override the default MongoDB connection string (e.g.,mongodb://localhost:27017)GITHUB_TOKEN– Personal Access Token withreposcope, required for release steps when running locallyRELEASE_TAG/RELEASE_NAME– Override the computed tag/name for releases
Refer to each script for any additional, script-specific variables.
The workflow at .github/workflows/release.yml provides a full CI pipeline that:
- Spins up a MongoDB service
- Runs the numbered scripts in sequence
- Packages artifacts
- Creates/updates a release and uploads artifacts
Trigger it manually (workflow_dispatch) or configure schedules/conditions as desired. The workflow expects default permissions or a token with sufficient rights to create releases.
- MongoDB connection errors: ensure MongoDB is listening on
localhost:27017and reachable. If using Docker, check the container logs and port mapping. mongorestorenot found: re-run03_install_mongo_tools.shor install MongoDB Database Tools from MongoDB’s official distribution.- Python build issues (e.g.,
re2): run08_fallback_built_google_re2.shto build a compatible wheel as a fallback. - Exporter module not found: run
05_clone_sefaria_project.shand07_pip_install_requirements.shagain, then13_check_export_module.sh.
This repository focuses on orchestration and reproducibility of Sefaria exports. It does not modify Sefaria content or implement the exporter itself; those come from the upstream Sefaria-Project.
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). See LICENSE for details.