Pack huge numbers of small files into larger, S3-optimized shard objects. DES gives deterministic routing (no DB), pluggable compression, and fast retrieval using S3 range-GET (footer → index → payload). It ships with local/S3/multi-S3 backends, a packer CLI, and HTTP retriever service.
- Append-only shard format with header/data/index/footer.
- Compression (
zstd,lz4) with smart skipping of already-compressed extensions. - Optimized S3 retriever: range-GETs for header/footer/index plus payload (or BigFile object when needed).
- Multi-zone S3 routing (
MultiS3ShardRetriever) based on shard index ranges. - HTTP retriever with backends:
local,s3,multi_s3; Prometheus metrics at/metrics. - In-memory index cache to reduce repeated S3 calls.
- Packer CLI (
des-pack), Docker images, docker-compose, and K8s job manifests. - Source reads from local filesystem or S3 URIs (
s3://bucket/key) with retry/backoff viaS3FileReader.
- Payloads larger than
DES_BIG_FILE_THRESHOLD_BYTES(default 10 MiB) are written outside the.desbody under a sibling_bigFiles/directory; the shard stores only metadata (is_bigfile,bigfile_hash,bigfile_size,meta). - Why: keep shard sizes predictable and avoid inflating
.desobjects with rare, huge files. - Config:
DES_BIG_FILE_THRESHOLD_BYTES,DES_BIGFILES_PREFIX(defaults viaDESConfig.from_env()and passed intoShardWriter/ShardReader/S3ShardRetriever). - Layout: local
.../YYYYMMDD/*.deswith.../YYYYMMDD/_bigFiles/<sha256>; S3s3://bucket/prefix/*.deswiths3://bucket/prefix/_bigFiles/<sha256>. - Read path: index entries flagged
is_bigfileare fetched from_bigFiles/(local FS or S3) instead of range-GET payloads; inline entries behave as before. Older shards (no bigfiles) continue to read normally.
- Routing:
des_core.routing.locate_shardmaps(uid, created_at, n_bits)todate_dir,shard_index,shard_hex,object_key. - Shard IO:
des_core.shard_iowrites/reads.desshards, stores an index+footer, handles compression metadata, decompresses transparently. - Packing:
des_core.packer(local) anddes_core.s3_packer(upload) build shards from manifests. - Retrieval:
des_core.retriever(filesystem),des_core.s3_retriever(single S3),des_core.multi_s3_retriever(zone fan-out). - HTTP:
des_core.http_retrieverexposes retrieval over FastAPI.
Read path (S3 backend): locate shard → range-GET footer → range-GET index → payload range → decompress if needed → return bytes.
python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e ".[compression,s3,dev]" # installs S3/compression extras and dev tooling
pytestFor a runtime-only install without lint/type tooling: pip install -e ".[compression,s3]" (or pip install .).
From source:
export DES_BACKEND=local
export DES_BASE_DIR=./example_data/des
export DES_N_BITS=8
uvicorn des_core.http_retriever:app --host 0.0.0.0 --port 8000Via Docker:
docker build -t des-http:local .
docker run --rm -p 8000:8000 \
-e DES_BACKEND=local \
-e DES_BASE_DIR=/data/des \
-v "$(pwd)"/example_data/des:/data/des:ro \
des-http:localKey endpoints:
GET /files/{uid}?created_at=YYYY-MM-DDTHH:MM:SSZ→ file bytesGET /metrics→ Prometheus metricsGET /health→ simple health check
DES_BASE_DIR– directory with.desshards.DES_N_BITS– shard routing bits (default 8).
DES_S3_BUCKET(required)DES_S3_REGION(optional)DES_S3_ENDPOINT_URL(optional, for S3-compatible storage)DES_S3_PREFIX(optional prefix)DES_N_BITSmust match how shards were written.
DES_ZONES_CONFIG– YAML/JSON zones file (required).
Example zones config:
n_bits: 8
zones:
- name: zone-a
range: { start: 0, end: 127 }
s3:
bucket: des-zone-a
region_name: us-east-1
endpoint_url: https://s3.example.com
prefix: des-a/
- name: zone-b
range: { start: 128, end: 255 }
s3:
bucket: des-zone-b
region_name: us-east-1
endpoint_url: https://s3.example.com
prefix: des-b/Ranges must not overlap and must be within [0, 2**n_bits - 1].
des-pack reads a JSON manifest and writes .des shards to an output directory.
Manifest example (list of files):
[
{
"uid": "uid-123",
"created_at": "2024-01-01T00:00:00Z",
"size_bytes": 1024,
"source_path": "/data/input/file-123.bin"
},
{
"uid": "uid-456",
"created_at": "2024-01-02T00:00:00Z",
"size_bytes": 2048,
"source_path": "/data/input/file-456.bin"
}
]Run locally:
des-pack \
--input-json ./example_data/manifests/to-pack.json \
--output-dir ./example_data/output \
--max-shard-size 1073741824 \
--n-bits 8docker-compose provides two services:
- HTTP retriever:
docker compose up des-http
- Packer (one-shot job):
docker compose run --rm des-packer
Mounts:
./example_data/input→/data/input./example_data/manifests→/data/manifests./example_data/output→/data/output
Manifests under k8s/:
des-http-deployment.yaml+des-http-service.yaml– HTTP retriever.des-packer-job.yaml– one-off packer job.des-packer-cronjob.yaml– periodic packer job.
Apply:
kubectl apply -f k8s/des-http-deployment.yaml
kubectl apply -f k8s/des-http-service.yaml
kubectl apply -f k8s/des-packer-job.yaml # or des-packer-cronjob.yamlReplace images with your registry (image: ghcr.io/your-org/des-http:tag) and use PVCs instead of emptyDir for real data.
Prometheus metrics exposed at /metrics:
des_retrievals_total{backend,status}des_retrieval_seconds{backend}des_s3_range_calls_total{backend,type}des_s3_source_reads_total{status}des_s3_source_read_seconds{status}des_s3_source_bytes_downloaded
Example scrape config:
scrape_configs:
- job_name: des-http
static_configs:
- targets: ["des-http.default.svc.cluster.local:8000"]python -m venv .venv
source .venv/bin/activate
pip install -e ".[compression,s3,dev]"
pytest --cov=des_core --cov-report=term-missing
ruff check src tests
mypy src testsDry-run migration stats (Story 3):
des-stats --db-url "postgresql+psycopg://user:pass@host/db" --table source_files --cutoff "2024-01-01T00:00:00Z"Migration orchestration (Story 4):
- Configure SourceDatabase, packer, and run MigrationOrchestrator.run_migration_cycle() to fetch → validate → pack → mark → optional cleanup.
- Source paths can be local or
s3://bucket/keywhenpacker.s3_source.enabled=true(see docs/S3_SOURCE_GUIDE.md).
Migration from database (Story 5):
# Single run
des-migrate --config migration-config.json
# Dry run
des-migrate --config migration-config.json --dry-run
# Continuous
des-migrate --config migration-config.yaml --continuous --interval 3600Migration metrics (Story 6):
- Prometheus metrics: des_migration_cycles_total{status}, des_migration_files_total, des_migration_bytes_total, des_migration_duration_seconds (histogram), des_migration_pending_files, des_migration_batch_size.
- Expose metrics via
prometheus_client.start_http_server(port)or integrate the default registry into your HTTP app. - Assets:
examples/grafana-dashboard-des-migration.json,examples/alerts-des-migration.yml.
- No built-in auth for HTTP retriever.
- Multi-S3 routing is read-side only; write replication is manual.
- Packer consumes manifest files; upstream DB integration is not included yet.
Further reading: docs/ARCHITECTURE.md, docs/DEPLOYMENT.md, ROADMAP.md
GitHub Actions runs on pushes and pull requests to main/master, executing pytest with coverage, mypy on src and tests, and ruff check to enforce linting.