Skip to content

dariuszduszynski/DES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

118 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Easy Store (DES)

Pack huge numbers of small files into larger, S3-optimized shard objects. DES gives deterministic routing (no DB), pluggable compression, and fast retrieval using S3 range-GET (footer → index → payload). It ships with local/S3/multi-S3 backends, a packer CLI, and HTTP retriever service.

Key features

  • Append-only shard format with header/data/index/footer.
  • Compression (zstd, lz4) with smart skipping of already-compressed extensions.
  • Optimized S3 retriever: range-GETs for header/footer/index plus payload (or BigFile object when needed).
  • Multi-zone S3 routing (MultiS3ShardRetriever) based on shard index ranges.
  • HTTP retriever with backends: local, s3, multi_s3; Prometheus metrics at /metrics.
  • In-memory index cache to reduce repeated S3 calls.
  • Packer CLI (des-pack), Docker images, docker-compose, and K8s job manifests.
  • Source reads from local filesystem or S3 URIs (s3://bucket/key) with retry/backoff via S3FileReader.

BigFiles support

  • Payloads larger than DES_BIG_FILE_THRESHOLD_BYTES (default 10 MiB) are written outside the .des body under a sibling _bigFiles/ directory; the shard stores only metadata (is_bigfile, bigfile_hash, bigfile_size, meta).
  • Why: keep shard sizes predictable and avoid inflating .des objects with rare, huge files.
  • Config: DES_BIG_FILE_THRESHOLD_BYTES, DES_BIGFILES_PREFIX (defaults via DESConfig.from_env() and passed into ShardWriter/ShardReader/S3ShardRetriever).
  • Layout: local .../YYYYMMDD/*.des with .../YYYYMMDD/_bigFiles/<sha256>; S3 s3://bucket/prefix/*.des with s3://bucket/prefix/_bigFiles/<sha256>.
  • Read path: index entries flagged is_bigfile are fetched from _bigFiles/ (local FS or S3) instead of range-GET payloads; inline entries behave as before. Older shards (no bigfiles) continue to read normally.

High-level architecture

  • Routing: des_core.routing.locate_shard maps (uid, created_at, n_bits) to date_dir, shard_index, shard_hex, object_key.
  • Shard IO: des_core.shard_io writes/reads .des shards, stores an index+footer, handles compression metadata, decompresses transparently.
  • Packing: des_core.packer (local) and des_core.s3_packer (upload) build shards from manifests.
  • Retrieval: des_core.retriever (filesystem), des_core.s3_retriever (single S3), des_core.multi_s3_retriever (zone fan-out).
  • HTTP: des_core.http_retriever exposes retrieval over FastAPI.

Read path (S3 backend): locate shard → range-GET footer → range-GET index → payload range → decompress if needed → return bytes.

Installation (source)

python -m venv .venv
source .venv/bin/activate  # on Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e ".[compression,s3,dev]"  # installs S3/compression extras and dev tooling
pytest

For a runtime-only install without lint/type tooling: pip install -e ".[compression,s3]" (or pip install .).

Run HTTP retriever (local backend)

From source:

export DES_BACKEND=local
export DES_BASE_DIR=./example_data/des
export DES_N_BITS=8

uvicorn des_core.http_retriever:app --host 0.0.0.0 --port 8000

Via Docker:

docker build -t des-http:local .
docker run --rm -p 8000:8000 \
  -e DES_BACKEND=local \
  -e DES_BASE_DIR=/data/des \
  -v "$(pwd)"/example_data/des:/data/des:ro \
  des-http:local

Key endpoints:

  • GET /files/{uid}?created_at=YYYY-MM-DDTHH:MM:SSZ → file bytes
  • GET /metrics → Prometheus metrics
  • GET /health → simple health check

Backend configuration

Local (DES_BACKEND=local)

  • DES_BASE_DIR – directory with .des shards.
  • DES_N_BITS – shard routing bits (default 8).

Single S3 (DES_BACKEND=s3)

  • DES_S3_BUCKET (required)
  • DES_S3_REGION (optional)
  • DES_S3_ENDPOINT_URL (optional, for S3-compatible storage)
  • DES_S3_PREFIX (optional prefix)
  • DES_N_BITS must match how shards were written.

Multi-S3 (DES_BACKEND=multi_s3)

  • DES_ZONES_CONFIG – YAML/JSON zones file (required).

Example zones config:

n_bits: 8
zones:
  - name: zone-a
    range: { start: 0, end: 127 }
    s3:
      bucket: des-zone-a
      region_name: us-east-1
      endpoint_url: https://s3.example.com
      prefix: des-a/
  - name: zone-b
    range: { start: 128, end: 255 }
    s3:
      bucket: des-zone-b
      region_name: us-east-1
      endpoint_url: https://s3.example.com
      prefix: des-b/

Ranges must not overlap and must be within [0, 2**n_bits - 1].

Packer CLI (des-pack)

des-pack reads a JSON manifest and writes .des shards to an output directory.

Manifest example (list of files):

[
  {
    "uid": "uid-123",
    "created_at": "2024-01-01T00:00:00Z",
    "size_bytes": 1024,
    "source_path": "/data/input/file-123.bin"
  },
  {
    "uid": "uid-456",
    "created_at": "2024-01-02T00:00:00Z",
    "size_bytes": 2048,
    "source_path": "/data/input/file-456.bin"
  }
]

Run locally:

des-pack \
  --input-json ./example_data/manifests/to-pack.json \
  --output-dir ./example_data/output \
  --max-shard-size 1073741824 \
  --n-bits 8

Docker & docker-compose

docker-compose provides two services:

  • HTTP retriever:
    docker compose up des-http
  • Packer (one-shot job):
    docker compose run --rm des-packer

Mounts:

  • ./example_data/input/data/input
  • ./example_data/manifests/data/manifests
  • ./example_data/output/data/output

Kubernetes overview

Manifests under k8s/:

  • des-http-deployment.yaml + des-http-service.yaml – HTTP retriever.
  • des-packer-job.yaml – one-off packer job.
  • des-packer-cronjob.yaml – periodic packer job.

Apply:

kubectl apply -f k8s/des-http-deployment.yaml
kubectl apply -f k8s/des-http-service.yaml
kubectl apply -f k8s/des-packer-job.yaml  # or des-packer-cronjob.yaml

Replace images with your registry (image: ghcr.io/your-org/des-http:tag) and use PVCs instead of emptyDir for real data.

Metrics

Prometheus metrics exposed at /metrics:

  • des_retrievals_total{backend,status}
  • des_retrieval_seconds{backend}
  • des_s3_range_calls_total{backend,type}
  • des_s3_source_reads_total{status}
  • des_s3_source_read_seconds{status}
  • des_s3_source_bytes_downloaded

Example scrape config:

scrape_configs:
  - job_name: des-http
    static_configs:
      - targets: ["des-http.default.svc.cluster.local:8000"]

Development & tests

python -m venv .venv
source .venv/bin/activate
pip install -e ".[compression,s3,dev]"
pytest --cov=des_core --cov-report=term-missing
ruff check src tests
mypy src tests

Dry-run migration stats (Story 3):

des-stats --db-url "postgresql+psycopg://user:pass@host/db" --table source_files --cutoff "2024-01-01T00:00:00Z"

Migration orchestration (Story 4):

  • Configure SourceDatabase, packer, and run MigrationOrchestrator.run_migration_cycle() to fetch → validate → pack → mark → optional cleanup.
  • Source paths can be local or s3://bucket/key when packer.s3_source.enabled=true (see docs/S3_SOURCE_GUIDE.md).

Migration from database (Story 5):

# Single run
des-migrate --config migration-config.json

# Dry run
des-migrate --config migration-config.json --dry-run

# Continuous
des-migrate --config migration-config.yaml --continuous --interval 3600

Migration metrics (Story 6):

  • Prometheus metrics: des_migration_cycles_total{status}, des_migration_files_total, des_migration_bytes_total, des_migration_duration_seconds (histogram), des_migration_pending_files, des_migration_batch_size.
  • Expose metrics via prometheus_client.start_http_server(port) or integrate the default registry into your HTTP app.
  • Assets: examples/grafana-dashboard-des-migration.json, examples/alerts-des-migration.yml.

Limitations & roadmap

  • No built-in auth for HTTP retriever.
  • Multi-S3 routing is read-side only; write replication is manual.
  • Packer consumes manifest files; upstream DB integration is not included yet.

Further reading: docs/ARCHITECTURE.md, docs/DEPLOYMENT.md, ROADMAP.md

CI / Quality Gate

GitHub Actions runs on pushes and pull requests to main/master, executing pytest with coverage, mypy on src and tests, and ruff check to enforce linting.

About

Data Easy Storage

Resources

License

Stars

Watchers

Forks

Packages

No packages published