Update design for PR xai-org#286: Reproducible and automatically configure development environments

gauravagerwala · gauravagerwala · commit f62fe14ab79d · 2025-12-07T09:20:18.000-08:00
diff --git a/.exp/design-workflow-1-grok-1-inference-and-sampling.md b/.exp/design-workflow-1-grok-1-inference-and-sampling.md
@@ -4,10 +4,12 @@
 
 The \"Grok-1 Inference and Sampling\" workflow provides the machinery to load the Grok-1 model's 314 billion parameters from a checkpoint, initialize the decoder-only transformer architecture with Mixture-of-Experts (MoE) layers and Grouped Query Attention (GQA), set up distributed sharding across GPUs using JAX meshes and PJIT, tokenize prompts with SentencePiece, and generate text autoregressively. Sampling incorporates temperature-controlled softmax, nucleus (top-p) filtering for diversity control, and top-k logging. The design emphasizes correctness for validation, supporting batched multi-request handling via a generator that manages KV caches per request slot, padding for variable lengths, and efficient decode steps post-prefill.
 
-Key inputs: Checkpoint in `./checkpoints/ckpt-0/`, `tokenizer.model`, GPU cluster, prompts as `Request` objects (prompt str, temperature float, nucleus_p float, rng_seed int, max_len int).  
+Key inputs: Checkpoint in `./checkpoints/ckpt-0/` (automated download and symlink via `just download-weights` in Justfile), `tokenizer.model`, GPU cluster, prompts as `Request` objects (prompt str, temperature float, nucleus_p float, rng_seed int, max_len int).  
 Outputs: Generated text strings.  
-Entry points: `run.py` for test run, or `InferenceRunner().run()` generator for streaming requests.  
-Relevant files: `run.py`, `runners.py`, `model.py`, `checkpoint.py`, `tokenizer.model`.
+Entry points: `just test` (runs `run.py`) for test run, or `runners.InferenceRunner().run()` generator for streaming requests. Use `nix develop` from `flake.nix` for dev environment setup including deps and tools.  
+Relevant files (core): `run.py`, `runners.py`, `model.py`, `checkpoint.py`, `tokenizer.model`.
+
+Development and setup files: `Justfile` (tasks for download and test), `flake.nix` (Nix dev shell), `.envrc` (direnv integration), `.env.public` (magnet link), `requirements.txt` (Python deps), `.github/hooks/pre-commit` (ruff pre-commit), `.github/workflows/test.yml` (CI linting).
 
 The workflow orchestrates model loading, compilation of sharded compute functions, prompt processing (prefill KV cache while sampling first token), and iterative single-token generation using cached attention keys/values, until max length or EOS.
 
@@ -45,25 +47,28 @@ The workflow orchestrates model loading, compilation of sharded compute function
 
 ```mermaid
 sequenceDiagram
+    participant Setup as "Dev Setup (Addition)"
     participant User
     participant RunPy as run.py
     participant IR as InferenceRunner
     participant MR as ModelRunner
     participant Model as model.py
     participant Checkpoint as checkpoint.py
     participant JAX as JAX Runtime
-    User->>RunPy: Execute main()
+    Setup->>User: nix develop<br/>direnv allow<br/>just download-weights<br/>(reproducible env, deps install, checkpoint torrent download & symlink)
+    User->>RunPy: Execute main()<br/>or just test
     RunPy->>IR: Create with config, MR, paths, meshes
     IR->>MR: initialize(dummy_data, meshes)
     MR->>Model: model.initialize(), fprop_dtype=bf16
     Note over MR,JAX: Calculate batch sizes, create mesh (data, model axes)
     MR->>MR: hk.transform forward/logits_fn with pjit sharding
     MR->>Checkpoint: load_or_init -> restore(shapes, mesh, sharding)
     Checkpoint->>MR: Sharded params (TrainingState)
-    IR->>IR: Load tokenizer, compile pjit funcs (sample_step, prefill_memory, new_memory) with shardings
+    IR->>IR: Load tokenizer<br/>compile pjit funcs (sample_step, prefill_memory, new_memory)<br/>with shardings
     IR->>IR: Precompile with dummy prompts for pad_sizes
     RunPy->>IR: gen = run()  // generator setup with initial memory, settings, etc.
 ```
+Note: New "Dev Setup" participant and steps reflect PR #286 additions for environment and data preparation. Core sequence unchanged.
 
 ## Inference and Sampling Sequence
 
@@ -104,6 +109,41 @@ sequenceDiagram
     end
 ```
 
+## Development Environment and Setup Sequence
+
+PR #286 adds infrastructure for reproducible dev environments and automated setup, streamlining preparation for this workflow.
+
+### Setup Sequence
+
+```mermaid
+sequenceDiagram
+    participant Dev as Developer
+    participant Nix as Nix Flake
+    participant Direnv as Direnv (.envrc)
+    participant Env as .env.public
+    participant Just as Justfile
+    participant Transmission as Transmission CLI
+    participant Checkpoints as checkpoints/
+    participant GitHooks as Git Hooks
+    participant Ruff as Ruff Linter
+    participant Python as Python Venv
+    Dev->>Nix: nix develop or direnv allow
+    Nix->>Direnv: source .envrc (use flake, python layout)
+    Direnv->>Env: load GROK_MAGNET_LINK
+    Nix->>Just: install just
+    Nix->>Transmission: install transmission
+    Nix->>Ruff: install ruff
+    Nix->>Python: create .venv, pip install requirements.txt
+    Nix->>GitHooks: git config core.hooksPath .github/hooks
+    GitHooks->>Ruff: pre-commit runs ruff check
+    Dev->>Just: just download-weights
+    Just->>Transmission: transmission-cli --download-dir checkpoints $GROK_MAGNET_LINK
+    Transmission->>Checkpoints: download grok-1/ckpt-0/
+    Just->>Checkpoints: ln -s grok-1/ckpt-0 ckpt-0
+    Dev->>Just: just test (runs python run.py)
+    Note over Dev,Ruff: GitHub CI test.yml runs ruff on PRs/push
+```
+
 ## Sharding and Distributed Execution
 
 - **Mesh Configuration**: `make_mesh(local=(data_replicas, model_par), between_hosts=(data_hosts, model_hosts))` creates hybrid mesh for SPMD parallelism. E.g., local 1x8 shards model across 8 GPUs.
@@ -127,6 +167,13 @@ sequenceDiagram
 - **Error/Edge Cases**: Assumes sufficient memory/GPUs; handles long contexts by left-truncation/padding. No built-in EOS handling (relies on max_len or app logic). Quantized weights require custom unpickling.
 - **Performance Notes**: MoE router/experts use JAX vmap/shard_map (serial per-token, inefficient for prod). Focus on correctness/single-host validation.
 - **Extensibility**: Modular Haiku design allows custom configs/modules. Generator interface suits serving multiple prompts concurrently.
-- **Dependencies & Setup**: `requirements.txt` (jax[cuda12_pip], haiku, etc.). Download ckpt via torrent/HF, place in checkpoints/.
+- **Dependencies & Setup**:
+  - Python dependencies: `requirements.txt` (jax[cuda12_pip], haiku, sentencepiece, numpy, etc.).
+  - Reproducible dev environment: `flake.nix` enables `nix develop` shell that auto-creates `.venv`, installs requirements via pip, tools (just, transmission for torrent, ruff linter), and sets git core.hooksPath to `.github/hooks` for pre-commit ruff checks.
+  - Direnv: `.envrc` for automatic nix flake and python layout activation, loads `.env.public` containing `GROK_MAGNET_LINK`.
+  - Checkpoint download: ~314B weights via torrent (magnet URI) or Hugging Face to `./checkpoints/`. Place/ symlink as `ckpt-0`. Automated with `just download-weights` (uses transmission-cli, creates dir/symlink).
+  - Testing: `just test` runs `python run.py`; or direct execution.
+  - Quality control: Ruff linting in local pre-commit and GitHub Actions CI (`.github/workflows/test.yml`) on PRs and pushes to main.
+  - Concerns: Large download untested in PR due to size; requires stable internet and sufficient disk space.
 
 This document captures the high-level design, derived from code analysis.
diff --git a/pr-analysis-286.md b/pr-analysis-286.md
@@ -0,0 +1,80 @@
+# PR #286: Workflow Design Impact Analysis
+
+## Affected Workflows
+- **Grok-1 Inference and Sampling** (Workflow 1): This workflow is impacted by the PR's enhancements to development setup and checkpoint acquisition processes. Evidence from PR description and changed files shows additions of Nix-based reproducible environments (flake.nix, .envrc), task automation for downloading model weights via torrent (Justfile, .env.public, transmission), testing (just test running run.py entry point), and linting enforcement (pre-commit, test.yml). These align with and expand the design doc's \"Dependencies & Setup\" section mentioning torrent/HF download to checkpoints/ckpt-0/, a key input. Core runtime flows unchanged. [PR #286](https://github.com/xai-org/grok-1/pull/286)
+
+Workflows 2 (Model Loading) and 3 (Forward Pass) unaffected, lacking setup references in docs and no PR changes to their core files/logic.
+
+## Workflow 1 Analysis
+### Summary of design changes
+Specific aspects affected: Prerequisites for workflow execution, including environment configuration and model data preparation. The PR adds a structured dev setup layer before user invocation of run.py.
+
+Implementation: 
+- Deterministic deps via nixpkgs in flake.nix, with shellHook automating venv, pip installs, git hooks setup.
+- Auto-activation via direnv (.envrc loading .env.public magnet).
+- Tasks in Justfile: download-weights (torrent download + symlink), test (run.py).
+- Quality: ruff integration locally and in CI for PR validation.
+
+Benefits: Reproducibility across systems, reduced manual steps for large downloads/setup, enforced standards preventing bugs. Implications: Easier collaboration/onboarding; potential Nix learning curve; untested download task (per PR concerns).
+
+The design docs have been updated in .exp/design-workflow-1-grok-1-inference-and-sampling.md to reflect these changes, including new/updated diagrams and sections.
+
+### Updated Diagrams Showing Changes
+**Initialization Sequence (with additions)**: New green-indicated steps/participant for setup phase; yellow for modified user-run.py interaction; no red removals.
+
+```mermaid
+sequenceDiagram
+    participant Setup as "Dev Setup (Addition)"
+    participant User
+    participant RunPy as run.py
+    participant IR as InferenceRunner
+    participant MR as ModelRunner
+    participant Model as model.py
+    participant Checkpoint as checkpoint.py
+    participant JAX as JAX Runtime
+    Setup->>User: nix develop<br/>direnv allow<br/>just download-weights<br/>(reproducible env, deps install, checkpoint torrent download & symlink)
+    User->>RunPy: Execute main()<br/>or just test
+    RunPy->>IR: Create with config, MR, paths, meshes
+    IR->>MR: initialize(dummy_data, meshes)
+    MR->>Model: model.initialize(), fprop_dtype=bf16
+    Note over MR,JAX: Calculate batch sizes, create mesh (data, model axes)
+    MR->>MR: hk.transform forward/logits_fn with pjit sharding
+    MR->>Checkpoint: load_or_init -> restore(shapes, mesh, sharding)
+    Checkpoint->>MR: Sharded params (TrainingState)
+    IR->>IR: Load tokenizer<br/>compile pjit funcs (sample_step, prefill_memory, new_memory)<br/>with shardings
+    IR->>IR: Precompile with dummy prompts for pad_sizes
+    RunPy->>IR: gen = run()  // generator setup with initial memory, settings, etc.
+```
+
+**New Setup Sequence Diagram** (additions only, green by nature):
+
+```mermaid
+sequenceDiagram
+    participant Dev as Developer
+    participant Nix as Nix Flake
+    participant Direnv as Direnv (.envrc)
+    participant Env as .env.public
+    participant Just as Justfile
+    participant Transmission as Transmission CLI
+    participant Checkpoints as checkpoints/
+    participant GitHooks as Git Hooks
+    participant Ruff as Ruff Linter
+    participant Python as Python Venv
+    Dev->>Nix: nix develop or direnv allow
+    Nix->>Direnv: source .envrc (use flake, python layout)
+    Direnv->>Env: load GROK_MAGNET_LINK
+    Nix->>Just: install just
+    Nix->>Transmission: install transmission
+    Nix->>Ruff: install ruff
+    Nix->>Python: create .venv, pip install requirements.txt
+    Nix->>GitHooks: git config core.hooksPath .github/hooks
+    GitHooks->>Ruff: pre-commit runs ruff check
+    Dev->>Just: just download-weights
+    Just->>Transmission: transmission-cli --download-dir checkpoints $GROK_MAGNET_LINK
+    Transmission->>Checkpoints: download grok-1/ckpt-0/
+    Just->>Checkpoints: ln -s grok-1/ckpt-0 ckpt-0
+    Dev->>Just: just test (runs python run.py)
+    Note over Dev,Ruff: GitHub CI test.yml runs ruff on PRs/push
+```
+
+The Inference and Sampling diagram remains unchanged, as PR does not affect sampling flows.