Enso

An open-source replication of Logical Intelligence's Kona 1.0 — an Energy-Based Model that solves Sudoku through latent reasoning. Kona achieved 96.2% accuracy on hard Sudoku in ~313ms per puzzle, while frontier LLMs (GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, DeepSeek V3.2) managed just 2% combined. This project implements the core idea: learn an energy landscape in representation space where valid solutions have low energy, then "think" at inference time by optimizing a latent variable via Langevin dynamics.

Motivation

The 96% vs 2% gap between EBMs and LLMs on Sudoku isn't about Sudoku specifically — it exposes a fundamental architectural limitation. LLMs generate solutions token-by-token, committing to each digit as they go, with no way to revise earlier decisions when conflicts emerge later. EBMs instead produce a complete candidate solution and evaluate it against all constraints simultaneously, using gradient information in continuous latent space to move toward valid configurations.

Logical Intelligence (with Yann LeCun as founding chair of the Technical Research Board, Fields Medalist Michael Freedman as Chief of Mathematics) demonstrated this with Kona 1.0 — trained only on partial solutions (50% masked), it learned the rules and generated full solutions at test time.

This project is an independent replication using a JEPA (Joint Embedding Predictive Architecture) approach, informed by:

I-JEPA (CVPR 2023) — Self-supervised learning through prediction in representation space
IRED (2024) — Iterative reasoning with energy diffusion
JEPA-Reasoner (2025) — Applying JEPA to logical reasoning tasks

Architecture

Training

graph LR
    puzzle["Puzzle (9x9)"] --> CE["Context Encoder<br/><i>8-layer Transformer</i>"]
    CE --> z_ctx["z_context"]
    z_ctx --> P["Predictor<br/><i>3-layer MLP</i>"]
    z["z"] --> P
    P --> z_pred["z_pred"]

    solution["Solution (9x9)"] --> TE["Target Encoder<br/><i>EMA of Context Encoder</i>"]
    TE --> z_tgt["z_target"]
    z_tgt --> ZE["z_encoder<br/><i>Linear projection</i>"]
    ZE -- "+ noise" --> z

    z_pred -. "Energy = ‖z_pred − z_target‖²" .-> z_tgt

    z_ctx --> D["Decoder<br/><i>4-layer Transformer</i>"]
    z --> D
    D --> logits["Cell Logits (9x9x9)"]
    logits -. "CE loss + constraint loss<br/>(empty cells only)" .-> solution

    style z_pred fill:#4a9eff,color:#fff
    style z_tgt fill:#4a9eff,color:#fff
    style z_ctx fill:#6bcb77,color:#fff
    style z fill:#ff9800,color:#fff

Inference (Langevin Dynamics)

graph LR
    puzzle["Puzzle (9x9)"] --> CE["Context Encoder"]
    CE --> z_ctx["z_context"]

    init["z₀ ~ N(0, I)"] --> loop

    subgraph loop ["Iterate T steps"]
        direction TB
        energy["E(z, z_context)"] --> grad["∇z Energy"]
        grad --> update["z ← z − η∇E + noise"]
    end

    z_ctx --> loop
    loop --> D["Decoder"]
    z_ctx --> D
    D --> sol["9x9 Solution"]

    style loop fill:#fff3e0,stroke:#ff9800
    style sol fill:#4a9eff,color:#fff

Components

Module	Description	Parameters
Context Encoder	8-layer Transformer with Sudoku-aware positional encoding (learned row + column + box embeddings). Encodes puzzle to z_context.	d_model=512, 8 heads
Target Encoder	Same architecture, processes solutions. Updated via EMA from context encoder (no gradients).	EMA momentum 0.996 -> 1.0
Predictor	3-layer MLP mapping (z_context, z) -> z_pred. Intentionally limited capacity so it can't ignore z.	hidden=1024
Decoder	4-layer Transformer decoding (z_context, z) to per-cell digit logits. Hard-enforces given clues.	8 heads, d_cell=128
Total	~50M trainable parameters

Loss Function

$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{energy}} + \mathcal{L}_{\text{VICReg}} + \mathcal{L}_{\text{decode}} + \mathcal{L}_{\text{constraint}}$$

Term	Formula	Purpose
$\mathcal{L}_{\text{energy}}$	$\|\|z_{\text{pred}} - z_{\text{target}}\|\|^2$	Drives representation learning
$\mathcal{L}_{\text{VICReg}}$	Variance + covariance penalty on $z_{\text{context}}$	Prevents representation collapse
$\mathcal{L}_{\text{decode}}$	Cross-entropy on empty cells only	Auxiliary decoder supervision
$\mathcal{L}_{\text{constraint}}$	Sudoku row/col/box uniqueness penalty on softmax(logits)	Teaches structural rules explicitly

Inference

At inference time, the model solves puzzles through Langevin dynamics — gradient-based optimization of the latent variable z:

Initialize multiple z chains from N(0, I)
For each step, decode z to a candidate solution, re-encode through the target encoder, and compute self-consistency energy + constraint penalty
Update z via gradient descent with noise (temperature annealing)
Select the lowest-energy chain and decode to a discrete grid

Results

Trained on 9M puzzles (8M train / 500K val), 20 epochs per run on RTX 5090.

Run	Changes	Cell Acc	Puzzle Acc	Langevin Puzzle Acc
Run 2	Baseline (bs=512, lr=3e-4)	97.2%	74.7%	70.7% (-4.0)
Run 3	Larger batch (bs=2048, lr=6e-4)	98.3%	83.8%	81.0% (-2.8)
Run 4	+ z normalization, constraint loss, self-consistency energy	97.6%	82.5%	83.5% (+1.0)

Key milestone in Run 4: Langevin dynamics improved puzzle accuracy for the first time (+1.0%), after consistently hurting results in prior runs. The three structural fixes (L2-normalized z_encoder, constraint loss during training, self-consistency inference energy) aligned the solver with what the model learned.

Note: Kona's 96.2% is on hard puzzles specifically; our validation set includes all difficulty levels.

See training-log.md for detailed run history and lessons learned.

Project Structure

src/ebm/
    dataset/
        loader.py           # Kaggle dataset download and loading
        torch_dataset.py    # PyTorch Dataset with one-hot encoding
        splits.py           # Deterministic train/val/test splitting
    model/
        encoder.py          # SudokuEncoder (Transformer + positional encoding)
        predictor.py        # LatentPredictor (MLP)
        decoder.py          # SudokuDecoder (lightweight Transformer)
        energy.py           # Energy function (L2 distance)
        constraints.py      # Differentiable Sudoku constraint penalty
        jepa.py             # SudokuJEPA orchestrator
    training/
        trainer.py          # Training loop with validation
        losses.py           # VICReg + combined loss computation
        checkpoint.py       # Best-K checkpoint management
        scheduler.py        # LR warmup + cosine decay, EMA scheduling
        metrics.py          # Weights & Biases integration
    evaluation/
        solver.py           # Langevin dynamics inference
        metrics.py          # Cell accuracy, puzzle accuracy, constraint satisfaction
    utils/
        config.py           # Pydantic configuration classes
        device.py           # GPU detection, auto-scaling batch size + LR
    main.py                 # CLI entry point
tests/                      # Unit tests (101 tests, 96%+ coverage)
scripts/
    smoke_test.py           # Quick training validation script
    plot_training.py        # Generate training curve comparison plots

Setup

Requires Python 3.13+ and uv.

# Clone and install
git clone https://github.com/MVPandey/Enso.git
cd Enso
uv sync

# Configure API keys (copy and fill in)
cp .env.example .env

Required .env variables:

KAGGLE_API_TOKEN — for downloading the 9M Sudoku dataset
WANDB_API_KEY — for experiment tracking (optional)
WANDB_PROJECT, WANDB_ENTITY — W&B project settings (optional)

Usage

Training

# Full training (9M puzzles, 50 epochs, ~20 hours on RTX 5090)
uv run python -m ebm.main train

# Quick experiment (100K puzzles, 20 epochs, ~5 minutes)
uv run python -m ebm.main train --n-samples 100000 --epochs 20

# Custom batch size
uv run python -m ebm.main train --batch-size 256 --epochs 10

Evaluation

# Evaluate a trained checkpoint
uv run python -m ebm.main eval --checkpoint checkpoints/best.pt

# With custom inference parameters
uv run python -m ebm.main eval --checkpoint checkpoints/best.pt \
    --langevin-steps 100 --n-chains 16

Smoke Test

# Verify training pipeline works (100 steps on 10K samples)
uv run python scripts/smoke_test.py

Development

# Run tests
uv run pytest

# Lint and format
uv run ruff check --fix .
uv run ruff format .

# Type check
uv run ty check src/

Dataset

Uses the 9 Million Sudoku Puzzles and Solutions dataset from Kaggle. Each puzzle is an 81-character string of digits (0 = empty cell). The dataset is automatically downloaded and cached on first use.

Training: 8M puzzles
Validation: 500K puzzles
Test: 500K puzzles

Experiment Tracking

Training metrics are logged to Weights & Biases when configured:

Per-step: total loss, energy loss, VICReg loss, decode loss, constraint loss, learning rate, EMA momentum
Per-epoch: validation energy, cell accuracy, puzzle accuracy, z-variance (collapse detector)

Runs are automatically named with timestamps for easy identification.

See training-log.md for detailed run history and findings.

References

Kona 1.0 Sudoku Demo — Logical Intelligence's EBM achieving 96.2% on hard Sudoku
EBM vs. LLMs: 96% vs. 2% Benchmark — Technical blog post
I-JEPA — Self-supervised learning through prediction in representation space
IRED — Iterative reasoning with energy diffusion
JEPA-Reasoner — Applying JEPA to logical reasoning tasks

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
scripts		scripts
src/ebm		src/ebm
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
training-log.md		training-log.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Enso

Motivation

Architecture

Training

Inference (Langevin Dynamics)

Components

Loss Function

Inference

Results

Project Structure

Setup

Usage

Training

Evaluation

Smoke Test

Development

Dataset

Experiment Tracking

References

About

Uh oh!

Releases

Packages

Contributors 2

Languages

MVPandey/Enso

Folders and files

Latest commit

History

Repository files navigation

Enso

Motivation

Architecture

Training

Inference (Langevin Dynamics)

Components

Loss Function

Inference

Results

Project Structure

Setup

Usage

Training

Evaluation

Smoke Test

Development

Dataset

Experiment Tracking

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages