A workflow specification for autonomous agents.
Simplex is a specification for describing work that autonomous agents will perform. It captures what needs to be done and how to know when it's done, without prescribing how to do it. Simplex is optimized for high-fidelity interpretation by large language models.
The motivation is practical. When agents work autonomously for extended periods, they need instructions that are complete enough to act on without clarification, yet flexible enough to allow implementation choices. Simplex occupies this middle ground between natural language (too ambiguous) and programming languages (too prescriptive).
Five pillars guide Simplex.
Enforced simplicity. Simplex refuses to support constructs that would allow specifications to become unwieldy. If something cannot be expressed simply, it must be decomposed into smaller pieces first. This is a feature, not a limitation. Complexity that cannot be decomposed is complexity that is not yet understood.
Note: Enforcement happens through tooling, not the specification itself. A Simplex linter flags overly complex constructs (lengthy RULES blocks, excessive inputs, deep nesting) and rejects them. The specification defines what simplicity means; tooling enforces it. See the Linter Specification section.
Syntactic tolerance, semantic precision. Simplex allows for formatting inconsistencies, typos, and notational variations. Agents interpret what you meant, not what you typed. However, the meaning itself must be unambiguous. If an agent would have to guess your intent, the specification is invalid. Sloppy notation is acceptable; vague meaning is not.
Note: Semantic precision is validated through example coverage. If examples do not exercise every branch of the rules, or if examples could be satisfied by multiple conflicting interpretations, the specification is ambiguous and invalid. See Validation Criteria.
Testability. Every function requires examples. These are not illustrations; they are contracts. The examples define what correct output looks like for given inputs. An agent's work is not complete until its output is consistent with the examples.
Completeness. A valid specification must be sufficient to generate working code without further clarification. This is what distinguishes Simplex from a prompting approach. There is no back-and-forth, no "what did you mean by X?" The spec must stand alone.
Implementation autonomy. Simplex describes behavior and constraints, never implementation. Algorithms, data structures, and technology choices belong to agents. If a spec needs persistent storage, it says so. Whether that becomes a graph database, file system, or something else is the agent's concern. The spec neither prescribes nor cares.
Simplex has no formal grammar. There is no parser, no AST, no compilation step. Agents read specifications as semi-structured text and extract meaning directly.
This is intentional. A formal grammar would contradict the principle of syntactic tolerance. It would also add complexity and create failure modes. Since Simplex exists for LLM interpretation, it should be native to how LLMs work.
Instead of grammar rules, Simplex uses landmarks. Landmarks are structural markers that agents recognize and orient around. They are all-caps words followed by a colon. Content under a landmark continues until the next landmark or the end of the document.
Agents scan for landmarks, extract the content associated with each, and build understanding from there. Unrecognized landmarks are ignored rather than rejected, which provides forward compatibility as Simplex evolves.
Simplex defines sixteen landmarks. Five describe structure. Eleven describe functions.
DATA defines the shape of a type. It names a concept and lists its fields with descriptions and constraints. DATA blocks help agents understand what they are working with, but they are optional. If a function's inputs and outputs are clear from context, explicit DATA definitions are unnecessary.
DATA: PolicyRule
id: string, unique, format "XXX-NNN"
rule: string, the policy statement
severity: critical | warning | info
tags: list of strings
example_violation: string, optional
example_fix: string, optional
CONSTRAINT states an invariant that must hold. Unlike function-specific rules, constraints apply broadly. They describe conditions that should never be violated, regardless of which function is executing.
CONSTRAINT: policy_ids_must_exist
any policy ID referenced anywhere must exist in the registry
FUNCTION introduces a unit of work. It names an operation, lists its inputs, and declares its return type. The function block contains nested landmarks that specify behavior, completion criteria, and test cases.
FUNCTION: filter_policies(policies, ids, tags) → filtered list
BASELINE declares evolutionary context for a function. It establishes what currently exists, what must be preserved during evolution, and what is being evolved. BASELINE is optional; when absent, the function is treated as greenfield (building something new rather than modifying something existing).
BASELINE:
reference: "current session-based authentication"
preserve:
- existing login API contract
- session timeout behavior (30 minutes)
- backward compatibility for v1 clients
evolve:
- authentication mechanism (target: JWT-based)
- token refresh (target: rotation on each use)
BASELINE contains three required fields when present:
- reference: A description or pointer to the prior state being evolved from. Can be prose or a concrete reference (commit hash, version tag).
- preserve: Behaviors, contracts, or properties that must remain unchanged. These become regression tests.
- evolve: Capabilities being added or changed. These represent forward progress.
BASELINE differs from CONSTRAINT. A constraint is a timeless invariant ("all tokens must be signed"). A preserved behavior is relative to a reference point ("the login API response shape was X before, keep it X"). Both can coexist in a specification.
EVAL declares how to measure success for a function's examples. It specifies grading approaches and thresholds, particularly distinguishing regression tests (must always pass) from capability tests (measure progress). EVAL is required when BASELINE is present; optional otherwise.
EVAL:
preserve: pass^3
evolve: pass@5
grading: code
EVAL contains three fields:
- preserve: Threshold for preserved behaviors using
pass^knotation (all k trials must pass). Required when BASELINE present. - evolve: Threshold for evolved capabilities using
pass@knotation (at least 1 of k attempts must succeed). Required when BASELINE present. - grading: How examples are evaluated:
code(deterministic comparison, default),model(LLM-as-judge for subjective outputs), oroutcome(verify actual state change). Optional, defaults tocode.
The threshold notation:
pass^kmeans all k trials must pass. Used for regression tests where consistency is required.pass@kmeans at least 1 of k trials must pass. Used for capability tests where progress is being measured.
When BASELINE is absent and EVAL is omitted, examples are treated as traditional test cases with implicit code grading and pass^1 threshold.
These landmarks appear within a FUNCTION block.
RULES describes what the function does. This is behavioral specification: given these inputs, what should happen? Rules are prose, not pseudocode. They describe outcomes, not steps.
RULES:
- if neither ids nor tags provided, return all policies
- if only ids provided, return policies matching those IDs
- if only tags provided, return policies with at least one matching tag
- if both provided, return union of matches, deduplicated
DONE_WHEN states how an agent knows the work is complete. These are observable criteria, not implementation details. An agent checks these conditions to determine whether to stop or continue.
DONE_WHEN:
- returned list contains exactly the policies matching criteria
- no duplicates in returned list
EXAMPLES provides concrete input-output pairs. These are mandatory. They serve as test cases, ground truth, and clarification of intent. If the rules are ambiguous, the examples disambiguate.
Examples must satisfy coverage criteria:
- Every conditional branch in RULES must have at least one example exercising it
- If a rule uses "or," "optionally," or other alternation, examples must show each path
- If the same set of examples could be produced by multiple conflicting interpretations of the rules, the specification is ambiguous
EXAMPLES:
([p1, p2, p3], none, none) → [p1, p2, p3] # neither ids nor tags
([p1, p2, p3], [p1.id], none) → [p1] # only ids
([p1, p2, p3], none, [python]) → matches with tag # only tags
([p1, p2, p3], [p1.id], [python]) → union # both provided
ERRORS specifies what to do when things go wrong. It maps conditions to responses. This landmark is required. At minimum, it must specify default failure behavior.
A valid minimal ERRORS block:
ERRORS:
- any unhandled condition → fail with descriptive message
A more complete ERRORS block:
ERRORS:
- policy ID not found → fail with "unknown policy ID: {id}"
- invalid YAML syntax → fail with "parse error: {details}"
- any unhandled condition → fail with descriptive message
The requirement ensures agents never silently swallow failures during long-running autonomous execution.
READS declares what shared memory this function consumes. When agents coordinate through shared state, this landmark makes dependencies explicit.
READS:
- SharedMemory.artifacts["registry_path"]
- SharedMemory.status["validation_complete"]
WRITES declares what shared memory this function produces. Together with READS, this allows agents to understand data flow without central orchestration.
WRITES:
- SharedMemory.artifacts["compiled_agents"]
- SharedMemory.status["compilation"] = success | failure
TRIGGERS states conditions under which an agent should pick up this work. In swarm architectures where agents poll for available work, triggers help them decide what to do next.
TRIGGERS:
- SharedMemory.artifacts["registry_path"] exists
- SharedMemory.status["compilation"] != success
NOT_ALLOWED establishes boundaries. It states what the function must not do, even if it might seem reasonable. Use this sparingly; over-constraining defeats the purpose of implementation opacity.
NOT_ALLOWED:
- modify source files
- skip invalid entries silently
- generate partial output on error
HANDOFF describes what passes to the next stage. On success, what does the receiving agent get? On failure, what information helps with recovery or escalation?
HANDOFF:
- on success: CompiledArtifacts ready for write_artifacts
- on failure: error message with file and line number
UNCERTAIN declares conditions under which an agent should signal low confidence rather than proceeding silently. This provides a structured way to handle ambiguity in long-running autonomous workflows.
UNCERTAIN:
- if input format doesn't match any documented pattern → log warning and attempt best-effort parse
- if multiple valid interpretations exist → pause and request clarification
- if output would affect more than 100 files → require confirmation before proceeding
When UNCERTAIN is absent, agents proceed with best judgment and do not pause. When present, it defines explicit thresholds for caution.
UNCERTAIN does not violate the completeness pillar. The specification remains complete. It simply acknowledges that real-world inputs may fall outside documented cases and provides guidance for those situations.
DETERMINISM declares variance requirements for a function. Large language models are inherently non-deterministic—given identical inputs, they may produce different but equally valid outputs. When a specification requires reproducible outputs, DETERMINISM provides explicit guidance.
DETERMINISM:
level: strict | structural | semantic
seed: optional seed value or "from_input"
vary: fields allowed to vary
stable: fields that must be identical across runs
The three levels control what variance is acceptable:
- strict: Identical outputs for identical inputs. No variance permitted.
- structural: Same semantic content, but structural details (ordering, formatting) may vary.
- semantic: Outputs must be semantically equivalent but may differ in expression.
When seed: from_input is specified, agents derive seeds deterministically from input values, enabling reproducible outputs without hardcoded seeds. The vary and stable fields provide fine-grained control—a test generator might allow test_names to vary while requiring assertions to be stable.
DETERMINISM interacts with EVAL's grading types: grading: code with level: strict requires exact match; grading: model implicitly operates at semantic level. When both landmarks are present, DETERMINISM constrains what variance is acceptable while EVAL specifies how to measure conformance.
Of the sixteen landmarks, five are always required for a valid specification. One additional landmark is conditionally required.
FUNCTION is required because without it there is no work to describe.
RULES is required because without it there is no behavior.
DONE_WHEN is required because without it agents cannot know when to stop.
EXAMPLES is required because without them there is no ground truth.
ERRORS is required because without it agents may fail silently during autonomous execution.
EVAL is conditionally required: when BASELINE is present, EVAL must also be present. Without EVAL, BASELINE's preserve/evolve distinction would have no measurement criteria, violating the completeness pillar.
Everything else is optional. A minimal valid spec consists of a function with rules, completion criteria, examples, and error handling. A minimal valid evolution spec adds BASELINE and EVAL. The optional landmarks add precision when needed, but their absence does not invalidate a spec.
A specification is valid if it passes structural and semantic validation.
- At least one FUNCTION block exists
- Every FUNCTION contains RULES, DONE_WHEN, EXAMPLES, and ERRORS
- DATA types referenced in FUNCTION signatures are defined or obvious from context
- CONSTRAINT blocks state verifiable invariants
- If BASELINE is present, it must contain reference, preserve, and evolve
- If BASELINE is present, EVAL must also be present
- If EVAL is present with BASELINE, it must contain preserve and evolve thresholds
- EVAL.grading must be one of: code, model, outcome
Semantic validation ensures the specification is unambiguous.
Example coverage. Every conditional path in RULES must be exercised by at least one example.
- Count the branches: "if X" is one branch; "if X or Y" is two branches; "if X, else Y" is two branches
- Each branch needs at least one example demonstrating it
- Missing coverage is a validation error
Interpretation uniqueness. The examples must not be satisfiable by conflicting interpretations.
- If an agent could imagine two different implementations that both pass all examples but would behave differently on some unstated input, the specification is ambiguous
- This is a heuristic, not a formal proof. Agents should flag suspected ambiguity
Observable completion. DONE_WHEN criteria must be checkable from outside the function.
- "Internal state is consistent" is not observable—invalid
- "Output contains no duplicates" is observable—valid
- "All items processed" is observable only if processing produces visible evidence—clarify
Behavioral rules. RULES must describe outcomes, not procedures.
- "Loop through items and check each" is procedural—invalid
- "All items matching criteria are included in output" is behavioral—valid
Evolution coverage. When BASELINE is present, every item in preserve and evolve should have at least one corresponding example.
- Preserved behaviors need examples that verify regression protection
- Evolved capabilities need examples that demonstrate the new behavior
- Examples should be classifiable as testing preserved vs. evolved behavior based on what they exercise
Simplex does not have a composition construct. There is no way to formally specify that one function calls another, or that functions must execute in a particular order.
Design Note: This is intentional and represents a research hypothesis. Simplex is designed for autonomous agent workflows where agents operate over extended periods. The hypothesis is that agents can infer task dependencies and decomposition from context, potentially discovering structures the specification author did not anticipate. Prescribed composition would constrain this emergent behavior.
If a spec author wants to suggest relationships between functions, they can:
- Use READS/WRITES to show data dependencies
- Use TRIGGERS to show activation conditions
- Write prose in HANDOFF describing what the next stage expects
But Simplex does not enforce ordering. Agents determine sequencing based on their understanding of the full specification.
This design choice is experimental. Future versions may revisit it based on empirical results from autonomous agent research.
When agents coordinate through shared state (a knowledge graph, a key-value store, a file system), the READS, WRITES, and TRIGGERS landmarks describe interaction patterns.
However, Simplex does not define what shared memory is or how it works. It only provides landmarks for describing contracts against it. The implementation of shared memory is an agent concern.
A specification might say:
READS:
- SharedMemory.knowledge_graph (policy relationships)
WRITES:
- SharedMemory.artifacts["compiled_output"]
Whether SharedMemory is a graph database, a Redis instance, or a directory of JSON files is not the spec's concern. The contract is that something called SharedMemory exists, supports these operations, and agents can rely on it.
The following specification defines a Simplex linter. The linter enforces the "enforced simplicity" pillar through concrete limits and checks.
DATA: LintResult
valid: boolean
errors: list of LintError
warnings: list of LintWarning
DATA: LintError
location: string, which landmark or line
code: string, error identifier
message: string, human-readable explanation
DATA: LintWarning
location: string
code: string
message: string
FUNCTION: lint_spec(spec_text) → LintResult
RULES:
- parse spec_text to identify all landmarks and their content
- check structural validity: required landmarks present
- check complexity limits (see thresholds below)
- check semantic validity: example coverage, interpretation uniqueness
- check style guidance: behavioral rules, observable completion
- collect all errors and warnings
- spec is valid only if zero errors
DONE_WHEN:
- all landmarks examined
- all checks performed
- LintResult populated with findings
EXAMPLES:
(minimal valid spec) → {valid: true, errors: [], warnings: []}
(missing ERRORS landmark) → {valid: false, errors: [{code: "E001", ...}], warnings: []}
(RULES block over limit) → {valid: false, errors: [{code: "E010", ...}], warnings: []}
(uncovered branch in RULES) → {valid: false, errors: [{code: "E020", ...}], warnings: []}
ERRORS:
- unparseable input → fail with "cannot parse spec: {details}"
- any unhandled condition → fail with descriptive message
FUNCTION: check_complexity(spec) → list of LintError
RULES:
- RULES block exceeds 15 items → error E010 "RULES too complex: {count} items, max 15"
- FUNCTION has more than 6 inputs → error E011 "too many inputs: {count}, max 6"
- EXAMPLES fewer than branch count in RULES → error E020 "insufficient examples: {count} examples for {branches} branches"
- single RULES item exceeds 200 characters → warning W010 "rule may be too complex"
- spec contains more than 10 FUNCTION blocks → warning W011 "consider splitting into multiple specs"
DONE_WHEN:
- all complexity thresholds checked
- errors collected
EXAMPLES:
(spec with 5 RULES items, 3 inputs, 4 examples for 4 branches) → []
(spec with 20 RULES items) → [E010]
(spec with 8 inputs) → [E011]
(spec with 2 examples for 4 branches) → [E020]
ERRORS:
- any unhandled condition → fail with descriptive message
FUNCTION: check_coverage(rules, examples) → list of LintError
RULES:
- identify all conditional branches in rules
- "if X" introduces one branch
- "if X or Y" introduces two branches
- "if X, otherwise Y" introduces two branches
- "optionally" introduces two branches (with and without)
- for each branch, verify at least one example exercises it
- uncovered branch → error E020
DONE_WHEN:
- all branches identified
- all branches checked against examples
- errors collected
EXAMPLES:
(4 branches, 4 examples each covering one) → []
(4 branches, 3 examples covering 3) → [E020 for uncovered branch]
("if X or Y" with only X shown) → [E020 "branch 'Y' not covered by examples"]
ERRORS:
- cannot parse RULES structure → error E021 "cannot identify branches in RULES"
- any unhandled condition → fail with descriptive message
FUNCTION: check_observability(done_when) → list of LintError
RULES:
- each criterion in DONE_WHEN must be externally observable
- references to "internal state" → error E030
- references to "variable" or "data structure" → error E030
- valid: references to outputs, return values, side effects, written files
- valid: references to SharedMemory state
DONE_WHEN:
- all DONE_WHEN criteria examined
- non-observable criteria flagged
EXAMPLES:
("output contains no duplicates") → []
("internal counter reaches zero") → [E030]
("all items processed") → warning W030 "may not be observable without clarification"
ERRORS:
- any unhandled condition → fail with descriptive message
FUNCTION: check_behavioral(rules) → list of LintError
RULES:
- RULES must describe outcomes, not procedures
- procedural indicators: "loop", "iterate", "for each", "step 1", "then"
- procedural indicators: "create a variable", "initialize", "increment"
- finding procedural language → error E040 "RULES should be behavioral, not procedural"
- valid: describes what is true of output
- valid: describes conditions and their corresponding outcomes
DONE_WHEN:
- all RULES items examined
- procedural language flagged
EXAMPLES:
("items matching criteria are included") → []
("loop through items and add matches") → [E040]
("first, parse the input, then validate") → [E040]
ERRORS:
- any unhandled condition → fail with descriptive message
FUNCTION: check_baseline(baseline) → list of LintError
RULES:
- if BASELINE is present, it must contain reference field
- if BASELINE is present, it must contain preserve field with at least one item
- if BASELINE is present, it must contain evolve field with at least one item
- missing reference → error E050 "BASELINE requires reference field"
- missing preserve → error E051 "BASELINE requires preserve field"
- missing evolve → error E052 "BASELINE requires evolve field"
- empty preserve list → error E053 "BASELINE preserve must contain at least one item"
- empty evolve list → error E054 "BASELINE evolve must contain at least one item"
DONE_WHEN:
- BASELINE structure validated
- all required fields checked
EXAMPLES:
(baseline with reference, preserve, evolve) → []
(baseline missing reference) → [E050]
(baseline with empty preserve) → [E053]
ERRORS:
- any unhandled condition → fail with descriptive message
FUNCTION: check_eval(eval, baseline_present) → list of LintError
RULES:
- if BASELINE is present, EVAL must also be present
- if BASELINE is present, EVAL must contain preserve threshold
- if BASELINE is present, EVAL must contain evolve threshold
- BASELINE present but EVAL absent → error E060 "EVAL required when BASELINE present"
- preserve threshold missing when BASELINE present → error E061 "EVAL requires preserve threshold when BASELINE present"
- evolve threshold missing when BASELINE present → error E062 "EVAL requires evolve threshold when BASELINE present"
- preserve threshold must be pass^k notation → error E063 "preserve threshold must use pass^k notation"
- evolve threshold must be pass@k notation → error E064 "evolve threshold must use pass@k notation"
- grading must be code, model, or outcome → error E065 "grading must be code, model, or outcome"
- k in pass^k and pass@k must be positive integer → error E066 "threshold k must be positive integer"
DONE_WHEN:
- EVAL presence checked against BASELINE
- all threshold notations validated
- grading type validated
EXAMPLES:
(eval with pass^3, pass@5, code; baseline present) → []
(no eval; baseline present) → [E060]
(eval with invalid threshold "pass3") → [E063 or E064]
(eval with grading "fuzzy") → [E065]
ERRORS:
- any unhandled condition → fail with descriptive message
FUNCTION: check_evolution_coverage(baseline, examples) → list of LintError
RULES:
- every item in BASELINE.preserve should have at least one corresponding example
- every item in BASELINE.evolve should have at least one corresponding example
- uncovered preserve item → warning W050 "preserve item '{item}' has no corresponding example"
- uncovered evolve item → warning W051 "evolve item '{item}' has no corresponding example"
- examples should be classifiable as testing preserved vs evolved behavior
DONE_WHEN:
- all preserve items checked for coverage
- all evolve items checked for coverage
- warnings collected
EXAMPLES:
(3 preserve items, 3 evolve items, 6+ examples covering all) → []
(2 preserve items with only 1 covered) → [W050]
(2 evolve items with none covered) → [W051, W051]
ERRORS:
- any unhandled condition → fail with descriptive message
CONSTRAINT: linter_thresholds
the specific numeric limits (15 rules, 6 inputs, 200 chars) are defaults
implementations may allow configuration
the principle is enforced simplicity, not specific numbers
Simplex can describe itself. The following specification defines how agents should interpret Simplex documents.
DATA: Landmark
name: string, all caps
purpose: what it communicates
required: yes | no | conditional
DATA: Spec
functions: one or more FUNCTION blocks
data: zero or more DATA blocks
constraints: zero or more CONSTRAINT blocks
FUNCTION: parse_spec(text) → Spec
RULES:
- landmarks are all-caps words followed by colon
- required: FUNCTION, RULES, DONE_WHEN, EXAMPLES, ERRORS
- conditionally required: EVAL (when BASELINE present)
- optional: DATA, CONSTRAINT, BASELINE, EVAL, READS, WRITES, TRIGGERS, NOT_ALLOWED, HANDOFF, UNCERTAIN, DETERMINISM
- content continues until next landmark or end
- allow for formatting inconsistency
- extract meaning not syntax
- ignore unrecognized landmarks
DONE_WHEN:
- all FUNCTION blocks identified with nested landmarks
- all DATA and CONSTRAINT blocks identified
- BASELINE and EVAL blocks identified when present
ERRORS:
- no FUNCTION found → "invalid spec: no functions"
- any unhandled condition → fail with descriptive message
EXAMPLES:
(well-formed text) → Spec
(sloppy formatting) → Spec if meaning clear
(unknown landmarks) → Spec, unknowns ignored
(missing ERRORS) → "invalid spec: ERRORS required"
(BASELINE without EVAL) → "invalid spec: EVAL required when BASELINE present"
FUNCTION: validate_spec(spec) → valid | issues
RULES:
- FUNCTION requires RULES, DONE_WHEN, EXAMPLES, ERRORS
- RULES must be behavioral, not procedural
- DONE_WHEN must be externally observable
- EXAMPLES must cover every conditional branch in RULES
- EXAMPLES must not be satisfiable by conflicting interpretations
- ERRORS must specify at least default failure behavior
- DATA types referenced must be defined or obvious
- CONSTRAINT must state verifiable invariants
- if BASELINE present, must contain reference, preserve, evolve
- if BASELINE present, EVAL must also be present
- if EVAL present with BASELINE, must contain preserve and evolve thresholds
- EVAL.grading must be code, model, or outcome
- preserve/evolve items in BASELINE should have corresponding examples
DONE_WHEN:
- structural checks complete
- semantic checks complete
- evolution checks complete (when BASELINE present)
- all issues collected
ERRORS:
- none; issues are returned, not thrown
- any unhandled condition → fail with descriptive message
EXAMPLES:
(complete spec with full coverage) → valid
(missing ERRORS landmark) → issues: ["E001: ERRORS required"]
(uncovered branch) → issues: ["E020: branch X not covered"]
(procedural RULES) → issues: ["E040: RULES should be behavioral"]
(non-observable DONE_WHEN) → issues: ["E030: criterion not observable"]
(BASELINE without EVAL) → issues: ["E060: EVAL required when BASELINE present"]
(EVAL with invalid threshold) → issues: ["E063: preserve threshold must use pass^k notation"]
CONSTRAINT: self_description
this specification is parseable by parse_spec
this specification passes validate_spec
this specification passes lint_spec with zero errors
The self-description constraint is meaningful. Any future version of Simplex must remain self-describing and pass its own linter. This provides a check on evolution: changes that break self-description or fail linting are changes that have gone too far.
When writing a Simplex specification, start with the function signature and examples. The examples force clarity about inputs and outputs before you describe behavior. Many specification errors become obvious when you try to write concrete examples.
Next, write the rules. Describe behavior, not implementation. If you catch yourself writing "loop through" or "create a variable," step back. Describe what should be true of the output, not how to compute it.
Then write the completion criteria. These should be observable from outside the function. "Internal data structure is consistent" is not observable. "Output contains no duplicates" is observable.
Write the error handling. At minimum, specify that unhandled conditions fail with descriptive messages. For functions with known failure modes, map specific conditions to specific responses.
Add optional landmarks only as needed. A simple function may need nothing beyond the required five. A function that interacts with shared memory or has complex coordination needs benefits from READS, WRITES, TRIGGERS. A function operating in uncertain environments benefits from UNCERTAIN.
If a specification becomes unwieldy, decompose it. Multiple simple specs are better than one complex spec. If the linter flags complexity errors, that is a signal to break the work into smaller pieces.
Run the linter before considering a specification complete. A spec that passes linting has met the structural and semantic requirements for validity.
v0.5 — Current version. Consolidated pillars from six to five, merging "Specification, not implementation" and "Implementation opacity" into "Implementation autonomy." Added DETERMINISM landmark for explicit control over output variance (strict/structural/semantic levels, seed specification, vary/stable fields). Enhanced CONSTRAINT as behavioral anchors with examples of state invariants, output guarantees, and boundary conditions. Strengthened DATA as required output schemas—when a function return type references a DATA block, outputs must conform exactly. Added not allowed field annotation and conditional field presence. Added linter function check_determinism. Updated meta-specification and validation for new landmarks.
v0.5 variance reduction features address the gap between specification intent and implementation variance. When agents have implementation freedom, outputs can vary in structure, ordering, and detail even when semantically equivalent. These landmarks provide spec authors with tools to constrain variance where consistency matters.
v0.4 — Added BASELINE and EVAL landmarks for evolutionary specifications. BASELINE declares what to preserve and evolve relative to a reference state. EVAL declares grading approach and consistency thresholds using pass^k (all trials must pass) and pass@k (at least one trial must pass) notation. These landmarks address agent failure modes in long-horizon software evolution scenarios. EVAL is required when BASELINE is present to ensure measurement criteria are explicit. Added linter functions for BASELINE/EVAL validation and evolution coverage checking. Updated meta-specification for conditional landmark requirements.
v0.4 landmark additions informed by SWE-EVO research [1].
v0.3 — Made ERRORS required. Added UNCERTAIN landmark for confidence signaling. Added Validation Criteria section with semantic ambiguity detection. Added Linter Specification. Clarified that composition absence is an intentional research hypothesis. Updated meta-specification for new requirements.
v0.2 — Established pillars, landmarks, interpretation model, and meta-specification.
v0.1 — Initial exploration. Identified need for specification targeting autonomous agents.
[1] Y. Liu et al., "SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios," arXiv:2512.18470, December 2025. https://arxiv.org/abs/2512.18470