Add SDK compatibility documentation and benchmarks-commit parameter (#119)

simonrosenberg · openhands-agent · web-flow · commit 9aabf7fa98f1 · 2025-12-02T15:46:47.000+01:00
* Add SDK compatibility documentation and benchmarks-commit parameter - Document SDK critic module breaking change (commit 79868ae5) in README - Add optional benchmarks-commit parameter to build-swe-bench-images workflow - Update checkout step to support evaluating older SDK versions with compatible benchmarks code - Maintain backward compatibility - workflow behaves the same when parameter is not provided Fixes #118 Co-authored-by: openhands <openhands@all-hands.dev> * Refactor documentation to emphasize general SDK version compatibility - Make the documentation more general about benchmarks/SDK version dependencies - Present SDK critic module as an example rather than the main focus - Clarify that version incompatibilities can arise as both codebases evolve Co-authored-by: openhands <openhands@all-hands.dev> * Add clarifying comments about empty ref behavior in checkout step - Explain that empty ref causes actions/checkout to use the triggering commit - This preserves the original workflow behavior for workflow_dispatch events Co-authored-by: openhands <openhands@all-hands.dev> --------- Co-authored-by: openhands <openhands@all-hands.dev>
diff --git a/.github/workflows/build-swe-bench-images.yml b/.github/workflows/build-swe-bench-images.yml
@@ -30,6 +30,11 @@ on:
         required: false
         default: ''
         type: string
+      benchmarks-commit:
+        description: 'Benchmarks repository commit/ref to use. Leave blank to use the PR head or main branch. Useful for evaluating older SDK versions that are incompatible with current benchmarks code (e.g., SDK versions before the critic module was added in commit 79868ae5).'
+        required: false
+        default: ''
+        type: string
 
 # Reasonable defaults for automatic (push) runs; workflow_dispatch can override these.
 env:
@@ -61,9 +66,25 @@ jobs:
       issues: write
 
     steps:
+      - name: Determine checkout ref
+        id: checkout-ref
+        run: |
+          if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ -n "${{ inputs.benchmarks-commit }}" ]; then
+            echo "ref=${{ inputs.benchmarks-commit }}" >> "$GITHUB_OUTPUT"
+            echo "Using benchmarks-commit from workflow_dispatch: ${{ inputs.benchmarks-commit }}"
+          elif [ -n "${{ github.event.pull_request.head.sha }}" ]; then
+            echo "ref=${{ github.event.pull_request.head.sha }}" >> "$GITHUB_OUTPUT"
+            echo "Using PR head SHA: ${{ github.event.pull_request.head.sha }}"
+          else
+            # Empty ref means checkout the ref that triggered the workflow (e.g., main branch for workflow_dispatch)
+            echo "ref=" >> "$GITHUB_OUTPUT"
+            echo "Using default ref (the commit that triggered this workflow)"
+          fi
+      
       - uses: actions/checkout@v4
         with:
-          ref: ${{ github.event.pull_request.head.sha }}
+          # When ref is empty, actions/checkout uses the commit that triggered the workflow
+          ref: ${{ steps.checkout-ref.outputs.ref }}
           submodules: recursive
 
       # If this was a manual dispatch, override defaults with provided inputs.
diff --git a/README.md b/README.md
@@ -166,6 +166,47 @@ Uses a [remote runtime API](https://openhands.dev/blog/evaluation-of-llms-as-cod
 
 See individual benchmark READMEs for specific usage examples.
 
+## SDK Compatibility and Version Management
+
+⚠️ **Important**: The benchmarks repository depends on the [OpenHands Agent SDK](https://github.com/OpenHands/software-agent-sdk), and **not every version of the benchmarks is compatible with every version of the SDK**. As the SDK evolves and introduces new features, the benchmarks code may adopt these features, creating version dependencies.
+
+### Evaluating Different SDK Versions
+
+When evaluating a specific SDK version, you need to ensure the benchmarks code is compatible with that SDK version. You have two options:
+
+1. **Use the `benchmarks-commit` parameter in the workflow** (Recommended):
+   - When manually triggering the `build-swe-bench-images` workflow, specify both:
+     - `sdk-commit`: The SDK version you want to evaluate
+     - `benchmarks-commit`: A benchmarks commit that's compatible with that SDK version
+   
+2. **Manually check out compatible versions locally**:
+   ```bash
+   # Check out a benchmarks commit that's compatible with your target SDK version
+   git checkout <benchmarks-commit>
+   
+   # Update the SDK submodule to your target version
+   cd vendor/software-agent-sdk
+   git checkout <sdk-commit>
+   cd ../..
+   
+   # Rebuild the environment
+   make build
+   ```
+
+### Example: SDK Critic Module
+
+A notable example of version dependency is the SDK critic module. As of SDK commit [`79868ae5`](https://github.com/OpenHands/software-agent-sdk/commit/79868ae5) (November 17, 2025), the OpenHands Agent SDK introduced the `openhands.sdk.critic` module. Current benchmarks code imports `CriticBase` from this module, which means:
+
+- **SDK versions ≥ `79868ae5`**: Compatible with current benchmarks code
+- **SDK versions < `79868ae5`**: Require an older benchmarks commit (before the critic import was added)
+
+To check if a specific benchmarks commit requires the critic module:
+```bash
+git show <commit>:benchmarks/utils/models.py | grep "from openhands.sdk.critic"
+```
+
+If this command returns output, that benchmarks commit requires an SDK version with the critic module.
+
 ## Links
 
 - **Original OpenHands**: https://github.com/OpenHands/OpenHands/