-
Notifications
You must be signed in to change notification settings - Fork 22
Description
OSS Security Economics SIG – Benchmarking AI/ML Security Agents on CVE Triage & Patching**
1. Summary
This issue proposes a focused next phase for the OSS Security Economics (SecEcon) work under the AI/ML Security WG. The goal is to turn SecEcon into a SIG that defines and maintains open, reproducible benchmarks for the economic and operational value of AI/ML-powered cyber agents on real security workflows, starting with CVE triage and patching. The intent is to provide shared metrics, datasets, and reporting templates the whole community can use.
2. Background
The original SecEcon SIG (see also prior issues and TAC thread) explored how to better understand the economics of open source security: where security investment needs to go, which projects are the most valuable, and how risk and funding interact across ecosystems.
Since then, the community reinforced that economics and incentives are first-order security concerns of open source security:
-
OpenSSF open letter after the npm worm (joint letter from public OSS infrastructure stewards):
https://openssf.org/blog/2025/09/23/open-infrastructure-is-not-free-a-joint-statement-on-sustainable-stewardship/ -
Python Software Foundation open letter on funding / limited runway:
https://blog.python.org/2025/02/connecting-dots-understanding-psfs.html -
Linux Foundation / FINOS – open source security & economics:
The problems remain unchainged:
- Critical dependencies often lack sustained maintainer capacity.
- “Part-time firefighters” models (sporadic volunteer time) are rational but fragile and not scalable.
- Organizations lack good data and frictionless onramps for proactive investment in OSS security.
At the same time, the wider Linux Foundation and FINOS communities are now increasingly focused on the hidden economics of open models and open source AU infrastructure. The AI/ML WG is a natural place to pilot this for AI-driven security workflows.
3. Updated Mission & Scope
We propose to focus the OSS Security Economics work as an AI/ML Security Economics SIG under the AI/ML WG with the following updated mission:
Define, run, and regularly publish open benchmarks for the economic and operational value of AI/ML cyber agents in realworld open source maintainer security workflows, starting with:
- CVE triage
- Patch recommendation and validation
- AI Slop Noise reduction
Key characteristics:
-
Security economics benchmark open to all agents/tools
- Any AI/ML agent, system, or workflow that participates is evaluated under the same set of tasks and metrics.
- The benchmark is not tied to a specific implementation or vendor.
-
Real-world, open source security-focused tasks
- Domain: cyber security tasks, not general software engineering.
- Example task types:
- Prioritize incoming CVEs for a given project based on impact and exploitability.
- Propose patches for specific vulnerabilities.
- Validate candidate patches (tests passing, regression risk, security impact).
- Reduce noise in issue/alert queues and highlight the most economically impactful fixes.
-
Metrics that reflect time, risk, and outcomes
- Time saved (e.g., engineer-hours avoided for triage, patching, or validation).
- Risk reduced (e.g., coverage of higher-severity issues, reduced MTTR).
- Noise removed (e.g., fewer false positives / dead-end investigations).
- A simple, transparent economic score (e.g., estimated cost saved vs. a baseline human-only workflows).
-
Open, reproducible, grounded in real OSS
- Benchmarks are built with feedback from real maintainers, around real open source projects and real CVEs, with clearly documented conditions and scoring rules.
- The SIG will aim for a regular cadence (e.g., monthly or quarterly runs) and public benchmark reports.
- Multiple agents/tools can be compared fairly across the same scenarios.
Given this scope, we suggest a hybrid structure:
- SecEcon as a SIG under the AI/ML WG:
- Owns the problem framing, benchmark design, metrics, and coordination across WGs.
- Benchmark suite as a project once the design stabilizes:
- Hosts datasets, automation, and long-term maintenance.
4. Differentiation from Cyber Reasoning Systems
To avoid confusion and overlap, we propose the following clear separation of concerns:
-
Cyber Reasoning Systems (CRS)
- Focus: Research for new capabilities of autonomous or semi-autonomous security systems.
- Typical questions:
- Can systems solve specific security challenges or CTF-style tasks?
- How well do they reason about complex vulnerabilities and exploits?
-
OSS Security Economics SIG (SecEcon)
- Focus: economic and operational value of these systems in real-world workflows.
- Typical questions:
- How much time do agents actually give maintainers back?
- How much risk do they reduce in practice?
- How much noise do they remove from maintainers alert and issue streams?
- How does human attention get reallocated when agents are in the loop?
In other words:
- CRS = “How do these systems perform security tasks?”
- SecEcon = “How valuable are these systems for open source maintainers and enterprise security teams in real operations, in terms of time, risk, and cost?”
SecEcon will consume outputs from CRS-like systems (patches, triage outputs, recommendations) and benchmark their impact. It will not design core CRS architectures or compete with those efforts. Instead, SecEcon aims to provide shared metrics and reports that CRS, AI tools, and organizations can all use to standartize andrefine their approaches,
5. Proposed Deliverables (Phase 1–2)
Phase 1 - Design & Spec (Q1, 2026)
- Initial benchmark specification, covering:
- Task definitions (CVE triage, patch recommendation, patch validation, noise reduction).
- Input/output formats and constraints.
- Evaluation metrics (time saved, risk reduced, noise removed, economic score).
- Scoring methodology that is simple, transparent, and robust.
- Small reference dataset drawn from real OSS projects:
- A curated set of CVEs and associated patches.
- Clear documentation of project context and assumptions.
- Lightweight reporting template, for example:
- “Agent X on Task Y, over Period Z, achieved:
- N minutes/hours saved per issue,
- MTTR improved by Δ,
- coverage of top-k highest impact issues improved by Δ,
- resulting in an estimated $X equivalent value vs. baseline.”
- “Agent X on Task Y, over Period Z, achieved:
Phase 2 – Pilot Runs & Regular Reporting (Q2 and Q3 2026)
- Run the benchmark on:
- A small set of volunteer agents / systems across participating organizations.
- Baseline “human-only” or “script-only” workflows where feasible.
- Publish regular summaries (e.g., monthly/quarterly) with:
- Aggregate metrics (no vendor-specific ranking required to start).
- Lessons learned on task design, metrics, and operational constraints.
- Refine the spec based on feedback from:
- Open Source Maintainers
- AI/ML WG
- Cyber Reasoning Systems stakeholders
- Supply chain / best practices WGs
- FINOS and other domain partners interested in security economics.
6. Coordination & Next Steps
To move this forward in a way that respects everyone’s time and existing structures, we propose:
1. Monthly SecEcon segment in AI/ML WG
- A 30-45 minute check-in once per month, but not an additional call. Preferably piggybacked on an existing AI/ML WG call (e.g., last 10–20 minutes expanded for those interested).
- Focus of these segments:
- Review benchmark design proposals.
- Discuss candidate tasks, metrics, and datasets.
- Coordinate with CRS and other relevant efforts.
2. Repository under ossf
- Create or designate a repository (e.g.,
ossf/ai-ml-security-econ-benchmarks), used for:- Benchmark specifications and task definitions.
- Synthetic and curated datasets (with clear licensing).
- Evaluation scripts / notebooks.
- Public reports and dashboards.
3. Slack channel
- Setup Slack channel such as
#wg-ai-ml-sececonto:- Bring together AI/ML WG members, CRS participants, security engineers, and interested FINOS / financial-services stakeholders.
- Share drafts, benchmark runs, and feedback.
- Coordinate contributions and reviews asynchronously.
7. Open Questions for the WG
We’d appreciate feedback from the AI/ML WG on the following:
- Scope clarity
- Does this feel clearly distinct and complementary to Cyber Reasoning Systems and other ongoing efforts (model signing, model cards, AIBOM/SBOM work)?
- Metrics
- Which 2–3 metrics would you consider must-have from day one (e.g., MTTR, false positive reduction, time saved per issue, severity-weighted risk reduction)?
- Cadence & integration
- Is a monthly SecEcon-focused segment within the AI/ML WG meeting a reasonable starting point?
- Interlocks
- Which other WGs/SIGs should we formally loop in early (e.g., Best Practices, Supply Chain, Policy, FINOS participants)?
If you’re interested in contributing tasks, data, or agent implementations for early experiments, please comment here or indicate your interest so we can follow up as we bootstrap the repo and channel.