CI-Bench is a framework to evaluate LLM based tools on software engineering tasks on BugSwarm artifacts.
- Automated Code Repair
- SWE-Agent
- Agentless
- Auto-code-rover
- Test Generation
- Fault Localization
We are working on adding more tasks and tools to the framework.
- Ubuntu 22.04
- Python 3.9 or above
- conda
- Docker
- bugswarm-client
- bugswarm-common
- yq (v4.46.1)
The following describes the steps to run CI-Bench on supported tasks and tools. For creating new tasks or benchmarking new tools, please refer to this documentation.
To set up the tool to be evaluated, run the following command.
bash setup.sh --task <task_name> --tool-name <tool> --virtual-env <env_option>Required options:
task_name: The targeted software engineering test. Can berepair,testing, orlocalization.tool: The tool to be evaluated. Can beswe-agent,auto-code-rover, oragentless.env_option: The virtual environment option. Can becondaorpython.
Suppose we want to setup the SWE-agent tool on venv environment. So the command would be
bash setup.sh --task repair --tool-name swe-agent --virtual-env pythonTo benchmark the tool, run the following command.
bash run.sh --task <task_name> --tool-name <tool_name> --virtual-env <env_option> --artifact-id <bugswarm_artifact>Lets consider we want to benchmark with an artifact tananaev-traccar-64783123 on SWE-agent tool in python-venv environemnt.
The sample command will be
bash run.sh --task repair --tool-name swe-agent --virtual-env python --artifact-id tananaev-traccar-64783123To run the benchmark with multiple artifacts, simply use a file with artifact IDs separated by new line.
bash run.sh --task <task_name> --tool-name <tool_name> --virtual-env <env_option> --artifact-list <file_path>Required options:
task_type: The targeted software engineering test. Can berepair,testing, orlocalization.tool: The tool to be evaluated. Can beswe-agent,auto-code-rover, oragentless.env_option: The virtual environment option. Can becondaorpython.bugswarm_artifact: The BugSwarm artifact ID for the benchmarking task.file_path: The file path containing the list of BugSwarm artifact IDs.
Lets consider we want to benchmark with an artifact tananaev-traccar-64783123 on SWE-agent tool in python-venv environemnt
The sample command will be
bash run.sh --task repair --tool-name swe-agent --virtual-env python --artifact-id tananaev-traccar-64783123The names of the artifacts are in the file named artifacts.txt, the command will be
bash run.sh --task repair --tool-name swe-agent --virtual-env python --artifact-list artifacts.txtPlease remove the container and artifact image at first.
To test the LLM based repair tool generated patch, follow below:
export BUGSWARM_TOKEN="<token>"Run the following command:
python3 components/executor.py <patch_file_path> <artifact-id>Required options:
patch_file_path: The path for the patch fileartifact-id: The BugSwarm artifact ID for the benchmarking task.
You need to install antlr4 at first
python -m pip install antlr4-python3-runtime==4.13.2To know if the patch is syntactically equivalent:
bash evaluate.sh --tool_name <tool_name> --evaluation_metric SYE --bugswarm_artifact <artifact-id> --patch_file_path <patch_file_path>Required options:
patch_file_path: The path for the patch fileartifact-id: The BugSwarm artifact ID for the benchmarking task.patch_file_path: The path for the patch file
The youtube video link is here.