CI-Bench

CI-Bench is a framework to evaluate LLM based tools on software engineering tasks on BugSwarm artifacts.

Supported Tasks and Tools

We are working on adding more tasks and tools to the framework.

Prerequisites

Ubuntu 22.04
Python 3.9 or above
conda
Docker
bugswarm-client
bugswarm-common
yq (v4.46.1)

General Usage

The following describes the steps to run CI-Bench on supported tasks and tools. For creating new tasks or benchmarking new tools, please refer to this documentation.

Set up the Tool

To set up the tool to be evaluated, run the following command.

bash setup.sh --task <task_name> --tool-name <tool> --virtual-env <env_option>

Required options:

task_name: The targeted software engineering test. Can be repair, testing, or localization.
tool: The tool to be evaluated. Can be swe-agent, auto-code-rover, or agentless.
env_option: The virtual environment option. Can be conda or python.

Suppose we want to setup the SWE-agent tool on venv environment. So the command would be

bash setup.sh --task repair --tool-name swe-agent --virtual-env python

Benchmarking the Tool

To benchmark the tool, run the following command.

bash run.sh --task <task_name> --tool-name <tool_name> --virtual-env <env_option> --artifact-id <bugswarm_artifact>

Lets consider we want to benchmark with an artifact tananaev-traccar-64783123 on SWE-agent tool in python-venv environemnt.

The sample command will be

bash run.sh --task repair --tool-name swe-agent --virtual-env python --artifact-id tananaev-traccar-64783123

To run the benchmark with multiple artifacts, simply use a file with artifact IDs separated by new line.

bash run.sh --task <task_name> --tool-name <tool_name> --virtual-env <env_option> --artifact-list <file_path>

Required options:

task_type: The targeted software engineering test. Can be repair, testing, or localization.
tool: The tool to be evaluated. Can be swe-agent, auto-code-rover, or agentless.
env_option: The virtual environment option. Can be conda or python.
bugswarm_artifact: The BugSwarm artifact ID for the benchmarking task.
file_path: The file path containing the list of BugSwarm artifact IDs.

Lets consider we want to benchmark with an artifact tananaev-traccar-64783123 on SWE-agent tool in python-venv environemnt The sample command will be

bash run.sh --task repair --tool-name swe-agent --virtual-env python --artifact-id tananaev-traccar-64783123

The names of the artifacts are in the file named artifacts.txt, the command will be

bash run.sh --task repair --tool-name swe-agent --virtual-env python --artifact-list artifacts.txt

Testing LLM based repair tool generated patch

Please remove the container and artifact image at first.

To test the LLM based repair tool generated patch, follow below:

export BUGSWARM_TOKEN="<token>"

Run the following command:

python3 components/executor.py <patch_file_path> <artifact-id>

Required options:

patch_file_path: The path for the patch file
artifact-id: The BugSwarm artifact ID for the benchmarking task.

Testing the LLM generated patch if they are SYE equivalent

You need to install antlr4 at first

python -m pip install antlr4-python3-runtime==4.13.2

To know if the patch is syntactically equivalent:

bash evaluate.sh --tool_name <tool_name> --evaluation_metric SYE --bugswarm_artifact <artifact-id> --patch_file_path <patch_file_path>

Required options:

patch_file_path: The path for the patch file
artifact-id: The BugSwarm artifact ID for the benchmarking task.
patch_file_path: The path for the patch file

Video Demonstration

The youtube video link is here.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
Evaluation		Evaluation
components		components
docs		docs
task		task
tools		tools
README.md		README.md
evaluate.sh		evaluate.sh
repo_reset.sh		repo_reset.sh
run.sh		run.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CI-Bench

Supported Tasks and Tools

Prerequisites

General Usage

Set up the Tool

Benchmarking the Tool

Testing LLM based repair tool generated patch

Testing the LLM generated patch if they are SYE equivalent

Video Demonstration

About

Uh oh!

Releases

Packages

Languages

BugSwarm/CI-Bench

Folders and files

Latest commit

History

Repository files navigation

CI-Bench

Supported Tasks and Tools

Prerequisites

General Usage

Set up the Tool

Benchmarking the Tool

Testing LLM based repair tool generated patch

Testing the LLM generated patch if they are SYE equivalent

Video Demonstration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages