Grounded Semantic Role Labeling from Synthetic Multimodal Data for Situated Robot Commands
The project introduces multimodal models for grounded semantic role labeling and the generation of synthetic domestic images. The generated dataset is conditioned on linguistic and environmental constraints extracted from the HuRIC dataset, enabling experiments in Situated Human-Robot Interaction (HRI).
The paper has been accepted to EMNLP 2025 and can be accessed here.
The repository provides the methods to train and evaluate multimodal models, specifically targeting Grounded Semantic Role Labelling (G-SRL) in domestic environments, using synthetic generated images, through a complete pipeline for generating and processing synthetic visual data for robotic command understanding. The pipeline supports:
- Extraction of constraints from HuRIC annotations
- Prompt generation for synthetic image creation
- Image generation using diffusion models
- Automatic bounding box annotation
- Consistency checking with visual LLMs
- Filtering and selection of top-ranked samples
This repository includes two primary components:
Contains the training and evaluation scripts for applying MiniCPM-V 2.6 to the G-SRL task using the generated and validated dataset.
Refer to the README in training_models/ for configuration and usage.
A self-contained pipeline to create a set of images for the G-SRL dataset. It includes:
- Constraint and prompt generation
- Diffusion-based image generation
- Automatic bounding box labelling
- Visual consistency evaluation
- Top-k image selection
Refer to the README in image_generator/ for full details.
The prerequisites and environment setup are common to the entire project. You will need:
- CUDA-capable GPUs
- NVIDIA CUDA drivers installed
- Python + Conda
Create the environment as follows:
export CUDA_HOME=/usr/local/cuda
conda env create -f environment.yml
conda activate visual_grounding
./install_requirements.shEach subfolder includes a dedicated README to walk you through its functionality. A typical workflow consists of:
- Running the image generation pipeline (
image_generator/) - Using the generated images to train or evaluate a model (
training_models/)
If you want to just train the MiniCPM models (Step 2.), you can use our public available datasets by setting the correct paths.
Please refer to the subfolder READMEs for detailed instructions on each component.
@inproceedings{hromei-etal-2025-grounded,
title = "Grounded Semantic Role Labelling from Synthetic Multimodal Data for Situated Robot Commands",
author = "Hromei, Claudiu Daniel and
Scaiella, Antonio and
Croce, Danilo and
Basili, Roberto",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1212/",
pages = "23758--23781",
ISBN = "979-8-89176-332-6",
}