This repository contains the code related to our paper Hierarchy-Aware Evaluation of Free-Form Predictions From Vision-And-Language Models:
Vésteinn Snæbjarnarson1,2, Kevin Du2, Niklas Stoehr2, Serge Belongie1, Ryan Cotterell2, Nico Lang1, Stella Frank1 ; 1University of Copenhagen, 2ETH Zürich
When a vision-and-language model (VLM) is prompted to identify an entity in an image, it may err on the side of caution and answer with ;"tree", instead of a more specific description such as "Pine tree". Traditional binary accuracy metrics cannot differentiate between wrong predictions and insufficiently specific ones. They also do not give partial credit for close answers: "pine tree" for a Norway Spruce should be better than "cypress", taxonomically speaking, but string matching-based similarity measures will reject both equally. To address this shortcoming, we propose a framework for evaluating open-ended text predictions against a taxonomic hierarchy, using measures of hierarchical precision and recall to measure the level of correctness and specificity of predictions. We first show that existing text similarity measures and accuracy-based evaluation metrics do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the free-form outputs and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our taxonomic evaluation. We find that models respond differently to instructions prompting for more specific answers, with GPT4V responding most specifically and others showing a trade-off between hierarchical precision and recall.
We suggest scoring the performance of a vision-and-language models not only based on binary accuracy, but based on how incorrect or how incorrect the model is. For instance, we think that "mammal" should be less wrong than "bird" when classifying a photo of a cat. And, similarly, an overly confident wrong answer is worse than a less specific one. In general we would prefer a prediction "Some kind of bird." over the wrong species.
To do so we assume a ground-truth taxonomy, where labels are connected based on their hieararchical similarity. For this purpose we rely on two taxonomies; (1) that of the Tree of Life based on genetic relationship between species, and (2) a taxonomy extracted from Wikidata. Wikidata is a knowledgegraph extracted from Wikipedia.
Depending on the application, the user is encouraged to use their own domain specific data. But for assessing the general behaviour of a general purpose model we believe that the two taxonomies we consider can be useful.
If you wish to use an environment, set that up, e.g. a new conda environment.
conda create -n vlmeval python=3.12
conda activate vlmeval
conda install cuda -c nvidia
pip install poetryInstall llava (we recommend commenting out the torch, torchvision, sentencepiece and scikit-learn versions in the pyproject.toml) after cloning
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .Then clone the repository.
cd ..
git clone git@github.com:vesteinn/lvlm-eval.git
cd lvlm-eval
git submodule update --recursive --init
git lfs fetch --all
git lfs pullAnd install the package along with dependencies
pip install -e .While we provide scripts for generating our data we suggest the user bring their own for their domain of interests. These are likely to need some custom processing.
We also provide our already processed data.
If you wish to process the data yourself you will need to modify the paths in lvlm-eval/src/vlmeval/paths.py.
Run bash download_data.sh to fetch the files not included in the repository, this requires some gigabytes of space.
Since we use CLIP embeddings it's best to pre-calculate them once.
cd src/vlmeval/calculate_scores
python clip_embed.pyAfter installation, the command vlm-generate is available in the environment, it will write output to the folder specified in vlmeval.paths.MODEL_OUTPUT.
$ vlmeval-generate --help
usage: vlmeval-generate [-h] [-Q QUANTIZATION_MODE] [-S START_IDX] [-E END_IDX] [-T TEMPERATURE] [-M MAX_NEW_TOKENS] [-I INPUT_DATA_PATH] [-P PROMPT_TYPE] [-O OUTPUT_FILE_PATH]
DATASET_NAME MODEL_NAME
positional arguments:
DATASET_NAME Name of the dataset
MODEL_NAME Must match the model class names exactly.
options:
-h, --help show this help message and exit
-Q QUANTIZATION_MODE, --QUANTIZATION_MODE QUANTIZATION_MODE
['full', '16bit', '8bit', or '4bit']
-S START_IDX, --START_IDX START_IDX
Start index of the dataset
-E END_IDX, --END_IDX END_IDX
End index of the dataset
-T TEMPERATURE, --TEMPERATURE TEMPERATURE
-M MAX_NEW_TOKENS, --MAX_NEW_TOKENS MAX_NEW_TOKENS
-I INPUT_DATA_PATH, --INPUT_DATA_PATH INPUT_DATA_PATH
path for either the inat or oven dataset jsonl files
-P PROMPT_TYPE, --PROMPT_TYPE PROMPT_TYPE
Must be one of ['specific', 'default', 'barebones'] or otherwise defined in the prompt_templates.py file
-O OUTPUT_FILE_PATH, --OUTPUT_FILE_PATH OUTPUT_FILE_PATH
path for the output fileFor example, to run inference with the Llama3.2 model on the OVEN datasets, run
vlmeval-generate oven Llama3_2 -Q 16bit -S 0 -E 5046 -I src/vlmeval/data/oven/val/oven_only_test_equal_repr_all_bar_inaturalist.jsonl -P barebonesNote that you will need to be logged in to huggingface and have accepted the conditions for using the model.
We also compare how well existing textual similarity measures capture the taxonomic information. To run these calculations use
$ vlmeval-score-measure --help
usage: vlmeval-score-measure [-h] [--out_postfix OUT_POSTFIX] [--reliability] [--specificity] [--accuracy] [--top_k] [--write_samples_for_inference] [--measures MEASURES]
[--verbosity VERBOSITY] [--estimate_position_file ESTIMATE_POSITION_FILE] [--estimate_measure ESTIMATE_MEASURE]
[--estimate_dataset ESTIMATE_DATASET] [--estimate_model_name ESTIMATE_MODEL_NAME] [--datasets DATASETS]
options:
-h, --help show this help message and exit
--out_postfix OUT_POSTFIX
--reliability
--specificity
--accuracy
--top_k
--write_samples_for_inference
--measures MEASURES
--verbosity VERBOSITY
--estimate_position_file ESTIMATE_POSITION_FILE
--estimate_measure ESTIMATE_MEASURE
--estimate_dataset ESTIMATE_DATASET
--estimate_model_name ESTIMATE_MODEL_NAME
--datasets DATASETSTo reproduce the correlation results (Table 1) in the paper you can run the script scripts/run_correlation_all.sh.
We also analyze how well these measures can be used to map to a taxonomy, to run these calculations vlmeval-map-predictions can be used.
$ vlmeval-map-predictions --help
usage: vlmeval-map-predictions [-h] [--prediction_file PREDICTION_FILE] [--measure MEASURE] [--matcher {complex,direct,both}] [--output_dir OUTPUT_DIR] [--top_k TOP_K]
[--verbosity VERBOSITY]
Calculate taxonomy positioning metrics
options:
-h, --help show this help message and exit
--prediction_file PREDICTION_FILE
Path to file containing predictions
--measure MEASURE Name of measure to use from measures.py
--matcher {complex,direct,both}
Matching strategy to use
--output_dir OUTPUT_DIR
Directory to save results
--top_k TOP_K Number of top candidates to consider for direct matcher
--verbosity VERBOSITY
Verbosity level (0-2)To reproduce the results in the paper for the mapping accuracies (Table 2) with the different measures, you can use the script scripts/run_all_positioning_test.sh.