Hierarchy-Aware Evaluation of Free-Form Predictions From Vision-And-Language Models (CVPR 2024)

This repository contains the code related to our paper Hierarchy-Aware Evaluation of Free-Form Predictions From Vision-And-Language Models:

Vésteinn Snæbjarnarson^1,2, Kevin Du², Niklas Stoehr², Serge Belongie¹, Ryan Cotterell², Nico Lang¹, Stella Frank¹ ; ¹University of Copenhagen, ²ETH Zürich

When a vision-and-language model (VLM) is prompted to identify an entity in an image, it may err on the side of caution and answer with ;"tree", instead of a more specific description such as "Pine tree". Traditional binary accuracy metrics cannot differentiate between wrong predictions and insufficiently specific ones. They also do not give partial credit for close answers: "pine tree" for a Norway Spruce should be better than "cypress", taxonomically speaking, but string matching-based similarity measures will reject both equally. To address this shortcoming, we propose a framework for evaluating open-ended text predictions against a taxonomic hierarchy, using measures of hierarchical precision and recall to measure the level of correctness and specificity of predictions. We first show that existing text similarity measures and accuracy-based evaluation metrics do not capture taxonomic similarity well. We then develop and compare different methods to map textual VLM predictions onto a taxonomy. This allows us to compute hierarchical similarity measures between the free-form outputs and the ground truth labels. Finally, we analyze modern VLMs on fine-grained visual classification tasks based on our taxonomic evaluation. We find that models respond differently to instructions prompting for more specific answers, with GPT4V responding most specifically and others showing a trade-off between hierarchical precision and recall.

Overview and Intuition

We suggest scoring the performance of a vision-and-language models not only based on binary accuracy, but based on how incorrect or how incorrect the model is. For instance, we think that "mammal" should be less wrong than "bird" when classifying a photo of a cat. And, similarly, an overly confident wrong answer is worse than a less specific one. In general we would prefer a prediction "Some kind of bird." over the wrong species.

To do so we assume a ground-truth taxonomy, where labels are connected based on their hieararchical similarity. For this purpose we rely on two taxonomies; (1) that of the Tree of Life based on genetic relationship between species, and (2) a taxonomy extracted from Wikidata. Wikidata is a knowledgegraph extracted from Wikipedia.

Depending on the application, the user is encouraged to use their own domain specific data. But for assessing the general behaviour of a general purpose model we believe that the two taxonomies we consider can be useful.

Installation and setup

If you wish to use an environment, set that up, e.g. a new conda environment.

conda create -n vlmeval python=3.12
conda activate vlmeval
conda install cuda -c nvidia
pip install poetry

Install llava (we recommend commenting out the torch, torchvision, sentencepiece and scikit-learn versions in the pyproject.toml) after cloning

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .

Then clone the repository.

cd ..
git clone git@github.com:vesteinn/lvlm-eval.git
cd lvlm-eval
git submodule update --recursive --init
git lfs fetch --all
git lfs pull

And install the package along with dependencies

pip install -e .

Preparing data

Extracting taxonomic data

While we provide scripts for generating our data we suggest the user bring their own for their domain of interests. These are likely to need some custom processing.

We also provide our already processed data.

If you wish to process the data yourself you will need to modify the paths in lvlm-eval/src/vlmeval/paths.py.

Downloading image data

Run bash download_data.sh to fetch the files not included in the repository, this requires some gigabytes of space.

Embedding images with CLIP

Since we use CLIP embeddings it's best to pre-calculate them once.

cd src/vlmeval/calculate_scores
python clip_embed.py

Analyzing VLMs

Generating predictions with a VLM

After installation, the command vlm-generate is available in the environment, it will write output to the folder specified in vlmeval.paths.MODEL_OUTPUT.

$ vlmeval-generate --help
usage: vlmeval-generate [-h] [-Q QUANTIZATION_MODE] [-S START_IDX] [-E END_IDX] [-T TEMPERATURE] [-M MAX_NEW_TOKENS] [-I INPUT_DATA_PATH] [-P PROMPT_TYPE] [-O OUTPUT_FILE_PATH]
                       DATASET_NAME MODEL_NAME

positional arguments:
  DATASET_NAME          Name of the dataset
  MODEL_NAME            Must match the model class names exactly.

options:
  -h, --help            show this help message and exit
  -Q QUANTIZATION_MODE, --QUANTIZATION_MODE QUANTIZATION_MODE
                        ['full', '16bit', '8bit', or '4bit']
  -S START_IDX, --START_IDX START_IDX
                        Start index of the dataset
  -E END_IDX, --END_IDX END_IDX
                        End index of the dataset
  -T TEMPERATURE, --TEMPERATURE TEMPERATURE
  -M MAX_NEW_TOKENS, --MAX_NEW_TOKENS MAX_NEW_TOKENS
  -I INPUT_DATA_PATH, --INPUT_DATA_PATH INPUT_DATA_PATH
                        path for either the inat or oven dataset jsonl files
  -P PROMPT_TYPE, --PROMPT_TYPE PROMPT_TYPE
                    Must be one of ['specific', 'default', 'barebones'] or otherwise defined in the prompt_templates.py file
  -O OUTPUT_FILE_PATH, --OUTPUT_FILE_PATH OUTPUT_FILE_PATH
                        path for the output file

For example, to run inference with the Llama3.2 model on the OVEN datasets, run

vlmeval-generate oven Llama3_2 -Q 16bit -S 0 -E 5046 -I src/vlmeval/data/oven/val/oven_only_test_equal_repr_all_bar_inaturalist.jsonl -P barebones

Note that you will need to be logged in to huggingface and have accepted the conditions for using the model.

Analyzing results

Analyzing similarity measures

We also compare how well existing textual similarity measures capture the taxonomic information. To run these calculations use

$ vlmeval-score-measure --help
usage: vlmeval-score-measure [-h] [--out_postfix OUT_POSTFIX] [--reliability] [--specificity] [--accuracy] [--top_k] [--write_samples_for_inference] [--measures MEASURES]
                             [--verbosity VERBOSITY] [--estimate_position_file ESTIMATE_POSITION_FILE] [--estimate_measure ESTIMATE_MEASURE]
                             [--estimate_dataset ESTIMATE_DATASET] [--estimate_model_name ESTIMATE_MODEL_NAME] [--datasets DATASETS]

options:
  -h, --help            show this help message and exit
  --out_postfix OUT_POSTFIX
  --reliability
  --specificity
  --accuracy
  --top_k
  --write_samples_for_inference
  --measures MEASURES
  --verbosity VERBOSITY
  --estimate_position_file ESTIMATE_POSITION_FILE
  --estimate_measure ESTIMATE_MEASURE
  --estimate_dataset ESTIMATE_DATASET
  --estimate_model_name ESTIMATE_MODEL_NAME
  --datasets DATASETS

To reproduce the correlation results (Table 1) in the paper you can run the script scripts/run_correlation_all.sh.

We also analyze how well these measures can be used to map to a taxonomy, to run these calculations vlmeval-map-predictions can be used.

$ vlmeval-map-predictions --help
usage: vlmeval-map-predictions [-h] [--prediction_file PREDICTION_FILE] [--measure MEASURE] [--matcher {complex,direct,both}] [--output_dir OUTPUT_DIR] [--top_k TOP_K]
                               [--verbosity VERBOSITY]

Calculate taxonomy positioning metrics

options:
  -h, --help            show this help message and exit
  --prediction_file PREDICTION_FILE
                        Path to file containing predictions
  --measure MEASURE     Name of measure to use from measures.py
  --matcher {complex,direct,both}
                        Matching strategy to use
  --output_dir OUTPUT_DIR
                        Directory to save results
  --top_k TOP_K         Number of top candidates to consider for direct matcher
  --verbosity VERBOSITY
                        Verbosity level (0-2)

To reproduce the results in the paper for the mapping accuracies (Table 2) with the different measures, you can use the script scripts/run_all_positioning_test.sh.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hierarchy-Aware Evaluation of Free-Form Predictions From Vision-And-Language Models (CVPR 2024)

Overview and Intuition

Installation and setup

Preparing data

Extracting taxonomic data

Downloading image data

Embedding images with CLIP

Analyzing VLMs

Generating predictions with a VLM

Analyzing results

Analyzing similarity measures

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
src/vlmeval		src/vlmeval
README.md		README.md
download_data.sh		download_data.sh
pyproject.toml		pyproject.toml

vesteinn/vlm-eval

Folders and files

Latest commit

History

Repository files navigation

Hierarchy-Aware Evaluation of Free-Form Predictions From Vision-And-Language Models (CVPR 2024)

Overview and Intuition

Installation and setup

Preparing data

Extracting taxonomic data

Downloading image data

Embedding images with CLIP

Analyzing VLMs

Generating predictions with a VLM

Analyzing results

Analyzing similarity measures

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages