Skip to content

Support multimodal input data#278

Merged
moskomule merged 15 commits intomainfrom
feat/multimodal-input
Feb 18, 2026
Merged

Support multimodal input data#278
moskomule merged 15 commits intomainfrom
feat/multimodal-input

Conversation

@moskomule
Copy link
Member

@moskomule moskomule commented Feb 12, 2026

Refer to #277.

Add parse_input_utterance and preprocessor to TemplateChatDataset to support multimodal input data.

  • parse_input_utterance parses structured contents used in multimodal LMs
  • preprocessor preprocesses each item, like image resizing

To set preprocessor from jsonnet, a base class Preprocessor is also prepared.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for multimodal input data (e.g., text + images) to flexeval, enabling evaluation of Vision Language Models (VLMs) and other multimodal language models. The implementation introduces two key features to TemplateChatDataset: parse_input_utterance to parse structured content from templates into lists of dictionaries (as required by OpenAI's multimodal API format), and preprocessor to preprocess items before template rendering (e.g., image resizing or base64 encoding).

Changes:

  • Added parse_input_utterance parameter supporting literal_eval and json_loads parsing methods
  • Added preprocessor parameter accepting a list of preprocessor instances for item transformation
  • Created Preprocessor abstract base class defining the preprocessor interface
  • Implemented ConvertImageToBase64 as an example preprocessor for image handling
  • Added tests for the parse_input_utterance functionality

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.

File Description
flexeval/core/chat_dataset/template_based.py Core implementation of multimodal support: added Preprocessor ABC, parse_input_utterance and preprocessor parameters to TemplateChatDataset and its subclasses
flexeval/multimodal/image_preprocessor.py Example implementation of image-to-base64 preprocessor with resizing support
flexeval/multimodal/init.py Module initialization exporting ConvertImageToBase64
tests/core/chat_dataset/test_template_based.py Test coverage for parse_input_utterance feature with different parsing methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

moskomule and others added 9 commits February 12, 2026 15:11
Copy link

@amanjainj98 amanjainj98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the Preprocessor class to flexeval/core/chat_dataset/init.py would be helpful for usage.

from .template_based import HFChatDataset, JsonlChatDataset, TemplateChatDataset, load_jinja2_template

Copy link
Collaborator

@junya-takayama junya-takayama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! nice feature 👏

Just a small nit. ↓

@moskomule moskomule merged commit ea919a1 into main Feb 18, 2026
7 checks passed
@moskomule moskomule deleted the feat/multimodal-input branch February 18, 2026 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants