Polysome is a generic data generation framework designed for transforming text attributes using Large Language Models. It serves as a powerful workflow builder for chaining prompts to generate synthetic data at scale.
While originally developed for computational pathology, Polysome is domain-agnostic and focuses purely on the data generation aspect of the pipeline. It allows you to define complex, multi-step text transformation workflows using a node-based architecture.
Key Features:
- Workflow Engine: Define DAG-based pipelines to load data, process it with LLMs, and structure the output.
- Synthetic Data Generation: Ideal for creating instruction tuning datasets, rewriting reports, or extracting structured information from unstructured text.
- Prompt Chaining: distinct nodes for complex reasoning tasks, summarization, and translation.
- High Performance: Supports batch processing and Data Parallelism via vLLM.
Install Polysome from PyPI (minimal install, HuggingFace only):
pip install polysomeFor specific engines or features:
# vLLM (Recommended for Linux + NVIDIA GPU)
pip install "polysome[vllm]"
# llama.cpp (Recommended for CPU or Apple Silicon)
pip install "polysome[llama-cpp]"
# UI / Prompt Editor
pip install "polysome[ui]"
# Install everything (for development/testing)
pip install "polysome[all]"Convenience aliases:
pip install "polysome[gpu]"β installsvllmstackpip install "polysome[cpu]"β installsllama-cppstack
Many models (like Gemma-3) are gated and require authentication. Before running workflows with these models, log in via the CLI:
pip install huggingface_hub
huggingface-cli loginAlternatively, set the HF_TOKEN environment variable.
Initialize a new project with example workflows and prompts:
polysome init my-analysis
cd my-analysisThis creates:
workflows/- Example workflow configurationsprompts/- Prompt templates for your tasksdata/- Directory for input data (with example file)output/- Results will be saved here
# Run the example workflow
polysome run workflows/basic_text_generation.json
# Run with custom settings
polysome run workflows/my_workflow.json --log-level DEBUG-
Edit your workflow (
workflows/basic_text_generation.json):- Change the model name (default:
google/gemma-3-4b-it) - Adjust generation parameters
- Add or remove processing nodes
- Change the model name (default:
-
Customize prompts (
prompts/simple_qa/):system_prompt.txt: Define the AI's roleuser_prompt.txt: Template with variables like{{ question }}few_shot.jsonl: Example inputs and outputs
-
Prepare your data (
data/input.json):- Format as JSON with fields matching your prompt variables
-
Run and iterate:
polysome run workflows/basic_text_generation.json
For reproducible environments or deployment:
# Run with Docker
docker run --rm --gpus all \
-v ./data:/data \
-v ./output:/output \
-v ./workflows:/workflows \
-v ./prompts:/prompts \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e WORKFLOW_PATH=/workflows/basic_text_generation.json \
ghcr.io/computationalpathologygroup/polysome:latestNote: We recommend mounting your Hugging Face cache to avoid re-downloading models.
For detailed Docker usage, see docs/docker_container.md.
For advanced users, you can also use Polysome programmatically:
from polysome.workflow import Workflow
# Load and run a workflow
workflow = Workflow("workflows/my_workflow.json")
success = workflow.run()
# Access results
print(f"Results saved to: {workflow.output_dir}")Want to contribute or modify Polysome? See CONTRIBUTING.md for guidelines.
git clone https://github.com/computationalpathologygroup/Polysome.git
cd Polysome
# Install in development mode with dependencies
pip install -e ".[gpu-dev]" # For GPU development
# OR
pip install -e ".[cpu-dev]" # For CPU developmentpytest
pytest --cov=polysome # With coverageWorkflows are defined in JSON files (DAGs) located in the workflows/ directory. They control how data is loaded, processed by LLMs, and saved.
For a detailed guide on creating nodes and configuring JSONs, see docs/text_preprocessing.md.
Polysome includes a Streamlit-based Prompt Editor to help you design, manage, and test Jinja2 templates for your LLM tasks.
# Run the editor
polysome-guiFor a user guide on managing templates and few-shot examples, see docs/prompt_editor.md.
For large-scale processing, Polysome supports Data Parallelism using vLLM to distribute batches across multiple GPUs.
To enable this, use the vllm_dp engine in your workflow configuration. See docs/data_parallelism.md for setup instructions and performance tuning.
This framework was originally developed to support visual instruction tuning. If you use this code to generate data for such models, please consider citing the following paper:
@inproceedings{moonemans2025open,
title={Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modeling},
author={Sander Moonemans and Sebastiaan Ram and Fr{\'e}d{\'e}rique Meeuwsen and Carlijn Lems and Jeroen van der Laak and Geert Litjens and Francesco Ciompi},
booktitle={Submitted to Medical Imaging with Deep Learning},
year={2025},
url={https://openreview.net/forum?id=aGPowreqPi},
note={under review}
}