This tool enables asynchronous downloading and processing of immunofluorescence (IF) multi-channel localization images from the Human Protein Atlas (HPA) by parsing XML metadata. It supports concurrent downloads with configurable retries and delays, and processes images by converting them to grayscale, resizing with aspect ratio preservation, padding to 384×384 pixels, and saving as PNG files. Each protein's data is stored with metadata and comprehensive logs for easy dataset management.
- Asynchronous concurrent downloads with configurable concurrency, retries, and delay intervals.
- Beautiful progress bars and debug logging powered by rich.
- Automatic parsing of HPA
.xmlinterface to extract single-channel original image URLs. - Image processing pipeline: grayscale conversion, aspect ratio preserving resizing, black padding to 384×384 pixels, and PNG saving.
- Generates individual
metadata.jsonfor each protein. - Aggregates dataset statistics into
dataset_stats.json. - Logs errors comprehensively in
error_log.txt.
- Clone or download this repository.
- Create a conda environment (optional) and install required packages:
conda create -n hpasync python=3.10
conda activate hpasync
conda install aiohttp pillow rich # conda
# or
pip install -r requirements.txt # pipOr simply install via pip without conda:
pip install -r requirements.txtThe main dependencies include:
aiohttpfor asynchronous HTTP requestsrichfor progress bars and loggingPillowfor image processing
Run the main script with your protein list and desired options:
python main.py --protein-list protein_list.txt --outdir ./HPA_download --concurrency 8 --delay 0.5 --retries 3 --debug| Argument | Short | Description | Default |
|---|---|---|---|
--protein-list |
-p |
Path to a text file containing protein IDs (one per line). | Required |
--outdir |
-o |
Output root directory for saving images and metadata. | ./HPA_download |
--concurrency |
-c |
Maximum number of concurrent download tasks. | 6 |
--delay |
-d |
Request interval in seconds to avoid server overload. | 0.5 |
--retries |
-r |
Number of retry attempts on failed requests. | 3 |
--timeout |
-t |
Timeout per request in seconds. | 30 |
--debug |
-D |
Enable DEBUG logging for detailed output. | False |
--help |
-h |
Show help message and exit. | N/A |
After running, the output directory will have the following structure:
HPA_download/
├── protein_id_1/
│ ├── image_1.png
│ ├── image_2.png
│ ├── metadata.json
├── protein_id_2/
│ ├── image_1.png
│ ├── metadata.json
├── dataset_stats.json
└── error_log.txt
- Per protein folder: Contains processed images and a
metadata.jsonfile describing the images. - dataset_stats.json: Summarizes statistics across all processed proteins.
- error_log.txt: Logs any errors encountered during downloads or processing.
- Ensure your protein list file contains valid HPA protein IDs.
- The tool respects server load by default with configurable delay and concurrency.
- Debug mode is helpful for troubleshooting but may produce verbose output.
- The image processing pipeline ensures uniform image size and format for downstream analysis.
This tool includes a verification feature to ensure data integrity and consistency after downloading and processing.
- Compares the XML metadata with the generated
metadata.jsonfiles to verify consistency. - Checks the completeness and integrity of single-channel images to ensure no missing or corrupted files.
- Provides a summary table in the console indicating verification status per protein.
- Generates a detailed JSON report with verification results for all proteins.
Run the verification script with the output directory containing the downloaded data:
python verify.py | Argument | Short | Description | Default |
|---|---|---|---|
--indir |
-i |
Root directory of the downloaded dataset to verify. | ./HPA_download |
--out |
-o |
Output path for the summary JSON report. | ./verify_stats.json |
--concurrency |
-c |
Number of concurrent HTTP requests for verification. | 8 |
--debug |
-D |
Enable debug logging to print detailed difference lists. | False |
--help |
-h |
Show help message and exit. | N/A |
For questions or issues, please contact me.
This project is licensed under the MIT License.

