HPA Multi-Channel Localization Image Downloader and Processor (XML Parsing Version)

Overview

This tool enables asynchronous downloading and processing of immunofluorescence (IF) multi-channel localization images from the Human Protein Atlas (HPA) by parsing XML metadata. It supports concurrent downloads with configurable retries and delays, and processes images by converting them to grayscale, resizing with aspect ratio preservation, padding to 384×384 pixels, and saving as PNG files. Each protein's data is stored with metadata and comprehensive logs for easy dataset management.

Features

Asynchronous concurrent downloads with configurable concurrency, retries, and delay intervals.
Beautiful progress bars and debug logging powered by rich.
Automatic parsing of HPA .xml interface to extract single-channel original image URLs.
Image processing pipeline: grayscale conversion, aspect ratio preserving resizing, black padding to 384×384 pixels, and PNG saving.
Generates individual metadata.json for each protein.
Aggregates dataset statistics into dataset_stats.json.
Logs errors comprehensively in error_log.txt.

Installation and Dependencies

Clone or download this repository.
Create a conda environment (optional) and install required packages:

conda create -n hpasync python=3.10
conda activate hpasync
conda install aiohttp pillow rich # conda
# or
pip install -r requirements.txt # pip

Or simply install via pip without conda:

pip install -r requirements.txt

The main dependencies include:

aiohttp for asynchronous HTTP requests
rich for progress bars and logging
Pillow for image processing

Usage

Run the main script with your protein list and desired options:

python main.py --protein-list protein_list.txt --outdir ./HPA_download --concurrency 8 --delay 0.5 --retries 3 --debug

Command Line Arguments

Argument	Short	Description	Default
`--protein-list`	`-p`	Path to a text file containing protein IDs (one per line).	Required
`--outdir`	`-o`	Output root directory for saving images and metadata.	`./HPA_download`
`--concurrency`	`-c`	Maximum number of concurrent download tasks.	6
`--delay`	`-d`	Request interval in seconds to avoid server overload.	0.5
`--retries`	`-r`	Number of retry attempts on failed requests.	3
`--timeout`	`-t`	Timeout per request in seconds.	30
`--debug`	`-D`	Enable DEBUG logging for detailed output.	False
`--help`	`-h`	Show help message and exit.	N/A

Output Files and Structure

After running, the output directory will have the following structure:

HPA_download/
├── protein_id_1/
│   ├── image_1.png
│   ├── image_2.png
│   ├── metadata.json
├── protein_id_2/
│   ├── image_1.png
│   ├── metadata.json
├── dataset_stats.json
└── error_log.txt

Per protein folder: Contains processed images and a metadata.json file describing the images.
dataset_stats.json: Summarizes statistics across all processed proteins.
error_log.txt: Logs any errors encountered during downloads or processing.

Notes

Ensure your protein list file contains valid HPA protein IDs.
The tool respects server load by default with configurable delay and concurrency.
Debug mode is helpful for troubleshooting but may produce verbose output.
The image processing pipeline ensures uniform image size and format for downstream analysis.

Verification

This tool includes a verification feature to ensure data integrity and consistency after downloading and processing.

Functionality

Compares the XML metadata with the generated metadata.json files to verify consistency.
Checks the completeness and integrity of single-channel images to ensure no missing or corrupted files.

Output

Provides a summary table in the console indicating verification status per protein.
Generates a detailed JSON report with verification results for all proteins.

Usage Example

Run the verification script with the output directory containing the downloaded data:

python verify.py

Command Line Arguments

Argument	Short	Description	Default
`--indir`	`-i`	Root directory of the downloaded dataset to verify.	`./HPA_download`
`--out`	`-o`	Output path for the summary JSON report.	`./verify_stats.json`
`--concurrency`	`-c`	Number of concurrent HTTP requests for verification.	8
`--debug`	`-D`	Enable debug logging to print detailed difference lists.	False
`--help`	`-h`	Show help message and exit.	N/A

For questions or issues, please contact me.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
protein_list.txt		protein_list.txt
requirements.txt		requirements.txt
verify.py		verify.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HPA Multi-Channel Localization Image Downloader and Processor (XML Parsing Version)

Overview

Features

Installation and Dependencies

Usage

Command Line Arguments

Output Files and Structure

Notes

Verification

Functionality

Output

Usage Example

Command Line Arguments

License

About

Uh oh!

Releases

Packages

Languages

License

pacmandoh/hpasync

Folders and files

Latest commit

History

Repository files navigation

HPA Multi-Channel Localization Image Downloader and Processor (XML Parsing Version)

Overview

Features

Installation and Dependencies

Usage

Command Line Arguments

Output Files and Structure

Notes

Verification

Functionality

Output

Usage Example

Command Line Arguments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages