Skip to content

This tool asynchronously downloads and processes multi-channel images from the Human Protein Atlas (HPA) by parsing XML metadata. It supports concurrent downloads with retries, converts images to grayscale, resizes and pads to 384×384 pixels, saves PNGs, and maintains metadata and logs per protein.

License

Notifications You must be signed in to change notification settings

pacmandoh/hpasync

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HPA Multi-Channel Localization Image Downloader and Processor (XML Parsing Version)

Overview

This tool enables asynchronous downloading and processing of immunofluorescence (IF) multi-channel localization images from the Human Protein Atlas (HPA) by parsing XML metadata. It supports concurrent downloads with configurable retries and delays, and processes images by converting them to grayscale, resizing with aspect ratio preservation, padding to 384×384 pixels, and saving as PNG files. Each protein's data is stored with metadata and comprehensive logs for easy dataset management.

Features

  • Asynchronous concurrent downloads with configurable concurrency, retries, and delay intervals.
  • Beautiful progress bars and debug logging powered by rich.
  • Automatic parsing of HPA .xml interface to extract single-channel original image URLs.
  • Image processing pipeline: grayscale conversion, aspect ratio preserving resizing, black padding to 384×384 pixels, and PNG saving.
  • Generates individual metadata.json for each protein.
  • Aggregates dataset statistics into dataset_stats.json.
  • Logs errors comprehensively in error_log.txt.

Installation and Dependencies

  1. Clone or download this repository.
  2. Create a conda environment (optional) and install required packages:
conda create -n hpasync python=3.10
conda activate hpasync
conda install aiohttp pillow rich # conda
# or
pip install -r requirements.txt # pip

Or simply install via pip without conda:

pip install -r requirements.txt

The main dependencies include:

  • aiohttp for asynchronous HTTP requests
  • rich for progress bars and logging
  • Pillow for image processing

Usage

Run the main script with your protein list and desired options:

python main.py --protein-list protein_list.txt --outdir ./HPA_download --concurrency 8 --delay 0.5 --retries 3 --debug

fetch

Command Line Arguments

Argument Short Description Default
--protein-list -p Path to a text file containing protein IDs (one per line). Required
--outdir -o Output root directory for saving images and metadata. ./HPA_download
--concurrency -c Maximum number of concurrent download tasks. 6
--delay -d Request interval in seconds to avoid server overload. 0.5
--retries -r Number of retry attempts on failed requests. 3
--timeout -t Timeout per request in seconds. 30
--debug -D Enable DEBUG logging for detailed output. False
--help -h Show help message and exit. N/A

Output Files and Structure

After running, the output directory will have the following structure:

HPA_download/
├── protein_id_1/
│   ├── image_1.png
│   ├── image_2.png
│   ├── metadata.json
├── protein_id_2/
│   ├── image_1.png
│   ├── metadata.json
├── dataset_stats.json
└── error_log.txt
  • Per protein folder: Contains processed images and a metadata.json file describing the images.
  • dataset_stats.json: Summarizes statistics across all processed proteins.
  • error_log.txt: Logs any errors encountered during downloads or processing.

Notes

  • Ensure your protein list file contains valid HPA protein IDs.
  • The tool respects server load by default with configurable delay and concurrency.
  • Debug mode is helpful for troubleshooting but may produce verbose output.
  • The image processing pipeline ensures uniform image size and format for downstream analysis.

Verification

This tool includes a verification feature to ensure data integrity and consistency after downloading and processing.

Functionality

  • Compares the XML metadata with the generated metadata.json files to verify consistency.
  • Checks the completeness and integrity of single-channel images to ensure no missing or corrupted files.

Output

  • Provides a summary table in the console indicating verification status per protein.
  • Generates a detailed JSON report with verification results for all proteins.

Usage Example

Run the verification script with the output directory containing the downloaded data:

python verify.py 

verify_stats

Command Line Arguments

Argument Short Description Default
--indir -i Root directory of the downloaded dataset to verify. ./HPA_download
--out -o Output path for the summary JSON report. ./verify_stats.json
--concurrency -c Number of concurrent HTTP requests for verification. 8
--debug -D Enable debug logging to print detailed difference lists. False
--help -h Show help message and exit. N/A

For questions or issues, please contact me.

License

This project is licensed under the MIT License.

About

This tool asynchronously downloads and processes multi-channel images from the Human Protein Atlas (HPA) by parsing XML metadata. It supports concurrent downloads with retries, converts images to grayscale, resizes and pads to 384×384 pixels, saves PNGs, and maintains metadata and logs per protein.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages