JINGWEI is a deep learning framework for missing proteomic data imputation, supporting both DMF (Deep Matrix Factorization) and DCAE (Dilated Convolutional AutoEncoder) methods.
- Multiple Imputation Methods: Support for DMF and DCAE algorithms
- Flexible Architecture: Configurable network architectures and hyperparameters
- GPU Acceleration: CUDA support with specific GPU selection
- Comprehensive Logging: TensorBoard integration for training monitoring
- Early Stopping: Prevent overfitting with configurable patience
- Batch Processing: Efficient batch training with customizable batch sizes
- Python 3.12
- CUDA-capable GPU (optional, but recommended)
It is recommended to use conda to manage the environment.
conda create -n jingwei python=3.12
conda activate jingweiInstall the required packages:
pip install -r requirements.txtOr install manually:
pip install torch pytorch-lightning pandas numpy matplotlib seaborn tensorboard scipy scikit-learn# Basic usage with DMF method
./src/JINGWEI.sh --data-path data/your_dataset.csv
# Use DCAE method with GPU 1
./src/JINGWEI.sh --data-path data/Alzheimer.csv --method DCAE --device cuda --gpu-id 1
# Custom parameters with early stopping
./src/JINGWEI.sh --data-path data/your_dataset.csv \
--method DMF \
--hidden-dims 512 256 128 \
--embedding-dim 128 \
--early-stopping \
--max-epochs 100--data-path PATH: Path to input CSV file
--method {DMF,DCAE}: Imputation method (default: DMF)
--hidden-dims DIMS: Hidden layer dimensions, space-separated (default: "256 128")--batch-size SIZE: Batch size for training (default: 1024)--learning-rate RATE: Learning rate (default: 0.001)--weight-decay DECAY: Weight decay for optimizer (default: 0.00001)--gradient-clip VALUE: Gradient clipping value (default: 1.0)
--embedding-dim DIM: Embedding dimension (default: 64)
--latent-dim DIM: Latent dimension (default: 64)--num-encoder-blocks NUM: Number of encoder blocks (default: 2)--num-decoder-blocks NUM: Number of decoder blocks (default: 2)--dilation VALUE: Dilation factor (default: 2)
--mask-weight WEIGHT: Weight for mask prediction loss (default: 0.5)--reconstruction-weight WEIGHT: Weight for reconstruction loss (default: 1.0)
--max-epochs EPOCHS: Maximum training epochs (default: 200)--early-stopping: Enable early stopping--patience PATIENCE: Patience for early stopping (default: 20)
--device {cpu,cuda,auto}: Device to use (default: auto)--gpu-id ID: Specific GPU ID to use (0, 1, etc.)
--results-dir DIR: Directory for saving results (default: ./results)--log-interval INTERVAL: Logging interval in steps (default: 50)--progress-bar: Show progress bar during training
The input CSV file should have the following format:
- First row: Header (will be skipped)
- First column: Sample IDs/names (will be skipped)
- Remaining columns: Protein expression data
- Missing values: Use 0, negative values, or NaN
Example:
Sample_ID,Protein_1,Protein_2,Protein_3,...
Sample_001,1.23,0.45,NaN,...
Sample_002,2.34,0,1.67,...
Sample_003,1.45,1.23,2.89,...
JINGWEI generates the following outputs in the results directory:
results/
├── checkpoints/ # Model checkpoints
├── logs/ # TensorBoard logs
└── outputs/
└── {METHOD}_{DATASET}_{TIMESTAMP}/
├── config.json # Training configuration
├── imputed_data.csv # Imputed protein data
├── training_metrics.csv # Training loss history
└── model_final.ckpt # Final trained model
./src/JINGWEI.sh --data-path data/Alzheimer.csv \
--method DMF \
--hidden-dims 512 256 128 64 \
--embedding-dim 128 \
--mask-weight 0.3 \
--learning-rate 0.0005 \
--max-epochs 150 \
--early-stopping \
--progress-bar./src/JINGWEI.sh --data-path data/Alzheimer.csv \
--method DCAE \
--device cuda \
--gpu-id 1 \
--latent-dim 128 \
--num-encoder-blocks 3 \
--num-decoder-blocks 3 \
--dilation 4 \
--batch-size 512./src/JINGWEI.sh --data-path data/Alzheimer.csv \
--device cpu \
--results-dir ./my_results \
--max-epochs 50 \
--log-interval 10- Uses row and column embeddings to capture latent patterns
- Suitable for collaborative filtering-style missing data
- Good for datasets with structured missing patterns
- Uses dilated convolutions to capture long-range dependencies
- Suitable for sequential or structured protein data
- Better for complex missing data patterns
tensorboard --logdir results/logsMonitor the following metrics:
train_loss: Overall training lossreconstruction_loss: Data reconstruction qualitymask_loss: Missing data pattern prediction accuracy
-
CUDA Out of Memory
- Reduce
--batch-size - Use
--device cpufor CPU training
- Reduce
-
Shape Mismatch Errors
- Check CSV format (ensure first column is skipped)
- Verify data contains only numeric values
-
Slow Training
- Use GPU acceleration with
--device cuda - Increase
--batch-sizeif memory allows
- Use GPU acceleration with
-
Poor Performance
- Adjust
--mask-weight(try 0.1-0.8) - Experiment with different
--hidden-dims - Enable
--early-stopping
- Adjust
For help with parameters:
./src/JINGWEI.sh --helpJINGWEI/
├── README.md
├── requirements.txt
├── src/
│ ├── JINGWEI.sh # Main training script
│ ├── train.py # Python training interface
│ ├── datasets.py # Data loading utilities
│ ├── models.py # Model architectures
│ └── methods/
│ ├── DMF.py # DMF implementation
│ └── DCAE.py # DCAE implementation
└── data/
└── your_datasets.csv
This project is licensed under the MIT License
- Initial release
- Support for DMF and DCAE methods
- GPU acceleration
- Comprehensive parameter configuration
- TensorBoard integration