NETD: A Dynamic Non-I.I.D. Encrypted Traffic Dataset

Note:

⭐ Please leave a STAR if you like this project! ⭐
If you find any incorrect / inappropriate / outdated content, please kindly consider opening an issue or a PR.

This repository contains the NETD (Dynamic Non-I.I.D. Encrypted Traffic Dataset), a dataset designed to support research on Out-of-Distribution (O.O.D.) generalization for encrypted traffic classification. NETD is constructed by introducing controlled distributional shifts into existing public datasets, allowing researchers to evaluate model robustness under varying conditions.

This dataset was developed as part of the research for the paper:

Lin, Xinjie and Xiong, Gang and Gou, Gaopeng and Dong, Wenqi and Yu, Jing and Li, Zhen and Xia, Wei. 2025. Respond to Change with Constancy: Instruction-tuning with LLM for Non-I.I.D. Network Traffic Classification. IEEE Transactions on Information Forensics and Security.

Motivation

Most contemporary traffic classification research relies on datasets that assume the training and testing data are Independent and Identically Distributed (I.I.D.). However, this assumption is often unrealistic in real-world network scenarios where application updates and user behavior change over time, causing "distribution drift". This drift degrades the performance of models trained on older data.

NETD was created to address this gap by providing a dataset that:

Is the first of its kind to support the dynamic adjustment of traffic distribution bias in a controlled manner.
Simulates realistic distributional shifts using principled strategies.
Enables a more rigorous evaluation of O.O.D. generalization in traffic classification models.

The dataset is constructed from the publicly available ISCX-VPN dataset, which includes traffic from 17 applications across 6 service categories (Chat, Email, File Transfer, P2P, Streaming, VoIP).

Construction Methodology

NETD simulates distributional shifts by manipulating two key factors: Proportional Bias and Compositional Bias. This is achieved by treating a specific network behavior (e.g., traffic from one application) as a "principal component" and others within the same class as "secondary components".

1. Proportional Bias

This setting simulates a shift in the prevalence of intra-class behaviors. We ensure all components of a class are present in both training and testing sets, but we alter their ratios. For a given service category, one application is randomly selected as the dominant component. The bias is controlled by the Dominant Ratio:

$$Dominant \ Ratio = \frac{N_{Dominant}}{N_{Minor}}$$

where $N_{Dominant}$ is the number of samples from the dominant application, and $N_{Minor}$ is the average number of samples from the other applications in that class. By setting one ratio for the training set and a different one for the test set, we create a distributional shift.

2. Compositional Bias

This setting simulates a more extreme shift where the training data fails to cover the complete distribution of the test data. This is achieved by varying the number of constituent applications for each service category between the training and testing sets. For example, the training set might only contain traffic from a subset of applications for a "Streaming" service, while the test set contains traffic from all applications.

Dataset Variants

We provide six datasets generated with different distribution shifts:

NETD-1: Constructed using a proportional bias strategy. The training set is generated by randomly sampling with a dominant-to-minor component ratio of 1:3.
NETD-2: Also constructed using proportional bias, but with a dominant-to-minor component ratio of 3:1 in the training set.
NETD-3: Constructed using a compositional bias strategy. The training set is built from only 80% of the applications within each target service class, while the test set contains the full data.
NETD-4: A more extreme version of NETD-3. The training set is constructed from only 20% of the contextual applications for the target service.
APP53-Time: The dataset is used for classifying encrypted application traffic based on time span (one month interval), and compared to the categories listed in the paper, it lacks the bbc.mobile.weather category.
APP53-Version: The dataset is used for classifying encrypted application traffic based on version span (version update).

Dataset Access

You can download the complete datasets from the following links:

NETD
- APP53-Time: Download APP53-Time from Cloud Drive
- APP53-Version: Download APP53-Version from Cloud Drive

How to Cite

If you use NETD in your research, please cite our paper:

@article{lin2025etool,
  title={Respond to Change with Constancy: Instruction-tuning with LLM for Non-IID Network Traffic Classification},
  author={Lin, Xinjie and Xiong, Gang and Gou, Gaopeng and Dong, Wenqi and Yu, Jing and Li, Zhen and Xia, Wei},
  journal={IEEE Transactions on Information Forensics and Security},
  volume={20},
  pages={5758-5773},
  year={2025}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
code		code
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NETD: A Dynamic Non-I.I.D. Encrypted Traffic Dataset

Motivation

Construction Methodology

1. Proportional Bias

2. Compositional Bias

Dataset Variants

Dataset Access

How to Cite

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

linwhitehat/NETD

Folders and files

Latest commit

History

Repository files navigation

NETD: A Dynamic Non-I.I.D. Encrypted Traffic Dataset

Motivation

Construction Methodology

1. Proportional Bias

2. Compositional Bias

Dataset Variants

Dataset Access

How to Cite

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages