Learning Multi-modal Representations from Network Traffic

In this repository, we provide the source code and data used in the experiments of our paper "Learning Multi-modal Representations from Network Traffic", presented at the [Conference Name] conference in [Year]. The paper presents a multi-modal autoencoder (MAE) architecture for network traffic classification, and evaluates it on different traffic analysis tasks.

The MAE architecture consists of multiple adaptation modules for handling different input measurements, and an integration module for creating intermediate embeddings. We demonstrate the benefits of this approach by implementing it on three different traffic analysis tasks and comparing it to alternative deep learning solutions. Our results show that the MAE performs on par or better than these alternatives, and that the representation it generates is rich and compact.

This repo includes an implementation of the multi-modal autoencoder (MAE) architecture proposed in the paper, using Python 3.7.9 and Keras. We also provide scripts for training the models and notebooks for running the experiments and reproducing the results reported in the paper.

We hope that this repository will be useful for researchers interested in network traffic analysis and multi-modal autoencoders. Please feel free to contact us if you have any questions or suggestions.

Project structure

The repository is organized as follows:

notebooks: This folder contains Jupyter notebooks for each of the three case studies, as well as notebooks for hyperparameter tuning and results aggregation and visualization.
training: This folder contains scripts for training the MAE and deep classifiers for all case studies, as well as for the grid search.

Dependencies

To run the code in this repository, you will need the following software and libraries:

Python 3.6 or higher
gensim
community

How to reproduce results in the paper?

Note: This guide assumes a Debian-like system (tested on Ubuntu 20.04 & Debian 11).

Clone this repository

$ git clone https://github.com/[USERNAME]/[REPOSITORY]

Download the datasets
Install the prerequisites. Open a terminal and run
```
$ ./setup.sh
```
This will firstly download the datasets from https://url_tbd, then it will initialize a new virtual environment and download and install the python library requirements. Finally, it will start a jupyter notebook
Run the notebooks. To run the first notebook:
```
$ jupyter-lab 01-darknet-overview.ipynb
```
Note that each notebook is referred to a single fold of the stratified k fold cross validation.
When the notebook exploration is ended, remember to deactivate the virtual environment:
```
$  deactivate
```

Datasets

The datasets used in the paper can be downloaded from [INSERT URL HERE] using the password provided. These datasets are:

MIRAGE: a dataset of network traffic flows for mobile app classification, with 44,000 samples and 16 classes.
DARKNET: a dataset of IP addresses for classifying groups based on coordinates, with 14,000 samples and 13 + 1 classes.
ISCXVPN2016: a dataset of network flows for traffic type classification, with 609 samples and 5 classes.

We provide the datasets in preprocessed form, using Tstat to extract quantities and Word2Vec to extract entities from a corpus file in .txt format.

Once the datasets are downloaded and unzipped, they are ready to use. You can then follow the instructions in the notebooks to preprocess the data and train the models.

Notebooks

Features engineering
- Task01 - Quantities and entities extraction
- Task02 - Quantities and entities extraction
- Task03 - Quantities and entities extraction
- Datasets characterization
Task01: Traffic application classification
- Validation of deep classifiers
- Embeddings neighborhood evaluation
- Shallow learners
- Unsupervised clustering
Task02: Darknet traffic classification
- Validation of deep classifiers
- Embeddings neighborhood evaluation
Task03: Traffic category classification
- Validation of deep classifiers
- Embeddings neighborhood evaluation
Grid Search
- Validation of deep classifiers
Conclusions
- Experiments finalization
- Data visualization

Training the models

We provide the scripts to train models for (i) different tasks and folds and (ii) MAE gridsearch on task01.

Regular training

To train the MAE and the classifiers for a single task, users should specify the task number, model type, number of epochs, batch size, and fold number as command line arguments.

To run the training, open a terminal, move to the training folder and run:

python train_tasks.py -t TASK_NUMBER -m MODEL_TYPE -e NUM_EPOCHS -b BATCH_SIZE -f FOLD_NUMBER

Arguments

TASK_NUMBER: The task number to run (01, 02, or 03). Default: 01
MODEL_TYPE: The model type to run (mae, classifiers, or all). Default: all
NUM_EPOCHS: The number of epochs to run. Default: 20
BATCH_SIZE: The batch size to use. Default: 128
FOLD_NUMBER: The fold number to run (0-4 or all). Default: all

Examples

To run the classifiers model for task 02 with 15 epochs and a batch size of 256 for fold 3:

$ cd training
$ python train_tasks.py -t 02 -m classifiers -e 15 -b 256 -f 3

To run the MAE model for task 01 with 20 epochs and a batch size of 128 for all 5 folds:

$ cd training
$ python train_tasks.py -t 01 -m mae -e 20 -b 128

To run both the MAE and classifiers models for task 03 with 10 epochs and a batch size of 512 for all 5 folds:

$ cd training
$ python train_tasks.py -t 03 -e 10 -b 512

GridSearch training

To train the models, open a terminal, move to the training folder and run:

python run_gridsearch.py -e NUM_EPOCHS -b BATCH_SIZE -f FOLD_NUMBER

Arguments

NUM_EPOCHS: The number of epochs to run. Default: 20
BATCH_SIZE: The batch size to use. Default: 128
FOLD_NUMBER: The fold number to run (0-4 or all). Default: all

Examples

To run the gridsearch training 15 epochs and a batch size of 256 for fold 3:

$ cd training
$ python train_gridsearch.py -e 15 -b 256 -f 3

To run the gridsearch training 15 epochs and a batch size of 256 for all 5 folds:

$ cd training
$ python train_tasks.py -e 20 -b 128

Citations

If you use this code or data in your research, please cite our paper:

@article{authors20xx,
  title={Multi-modal Autoencoders for Network Traffic Analysis},
  author={Authors, A. and Authors, B.},
  journal={Journal Name},
  year={20xx}
}

Todo

Notebooks:
- Finalization and visualization
  - t-SNE - task01
Documentation
- README
- Notebooks
  - Starting datasets and characterization
  - Finalization and visualization
- References to the datasets and papers
- More detailed instruction on how to download and setup experiments (once agreed)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
notebooks		notebooks
training		training
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Multi-modal Representations from Network Traffic

Table of Content

Project structure

Dependencies

How to reproduce results in the paper?

Datasets

Notebooks

Training the models

Regular training

Arguments

Examples

GridSearch training

Arguments

Examples

Citations

Todo

About

Releases

Packages

Contributors 2

Languages

SmartData-Polito/multimodal-ae-for-networking

Folders and files

Latest commit

History

Repository files navigation

Learning Multi-modal Representations from Network Traffic

Table of Content

Project structure

Dependencies

How to reproduce results in the paper?

Datasets

Notebooks

Training the models

Regular training

Arguments

Examples

GridSearch training

Arguments

Examples

Citations

Todo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages