This MATLAB project uses two different neural networks to classify Darknet traffic samples into three classes: Tor
, VPN
and Benign (Non-Tor + NonVPN)
.
The datasets
directory contains two datasets in the CSV format:
Darknet.csv
- also known as CIC-Darknet2020Darknet_preprocessed.csv
- preprocessed dataset
In order to generate the preprocessed dataset and to get the label distribution, run data_preprocessing/main.m
.
From the dataset, only 28 features were considered:
Average Packet Size | FIN Flag Count | Fwd Packets/s | Packet Length Mean |
Bwd Init Win Bytes | Flow Duration | Fwd Segment Size Avg | Packet Length Std |
Bwd Packet Length Max | Flow IAT Max | Fwd Seg Size Min | Packet Length Variance |
Bwd Packet Length Mean | Flow IAT Mean | Idle Max | Protocol |
Bwd Packet Length Min | Flow IAT Min | Idle Mean | Subflow Bwd Bytes |
Bwd Packets/s | Fwd Header Length | Idle Min | Subflow Fwd Packets |
Bwd Segment Size Avg | FWD Init Win Bytes | Packet Length Max | Total Length of Bwd Packet |
Data underwent normalization within the range of [-1, 1]
using the Z-Score, which measures the distance of a data point from the mean in terms of the standard deviation, preserving the shape properties of the original data.
Due to the notable scarcity of Tor
samples in comparison to other types of traffic, it was performed data augmentation by utilizing the SMOTE (Synthetic Minority Over-sampling Technique) function. This allows a more representative dataset with diverse and abundant samples from each class to avoid overfitting.
Data was divided into three subsets:
Non-Tor | NonVPN | VPN | Tor | |
---|---|---|---|---|
Training (60%) | 56,014 | 14,318 | 13,751 | 8,352 |
Validation (20%) | 18,671 | 4,773 | 4,584 | 2,784 |
Testing (20%) | 18,671 | 4,772 | 4,584 | 2,784 |
Total | 93,356 | 23,863 | 22,919 | 13,920 |
Comprises an input layer with 28 nodes, corresponding to the number of features, and two hidden layers with 5 nodes each. Lastly, the output layer is composed of 3 nodes. The training of this model employed the Levenberg-Marquardt backpropagation algorithm, with the mean squared error (MSE) utilized as the performance metric to assess the network's performance in a parallel execution environment.
Composed by an input layer where each input sample has a height of 28 pixels, a width of 1 pixel, and a single channel (grayscale). It also comprises a convolutional layer and a rectified linear unit (ReLU) activation function which is applied element-wise to the output of the convolutional layer. The mentioned structured is then followed by three fully connected layers where the first 2 layers have 5 neurons each, while the last fully connected layer consists of 3 neurons, corresponding to the number of output classes. A softmax layer was added to convert the outputs of the previous fully connected layer into a probability distribution over the classes. The last stage of the model is a classification layer that assigns the predicted class based on the highest probability from the softmax layer. This model was trained using the training subset, with the assistance of the Adam optimizer. The training was conducted for a maximum of 8 epochs, with a mini-batch size of 256 in a parallel execution environment.
Running mlp/main.m
and/or cnn/main.m
will do the following:
- Read
datasets/Darknet_preprocessed.csv
and split the data into training, validation and testing subsets. - Create and train the MLP/CNN model, respectively.
- Generate the confusion matrix after feeding the testing subset.
The results from the experiments are presented below in the form of confusion matrices, showcasing the performance of the Multilayer Perceptron (left) and the Convolutional Neural Network (right):
The evaluation metrics used to assess the performance of the modelscan be seen in the table below:
Metric | Equation | MLP | CNN |
---|---|---|---|
Accuracy | (TP + TN)/(TP + FP + TN + FN) | 0.94 | 0.91 |
Precision | TP/(TP + FP) | 0.87 | 0.89 |
0.86 | 0.78 | ||
0.97 | 0.94 | ||
Recall | TP/(TP + FN) | 0.99 | 0.94 |
0.81 | 0.69 | ||
0.97 | 0.95 | ||
F1 | 2TP/(2TP + FP + FN) | 0.93 | 0.91 |
0.83 | 0.73 | ||
0.97 | 0.94 |
TP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative