Massive data streams generated by IoT devices provide new opportunities for learning from such streams in real time. Traditional machine learning algorithms require data sets of fixed size, available prior to the training of the model. When data continuously arrives in at high speed, we must train machine learning models in a different way. Our project is motivated by the training of a binary classifier incrementally from IoT data stream. We analyze a labeled dataset from kaggle [1] containing measurements coming from environmental sensors such as temperature, humidity, smoke, gas, movement, and light, with the light being our class label. We predict the class label for unseen examples by using two data stream machine learning models - Hoeffding Tree Classifier and Naive Bayes Classifier. These algorithms differ from the traditional machine learning methods because they process one example at a time and use limited amount of memory which allows them to learn from massive unbounded data streams in real time.
Keywords — Spark Streaming, Supervised Learning, Decision Trees, Data Streams, skmultiflow library.
IoT sensors allow collection of immense amounts of data from our environment which could be used to create smart autonomous buildings. For example, in the context of public buildings such as hospitals, it is important to control and predict the indoor environment. In case of emergencies or failures the facility will be able to automatically regulate the indoor parameters by using models that learn from previous usage. This will allow us to take full advantage of the IoT sensor data and to realize near real-time control of the building. The data stream will be continuously processed and analyzed on premises. We can detect anomalies and trigger alarms accordingly.
Our goal is to build a binary classifier to predict the usage of light from IoT data stream. One way would be to implement a cloud-based system where all IoT sensors send measurements to an endpoint on the cloud for storage and for training of the model. There are three main problems with this setting:
- The volume and the velocity of the data will create network overhead.
- It might be too slow to train the model given the volume of the data.
- If we want to update the model as we receive new data, we will have to restart the learning process.
The question is is it possible to build an accurate decision tree on-premises, without storing the data and with limited amount of memory. To solve this problem we can build the decision tree in a different way as proposed by P.Domingos and G.Hulted [3].
Prior to the Hoeffding Tree model proposed in [3] there were systems that perform batch learning [4], however these systems needed external storage and required multiple passes over the data. The advantage of Hoeffding Trees is that they require single pass of the data thus are suitable for classifying high speed streams. Several enhancements of the algorithm have been introduced to improve its performance and to generalize the idea [5] [6].
Moreover, A. Mukherjee [8] have already studied Naive Bayes and Decision Tree Classifier for Streaming Data Using HBase, but here in this project, we will use Apache Spark and scikit learn package in order to build, evaluate and improve the overall performance of the algorithms.
Our use case is related to environmental data set [1] generated by three arrays of environmental sensors taking different measurements: carbon monoxide, humidity, gas, smoke, temperature, detected motion and light. We assume that all sensors are feeding into one central controller on a single input port. We have combined the results from all devices in order to obtain a generalized model.
[Table 1: The nine columns of our data set]
column | description | units |
---|---|---|
ts | timestamp of event | epoch |
device | unique device name | string |
co | carbon monoxide | ppm (%) |
humidity | humidity | percentage |
light | light detected? | boolean |
lpg | liquid petroleum gas | ppm (%) |
motion | motion detected? | boolean |
smoke | smoke | ppm (%) |
temp | temperature | Fahrenheit |
The data spans the period from 07/12/2020 00:00:00 UTC – 07/19/2020 23:59:59 UTC (8 days). There is a total of 405,184 rows of data points. The average rate is 1 entry every 1.33 seconds. The longest period without any entries is 6 seconds.
During the experiment each of the three IoT devices was placed in a different physical location with varied environmental conditions. Some locations were cooler and more humid, other locations had highly variable temperature and humidity and the third type of locations were warm and dry. Thus, we have a variety of data. Data is accurate, there are no missing values and only few duplicates for which we took only the first row. Hence the data set has the property veracity. The collected data is valuable for monitoring and control of indoor environment in smart buildings.
Our target lable is “light” and has Boolean values - 0 (ligth off) and 1 (light on). We will apply supervised learning algorithms to predict this label for unseen examples. 28% of the values on the column “light” are True and the rest 72% are False. So, the light being on is not a rare event, but the dataset is imbalanced.
[Proportion of the classes]
We have investigated the distribution of the class label overtime and we see that the target is not evenly distrbuted.
[Changes in the class distribution overtime - concept drift]
In addition, from the heatmap of the correlation matrix using the Pearson method, we can see that the target label "light" and the feature "temperature" are highly correlated. Hence, the tempepature will be important feature for the classification algorithm.
[Heatmap of the correlation matrix]
The language for this project was Python used together with scikit-learn machine learning library and pandas dataframes. We have also used the Apache Spark Structured Streaming Library [7] to create the streaming context and execute the machine learning algorithms available in scikit-learn library [10]. To build Naive Bayes multinomial model we import MultinomialNB library and for Hoeffding Tree we import HoeffdingTreeClassifier. Decision tree(e.g, Hoeffding Tree) and Naive Bayes (NB) are the best choices in data science community for data stream machine learning, which inspired us to pick them. SVM is also accurate but it is very expensive computationally and additionally SVM has a poor performance with imbalanced dataset due to its soft margin optimization problem.
We have also found the accuracy of our models which means what percent of instances of the test set that were correctly classified by our model. Moreover, we calculated the metrics: precision, recall, and f1-score of our builded models to understand what proportion of the instances are predicted correctly. Additionally, we did error analysis to understand where did the learner go wrong. We also use the confusion matrix to measure the performance of the model. We have done hyperparameter tuning based on our observations to adjust the model accordingly.
We have also compared the C4.5 and the Hoeffding Tree algorithms. Both algorithms are supervised algorithms aiming to build a Decision Tree to classify the data points. The main difference is that the Hoeffding Tree does not need the entire dataset to estimate the best split and it needs to see each sample only once.
We have also applied three methods to adress the problem of imbalanced data set.
Our main challenge was that our classifiers were biased towards the majority class. We applied the following techniques to address this problem.
- Class weights
We used the method compute_sample_weight from the scikit-learn library to obtain the class weights inversely proportional to the class frequencies in the input data. With the class weights we give more emphasis of the minority class.
- Reservoir Sampling
With Reservoir Sampling we fill a preallocated buffer, called a reservoir, with uniformly sampled elements from the datastream. In our case, we used only data points from the majority class to fill the reservoir. We used as much data points as possible from the minority class to train the model. (we were able to use 70% of the minority class samples). Our goal was to undersample the majority class. Once we reach a given number of training samples from the minority class, we used the reservoir to train the model with the equal number of data points from the majority class in order to train the model with equal number of examples from both classes. Provided is the pseudocode of our algorithm.
[Algorithm: Reservoir Sampling]
Fix N - the size of the reservoir
while get_next_sample from data stream
**test** the model model on the current sample
if data instance from majority class
fill reservoir
if data instance from minority class
**train** the model
if trained with N samples from minority class
use the reservoir to train with N samples from majority class
empty the reservoir
return model and metric
- Apply SMOTE (Over-Sampling)
In order to create the balanced datasets, we also applied SMOTE for Naive Bayess classier by following over-sampling on trainning datasests where it creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class and transformed imbalanced datasets to balanced datasets.
[Illustration of Under-Sampling and Over-Sampling]
Dataset have been split in to 80% to train the models and 20% to evaluate model performance. The results from both balanced and imbalanced datasets will be present and explained. In addition, we used the method first test then train which is suitable when there is no data available for testing. In streaming environment as data comes in often it is hard to split to test and training set. With the method first test then train for each data point, we make a prediction using the current model, and then we use the same data point to update the model. We count the number of correctly predicted labels vs the total number of instances which gives us the prequential error [9].
To measure model performance, we apply four types of metrics: accuracy_score, recall_score, precision_score and f1_score. We have also used Confusion Matrix (CM) to analyse class wise accuracy.
Our results show that the applied machine learning models perform well in predicting the target label. The chosen methods are suitable for training in the context of data streams.
- Using holdout test set.
A holdout test set was used to evaluate the performance of the incremental training of the Hoeffding Tree using class weights. We have set aside part of the data set for testing purposes before we start the training. We used the method train_test_split from the scikit package to randomly split our data set to 85% for the train set and 15% for the test set. Our goal was to keep the test set small, because we need to evaluate model at every step. We made a prediction on the selected test set at regular intervals, every 100 data points. Our results showed on average high accuracy and F1-score. Also, from the graph we can see that the scores vary depending on data distribution. Our conclusion is that the training with class weights was not enough to mitigate the uneven distribution of the class label.
[Accuracy – Hoeffding Tree with class weights]
[F1-SCore - Hoeffding Tree with class weights]
- Using Reservoir Sampling
We choose a reservoir of fixed size N and we selected N to be greater than the grace period of the Hoefding Tree (N=100, grace_period=10). We apply first test then train method together with Reservoir Sampling for the majority class examples. We count the number of correctly predicted labels vs the total number of instances. We also keep track of True Positive, False Negative and False Positive data points. We obtained how the Accuracy and F1-score evolve as data ponts are processed:
[Accuracy and F1-score]
[Confusion Matrix (every 1000 samples)]
[Final Model]
We applied 4 types of Naive Bayes Model, which are given below:
- Gaussian Naive Bayes: it follow a normal distribution.
- Multinomial Naive Bayes: it is used when the data is multinomial distributed.
- Bernoulli Naive Bayes: The Bernoulli classifier works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables.
- Complement Naive Bayes: It is wellknown model for imbalanced datasets. Where Multinomial and Gaussian Naive Bayes may give a low accuracy, Complement Naive Bayes will perform quite well and will give relatively higher accuracy. But with our datasets complemts Naive bayes performance is not good instead Gaussian Naive bayes overperformed accounted for 98% accuracy with both imbalanced and balanced datasets.
[Gaussian Naive Bayes model score]
Among 4 Naive Bayes classifier, Gaussian Naive Bayes gives us highest accuracy with balanced and imbalanced data which is 98% along with high precision and recall. for light off, we achieve 97% precision and 100% recall as well as 99% f1-score. On the other hand, for light on, we got 100% precission, 93% recall as well as 96% f1-score. Therefore, this result indicate good performance and good accuracy for this Gaussian Naive Bayes model.
[Gaussian Naivee Bayes: Confusion Matrix]
we also generate the confussion matrix with real number and in normalized format for the Gaussian Naive Bayes. this picture indicate the normalized form of confusion matrix where True positive is 26% and True Negative is 72% therefore we got 98% accuracy.
[Gaussian Naive Bayes: Four model classifier comparison]
We also did some performance comaparison among the 4 different types of Naive Bayes Classifier. We found that from 4 types of Naive Bayes Classifier model Gaussian Naive Bayes performed exceptionally well in both balanced (i.e., after applying SMOTE over-sampling) and imbalanced (i.e., real) datasets. Hence, we got the accuracy of 98% where Multinomial and Bernoulli Naive Bayes around 75% as well as Complement Naive Bayes gives us 48% accuracy.
We investigated the behavior of two data stream learning algorithms for binary classification in real world scenario: Hoeffdiing Tree and Naive Bayes. We obtained the confusion matrix and the error of the prediction. We observed that in case of data streams we need techniques that are different from the methods used to learn from stationary tabular data. In the case of Decision Trees we dont have the entire data set to obtain the best split. A major challenge is that the class distribution continuously changes over time. We have applied several techniques to make sure we obtain models that capture well the underlying concepts from the stream and generalize for unseen examples.
Our results show that Hoeffding Tree is suitable model for classifying streams of data and could be successfully applied in real world problems. This model has short training time. The confusion matrix shows that after seeing 35000 samples (10% of all samples) the model reaches F1-score of 90%. By applying Reservoir Sampling we made sure that we train the model with equal number of samples from both classes. Our model achieved high F1-score of 94% overall. We think that the high correlation between temperature and light can explain this high F1-score. The tree captures well the underlying concepts of the data stream. Possible improvements would be to explore different Reservoir Sampling techniques, for example, Reservoir with adaptive size or two independent sampling processes - one for the majority class and one for the minority class. The goal would be to maximize the number of samples used for trainning in order to capture well the chracteristics of the data. In our project we were able to use only 70% of the minority class data points for trainng.
The evaluation of the models was a challenge in such non-stationary environment. It is hard to evaluate if the model overfits or not because it is hard to set aside a test set. The Holdout test set proved unsuitable for data streams because the evaluation of the model takes too much time on large test sets. The second method - first test then train - is more appropriate to evaluate the model in real time as data points arrive at high rate.
We have used data duplication technique by oversampling but can be applied more standard and reliable techniques such as SMOTE (Under-Sampling), Reservoir Sampling for Naive Bayes. Although, there was no computational difficulty to run the model due to our data size but can be extended to cluster and parallel computing framework such as Apache and Dask for faster processing when data size will grow bigger. We will also apply more accurate and updated algorithm such as Deep learning and Artificial neural network.
Thanks to Professor Glatard for advice and comments both in class and online about the course and the project. Also thanks all the TA’s and their Lab sessions were great and helped us write the code for this project. Moreover, we enjoyed and learned a lot from this course.
[1] Gary A., Stafford; (2020). https://www.kaggle.com/garystafford/environmental-sensor-data-132k
[2] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
[3] Domingos P., Hulten G. Mining High-Speed Data Streams, 2000.
[4] Gehrke J., Ramakrishnan R., Ganti V. RainForest- a framework for fast decision tree construction of large datasets.Data Mining and Knowledge Discovery, 4(2/3):127–162, 2000.
[5] Hulten G., Spencer L., Domingos P. Mining time-changing data stream. International Conference on Knowledge Discovery and Data Mining, 2001.
[6] Domingos P., Hulten G. Mining complex models from arbitrarily large databases in constant time, 2002.
[7] https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
[8] Naive Bayes and Decision Tree Classifier for Streaming Data Using HBase. Available on: https://link.springer.com/chapter/10.1007/978-981-13-3250-0_8
[9] Gamma J., Sebastiao R., Rodrigues P. Issues in Evaluation of Stream Learning Algorithmns, 2009
[10] [skmultiflow library]: Montiel, J., Read, J., Bifet, A., & Abdessalem, T. (2018). Scikit-multiflow: A multi-output streaming framework. The Journal of Machine Learning Research, 19(72):1−5.