diff --git a/3-variants-of-classification-problems-in-machine-learning.md b/3-variants-of-classification-problems-in-machine-learning.md new file mode 100644 index 0000000..a4d6fbe --- /dev/null +++ b/3-variants-of-classification-problems-in-machine-learning.md @@ -0,0 +1,137 @@ +--- +title: "3 Variants of Classification Problems in Machine Learning" +date: "2020-10-19" +categories: + - "deep-learning" +tags: + - "classification" + - "classifier" + - "deep-learning" + - "machine-learning" + - "neural-network" + - "support-vector-machine" +--- + +The field of machine learning is big and by consequence it can be daunting to start your first machine learning project. In doing so, it's likely that you have already performed a bit of research. During this research, you likely branched off into the sub field of Supervised Machine Learning methods, and subsequently into classification. And that's likely why you're here. + +Because: what is classification? What can we use it for? And, more importantly: what variants of classification are out there? Those are the questions that we will be looking at in this article. In doing so, we will firstly look at the more general topic and find out what classification is about. Subsequently, we will move on and discuss each of the three variants of classification present within Classification-related Supervised Machine Learning problems: + +1. **Binary Classification** +2. **Multiclass Classification** +3. **Multilabel Classification** + +It's important to know that these three types are generic, and can - and will - be separated from the algorithms that can be used. For example, Support Vector Machines, Deep Neural Networks, Logistic Regression and Decision Trees can be used for classification purposes. While their internals differ, their effect is the same - as we shall see, it comes down to assigning particular samples to a set of buckets. + +Let's get moving! 😎 + +* * * + +\[toc\] + +* * * + +## What is classification? + +Suppose that you're working at an assembly line, where various parts are moving on a conveyor belt into the direction of a bucket. Schematically, this can be visualized as follows: + +![](images/whatisclassification.png) + +When all the individual parts have fallen into the bucket, it's up to you to separate the blue ones from the yellow ones. This is a tedious job, especially when you have to handle a large amount of parts. The consequence of such labor is that mistakes get more prevalent when time goes on - humans get tired, after all. Especially in critical assembly lines, such errors can be costly and should be avoided. + +Fortunately, there is a good thing hidden in the image above. Recall that there are _blue objects_ and _yellow objects_. If you don't look at the individual objects, but at their types, you can see that there are only two of them: blue and yellow. When you talk about the general 'type' of an object rather than a specific instance, you're talking about the _**class**_ of objects - which groups all similar objects into a coherent group. Such terminology should especially be resembling to those who have some experience with object-oriented programming and related languages, such as Java. + +### Towards ML-based classification + +Machine Learning, which essentially boils down to pattern recognition, means that it becomes possible to build systems that automatically perform _**classification**_ - i.e., assigning a class to a particular sample. For example, such systems can assign a class to any image that is input to the system. This allows systems to separate cats from dogs, to give just one example. However, it can also be applied to our setting. By adding a machine learning powered system to the assembly line, it should become possible to distinguish between objects by simply looking at them, like this: + +![](images/whatisclassification2.png) + +In the real world, this can be achieved by creating a **Machine Learning model** that takes a picture as input (essentially, a video feed is an endless stream of pictures) and subsequently predicts to what class the image belongs. Or, if we apply a model with more detail, detect the particular objects within the video. The ML powered system is thus a software program, a webcam as well as some 'actuator', or a mechanism that can act on predictions generated for some input. Thus, with Machine Learning, we can create the scenario above: using technology, we can separate the blue and yellow objects automatically. + +This process - distinguishing between object types or _classes_ by automatically assigning them into a particular category - is what we know as **classification**. Let's now take a look at the three variants of classification that can be applied within a supervised classification problem in machine learning. + +* * * + +## Variant 1: Binary Classification + +The first variant of classification problems is called **binary classification**. If you know the binary system of numbers, you'll know that it's related to the number _two_: + +> In mathematics and digital electronics, a binary number is a number expressed in the base-2 numeral system or binary numeral system, which uses only two symbols: typically "0" (zero) and "1" (one). +> +> Wikipedia (2003) + +Binary classification, here, equals the assembly line scenario that we already covered and will repeat now: + +![](images/whatisclassification2.png) + +Essentially, there are two outcomes (i.e. a binary outcome): **class 0** or **class 1**. This is the case because classes are always represented numerically and hence there is no such thing as "blue" or "yellow". However, in the output, we can obviously transform 0 into "blue" and 1 into "yellow". + +### Implementing a binary classifier + +With binary classification, we therefore assign an input to one of two classes: class 0, or class 1. Usually, in neural networks, we use the [Sigmoid](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#sigmoid) activation function for doing so. Funnily, neural networks therefore predict a value in the range \[latex\]\[0, 1\]\[/latex\], meaning between 0 and 1. For example, the output of a neural network can be \[latex\]0.69\[/latex\]. Here, the network thinks that it's more likely that it belongs to class 1, but cannot be fully sure. It's then up to the ML engineer to do something with the outcome, by e.g. applying a `round()` function that maps outputs to 0 or 1. + +Note that other machine learning methods such as SVMs do not necessarily output values between 0 and 1, but rather include the rounding effect as part of their functioning. + +* * * + +## Variant 2: Multiclass Classification + +The second variant of classification is called **multiclass classification**. Here, we extend the assembly line by adding another bucket: + +![](images/whatisclassification5.png) + +Now, the machine learning powered system can distinguish between blue, yellow and red objects, or in ML terms **classes 0, 1 and 2**. + +### Algorithmic implementation of Multiclass Classification + +Multiclass classification can therefore be used in the setting where your classification dataset has more than two classes. Depending on the algorithm you're using, constructing a multiclass classifier can be cumbersome or really easy. This depends on whether the algorithms natively support this form of classification. For example: + +- A **Support Vector Machine** does not natively support multiclass classification. In those cases, you must train [multiple binary classifiers](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/) and apply a strategy to generate a multiclass prediction. +- A **Deep Neural Network** _does_ natively support multiclass classification. By means of the [Softmax activation function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/), it is possible to generate a probability distribution over the multiple classes, one for each input. That is, for one input, the model will predict the probability that it belongs to a particular class, with probabilities summed equalling \[latex\]1\[/latex\], or \[latex\]100%\[/latex\]. For example, where the output of a binary classifier could be \[latex\]0.69\[/latex\] as we saw above, a multiclass scenario would producte something like \[latex\]\[0.10 0.69 0.21\]\[/latex\]. Together, the probabilities equal 100%, and class 1 is most likely. By simply taking an `argmax` here, you would get the class that is most probable. + +* * * + +## Variant 3: Multilabel Classification + +The assembly lines covered above have two or three types of objects that belong to one bucket and to one bucket only. In machine learning terms, this means that binary and multiclass classification assume that each input can be mapped to one particular target class. + +This does not necessarily hold for all machine learning problems. It can be that your dataset assigns multiple classes to an input value. If we want to automate this process, we must create a machine learning model that can do the same thing. Enter the world of **multilabel classification**, or tagging, which generalizes multiclass classification to a multi-class-at-once scenario. Visually, this looks as follows - indeed, rather than assigning one class and using multiple buckets, you're back at the one-bucket scenario, where you'll find individual objects with multiple tags attached: + +![](images/whatisclassification6.png) + +As the objects are now tagged, you can easily get a subset of the objects by applying simple filter operations. This means that it's no longer a burden to keep the different objects in just one bucket: it's easy to find the objects you need at a point in time thanks to the labeling. In fact, you now have a higher-dimensional search space compared to multiclass classification, which can benefit you in some scenarios. + +### Examples of Multilabel Classification + +In fact, there are [many scenarios](https://www.uco.es/kdis/mllresources/) where using a multilabel classifier is useful (Universidad de Córdoba, n.d.): + +- **Categorization of news articles:** news articles often belong to more than one category. For example, an article that discusses a Formula 1 race belongs both to the categories _automotive_ and _race sports_. Automatically assigning news articles a category thus involves multilabel classification. +- **Categorization of academic works:** in a similar setting as for news articles, academic works can also have multiple categories. +- **Semantics of music:** analyzing music can involve assigning tags related to semantic concepts (Universidad de Córdoba, n.d.). As a music classifier often takes a few seconds of sound as input, a machine learning model should be able to assign multiple semantic concepts to just one input. +- **Human emotions:** humans can show multiple emotions in a very brief time interval, and sometimes even two emotions at once: happily puzzled, to give just one example. This emotion represents the categories 'happy' and 'puzzled'. A machine learning model that should detect emotions on some visual input should therefore be able to perform multiclass classification. + +And there are many more - [as we can see here](http://www.uco.es/kdis/mllresources/#3sourcesDesc). + +* * * + +## Summary + +Classification in supervised machine learning involves using a machine learning model to categorize input samples by assigning them one or multiple 'types', 'tags', 'labels', or in more formal ML terms _classes_. A variety of machine learning algorithms can be used for this purpose. In this article, however, we refrained from a too deep focus on the algorithms, but rather covered classification at a high level. + +We first looked at what classification is. By means of an assembly line example, we saw how classifiers can be used to automate away cumbersome human tasks - tasks that can be error-prone. Subsequently, we looked at the multiple forms of classification that can be achieved with a machine learning model. Firstly, in a binary classification task, the machine learning model will assign the input sample to one of two buckets. Secondly, in the multiclass equivalent, the input sample will be assigned to one of multiple buckets. Finally, in the multilabel scenario, the task transforms into a tagging task, where multiple tasks can be assigned to an input sample. Eventual filtering can then be performed to retrieve the subset of samples that you need. We saw that those classifiers can be useful in many scenarios. + +I hope that you've learnt something from today's article! If you did, or _did not_, please feel free to leave a comment in the comments section below. I'll happily read your post and improve my article where necessary, so please don't omit any criticism where applicable. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2003, June 2). _Binary number_. Wikipedia, the free encyclopedia. Retrieved October 19, 2020, from [https://en.wikipedia.org/wiki/Binary\_number](https://en.wikipedia.org/wiki/Binary_number) + +_Binary classification_. (2003, April 3). Wikipedia, the free encyclopedia. Retrieved October 19, 2020, from [https://en.wikipedia.org/wiki/Binary\_classification](https://en.wikipedia.org/wiki/Binary_classification) + +_Multiclass classification_. (2010, February 25). Wikipedia, the free encyclopedia. Retrieved October 19, 2020, from [https://en.wikipedia.org/wiki/Multiclass\_classification](https://en.wikipedia.org/wiki/Multiclass_classification) + +_Multi-label classification_. (2006, October 16). Wikipedia, the free encyclopedia. Retrieved October 19, 2020, from [https://en.wikipedia.org/wiki/Multi-label\_classification](https://en.wikipedia.org/wiki/Multi-label_classification) + +_Universidad de Córdoba. (n.d.). _Multi-label classification dataset repository_. [https://www.uco.es/kdis/mllresources/](https://www.uco.es/kdis/mllresources/)_ diff --git a/a-gentle-introduction-to-long-short-term-memory-networks-lstm.md b/a-gentle-introduction-to-long-short-term-memory-networks-lstm.md new file mode 100644 index 0000000..6ff4124 --- /dev/null +++ b/a-gentle-introduction-to-long-short-term-memory-networks-lstm.md @@ -0,0 +1,280 @@ +--- +title: "A gentle introduction to Long Short-Term Memory Networks (LSTM)" +date: "2020-12-29" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "long-short-term-memory" + - "lstm" + - "machine-learning" + - "recurrent-neural-networks" + - "rnn" + - "seq2seq" + - "sequence-to-sequence-learning" + - "transformer" + - "transformers" + - "vanilla-rnn" +--- + +One of the fields where Machine Learning has boosted progress is Natural Language Processing. This is particularly true for the models that are used for machine translation and similar tasks. In other words, for models that can be used for performing [sequence-to-sequence learning](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/), where sequences of one kind (e.g. phrases written in English) are transducted into ones of another kind (e.g. phrases written in German). + +For many years, **Long Short-Term Memory** networks (LSTM networks) have been part of the state-of-the-art within sequence-to-sequence learning. Having been replaced slowly but surely after the 2017 [Transformer breakthrough](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) (i.e., the Vaswani et al. work from 2017), they do still play an essential role in many Seq2Seq tasks today, especially with deployed models. + +This article provides a gentle introduction to LSTMs. As with any MachineCurve article, we're going to take a look at some context first. We'll cover classic Recurrent Neural Networks and why training them is problematic. This is followed by an introduction of the Long Short-Term Memory Network by Hochreiter and Schmidhuber in their 1997 work. We're also going to cover intuitively why LSTMs solve the vanishing gradients problem traditionally present within Machine Learning with recurrent segments. + +Included as well is a thorough analysis of the contemporary LSTM architecture, which includes a few changes here and there to improve the basic LSTM. In particular, we're going to take a look at separating memory from the hidden state, the various gates (i.e. the forget, input/update and output gates). Finally, we're taking a look at the future as well, by looking at why Transformers have replaced LSTMs in the past few years. + +These are the takeaways from reading this article: + +- Finding out what the problems are with classic Recurrent Neural Networks. +- Identifying how LSTMs work and why they solve the vanishing gradients problems. +- Looking at the contemporary LSTM architecture, its components, and its variants. +- Learning why Transformers have slowly but surely replaced LSTMs in sequence-to-sequence learning. + +Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Problems with classic RNNs + +When people speak about applying Machine Learning to the field of Natural Language Processing, the term **recurrent neural networks** is what many people come across relatively quickly. In its basic form, i.e. in its _vanilla_ form, a recurrent neural network (RNN) can be visualized in the following way: + +![](images/2560px-Recurrent_neural_network_unfold.svg_.png) + +A fully recurrent network. Created by [fdeloche](https://commons.wikimedia.org/wiki/User:Ixnay) at [Wikipedia](https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg), licensed as [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0). No changes were made. + +After tokenizing a sequence such as a phrase, we can feed individual tokens (e.g. words) to the network; tokens here are visualized by the green circles \[latex\]x\_t\[/latex\]. These are input to a network with a hidden state \[latex\]h\[/latex\], which based on this hidden state generate an output token \[latex\]o\[/latex\]. What's more, the output of the hidden state is passed back into the hidden state. This way, we can both generate output values _and_ have some kind of a memory. + +Especially when you unfold this structure showing the parsing of subsequent tokens \[latex\]x\_{t-1}\[/latex\] etc., we see that hidden state passes across tokens in a left-to-right fashion. Each token can use information from the previous steps and hence benefit from additional context when transducing (e.g. translating) a token. + +> The structure of the network is similar to that of a standard multilayer perceptron, with the distinction that we allow connections among hidden units associated with a time delay. Through these connections the model can retain information about the past, enabling it to discover temporal correlations between events that are far away from each other in the data. +> +> Pascanu et al. (2013) + +While being a relatively great step forward, especially with larger sequences, classic RNNs did not show great improvements over classic neural networks where the inputs were sets of time steps (i.e. multiple tokens just at once), according to Hochreiter & Schmidhuber (1997). Diving into Hochreiter's thesis work from 6 years earlier, the researchers have identified the [vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) and the relatively large distances error flow has to go when sequences are big as one of the leading causes why such models don't perform well. + +> The vanishing gradients problem refers to the opposite behaviour, when long term components go exponentially fast to norm 0, making it impossible for the model to learn correlation between temporally distant events. +> +> Pascanu et al. (2013) + +### Why vanishing gradients? + +The vanishing gradients problem in classic RNNs occurs because they were trained with a backpropagation variant called **Backpropagation through time** (BPTT; Wikipedia, 2010). To understand how BPTT works, we'll have to take a look at recurrent neural networks again. In the figure below, we can see a recurrent network, handling an input \[latex\]a\_t\[/latex\] for some time step and generates a prediction \[latex\]y\_{t+1}\[/latex\] for the next timestep. The hidden state of the previous attempt is passed to the network as well and is often a vector of zeroes at \[latex\]t = 0\[/latex\] (Wikipedia, 2010). + +When unfolded through time, we can see the chained passages of inputs \[latex\]a\_t\[/latex\] across the entire time domain. We also see the hidden state changing over time, being used continuously for generating the subsequent input. Effectively, we're 'copying' the network, but every copy of the network has the same parameters (Wikipedia, 2010). We can then simply apply backpropagation for computing the gradients, like we're used to. + +![](images/Unfold_through_time.png) + +Source: Headlessplatter (Wikipedia). Licensed to be in the public domain. + +Now here's the problem. Traditionally, to ensure that neural networks can [learn to handle nonlinear data](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/), nonlinear activation functions were added to the network. Sigmoid has been an activation function that used to be one of the standard functions that was applied in neural network. The plot below illustrates perfectly why gradients vanish if the chain of 'copies' through which backpropagation must plough is long: the maximum value of the Sigmoid derivative is < 0.3. + +In other words, if we have to chain the derivative of Sigmoid across three time steps, our gradient gets close to zero quickly. Especially upstream layers i.e. upstream time steps are struck significantly by this problem, because they cease learning when sequences get too long. Say hello to the _vanishing gradients problem_! + +[![](images/sigmoid_deriv-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sigmoid_deriv.png) + +* * * + +## Introducing Long Short-Term Memory Networks + +In their 1997 work, Hochreiter & Schmidhuber introduce the **Long Short-Term Memory**, or LSTM. In the work, the authors explore Hochreiter's 1991 thesis which among others investigates the problem with vanishing gradients in classic RNNs. They explore why they happen and identify a solution: the so-called **Constant Error Carrousel** (CEC). We'll explore why it solves vanishing gradients in more detail later, but it boils down to one thing: because the memory is constructed using linear operations, the derivative is always \[latex\]1.0\[/latex\] (because the derivative of \[latex\]f = c \\times x\[/latex\] equals 1.0). + +### The contemporary LSTM architecture + +Let's however first take a look at the architecture of a contemporary LSTM network. Below, you'll see a visualization which seems to be complex at first sight. However, it isn't necessarily so when we look at it with more caution. More specifically, we can structure the various building blocks into four main categories: + +1. A separation between **memory** and **output state**. +2. A **forget gate** which helps us remove certain things from memory. +3. An **update (or input) gate** which helps us add certain things to memory. +4. An **output gate** which helps us generate an output prediction based on the input and existing memory (i.e. based on input and updated context). + +All functionality within an LSTM is grouped into a cell-like structure called a **memory cell**. Similar to classic recurrent networks, the output of the cell flows back into the cell when the next prediction takes place. Or, when unrolled, like the recurrent network above, the output of one copy of an identical cell is passed to another copy of that cell. In the image below, this is visualized by the horizontal streams of _outputs_ \[latex\]h\[t\]\[/latex\] and of _memory_ \[latex\]c\[t\]\[/latex\]. + +[![](images/LSTM-1024x657.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/LSTM.png) + +An LSTM memory cell with a Forget Gate, Update Gate and Output Gate. + +### Separating memory and output state + +The first key difference between classic recurrent neural networks and LSTMs is that **memory** is separated from the **outputs**. In classic networks, these are tightly integrated, as we can see in the unrolled recurrent network visualized above. + +In the classic network, the _output_ is used for providing context for the next prediction. This is no longer true for LSTMs. At least, no longer _only_ true for LSTMs, because of this separation between memory and output state. + +This is visible in the image above: + +- Here, the input to the LSTM at any point in time is displayed as \[latex\]x\[t\]\[/latex\]. +- The output is visualized as \[latex\]h\[t\]\[/latex\]. In fact, it's recurrent, as it accepts the output from the previous cell (or, to be more precise, another copy of the identical cell) and passes it onto the next one. +- The same is true for the memory \[latex\]c\[t\]\[/latex\], which is newly available and was not available in previous networks. + +The idea here is that while outputs can provide quite some context about what has happened directly before, a longer-term memory is necessary for providing additional (i.e. longer-term) context. This is why the outputs and the memory are no longer tightly integrated with LSTMs. Now, the drawback of separating memory from cell outputs is that you'll have to keep both in sync. + +And keeping them in sync means that we must forget what can be forgotten from the previous output, given the current one. It also means that we have to remember what must be remembered from the current one, given the previous output. Otherwise, the memory is useless, isn't it? + +For this reason, LSTMs come with **gates**. Below, we'll describe the _contemporary_ variant, as proposed by Gers et al. (1999) as an extension to the original LSTM proposed by Hochreiter & Schmidhuber (1997). It has three gates, being the **forget gate**, the **update gate** and the **output gate**. They all play a distinct but important role. Let's now take a look at how each individual gate keeps the memory in sync. + +### Forget gate + +Suppose you are feeding the sequence `I am going to the gym` to the model, where the sentence has been tokenized into ` ` (of course to its integer equivalents, by means of the generation of some vocabulary). + +In the previous run, you have applied the LSTM model to `` and you will now be processing ``. This means that you'll have the following setting: + +- The value for \[latex\]x\[t\]\[/latex\] will be the tokenized version of ``. +- The value for \[latex\]h\[t-1\]\[/latex\] will be the (translated) tokenized output of ``. +- The value for \[latex\]c\[t-1\]\[/latex\] will be some representation of long-term memory, which at that point only includes (part of) the representation of ``. + +Why it's likely that it's only _part of_ the representation is because the tokenized input will impact both the **output** \[latex\]h\[t\]\[/latex\] and the **memory** \[latex\]c\[t\]\[/latex\]. + +The first way in which this will happen is through the **forget gate**, which has been selected in green below. The gate is composed of multiple components, from top to bottom: + +- A block that (Hadamard) **multiplies** the memory from the previous timestep with the output of the forget gate. +- A **Sigmoid function** which acts as a mechanism for deciding what to forget. +- The **previous output** and the **current input** as inputs to the forget gate. + +![](images/LSTM-1-1024x657.png) + +The previous output \[latex\]h\[t-1\]\[/latex\] and current input \[latex\]\[x\[t\]\[/latex\] are first added together by means of matrix addition, after (learned) weight matrices have been applied to both inputs. These learned weights determine the strength of the forget gate by putting more attention on the current input or the previous output. The result is then added to a [Sigmoid activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/), of which we know that it maps all inputs to a value between 0.0 and 1.0. + +In other words, the current and previous input together with the learned weights determine what must be 'forgotten', because when certain elements in the matrices that are the outcome of the addition are < 0, they are likely to be more forgotten (since the output of the Sigmoid activation is closer to 0.0 than to 1.0). If instead outcomes are >= 0, they are more likely to be _omitted_ from the removal process. + +The removal or forgetting process itself happens by means of a Hadamard matrix multiplication. The memory matrix is Hadamard multiplied with the outcome of the Sigmoid-activated matrix, meaning that all elements that should be reduced in strength are reduced, and all elements that must be retained are not impacted significantly. In other words, this gate allows us to learn what to forget based on certain combinations of previous outputs and current inputs. + +[![](images/sigmoid_deriv-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sigmoid_deriv.png) + +### Update gate (input gate) + +Next up is the **update gate** (also called the **input gate**), visualized in green below. Contrary to the forget gate, whose task is to _remove information from memory_, the task of the update gate is to _add information into memory_. + +The gate itself is a bit more complex than the forget gate, but don't worry, with some explanation it'll also be easy to grasp what is happening here. + +Recall that this is our point in time: + +- The value for \[latex\]x\[t\]\[/latex\] will be the tokenized version of ``. +- The value for \[latex\]h\[t-1\]\[/latex\] will be the (translated) tokenized output of ``. +- The value for \[latex\]c\[t-1\]\[/latex\] will be some representation of long-term memory, which at that point only includes (part of) the representation of ``. + +As you can see, it's composed of two components: a [Sigmoid activation](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) of a joint learned weighted input based on the previous output \[latex\]h\[t-1\]\[/latex\] and current input \[latex\]x\[t\]\[/latex\] and a [Tanh activation](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) that accepts another joint learned weighted input based on the two inputs. The outcomes of these activations are first Hadamard matrix multiplied, and subsequently added into memory by means of matrix addition. + +[![](images/LSTM-2-1024x657.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/LSTM-2.png) + +I can imagine that it's still a bit vague what is happening here. Let's break down stuff even further. Here are the plots of both the Sigmoid and Tanh function and their derivatives. + +- [![](images/sigmoid_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/sigmoid_and_deriv.jpeg) + +- [![](images/tanh_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/tanh_and_deriv.jpeg) + + +Let's first take a look at the **Tanh function**. As we can see, the function maps all inputs to a value between -1.0 and +1.0. In other words, it [normalizes](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) any input to the \[latex\]\[-1.0, 1.0\]\[/latex\] range. Feeding the joined weighted combination of previous outputs and current inputs to Tanh therefore ensures some normalization of input values. This benefits the stability of the training process. It doesn't however truly serve as an _update_, because with Tanh, all new information will be added. + +That's why here too, we apply a **Sigmoid function**. As we know that it maps to 0.0 to +1.0 instead, we can see that it learns to select the most important aspects of the combination of previous output and current input. The outcome of the Sigmoid activation is Hadamard matrix multiplied with the Tanh outcome before it is added to memory. + +In other words, the Hadamard matrix multiplied combination of Sigmoid activated and Tanh activated outcomes ensures that (1) only important aspects, given the current inputs, are added into memory; that (2) they are added in a way that numerically stabilizes the training process. + +Great stuff! + +### Output gate + +Last but not least is the **output gate**, which is visualized in green below. + +Its sole responsibility is formulating the _output_ \[latex\]h\[t\]\[/latex\] of the memory cell given the previous output and the current input \[latex\]h\[t-1\]\[/latex\] and \[latex\]x\[t\]\[/latex\]. This gate is nevertheless really important, because it'll determine both the correctness of the prediction (i.e. the output) and the stability of all subsequent productions simply because its prediction is re-used in the next one. + +Once again, we see a Tanh and Sigmoid activated Hadamard matrix multiplication. This time, though, the inputs flow from a different direction. + +- The Sigmoid activated input flows from the previous output and current input. Being weighted using separate weight matrices, like all the Sigmoids so far, this Sigmoid activation provides a learned representation about what's most important in the current input and previous output for using in the transduction task. +- The Tanh activated input flows from the memory (which has been updated by means of forgetting and adding new information) and essentially normalizes the memory values, stabilizing the training process. + +Together, through a Hadamard matrix multiplication, they produce the output token that we are _hopefully_ looking for. + +[![](images/LSTM-3-1024x657.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/LSTM-3.png) + +### Why LSTMs don't suffer from vanishing gradients + +We know that classic RNNs faced the difficulty of vanishing gradients during the training process, but that LSTMs are free from vanishing gradients. + +But why is this the case? + +Now that we understand how LSTMs work and that they separate memory and previous outputs by means of gates, the answer is simple: **gradients can flow freely, constantly (i.e. gradient = 1.0), between the copies of the same memory cell**. + +And in addition, gradients _within_ the cell components can be any value by virtue of the Sigmoid and Tanh activation functions being used, so the cell will be able to learn how it can adapt the weights of the matrices involved with the forget, update and output gates. + +I can imagine that this is hard to grasp, so let's break it down into separate components once more :) + +![](images/LSTM-4-1024x657.png) + +#### Memory, mathematically + +Let's take a close look at the way in which the memory is updated in one token pass first. Put simply, it's a linear operation that is written mathematically this way: + +\[latex\]c\_t = f\_t \\circ c\_{t-1} + i\_t \\circ \\tilde{c}\_t\[/latex\] + +Here, \[latex\]f\_t\[/latex\] represents the activation value for the _forget gate_, which is Hadamard matrix multiplied with the value for \[latex\]c\[t-1\]\[/latex\] (we know that from above). + +Here, \[latex\]i\_t \\circ \\tilde{c}\_t\[/latex\] is the Hadamard matrix multiplication between the Sigmoid-activated and Tanh-activated outputs from the _update gate_, which are then simply matrix added into memory. + +In other words, it represents the operations that we intuitively understood above. + +#### The memory activation function is the identity function + +In addition, no [nonlinear activation function](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/) is present within this memory flow, contrary to classic RNNs, which are often Sigmoid activated. In other words: the activation function can be described as the identity function, or \[latex\]f(x) = x\[/latex\] (ManiacMalko, n.d.). As the gradient of it is 1.0, we can see that errors can flow freely between copies of the same memory cell withint vanishing (as happens when gradients are < 1.0 e.g. in the Sigmoid case). + +This change compared to classic RNNs resolves the vanishing gradients problem in LSTMs. + +* * * + +## From LSTMs to Transformers + +[![](images/Diagram-32-1-1024x991.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-32-1.png) + +In the 2010s, LSTMs were the go-to type of network for sequence-to-sequence learning activities such as Machine Translation. + +However, there was one remaining bottleneck that was not resolved by LSTMs either: the fact that processing has to happen sequentially. + +Each part of the sequence must be fed to the network in sequence, after which a transduction is computed on a per-token basis. + +This unnecessarily slows down the training process. + +In their breakthrough work, Vaswani et al. (2017) have proposed the [Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), which allows for parallelism by means of stripping away the recurrent aspects in the architecture. The massive growth in interest in Transformers has ensured that LSTMs have been removed from the pedestal; they are no longer considered to be state-of-the-art in NLP. + +Nevertheless, they are continuously being used today, with approximately 55.000 hits in Google Scholar even when the 'since 2020' option was selected. A wide variety of applications is covered, among which predictions for [COVID-19 disease](https://www.machinecurve.com/index.php/2020/11/05/ml-against-covid-19-detecting-disease-with-tensorflow-keras-and-transfer-learning/), air quality forecasting, and water production forecasting. + +That's why LSTMs must not yet be discarded, but applied with care :) + +* * * + +## Summary + +In this article, we looked at Long Short-Term Memory networks (LSTMs), which were state-of-the-art approaches in Machine Learning for NLP (or more generally, for time series) in the past few years before they were replaced by Transformer networks. In doing so, we first saw which problems occur when we train classic RNNs, primarily the vanishing gradients problem. We also saw that it occurs because classic activation functions like Sigmoid produce derivatives that can be < 1 at best, yielding the vanishing of gradients at improvement time. + +LSTMs, we saw, overcome this problem by introducing what is known as Constant Error Caroussels. By separating memory from the hidden, nonlinearly activated output, they can ensure that the gradient of the memory is 1.0 at all times - ensuring that the gradients neither explode nor vanish, while they can flow freely between time steps. Through three gates, being the forget gate, the input/update gate and the output gate, current inputs and previous predictions can update memory by removing what can be discarded, adding what must be retained, and finally generate output based on inputs and current memory. + +Despite the benefits achieved with LSTMs, they are no longer considered to be state-of-the-art approaches. This is primarily due to the nascence of Transformer networks, which have the additional benefit that sequences don't have to be processed sequentially, but rather, in parallel. Still, LSTMs remain widely applied and hence must not be discarded from research and engineering activities. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article. If you did, please feel free to drop a message in the comments section below 💬 Please do the same if you have any questions, or click the **Ask Questions** button on the right to ask your question. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Hochreiter, S., & Schmidhuber, J. (1997). [Long short-term memory](https://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735). _Neural computation_, _9_(8), 1735-1780. + +Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). [Learning to forget: Continual prediction with LSTM](https://digital-library.theiet.org/content/conferences/10.1049/cp_19991218). + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Pascanu, R., Mikolov, T., & Bengio, Y. (2013, February). [On the difficulty of training recurrent neural networks.](http://proceedings.mlr.press/v28/pascanu13.pdf?source=post_page---------------------------) In _International conference on machine learning_ (pp. 1310-1318). + +Wikipedia. (2010, June 1). _Backpropagation through time_. Wikipedia, the free encyclopedia. Retrieved December 28, 2020, from [https://en.wikipedia.org/wiki/Backpropagation\_through\_time](https://en.wikipedia.org/wiki/Backpropagation_through_time) + +Xu, C. (n.d.). _Need help understanding LSTMs' backpropagation and carousel of error_. Data Science Stack Exchange. [https://datascience.stackexchange.com/a/23042](https://datascience.stackexchange.com/a/23042) + +ManiacMalko. (n.d.). _\[D\] LSTM - Constant error carrousel_. reddit. [https://www.reddit.com/r/MachineLearning/comments/ecja78/d\_lstm\_constant\_error\_carrousel/](https://www.reddit.com/r/MachineLearning/comments/ecja78/d_lstm_constant_error_carrousel/) + +Gers, F. A., Schraudolph, N. N., & Schmidhuber, J. (2002). [Learning precise timing with LSTM recurrent networks](https://www.jmlr.org/papers/volume3/gers02a/gers02a.pdf). _Journal of machine learning research_, _3_(Aug), 115-143. + +Wikipedia. (2007, April 16). _Long short-term memory_. Wikipedia, the free encyclopedia. Retrieved December 29, 2020, from [https://en.wikipedia.org/wiki/Long\_short-term\_memory](https://en.wikipedia.org/wiki/Long_short-term_memory) diff --git a/a-simple-conv3d-example-with-keras.md b/a-simple-conv3d-example-with-keras.md new file mode 100644 index 0000000..4e90981 --- /dev/null +++ b/a-simple-conv3d-example-with-keras.md @@ -0,0 +1,555 @@ +--- +title: "A simple Conv3D example with TensorFlow 2 and Keras" +date: "2019-10-18" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "conv3d" + - "convolutional-neural-networks" + - "deep-learning" + - "keras" + - "mnist" +--- + +These past few years, convolutional neural networks have become known for the boost they gave to machine learning, or artificial intelligence in a broader sense. Primarily, these networks have been applied to two-dimensional data: data with two axes (x and y), such as images. + +_The cover image is courtesy of [David de la Iglesia Castro](https://github.com/daavoo?tab=repositories), the creator of the 3D MNIST dataset._ + +We all know about the computer vision applications which allow us to perform object detection, to name just one. + +How these Conv2D networks work [has been explained in another blog post.](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) + +For many applications, however, it's not enough to stick to two dimensions. Rather, the _height_ or _time_ dimension is also important. In videos, which are essentially many images stacked together, time is this third axis. It can however also be height or _number of layers_, in e.g. the layered image structure of an MRI scan. In both cases, the third axis intrinsically links the two-dimensional parts together, and hence cannot be ignored. + +Enter three-dimensional convolutional neural networks, or Conv3Ds. In this blog post, we'll cover this type of CNNs. More specifically, we will first take a look at the differences between 'normal' convolutional neural networks (Conv2Ds) versus the three-dimensional ones (Conv3D). Subsequently, we will actually provide a TensorFlow 2/Keras-based implementation of a Conv3D, with the [3D MNIST](https://www.kaggle.com/daavoo/3d-mnist) dataset available at Kaggle. We discuss all the individual parts of the implementation before arriving at the final code, which ensures that you'll understand what happens on the fly. + +After reading this tutorial, you will understand... + +- **What the differences are between `Conv2D` and `Conv3D` layers.** +- **What the 3D MNIST dataset contains.** +- **How to build a 3D Convolutional Neural Network with TensorFlow 2 based Keras.** + +All right, let's go! 😄 + +_Note that the code for this blog post is also available on [GitHub](https://github.com/christianversloot/keras-cnn)._ + +* * * + +**Update 10/Feb/2021:** ensure that tutorial is up to date. Converted all TensorFlow examples to new versions of the library (TensorFlow 2.x). + +* * * + +\[toc\] + +* * * + +## Example code: using Conv3D with TensorFlow 2 based Keras + +This example shows how you can **create 3D convolutional neural networks** with TensorFlow 2 based Keras through `Conv3D` layers. You can immediately use it in your neural network code. However, if you want to understand 3D Convolutions in more detail or wish to get step-by-step examples for creating your own 3D ConvNet, make sure to read the rest of this tutorial too 🚀 + +``` + # Create the model + model = Sequential() + model.add(Conv3D(32, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=sample_shape)) + model.add(MaxPooling3D(pool_size=(2, 2, 2))) + model.add(Conv3D(64, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform')) + model.add(MaxPooling3D(pool_size=(2, 2, 2))) + model.add(Flatten()) + model.add(Dense(256, activation='relu', kernel_initializer='he_uniform')) + model.add(Dense(no_classes, activation='softmax')) +``` + +* * * + +## Conv2D vs Conv3D + +If you are familiar with convolutional neural networks, it's likely that you understand what happens in a traditional or two-dimensional CNN: + +![](images/CNN.jpg) + +A two-dimensional image, with multiple channels (three in the RGB input in the image above), is interpreted by a certain number (`N`) kernels of some size, in our case 3x3x3. The actual _interpretation_ happens because each kernel _slides over the input image_; literally, from the left to the right, then down a bit; from the left to the right, and so on. By means of element-wise multiplications, it generates a _feature map_ which is smaller than the original input, and in fact is a _more abstract summary_ of the original input image. Hence, by stacking multiple convolutional layers, it becomes possible to generate a very abstract representation of some input representing some _average object_, which allows us to classify them into groups. + +_For more information, I'd really recommend my other blog post, [Convolutional Neural Networks and their components for computer vision](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/)._ + +Now, with three-dimensional convolutional layers, things are different - but not too different. Instead of three dimensions in the input image (the two image dimensions and the _channels_ dimension, you'll have four: the two image dimensions, the time/height dimension, and the channels dimension). As such, the feature map is also three-dimensional. This means that the filters move in three dimensions instead of two: not only from left to right and from the top to the bottom, but also forward and backward. Three-dimensional convolutional layers will therefore be more expensive in terms of the required computational resources, but allow you to retrieve much richer insights. + +Now that we understand them intuitively, let's see if we can build one! + +* * * + +## Today's dataset: 3D MNIST + +...creating a machine learning requires a dataset with which the model can be trained. + +The **3D MNIST dataset** that is available at [Kaggle](https://www.kaggle.com/daavoo/3d-mnist) serves this purpose. It is an adaptation of the original MNIST dataset which we used to create e.g. the [regular CNN](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). The authors of the dataset converted the two-dimensional data into 3D by means of point clouds, as follows: + +[![](images/mnist3d.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/10/mnist3d.jpg) + +Courtesy of [David de la Iglesia Castro](https://github.com/daavoo?tab=repositories), the creator of the 3D MNIST dataset. + +Since the data is three-dimensional, we can use it to give an example of how the Keras Conv3D layers work. + +Since it is relatively simple (the 2D dataset yielded accuracies of almost 100% in the 2D CNN scenario), I'm confident that we can reach similar accuracies here as well, allowing us to focus on the model architecture rather than poking into datasets to maximize performance. + +Let's now create the model! 😎 + +* * * + +## Creating the model + +### What you'll need to run it + +Before we start coding, let's make sure that you have all the software dependencies installed that we need for successful completion: + +- **Python**, obviously, since Keras runs in Python. It's best to use Python 3.8+. +- **TensorFlow 2.x**, especially one of the newer versions. It includes Keras by means of the tightly coupled `tensorflow.keras` APIs. +- **Numpy** for relatively basic number processing in terms of reshaping the input data (we'll see why we need Numpy later!) +- **Matplotlib** for data visualization. +- **H5py** for importing and parsing HDF5 files. The 3D MNIST dataset is provided in HDF5 format, which stands for _Hierarchical Data Format version 5_ and is a way of storing large datasets into _one file_, by means of a hierarchy comparable to a folder structure in Windows Explorer. With H5py, we can import and parse the files into a format we can further use. + +Besides the software dependencies, you'll also need the data itself. The dataset is available on Kaggle, which is a community of machine learning enthusiasts where competitions, question and answers and datasets are posted. + +There are two ways of installing the dataset into your host machine: + +- By installing the Kaggle Python API, with `pip install kaggle`. Next, you can issue `kaggle datasets download -d daavoo/3d-mnist` (if you included the `kaggle.json` API key file in the `~/.kaggle` - read [here](https://github.com/Kaggle/kaggle-api) how to do this) and the dataset must download. We will need the file `full_dataset_vectors.h5`. **Note that for the 3D MNIST dataset, this option is currently (as of February 2021) broken, and you will have to download the data manually.** +- Besides using the API facilities, it's also possible to download the data manually. On the [Kaggle data repository page](https://www.kaggle.com/daavoo/3d-mnist), navigate to 'Data', and download `full_dataset_vectors.h5`. + +For both scenarios, you'll need a free Kaggle account. + +Let's move the file `full_dataset_vectors.h5` into a new folder (e.g. `3d-cnn`) and create a Python file such as `3d_cnn.py`. Now that the data has been downloaded & that the model file is created, we can start coding! 😄 + +So let's open up your code editor and _on y va!_ (🇫🇷 for _let's go!_). + +### Model imports + +As usual, we import the dependencies first: + +``` +''' + A simple Conv3D example with TensorFlow 2 based Keras +''' +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv3D, MaxPooling3D +from tensorflow.keras.utils import to_categorical +import h5py +import numpy as np +import matplotlib.pyplot as plt +``` + +For most of them, I already explained why we need them. However, for the Keras ones, I'll explain them in a slightly more detailed way: + +- We'll work with the **Sequential API**. It's the easiest way to structure your Keras layers (contrary to the Functional API), but it comes with a cost - you lose flexibility in terms of how data flows through your model, as you literally stack all layers. For this blog post, that doesn't matter, but it may be an idea to inform yourself about the differences between both APIs. +- Next, we import some layers: + - The **Dense** layer represents the densely-connected layers ([MLP-like layers](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/)) that we will use to classify the very abstract 3D convolutional feature maps into one of the buckets 0-9, for the digits 0-9. + - The **Flatten** layer will be used to convert the multidimensional feature map into a one-dimensional array, since only those can be handled by densely-connected layers. + - The **Conv3D** layer, which was intuitively discussed above, will be used for performing the convolutional operations. + - In between the convolutional layers, we apply three-dimensional max pooling with **MaxPooling3D** in order to down-sample the feature maps (or in plain English: making them smaller, presumably without losing information) which saves precious computational resources. +- Finally, we import the `[to_categorical](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras/)` function. The [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) we're using to compute _how bad the model performs_ during training, [categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#categorical-crossentropy), requires that we convert our integer target data (e.g. \[latex\]8\[/latex\] when it's an 8) into categorical vectors representing true/false values for class presence, e.g. \[latex\]\[0, 0, 0, 0, 0, 0, 0, 0, 1, 0\]\[/latex\] for class 8 over all classes 0-9. `to_categorical` converts the integer target data into categorical format. + +### Model configuration + +Now that we imported all dependencies, we can proceed with some model configuration variables that allow us to configure the model in an orderly fashion: + +``` +# -- Preparatory code -- +# Model configuration +batch_size = 100 +no_epochs = 30 +learning_rate = 0.001 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +Specifically, we configure the model as follows: + +- We use a **batch size** of 100 samples. This means that one hundred samples are fed forward through the network each time, generating predictions, computing loss, and optimization. The higher the batch size, the higher the efficiency with which the improvement gradient can be computed, but the more memory is required. +- We use 30 **epochs**. One epoch, or full iteration, means that all samples are fed forward once, and that the process can start over again. It is possibly to [dynamically determine the number of epochs](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), but for the sake of simplicity we just choose 30. +- The **learning rate**, or the aggressiveness with which the optimizer (in our case, the Adam optimizer) will attempt to improve once the gradient is known, is set to 0.001. +- We obviously have 10 classes (the digits 0 up to and including 9), so **no\_classes** is 10. +- Twenty percent or 0.2 of the training data is used as validation data, so this defines our **validation\_split**. +- Finally, we set **verbosity** to 1, which means that all possible output is output to our screens. This helps in understanding what happens, but slightly slows down the process. Hence, if you're using those models for real, you may wish to turn verbose mode off, but for now, it's going to be useful. + +### Adding helper functions + +Contrary to the two-dimensional CNN, we must add some helper functions: + +``` +# Convert 1D vector into 3D values, provided by the 3D MNIST authors at +# https://www.kaggle.com/daavoo/3d-mnist +def array_to_color(array, cmap="Oranges"): + s_m = plt.cm.ScalarMappable(cmap=cmap) + return s_m.to_rgba(array)[:,:-1] + +# Reshape data into format that can be handled by Conv3D layers. +# Courtesy of Sam Berglin; Zheming Lian; Jiahui Jang - University of Wisconsin-Madison +# Report - https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/Report.pdf +# Code - https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/network_final_version.ipynb +def rgb_data_transform(data): + data_t = [] + for i in range(data.shape[0]): + data_t.append(array_to_color(data[i]).reshape(16, 16, 16, 3)) + return np.asarray(data_t, dtype=np.float32) +``` + +The first helper function, `array_to_color`, was provided by the authors of the [3D MNIST dataset](https://www.kaggle.com/daavoo/3d-mnist) and courtesy goes out to them. What it does is this: the imported data will be of one channel only. This function converts the data into RGB format, and hence into three channels. This ensures resemblence with the original 2D scenario. + +Next, we use `rgb_data_transform`, which was created by machine learning students [Sam Berglin, Zheming Lian and Jiahui Jang](https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/network_final_version.ipynb) at the University of Wisconsin-Madison. Under guidance of professor Sebastian Raschka, whose [Mlxtend](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) library we use quite often, they also created a 3D ConvNet for the 3D MNIST dataset, but then using PyTorch instead of Keras. + +The function reshapes the data, which per sample comes in a (4096,) shape (16x16x16 pixels = 4096 pixels), so in a one-dimensional array. Their function reshapes the data into three-channeled, four-dimensional 16x16x16x3 format, making use of `array_to_color`. The Conv3D function can now handle the data. + +### Data preparation + +We can next import and prepare the data: + +``` +# -- Process code -- +# Load the HDF5 data file +with h5py.File("./full_dataset_vectors.h5", "r") as hf: + + # Split the data into training/test features/targets + X_train = hf["X_train"][:] + targets_train = hf["y_train"][:] + X_test = hf["X_test"][:] + targets_test = hf["y_test"][:] + + # Determine sample shape + sample_shape = (16, 16, 16, 3) + + # Reshape data into 3D format + X_train = rgb_data_transform(X_train) + X_test = rgb_data_transform(X_test) + + # Convert target vectors to categorical targets + targets_train = to_categorical(targets_train).astype(np.integer) + targets_test = to_categorical(targets_test).astype(np.integer) +``` + +The first line containing `with` ensures that we open up the [HDF5](https://www.machinecurve.com/index.php/2020/04/13/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files/) file as `hf`, which we can subsequently use to retrieve the data we need. + +Specifically, we first load the training and testing data into two different variables: the `X`es for the feature vectors, the `targets` for the... well, unsurprisingly, targets 😄 + +Next, we determine the shape of each sample, which we must supply to the Keras model later. + +Next, we actually transform and reshape the data from one-channeled (4096,) format into three-channeled (16, 16, 16, 3) format. This is followed by converting the targets into categorical format, which concludes the preparatory phase. + +### Model architecture & training + +We can now finally create the model architecture and start the training process. + +First - the architecture: + +``` +# Create the model +model = Sequential() +model.add(Conv3D(32, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=sample_shape)) +model.add(MaxPooling3D(pool_size=(2, 2, 2))) +model.add(Conv3D(64, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform')) +model.add(MaxPooling3D(pool_size=(2, 2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(no_classes, activation='softmax')) +``` + +As discussed, we use the Keras Sequential API with Conv3D, MaxPooling3D, Flatten and Dense layers. + +Specifically, we use two three-dimensional convolutional layers with 3x3x3 kernels, ReLU [activation functions](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) and hence He uniform [init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). + +3D max pooling is applied with 2x2x2 pool sizes. + +Once the convolutional operations are completed, we Flatten the feature maps and feed the result to a Dense layer which also activates and initializes using the ReLU/He combination. + +Finally, we output the data into a Dense layer with `no_classes` (= 10) neurons and a Softmax activation function. This activation function generates a multiclass probability distribution over all the possible target classes, essentially a vector with probabilities that the sample belongs to that particular class, all values summing to 100% (or, statistically, 1). + +Second - the training procedure: + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(lr=learning_rate), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(X_train, targets_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +We first `compile` the model which essentially configures the architecture according to the hyperparameters that we set in the configuration section. + +Next, we `fit` the data to the model, using the other configuration settings set before. Fitting the data starts the training process. The output of this training process is stored in the `history` object which we can use for [visualization purposes](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). + +### Model evaluation + +Finally, we can add some code for evaluating model performance: + +``` +# Generate generalization metrics +score = model.evaluate(X_test, targets_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Plot history: Categorical crossentropy & Accuracy +plt.plot(history.history['loss'], label='Categorical crossentropy (training data)') +plt.plot(history.history['val_loss'], label='Categorical crossentropy (validation data)') +plt.plot(history.history['accuracy'], label='Accuracy (training data)') +plt.plot(history.history['val_accuracy'], label='Accuracy (validation data)') +plt.title('Model performance for 3D MNIST Keras Conv3D example') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +The above code simply evaluates the model by means of the testing data, printing the output to the console, as well as generating a plot displaying categorical crossentropy & accuracy over the training epochs. + +* * * + +## The model altogether + +Altogether, we arrive at this model code: + +``` +''' + A simple Conv3D example with TensorFlow 2 based Keras +''' +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv3D, MaxPooling3D +from tensorflow.keras.utils import to_categorical +import h5py +import numpy as np +import matplotlib.pyplot as plt + +# -- Preparatory code -- +# Model configuration +batch_size = 100 +no_epochs = 30 +learning_rate = 0.001 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Convert 1D vector into 3D values, provided by the 3D MNIST authors at +# https://www.kaggle.com/daavoo/3d-mnist +def array_to_color(array, cmap="Oranges"): + s_m = plt.cm.ScalarMappable(cmap=cmap) + return s_m.to_rgba(array)[:,:-1] + +# Reshape data into format that can be handled by Conv3D layers. +# Courtesy of Sam Berglin; Zheming Lian; Jiahui Jang - University of Wisconsin-Madison +# Report - https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/Report.pdf +# Code - https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/network_final_version.ipynb +def rgb_data_transform(data): + data_t = [] + for i in range(data.shape[0]): + data_t.append(array_to_color(data[i]).reshape(16, 16, 16, 3)) + return np.asarray(data_t, dtype=np.float32) + +# -- Process code -- +# Load the HDF5 data file +with h5py.File("./full_dataset_vectors.h5", "r") as hf: + + # Split the data into training/test features/targets + X_train = hf["X_train"][:] + targets_train = hf["y_train"][:] + X_test = hf["X_test"][:] + targets_test = hf["y_test"][:] + + # Determine sample shape + sample_shape = (16, 16, 16, 3) + + # Reshape data into 3D format + X_train = rgb_data_transform(X_train) + X_test = rgb_data_transform(X_test) + + # Convert target vectors to categorical targets + targets_train = to_categorical(targets_train).astype(np.integer) + targets_test = to_categorical(targets_test).astype(np.integer) + + # Create the model + model = Sequential() + model.add(Conv3D(32, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=sample_shape)) + model.add(MaxPooling3D(pool_size=(2, 2, 2))) + model.add(Conv3D(64, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform')) + model.add(MaxPooling3D(pool_size=(2, 2, 2))) + model.add(Flatten()) + model.add(Dense(256, activation='relu', kernel_initializer='he_uniform')) + model.add(Dense(no_classes, activation='softmax')) + + # Compile the model + model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(lr=learning_rate), + metrics=['accuracy']) + + # Fit data to model + history = model.fit(X_train, targets_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + + # Generate generalization metrics + score = model.evaluate(X_test, targets_test, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + + # Plot history: Categorical crossentropy & Accuracy + plt.plot(history.history['loss'], label='Categorical crossentropy (training data)') + plt.plot(history.history['val_loss'], label='Categorical crossentropy (validation data)') + plt.plot(history.history['accuracy'], label='Accuracy (training data)') + plt.plot(history.history['val_accuracy'], label='Accuracy (validation data)') + plt.title('Model performance for 3D MNIST Keras Conv3D example') + plt.ylabel('Loss value') + plt.xlabel('No. epoch') + plt.legend(loc="upper left") + plt.show() +``` + +## Model performance + +Running the model produces mediocre performance - a test accuracy of approximately 65.6%, contrary to the 99%+ of the 2D model: + +``` +Train on 8000 samples, validate on 2000 samples +Epoch 1/30 +2019-10-18 14:49:16.626766: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll +2019-10-18 14:49:17.253904: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll +8000/8000 [==============================] - 5s 643us/step - loss: 2.1907 - accuracy: 0.2256 - val_loss: 1.8527 - val_accuracy: 0.3580 +Epoch 2/30 +8000/8000 [==============================] - 2s 305us/step - loss: 1.6607 - accuracy: 0.4305 - val_loss: 1.4618 - val_accuracy: 0.5090 +Epoch 3/30 +8000/8000 [==============================] - 2s 308us/step - loss: 1.3590 - accuracy: 0.5337 - val_loss: 1.2485 - val_accuracy: 0.5760 +Epoch 4/30 +8000/8000 [==============================] - 2s 309us/step - loss: 1.2173 - accuracy: 0.5807 - val_loss: 1.2304 - val_accuracy: 0.5620 +Epoch 5/30 +8000/8000 [==============================] - 2s 306us/step - loss: 1.1320 - accuracy: 0.6084 - val_loss: 1.1913 - val_accuracy: 0.5795 +Epoch 6/30 +8000/8000 [==============================] - 2s 305us/step - loss: 1.0423 - accuracy: 0.6376 - val_loss: 1.1136 - val_accuracy: 0.6140 +Epoch 7/30 +8000/8000 [==============================] - 2s 310us/step - loss: 0.9899 - accuracy: 0.6572 - val_loss: 1.0940 - val_accuracy: 0.6255 +Epoch 8/30 +8000/8000 [==============================] - 2s 304us/step - loss: 0.9365 - accuracy: 0.6730 - val_loss: 1.0905 - val_accuracy: 0.6310 +Epoch 9/30 +8000/8000 [==============================] - 2s 305us/step - loss: 0.8850 - accuracy: 0.6975 - val_loss: 1.0407 - val_accuracy: 0.6425 +Epoch 10/30 +8000/8000 [==============================] - 2s 309us/step - loss: 0.8458 - accuracy: 0.7115 - val_loss: 1.0667 - val_accuracy: 0.6315 +Epoch 11/30 +8000/8000 [==============================] - 3s 320us/step - loss: 0.7971 - accuracy: 0.7284 - val_loss: 1.0328 - val_accuracy: 0.6420 +Epoch 12/30 +8000/8000 [==============================] - 3s 328us/step - loss: 0.7661 - accuracy: 0.7411 - val_loss: 1.0596 - val_accuracy: 0.6365 +Epoch 13/30 +8000/8000 [==============================] - 3s 324us/step - loss: 0.7151 - accuracy: 0.7592 - val_loss: 1.0463 - val_accuracy: 0.6470 +Epoch 14/30 +8000/8000 [==============================] - 3s 334us/step - loss: 0.6850 - accuracy: 0.7676 - val_loss: 1.0592 - val_accuracy: 0.6355 +Epoch 15/30 +8000/8000 [==============================] - 3s 341us/step - loss: 0.6359 - accuracy: 0.7839 - val_loss: 1.0492 - val_accuracy: 0.6555 +Epoch 16/30 +8000/8000 [==============================] - 3s 334us/step - loss: 0.6136 - accuracy: 0.7960 - val_loss: 1.0399 - val_accuracy: 0.6570 +Epoch 17/30 +8000/8000 [==============================] - 3s 327us/step - loss: 0.5794 - accuracy: 0.8039 - val_loss: 1.0548 - val_accuracy: 0.6545 +Epoch 18/30 +8000/8000 [==============================] - 3s 330us/step - loss: 0.5398 - accuracy: 0.8169 - val_loss: 1.0807 - val_accuracy: 0.6550 +Epoch 19/30 +8000/8000 [==============================] - 3s 351us/step - loss: 0.5199 - accuracy: 0.8236 - val_loss: 1.0881 - val_accuracy: 0.6570 +Epoch 20/30 +8000/8000 [==============================] - 3s 332us/step - loss: 0.4850 - accuracy: 0.8350 - val_loss: 1.0920 - val_accuracy: 0.6485 +Epoch 21/30 +8000/8000 [==============================] - 3s 330us/step - loss: 0.4452 - accuracy: 0.8549 - val_loss: 1.1540 - val_accuracy: 0.6510 +Epoch 22/30 +8000/8000 [==============================] - 3s 332us/step - loss: 0.4051 - accuracy: 0.8696 - val_loss: 1.1422 - val_accuracy: 0.6570 +Epoch 23/30 +8000/8000 [==============================] - 3s 347us/step - loss: 0.3743 - accuracy: 0.8811 - val_loss: 1.1720 - val_accuracy: 0.6610 +Epoch 24/30 +8000/8000 [==============================] - 3s 349us/step - loss: 0.3575 - accuracy: 0.8816 - val_loss: 1.2174 - val_accuracy: 0.6580 +Epoch 25/30 +8000/8000 [==============================] - 3s 349us/step - loss: 0.3223 - accuracy: 0.8981 - val_loss: 1.2345 - val_accuracy: 0.6525 +Epoch 26/30 +8000/8000 [==============================] - 3s 351us/step - loss: 0.2859 - accuracy: 0.9134 - val_loss: 1.2514 - val_accuracy: 0.6555 +Epoch 27/30 +8000/8000 [==============================] - 3s 347us/step - loss: 0.2598 - accuracy: 0.9218 - val_loss: 1.2969 - val_accuracy: 0.6595 +Epoch 28/30 +8000/8000 [==============================] - 3s 350us/step - loss: 0.2377 - accuracy: 0.9291 - val_loss: 1.3296 - val_accuracy: 0.6625 +Epoch 29/30 +8000/8000 [==============================] - 3s 349us/step - loss: 0.2119 - accuracy: 0.9362 - val_loss: 1.3784 - val_accuracy: 0.6550 +Epoch 30/30 +8000/8000 [==============================] - 3s 350us/step - loss: 0.1987 - accuracy: 0.9429 - val_loss: 1.4143 - val_accuracy: 0.6515 +Test loss: 1.4300630502700806 / Test accuracy: 0.656000018119812 +``` + +We can derive a little bit more information from the diagram that we generated based on the `history` object: + +[![](images/3d_mnist_perf-1024x581.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/3d_mnist_perf.png) + +The first and most clear warning signal is the orange line, or the categorical crossentropy loss on the validation data. It's increasing, which means that the model is overfitting - or adapting too much to the training data. The blue line illustrates this even further, since loss is decreasing rapidly there, while the 'check' gets worse and worse. + +This deviation also becomes visible in the accuracy plot, albeit less significantly. + +Now - we got a working Conv3D model with the 3D MNIST dataset, but can we improve on the 65.6% accuracy by doing something about the overfitting? + +* * * + +## Battling overfitting + +### Adding Dropout + +Adding [Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/) to the model architecture allows us to 'drop' random elements from the feature maps during training. Although this confuses the model, it disallows it to adapt too much to the training data: + +``` +# Create the model +model = Sequential() +model.add(Conv3D(32, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform', input_shape=sample_shape)) +model.add(MaxPooling3D(pool_size=(2, 2, 2))) +model.add(Dropout(0.5)) +model.add(Conv3D(64, kernel_size=(3, 3, 3), activation='relu', kernel_initializer='he_uniform')) +model.add(MaxPooling3D(pool_size=(2, 2, 2))) +model.add(Dropout(0.5)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(no_classes, activation='softmax')) +``` + +Don't forget to add it as an extra import: + +``` +from tensorflow.keras.layers import Dense, Flatten, Conv3D, MaxPooling3D, Dropout +``` + +With Dropout, overfitting can be reduced: + +[![](images/with_dropout-1024x497.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/with_dropout.png) + +However, testing accuracy remains mediocre. This suggests that the model cannot further improve because the _quantity of data_ is too low. Perhaps, if more data were added, or when a process called Data Augmentation is used, we can improve performance even further. However, that's for another time! 😎 + +* * * + +## Summary + +In this blog post, we've seen how Conv3D layers differ from Conv2D but more importantly, we've seen a Keras based implementation of a convolutional neural network that can handle three-dimensional input data. I hope you've learnt something from this blog - and if you did, I would appreciate a comment below! 👇 + +Thanks for reading and happy engineering 😄 + +_Note that the code for this blog post is also available on [GitHub](https://github.com/christianversloot/keras-cnn)._ + +* * * + +## References + +GitHub. (n.d.). daavoo - Overview. Retrieved from [https://github.com/daavoo](https://github.com/daavoo) + +Berglin, S., Lian, Z., & Jiang, J. (2019). 3D Convolutional Neural Networks. Retrieved from [https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/Report.pdf](https://github.com/sberglin/Projects-and-Papers/blob/master/3D%20CNN/Report.pdf) + +Kaggle. (n.d.). 3D MNIST. Retrieved from [https://www.kaggle.com/daavoo/3d-mnist](https://www.kaggle.com/daavoo/3d-mnist) + +GitHub. (2019, September 19). Kaggle/kaggle-api. Retrieved from [https://github.com/Kaggle/kaggle-api](https://github.com/Kaggle/kaggle-api) + +MachineCurve. (2019, May 30). Convolutional Neural Networks and their components for computer vision. Retrieved from [https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) + +MachineCurve. (2019, September 23). Understanding separable convolutions. Retrieved from [https://www.machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/](https://www.machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/) + +About loss and loss functions – MachineCurve. (2019, October 15). Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) + +Avoid wasting resources with EarlyStopping and ModelCheckpoint in Keras – MachineCurve. (2019, June 3). Retrieved from [https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) diff --git a/about-loss-and-loss-functions.md b/about-loss-and-loss-functions.md new file mode 100644 index 0000000..6a585a1 --- /dev/null +++ b/about-loss-and-loss-functions.md @@ -0,0 +1,488 @@ +--- +title: "About loss and loss functions" +date: "2019-10-04" +categories: + - "deep-learning" + - "svms" +tags: + - "classifier" + - "deep-learning" + - "loss-function" + - "machine-learning" + - "optimizer" + - "regression" + - "support-vector-machine" +--- + +When you're training supervised machine learning models, you often hear about a **loss function** that is minimized; that must be chosen, and so on. + +The term **cost function** is also used equivalently. + +**But what is loss? And what is a loss function?** + +I'll answer these two questions in this blog, which focuses on this optimization aspect of machine learning. We'll first cover the high-level supervised learning process, to set the stage. This includes the role of training, validation and testing data when training supervised models. + +Once we're up to speed with those, we'll introduce loss. We answer the question _what is loss?_ However, we don't forget _what is a loss function?_ We'll even look into some commonly used loss functions. + +Let's go! 😎 + +\[toc\] + +\[ad\] + +## The high-level supervised learning process + +Before we can actually introduce the concept of loss, we'll have to take a look at the **high-level supervised machine learning process**. All supervised training approaches fall under this process, which means that it is equal for deep neural networks such as [MLPs](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/) or [ConvNets](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), but also for [SVMs](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/). + +Let's take a look at this training process, which is cyclical in nature. + +[![](images/High-level-training-process-1024x973.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/09/High-level-training-process.jpg) + +### Forward pass + +We start with our **features and targets**, which are also called your _dataset_. This dataset is split into three parts before the training process starts: training data, validation data and testing data. The training data is used during the training process; more specificially, to generate predictions during the forward pass. However, after each training cycle, the predictive performance of the model must be tested. This is what the validation data is used for - it helps during model optimization. + +Then there is testing data left. Assume that the validation data, which is essentially a statistical _sample_, does not fully match the _population it describes_ in statistical terms. That is, the sample does not represent it fully and by consequence the mean and variance of the sample are (hopefully) slightly different than the actual population mean and variance. Hence, a little bias is introduced into the model every time you'll optimize it with your validation data. While it may thus still work very well in terms of _predictive power_, it may be the case that it will lose its power to _generalize_. In that case, it would no longer work for data it has never seen before, e.g. data from a different sample. The _testing data_ is used to test the model once the entire training process has finished (i.e., only after the last cycle), and allows us to tell something about the generalization power of our machine learning model. + +The _training data_ is fed into the machine learning model in what is called the **forward pass**. The origin of this name is really easy: the data is simply fed to the network, which means that it passes through it in a forward fashion. The end result is a set of predictions, one per sample. This means that when my training set consists of 1000 feature vectors (or rows with features) that are accompanied by 1000 targets, I will have 1000 predictions after my forward pass. + +\[ad\] + +### Loss + +You do however want to know how well the model performs with respect to the targets originally set. A well-performing model would be interesting for production usage, whereas an ill-performing model must be optimized before it can be actually used. + +**This is where the concept of loss enters the equation.** + +Most generally speaking, the _loss allows us to compare between some actual targets and predicted targets_. It does so by imposing a "cost" (or, using a different term, a "loss") on each prediction if it deviates from the actual targets. + +It's relatively easy to compute the loss conceptually: we agree on some cost for our machine learning predictions, compare the 1000 targets with the 1000 predictions and compute the 1000 costs, then add everything together and present the global **loss**. + +Our goal when training a machine learning model? + +**To minimize the loss**. + +The reason why is simple: the lower the loss, the more the set of targets and the set of predictions resemble each other. + +And the more they resemble each other, the better the machine learning model performs. + +As you can see in the machine learning process depicted above, arrows are flowing backwards towards the machine learning model. Their goal: to optimize the internals of your model only slightly, so that it will perform better during the next cycle (or iteration, or epoch, as they are also called). + +### Backwards pass + +When loss is computed, the model must be improved. This is done by propagating the error backwards to the model structure, such as the **model's weights**. This closes the learning cycle between feeding data forward, generating predictions, and improving it - by adapting the weights, the model likely improves (sometimes much, sometimes slightly) and hence _learning takes place_. + +Depending on the model type used, there are many ways for optimizing the model, i.e. propagating the error backwards. In neural networks, often, a combination of **gradient descent based methods** and **backpropagation** is used: gradient descent like optimizers for computing the _gradient_ or the direction in which to optimize, backpropagation for the actual error propagation. + +In other model types, such as [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/), we do not actually propagate the error backward, strictly speaking. However, we use methods such as **quadratic optimization** to find the mathematical optimum, which given linear separability of your data (whether in regular space or kernel space) must exist. However, visualizing it as "adapting the weights by computing some error" benefits understanding. Next up - the loss functions we can actually use for computing the error! 😄 + +\[ad\] + +## Loss functions + +Here, we'll cover a wide array of loss functions: some of them for regression, others for classification. + +### Loss functions for regression + +There are two main types of supervised learning problems: [classification](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/#what-is-a-classifier) and [regression](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/#mlps-for-classification-and-regression-the-differences). In the first, your aim is to classify a sample into the correct bucket, e.g. into one of the buckets 'diabetes' or 'no diabetes'. In the latter case, however, you don't _classify_ but rather _estimate_ some real valued number. What you're trying to do is _regress a mathematical function from some input data_, and hence it's called regression. For regression problems, there are many loss functions available. + +#### Mean Absolute Error (L1 Loss) + +**Mean Absolute Error** (MAE) is one of them. This is what it looks like: + +![](images/image-16-1024x185.png) + +Don't worry about the maths, we'll introduce the MAE intuitively now. + +That weird E-like sign you see in the formula is what is called a Sigma sign, and it sums up what's behind it: `|Ei`|, in our case, where `Ei` is the error (the difference between prediction and actual value) and the | signs mean that you're taking the _absolute value_, or convert -3 into 3 and 3 remains 3. + +The summation, in this case, means that we sum all the errors, for all the `n` samples that were used for training the model. We therefore, after doing so, end up with a very large number. We divide this number by `n`, or the number of samples used, to find the _mean_, or the average Absolute Error: the Mean Absolute Error or MAE. + +It's very well possible to use the MAE in a multitude of regression scenarios (Rich, n.d.). However, if your average error is very small, it may be better to use the Mean Squared Error that we will introduce next. + +What's more, and this is important: when you use the MAE in optimizations that use gradient descent, you'll face the fact that the gradients are continuously large (Grover, 2019). Since this also occurs when the loss is low (and hence, you would only need to _move a tiny bit_), this is bad for learning - it's easy to overshoot the minimum continously, finding a suboptimal model. Consider _Huber loss_ (more below) if you face this problem. If you face larger errors and don't care (yet?) about this issue with gradients, or if you're here to learn, let's move on to Mean Squared Error! + +#### Mean Squared Error + +Another loss function used often in regression is **Mean Squared Error** (MSE). It sounds really difficult, especially when you look at the formula (Binieli, 2018): + +![](images/image-14-1024x296.png) + +... but fear not. It's actually really easy to understand what MSE is and what it does! + +We'll break the formula above into three parts, which allows us to understand each element and subsequently how they work together to produce the MSE. + +![](images/image-15-1024x290.png) + +The primary part of the MSE is the middle part, being the Sigma symbol or the _summation sign_. What it does is really simple: it counts from _i_ to _n_, and on every count executes what's written behind it. In this case, that's the third part - the square of (Yi - Y'i). + +In our case, `i` starts at 1 and _n_ is not yet defined. Rather, `n` is the number of samples in our training set and hence the number of predictions that has been made. In the scenario sketched above, `n` would be 1000. + +Then, the third part. It's actually mathematical notation for what we already intuitively learnt earlier: it's the difference between the actual target for the sample (`Yi`) and the predicted target (`Y'i`), the latter of which is removed from the first. + +With one minor difference: the end result of this computation is _squared_. This property introduces some mathematical benefits during optimization (Rich, n.d.). Particularly, the MSE is continuously differentiable whereas the MAE is not (at x = 0). This means that optimizing the MSE is easier than optimizing the MAE. + +Additionally, large errors introduce a much larger cost than smaller errors (because the differences are squared and larger errors produce much larger squares than smaller errors). This is both good and bad at the same time (Rich, n.d.). This is a good property when your errors are small, because optimization is then advanced (Quora, n.d.). However, using MSE rather than e.g. MAE will open your ML model up to outliers, which will severely disturb training (by means of introducing large errors). + +Although the conclusion may be rather unsatisfactory, choosing between MAE and MSE is thus often heavily dependent on the dataset you're using, introducing the need for some a priori inspection before starting your training process. + +Finally, when we have the sum of the squared errors, we divide it by n - producing the _mean squared error_. + +#### Mean Absolute Percentage Error + +The **Mean Absolute Percentage Error**, or MAPE, really looks like the MAE, even though the formula looks somewhat different: + +![](images/image-18-1024x269.png) + +When using the MAPE, we don't compute the absolute error, but rather, the _mean error percentage with respect to the actual values_. That is, suppose that my prediction is 12 while the actual target is 10, the MAPE for this prediction is \[latex\]| (10 - 12 ) / 10 | = 0.2\[/latex\]. + +Similar to the MAE, we sum the error over all the samples, but subsequently face a different computation: \[latex\]100\\% / n\[/latex\]. This looks difficult, but we can once again separate this computation into more easily understandable parts. More specifically, we can write it as a multiplication of \[latex\]100\\%\[/latex\] and \[latex\]1 / n\[/latex\] instead. When multiplying the latter with the sum, you'll find the same result as dividing it by `n`, which we did with the MAE. That's great. + +The only thing left now is multiplying the whole with 100%. Why do we do that? Simple: because our _computed error_ is a ratio and not a percentage. Like the example above, in which our error was 0.2, we don't want to find the ratio, but the percentage instead. \[latex\]0.2 \\times 100\\%\[/latex\] is ... unsurprisingly ... \[latex\]20\\%\[/latex\]! Hence, we multiply the mean ratio error with the percentage to find the MAPE! :-) + +Why use MAPE if you can also use MAE? + +\[ad\] + +Very good question. + +Firstly, it is a very intuitive value. Contrary to the absolute error, we have a sense of how _well-performing_ the model is or how _bad it performs_ when we can express the error in terms of a percentage. An error of 100 may seem large, but if the actual target is 1000000 while the estimate is 1000100, well, you get the point. + +Secondly, it allows us to compare the performance of regression models on different datasets (Watson, 2019). Suppose that our goal is to train a regression model on the NASDAQ ETF and the Dutch AEX ETF. Since their absolute values are quite different, using MAE won't help us much in comparing the performance of our model. MAPE, on the other hand, demonstrates the error in terms of a _percentage_ - and a percentage is a percentage, whether you apply it to NASDAQ or to AEX. This way, it's possible to compare model performance across statistically varying datasets. + +#### Root Mean Squared Error (L2 Loss) + +Remember the MSE? + +![](images/image-14-1024x296.png) + +There's also something called the RMSE, or the **Root Mean Squared Error** or Root Mean Squared Deviation (RMSD). It goes like this: + +![](images/image.png) + +Simple, hey? It's just the MSE but then its square root value. + +How does this help us? + +The errors of the MSE are squared - hey, what's in a name. + +The RMSE or RMSD errors are _root squares_ of the _square_ - and hence are back at the scale of the original targets (Dragos, 2018). This gives you much better intuition for the error in terms of the targets. + +#### Logcosh + +"Log-cosh is the logarithm of the hyperbolic cosine of the prediction error." (Grover, 2019). + +Well, how's that for a starter. + +This is the mathematical formula: + +![](images/image-3.png) + +And this the plot: + +[![](images/logcosh-1024x433.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/10/logcosh.jpeg) + +Okay, now let's introduce some intuitive explanation. + +The [TensorFlow docs](https://www.tensorflow.org/api_docs/python/tf/keras/losses/logcosh) write this about Logcosh loss: + +> `log(cosh(x))` is approximately equal to `(x ** 2) / 2` for small `x` and to `abs(x) - log(2)` for large `x`. This means that 'logcosh' works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction. + +Well, that's great. It seems to be an improvement over MSE, or L2 loss. Recall that MSE is an improvement over MAE (L1 Loss) if your data set contains quite large errors, as it captures these better. However, this also means that it is much more sensitive to errors than the MAE. Logcosh helps against this problem: + +- For relatively small errors (even with the _relatively small but larger errors_, which is why MSE can be better for your ML problem than MAE) it outputs approximately equal to \[latex\]x^2 / 2\[/latex\] - which is pretty equal to the \[latex\]x^2\[/latex\] output of the MSE. +- For larger errors, i.e. outliers, where MSE would produce extremely large errors (\[latex\](10^6)^2 = 10^12\[/latex\]), the Logcosh approaches \[latex\]|x| - log(2)\[/latex\]. It's like (as well as unlike) the MAE, but then somewhat corrected by the `log`. + +Hence: indeed, if you have _both larger errors_ that must be detected _as well as outliers_, which you perhaps cannot remove from your dataset, consider using Logcosh! It's available in many frameworks like TensorFlow as we saw above, but also in [Keras](http://keras.io/losses#logcosh). + +\[ad\] + +#### Huber loss + +Let's move on to **Huber loss**, which we already hinted about in the section about the MAE: + +![](images/image-4-1024x284.png) + +Or, visually: + +[![](images/huberloss-1024x580.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/10/huberloss.jpeg) + +When interpreting the formula, we see two parts: + +- \[latex\]1/2 \\times (t-p)^2\[/latex\], when \[latex\]|t-p| \\leq \\delta\[/latex\]. This sounds very complicated, but we can break it into parts easily. + - \[latex\]|t-p|\[/latex\] is the _absolute error_: the difference between target \[latex\]t\[/latex\] and prediction \[latex\]p\[/latex\]. + - We square it and divide it by two. + - We however only do so when the absolute error is smaller than or equal to some \[latex\]\\delta\[/latex\], also called delta, which **you** can configure! We'll see next why this is nice. +- When the absolute error is _larger than_ \[latex\]\\delta\[/latex\], we compute the error as follows: \[latex\]\\delta \\times |t-p| - (\\delta^2 / 2)\[/latex\]. + - Let's break this apart again. We multiply the delta with the absolute error and remove half of delta square. + +**What is the effect of all this mathematical juggling?** + +Look at the visualization above. + +For relatively small deltas (in our case, with \[latex\]\\delta = 0.25\[/latex\], you'll see that the loss function becomes relatively flat. It takes quite a long time before loss increases, even when predictions are getting larger and larger. + +For larger deltas, the slope of the function increases. As you can see, the larger the delta, the slower the _increase of this slope:_ eventually, for really large \[latex\]\\delta\[/latex\] the slope of the loss tends to converge to some maximum. + +If you look closely, you'll notice the following: + +- With small \[latex\]\\delta\[/latex\], the loss becomes relatively insensitive to larger errors and outliers. This might be good if you have them, but bad if on average your errors are small. +- With large \[latex\]\\delta\[/latex\], the loss becomes increasingly sensitive to larger errors and outliers. That might be good if your errors are small, but you'll face trouble when your dataset contains outliers. + +Hey, haven't we seen that before? + +Yep: in our discussions about the MAE (insensitivity to larger errors) and the MSE (fixes this, but facing sensitivity to outliers). + +Grover (2019) writes about this [nicely](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0): + +> Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.) + +That's what this \[latex\]\\delta\[/latex\] is for! You are now in control about the 'degree' of MAE vs MSE-ness you'll introduce in your loss function. When you face large errors due to outliers, you can try again with a lower \[latex\]\\delta\[/latex\]; if your errors are too small to be picked up by your Huber loss, you can increase the delta instead. + +And there's another thing, which we also mentioned when discussing the MAE: it produces large gradients when you optimize your model by means of gradient descent, even when your errors are small (Grover, 2019). This is bad for model performance, as you will likely overshoot the mathematical optimum for your model. You don't face this problem with MSE, as it tends to decrease towards the actual minimum (Grover, 2019). If you switch to Huber loss from MAE, you might find it to be an additional benefit. + +Here's why: Huber loss, like MSE, decreases as well when it approaches the mathematical optimum (Grover, 2019). This means that you can combine the best of both worlds: the insensitivity to larger errors from MAE with the sensitivity of the MSE and its suitability for gradient descent. Hooray for Huber loss! And like always, it's also available when you train models with [Keras](https://keras.io/losses/#huber_loss). + +**Then why isn't this the perfect loss function?** + +Because the benefit of the \[latex\]\\delta\[/latex\] is also becoming your bottleneck (Grover, 2019). As you have to configure them manually (or perhaps using some automated tooling), you'll have to spend time and resources on finding the most optimum \[latex\]\\delta\[/latex\] for your dataset. This is an iterative problem that, in the extreme case, may become impractical at best and costly at worst. However, in most cases, it's best just to experiment - perhaps, you'll find better results! + +\[ad\] + +### Loss functions for classification + +Loss functions are also applied in classifiers. I already discussed in another post what classification is all about, so I'm going to repeat it here: + +> Suppose that you work in the field of separating non-ripe tomatoes from the ripe ones. It’s an important job, one can argue, because we don’t want to sell customers tomatoes they can’t process into dinner. It’s the perfect job to illustrate what a human classifier would do. +> +> Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape: +> +> \- If it’s green, it’s likely to be unripe (or: not sellable); +> \- If it smells, it is likely to be unsellable; +> \- The same goes for when it’s white or when fungus is visible on top of it. +> +> If none of those occur, it’s likely that the tomato can be sold. We now have _two classes_: sellable tomatoes and non-sellable tomatoes. Human classifiers _decide about which class an object (a tomato) belongs to._ +> +> The same principle occurs again in machine learning and deep learning. +> Only then, we replace the human with a machine learning model. We’re then using machine learning for _classification_, or for deciding about some “model input” to “which class” it belongs. +> +> Source: [How to create a CNN classifier with Keras?](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) + +We'll now cover loss functions that are used for classification. + +#### Hinge + +The **hinge loss** is defined as follows (Wikipedia, 2011): + +![](images/image-1.png) + +It simply takes the maximum of either 0 or the computation \[latex\] 1 - t \\times y\[/latex\], where `t` is the machine learning output value (being between -1 and +1) and `y` is the true target (-1 or +1). + +When the target equals the prediction, the computation \[latex\]t \\times y\[/latex\] is always one: \[latex\]1 \\times 1 = -1 \\times -1 = 1)\[/latex\]. Essentially, because then \[latex\]1 - t \\times y = 1 - 1 = 1\[/latex\], the `max` function takes the maximum \[latex\]max(0, 0)\[/latex\], which of course is 0. + +That is: when the actual target meets the prediction, the loss is zero. Negative loss doesn't exist. When the target != the prediction, the loss value increases. + +For `t = 1`, or \[latex\]1\[/latex\] is your target, hinge loss looks like this: + +[![](images/hinge_loss-1024x507.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/10/hinge_loss.jpeg) + +Let's now consider three scenarios which can occur, given our target \[latex\]t = 1\[/latex\] (Kompella, 2017; Wikipedia, 2011): + +- The prediction is correct, which occurs when \[latex\]y \\geq 1.0\[/latex\]. +- The prediction is very incorrect, which occurs when \[latex\]y < 0.0\[/latex\] (because the sign swaps, in our case from positive to negative). +- The prediction is not correct, but we're getting there (\[latex\] 0.0 \\leq y < 1.0\[/latex\]). + +In the first case, e.g. when \[latex\]y = 1.2\[/latex\], the output of \[latex\]1 - t \\ times y\[/latex\] will be \[latex\] 1 - ( 1 \\times 1.2 ) = 1 - 1.2 = -0.2\[/latex\]. Loss, then will be \[latex\]max(0, -0.2) = 0\[/latex\]. Hence, for all correct predictions - even if they are _too correct_, loss is zero. In the _too correct_ situation, the classifier is simply very sure that the prediction is correct (Peltarion, n.d.). + +In the second case, e.g. when \[latex\]y = -0.5\[/latex\], the output of the loss equation will be \[latex\]1 - (1 \\ times -0.5) = 1 - (-0.5) = 1.5\[/latex\], and hence the loss will be \[latex\]max(0, 1.5) = 1.5\[/latex\]. Very wrong predictions are hence penalized significantly by the hinge loss function. + +In the third case, e.g. when \[latex\]y = 0.9\[/latex\], loss output function will be \[latex\]1 - (1 \\times 0.9) = 1 - 0.9 = 0.1\[/latex\]. Loss will be \[latex\]max(0, 0.1) = 0.1\[/latex\]. We're getting there - and that's also indicated by the small but nonzero loss. + +What this essentially sketches is a _margin_ that you try to _maximize_: when the prediction is correct or even too correct, it doesn't matter much, but when it's not, we're trying to correct. The correction process keeps going until the prediction is fully correct (or when the human tells the improvement to stop). We're thus finding the most optimum decision boundary and are hence performing a maximum-margin operation. + +It is therefore not surprising that hinge loss is one of the most commonly used loss functions in [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) (Kompella, 2017). What's more, hinge loss itself _cannot be used with gradient descent like optimizers_, those with which (deep) neural networks are trained. This occurs due to the fact that it's not continuously differentiable, more precisely at the 'boundary' between no loss / minimum loss. Fortunately, a subgradient of the hinge loss function can be optimized, so it can (albeit in a different form) still be used in today's deep learning models (Wikipedia, 2011). For example, hinge loss is available as a [loss function](https://keras.io/losses#hinge) in Keras. + +#### Squared hinge + +The **squared hinge loss** is like the hinge formula displayed above, but then the \[latex\]max()\[/latex\] function output is _squared_. + +This helps achieving two things: + +- Firstly, it makes the loss value more sensitive to outliers, just as we saw with MSE vs MAE. Large errors will add to the loss more significantly than smaller errors. Note that simiarly, this may also mean that you'll need to inspect your dataset for the presence of such outliers first. +- Secondly, squared hinge loss is differentiable whereas hinge loss is not (Tay, n.d.). The way the hinge loss is defined makes it not differentiable at the 'boundary' point of the chart - [also see this perfect answer that illustrates it](https://www.quora.com/Why-is-squared-hinge-loss-differentiable). Squared hinge loss, on the other hand, is differentiable, _simply because of the square_ and the mathematical benefits it introduces during differentiation. This makes it easier for us to use a hinge-like loss in gradient based optimization - we'll simply take squared hinge. + +\[ad\] + +#### Categorical / multiclass hinge + +Both normal hinge and squared hinge loss work only for _binary classification problems_ in which the actual target value is either +1 or -1. Although that's perfectly fine for when you have such problems (e.g. the [diabetes yes/no problem](https://www.machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/) that we looked at previously), there are many other problems which cannot be solved in a binary fashion. + +(Note that one approach to create a multiclass classifier, especially with SVMs, is to create many binary ones, feeding the data to each of them and counting classes, eventually taking the most-chosen class as output - it goes without saying that this is not very efficient.) + +However, in neural networks and hence gradient based optimization problems, we're not interested in doing that. It would mean that we have to train _many networks_, which significantly impacts the time performance of our ML training problem. Instead, we can use the _multiclass hinge_ that has been introduced by researchers Weston and Watkins (Wikipedia, 2011): + +![](images/image-2-1024x170.png) + +What this means in plain English is this: + +**For all \[latex\]y\[/latex\] (output) values unequal to \[latex\]t\[/latex\], compute the loss. Eventually, sum them together to find the multiclass hinge loss.** + +Note that this does not mean that you sum over _all possible values for y_ (which would be all real-valued numbers except \[latex\]t\[/latex\]), but instead, you compute the sum over _all the outputs generated by your ML model during the forward pass_. That is, all the predictions. Only for those where \[latex\]y \\neq t\[/latex\], you compute the loss. This is obvious from an efficiency point of view: where \[latex\]y = t\[/latex\], loss is always zero, so no \[latex\]max\[/latex\] operation needs to be computed to find zero after all. + +Keras implements the multiclass hinge loss as [categorical hinge loss](https://keras.io/losses/#categorical_hinge), requiring to change your targets into categorical format (one-hot encoded format) first by means of `to_categorical`. + +#### Binary crossentropy + +A loss function that's used quite often in today's neural networks is **binary crossentropy**. As you can guess, it's a loss function for _binary_ classification problems, i.e. where there exist two classes. Primarily, it can be used where the output of the neural network is somewhere between 0 and 1, e.g. by means of the Sigmoid layer. + +This is its formula: + +![](images/image-5-1024x122.png) + +It can be visualized in this way: + +[![](images/bce-1-1024x421.png)](blob:https://www.machinecurve.com/3ed39fd0-ad6b-45d4-a546-1fad50051cc9) + +And, like before, let's now explain it in more intuitive ways. + +The \[latex\]t\[/latex\] in the formula is the _target_ (0 or 1) and the \[latex\]p\[/latex\] is the prediction (a real-valued number between 0 and 1, for example 0.12326). + +When you input both into the formula, loss will be computed related to the target and the prediction. In the visualization above, where the target is 1, it becomes clear that loss is 0. However, when moving to the left, loss tends to increase (ML Cheatsheet documentation, n.d.). What's more, it increases increasingly fast. Hence, it not only tends to _punish wrong predictions_, but also _wrong predictions that are extremely confident_ (i.e., if the model is very confident that it's 0 while it's 1, it gets punished much harder than when it thinks it's somewhere in between, e.g. 0.5). This latter property makes the binary cross entropy a valued loss function in classification problems. + +When the target is 0, you can see that the loss is mirrored - which is exactly what we want: + +[![](images/bce_t0-1024x459.png)](blob:https://www.machinecurve.com/59c5dd36-b3dc-49c8-bd2b-a40865b79063) + +#### Categorical crossentropy + +Now what if you have no _binary_ classification problem, but instead a _multiclass one_? + +Thus: one where your output can belong to one of > 2 classes. + +The [CNN that we created with Keras](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) using the MNIST dataset is a good example of this problem. As you can find in the blog (see the link), we used a different loss function there - **categorical crossentropy**. It's still crossentropy, but then adapted to multiclass problems. + +![](images/image-6.png) + +This is the formula with which we compute categorical crossentropy. Put very simply, we sum over all the classes that we have in our system, compute the target of the _observation_ and the prediction of the _observation_ and compute the observation target with the natural log of the observation prediction. + +It took me some time to understand what was meant with a prediction, though, but thanks to Peltarion (n.d.), I got it. + +The answer lies in the fact that the crossentropy is _categorical_ and that hence _categorical data is used_, with _one-hot encoding_. + +Suppose that we have dataset that presents what the odds are of getting diabetes after five years, just like the [Pima Indians dataset](https://www.machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/) we used before. However, this time another class is added, being "Possibly diabetic", rendering us three classes for one's condition after five years given current measurements: + +- 0: no diabetes +- 1: possibly diabetic +- 2: diabetic + +That dataset would look like this: + +
FeaturesTarget
{ … }1
{ … }2
{ … }0
{ … }0
{ … }2
…and so on…and so on
+ +However, categorical crossentropy cannot simply use _integers_ as targets, because its formula doesn't support this. Instead, we must apply _one-hot encoding_, which transforms the integer targets into categorial vectors, which are just vectors that displays all categories and whether it's some class or not: + +- 0: \[latex\]\[1, 0, 0\]\[/latex\] +- 1: \[latex\]\[0, 1, 0\]\[/latex\] +- 2: \[latex\]\[0, 0, 1\]\[/latex\] + +\[ad\] + +That's what we always do with `to_categorical` in Keras. + +Our dataset then looks as follows: + +
FeaturesTarget
{ … }[latex][0, 1, 0][/latex]
{ … }[latex][0, 0, 1][/latex]
{ … }[latex][1, 0, 0][/latex]
{ … }[latex][1, 0, 0][/latex]
{ … }[latex][0, 0, 1][/latex]
…and so on…and so on
+ +Now, we can explain with is meant with _an observation_. + +Let's look at the formula again and recall that we iterate over all the possible output classes - **once for every prediction made**, with some true target: + +![](images/image-6.png) + +Now suppose that our trained model outputs for the set of features \[latex\]{ ... }\[/latex\] or a very similar one that has target \[latex\]\[0, 1, 0\]\[/latex\] a probability distribution of \[latex\]\[0.25, 0.50, 0.25\]\[/latex\] - that's what these models do, they pick no class, but instead compute the probability that it's a particular class in the categorical vector. + +Computing the loss, for \[latex\]c = 1\[/latex\], what is the target value? It's 0: in \[latex\]\\textbf{t} = \[0, 1, 0\]\[/latex\], the target value for class 0 is 0. + +What is the prediction? Well, following the same logic, the prediction is 0.25. + +We call these two _observations_ with respect to the total prediction. By looking at all _observations_, merging them together, we can find the loss value for the entire prediction. + +We multiply the target value with the log. But wait! We multiply the log with **0** - so the loss value for this target is 0. + +It doesn't surprise you that this happens for all targets **except for one** - where the target value is 1: in the prediction above, that would be for the second one. + +Note that when the sum is complete, you'll multiply it with -1 to find the true categorical crossentropy loss. + +Hence, loss is driven by the actual target observation of your sample instead of all the non-targets. The structure of the formula however allows us to perform multiclass machine learning training with crossentropy. There we go, we learnt another loss function :-) + +#### Sparse categorical crossentropy + +But what if we don't want to convert our integer targets into categorical format? We can use sparse categorical crossentropy instead (Lin, 2019). + +It performs in pretty much similar ways to regular categorical crossentropy loss, but instead allows you to use integer targets! That's nice. + +
FeaturesTarget
{ … }1
{ … }2
{ … }0
{ … }0
{ … }2
…and so on…and so on
+ +#### Kullback-Leibler divergence + +Sometimes, machine learning problems involve the comparison between two probability distributions. An example comparison is the situation below, in which the question is _how much the uniform distribution differs from the Binomial(10, 0.2) distribution_. + +[![](images/kld.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/kld.png) + +When you wish to compare two probability distributions, you can use the Kullback-Leibler divergence, a.k.a. KL divergence (Wikipedia, 2004): + +\\begin{equation} KL (P || Q) = \\sum p(X) \\log ( p(X) \\div q(X) ) \\end{equation} + +KL divergence is an adaptation of entropy, which is a common metric in the field of information theory (Wikipedia, 2004; Wikipedia, 2001; Count Bayesie, 2017). While intuitively, entropy tells you something about "the quantity of your information", KL divergence tells you something about "the change of quantity when distributions are changed". + +Your goal in machine learning problems is to ensure that \[latex\]change \\approx 0\[/latex\]. + +Is KL divergence used in practice? Yes! _Generative machine learning models_ work by drawing a sample from encoded, latent space, which effectively represents a latent probability distribution. In other scenarios, you might wish to perform _multiclass classification_ with neural networks that use Softmax activation in their output layer, effectively generating a probability distribution across the classes. And so on. In those cases, you can use KL divergence loss during training. It compares the probability distribution represented by your training data with the probability distribution generated during your [forward pass](#forward-pass), and computes the _divergence_ (the difference, although when you swap distributions, the value changes due to non-symmetry of KL divergence - hence it's not _entirely_ the difference) between the two probability distributions. This is your loss value. Minimizing the loss value thus essentially steers your neural network towards the probability distribution represented in your training set, which is what you want. + +## Summary + +In this blog, we've looked at the concept of loss functions, also known as cost functions. We showed why they are necessary by means of illustrating the high-level machine learning process and (at a high level) what happens during optimization. Additionally, we covered a wide range of loss functions, some of them for classification, others for regression. Although we introduced some maths, we also tried to explain them intuitively. + +I hope you've learnt something from my blog! If you have any questions, remarks, comments or other forms of feedback, please feel free to leave a comment below! 👇 I'd also appreciate a comment telling me if you learnt something and if so, what you learnt. I'll gladly improve my blog if mistakes are made. Thanks and happy engineering! 😎 + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Keras. (n.d.). Losses. Retrieved from [https://keras.io/losses/](https://keras.io/losses/) + +Binieli, M. (2018, October 8). Machine learning: an introduction to mean squared error and regression lines. Retrieved from [https://www.freecodecamp.org/news/machine-learning-mean-squared-error-regression-line-c7dde9a26b93/](https://www.freecodecamp.org/news/machine-learning-mean-squared-error-regression-line-c7dde9a26b93/) + +Rich. (n.d.). Why square the difference instead of taking the absolute value in standard deviation? Retrieved from [https://stats.stackexchange.com/a/121](https://stats.stackexchange.com/a/121) + +Quora. (n.d.). What is the difference between squared error and absolute error? Retrieved from [https://www.quora.com/What-is-the-difference-between-squared-error-and-absolute-error](https://www.quora.com/What-is-the-difference-between-squared-error-and-absolute-error) + +Watson, N. (2019, June 14). Using Mean Absolute Error to Forecast Accuracy. Retrieved from [https://canworksmart.com/using-mean-absolute-error-forecast-accuracy/](https://canworksmart.com/using-mean-absolute-error-forecast-accuracy/) + +Drakos, G. (2018, December 5). How to select the Right Evaluation Metric for Machine Learning Models: Part 1 Regression Metrics. Retrieved from [https://towardsdatascience.com/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0](https://towardsdatascience.com/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0) + +Wikipedia. (2011, September 16). Hinge loss. Retrieved from [https://en.wikipedia.org/wiki/Hinge\_loss](https://en.wikipedia.org/wiki/Hinge_loss) + +Kompella, R. (2017, October 19). Support vector machines ( intuitive understanding ) ? Part#1. Retrieved from [https://towardsdatascience.com/support-vector-machines-intuitive-understanding-part-1-3fb049df4ba1](https://towardsdatascience.com/support-vector-machines-intuitive-understanding-part-1-3fb049df4ba1) + +Peltarion. (n.d.). Squared hinge. Retrieved from [https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/squared-hinge](https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/squared-hinge) + +Tay, J. (n.d.). Why is squared hinge loss differentiable? Retrieved from [https://www.quora.com/Why-is-squared-hinge-loss-differentiable](https://www.quora.com/Why-is-squared-hinge-loss-differentiable) + +Rakhlin, A. (n.d.). Online Methods in Machine Learning. Retrieved from [http://www.mit.edu/~rakhlin/6.883/lectures/lecture05.pdf](http://www.mit.edu/~rakhlin/6.883/lectures/lecture05.pdf) + +Grover, P. (2019, September 25). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from [https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0) + +TensorFlow. (n.d.). tf.keras.losses.logcosh. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/keras/losses/logcosh](https://www.tensorflow.org/api_docs/python/tf/keras/losses/logcosh) + +ML Cheatsheet documentation. (n.d.). Loss Functions. Retrieved from [https://ml-cheatsheet.readthedocs.io/en/latest/loss\_functions.html](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html) + +Peltarion. (n.d.). Categorical crossentropy. Retrieved from [https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/categorical-crossentropy](https://peltarion.com/knowledge-center/documentation/modeling-view/build-an-ai-model/loss-functions/categorical-crossentropy) + +Lin, J. (2019, September 17). categorical\_crossentropy VS. sparse\_categorical\_crossentropy. Retrieved from [https://jovianlin.io/cat-crossentropy-vs-sparse-cat-crossentropy/](https://jovianlin.io/cat-crossentropy-vs-sparse-cat-crossentropy/) + +Wikipedia. (2004, February 13). Kullback–Leibler divergence. Retrieved from [https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler\_divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) + +Wikipedia. (2001, July 9). Entropy (information theory). Retrieved from [https://en.wikipedia.org/wiki/Entropy\_(information\_theory)](https://en.wikipedia.org/wiki/Entropy_(information_theory)) + +Count Bayesie. (2017, May 10). Kullback-Leibler Divergence Explained. Retrieved from [https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) diff --git a/albert-explained-a-lite-bert.md b/albert-explained-a-lite-bert.md new file mode 100644 index 0000000..59fe622 --- /dev/null +++ b/albert-explained-a-lite-bert.md @@ -0,0 +1,212 @@ +--- +title: "ALBERT explained: A Lite BERT" +date: "2021-01-06" +categories: + - "deep-learning" +tags: + - "albert" + - "bert" + - "deep-learning" + - "language-model" + - "machine-learning" + - "natural-language-processing" + - "nlp" + - "transformers" +--- + +Transformer models like GPT-3 and [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) have been really prominent in today's Natural Language Processing landscape. They have built upon the [original Transformer model](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), which performed [sequence-to-sequence tasks](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/), and are capable of performing a wide variety of language tasks such as text summarization and machine translation. Text generation is also one of their capabilities, this is true especially for the models from the GPT model family. + +While being very capable, in fact capable of generating human-like text, they also come with one major drawback: they are huge. The size of models like BERT significantly limits their adoption, because they cannot be run on normal machines and even require massive GPU resources to even get them running properly. + +In other words: a solution for this problem is necessary. In an attempt to change this, Lan et al. (2019) propose **ALBERT**, which stands for **A Lite BERT**. By changing a few things in BERT's architecture, they can create a model that is capable of achieving the same performance as BERT, but only at a fraction of the parameters and hence computational cost. + +In this article, we'll explain the ALBERT model. First of all, we're going to take a look at the problem in a bit more detail, by taking a look at BERT's size drawback. We will then introduce the ALBERT model and take a look at the three key differences compared to BERT: factorized embeddings, cross-layer parameter sharing and another language task, namely inter-sentence coherence loss. Don't worry about the technical terms, because we're going to take a look at them in relatively plain English, to make things understandable even for beginners. + +Once we know how ALBERT works, we're going to take a brief look at its performance. We will see that it actually works better, and we will also see that this behavior emerges from the changes ALBERT has incorporated. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## BERT's (and other models') drawback: it's _huge_ + +If you want to understand what the ALBERT model is and what it does, it can be a good idea to read our [Introduction to the BERT model](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) first. + +In that article, we're going to cover BERT in more detail, and we will see how it is an improvement upon the [vanilla Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) proposed in 2017, and which has changed the Natural Language Processing field significantly by showing that language models can be created that rely on the attention mechanism alone. + +However, let's take a quick look at BERT here as well before we move on. Below, you can see a high-level representation of BERT, or at least its input and outputs structure. + +- BERT always takes two sets of tokens as inputs, a sentence A and a sentence B. Note that dependent on the task, sentence B can be empty (i.e. the set of token is empty there) whereas sentence A is filled all the time. This latter scenario happens during regular text classification tasks such as sentiment analysis, whereas with other tasks (such as textual entailment, i.e. learning text directionality) both sentences must be filled. +- Text from sentences A and B is first tokenized. Before the tokens from set A, we add a **classification token** or . This token learns to contain sentence-level information based on interactions with the textual tokens in [BERT's attention mechanism.](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) The output of the token called C can be used to e.g. fine-tune the model on sentence level tasks. +- After , we add the tokens from sentence A. We then add a separation token and then continue with the tokens from sentence B. In other words, the input to BERT is therefore a set of tokens, with some manual token interventions in between and in front of the textual tokens. +- Tokens are fed into BERT, meaning that they are word embedded first. They are then taken through the Transformer model, [meaning that attention is computed across tokens](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), and that the output is a set of vectors representing state. +- BERT utilizes two language tasks for this purpose: a **Masked Language Model (MLM)** task for predicting output tokens ("given these input tokens, what is the most likely output token" - indeed, it should be the actual next token from the input, but it's the task of the model to learn this). It also utilizes a **Next Sentence Prediction (NSP)** task to learn sentence-level information available in C. + +![](images/Diagram-44-1024x625.png) + +Previous studies (such as the [study creating BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) or the [one creating GPT](https://www.machinecurve.com/index.php/2021/01/05/dall-e-openai-gpt-3-model-can-draw-pictures-based-on-text/)) have demonstrated that the size of language models is related to performance. The bigger the language model, the better the model performs, is the general finding. + +> Evidence from these improvements reveals that a large network is of crucial importance for achieving state-of-the-art performance +> +> Lam et al. (2019) + +While this allows us to build models that really work well, this also comes at a cost: models are really huge and therefore cannot be used widely in practice. + +> An obstacle to answering this question is the memory limitations of available hardware. Given that current state-of-the-art models often have hundreds of millions or even billions of parameters, it is easy to hit these limitations as we try to scale our models. Training speed can also be significantly hampered in distributed training, as the communication overhead is directly proportional to the number of parameters in the model. +> +> Lam et al. (2019) + +Recall that BERT comes in two flavors: a \[latex\]\\text{BERT}\_\\text{BASE}\[/latex\] model that has 110 million trainable parameters, and a \[latex\]\\text{BERT}\_\\text{LARGE}\[/latex\] model that has 340 million ones (Devlin et al., 2018). + +This is _huge!_ Compare this to relatively simple ConvNets, [which if really small](https://www.machinecurve.com/index.php/2019/12/19/creating-a-signal-noise-removal-autoencoder-with-keras/) can be < 100k parameters in size. + +The effect, as suggested above, is that scaling models often means that engineers run into resource limits during deployment. There is also an impact on the training process, especially when training is distributed (i.e. across many machines), because the computational overhead of [distributed training strategies](https://www.machinecurve.com/index.php/question/what-are-tensorflow-distribution-strategies/) can be really big, especially with so many parameters. + +In their work, Lam et al. (2019) have tried to answer one question in particular: _Is having better NLP models as easy as having larger models?_ As a result, they come up with a better BERT design, yielding a drop in parameters with only a small loss in terms of performance. Let's now take a look at ALBERT, or _a lite BERT_. + +* * * + +## ALBERT, A Lite BERT + +And according to them, the answer is a **clear no** - better NLP models does not necessarily mean that models must be _bigger_. In their work, which is referenced below as Lam et al. (2019) including a link, they introduce **A Lite BERT**, nicely abbreviated to **ALBERT**. Let's now take a look at it in more detail, so that we understand why it is smaller and why it supposedly works just as well, and perhaps even better when scaled to the same number of parameters as BERT. + +[![](images/Diagram-6.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-6.png) + +From the paper, we come to understand that ALBERT simply utilizes the [BERT architecture](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/). This architecture, which itself is the [encoder segment from the original Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) (with only a few minor tweaks), is visible in the image on the right. It is changed in three key ways, which bring about a significant reduction in parameters: + +- **Key difference 1:** embeddings are factorized, decomposing the parameters of embedding into two smaller matrices in addition to adaptations to embedding size and hidden state size. +- **Key difference 2:** ALBERT applies cross-layer parameter sharing. In other words, parameters between certain subsegments from the (stacked) encoder segments are shared, e.g. the parameters of the Multi-head Self-Attention Segment and the Feedforward Segment. This is counter to BERT, which allows these segments to have their own parameters. +- **Key difference 3:** following post-BERT works which suggest that the Next Sentence Prediction (NSP) task utilized by BERT actually underperforms compared to what the model should be capable of, Lam et al. (2019) introduce a sentence-order prediction (SOP) loss task that actually learns about sentence coherence. + +If things are not clear by now, don't worry - that was expected :D We're going to take a look at each difference in more detail next. + +### Key difference 1: factorized embedding parameters + +The first key difference between the BERT and ALBERT models is that **parameters of the word embeddings are factorized**. + +> In mathematics, **factorization** (...) or **factoring** consists of writing a number or another mathematical object as a product of several _factors_, usually smaller or simpler objects of the same kind. For example, 3 × 5 is a factorization of the integer 15 +> +> Wikipedia (2002) + +Factorization of these parameters is achieved by taking the matrix representing the weights of the word embeddings \[latex\]E\[/latex\] and decomposing it into two different matrices. Instead of projecting the one-hot encoded vectors directly onto the hidden space, they are first projected on some-kind of lower-dimensional embedding space, which is then projected to the hidden space (Lan et al, 2019). Normally, this should not produce a different result, but let's wait. + +Another thing that actually ensures that this change reduces the number of parameters is that the authors suggest to reduce the size of the embedding matrix. In BERT, the shape of the vocabulary/embedding matrix E equals that of the matrix for the hidden state H. According to the authors, this makes no sense from both a theoretical and a practical point of view. + +First of all, theoretically, the matrix E captures context-independent information (i.e. a general word encoding) whereas the hidden representation H captures context-dependent information (i.e. related to the dataset with which is trained). According to Lan et al. (2019), BERT's performance emerges from using context to learn context-dependent representations. The context-independent aspects are not really involved. For this reason, they argue, \[latex\]\\text{H >> E}\[/latex\] (H must be a lot greater than E) in order to make things more efficient. + +The authors argue that this is also true from a practical point of view. When \[latex\]\\text{E = H}\[/latex\], increasing the hidden state (and hence the capability for BERT to capture more contextual details) also increases the size of the matrix for E, which makes no sense, as it's context-independent. By consequence, models with billions of parameters become possible, most of which are updated only sparsely during training (Lan et al, 2019). + +In other words, a case can be made that this is not really a good idea. + +Recall that ALBERT solves this issue by decomposing the embedding parameters into two smaller matrices, allowing a two-step mapping between the original word vectors and the space of the hidden state. In terms of computational cost, this no longer means \[latex\]\\text{O(VxH)}\[/latex\] but rather \[latex\]\\text{O(VxE + ExH)}\[/latex\], which brings a significant reduction when \[latex\]\\text{H >> E}\[/latex\]. + +### Key difference 2: cross-layer parameter sharing + +The next key difference is that **between encoder segments, layer parameters are shared for every similar subsegment**. + +This means that e.g. with 12 encoder segments: + +- The **multi-head self-attention subsegments** share parameters (i.e. weights) across all twelve layers. +- The same is true for the **feedforward segments**. + +The consequence of this change is that the number of parameters is reduced significantly, simply because they are shared. Another additional benefit reported by Lan et al. (2019) is that something else can happen that is beyond parameter reduction: the stabilization of the neural network due to parameter sharing. In other words, beyond simply reducing the computational cost involved with training, the paper suggests that sharing parameters can also improve the training process. + +### Key difference 3: inter-sentence coherence loss + +The third and final key difference is that instead of Next Sentence Prediction (NSP) loss, an **inter-sentence coherence loss** called **sentence-order prediction (SOP)** is used. + +The authors, based on previous findings that themselves are based on evaluations of the [BERT model](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/), argue that the NSP task can be unreliable. The key problem with this loss is that it merges topic prediction and coherence prediction into one task. Recall that NSP was added to BERT to predict whether two sentences are related (i.e. whether sentence B is actually the next sentence for sentence A or whether it is not). This involves both looking at the _topic_ ("what is this sentence about?") and some measure of coherence ("how related are the sentences?"). + +Intuitively, we can argue that topic prediction is much easier than coherence prediction. The consequence is that when the model discovers this, it can focus entirely on this subtask, and forget about the coherence prediction task; actually taking the path of least resistance. The authors actually demonstrate that this is happening with the NSP task, replacing it within their work with a **sentence-order prediction or SOP** task. + +This task focuses on coherence prediction only. It utilizes the same technique as BERT (i.e. the passing of two consecutive segments), but is different because it doesn't take a random sentence in the case where the sentence ('is not next'). Rather, it simply swaps sentences that are always consecutive, effectively performing a related next sentence prediction problem focused entirely on coherence. It enforces that the model zooms into the hard problem instead of the difficult one. + +* * * + +## Training ALBERT: high performance at lower cost + +Now that we understand how ALBERT works and what the key differences are, let's take a look at how it is trained. + +### Various configurations + +In the Lan et al. paper from 2019, four ALBERT types are mentioned and compared to the two BERT models. + +We can see that the ALBERT base model attempts to mimic BERT base, with a hidden state size of 768, parameter sharing and a smaller embedding size due to factorization explained above. Contrary to the 108 million parameters, it has only 12 million. This makes a big difference when training the model. + +Another model, ALBERT xxlarge (extra-extra large) has 235 million parameters, with 12 encoder segments, 4096-dimensional hidden state and 128-dimensional embedding size. It also includes parameter sharing. In theory, the context-dependent aspects of the model should be more performant than original BERT, since the hidden state is bigger. Let's now see whether this is true. + +| Model | Type | No. Parameters | No. Encoder Segments | Hidden State Size | Embedding Size | Parameter Sharing | +| --- | --- | --- | --- | --- | --- | --- | +| BERT | base | 108M | 12 | 768 | 768 | False | +| BERT | large | 334M | 24 | 1024 | 1024 | False | +| ALBERT | base | 12M | 12 | 768 | 128 | True | +| ALBERT | large | 18M | 24 | 1024 | 128 | True | +| ALBERT | xlarge | 60M | 24 | 2048 | 128 | True | +| ALBERT | xxlarge | 235M | 12 | 4096 | 128 | True | + +Source: Lan et al. (2019) + +### Comparing BERT and ALBERT + +ALBERT, like BERT, was pretrained on the BooksCorpus and English Wikipedia datasets (Lan et al., 2019). It was then further evaluated on three downstream benchmarks: + +- The **General Language Understanding Evaluation (GLUE)** benchmark and its individual language tasks. +- Two versions of the **Stanford Question Answering Dataset (SQuAD)**. +- The **ReAding Comprehension from Examinations (RACE)** dataset. + +The following results can be reported: + +- The ALBERT xxlarge model performs significantly better than BERT large while it has 70% fewer parameters. Percentual improvements per task (Lan et al., 2019): SQuAD v1.1 (+1.9%), SQuAD v2.0 (+3.1%), MNLI (+1.4%), SST-2 (+2.2%), and RACE (+8.4%). +- ALBERT models have higher data throughput compared to BERT models. This means that they can train faster than the BERT model. In fact, it's about 1.7 times faster. + +### Ablation studies: do the differences cause the performance improvement? + +Beyond the general results, the authors have also performed ablation experiments to see whether the changes actually cause the performance improvement, or not. + +> An ablation study studies the performance of an AI system by removing certain components, to understand the contribution of the component to the overall system. +> +> Wikipedia (n.d.) + +These are the results: + +- For **factorized embeddings**, the authors report good performance. Both the case where cross-layer parameters were not shared and where they were, are reported. Without sharing, larger embedding sizes give better performance. With sharing, performance boosts satisfy at an embedding size of 128 dimensions. That's why the 128-size embeddings were used in the table above. +- For **cross-layer parameter sharing**, the authors looked at not performing cross-layer sharing, performing cross-layer sharing for the feedforward segments only, performing sharing for the attention segments, and performing sharing for all subsegments. It turns out that sharing the parameters for the attention segments is most effective, while sharing the feedforward segment parameters does not contribute significantly. This clearly illustrates the important role of the attention mechanism in Transformer models. Because, however, all-segment sharing significantly decreases the number of parameters, at only slightly worse performance compared to attention-only sharing, the authors to perform all-segment sharing instead. +- For the **SOP task**, we can read that if NSP is performed on a SOP task, performance is poor. NSP on NSP of course performs well, as well as SOP on SOP. However, if SOP is performed on NSP, it performs really well. This suggests that SOP actually captures sentence coherence whereas NSP might not, and that SOP yields a better result than NSP. + +Summarizing the ablation studies, we can see that every difference contributes to the performance of ALBERT models over the traditional BERT ones. + +* * * + +## Summary + +While Transformer models in general and BERT models in particular perform really well in Natural Language Processing, they are of massive size, which significantly limits adoption. For example, engineers who want to deploy these big models run into significant hardware issues, and the same is true for when they must be trained further, e.g. during finetuning. + +In a 2019 paper, a different type of model was proposed, called ALBERT or _A Lite BERT_. Using the same architecture - the encoder segment from the original Transformer - with three key differences, the authors attempted to prove that better NLP models does not necessarily mean bigger models. These are the differences: + +1. Factorized embedding parameters, decoupling them from the hidden state, allowing embeddings to be of lower size. This massively reduces the number of parameters within models. +2. Cross-layer parameter sharing between the attention subsegments within the encoder segments, as well as the feedforward ones. Once again, this reduces the number of parameters significantly. +3. A different language task: instead of Next Sentence Prediction (NSP), Sentence-order prediction (SOP) is performed, to improve upon concerns about NSP. + +The experiments show that a better and more contextual model (ALBERT xxlarge) can be trained that improves upon BERT large at only 70% of the amount of BERT large parameters. This shows that better language models can be created with fewer parameters, possibly making such language models a _bit_ more of a commodity. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article! If you did, please feel free to leave a message in the comments section below 💬 I'd love to hear from you. Please do the same if you have any questions, or click the **Ask Questions** button above. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). [Albert: A lite bert for self-supervised learning of language representations.](https://arxiv.org/abs/1909.11942) _arXiv preprint arXiv:1909.11942_. + +Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805). _arXiv preprint arXiv:1810.04805_. + +Wikipedia. (2002, September 8). _Factorization_. Wikipedia, the free encyclopedia. Retrieved January 6, 2021, from [https://en.wikipedia.org/wiki/Factorization](https://en.wikipedia.org/wiki/Factorization) + +MachineCurve. (2021, January 1). _What are ablation studies in machine learning?_ [https://www.machinecurve.com/index.php/question/what-are-ablation-studies-in-machine-learning/](https://www.machinecurve.com/index.php/question/what-are-ablation-studies-in-machine-learning/) + +Wikipedia. (n.d.). _Ablation (artificial intelligence)_. Wikipedia, the free encyclopedia. Retrieved January 6, 2021, from [https://en.wikipedia.org/wiki/Ablation\_(artificial\_intelligence)](https://en.wikipedia.org/wiki/Ablation_(artificial_intelligence)) diff --git a/an-introduction-to-dcgans.md b/an-introduction-to-dcgans.md new file mode 100644 index 0000000..aff128f --- /dev/null +++ b/an-introduction-to-dcgans.md @@ -0,0 +1,124 @@ +--- +title: "An introduction to DCGANs" +date: "2021-03-24" +categories: + - "buffer" + - "deep-learning" +tags: + - "convolutional-neural-networks" + - "dcgan" + - "deep-learning" + - "gan" + - "gans" + - "generative-adversarial-networks" + - "generative-models" + - "machine-learning" +--- + +The class of Generative Adversarial Network models, or GANs, belongs to the toolbox of any advanced Deep Learning engineer these days. First proposed [in 2014](https://www.machinecurve.com/index.php/2021/03/23/generative-adversarial-networks-a-gentle-introduction/), GANs can be used for a wide variety of generative applications - but primarily and most significantly the generation of images and videos. + +But as with any innovation in neural networks, the original propositions almost never scale well. The same is true for GANs: vanilla GANs suffer from instability during training. And that does not benefit the quality of the images that are generated. + +In this article, we're going to take a look at the **Deep Convolutional GAN** or DCGAN family of GAN architectures. Proposed by Radford et al. (2015) after analyzing a wide variety of architectural choices, DCGANs apply a set of best practices that make training more stable and efficient. The primary difference compared to vanilla GANs is the usage of Convolutional layers, and the possibility to do so in a stable way. + +By reading it, you will learn... + +- **That GANs can be used for feature extraction.** +- **How DCGANs make training GANs more stable, and primarily use Conv instead of Dense layers.** +- **A set of best practices for training your GAN compared to a vanilla GAN.** + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## Conv-based GANs and stability problems + +Generative Adversarial Networks or GANs have been around in the Generative Deep Learning field since the 2014 paper by Ian Goodfellow and others. As we know from our [introduction article on GANs](https://www.machinecurve.com/index.php/2021/03/23/generative-adversarial-networks-a-gentle-introduction/), they are composed of two models. The first, the generator, is responsible for generating fake images that cannot be distinguished from real ones. In other words, for counterfeiting. + +The second, however, is the police - and its job is to successfully detect when images presented to it are fake. + +### Using GANs for feature extraction + +The reason why Radford et al. (2015) were so interested in Generative Adversarial Networks in the first place was not because of their generative capabilities. Rather, GANs can also be used as feature extractors. Feature extraction, here, involves constructing a set of features that are more abstract but also informative about the features they were based on. In other words, it involves dimensionality reduction. + +Interestingly, parts of the Generator and the Discriminator of a GAN can be reused "as feature extractors for supervised tasks" (Radford et al., 2015). They hence become an interesting alternative to other approaches such as classic Convolutional Neural Networks. + +![](images/GAN-1024x431.jpg) + +The vanilla GAN proposed by Goodfellow et al. (2014) was however composed of densely-connected layers, a.k.a. Dense layers. This was natural for the time: AlexNet was only two years old and ConvNets were only slowly but surely overtaking 'classic' MLP-like neural networks. + +Today, we know that when it comes to computer vision tasks, Dense layers are suboptimal compared to convolutional (a.k.a. Conv) layers. This is because the latter serve as trainable feature extractors. Rather than "showing the entire picture to each layer" (which is what happens when you use Dense layers), a Conv layer feeds only parts of the image to a set of neurons. + +![](images/Cnn_layer-1.jpg) + +Finding significant performance improvements in regular classification tasks, Conv layers can also improve GAN performance. + +### Stability + +GANs already showed great potential in the 2014 paper, they weren't perfect (and they still are not perfect today). While adding Conv layers is good option for improving the performance of a GAN, problems emerged related to stability (Radford et al., 2015). + +And that's not good if we want to use them in practice. Let's take a look at what can be done to make Conv based GANS more stable according to the best practices found in the Radford et al. (2015) paper. + +* * * + +## Some best practices - introducing DCGANs + +Radford et al. (2015), in their paper ["Unsupervised representation learning with deep convolutional generative adversarial networks"](https://arxiv.org/abs/1511.06434), explored possibilities for using convolutional layers in GANs to make them suitable as feature extractors for other vision approaches. + +After "extensive model exploration" they identified "a family of architectures \[resulting\] in stable training across a range of datasets \[, allowing for\] higher resolution and deeper (...) models" (Radford et al, 2015). This family of architectures is named **DCGAN**, or **Deep Convolutional GANs**. + +When converted into best practices, this is a list that when used should improve any GAN compared to vanilla ones from the early days: + +1. **Minimizing fully connected layers:** Remove fully connected hidden layers for deeper architectures, [relying on Global Average Pooling instead](https://www.machinecurve.com/index.php/2020/01/31/reducing-trainable-parameters-with-a-dense-free-convnet-classifier/). If you cannot do that, make sure to add Dense layers only to the input of the Generator and the output of the Discriminator. + - The first layer of the Generator must be a Dense layer because it must be able to take samples from the latent distribution \[latex\]Z\[/latex\] as its input. + - The final layer of the Discriminator must be a Dense layer because it must be able to convert inputs to a probability value. +2. **Allowing the network to learn its own downsampling and upsampling.** This is achieved through replacing _deterministic pooling functions_ (like [max pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/)) with strided convolutions in the Discriminator and fractional-strided convolutions in the Generator. +3. **Applying Batch Normalization.** Ensuring that the data distributions within each layer remain in check means that the weights updates oscillate less during training, and by consequence training is more stable. +4. **Use Rectified Linear Unit in the Generator.** The ReLU activation function is used in the generator, except for the last layer, which uses Tanh. +5. **Use Leaky ReLU in the Discriminator.** This was found to work well, in contrast to the Goodfellow et al. (2014) approach, which used maxout. Radford et al. (2015) set the slope of the leak to 0.2. + +For the record: for training their DCGANs, they used minibatch SGD with Adam optimization, a batch size of 128, weight init from a zero-centered Normal distribution with 0.02 stddev. Learning rate for Adam was set to 0.0002 (contrary to default 0.001) and the momentum term \[latex\]\\beta\_1\[/latex\] was reduced to 0.5 from 0.9. + +Best practices, always nice! + +* * * + +## Summary + +In this article, we studied a class of GAN architectures called DCGAN, or Deep Convolutional GAN. We saw that vanilla GANs suffer from instability during training, and that this is not too uncommon for innovations - remember that the original GAN was already proposed back in 2014! + +DCGANs apply a set of best practices identified by Radford et al. (2015) in a series of experiments. Minimizing the amount of fully-connected layers, replacing elements like Max Pooling with (fractional-)strided convolutions, applying Batch Normalization, using ReLU in the Generator and Leaky ReLU in the Discriminator stabilizes training - and allows you to achieve better results with your GAN. + +Summarizing everything, by reading this article, you have learned... + +- **That GANs can be used for feature extraction.** +- **How DCGANs make training GANs more stable, and primarily use Conv instead of Dense layers.** +- **A set of best practices for training your GAN compared to a vanilla GAN.** + +I hope that it was useful for your learning process! Please feel free to share what you have learned in the comments section 💬 I’d love to hear from you. Please do the same if you have any questions or other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Radford, A., Metz, L., & Chintala, S. (2015). [Unsupervised representation learning with deep convolutional generative adversarial networks.](https://arxiv.org/abs/1511.06434) _arXiv preprint arXiv:1511.06434_. + +Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). [Generative adversarial networks.](https://arxiv.org/abs/1406.2661) _arXiv preprint arXiv:1406.2661_. + +Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, and Riedmiller, Martin. [Striving for simplicity: The all convolutional net.](https://arxiv.org/abs/1412.6806) arXiv preprint arXiv:1412.6806, 2014. + +Mordvintsev, Alexander, Olah, Christopher, and Tyka, Mike. Inceptionism : Going deeper into neural networks. [http://googleresearch.blogspot.com/2015/06/ inceptionism-going-deeper-into-neural.html](http://googleresearch.blogspot.com/2015/06/ inceptionism-going-deeper-into-neural.html). + +Ioffe, Sergey and Szegedy, Christian. [Batch normalization: Accelerating deep network training by reducing internal covariate shift.](http://proceedings.mlr.press/v37/ioffe15.html) arXiv preprint arXiv:1502.03167, 2015. + +Nair, Vinod and Hinton, Geoffrey E. [Rectified linear units improve restricted boltzmann machines.](https://www.cs.toronto.edu/~hinton/absps/reluICML.pdf) In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814, 2010. + +Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y. [Rectifier nonlinearities improve neural network acoustic models.](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.693.1422&rep=rep1&type=pdf) In Proc. ICML, volume 30, 2013. + +Xu, Bing, Wang, Naiyan, Chen, Tianqi, and Li, Mu. [Empirical evaluation of rectified activations in convolutional network.](https://arxiv.org/abs/1505.00853) arXiv preprint arXiv:1505.00853, 2015. diff --git a/an-introduction-to-tensorflow-keras-callbacks.md b/an-introduction-to-tensorflow-keras-callbacks.md new file mode 100644 index 0000000..971aa99 --- /dev/null +++ b/an-introduction-to-tensorflow-keras-callbacks.md @@ -0,0 +1,596 @@ +--- +title: "An introduction to TensorFlow.Keras callbacks" +date: "2020-11-10" +categories: + - "frameworks" +tags: + - "callbacks" + - "keras" + - "tensorflow" +--- + +Training a deep learning model is both simple and complex at the same time. It's simple because with libraries like TensorFlow 2.0 (`tensorflow.keras`, specifically) it's very easy to get started. But while creating a first model is easy, fine-tuning it while knowing what you are doing is a bit more complex. + +For example, you will need some knowledge on the [supervised learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), gradient descent or other optimization, regularization, and a lot of other contributing factors. + +Tweaking and tuning a deep learning models therefore benefits from two things: insight into what is happening and automated control to avoid the need for human intervention where possible. In Keras, this can be achieved with the `tensorflow.keras.callbacks` API. In this article, we will look into Callbacks in more detail. We will first illustrate what they are by displaying where they play a role in the supervised machine learning process. Then, we cover the Callbacks API - and for each callback, illustrate what it can be used for together with a small example. Finally, we will show how you can create your own Callback with the `tensorflow.keras.callbacks.Base` class. + +Let's take a look :) + +**Update 11/Jan/2021:** changed header image. + +* * * + +\[toc\] + +* * * + +## Callbacks and their role in the training process + +In our article about the [supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), we saw how a supervised machine learning model is trained: + +1. A machine learning model (today, often a neural network) is [initialized](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/). +2. Samples from the training set are fed forward, through the model, resulting in a set of predictions. +3. The predictions are compared with what is known as the _ground truth_ (i.e. the labels corresponding to the training samples), resulting in one value - a [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions) - telling us how _bad_ the model performs. +4. Based on the loss value and the subsequent backwards computation of the error, the weights are changed a little bit, to make the model a bit better. Then, we're either moving back to step 2, or we stop the training process. + +As we can see, steps 2-4 are _iterative_, meaning that the model improves in a cyclical fashion. This is reflected in the figure below. + +![](images/High-level-training-process-1024x973.jpg) + +In Machine Learning terms, each iteration is also called an **epoch**. Hence, training a machine learning model involves the completion of at least one, but often multiple epochs. Note from the article about [gradient descent based optimization](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) that we often don't feed forward all data at once. Instead, we use what is called a _minibatch approach_ - the entire batch of data is fed forward in smaller batches called minibatches. By consequence, each epoch consists of at least one but often multiple **batches** of data. + +Now, it can be the case that you want to get insights from the training process while it is running. Or you want to provide automated steering in order to avoid wasting resources. In those cases, you might want to add a **callback** to your Keras model. + +> A callback is an object that can perform actions at various stages of training (e.g. at the start or end of an epoch, before or after a single batch, etc). +> +> Keras Team (n.d.) + +As we shall see later in this article, among others, there are [callbacks for monitoring](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) and for stopping the training process [when it no longer makes the model better](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). This is possible because with callbacks, we can 'capture' the training process while it is happening. They essentially 'hook' into the training process by allowing the training process to invoke certain callback definitions. In Keras, each callback implements at least one, but possibly multiple of the following definitions (Keras Team, n.d.). + +- With the `on_train_begin` and `on_train_end` definitions, we can perform a certain action either when `model.fit` starts executing or when the training process has just ended. +- With the `on_epoch_begin` and `on_epoch_end` definitions, we can perform a certain action just before the start of an epoch, or directly after it has ended. +- With the `on_test_begin` and `on_test_end` definitions, we can perform a certain action just before or after the model [is evaluated](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/). +- With the `on_predict_begin` and `on_predict_end` definitions, we can do the same, but then when we generate [new predictions](https://www.machinecurve.com/index.php/2020/02/21/how-to-predict-new-samples-with-your-keras-model/). If we predict for a batch rather than a single sample, we can use the `on_predict_batch_begin` and `on_predict_batch_end` definitions. +- With the `on_train_batch_begin`, `on_train_batch_end`, `on_test_batch_begin` and `on_test_batch_end` definitions, we can perform a certain action directly before or after we feed a batch to either the training or testing process. + +As we can see, by using a callback, through the definitions outlined above, we can control the training process at a variety of levels. + +* * * + +## The Keras Callbacks API + +Now that we understand what callbacks are, how they can help us, and what definitions - and hence hooks - are available for 'breaking into' your training process in TensorFlow 2.x based Keras. Now, it's time to take a look at the Keras Callbacks API. Available as `tensorflow.keras.callbacks`, it's a set of generally valuable Callbacks that can be used in a variety of cases. + +Most specifically, it contains the following callbacks, and we will cover each of them next: + +1. **ModelCheckpoint callback:** can be used to [automatically save a model](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) after each epoch, or just the best one. +2. **TensorBoard callback:** allows us to monitor the training process in realtime with [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/). +3. **EarlyStopping callback:** ensures that the training process stops if the loss value [does no longer improve](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). +4. **LearningRateScheduler callback:** updates the learning rate before the start of an epoch, based on a `scheduler` function. +5. **ReduceLROnPlateau callback:** reduces learning rate if the loss value does no longer improve. +6. **RemoteMonitor callback:** sends TensorFlow training events to a remote monitor, such as a logging system. +7. **LambdaCallback:** allows us to define simple functions that can be executed as a callback. +8. **TerminateOnNaN callback:** if the loss value is Not a Number (NaN), the training process stops. +9. **CSVLogger callback:** streams the outcome of an epoch to a CSV file. +10. **ProgbarLogger callback:** used to determine what is printed to standard output in the Keras progress bar. + +### How do we add a callback to a Keras model? + +Before we take a look at all the individual callbacks, we must take a look at how we can use the `tensorflow.keras.callbacks` API in the first place. Doing so is really simple and only changes your code in a minor way: + +1. You must add the specific callbacks to the model imports. +2. You must _initialize_ the callbacks you want to use, including their configuration; preferably do so in a list. +3. You must add the callbacks to the `model.fit` call. + +With those three simple steps, you ensure that the callbacks are hooked into the training process! + +For example, if we want to use both `ModelCheckpoint` and `EarlyStopping` - [as we do here](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) - for step (1), we first **add the imports**: + +``` + +from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint +``` + +Then, for step (2), we **initialize the callbacks** in a list: + +``` +keras_callbacks = [ + EarlyStopping(monitor='val_loss', patience=5, mode='min', min_delta=0.01), + ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') +] +``` + +And then, for step (3), we simply **add the callbacks** to `model.fit`: + +``` +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### ModelCheckpoint callback + +If you want to periodically save your Keras model - or the model weights - to some file, the `ModelCheckpoint` callback is what you need. + +> Callback to save the Keras model or model weights at some frequency. +> +> TensorFlow (n.d.) + +It is available as follows: + +``` +tf.keras.callbacks.ModelCheckpoint( + filepath, monitor='val_loss', verbose=0, save_best_only=False, + save_weights_only=False, mode='auto', save_freq='epoch', options=None, **kwargs +) +``` + +With the following arguments: + +- With `filepath`, you can specify where the model must be saved. +- If you want to save only if some quantity has changed, you can set this quantity by means of `monitor`. It is set to validation loss by default. +- With `verbose`, you can specify if the callback output should be output in your standard output (often, your terminal). +- If you only want to save the model when the monitored quantity improves, you can set `save_best_only` to `True`. +- Normally, the entire model is [saved](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/) - that is, the stack of layers as well as the [model weights](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/). If you want to save the weights only (e.g. because you can initialize the model yourself), you can set `save_weights_only` to `True`. +- With `mode`, you can determine in what direction the `monitor` quantity must move to consider it to be an improvement. You can choose any from `{auto, min, max}`. When it is set to `auto`, it determines the `mode` based on the `monitor` - with loss, for example, it will be `min`; with accuracy, it will be `max`. +- The `save_freq` allows you to determine when to save the model. By default, it is saved after every epoch (or checks whether it has improved after every epoch). By changing the `'epoch'` string into an integer, you can also instruct Keras to save after every `n` minibatches. +- If you want, you can specify other compatible `options` as well. Check the `ModelCheckpoint` docs (see link in references) for more information about these `options`. + +Using `ModelCheckpoint` is easy - and here is an example based on a [generator](https://www.machinecurve.com/index.php/2020/04/06/using-simple-generators-to-flow-data-from-file-with-keras/): + +``` +checkpoint_path=f'{os.path.dirname(os.path.realpath(__file__))}/covid-convnet.h5' +keras_callbacks = [ + ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### TensorBoard callback + +Did you know that you can visualize the training process realtime [with TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/)? + +![](images/image-1.png) + +With the `TensorBoard` callback, you can link TensorBoard with your Keras model. + +> Enable visualizations for TensorBoard. +> +> TensorFlow (n.d.) + +The callback logs a range of items from the training process into your TensorBoard log location: + +- Metrics summary plots +- Training graph visualization +- Activation histograms +- Sampled profiling + +It is implemented as follows: + +``` +tf.keras.callbacks.TensorBoard( + log_dir='logs', histogram_freq=0, write_graph=True, write_images=False, + update_freq='epoch', profile_batch=2, embeddings_freq=0, + embeddings_metadata=None, **kwargs +) +``` + +- With `log_dir`, you can specify the file path to your TensorBoard log folder. +- The `TensorBoard` callback computes activation and weight histograms. With `histogram_freq`, you can specify the frequency (in epochs) when this should happen. Histograms will not be computed when `histogram_freq` is set to 0. +- Whether to write the TensorFlow graph to the logs can be configured with `write_graph`. +- If you want to visualize your model weights as images in TensorBoard, you can set `write_images` to `True`. +- With `update_freq`, you can specify when this callback sends data to TensorBoard. If it's set to `epoch`, it will send data every epoch. If set to `batch`, data will be sent on every batch. If set to an integer `n` instead, data will be sent every `n` batches. +- With the [TensorFlow Profiler](https://www.tensorflow.org/guide/profiler), we can calculate the compute performance of TensorFlow - that is, the resources it needs at a point in time. With `profile_batch`, you can specify a batch to profile, meaning that Profiling information will be sent to TensorBoard as well. +- If you are using [Embeddings](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/), it is possible to let TensorFlow visualize them. Specifying the `embeddings_freq` allows you to configure when Embeddings need to be visualized; it represents the frequency in epochs. Embeddings will not be visualized when the frequency is set to 0. +- A dictionary with Embeddings metadata can be passed along with `embeddings_metadata`. + +Here is an example of using the `TensorBoard` callback within your Keras model: + +``` +keras_callbacks = [ + TensorBoard(log_dir="./logs") +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### EarlyStopping callback + +Optimizing your neural network involves applying [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [another optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) to a loss value generated by feeding forward batches of training samples, generating predictions that are compared with the corresponding training labels. + +During this process, you want to find a model that performs well in terms of predictions (i.e., it is not underfit) but that is not too rigid with respect to the dataset it is trained on (i.e., it is neither overfit). That's why the `EarlyStopping` callback can be useful if you are dealing with a situation like this. + +> Stop training when a monitored metric has stopped improving. +> +> TensorBoard (n.d.) + +It is implemented as follows: + +``` +tf.keras.callbacks.EarlyStopping( + monitor='val_loss', min_delta=0, patience=0, verbose=0, mode='auto', + baseline=None, restore_best_weights=False +) +``` + +- The `monitor` is the quantity to monitor for improvement; it is similar to the quantity monitored for `ModelCheckpointing`. +- The same goes for the `mode`. +- With `min_delta`, you can configure the minimum change that must happen from the current `monitor` in order to consider the change an improvement. +- With `patience`, you can indicate how long in epochs to wait for additional improvements before stopping the training process. +- With `verbose`, you can specify the verbosity of the callback, i.e. whether the output is written to standard output. +- The `baseline` value can be configured to specify a minimum `monitor` that must be achieved at all before _any_ change can be considered an improvement. +- As you would expect, having a `patience` > 0 will ensure that the model is trained for `patience` more epochs, possibly making it worse. With `restore_best_weights`, we can restore the weights of the best-performing model instance when the training process stops. This can be useful if you directly perform [model evaluation](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) after stopping the training process. + +Here is an example of using `EarlyStopping` with Keras: + +``` + +keras_callbacks = [ + EarlyStopping(monitor='val_loss', min_delta=0.001, restore_best_weights=True) +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### LearningRateScheduler callback + +During the optimization process, a so called _weight update_ is computed. However, if we compare the optimization process with rolling a ball down a mountain (reflecting the [loss landscape](https://www.machinecurve.com/index.php/2020/02/26/getting-out-of-loss-plateaus-by-adjusting-learning-rates/)), we want to smooth the ride, ensuring that our ball does not bounce out of control. That is why a [learning rate](https://www.machinecurve.com/index.php/2019/11/06/what-is-a-learning-rate-in-a-neural-network/) is applied: it specifies a fraction of the weight update to be used by the optimizer. + +Preferably being relatively large during the early iterations and lower in the later stages, we must adapt the learning rate during the training process. This is called [learning rate decay](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/) and shows what a _learning rate scheduler_ can be useful for. The `LearningRateScheduler` callback implements this functionality. + +> At the beginning of every epoch, this callback gets the updated learning rate value from `schedule` function provided at `__init__`, with the current epoch and current learning rate, and applies the updated learning rate on the optimizer. +> +> TensorFlow (n.d.) + +Its implementation is really simple: + +``` +tf.keras.callbacks.LearningRateScheduler( + schedule, verbose=0 +) +``` + +- It accepts a `schedule` function which you can use to decide yourself how the learning rate must be scheduled during every epoch. +- With `verbose`, you can decide to illustrate the callback output in your standard output. + +Here is an example of using the `LearningRateScheduler` with Keras: + +``` +def scheduler(epoch, learning_rate): + if epoch < 15: + return learning_rate + else: + return learning_rate * 0.99 + +keras_callbacks = [ + LearningRateScheduler(scheduler) +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### ReduceLROnPlateau callback + +During the optimization process - i.e., rolling the ball downhill - it can be the case that you encounter so-called _loss plateaus_. In those areas, the gradient of the loss function is close to zero, but not entirely - indicating that you are in the vicinity of a loss minimum. That is, close to where you want to be (unless you are dealing with a local minimum, of course). + +Keeping your learning rate equal when close to a plateau means that your model will likely not improve any further. This happens because your model will optimize, oscillating around the loss minimum, simply because the steps the current [learning rate](https://www.machinecurve.com/index.php/2019/11/06/what-is-a-learning-rate-in-a-neural-network/) it instructs to set are too big. + +With the `ReduceLROnPlateau` callback, the optimization process can be instructed to _reduce_ the learning rate (and hence the step) when a plateau is encountered. + +> Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This callback monitors a quantity and if no improvement is seen for a 'patience' number of epochs, the learning rate is reduced. +> +> TensorFlow (n.d.) + +The callback is implemented as follows: + +``` +tf.keras.callbacks.ReduceLROnPlateau( + monitor='val_loss', factor=0.1, patience=10, verbose=0, mode='auto', + min_delta=0.0001, cooldown=0, min_lr=0, **kwargs +) +``` + +- The `monitor` and `patience` resemble the monitors and patience values that we have already encountered. In other words, it is the quantity to observe that helps us judge whether improvement has happened. Patience tells us how long to wait before we consider improvement impossible. The `mode` is related to the `monitor` and instructs what kind of operation to perform while monitoring: `min` or `max` (or `auto`matically determined). +- The `min_delta` tells us _how much_ the model should improve at minimum before we consider the change an improvement. +- The `factor` determines how much to decrease the learning rate upon encountering a plateau: `new_lr = lr * factor`. +- The `verbose` attribute can be configured to display the callback output in your standard output. +- The `min_lr` gives us a lower bound on the learning rate. +- The `cooldown` attribute instructs the model to wait with invoking this specific callback for a number of epochs, allowing us to find _some improvement_ with the reduced learning rate (this could take a few epochs). + +An example of using the `ReduceLROnPlateau` callback with Keras: + +``` +keras_callbacks = [ + ReduceLROnPlateau(monitor='val_loss', factor=0.25, patience=5, cooldown=5, min_lr=0.000000001) +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### RemoteMonitor callback + +Above, we saw that training logs can be distributed to [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) for visualization and logging purposes. However, it can be the case that you have your own logging and visualization system - whether that's a cloud-based system or a locally installed Grafana or Elastic Stack visualization tooling. + +In those cases, you might wish to send the training logs there instead. The `RemoteMonitor` callback can help you do this. + +> Callback used to stream events to a server. +> +> TensorFlow (n.d.) + +It is implemented as follows: + +``` +tf.keras.callbacks.RemoteMonitor( + root='http://localhost:9000', path='/publish/epoch/end/', field='data', + headers=None, send_as_json=False +) +``` + +- With the `root` argument, you can specify the root of the endpoint to where data must be sent. +- The `path` indicates the path relative to `root` where data must be sent. In other words, `root + path` describe the full endpoint. +- The JSON field under which data is sent can be configured with `field`. +- In `headers`, additional HTTP headers (such as an Authorization header) can be provided. +- With `send_as_json` as `True`, the content type of the request will be changed to `application/json`. Otherwise, it will be sent as part of a form. + +An example of using the `RemoteMonitor` callback with Keras: + +``` +keras_callbacks = [ + RemoteMonitor(root='https://some-domain.com', path='/statistics/keras') +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### LambdaCallback + +Say that you want a certain function to fire after every batch or every epoch - a simple function, nothing special. However, it's not provided in the collection of callbacks presented with the `tensorflow.keras.callbacks` API. In this case, you might want to use the `LambdaCallback`. + +> Callback for creating simple, custom callbacks on-the-fly. This callback is constructed with anonymous functions that will be called at the appropriate time. Te +> +> TensorFlow (n.d.) + +It can thus be used to provide anonymous (i.e. `lambda` functions without a name) functions to the training process. The callback looks as follows: + +``` +tf.keras.callbacks.LambdaCallback( + on_epoch_begin=None, on_epoch_end=None, on_batch_begin=None, on_batch_end=None, + on_train_begin=None, on_train_end=None, **kwargs +) +``` + +Here, the `on_epoch_begin`, `on_epoch_end`, `on_batch_begin`, `on_batch_end`, `on_train_begin` and `on_train_end` _event_ based arguments can be filled with Python definitions. They are executed at the right point in time. + +An example of a `LambdaCallback` added to your Keras model: + +``` +keras_callbacks = [ + LambdaCallback(on_batch_end=lambda batch, log_data: print(batch)) +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### TerminateOnNaN callback + +In some cases (e.g. when you did not apply min-max normalization to your input data), the loss value can be very strange - outputting values close to Infinity or values that are Not a Number (`NaN`). In those cases, you don't want to pursue further training. The `TerminateOnNaN` callback can help here. + +> Callback that terminates training when a NaN loss is encountered. +> +> TensorFlow (n.d.) + +It is implemented as follows: + +``` +tf.keras.callbacks.TerminateOnNaN() +``` + +An example of using the `TerminateOnNaN` callback with your Keras model: + +``` +keras_callbacks = [ + TerminateOnNaN() +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### CSVLogger callback + +CSV files can be very useful when you need to exchange data. If you want to flush your training logs into a CSV file, the `CSVLogger` callback can be useful to you. + +> Callback that streams epoch results to a CSV file. +> +> TensorFlow (n.d.) + +It is implemented as follows: + +``` +tf.keras.callbacks.CSVLogger( + filename, separator=',', append=False +) +``` + +- The `filename` attribute determines where the CSV file is located. If there is none, it will be created. +- The `separator` attribute determines what character separates the columns in a single row, and is also called delimiter. +- With `append`, you can indicate whether data should simply be added to the end of the file, or a new file should overwrite the old one every time. + +This is an example of using the `CSVLogger` callback with Keras: + +``` +keras_callbacks = [ + CSVLogger('./logs.csv', separator=';', append=True) +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### ProgbarLogger callback + +When you are training a Keras model with verbosity set to `True`, you will see a progress bar in your terminal. With the `ProgbarLogger` callback, you can change what is displayed there. + +> Callback that prints metrics to stdout. +> +> TensorFlow (n.d.) + +It is implemented as follows: + +``` +tf.keras.callbacks.ProgbarLogger( + count_mode='samples', stateful_metrics=None +) +``` + +- With `count_mode`, you can instruct Keras to display samples or steps (i.e. batches) already fed forward through the model +- The `stateful_metrics` attribute can contain metrics that should not be averaged over time. + +Here is an example of using the `ProgbarLogger` callback with Keras. + +``` +keras_callbacks = [ + ProgbarLogger(count_mode='samples') +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### Experimental: BackupAndRestore callback + +When you are training a neural network, especially in a [distributed setting](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/), it would be problematic if your training process suddenly stops - e.g. due to machine failure. Every iteration passed so far will be gone. With the experimental `BackupAndRestore` callback, you can instruct Keras to create temporary checkpoint files after each epoch, to which you can restore later. + +> `BackupAndRestore` callback is intended to recover from interruptions that happened in the middle of a model.fit execution by backing up the training states in a temporary checkpoint file (based on TF CheckpointManager) at the end of each epoch. +> +> TensorFlow (n.d.) + +It is implemented as follows: + +``` +tf.keras.callbacks.experimental.BackupAndRestore( + backup_dir +) +``` + +Here, the `backup_dir` attribute indicates the folder where checkpoints should be created. + +Here is an example of using the `BackupAndRestore` callback with Keras. + +``` +keras_callbacks = [ + BackupAndRestore('./checkpoints') +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### Applied by default: History and BaseLogger callbacks + +There are two callbacks that are part of the `tensorflow.keras.callbacks` API but which can be covered less extensively - because of the simple reason that they are already applied to each Keras model under the hood. + +They are the `History` and `BaseLogger` callbacks. + +- The `History` callback generates a `History` [object](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/#the-history-object) when calling `model.fit`. +- The `BaseLogger` callback accumulates basic metrics to display later. + +* * * + +## Creating your own callback with the Base Callback + +Sometimes, neither the default or the `lambda` callbacks can provide the functionality you need. In those cases, you can create your own callback, by using the Base callback class `tensorflow.keras.callbacks.Callback`. Creating one is very simple: you define a `class`, create the relevant definitions (you can choose from `on_epoch_begin`, `on_epoch_end`, `on_batch_begin`, `on_batch_end`, `on_train_begin` and `on_train_end` etc.), and then add the callback to your callbacks list. There you go! + +``` +class OwnCallback(tensorflow.keras.callbacks.Callback): + def on_train_begin(self, logs=None): + print('Training is now beginning!') + +keras_callbacks = [ + OwnCallback() +] +model.fit(train_generator, + epochs=50, + verbose=1, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +* * * + +## Summary + +In this article, we looked at the concept of a callback for hooking into the supervised machine learning training process. Sometimes, you want to receive _additional information_ while you are training a model. In other cases, you want to _actively steer the process_ into a desired direction. Both cases are possible by means of a callback. + +Beyond the conceptual introduction to callbacks, we also looked at how Keras implements them - by means of the `tensorflow.keras.callbacks` API. We briefly looked at each individual callback provided by Keras, ranging from automated changes to hyperparameters to logging in TensorBoard, file or into a remote monitor. We also looked at creating your own callback, whether that's with a `LambdaCallback` for simple custom callbacks or with the Base callback class for more complex ones. + +I hope that you have learned something from today's article! If you did, please feel free to leave a comment in the comments section below 💬 Please do the same if you have any questions, remarks or suggestions for improvement. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Keras Team. (n.d.). _Keras documentation: Callbacks API_. Keras: the Python deep learning API. [https://keras.io/api/callbacks/](https://keras.io/api/callbacks/) + +Keras Team. (2020, April 15). _Keras documentation: Writing your own callbacks_. Keras: the Python deep learning API. [https://keras.io/guides/writing\_your\_own\_callbacks/#a-basic-example](https://keras.io/guides/writing_your_own_callbacks/#a-basic-example) + +TensorFlow. (n.d.). _Tf.keras.callbacks.ModelCheckpoint_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/ModelCheckpoint](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint) + +TensorFlow. (n.d.). _Tf.keras.callbacks.TensorBoard_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/TensorBoard](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard) + +TensorFlow. (n.d.). _Tf.keras.callbacks.EarlyStopping_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/EarlyStopping](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping) + +TensorFlow. (n.d.). _Tf.keras.callbacks.LearningRateScheduler_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/LearningRateScheduler](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler) + +TensorFlow. (n.d.). _Tf.keras.callbacks.ReduceLROnPlateau_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/ReduceLROnPlateau](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ReduceLROnPlateau) + +TensorFlow. (n.d.). _Tf.keras.callbacks.RemoteMonitor_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/RemoteMonitor](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/RemoteMonitor) + +TensorFlow. (n.d.). _Tf.keras.callbacks.LambdaCallback_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/LambdaCallback](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LambdaCallback) + +TensorFlow. (n.d.). _Tf.keras.callbacks.TerminateOnNaN_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/TerminateOnNaN](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TerminateOnNaN) + +TensorFlow. (n.d.). _Tf.keras.callbacks.BaseLogger_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/BaseLogger](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/BaseLogger) + +TensorFlow. (n.d.). _Tf.keras.callbacks.CSVLogger_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/CSVLogger](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/CSVLogger) + +TensorFlow. (n.d.). _Tf.keras.callbacks.History_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/History](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/History) + +TensorFlow. (n.d.). _Tf.keras.callbacks.ProgbarLogger_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/ProgbarLogger](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ProgbarLogger) + +TensorFlow. (n.d.). _Tf.keras.callbacks.experimental.BackupAndRestore_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/experimental/BackupAndRestore](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/experimental/BackupAndRestore) + +TensorFlow. (n.d.). _Tf.keras.callbacks.Callback_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/Callback](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/Callback) diff --git a/automating-neural-network-configuration-with-keras-tuner.md b/automating-neural-network-configuration-with-keras-tuner.md new file mode 100644 index 0000000..1e0f26a --- /dev/null +++ b/automating-neural-network-configuration-with-keras-tuner.md @@ -0,0 +1,406 @@ +--- +title: "Automating neural network configuration with Keras Tuner" +date: "2020-06-09" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-neural-network" + - "hyperparameter-tuning" + - "hyperparameters" + - "keras" + - "keras-tuner" + - "training-process" +--- + +Machine learning has been around for many decades now. Starting with the [Rosenblatt Perceptron](https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/) in the 1950s, followed by Multilayer Perceptrons and a variety of other machine learning techniques like [Support Vector Machines](https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/), we have arrived in the age of deep neural networks since 2012. + +In the last few years, we have seen an explosion of machine learning research: a wide variety of neural network architectures was invented, published, and the same goes for _tuning_ the neural networks - i.e., what set of hyperparameters works best given a certain problem scenario. That's why training a neural network is often considered to be more of an art than a science - intuition through experience often guides the deep learning engineer into picking the right configuration for their model. + +However, I do believe that this is going to end. Not deep learning itself, but the amount of knowledge required for successfully training a deep neural network. In fact, training ML models is being commoditized... and in today's blog, we'll cover one of the ways in which this is currently happening, namely, with the Keras Tuner. Keras Tuner is a technique which allows deep learning engineers to define neural networks with the Keras framework, define a search space for both model parameters (i.e. architecture) and model hyperparameters (i.e. configuration options), and first search for the best architecture before training the final model. + +We'll first cover the supervised machine learning process and illustrate hyperparameter tuning and its difficulties in more detail. Subsequently, we'll provide some arguments as to why automating hyperparameter tuning can lead to _better_ end results in possibly _less time_. Then, we introduce the Keras Tuner, and close off with a basic example so that you can get basic experience. In another blog post, we'll cover the Keras Tuner building blocks, which will help you gain a deeper understanding of automated hyperparameter tuning. + +**Update 08/Dec/2020:** added references to PCA article. + +* * * + +\[toc\] + +* * * + +## Training neural networks: what is (hyper)parameter tuning? + +Let's take a step back. Before we can understand automated parameter and hyperparameter tuning, we must first take a look at what it is in the first place. + +That's why we'll take a look at the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) that we're using to explain how training a neural network works throughout this website. + +Here it is: + +[![](images/High-level-training-process-1024x973.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/09/High-level-training-process.jpg) + +In your machine learning workflow, you have selected or extracted features and targets for your model based on a priori analysis of your dataset - perhaps using dimensionality reduction techniques like [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/). Using those features, you will be able to train your machine learning model - visible in green. You do so iteratively: + +- Before training starts, you initialize the weights of your neural network in a random or almost-random way; +- In the _forward pass_, you'll feed all your samples (often, in minibatches) to the machine learning model, which generates predictions. +- With a _loss function_, the predictions are compared to the true targets, and a loss value emerges. +- Through backwards computation of the error contribution of particular neurons in the _backwards pass_, it becomes clear how much each neuron contributes to the error. +- With an optimizer such as [Gradient Descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [Adaptive Optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), the weights are changed a tiny bit. +- A new iteration starts, where we expect that the model performs a little bit better. This goes on until the model has improved sufficiently for it to be used in practice. + +### Neural network architecture and configuration + +If you look at how we build models, you'll generally see that doing so consists of three individual steps: + +1. **Creating the model skeleton** (in Keras, this happens through the [Sequential API](https://keras.io/api/models/sequential/) or the [Functional API](https://keras.io/guides/functional_api/)). +2. **Instantiating the model:** using the skeleton and configuration options to create a trainable model. +3. **Fitting data to the model:** starting the training process. + +### Tuning parameters in your neural network + +In step (1), you add various layers of your neural network to the skeleton, such as the [Convolutional Neural Network](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) created here with Keras: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +Here, the architectural choices you make (such as the number of filters for a `Conv2D` layer, kernel size, or the number of output nodes for your `Dense` layer) determine what are known as the _parameters_ of your neural network - the weights (and by consequence biases) of your neural network:[](https://datascience.stackexchange.com/posts/17643/timeline) + +> The parameters of a neural network are typically the weights of the connections. In this case, these parameters are learned during the training stage. So, the algorithm itself (and the input data) tunes these parameters. +> +> [Robin, at StackExchange](https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training#:~:text=The%20parameters%20of%20a%20neural,or%20the%20number%20of%20epochs.) + +### Tuning hyperparameters in your neural network + +However, things don't end there. Rather, in step (2), you'll _configure_ the model during instantiation by setting a wide range of configuration options. Those options include, but are not limited to: + +- The **optimizer** that is used during training: e.g., whether you are using [Gradient Descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or an adaptive optimizer like [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam). +- The **learning rate** that is used during optimization: i.e., what fraction of the error contribution found will be used for optimization for a particular neuron. +- The **batch size** that will be used during the forward pass. +- The **number of iterations** (or epochs) that will be used for training the neural network + +Here's why they are called _hyper_parameters: + +> The hyper parameters are typically the learning rate, the batch size or the number of epochs. The are so called "hyper" because they influence how your parameters will be learned. You optimize these hyper parameters as you want (depends on your possibilities): grid search, random search, by hand, using visualisations… The validation stage help you to both know if your parameters have been learned enough and know if your hyper parameters are good. +> +> [Robin, at StackExchange](https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training#:~:text=The%20parameters%20of%20a%20neural,or%20the%20number%20of%20epochs.) + +As Robin suggests, hyperparameters can be selected (and optimized) in multiple ways. The easiest way of doing so is by hand: you, as a deep learning engineer, select a set of hyperparameters that you will subsequently alter in an attempt to make the model better. + +However, can't we do this in a better way when training a Keras model? + +* * * + +## Automating (hyper)parameter tuning for faster & better experimentation: introducing the Keras Tuner + +As you would have expected: yes, we can! :) Let's introduce Keras Tuner to the scene. As you would expect from engineers, the description as to what it does is really short but provides all the details: + +> A hyperparameter tuner for Keras, specifically for tf.keras with TensorFlow 2.0. +> +> [Keras-tuner on GitHub](https://github.com/keras-team/keras-tuner) + +If you already want to look around, you could visit their website, and if not, let's take a look at what it does. + +### Automatically tuning (hyper)parameters of your Keras model through search spaces + +Keras Tuner can be used for automatically tuning the parameters and hyperparameters of your Keras model. It does so by means of a search space. If you are used to a bit of mathematics, you are well aware of what a space represents. If not, and that's why we're using this particular space, you can likely imagine what we mean when we talk about a _three-dimensional_ or a _two-dimensional_ space. + +Indeed, in the case of a 2D space - where the axes represent e.g. the _hyperparameter_ learning rate and the _parameter_ (or, more strictly, contributing factor to the number of parameters) _number of layers_, you can visualize the space as follows: + +![](images/searchspace.png) + +Here, all the intersections between the two axes (dimensions) are possible combinations of hyperparameters that can be selected for the model. For example, learning rate \[latex\]LR\[/latex\] and number of layers \[latex\]N\[/latex\] can be \[latex\](LR = 10^{-3}, N = 4)\[/latex\], but also \[latex\](LR = 10^{-2}, N = 2)\[/latex\] is possible, and so on. Here, we have two dimensions (which benefits visualization), but the more tunable options you add to your model, the more dimensions will be added to your search space. + +Hopefully, you are now aware about how a search space is constructed by yourself when you want Keras Tuner to look for a most optimal set of hyperparameters and parameters for your neural network. + +You can use a wide range of `HyperParameters` building block styles for creating your search space: + +- **Boolean** **values**, which are set to `true` or `false` +- **Choice values**, which represent an `array` of choices from which one value is chosen for a set of hyperparameters. +- **Fixed values**, which aren't tunable but rather are fixed as they are. +- **Float values**, which represent floating-point values (such as the learning rate above). +- **Integer values**, which represent integer values to be tuned (such as the number of layers above). + +Although the _choice_ values and _float/integer values_ look a lot like each other, they are different - in the sense that you can specify a range in the latter. However, that's too much detail for now - we will cover all the tunable `HyperParameters` in that different blog post we already mentioned before. At this point, it's important that you understand that using Keras Tuner will allow you to construct a search space by means of the building blocks mentioned before. + +### Putting bounds to your search space + +And it's also important that you understand that it does so _within constraints set by the user_. That is, searching the hyperparameter space cannot go on indefinitely. Keras Tuner allows you to constrain searching: by setting a **maximum number of trials**, you can tell the tuner to cut off tuning after some time. + +There's one thing missing, still. It's nice that we have a seach space, but _how exactly_ does Keras Tuner perform the search operation? + +### Applying various search strategies + +By means of a search strategy! + +It's like as if you've lost something, and there are multiple options you can configure to find back what you've lost. And as with anything, there are many ways in which you can do a particular thing... the same is true for searching through your hyperparameter space :) + +We'll cover the various search strategies in more detail in that other blog post that we've mentioned. Here's a brief overview of the search strategies that are supported by Keras Tuner: + +- **Random search:** well, this one is pretty easy. For every dimension in your search space, this algorithm will select a random value, train the model, and report the results. +- **Bayesian optimization:** viewing hyperparameters tuning as the optimization of a black-box function, and using Bayes' rule for optimization. +- **Hyperband:** this one attempts to reduce the total tuning time by running experiments very shortly, then only taking the best of them for longer training, in a competition-style fashion. +- **Sklearn:** allowing you to tune hyperparameters for Scikit-learn models as well, using cross-validated hyperparameter search. + +* * * + +## A basic example of using Keras Tuner + +Now let's take a look at using Keras Tuner for optimizing your Keras model. We will be building a simple ConvNet, [as we have seen in the Conv2D tutorial](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/). We'll subsequently tune its hyperparameters with Keras Tuner for a limited number of epochs, and finally train the best model fully. We'll keep it simple: we're only going to construct a one-dimensional search space based on the learning rate for the Adam optimizer. + +Make sure that Keras Tuner is installed by executing `pip install -U keras-tuner` first in your machine learning environment :) + +### Imports, model configuration, and loading the data + +Open up your IDE and create a file e.g. called `tuning.py`. Here, you're going to write down your code. We'll start with imports (such as `tensorflow.keras` and `kerastuner`), defining the model configuration options and loading the data. If you have no experience in doing so, I recommend that you first read the [Conv2D post](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) as I explain these things there in more detail. Here's the code that you'll add first: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from kerastuner.tuners import RandomSearch + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST data +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +In brief, what it does: + +- Load all the modules and libraries that you'll be using today. +- Defining all the hyperparameters that we will not be tuning today, and other configuration options. +- Loading the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), and reshaping it into Conv2D-compatible format. +- Cast the data into `float32` format which allows GPU owners to train their models faster. +- Scaling the data into the \[latex\]\[0, 1\]\[/latex\] range which benefits the training process. + +### Defining the model-building function + +Keras Tuner allows you to perform your experiments in two ways. The first, and more scalable, approach is a `HyperModel` class, but we don't use it today - as Keras Tuner itself introduces people to automated hyperparameter tuning via model-building functions. + +Those functions are nothing more than a Python `def` where you create the model skeleton and compile it, as you would do usually. However, here, you also construct your search space - that space we explained above. For example, I make the learning rate hyperparameter tunable by specifying it as follows: `hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])`. + +Here's the code for the model-building function. If you've used Keras before, you instantly recognize what it does! + +``` +# MODEL BUILDING FUNCTION +def build_model(hp): + # Create the model + model = Sequential() + model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) + model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) + model.add(Flatten()) + model.add(Dense(128, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + + # Display a model summary + model.summary() + + # Compile the model + model.compile(loss=loss_function, + optimizer=Adam( + hp.Choice('learning_rate', + values=[1e-2, 1e-3, 1e-4])), + metrics=['accuracy']) + + # Return the model + return model +``` + +### Performing tuning + +Now, it's time to perform tuning. As we've constructed our search space, we must first define our search strategy - and it will be `RandomSearch` today: + +``` +# Perform tuning +tuner = RandomSearch( + build_model, + objective='val_accuracy', + max_trials=5, + executions_per_trial=3, + directory='tuning_dir', + project_name='machinecurve_example') +``` + +We'll add the model-building function as the function that contains our model and our search space. Our goal is to minimize validation accuracy (Keras Tuner automatically infers whether it should be maximized or minimized based on the objective), tell it that it should perform 5 trials, and that it should perform 3 executions per trial. The latter ensures that it's not simply variance that causes a hyperparameter to be 'best', as more instances of better performance tend to suggest that performance is _actually_ better. The `directory` and `project_name` attributes are set so that checkpoints of the tuning operations are saved. + +Now that we have configured our search strategy, it's time to print a summary of it and actually perform the search operation: + +``` +# Display search space summary +tuner.search_space_summary() + +# Perform random search +tuner.search(input_train, target_train, + epochs=5, + validation_split=validation_split) +``` + +Here, we instruct Keras Tuner to perform hyperparameter tuning with our training set, for 5 epochs per trial, and to make sure to make a validation split (of 20%, in our case, given how we have configured our model). + +### Fully train the best model + +Once the search is complete, you can get the best model, and train it fully as per your configuration: + +``` +# Get best model +models = tuner.get_best_models(num_models=1) +best_model = models[0] + +# Fit data to model +history = best_model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +That's it! :) You should now have a fully working Keras Tuner based hyperparameter tuner. If you run `python tuning.py`, of course while having all the dependencies installed onto your system, the tuning _and_ eventually the training process should begin. + +### Full model code + +If you wish to obtain the full model code, that's of course also possible. Here you go: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from kerastuner.tuners import RandomSearch + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST data +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# MODEL BUILDING FUNCTION +def build_model(hp): + # Create the model + model = Sequential() + model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) + model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) + model.add(Flatten()) + model.add(Dense(128, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + + # Display a model summary + model.summary() + + # Compile the model + model.compile(loss=loss_function, + optimizer=Adam( + hp.Choice('learning_rate', + values=[1e-2, 1e-3, 1e-4])), + metrics=['accuracy']) + + # Return the model + return model + +# Perform tuning +tuner = RandomSearch( + build_model, + objective='val_accuracy', + max_trials=1, + executions_per_trial=1, + directory='tuning_dir', + project_name='machinecurve_example') + +# Display search space summary +tuner.search_space_summary() + +# Perform random search +tuner.search(input_train, target_train, + epochs=5, + validation_split=validation_split) + +# Get best model +models = tuner.get_best_models(num_models=1) +best_model = models[0] + +# Fit data to model +history = best_model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Summary + +In this blog post, you've been introduced to automated tuning of your neural network parameters and hyperparameters. Over the next years, this will become an increasingly important aspect of machine learning, in my opinion - because why leave to humans what computers could do better? Maybe, machine learning configuration will even become commoditized because of such progress! The benefit is that you've read this post (and may likely deepen your understanding by performing some Google searches). You're now aware of this trend, and can steer your learnings towards staying on top of the machine learning wave :) + +What's more, you've also been able to get some practical experience with a code example using Keras Tuner. I hope you've learnt something today, and that it will help your machine learning endeavors :) If you have any questions, remarks, or other comments, please feel free to leave a comment in the comments section below. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +_Keras tuner_. (n.d.). [https://keras-team.github.io/keras-tuner/](https://keras-team.github.io/keras-tuner/) + +Data Science Stack Exchange. (n.d.). _Model parameters & hyper parameters of neural network & their tuning in training & validation stage_. [https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training](https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training) diff --git a/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras.md b/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras.md new file mode 100644 index 0000000..f38fcff --- /dev/null +++ b/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras.md @@ -0,0 +1,314 @@ +--- +title: "Using EarlyStopping and ModelCheckpoint with TensorFlow 2 and Keras" +date: "2019-05-30" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "ai" + - "callbacks" + - "deep-learning" + - "keras" + - "neural-networks" +--- + +Training a neural network can take a lot of time. In some cases, especially with very deep architectures trained on very large data sets, it can take weeks before one's model is finally trained. + +In Keras, when you train a neural network such as a [classifier](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) or a [regression model](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/), you'll usually set the number of epochs when you call `model.fit`: + +``` +fit(x=None, y=None, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None, validation_freq=1) +``` + +Unfortunately, setting a fixed number of epochs is often a **bad idea**. Here's why: + +- When you use too few epochs, your model will remain underfit. What I mean is that its predictive power can still be improved without a loss of generalization power (i.e., it improves without overfitting). You will end up with a model that does not perform at its maximum capability. +- When you use too many epochs, depending on how you configure the training process, your final model will either be _optimized_ or it will be _overfit_. In both cases, you will have wasted resources. Hey, but why are those resources wasted when the final model is optimal? Simple - most likely, this optimum was found in e.g. 20% of the epochs you configured the model for. 80% of the resources you used are then wasted. Especially with highly expensive tasks in computational terms, you'll want to avoid waste as much as you can. + +This is quite a dilemma, isn't it? How do we choose what number of epochs to use? + +You cannot simply enter a random value due to the reasons above. + +Neither can you test without wasting more resources. What's more, if you think to avert the dilemma by finding out with a very small subset of your data, then I've got some other news - you just statistically altered your sample by drawing a subset from the original sample. You may now find that by using the original data set for training, it is still not optimal. + +What to do? :( In this tutorial, we'll check out one way of getting beyond this problem: using a combination of **Early Stopping** and **model checkpointing**. Let's see what it is composed of. + +In other words, this tutorial will teach you... + +- **Why performing early stopping and model checkpointing can be beneficial.** +- **How early stopping and model checkpointing are implemented in TensorFlow.** +- **How you can use `EarlyStopping` and `ModelCheckpoint` in your own TensorFlow/Keras model.** + +Let's take a look 🚀 + +* * * + +**Update 13/Jan/2021:** Added code example to the top of the article, so that people can get started immediately. Also ensured that the article is still up-to-date, and added a few links to other articles. + +**Update 02/Nov/2020:** Made model code compatible with TensorFlow 2.x. + +**Update 01/Feb/2020:** Added links to other MachineCurve blog posts and processed textual corrections. + +* * * + +\[toc\] + +* * * + +## Code example: how to use EarlyStopping and ModelCheckpoint with TensorFlow? + +This code example immediately teaches you **how EarlyStopping and ModelCheckpointing can be used with TensorFlow**. It allows you to get started straight away. If you want to understand both callbacks in more detail, however, then make sure to continue reading the rest of this tutorial. + +``` +from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint + +keras_callbacks = [ + EarlyStopping(monitor='val_loss', patience=30, mode='min', min_delta=0.0001), + ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') +] + +model.fit(x_train, y_train, + batch_size=batch_size, + epochs=epochs, + validation_split=0.2, + callbacks=keras_callbacks) +``` + +* * * + +## EarlyStopping and ModelCheckpoint in Keras + +Fortunately, if you use Keras for creating your deep neural networks, it comes to the rescue. + +It has two so-called [callbacks](https://www.machinecurve.com/index.php/mastering-keras/#keras-callbacks) which can really help in settling this issue, avoiding wasting computational resources a priori and a posteriori. They are named `EarlyStopping` and `ModelCheckpoint`. This is what they do: + +- **EarlyStopping** is called once an epoch finishes. It checks whether the [metric](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) you configured it for has improved with respect to the best value found so far. If it has not improved, it increases the count of 'times not improved since best value' by one. If it did actually improve, it resets this count. By configuring your _patience_ (i.e. the number of epochs without improvement you allow before training should be aborted), you have the freedom to decide when to stop training. This allows you to configure a very large number of epochs in model.fit (e.g. 100.000), while you know that it will abort the training process once it no longer improves. Gone is your waste of resources with respect to training for too long. +- It would be nice if you could save the best performing model automatically. **ModelCheckpoint** is perfect for this and is also called after every epoch. Depending on how you configure it, it saves the entire model or its weights to an HDF5 file. If you wish, it can only save the model once it has improved with respect to some metric you can configure. You will then end up with the best performing instance of your model saved to file, ready for loading and production usage. + +Together, EarlyStopping and ModelCheckpoint allow you to stop early, saving computational resources, while maintaining the best performing instance of your model automatically. That's precisely what you want. + +* * * + +## Example implementation + +Let's build one of the [Keras examples](https://github.com/keras-team/keras/blob/master/examples/imdb_cnn.py) step by step. It uses one-dimensional [convolutional layers](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) for classifying IMDB reviews and, according to its metadata, achieves about 90% test accuracy after just two training epochs. + +We will slightly alter it in order to (1) include the callbacks and (2) keep it running until it no longer improves. + +Let's first load the Keras imports. Note that we also include `numpy`, which is not done in the Keras example. We include it because we'll need to fix the random number generator, but we'll come to that shortly. + +``` +from tensorflow.keras.preprocessing import sequence +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Activation +from tensorflow.keras.layers import Embedding +from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D +from tensorflow.keras.datasets import imdb +import numpy as np +``` + +We will then set the parameters. Note that instead of 2 epochs in the example, we'll use 200.000 epochs here. + +``` +# set parameters: +max_features = 5000 +maxlen = 400 +batch_size = 32 +embedding_dims = 50 +filters = 250 +kernel_size = 3 +hidden_dims = 250 +epochs = 200000 +``` + +We'll fix the random seed in Numpy. This allows us to use the same pseudo random number generator every time. This removes the probability that variation in the data is caused by the pseudo-randomness between multiple instances of a 'random' number generator - rather, the pseudo-randomness is equal all the time. + +``` +np.random.seed(7) +``` + +We then load the data. We make a `load_data` call to the [IMDB data set](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#imdb-movie-reviews-sentiment-classification), which is provided in Keras by default. We load a maximum of 5.000 words according to our configuration file. The `load_data` definition provided by Keras automatically splits the data in training and testing data (with inputs `x` and targets `y`). In order to create feature vectors that have the same shape, the sequences are padded. That is, `0.0` is added towards the end. Neural networks tend not to be influenced by those numbers. + +``` +(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features) +x_train = sequence.pad_sequences(x_train, maxlen=maxlen) +x_test = sequence.pad_sequences(x_test, maxlen=maxlen) +``` + +Next up is the model itself. It is proposed by Google. Given the goal of this blog post, there's not much need for explaining whether the architecture is good (which is the case, though): + +``` +model = Sequential() +model.add(Embedding(max_features, + embedding_dims, + input_length=maxlen)) +model.add(Dropout(0.2)) +model.add(Conv1D(filters, + kernel_size, + padding='valid', + activation='relu', + strides=1)) +model.add(GlobalMaxPooling1D()) +model.add(Dense(hidden_dims)) +model.add(Dropout(0.2)) +model.add(Activation('relu')) +model.add(Dense(1)) +model.add(Activation('sigmoid')) +``` + +Next, we compile the model. [Binary crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) is used since we have two target classes (`positive` and `negative`) and our task is a classification task (for which [crossentropy](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#binary-crossentropy) is a good way of computing loss). The optimizer is [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam), which is a state-of-the-art optimizer combining various improvements to original [stochastic gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). As an additional metric which is more intuitive to human beings, `accuracy` is included as well. + +``` +model.compile(loss='binary_crossentropy', + optimizer='adam', + metrics=['accuracy']) +``` + +We'll next make slight changes to the example. Google utilizes the `test` data for validation; we don't do that. Rather, we'll create a separate validation split from the training data. We thus end up with three distinct data sets: a training set, which is used to train the model; a validation set, which is used to study its predictive power after every epoch, and a testing set, which shows its generalization power since it contains data the model has never seen. We generate the validation data by splitting the training data in actual training data and validation date. We use a 80/20 split for this; thus, 20% of the original training data will become validation data. All right, let's fit the training data and start the training process. + +``` +model.fit(x_train, y_train, + batch_size=batch_size, + epochs=epochs, + validation_split=0.2) +``` + +Later, we'll evaluate the model with the test data. + +### Adding the callbacks + +We must however first add the callbacks to the imports at the top of our code: + +``` +from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint +``` + +We can then include them into our code. Just before `model.fit`, add this Python variable: + +``` +keras_callbacks = [ + EarlyStopping(monitor='val_loss', patience=30, mode='min', min_delta=0.0001), + ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') +] +``` + +As you can see, the callbacks have various configuration options: + +- The **checkpoint\_path** in ModelCheckpoint is the path to the file where the model instance should be saved. In my case, the checkpoint path is `checkpoint_path=f'{os.path.dirname(os.path.realpath(__file__))}/testmodel.h5'`. +- A **monitor**, which specifies the variable that is being monitored by the callback for making its decision whether to stop or save the model. Often, it's a good idea to use `val_loss`, because it overfits much slower than training loss. This does however require that you add a `validation_split` in `model.fit`. +- A **patience**, which specifies how many epochs without improvement you'll allow before the callback interferes. In the case of EarlyStopping above, once the [validation loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) improves, I allow Keras to complete 30 new epochs without improvement before the training process is finished. When it improves at e.g. the 23rd epoch, this counter is reset and the cycle starts again. +- The **mode**, which can also be `max` or left empty. If it's left empty, it decides itself based on the `monitor` you specify. Common sense dictates what mode you should use. Validation loss should be minimized; that's why we use `min`. Not sure why you would attempt to maximize validation loss :) +- The **min\_delta** in EarlyStopping. Only when the improvement is higher than this delta value it is considered to be an improvement. This avoids that very small improvements disallow you from finalizing training, e.g. when you're trapped in a small convergence scenario when using a really small learning rate. +- The **save\_best\_only** in ModelCheckpoint pretty much speaks for itself. If `True`, it only saves the best model instance with respect to the monitor specified. +- If you wish, you can add `verbose=1` to both callbacks. This textually shows you whether the model has improved or not and whether it was saved to your `checkpoint_path`. I leave this up to you as it slows down the training process slightly (...since the prints must be handled by Python). + +Those are not the only parameters. There's many more for both [ModelCheckpoint](https://keras.io/callbacks/#modelcheckpoint) and [EarlyStopping](https://keras.io/callbacks/#earlystopping), but they're used less commonly. Do however check them out! + +All right, if we would now add the callback variable to the `model.fit` call, we'd have a model that stops when it no longer improves _and_ saves the best model. Replace your current code with this: + +``` +model.fit(x_train, y_train, + batch_size=batch_size, + epochs=epochs, + validation_split=0.2, + callbacks=keras_callbacks) +``` + +Okay, let's run it and see what happens :) + +![](images/bookshelves-chair-desk-1546912.jpg) + +All right, let's give it a go! + +* * * + +## Numpy allow\_pickle error + +It may be that you'll run into issues with Numpy when you load the data into a Numpy array. Specifically, the error looks as follows: + +``` +ValueError: Object arrays cannot be loaded when allow_pickle=False +``` + +It occurs because Numpy has recently inverted the default value for allow\_pickle and Keras has not updated yet. Altering `imdb.py` in `keras/datasets` folder will resolve this issue. Let's hope the pull request that has been issued for this problem will be accepted soon. Change line 59 into: + +``` +with np.load(path, allow_pickle=True) as f: +``` + +**Update February 2020:** this problem should be fixed in any recent Keras version! 🎉 + +* * * + +## Keras results + +You'll relatively quickly see the results: + +``` +Epoch 1/200000 +20000/20000 [==============================] - 10s 507us/step - loss: 0.4380 - acc: 0.7744 - val_loss: 0.3145 - val_acc: 0.8706 +Epoch 00001: val_loss improved from inf to 0.31446, saving model to C:\Users\chris\DevFiles\Deep Learning/testmodel.h5 + +Epoch 2/200000 +20000/20000 [==============================] - 7s 347us/step - loss: 0.2411 - acc: 0.9021 - val_loss: 0.2719 - val_acc: 0.8890 +Epoch 00002: val_loss improved from 0.31446 to 0.27188, saving model to C:\Users\chris\DevFiles\Deep Learning/testmodel.h5 + +Epoch 3/200000 +20000/20000 [==============================] - 7s 344us/step - loss: 0.1685 - acc: 0.9355 - val_loss: 0.2733 - val_acc: 0.8924 +Epoch 00003: val_loss did not improve from 0.27188 +``` + +Apparently, the training process achieves optimal validation loss after just two epochs (which was also indicated by the Google engineers who created the model code we are thankful for using and which we adapted), because after epoch 32 it shows: + +``` +Epoch 32/200000 +20000/20000 [==============================] - 7s 366us/step - loss: 0.0105 - acc: 0.9960 - val_loss: 0.7375 - val_acc: 0.8780 +Epoch 00032: val_loss did not improve from 0.27188 +Epoch 00032: early stopping +``` + +...and the training process comes to a halt, as we intended :) Most likely, the model can still be improved - e.g. by introducing [learning rate decay](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/) and finding the best [learning rate](https://www.machinecurve.com/index.php/2019/11/06/what-is-a-learning-rate-in-a-neural-network/) prior to the training process - but hey, that wasn't the goal of this exercise. + +I've also got my HDF5 file: + +![](images/image-1.png) + +* * * + +## Let's evaluate the model + +We can next comment out everything from `model = Sequential()` up to and including `model.fit`. Let's add some evaluation functionality. + +We should load the model, so we should add its feature to the imports: + +``` +from tensorflow.keras.models import load_model +``` + +And subsequently add evaluation code just after the code that was commented out: + +``` +model = load_model(checkpoint_path) +scores = model.evaluate(x_test, y_test, verbose=1) +print(f'Score: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%') +``` + +Next, run it again. Instead of training the model again (you commented out the code specifying the model and the training process), it will now load the model you saved during training and evaluate it. You will most likely see a test accuracy of ≈ 88%. + +``` +25000/25000 [==============================] - 3s 127us/step +Score: loss of 0.27852724124908446; acc of 88.232% +``` + +All right! Now you know how you can use the EarlyStopping and ModelCallback checkpoints in Keras, allowing you to save precious resources when a model no longer improves. Let me wish you all the best with your machine learning adventures and please, feel free to comment if you have questions or comments. I'll be happy to respond and to improve my work if you feel I've made a mistake. Thanks! + +* * * + +## References + +- [Keras,](https://github.com/keras-team/keras) which is licensed under the [MIT License](https://github.com/keras-team/keras/blob/master/LICENSE). +- The specific Keras [example](https://github.com/keras-team/keras/blob/master/examples/imdb_cnn.py) which lays at the foundation of our blog. +- The [Keras](https://keras.io/) docs. + +Thanks a lot to the authors of those works! diff --git a/batch-normalization-with-pytorch.md b/batch-normalization-with-pytorch.md new file mode 100644 index 0000000..c26992c --- /dev/null +++ b/batch-normalization-with-pytorch.md @@ -0,0 +1,419 @@ +--- +title: "Batch Normalization with PyTorch" +date: "2021-03-29" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "batch-normalization" + - "covariance-shift" + - "deep-learning" + - "neural-network" + - "neural-networks" + - "pytorch" +--- + +One of the key elements that is considered to be a good practice in a neural network is a technique called Batch Normalization. Allowing your neural network to use normalized inputs across all the layers, the technique can ensure that models converge faster and hence require less computational resources to be trained. + +In a different tutorial, we showed how you can implement [Batch Normalization with TensorFlow and Keras](https://www.machinecurve.com/index.php/2020/01/15/how-to-use-batch-normalization-with-keras/). This tutorial focuses on **PyTorch** instead. After reading it, you will understand: + +- **What Batch Normalization does at a high level, with references to more detailed articles.** +- **The differences between `nn.BatchNorm1d` and `nn.BatchNorm2d` in PyTorch.** +- **How you can implement Batch Normalization with PyTorch.** + +It also includes a test run to see whether it can really perform better compared to not applying it. + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## Full code example: Batch Normalization with PyTorch + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(32 * 32 * 3, 64), + nn.BatchNorm1d(64), + nn.ReLU(), + nn.Linear(64, 32), + nn.BatchNorm1d(32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## What is Batch Normalization? + +Training a neural network is performed according to the high-level supervised machine learning process. A batch of data is fed through the model, after which its predictions are compared with the actual or _ground truth_ values for the inputs. + +The difference leads to what is known as a loss value, which can be used for subsequent error backpropagation and model optimization. + +Optimizing a model involves slightly adapting the weights of the trainable layers in your model. All is good so far. However, now suppose that you have the following scenario: + +- You feed a model with a batch of low-dimensional data that has a mean of 0.25 and a standard deviation of 1.2, and you adapt your model. +- Your second batch has a mean of 13.2 and an 33.9 standard deviation. +- Your third goes back to 0.35 and 1.9, respectively. + +You can imagine that given your model's weights, it will be relatively poor in handling the second batch - and by consequence, the weights change significantly. By consequence, your model will also be _worse_ than it can be when processing the third batch, simply because it has adapted for the significantly deviating scenario. + +And although it can learn to reverse to the more generic process over time, you can see that with relative instability in your dataset (which can even happen within relatively normalized datasets, due to such effects happening in downstream layers), model optimization will oscillate quite heavily. And this is bad, because it slows down the training process. + +**Batch Normalization** is a normalization technique that can be applied at the layer level. Put simply, it normalizes "the inputs to each layer to a learnt representation likely close to \[latex\](\\mu = 0.0, \\sigma = 1.0)\[/latex\]. By consequence, all the layer inputs are normalized, and significant outliers are less likely to impact the training process in a negative way. And if they do, their impact will be much lower than without using Batch Normalization. + +> Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a stateof-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters. +> +> The abstract from the [Batch Normalization paper](http://proceedings.mlr.press/v37/ioffe15.html) by Ioffe & Szegedy (2015) + +* * * + +## BatchNormalization with PyTorch + +If you wish to understand Batch Normalization in more detail, I recommend reading our [dedicated article about Batch Normalization](https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/). Here, you will continue implementing Batch Normalization with the PyTorch library for deep learning. This involves a few steps: + +1. Taking a look at the differences between `nn.BatchNorm2d` and `nn.BatchNorm1d`. +2. Writing your neural network and constructing your Batch Normalization-impacted training loop. +3. Consolidating everything in the full code. + +### Differences between BatchNorm2d and BatchNorm1d + +First of all, the differences between two-dimensional and one-dimensional Batch Normalization in PyTorch. + +1. Two-dimensional Batch Normalization is made available by `[nn.BatchNorm2d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html)`. +2. For one-dimensional Batch Normalization, you can use `[nn.BatchNorm1d](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html)`. + +One-dimensional Batch Normalization is defined as follows on the PyTorch website: + +> Applies Batch Normalization over a 2D or 3D input (a mini-batch of 1D inputs with optional additional channel dimension) (...) +> +> PyTorch (n.d.) + +...this is how two-dimensional Batch Normalization is described: + +> Applies Batch Normalization over a 4D input (a mini-batch of 2D inputs with additional channel dimension) (…) +> +> PyTorch (n.d.) + +Let's summarize: + +- One-dimensional BatchNormalization (`nn.BatchNorm1d`) applies Batch Normalization over a 2D or 3D input (a _batch_ of _1D_ inputs with a possible _channel_ dimension). +- Two-dimensional BatchNormalization (`nn.BatchNorm2d`) applies it over a 4D input (a _batch_ of _2D_ inputs with a possible _channel_ dimension). + +#### 4D, 3D and 2D inputs to BatchNormalization + +Now, what is a "4D input"? PyTorch describes it as follows: \[latex\](N, C, H, W)\[/latex\] + +- Here, \[latex\]N\[/latex\] stands for the number of samples in a batch. +- \[latex\]C\[/latex\] represents the number of channels. +- \[latex\]H\[/latex\] represents height and \[latex\]W\[/latex\] width. + +In other words, a 4D input to a `nn.BatchNorm2d` layer represents a set of \[latex\]N\[/latex\] objects that each have a height and a width, always a number of channels >= 1. What comes to mind when reading that? + +Indeed, images do. + +A "2D or 3D input" goes as follows: \[latex\](N, C, L)\[/latex\] (here, the C is optional). + +`nn.BatchNorm1d` represents lower-dimensional inputs: a number of inputs, possibly a number of channels and a content per object. These are regular, one-dimensional arrays, like the ones produced by [Dense layers](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) in a neural network. + +Okay: we now know that we must apply `nn.BatchNorm2d` to layers that handle images. Primarily, these are [Convolutional layers](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), which slide over images in order to generate a more abstract representation of them. `nn.BatchNorm1d` can be used with Dense layers that are stacked on top of the Convolutional ones in order to generate classifications. + +#### Where to use BatchNormalization in your neural network + +Now that we know what _type_ of Batch Normalization must be applied to each type of layer in a neural network, we can wonder about the _where_ - i.e., where to apply Batch Normalization in our neural network. + +Here's the advice of some Deep Learning experts: + +> Andrew Ng says that batch normalization should be applied immediately before the non-linearity of the current layer. The authors of the BN paper said that as well, but now according to François Chollet on the keras thread, the BN paper authors use BN after the activation layer. On the other hand, there are some benchmarks (…) that show BN performing better after the activation layers. +> +> StackOverflow (n.d.) + +There is thus no clear answer to this question. You will have to try experimentally what works best. + +### Writing the neural network + training loop + +Okay, we now know the following things... + +- What Batch Normalization does at a high level. +- Which types of Batch Normalization we need for what type of layer. +- Where to apply Batch Normalization in your neural network. + +Time to talk about the core of this tutorial: implementing Batch Normalization in your PyTorch based neural network. Applying Batch Normalization to a PyTorch based [neural network](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/) involves just three steps: + +1. Stating the imports. +2. Defining the `nn.Module`, which includes the application of Batch Normalization. +3. Writing the training loop. + +Create a file - e.g. `batchnorm.py` - and open it in your code editor. Also make sure that you have Python, PyTorch and `torchvision` installed onto your system (or available within your Python environment). Let's go! + +#### Stating the imports + +Firstly, we're going to state our imports. + +- We're going to need `os` based definitions for downloading the dataset properly. +- All `torch` based imports are required for PyTorch: `torch` itself, the `nn` (a.k.a. neural network) module and the `DataLoader` for loading the dataset we're going to use in today's neural network. +- From `torchvision`, we load the `CIFAR10` dataset - as well as some `transforms` (primarily image normalization) that we will apply on the dataset before training the neural network. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms +``` + +#### Defining the nn.Module, with Batch Normalization + +Next up is defining the `nn.Module`. Indeed, we're not using `Conv` layers today - which will likely improve your neural network. Instead, we're immediately flattening the 32x32x3 input, then further processing it into a 10-class outcome (because CIFAR10 has 10 classes). + +As you can see, we're applying `BatchNorm1d` here because we use densely-connected/fully connected (a.k.a. `Linear`) layers. Note that the number of inputs to the BatchNorm layer must equal the number of _outputs_ of the `Linear` layer. + +It clearly shows how Batch Normalization must be applied with PyTorch. + +``` +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(32 * 32 * 3, 64), + nn.BatchNorm1d(64), + nn.ReLU(), + nn.Linear(64, 32), + nn.BatchNorm1d(32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) +``` + +#### Writing the training loop + +Next up is writing the training loop. We're not going to cover it to a great extent here, because already wrote about it in our dedicated article about getting started with a first PyTorch model: + +- [Getting started with PyTorch](https://www.machinecurve.com/index.php/2021/01/13/getting-started-with-pytorch/) + +However, to summarize briefly what happens, here you go: + +- First, we set the seed vector of our random number generator to a fixed number. This ensures that any differences are due to the stochastic nature of the number generation process, and not due to pseudorandomness of the number generator itself. +- We then prepare the CIFAR-10 dataset, initialize the MLP and define the loss function and optimizer. +- This is followed by iterating over the epochs, where we set current loss to 0.0 and start iterating over the data loader. We set the gradients to zero, perform the forward pass, compute the loss, and perform the backwards pass followed by optimization. Indeed this is what happens in the supervised ML process. +- We print statistics per mini batch fed forward through the model. + +``` +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Full code + +I can imagine why you want to get started immediately. It's always more fun to play around, isn't it? :) You can find the full code for this tutorial at the top of this page. + +* * * + +## Results + +These are the results after training our MLP for 5 epochs on the CIFAR-10 dataset, _with_ Batch Normalization: + +``` +Starting epoch 5 +Loss after mini-batch 500: 1.573 +Loss after mini-batch 1000: 1.570 +Loss after mini-batch 1500: 1.594 +Loss after mini-batch 2000: 1.568 +Loss after mini-batch 2500: 1.609 +Loss after mini-batch 3000: 1.573 +Loss after mini-batch 3500: 1.570 +Loss after mini-batch 4000: 1.571 +Loss after mini-batch 4500: 1.571 +Loss after mini-batch 5000: 1.584 +``` + +The same, but then _without_ Batch Normalization: + +``` +Starting epoch 5 +Loss after mini-batch 500: 1.650 +Loss after mini-batch 1000: 1.656 +Loss after mini-batch 1500: 1.668 +Loss after mini-batch 2000: 1.651 +Loss after mini-batch 2500: 1.664 +Loss after mini-batch 3000: 1.649 +Loss after mini-batch 3500: 1.647 +Loss after mini-batch 4000: 1.648 +Loss after mini-batch 4500: 1.620 +Loss after mini-batch 5000: 1.648 +``` + +Clearly, but unsurprisingly, the Batch Normalization based model performs better. + +* * * + +## Summary + +In this tutorial, you have read about implementing Batch Normalization with the PyTorch library for deep learning. Batch Normalization, which was already proposed in 2015, is a technique for normalizing the inputs to each layer within a neural network. This can ensure that your neural network trains faster and hence converges earlier, saving you valuable computational resources. + +After reading it, you now understand... + +- **What Batch Normalization does at a high level, with references to more detailed articles.** +- **The differences between `nn.BatchNorm1d` and `nn.BatchNorm2d` in PyTorch.** +- **How you can implement Batch Normalization with PyTorch.** + +Great! Your next step may be to enhance your training process even further. Take a look at our article about [K-fold Cross Validation](https://www.machinecurve.com/index.php/2021/02/03/how-to-use-k-fold-cross-validation-with-pytorch/) for doing so. + +I hope that it was useful for your learning process! Please feel free to share what you have learned in the comments section 💬 I’d love to hear from you. Please do the same if you have any questions or other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +PyTorch. (n.d.). _BatchNorm1d — PyTorch 1.8.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html) + +PyTorch. (n.d.). _BatchNorm2d — PyTorch 1.8.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html) + +StackOverflow. (n.d.). _Where to apply batch normalization on standard CNNs_. Stack Overflow. [https://stackoverflow.com/questions/47143521/where-to-apply-batch-normalization-on-standard-cnns](https://stackoverflow.com/questions/47143521/where-to-apply-batch-normalization-on-standard-cnns) + +Ioffe, S., & Szegedy, C. (2015, June). [Batch normalization: Accelerating deep network training by reducing internal covariate shift.](http://proceedings.mlr.press/v37/ioffe15.html) In _International conference on machine learning_ (pp. 448-456). PMLR. diff --git a/best-machine-learning-artificial-intelligence-books.md b/best-machine-learning-artificial-intelligence-books.md new file mode 100644 index 0000000..f665a5e --- /dev/null +++ b/best-machine-learning-artificial-intelligence-books.md @@ -0,0 +1,1049 @@ +--- +title: "Best Machine Learning & Artificial Intelligence Books Available in 2021" +date: "2020-06-08" +categories: + - "books-about-ai" + - "deep-learning" +tags: + - "artificial-intelligence" + - "books" + - "learning-machine-learning" + - "machine-learning" +--- + +Quite recently, one of my colleagues asked me to give some tips for books about machine learning. In his case, he wanted to have a book about machine learning for beginners, so that he could understand what I'm doing... which helps him think about how machine learning can create value for the company I work for during daytime. + +Relatively quickly, I was able to find the book he needed - a perfect balance between technological rigor and understandability and readability. He quite liked it. And that's when I thought: there must be more people who are looking for machine learning books that suit their needs! That's why this post is dedicated to **books about machine learning**. More specifically, it is tailored to a set of categories: for example, you'll find **beginner machine learning books**, machine learning books about frameworks like **PyTorch**. I also cover books about **Keras/TensorFlow** and **scikit-learn**, or books about the **maths behind machine learning**. We even look at **academic textbooks** and books that discuss **societal and business impacts of machine learning (and artificial intelligence in general)**. + +This will therefore be a long post. Using the Table of Contents below, you can first select a group of books that you're interested in (or click one of the highlighted links above). Then, you'll be able to read my ideas about the books. I will cover a couple of things: the **author**, the **publishing date** (which illustrates whether it's a true classic or contains state-of-the-art knowledge), **what it covers and how it does that**, and **my impression about the book**. Additionally, I'll try to provide an overview of other reviews made available online. + +**Disclaimer:** creating this post - and a website like MachineCurve - involves a large time investment. MachineCurve participates in the **Amazon Services LLC Associates Program**, an affiliate advertising program designed to provide a means for sites to earn advertising commissions by linking to Amazon. + +I will therefore earn a small affiliate commission when you buy any product on Amazon linked to from this website. This does not create any additional cost for you. Neither does this mean that my ideas are biased towards commerce – on the contrary, they’re real. Through affiliate commissions, I have more time for generating Machine Learning content! 💡 + +* * * + +**Last Updated:** December 10, 2020 +This is a work in progress! I'm working on adding more and more books on a daily basis. + +* * * + +In this table of contents, you can see **all categories** of Machine Learning books that we're reviewing on this page, as well as the **individual books** that are part of the categories. Click on one of the categories or books to go there directly. + +\[toc\] + +* * * + +## Books about Machine Learning and Artificial Intelligence for Beginners + +### [1\. Grokking Deep Learning, by Andrew Trask](https://www.amazon.com/gp/product/1617293709/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1617293709&linkId=3cc2257936d61b1b50b16fa6c7121db7) + + + +_If you want to learn building Deep Learning models from scratch using Python._ + +**Author:** Andrew Trask, Senior Research Scientist at DeepMind +**Publishing date:** January 25, 2019 +**Price/quality:** 🟡🟢 Acceptable to good +**What it covers:** + +- Grokking Deep Learning teaches deep learning from a conceptual and a programming perspective. It teaches building deep learning models from scratch. +- You don't use any framework yet - rather, you'll use **Python** and **NumPy**. +- It covers fundamental concepts, like supervised vs unsupervised learning, forward propagation, gradient descent, backpropagation, to make you understand things from a high-level perspective. +- It then proceeds with more detailed stuff: regularization, batches, activation functions. +- After the conceptual deep dive, it broadens your view as it covers multiple types of neural networks - such as Convolutional Neural Networks, neural networks for Natural Language Processing, and so on. +- Finally, it provides a guide as to what steps you could take next. + +**My impression:** + +[Grokking Deep Learning](https://www.amazon.com/gp/product/1617293709/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1617293709&linkId=3cc2257936d61b1b50b16fa6c7121db7) (_affiliate link_) is a great book for those who wish to understand neural networks - especially if they have become thrilled by the deep learning related hype. But don't understand me incorrectly, it's not a hype oriented book. Rather, it helps you take your first baby steps. + +As with all good things, it starts with why. Why study deep learning; what could you possibly gain from it? You'll soon discover that the world is changing, and that it becomes increasingly automated. Deep learning is a major catalyst of this movement. What's more, it helps you understand what happens within deep learning frameworks - and, it claims, has a uniquely low barrier to entry. + +Let's take a look at this from my perspective. When I first started with deep learning, I used François Chollet's [Deep Learning with Python](https://www.amazon.com/gp/product/1617294438/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1617294438&linkId=a869d87f8ef52041c60446d91bb9721b) (_affiliate link_) to get a heads start. I've always been a big fan of this book because it makes deep learning concepts very accessible, but does so through the lens of Keras. Grokking Deep Learning takes the true conceptual path - you won't be able to create blazingly cool TensorFlow models, or create GANs with PyTorch, but you _will_ understand what happens within the neural nets. + +And it indeed does so in a brilliantly easy way. The only prerequisites are knowledge of Python and some basic mathematics knowledge - related to calculus and vector theory. And if you don't have the info, you'll learn it from the book. It contains a large amount of visualizations that help you understand intuitively what is going on. Definitely recommended if you want to get the basics. However, it seems like that towards the end of the book, [the chapters become denser](https://www.reddit.com/r/deeplearning/comments/dsx87c/saw_this_review_of_grokking_deep_learning_on/fl6yxl4/) and less easily comprehensible. So especially the first chapters provide a good introduction. Still, if you like a little bit of searching around besides reading things from books, it could be a good choice. The [Amazon reviews](https://www.amazon.com/gp/product/1617293709/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1617293709&linkId=3cc2257936d61b1b50b16fa6c7121db7) (_affiliate link_) are mostly very positive. + +### [2\. Machine Learning For Dummies, by John Paul Mueller & Luca Massaron](https://www.amazon.com/gp/product/1119245516/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1119245516&linkId=e564bec25928f7de81c97c71ac6a5db9) + + + +_If you have some interest in technology and want to understand how Machine Learning models work._ + +**Author:** John Paul Mueller (freelance author and technical editor, 100+ books) & Luca Massaron (data scientist) +**Publishing date:** May 10, 2016 +**Price/quality:** 🟢 Good +**What it covers:** + +- Why machine learning is playing such a prominent role in today's list of technologies promising change. +- Introducing data science related languages, such as Python and R, which can be used for machine learning too. +- Introducing basic steps for coding in R with R Studio and in Python with Anaconda. + +**My impression:** + +[Machine Learning For Dummies](https://www.amazon.com/gp/product/1119245516/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1119245516&linkId=e564bec25928f7de81c97c71ac6a5db9) (_affiliate link_) is a good introductory book to machine learning, although it's already getting older (it was released in 2016). It first introduces artificial intelligence and covers what I think is an important aspect - art and engineering - as machine learning forces you to follow your intuition every now and then. This is followed by an introduction to Big Data, which is the other part of the coin needed. + +In my point of view, the book forces you to choose a language for coding relatively quickly, as it proceeds with you preparing your learning tools: you either use R, or Python (or both, but often you'd just choose one). In doing so, it gives you a crash course of programming in both of the languages, for when you haven't done so before. And if you're not satisfied with both, it'll give you guidance to other machine learning tools as well - such as SAS, SPSS, Weka, RapidMiner and even Spark, for distributed training. However, it doesn't cover them in depth. + +Then, it proceeds with the basics of machine learning - and shows you how supervised ML essentially boils down to error computation and subsequent optimization. It also covers data preprocessing, and then introduces a wide array of machine learning techniques: clustering, support vector machines, neural networks and linear models. Finally, it allows you to create models for image classification, text/sentiment classification and product recommendation. + +I do appreciate the effort put into the book by the authors. However, I think that it would be best if you already have some background experience with programming - despite the crash course. In my point of view, it's also important to have a clear view about the differences between say, supervised and unsupervised machine learning, as it covers them all relatively quickly - and the field is wide. Nevertheless, if you are into machine learning _programming_, [this can be a very good book for starters](https://www.amazon.com/gp/product/1119245516/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1119245516&linkId=e564bec25928f7de81c97c71ac6a5db9) (_affiliate link_) - especially considering its price. + +### [3\. Artificial Intelligence For Dummies, by John Mueller & Luca Massaron](https://www.amazon.com/gp/product/1119467659/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1119467659&linkId=4da8193d9eb89565ee638b2f9d398833) + + + +_If you have a background in business, are non-technical but want to understand what happens technologically._ + +**Author:** John Paul Mueller (freelance author and technical editor, 100+ books) & Luca Massaron (data scientist) +**Publishing date:** March 16, 2018 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Some history about Artificial Intelligence +- How AI is used in modern computing +- The limits of AI, common misconceptions and application areas. + +**My impression:** + +If you're looking for an AI book that is written for business oriented people who are interested in the technology side of AI without diving deep into technology, this could be the book you're looking for. + +When I give guest lectures about the impact of AI and Machine Learning, I always make sure to include a slide which asks my audience a particular question: "What is Artificial Intelligence?" + +Funnily, they will find out, the _precise_ answer to the question is given by them by remaining silent... as nobody knows. + +That's why I think this book is such a good introduction for persons who want to understand Artificial Intelligence in more detail, beyond the realm of _it has great impact on your business_, without getting lost in programming code. + +First of all, the book does precisely that: introducing Artificial Intelligence, questioning what intelligence is, taking a look at its history, including the first AI winter (connectionist-expert systems debate) and the second (the demise of the latter and the revival of the first). + +It then proceeds by looking at the fuel of AI - being data. It covers why data is so useful, but also why it cannot be trusted all the time, _and_ the limits of getting data in order. Once complete, the discussion gets a bit more technical - looking at the concept of an algorithm, introducing machine learning as well as specialized hardware for creating AI applications and running them (i.e., GPUs). + +Following the conceptual part is a part that considers the uses of AI in society. First, a wide range of applications is covered - such as AI for _corrections_ and AI for _suggestions_. This includes a chapter on automating industrial processes and even the application of AI in healthcare, which is a controversial topic - privacy related issues are just around the corner, not to mention the ethical implications of health + +Subsequently, it provides a lot of information about applying artificial intelligence in _software applications_ - introducing machine learning and deep learning for doing so, as well as in _hardware applications_, i.e. robotics, unmanned vehicles and self-driving cars. This is concluded by a chapter about the future of AI - especially from the lens of the hype that we've seen emerging in the past few years. But it also looks at the potential of AI to disrupt today's jobs, how it can be applied in space and how it can contribute to society in general. + +I really like the book. I do. It helps bridge the gap between business and technology, and is in fact the book that I recommended my colleague when he wanted to understand the _technology_ side of AI in more detail. As he's a business oriented person, he doesn't code and neither wants to learn how to. This book provides all the broad technology oriented details, links them to application areas, and is appreciative of the history of AI, the nonsense of the current AI hype, and what the future may hold. I definitely recommend it. + +### [4\. Make Your Own Neural Network, by Tariq Rashid](https://www.amazon.com/gp/product/B01EER4Z4G/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B01EER4Z4G&linkId=e49ce31f2f08e681c8b7de38357aea37) + + + +_If you want to understand what happens inside a neural network in great detail - as a learning by doing experience._ + +**Author:** Tariq Rashid +**Publishing date:** March 31, 2016 +**Price/quality:** 🟢 Really good +**What it covers:** + +- The mathematics of neural networks, but in a comprehensive way - secondary school mathematics will suffice. +- Creating your own neural network with pure Python. +- Iterative improvement of your code by showing what works and what doesn't. + +**My impression:** + +The book [Make Your Own Neural Network](https://www.amazon.com/gp/product/B01EER4Z4G/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B01EER4Z4G&linkId=e49ce31f2f08e681c8b7de38357aea37) (_affiliate link_) starts with a prologue in which the author covers the history of the AI field in a nutshell. Very briefly, he covers the rise of AI in the 1950s, the first AI winter, and all progress until now. Very insufficient detail in order to fully grasp what has been around in the past few years, but that's not the point - rather, it _does_ set the stage, which _was_ the goal. + +The book contains three parts: + +1. In the first part, **How they work**, the author covers mathematical ideas related to neural networks. +2. In the second part, **DIY with Python**, you're going to get to work. More specifically, you will build a neural network that can classify handwritten digits. +3. Finally, in the third part, **Even More Fun**, you're going to expand your neural network in order to find whether you can boost its performance. You're even going to try and look inside the neural network you've created. + +In my point of view, the author really does his best to make neural networks comprehensible for absolute beginners. That's why it's likely not a book for you if you already have some experience: you likely won't learn many new things. However, if you have _absolutely no experience_, I think it's absolutely one of the best books to start with. Kudos to Tariq Rashid, who has done a terrific job at making neural network theory very accessible. + +### [5\. Machine Learning Pocket Reference: Working with Structured Data in Python, by Matt Harrison](https://www.amazon.com/gp/product/1492047546/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1492047546&linkId=ddd7302d1442448c6fded31e4970e980) + + + +_If you want a quick-lookup reference manual for when you're undecided about what to do when Machine Learning engineering._ + +**Author:** Matt Harrison +**Publishing date:** August 27, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Data preprocessing: cleaning your dataset and what to do when data goes missing. +- Feature selection: which features are useful to your model? How to find out? +- Model selection: what configuration of my ML model works best? +- Supervised learning: classification and regression. +- Unsupervised learning: clustering and dimensionality reduction. +- Evaluating your machine learning model. + +**My impression:** + +You sometimes don't want a book filled with details, but rather a reference guide that you can use when you're troubled by some kind of machine learning related problem. [Machine Learning Pocket Reference](https://www.amazon.com/gp/product/1492047546/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1492047546&linkId=ddd7302d1442448c6fded31e4970e980) (_affiliate link_) is then a great book for you. As you can see above, it covers many things related to the _building your machine learning model_ part of the machine learning lifecycle. From data preprocessing to evaluating your machine learning model, this guide will help you proceed. + +Contrary to many books on the topic, the author avoids state-of-the-art neural network frameworks like TensorFlow, Keras and PyTorch. In doing so, he wants to focus on the _concepts at hand_ - i.e., performing all the work with just Python and Scikit-learn, which provides many interesting helper functions. A few of the things that are covered: + +- Pandas Profiling, which generates reports about your Pandas DataFrame which helps you inspect your dataset easily. +- Validation curves, which help the evaluation process, as well as Confusion Matrices. +- Performing exploratory data analysis, which includes box plots and violin plots. + +And much more - _including_ code for doing so! + +When solving a problem with machine learning, time is often your greatest ally _and_ your greatest enemy. Training a machine learning model can be time-intensive, and by consequence you want to do many things right. But then, exploring the data, cleaning the data - those are time-consuming tasks. [Machine Learning Pocket Reference](https://www.amazon.com/gp/product/1492047546/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1492047546&linkId=ddd7302d1442448c6fded31e4970e980) (_affiliate link_) is scattered with useful tools and techniques that help make the life of machine learning engineers easier. Once again: if you want a quick pocket guide to fall back to if you're facing a problem, you could try Google... or this book! + +### [6\. Artificial Intelligence: A Guide for Thinking Humans, by Melanie Mitchell](https://www.amazon.com/gp/product/0374257833/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=0374257833&linkId=e8a574741ada34ed87205ad42123790b) + + + +_If you want a holistic perspective on AI: where it comes from, what it is now and where it is heading._ + +**Author:** Melanie Mitchell +**Publishing date:** October 15, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Understanding the intelligence of AI programs and how they work +- Understanding how they fail +- Understanding the differences between AI and humans and what Artificial General Intelligence looks like + +**My impression:** + +The book [Artificial Intelligence: A Guide for Thinking Humans](https://www.amazon.com/gp/product/0374257833/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=0374257833&linkId=e8a574741ada34ed87205ad42123790b) (_affiliate link_) is not your ordinary AI book: it doesn't cover the practical perspective. Rather, it's a book that provides a deep dive into the AI field - allowing you to understand where things have come from, how things work, and what AI could possibly achieve. + +It is written in five parts: Background, Looking and Seeing, Learning to Play, Artificial Intelligence Meets Natural Language and The Barrier of Meaning. + +In the first part, Mitchell covers the roots of Artificial Intelligence. She traces current AI developments back into the past, which in my opinion is very important for people who wish to learn a thing or two about AI. You simply need to know the past. And she covers it with great detail, as you will see - she'll cover all theoretical developments, including neural networks and the patterns of AI summers (hypes) and winters (the exact opposite). + +Following the Background part, the book continues with Looking and Seeing. Here, among others, the reader is introduced to Convolutional Neural Networks - which are the ones that triggered the AI hype back in 2012. It covers machine learning, the branch of AI that you hear a lot about, in depth - and does not shy away from discussing AI and Ethics, an important theme in deployment of 'smart' algorithms. + +In the other parts, Melanie Mitchell covers games & AI (which leans towards reinforcement learning) and Natural Language Processing, two important themes in AI research and practice! It's actually a stepping stone towards the final, and equally important part: The Barrier of Meaning. Here, the author takes a look at Artificial General Intelligence - or what happens if AIs become as intelligent as human beings. What does it mean to 'understand'? What is knowledge, and how can it be represented in AI? As you will see, it's not very simple to replicate human intelligence. But efforts are underway. + +In my opinion, the book [Artificial Intelligence: A Guide for Thinking Humans](https://www.amazon.com/gp/product/0374257833/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=0374257833&linkId=e8a574741ada34ed87205ad42123790b) (_affiliate link_) is a great book for those who wish to understand AI from a holistic perspective. Where does it come from? What is it now? And where is it going to? Melanie Mitchell answers those questions without making the book boring. And the reviews are in her favor: she's got a 5-star rating on Amazon. Definitely recommended - especially given the price. + +### [7\. AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python, by Hadelin de Ponteves](https://www.amazon.com/gp/product/1838645357/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1838645357&linkId=cde7ddce49aefce22131361749d15dab) + + + +_If_ _you want to get a conceptual and hands-on introduction to Reinforcement Learning._ + +**Author:** Hadelin de Ponteves +**Publishing date:** November 29, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Learning about the basics of Reinforcement Learning +- Getting practical experience by building fun projects, solving real-world business problems, and learning how to code Reinforcement Learning models with Python +- Discovering Reinforcement Learning and Deep Reinforcement Learning, i.e. the state-of-the-art in AI research + +**My impression:** + +[AI Crash Course](https://www.amazon.com/gp/product/1838645357/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1838645357&linkId=cde7ddce49aefce22131361749d15dab) (_affiliate link_) is not your ordinary machine learning book. Why that is the case? Very simple - although the title suggests that it covers "deep learning", in my point of view, it seems to cover the reinforcement part of deep learning only. That is: it skips the more traditional supervised deep learning approaches (e.g. supervised deep neural networks) and unsupervised learning, which are still important areas of active research and practice today. + +The fact that it does, does not make it a bad book. On the contrary: for what it does, it's good - the crash course is really relevant and is perceived to be really good by many readers, and definitely worth the money. Let's take a look at what it does from my perspective. + +In chapter 1, Hadelin de Ponteves introduces you to the topic of Artificial Intelligence. It's called "Welcome to the Robot World" and not for nothing: taking the analogy of robotic systems, the author introduces you to main concepts of reinforcement learning (e.g. Q-learning and Deep Q-learning) and gives examples of the deployment of Artificial Intelligence across a wide range of industries. Chapter 2 introduces you to GitHub and Colab, while Chapter 3 subsequently provides you with a crash course in Python - relevant for those who haven't worked with the language before. + +Now that you have been introduced to AI, some tools and Python, it's time to get to work. Chapter 4 kickstarts the AI Crash Course with "AI Foundation Techniques", or better: Reinforcement Learning Foundation Techniques. It introduces how AI models convert inputs to outputs, how a reward can be attached to outputs, how the environment impacts the way your AI works and one of the core topics in Reinforcement Learning - the Markov decision process. Finally, the book covers how you can train your Reinforcement Learning model. + +After the introduction, the book covers a lot of applications. Using Thompson Sampling, Q-learning, Deep Q-learning and other techniques, you will create models for sales/advertising, logistics, autonomous vehicles, business in general and gaming. After those applications, where you'll create real code, the book recaps and finally suggests additional reading materials. + +The book is good. You'll definitely feel as if you achieved something after completing every chapter. It even provides a lot of examples. However, I do think that the author could have better named it Reinforcement Learning Crash Course - because readers may be confused to discover the areas of supervised learning and unsupervised learning if they dive deeper into Machine Learning after reading the book. And what to think about the other approaches in AI, which have nothing to do with Machine Learning? Despite the name, [AI Crash Course](https://www.amazon.com/gp/product/1838645357/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1838645357&linkId=cde7ddce49aefce22131361749d15dab) (_affiliate link_) definitely a book recommended to those who wish to get an introduction to reinforcement learning. + +### [8\. Machine Learning For Absolute Beginners: A Plain English Introduction (Machine Learning From Scratch), by Oliver Theobald](https://www.amazon.com/gp/product/B07335JNW1/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07335JNW1&linkId=9a1bbcee47cf127543b04502ac3ff41a) + + + +_If you have no experience with Machine Learning yet and want to understand the basic concepts._ + +**Author:** Oliver Theobald +**Publishing date:** 2017 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Teaching you the basic concepts of machine learning, which means that it is suitable to absolute beginners +- Teaching you how to build a model in Python, although it's not the focus +- Preparing you for more advanced machine learning books + +**My impression:** + +Well, what's in a name? The book [Machine Learning for Absolute Beginners](https://www.amazon.com/gp/product/B07335JNW1/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07335JNW1&linkId=9a1bbcee47cf127543b04502ac3ff41a) (_affiliate link_) is, in my opinion, a really great book to start with machine learning if you know absolutely nothing about it. It doesn't have a lot of code and it covers the basic concepts - but given the price, it's a great purchase if you want to look at whether machine learning is something for you. + +The book starts by introducing Machine Learning by telling a story about IBM and a machine that plays checkers better than its programmer. Following the introduction, categories of ML - being supervised learning, unsupervised learning and reinforcement learning - are introduced, as well as what can be found in the toolbox of a machine learning engineer. + +Then, the book proceeds with data: how to clean it, and make it ready for actual machine learning projects. Those are then highlighted: the book covers regression analysis with machine learning, clustering, bias & variance, and a lot of machine learning techniques such as neural networks, decision trees and model ensembles. + +Once you know about the concepts, it teaches you how to build a model in Python with the Scikit-learn framework, as well as how to optimize it. This prepares you for other, more advanced books - e.g. the ones introducing Scikit-learn in more detail, or TensorFlow/Keras. + +I think the book [Machine Learning for Absolute Beginners](https://www.amazon.com/gp/product/B07335JNW1/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07335JNW1&linkId=9a1bbcee47cf127543b04502ac3ff41a) (_affiliate link_) is priced appropriately and does what it suggests: teach you about the absolute basic concepts in Machine Learning. Don't expect the book, which is just short of 130 pages, to make you an expert. But if you have no clue about ML and what it is, this book will help you understand things quickly. Definitely recommended in that case! + +### [9\. Neural Network Projects with Python: The ultimate guide to using Python to explore the true power of neural networks through six projects, by James Loy](https://www.amazon.com/gp/product/B07P77QWW7/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07P77QWW7&linkId=d2925f3b198ae57bfd660615ec806ccb) + + + +_If_ _you want to start gaining hands-on experience with neural networks, without losing track of the concepts._ + +**Author:** James Loy +**Publishing date:** February 28, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Architectures of neural networks: Convolutional Neural Networks (CNNs/ConvNets) and Recurrent Neural Networks (LSTMs) +- Using popular frameworks like Keras to build neural networks +- Diving deep into application areas of machine learning like the identification of faces, of other objects, and sentiment analysis + +**My impression:** + +Machine learning has come a long way - but today, neural networks are the primary driver of the machine learning hype. The book [Neural Network Projects with Python](https://www.amazon.com/gp/product/B07P77QWW7/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07P77QWW7&linkId=d2925f3b198ae57bfd660615ec806ccb) (_affiliate link_) understands this and introduces you to the topic. It's a good read for those who have some machine learning experience and want to be introduced to neural networks. + +The book starts with a Machine Learning and Neural Networks 101. The chapter covers what machine learning is, how your computer can be set up to run ML, and introduces you to frameworks for data science and ML: pandas, TensorFlow/Keras, and other libraries. + +This is followed by a variety of neural network types such as Multilayer Perceptrons (which we know from the past, as traditional ML models), Deep Feedforward Networks (the trend of deep learning), and their specific variants - ConvNets, autoencoders, and recurrent ones being LSTMs. Finally, the book lets you implement an Object detection system for faces with contemporary frameworks. Each chapter covers the particular sensitivities of the problem at hand: for example, it covers scaling of data, other preprocessing, feature selection, but also a review of neural network history and configuration. This allows you to understand the ML concepts in more depth. + +As mentioned, I do think that this book is a good introduction to neural networks for those who have no experience, or limited ML experience at least. + +### [10\. Machine Learning with Python for Everyone, by Mark Fenner](https://www.amazon.com/gp/product/0134845625/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=0134845625&linkId=7e07f4ade76a218b3528039f2a8920c5) + + + +_If_ _you're a beginner to ML and want to start building models rather than being botered with maths._ + +**Author:** Mark Fenner +**Publishing date:** July 30, 2019 +**Price/quality:** 🟢 Good +**What it covers:** + +- Understanding machine learning algorithms and concepts +- Introducing yourself to machine learning pipelines: feature engineering, model creation, and evaluation. +- Applying machine learning to images and text in classification and regression settings +- Studying neural networks and building your own models with Scikit-learn + +**My impression:** + +Machine learning is a daunting topic to many, especially given the fact that many books on the topic are filled with maths. In [Machine Learning with Python for Everyone](https://www.amazon.com/gp/product/0134845625/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=0134845625&linkId=7e07f4ade76a218b3528039f2a8920c5) (_affiliate link_), Mark Fenner aims to teach the absolute ML beginner how things work, without requiring them to have substantial experience with mathematics. Although some equations will be inevitable, Fenner mostly explains things through stories and visualizations - as well as code. + +The book contains 4 parts. In Part 1, Fenner establishes the foundation that you need to understand the absolute basics of ML. In chapter 1, the book covers what ML, and especially supervised ML, involves (i.e. features and targets), shows what classifiers and regressors are, and introduces evaluating machine learning systems. This is followed by chapter 2, which covers a bit of technology background - and some maths. This is followed by two more detailed chapters on creating classifiers and regressors. You'll be introduced to what they are, and some initial examples with e.g. a Nearest Neighbors or Naive Bayes Classifier, or Linear Regression. + +Moving to Part 2, Fenner starts discussing model evaluation. Chapter 5 introduces the reader to why models must be evaluated, and how model error can be represented bya cost function. Beyond simple error measurement, the chapter also covers how sampling strategies such as Cross Validation can be used to make better evaluation estimations. Chapters 6 and 7 deepen your knowledge about evaluation by specifically looking at evaluation methods for classifiers and regressors (with e.g. ROC curves, Precision-Recall curves, and basic regression metrics). + +Part 3 deepens the reader's knowledge about classifiers, regressors and machine learning pipeline. In chapter 8, you'll read about a variety of classifiers (Decision Trees, Support Vector Machines, Logistic Regression) which you haven't read about earlier in the book. You'll also learn how to compare them in order to select the right one for the job. The same happens with regressors: you'll learn about Linear Regression, Support Vector Regression, Regression Trees and others. + +Chapter 10 introduces you to feature engineering. How to select features for your model while being confident that they actually contribute to generating the prediction? How to prepare data so that it can be fed to the model training process properly? Those and other questions will be answered in chapter 10. This is followed by tuning of the model (and specifically, its hyperparameters) in chapter 11. + +Part 4 introduces the reader to more complex topics. Chapter 12 introduces you to ensemble methods, which is a difficult way of describing the combination of models to generate better performance. This is followed by chapters about automatic feature engineering (such as [Principal Components Analysis, or PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/)), feature engineering for specific domains (such as text, clustering, and images), and advanced feature engineering (such as neural networks). + +That's it! In fewer than 600 pages, Mark Fenner covers the width and depth of the machine learning field - without a lot of maths. This way, readers who have a background in programming (preferably Python) and don't want to be bothered by the math have a chance of getting started with their machine learning endeavor. The reviews love the book: "This is a wonderful book", "best approach I have seen" and so on. However, it does not guide you through writing code fully… one review suggests that the reader must be familiar with Jupyter Notebooks (or getting familiar with them) before they can get started. Perhaps, that's a suggestion for improvement. But for the rest, [Machine Learning with Python for Everyone](https://www.amazon.com/gp/product/0134845625/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=0134845625&linkId=7e07f4ade76a218b3528039f2a8920c5) (_affiliate link_) is definitely recommended to those who want to just start with ML! :) + +### [11\. Machine Learning: The Absolute Complete Beginner’s Guide by Steven Samelson](https://amzn.to/2HhTTGB) + + + +_If you want to understand Machine Learning at a high level._ + +**Author:** Steven Samelson +**Publishing date:** 2019 +**Price/quality:** 🟠🟡 Moderate to acceptable +**What it covers:** + +- A high-level coverage of the field of machine learning. + +**My impression:** + +The book [Machine Learning: The Absolute Complete Beginner’s Guide by Steven Samelson](https://amzn.to/2HhTTGB) (_affiliate link_) was published in 2019, has 122 pages and proposes to be guiding beginners in Machine Learning in their learning path. + +As such, it's important to realize that this boek does not provide an in-depth introduction to Machine Learning. It's neither a technical handbook. Rather, it can be considered to be a high-level perspective on what ML is, what it is not, and how you can start. + +For this reason, some people feel a bit disappointed upon reading the book, as they had expected this book to be that in-depth introduction. It's not. What's more, people on Amazon complain about the grammar - that it's not good. Therefore, I'd say: be careful checking out this book - despite the low price of approximately 5 dollars. + +### [12\. Building Machine Learning Powered Applications: Going from Idea to Product by Emmanuel Ameisen](https://amzn.to/2RNVzJS) + + + +_If you want to grasp the ML model lifecycle: from training a model to deploying it while measuring its success_ - _focused on product development rather than theory._ + +**Author:** Emmanuel Ameisen +**Publishing date:** January 2020 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Developing machine learning driven products rather than models only +- Covering the machine learning lifecycle for doing so +- Teaching you how to define a ML problem that solves your product goals +- Explaining how to build a ML pipeline with a dataset, train and evaluate your ML models, and finally deploying and monitoring them + +**My impression:** + +The book [Building Machine Learning Powered Applications](https://amzn.to/2RNVzJS) (_affiliate link_) has four parts: (1) Finding the correct Machine Learning approach; (2) Building a Working Pipeline; (3) Iterating on Models, and (4) Deploying and Monitoring them. + +Being a software engineer, this list really excites me, because I'm always focused on getting things out there - moving away from theory and deploying my machine learning models and other software in the field. I haven't encountered many books yet which suggest that they cover the entire path from having a problem to model deployment from a practical point of view! + +In the preface, the author Emmanuel Ameisen - who now works as a Machine Learning engineer at payment platform Stripe - introduces the foundations for this book: that its their goal to teach you how to use ML for building practical applications. It then moves to chapter 1, which covers "framing" your product's goal into a Machine Learning problem. + +That's an important aspect, because many problems that can be solved by ML can be solved even better by traditional, sometimes even human-only, methods! Machine Learning, despite the hype, should not be used for everything. In chapter 2, the author continues here, by diving deeper into the framing exercise - teaching you how to create a plan for making a machine learning model. It covers the first steps in model monitoring, scope estimation, and planning how to get to work. + +In chapter 3, the author helps you create an end-to-end pipeline for your machine learning model. You're taught what are the contents, from data preprocessing to creating the skeleton of your model. This is followed by chapters 4 and 5, which actually cover exploring and processing your first dataset as well as training and evaulating your first attempt in training the machine learning model. + +Despite the good effort, it's likely that your model will not work out of the box - that is, very well, after your first training. Chapter 6 covers this, by teaching how to debug your ML model in order to make it better. It covers best practices in software and machine learning development, visualizing the model, ensuring predictive power (that means, that your model works on your data) as well as generalization (that is, it also works on data it has never seen before). Chapter 7 expands on this by introducing classifiers. + +Part 4 covers model deployment and does so with chapters 8 to 11. First, and I think this is a good thing, the book starts with deployment considerations in chapter 8. Doing so, the author stresses to think first and deploy later. It covers aspects related to data ownership, bias and its consequences, as well as abuse of your ML model. + +Chapters 9, 10 and 11 subsequently cover the actual aspects of model development. In chapter 9, you're taught whether to deploy on a server side setting or deploy your model directly at your client. Chapter 10 covers building safeguards for models, to ensure that your model keeps running - despite the variety of edge cases that it could potentially face. Chapter 11, finally, covers how your machine learning model can be monitored. + +If you're responsible for training machine learning models as well as their deployment, I think that this is a great book - but only when you're a beginner in doing so (let me explain this a bit later). I think that this book is unique in the sense that it covers model training as well as deployment. It's also very young (early 2020) and therefore really relevant for today's practice. Now, with respect to your level in doing so - if you are a beginner in training and deployment, this book is great. You will learn a lot when reading it; especially because it also greatly covers the 'why' of deployment. Now, if you aren't a beginner, I think that it's likely that you already know a lot about the training aspects of machine learning models. This makes the first few parts of the book less relevant for you. Even then, still, I think that the book's continuation into model deployment will teach even the more advanced ML engineers a few things - because ML deployment is still a developing area in today's ML practice. + +In sum: definitely a unique book that I would love to have on my book shelf :) + +### [13\. Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence, by Jon Krohn, Grant Beyleveld and Aglaé Bassens](https://amzn.to/32Xlw09) + + + +_If you want to get a more in-depth introduction with a great balance between theory and practice._ + +**Author:** John Krohn, Grant Beyleveld and Aglaé Bassens +**Publishing date:** August 5, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Finding out why deep learning is different and how it can benefit practice +- Mastering the theory: finding out about neurons, training, optimization, [ConvNets](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/), recurrent nets, GANs, reinforcement learning, and so on +- Building interactive deep learning applications, to help move forward your own AI projects + +**My impression:** + +We hear a lot about Deep Learning today. The field is a subset of Machine Learning, which is a subset of the broader field of Artificial Intelligence in its own. The book [Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence](https://amzn.to/32Xlw09) (_affiliate link_) aims to specifically teach you the introductory principles of Deep Learning - with the goal of faciliating that you'll work on your own AI projects. For this, it has four parts: + +1. **Introducing Deep Learning** +2. **Essential Theory Illustrated** +3. **Interactive Applications of Deep Learning** +4. **You and AI** + +The chapters seem to be densely packed with information. Personally, I never think this is a bad thing - because you'll get your money's worth of information. However, if you're not so much into reading, perhaps different books are better choices, providing you a coarser overview of the Deep Learning field. Now, with respect to the chapters, in part 1, deep learning is introduced. Chapter 1 covers the differences between biological vision (i.e., how organisms see) and computer or machine vision. It provides a bit of history as to how computer vision has evolved over the years, and what applications are possible today. We therefore find that the book starts with a bit of history as well as applications - in my point of view those are good methods for approaching the field holistically. + +Chapter 2 does the same, but then for human language and machine language. It introduces how deep learning is used for Natural Language Processing, and how things have evolved in the past. Briefly, it guides you how language is represented in machines, and how techniques that can handle those representations are used in today's applications. Chapters 3 and 4 repeat this, but then talk about machine art (chapter 3) and game-playing machines (chapter 4). Part 1 therefore gives you a true overview of the Deep Learning field - as well as its history and applications. A great introduction as to the context and the why of what's happening. + +Now that the reader is introduced to context, the book moves forward with essential theory. It introduces neural networks by means of a practical implementation using [Keras](https://www.machinecurve.com/index.php/mastering-keras/), followed by theory about artificial neurons. That is, chapters 5 and 6 cover both a practical introduction to neural networks as a theoretical one. Following the path from the [Perceptron](https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/) to [modern neurons](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/), the reader can appreciate how things have evolved while directly putting things to practice. Once again, I appreciate this approach of the author. + +Chapters 7, 8 and 9 study deep learning theory in more detail. Now that the reader has been introduced to neurons in chapter 6, chapter 7 moves forward by structuring them into networks. What does the input layer mean? What are densely-connected (or fully-connected) layers? And how can we stack them together to generate a neural network? Those are the questions that will be covered in the chapter. Subsequently, chapter 8 moves even more in-depth by studying cost functions (or: [loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/)), [optimization](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) (or: how models are made better), backpropagation. Here, once again, things are made practical by means of Keras implementations, which is good. Chapter 9 subsequently moves forward by introducing [weight initialization](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) (and things like [Xavier init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/)), [vanishing and exploding gradients](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/), [L1/L2 regularization](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/), [Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) and [modern optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/). It really throws you off the cliff in terms of complexity, but you'll learn a lot. + +Part 3, covering chapters 10 to 13, moves forward to practical applications of Deep Learning. Now that you understand better how those models work, it's time to study Machine or Computer Vision, Natural Language Processing, Generative Networks and Reinforcement Learning. Those application areas have their own specifics when it comes to deep learning. For example, computer vision widely uses [ConvNets](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), which are covered in the book. [GANs](https://www.machinecurve.com/index.php/2019/07/17/this-person-does-not-exist-how-does-it-work/) work differently compared to Computer Vision and NLP models, because they work with two models - at once - for generating new data. Reinforcement Learning is even more different, and the book teaches you how agents can be used in situations where insufficient training data is available. + +Finally, part 4 - chapter 14 - covers how YOU can move forward with AI, and specifically your own deep learning projects. It covers a set of ideas, points you towards additional resources, and gives you pointers to frameworks/libraries for Deep Learning such as [Keras](https://www.machinecurve.com/index.php/mastering-keras/), TensorFlow and [PyTorch](https://www.machinecurve.com/index.php/mastering-pytorch/). It also briefly covers how Artificial General Intelligence (AGI) can change the world, and where we are on the path towards there. + +In my point of view, this book is a great introduction to the field of Deep Learning. This is especially so for people who are highly specialized in one field of Deep Learning, while they know not so much about other fields. For example, this would be the case if you are a Computer Vision engineer, wanting to know more about Reinforcement Learning. The book nicely combines theory with practical implementations, meaning that you're not drowned in theory but gain enough theoretical understanding in order to understand what's happening. Once again, as with many books reviewed here, definitely recommended. Even so that I'm actually considering buying it myself :) + +### [14\. Programming Machine Learning: From Coding to Deep Learning, by Paolo Perrotta](https://amzn.to/3mZh9cV) + + + +_If you want to start with machine learning as a software engineer_ + +**Author:** Paolo Perrotta +**Publishing date:** March 31, 2020 +**Price/quality:** 🟢 Good +**What it covers:** + +- Creating machine learning models from a developer perspective +- Intuitively understanding what happens in supervised machine learning without being buried in maths +- Working through the history of machine learning techniques (linear regression, [perceptrons](https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/)) before creating actual deep learning models, to give you a bit of context as well. + +**My impression:** + +The field of machine learning presents some really big barriers to entry, especially if you don't come from a mathematics origin. In fact, websites like MachineCurve exist because people - me included - think that it's possible to learn machine learning without having to dive through all the maths. In fact, I even think that it's perfectly possible to get intuition about what happens without having to write down that math. If it works, and if it works well - and if you can explain why - why would that be no good? + +The book [Programming Machine Learning: From Coding to Deep Learning](https://amzn.to/3mZh9cV) (_affiliate link_) takes a similar perspective. It is a book for developers who want to learn how to build machine learning models from scratch. It covers three widely used terms within the field: supervised learning, neural networks, and deep learning. + +Supervised learning, which involves training machine learning models based on a training dataset, is introduced in Part 1 of the book. You'll learn how to create a program that can learn within a short time, and turn it into a [perceptron](https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/) (indeed, that single-neuron network that was very popular in the 1950s). The author attempts to achieve this by first explaining how machine learning works. That is, supervised learning is introduced, and it is explained why it is different than regular programming. Then, the book teaches you how to make your system ready for machine learning. Part 1 also covers these elements: + +- Understanding the problem that you're trying to solve with machine learning +- Understanding gradient descent +- Understanding hyperspaces and how they are used in classifiers +- Understanding more advanced methods for classification + +Subsequently, the book moves to larger neural networks: in Part 2, you'll learn how to create code for Multilayer Perceptrons. Finally, in Part 3, you'll be using modern machine learning libraries for creating actual deep learning models. It also introduces more advanced deep learning techniques. + +The book [Programming Machine Learning](https://amzn.to/3mZh9cV) (_affiliate link_) is true to its title. That is, it explains machine learning in plain English and with a lot of pictures to support the content. While it can be a bit high-level for people who already have a background in machine learning, and while it can be a bit non-technical for those who really want to be fully introduced to a library like Scikit or TensorFlow, it's a good book if you are a (Python) developer who wants to start learning ML. It is like a stepping stone to a more advanced, in-depth machine learning book. Good buy. + +### [15\. Machine Learning with Python: An introduction to Data Science with useful concepts and examples, step by step, learning to use Python, by William Gray](https://amzn.to/3jeiEBJ) + + + +_If you want to grasp the basics of machine learning, and start with Python based ML development._ + +**Author:** William Gray +**Publishing date:** July 26, 2019 +**Price/quality:** 🟡 Acceptable +**What it covers:** + +- The fundamental concepts and applications of machine learning +- What kind of machine learning algorithms there are, as well as other branches of Artificial Intelligence +- Python basics +- Machine Learning in Python, including case studies +- Key frameworks, open source libraries, including the creation of a machine learning model and learning how to deploy it in a web application. + +**My impression:** + +The book [Machine Learning with Python](https://amzn.to/3jeiEBJ) (_affiliate link_) is in fact two books in one - a book about the concepts of Machine Learning, while another focuses on key frameworks and real Python applications. Let's briefly focus on the contents of the two books first, and then proceed with our impression. + +The first book first starts with an introduction to machine learning. Why use AI? What is its purpose, and what are research goals? Then: what is machine learning, and how does it differ from AI? Those are the questions with which this book starts. Subsequently, fundamental concepts of Machine Learning are discussed - that is, methods, applications, deep learning and deep neural networks. It also attempts to cover some applications such as Siri, Cortana, use at Paypal and Uber, and Google Translate. Chapter 3 then covers all use of AI around us. + +Chapter 4 moves on with explaining the various categories of ML algorithms. It covers supervised learning, unsupervised learning, semi-supervised learning and reinforcement learning, including use cases for the application of the latter category. Chapters 5 to 7 work towards an introduction to Python. It attempts to brush up your math skills to start with Python maths libraries, teaches you Python syntax, and introduces you to Numpy, Pandas and Matplotlib - the main data analysis libraries. Chapter 8, 9 and 10 cover machine learning applications, and how Python has evolved. Chapters 11, 12 and finally 13 cover challenges of ML in big data, general AI, and whether third world countries can learn AI. + +The second book starts again with introducing Python for machine learning. Chapter 2 then moves forward by understanding key ML frameworks, starting with the right questions, and whether you should best use deep learning techniques or more classic techniques. It also looks forward to model deployment and model optimization. + +Chapter 3 introduces most of the modern frameworks used in Machine Learning such as TensorFlow, Keras, PyTorch and so on. Once again, it introduces ML approaches i.e. supervised and unsupervised learning in Chapter 4. Chapter 5 introduces Scikit-learn based classification while chapter 6 teaches you how to implement neural networks from scratch. Chapter 6 moves to activation functions, overfitting, hyperparameter tuning and types of classification algorithms. Chapter 7 introduces TensorFlow, chapter 8 teaches you how to deploy ML models into a web application and chapter 9 moves on to the future of ML and how to attain a competitive advantage with Machine Learning. + +Now that we have seen the contents of the book - or the books in book- in-book, in fact - I'm a bit puzzled. If you have some experience with machine learning, there is a way to get up the learning curve when you wish to learn machine learning. If you then look at the table of contents of this book, it seems that the topics flow back and forth a bit - and sometimes, a logical structure seems to be missing. + +This is also reflected in the reviews on [Amazon](https://amzn.to/3jeiEBJ) (_affiliate link_): "a lot of repetition to explain again what ML is", and "it looks like a series of blog articles". References are missing and I also spotted textual mistakes here and there. This is what I saw as well: topics are reintroduced time and again. Therefore, I'd like to stress that this is an acceptable book for beginners. It will definitely teach you something. However, if you expect absolute quality, this might not be the book for you. + +* * * + +## Books about Machine Learning concepts + +### [1\. Machine Learning Pocket Reference: Working with Structured Data in Python, by Matt Harrison](https://amzn.to/305ucj3) + + + +_If you want to start with structured machine learning._ + +**Author:** Matt Harrison +**Publishing date:** August 27, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Exploring your dataset, performing analyses for finding how to proceed; +- Cleaning data and handling missing data +- Selecting features useful to your machine learning model +- Selecting a model class +- Building a classifier and a regression model +- Evaluating the model with classifier- and regression-specific metrics +- Unsupervised learning techniques such as Clustering and Dimensionality reduction +- Pipelines in Scikit-learn + +**My impression:** + +The book [Machine Learning Pocket Reference: Working with Structured Data in Python](https://amzn.to/305ucj3) (_affiliate link_) by Matt Harrison has 19 chapters and claims that it helps you guide the basic waters of machine learning. That is, if you are a beginner, and have no idea - not even about some basic concepts - this book should be for you. Let's take a look. + +As we mentioned before, the book covers the process of training a machine learning model: exploring your dataset, to find starting points; cleaning and preparing your data; selecting features and a model; model training, and finally, model evaluation. It structures the chapters in that way: rather sequentially. I think this is good, because it allows you to gain understanding of the concepts as well as link them together in a natural flow. + +Now, with respect to the chapters: chapter 1 covers the technical basics. That is, Harrison discusses the libraries that are used throughout the book. The list is quite extensive, but it includes `autosklearn`, `sklearn`, `fastai`, `xgboost`, `sklearn`, the basic stack of `numpy`, `scipy` and `matplotlib`, and many others. It also covers how to install them with `Pip` or `Conda`, creating an environment specifically for the book so that your work does not interfere with other projects on your host machine. + +In my point of view, Chapter 2 then comes at precisely the right time. If you want to begin studying a topic, I think it's very important to perform what I call castle building. The castle represents your knowledge, and by studying a variety of topics you can both build your castle and make it uniquely targeted at a particular job or skillset. This helps you study for understanding rather than recollection. Elon Musk attempts to connect the dots when he reads, and I too think that this is very important. But where to start? Often, a holistic standpoint is good in those cases - a starting point where you look at the entire field from a high level. It allows you to see the dots, which you can then start connecting. Chapter 2 of this book precisely provides you with that holistic point of view: it presents - at a high level - the process of creating structured machine learning models. It starts with "asking a question" and eventually moves to "deploy model". From my perspective, and the way I learn, I think that's a great way of teaching. + +Chapter 3 then introduces you to a variety of topics, such as imports, what happens when asking a question, importing data, creating the model, and optimizing it. It introduces you to all the different topics a bit more in-depth. This is expanded in the subsequent chapters. Chapter 4 covers missing data, and what to do when data is missing, dropping it, and so on. Chapter 5 covers cleaning your dataset to make it more suitable for machine learning. Chapter 6 subsequently covers data exploration: through statistics such as size, summary stats, histograms, scatter plots, box plots, and so on, it's possible to draw conclusions about the data sets. For example, they allow you to discover whether your dataset is large enough for a particular model. + +This is further expanded upon in chapters 7 and 8, which specifically target data preprocessing and feature selection. Chapter 9 subsequently covers what to do when you have an imbalanced dataset: this happens in scenarios where a large group of samples belong to one class, while the other class (or classes) have significantly fewer samples. As models would naturally favor this large class, this could lead to a biased model. The book covers methods for handling those cases. This is where you move from data preparation to model building. + +Chapter 10 introduces you to a variety of methods and techniques for classification. For example, it covers logistic regression, naive bayes, SVMs, decision trees - the quite older methods - and the relatively newer ones, such as XGBoost and other boosting techniques. It does not cover deep neural networks in depth. Chapter 11 moves forward by answering the question: given this large set of possible models, how do I select the right one? That's also one of the important questions to answer. Chapter 12 subsequently moves to metrics and evaulation, which you'll have to perform after training, and introduces a variety of metrics that can be used for this. Chapter 13 then explains how to add model explainability - or to answer the question why your model predicted what it predicted. Chapters 14, 15 and 16 perform the same as chapters 10, 11 and 12, but then for regression models. The book then moves to unsupervised learning techniques in chapters 17 and 18. + +Finally, in chapter 19, it covers the creation of pipelines using Scikit-learn. It does so for both [classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/), [regression](https://www.machinecurve.com/index.php/2020/11/17/how-to-perform-multioutput-regression-with-svms-in-python/) and [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/). + +In sum, I think that this is a very useful book that is packed with useful code. It should help you get that a-ha moment back when you're confused during a ML project. What it's not, and this is what the book introduction also covers, an extensive guide. Rather, it's a "companion" - a nice analogy found in one of the reviews on [Amazon](https://amzn.to/305ucj3) (_affiliate link_). If you're a beginner, wishing to grasp the concepts in more detail but in a relatively "pocket" version, this could be the book for you. Definitely recommended. + +* * * + +## Books about Keras and TensorFlow + +### [1\. Deep Learning with Python, by François Chollet](https://www.amazon.com/gp/product/1617294438/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1617294438&linkId=fc1751c9b5d563c5a593b744c5f6ae4c) + + + +_If you have a background in Python and want to get started with Deep Learning._ + +**Author:** François Chollet +**Publishing date:** 2017 +**Price/quality:** 🟡🟢 Acceptable to good +**What it covers:** + +- An introduction to deep learning. What is it? How does it differ from previous approaches? +- Mathematical building blocks of neural networks, in an accessible way. +- Your first neural network with Keras (densely-connected neural network). +- Convolutional neural networks for image classification. +- Recurrent neural networks for text classification. +- Best practices for deep learning. +- An introduction to generative deep learning. + +**My impression:** + +This is the book which I used to start learning about deep learning. I come from a software background and have always loathed the maths intensive books... it felt as if I couldn't start properly when reading those books. François Chollet's [Deep Learning with Python](https://www.amazon.com/gp/product/1617294438/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1617294438&linkId=fc1751c9b5d563c5a593b744c5f6ae4c) (_affiliate link_) provided the precise balance between rigor and accessibility. Rather than explaining things through a lens of mathemathics, Chollet - who is one of the key players in the field of Deep Learning - utilizes a programming lens instead. + +In the book, Chollet first introduces deep learning. What is it? How is it different from previous machine learning techniques like Support Vector Machines? He then proceeds with _some_ mathemathics, as at least a basic understanding and appreciation of maths is necessary in order to get started with neural networks. Those cover the fundamentals of deep learning. The first part also includes basic neural networks with Python code, utilizing the Keras framework - of which he is the creator. + +In the second part of the book, Chollet introduces a variety of practical applications of deep learning. This part builds on top of the first and can be considered to be the "advanced" part of the book. He introduces ConvNets for computer vision, covers text based classifiers, provides best practices for deep learning and covers generative deep learning models. This wraps up the Deep Learning with Python book. + +Back in 2018, when I started with deep learning, this was a great book. And to be honest, it still is. Especially for beginners, it can be a great book to understand neural networks conceptually. However, the deep learning landscape has changed significantly over the past few years. The biggest change impacting this book: the deep integration of Keras with Tensorflow since TensorFlow 2.0. Where Chollet utilizes the standalone Keras version `keras` throughout the book, it's best practice to use `tensorflow.keras` these days. While this is often not problematic (as a simple replacement does wonders), some parts of the framework have moved to different locations, which poses the risk that parts of your code might no longer work properly. This means that you might need to perform some Googling around. + +If you're a fan of Chollet and his style, go for the [book](https://www.amazon.com/gp/product/1617294438/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1617294438&linkId=fc1751c9b5d563c5a593b744c5f6ae4c) (_affiliate link_). If not, or if you want to ensure that you buy a book that is more up to date, this could perhaps not be the book for you. Nevertheless, it's one of my all time deep learning favorites... it's the book I started things with :) + +### [2\. Deep Learning with R, by François Chollet & J.J. Allaire](https://www.amazon.com/gp/product/161729554X/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=161729554X&linkId=210473045a82710dea4481631e04b7aa) + + + +_If you have a background in R and want to start with Deep Learning._ + +**Author:** François Chollet and J.J. Allaire +**Publishing date:** February 9, 2018 +**Price/quality:** 🟡🟢 Acceptable to good +**What it covers:** + +- An introduction to deep learning. What is it? How does it differ from previous approaches? +- Mathematical building blocks of neural networks, in an accessible way. +- Your first neural network with Keras (densely-connected neural network). +- Convolutional neural networks for image classification. +- Recurrent neural networks for text classification. +- Best practices for deep learning. +- An introduction to generative deep learning. + +**My impression:** + +The original Keras framework was targeted at many backends - TensorFlow, Theano, CNTK for Python, to name a few. But it also runs on R. The book [Deep Learning with R](https://www.amazon.com/gp/product/161729554X/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=161729554X&linkId=210473045a82710dea4481631e04b7aa) (_affiliate link_) by François Chollet and J.J. Allaire allows you to take your first steps in developing neural networks if you have a background in R rather than Python. + +It was published a few months after the Python version, being Deep Learning with Python. The table of contents is the same - which means that it is practically the same book, but then filled with examples in R rather than Python ones. As well as the Python variant, it covers multiple things. Firstly, it introduces deep learning. What is it and how is it different than previous approaches? Then, it continuous by introducing some mathematical building blocks of neural networks. Don't worry, the maths aren't heavy. + +This is followed by your first baby steps in building a neural network. Using a few Dense layers of the Keras framework, you'll build your first classifier. This is followed by more practical and relatively state-of-the-art types of layers, such as Convolutional layers and hence ConvNets for image classification and Recurrent layers and hence Recurrent Neural Networks for text classification. The authors then finalize with deep learning best practices and an introduction to generative deep learning. + +As with the Python book, I think it's a great book for those who wish to understand the concepts behind deep learning but have a background in R. However, here too, you must be careful about the fact that the book is a bit older... and the Keras landscape has changed significantly over the last few years. In fact, it seems that TensorFlow 2.x is now the lingua franca in terms of backends used with Keras. This means that 'old' Keras still supports R, but that it's no longer the main focus. That's why I'd suggest to switch to the Python version instead, or even try a few different Python related machine learning books which are more up to date. But conceptually, this is still a great [book](https://www.amazon.com/gp/product/161729554X/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=161729554X&linkId=210473045a82710dea4481631e04b7aa) (_affiliate link_). + +### [3\. Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, by Sebastian Raschka](https://www.amazon.com/gp/product/B07VBLX2W7/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07VBLX2W7&linkId=c61c5f59e9008c545232a2f89fdaa1eb) + + + +_If you want to get a broad introduction to Machine Learning and Deep Learning, followed by Python examples with the Scikit-learn and TensorFlow 2.0 frameworks._ + +**Author:** Sebastian Raschka +**Publishing date:** December 9, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Learn to use the Scikit-learn and TensorFlow frameworks for machine learning and deep learning. +- Study across a wide range of applications, such as image classification, sentiment analysis and more. +- Select and build a wide range of model types (neural networks and classic models) including best practices for evaluating and tuning them. + +**My impression:** + +The name Sebastian Raschka already makes me think positively about this book. When working with his [Mlxtend toolkit for visualizing classifier decision boundaries](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) (_affiliate link_), Raschka was quick to respond to issues filed at his GitHub repository, and responded positively to my request for using his toolkit in my blog posts. Still, that does not provide an objective review of his book. Let's take a look what's inside. + +The book starts with an introduction about machine learning, looking at what it is. It introduces the three main branches of ML - being supervised learning, unsupervised learning and reinforcement learning, and follows up with basic terminology and an introduction to the basic machine learning workflow. He also makes a case for using Python in the book, being one of the important languages for data science and machine learning. + +He then proceeds by explaining the most salient machine learning models there are, and how they work. It starts with the Rosenblatt Perceptron, a very old type of artificial neural network, and explains how it is optimized i.e. by means of the Perceptron Learning Rule. Next, he covers a wide range of traditional ML approaches: logistic regression, Support Vector Machines (for linear learning), kernel SVMs (for nonlinear learning), decision trees and kNN. This way, Raschka makes sure that you both appreciate _and_ understand salient machine learning algorithms before deep learning was hot. + +Part of training a machine learning model is data preprocessing, i.e. handling missing data, selecting features for your model, and so on. This is especially true for classic ML models. That's why Raschka, before introducing neural networks, proceeds with a wide range of important topics first. He covers preprocessing, dimensionality reduction (especially important with traditional ML algorithms), as well as best practices for model evaluation and hyperparameter tuning. Ensemble learning is also covered, i.e. how multiple models can be combined to generate one prediction, possibly improving predictive power on the go. + +Once you finish this part of the book, you'll work on various applications. Throughout a wide range of application areas (sentiment analysis, computer vision, agents in complex environments, image synthesis) as well as ML types (neural networks, unsupervised approaches, ConvNets, GANs and recurrent neural networks), Raschka covers recent developments thoroughly. This includes building models from scratch, to understand them conceptually, as well as recreating what you did using modern machine learning frameworks such as Scikit-learn and TensorFlow (including Keras). Reinforcement learning is covered separately, and includes its theoretical foundations, important algorithms, and a first implementation using the OpenAI Gym Toolkit. + +If you have some experience writing Python code and want to start with machine learning, [Python Machine Learning](https://www.amazon.com/gp/product/B07VBLX2W7/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07VBLX2W7&linkId=c61c5f59e9008c545232a2f89fdaa1eb)(_affiliate link_) is a really good book. It's up to date (as it was released relatively recently), it covers state-of-the-art frameworks and toolkits, but it also doesn't fail to explain the concepts, best practices _and_ the history of machine learning. Thus, rather than ending up having a good understanding of one particular type of ML, Raschka's book introduces you to the full breadth of ML and invites you to specialize further. Recommended! + +### [4\. TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers, by Pete Warden & Daniel Situnayake](https://www.amazon.com/gp/product/1492052043/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1492052043&linkId=d114387271335768598136f3b51f1c81) + + + +_If you want to understand pragmatically what needs to be done to make Edge AI possible with TensorFlow._ + +**Author:** Pete Warden and Daniel Situnayake +**Publishing date:** December 16, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Various applications: speech recognition, image recognition, gesture responder +- Real embedded hardware: deploy ML models on Arduino and low-power microcontrollers +- ML background: understand basic concepts of machine learning and training models based on various data sources +- TensorFlow Lite for Microcontrollers, which helps you make tiny Machine Learning models +- Ensure privacy and security by design +- Optimize for latency or energy usage and understand how this impacts your model + +**My impression:** + +If you take a close look at trends in machine learning, you will find that models are required to be smaller and smaller. The reason is simple: they must run in an embedded way, _at the edge_, for the reason that people no longer want to run them in cloud environments but _directly in the field_. + +This is pretty problematic, as machine learning models - and especially today's deep neural networks - have a large amount of parameters and are thus often way too large for running on microcontrollers such as Arduinos and other low-power environments. What to do about this is what is covered in the book [TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers](https://www.amazon.com/gp/product/1492052043/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1492052043&linkId=d114387271335768598136f3b51f1c81) (_affiliate link_). If you're looking to learn about Edge AI and really learn to _apply_ it, this could be a very interesting book for you. + +As with any book about machine learning, it covers its basic concepts first. You'll learn what ML is as well as the workflow used in Deep Learning projects. This is followed by an introduction of the microcontrollers used by the authors. However, don't expect a big introduction about ML and microcontrollers in this book: it's about the synthesis of both and the authors therefore expect that you have gained basic knowledge elsewhere, which makes sense to me. + +The book then proceeds by showing how you can build and deploy models for word detection, person detection and gesture detection. This includes exporting the model to TensorFlow Lite, which allows you to convert your model into one that can run on low-power environments; also converting it into C for running on Arduinos is covered. + +Following the applications, TensorFlow Lite for Microcontrollers is introduced. This first starts with a hierarchy between TensorFlow, TensorFlow Lite and the Microcontrollers edition, and covers many of the microcontroller related aspects for deploying machine learning models. Finally, the authors cover designing your own applications with best practices, and show you what optimization of your ML model looks like. + +This is a pretty recent book about a topic within Machine Learning that in my opinion will gain a lot in popularity in the years to come. Deploying AI models in the field will be increasingly important, and you'll be able to set yourself apart if you know a lot about this niche. [Book reviews](https://www.amazon.com/gp/product/1492052043/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1492052043&linkId=d114387271335768598136f3b51f1c81) (_affiliate link_) are very positive. Recommended! + +### [5\. Deep Learning with TensorFlow 2 and Keras: Regression, ConvNets, GANs, RNNs, NLP, and more with TensorFlow 2 and the Keras API, by Antonio Gulli, Amita Kapoor & Sujit Pal](https://www.amazon.com/gp/product/B082MBMFVF/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B082MBMFVF&linkId=865dba0c6592875d015657e0b15f4bcc) + + + +_If you want to learn more about the TensorFlow 2.0 and Keras APIs, and get some real coding experience._ + +**Author:** Antonio Gulli, Amita Kapoor and Sujit Pal +**Publishing date:** December 20, 2019 +**Price/quality:** 🟢 Good +**What it covers:** + +- Learn creating deep neural networks with TensorFlow 2.0 and Keras +- A wide range of application areas for TF 2.0 and Keras deep neural networks +- Many code examples + +**My impression:** + +If you already know what deep learning is all about, and want to get some practical experience, this book could be for you. The second edition of [Deep Learning with TensorFlow 2 and Keras](https://www.amazon.com/gp/product/B082MBMFVF/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B082MBMFVF&linkId=865dba0c6592875d015657e0b15f4bcc) (_affiliate link_) entirely focuses on TensorFlow 2.0 and the Keras API, gives a lot of code examples, and touches all the important concepts within deep learning from a developer perspective. + +However, it seems that it _does_ require that you already know what stuff is about at a conceptual level. For example, with the Perceptron algorithm, it simply introduces the way it activates, what it does (construct a hyperplane between samples) and how it is optimized. That's it. For other details, it assumes that you know to find your way to Google. This is the same for all other topics covered in the book. + +However, as I wrote, if you're looking for a book filled with code examples - this one is it. It introduces TensorFlow, Keras and what things have changed in TensorFlow 2.0, followed by the Perceptron, Multilayer Perceptron, and examples with Keras (including all the detailed tuning options such as adding Batch Normalization to your architecture, adding other forms of regularization, choosing an optimizer, and so on). + +After the introduction, and a detailed comparison between TensorFlow 1.0 and 2.0, it proceeds with application areas. First, you'll find a chapter about regression, followed by classification using Convolutional Neural Networks - i.e. computer vision. In both chapters, the typical problems (such as linear regression or image classification) are discussed, as well as how this impacts architectural choices in your quest to find a well-performing machine learning model. + +Subsequently, more advanced applications of ConvNets are covered, as well as GANs for generative deep learning. This is followed by word embeddings, which can be used in natural language processing, by recurrent neural networks, autoencoders, unsupervised learning and reinforcement learning. That's pretty much it in terms of what can be done with deep learning these days. + +Once those details are covered, the book moves forward by looking at how TensorFlow can be deployed in cloud instances so that you can perform learning with e.g. a powerful GPU. Then, running TensorFlow models on mobile/IoT and in the web browser are covered, followed by AutoML - which includes automated hyperparameter tuning. Towards the end, the book covers the maths behind deep learning, and TPUs - pieces of hardware that can accelerate your training process. + +Coming back to the original statement about this book: if you already have some experience with machine learning, this can be a great book. However, if you're an absolute beginner, it may be wise to look at a beginners book above first - for the simple purpose that if you understand what is going on, you'll flow through this book more easily. For people who already have some experience under their belt, [Deep Learning with TensorFlow 2 and Keras](https://www.amazon.com/gp/product/B082MBMFVF/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B082MBMFVF&linkId=865dba0c6592875d015657e0b15f4bcc) (_affiliate link_) is definitely recommended. + +### [6\. Practical Deep Learning for Cloud, Mobile, and Edge: Real-World AI & Computer-Vision Projects Using Python, Keras & TensorFlow, by Anirudh Koul, Siddha Ganju and Meher Kasam](https://amzn.to/372fYUr) + + + +_If you want to learn using TensorFlow, Keras and TensorFlow lite focused on Edge-based Computer Vision._ + +**Author:** Anirudh Koul, Siddha Ganju, and Meher Kasam +**Publishing date:** October 14, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Using Keras, TensorFlow, Core ML, and TensorFlow Lite +- Putting the focus on 'lite' models, running them on Raspberry Pi, Jetson Nano and Google Coral, as well as in the web browser +- A bit of reinforcement learning and transfer learning +- Case studies and practical tips + +**My impression:** + +The book [Practical Deep Learning for Cloud, Mobile, and Edge: Real-World AI & Computer-Vision Projects Using Python, Keras & TensorFlow](https://amzn.to/372fYUr) (_affiliate link_) by Anirudh Koul, Siddha Ganju and Meher Kasam is an interesting book for the Machine Learning practitioner. Contrary to many books about TensorFlow, Keras and other libraries, it does not aim to provide a general introduction to the libraries. + +Rather, it aims to capture a trend which I believe will be immensely important in the years to come with respect to generating new model predictions, or inference: moving Deep Learning models away from cloud-based environments and into the field, where they run on embedded devices. As we shall see, the authors aim to capture this trend by looking at a variety of use cases and deployment scenarios, and providing the tools from the TensorFlow arsenal that could make this work for you. + +The book starts by exploring the landscape of Artificial Intelligence in chapter 1. This always makes me happy, because I am a fan of books that provide the reader with necessary context in order to understand what is going on. In the chapter, the authors discuss what AI is (and why precisely that is a difficult question), a brief history from AI through the hypes and AI winters, and introduce Deep Learning as one of the most recent trends within the field of AI. It also gives critical succes factors for a well-performing Deep Learning model, and hints towards responsible utilization of AI - also an increasingly important topic for the next years. + +As I said, the book has a focus on Edge AI cases. Computer vision problems are very prominent in this area - as models rely on progress in this branch of Deep Learning if they want to see what is going on in the field. In chapter 2, the authors introduce Image Classification with Keras, by means of the ImageNet competition and Model Zoos. This is followed by Transfer Learning in chapter 3, where building a classifier is extended in multiple ways: using a pretrained model for getting better results; organizing the data; data augmentation; and finally training and testing the model. By focusing on computer vision problems, you'll be introduced to the Keras APIs that you need. + +Chapter 4 will teach you to build a Reverse Image Search Engine - by means of Feature Embeddings. Through t-SNE and [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/), as well as some other techniques, you'll learn to build a tool for image similarity. This is followed by Chapter 5, which focuses on Convolutional Neural Networks and all their related components. It introduces TensorBoard for showing realtime training progress, breaking up your data into training, testing and validation data, early stopping and other stuff. What I'm mostly happy about are the two final components of this chapter: how hyperparameters affect accuracy (with a discussion on things like batch size, [optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), learning rate, and so on), and tools for automating ML: [Keras Tuner](https://www.machinecurve.com/index.php/2020/06/09/automating-neural-network-configuration-with-keras-tuner/), AutoAugment and AutoKeras. Really great - and this makes this book especially future proof! + +If you've been familiar with Deep Learning for some time, you know that it's often necessary to have big GPUs if you want to train your model. Chapter 6 helps you manage the GPUs you're using by teaching how to maximize speed and performance of TensorFlow, i.e. how to squeeze every bit of power out of your graphics card that you could possibly do. Chapter 7 extends this by providing practical tips, and Chapter 8 teaches you how to use Cloud APIs for Computer Vision Problems. + +When you want to deploy your Machine Learning model, it's important that you do so professionally. In chapter 9, the authors introduce how to scale inference and how to deploy your model in a good way by means of TensorFlow Serving and KubeFlow. Doing so, the authors describe a set of desirable qualities for production machine learning scenarios (think availability, scalability, latency, failure handling, monitoring, and so on), and teach you how to deploy models by means of Google Cloud Platform, TensorFlow Serving and KubeFlow. Great stuff! + +The next chapters start zooming in to specific usage scenarios of your Deep Learning model. If you want to run your model in a web browser, to give just one example, that is entirely possible with TensorFlow.js. Chapter 10 focuses entirely on this matter. This is followed by Chapter 11, which shows how to convert your TensorFlow and Keras models into CoreML models, which allows you to run them on an iOS device. Chapter 12 extends this topic and Chapter 13 teaches you how to run TF models on Android, extended in Chapter 14 on the TensorFlow Object Detection API. + +If you truly want to run your model in the field, it's likely that you'll be using a piece of embedded hardware for doing so, like a Raspberry Pi or a FPGA board or an Arduino or an NVIDIA Jetson Nano. Chapter 15 compares those devices and gives you a hands-on example of running your model on an embedded device. The last two chapters, Chapter 16 and 17, move towards building a Self-driving Car, eventually providing a brief introduction to Reinforcement Learning. + +Having studied this [book](https://amzn.to/372fYUr) (_affiliate link_) for a while, I can only argue that this is one of the best books that your money can buy at this point in time. It's good, because it introduces today's state-of-the-art Deep Learning libraries, and I think it's also future proof, because it covers three topics (automation, scaling & cloud based training and Edge AI) which in my point of view will be the most important ones in the years to come. Definitely a buy for me! + +* * * + +## Books about PyTorch + +### [1\. Programming PyTorch for Deep Learning: Creating and Deploying Deep Learning Applications, by Ian Pointer](https://www.amazon.com/gp/product/B07Y6181J5/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07Y6181J5&linkId=5e44172e378d714d1919b4e16a412c60) + + + +_If_ _you want to get to know the PyTorch API in a learning by doing fashion._ + +**Author:** Ian Pointer +**Publishing date:** September 20, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Deploying deep learning models into production +- Find out how PyTorch is used in various companies +- Learning how to create deep learning models with PyTorch + +**My impression:** + +The book starts with an introduction to deep learning and PyTorch. What machine do you need? What do you do when you want to train your model in the cloud? How to install PyTorch? Those are the questions that are covered before the actual work starts. + +One of the main fields that was accelerated by deep learning is computer vision. That's why it's not surprising that this book starts off with computer vision, and specifically image classification, for you to start writing neural networks with PyTorch. It covers the dataset that you will be using, how to load it with PyTorch, how to build your training set and what the point is of testing and validation datasets. Subsequently, you'll create the neural network - train it - make predictions - and save the model. The book thus offers you a full iteration through the machine learning training workflow with PyTorch. + +Once you have set your first steps, the book continues with more advanced topics related to ConvNets (which are used for computer vision problems). You'll cover the conceptual stuff, a history of ConvNets, and pretrained models. This is followed by an entire chapter about Transfer Learning, which allows you to reuse models and train them for your problem at hand. + +The other field that was massively accelerated by deep learning is natural language processing. It's neither surprising that after ConvNets, recurrent neural networks are introduced for text classification. The book introduces Torchtext for doing so, and covers augmenting your dataset. The chapter about text classification is followed by sound classification, debugging PyTorch models, using models in production (Python Flask web service, Kubernetes deployment, TorchScript and libTorch) and more advanced topics (such as GANs). + +Overall, this is a good book and is like François Chollet's Deep Learning with Keras book: it introduces you to deep learning with the focus on one specific framework; in this case, PyTorch. Amazon reviews suggest that mainly the first chapters are really good, but that the latter ones are a bit more difficult to get through. This makes sense, but still, it's a good book if you specifically want to learn PyTorch. + +### [2\. PyTorch Computer Vision Cookbook: Over 70 recipes to master the art of computer vision with deep learning and PyTorch 1.x, by Michael Avendi](https://www.amazon.com/gp/product/B0862CX2ZL/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B0862CX2ZL&linkId=ffea6ad19e1b5f47449cd8946e249b7a) + + + +_If_ _you're looking for a book that teaches you PyTorch for computer vision._ + +**Author:** Michael Avendi +**Publishing date:** March 20, 2020 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Developing, training and finetuning deep learning models with PyTorch. +- Specifically focus on computer vision tasks such as classification, object detection and object segmentation. +- Learning advanced applications of computer vision such as neural style transfer, image generation and video classification. +- Discovering best practices for applying deep learning to computer vision problems. + +**My impression:** + +This is a new book about using PyTorch for computer vision problems. As you most likely know, the field of computer vision is one of the ones most accelerated by the advent of deep learning since 2012. Today's deep learning frameworks therefore contain a lot of functionality specifically tailored to such models. In [PyTorch Computer Vision Cookbook](https://www.amazon.com/gp/product/B0862CX2ZL/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B0862CX2ZL&linkId=ffea6ad19e1b5f47449cd8946e249b7a) (_affiliate link_), Michael Avendi covers the breadth of applications of PyTorch to computer vision. + +The first chapter covers getting started with PyTorch for deep learning. It provides some technical requirements for running PyTorch on your system, provides information and instructions about how to install the tools you need, and introduces you to some general PyTorch concepts -such as the nn.Sequential and nn.Module APIs, running the model on your GPU with CUDA, and saving/loading models. + +Subsequently, Avendi proceeds by writing about binary image classification. That is, an image is assigned one of two classes. This is followed by a multiclass classification problem, where the image is assigned to one of multiple classes instead. In both chapters, the book follows a workflow that is relatively default: exploring the dataset, splitting it into training/testing data, transforming it, building the classifier, performing hyperparameter tuning, training and evaluation, and deployment and inference. This covers almost the entire deep learning model lifecycle. + +After classification, the book covers object detection. It covers both single-object detection as well as multi-object detection. The chapters about object detection follow a similar structure as the ones about image classification. That's unsurprising, given how those approaches look like each other. And once you're up to speed about classification, it covers a more detailed approach to object detection - object segmentation. Rather than detecting the object and putting a bounding box around it, it classifies each pixel into a class, allowing you to create models that really segment objects. + +More advanced topics follow next. First, Avendi covers Neural Style Transfer - an application area of deep learning where two images are blended to create a new one. It is unsurprising that Generative Adversarial Networks are covered next, which are used for generating new image data. Finally, the book covers video processing with PyTorch. Here too, all the chapters are focused on getting things done. + +The [PyTorch Computer Vision Cookbook](https://www.amazon.com/gp/product/B0862CX2ZL/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B0862CX2ZL&linkId=ffea6ad19e1b5f47449cd8946e249b7a) (_affiliate link_) is therefore a highly practical book for those who wish to learn programming in PyTorch for computer vision. It provides many code examples, and covers the full breadth of computer vision application areas. Interesting book. + +### [3\. PyTorch 1.x Reinforcement Learning Cookbook: Over 60 recipes to design, develop, and deploy self-learning AI models using Python, by Yuxi (Hayden) Liu](https://www.amazon.com/gp/product/B07YZ9GZ7J/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07YZ9GZ7J&linkId=84fb398dcb496512f6b978af6db7e410) + + + +_If_ _you're looking for a book that teaches PyTorch for reinforcement learning._ + +**Author:** Michael Avendi +**Publishing date:** October 31, 2019 +**Price/quality:** 🟢 Good +**What it covers:** + +- Learning about Reinforcement Learning algorithms +- Applying Reinforcement Learning tools to simulate environments your agents operate in +- Using PyTorch to build the models + +**My impression:** + +In the news, you hear a lot about machine learning - but often you'll hear about one branch of ML, being supervised learning. Image classification, regression, object detection - those applications all require that a model is trained on a dataset before it can be used in practice. + +Sometimes, however, this simply cannot be done. This could either be the case because the environment is too complex or too unstable, or because you don't have enough data to make a supervised approach worthwhile. Reinforcement Learning could then provide a viable path. With RL, which is an emergent theme in machine learning research, you effectively have an intelligent agent which you train by rewarding good behavior and punishing poor behavior. + +The book [PyTorch 1.x Reinforcement Learning](https://www.amazon.com/gp/product/B07YZ9GZ7J/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07YZ9GZ7J&linkId=84fb398dcb496512f6b978af6db7e410) (_affiliate link_) allows machine learning engineers to find quick solutions to a variety of Reinforcement Learning scenarios. It starts with the tools that you'll need to start with RL: your working environment, OpenAI Gym, Atari environments, CartPole, and developing algorithms with PyTorch. It then covers a variety of Reinforcement Learning techniques in subsequent chapters: + +- Markov Decision Processes and Dynamic Programming +- Monte Carlo methods +- Temporal Difference and Q-Learning, including SARSA +- Multi-armed Bandit Problems +- Scaling up your RL approach +- Deep Q-Networks +- Policy Gradients + +It's not a book for beginners - in the sense that if you have no prior experience with machine learning, you will find the book really difficult. If, however, you are aware of what RL is, and want to gain both very detailed insights in the breadth of RL approaches as well as real practical experience, [PyTorch 1.x Reinforcement Learning](https://www.amazon.com/gp/product/B07YZ9GZ7J/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07YZ9GZ7J&linkId=84fb398dcb496512f6b978af6db7e410) (_affiliate link_) could be a really good extension of your current knowledge. + +### [4\. Deep Learning with PyTorch 1.x: Implement deep learning techniques and neural network architecture variants using Python, by Laura Mitchell, Sri. Yogesh K. & Vishnu Subramanian](https://www.amazon.com/gp/product/B07TB6SV6K/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07TB6SV6K&linkId=3241d6b8ed07bf7794e2f8e413962799) + + + +_If you're looking for a book that introduces you to Deep Learning concepts and allows you to write code in the process._ + +**Author:** Laura Mitchell, Sri. Yogesh K. and Vishnu Subramanian +**Publishing date:** November 29, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Learning to work with the PyTorch framework +- Understanding how to deploy training your PyTorch models on GPUs +- Using a wide range of model types - CNNs, RNNs, LSTMs, ResNet, DenseNet and Inception +- Applying your knowledge to application areas such as computer vision and Natural Language Processing +- Working with advanced neural networks such as GANs and Autoencoders, as well as Transfer Learning and Reinforcement Learning + +**My impression:** + +The book starts with an introduction to Deep Learning using PyTorch. It does so in a way that I often appreciate, namely by tarting off with diving into the history and origins of Artificial Intelligence and Machine Learning. I think that it is important for people to understand where things have come from… and how Deep Learning fits this process. Subsequently, it covers application areas of Deep Learning, frameworks (specifically tailored to PyTorch) and setting up your work environment. + +Subsequently, it dives into neural network building blocks, by explaining what they are, how they can be built with the PyTorch framework (nn.Sequential and nn.Module), and how tensor operations work. Those two chapters prepare you for more advanced topics, which follow next. + +Neural networks are highly configurable - we all know that. That's why it's important to get a feeling for those internals too, and the book covers this by studying things like activation functions, which architecture to choose for what problem loss functions, and how neural networks are optimized. In essence, this is the high-level supervised learning process that we also cover on this website. + +Next, it proceeds with the two default application areas for Deep Learning - Computer Vision and Natural Language Processing. Those application areas are the ones where advances have had the greatest impact, and the book will teach you to create real deep learning models for those application areas with PyTorch. For computer vision, this includes studying CV-specific DL aspects (such as convolutions), transfer learning, and visualizing the black box. For Natural Language Processing, this includes working with text data (tokenization, n-gram representation and vectorization), using word embeddings and why ConvNets (CV networks!) can also be applied here. + +Once you're up to speed about these two application area, the book proceeds with Autoencoders and Generative Adversarial Networks - which spawn a wide range of new applications such as generative machine learning (yes, generating new data indeed). Finally, the book covers Transfer Learning in more detail, introduces you to Deep Reinforcement Learning (including Q-learning and Policy methods) and finally covers what's next - i.e., areas that might gain popular traction in the years to come. + +A book like François Chollet's Deep Learning with Python for Keras, I think this is a good book to get started with Deep Learning. It introduces you to the concepts and does not assume that you already know them, and it specifically focuses on a framework, which allows you to get practical experience in the process. If you're looking for a more mathematical book, this one isn't for you, but if you want to from a developer perspective - [Deep Learning with PyTorch 1.x](https://www.amazon.com/gp/product/B07TB6SV6K/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07TB6SV6K&linkId=3241d6b8ed07bf7794e2f8e413962799) is really recommended. + +### [5\. Python Deep learning: Develop your first Neural Network in Python Using TensorFlow, Keras, and PyTorch, by Samuel Burns](https://www.amazon.com/gp/product/1092562222/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1092562222&linkId=3e00b20e68477638201c4b0816c41a11) + + + +**Author:** Samuel Burns +**Publishing date:** April 3, 2019 +**Price/quality:** 🟠 Moderate +**What it covers:** + +- Understanding Deep Learning in detail +- Getting started with Deep Learning in Python +- Coding a Neural Network from scratch +- Using Python 3.X, TensorFlow, Keras and PyTorch + +**My impression:** + +The book starts with an introduction to Deep Learning and Artificial Neural Networks. Subsequently, it explores the libraries and frameworks that it uses: TensorFlow, Keras and PyTorch, as well as instructions for installing them. + +Once you have a working environment, the book proceeds with TensorFlow basics such as Constants, Variables and Sessions, which allow you to work with the framework. It then proceeds with what it calls Keras basics, such as Learning Rate, Optimizers, Metrics and Loss Functions. I cringe a bit here, as many of those Keras basics are fundamental concepts in Deep Learning instead - and do not fully belong to Keras itself. What's more, the book makes use of the 'old' version of Keras, which supports the CNTK, Theano and TensorFlow backends. Today, Keras is tightly coupled to TensorFlow as tensorflow.keras, and the other backends are no longer recommended. Take this into account when considering the book. + +Once the Keras basics have been introduced, the book moves forward to PyTorch basics: computational graphs, tensors, building a neural network … and then applies it to ConvNets and Recurrent Neural Networks. That's all. In my opinion, the book stays at a high level, is a good starting point, but there are much better books out there today… especially since some of the concepts are already outdated (e.g. the Keras version used in the book). I wouldn't recommend this book per se: better pick one of the above. + +* * * + +## Books about Scikit-learn + +### [1\. Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow 2, by Sebastian Raschka](https://www.amazon.com/gp/product/B07VBLX2W7/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07VBLX2W7&linkId=c61c5f59e9008c545232a2f89fdaa1eb) + + + +_If you want to get a broad introduction to Machine Learning and Deep Learning, followed by Python examples with the Scikit-learn and TensorFlow 2.0 frameworks._ + +**Author:** Sebastian Raschka +**Publishing date:** December 9, 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Learn to use the Scikit-learn and TensorFlow frameworks for machine learning and deep learning. +- Study across a wide range of applications, such as image classification, sentiment analysis and more. +- Select and build a wide range of model types (neural networks and classic models) including best practices for evaluating and tuning them. + +**Read on this site:** + +We already covered [Python Machine Learning](https://www.amazon.com/gp/product/B07VBLX2W7/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07VBLX2W7&linkId=c61c5f59e9008c545232a2f89fdaa1eb) (_affiliate link_) at a different spot on this website. [Click here to read my impression of this book.](#3-python-machine-learning-machine-learning-and-deep-learning-with-python-scikit-learn-and-tensorflow-2-by-sebastian-raschka) + +### [2\. Introduction to Machine Learning with Python: A Guide for Data Scientists, by Andreas C. Müller & Sarah Guido](https://www.amazon.com/gp/product/1449369413/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1449369413&linkId=70832165a078e5ef5c865264f47d45c3) + + + +_If_ _you want to discover the breadth of machine learning algorithms including getting practical experience._ + +**Author:** Andreas C. Müller and Sarah Guido +**Publishing date:** October 2016 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Diving into machine learning concepts before you start coding +- Taking a look at the advantages and disadvantages of various machine learning algorithms +- The machine learning workflow: data represention, feature selection, hyperparameter tuning, and model evaluation, including advanced topics such as pipelines and text processing +- Best practices for your machine learning projects + +**My impression:** + +The book starts with an introduction to machine learning. It covers problems that can be solved by ML and, more importantly, covers the limits as well - as machine learning is not the answer to all the problems. The introduction also covers why Python is the lingua franca for data science projects these days, helps you with Scikit-learn, covers a variety of essential tools (Jupyter, NumPy, SciPy, Matplotlib, Pandas, Mglearn) and allows you to work on writing your first machine learning model! + +Machine learning has three broad categories of work - Supervised Learning, Unsupervised Learning (including data preprocessing) and Reinforcement Learning. The second chapter covers Supervised learning and the wide range of models and model types available (kNN, linear, Naive Bayses, Decision Trees, SVMs and ensembles). This includes best practices related to generalization, overfitting/underfitting and model uncertainty. The chapter is focused on classification and regression problems. + +Once you're up to speed about supervised learning, you'll learn about Unsupervised Learning. The book covers data preprocessing and scaling, dimensionality reduction and clustering.This is followeed by chapters on representing your dataset, feature selection, model evaluation and improvement, machine learning pipelines and working with textual data. + +Altogether, this book allows you to get a very broad and yet in-depth understanding of the wide range of possibilities within the machine learning field. It also allows you to actually create models with the Scikit-learn framework and provides a wide range of examples for doing so. Although this does not include the deeper neural networks (which are often built with Keras and TensorFlow), this book is a really good basis for those who want to start with machine learning from a developer perspective. What's more, despite the fact that the book is relatively old (it was released in October 2016), [Introduction to Machine Learning with Python](https://www.amazon.com/gp/product/1449369413/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1449369413&linkId=70832165a078e5ef5c865264f47d45c3) (_affiliate link_) is still up to date as the Scikit-learn API is not changed very often. Definitely recommended! + +### [3\. Hands-On Unsupervised Learning Using Python: How to Build Applied Machine Learning Solutions from Unlabeled Data, by Ankur A. Patel](https://www.amazon.com/gp/product/B07NY447H8/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07NY447H8&linkId=6632bf5e0b6a33f045faf8ae3dd2dfcc) + + + +_If you're looking for a book about unsupervised machine learning with Python._ + +**Author:** Ankur A. Patel +**Publishing date:** February 21, 2019 +**Price/quality:** 🟡 Acceptable +**What it covers:** + +- Comparing unsupervised learning to the other two approaches, supervised and reinforcement learning +- Setting up an end-to-end machine learning workflow, tailored to unsupervised models +- Performing anomaly detection, clustering and semisupervised learning +- Creating restricted Boltzmann machines and using GANs + +**My impression:** + +When you start in the book, you will first cover the fundamentals of unsupervised learning in part one. This is important, because you will have to understand where unsupervised learning precisely is located in the machine learning ecosystem. It covers the difference between rule based models and machine learning, supervised versus unsupervised learning, and looking at the conceptual details of supervised and unsupervised approaches. It even looks to combine reinforcement learning with unsupervised learning! + +Once you're through the concepts, the book allows you to set up your end-to-end machine learning project. It covers various libraries (TensorFlow, Keras, XGBoost, LightGBM) and allows you to set up a Jupyter Notebook in which you'll work. It then proceeds with data inspection, data preparation, model preparation, and picking a machine learning model. Then, you'll look at evaluating your model, model ensembles, and model selection. Note that this chapter includes a few supervised approaches, but that the book then continues with unsupervised leraning. + +Indeed: the next part fully covers unsupervised learning using Scikit-learn. It covers dimensionality reduction, principal component analysis, singular value decomposition and random projection… and a variety of other methods for unsupervised learning! This is all preparatory work and belongs to the sphere of feature selection. + +The book then moves on to actual unsupervised machine learning approaches. It starts off with anomaly detection. Then, it proceeds with clustering, and group segmentation, before it moves to more advanced topics such as autoencoders, semisupervised learning and deep unsupervised learning (with Restricted Boltzmann Machines, GANs and deep clustering). + +The reviews of [Hands-On Unsupervised Learning](https://www.amazon.com/gp/product/B07NY447H8/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=B07NY447H8&linkId=6632bf5e0b6a33f045faf8ae3dd2dfcc) (_affiliate link_) are mixed. Some call the examples trivial, and others mention that it seems to be hurried. However, others seem to be happy. Looking through the book, it indeed seems to be the case that explanations are often not too detailed, and that especially visualizations are missing - which could have greatly helped. In my opinion, it's a good book - especially if you're looking for one about unsupervised machine learning - but you should already have some ML experience under the belt. And you should like quick and high-level explanations. It's not for beginners, I'd say. + +### [4\. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, by Aurélien Géron](https://www.amazon.com/gp/product/1492032646/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1492032646&linkId=e1273e0aa8f76ae3842777311187411a) + + + +_If_ _you want to get broad practical machine learning experience while understanding the concepts._ + +**Author:** Aurélien Géron +**Publishing date:** March 13, 2017 +**Price/quality:** 🟢 Really good +**What it covers:** + +- Exploring what's out there in the machine learning field: support vector machines, decision trees, random forests, ensemble methods, and neural networks. +- Using Scikit-learn and TensorFlow to build classic ML models and neural networks +- More advanced details about neural network architectures (ConvNets, recurrent nets, reinforcement learning) + +**My impression:** + +The book Hands-On Machine Learning aims to provide a very broad yet deep understanding of the machine learning field. As you would expect, it starts off with the ML fundamentals. First of all, questions like "what is machine learning?" and "why use it?" are answered, followed by a coverage of application areas, types of machine learning problems, and main challenges that you will face as a machine learning engineer. + +Once you're up to speed about the basics, you'll learn what a real machine learning project entails - from getting up to speed with the data, selecting features that are useful, training your model, to deploying it into production settings. This provides the general basis that will prove to be very useful in all your machine learning projects. + +The book then proceeds with a wide range of classic models - linear regression, polynomial regression, other linear models and logistic regression. In doing so, it introduces gradient descent based optimization, a technique that you will also find when studying deep(er) neural networks. Once this is done, you'll learn about Support Vector Machines - and how they work, both linearly and nonlinearly. And even how SVMs can be used for regression! And the book also makes sure that you'll understand how they work internally; it does not only provide code examples and guidance with respect to how to code them. + +Decision trees, ensemble learning and random forests are subsequently covered as traditional machine learning techniques. Then, before it moves to neural networks and deep learning, it covers unsupervised approaches for dimensionality reduction and unsupervised learning (e.g. clustering). + +As mentioned, it then moves into neural networks and deep learning territory. And despite the fact that I'm really impressed by the first part, I think this part covers many of the Deep Learning issues in great detail - it's a really good section. You'll learn about the history of neural networks first (Rosenblatt Perceptron and Multilayer Perceptron), and how they can be used in both regression and classification tasks. The book then moves to a practical implementation using TensorFlow 2.x based Keras; this is the version of Keras that is up to date with the state-of-the-art. It includes a variety of basic operations such as saving and restoring a model and using callbacks, as we are used to with the Keras library. + +If you think that's it, you're wrong :) The book then proceeds with advanced concepts related to deep learning, such as vanishing and exploding gradients (and what to do about them), better optimization techniques, and how overfitting can be avoided with regularization. In the final chapters, more advanced TensorFlow topics are covered - such as how to train models at scale - as well as one application area, being Deep Computer Vision. + +I agree with the reviews. [Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems](https://www.amazon.com/gp/product/1492032646/ref=as_li_qf_asin_il_tl?ie=UTF8&tag=webn3rd02-20&creative=9325&linkCode=as2&creativeASIN=1492032646&linkId=e1273e0aa8f76ae3842777311187411a) (_affiliate link_) is one of the best books out there right now. Despite its age (and other competitors emergent at the scene of combining various frameworks), this one is still worth buying! From this book forward, you can proceed to books like the [Pocket Reference](#5-machine-learning-pocket-reference-working-with-structured-data-in-python-by-matt-harrison) so that you'll find even more details. Definitely recommended. + +* * * + +## Books about other Machine Learning frameworks + +### [1\. Hands-On Machine Learning with R, by Brad Boehmke and Brandon M. Greenwell](https://amzn.to/3jkkZLn) + + + +_If you want a practitioner's guide to the machine learning process - as well as applying the ML stack within R._ + +**Author:** Brad Boehmke and Brandon M. Greenwell +**Publishing date:** 2019 +**Price/quality:** 🟢 Really good +**What it covers:** + +- It takes a developer perspective to machine learning using R, by using a variety of R packages such as glmnet, h2o, ranger, xgboost, and keras; +- Nevertheless, it teaches the user about the entire machine learning process i.e. from feature engineering to model evaluation & interpretation. +- A variety of algorithms, such as regression, random forests, gradient boosting, and deep learning, is presented. +- Teaching you a firm understanding of what is possible with R when it comes to machine learning, including hands-on experience implementing such models. + +**My impression:** + +The book [Hands-On Machine Learning with R](https://amzn.to/3jkkZLn) (_affiliate link_) by Bradley Boehmke and Brandon M. Greenwell claims to be a practitioner's guide for machine learning in R. Since many machine learning models are created with Python these days, it could in theory be a great book for those who have R experience or don't want to make the switch to Python. This is especially true because the book argues that it will use a variety of frameworks that are well-known in the Python world. Let's take a look at how the book proceeds! + +From a high-level, we can observe that the book has X parts. Part 1, Fundamentels, will teach you, well, the fundamentals of machine learning. Chapter 1 teaches the reader what the differences are between supervised learning and unsupervised learning. This includes a look at what the differences are between regression and classification problems. Chapter 2 then moves on with the modeling process: from data splitting to creating models, resampling, and model evaluation. This is followed by a chapter on feature and target engineering, which is to be performed in many machine learning projects before a model can be actually trained. By looking at missing values, filtering features, numeric and categorical feature engineering and things like dimensionality reduction, the reader is presented with a thorough overview of what is necessary to train with good features. + +Part 2 then moves on to Supervised Learning, which was introduced in chapter 1. The chapters provide content about Linear Regression, Logistic Regression, Regularized Regression and techniques like K-Nearest Neighbors and Decision Trees. Bagging, Random Forests and Gradient Boosting are also covered, and so are SVMs. In ML terms, those are often called relatively 'old-fashioned', but they are still very practically usable today. That's why I think it's good that this book covers those topics. Part 2 also covers Deep Learning, model ensembling and how to interpret machine learning models - i.e., looking inside the black box. + +Part 3 moves to Dimensionality Reduction. Often, if you're dealing with a machine learning dataset, you have many dimensions, from which you should select a few when engineering your features. Principal Components Analysis, Generalized Low Rank Models and Autoencoders can all be used for reducing the dimensionality of your machine learning problem - whether that is by selecting dimensions or reducing the dimensionality altogether. + +Finally, Part 4 moves forward to Clustering - that is, Unsupervised Learning. It introduces the reader to K-means clustering, Hierarchical clustering and Model-based clustering. + +While the [book](https://amzn.to/3jkkZLn) (_affiliate link_) is available [here](https://bradleyboehmke.github.io/HOML/) for those who wish to take a look, I can highly recommend it. While the authors write relatively academically, they do so in an inviting way - they don't burden the user with maths, but rather, provide source code and some math to gain an intuition about what happens in the process. It also covers a large part of many machine learning techniques (whether that is for feature engineering, training your model, or evaluating it) used today, as well as a variety of machine learning algorithms. They do so with well-written English and a lot of examples - both visual examples and source code examples. That's why, given my ML experience, the structure of the book and its contents, I would definitely recommend it to those who wish to get a book about Machine Learning with R. While not cheap, it is definitely a great investment to those who really want to give it a go. Great buy! + +* * * + +## Books about Reinforcement Learning + +Coming soon! + +* * * + +## Books about Machine Learning mathematics + +Coming soon! + +* * * + +## Books about Machine Learning for academic courses + +Coming soon! + +* * * + +## Books about other Machine Learning topics + +Coming soon! + +* * * + +## Books about Machine Learning/Artificial Intelligence and Business + +Coming soon! + +* * * + +## Want quick tutorials instead? Welcome to MachineCurve! diff --git a/beyond-swish-the-lisht-activation-function.md b/beyond-swish-the-lisht-activation-function.md new file mode 100644 index 0000000..4398b57 --- /dev/null +++ b/beyond-swish-the-lisht-activation-function.md @@ -0,0 +1,129 @@ +--- +title: "Beyond Swish: the LiSHT activation function" +date: "2019-11-17" +categories: + - "buffer" + - "deep-learning" +tags: + - "activation-function" + - "activation-functions" + - "deep-learning" + - "lisht" + - "machine-learning" + - "neural-networks" + - "optimizer" + - "relu" + - "swish" +--- + +Deep neural networks perform linear operations to combine weight vectors with input vectors. The values that are the outputs of these combinations are subsequently fed to [activation functions](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) which map the linear input into nonlinear output. + +The Rectified Linear Unit or ReLU activation function is very popular today. It activates to zero for all inputs lower than zero, and activates linearly (i.e. \[latex\]f(x) = x\[/latex\] for all \[latex\]x >= 0\[/latex\]). + +Nevertheless, it has some challenges - to which [the Swish activation function was found to be a solution](https://www.machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/). Increasing in popularity, studies have emerged that empirically investigate the effectiveness of Swish. Does it really result in better model performance? If not, why is this the case? How could even Swish be improved? + +We'll take a look at these questions in this blog post. First, we recap - based on our earlier blog post linked above - how Swish might improve model performance compared to traditional ReLU. Subsequently, we introduce challenges that were found empirically, before introducing a new activation function called _LiSHT_. + +Ready? Let's go! + +**Update 17/Mar/2021:** ensured that article is up to date for 2021. Added better formatting, fixed a few spelling issues and improved article metadata. + +* * * + +\[toc\] + +* * * + +## Recap: how Swish improves ReLU + +If we wish to understand the challenges of the Swish activation function, we must first investigate how Swish improves ReLU in the first place. As we have seen [in our Swish related blog post](https://www.machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/), there are multiple reasons ( Ramachandran, 2017): + +- Like ReLU, it is bounded below and unbounded above. This allows Swish to introduce both sparsity and non-congestion in the training process. +- It's also smooth, compared to ReLU. Because of this, the [Swish loss landscape](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) is smooth as well, which allows the optimizer to experience less oscillation. This might ensure faster convergence. +- Small negative values are not zeroed out, which may help you catch certain patterns in your dataset in a better way. + +[![](images/relu_swish-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/relu_swish.png) + +How the ReLU and Swish activations activate. They are really similar, but Swish is smooth and allows the model to capture small negative inputs. + +* * * + +## Swish challenges + +This does not mean that Swish is free of challenges. On the contrary - and this has everything to do with model optimization. + +While Swish reportedly improves model performance (Ramachandran et al., 2017), it still does not allow you to avoid [vanishing gradients](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/), as argued by Roy et al. (2019). Instead, they argue that "the gradient diminishing problem is still present in case of Swish function". + +But why is this the case? + +We'll have to take a look at neural network optimization by means of [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) (or [similar optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/)) combined with backpropagation. + +It will be fairly simple to identify why even Swish might cause you to fall prey to these vanishing gradients. + +### Vanishing gradients? + +Lets very briefly recap the vanishing gradients problem for the unaware reader. Suppose that we create a neural network with the Sigmoid activation function. Gradient descent, which is a first-order derivative optimizer, will then - together with backprop - use the first-order derivative to compute the gradients and to perform the optimization procedure. + +The activation function and its first-order derivative can be visualized as follows: + +[![](images/sigmoid_deriv-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sigmoid_deriv.png) + +As you can see, computed gradients for Sigmoid will never be larger than \[latex\]\\approx 0.25\[/latex\], and in many cases the gradients will be very small. + +Since optimizing multiple layers of a neural network essentially chains together computed gradients from loss value to layer, with all intermediate layers included, the gradients for upstream layers get really small, slowing down the learning process the more upstream you get. Adding more and more layers will thus essentially create a network that learns slowly or cannot even converge anymore - _say hello to the vanishing gradients problem_. + +While Sigmoid is one of the worst activation functions in terms of the vanishing gradients problem, we experience a similar situation when applying the Swish activation function. Let's take a look. + +### Swish and vanishing gradients + +We can generate the same plot for the Swish activation function (Serengil, 2018; Ramachandran, 2017): + +[![](images/swish_deriv-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/swish_deriv.png) + +Even though the vanishing gradients problem is much less severe in case of Swish, only inputs of \[latex\]x >= 2\[/latex\] result in gradients of 1 and (sometimes) higher. In any other case, the gradient will still cause the chain to get smaller with increasing layers. + +Hence, indeed - as Roy et al. (2019) argue: Swish does not fully avoid the vanishing gradients problem. + +* * * + +## Introducing LiSHT + +To reduce the impact of this problem, they introduce the LiSHT activation function, or the **Linearly Scaled Hyperbolic Tangent**. This activation function simply uses the `tanh` function and scales it linearly, as follows: + +\[latex\]LiSHT(x) = x \\times tanh(x)\[/latex\] + +When we compare it with traditional ReLU and Swish, we get this plot: + +[![](images/lisht_visualized-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_visualized.png) + +And when we look at LiSHT in terms of the derivatives, this is what we see: + +[![](images/lisht_derivs-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_derivs.png) + +Essentially, LiSHT looks very much like Swish in terms of the first-order derivative. However, the range is expanded into the negative as well, which means that the vanishing gradient problem is reduced even further - at least in theory. + +In their work, Roy et al. (2019) report based on empirical testing that indeed, the vanishing gradient problems is reduced compared to Swish and traditional ReLU. Additional correlations between network learning and the shape of e.g. the LiSHT loss landscape were identified. + +Even though the authors empirically tested LiSHT on various datasets (Car Evaluation, Iris, MNIST, CIFAR10, CIFAR100 and Twitter140) with multiple types of architectures ([MLP](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/), [CNN](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/), [LSTM](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)), we'll have to wait to see if LiSHT will generate traction in the machine learning community. Firstly, it will be difficult to knock ReLU off the throne, as it generalizes well to most machine learning scenarios. While the authors have done their best to test LiSHT across many settings, we still don't know enough about how well it generalizes across most scenarios. + +Secondly, which has nothing to do with true _fact_, the machine learning community has been relatively slow to adapt promising activation functions like Swish. While it does improve ReLU in many cases, most tutorials still recommend ReLU over such new activation functions. While this partially occurs because of the first reason - i.e., that ReLU simply generalizes well, and works well in many cases - the LiSHT authors also face the inherent slowness of collective human nature to adapt. + +I'm curious to see more applications of LiSHT and I can be sure that we'll also do some testing ourselves here at MachineCurve! + +* * * + +## Summary + +In this blog post, we introduced the LiSHT activation function. It's a relatively new one, which attempts to improve Swish, which itself was an improvement of traditional ReLU in terms of the loss landscape generated during optimization. We did so by taking a look at how Swish improves ReLU in the first place, why Swish is still sensitive to vanishing gradients, and how LiSHT attempts to reduce this sensitivity. + +I hope you've learnt something new today, and I wish you all the best in your machine learning process. If you have any questions, please feel free to leave a comment in the comments box below 😄👇 I'd encourage you to do the same if you do not agree with elements of my blog post, since the only way to improve it is by doing so collectively. Thanks for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Swish: a self-gated activation function. [_arXiv preprint arXiv:1710.05941_, _7_.](https://www.semanticscholar.org/paper/Swish%3A-a-Self-Gated-Activation-Function-Ramachandran-Zoph/4f57f486adea0bf95c252620a4e8af39232ef8bc) + +Roy, S. K., Manna, S., Dubey, S. R., & Chaudhuri, B. B. (2019). LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks. [_arXiv preprint arXiv:1901.05894_.](https://arxiv.org/abs/1901.05894) + +Serengil, S. (2018, August 31). Swish as Neural Networks Activation Function. Retrieved from [https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/](https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/) diff --git a/bidirectional-lstms-with-tensorflow-and-keras.md b/bidirectional-lstms-with-tensorflow-and-keras.md new file mode 100644 index 0000000..ac4e490 --- /dev/null +++ b/bidirectional-lstms-with-tensorflow-and-keras.md @@ -0,0 +1,252 @@ +--- +title: "Bidirectional LSTMs with TensorFlow 2.0 and Keras" +date: "2021-01-11" +categories: + - "deep-learning" + - "frameworks" +tags: + - "bidirectional" + - "deep-learning" + - "lstm" + - "machine-learning" + - "nlp" + - "recurrent-neural-networks" + - "seq2seq" + - "sequence-to-sequence-learning" + - "tensorflow" +--- + +Long Short-Term Memory networks or [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) are Neural Networks that are used in a variety of tasks. Used in Natural Language Processing, time series and other sequence related tasks, they have attained significant attention in the past few years. Thanks to their recurrent segment, which means that LSTM output is fed back into itself, LSTMs can use context when predicting a next sample. + +Traditionally, LSTMs have been one-way models, also called unidirectional ones. In other words, sequences such as tokens (i.e. words) are read in a left-to-right or right-to-left fashion. This does not necessarily reflect good practice, as more recent Transformer based approaches like [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) suggest. In fact, _bidirectionality_ - or processing the input in a left-to-right _and_ a right-to-left fashion, can improve the performance of your Machine Learning model. + +In this tutorial, we will take a closer look at Bidirectionality in LSTMs. We will take a look LSTMs in general, providing sufficient context to understand what we're going to do. We also focus on how Bidirectional LSTMs implement bidirectionality. We then continue and actually implement a Bidirectional LSTM with TensorFlow and Keras. We're going to use the `tf.keras.layers.Bidirectional` layer for this purpose. + +After reading this tutorial, you will... + +- Understand what Bidirectional LSTMs are and how they compare to regular LSTMs. +- Know how Bidirectional LSTMs are implemented. +- Be able to create a TensorFlow 2.x based Bidirectional LSTM. + +* * * + +\[toc\] + +* * * + +## Code example: using Bidirectional with TensorFlow and Keras + +Here's a quick code example that illustrates how TensorFlow/Keras based `LSTM` models can be wrapped with `Bidirectional`. This converts them from unidirectional recurrent models into bidirectional ones. [Click here](https://www.machinecurve.com/index.php/2021/01/11/bidirectional-lstms-with-tensorflow-and-keras/#tf-keras-layers-bidirectional) to understand the `merge_mode` attribute. If you want to understand bidirectional LSTMs in more detail, or construct the rest of the model and actually run it, make sure to read the rest of this tutorial too! :) + +``` +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(Bidirectional(LSTM(10), merge_mode='sum')) +model.add(Dense(1, activation='sigmoid')) +``` + +* * * + +## Bidirectional LSTMs: concepts + +Before we take a look at the code of a Bidirectional LSTM, let's take a look at them in general, how unidirectionality can limit LSTMs and how bidirectionality can be implemented conceptually. + +### How LSTMs work + +A **[Long Short-Term Memory](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)** network or LSTM is a type of [recurrent neural network](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/) (RNN) that was developed to resolve the [vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). This problem, which is caused by the chaining of gradients during error backpropagation, means that the most upstream layers in a neural network learn very slowly. + +It is especially problematic when your neural network is recurrent, because the type of backpropagation involved there involves unrolling the network for each input token, effectively chaining copies of the same model. The longer the sequence, the worse the vanishing gradients problem is. We therefore don't use classic or vanilla RNNs so often anymore. + +LSTMs fix this problem by separating _memory_ from the _hidden outputs_. An LSTM consists of memory cells, one of which is visualized in the image below. As you can see, the output from the previous layer \[latex\]h\[t-1\]\[/latex\] and to the next layer \[latex\]h\[t\]\[/latex\] is separated from the memory, which is noted as \[latex\]c\[/latex\]. Interactions between the previous output and current input with the memory take place in three segments or _gates_: + +- The **forget gate**, which is the first segment. It feeds both the previous output and the current input through a [Sigmoid](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) (\[latex\]\\sigma\[/latex\]) function, then multiplying the result with memory. It thus removes certain short-term elements from memory. +- The **input** or **update gate**, which is the second segment. It also utilizes a Sigmoid function and learns what must be added memory, updating it based on the current input and the output from the previous layer. In addition, this Sigmoid activated data is multiplied with a [Tanh](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) generated output from memory and input, normalizing the memory update and keeping memory values low. +- The **output gate**, which is the third segment. It utilizes a Sigmoid activated combination from current input and previous output and multiplies it with a [Tanh-normalized](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) representation from memory. The output is then presented and is used in the next cell, which is a copy of the current one with the same parameters. + +While many [nonlinear operations](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/) are present within the memory cell, the memory flow from \[latex\]c\[t-1\]\[/latex\] to \[latex\]c\[t\]\[/latex\] is _linear_ - the multiplication and addition operations are linear operations. By consequence, through a smart implementation, the gradient in this segment is always kept at `1.0` and hence vanishing gradients no longer occur. This aspect of the LSTM is therefore called a **Constant Error Carrousel**, or CEC. + +[![](images/LSTM-5.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/LSTM-5.png) + +### How unidirectionality can limit your LSTM + +Suppose that you are processing the sequence \[latex\]\\text{I go eat now}\[/latex\] through an LSTM for the purpose of translating it into French. Recall that processing such data happens on a per-token basis; each token is fed through the LSTM cell which processes the input token and passes the hidden state on to itself. When unrolled (as if you utilize many copies of the same LSTM model), this process looks as follows: + +[![](images/unidirectional-1024x414.png)](https://www.machinecurve.com/wp-content/uploads/2021/01/unidirectional.png) + +This immediately shows that LSTMs are unidirectional. In other words, the sequence is processed into one direction; here, from left to right. This makes common sense, as - except for a few languages - we read and write in a left-to-right fashion. For translation tasks, this is therefore not a problem, because you don't know what will be said in the future and hence have no business about knowing what will happen after your current input word. + +But unidirectionality can also limit the performance of your Machine Learning model. This is especially true in the cases where the task is language _understanding_ rather than [sequence-to-sequence modeling](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/). For example, if you're reading a book and have to construct a summary, or understand the context with respect to the sentiment of a text and possible hints about the semantics provided later, you'll read in a back-and-forth fashion. + +Yes: you will read the sentence from the left to the right, and then also approach the same sentence from the right. In other words, in some language tasks, you will perform _bidirectional_ reading. And for these tasks, unidirectional LSTMs might not suffice. + +### From unidirectional to bidirectional LSTMs + +In those cases, you might wish to use a Bidirectional LSTM instead. With such a network, sequences are processed in both a left-to-right _and_ a right-to-left fashion. In other words, the phrase \[latex\]\\text{I go eat now}\[/latex\] is processed as \[latex\]\\text{I} \\rightarrow \\text{go} \\rightarrow \\text{eat} \\rightarrow \\text{now}\[/latex\] and as \[latex\]\\text{I} \\leftarrow \\text{go} \\leftarrow \\text{eat} \\leftarrow \\text{now}\[/latex\]. + +This provides more context for the tasks that require both directions for better understanding. + +[![](images/bidirectional-1024x414.png)](https://www.machinecurve.com/wp-content/uploads/2021/01/bidirectional.png) + +While conceptually bidirectional LSTMs work in a bidirectional fashion, they are not bidirectional in practice. Rather, they are just two unidirectional LSTMs for which the output is combined. Outputs can be combined in multiple ways (TensorFlow, n.d.): + +- **Vector summation**. Here, the output equals \[latex\]\\text{LSTM}\_\\rightarrow + \\text{LSTM}\_\\leftarrow\[/latex\]. +- **Vector averaging**. Here, the output equals \[latex\]\\frac{1}{2}(\\text{LSTM}\_\\rightarrow + \\text{LSTM}\_\\leftarrow)\[/latex\] +- **Vector multiplication.** Here, the output equals \[latex\]\\text{LSTM}\_\\rightarrow \\times \\text{LSTM}\_\\leftarrow\[/latex\]. +- **Vector concatenation**. Here, the output vector is twice the dimensionality of the input vectors, because they are concatenated rather than combined. + +* * * + +## Implementing a Bidirectional LSTM + +Now that we understand how bidirectional LSTMs work, we can take a look at implementing one. In this tutorial, we will use TensorFlow 2.x and its Keras implementation `tf.keras` for doing so. + +### Tf.keras.layers.Bidirectional + +Bidirectionality of a recurrent Keras Layer can be added by implementing `tf.keras.layers.bidirectional` (TensorFlow, n.d.). It is a [wrapper layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) that can be added to any of the recurrent layers available within Keras, such as `LSTM`, `GRU` and `SimpleRNN`. It looks as follows: + +``` +tf.keras.layers.Bidirectional( + layer, merge_mode='concat', weights=None, backward_layer=None, + **kwargs +) +``` + +The layer attributes are as follows: + +- The first argument represents the `layer` (one of the recurrent `tf.keras.layers`) that must be turned into a bidirectional one. +- The `merge_mode` represents the way that outputs are constructed. Recall that results can be summated, averaged, multiplied and concatenated. By default, it's `concat` from the options `{'sum', 'mul', 'concat', 'ave', None}`. When set to `None`, nothing happens to the outputs, and they are returned as a list (TensorFlow, n.d.). +- With `backward_layer`, a different layer can be passed for backwards processing, should left-to-right and right-to-left directionality be processed differently. + +### Creating a regular LSTM + +The first step in creating a Bidirectional LSTM is defining a regular one. This can be done with the `tf.keras.layers.LSTM` layer, which we have [explained in another tutorial](https://www.machinecurve.com/index.php/2021/01/07/build-an-lstm-model-with-tensorflow-and-keras/). For the sake of brevity, we won't copy the entire model here multiple times - so we'll just show the segment that represents the model. As you can see, creating a regular LSTM in TensorFlow involves initializing the model (here, using `Sequential`), adding a [word embedding](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/), followed by the LSTM layer. Using a final Dense layer, we perform a [binary classification problem](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). + +``` +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(LSTM(10)) +model.add(Dense(1, activation='sigmoid')) +``` + +### Wrapping the LSTM with Bidirectional + +Converting the regular or unidirectional LSTM into a bidirectional one is really simple. The only thing you have to do is to wrap it with a `Bidirectional` layer and specify the `merge_mode` as explained above. In this case, we set the merge mode to _summation_, which deviates from the default value of _concatenation_. + +``` +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(Bidirectional(LSTM(10), merge_mode='sum')) +model.add(Dense(1, activation='sigmoid')) +``` + +### Full model code + +Of course, we will also show you the full model code for the examples above. This teaches you how to implement a full bidirectional LSTM. Let's explain how it works. Constructing a bidirectional LSTM involves the following steps... + +1. **Specifying the model imports**. As you can see, we import a lot of TensorFlow modules. We're using the provided [IMDB dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) for educational purposes, `Embedding` for [learned embeddings](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/), the `Dense` layer type for [classification](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/), and `LSTM`/`Bidirectional` for constructing the bidirectional LSTM. [Binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) is used together with the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) for optimization. With `pad_sequences`, we can ensure that our inputs are of equal length. Finally, we'll use `Sequential` - the Sequential API - for creating the initial model. +2. **Listing the configuration options.** I always think it's useful to specify all the configuration options before using them throughout the code. It simply provides the overview that we need. They are explained in more detail [in the tutorial about LSTMs](https://www.machinecurve.com/index.php/2021/01/07/build-an-lstm-model-with-tensorflow-and-keras/#listing-model-configuration). +3. **Loading and preparing the dataset.** We use `imdb.load_data(...)` for loading the dataset given our configuration options, and use `pad_sequences` to ensure that sentences that are shorter than our maximum limit are padded with zeroes so that they are of equal length. The IMDB dataset can be used for sentiment analysis: we'll find out whether a review is positive or negative. +4. **Defining the Keras model**. In other words, constructing the skeleton of our model. Using `Sequential`, we initialize a model, and stack the `Embedding`, `Bidirectional LSTM`, and `Dense` layers on top of each other. +5. **Compiling the model**. This actually converts the model skeleton into a model that can be trained and used for predictions. Here, we specify the [optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) and additional metrics. +6. **Generating a summary**. This allows us to [inspect the model in more detail](https://www.machinecurve.com/index.php/2020/04/01/how-to-generate-a-summary-of-your-keras-model/). +7. **Training and evaluating the model**. With `model.fit(...)`, we start the training process using our [training data](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/), with subsequent [evaluation](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) on our testing data using `model.evaluate(...)`. + +``` +import tensorflow as tf +from tensorflow.keras.datasets import imdb +from tensorflow.keras.layers import Embedding, Dense, LSTM, Bidirectional +from tensorflow.keras.losses import BinaryCrossentropy +from tensorflow.keras.models import Sequential +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.preprocessing.sequence import pad_sequences + +# Model configuration +additional_metrics = ['accuracy'] +batch_size = 128 +embedding_output_dims = 15 +loss_function = BinaryCrossentropy() +max_sequence_length = 300 +num_distinct_words = 5000 +number_of_epochs = 5 +optimizer = Adam() +validation_split = 0.20 +verbosity_mode = 1 + +# Load dataset +(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words) +print(x_train.shape) +print(x_test.shape) + +# Pad all sequences +padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with +padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with + +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(Bidirectional(LSTM(10), merge_mode='sum')) +model.add(Dense(1, activation='sigmoid')) + +# Compile the model +model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics) + +# Give a summary +model.summary() + +# Train the model +history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split) + +# Test the model after training +test_results = model.evaluate(padded_inputs_test, y_test, verbose=False) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%') +``` + +* * * + +## Results + +We can now run our Bidirectional LSTM by running the code in a terminal that has TensorFlow 2.x installed. This is what you should see: + +``` +2021-01-11 20:47:14.079739: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) +Epoch 1/5 +157/157 [==============================] - 20s 102ms/step - loss: 0.6621 - accuracy: 0.5929 - val_loss: 0.4486 - val_accuracy: 0.8226 +Epoch 2/5 +157/157 [==============================] - 15s 99ms/step - loss: 0.4092 - accuracy: 0.8357 - val_loss: 0.3423 - val_accuracy: 0.8624 +Epoch 3/5 +157/157 [==============================] - 16s 99ms/step - loss: 0.2865 - accuracy: 0.8958 - val_loss: 0.3351 - val_accuracy: 0.8680 +Epoch 4/5 +157/157 [==============================] - 20s 127ms/step - loss: 0.2370 - accuracy: 0.9181 - val_loss: 0.3010 - val_accuracy: 0.8768 +Epoch 5/5 +157/157 [==============================] - 22s 139ms/step - loss: 0.1980 - accuracy: 0.9345 - val_loss: 0.3290 - val_accuracy: 0.8686 +Test results - Loss: 0.33866164088249207 - Accuracy: 86.49600148200989% +``` + +An 86.5% accuracy for such a simple model, trained for only 5 epochs - not too bad! :) + +* * * + +## Summary + +In this tutorial, we saw how we can use TensorFlow and Keras to create a bidirectional LSTM. Using step-by-step explanations and many Python examples, you have learned how to create such a model, which should be better when bidirectionality is naturally present within the language task that you are performing. + +We saw that LSTMs can be used for sequence-to-sequence tasks and that they improve upon classic RNNs by resolving the vanishing gradients problem. However, they are unidirectional, in the sense that they process text (or other sequences) in a left-to-right or a right-to-left fashion. This can be problematic when your task requires context 'from the future', e.g. when you are using the full context of the text to generate, say, a summary. + +Bidirectionality can easily be added to LSTMs with TensorFlow thanks to the `tf.keras.layers.Bidirectional` layer. Being a layer wrapper to all Keras recurrent layers, it can be added to your existing LSTM easily, as you have seen in the tutorial. Configuration is also easy. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article! If you did, please feel free to leave a comment in the comments section 💬 Please do the same if you have any remarks or suggestions for improvement. If you have questions, click the **Ask Questions** button on the right. I will try to respond as soon as I can :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +MachineCurve. (2020, December 29). _A gentle introduction to long short-term memory networks (LSTM)_. [https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) + +TensorFlow. (n.d.). _Tf.keras.layers.Bidirectional_. [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/Bidirectional](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) diff --git a/binary-crossentropy-loss-with-pytorch-ignite-and-lightning.md b/binary-crossentropy-loss-with-pytorch-ignite-and-lightning.md new file mode 100644 index 0000000..1b5068b --- /dev/null +++ b/binary-crossentropy-loss-with-pytorch-ignite-and-lightning.md @@ -0,0 +1,312 @@ +--- +title: "Binary Crossentropy Loss with PyTorch, Ignite and Lightning" +date: "2021-01-20" +categories: + - "deep-learning" + - "frameworks" +tags: + - "binary-crossentropy" + - "crossentropy" + - "deep-learning" + - "loss-function" + - "loss-value" + - "machine-learning" + - "training-process" +--- + +Training a deep learning model is a cyclical process. First, you feed forward data, generating predictions for each sample. Then, the predictions are compared and the comparison is aggregated into a loss value. Finally, using this loss value, errors are computed backwards using backpropagation and the model is optimized with gradient descent or an adaptive optimizer. + +This way, you can train a model that really performs well - one that can be used in practice. + +In this tutorial, we will take a close look at **using Binary** **Crossentropy Loss with PyTorch**. This loss, which is also called BCE loss, is the de facto standard loss for [binary classification tasks](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) in neural networks. After reading this tutorial, you will... + +- Understand what Binary Crossentropy Loss is. +- How BCE Loss can be used in neural networks for binary classification. +- Have implemented Binary Crossentropy Loss in a PyTorch, PyTorch Lightning and PyTorch Ignite model. + +Let's get to work! 🚀 + +* * * + +\[toc\] + +* * * + +## Using BCELoss with PyTorch: summary and code example + +Training a neural network with PyTorch, PyTorch Lightning or PyTorch Ignite requires that you use a loss function. This is not specific to PyTorch, as they are also common in TensorFlow - and in fact, a core part of how a neural network is trained. + +Choosing a loss function is entirely dependent on your dataset, the problem you are trying to solve and the specific variant of that problem. For **binary classification problems**, the loss function that is most suitable is called **binary crossentropy loss**. It compares the prediction, which is a number between 0 and 1, with the true target, that is either 0 or 1. Having the property that loss increases exponentially while the offset increases linearly, we get a way to punish extremely wrong predictions more aggressively than ones that are close to the target. This stabilizes the training process. + +In PyTorch, binary crossentropy loss is provided by means of `nn.BCELoss`. Below, you'll see how Binary Crossentropy Loss can be implemented with either classic PyTorch, PyTorch Lightning and PyTorch Ignite. Make sure to read the rest of the tutorial too if you want to understand the loss or the implementations in more detail! + +### Classic PyTorch + +Using `BCELoss` in classic PyTorch is a two-step process: + +1. **Define it as a criterion.** +2. **Use it in the custom training loop.** + +Step 1 - the criterion definition: + +``` +criterion = nn.BCELoss() +``` + +Step 2 - using it in the custom training loop: + +``` +for epoch in range(5): + for i, data in enumerate(trainloader, 0): + inputs, labels = data + optimizer.zero_grad() + # Forward pass + outputs = net(inputs) + # Compute loss + loss = criterion(outputs, labels) + # Backward pass + loss.backward() + # Optimization + optimizer.step() +``` + +### PyTorch Lightning + +In Lightning, we can **add `BCELoss` to our `training_step`, `validation_step` and `testing_step`** like this to start using Binary Crossentropy Loss: + +``` +from torch import nn +import pytorch_lightning as pl + +class NeuralNetwork(pl.LightningModule): + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.bce(y_hat, y) + self.log('train_loss', loss) + return loss +``` + +### PyTorch Ignite + +In Ignite, we can **add `BCELoss` as a `criterion` to the Trainer** **creation** for using Binary Crossentropy Loss. It can be added like this: + +``` +from torch import nn + +criterion = nn.BCELoss() +trainer = create_supervised_trainer(model, optimizer, criterion, device=device) +``` + +* * * + +## Binary Crossentropy Loss for Binary Classification + +From our article about the [various classification problems](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) that Machine Learning engineers can encounter when tackling a supervised learning problem, we know that **binary classification** involves grouping any input samples in one of two classes - a first and a second, often denoted as _class 0_ and _class 1_. + +![](images/whatisclassification2.png) + +### High-level training process + +We also know from our article about [loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) and the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) that when you train a neural network, these are the steps that the process will go through: + +1. **Feeding forward data through the model.** The result is a set of predictions with one prediction per input sample. +2. **Comparing the predictions with the ground truth**. Here, we compute the differences between the prediction and the _true_ sample. We converge these differences in one value, which we call the _loss value_. +3. **Improving the model.** By computing the errors backwards by means of backpropagation, we get gradients that we can use to improve the model through optimization. +4. **Starting at (1) again.** This process is cyclical until a performance threshold has been passed, until time is up or until the process is halted manually. + +Sounds like a straight-forward process. But we didn't answer the _how_ with respect to generating differences between predictions and the true sample, and the subsequent convergence of these into a loss value. + +### Binary crossentropy loss + +In fact, there are many loss functions that we can use for this purpose - and each combination of task, variant and data distribution has the best possible candidate. + +For binary classification problems, the loss function of choice is the **binary crossentropy loss**, or the **BCELoss**, if you will. Don't be scared away by the maths, but it can be defined as follows: + +![](images/image-5-1024x122.png) + +Don't let the maths scare you away... just read on! 😉 + +Here, `t` is the target value (either `0.0` or `1.0` - recall that the classes are represented as _class 0_ and _class 1_). The prediction `p` can be any value between zero and one, as is common with the [Sigmoid activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/). This function is commonly used to generate the output in the last layer of your neural network when performing binary classification. The `log` here is the logarithm which generates the exponential properties that make the function so useful. + +Visualized for the two possible targets and any value for `p` between 0 and 1, this is what BCE loss looks like: + +- [![](images/bce-1-1024x421.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/bce-1-1024x421.png) + + Binary crossentropy, target = 1 + +- [![](images/bce_t0-1024x459.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/bce_t0-1024x459.png) + + Binary crossentropy, target = 0 + + +Indeed: + +- If the distance between the target and the prediction is high (e.g. `t = 0.0; p = 1.0` or `t = 1.0; p = 0.0`), loss is highest - infinite, even, for an `1.0` delta. +- There is continuity between all loss values, meaning that all possible values (i.e. `[0, 1]`) are supported. +- Loss increases exponentially when the difference between prediction and target increases linearly. In other words, predictions that are _really_ wrong are punished more significantly than predictions that are _a bit off_. This means no craziness when the model is close to optimum values, but quite a shift in weights when it's not. + +These properties make binary crossentropy a very suitable loss function for binary classification problems. Let's now take a look at how we can implement it with PyTorch and its varieties. + +* * * + +## Implementing Binary Crossentropy Loss with PyTorch + +In this section, we'll see a step-by-step approach to constructing Binary Crossentropy Loss using PyTorch or any of the variants (i.e. PyTorch Lightning and PyTorch Ignite). As these are the main flavors of PyTorch these days, we'll cover all three of them. + +### Introducing BCELoss + +In PyTorch, Binary Crossentropy Loss is provided as `[nn.BCELoss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html)`. This loss function can be used with classic PyTorch, with PyTorch Lightning and with PyTorch Ignite. It looks like this (PyTorch, n.d.): + +``` +torch.nn.BCELoss(weight: Optional[torch.Tensor] = None, size_average=None, reduce=None, reduction: str = 'mean') +``` + +You can pass four _optional_ arguments: + +- The optional **weight** Tensor can be provided to automatically rescale loss after each batch. In other words, it can be used to compute a weighted loss function. +- The **size\_average** argument is deprecated, but can be set to `False` in order to avoid averaging losses across each minibatch. Instead, minibatch loss is then summed together. It is set to `True` by default, computing the average. +- The **reduce** argument is also deprecated, but if set to `True` results the loss per minibatch instead of summing/averaging. +- The **reduction** argument combines both _size\_average_ and _reduce_ and must be used when aiming to use one of the two previous arguments. It can be set to `none`, `mean`, and `sum`: + - When set to `none`, no reduction will be applied. + - When set to `mean`, the average will be computed. + - When set to `sum`, the sum will be computed. + +### Classic PyTorch + +In classic PyTorch, we must define the training, testing and validation loops ourselves. Adding `BCELoss` as a loss function is not too difficult, though. It involves specifying the loss as a `criterion` first and then manually invoking it within e.g. the training loop. + +Specifying the loss as a criterion involves using `BCELoss` in the following way: + +``` +criterion = nn.BCELoss() +``` + +Here is an example of a (very simple) training loop. It performs nothing but resetting the optimizer (so that it can be used at every iteration), making a forward pass, computing the loss, performing the backward pass with backpropagation and subsequent model optimization. + +``` +for epoch in range(5): + for i, data in enumerate(trainloader, 0): + inputs, labels = data + optimizer.zero_grad() + # Forward pass + outputs = net(inputs) + # Compute loss + loss = criterion(outputs, labels) + # Backward pass + loss.backward() + # Optimization + optimizer.step() +``` + +Indeed, that's the high-level training process that we covered at the start of this tutorial! + +### PyTorch Lightning + +PyTorch Lightning is a wrapper on top of native PyTorch which helps you organize code while benefiting from all the good things that PyTorch has to offer. In Lightning, the forward pass during training is split into three definitions: `training_step`, `validation_step` and `testing_step`. These specify what should happen for the training process, its validation component and subsequent model evaluation, respectively. + +Using native PyTorch under the hood, we can also use `nn.BCELoss` here. The first step is initializing it in the `__init__` definition: + +``` +from torch import nn +import pytorch_lightning as pl + +class NeuralNetwork(pl.LightningModule): + def __init__(self): + super().__init__() + # Other inits, like the layers, are also here. + self.bce = nn.BCELoss() +``` + +Recall that a loss function computes the aggregate error when a set of predictions is passed - by comparing them to the ground truth for the samples. In the `training_step`, we can create such functionality in the following way: + +- We first decompose the batch (i.e. the input sample/target combinations) into `x` and `y`, where obviously, \[latex\]\\text{x} \\rightarrow \\text{y}\[/latex\]. +- We then reshape `x` so that it can be processed by our neural network. +- We generate `y_hat`, which is the set of predictions for `x`, by feeding `x` forward through our neural network defined in `self.layers`. Note that you will see the creation of `self.layers` in the full code example below. +- We then compute binary crossentropy loss between `y_hat` (predictions) and `y` (ground truth), log the loss, and return it. Based on this loss, PyTorch Lightning will handle the gradients computation and subsequent optimization (with the optimizer defined in `configure_optimizers`, see the full code example below). + +``` + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.bce(y_hat, y) + self.log('train_loss', loss) + return loss +``` + +Quite easy, isn't it? When added to a regular Lightning model i.e. to the `LightningModule`, the full code looks as follows: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import pytorch_lightning as pl + +class MNISTNetwork(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10), + nn.Sigmoid() + ) + self.bce = nn.BCELoss() + + def forward(self, x): + return self.layers(x) + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.bce(y_hat, y) + self.log('train_loss', loss) + return loss + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-4) + return optimizer + +if __name__ == '__main__': + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + pl.seed_everything(42) + neuralnetwork = MNISTNetwork() + trainer = pl.Trainer(auto_scale_batch_size='power', gpus=1, deterministic=True) + trainer.fit(neuralnetwork, DataLoader(dataset)) +``` + +### PyTorch Ignite + +In PyTorch Ignite, we can also add Binary Crossentropy loss quite easily. Here, we have to specify it as a `criterion` in the Trainer. Like with classic PyTorch and Lightning, we can use `nn.BCELoss` for this purpose. Adding BCE loss can be done as follows: + +``` +from torch import nn + +criterion = nn.BCELoss() +trainer = create_supervised_trainer(model, optimizer, criterion, device=device) +``` + +That's it for today! Now that you have completed this tutorial, you know how to implement Binary Crossentropy Loss with PyTorch, PyTorch Lightning and PyTorch Ignite. If you have any comments, please feel free to leave a message in the comments section below 💬 Please do the same if you have any questions, or ask your question [here](https://www.machinecurve.com/index.php/machine-learning-questions/). + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +PyTorch Ignite. (n.d.). _Ignite your networks! — ignite master documentation_. PyTorch. [https://pytorch.org/ignite/](https://pytorch.org/ignite/) + +PyTorch Lightning. (2021, January 12). [https://www.pytorchlightning.ai/](https://www.pytorchlightning.ai/) + +PyTorch. (n.d.). [https://pytorch.org](https://pytorch.org/) + +PyTorch. (n.d.). _BCELoss — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) diff --git a/build-an-lstm-model-with-tensorflow-and-keras.md b/build-an-lstm-model-with-tensorflow-and-keras.md new file mode 100644 index 0000000..0e71bff --- /dev/null +++ b/build-an-lstm-model-with-tensorflow-and-keras.md @@ -0,0 +1,438 @@ +--- +title: "Build an LSTM Model with TensorFlow 2.0 and Keras" +date: "2021-01-07" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "deep-neural-network" + - "long-short-term-memory" + - "lstm" + - "machine-learning" + - "neural-network" + - "neural-networks" + - "recurrent-neural-networks" + - "tensorflow" +--- + +Long Short-Term Memory ([LSTM](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)) based neural networks have played an important role in the field of Natural Language Processing. In addition, they have been used widely for sequence modeling. The reason why LSTMs have been used widely for this is because the model connects back to itself during a forward pass of your samples, and thus benefits from context generated by previous predictions when prediction for any new sample. + +In this article, we're going to take a look at how we can build an LSTM model with TensorFlow and Keras. For doing so, we're first going to take a brief look at what LSTMs are and how they work. Don't worry, we won't cover this in much detail, because [we already did so in another article](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/). It is necessary though to understand what is happening before we actually get to work. That's how you build intuition for the models you'll use for Machine Learning tasks. + +Once we know about LSTMs, we're going to take a look at how we can build one with TensorFlow. More specifically, we're going to use `tf.keras`, or TensorFlow's tightly coupled (or frankly, embedded) version of Keras for the job. First of all, we're going to see how LSTMs are represented as `tf.keras.layers.LSTM`. We'll then move on and actually build the model. With **step-by-step explanations**, you will understand what is going on at each line and build an understanding of LSTM models in code. + +Let's get to work! 😎 + +**Update 11/Jan/2021:** added quick example. + +* * * + +\[toc\] + +* * * + +## Example code: Using LSTM with TensorFlow and Keras + +The code example below gives you a working LSTM based model with TensorFlow 2.x and Keras. If you want to understand it in more detail, make sure to read the rest of the article below. + +``` +import tensorflow as tf +from tensorflow.keras.datasets import imdb +from tensorflow.keras.layers import Embedding, Dense, LSTM +from tensorflow.keras.losses import BinaryCrossentropy +from tensorflow.keras.models import Sequential +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.preprocessing.sequence import pad_sequences + +# Model configuration +additional_metrics = ['accuracy'] +batch_size = 128 +embedding_output_dims = 15 +loss_function = BinaryCrossentropy() +max_sequence_length = 300 +num_distinct_words = 5000 +number_of_epochs = 5 +optimizer = Adam() +validation_split = 0.20 +verbosity_mode = 1 + +# Disable eager execution +tf.compat.v1.disable_eager_execution() + +# Load dataset +(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words) +print(x_train.shape) +print(x_test.shape) + +# Pad all sequences +padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with +padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with + +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(LSTM(10)) +model.add(Dense(1, activation='sigmoid')) + +# Compile the model +model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics) + +# Give a summary +model.summary() + +# Train the model +history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split) + +# Test the model after training +test_results = model.evaluate(padded_inputs_test, y_test, verbose=False) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%') +``` + +* * * + +## Brief recap on LSTMs + +Before we will actually write any code, it's important to understand what is happening inside an LSTM. First of all, we must say that an LSTM is an improvement upon what is known as a _vanilla_ or _traditional_ Recurrent Neural Network, or RNN. Such networks look as follows: + +![](images/2560px-Recurrent_neural_network_unfold.svg_.png) + +A fully recurrent network. Created by [fdeloche](https://commons.wikimedia.org/wiki/User:Ixnay) at [Wikipedia](https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg), licensed as [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0). No changes were made. + +In a vanilla RNN, an input value (`X`) is passed through the model, which has a hidden or learned state `h` at that point in time. The model produces the output `O` which is in the target representation. Using this way of working, we can convert inputs in English into outputs in German, to give just an example. Vanilla RNNs are therefore widely used as [sequence-to-sequence models](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/). + +However, we can do the same with classic neural networks. Their benefit compared to [classic MLPs](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/) is that they pass the output back to themselves, so that it can be used during the next pass. This provides the neural network with context with respect to previous inputs (which in semantically confusing tasks like translation can sometimes be really important). Classic RNNs are therefore nothing more than a fully-connected network that passes neural outputs back to the neurons. + +So far, so good. RNNs really boosted the state-of-the-art back in the days. But well, there's a problem. It emerges when you want to train classic Recurrent Neural Networks. If you apply backpropagation to training a regular neural network, errors are computed backwards, so that the gradient update becomes known that can be applied by the [optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/). Recurrent backpropagation is something that is however not so easy or available, so another approach had to be taken. Effectively, this involved _unfolding_ the network, effectively making copies of the network (with exactly the same initialization) and improving upon them. This way, we can compute gradients more easily, and chain them together. It allowed for the training of RNNs. + +But _chaining gradients together_ effectively means that you have to apply multiplications. And here's the catch: classic RNNs were combined with [activation functions like Sigmoid and Tanh, but primarily Sigmoid](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/). As the output of the derivative of these functions is almost always < 1.0, you get a severe case of [vanishing gradients](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). Classic RNNs could therefore not be used when sequences got long; they simply got stuck or trained _very_ slowly. + +Enter LSTMs. These **[Long Short-Term Memory](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)** networks effectively split up the _output_ and _memory_. In so-called _memory cells_, they allow all functionality to happen, the prediction to be generated, and memory to be updated. Visually, this looks as follows: + +![](images/LSTM-1024x657.png) + +Let's take a brief look at all the components in a bit more detail: + +- All functionality is embedded into a _memory cell_, visualized above with the rounded border. +- The `h[t-1]` and `h[t]` variables represent the outputs of the memory cell at respectively `t-1` and `t`. In plain English: the output of the previous cell into the current cell, and the output of the current cell to the next one. +- The `c[t-1]` and `c[t]` variables represent the _memory_ itself, at the known time steps. As you can see, memory has been cut away from the output variable, being an entity on its own. +- We have three so-called _gates_, represented by the three blocks of elements within the cell: + - On the left, we see a _forget gate_. It takes the previous output and current input and by means of [Sigmoid activation](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) computes what can be forgotten and hence removed from memory related to current and previous input. By multiplying this with the memory, the removal is performed. + - In the middle, we see an _input gate_. It takes the previous output and current input and applies both a [Sigmoid and Tanh activation](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/). The Sigmoid activation effectively learns what must be _kept_ from the inputs, whereas the Tanh _normalizes_ the values into the range `[-1, +1]`, stabilizing the training process. As you can see, the results are first multiplied (to ensure that normalization occurs) after which it is added into memory. + - On the right, we see an _output gate_. It takes a _normalized_ value for memory through Tanh and a Sigmoid activated value for the previous output and current input, effectively learning what must be predicted for the current input value. This value is then output, and the memory and output values are also passed to the next cell. + +The benefit of LSTMs with respect to simple RNNs lies in the fact that memory has been separated from the actual output mechanisms. As you can see, all vanishing gradient-causing mechanisms lie _within_ the cell. In inter-cell communication, the only elements that are encountered during gradient computation are multiplication (x) and addition (+). These are linear operations, and by consequence the LSTM can ensure that gradients between cells are always 1.0. Hence, with LSTMs, the vanishing gradients problem is resolved. + +This makes them a lot faster than vanilla RNNs. + +* * * + +## LSTMs in TensorFlow and Keras + +Now that we understand how LSTMs work in theory, let's take a look at constructing them in TensorFlow and Keras. Of course, we must take a look at how they are represented first. In TensorFlow and Keras, this happens through the `tf.keras.layers.LSTM` class, and it is described as: + +> Long Short-Term Memory layer - Hochreiter 1997. +> +> TensorFlow (n.d.) + +Indeed, that's the LSTM we want, although it might not have all the gates yet - gates were changed in another paper that was a follow-up to the Hochreiter paper. Nevertheless, understanding the LSTM with all the gates is a good idea, because that's what most of them look like today. + +In code, it looks as follows: + +``` +tf.keras.layers.LSTM( + units, activation='tanh', recurrent_activation='sigmoid', + use_bias=True, kernel_initializer='glorot_uniform', + recurrent_initializer='orthogonal', + bias_initializer='zeros', unit_forget_bias=True, + kernel_regularizer=None, recurrent_regularizer=None, bias_regularizer=None, + activity_regularizer=None, kernel_constraint=None, recurrent_constraint=None, + bias_constraint=None, dropout=0.0, recurrent_dropout=0.0, + return_sequences=False, return_state=False, go_backwards=False, stateful=False, + time_major=False, unroll=False, **kwargs +) +``` + +These are the attributes that can be configured: + +- With **units**, we can define the dimensionality of the output space, as we are used to e.g. with Dense layers. +- The **activation** attribute defines the [activation function](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/) that will be used. By default, it is the [Tanh function](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/). +- With **recurrent\_activation**, you can define the activation function for the recurrent functionality. +- The **use\_bias** attribute can be used to configure whether bias must be used to steer the model as well. +- The **[initializers](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/)** can be used to initialize the weights of the kernels and recurrent segment, as well as the biases. +- The **unit\_forget\_bias** represents the bias value (+1) at the forget gate. This is recommended in a follow-up study to the original LSTM paper. +- The **[regularizers](https://www.machinecurve.com/index.php/2020/01/26/which-regularizer-do-i-need-for-training-my-neural-network/)** and **constraints** allow you to constrain the training process, possibly blocking vanishing and exploding gradients, and keeping the model at adequate complexity. +- **[Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/)** can be added to avoid overfitting, to both the cell itself as well as the recurrent segment. +- With **return\_sequences**, you can indicate whether you want only the prediction for the current input as the output, or that with all the previous predictions appended. +- With **return\_state**, you can indicate whether you also want to have state returned besides the outputs. +- With **go\_backwards**, you can indicate whether you want to have the sequence returned in reverse order. +- If you set **stateful** to True, the recurrent segment will work on a batch level rather than model level. +- Structure of your input (timesteps, batch, features or batch, timesteps, features) can be switched with **time\_major**. +- With **unroll**, you can still unroll the network at training. If set to False, a symbolic loop will be used. +- Additional arguments can be passed with **\*\*kwargs**. + +* * * + +## How to create a Neural Network with LSTM layers in TensorFlow and Keras + +Now that we understand how LSTMs work and how they are represented within TensorFlow, it's time to actually build one with Python, TensorFlow and its Keras APIs. We'll walk you through the process with step-by-step examples. The process is composed of the following steps: + +1. Importing the Keras functionality that we need into the Python script. +2. Listing the configuration for our LSTM model and preparing for training. +3. Loading and preparing a dataset; we'll use the [IMDB dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#imdb-movie-reviews-sentiment-classification) today. +4. Defining the Keras model. +5. Compiling the Keras model. +6. Training the Keras model. +7. [Evaluating](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) the Keras model. + +Open up a code editor and create a file, e.g. called `lstm.py`, and let's go! + +### Defining the model imports + +Let's specify the model imports first: + +``` +import tensorflow as tf +from tensorflow.keras.datasets import imdb +from tensorflow.keras.layers import Embedding, Dense, LSTM +from tensorflow.keras.losses import BinaryCrossentropy +from tensorflow.keras.models import Sequential +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.preprocessing.sequence import pad_sequences +``` + +- We'll need TensorFlow so we import it as `tf`. +- From the [TensorFlow Keras Datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), we import the `imdb` one. +- We'll need [word embeddings](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) (`Embedding`), [MLP layers](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) (`Dense`) and LSTM layers (`LSTM`), so we import them as well. +- Our [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) will be [binary cross entropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). +- As we'll stack all layers on top of each other with `model.add`, we need `Sequential` (the Keras Sequential API) for constructing our `model` variable in the first place. +- For [optimization](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) we use an extension of classic gradient descent called [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/). +- Finally, we need to import `pad_sequences`. We're going to use the IMDB dataset which has sequences of reviews. While we'll specify a maximum length, this can mean that shorter sequences are present as well; these are not cutoff and therefore have different sizes than our desired one (i.e. the maximum length). We'll have to pad them with zeroes in order to make them of equal length. + +### Listing model configuration + +The next step is specifying the model configuration. While strictly not necessary (we can also specify them hardcoded), I always think it's a good idea to group them together. This way, you can easily see how your model is configured, without having to take a look through all the aspects. + +Below, we can see that our model will be trained with a batch size of 128, using [binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) and [Adam optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), and only for five epochs (we only have to show you that it works). 20% of our training data will be used for validation purposes, and the output will be verbose, with verbosity mode set to 1 out of 0, 1 and 2. Our [learned word embedding](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) will have 15 hidden dimensions and each sequence passed through the model is 300 characters at max. Our vocabulary will contain 5000 words at max. + +``` +# Model configuration +additional_metrics = ['accuracy'] +batch_size = 128 +embedding_output_dims = 15 +loss_function = BinaryCrossentropy() +max_sequence_length = 300 +num_distinct_words = 5000 +number_of_epochs = 5 +optimizer = Adam() +validation_split = 0.20 +verbosity_mode = 1 +``` + +You might now also want to disable [Eager Execution in TensorFlow](https://www.machinecurve.com/index.php/2020/09/13/tensorflow-eager-execution-what-is-it/). While it doesn't work for all, some people report that the training process speeds up after using it. However, it's not necessary to do so - simply test how it behaves on your machine: + +``` +# Disable eager execution +tf.compat.v1.disable_eager_execution() +``` + +### Loading and preparing the data + +Once this is complete, we can load and prepare the data. To make things easier, Keras comes [with a standard set of datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), of which the IMDB dataset can be used for sentiment analysis (essentially text classification with two classes). Using `imdb.load_data(...)`, we can load the data. + +Once the data has been loaded, we apply `pad_sequences`. This ensures that sentences shorter than the maximum sentence length are brought to equal length by applying padding with, in this case, zeroes, because that often corresponds with the padding character. + +``` +# Load dataset +(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words) +print(x_train.shape) +print(x_test.shape) + +# Pad all sequences +padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with +padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with +``` + +### Defining the Keras model + +We can then define the Keras model. As we are using the Sequential API, we can initialize the `model` variable with `Sequential()`. The first layer is an `Embedding` layer, which learns a [word embedding](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) that in our case has a dimensionality of 15. This is followed by an `LSTM` layer providing the recurrent segment (with default `tanh` activation enabled), and a `Dense` layer that has one output - through Sigmoid a number between 0 and 1, representing an orientation towards a class. + +``` +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(LSTM(10)) +model.add(Dense(1, activation='sigmoid')) +``` + +### Compiling the Keras model + +The model can then be compiled. This initializes the model that has so far been a skeleton, a foundation, but no actual model yet. We do so by specifying the optimizer, the loss function, and the additional metrics that we had specified before. + +``` +# Compile the model +model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics) +``` + +This is also a good place to [generate a summary](https://www.machinecurve.com/index.php/2020/04/01/how-to-generate-a-summary-of-your-keras-model/) of what the model looks like. + +``` +# Give a summary +model.summary() +``` + +### Training the Keras model + +Then, we can instruct TensorFlow to start the training process. + +``` +# Train the model +history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split) +``` + +The `(input, output)` pairs passed to the model are the padded inputs and their corresponding class labels. Training happens with the batch size, number of epochs, verbosity mode and validation split that were also defined in the configuration section above. + +### Evaluating the Keras model + +We cannot evaluate the model on the same dataset that was used for training it. We fortunately have testing data available through the [train/test split](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/) performed in the `load_data(...)` section, and can use built-in [evaluation facilities](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) to evaluate the model. We then print the test results on screen. + +``` +# Test the model after training +test_results = model.evaluate(padded_inputs_test, y_test, verbose=False) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%') +``` + +### Full model code + +If you want to get the full model code just at once, e.g. for copy-and-run, here you go: + +``` +import tensorflow as tf +from tensorflow.keras.datasets import imdb +from tensorflow.keras.layers import Embedding, Dense, LSTM +from tensorflow.keras.losses import BinaryCrossentropy +from tensorflow.keras.models import Sequential +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.preprocessing.sequence import pad_sequences + +# Model configuration +additional_metrics = ['accuracy'] +batch_size = 128 +embedding_output_dims = 15 +loss_function = BinaryCrossentropy() +max_sequence_length = 300 +num_distinct_words = 5000 +number_of_epochs = 5 +optimizer = Adam() +validation_split = 0.20 +verbosity_mode = 1 + +# Disable eager execution +tf.compat.v1.disable_eager_execution() + +# Load dataset +(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words) +print(x_train.shape) +print(x_test.shape) + +# Pad all sequences +padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with +padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with + +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(LSTM(10)) +model.add(Dense(1, activation='sigmoid')) + +# Compile the model +model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics) + +# Give a summary +model.summary() + +# Train the model +history = model.fit(padded_inputs, y_train, batch_size=batch_size, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split) + +# Test the model after training +test_results = model.evaluate(padded_inputs_test, y_test, verbose=False) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%') +``` + +### Running the model + +Time to run the model! Open up a terminal where at least TensorFlow and Python have been installed, and run the model - `python lstm.py`. + +You should see that the model starts training after e.g. a few seconds. If you have the IMDB dataset not downloaded to your machine, it will be downloaded first. + +Eventually, you'll approximately see an 87.1% accuracy on the evaluation set: + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +embedding (Embedding) (None, 300, 15) 75000 +_________________________________________________________________ +lstm (LSTM) (None, 10) 1040 +_________________________________________________________________ +dense (Dense) (None, 1) 11 +================================================================= +Total params: 76,051 +Trainable params: 76,051 +Non-trainable params: 0 +_________________________________________________________________ +2021-01-08 14:53:19.988309: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) +Epoch 1/5 +157/157 [==============================] - 19s 106ms/step - loss: 0.6730 - accuracy: 0.5799 - val_loss: 0.4866 - val_accuracy: 0.8174 +Epoch 2/5 +157/157 [==============================] - 13s 83ms/step - loss: 0.4312 - accuracy: 0.8445 - val_loss: 0.3694 - val_accuracy: 0.8540 +Epoch 3/5 +157/157 [==============================] - 14s 86ms/step - loss: 0.2997 - accuracy: 0.8955 - val_loss: 0.3333 - val_accuracy: 0.8680 +Epoch 4/5 +157/157 [==============================] - 15s 96ms/step - loss: 0.2499 - accuracy: 0.9133 - val_loss: 0.3078 - val_accuracy: 0.8782 +Epoch 5/5 +157/157 [==============================] - 14s 90ms/step - loss: 0.2032 - accuracy: 0.9316 - val_loss: 0.3152 - val_accuracy: 0.8806 +Test results - Loss: 0.3316078186035156 - Accuracy: 87.09200024604797% +``` + +#### TensorFlow/Keras LSTM slow on GPU + +If you face speed issues with training the TensorFlow LSTM on your GPU, you might decide to temporarily disable its access to your GPUs by adding the following _before_ `model.fit`: + +``` +import os +os.environ['CUDA_VISIBLE_DEVICES'] = '-1' +``` + +* * * + +## Summary + +Long Short-Term Memory Networks (LSTMs) are a type of recurrent neural network that can be used in Natural Language Processing, time series and other sequence modeling tasks. In this article, we covered their usage within TensorFlow and Keras in a step-by-step fashion. + +We first briefly looked at LSTMs in general. What are they? What can they be used for? How do they improve compared to previous RNN based approaches? This analysis gives you the necessary context in order to understand what is going on within your code. + +We then looked at how LSTMs are represented in TensorFlow and Keras. We saw that there is a separate `LSTM` layer that can be configured with a wide variety of attributes. In the article, we looked at the meaning for each attribute and saw how everything interrelates. Once understanding this, we moved on to actually implementing the model with TensorFlow. In a step-by-step phased approach, we explained in detail why we made certain choices, allowing you to see exactly how the model was constructed. + +After training on the IMDB dataset, we saw that the model achieves an accuracy of approximately 87.1% on the evaluation set. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article. If you did, please feel free to drop a message, as I'd love to hear from you 💬 Please do the same if you have any questions, or click the **Ask Questions** button to the right. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Keras Team. (n.d.). _Keras documentation: The sequential class_. Keras: the Python deep learning API. [https://keras.io/api/models/sequential/](https://keras.io/api/models/sequential/) + +MachineCurve. (2020, December 29). _A gentle introduction to long short-term memory networks (LSTM)_. [https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) + +TensorFlow. (n.d.). _Tf.keras.layers.LSTM_. [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) + +TensorFlow. (n.d.). _Tf.keras.losses.BinaryCrossentropy_. [https://www.tensorflow.org/api\_docs/python/tf/keras/losses/BinaryCrossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/BinaryCrossentropy) + +TensorFlow. (n.d.). _Tf.keras.optimizers.Adam_. [https://www.tensorflow.org/api\_docs/python/tf/keras/optimizers/Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) + +TensorFlow. (n.d.). _Tf.keras.layers.SimpleRNN_. [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/SimpleRNN](https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNN) diff --git a/building-a-decision-tree-for-classification-with-python-and-scikit-learn.md b/building-a-decision-tree-for-classification-with-python-and-scikit-learn.md new file mode 100644 index 0000000..b1b8a00 --- /dev/null +++ b/building-a-decision-tree-for-classification-with-python-and-scikit-learn.md @@ -0,0 +1,402 @@ +--- +title: "Building a Decision Tree for classification with Python and Scikit-learn" +date: "2022-01-23" +categories: + - "geen-categorie" +tags: + - "decision-tree" + - "decision-trees" + - "machine-learning" + - "python" + - "scikit-learn" + - "traditional-machine-learning" +--- + +Although we hear a lot about deep learning these days, there is a wide variety of other machine learning techniques that can still be very useful. Decision tree learning is one of them. By recursively partitioning your feature space into segments that group common elements yielding a class outcome together, it becomes possible to build predictive models for both classification and regression. + +In today's tutorial, you will learn to build a decision tree for classification. You will do so using Python and one of the key machine learning libraries for the Python ecosystem, _Scikit-learn_. After reading it, you will understand... + +- **What decision trees are.** +- **How the CART algorithm can be used for decision tree learning.** +- **How to build a decision tree with Python and Scikit-learn.** + +Are you ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What are decision trees? + +Suppose that you have a dataset that describes wines in many columns, and the wine variety in the last column. + +These independent variables can be used to build a predictive model that, given some new inputs, tells us whether a specific measurement comes from a wine of variety one, two or three, ... + +As you already know, there are many techniques for building predictive models. Deep neural networks are very popular these days, but there are also approaches that are a bit more _classic_ - but not necessarily wrong. + +Decision trees are one such technique. They essentially work by breaking down the decision-making process into many smaller questions. In the wine scenario, as an example, you know that wines can be separated by color. This distinguishes between wine varieties that make _white wine_ and varieties that make _red wine_. There are more such questions that can be asked: what is the alcohol content? What is the magnesium content? And so forth. + +![](images/tree-1024x535.png) + +An example of a decision tree. Each variety (there are three) represents a different color - orange, green and purple. Both color and color intensity point towards an estimated class given a sub question stage. For example, the first question points towards class 2, the path of which gets stronger over time. Still, it is possible to end up with both class 1 and class 3 - by simply taking the other path or diverting down the road. + +By structuring these questions in a smart way, you can separate the classes (in this case, the varieties) by simply providing answers that point you to a specific variety. And precisely that is what _decision trees_ are: they are tree-like structures that break your classification problem into many smaller sub questions given the inputs you have. + +Decision trees can be constructed manually. More relevant however is the automated construction of decision trees. And that is precisely what you will be looking at today, by building one with Scikit-learn. + +* * * + +## How are decision tree classifiers learned in Scikit-learn? + +In today's tutorial, you will be building a decision tree for classification with the `DecisionTreeClassifier` class in Scikit-learn. When learning a decision tree, it follows the **Classification And Regression Trees** or **CART** algorithm - at least, an optimized version of it. Let's first take a look at how this algorithm works, before we build a classification decision tree. + +### Learning a CART tree + +At a high level, a CART tree is built in the following way, using some _split evaluation criterion_ (we will cover that in a few moments): + +1. Compute all splits that can be made (often, this is a selection over the entire feature space). In other words, do this for each of the independent variables, and a target value. For example, in the tree above, "Proline <= 755.0" in the root node is one such split at the first level. It's the _proline_ variable, with _755.0_ as the target value. +2. For each split, compute the value of the _split evaluation criterion_. +3. Pick the one with the best value as the split. +4. Repeat this process for the next level, until split generation is exhausted (by either a lack of further independent variables or a user-constrained depth of the decision tree). + +In other words, the decision tree learning process is a recursive process that picks the best split at each level for building the tree until the tree is exhausted or a user-defined criterion (such as maximum tree depth) is reached). + +Now, regarding the split evaluation criterion, Scikit-learn based CART trees use two types of criterions: the **Gini impurity** and the **entropy** metrics. + +### Gini impurity + +The first - and default - split evaluation metric available in Scikit's decision tree learner is Gini impurity: + +![](images/1*DJ_5UG9hn1ppqZ8XOdfpzw.png) + +The metric is defined in the following way: + +> Gini impurity (named after Italian mathematician Corrado Gini) is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. +> +> Wikipedia (2004) + +Suppose that we... + +- Pick a random sample. +- Assign a random class. + +What is the **probability that we classify it wrongly**? That's the Gini impurity for the specific sample. + +#### Random classification + +For example, if we have 100 samples, where 25 belong to class A and 75 to class B, these are our probabilities: + +- **Pick A _and_ classify A**: 25/100 x 25/100 = 6.25% +- **Pick A** _**and**_ **classify B**: 25/100 x 75/100 = 18.75% +- **Pick B _and_ classify A:** 75/100 x 25/100 = 18.75% +- **Pick B _and_ classify B:** 75/100 x 75/100 = 56.25%. + +So, what's the probability of **classifying it wrongly?** + +That's 18.75 + 18.75 = 37.5%. In other words, the Gini impurity of this data scenario with random classification is 0.375. + +By _minimizing the Gini impurity_ of the scenario, we get the best classification for our selection. + +#### Adding a split + +Suppose that instead of randomly classifying our samples, we add a _decision boundary_. In other words, we split our sample space in two, or in other words, we add a split. + +We can simply compute the Gini impurity of this split by computing a weighted average of the Gini impurities of both sides of the split. + +Suppose that we add the following split to the very simple two-dimensional dataset below, generated by the OPTICS clustering algorithm: + +![](images/afbeelding.png) + +Now, for both sides of the split, we repeat the same: + +- Pick a random sample. +- Classify it randomly given the available classes. + +On the left, you can clearly see that Gini impurity is 0: if we pick a sample, it can be classified as blue only, because the only class available in that side is blue. + +On the right, impurity is very low, but not zero: there are some blue samples available, and Gini impurity is approximately 0.00398. + +Clearly, a better split is available at `X[0] ~ 5`, where Gini impurity would be 0... ;-) But this is just for demonstrative purposes! + +#### Now, how good is a split? + +Now that you understand how Gini impurity can be computed given a split, we can look at the final aspect of computing the _goodness-of-split_ using Gini impurity...how to decide about the contribution of a split? + +At each level of your decision tree, you know the following: + +- The current Gini impurity, given your previous levels (at the root level, that is 0, obviously). +- The possible splits and their Gini impurities. + +Picking the best split now involves picking the split with the greatest reduction in total Gini impurity. This can be computed by the weighted average mentioned before. In the case above... + +- We have 498 samples on the left with a Gini impurity of 0. +- We have 502 samples on the right with a Gini impurity of 0.00398. +- Total reduction of Gini impurity given this split would be (498/1000) \* 0 + (502/1000) \* 0.00398 = 0.00199796. + +If this is the _greatest_ reduction of Gini impurity (by computing the difference between existing impurity and resulting impurity), then it's the split to choose! :) + +### Entropy + +A similar but slightly different metric that can be used is that of entropy: + +![](images/1*GiGjVirI86xCX_gX_eng2Q.png) + +For using entropy, you'll have to repeat all the steps executed above. Then, it simply boils down to adding the probabilities computed above into the formula... and you pick the split that yields lowest entropy. + +### Choosing between Gini impurity and entropy + +Model performance-wise, there is little reason to choose between Gini impurity and entropy. In an analysis work, Raileanu and Stoffel (2004) identified that... + +- There is no clear empirical difference between choosing between Gini impurity and entropy. +- That entropy might be slower to compute because it uses a logarithm. + +In other words, I would go with Gini impurity - and assume that's why it's the default option in Scikit-learn, too! :) + +* * * + +## Building a Decision Tree for classification with Scikit-learn + +Now that you understand some of the theory behind CART trees, it's time to build one such tree for classification. You will use one of the default machine learning libraries for this purpose, being Scikit-learn. It's a three-step process: + +- First, you will ensure that you have installed all dependencies necessary for running the code. +- Then, you take a look at the dataset. +- Finally, you'll build the decision tree classifier. + +### Ensure that you have installed the dependencies + +Before writing any code, it's important that you have installed all the dependencies on your machine: + +- **Python**. It's important to run a recent version of Python, at least 3+. +- **Scikit-learn**. Being one of the key libraries for traditional machine learning algorithms, Scikit-learn is still widely used within these machine learning communities. Ensure that you can use its functionality by having it installed via `pip install -U scikit-learn`. +- **Matplotlib**. You will also need to visualize some results (being the learned tree). Ensure that you have Matplotlib installed as well, via `pip install matplotlib`. + +### Today's dataset + +If you have been a frequent reader of MachineCurve tutorials, you know that I favor out-of-the-box datasets that come preinstalled with machine learning libraries used during tutorials. + +That's very simple - although in the real world data is _key_ to success, these tutorials are meant to tell you something about the models you're building and hence lengthy sections on datasets can be distracting. + +For that reason, today, you will be using one of the datasets that comes with Scikit-learn out of the box: the **wine dataset**. + +> The wine dataset is a classic and very easy multi-class classification dataset. +> +> Scikit-learn + +It is a dataset with 178 samples and 13 attributes that assigns each sample to a wine variety (indeed, we're using a dataset similar to what you have read about before!). The dataset has 3 wine varieties. These are the attributes that are part of the wine dataset: + +- Alcohol +- Malic acid +- Ash +- Alcalinity of ash +- Magnesium +- Total phenols +- Flavanoids +- Nonflavanoid phenols +- Proanthocyanins +- Color intensity +- Hue +- OD280/OD315 of diluted wines +- Proline + +In other words, in the various dimensions of the _independent variables_, many splits can be made using which many Gini impurity/entropy values can be computed... after which we can choose the best split every time. + +### Specifying the Python imports + +Now that you understand something about decision tree learning and today's dataset, let's start writing some code. Open up a Python file in your favorite IDE or create a Jupyter Notebook, and let's add some imports: + +``` +from sklearn.datasets import load_wine +from sklearn import tree +import matplotlib.pyplot as plt +``` + +These imports speak pretty much for themselves. The first is related to the dataset that you will be using. The second is the representation of decision trees within Scikit-learn, and the latter one is the PyPlot functionality from Matplotlib. + +### Loading our dataset + +In Python, it's good practice to work with _definitions_. They make code reusable and allow you to structure your code into logical segments. In today's model, you will apply these definitions too. + +The first one that you will create is one for loading your dataset. It simply calls `load_wine(...)` and passes the `return_X_y` attribute set to `True`. This way, your dataset will be returned in two separate lists - `X` and `y`. + +``` +def load_dataset(): + """ Load today's dataset. """ + return load_wine(return_X_y=True) +``` + +### Defining feature and class names + +Next up, you will specify a definition that returns names of the features (the independent variables) and the eventual class names. + +``` +def feature_and_class_names(): + """ Define feature and class names. """ + feature_names = ["Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline",] + class_names = ["Class 1", "Class 2", "Class 3"] + + return feature_names, class_names +``` + +### Initializing the classifier and fitting the data + +Per the Scikit-learn documentation of the `DecisionTreeClassifier` model type that you will use, there are some options that you must include in your model design. These are the options that are configurable. + +- **criterion** {“gini”, “entropy”}, default=”gini” + - Used for measuring the quality of a split. Like we discussed above, you can choose between Gini impurity and entropy, while often it's best to leave it configured as `gini`. +- **splitter** {“best”, “random”}, default=”best” + - Used to determine how a best split is chosen. "best" here represents the best split, whereas "random" represents the best random split. +- **max\_depth** int, default=None + - The maximum number of levels of your decision tree. Can be used to limit the depth of your tree, to avoid overfitting. +- **min\_samples\_split** int or float, default=2 + - The number of samples that must be available to split an internal node. +- **min\_samples\_leaf** int or float, default=1 + - The number of samples that must be available for letting a node be a leaf node. +- **min\_weight\_fraction\_leaf** float, default=0.0 + - The minimum weighted value of samples that must be present for a node to be a leaf node. +- **max\_features** int, float or {“auto”, “sqrt”, “log2”}, default=None + - The number of features to look at when generating splits. +- **random\_state** int, RandomState instance or None, default=None + - A random seed that you can use to make the behavior of your fitting process deterministic. +- **max\_leaf\_nodes** int, default=None + - The maximum number of leaf nodes that you allow in your tree. +- **min\_impurity\_decrease** float, default=0.0 + - A split will be considered only when the impurity / entropy decrease is equal to or larger than the configured value. +- **class\_weight** dict, list of dict or “balanced”, default=None + - When having an imbalanced dataset, you can weigh classes according to their importance. This allows the model to better balance between the classes in an attempt to avoid overfitting. +- **ccp\_alpha** non-negative float, default=0.0 + - A pruning parameter that is not relevant for today's article. + +Let's now create a definition for initializing your decision tree. We choose Gini impurity, best splitting, and letting maximum depth be guided by the minimum of samples necessary for generating a split. In other words, we risk overfitting to avoid adding a lot of complexity to the tree. In practice, that wouldn't be a good + +``` +def init_tree(): + """ Initialize the DecisionTreeClassifier. """ + return tree.DecisionTreeClassifier() +``` + +Then, we can add a definition for training the tree: + +``` +def train_tree(empty_tree, X, Y): + """ Train the DecisionTreeClassifier. """ + return empty_tree.fit(X, Y) +``` + +### Plotting the decision tree + +Finally, what's left is a definition for plotting the decision tree: + +``` +def plot_tree(trained_tree): + """ Plot the DecisionTreeClassifier. """ + + # Load feature and class names + feature_names, class_names = feature_and_class_names() + + # Plot tree + tree.plot_tree(trained_tree, feature_names=feature_names, class_names=class_names, fontsize=12, rounded=True, filled=True) + plt.show() +``` + +### Merging everything together + +Then, you merge everything together ... + +- In a definition, you load the dataset, initialize the tree, train the tree, and plot the trained tree. +- You then call this definition when your Python script starts. + +``` +def decision_tree_classifier(): + """ End-to-end training of decision tree classifier. """ + + # Load dataset + X, Y = load_dataset() + + # Train the decision tree + tree = init_tree() + trained_tree = train_tree(tree, X, Y) + + # Plot the trained decision tree + plot_tree(trained_tree) + + +if __name__ == '__main__': + decision_tree_classifier() +``` + +### Full model code + +If you want to get started immediately, here is the full code example for creating a classification decision tree with Scikit-learn. + +``` +from sklearn.datasets import load_wine +from sklearn import tree +import matplotlib.pyplot as plt + +def load_dataset(): + """ Load today's dataset. """ + return load_wine(return_X_y=True) + + +def feature_and_class_names(): + """ Define feature and class names. """ + feature_names = ["Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline",] + class_names = ["Class 1", "Class 2", "Class 3"] + + return feature_names, class_names + + +def init_tree(): + """ Initialize the DecisionTreeClassifier. """ + return tree.DecisionTreeClassifier() + + +def train_tree(empty_tree, X, Y): + """ Train the DecisionTreeClassifier. """ + return empty_tree.fit(X, Y) + + +def plot_tree(trained_tree): + """ Plot the DecisionTreeClassifier. """ + + # Load feature and class names + feature_names, class_names = feature_and_class_names() + + # Plot tree + tree.plot_tree(trained_tree, feature_names=feature_names, class_names=class_names, fontsize=12, rounded=True, filled=True) + plt.show() + + +def decision_tree_classifier(): + """ End-to-end training of decision tree classifier. """ + + # Load dataset + X, Y = load_dataset() + + # Train the decision tree + tree = init_tree() + trained_tree = train_tree(tree, X, Y) + + # Plot the trained decision tree + plot_tree(trained_tree) + + +if __name__ == '__main__': + decision_tree_classifier() +``` + +* * * + +## References + +Scikit-learn. (n.d.). _1.10. Decision trees — scikit-learn 0.24.0 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved January 21, 2022, from [https://scikit-learn.org/stable/modules/tree.html](https://scikit-learn.org/stable/modules/tree.html) + +Scikit-learn. (n.d.). _Sklearn.tree.DecisionTreeClassifier_. scikit-learn. Retrieved January 21, 2022, from [https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) + +Scikit-learn. (n.d.). _Sklearn.datasets.load\_wine_. scikit-learn. Retrieved January 21, 2022, from [https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load\_wine.html#sklearn.datasets.load\_wine](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine) + +Wikipedia. (2004, April 5). _Decision tree learning_. Wikipedia, the free encyclopedia. Retrieved January 22, 2022, from [https://en.wikipedia.org/wiki/Decision\_tree\_learning](https://en.wikipedia.org/wiki/Decision_tree_learning) + +Raileanu, L. E., & Stoffel, K. (2004). Theoretical comparison between the Gini index and information gain criteria. _Annals of Mathematics and Artificial Intelligence_, _41_(1), 77-93. [https://doi.org/10.1023/b:amai.0000018580.96245.c6](https://doi.org/10.1023/b:amai.0000018580.96245.c6) diff --git a/building-a-simple-vanilla-gan-with-pytorch.md b/building-a-simple-vanilla-gan-with-pytorch.md new file mode 100644 index 0000000..bf00aee --- /dev/null +++ b/building-a-simple-vanilla-gan-with-pytorch.md @@ -0,0 +1,816 @@ +--- +title: "Building a simple vanilla GAN with PyTorch" +date: "2021-07-17" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "computer-vision" + - "deep-learning" + - "discriminator" + - "gan" + - "gans" + - "generative-adversarial-networks" + - "generative-ml" + - "generative-models" + - "generator" + - "machine-learning" + - "mnist" +--- + +Using a **Generative Adversarial Model**, or a GAN, it is possible to perform generative Machine Learning. In other words, you can ensure that a model learns to produce new data, such as images. + +Like these: + +- ![](images/epoch36_batch50.jpg) + +- ![](images/epoch36_batch0.jpg) + +- ![](images/epoch30_batch50.jpg) + + +In today's article, you will create a **simple GAN**, also called a _vanilla GAN_. It resembles the Generative Adversarial Network first created by Goodfellow et al. (2014). After reading this article, you will... + +- **Understand what a GAN is and how it works.** +- **Be capable of building a simple GAN with Python and PyTorch.** +- **Have produced your first GAN results**. + +Let's take a look! :) + +* * * + +\[toc\] + +* * * + +## What is a GAN? + +Before we start building our simple GAN, it may be a good idea to briefly recap what GANs are. Make sure to read the [gentle introduction to GANs](https://www.machinecurve.com/index.php/2021/03/23/generative-adversarial-networks-a-gentle-introduction/) if you wish to understand their behavior in more detail. However, we'll also cover things here briefly. Let's take a look at the generic architecture of a GAN: + +![This image has an empty alt attribute; its file name is GAN-1024x431.jpg](images/GAN-1024x431.jpg) + +You'll see that a GAN is composed of two separate models. The first, being called the **Generator**, learns to convert a sample of noise (often drawn from a standard normal distribution) into a fake image. This image is then fed to the **Discriminator**, which judges whether the image is fake or real. Using the loss that emerges from this judgment, the networks are optimized jointly, after which the process starts again. + +You can also compare this process with that of a **counterfeiter** and the **police**. The Generator serves as the counterfeiter, while the task of the police is to catch them. When the police catches more counterfeit images, the counterfeiter has to learn to produce better results. This is exactly what happens: through the Discriminator becoming better in judging whether an image is fake or real, the Generator eventually becomes better in generating fake images. Consequentially, the Generator can be used independently to generate images after it has been trained. + +Now, it's time to start building the GAN. Note that more contemporary approaches, such as [DCGANs](https://www.machinecurve.com/index.php/2021/07/15/creating-dcgan-with-pytorch/), are more preferred if you wish to use your GAN in production (because of the simple reason that originally, the vanilla GAN didn't use any Convolutional layers). However, if you want to start with GANs, the example that you will produce below is a very good starting point - after which you can continue with DCGANs and further. Let's take a look! :) + +* * * + +## Simple GAN with PyTorch - fully explained code example + +Let's now take a look at building a **simple Generative Adversarial Network**, which looks like the original GAN proposed by Goodfellow et al. (2014). + +### Importing the dependencies + +When you want to run the code that you're going to create, you will need to ensure that some dependencies are installed into your environment. These dependencies are as follows: + +- A 3.x based version of **Python**, which you will use to run these scripts. +- **PyTorch** and its corresponding version of **Torchvision** for training the neural networks with MNIST data. +- **NumPy** for numbers processing. +- **Matplotlib** for visualizing images. + +Now, create a Python file or Python-based Notebook, with the following imports: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import numpy as np +import matplotlib.pyplot as plt +import uuid +``` + +For some Operating System functions, you will need `os`. `uuid` will be used for generating a unique run identifier, which will be useful for saving intermediate models and generated images; i.e., for housekeeping. `torch` will be used for training the neural network, and hence you will need to import its `nn` library. The `MNIST` dataset will be used and hence requires import, and it will be loaded with the `DataLoader`. Finally, when loading the data, you will convert it into Tensor format and normalize the images, requiring `transforms`. Finally, for number processing and visualization, you'll need `numpy` and `matplotlib.pyplot`. + +### Configuration variables + +Now that you have specified the imports, it's time to pin down the configurable variables that will be used throughout the training process. Here's what you will create and why you'll need it: + +- **The number of epochs:** each training process contains a fixed number of iterations through the entire training set, the _number of epochs_. We set it to 50, but you can choose any number. Note that 50 will produce an acceptable result; more may improve the results even further. +- **The noise dimension:** recall that the Generator will be fed a variable that serves as a sample from a multidimensional latent distribution. These are difficult words to say that we sample from a landscape that will eventually take a shape so that good examples are produced by the Generator. The dimensionality of this landscape and hence the vectors sampled from it will be defined by `NOISE_DIMENSION`. +- **The batch size:** within an epoch, we feed forward the data through the network in batches - i.e., not all in once. The why is simple - because it would not fit in memory otherwise. We set the batch size to 128 samples, but this can be higher, depending on the hardware on your system. +- **Training on GPU, yes or no:** depending on the availability of a GPU, you can choose to use it for training - otherwise your CUP will be used. +- **A unique run identifier:** related to housekeeping. You will see that during the training process, intermediate models and images will be stored on disk so that you can keep track of training progress. A folder with a _unique_ identifier will be created for this purpose; hence the `UNIQUE_RUN_ID`. +- **Print stats after n-th batch:** after feeding forward minibatches through the network, statistics will be printed after every `n-th` batch. Currently, we set it to 50. +- The **optimizer learning rate** and **optimizer betas**. The optimizer for the Generator and Discriminator will be initialized with a learning rate and Beta values. We set them to values that are deemed to produce acceptable results given previous research. +- The **output shape of the generator output** will be used to initialize the last layer of the Generator and the first layer of the Discriminator. It must be a multiplication of all shape dimensions of an individual image. In our case, the MNIST dataset has `28x28x1` images. + +``` +# Configurable variables +NUM_EPOCHS = 50 +NOISE_DIMENSION = 50 +BATCH_SIZE = 128 +TRAIN_ON_GPU = True +UNIQUE_RUN_ID = str(uuid.uuid4()) +PRINT_STATS_AFTER_BATCH = 50 +OPTIMIZER_LR = 0.0002 +OPTIMIZER_BETAS = (0.5, 0.999) +GENERATOR_OUTPUT_IMAGE_SHAPE = 28 * 28 * 1 +``` + +### PyTorch speedups + +There are some ways that you can use to make your PyTorch code [run faster](https://betterprogramming.pub/how-to-make-your-pytorch-code-run-faster-93079f3c1f7b): that's why you'll write these speedups next. + +``` +# Speed ups +torch.autograd.set_detect_anomaly(False) +torch.autograd.profiler.profile(False) +torch.autograd.profiler.emit_nvtx(False) +torch.backends.cudnn.benchmark = True +``` + +### Building the Generator + +Now that we have written some preparatory code, it's time to build the actual Generator! Contrary to the [Deep Convolutional GAN](https://www.machinecurve.com/index.php/2021/07/15/creating-dcgan-with-pytorch/), which essentially follows the _vanilla GAN_ that you will create today, this Generator does not use [Convolutional layers](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). Here's the code for the Generator: + +``` +class Generator(nn.Module): + """ + Vanilla GAN Generator + """ + def __init__(self,): + super().__init__() + self.layers = nn.Sequential( + # First upsampling + nn.Linear(NOISE_DIMENSION, 128, bias=False), + nn.BatchNorm1d(128, 0.8), + nn.LeakyReLU(0.25), + # Second upsampling + nn.Linear(128, 256, bias=False), + nn.BatchNorm1d(256, 0.8), + nn.LeakyReLU(0.25), + # Third upsampling + nn.Linear(256, 512, bias=False), + nn.BatchNorm1d(512, 0.8), + nn.LeakyReLU(0.25), + # Final upsampling + nn.Linear(512, GENERATOR_OUTPUT_IMAGE_SHAPE, bias=False), + nn.Tanh() + ) + + def forward(self, x): + """Forward pass""" + return self.layers(x) +``` + +You can see that it is a regular PyTorch [`nn.Module` class](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/) and hence performs a `forward` pass by simply feeding the data to a model, specified in `self.layers` as a `nn.Sequential` based neural network. In our case, you will write four upsampling blocks. The intermediate blocks consist of a `nn.Linear` (or densely-connected) layer, a `BatchNorm1d` layer for Batch Normalization, and Leaky ReLU. Bias is set to `False` because the Batch Norm layers nullify it. + +The final upsampling layer converts the intermediate amount of neurons of already 512 into `GENERATOR_OUTPUT_IMAGE_SHAPE`, which is `28 * 28 * 1 = 784`. With Tanh, the outputs are normalized to the range `[-1, 1]`. + +### Building the Discriminator + +The Discriminator is even simpler than the Generator. It is a separate neural network, as you can see by its `nn.Module` class definition. It simply composes a fully-connected neural network that accepts an input of dimensionality `GENERATOR_OUTPUT_IMAGE_SHAPE` (i.e., a Generator output) and converts it into a `[0, 1]` Sigmoid-normalized prediction as to whether the image is real or fake. + +``` +class Discriminator(nn.Module): + """ + Vanilla GAN Discriminator + """ + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(GENERATOR_OUTPUT_IMAGE_SHAPE, 1024), + nn.LeakyReLU(0.25), + nn.Linear(1024, 512), + nn.LeakyReLU(0.25), + nn.Linear(512, 256), + nn.LeakyReLU(0.25), + nn.Linear(256, 1), + nn.Sigmoid() + ) + + def forward(self, x): + """Forward pass""" + return self.layers(x) +``` + +### Combining everything into one + +Okay, we now have two different neural networks, a few imports and some configuration variables. Time to combine everything into one! Let's start with writing some housekeeping functions. + +#### Housekeeping functions + +Recall that you read before that intermediate models would be saved in a folder, and that images would be generated as well. While we will actually _implement_ these calls later, i.e. use them, you're already going to write them now. Our housekeeping functions contain five definitions: + +1. **Getting the device**. Recall that you specified `True` or `False` for `TRAIN_ON_GPU`. This definition will check whether you want to use the GPU and whether it is avilable, and instructs PyTorch to use your CPU otherwise. +2. **Making the directory for a run** utilizes the `UNIQUE_RUN_ID` to generate a directory for the unique run. +3. **Generating the images** will generate 16 examples using some Generator (usually, the Generator that you will have trained most recently) and store them to disk. +4. **Saving the models** saves the current state of the Generator and Discriminator to disk. +5. **Printing training progress** prints the current loss values on screen. + +``` +def get_device(): + """ Retrieve device based on settings and availability. """ + return torch.device("cuda:0" if torch.cuda.is_available() and TRAIN_ON_GPU else "cpu") + + +def make_directory_for_run(): + """ Make a directory for this training run. """ + print(f'Preparing training run {UNIQUE_RUN_ID}') + if not os.path.exists('./runs'): + os.mkdir('./runs') + os.mkdir(f'./runs/{UNIQUE_RUN_ID}') + + +def generate_image(generator, epoch = 0, batch = 0, device=get_device()): + """ Generate subplots with generated examples. """ + images = [] + noise = generate_noise(BATCH_SIZE, device=device) + generator.eval() + images = generator(noise) + plt.figure(figsize=(10, 10)) + for i in range(16): + # Get image + image = images[i] + # Convert image back onto CPU and reshape + image = image.cpu().detach().numpy() + image = np.reshape(image, (28, 28)) + # Plot + plt.subplot(4, 4, i+1) + plt.imshow(image, cmap='gray') + plt.axis('off') + if not os.path.exists(f'./runs/{UNIQUE_RUN_ID}/images'): + os.mkdir(f'./runs/{UNIQUE_RUN_ID}/images') + plt.savefig(f'./runs/{UNIQUE_RUN_ID}/images/epoch{epoch}_batch{batch}.jpg') + + +def save_models(generator, discriminator, epoch): + """ Save models at specific point in time. """ + torch.save(generator.state_dict(), f'./runs/{UNIQUE_RUN_ID}/generator_{epoch}.pth') + torch.save(discriminator.state_dict(), f'./runs/{UNIQUE_RUN_ID}/discriminator_{epoch}.pth') + + +def print_training_progress(batch, generator_loss, discriminator_loss): + """ Print training progress. """ + print('Losses after mini-batch %5d: generator %e, discriminator %e' % + (batch, generator_loss, discriminator_loss)) +``` + +#### Preparing the dataset + +Okay, after housekeeping it's time to start writing functionality for preparing the dataset. This will be a multi-stage process. First, we load the `MNIST` dataset from `torchvision`. Upon loading, the smaples will be transformed into Tensor format and normalized in the range `[-1, 1]` so that they are directly compatible with the Generator-generated images. + +However, after loading all the data, we still need to batch it - recall that you will not feed all the images to the network at once, but will do so in a batched fashion. You will also shuffle the images. For the sake of PyTorch efficiency, the number of workers will be 4, and `pin_memory` is set to True. Once complete, the `DataLoader` is returned, so that it can be used. + +``` +def prepare_dataset(): + """ Prepare dataset through DataLoader """ + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, train=True, transform=transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)) + ])) + # Batch and shuffle data with DataLoader + trainloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True) + # Return dataset through DataLoader + return trainloader +``` + +#### Initialization functions + +Some other defs that you will need are related to the models, loss functions and optimizers that will be used during the joint training process. + +In `initialize_models`, youll initialize the Generator and Discriminator, move them to the device that was configured, and return it. Initializing binary cross-entropy loss will be performed in `initialize_loss`, and finally, the optimizers for both Generator and Discriminator will be initialized in `initialize_optimizers`. Once again, you will use these later. + +``` +def initialize_models(device = get_device()): + """ Initialize Generator and Discriminator models """ + generator = Generator() + discriminator = Discriminator() + # Move models to specific device + generator.to(device) + discriminator.to(device) + # Return models + return generator, discriminator + + +def initialize_loss(): + """ Initialize loss function. """ + return nn.BCELoss() + + +def initialize_optimizers(generator, discriminator): + """ Initialize optimizers for Generator and Discriminator. """ + generator_optimizer = torch.optim.AdamW(generator.parameters(), lr=OPTIMIZER_LR,betas=OPTIMIZER_BETAS) + discriminator_optimizer = torch.optim.AdamW(discriminator.parameters(), lr=OPTIMIZER_LR,betas=OPTIMIZER_BETAS) + return generator_optimizer, discriminator_optimizer +``` + +#### Forward and backward pass + +Using the initialized models, you will perform a forward and a backward pass. For this, and the training step as a whole, you'll need three defs that will be created next. The fist, `generate_noise`, is used to generate `number_of_images` noise vectors of `noise_dimension` dimensionality, onto the device that you configured earlier. + +Efficiently zeroing the gradients must be done at the start of each training step and will be done by calling `efficient_zero_grad()`. Finally, using `forward_and_backward`, a forward _and_backward pass will be computed using some model, loss function, data and corresponding targets. The numeric value for loss is then returned. + +``` +def generate_noise(number_of_images = 1, noise_dimension = NOISE_DIMENSION, device=None): + """ Generate noise for number_of_images images, with a specific noise_dimension """ + return torch.randn(number_of_images, noise_dimension, device=device) + + +def efficient_zero_grad(model): + """ + Apply zero_grad more efficiently + Source: https://betterprogramming.pub/how-to-make-your-pytorch-code-run-faster-93079f3c1f7b + """ + for param in model.parameters(): + param.grad = None + + +def forward_and_backward(model, data, loss_function, targets): + """ + Perform forward and backward pass in a generic way. Returns loss value. + """ + outputs = model(data) + error = loss_function(outputs, targets) + error.backward() + return error.item() +``` + +#### Performing a training step + +Now that we have defined our functions for the forward and the backward pass, it's time to create one for performing a training step. + +Recall that a training step for a GAN involves multiple forward and backward passes: one with real images using the Discriminator and one with fake images using the Discriminator, after which it is optimized. Then, the fake images are used again for optimizing the Generator. + +Below, you will code this process into four intermediate steps. First of all, you'll prepare a few things, such as setting label values for real and fake data. In the second step, the Discriminator is trained, followed by the Generator in the third. Finally, you'll merge together some loss values, and return them, in the fourth step. + +``` +def perform_train_step(generator, discriminator, real_data, \ + loss_function, generator_optimizer, discriminator_optimizer, device = get_device()): + """ Perform a single training step. """ + + # 1. PREPARATION + # Set real and fake labels. + real_label, fake_label = 1.0, 0.0 + # Get images on CPU or GPU as configured and available + # Also set 'actual batch size', whih can be smaller than BATCH_SIZE + # in some cases. + real_images = real_data[0].to(device) + actual_batch_size = real_images.size(0) + label = torch.full((actual_batch_size,1), real_label, device=device) + + # 2. TRAINING THE DISCRIMINATOR + # Zero the gradients for discriminator + efficient_zero_grad(discriminator) + # Forward + backward on real images, reshaped + real_images = real_images.view(real_images.size(0), -1) + error_real_images = forward_and_backward(discriminator, real_images, \ + loss_function, label) + # Forward + backward on generated images + noise = generate_noise(actual_batch_size, device=device) + generated_images = generator(noise) + label.fill_(fake_label) + error_generated_images =forward_and_backward(discriminator, \ + generated_images.detach(), loss_function, label) + # Optim for discriminator + discriminator_optimizer.step() + + # 3. TRAINING THE GENERATOR + # Forward + backward + optim for generator, including zero grad + efficient_zero_grad(generator) + label.fill_(real_label) + error_generator = forward_and_backward(discriminator, generated_images, loss_function, label) + generator_optimizer.step() + + # 4. COMPUTING RESULTS + # Compute loss values in floats for discriminator, which is joint loss. + error_discriminator = error_real_images + error_generated_images + # Return generator and discriminator loss so that it can be printed. + return error_generator, error_discriminator +``` + +#### Performing an epoch + +Recall that training the GAN consists of multiple epochs which themselves consist of multiple training steps. Now that you have written some code for an individual training step, it's time that you write code for performing an epoch. As you can see below, you'll iterate over the batches that are created by the `DataLoader`. Using each batch, a training step is performed, and statistics are printed if necessary. + +After every epoch, the models are saved, and CUDA memory is cleared. + +``` +def perform_epoch(dataloader, generator, discriminator, loss_function, \ + generator_optimizer, discriminator_optimizer, epoch): + """ Perform a single epoch. """ + for batch_no, real_data in enumerate(dataloader, 0): + # Perform training step + generator_loss_val, discriminator_loss_val = perform_train_step(generator, \ + discriminator, real_data, loss_function, \ + generator_optimizer, discriminator_optimizer) + # Print statistics and generate image after every n-th batch + if batch_no % PRINT_STATS_AFTER_BATCH == 0: + print_training_progress(batch_no, generator_loss_val, discriminator_loss_val) + generate_image(generator, epoch, batch_no) + # Save models on epoch completion. + save_models(generator, discriminator, epoch) + # Clear memory after every epoch + torch.cuda.empty_cache() +``` + +#### Starting the training process + +Finally - the last definition! + +In this definition, you will merge everything together, so that training can actually be performed. + +First of all, you'll ensure that a new directory is created for this unique run. Then, you'll set the seed for the random number generator to a fixed number, so that variability in the initialization vector cannot be the cause of any oddities. Then, you'll retrieve the prepared (i.e. shuffled and batched) dataset; initialize the models, loss and optimizers; and finally train the model by iterating for the number of epochs specified. + +To ensure that your script starts running, you'll call `train_dcgan()` as the last part of your code. + +``` +def train_dcgan(): + """ Train the DCGAN. """ + # Make directory for unique run + make_directory_for_run() + # Set fixed random number seed + torch.manual_seed(42) + # Get prepared dataset + dataloader = prepare_dataset() + # Initialize models + generator, discriminator = initialize_models() + # Initialize loss and optimizers + loss_function = initialize_loss() + generator_optimizer, discriminator_optimizer = initialize_optimizers(generator, discriminator) + # Train the model + for epoch in range(NUM_EPOCHS): + print(f'Starting epoch {epoch}...') + perform_epoch(dataloader, generator, discriminator, loss_function, \ + generator_optimizer, discriminator_optimizer, epoch) + # Finished :-) + print(f'Finished unique run {UNIQUE_RUN_ID}') + + +if __name__ == '__main__': + train_dcgan() +``` + +### Python GAN - full code example + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import numpy as np +import matplotlib.pyplot as plt +import uuid + + +# Configurable variables +NUM_EPOCHS = 50 +NOISE_DIMENSION = 50 +BATCH_SIZE = 128 +TRAIN_ON_GPU = True +UNIQUE_RUN_ID = str(uuid.uuid4()) +PRINT_STATS_AFTER_BATCH = 50 +OPTIMIZER_LR = 0.0002 +OPTIMIZER_BETAS = (0.5, 0.999) +GENERATOR_OUTPUT_IMAGE_SHAPE = 28 * 28 * 1 + + +# Speed ups +torch.autograd.set_detect_anomaly(False) +torch.autograd.profiler.profile(False) +torch.autograd.profiler.emit_nvtx(False) +torch.backends.cudnn.benchmark = True + + +class Generator(nn.Module): + """ + Vanilla GAN Generator + """ + def __init__(self,): + super().__init__() + self.layers = nn.Sequential( + # First upsampling + nn.Linear(NOISE_DIMENSION, 128, bias=False), + nn.BatchNorm1d(128, 0.8), + nn.LeakyReLU(0.25), + # Second upsampling + nn.Linear(128, 256, bias=False), + nn.BatchNorm1d(256, 0.8), + nn.LeakyReLU(0.25), + # Third upsampling + nn.Linear(256, 512, bias=False), + nn.BatchNorm1d(512, 0.8), + nn.LeakyReLU(0.25), + # Final upsampling + nn.Linear(512, GENERATOR_OUTPUT_IMAGE_SHAPE, bias=False), + nn.Tanh() + ) + + def forward(self, x): + """Forward pass""" + return self.layers(x) + + +class Discriminator(nn.Module): + """ + Vanilla GAN Discriminator + """ + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(GENERATOR_OUTPUT_IMAGE_SHAPE, 1024), + nn.LeakyReLU(0.25), + nn.Linear(1024, 512), + nn.LeakyReLU(0.25), + nn.Linear(512, 256), + nn.LeakyReLU(0.25), + nn.Linear(256, 1), + nn.Sigmoid() + ) + + def forward(self, x): + """Forward pass""" + return self.layers(x) + + +def get_device(): + """ Retrieve device based on settings and availability. """ + return torch.device("cuda:0" if torch.cuda.is_available() and TRAIN_ON_GPU else "cpu") + + +def make_directory_for_run(): + """ Make a directory for this training run. """ + print(f'Preparing training run {UNIQUE_RUN_ID}') + if not os.path.exists('./runs'): + os.mkdir('./runs') + os.mkdir(f'./runs/{UNIQUE_RUN_ID}') + + +def generate_image(generator, epoch = 0, batch = 0, device=get_device()): + """ Generate subplots with generated examples. """ + images = [] + noise = generate_noise(BATCH_SIZE, device=device) + generator.eval() + images = generator(noise) + plt.figure(figsize=(10, 10)) + for i in range(16): + # Get image + image = images[i] + # Convert image back onto CPU and reshape + image = image.cpu().detach().numpy() + image = np.reshape(image, (28, 28)) + # Plot + plt.subplot(4, 4, i+1) + plt.imshow(image, cmap='gray') + plt.axis('off') + if not os.path.exists(f'./runs/{UNIQUE_RUN_ID}/images'): + os.mkdir(f'./runs/{UNIQUE_RUN_ID}/images') + plt.savefig(f'./runs/{UNIQUE_RUN_ID}/images/epoch{epoch}_batch{batch}.jpg') + + +def save_models(generator, discriminator, epoch): + """ Save models at specific point in time. """ + torch.save(generator.state_dict(), f'./runs/{UNIQUE_RUN_ID}/generator_{epoch}.pth') + torch.save(discriminator.state_dict(), f'./runs/{UNIQUE_RUN_ID}/discriminator_{epoch}.pth') + + +def print_training_progress(batch, generator_loss, discriminator_loss): + """ Print training progress. """ + print('Losses after mini-batch %5d: generator %e, discriminator %e' % + (batch, generator_loss, discriminator_loss)) + + +def prepare_dataset(): + """ Prepare dataset through DataLoader """ + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, train=True, transform=transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)) + ])) + # Batch and shuffle data with DataLoader + trainloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True) + # Return dataset through DataLoader + return trainloader + + +def initialize_models(device = get_device()): + """ Initialize Generator and Discriminator models """ + generator = Generator() + discriminator = Discriminator() + # Move models to specific device + generator.to(device) + discriminator.to(device) + # Return models + return generator, discriminator + + +def initialize_loss(): + """ Initialize loss function. """ + return nn.BCELoss() + + +def initialize_optimizers(generator, discriminator): + """ Initialize optimizers for Generator and Discriminator. """ + generator_optimizer = torch.optim.AdamW(generator.parameters(), lr=OPTIMIZER_LR,betas=OPTIMIZER_BETAS) + discriminator_optimizer = torch.optim.AdamW(discriminator.parameters(), lr=OPTIMIZER_LR,betas=OPTIMIZER_BETAS) + return generator_optimizer, discriminator_optimizer + + +def generate_noise(number_of_images = 1, noise_dimension = NOISE_DIMENSION, device=None): + """ Generate noise for number_of_images images, with a specific noise_dimension """ + return torch.randn(number_of_images, noise_dimension, device=device) + + +def efficient_zero_grad(model): + """ + Apply zero_grad more efficiently + Source: https://betterprogramming.pub/how-to-make-your-pytorch-code-run-faster-93079f3c1f7b + """ + for param in model.parameters(): + param.grad = None + + +def forward_and_backward(model, data, loss_function, targets): + """ + Perform forward and backward pass in a generic way. Returns loss value. + """ + outputs = model(data) + error = loss_function(outputs, targets) + error.backward() + return error.item() + + +def perform_train_step(generator, discriminator, real_data, \ + loss_function, generator_optimizer, discriminator_optimizer, device = get_device()): + """ Perform a single training step. """ + + # 1. PREPARATION + # Set real and fake labels. + real_label, fake_label = 1.0, 0.0 + # Get images on CPU or GPU as configured and available + # Also set 'actual batch size', whih can be smaller than BATCH_SIZE + # in some cases. + real_images = real_data[0].to(device) + actual_batch_size = real_images.size(0) + label = torch.full((actual_batch_size,1), real_label, device=device) + + # 2. TRAINING THE DISCRIMINATOR + # Zero the gradients for discriminator + efficient_zero_grad(discriminator) + # Forward + backward on real images, reshaped + real_images = real_images.view(real_images.size(0), -1) + error_real_images = forward_and_backward(discriminator, real_images, \ + loss_function, label) + # Forward + backward on generated images + noise = generate_noise(actual_batch_size, device=device) + generated_images = generator(noise) + label.fill_(fake_label) + error_generated_images =forward_and_backward(discriminator, \ + generated_images.detach(), loss_function, label) + # Optim for discriminator + discriminator_optimizer.step() + + # 3. TRAINING THE GENERATOR + # Forward + backward + optim for generator, including zero grad + efficient_zero_grad(generator) + label.fill_(real_label) + error_generator = forward_and_backward(discriminator, generated_images, loss_function, label) + generator_optimizer.step() + + # 4. COMPUTING RESULTS + # Compute loss values in floats for discriminator, which is joint loss. + error_discriminator = error_real_images + error_generated_images + # Return generator and discriminator loss so that it can be printed. + return error_generator, error_discriminator + + +def perform_epoch(dataloader, generator, discriminator, loss_function, \ + generator_optimizer, discriminator_optimizer, epoch): + """ Perform a single epoch. """ + for batch_no, real_data in enumerate(dataloader, 0): + # Perform training step + generator_loss_val, discriminator_loss_val = perform_train_step(generator, \ + discriminator, real_data, loss_function, \ + generator_optimizer, discriminator_optimizer) + # Print statistics and generate image after every n-th batch + if batch_no % PRINT_STATS_AFTER_BATCH == 0: + print_training_progress(batch_no, generator_loss_val, discriminator_loss_val) + generate_image(generator, epoch, batch_no) + # Save models on epoch completion. + save_models(generator, discriminator, epoch) + # Clear memory after every epoch + torch.cuda.empty_cache() + + +def train_dcgan(): + """ Train the DCGAN. """ + # Make directory for unique run + make_directory_for_run() + # Set fixed random number seed + torch.manual_seed(42) + # Get prepared dataset + dataloader = prepare_dataset() + # Initialize models + generator, discriminator = initialize_models() + # Initialize loss and optimizers + loss_function = initialize_loss() + generator_optimizer, discriminator_optimizer = initialize_optimizers(generator, discriminator) + # Train the model + for epoch in range(NUM_EPOCHS): + print(f'Starting epoch {epoch}...') + perform_epoch(dataloader, generator, discriminator, loss_function, \ + generator_optimizer, discriminator_optimizer, epoch) + # Finished :-) + print(f'Finished unique run {UNIQUE_RUN_ID}') + + +if __name__ == '__main__': + train_dcgan() +``` + +* * * + +## Results + +Now, it's time to run your model, e.g. with `python gan.py`. + +You should see that the model starts iterating relatively quickly, even on CPU. + +During the first epochs, we see a quick improvement from the random noise into slightly recognizable numbers, when we open the files in the folder created for this training run: + +- [![](images/epoch0_batch0-1.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch0_batch0-1.jpg) + + Epoch 0, batch 0 + +- [![](images/epoch0_batch50-1.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch0_batch50-1.jpg) + + Epoch 0, batch 50 + +- [![](images/epoch1_batch0-1.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch1_batch0-1.jpg) + + Epoch 1, batch 0 + +- [![](images/epoch1_batch50.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch1_batch50.jpg) + + Epoch 1, batch 50 + +- [![](images/epoch2_batch0.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch2_batch0.jpg) + + Epoch 2, batch 0 + +- [![](images/epoch2_batch50.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch2_batch50.jpg) + + Epoch 2, batch 50 + +- [![](images/epoch3_batch0.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch3_batch0.jpg) + + Epoch 3, batch 0 + +- [![](images/epoch3_batch50.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch3_batch50.jpg) + + Epoch 3, batch 50 + + +Over the course of subsequent epochs, the outputs start to improve, as more and more noise disappears: + +- [![](images/epoch18_batch0.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch18_batch0.jpg) + + Epoch 18, batch 0 + +- [![](images/epoch18_batch50.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch18_batch50.jpg) + + Epoch 18, batch 50 + +- [![](images/epoch25_batch0.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch25_batch0.jpg) + + Epoch 25, batch 0 + +- [![](images/epoch25_batch50.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch25_batch50.jpg) + + Epoch 25, batch 50 + +- [![](images/epoch30_batch0.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch30_batch0.jpg) + + Epoch 30, batch 0 + +- [![](images/epoch30_batch50.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch30_batch50.jpg) + + Epoch 30, batch 50 + +- [![](images/epoch36_batch0.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch36_batch0.jpg) + + Epoch 36, batch 0 + +- [![](images/epoch36_batch50.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/07/epoch36_batch50.jpg) + + Epoch 36, batch 50 + + +Voila, your first GAN is complete! :D + +* * * + +## Sources + +Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). [Generative adversarial networks.](https://arxiv.org/abs/1406.2661) _arXiv preprint arXiv:1406.2661_. + +MachineCurve. (2021, July 15). _Creating DCGAN with PyTorch_. [https://www.machinecurve.com/index.php/2021/07/15/creating-dcgan-with-pytorch/](https://www.machinecurve.com/index.php/2021/07/15/creating-dcgan-with-pytorch/) diff --git a/building-an-image-denoiser-with-a-keras-autoencoder-neural-network.md b/building-an-image-denoiser-with-a-keras-autoencoder-neural-network.md new file mode 100644 index 0000000..cce3623 --- /dev/null +++ b/building-an-image-denoiser-with-a-keras-autoencoder-neural-network.md @@ -0,0 +1,463 @@ +--- +title: "Building an Image Denoiser with a Keras autoencoder neural network" +date: "2019-12-20" +categories: + - "deep-learning" + - "frameworks" +tags: + - "autoencoder" + - "conv2dtranspose" + - "convolutional-neural-networks" + - "deep-learning" + - "denoising" + - "keras" + - "noise-removal" + - "transposed-convolution" +--- + +Images can be noisy, and you likely want to have this noise removed. Traditional noise removal filters can be used for this purpose, but they're not data-specific - and hence may remove more noise than you wish, or leave too much when you want it gone. + +Autoencoders based on neural networks can be used to _learn_ the noise removal filter based on the dataset you wish noise to disappear from. In this blog post, we'll show you what autoencoders are, why they are suitable for noise removal, and how you can create such an autoencoder with the Keras deep learning framework, providing some nice results! + +Are you ready? Let's go 😊 + +* * * + +\[toc\] + +* * * + +## Recap: autoencoders, what are they again? + +If we wish to create an autoencoder, it's wise to provide some background information about them first. If you know a thing or two about autoencoders already, it may be the case that this section is no longer relevant for you. In that case, feel free to skip it, but if you know only little about the concept of autoencoders, I'd recommend you keep reading 😀 + +This is an autoencoder at a very high level: + +![](images/Autoencoder.png) + +It contains an _encoder_, which transforms some high-dimensional input into lower-dimensional format, and a _decoder_, which can read the encoded state and convert it into something else. The encoded state is also called latent state. + +[![](images/2-300x225.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2.png) + +_When autoencoders are used for reconstructing some input, [this is what you get](https://www.machinecurve.com/index.php/2019/12/10/conv2dtranspose-using-2d-transposed-convolutions-with-keras/)._ + +(What you must understand is that traditional autoencoders a.k.a. vanilla autoencoders cannot be used for _generative_ activity, i.e. constructing new images from some encoded state, [like a GAN](https://www.machinecurve.com/index.php/2019/07/17/this-person-does-not-exist-how-does-it-work/). This has to do with the non-restrictiveness with which the encoder learns the latent/encoded state (Shafkat, 2018). Vanilla autoencoders can however perfectly be used for noise reduction (as we will do in this blog post) and dimensionality reduction purposes.) + +Usually, neural networks are used for learning the encoder and the decoder. Depending on the data you'll feed it, different types of layers must be used. For example, for image data [or data that can be represented as image-like data](https://www.machinecurve.com/index.php/2019/12/19/creating-a-signal-noise-removal-autoencoder-with-keras/), you usually use two-dimensional convolutional layers for the encoder, and two-dimensional [transposed convolutions](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions/) for the decoder segments. For simpler settings, you may choose to use Densely-connected a.k.a. Dense layers. + +### Why autoencoders can be good denoisers + +One of the main application areas for autoencoders is noise reduction (Keras Blog, n.d.). This is also called denoising and in very well-performing cases, one speaks about noise removal. But why are they so suitable for denoising? It's a valid question... let's try to find out! 😁 + +When looking at François Chollet's blog post **["Building Autoencoders in Keras"](https://blog.keras.io/building-autoencoders-in-keras.html)**, you can find a few key principles that tell you why autoencoders are so suitable for removing noise from signals or images (Keras Blog, n.d.). They also include why you must be careful at the same time. The principles are as follows: + +- **The encoder and decoder are learnt**. Since you control your input and your target values before starting the training process, it's possible to learn the encoder and decoder in a way so that noise is removed. For training, provide noisy images as input, and their corresponding noise-free images as targets, and the encoder and decoder will together learn to remove the particular noise present in your images. +- **The behavior of encoder and decoder will be lossy**. Because the autoencoder learns to convert high-dimensional data (e.g., an image) into lower-dimensional format (i.e., the encoded/latent state), data must be dropped in order to maximize the relationships between image and encoded state. Additionally, going from latent state to output also incurs information loss. By consequence, it's important to understand that encoder and decoder will behave in a lossy way. Lossless use of autoencoders is impossible. +- **The encoder and decoder are highly data-specific**. While it's possible to use mathematics-based noise removal algorithms across a wide range of denoising scenarios, you cannot use autoencoders in such a way. This happens to be the case because the encoder and decoder are learnt in a highly data-specific way. Consequentially, if you'd use it in another scenario (e.g., to use an autoencoder trained on MNIST data to perform noise removal at one-dimensional waveforms), results will likely be poor. This behavior emerges because the features the autoencoder are used on have never been used for learning, and are therefore not present in the latent state space (Shafkat, 2018). + +Altogether, this behavior of autoencoders makes them useful in denoising projects, if you can live with their drawbacks 😀 + +* * * + +## Today's case + +Now that we know what autoencoders are and why they can be useful, it's time to take a look at the autoencoder that we will create today. What is the data that we will use? What is our goal, and what does our model look like? Let's find out! 😎 + +First, we're going discuss the dataset we're using today - which is the MNIST image data set. + +Subsequently, we cover the model and its architecture, and explain you why we use certain layers. Finally, we'll tell you what software dependencies need to be installed on your system if you want to run this model successfully. + +### The data + +[![](images/mnist-300x133.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +First: the data. If you follow MachineCurve blogs regularly, you must be familiar with the MNIST dataset by now. + +MNIST (the abbreviation for **"Modified National Institute of Standards and Technology dataset"**) is an image dataset that contains thousands of 28x28 pixel one-digit images (LeCun et al., n.d.). A few samples have been visualized on the right, and they clearly show dataset contents: digits, presumably handwritten, and thus ten classes. + +The dataset contains 60.000 training samples and 10.000 testing samples. + +Today, we'll be trying to learn an _image noise remover_ (or _denoiser_) based on this dataset. This means that we'll have to add noise to the data, after which we can feed both the noisy and the pure data to an autoencoder which learns noise removal. For the sake of clarity, this is what a pure and a noisy sample looks like (with 55% of the generated amount of Gaussian noise of \[latex\](0, 1)\[/latex\] mean/stddev applied to the image): + +- [![](images/1-6.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1-6.png) + +- [![](images/2-4.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2-4.png) + + +### The model + +[![](images/explained-1-300x165.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/09/explained-1.jpg) + +_Read our blog post **["Understanding transposed convolutions"](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions/)** if you wish to understand transposed convolutions in more detail. Check out **["Conv2DTranspose: using 2D transposed convolutions with Keras"](https://www.machinecurve.com/index.php/2019/12/10/conv2dtranspose-using-2d-transposed-convolutions-with-keras/)** if you wish to understand how to use them with Keras._ + +Second: the model. As we're trying to remove noise from images, it makes sense to use **convolutional layers** for the encoder segment and **transposed convolutions** for the decoder segment. + +Below, you can see what it looks like. Obviously, it has an input layer, to receive the inputs, before the encoding and decoding segments are added. + +The two two-dimensional convolutional layers (Conv2D layers) form the part of the autoencoder that learn the encoder. The first layer learns 64 features and the other 32 features. A kernel size of 3x3 pixels is used, together with max-norm regularization (\[latex\]normsize = 2.0\[/latex\]). Since we use [ReLU activation](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/), [we use He init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). + +![](images/model-6.png) + +The two two-dimensional transposed convolutional layers, or Conv2DTranspose, serve as the decoder for our autoencoder. They learn to convert the latent state, which is the output of the encoder segment, into an output image - in our case, that's the noise-free image. The first learns 32 features; the second 64. As with the Conv2D layers, we also use max-norm regularization, ReLU activation and He initialization here. + +The last layer is the output layer, and is represented by a two-dimensional convolutional layer (Conv2D) that outputs one filter and uses padding in order not to change the shape. The output here must be the constructed, noise-free MNIST sample. + +Let's see if we can actually such an autoencoder with Keras! + +### What you'll need to run the model + +...however, the only step which is left before we can start doing this, is an overview of what we'll need in terms of software dependencies. Without them, we can't run the model. Here they are - make sure that you have them installed before trying to run the Python script: + +- **Keras**, the deep learning framework that we use. +- One of the Keras backends, preferably **TensorFlow**. +- **Matplotlib**, for visualizing some samples. +- **Numpy**, for numbers processing. +- **Python**, for running the code 😋 + +* * * + +## Implementing the autoencoder with Keras + +All right, time to create some code 😁 + +The first thing to do is to open up your Explorer, and to navigate to a folder of your choice. In this folder, create a new file, and call it e.g. `image_noise_autoencoder.py`. Now open this file in your code editor - and you're ready to start :) + +[![](images/model-6-187x300.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/model-6.png) + +Creating our model consists of multiple steps: + +- First, we'll add the imports to our Python script, so that we can actually use e.g. Keras. +- Next, we set the configuration variables for our model. +- Then, we load and prepare MNIST data so that it can be used in the autoencoder. +- Subsequently, we also create noisy samples by adding noise to a copy of the dataset. +- Then, we define the architecture of our model, based on the plot on the right. +- Next, we compile our model and start the training process. +- Finally, we identify how well the model performs by visualizing a few denoised images from our test set. That's data our new autoencoder won't have seen before. + +### Adding the imports + +Step one: define which packages you'll need in your Python script. They are as follows. + +``` +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Conv2D, Conv2DTranspose +from keras.constraints import max_norm +from keras import backend as K +import matplotlib.pyplot as plt +import numpy as np +``` + +With Keras, we'll (1) use a handy pointer to the MNIST dataset, and (2) create our deep learning model - with the Sequential API, using the Conv layers, and max-norm regularization. The Keras backend is used for removing the differences between Theano, CNTK and TensorFlow in the context of channels-first/channels-last, as we'll see in the data preparation section. + +With Matplotlib, we'll create some visualizations, and Numpy is used for numbers processing. + +### Setting model configuration + +Step two: defining the configuration of your model. + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 150 +no_epochs = 50 +validation_split = 0.2 +verbosity = 1 +max_norm_value = 2.0 +noise_factor = 0.55 +number_of_visualizations = 6 +``` + +The image width and image height of MNIST data are 28 pixels for both, respectively. Hence, we set `img_width = img_height = 28`. + +Next up are some configuration variables related to tuning the neural network. We use a batch size of 150 to balance between accurate gradients (during optimization) and speed of learning/memory requirements. We'll let the model train for 50 epochs, and use 20% of the training data for validating the state of the model. We use max-norm regularization with a max norm of 2.0. + +Verbosity mode is set to True. This means that all data will be output on screen. The last two config values are related to the data. The `noise_factor` represents the percentage of the generated noise that must be added to the pure input data. In our case, that's 55%, or 0.55. The `number_of_visualizations`, on the other hand, tells us how many test data-based visualizations we must make once the model has finished training. + +### Loading and preparing data + +Step three: loading and preparing the dataset. + +We do so in a few steps: + +- First, we use `load_data()` to download the MNIST dataset or to retrieve it from cache. This allows us to load the data into four variables (two for training/testing data; two for inputs and targets) easily. +- Then, we reshape the data based on a channels-first/channels-last strategy. Image data must always contain a third dimension which represents the number of channels present in your image. For example, RGB data has 3 channels. Today, we only use one, but have to specify it anyway. The unfortunate thing, however, is that the backends use different strategies: some backends use a shape that presents channels first (e.g. \[latex\](1, 28, 28)\[/latex\]) while others present them last (\[latex\](28, 28, 1)\[/latex\]). Depending on what strategy your backend is using (hence, we need to import the backend into `K`!), we reshape your data into the correct format, so that the model becomes backend-agnostic 😀 Note that the Keras team wrote the code for doing so, and that they must be thanked. +- Next, we parse the int numbers into floats, specifically the `float32` datatype. Presumably, this speeds up the training process. +- Finally, we normalize the data into the range \[latex\]\[0, 1\]\[/latex\]. + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height) + input_shape = (1, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +### Adding noise to the dataset + +Step four: adding some noise to the dataset. We first retrieve the pure training and testing data, subsequently generate some noisy (Gaussian noise, with a mean of 0 and a standard deviation of 0, based on `pure.shape` and `pure_test.shape`). Subsequently, following the `noise_factor` we set in the configuration step, we add the noise to the pure data, creating the `noisy_input` which we'll feed to the autoencoder. + +``` +# Add noise +pure = input_train +pure_test = input_test +noise = np.random.normal(0, 1, pure.shape) +noise_test = np.random.normal(0, 1, pure_test.shape) +noisy_input = pure + noise_factor * noise +noisy_input_test = pure_test + noise_factor * noise_test +``` + +### Creating the model's architecture + +Next, as step five, we specify the architecture that we discussed earlier. Note that the last layer makes use of the Sigmoid activation function, which allows us to use [binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). + +``` +# Create the model +model = Sequential() +model.add(Conv2D(64, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform', input_shape=input_shape)) +model.add(Conv2D(32, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv2DTranspose(32, kernel_size=(3,3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv2DTranspose(64, kernel_size=(3,3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv2D(1, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='sigmoid', padding='same')) + +model.summary() +``` + +Calling `model.summary()` produces this nice summary, which provides even more insight into our model: + +``` +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d_1 (Conv2D) (None, 26, 26, 64) 640 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 24, 24, 32) 18464 +_________________________________________________________________ +conv2d_transpose_1 (Conv2DTr (None, 26, 26, 32) 9248 +_________________________________________________________________ +conv2d_transpose_2 (Conv2DTr (None, 28, 28, 64) 18496 +_________________________________________________________________ +conv2d_3 (Conv2D) (None, 28, 28, 1) 577 +================================================================= +Total params: 47,425 +Trainable params: 47,425 +Non-trainable params: 0 +_________________________________________________________________ +Train on 48000 samples, validate on 12000 samples +``` + +### Model compilation & starting the training process + +Step six: compiling the model and starting the training the process. Compiling the model is just a difficult combination of words for setting some configuration values; these are the so-called hyperparameters. We have to choose an optimizer ([for which we use the Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) given its benefits compared with traditional SGD) and a [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) (binary crossentropy loss). + +``` +# Compile and fit data +model.compile(optimizer='adam', loss='binary_crossentropy') +model.fit(noisy_input, pure, + epochs=no_epochs, + batch_size=batch_size, + validation_split=validation_split) +``` + +Fitting the data to the compiled way, once again, is just another way of saying something more simpler: to start the training process. Important to note is that our `noisy_input` is indeed used as features, while the `pure` data represent the targets, meaning that the loss function computes the error based on how much the predicted pure sample is off with respect to the actual pure sample. We also set the number of epochs, the batch size and the validation split based on the parameters configured earlier. + +### Model evaluation through visualization + +Step seven - our final step: evaluating the model by generating some visualizations of how test samples are denoised by our trained autoencoder. For this purpose, we'll take a subset of the test data, as well as their targets, and use it to generate predictions - the `denoised_images`. + +``` +# Generate denoised images +samples = noisy_input_test[:number_of_visualizations] +targets = target_test[:number_of_visualizations] +denoised_images = model.predict(samples) +``` + +We then create some Matplotlib code to visualize them: + +``` +# Plot denoised images +for i in range(0, number_of_visualizations): + # Get the sample and the reconstruction + noisy_image = noisy_input_test[i][:, :, 0] + pure_image = pure_test[i][:, :, 0] + denoised_image = denoised_images[i][:, :, 0] + input_class = targets[i] + # Matplotlib preparations + fig, axes = plt.subplots(1, 3) + fig.set_size_inches(8, 3.5) + # Plot sample and reconstruciton + axes[0].imshow(noisy_image) + axes[0].set_title('Noisy image') + axes[1].imshow(pure_image) + axes[1].set_title('Pure image') + axes[2].imshow(denoised_image) + axes[2].set_title('Denoised image') + fig.suptitle(f'MNIST target = {input_class}') + plt.show() +``` + +And this is it! Now open up a terminal, `cd` to the folder where your model is located, and run `image_noise_autoencoder.py`. The training process should now begin and once it finished, visualizations that present the denoising process should start popping up 😁 + +### Full model code + +If you're interested in the full model code altogether, here you go: + +``` +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Conv2D, Conv2DTranspose +from keras.constraints import max_norm +from keras import backend as K +import matplotlib.pyplot as plt +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 150 +no_epochs = 50 +validation_split = 0.2 +verbosity = 1 +max_norm_value = 2.0 +noise_factor = 0.55 +number_of_visualizations = 6 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height) + input_shape = (1, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Add noise +pure = input_train +pure_test = input_test +noise = np.random.normal(0, 1, pure.shape) +noise_test = np.random.normal(0, 1, pure_test.shape) +noisy_input = pure + noise_factor * noise +noisy_input_test = pure_test + noise_factor * noise_test + +# Create the model +model = Sequential() +model.add(Conv2D(64, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform', input_shape=input_shape)) +model.add(Conv2D(32, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv2DTranspose(32, kernel_size=(3,3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv2DTranspose(64, kernel_size=(3,3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv2D(1, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='sigmoid', padding='same')) + +model.summary() +from keras.utils.vis_utils import plot_model + +plot_model(model, to_file='model.png') + +# Compile and fit data +model.compile(optimizer='adam', loss='binary_crossentropy') +model.fit(noisy_input, pure, + epochs=no_epochs, + batch_size=batch_size, + validation_split=validation_split) + +# Generate denoised images +samples = noisy_input_test[:number_of_visualizations] +targets = target_test[:number_of_visualizations] +denoised_images = model.predict(samples) + +# Plot denoised images +for i in range(0, number_of_visualizations): + # Get the sample and the reconstruction + noisy_image = noisy_input_test[i][:, :, 0] + pure_image = pure_test[i][:, :, 0] + denoised_image = denoised_images[i][:, :, 0] + input_class = targets[i] + # Matplotlib preparations + fig, axes = plt.subplots(1, 3) + fig.set_size_inches(8, 3.5) + # Plot sample and reconstruciton + axes[0].imshow(noisy_image) + axes[0].set_title('Noisy image') + axes[1].imshow(pure_image) + axes[1].set_title('Pure image') + axes[2].imshow(denoised_image) + axes[2].set_title('Denoised image') + fig.suptitle(f'MNIST target = {input_class}') + plt.show() +``` + +* * * + +## Results + +Next up, the interesting part - the results 😁 + +And I must say that I'm really happy with how well the autoencoder has learnt to denoise MNIST images 🎉 With a loss value of \[latex\]\\approx 0.095\[/latex\], it performs quite well - but hey, it's better to see how it works visually. Therefore, let's skip to the example visualizations: + +- [![](images/1-5.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1-5.png) + +- [![](images/2-3.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2-3.png) + +- [![](images/3-3.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/3-3.png) + +- [![](images/4-3.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/4-3.png) + +- [![](images/5-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/5-2.png) + +- [![](images/6-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/6-2.png) + + +I'm really happy with the results! 😎 + +* * * + +## Summary + +In this blog post, we've seen what autoencoders are and why they are suitable for noise removal / noise reduction / denoising of images. Additionally, we provided an example of such an autoencoder created with the Keras deep learning framework. This way, I hope that you can make a quick start in your neural network based image denoising projects. + +If not - I hope you've learnt something from this blog post. If you did, or if you have questions left, please feel free to leave a comment below 👇 Please do so as well if you have remarks or when you spot mistakes in the article. I'll then happily improve my post and list you as a contributor :) + +Thank you for reading MachineCurve today and happy engineering! + +_If you're interested: the code for today's model is also available in my [keras-autoencoders repository](https://github.com/christianversloot/keras-autoencoders) on GitHub._ + +* * * + +## References + +LeCun, Y., Cortes, C., & Burges, C. (n.d.). MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges. Retrieved from [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/) + +Shafkat, I. (2018, April 5). Intuitively Understanding Variational Autoencoders. Retrieved from [https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf](https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf) + +Keras Blog. (n.d.). Building Autoencoders in Keras. Retrieved from [https://blog.keras.io/building-autoencoders-in-keras.html](https://blog.keras.io/building-autoencoders-in-keras.html) diff --git a/can-neural-networks-approximate-mathematical-functions.md b/can-neural-networks-approximate-mathematical-functions.md new file mode 100644 index 0000000..d23240d --- /dev/null +++ b/can-neural-networks-approximate-mathematical-functions.md @@ -0,0 +1,157 @@ +--- +title: "Can neural networks approximate mathematical functions?" +date: "2019-07-18" +categories: + - "svms" +tags: + - "function" + - "mathematics" + - "neural-network" +--- + +In the [paper](https://www.sciencedirect.com/science/article/pii/0893608089900208) _Multilayer feedforward networks are universal approximators_ written by Kurt Hornik, Maxwell Stinchcombe and Halbert White in 1989, it was argued that neural networks can approximate "quite well nearly any function". + +...and it made the authors wonder about what neural networks can achieve, since pretty much anything can be translated into models and by consequence mathematical formulae. + +When reading the paper, I felt like experimenting a little with this property of neural networks, and to try and find out whether with sufficient data functions such as \[latex\]x^2\[/latex\], \[latex\]sin(x)\[/latex\] and \[latex\]1/x\[/latex\] can be approximated. + +Let's see if we can! + +**Update 02/Nov/2020:** made code compatible with TensorFlow 2.x. + +**Update 02/Nov/2020:** added Table of Contents. + +* * * + +\[toc\] + +* * * + +## The experiment + +For the experiment, I used the following code for approximating \[latex\]x^2\[/latex\]: + +``` +# Imports +import numpy as np +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense + +# Load training data +x = -50 + np.random.random((25000,1))*100 +y = x**2 + +# Define model +model = Sequential() +model.add(Dense(40, input_dim=1, activation='relu')) +model.add(Dense(20, activation='relu')) +model.add(Dense(10, activation='relu')) +model.add(Dense(1)) +model.compile(loss='mean_squared_error', optimizer='adam') +model.fit(x, y, epochs=15, batch_size=50) + +predictions = model.predict([10, 5, 200, 13]) +print(predictions) # Approximately 100, 25, 40000, 169 +``` + +Let's take the code above apart first, before we move on to the results. + +First, I'm importing the Python packages that I need for successfully running the experiment. First, I'm using `numpy`, which is the numerical processing package that is the de facto standard in data science today. + +Second, I'm using `keras`, which is a deep learning framework for Python and runs on TensorFlow, Theano and CNTK. It simply abstracts much of the pain away and allows one to create a deep learning model in only a few lines of code. + +And it runs on GPU, which is very nice. + +Specifically, for Keras, I'm importing the `Sequential` model type and the `Dense` layer type. The Sequential model type requires the engineer to 'stack' the individual layers on top of each other (as you will see next), while the Dense or Densely-connected layer means that each individual neuron is connected to all neurons in the following layer. + +Next, I load the training data. Rather simply, I'm generating 25.000 numbers in the range \[-50, 50\]. Subsequently, I'm also generating the targets for the individual numbers by applying `x**2` or \[latex\]x^2\[/latex\]. + +Then, I define the model - it's a Sequential one with three hidden layers: all of them are Dense with 40, 20 and 10 neurons, respectively. The input layer has simply one neuron (every `x` is just a number) and the output layer has only one as well (since we regress to `y`, which is also just a number). Note that all layers use `ReLU` as an activation function except for the last one, standard with regression. + +Mean squared error is used as a loss function, as well as Adam for optimization, all pretty much standard options for deep neural networks today. + +Next, we fit the data in 15 epochs and generate predictions for 4 values. Let's see what it outputs under 'The results'. + +\[ad\] + +### The two other functions + +I used the same code for \[latex\]sin(x)\[/latex\] and \[latex\]1/x\[/latex\], however I did change the assignment of \[latex\]y\[/latex\] as follows, together with the expected values for the predictions: + +- **sin(x):** \[latex\]y = np.sin(x)\[/latex\]; expected values approximately -0.544, -0.959, -0.873 and 0.420. +- **1/x:** \[latex\]y = 1/x\[/latex\]; expected values approximately 0.10, 0.20, 0.005 and 0.077. + +## The results + +For \[latex\]x^2\[/latex\], these were the expected results: `100, 25, 40000, 169`. + +Those are the actual results: + +``` +[[ 101.38112 ] + [ 25.741158] + [11169.604 ] + [ 167.91489 ]] +``` + +Pretty close for most ones. Only for `40000`, the model generated a wholly wrong prediction. That's not strange, though: the training data was generated in the interval \[-50, 50\]; apparently, 100, 25 and 169 are close enough to be properly regressed, while 40000 is not. That makes intuitive sense. + +\[ad\] + +Let's now generate predictions for all the `x`s when the model finishes and plot the results: + +``` +import matplotlib.pyplot as plt +plt.subplot(2, 1, 1) +plt.scatter(x, y, s = 1) +plt.title('y = $x^2$') +plt.ylabel('Real y') + +plt.subplot(2, 1, 2) +plt.scatter(x, predictions, s = 1) +plt.xlabel('x') +plt.ylabel('Approximated y') + +plt.show() +``` + +When you plot the functions, you get pretty decent results for \[latex\]x^2\[/latex\]: + +[![](images/x2_approximated-1024x537.jpeg)](https://machinecurve.com/wp-content/uploads/2019/07/x2_approximated.jpeg) + +For \[latex\]sin(x)\[/latex\], results are worse: + +[![](images/sinx_approximated-1024x537.jpeg)](https://machinecurve.com/wp-content/uploads/2019/07/sinx_approximated.jpeg) + +What you see is that it approximates the sine function quite appropriately for a _very small domain_, e.g. \[-5, +3\], but then loses track. We might improve the estimation by feeding it with _more_ samples, so we increase the number of random samples to 100.000, still at the interval \[-50, 50\]: + +[![](images/sinx_more_data-1024x537.jpeg)](https://machinecurve.com/wp-content/uploads/2019/07/sinx_more_data.jpeg) + +That's already much better, but still insufficient. Perhaps, the cause is different - e.g. we may achieve better results if we used something like sin(x) as an activation function. However, that's something for a next blog. + +\[ad\] + +And finally, this is what \[latex\]1/x\[/latex\] looks like: + +[![](images/1x_approximated-1024x537.jpeg)](https://machinecurve.com/wp-content/uploads/2019/07/1x_approximated.jpeg) + +That one's getting closer again, but you can stee that it is not yet _highly accurate._ + +## My observations + +The experiment was quite interesting, actually. + +First, I noticed that you need more training data than I expected. For example, with only 1000 samples in my training set, the approximation gets substantially worse: + +[![](images/x2_1000-1024x537.jpeg)](https://machinecurve.com/wp-content/uploads/2019/07/x2_1000.jpeg) + +Second, not all the functions could be approximated properly. Particularly, the sine function was difficult to approximate. + +Third, I did not account for overfitting whatsoever. I just let the models run, possibly introducing severe overfitting to the function at hand. But - to some extent - that was precisely what we wanted. + +\[ad\] + +Fourth, perhaps as a result of (3), the models seem to perform quite well _around_ the domain of the training data (i.e. the \[-50, +50\] interval), but generalization remains difficult. On the other hand, that could be expected; the `40000` value for the first \[latex\]x^2\[/latex\] was anything but \[latex\] +\-50 < x < 50\[/latex\]. + +Altogether, this was a nice experiment for during the evening, showing that you can use neural networks for approximating mathematical functions - if you take into account that it's slightly more complex than you imagine at first, it can be done. diff --git a/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d.md b/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d.md new file mode 100644 index 0000000..c030bba --- /dev/null +++ b/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d.md @@ -0,0 +1,499 @@ +--- +title: "Classifying IMDB sentiment with Keras and Embeddings, Dropout & Conv1D" +date: "2020-03-03" +categories: + - "deep-learning" + - "frameworks" +tags: + - "dataset" + - "deep-learning" + - "imdb-dataset" + - "keras" + - "machine-learning" + - "natural-language-processing" + - "text-classification" + - "word-embedding" +--- + +When using the Keras framework for deep learning, you have at your disposal a module called `keras.datasets` - which represents [standard datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) that can be used in your deep learning models, for educating yourself (click [here](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) if you wish to extend them). + +Now, what can you do with them? + +With regards to the image datasets, it's pretty straightforward: use [convolutional layers](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) to generate a computer vision model with e.g. the [CIFAR-10 and CIFAR-100 datasets](https://www.machinecurve.com/index.php/2020/02/09/how-to-build-a-convnet-for-cifar-10-and-cifar-100-classification-with-keras/). + +However, there are more datasets - and the **IMDB Dataset** is one of them. + +This dataset contains reviews for movies from IMDB and corresponding movie sentiment. It's clearly no computer vision problem, but it can be cracked! In this blog post, we'll show you how to do so - by building a Keras sentiment classifier which attempts to predict the sentiment for input based on the patterns learnt from the IMDB data. + +First, we'll introduce you to the concepts for today's blog - being sentiment, the dataset itself, one-hot encoding, word embeddings, one-dimensional convolutions and. Then, we continue with building the Keras sentiment classifier - we'll walk through every single step doing so. Finally, we show you the results - and how to predict the sentiment of new text based on your Keras model. + +Are you ready? Let's give it a go! 😎 + +* * * + +\[toc\] + +* * * + +## Introducing the concepts for today's blog + +As introduced earlier, let's first take a look at a few concepts that are important for today's blog post: + +- Sentiment; +- The IMDB dataset; +- Word embeddings; +- One-dimensional convolutions; + +### Sentiment + +We'll begin with _sentiment_. What is it? What does it represent? Likely, you already have an intuitive understanding about what is is - something related to how you perceive something, probably. + +Sentiment is a term that we see a lot in terms of Tweets, as much machine learning research has focused on building models with Twitter data given its enormous size. However, more generally, using the Oxford Learner's Dictionaries (n.d.), we arrive at this definition for _sentiment_: + +> \[countable, uncountable\] _(formal)_ a feeling or an opinion, especially one based on emotions +> +> Oxford Learner's Dictionaries (n.d.) + +We were close with our initial guess. + +If you express sentiment about something, such as a movie, you express the feeling or opinion you have, which is likely based on emotions. Do you like the movie? Why so? Those questions. + +### The IMDB dataset + +In the `keras.datasets` module, we find the IMDB dataset: + +> Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a [sequence](https://keras.io/preprocessing/sequence/) of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer "3" encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: "only consider the top 10,000 most common words, but eliminate the top 20 most common words". +> +> Keras (n.d.) + +When processing the reviews into [readable format](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#imdb-movie-reviews-sentiment-classification), this is an example: + +> this has to be one of the worst films of the 1990s when my friends i were watching this film being the target audience it was aimed at we just sat watched the first half an hour with our jaws touching the floor at how bad it really was the rest of the time everyone else in the theatre just started talking to each other leaving or generally crying into their popcorn that they actually paid money they had earnt working to watch this feeble excuse for a film it must have looked like a great idea on paper but on film it looks like no one in the film has a clue what is going on crap acting crap costumes i can't get across how embarrasing this is to watch save yourself an hour a bit of your life + +Well.. while the movie may not be good, we can get access to 25K reviews for building a neural network sentiment classifer ... and that _is_ good :) + +### Representing words in ML models naïvely + +Representing words in machine learning models can be a difficult thing to do. Machine learning models, and especially modern neural networks, often have difficulties representing _words_. Take a look at TensorFlow - it's a framework for processing _numeric_ data, not text. + +But it can work, though! If we have a way to convert text into numeric format, we could use such frameworks and train modern machine learning models based on textual data. + +But how to do so efficiently? + +Quite quickly, but naïvely, one would use a process called **one-hot encoding** in order to generate such representations. Here, each word is represented in a vector that spans all distinct words in your dataset; only the dimension that represents your word is set to 1, the rest is zero. + +For example, if you have these two short phrases (lowercase intended): + +- hi there +- i am chris + +You'd have 5 words, and one-hot encoding your vector would thus have five dimensions, and you'd have these vectors, each representing one word: + +\[latex\]\[1, 0, 0, 0, 0\] \\rightarrow \\text{hi}\[/latex\] + +\[latex\]\[0, 1, 0, 0, 0\] \\rightarrow \\text{there} \[/latex\] + +\[latex\]\[0, 0, 1, 0, 0\] \\rightarrow \\text{i} \[/latex\] + +\[latex\]\[0, 0, 0, 1, 0\] \\rightarrow \\text{am} \[/latex\] + +\[latex\]\[0, 0, 0, 0, 1\] \\rightarrow \\text{chris} \[/latex\] + +Now, this will work well! + +...except for when you have a massive amount of words in your dataset ;-) + +One thousand distinct words? One thousand dimensions in your one-hot encoded vector. + +With approximately 500.000 words in the English vocabulary, you get the point about why this approach works while being naïve (Wikipedia, n.d.). + +### Word embeddings + +Now, what may work better is a so-called **word embedding**: + +[![](images/T-SNE_visualisation_of_word_embeddings_generated_using_19th_century_literature-1024x695.png)](https://www.machinecurve.com/wp-content/uploads/2020/03/T-SNE_visualisation_of_word_embeddings_generated_using_19th_century_literature.png) + +By [Siobhán Grayson](Siobhangrayson&action=edit&redlink=1) - Own work, [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0), [Link](https://commons.wikimedia.org/w/index.php?curid=64541584) + +In the image above, you see a two-dimensional "slice" (using T-SNE) from a multidimensional space, representing words in your dataset. It was created by Siobhán Grayson, for which we are thankful :) + +Now, what you see highlighted are purple Fs and grey Ms - these represent gender specific words, with F being _female_ related and M _male_ related ones. As you can see, across these two dimensions, words cluster in four distinct groups, with some outliers. What you can also see is a wide range of clusters of other (types of) words, mapped onto the two dimensions (and by consequence, onto the other dimensions as well - but this is just the slice). + +I hope it's clear to you what the power of such embeddings is: _words are mapped onto a multidimensional space, and each word represents a real-valued vector in this space_. Hence, each word can be described uniquely, while the space allows for relative sparsity of your vectors (e.g., with a ten-dimensional word embedding space, your vector has only ten values). + +Even better, _word embeddings can be learnt_. That is, they are initialized randomly, and the function mapping the word onto space can be adapted during training of the whole model - so that it gets better over time. Do note that pretrained embeddings can be used as well - this entirely depends on your use case. + +In this blog post, we'll use word embeddings with the IMDB data to generate our classifier. Specifically, we'll do so using the Keras `Embedding` layer. However, let's take a look at one-dimensional convolutions first. + +### One-dimensional convolutions (Conv1D) + +As you will see later in this blog post, a Keras sentiment classifier can be created by a simple pattern: + +- Using an `Embedding` layer; +- Using a `Flatten` layer, to perform dimensionality reduction by plain flattening; +- Using one or multiple `Dense` layers, which serve as the classifier (just as with convolutional layers, the Embeddings layer serves to extract and structure features). + +Now, while this will likely work, it's a naïve approach according to Chollet (2017): + +> \[Note\] that merely flattening the embedded sequences and training a single Dense layer on top leads to a model that treats each word in the input sequence separately, without considering inter-word relationships and sentence structure (for example, this model would likely treat both “this movie is a bomb” and “this movie is the bomb” as being negative reviews). It’s much better to add recurrent layers or 1D convolutional layers on top of the embedded sequences to learn features that take into account each sequence as a whole. +> +> Chollet (2017, p. 187) + +Hence, we could also use one-dimensional convolutional layers. But what are they? + +We recall from the blog posts about [2D convolutional layers](https://www.machinecurve.com/index.php/2019/12/03/what-do-convnets-see-visualizing-filters-with-activation-maximization/) that they represent two-dimensional _kernels_, which slide ("convolve") over the input, generating _feature maps_. As they learn to do so increasingly well over the course of the training process, they are effectively the _feature extractors_ for your model - allowing the Dense layers to work based on the patterns that were identified by the Conv layers. + +![](images/1dconv.png) + +Now, this principle works as well for 1D data - as we can see above. Here, the kernels simply aren't two-dimensional, but one-dimensional too. They convolve over the 1D input too, and generate feature maps that are "triggered" when their learnt patterns occur in new data. This way, they can help the Dense layers in generating their classification. + +Now, this is why Chollet (2017) argued that 1D Conv layers could improve text classification - for the simple reason that 1D Conv layers extract features based on _multiple input elements at once_, e.g. with the size 3 kernel above. This way, interrelationships between words are captured in a better way. I'm certain that there are more state of the art methods for doing so today, and I'm happy to hear about them - so leave a comment with your ideas for improving this post! 💬 + +Nevertheless, we'll use Keras `Conv1D` layers in today's classifier. + +* * * + +## Building a Keras sentiment classifier with the IMDB dataset + +Let's now take a look whether we can actually build something :) + +For this to work, first open up your Explorer/Finder, and navigate to some folder. Here, create a Python file - e.g., `imdb.py`. Now, open this file in your code editor, make sure that it supports Python, and let's go! + +### What you'll need to run this model + +In order to run this model successfully, it's important that your system has installed these software dependencies: + +- Keras, preferably by means of the TensorFlow 2.0 integration - this blog post was created for this major version; +- Numpy; +- Matplotlib, if you wish to generate the visualizations near the end of the model. + +Preferably, install the dependencies in an Anaconda environment, so that you make sure not to interfere with other projects and environments. If you have them installed, let's code :) + +### Model imports + +First things first - let's add the model imports: + +``` +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout, Conv1D, MaxPooling1D +from tensorflow.keras.datasets import imdb +from tensorflow.keras.preprocessing.sequence import pad_sequences +import numpy as np +import matplotlib.pyplot as plt +``` + +Here, we use the `Sequential` API for stacking layers on top of each other. More specifically, we'll use the `Embedding` layer for learning the word embedding, `Flatten` for making the data `Dense`\-ready, `Dropout` [for reducing overfitting](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/) and `Conv1D`/`MaxPooling1D` for extracting better patterns and generating [spatial hierarchy](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/). + +### Model configuration + +Next, model configuration: + +``` +# Model configuration +max_sequence_length = 100 +num_distinct_words = 10000 +embedding_output_dims = 15 +loss_function = 'binary_crossentropy' +optimizer = 'adam' +additional_metrics = ['accuracy'] +number_of_epochs = 100 +verbosity_mode = True +validation_split = 0.20 +``` + +Max sequence length, or `max_sequence_length`, describes the number of words in each sequence (a.k.a. sentence). We require this parameter because we need unifom input, i.e. inputs with the same shape. That is, with 100 words per sequence, each sequence is either padded to ensure that it is 100 words long, or truncated for the same purpose. + +With `num_distinct_words`, we'll set how many distinct words we obtain using the `keras.datasets.imdb` dataset's `load_data()` call. In this setting, it will load the 10.000 most important words - likely, more than enough for a well-functioning model. Other words are replaced with a uniform "replacement" character. + +Our embeddings layer has a dimensionality of `embedding_output_dims = 15`. For [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), we use [binary crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) (as we use [Sigmoid](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) activated outputs), and the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam). As an additional metric, we use the more intuitive accuracy. We train the model for 100 iterations, set verbosity mode to True (outputting most of the training process on screen) and use 20% of our training data for validation purposes. + +### Loading and preparing the IMDB dataset + +Next, we load the IMDB dataset and print some basic statistics: + +``` +# Load dataset +(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words) +print(x_train.shape) +print(x_test.shape) +``` + +Now, we'll have to add a little comment: + +``` +# Here, you'd normally test first that the model generalizes and concatenate all data +# (that is, normally, you'd perform e.g. K-fold Cross Validation first) +# Then, you can use all data for a full training run. Now, we'll use x_train for training only. +``` + +Indeed, normally, you wouldn't want to split training and testing data when training your production model - the more data, the better. However, you _do_ need to estimate how well your model works - and also in a production setting, with data that it hasn't seen. Hence, here, you would normally perform evaluation metrics such as [K-fold Cross Validation](https://www.machinecurve.com/index.php/2020/02/18/how-to-use-k-fold-cross-validation-with-keras/). If this ensures that model performance is well, you would normally retrain your model with all the data. However, for the sake of simplicity, we use `x_train` only and discard `x_test` until the model evaluation step. + +The next step would be to pad all sequences, as suggested before, to ensure that the shape of all inputs is equal (in our case, 100 words long): + +``` +# Pad all sequences +padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with +padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with +``` + +Next, we output three texts on screen - to get a feeling for what we're working with: + +``` +# Obtain 3 texts +for i in np.random.randint(0, len(padded_inputs), 3): + INDEX_FROM=3 # word index offset + word_to_id = imdb.get_word_index() + word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()} + word_to_id[""] = 0 + word_to_id[""] = 1 + word_to_id[""] = 2 + word_to_id[""] = 3 + + id_to_word = {value:key for key,value in word_to_id.items()} + print('=================================================') + print(f'Sample = {i} | Length = {len(padded_inputs[i])}') + print('=================================================') + print(' '.join(id_to_word[id] for id in padded_inputs[i] )) +``` + +### Defining the Keras model + +Then, we can define the Keras model: + +``` +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(Dropout(0.50)) +model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu')) +model.add(Dropout(0.50)) +model.add(MaxPooling1D(pool_size=2)) +model.add(Flatten()) +model.add(Dropout(0.50)) +model.add(Dense(1, activation='sigmoid')) +``` + +As you can see, the first layer is an `Embedding` layer which learns the word embedding - based on the number of distinct words, the number of output dimensions, and the input length that we defined during model configuration. + +Dropout is added after every layer of interest in order to add [noise through Bernoulli variables](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/), hopefully to avoid or reduce overfitting. + +Following the Embedding layer is a `Conv1D` layer with 32 filters of size 2. Then, we use `MaxPooling1D` to boost spatial hierarchies within your model - see the article [about pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) for more information. Finally, we use `Flatten` to reduce dimensionality of the data and `Dense` for generating a `Sigmoid`\-activated classification (that is, a classification within the range \[latex\](0, 1)\[/latex\]). + +### Model compilation, fitting & summary + +Next, we compile the model, fit the data and generate a summary: + +``` +# Compile the model +model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics) + +# Give a summary +model.summary() + +# Train the model +history = model.fit(padded_inputs, y_train, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split) +``` + +If you would save your work and run the Python code, it would start the training process :) + +However, let's add a few evaluation & visualization parts before doing so - for you to visually appreciate model progress. + +### Model evaluation & visualization + +First, we add a numerical evaluation using `model.evaluate` and the testing dataset: + +``` +# Test the model after training +test_results = model.evaluate(padded_inputs_test, y_test, verbose=False) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%') +``` + +And subsequently, we use the `history` object in order to [visualize model history](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/): + +``` +# Visualize history +# Plot history: Validation loss +plt.plot(history.history['val_loss']) +plt.title('Validation loss history') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['val_accuracy']) +plt.title('Validation accuracy history') +plt.ylabel('Accuracy value (%)') +plt.xlabel('No. epoch') +plt.show() +``` + +We now have a fully functioning machine learning model for IMDB sentiment classification using Word embeddings, 1D convolutional layers and Dropout! :D + +### Full model code + +Should you wish to obtain the full model code at once - here you go :) + +``` +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Embedding, Flatten, Dense, Dropout, Conv1D, MaxPooling1D +from tensorflow.keras.datasets import imdb +from tensorflow.keras.preprocessing.sequence import pad_sequences +import numpy as np +import matplotlib.pyplot as plt + +# Model configuration +max_sequence_length = 100 +num_distinct_words = 10000 +embedding_output_dims = 15 +loss_function = 'binary_crossentropy' +optimizer = 'adam' +additional_metrics = ['accuracy'] +number_of_epochs = 100 +verbosity_mode = True +validation_split = 0.20 + +# Load dataset +(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=num_distinct_words) +print(x_train.shape) +print(x_test.shape) + +# Here, you'd normally test first that the model generalizes and concatenate all data +# (that is, normally, you'd perform e.g. K-fold Cross Validation first) +# Then, you can use all data for a full training run. Now, we'll use x_train for training only. + +# Pad all sequences +padded_inputs = pad_sequences(x_train, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with +padded_inputs_test = pad_sequences(x_test, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with + +# Obtain 3 texts +for i in np.random.randint(0, len(padded_inputs), 3): + INDEX_FROM=3 # word index offset + word_to_id = imdb.get_word_index() + word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()} + word_to_id[""] = 0 + word_to_id[""] = 1 + word_to_id[""] = 2 + word_to_id[""] = 3 + + id_to_word = {value:key for key,value in word_to_id.items()} + print('=================================================') + print(f'Sample = {i} | Length = {len(padded_inputs[i])}') + print('=================================================') + print(' '.join(id_to_word[id] for id in padded_inputs[i] )) + +# Define the Keras model +model = Sequential() +model.add(Embedding(num_distinct_words, embedding_output_dims, input_length=max_sequence_length)) +model.add(Dropout(0.50)) +model.add(Conv1D(filters=32, kernel_size=2, padding='same', activation='relu')) +model.add(Dropout(0.50)) +model.add(MaxPooling1D(pool_size=2)) +model.add(Flatten()) +model.add(Dropout(0.50)) +model.add(Dense(1, activation='sigmoid')) + +# Compile the model +model.compile(optimizer=optimizer, loss=loss_function, metrics=additional_metrics) + +# Give a summary +model.summary() + +# Train the model +history = model.fit(padded_inputs, y_train, epochs=number_of_epochs, verbose=verbosity_mode, validation_split=validation_split) + +# Test the model after training +test_results = model.evaluate(padded_inputs_test, y_test, verbose=False) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {100*test_results[1]}%') + +# Visualize history +# Plot history: Validation loss +plt.plot(history.history['val_loss']) +plt.title('Validation loss history') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['val_accuracy']) +plt.title('Validation accuracy history') +plt.ylabel('Accuracy value (%)') +plt.xlabel('No. epoch') +plt.show() +``` + +* * * + +## Results + +Let's now take a look at some results after you ran the model with `python imdb.py`. + +### Validation plots + +First, the validation plots - i.e., the plots with validation loss and validation accuracy. Clearly, they indicate that [overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/#how-well-does-your-model-perform-underfitting-and-overfitting) occurs: the loss minimum is reported straight at the beginning of the process, after which loss increases again ([check here how to detect underfitting and overfitting on loss plots](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/#overfitting-and-underfitting-checking-your-validation-loss)). + +While this is bad - and perhaps can be overcome by tuning learning rates, using different optimizers, preparing the data or model architecture differently, training for longer and considering this as a temporary worse loss - we don't really care for now, haha :P Especially because accuracy at that point is \[latex\]\\approx 86\\%\[/latex\]. + +Instead, the scope of our blog post - to create an IMDB sentiment classifier - was achieved :) + +- [![](images/emb_loss.png)](https://www.machinecurve.com/wp-content/uploads/2020/03/emb_loss.png) + +- [![](images/emb_acc.png)](https://www.machinecurve.com/wp-content/uploads/2020/03/emb_acc.png) + + +### Generating new predictions + +We can also generate predictions for 'new' texts - like this: + +``` +# Texts +text_bad = x_train[7737] +text_good = x_train[449] +texts = (text_bad, text_good) +padded_texts = pad_sequences(texts, maxlen=max_sequence_length, value = 0.0) # 0.0 because it corresponds with + +# Generate predictions +predictions = model.predict(padded_texts) +print(predictions) +``` + +For sample 449, the prediction is `0.8987303` ... close to "good". This makes sense - the text clearly indicates that the viewer had positive sentiment about the movie, but he/she also makes a few neutral statements (such as "the acting is ok"): + +``` +================================================= +Sample = 449 | Length = 100 +================================================= +i'm doing these two together because their comic timing and acting quality was superb and for lisa this was one of her first roles and she is so natural as and matthew perry is just matthew perry playing himself basically the episode quality does improve later such as the sets they looks dark and creepy in this episode and makes them seem the acting is ok the characters gain confidence with each new scene and i am proud this is the pilot i hope we see the friends reunite cause they will always be there for us +``` + +For sample 7337, the output is `0.02299032` - which is close to `0`, "bad". Obviously, this is correct given the text: + +``` +================================================= +Sample = 7337 | Length = 100 +================================================= +is a mess i mean it's all over the place and so over the top tony montana would have been proud br br the last but not least mistake that movie made is a completely irrelevant title you simply can't really connect a between the plot and its title and then you will end up thinking that it makes no sense at all in short watching the detectives is pleasant if forgettable motion picture that you might have a chance to catch it on cable tv so quick that you couldn't imagine br br rating 1 5 4 +``` + +* * * + +## Summary + +In this blog post, we saw how to create an IMDB sentiment classifier using Keras. Firstly, we looked at the concepts that we used in the model - being word embeddings, 1D convolutional layers, and the concept of sentiment. Subsequently, we created a TensorFlow 2.0 based Keras implementation of an IMDB dataset classifier, which we did by guiding you through every step. + +I hope you've learnt something from today's blog post! :) If you did, please feel free to leave a comment below. I'd really appreciate it! + +Thank you for reading MachineCurve today and happy engineering 😎 + +\[kerasbox\] + +* * * + +## References + +Oxford Learner's Dictionaries. (n.d.). Sentiment. Retrieved from [https://www.oxfordlearnersdictionaries.com/definition/english/sentiment](https://www.oxfordlearnersdictionaries.com/definition/english/sentiment) + +Keras. (n.d.). Datasets: IMDB Movie reviews sentiment classification. Retrieved from [https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification](https://keras.io/datasets/#imdb-movie-reviews-sentiment-classification) + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Wikipedia. (2016, August 6). List of dictionaries by number of words. Retrieved from [https://en.wikipedia.org/wiki/List\_of\_dictionaries\_by\_number\_of\_words](https://en.wikipedia.org/wiki/List_of_dictionaries_by_number_of_words) diff --git a/cnns-and-feature-extraction-the-curse-of-data-sparsity.md b/cnns-and-feature-extraction-the-curse-of-data-sparsity.md new file mode 100644 index 0000000..0212a18 --- /dev/null +++ b/cnns-and-feature-extraction-the-curse-of-data-sparsity.md @@ -0,0 +1,177 @@ +--- +title: "CNNs and feature extraction? The curse of data sparsity." +date: "2019-07-19" +categories: + - "deep-learning" +tags: + - "computer-vision" + - "convolutional-neural-networks" + - "deep-learning" + - "dimensionality" +--- + +I recently finished work on my master's thesis in which I investigated how the process of mapping underground utilities such as cables and pipelines could be improved with deep neural networks. + +Specifically, since utility mapping harnesses a geophysical technique called Ground Penetrating Radar, which produces image-like data, I investigated the effectiveness of [Convolutional Neural Networks](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) for this purpose. Since utility mapping is effectively a classification problem with respect to utility material type, that's what made CNNs worthwhile. + +Later more on my thesis work, but today I want to share a peculiar observation with you: **that I have the feeling that feature compression deteriorates model performance when you're using CNNs.** + +**Update February 2020** - Added table of contents and added links to relevant MachineCurve blog posts. + +\[toc\] + +## A bit of history + +Since deep learning practitioners such as Chollet claim "to input data into CNNs as raw as possible", you may wonder why this blog is written in the first place. + +So let's look backwards for a bit before we'll try to explain the behavior I observed during my research. + +Primarily, approaches harnessing machine learning for improving the utility mapping process have used [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) for this purpose. SVMs, which were popular many years ago i.e. before deep learning was cool, had one big shortcoming: they could not handle dimensionality well. That is, if you had an image, you had to substantially downsample it prior to feeding it to the model. Otherwise, it wouldn't work. + +By consequence, many feature extraction approaches were investigated for utility mapping that all had in common that they wanted to reduce this _curse of dimensionality_. Examples are signal histograms (reducing dimensionality because many signal backscatters could be grouped into histogram bins) or the Discrete Cosine Transform (which essentially transforms the data input into the frequency spectrum, making it usable for signal compression such as the JPEG format). + +...so I thought: let's try and see if they also work with CNNs, and I trained CNNs with histograms, DCTs and raw data. + +Fun fact: the first two didn't work with accuracies averaging 50-60%. The latter one _did_ work and achieved ~80% with only 2500 data points. + +_Side note:_ we're currently expanding the number of samples to avoid the trap of overfitting. + +I think I have been able to intuitively derive the reasons for this problem based on logical reasoning, but let's first see if we can reproduce this behavior once more. + +\[ad\] + +## MNIST CNN + +Do we remember that fancy numbers dataset? + +![](images/mnist.png) + +Indeed, it's the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset: "a training set of 60,000 examples, and a test set of 10,000 examples". It [contains handwritten digits](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#mnist-database-of-handwritten-digits), thus numbers from 0-9. + +To give you a baseline of what a CNN can do with such a dataset, you will next see the result of training a CNN based on a [default Keras example script](https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py): + +``` +Epoch 1/12 +60000/60000 [==============================] - 24s 404us/step - loss: 0.2616 - acc: 0.9201 - val_loss: 0.0745 - val_acc: 0.9779 +Epoch 2/12 +60000/60000 [==============================] - 15s 250us/step - loss: 0.0888 - acc: 0.9731 - val_loss: 0.0427 - val_acc: 0.9864 +Epoch 3/12 +60000/60000 [==============================] - 15s 244us/step - loss: 0.0667 - acc: 0.9797 - val_loss: 0.0356 - val_acc: 0.9878 +Epoch 4/12 +60000/60000 [==============================] - 14s 239us/step - loss: 0.0559 - acc: 0.9835 - val_loss: 0.0308 - val_acc: 0.9901 +Epoch 5/12 +60000/60000 [==============================] - 14s 238us/step - loss: 0.0478 - acc: 0.9858 - val_loss: 0.0318 - val_acc: 0.9901 +Epoch 6/12 +60000/60000 [==============================] - 13s 212us/step - loss: 0.0434 - acc: 0.9870 - val_loss: 0.0288 - val_acc: 0.9908 +Epoch 7/12 +60000/60000 [==============================] - 13s 218us/step - loss: 0.0392 - acc: 0.9877 - val_loss: 0.0312 - val_acc: 0.9904 +Epoch 8/12 +60000/60000 [==============================] - 14s 236us/step - loss: 0.0350 - acc: 0.9891 - val_loss: 0.0277 - val_acc: 0.9909 +Epoch 9/12 +60000/60000 [==============================] - 14s 232us/step - loss: 0.0331 - acc: 0.9897 - val_loss: 0.0276 - val_acc: 0.9906 +Epoch 10/12 +60000/60000 [==============================] - 15s 243us/step - loss: 0.0318 - acc: 0.9901 - val_loss: 0.0269 - val_acc: 0.9913 +Epoch 11/12 +60000/60000 [==============================] - 13s 219us/step - loss: 0.0284 - acc: 0.9914 - val_loss: 0.0296 - val_acc: 0.9899 +Epoch 12/12 +60000/60000 [==============================] - 12s 200us/step - loss: 0.0263 - acc: 0.9918 - val_loss: 0.0315 - val_acc: 0.9903 +Test loss: 0.03145747215508682 +Test accuracy: 0.9903 +``` + +That's pretty good performance: it was right in approximately 99% of cases using the test set after only 12 epochs, or rounds of training. Could be worse... although it's a very simple computer vision problem indeed ;-) + +## Making the data sparser + +In order to demonstrate what I mean with _worse performance when your data is sparser_, I'm going to convert the MNIST samples into a sparsened version. I'll use the Discrete Cosine Transform for this, also called the DCT. + +The DCT is a signal compression technique which, according to [Wikipedia](https://en.wikipedia.org/wiki/Discrete_cosine_transform), "expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies". + +\[ad\] + +I'm specifically using the `scipy.fftpack` DCT, type 2, which is the de facto default DCT in the scientific community. It can be written as [follows](https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.fftpack.dct.html): + +``` + N-1 +y[k] = 2* sum x[n]*cos(pi*k*(2n+1)/(2*N)), 0 <= k < N. + n=0 +``` + +This is what the numbers subsequently look like visually: + +[![](images/mnist_with_dct-1024x537.jpeg)](https://machinecurve.com/wp-content/uploads/2019/07/mnist_with_dct.jpeg) + +DCTs generated for MNIST samples. + +You see that they can still be distinguished, but that the signal is more compact now (or diluted). This property, called _signal compaction_, allows one to literally downsample the DCT without losing predictive power. + +Now let's see what happens if you average the matrices across one of the axes: + +![](images/signal_compaction-1.png) + +We have substantially sparser feature vectors now: in fact, every number is now represented by 28 instead of 784 features. + +Let's redo the experiment. Note that this time, I had to change all references to [2D image data](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), e.g. the `Conv2D` and the `MaxPooling2D` layers, into their 1D variants - we namely removed one dimension from the data, and the 2D variants simply don't work anymore. + +The [convolution operation](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) with learning filters itself, however, remains similar. This is the result: + +``` +Epoch 1/12 +60000/60000 [==============================] - 23s 380us/step - loss: 2.5680 - acc: 0.1103 - val_loss: 2.3011 - val_acc: 0.1135 +Epoch 2/12 +60000/60000 [==============================] - 11s 183us/step - loss: 2.3026 - acc: 0.1123 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 3/12 +60000/60000 [==============================] - 12s 196us/step - loss: 2.3021 - acc: 0.1126 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 4/12 +60000/60000 [==============================] - 11s 190us/step - loss: 2.3015 - acc: 0.1123 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 5/12 +60000/60000 [==============================] - 10s 174us/step - loss: 2.3016 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 6/12 +60000/60000 [==============================] - 11s 186us/step - loss: 2.3014 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 7/12 +60000/60000 [==============================] - 11s 185us/step - loss: 2.3013 - acc: 0.1123 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 8/12 +60000/60000 [==============================] - 11s 192us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 9/12 +60000/60000 [==============================] - 11s 184us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 10/12 +60000/60000 [==============================] - 10s 163us/step - loss: 2.3015 - acc: 0.1125 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 11/12 +60000/60000 [==============================] - 10s 166us/step - loss: 2.3013 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 +Epoch 12/12 +60000/60000 [==============================] - 11s 191us/step - loss: 2.3014 - acc: 0.1124 - val_loss: 2.3010 - val_acc: 0.1135 +Test loss: 2.3010036102294924 +Test accuracy: 0.1135 +``` + +Absolutely terrible performance. Unworthy of CNNs! + +And this is indeed what I also experienced during my research. + +## Here's why I think that DCT sparsity deteriorated model performance + +In my research, I drew this conclusion with respect to the loss of performance when using the DCT: + +_I think you blind the convolutional filters to the ideosyncrasies of the data._ + +\[ad\] + +Or, in layman's terms, you make the CNN blind to the unique aspects represented by the numbers... despite the fact that they are already _in there_. + +**Why is this the case?** + +In my opinion, this can be explained by looking at the internals of a convolutional layer. It works as follows. [You specify a number of filters](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) which, during training, learn to recognize unique aspects of the image-like data. They can then be used to classify new samples - quite accurately, as we have seen with raw MNIST data. This means that the convolutional layer _already makes your data representation sparser_. What's more, this effect gets even stronger when layers like [Max Pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) are applied - which is precisely what I did above. + +But when you downsample the data first by e.g. applying the DCT, _you thus effectively apply sparsening twice._ My only conclusion can thus be that by consequence, the convolutional filters can no longer learn the unique aspects within the image-like data, as they are hidden in the data set made compact. Only then, I literally found out why people always suggest to input your image data into CNNs as untransformed as possible. + +**Then why did this work with SVMs?** + +Previous scientific works on supporting utility mapping with machine learning achieved promising results when applying dimensionality reduction techniques like the DCT before training their models, such as SVMs. + +Yet, it didn't work with CNNs. + +Besides the architectural differences between them, one must also conclude that _CNNs make data essentially sparser while SVMs do not_. Consequently, for the latter you actually needed to apply those compression techniques for them to work in the first place, while for the first it makes the models perform worse. + +An interesting insight - and a reminder to always set an average- to well-performing baseline first before you start training variations :-) + +Did you run into this problem too? I'm eager to know. Please feel free to leave a comment. I'm happy to respond :) Thanks for reading! diff --git a/commoditizing-ai-the-state-of-automated-machine-learning.md b/commoditizing-ai-the-state-of-automated-machine-learning.md new file mode 100644 index 0000000..2a5be1a --- /dev/null +++ b/commoditizing-ai-the-state-of-automated-machine-learning.md @@ -0,0 +1,186 @@ +--- +title: "Commoditizing AI? The state of automated machine learning." +date: "2019-07-22" +categories: + - "deep-learning" + - "svms" +tags: + - "automl" + - "commoditization" + - "deep-learning" + - "machine-learning" +--- + +It cannot go unnoticed that machine learning has been quite the hype these past few years. AI programs at universities are spiking in interest and, at least here in the Netherlands, students have to be told not to come because they are so crowded. + +However, hidden from popular belief, are we on the verge of a radical transformation in machine learning and its subset practice of deep learning? + +A transformation in the sense that we are moving towards automated machine learning - making hardcore ML jobs obsolete? + +Perhaps so, as recent research reports indicate that research into automated ML tools is intensifying (Tuggener et al., 2019). It triggered me: can I lose my job as a ML engineer even _before_ the field has stopped to be hot? + +Let's find out. In this blog, I'll take a brief look into so-called _AutoML_ tools as well as their developments. I first take a theoretical path and list the main areas of research into automating ML. I'll then identify a few practical tools that I think are most promising today. Finally, I'll discuss how this may in my opinion affect our jobs as ML engineers. + +\[ad\] + +\[toc\] + +\[ad\] + +## What are the reasons for automating machine learning? + +Data scientists have the sexiest job of the 21st Century, at least that's what they wrote some years back. However, the job is really complex, especially when it comes to training machine learning models. It encompasses many things... + +The first step is getting to know your data. What are its ideosyncrasies? What is important in the dataset? Which features do you think are most discriminative with respect to the machine learning problem at hand? Those are questions that must be answered by data scientists before one can even think about training a ML model. + +Then, next question - which type of model must be used? Should we use Support Vector Machines with some kernel function that allows us to train SVMs for non-linear datasets? Or should we use neural networks instead? If so, what type of neural network? + +[![](images/confused-digital-nomad-electronics-874242-1024x682.jpg)](https://machinecurve.com/wp-content/uploads/2019/07/confused-digital-nomad-electronics-874242-1024x682.jpg) + +How ML engineers may feel every now and then. + +Ok, suppose that we chose a certain class of neural networks, say [Convolutional Neural Networks](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). You'll then have to decide about the network architecture. How many layers should be used? Which activation functions must be added to these layers? What kind of regularization do I apply? How many densely-classified layer must accompany my convolutional ones? All kind of questions that must be answered by the engineer. + +Suppose that you have chosen both a _model class_ and an _architecture_. You'll then move on and select a set of hyperparameters, or configuration options. Example ones are the degree with which a model is optimized every iteration, also known as the learning rate. Similarly, you choose the optimizer, and the loss function to be used during training. And there are other ones. + +All right, but how do I even start when I already feel overwhelmed right now? + +### Data science success correlates with experience + +Quite easy. Very likely, you do not know the answers to all these questions in advance. Often, you therefore use the experience you have to guide you towards an intuitively suitable algorithm. Subsequently, you experiment with various architectures and hyperparameters - slightly guided by what is found in the literature, perhaps, but often based on common sense. + +And worry not: it's not strange that difficult jobs are made easier. In fact, this is very common. In the 1990s and later, the World Wide Web caused a large increase in access to information. This made difficult jobs, such as collecting insights on highly specific topics, much easier and - often - obsolete. This process can now also be observed in the field of machine learning. + +Will AI become a commodity? Let's see where we stand now, both in theory and in practice. + +\[ad\] + +## What are current approaches towards automated ML? + +What becomes clear from the paper written by [Tuggener et al. (2019)](https://arxiv.org/abs/1907.08392) is that much research is currently being performed into automating "various blocks of the machine learning pipeline", i.e. from the beginning to the end. They suggest that these developments can be grouped into these distinct categories: + +- Automating feature engineering. +- Meta-learning. +- Architecture search. +- Hyperparameter optimization. +- Combined Model Selection and Hyperparameter Optimization (CASH). + +### Automating feature engineering + +The first category is automated **feature engineering**. Every model harnesses feature vectors and, together with their respective targets in the case of supervised learning, attempts to identify patterns in the data set. + +However, not every feature in a feature vector is, so to say, _discriminative_ enough. + +That is, it blinds the model from identifying relevant patterns rather than making those patterns clearer. + +It's often best to remove these features. This is often a tedious job, since an engineer must predict which ones must be removed, before retraining the models to see whether his or her prediction is right. + +Various approaches towards automating this problem exist today. For example, it can be considered to be a reinforcement learning problem, where an intelligent agent learns to recognize good and bad features. Other techniques combine features before feeding them into the model, assessing their effectiveness. Another approach attempts to compute the information gain for scenarios where features are varied. Their goal is to maximize this gain. However, they all have in common that they _only focus on the feature engineering aspects_. That's however only one aspect of the ML pipeline. + +### Meta-learning + +In another approach, named **meta-learning**, the features are not altered. Rather, a meta-model is trained that has learnt from previous training processes. Such models can take as input e.g. the number and type of features as well as the algorithms and then generate a prediction with respect to what optimization is necessary. + +As Tuggener et al. (2019) demonstrate, many such algorithms are under active development today. The same observation is made by Elshawi et al. (2019). + +### Architecture search + +Similarly under active research scrutiny these days is what Tuggener et al. (2019) call **architecture search**. In essence, finding the best-performing model can be considered to be a search problem with the goal of finding the right model architecture. It's therefore perhaps one of the most widely used means for automating ML these days. + +Within this category, many sub approaches to searching the most optimal architecture can be observed today (Elshawi et al., 2019). At a very high level, they are as follows: + +- Searching randomly. It's a naïve approach, but apparently especially this fact benefits finding model architectures. +- Reinforcement learning, or training a dumb agent by means of "losses" and "rewards" to recognize good paths towards improvement, is an approach that is used today as well. +- By optimizing the gradient of the _search problem_, one can essentially consider finding the architecture to be a meta problem. +- Evolutionary algorithms that add genetic optimization can be used for finding well-performing architectures. +- Bayesian optimization, or selecting a path to improvement from a Gaussian distribution, is also used in certain works. + +I refer to the original work (Elshawi et al., 2019 - see the references list below) for a more detailed review. + +### Hyperparameter optimization + +Suppose that you have chosen a particular model type, say a Convolutional Neural Network. As you've read before, you then face the choice of hyperparameter selection - or, selecting the model's configuration elements. + +It includes, as we recall, picking a suitable optimizer, setting the learning rate, et cetera. + +This is essentially a large search problem to be solved. + +If **hyperparameter optimization** is used for automating machine learning, it's essentially this last part of the ML training process that is optimized. + +But is it enough? Let's introduce CASH. + +### Combined Model Selection and Hyperparameter Optimization (CASH) + +If you combine the approaches discussed previously, you come to what is known as the CASH approach: combining model selection and hyperparameter optimization. + +Suppose that you have a dataset for which you wish to train a machine learning model, but you haven't decided yet about an architecture. + +Solving the CASH problem would essentially mean that you find the optimum data pipeline for the dataset (Tuggener et al., 2019) - including: + +- Cleaning your data. +- Feature selection and construction, where necessary. +- Model selection (SVMs? Neural networks? CNNs? RNNs? Eh, who knows?) +- Hyperparameter optimization. +- Perhaps, even ensemble learning, combining the models into a better-performing ensemble. + +According to Tuggener et al. (2019) this would save a massive amount of time for data scientists. They argued that a problem which their data scientists worked hard on for weeks could be solved by automated tooling in 30 minutes. Man, that's progress. + +\[ad\] + +## AutoML tools used in practice + +All these theoretical contributions are nice, but I am way more curious about how they are applied in practice. + +What systems for automating machine learning are in use today? + +Let's see if we can find some and compare them. + +### Cloud AutoML + +The first system I found is called [Cloud AutoML](https://cloud.google.com/automl/) and is provided as a service by Google. It suggests that it uses Google's _Neural Architecture Search_. This yields the insight that it therefore specifically targets neural networks and attempts to find the best architecture with respect to the dataset. It focuses on computer vision, natural language processing and tabular data. + +[![](images/art-blue-skies-clouds-335907-1024x686.jpg)](https://machinecurve.com/wp-content/uploads/2019/07/art-blue-skies-clouds-335907.jpg) + +The cloud is often the place for automated machine learning, but this does not always have to be the case. + +### AutoKeras + +Cloud AutoML is however rather pricy as it apparently costs $20 per hour (Seif, 2019). Fortunately, for those who have experience with [Keras](https://machinecurve.com/index.php/mastering-keras/), there is now a library out there called AutoKeras - take a look at it [here](https://github.com/keras-team/autokeras). It essentially turns the Keras based way of working into an AutoML problem: it performs an architecture search by means of Bayesian optimization and network morphism. Back to plain English now, but if you really wish to understand it deeper - take a look at (Jin et al., 2018). + +I do - and will dive deeper into it ASAP. Remind me of this, please! 😄 + +### Other tools + +A post by Oleksii Kharkovyna at Medium/TowardsDataScience suggests that there are various other approaches to automated ML in use today. Check it out [here](https://towardsdatascience.com/top-10-data-science-ml-tools-for-non-programmers-d12ce6dcccc). + +## My conclusions + +The field of machine learning seems to be democratizing rapidly. Whereas deep knowledge on algorithms, particularly deep neural networks these days, was required in the past, that seems to be less and less the case. + +Does this mean that no ML engineers are required anymore? + +No. Not in my view. + +![](images/connection-data-desk-1181675-1024x683.jpg) + +However, what I'm trying to suggest here is a number of things: + +1. Do not stare yourself blind at becoming an expert in model optimization. It's essentially a large search problem that is bound to be democratized and, by consequently, automated away. +2. Take notice of the wide array of automated machine learning tools and get experience with them. You may be asked to use them in the future. It would be nice if you already had some experience - it would set you apart from the rest 😄 +3. Become creative! 🧠 These automated machine learning solutions are simply the solvers of large search problems. However, translating business problems into a machine learning task still requires creativity and tactical and/or strategic awareness. This is still a bridge too far for those kind of technologies. + +Data science may still be the sexiest job of the 21st Century, but be prepared for some change. Would you agree with me? Or do you disagree entirely? I would be glad to know. Leave your comments in the comment section below 👇 I'll respond with my thoughts as soon as I can. + +\[ad\] + +## References + +Elshawi, R., Maher, M., & Sakr, S. (2019, June). Automated Machine Learning: State-of-The-Art and Open Challenges. Retrieved from [https://arxiv.org/abs/1906.02287](https://arxiv.org/abs/1906.02287) + +Kharkovyna, O. (2019, May 22). Top 10 Data Science & ML Tools for Non-Programmers - Towards Data Science. Retrieved from [https://towardsdatascience.com/top-10-data-science-ml-tools-for-non-programmers-d12ce6dcccc](https://towardsdatascience.com/top-10-data-science-ml-tools-for-non-programmers-d12ce6dcccc ) + +Jin, H., Song, Q., & Hu, X. (2018, June). Auto-Keras: An Efficient Neural Architecture Search System. Retrieved from [https://arxiv.org/abs/1806.10282](https://arxiv.org/abs/1806.10282) + +Seif, G. (2019, February 23). AutoKeras: The Killer of Google's AutoML - Towards Data Science. Retrieved from [https://towardsdatascience.com/autokeras-the-killer-of-googles-automl-9e84c552a319](https://towardsdatascience.com/autokeras-the-killer-of-googles-automl-9e84c552a319) + +Tuggener, L., Amirian, M., Rombach, K., Lörwald, S., Varlet, A., Westermann, C., & Stadelmann, T. (2019, July). Automated Machine Learning in Practice: State of the Art and Recent Results. Retrieved from [https://arxiv.org/abs/1907.08392](https://arxiv.org/abs/1907.08392) diff --git a/conditional-gans-cgans-explained.md b/conditional-gans-cgans-explained.md new file mode 100644 index 0000000..22e7d34 --- /dev/null +++ b/conditional-gans-cgans-explained.md @@ -0,0 +1,107 @@ +--- +title: "Conditional GANs (cGANs) explained" +date: "2021-03-25" +categories: + - "buffer" + - "deep-learning" +tags: + - "cgan" + - "conditional-gan" + - "gan" + - "gans" + - "generative-adversarial-networks" + - "generative-models" +--- + +**Conditional Generative Adversarial Networks**, or _cGANs_ for short, improve regular or 'vanilla' GANs by adding a condition into the Generator and Discriminator networks. The idea is that it allows a GAN to better structure its latent space and the mapping into data space, and the concept of a cGAN was proposed by Mirza & Osindero (2014). + +In this article, we're going to take a look at cGANs and explain the concepts. After reading it, you will... + +- **Understand what is meant with vanilla GANs being _unconditional_.** +- **Know how GANs can be made conditional, and be turned into cGANs.** +- **See how this improves performance of a GAN trained on the MNIST dataset.** + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## Vanilla GANs are Unconditional + +Generative Adversarial Networks were proposed back in 2014, through a paper written by Ian Goodfellow and others (Goodfellow et al., 2014). Their architecture is composed of two neural networks. First of all, there is a _generator_ \[latex\]G\[/latex\], which is responsible for generating images. Secondly, there is a _discriminator_ \[latex\]D\[/latex\], which has the task to detect which of the images presented to it is fake and which is real. + +As they are trained jointly by minimizing loss components, the following minimax game emerges, as discussed in [our article about vanilla GANs](https://www.machinecurve.com/index.php/2021/03/23/generative-adversarial-networks-a-gentle-introduction/): + +![](images/image-1-1024x401.png) + +Loss and its components for an unconditioned GAN. + +This loss works with both the vector \[latex\]\\bf{z}\[/latex\] sampled from the latent distribution \[latex\]\\text{Z}\[/latex\] and vector \[latex\]\\bf{x}\[/latex\] which is generated by the generator on the basis of \[latex\]\\bf{z}\[/latex\]. + +Through their joint training, the Generator learns to convert samples from the latent distribution in such a way that they produce output images that cannot be distinguished from real ones anymore. This allows us to draw samples (\[latex\]\\bf{z}\[/latex\]s) from the latent distribution and generate images. In effect, through the lens of the Generator, the latent distribution thus compresses information about 'data space' - but then in an accessible way. + +According to Goodfellow et al. (2014), vanilla GANs have many straightforward extensions - of which cGANs are one: + +> A conditional generative model p(x | c) can be obtained by adding c as input to both G and D +> +> Goodfellow et al. (2014) + +Yes, vanilla GANs are _unconditional_. In the quote above, you can read \[latex\]p(\\text{x | c})\[/latex\] as _the probability that we generate vector \[latex\]\\bf{x}\[/latex\] given some condition \[latex\]c\[/latex\]._ Adding a condition to our probabilities allows us to teach the Generator to use the latent distribution in an even more structured way. + +* * * + +## Introducing the Conditional GAN (cGAN) + +For this reason, Mirza & Osindero (2014) propose what they call the **Conditional GAN**, or cGAN. It adds conditioning on some extra information \[latex\]\\bf{y}\[/latex\]. Then, the probability \[latex\]p(\\bf{x})\[/latex\] generated by \[latex\]G\[/latex\] turns into \[latex\]p(\\bf{x|y})\[/latex\], or "the probability for \[latex\]\\bf{x}\[/latex\] given condition \[latex\]\\bf{y}\[/latex\]". The trained eye easily sees that \[latex\]\\bf{y}\[/latex\] and \[latex\]\\bf{c}\[/latex\] are the same; that they just use different letters. + +Adding conditioning is expected to allow for better structuring of the latent space and its sampling _and_ thus generate better results. Any kind of information can be used for conditioning (Mirza & Osindero, 2014). Simple conditioning can be achieved by adding label data, such as the target for the image to be generated (e.g. `y = 1` if the goal is to generate 1s from the MNIST dataset). It is also possible to use more complex conditioning information, such as data from other modalities (e.g. text instead of images). + +Applying conditioning involves feeding the condition parameter \[latex\]\\bf{y}\[/latex\] into both the generator \[latex\]G\[/latex\] and the discriminator \[latex\]D\[/latex\] of your GAN. We use an additional input layer for this purpose. Visually, this looks as follows: + +![](images/image-5-1024x875.png) + +Conditional Generative Adversarial Network (cGAN) architecture (Mirza & Osindero, 2014). + +We can see that the conditioning information is first concatenated with the latent space or generation and then further processed. It's a straightforward expansion of the vanilla GAN. No other significant changes to e.g. hyperparameters were applied compared to the Goodfellow et al. (2014) GAN, at least according to the paper (Mirza & Osindero, 2014). + +* * * + +## Results compared to vanilla GANs + +Conditional GANs were tested in two settings: + +- **Generating samples from the MNIST dataset.** Performance was measured by a log-likelihood estimate. +- **Generating text labels for images.** + +In both cases, very agreeable results were achieved, with cGANs achieving better performance on MNIST generation compared to vanilla GANs. Conditioning definitely helps here! + +![](images/mnist.png) + +* * * + +## Summary + +Vanilla GANs were proposed back in 2014 (by Goodfellow et al., 2014) as a new mechanism for generative Machine Learning. As with any innovation, non-significant amounts of optimization have taken place. This is even suggested in the original work: there are a variety of straightforward extensions that can be applied to possibly make GANs even more performant. + +Conditional GANs, or cGANs, are one such extension. By making the sampling from latent space and data space conditional by adding an additional parameter to the neural networks, the neural network can much better structure the latent space and the mapping into data space. As a consequence, cGANs are more performant compared to the 2014 vanilla GANs, and adding conditionality - if possible - can be a best practice for training your GAN. + +By reading this tutorial, you now... + +- **Understand what is meant with vanilla GANs being _unconditional_.** +- **Know how GANs can be made conditional, and be turned into cGANs.** +- **See how this improves performance of a GAN trained on the MNIST dataset.** + +I hope that it was useful for your learning process! Please feel free to share what you have learned in the comments section 💬 I’d love to hear from you. Please do the same if you have any questions or other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). [Generative adversarial networks.](https://arxiv.org/abs/1406.2661) _arXiv preprint arXiv:1406.2661_. + +Mirza, M., & Osindero, S. (2014). [Conditional generative adversarial nets.](https://arxiv.org/abs/1411.1784) _arXiv preprint arXiv:1411.1784_. diff --git a/conv2dtranspose-using-2d-transposed-convolutions-with-keras.md b/conv2dtranspose-using-2d-transposed-convolutions-with-keras.md new file mode 100644 index 0000000..d9ac3a7 --- /dev/null +++ b/conv2dtranspose-using-2d-transposed-convolutions-with-keras.md @@ -0,0 +1,356 @@ +--- +title: "Conv2DTranspose: using 2D transposed convolutions with Keras" +date: "2019-12-10" +categories: + - "deep-learning" + - "frameworks" +tags: + - "autoencoder" + - "conv2d" + - "conv2dtranspose" + - "convolutional-neural-networks" + - "deep-learning" + - "keras" + - "machine-learning" + - "python" + - "transposed-convolution" +--- + +Transposed convolutions - [we looked at them in theory](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions/) in a previous blog post, but how can they be applied? What are they useful for? These were questions that kept popping up every now and then. + +While we all understand the usefulness of 'normal' convolutional layers, this is more difficult for transposed layers. + +As a result, I've spent some time looking into applications, which results in this blog post, covering how to use 2D transposed convolutions with Keras. In this blog post, we'll first revisit the concept of transposed convolutions at a high level, to polish our understanding of these layers. Subsequently, we introduce the concept of an _autoencoder_, where they can be used. This understanding is subsequently transformed into an actual Keras model, with which we will try to reconstruct MNIST images that have been encoded into lower-dimensional state before. + +Ready to find out how Conv2DTranspose works with Keras? Let's go! 😎 + +\[toc\] + +## Recap: what are transposed convolutions? + +Imagine that you have a ConvNet which has only one convolutional layer, performing the following operation on an image that has only one channel: + +![](images/CNN-onechannel.png) + +This is what a convolutional layer, being part of a [convolutional neural network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), does when training it. + +But now you have the desire to work in the opposite direction, i.e., to use a smaller input and to learn its larger representation, being the following: + +![](images/CNN-opposite.png) + +What to do? + +You have multiple options, as we can see: + +- It's possible to use **traditional interpolation techniques** like [bicubic](https://en.wikipedia.org/wiki/Bicubic_interpolation) or [bilinear interpolation](https://en.wikipedia.org/wiki/Bilinear_interpolation). While they are fast, they are not too flexible: they just produce a pixel estimate given the pixel's surroundings. This might not be suitable if e.g. you have very particular data, which shares certain patterns across samples. +- You could also choose to use **[transposed convolutions](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions/)**. These convolutions, which essentially compute the matrix transpose of a regular convolutional layer, swapping the effect of the forward and the backwards pass as a result. The fun thing: the weights of these transposed convolutions are learnable, allowing - and requiring - you to learn the 'swap' from the data you're feeding it. + +If you're interested in how these transposed convolutions work, I would like to recommend the post "[Understanding transposed convolutions](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions/)", where I cover them in more detail. + +In this blog, we'll try and implement them with Keras, in order to build something that is known as an "autoencoder". + +## Transposed convolutions in the Keras API + +Let's first take a look how Keras represents transposed convolutions, by looking at the Keras API (Keras, n.d.). + +This immediately requires us to make a choice: apparently, Keras contains functionality for two-dimensional and three-dimensional transposed convolutions. + +The difference? Relatively simple - it has to do with the dimensionality of your input data. + +As with the [Conv2D](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) and [Conv3D](https://www.machinecurve.com/index.php/2019/10/18/a-simple-conv3d-example-with-keras/) layers, which take either two- or three-dimensional input data (e.g., 2D pictures or 3D videos), you'll need to have two types of transposed convolutions for them in order to perform the opposite operation: **Conv2DTranspose** and **Conv3DTranspose**. + +We'll leave the three-dimensional variant to another blog and cover the two-dimensional transposed convolution here, and will provide an example implementation as well. + +### Conv2DTranspose in the Keras API + +This is how Conv2DTranspose is represented within the Keras API: + +``` +keras.layers.Conv2DTranspose(filters, kernel_size, strides=(1, 1), padding='valid', output_padding=None, data_format=None, dilation_rate=(1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None) +``` + +The source can be found [here](https://github.com/keras-team/keras/blob/master/keras/layers/convolutional.py#L621), and the official Keras docs [here](https://keras.io/layers/convolutional/#conv2dtranspose). + +Let's now break it apart - we'll see that the attributes are pretty similar to the ones of the regular Conv2D layer: + +- The Conv2DTranspose layer learns a number of `filters`, similar to the regular Conv2D layer (remember that the transpose layer simply swaps the backwards and forward pass, keeping the rest of the operations the same!) +- As the transposed convolution will also slide over the input, we must specify a `kernel_size`, as with the normal convolution. +- The same goes for the stride, through the `strides` attribute. +- The same goes for the `padding` and `output_padding` attributes. +- Data format: `data_format`, either channels first / channels last approach. +- Dilation rate: `dilation_rate`, if you wish to use dilated convolution. +- Whether biases should be used, with `use_bias` (by default set to True, and best kept there, I'd say). +- The `activation` function that must be used. +- As with any layer, it's possible to specify initializers, regularizers and constraints for the kernel and bias. + +And that's it - we just dissected the Conv2DTranspose layer present within the Keras API, and have seen that it's not complicated at all 😀 + +## Using Conv2DTranspose with Keras + +Now, it's time to show you how the Conv2DTranspose layer can be used with Keras. + +As said, we do so by building what is known as an "autoencoder". + +### What is an autoencoder? + +Wikipedia mentions the following about an autoencoder: "An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner." + +Okay, building a neural network is likely not too difficult for the MachineCurve reader, but "\[learning\] efficient data codings in an unsupervised manner"? What's that? + +Perhaps, we can first show this visually. + +This is what an autoencoder does at a high level: + +[![](images/Autoencoder.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/Autoencoder.png) + +- One inputs an **input image** into the neural network. +- This input image is fed through the **encoder** part of the network, which encodes the image into some **encoded state**. This state is often reduced in dimensionality. +- This encoded state is fed into the **decoder** part of the network, which simply attempts to perform some action - in the case above, reconstructing the original input image. + +While I will cover autoencoders in more detail in another series of blog posts, it's important to note that the _encoder_ and the _decoder_ are learnt based on your dataset. While this means that many interesting applications emerge (see the next section), your encoder and decoder will be data-specific (and useless on other data contexts) and lossy (having difficulty reconstructing the image for 100%, getting close often), requiring some training data as well (Keras Blog, n.d.). + +Note that I deliberately applied a different color to the reconstructed image, to show that while the network will attempt, the reconstruction will never be _fully_ equal to the original input. + +### Interesting applications of autoencoders (and hence, Conv2DTranspose) + +The fact that you learn _encoder_ and _decoder_ functions with autoencoders, a new interesting range of appliations emerges (Keras Blog, n.d.; own experience): + +- By consequence of training an encoder, you're efficiently learning a **dimensionality reduction** method that is highly applicable to your training set. If your neural network, e.g. your classifier, requires lower-dimensional data, it may be worthwhile to let it pass through a learned encoder function first, using the encoded state as the feature vectors for your classifier. +- Autoencoders are also used for **noise reduction**. Think about it as follows: when training the encoder and decoder parts, i.e. learning weights for the trainable parameters for these parts, you feed forward data - just as in the [high-level supervised learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). You compare the prediction with some 'target', compute a loss, and optimize the model in order to move the prediction closer to the 'target'. Now, when you have data (e.g. images) and add the noise you wish to filter to them, you can use autoencoders for noise reduction. By feeding forward the noisy images as input data, and setting the targets to the noise-free data, the autoencoder will essentially 'reconstruct' the image based on the noise-free data, given the noisy input data: there you have your noise reduction algorithm. +- In a strange way, you can also use autoencoders for **classification**. Say that you have a binary classification scenario for simple images: "yes" or "no". Picetti et al. (2018) had this scenario, in which they had so-called Ground Penetrating Radar images of landmines: _contains_ landmine or _does not contain_ landmine. By training the encoder and decoder on radar images _without_ landmines, they ensured that decoding would fail when landmines were present. By subsequently measuring the differences between input and (reconstructed) output, it's possible to say whether a mine is present: if there's not too much different, no landmine is present; if there is a lot of difference, it's likely that a mine has been spotted. + +### Today's Conv2DTranspose model: a Conv-based autoencoder + +Autoencoders can be built in many ways. For example, it's possible to use densely-connected (or, in Keras terms, `Dense`) layers, but this is not recommended for images (Keras Blog, n.d.). + +Instead, for image-like data, a Conv-based autoencoder is more preferred - based on convolutional layers, which give you the same benefits as 'normal' ConvNets (e.g., invariance to the position of objects in an image, due to the nature of convolutions). + +As we'll use the MNIST dataset for showing how Conv2DTranspose works, which contains thousands of images of handwritten digits, we'll create a _simple_ Conv-based autoencoder today, using the Conv2DTranspose layers. Do note that I won't cover many of the autoencoder ideosyncrasies and will keep the autoencoder architecture really simple (only providing the _decoder_ function, keeping the encoder function hidden in the model), as today's goal is not to explain autoencoders, but to give a Conv2DTranspose example instead. + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +### What you'll need to run the model + +As with many of the tutorials we post n MachineCurve, you'll need a few dependencies to run this model successfully: + +- First of all, you'll need a recent version of **Python** - as we will write our code in this language. Please use Python 3.6 or newer. +- Additionally, you'll need the **Keras** deep learning framework, which we use to show Conv2DTranspose. +- The **Tensorflow** (or **Tensorflow GPU**) backend. +- You will also need **Numpy** for number processing. +- And don't forget **Matplotlib** for visualizing the inputs and the reconstructed outputs 😊 + +### Model imports & configuration + +Now, open your file explorer, navigate to some folder and create a Python file: for example, `conv2dtranspose.py`. + +Open up your code editor as well as this file, and let's start writing some code 😀 + +We first add all the imports: + +``` +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Conv2D, Conv2DTranspose +import matplotlib.pyplot as plt +import numpy as np +``` + +- We import `keras` so that we can import all the other stuff. +- We import `mnist` from `keras.datasets`. Easy way of importing your data! +- From `keras.models` we import `Sequential`, which represents the Keras Sequential API for stacking all the model layers. +- We use `keras.layers` to import `Conv2D` (for the encoder part) and `Conv2DTranspose` (for the decoder part). +- We import Matplotlib, specifically the Pyplot library, as `plt`. +- And `numpy` as `np`. + +Subsequently, we specify some configuration options: + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 1000 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +The MNIST digits are 28 pixels wide and high - we hence specify `img_width` and `img_height` as 28. + +We use a [batch size of 1000](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/#minibatch-gradient-descent). Even though we don't use the gradient descent optimizer (as we will see later), this represents a minibatch approach, balancing between _memory requirements_ and _accuracy of gradients_ (click the link for more detailed information about this trade-off). + +The autoencoder will be trained for 25 epochs, and there are ten classes - the ten digits, 0 to 9. 20% of the data will be used for validation purposes and the `verbosity` will be set to True (1), showing all output on screen. + +![](images/mnist-visualize.png) + +Example MNIST digits + +### Loading MNIST data & making it ready for training + +We next load the MNIST data (this assumes that you'll run Keras on Tensorflow given the channels first/channels last approach): + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +This also includes reshaping the data into actionable format, parsing the numbers as floats (presumably speeding up the training process), and normalizing the data (which is appreciated by the optimizer). + +### Defining the model architecture + +Next, we can define the model architecture. It looks as follows: + +![](images/model-2.png) + +Our model starts with an input layer, allowing us to input the data - which is normal for any neural network. + +It is then followed by three Conv2D layers, forming the 'encoder' part of our autoencoder. + +The Conv2D layers are followed by three Conv2DTranspose layers, which form the 'decoder' part of our model: they upsample the encoded state back into higher-dimensional format, being the 28x28 pixel data as we had before. The last Conv2D layer finalizes this process, effectively reshaping the image into the 28x28 pixel images by convolving over the upsampled data. + +In code, it looks like this: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', kernel_initializer='he_normal', input_shape=input_shape)) +model.add(Conv2D(16, kernel_size=(3, 3), activation='relu', kernel_initializer='he_normal')) +model.add(Conv2D(8, kernel_size=(3, 3), activation='relu', kernel_initializer='he_normal')) +model.add(Conv2DTranspose(8, kernel_size=(3,3), activation='relu', kernel_initializer='he_normal')) +model.add(Conv2DTranspose(16, kernel_size=(3,3), activation='relu', kernel_initializer='he_normal')) +model.add(Conv2DTranspose(32, kernel_size=(3,3), activation='relu', kernel_initializer='he_normal')) +model.add(Conv2D(1, kernel_size=(3, 3), activation='sigmoid', padding='same')) +``` + +This is how the **Conv2DTranspose** layer can be used: for the decoder part of an autoencoder. + +Do note the following aspects: + +- For all but the last layer, we use the `he_normal` [kernel initializer](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/). We do so given the fact that we use ReLU, and that Xavier init is [incompatible with this activation function](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). +- The last layer contains 'same' padding in order to ensure that the output is 28x28 pixels. + +### Model compilation & fitting the data + +Next, we compile the model: + +``` +# Compile and fit data +model.compile(optimizer='adam', loss='binary_crossentropy') +``` + +...using the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) and [binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/), which are the standard choices in today's ML toolkit 😉 The BCE loss allows the loss to increasingly go up when the predicted reconstruction is more off, while punishing almost-correct predictions only lightly. This avoids large, weird steps. + +Then, we fit the data: + +``` +model.fit(input_train, input_train, + epochs=no_epochs, + batch_size=batch_size, + validation_split=validation_split) +``` + +Do note that we fit the input data _as targets_ as well. This allows the decoder part to reconstruct the original image, as the outcome of the Conv2D-Conv2DTranspose pipeline will be compared with the original inputs being targets as well. Here, we also configure the `batch_size` and `validation-split` that we set earlier. + +### Reconstructing inputs with the autoencoder + +`model.fit` starts the training process. Once the model has finished training, it is likely that you have created a reconstruction model that can reconstruct the MNIST digits quite successfully. Let's now see whether we can actually achieve this. + +``` +# Generate reconstructions +num_reconstructions = 8 +samples = input_test[:num_reconstructions] +targets = target_test[:num_reconstructions] +reconstructions = model.predict(samples) +``` + +The code above, when added, takes `num_reconstructions` of the test data set (i.e., data the model has never seen before, to see whether it generalizes well) as well as its `targets` and generates `reconstructions` by `model.predict`. You can set the number of reconstructions yourself, or choose to build it in a different way - but that's up to you. + +### Visualizing the reconstructions + +Next, we add relatively simple code for visualizing the original inputs and reconstructions: + +``` +# Plot reconstructions +for i in np.arange(0, num_reconstructions): + # Get the sample and the reconstruction + sample = samples[i][:, :, 0] + reconstruction = reconstructions[i][:, :, 0] + input_class = targets[i] + # Matplotlib preparations + fig, axes = plt.subplots(1, 2) + # Plot sample and reconstruciton + axes[0].imshow(sample) + axes[0].set_title('Original image') + axes[1].imshow(reconstruction) + axes[1].set_title('Reconstruction with Conv2DTranspose') + fig.suptitle(f'MNIST target = {input_class}') + plt.show() +``` + +This allows us to iterate over the reconstructions, to retrieve the sample and reconstruction and to prepare it for application in `imshow`, subsequently generating `subplots` and plotting the images. + +Let's now see whether this was a success 😊 + +## What the reconstructions look like + +Now, open up a terminal, `cd` to the folder where your file is located - and run `python conv2dtranspose.py`. The training process should then commence and eventually finish, and plots should be generated. + +Training our model yields a loss of approximately 0.05 - which is quite good, yet is unsurprising given the results often achieved with MNIST. + +What's perhaps more important is to see whether it actually _works_ - by visualizing the reconstructions. I won't keep you waiting any longer - here they are: + +- [![](images/1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1.png) + +- [![](images/2.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2.png) + +- [![](images/3.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/3.png) + +- [![](images/4.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/4.png) + +- [![](images/5.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/5.png) + +- [![](images/6.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/6.png) + +- [![](images/7.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/7.png) + +- [![](images/8.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/8.png) + + +As you can see, using Conv2DTranspose for the decoder part worked pretty well! 😎 + +## Summary + +In this blog post, we've seen how transposed convolutions can be used with Keras - by virtue of an autocoder example. These transposed convolutions, in two dimensions available as the Conv2DTranspose layer, can be used for the 'decoder' part of such an autoencoder - allowing you to e.g. reduce dimensionality, remove noise, or reconstruct images, as we have done. + +The tutorial includes a Keras based example of how to build such a model. Training the model has resulted in successful reconstructions, and a good demonstration of how Conv2DTranspose can be used with Keras. + +I hope you've learnt something today! If you did, or if you didn't or have any questions - please feel free to leave a comment below. I'm looking forward to your response 😊 + +Thanks for reading MachineCurve today and happy engineering! 😎 + +## References + +Ramey, J. (2018, May 14). Autoencoders with Keras. Retrieved from [https://ramhiser.com/post/2018-05-14-autoencoders-with-keras/](https://ramhiser.com/post/2018-05-14-autoencoders-with-keras/) + +Keras Blog. (n.d.). Building Autoencoders in Keras. Retrieved from [https://blog.keras.io/building-autoencoders-in-keras.html](https://blog.keras.io/building-autoencoders-in-keras.html) + +MachineCurve. (2019, September 29). Understanding transposed convolutions. Retrieved from [https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions) + +Keras. (n.d.). Convolutional Layers: Conv2DTranspose. Retrieved from [https://keras.io/layers/convolutional/#conv2dtranspose](https://keras.io/layers/convolutional/#conv2dtranspose) + +Wikipedia. (2006, September 4). Autoencoder. Retrieved from [https://en.wikipedia.org/wiki/Autoencoder](https://en.wikipedia.org/wiki/Autoencoder) + +Picetti, F., Testa, G., Lombardi, F., Bestagini, P., Lualdi, M., & Tubaro, S. (2018, July). [Convolutional Autoencoder for Landmine Detection on GPR Scans.](https://ieeexplore.ieee.org/abstract/document/8441206) In _2018 41st International Conference on Telecommunications and Signal Processing (TSP)_ (pp. 1-4). IEEE. diff --git a/convolutional-neural-networks-and-their-components-for-computer-vision.md b/convolutional-neural-networks-and-their-components-for-computer-vision.md new file mode 100644 index 0000000..03fb7c6 --- /dev/null +++ b/convolutional-neural-networks-and-their-components-for-computer-vision.md @@ -0,0 +1,449 @@ +--- +title: "Convolutional Neural Networks and their components for computer vision" +date: "2018-12-07" +categories: + - "deep-learning" +tags: + - "cnn" + - "computer-vision" + - "convolutional-neural-networks" + - "deep-learning" + - "neural-networks" +--- + +Machine learning (and consequently deep learning) can be used to train computers to see things. We know that machine learning is about feeding examples to machines, after which they derive the patterns in these examples themselves. Consequently, we can see that using machine learning for computer vision equals showing machines enough examples so that they can learn to recognize them on their own, for new data. + +In deep learning, we use [deep neural networks](https://machinecurve.com/index.php/2018/11/23/what-is-deep-learning-exactly/) to learn machines to recognize patterns on their own. + +But not every class of deep learning algorithms is suitable for computer vision, for reasons that will be explained later in this blog. + +Nevertheless, there exists a class of networks that _can_ handle this kind of data. + +**Convolutional Neural Networks** (CNNs) are a class of Artificial Neural Networks (ANNs) which have proven to be very effective for this type of task. They have certain characteristics that share resemblance with how human beings recognize patterns in visual imagery. + +\[ad\] + +But CNN is not _one thing_. It is a _class_ of algorithms. And it contains various so-called _network architectures_. + +Then what is an architecture? + +Most simply, we can compare an architecture with a building. It consists of walls, windows, doors, et cetera - and together these form the building. Explaining what a neural network architecture is benefits from this analogy. Put simply, it is a collection of components that is put in a particular order. The components themselves may be repeated and also may form blocks of components. Together, these components form a neural network: in this case, a CNN to be precise. + +In this blog, I would like to explain the generic concepts behind CNNs in more detail. I will cover the differences between _regular_ neural networks and convolutional ones, I will decompose a general neural network into its components - i.e., its layers - and I will give some suggestions about how to construct such a network from scratch. + +**Update February 2020** - Added references to other MachineCurve blogs and added more information about activation functions to the text. + +\[toc\] + +## The differences between regular neural networks and convolutional ones + +CNNs are quite similar to 'regular' neural networks: it's a network of neurons, which receive input, transform the input using mathematical transformations and [preferably](https://machinecurve.com/index.php/2018/11/23/what-is-deep-learning-exactly/) a non-linear activation function, and they often end in the form of a classifier/regressor. + +But they are different in the sense that they assume that the input is an image. + +What's more, the input is expected to be preprocessed **to a minimum extent**. + +Based on this assumption, using various types of layers, we can create architectures which are optimized for computer vision. + +### Short recall: an ANN consists of layers + +In order to build our knowledge, we must take one small step back before we can continue. We must recall that a regular neural network consists of a chain of interconnected layers of neurons. One network may look as follows: + +\[caption id="attachment\_172" align="aligncenter" width="296"\]![](images/296px-Colored_neural_network.svg_.png) _Source: [Colored neural network at Wikipedia](https://en.wikipedia.org/wiki/Artificial_neural_network#/media/File:Colored_neural_network.svg), author: [Glosser.ca](https://commons.wikimedia.org/wiki/User_talk:Glosser.ca), license: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/nl/legalcode), no changes._\[/caption\] + +### Layer structure in a normal neural network + +As you may recognize from our previous post on [what deep learning is](https://machinecurve.com/index.php/2018/11/23/what-is-deep-learning-exactly/), such a neural network consists of layers: in its simplest form one input layer, one or multiple hidden layers, and one output layer. + +The neurons are structured vertically and are interconnected. This means that the output from one neuron in the input layers goes to every neuron in the subsequent layer. This process happens for every layer, as you can recognize in the model above. + +This kind of information propagation is highly inefficient when it comes to computer vision. Let's imagine how you would recognize objects in an image: by taking a close look at the object itself, and possibly its direct surroundings. If you must recognize trees in a photo, you will not take a look at the blue skies at the very top of the image. + +This is however what **would** happen if you would feed images to a normal ANN: it would take into account the entire image for the computer vision task you want it to perform. This is highly inefficient: both in terms of pragmatic quality (i.e., why the heck do you want to look at the entire image to identify an object within the image?) as well as neural network quality (for large images, the number of parameters skyrockets - and this is not a good thing, for it would greatly increase the odds that overfitting happens). + +\[ad\] + +### Layer structure in a CNN + +Convolutional neural networks to the rescue. They are specifically designed to be used in computer vision tasks, which means that their design is optimized for processing images. + +In CNNs, the layers are threedimensional. This means that the neurons are structured in shape of form **(width, height, depth)**. If we have a 50x50 pixels image encoded as RGB (red - green - blue), the shape of the input layer will be (50, 50, 3). + +You may now ask: why is this better than a regular ANN? + +Here's why: we can now use so-called convolutional layers, which allow us to inspect small pieces of the image and draw conclusions for every piece and then merge them together. + +That's completely different than these general ANNs, which look at the image as a whole. + +I will next cover the components of a CNN, including these convolutional layers! + +\[caption id="attachment\_196" align="aligncenter" width="182"\]![](images/cnn_layer.jpg) Layers in a CNN that are specifically tailored to computer vision are threedimensional: they process images with shape (width, height, depth)\[/caption\] + +## CNN components + +Convolutional neural networks share the characteristics of [multilayer perceptrons](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) (and may be said to be composed of individual MLPs, although this analogy remains a bit vague): they have one input layer, one output layer and a set of - at minimum one - hidden layer(s) in between. + +Like I wrote before, CNNs are composed of various 'components'. A component equals at least one layer in the CNN. Now, what are these components? Here they are, in short: + +- Convolutional layers; +- Pooling layers; +- Fully connected layers. + +However, in order to understand a CNN in a better way, I will first have to look at another term in more detail: the _receptive field_ of a particular layer. I will then continue with the actual layers. + +### Receptive field + +Suppose that we have a neural network. In this case, I'll use a simple one, for clarity, and it is illustrated below. The yellow layer with two neurons is the **input layer**. It receives the input data. The red layer at the end is the **output layer**, which provides the output as specified by the architecture. The neural network below consists of one blue layer, a so-called **hidden layer**. It may be more than possible (and this is the case more often than not) that a model consists of multiple hidden layers. + +But once again, for clarity, I attempted to keep the model as simple as possible. + +Every hidden layer is connected to two layers: one from which it receives a signal and one to which it passes the transformed signal. + +The input layer is connected to a subsequent layer and the output layer is connected to a prior layer. + +Every layer has a so-called _receptive field_. This is the relative number (or the percentage, or the ratio, as you desire) of neurons from which it receives a signal. The hidden layer in the network below receives the signal from 2 out of 2 neurons, so its receptive field is the entire previous layer. The same goes for the output layer, which is 5 out of 5. + +\[ad\] + +You will see that in CNNs, not every neuron in a layer has a full receptive field. That's what I meant when I wrote that layers tailored to a CNN will allow us to investigate pieces of images, rather than entire images at once. Let's proceed with the so-called _convolutional layers_, in which this principle becomes perfectly clear! + +\[caption id="attachment\_189" align="aligncenter" width="341"\]![](images/Basic-neural-network.jpg) Ⓒ MachineCurve.com\[/caption\] + +### Convolutional layers + +In a CNN, the first layer that comes after the input layer is a so-called **convolutional layer**. In another [brilliant piece](https://adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/) on CNNs, the analogy to a flashlight was drawn, when explaining what a convolutional layer does. Let's attempt to recreate this analogy here to make clear how such a layer operates. + +\[caption id="attachment\_201" align="aligncenter" width="656"\]![](images/Cnn_layer-1.jpg) My recreation of the analogy. Note that it is also perfectly explained [here](https://adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/)!\[/caption\] + +#### From input to filter: the flashlight analogy + +We have the **input layer** on the left. This layer contains the actual image - a matrix/tensor/whatever you wish to call it of shape (width, height, depth). In our case, since we used a 50x50 RGB image above, our shape is (50, 50, 3). Note that this means that behind the layer you see in the left of the image, lie two more layers (given the depth of 3). + +The smaller layer on the right is the first **convolutional layer**. What it does can be explained using the flashlight analogy. Every neuron has a certain _receptive field_ on the previous layer, the input layer in our case. The receptive field in our example is 5x5x3 pixels (note that we have three layers!). Consequently, the convolutional layer must also have a depth of 3. What we can do now is take a look at a small part of the image and see what's in it. That's what I meant when I wrote about it before! + +In deep learning terms, we also call this neuron a **filter**. A convolutional layer thus consists of a set of filters which all look at different parts of the image. In our case, the filters look at only 5x5x3 = 75 pixels each, instead of the 50x50x3 = 7500 pixels within the three RGB layers. + +#### Don't I have to pick a filter when training a CNN? + +I then had the following question: what does this filter look like? How do I pick a filter when I start training a CNN? It was extremely unclear to me: it seemed like the models applied filters that could detect various things within an image (e.g., a vertical line), but why didn't I find about how to pick a filter type when starting training? + +One has to start somewhere, doesn't he? + +An [answer](https://stackoverflow.com/questions/48388810/what-is-the-kind-of-filter-does-keras-uses-for-conv2d-in-cnn) on StackOverflow provided the solution to my confusion. It goes like this when training a CNN: first, for every convolutional layer, the filter is initialized in some way. This may be done randomly, but different initialization methods for CNNs exist. + +It then trains once and calculates the so-called '[loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/)', i.e. the difference between the real world and what the model predicted. + +Based on this loss value, the filter [is changed a tiny bit](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/), after which the same thing starts again. + +\[ad\] + +This way, the CNN learns to create its filter organically based on the data, while maximizing the performance of the network. + +Now that we know how a convolutional layer uses filters to look at smaller areas of the image, we can move on and take a look inside the filter. + +#### Inside the filter + +A neuron produces just one single output value. + +This means that the 5x5x3 = 75 pixels must be transformed into _just one_ number. + +But how is this done? + +Let's take a look inside the filter to find out more. + +And especially how it calculates the singular value that the neuron in the convolutional network produces as output. + +It does so by what we call **element wise multiplications.** + +Suppose that we have a picture of a house. + +\[caption id="attachment\_205" align="aligncenter" width="171"\]![](images/small_house.jpg) A house. Ish.\[/caption\] + +Suppose that our network has learned that it must detect vertical lines. What I mean is: it has learned a filter that knows how to detect vertical lines. This filter looks like this: + +![](images/filter.jpg) + +Normally, a filter moves over the image, in order to capture it entirely. We'll cover that next. But for simplicity, we have just put the filter at a convenient place, exactly where we have a vertical line in our image. Like this: + +![](images/house_with_filter.jpg) + +If we zoom in at the **image**, and especially the part that is covered by the filter, we see this: + +![](images/image.jpg) + +If we translate this into what a computer sees, remember that it's a 2D array of pixel intensities. It would thus look something like this internally: + +![](images/computerized_image_2.jpg) + +The same is true for the filter: + +![](images/computerized_filter-2.jpg) + +Obviously, the model does not learn a filter based on one picture. This means that the pixel intensities for the vertical line filter will probably not exactly match the ones for the image. Additionally, they may be different for different sub parts of the vertical line, as you can see in the computerized version of the filter above. + +But that does not matter really, since it will still be able to detect a vertical line. + +\[ad\] + +And here's why: + +What happens next is that the model performs **element wise multiplications** using the filters that it's learnt, like I wrote before. That's very simple: if we put the two 2D arrays (also known as matrices) next to each other, we take the two elements at position (i,j) and multiply them. We do this for all the positions in the matrices, and add up the numbers. + +Visually, we can represent this as the two arrows that iterate over the matrices. + +The first calculation will be (0 x 0) = 0. + +We move the arrow one position to the right. + +The second, third, fourth, up to the eighth, is also 0. + +But the 9th is (35 x 25) = 875. + +We remember 875 and move the arrow one step to the right. + +![](images/multiplication.jpg) + +We continue this process until we _looked at every individual element, and its corresponding variant in the other matrix._ We will now have a large set of numbers. + +Part of the element-wise multiplications used in Convolutional Neural Networks is that in the end we sum them all together, like this: + +![](images/formula.jpg) + +The result is a large number, obviously. It is the output of the neuron of the convolutional layer which looks at its very own receptive field. + +But what is the significance of this number? + +That's very simple: the larger it is, the more the receptive field (i.e., the part of the image) looked at by the neuron matches the filter used (in our case, that the part of the house looked at contains a vertical line). + +This means that we can now distinguish certain characteristics in parts of an image. + +\[ad\] + +But why is that the case? + +Let's move the filter a bit, once again to an arbitrary position in the image, to show what I mean. + +![](images/house_moved_filter.jpg) + +As you may guess, the vertical line filter itself did not change, because we're using the trained model and thus its learnt filters. The computerized version is thus still the same: + +![](images/computerized_filter-2.jpg) + +But the computerized version of the part of the image _did_ change, simply because we let the filter look at another part of the image. + +What do you think it looks like? + +If you think it's a matrix of zeroes, you're right :-) + +![](images/matrix_zeroes.jpg) + +The result of the formula is now very simple. All the elements are multiplied by 0, and thus the sum will be **0**. + +You can now make a guess how the CNN detects certain patterns in data. If a part of the image looks like something available in a filter, it will produce a rather large number. If the part of the image does not look like the filter at all, the result of the element-wise multiplication will be zero or a very small number! + +#### How does the filter really move over the image? + +For the examples above, we twice placed our filter on top of an arbitrary part of the image. + +As you may guess, this is not how Convolutional Neural Networks do really attempt to detect learnt patterns in an image. + +The process is actually pretty simple, though, and here's how it is done. + +The filter starts at the very top left part of the image, for which the element-wise multiplications are computed. + +\[ad\] + +It will then move slightly to the right. If it reaches the border, it will move slightly to the bottom and will start again at the very left end of the image. + +![](images/filter_move.jpg) + +By how many pixels it will move to the right (or towards the bottom whenever necessary) is what we call **stride**. The higher the stride, the faster it moves over an image, since it will move by more pixels. + +This means that the model will be smaller and easier to train. + +But increased speed comes with one downside: the higher the stride, the less detailed it can capture patterns in the image, since it simply does not look at all of its individual parts. + +Overall, Deep Learning engineers often choose a stride of 1, and sometimes 2. We often favor longer training times with more accurate models over less accurate ones while training time is short. + +So what we know now is that in a convolutional layer, for new data, the learnt filters slide over the image, calculate whether their content is present in the part of the image, and output a number for every step. + +#### The activation map + +Once the filter has looked at the entire image, it will be able to draw some kind of a heatmap over the image. The colors of the heatmap will be brightest where the filter was able to detect the pattern that it is trained for. They will be the lightest where, in our case, no vertical lines could be found. + +For our house, this means that the two vertical lines that represent the walls will most likely light up, while the rest remains darker. + +We call this heatmap-like structure the **activation map** and it provides highlights where certain aspects can be found. + +This activation map will reduce the dimensionality of the input to 1D. Suppose that we have a 32 x 32 RGB image and thus a 3D input of 32x32x3. + +\[ad\] + +If we use 5x5 pixel filters, it can take 28 locations horizontally, and 28 vertically. As the filter will look and merge the 3 dimensions, we will end up with an array of shape (28, 28, 1), where 1 stands for the number of filters used. + +In our case, we simply used one filter, but obviously, multiple ones are used. Being able to detect vertical lines will not be enough to distinguish a house from a lamp post. That's why often more filters are used. If five are used, the output shape would be (28, 28, 5). + +#### CNNs make images more abstract, why is that? + +From our blog about [what deep learning is](https://machinecurve.com/index.php/2018/11/23/what-is-deep-learning-exactly/) we learnt that neural networks make representations more abstract when one moves through the network and moves to the right. + +In convolutional neural networks, this is also the case. + +But why? + +Quite simple, actually. + +Suppose that you train a model so that it can distinguish the lamp posts we used in our previous example. It will need to provide as an output "contains lamp post" or "contains no lamp post". + +It will thus need to learn filters that can together recognize these lamp posts, of which one obviously will be a vertical line detector. + +We would need to use various images that contain lamp posts. We feed them to the model during training so that it can learn these particular filters. + +Obviously, every lamp post is different. No lamp post looks the same because simply put, every picture is different. + +This means that when processing the hundreds or thousands of lamp posts it sees in the training data, it would need to generalize a bit, or it will not be able to recognize a new lamp post. + +This is why when one moves through a network with multiple convolutional layers, data gets more abstract. Whereas the first convolutional layer will probably detect certain 'large-scale patterns' (like an object similar to a post exists yes/no, while trees may also be classified as 'yes'); the second one will be able to look at the data at a more abstract level. For example, the filters of the layer will probably learn that a lamp post is pretty much vertical, and learn a vertical line detector. It will also learn that lamp posts have some kind of light structure, and thus that trees - with leaves - are probably no lamp posts. + +Making the data more abstract thus allows for better generalization and thus for better handling of new data! + +### Non-linearity layers: ReLu + +Traditional machine learning methods can often only handle data that is linear. But images are far from linear: suppose that we have a set of pictures with blue skies. In some of them, a ball flies through the sky, and our aim is to identify the pictures which contain a ball. + +If we would use a linear machine learning model, we would get into trouble. There is no way to define a linear function that can grasp the ball in full. Have you ever seen a (curved) line in the shape of a circle before? + +Yes, maybe the cosine and sine functions come to mind - but let's stop here, because these are no lines. + +Deep neural networks use some kind of [nonlinear activation function](https://machinecurve.com/index.php/2018/11/23/what-is-deep-learning-exactly/) to process more advanced, non-linear data. + +But convolutional layers do not provide this kind of non-linearity. What they simply do is to compute element-wise multiplications between a filter matrix and a matrix that contains a part of an image. + +We would thus need to add some non-linearity to our model. + +We use **non-linearity layers** for this, and we put them directly after the convolutional layer. Multiple non-linearity layer types exist, of which these [are most widely used](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/): + +- Rectified Linear Unit (ReLU) +- Sigmoid +- Tanh + +The preferable non-linear layer of choice these days is the [ReLu](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#rectified-linear-unit-relu) layer, though. Researchers have identified that models using these type of activation functions (non-linear layers) are faster to train, which saves computational resources. This does not mean that sigmoid and tanh based CNNs are useless, possibly even the contrary. However, it will take longer to train, and for most tasks this would not be necessary, as ReLu would perform fine for them. Please note that a wide variety of activation functions [is available these days](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/). Click the link for an overview. It also includes activation functions that attempt to improve the ReLU activation function mentioned above. + +\[ad\] + +### Pooling layers + +We know that the CNN's convolutional layer reduces the input. This reduction is however very small: using one filter, the 32x32x3 RGB image input into the convolutional layer leads to a 28x28x1 output. + +But often, images are not 32x32 pixels: they are larger. + +This means that the output would still be very large. + +We saw before that in many cases, multiple convolutional layers are used in a CNN, for reasons of abstraction and generalization. + +The larger the input of a convolutional layer, the larger the convolutional operation (the sliding of the learnt filters over the images) it needs to perform, and the longer it will take to train the model. + +For the sake of computational resources, we would thus need to reduce the dimensions of the output of a convolutional layer. + +And here is where [pooling layers](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) come into view. + +Pooling layers will allow us to move over the output of the convolutional layer (and possibly, the ReLu layer) i.e. the activation map, and make it smaller. + +We move over the output often with a 2 x 2 structure and move again with the concept of **stride**, which is often 1. + +#### Max pooling + +Multiple forms of pooling exist, but max pooling is the most widely used one. As images will say more than one thousand words, this perfectly demonstrates what max pooling does: + +![](images/pooling.jpg) + +(note that here, stride 2 was used). + +As you can see, if the activation map for the image was first (4,4,1), it will now be (2,2,1). + +This will ensure that future convolutional layers will have much more ease at training! + +With a real-world image example from a [great source on CNNs](http://cs231n.github.io/convolutional-networks/#pool): + +![](images/main-qimg-3a8a3a78734fed3301ed3546634b871a-c.jpg) + +In this example, a convolutional layer produced a 224x224 activation map with 64 filters, which is downsampled to 112x112x64. This is four times smaller, saving a lot of computational resources! (the calculation: (112\*112)/(224\*224)). + +\[ad\] + +### Fully connected layers + +Suppose that we trained a CNN which needs to classify images as belonging to the class "lamp post" or the class "no lamp post". + +Convolutional layers combined with non-linear layers and pooling layers allowed us to identify patterns in images, but they cannot _predict the actual class_. + +They're simply not built to do that. + +But we have seen a typical structure before that _is able_ to do precisely that. + +![](images/Basic-neural-network.jpg) + +And it's the **fully-connected** (a.k.a. densely-connected) neural network. It suits classification and regression tasks very well. + +Near the end of a CNN used for classification and regression, we will thus very likely see these kind of structures, packaged into a set of layers. For the same reason that they are used with convolutional layers, these layers will also have non-linear activation functions. The last layer will probably use the so-called _softmax activation function_ since it will allow us to direct our output into a class. We also call this last layer the _loss layer._ + +## Stacking CNN layers into an architecture + +Now that we covered the layers that are very often used in a CNN, we may talk about stacking them into an architecture. + +Since we're talking about convolutional neural networks, the convolutional layers play a big role in these kind of architectures. + +This means that we will most probably start with a convolutional layer, which takes the image as input, and converts the input to an activation map with learnt filters. + +We will then use a [ReLu](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) layer to apply some non-linearity to the image. + +We then arrive at a crossroad. + +Do we need the data to be more abstract, so accuracy can improved? + +\[ad\] + +Especially for images this can be useful, but for smaller drawings this may not be necessarily true. + +If we need it to be more abstract, we will need to use additional convolutional layers. + +If we can move on to classification, we can use the fully connected layers with a softmax output layer. + +In both cases, we would need to apply downsampling (using, for example, max pooling) to reduce the dimensionality of the network. + +\[caption id="attachment\_258" align="aligncenter" width="774"\]![](images/convnet_fig.png) A CNN architecture. You can see all the familiar components neatly structured into one architecture. You can also see that the input is downsampled and gets smaller quickly, which benefits training. In the end, a 'flatten' layer is used to convert the multidimensional activation maps into an 1D structure, to be used by the classifier. This architecture will probably be able to classify images, e.g. "lamp post" or "no lamp post". Source: [gwding/draw\_convnet](https://github.com/gwding/draw_convnet)\[/caption\] + +### Repetition of layers + +Consequently, we would start a CNN with: + +- A convolutional layer; +- Followed by a ReLu layer; +- Followed by a max pooling layer. + +If we can move on to classification, we add the _densely classified layers_. + +If we need to make data more abstract, we would repeat the structure above until we can move on to classification. + +Repetition of these layers is perfectly normal and we encourage that once you build these kind of networks, you play around with layer stacking / repetition of layers, in order to find out how these kind of architectural changes impact the models. + +\[ad\] + +## Conclusion + +If you got here, you will now know what a convolutional neural network is used for and what it is composed of. In a next blog, we will definitely cover the various architectures that are out there today. But for now, it's time for you to rest, to let it sink in, and to have your knowledge about CNNs serve you as a basis for further development. + +Good luck! :-) + +## Sources + +- [CNN Architectures: LeNet, AlexNet, VGG, GoogLeNet, ResNet and more](https://medium.com/@sidereal/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5) +- [Jeremy Jordan: Common architectures in convolutional neural networks](https://www.jeremyjordan.me/convnet-architectures/) +- [Wikipedia (EN): Convolutional neural network](https://en.wikipedia.org/wiki/Convolutional_neural_network) +- [CS231n: Convolutional Neural Networks (CNNs / ConvNets)](http://cs231n.github.io/convolutional-networks/) +- [Adit Deshpande: A Beginner's Guide To Understanding Convolutional Neural Networks](https://adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/) +- [Quora: What is the kind of filter does keras uses for conv2D in CNN?](https://stackoverflow.com/questions/48388810/what-is-the-kind-of-filter-does-keras-uses-for-conv2d-in-cnn) +- [StackExchange: What is the definition of a “feature map” (aka “activation map”) in a convolutional neural network?](https://stats.stackexchange.com/questions/291820/what-is-the-definition-of-a-feature-map-aka-activation-map-in-a-convolutio) +- [gwding/draw\_convnet](https://github.com/gwding/draw_convnet) +- Chollet, F. (2018). _Deep learning with Python._ +- Own knowledge diff --git a/convolutional-neural-networks-with-pytorch.md b/convolutional-neural-networks-with-pytorch.md new file mode 100644 index 0000000..c77929e --- /dev/null +++ b/convolutional-neural-networks-with-pytorch.md @@ -0,0 +1,185 @@ +--- +title: "Convolutional Neural Networks with PyTorch" +date: "2021-07-08" +categories: + - "deep-learning" + - "frameworks" +tags: + - "convnet" + - "deep-learning" + - "machine-learning" + - "neural-network" + - "neural-networks" + - "pytorch" +--- + +Deep neural networks are widely used to solve computer vision problems. Frequently, their performance is much better compared to Multilayer Perceptrons, which - as we shall see - is not too surprising. In this article, we will focus on building a ConvNet with the PyTorch library for deep learning. + +After reading it, you will understand... + +- **How Convolutional Neural Networks work** +- **Why ConvNets are better than MLPs for image problems** +- **How to code a CNN with PyTorch** + +Let's take a look! :) + +* * * + +\[toc\] + +* * * + +## How ConvNets are used for Computer Vision + +If you are new to the world of neural networks, you will likely see such networks being displayed as a set of connected neurons: + +![](images/Basic-neural-network.jpg) + +These networks are called _Multilayer Perceptrons_, or MLPs for short. They take some input data, pass them through (a set of) layers in a forward fashion, and then generate a prediction in some output layer. + +With MLPs, a variety of problems can be solved - including computer vision problems. But this does not mean that they are the best tool for the job. Rather, it is more likely that you will be using a **Convolutional Neural Network** - which looks as follows: + +![](images/convnet_fig.png) + +Source: [gwding/draw\_convnet](https://github.com/gwding/draw_convnet) + +We'll now briefly cover the inner workings of such a network, and why it can be a better tool for image problems. We don't cover this topic extensively, because this article focuses on building a ConvNet with PyTorch. If you wish to understand ConvNets in more detail, we'd love to point you to these articles: + +- [Convolutional Neural Networks and their components for computer vision](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) +- [How to build a ConvNet for CIFAR-10 and CIFAR-100 classification with Keras?](https://www.machinecurve.com/index.php/2020/02/09/how-to-build-a-convnet-for-cifar-10-and-cifar-100-classification-with-keras/#why-convolutional-neural-networks-suit-image-classification) + +### A ConvNet, structured + +Let's now take a look at the image above. We begin on the right, where you'll see an _Outputs_ layer with two outputs. Apparently, that network generates two types of predictions (for example, it can be a multiclass network with two classes, or it can give two regression outputs). + +Left of this layer, we can see two layers with Hidden units. These are called _Fully connected_. Indeed, they are the type of layer that we know from a Multilayer Perceptron! In other words, a Convolutional Neural Network often includes a MLP for generating the predictions. But then what makes such a network _Convolutional?_ + +The presence of Convolutional layers (hello, captain obvious). + +On the left, we can see so-called **Convolution** layers followed by **[(Max) pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/)** layers. A _convolution_ can be defined as follows: + +> In [mathematics](https://en.wikipedia.org/wiki/Mathematics) (in particular, [functional analysis](https://en.wikipedia.org/wiki/Functional_analysis)), **convolution** is a [mathematical operation](https://en.wikipedia.org/wiki/Operation_(mathematics)) on two [functions](https://en.wikipedia.org/wiki/Function_(mathematics)) (_f_ and _g_) that produces a third function ({\\displaystyle f\*g}![f*g](https://wikimedia.org/api/rest_v1/media/math/render/svg/de088e4a3777d3b5d2787fdec81acd91e78a719e)) that expresses how the shape of one is modified by the other. +> +> Wikipedia (2001) + +In other words, a Convolutional layer combines two parts and generates a function that expresses how one alters the other. Recall, if you are familiar with neural networks, that they have _inputs_ which are fed through a layer that has _weights_. If you take a look at this from a Convolution perspective, such a layer will have weights - and it evaluates how much inputs "alter", or "trigger" these weights. + +Then, by adapting the weights during optimization, we can teach the network to be "triggered" by certain patterns present in the input data. Indeed, such layers can be taught to be triggered by certain parts that are present in some input data, such as a nose, and relate it to e.g. output class "human" (when seen from the whoel network). + +Since Convnets work with a kernel that is slided over the input data, they are said to be _translation invariant_ - meaning that a nose can be detected regardless of size and position within the image. It is why ConvNets are way more powerful for computer vision problems than classic MLPs. + +* * * + +## Code example: simple Convolutional Neural Network with PyTorch + +Now that we have recalled how ConvNets work, it's time to actually build one with PyTorch. Next, you will see a full example of a simple Convolutional Neural Network. From beginning to end, you will see that the following happens: + +1. **The imports**. First of all, we're importing all the dependencies that are necessary for this example. For loading the dataset, which is `MNIST`, we'll need the operating system functionalities provided by Python - i.e., `os`. We'll also need PyTorch (`torch`) and its neural networks library (`nn`). Using the `DataLoader` we can load the dataset, which we can transform into Tensor format with `transforms` - as we will see later. +2. **The neural network Module definition.** In Pytorch, neural networks are constructed as `nn.Module` instances - or neural network modules. In this case, we specify a `class` called `ConvNet`, which extends the `nn.Module` class. In its constructor, we pass some data to the super class, and define a `Sequential` set of layers. This set of layers means that a variety of neural network layers is stacked on top of each other. +3. **The layers**. Recall from the image above that the first layers are Convolutional in nature, followed by MLP layers. For two-dimensional inputs, such as images, Convolutional layers are represented in PyTorch as `nn.Conv2d`. Recall that all layers require an activation function, and in this case we use Rectified Linear Unit (`ReLU`). The multidimensional output of the final Conv layer is flattened into one-dimensional inputs for the MLP layers, which are represented by `Linear` layers. +4. **Layer inputs and outputs.** All Python layers represent the number of _in\_channels_ and the number of _out\_channels_ in their first two arguments, if applicable. For our example, this means that: + - The first `Conv2d` layer has one input channel (which makes sence, since MNIST data is grayscale and hence has one input channel) and provides ten output channels. + - The second `Conv2d` layer takes these ten output channels and outputs five. + - As the MNIST dataset has 28 x 28 pixel images, two `Conv2d` layers with a kernel size of 3 produce feature maps of 24 x 24 pixels each. This is why after flattening, our number of inputs will be `24 * 24 * 5` - 24 x 24 pixels with 5 channels from the Conv layer. 64 outputs are specified. + - The next Linear layer has 64 inputs and 32 outputs. + - Finally, the 32 inputs are converted into 10 outputs. This also makes sence, since MNIST has ten classes (the numbers 0 to 9). Our loss function will be able to handle this format. +5. **Forward definition**. In the `forward` def, the forward pass of the data through the network is performed. +6. **The operational aspects**. Under the `main` check, the random seed is fixed, the data is loaded and preprocessed, the ConvNet, loss function and optimizer are initialized and the training loop is performed. In the training loop, batches of data are passed through the network, after the loss is computed and the error is backpropagated, after which the network weights are adapted during optimization. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class ConvNet(nn.Module): + ''' + Simple Convolutional Neural Network + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Conv2d(1, 10, kernel_size=3), + nn.ReLU(), + nn.Conv2d(10, 5, kernel_size=3), + nn.ReLU(), + nn.Flatten(), + nn.Linear(24 * 24 * 5, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the ConvNet + convnet = ConvNet() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(convnet.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = convnet(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## Sources + +- [gwding/draw\_convnet](https://github.com/gwding/draw_convnet) +- Wikipedia. (2001, December 20). _Convolution_. Wikipedia, the free encyclopedia. Retrieved July 8, 2021, from [https://en.wikipedia.org/wiki/Convolution](https://en.wikipedia.org/wiki/Convolution) +- PyTorch. (n.d.). _Conv2d — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) diff --git a/could-chaotic-neurons-reduce-machine-learning-data-hunger.md b/could-chaotic-neurons-reduce-machine-learning-data-hunger.md new file mode 100644 index 0000000..6a24173 --- /dev/null +++ b/could-chaotic-neurons-reduce-machine-learning-data-hunger.md @@ -0,0 +1,216 @@ +--- +title: "Could chaotic neurons reduce machine learning data hunger?" +date: "2019-06-01" +categories: + - "svms" +tags: + - "chaos-theory" + - "machine-learning" + - "mathematics" +--- + +This week, I found a very interesting work on Arxiv that was published only a short while ago. It's called [A Novel Chaos Theory Inspired Neuronal Architecture](https://arxiv.org/abs/1905.12601) and is the product of research performed by Harikrishnan N B and Nithin Nagaraj. + +Today's deep learning models are very data hungry. It's one of the fundamental challenges of deep artificial neural networks. They don't learn like humans do. When we learn, we create rules of logic based on first time observations which we can use in the future. Deep neural networks cannot do this. By consequence, they need large amounts of data to learn superficial representations of their target classes. + +And this is a problem for the data scenarios where you'll have very little data or when you have a very skewed distribution over the classes. Can we do something about this? + +\[toc\] + +\[ad\] + +## Adding chaos to learning + +In their work, the authors recognize that deep learning has so far been really promising in many areas. They however argue that although neural networks are loosely inspired by the human brain, they do not include its chaotic properties. That is, they remain relatively predictable over time - for the input, we know its output in advance. Human brains, according to the authors, also contain chaotic neurons, whose predictability reduces substantially after some time... and whose behavior _appears_ to become random (but, since they are chaotic, they are not). + +The main question the authors investigate in their work is as follows: **what if we create a neuronal architecture based on chaotic neurons?** Does it impact the success rate of learning with very small datasets, and perhaps positively? Let's find out. + +## How it works + +Let's see if we can intuitively - that is, with a minimum amount of mathematics and merely stimulating one's sense of intuition - find out how it works :) + +### Chaotic neurons + +Suppose that **X** is the _m x n_ matrix representing the inputs of our training set. Every row then represents a feature vector. Suppose that our matrix has 4 columns, thus n = 4. Our feature vector can then be represented as follows: + +\[mathjax\] $$ x\_i = \[ \\,\\, x^1\_i \\,\\,\\,\\, x^2\_i \\,\\,\\,\\, x^3\_i \\,\\,\\,\\, x^4\_i \\,\\, \] $$ + +By design, the network proposed by the authors must have four input neurons, one per feature. + +\[ad\] + +The authors call each of those neurons a Chaotic Generalized Luroth Series neuron (GLS), which take real inputs between \[0, 1) and map them to a real output value between \[0, 1) as follows. + +\\begin{equation} T(x) = \\begin{cases} \\frac{x}{b}, & \\text{if}\\ 0 <= x < b \\\\ \\frac{(1-x)}{(1-b)}, & \\text{if}\\ b <= x < 1 \\\\ \\end{cases} \\end{equation} + +For the \[0, 1\] domain, it visually looks as follows: + +[![](images/1dmap-1024x511.png)](https://machinecurve.com/wp-content/uploads/2019/05/1dmap.png) + +Since this function is _topologically transitive_, chaotic behavior is introduced in model behavior. I do not have the background to fully grasp this behavior - but it is one of the essential characteristics of chaos, at least in mathematical terms. So for this work, let's just assume that it is, so we can focus on its implications for machine learning :-) + +### Neuron behavior + +Neurons generally fire immediately, which emerges from their deterministic nature. That is, they are often continuous functions which take an input which is then mapped to another space, possibly in the same dimension. For example, `f(x) = x` is such a function. Mathematically, there is no delay between input and output. + +The chaotic neurons proposed by the authors behave differently. + +They do not cease firing immediately. Rather, their chaotic nature ensures that they fire for some time, and oscillate around some values, before they grind to a halt. This is visualized below. The neuron oscillates until its value approximates the input, then returns the number of milliseconds until that moment as its output. + +The formulae and the precise pseudo-code algorithm can be found [in the paper](https://arxiv.org/pdf/1905.12601.pdf). + +\[ad\] + +[![](images/GLS.png)](https://machinecurve.com/wp-content/uploads/2019/05/GLS.png) + +Inner workings of the four GLS neurons for the four-dimensional feature vector. x1 to x4 were initialized randomly in the domain of \[0, 1). + +## Training our network + +Training the network goes differently than we're used to. There is no backpropagation and there is no gradient descent. Rather, it looks somewhat like how Support Vector Machines attempt to build a weight vector. The authors propose to train the network as follows: + +1. Normalize the input data to the domain of \[0, 1). +2. For every cell in the input data, compute the value for the neuron. +3. Once this is completed, you have another matrix, but then filled with _firing times_. Split this matrix into multiple ones, grouped by class. +4. Compute a so-called _representation vector_ for the matrices. That is, compute the mean vector for all the vectors available in the class matrices. + +This representation vector represents the 'average' input vector for this class. It can be used to classify new inputs. Let's see how this works. + +## Classifying new inputs + +According to the authors, one would classify new inputs as follows: + +1. Normalize the input data to the domain of \[0, 1). +2. For every cell in the input vector, compute the output of the respective neuron. +3. For the vector with neuron outputs, compute the cosine similarities with respect to the representation vectors for the matrices. +4. Take the `argmax` value and find the class you're hopefully looking for. + +## Testing network performance + +In their work, the authors suggest that they achieve substantial classification performance on _really small sub samples_ of the well-known MNIST and Iris datasets. Those datasets are really standard-ish data sets when you're interested in playing around with machine learning models. + +And with substantial performance, I really mean substantial: **they claim that combining chaotic behavior with neurons allows one to get high performance with really small data sets**. For example, they achieved 70%+ accuracies on the MNIST data set with > 5 samples, and accuracies of +≈ 80% with ≈ 20 samples. Note: the authors _do suggest that when the number of samples increase_, regular deep learning models will eventually perform better. But hey, let's see what we find for this type of model in small data scenarios. + +\[ad\] + +### Implementing the authors' architecture + +Rather unfortunately, the authors did not provide code which means that I had to implement the feature extractor, training algorithm and testing algorithm myself. Fortunately, however, the authors provided pseudo-code for this, which was really beneficial. Let's take a look at what happened. + +According to the paper, there are two parameters that must be configured: `b` and `q`. `b` is used to compute the chaotic map and determines the tipping point of the function (see the visualization above, where b was approximately 0.46). `q`, on the other hand, is the starting point for the neuron's chaotic behavior, and represents neural membrane potential. In my architecture it's the same for all neurons since that is what the authors used, but an extension to their work may be customized `q`s for each neuron. The `error` rate was 0.1, in line with the paper. + +### Testing on the MNIST dataset + +All right, after implementing the architecture, I could begin with testing. I tested model performance on the MNIST dataset. + +The [MNIST dataset](http://yann.lecun.com/exdb/mnist/) is a relatively straight-forward dataset which contains handwritten numbers. It's a great dataset if one intends to learn building machine learning models for image classification and it's therefore one of the standard data sets. It looks as follows: + +[![](images/mnist-visualize.png)](https://machinecurve.com/wp-content/uploads/2019/06/mnist-visualize.png) + +First, I created a fancy little test protocol in order to attempt to show that it can both predict and generalize. It is as follows — + +- I used the `mnist` data set available by default in Keras. From the `x_train` sample, I always drew random samples for training, with replacement. +- I trained multiple times with varying numbers of training samples per class, but with an always equal number of samples per class. I trained the model with 1, 2, ... 21 samples per class, to see how its performance differs. +- I randomly drew 500 samples per class from the `x_train` sample for testing. It may be the case that some of those overlap with the actual training data. This is obviously considered to be poor practice, and yes, shame on me. But it was relatively easy to make it work this way :) What's more, in the ultimate worst case, only 4.2% of the test samples would overlap. But since we're drawing 500 samples from about 5-7k per class, and this 4.2% only occurs in the _worst case_ scenario when training with 21 samples if all 21 overlap (21/500 ≈ 4.2%), I think this won't be too problematic. + +And then, there was a setback. I simply could not get it to work with the MNIST dataset. Well, the network worked, but its performance was poor: I achieved accuracies of 20% at max: + +[![](images/mnist-acc-poor-1024x537.png)](https://machinecurve.com/wp-content/uploads/2019/06/mnist-acc-poor.png) + +Then I read [in the paper](https://arxiv.org/pdf/1905.12601.pdf) that it "may be the case that certain values of q may not work, but we can always find a `q` that works". + +My problem thus transformed into a search problem: find a value for `q` and possibly for `b` that works. The result of this quest is a piece of Python code which iterates over the entire \[0, 1) spectrum for both `b` (deltas of 0.05) and `q` (deltas of 0.01) to allow me to find the optimum combination. + +This is the result: + +\[ad\] + +[![](images/plot_for_mnist-1024x537.png)](https://machinecurve.com/wp-content/uploads/2019/06/plot_for_mnist.png) + +So indeed, it seems to be the case that model performance is very sensitive to the configurable parameters. The `q` I had configured seemed to produce a very low accuracy. Slightly altering the value for `q` yielded an entirely different result: + +[![](images/mnist_accs-1024x511.png)](https://machinecurve.com/wp-content/uploads/2019/05/mnist_accs.png) + +Accuracies of > 75% on the MNIST datasets with only 20+ training samples per class. + +Wow! :) I could pretty much reproduce the findings of the authors. An decreasingly increasing accuracy with respect to the number of samples, achieving some kind of plateau at > 20 samples for training. Even the maximum accuracy of about 78% gets close to what the authors found. + +### Testing on the Iris dataset + +Next up is the [Iris dataset](https://www.kaggle.com/uciml/iris/downloads/Iris.csv/data), which is another common dataset used by the machine learning community for playing around with new ideas. I let the search algorithm find optimum `b` and `q` values while it was configured to use 5 samples for training (which is similar to the authors' work), using 45 samples for testing (the Iris dataset I used contains 50 samples per class). First, I normalized the values into the \[0, 1) interval, since otherwise the neurons cannot handle them. + +The search plot looks promising, with maximum accuracies of ≈ 98,5%: + +[![](images/iris-plot-1024x537.png)](https://machinecurve.com/wp-content/uploads/2019/06/iris-plot.png) + +By zooming into this plot, I figured that one of the maximum accuracies, possibly the highest, occurs at `q = 0.50` and `b ≈ 0.55`. Let's train and see what happens: + +[![](images/iris-performance-1024x537.png)](https://machinecurve.com/wp-content/uploads/2019/06/iris-performance.png) + +We can see that it performs well. Once again, we can support the authors' findings :) However, we must note that performance seems to deteriorate slightly when a relatively large number of samples is used for training (> 5 samples, which is > 10% of the entire number of samples available per class). + +\[ad\] + +### Testing on CIFAR-10 + +All right. We just tested the model architecture with two data sets which the authors also used. For any machine learning problem, an engineer would be interested in how well it generalizes to different data sets... so the next obvious step was to train the model on another data set, not used by the authors. + +A dataset readily available within the Keras framework is the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html). It contains many images for ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). It looks as follows: + +[![](images/cifar10_visualized.png)](https://machinecurve.com/wp-content/uploads/2019/06/cifar10_visualized.png) + +The first step is running the Python code for finding the most optimum combination of `q` and `b`. + +[![](images/plot_for_cifar10-1024x511.png)](https://machinecurve.com/wp-content/uploads/2019/06/plot_for_cifar10.png) + +Oops, relatively poor accuracies + +Unfortunately, the maximum accuracies found by the search algorithm are only about 30% - and they are rather consistent in this behavior. This means that for CIFAR-10, the chaotic model performs worse than when the prediction is made at random. That's not what we want. + +I'm however not exactly sure why this behavior occurs. I do however have multiple hypotheses. First, if you inspect the data visualizations for MNIST and CIFAR-10 above, you'll see that the MNIST dataset is highly contrast rich, especially compared to the CIFAR-10 dataset. That is, we can clearly see what the number is. This distinction is relatively more obscure in the CIFAR-10 dataset. It may be that the model cannot handle this well. In that case, we've found our first possible bottleneck for the chaos theory inspired neural network: _it may be that it cannot handle well data relatively poor in contrast between areas of interest and areas of non-interest._ + +Second, the MNIST dataset provides numbers that have been positioned in the relative center of the image. That's a huge benefit for machine learning models. Do note for example that [CNNs](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) are so effective because the convolution operation allows them to be invariant to the position of the object. That is, they don't care where in the image the object is. _Hypothesis two:_ it may be that this chaos theory inspired network, in line with more traditional machine learning models, is sensitive to the precise location of objects. + +\[ad\] + +### Testing on the Pima Indians Diabetes dataset + +We did however see that the chaos theory inspired neural architecture performs relatively well on the Iris dataset. In order to see how well it generalizes with respect to those kind of datasets (i.e., no images), I finally also tested it on the [Pima Indians Diabetes dataset](https://www.kaggle.com/kumargh/pimaindiansdiabetescsv). It is a CC0 dataset usable for getting experience with machine learning models and contains various medical measurements and a prediction about whether patients will haveto face diabetes: + +> +> This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within ve years. +> +> Source: Pima Indians Diabetes Dataset + +The dataset is relatively imbalanced. Class 0, 'no diabetes', is present 500 times, whereas class 1, i.e. when one is predicted to get diabetes, is present only 267 times. Nevertheless, we should still have enough samples for training and testing. + +Similar to the Iris dataset, I first normalized the individual values into the \[0, 1) interval. This should not change the underlying patterns whereas the dataset can now be input into the GLS neurons. Let's inspect the results for searching good `q`s and `b`s. I'll run it with 15 samples for training and 50 for testing. + +[![](images/pima-performance-1024x537.png)](https://machinecurve.com/wp-content/uploads/2019/06/pima-performance.png) + +Once again, I'm impressed with the results of the network, this time on a dataset which was not tested by the authors previously. It seems that `b = 0.78` and `q = 0.47` must yield good results, and indeed: + +[![](images/pima-performance-2-1024x537.png)](https://machinecurve.com/wp-content/uploads/2019/06/pima-performance-2.png) + +\[ad\] + +### Conclusions + +With my experiments, I could reproduce the results reported by the authors in their paper [A Novel Chaos Theory Inspired Neuronal Architecture](https://arxiv.org/pdf/1905.12601.pdf). I was also able to reproduce these results on another dataset (i.e. the Pima Indians Diabetes Dataset), but failed to reproduce those findings on another (i.e., the CIFAR-10 dataset). I feel that the relative lack of contrast between object and non-object in the CIFAR-10 dataset results in low performance, together with the variable positions of objects of interest in this dataset. Consequently, I feel like the work produced by the authors is a really great start... while more work is required to make this work with real-world image datasets, of which CIFAR-10 is a prime example. However, I'll be happy to test with more non-image datasets in the future... to further investigate its performance :) + +If you've made it so far, I would like to thank you for reading this blog - I hope you've found it as interesting as me. It is fun to play around with new ideas about how to improve machine learning - and it's even more rewarding to find that the results reported in the original work could be reproduced. If you feel like I've made any mistakes, if you have questions or if you have any remarks, please feel free to leave a comment below. They are highly appreciated and I'll try to answer them as quickly as I can. Thanks again and happy engineering! + +## References + +Harikrishnan, N., & Nagaraj, N. (2019). A Novel Chaos Theory Inspired Neuronal Architecture. Retrieved from [https://arxiv.org/pdf/1905.12601.pdf](https://arxiv.org/pdf/1905.12601.pdf) + +How to Load and Visualize Standard Computer Vision Datasets With Keras. (2019, April 8). Retrieved from [https://machinelearningmastery.com/how-to-load-and-visualize-standard-computer-vision-datasets-with-keras/](https://machinelearningmastery.com/how-to-load-and-visualize-standard-computer-vision-datasets-with-keras/) + +MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges. (n.d.). Retrieved from [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/) + +CIFAR-10 and CIFAR-100 datasets. (n.d.). Retrieved from [https://www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html) + +pima-indians-diabetes.csv. (n.d.). Retrieved from [https://www.kaggle.com/kumargh/pimaindiansdiabetescsv](https://www.kaggle.com/kumargh/pimaindiansdiabetescsv) + +Iris Species. (n.d.). Retrieved from [https://www.kaggle.com/uciml/iris](https://www.kaggle.com/uciml/iris) diff --git a/creating-a-multilabel-neural-network-classifier-with-tensorflow-and-keras.md b/creating-a-multilabel-neural-network-classifier-with-tensorflow-and-keras.md new file mode 100644 index 0000000..4696d5b --- /dev/null +++ b/creating-a-multilabel-neural-network-classifier-with-tensorflow-and-keras.md @@ -0,0 +1,170 @@ +--- +title: "Creating a Multilabel Neural Network Classifier with Tensorflow 2.0 and Keras" +date: "2020-11-16" +categories: + - "deep-learning" + - "frameworks" +tags: + - "classification" + - "deep-learning" + - "deep-neural-network" + - "keras" + - "machine-learning" + - "multilabel-classification" + - "neural-network" + - "neural-networks" + - "tensorflow" +--- + +Neural networks can be used for a variety of purposes. One of them is what we call **multilabel classification:** creating a classifier where the outcome is not _one out of multiple_, but _some out of multiple_ labels. An example of multilabel classification in the real world is tagging: for example, attaching multiple categories (or 'tags') to a news article. But many more exist. + +There are many ways in which multilabel classifiers can be constructed. In other articles, we have seen [how to construct them with Support Vector Machines](https://www.machinecurve.com/index.php/2020/11/12/using-error-correcting-output-codes-for-multiclass-svm-classification/). But in this article, we're going to use neural networks for that purpose. It is structured as followed. Firstly, we'll take a more detailed look at multilabel classification. What is it? How does it work? We're going to use an assembly line setting to demonstrate it conceptually. + +Subsequently, we're going to continue in a more practical way - by introducing how Neural networks can be used for multiclass classification. Using the [bias-variance tradeoff](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/), we will look at pros and cons of using them for creating a multilabel classifier. Once this is complete, we do the real work: using a **step-by-step example**, we're going to build a multilabel classifier ourselves, using TensorFlow and Keras. + +Let's get to work! :) + +* * * + +\[toc\] + +* * * + +## What is multilabel classification? + +Suppose that we are observing someone who is working in a factory. It's their task to monitor an assembly line for new objects. Once a new object appears, they must attach a label to the object about its **size** as well as its **shape**. Subsequently, the objects must be stored in a bucket - which can then be transported away, or something else. + +This is _classification,_ and to be more precise it is an instance of **multilabel classification**. + +> In machine learning, **multi-label classification** and the strongly related problem of **multi-output classification** are variants of the classification problem where multiple labels may be assigned to each instance. +> +> Wikipedia (2006) + +> Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y). +> +> Wikipedia (2006) + +Visually, this looks as follows: + +![](images/whatisclassification6.png) + +* * * + +## Using Neural Networks for Multilabel Classification: the pros and cons + +Neural networks are a popular class of Machine Learning algorithms that are widely used today. They are composed of stacks of _neurons_ called _layers_, and each one has an Input layer (where data is fed into the model) and an Output layer (where a prediction is output). In between, there are (often many) Hidden layers, which are responsible for capturing patterns from the data - providing the predictive capabilities that eventually result in a prediction for some input sample. + +![](images/Basic-neural-network.jpg) + +Today, in Deep Learning, neural networks have very deep architectures - partially thanks to the advances in compute power and the cloud. Having such deep architectures allows neural networks to learn _a lot of patterns_ as well as _abstract and detailed patterns_, meaning that since their rise Machine Learning models can be trained and applied in a wide variety of situations. + +Among them, multilabel classification. + +Nevertheless, if we want to use Neural networks for any classification or regression task - and hence also multilabel classification - we must also take a look at the pros and cons. These can be captured by looking at them in terms of the **[bias-variance tradeoff](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/)**. + +- If your Machine Learning model has high **bias**, it is very rigid in terms of the underlying function the model is learning. For example, a linear model must explicitly learn a function of the form \[latex\]f(x): y = a \\times x + b\[/latex\]. It is therefore impossible to capture, say, a quadratic pattern in your dataset with a linear model. +- If your Machine Learning model has high **variance**, the internal function it learns changes significantly even with only _minor_ changes in the input distribution, i.e. the distribution of your training dataset. You therefore have to inspect your data closely before training - and especially look at things like model generalization. An example of such a model is a nonlinear [Support Vector Machine](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/), which can learn patterns through any kernel function of choice. + +Funnily, bias and variance are connected in a tradeoff: if your model has high bias, variance is often relatively low due to the rigidity of the function learned. If variance is high, meaning that small changes will significantly change the underlying function learned, then the function cannot be too rigid by consequence, and hence bias is low. + +If we want to use Neural Networks for multilabel classification, we must take this into account. Through [nonlinear activation functions](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/) like [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/), Neural networks are systems of neurons that can learn [any arbitrary function](https://www.machinecurve.com/index.php/2019/07/18/can-neural-networks-approximate-mathematical-functions/). This means that their bias is low - there is no rigidity when the Neural network is nonlinear. However, this means that it is susceptible to variance related behavior - that small changes in the dataset may trigger significant changes to the underlying patterns. In other words, if you have a small dataset or already think that the structure of your input data is of some function, you might also consider multilabel classifications with other models, [such as SVMs](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/). In other cases, Neural networks can definitely be useful. + +Now that we know about Neural networks for multilabel classification, let's see if we can create one with TensorFlow and Keras. + +![](images/darts-1024x768.jpg) + +* * * + +## Creating a Multilabel Classifier with Tensorflow and Keras + +Createing a multilabel classifier with TensorFlow and Keras is easy. In fact, it it not so different from creating a regular classifier - except a few minor details. Let's take a look at the steps required to create the dataset, and the Python code necessary for doing so. + +- **Imports:** the first step is importing all the Python dependencies that we need. We will use two packages: `sklearn`, primarily for data preprocessing related activities, and `tensorflow`, for the Neural network. From `sklearn`, we import `make_multilabel_classification` - which allows us to create a multilabel dataset - and `train_test_split` - allowing us to split the data into a training and testing dataset. From `tensorflow`, we will use the `Sequential` API for constructing our Neural Network, using `Dense` (i.e. densely-connected) layers. We use [binary crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) for computing [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) and [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) for optimization. + - We assume that you have the packages installed onto your system. If not, you can run `pip install tensorflow scikit-learn`. +- **Configuration options:** the second step is specifying a set of configuration options for dataset generation and the model. For example, we create 10000 samples with 6 features (i.e. columns) per sample (or vector/array), which have 3 target classes of which 2 are 'activated' per sample on average. We will train for 50 iterations (epochs), initialize our random number generators with a seed of 42, use a 250-sample batch size, output everything on `stdout` through `verbosity = 1` and use 20% of the training data for validation purposes. +- **Creating the dataset:** the next thing we do is creating the dataset. Up to now, we have none! Using Scikit-learn and more specifically `make_multilabel_classification`, we can create a multilabel dataset for classification - and we use the configuration options defined just before for doing so. +- **Train/test split:** after generating the dataset, we must create a split between training and testing data. Scikit-learn also provides a nice function for this: `train_test_split`. We convert `X` and `y` into its training and testing components with a 66/33 train/test split. In other words, 66% of 10000 samples will be used for training (and validation) purposes, while 33% will be used for testing. This split is relatively big on the testing end: 80/20 splits are also common. +- **Creating the model:** the next step is creating the `model` using an instance of the `Sequential` API. Using `model.add`, we then stack multiple densely-connected (`Dense`) layers on top. Recall from the image above that in a Dense layer, each neuron in a layer connects to all the other neurons in the previous layer. This means that they will become aware of certain patterns in any of the upstream neurons, if they fire. The Input layer has `n_features` [input dimensions](https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim/), as the shape must equal the input data. Our Dense layers will become narrower when we get closer to the output layer. This allows us to detect many patterns at first, generating 'summaries' later down the line. As is common, we use [ReLU activations](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/), except for the last layer. Here, we use [Sigmoid ones](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/). As we know, a Sigmoid activation function generates a prediction somewhere between \[latex\]\[0, 1\]\[/latex\] - and will hence do so for all neurons in the output layer. We set the _number_ of neurons there to `n_classes`. In other words, we get a 0-100% (0-1) prediction for _each_ output layer, and there are as many as the number of classes: our multilabel prediction setting is complete. +- **Compiling the model:** we then convert the model skeleton that we have just created into a true model. Using [binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) (which can be used in effectively the `n_classes` number of binary classification tasks) and the Adam optimizer, we instantiate the model. +- **Training the model:** we then fit the training data to the model and provide a few configuration options defined earlier. The model will now start training. +- **Evaluating the model:** after the model is trained, we can [evaluate](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) it using `model.evaluate`. Based on the testing dataset, we then know how well it performs when it is used on data that it has never seen before. + +Here is the Python code which is the output of the steps mentioned above: + +``` +# Imports +from sklearn.datasets import make_multilabel_classification +from sklearn.model_selection import train_test_split +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.losses import binary_crossentropy +from tensorflow.keras.optimizers import Adam + +# Configuration options +n_samples = 10000 +n_features = 6 +n_classes = 3 +n_labels = 2 +n_epochs = 50 +random_state = 42 +batch_size = 250 +verbosity = 1 +validation_split = 0.2 + +# Create dataset +X, y = make_multilabel_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, n_labels=n_labels, random_state=random_state) + +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=random_state) + +# Create the model +model = Sequential() +model.add(Dense(32, activation='relu', input_dim=n_features)) +model.add(Dense(16, activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(n_classes, activation='sigmoid')) + +# Compile the model +model.compile(loss=binary_crossentropy, + optimizer=Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(X_train, y_train, + batch_size=batch_size, + epochs=n_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(X_test, y_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Running it gives the following performance: + +``` +Test loss: 0.30817817240050344 / Test accuracy: 0.8562628030776978 +``` + +* * * + +## Summary + +In this article, we looked at creating a multilabel classifier with TensorFlow and Keras. For doing so, we first looked at what multilabel classification is: assigning multiple classes, or labels, to an input sample. This is clearly different from binary and multiclass classification, to some of which we may already be used. + +We also looked at how Neural networks can be used for multilabel classification in general. More specifically, we looked at the bias-variance tradeoff, and provided a few suggestions when to use Neural networks for the task, or when it can be useful to look at other approaches first. + +Subsequently, we moved forward and provided a step-by-step example of creating a Neural network for multilabel classification. We used the TensorFlow and Keras libraries for doing so, as well as generating a multilabel dataset using Scikit. We achieved quite nice performance. + +I hope that you have learned something from today's article! If you did, please feel free to leave a comment in the comments section below 💬 Please do the same if you have questions or other remarks, or even suggestions for improvement. I'd love to hear from you and will happily adapt my post when necessary. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +_TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc._ + +Wikipedia. (2006, October 16). _Multi-label classification_. Wikipedia, the free encyclopedia. Retrieved November 16, 2020, from [https://en.wikipedia.org/wiki/Multi-label\_classification](https://en.wikipedia.org/wiki/Multi-label_classification) + +MachineCurve. (2020, November 2). _Machine learning error: Bias, variance and irreducible error with Python_. [https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/) diff --git a/creating-a-multilayer-perceptron-with-pytorch-and-lightning.md b/creating-a-multilayer-perceptron-with-pytorch-and-lightning.md new file mode 100644 index 0000000..7e11fc5 --- /dev/null +++ b/creating-a-multilayer-perceptron-with-pytorch-and-lightning.md @@ -0,0 +1,522 @@ +--- +title: "Creating a Multilayer Perceptron with PyTorch and Lightning" +date: "2021-01-26" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "lightning" + - "machine-learning" + - "mlp" + - "multilayer-perceptron" + - "neural-network" + - "neural-networks" + - "pytorch" + - "pytorch-lightning" +--- + +Multilayer Perceptrons or MLPs are one of the basic types of neural networks that can be created. Still, they are very important, because they also lie at the basis of more advanced models. If you know that Multilayer Perceptrons are often called _feedforward segments_ in these architectures, you can easily see that they are heavily used in Transformer models as well as in Convolutional Neural Networks. + +![](images/800px-Neural_network_example.svg_-768x1024.png) + +A basic MLP. License: public domain. + +In other words: basic does not mean useless. Quite the contrary, for MLPs. + +Today, there are two frameworks that are heavily used for creating neural networks with Python. The first is TensorFlow. This article however provides a tutorial for creating an MLP with PyTorch, the second framework that is very popular these days. It also instructs how to create one with PyTorch Lightning. After reading this tutorial, you will... + +- Have refreshed the basics of Multilayer Perceptrons. +- Understand how to build an MLP with PyTorch. +- Also understand how to build one with PyTorch Lightning. + +Let's get to work! 🚀 + +* * * + +\[toc\] + +* * * + +## Summary and code examples: MLP with PyTorch and Lightning + +Multilayer Perceptrons are straight-forward and simple neural networks that lie at the basis of all Deep Learning approaches that are so common today. Having emerged many years ago, they are an extension of the simple Rosenblatt Perceptron from the 50s, having made feasible after increases in computing power. Today, they are used in many neural networks, sometimes augmented with other layer types as well. + +Being composed of layers of neurons that are stacked on top of each other, these networks - which are also called MLP - can be used for a wide variety of purposes, being regression and classification. In this article, we will show you how you can create MLPs with **PyTorch** and **PyTorch Lightning**, which are very prominent in today's machine learning and deep learning industry. + +First, we'll show two full-fledged examples of an MLP - the first created with classic PyTorch, the second with Lightning. + +### Classic PyTorch + +Defining a Multilayer Perceptron in classic PyTorch is not difficult; it just takes quite a few lines of code. We'll explain every aspect in detail in this tutorial, but here is already a **complete code example for a PyTorch created Multilayer Perceptron**. If you want to understand everything in more detail, make sure to rest of the tutorial as well. Best of luck! :) + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(32 * 32 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### PyTorch Lightning + +You can also get started with PyTorch Lightning straight away. Here, we provided a **full code example for an MLP created with Lightning**. Once more: if you want to understand everything in more detail, make sure to read the rest of this tutorial as well! :D + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms +import pytorch_lightning as pl + +class MLP(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(32 * 32 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + return self.layers(x) + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-4) + return optimizer + + +if __name__ == '__main__': + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + pl.seed_everything(42) + mlp = MLP() + trainer = pl.Trainer(auto_scale_batch_size='power', gpus=0, deterministic=True, max_epochs=5) + trainer.fit(mlp, DataLoader(dataset)) +``` + +* * * + +## What is a Multilayer Perceptron? + +![](images/800px-Neural_network_example.svg_-768x1024.png) + +Created by Wiso at Wikipedia. License: public domain. + +I always tend to think that it is good practice if you understand some concepts before you write some code. That's why we'll take a look at the basics of Multilayer Perceptrons, abbreviated as MLPs, in this section. Once completed, we move on and start writing some code with PyTorch and Lightning. + +Back in the 1950s, in the era where people had just started using computing technology after they found it really useful, there was a psychologist named Frank Rosenblatt. The man imagined what it would be like to add intelligence to machines - in other words, to make a machine that can think. The result is the [Rosenblatt Perceptron](https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/) - a mathematical operation where some input is passed through a neuron, where _weights_ are memoralized and where the end result is used to optimize the weights. While it can learn a [binary classifier](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/), it fell short of learning massively complex functions like thinking and such. + +Besides theoretical issues, the absence of sufficient computing power also meant that neural networks could not be utilized massively. Decades later, technological progress made possible the growth into **multilayer perceptrons**, or MLPs. In these perceptrons, _more than just one neuron_ _is used_ for generating predictions. In addition, neurons are stacked in layers of increasing abstractness, where each layers learns more abstract patterns. That is, while one layer can learn to detect lines, another can learn to detect noses. + +In MLPs, the input data is fed to an _input layer_ that shares the dimensionality of the input space. For example, if you feed input samples with 8 features per sample, you'll also have 8 neurons in the input layer. After being processed by the input layer, the results are passed to the next layer, which is called a hidden layer. The final layer is an output. Its neuron structure depends on the problem you are trying to solve (i.e. one neuron in the case of regression and binary classification problems; multiple neurons in a multiclass classification problem). + +If you look closely, you can see that each neuron passes the input to **all** neurons in the subsequent (or downstream) layer. This is why such layers are also called densely-connected, or Dense. In TensorFlow and Keras they are available as `tensorflow.keras.layers.Dense`; PyTorch utilizes them as `torch.nn.Linear`. + +* * * + +## Creating an MLP with PyTorch + +Now that we understand what an MLP looks like, it is time to build one with PyTorch. Below, we will show you how you can create your own PyTorch based MLP with step-by-step examples. In addition to that, we also show you how to build one with PyTorch Lightning. This is a library on top of PyTorch which allows you to build models with much less overhead ([for example, by automating away explicitly stating the training loop](https://www.machinecurve.com/index.php/2021/01/13/getting-started-with-pytorch/#benefits-of-pytorch-lightning-over-classic-pytorch)). + +First, we'll show you how to build an MLP with classic PyTorch, then how to build one with Lightning. + +### Classic PyTorch + +Implementing an MLP with classic PyTorch involves six steps: + +1. Importing all dependencies, meaning `os`, `torch` and `torchvision`. +2. Defining the MLP neural network class as a `nn.Module`. +3. Adding the preparatory runtime code. +4. Preparing the CIFAR-10 dataset and initializing the dependencies (loss function, optimizer). +5. Defining the custom training loop, where all the magic happens. + +#### Importing all dependencies + +The first step here is to add all the dependencies. We need `os` for file input/output functionality, as we will save the CIFAR-10 dataset to local disk later in this tutorial. We'll also import `torch`, which imports PyTorch. From it we import `nn`, which allows us to define a neural network module. We also import the `DataLoader` (for feeding data into the MLP during training), the `CIFAR10` dataset (for obvious purposes) and `transforms`, which allows us to perform transformations on the data prior to feeding it to the MLP. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms +``` + +#### Defining the MLP neural network class + +Next up is defining the `MLP` class, which replicates the `nn.Module` class. This Module class instructs the implementation of our neural network and is therefore really useful when creating one. It has two definitions: `__init__`, or the constructor, and `forward`, which implements the forward pass. + +In the constructor, we first invoke the superclass initialization and then define the layers of our neural network. We stack all layers (three densely-connected layers with `Linear` and [ReLU activation functions](https://www.machinecurve.com/index.php/2021/01/21/using-relu-sigmoid-and-tanh-with-pytorch-ignite-and-lightning/) using `nn.Sequential`. We also add `nn.Flatten()` at the start. Flatten [converts](https://www.machinecurve.com/index.php/question/runtimeerror-mat1-and-mat2-shapes-cannot-be-multiplied-384x32-and-1024x64-in-pytorch/) the 3D image representations (width, height and channels) into 1D format, which is necessary for `Linear` layers. Note that with image data it is often best to use Convolutional Neural Networks. This is out of scope for this tutorial and will be covered in another one. + +The forward pass allows us to react to input data - for example, during the training process. In our case, it does nothing but feeding the data through the neural network layers, and returning the output. + +``` +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(32 * 32 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) +``` + +#### Adding runtime code + +After defining the class, we can move on and write the runtime code. This code is actually executed at runtime, i.e. when you call the Python script from the terminal with e.g. `python mlp.py`. The `class` itself is then not yet used, but we will do so shortly. + +The first thing we define in the runtime code is setting the seed of the random number generator. Using a fixed seed ensures that this generator is initialized with the same starting value. This benefits reproducibility of your ML findings. + +``` +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) +``` + +#### Preparing the CIFAR-10 dataset and initializing dependencies + +[![](images/cifar10_visualized.png)](https://www.machinecurve.com/wp-content/uploads/2019/06/cifar10_visualized.png) + +The next code we add involves preparing the CIFAR-10 dataset. Some samples from this dataset are visualized in the image on the right. The dataset contains 10 classes and has 60.000 32 by 32 pixel images, with 6000 images per class. + +Loading and preparing the CIFAR-10 data is a two-step process: + +1. Initializing the dataset itself, by means of `CIFAR10`. Here, in increasing order, you specify the directory where the dataset has to be saved, that it must be downloaded, and that they must be converted into Tensor format. +2. Initializing the `DataLoader`, which takes the dataset, a batch size, shuffle parameter (whether the data must be ordered at random) and the number of workers to load data with. In PyTorch, data loaders are used for feeding data to the model uniformly. + +``` + # Prepare CIFAR-10 dataset + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) +``` + +Now, it's time to initialize the MLP - and use the class that we had not yet used before. We also specify the loss function (categorical crossentropy loss) and the Adam optimizer. The optimizer works on the parameters of the MLP and utilizes a learning rate of `10e-4`. We'll use them next. + +``` + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) +``` + +#### Defining the training loop + +The core part of our runtime code is the training loop. In this loop, we perform the epochs, or training iterations. For every iteration, we iterate over the training dataset, perform the entire forward and backward passes, and perform model optimization. + +Step-by-step, these are the things that happen within the loop: + +- Of course, we have a number of full iterations - also known as epochs. Here, we use 5 epochs, as defined by the `range(0, 5)`. +- We set the current loss value for printing to `0.0`. +- Per epoch, we iterate over the training dataset - and more specifically, the minibatches within this training dataset as specified by the batch size (set in the `trainloader` above). Here, we do the following things: + - We decompose the data into inputs and targets (or `x` and `y` values, respectively). + - We zero the gradients in the optimizer, to ensure that it starts freshly for this minibatch. + - We perform the forward pass - which in effect is feeding the inputs to the model, which, recall, was initialized as `mlp`. + - We then compute the loss value based on the `outputs` of the model and the ground truth, available in `targets`. + - This is followed by the backward pass, where the gradients are computed, and optimization, where the model is adapted. + - Finally, we print some statistics - but only at every 500th minibatch. At the end of the entire process, we print that the training process has finished. + +``` + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +#### Full model code + +For the full model code, see the full code example at the beginning of this tutorial. + +#### Running the training process + +Now, when you save the code e.g. to a file called `mlp.py` and run `python mlp.py`, you'll see the following when your PyTorch has been installed successfully. + +``` +Starting epoch 1 +Loss after mini-batch 500: 2.232 +Loss after mini-batch 1000: 2.087 +Loss after mini-batch 1500: 2.004 +Loss after mini-batch 2000: 1.963 +Loss after mini-batch 2500: 1.943 +Loss after mini-batch 3000: 1.926 +Loss after mini-batch 3500: 1.904 +Loss after mini-batch 4000: 1.878 +Loss after mini-batch 4500: 1.872 +Loss after mini-batch 5000: 1.874 +Starting epoch 2 +Loss after mini-batch 500: 1.843 +Loss after mini-batch 1000: 1.828 +Loss after mini-batch 1500: 1.830 +Loss after mini-batch 2000: 1.819 +... +``` + +Great! 😎 + +### PyTorch Lightning + +Another approach for creating your PyTorch based MLP is using PyTorch Lightning. It is a library that is available on top of classic PyTorch (and in fact, uses classic PyTorch) that makes creating PyTorch models easier. + +The reason is simple: writing even a simple PyTorch model means writing a lot of code. And in fact, writing a lot of code that does nothing more than the default training process (like our training loop above). + +In Lightning, these elements are automated as much as possible. In addition, running your code on a GPU does not mean converting your code to CUDA format (which we even haven't done above!). And there [are other benefits](https://www.machinecurve.com/index.php/2021/01/13/getting-started-with-pytorch/). Since Lightning is nothing more than classic PyTorch structured differently, there is significant adoption of Lightning. We'll therefore also show you how to create that MLP with Lightning - and you will see that it saves a lot of lines of code. + +#### Importing all dependencies + +The first step is importing all dependencies. If you have also followed the classic PyTorch example above, you can see that it is not so different from classic PyTorch. In fact, we use the same imports - `os` for file I/O, `torch` and its sub imports for PyTorch functionality, but now also `pytorch_lightning` for Lightning functionality. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms +import pytorch_lightning as pl +``` + +#### Defining the MLP LightningModule + +In PyTorch Lightning, all functionality is shared in a `LightningModule` - which is a structured version of the `nn.Module` that is used in classic PyTorch. Here, the `__init__` and `forward` definitions capture the definition of the model. We specify a neural network with three MLP layers and ReLU activations in `self.layers`. We also specify the cross entropy loss in `self.ce`. In `forward`, we perform the forward pass. + +Different in Lightning is that it also requires you to pass the `training_step` and `configure_optimizers` definitions. This is mandatory because Lightning strips away the training loop. The `training_step` allows you to compute the loss (which is then used for optimization purposes under the hood), and for these optimization purposes you'll need an optimizer, which is specified in `configure_optimizers`. + +That's it for the MLP! + +``` +class MLP(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(32 * 32 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + return self.layers(x) + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-4) + return optimizer +``` + +#### Adding runtime code: dataset, seed, and the Trainer + +Since Lightning hides much of the training loop, your runtime code becomes really small! + +- You have to define your dataset by initializing `CIFAR10`, just like with the original example. +- You'll seed everything to 42 to ensure that all pseudo-random number generators are initialized with fixed starting values. +- You initialize the MLP. +- You initialize the `Trainer` object, which is responsible for automating away much of the training loop, pass configuration options and then `fit` the data available in the `dataset` through the `DataLoader`. + +``` +if __name__ == '__main__': + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + pl.seed_everything(42) + mlp = MLP() + trainer = pl.Trainer(auto_scale_batch_size='power', gpus=1, deterministic=True, max_epochs=5) + trainer.fit(mlp, DataLoader(dataset)) +``` + +_Please do note that automating away the training loop does **not** mean that you lose all control over the loop. You can still control it if you want by means of your code. This is however out of scope for this tutorial._ + +#### Full model code + +For the full model code, see the full code example at the beginning of this tutorial. + +#### Running the training process + +Now, when you save the code e.g. to a file called `mlp-lightning.py` and run `python mlp-lightning.py`, you'll see the following when your PyTorch and PyTorch Lightning have been installed successfully. + +``` +PU available: True, used: True +TPU available: None, using: 0 TPU cores +LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] + + | Name | Type | Params +-------------------------------------------- +0 | layers | Sequential | 199 K +1 | ce | CrossEntropyLoss | 0 +-------------------------------------------- +199 K Trainable params +0 Non-trainable params +199 K Total params +Epoch 0: 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 40909/50000 [04:27<00:59, 152.76it/s, loss=2.14, v_num=4] +``` + +* * * + +## Recap + +In this tutorial, you have learned what a Multilayer Perceptron is and how you can create one with PyTorch and PyTorch Lightning. Firstly, we saw that MLPs (as they are called for short) involve densely-connected neurons stacked in layers. In a forward pass, samples are fed through the model, after which a prediction is generated. They are then optimized in an iterative fashion. + +After understanding the basics of MLPs, you used PyTorch and PyTorch Lightning for creating an actual MLP. In PyTorch, we saw that we could create one successfully, but that quite some redundant code had to be written in order to specify relatively straight-forward elements (such as the training loop). In the second example, we used PyTorch Lightning to avoid writing all this code. Running on top of classic PyTorch, Lightning allows you to specify your models in much less code without losing control over how they work. + +I hope that you have learned something from this tutorial! If you did, please feel free to leave a message in the comments section below 💬 I'd love to hear from you! + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +PyTorch Lightning. (2021, January 12). [https://www.pytorchlightning.ai/](https://www.pytorchlightning.ai/) + +PyTorch. (n.d.). [https://pytorch.org](https://pytorch.org/) + +PyTorch. (n.d.). _ReLU — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU) diff --git a/creating-a-signal-noise-removal-autoencoder-with-keras.md b/creating-a-signal-noise-removal-autoencoder-with-keras.md new file mode 100644 index 0000000..bb7fb14 --- /dev/null +++ b/creating-a-signal-noise-removal-autoencoder-with-keras.md @@ -0,0 +1,766 @@ +--- +title: "Creating a Signal Noise Removal Autoencoder with Keras" +date: "2019-12-19" +categories: + - "deep-learning" + - "frameworks" +tags: + - "autoencoder" + - "convolutional-neural-networks" + - "deep-learning" + - "denoising" + - "keras" + - "machine-learning" + - "noise-removal" + - "python" +--- + +Pure signals only exist in theory. That is, when you're doing signal processing related activities, it's very likely that you'll experience noise. Whether noise is caused by the measurement (or reception) device or by the medium in which you perform measurements, you want it gone. + +Various mathematical tricks exist to filter out noise from a signal. When noise is relatively constant across a range of signals, for example, you can take the mean of all the signals and deduct it from each individual signal - which likely removes the factors that contribute from noise. + +However, these tricks work by knowing a few things about the noise up front. In many cases, the exact shape of your noise is unknown or cannot be estimated because it is relatively hidden. In those cases, the solution may lie in _learning_ the noise from example data. + +[![](images/x2noise-300x225.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/x2noise.png) + +_A noisy \[latex\]x^2\[/latex\] sample. We'll try to remove the noise with an autoencoder._ + +Autoencoders can be used for this purpose. By feeding them noisy data as inputs and clean data as outputs, it's possible to make them recognize _the ideosyncratic noise for the training data_. This way, autoencoders can serve as denoisers. + +But what are autoencoders exactly? And why does the way they work make them suitable for noise removal? And how to implement one for _signal denoising / noise reduction_? + +We'll answer these questions in today's blog. First, we'll provide a recap on autoencoders - to (re)gain a theoretical understanding of what they are and how they work. This includes a discussion on why they can be applied to noise removal. Subsequently, we implement an autoencoder to demonstrate this, by means of a three-step process: + +- We generate a large dataset of \[latex\]x^2\[/latex\] samples. +- We generate a large dataset of \[latex\]x^2\[/latex\] samples to which Gaussian (i.e., random) noise has been added. +- We create an autoencoder which learns to transform noisy \[latex\]x^2\[/latex\] inputs into the original sine, i.e. _which removes the noise_ - also for new data! + +Ready? + +Okay, let's go! 😊 + +**Update 06/Jan/2021:** updated the article to reflect TensorFlow in 2021. As 1-dimensional transposed convolutions are available in TensorFlow now, the article was updated to use `Conv1D` and `Conv1DTranspose` layers instead of their 2D variants. This fits better given the 1D aspect of our dataset. In addition, references to old Keras were replaced with newer `tf.keras` versions, meaning that this article is compatible with TensorFlow 2.4.0+. + +* * * + +\[toc\] + +* * * + +## Recap: what are autoencoders? + +If we're going to build an autoencoder, we must know what they are. + +In our blog post **"Conv2DTranspose: using 2D transposed convolutions with Keras"**, we already [covered the high-level principles](https://www.machinecurve.com/index.php/2019/12/10/conv2dtranspose-using-2d-transposed-convolutions-with-keras/#what-is-an-autoencoder) behind autoencoders, but it's wise to repeat them here. + +We can visualize the flow of an autoencoder as follows: + +[![](images/Autoencoder.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/Autoencoder.png) + +Autoencoders are composed of two parts: an _encoder_, which encodes some input into encoded state, and a _decoder_ which can decode the encoded state into another format. This can be a reconstruction of the original input, as we can see in the plot below, but it can also be something different. + +![](images/3.png) + +_When autoencoders are used to reconstruct inputs from an encoded state._ + +For example, autoencoders are learnt for noise removal, but also for dimensionality reduction (Keras Blog , n.d.; we then use them to convert the input data into low-dimensional format, which might benefit training lower-dimensionality model types such as [SVMs](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/)). + +Note that the red parts in the block above - that is, the encoder and the decoder, are _learnt based on data_ (Keras Blog, n.d.). This means that, contrary to more abstract mathematical functions (e.g. filters), they are highly specialized in _one domain_ (e.g. signal noise removal at \[latex\]x^2\[/latex\] plots as we will do next) while they perform very poorly in another (e.g. when using the same autoencoder for image noise removal). + +* * * + +## Why autoencoders are applicable to noise removal + +Autoencoders learn an _encoded state_ with an _encoder_, and learn to decode this state into _something else_ with a _decoder_. + +Now think about this in the context of signal noise: suppose that you feed the neural network noisy data as _features_, while you have the pure data available as _targets_. Following the drawing above, the neural network will learn an encoded state based on the noisy image, and will attempt to decode it to best match the _pure data_. What's the thing that stands in between the pure data and the noisy data? Indeed, the noise. In effect, the autoencoder will thus learn to recognize noise and remove it from the input image. + +Let's now see if we can create such an autoencoder with Keras. + +* * * + +## Today's example: a Keras based autoencoder for noise removal + +In the next part, we'll show you how to use the Keras deep learning framework for creating a _denoising_ or _signal removal_ autoencoder. Here, we'll first take a look at two things - the data we're using as well as a high-level description of the model. + +### The data + +First, the data. As _pure signals_ (and hence autoencoder targets), we're using pure \[latex\]x^2\[/latex\] samples from a small domain. When plotted, a sample looks like this: + +[![](images/x2sample.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/x2sample.png) + +For today's model, we use 100.000 samples. To each of them, we add Gaussian noise - or random noise. While the global shape remains present, it's clear that the plots become noisy: + +[![](images/x2noise.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/x2noise.png) + +### The model + +Now, the model. It looks as follows: + +![](images/model-5.png) + +...and has these layers: + +- The input layer, which takes the input data; +- Two [Conv1D layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D), which serve as _encoder_; +- Two [Conv1D transpose layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1DTranspose), which serve as _decoder_; +- One Conv1D layer with one output, a Sigmoid activation function and padding, serving as the output layer. + +To provide more details, this is the model summary: + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv1d (Conv1D) (None, 148, 128) 512 +_________________________________________________________________ +conv1d_1 (Conv1D) (None, 146, 32) 12320 +_________________________________________________________________ +conv1d_transpose (Conv1DTran (None, 148, 32) 3104 +_________________________________________________________________ +conv1d_transpose_1 (Conv1DTr (None, 150, 128) 12416 +_________________________________________________________________ +conv1d_2 (Conv1D) (None, 150, 1) 385 +================================================================= +Total params: 28,737 +Trainable params: 28,737 +Non-trainable params: 0 +``` + +Let's now start with the first part - generating the pure waveforms! Open up your Explorer, navigate to some folder (e.g. `keras-autoencoders`) and create a file called `signal_generator.py`. Next, open this file in your code editor - and let the coding process begin! + +* * * + +## Generating pure waveforms + +[![](images/x2sample-300x225.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/x2sample.png) + +Generating pure waveforms consists of the following steps, in order to generate visualizations like the one shown on the right: + +- Adding the necessary imports to the start of your Python script; +- Defining configuration settings for the signal generator; +- Generating the data, a.k.a. the pure waveforms; +- Saving the waveforms and visualizing a subset of them. + +### Adding imports + +First, the imports - it's a simple list: + +``` +import matplotlib.pyplot as plt +import numpy as np +``` + +We use Numpy for data generation & processing and Matplotlib for visualizing some of the samples at the end. + +### Configuring the generator + +Generator configuration consists of three steps: sample-wide configuration, intra-sample configuration and other settings. First, sample-wide configuration, which is just the number of samples to generate: + +``` +# Sample configuration +num_samples = 100000 +``` + +Followed by intra-sample configuration: + +``` +# Intrasample configuration +num_elements = 1 +interval_per_element = 0.01 +total_num_elements = int(num_elements / interval_per_element) +starting_point = int(0 - 0.5*total_num_elements) +``` + +`num_elements` represents the _width_ of your domain. `interval_per_element` represents the step size that the iterator will take when generating the sample. In this case, the domain \[latex\](0, 1\]\[/latex\] will thus contain 100 samples (as \[latex\]1/interval per element = 1/0.01 = 100\[/latex\]). That's what's represented in `total_num_elements`. + +The starting point determines where to start the generation process. + +Finally, you can set the number of samples that you want visualized in the `other configuration` settings: + +``` +# Other configuration +num_samples_visualize = 1 +``` + +### Generating data + +Next step, creating some data! 😁 + +We'll first specify the lists that contain our data and the sub-sample data (one sample in `samples` contains multiple `xs` and `ys`; when \[latex\]totalnumelements = 100\[/latex\], that will be 100 of them each): + +``` +# Containers for samples and subsamples +samples = [] +xs = [] +ys = [] +``` + +Next, the actual data generation part: + +``` +# Generate samples +for j in range(0, num_samples): + # Report progress + if j % 100 == 0: + print(j) + # Generate wave + for i in range(starting_point, total_num_elements): + x_val = i * interval_per_element + y_val = x_val * x_val + xs.append(x_val) + ys.append(y_val) + # Append wave to samples + samples.append((xs, ys)) + # Clear subsample containers for next sample + xs = [] + ys = [] +``` + +We'll first iterate over every sample, determined by the range between 0 and the `num_samples` variable. This includes a progress report every 100 samples. + +Next, we construct the wave step by step, adding the function outputs to the `xs` and `ys` variables. + +Subsequently, we append the entire wave to the `samples` list, and clear the subsample containers for generating the next sample. + +### Saving & visualizing + +The next step is to save the data. We do so by using Numpy's `save` call, and save `samples` to a file called `./signal_waves_medium.py`. + +``` +# Input shape +print(np.shape(np.array(samples[0][0]))) + +# Save data to file for re-use +np.save('./signal_waves_medium.npy', samples) + +# Visualize a few random samples +for i in range(0, num_samples_visualize): + random_index = np.random.randint(0, len(samples)-1) + x_axis, y_axis = samples[random_index] + plt.plot(x_axis, y_axis) + plt.title(f'Visualization of sample {random_index} ---- y: f(x) = x^2') + plt.show() +``` + +Next, with some basic Matplotlib code, we visualize `num_samples_visualize` random samples from the `samples` array. And that's it already! + +Run your code with `python signal_generator.py` (ensure that you have Numpy and Matplotlib installed) and the generation process should begin, culminating in a `.npy` file and one (or more) visualizations popping up once the process finishes. + +### Full generator code + +If you wish to obtain the entire signal generator at once, here you go: + +``` +import matplotlib.pyplot as plt +import numpy as np + +# Sample configuration +num_samples = 100000 + +# Intrasample configuration +num_elements = 1 +interval_per_element = 0.01 +total_num_elements = int(num_elements / interval_per_element) +starting_point = int(0 - 0.5*total_num_elements) + +# Other configuration +num_samples_visualize = 1 + +# Containers for samples and subsamples +samples = [] +xs = [] +ys = [] + +# Generate samples +for j in range(0, num_samples): + # Report progress + if j % 100 == 0: + print(j) + # Generate wave + for i in range(starting_point, total_num_elements): + x_val = i * interval_per_element + y_val = x_val * x_val + xs.append(x_val) + ys.append(y_val) + # Append wave to samples + samples.append((xs, ys)) + # Clear subsample containers for next sample + xs = [] + ys = [] + +# Input shape +print(np.shape(np.array(samples[0][0]))) + +# Save data to file for re-use +np.save('./signal_waves_medium.npy', samples) + +# Visualize a few random samples +for i in range(0, num_samples_visualize): + random_index = np.random.randint(0, len(samples)-1) + x_axis, y_axis = samples[random_index] + plt.plot(x_axis, y_axis) + plt.title(f'Visualization of sample {random_index} ---- y: f(x) = x^2') + plt.show() +``` + +* * * + +## Adding noise to pure waveforms + +The second part: adding noise to the 100k pure waveforms we generated in the previous step. + +[![](images/x2noise-300x225.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/x2noise.png) + +It's composed of these individual steps: + +- Once again, adding imports; +- Setting the configuration variables for the noising process; +- Loading the data; +- Adding the noise; +- Saving the noisy samples and visualizing a few of them. + +Create an additional file, e.g. `signal_apply_noise.py`, and let's add the following things. + +### Adding imports + +Our imports are the same as we used in the signal generator: + +``` +import matplotlib.pyplot as plt +import numpy as np +``` + +### Configuring the noising process + +Our noising configuration is also a lot simpler: + +``` +# Sample configuration +num_samples_visualize = 1 +noise_factor = 0.05 +``` + +`num_samples_visualize` is the number of samples we wish to visualize once the noising process finishes, and `noise_factor` is the amount of noise we'll be adding to our samples (\[latex\]0 = no noise; 1 = full noise\[/latex\]). + +### Loading data + +Next, we load the data and assign the samples to the correct variables, being `x_val` and `y_val`. + +``` +# Load data +data = np.load('./signal_waves_medium.npy') +x_val, y_val = data[:,0], data[:,1] +``` + +### Adding noise + +Next, we add the noise to our samples. + +``` +# Add noise to data +noisy_samples = [] +for i in range(0, len(x_val)): + if i % 100 == 0: + print(i) + pure = np.array(y_val[i]) + noise = np.random.normal(0, 1, pure.shape) + signal = pure + noise_factor * noise + noisy_samples.append([x_val[i], signal]) +``` + +First, we define a new list that will contain our noisy samples. Subsequently, we iterate over each sample (reporting progress every 100 samples). We then do a couple of things: + +- We assign the pure sample (i.e., the \[latex\]x^2\[/latex\] plot wihtout noise) to the `pure` variable. +- Subsequently, we generate Gaussian noise using `np.random.normal`, with the same shape as `pure`'s. +- Next, we add the noise to the pure sample, using the `noise_factor`. +- Finally, we append the sample's domain and the noisy sample to the `noisy_samples` array. + +### Saving & visualizing + +Next, we - and this is no different than with the generator before - save the data into a `.npy` file (this time, with a different name 😃) and visualize a few random samples based on the number you configured earlier. + +``` +# Save data to file for re-use +np.save('./signal_waves_noisy_medium.npy', noisy_samples) + +# Visualize a few random samples +for i in range(0, num_samples_visualize): + random_index = np.random.randint(0, len(noisy_samples)-1) + x_axis, y_axis = noisy_samples[random_index] + plt.plot(x_axis, y_axis) + plt.title(f'Visualization of noisy sample {random_index} ---- y: f(x) = x^2') + plt.show() +``` + +If you would now run `signal_apply_noise.py`, you'd get 100k noisy samples, with which we can train the autoencoder we'll build next. + +### Full noising code + +If you're interested in the full code of the noising script, here you go: + +``` +import matplotlib.pyplot as plt +import numpy as np + +# Sample configuration +num_samples_visualize = 1 +noise_factor = 0.05 + +# Load data +data = np.load('./signal_waves_medium.npy') +x_val, y_val = data[:,0], data[:,1] + +# Add noise to data +noisy_samples = [] +for i in range(0, len(x_val)): + if i % 100 == 0: + print(i) + pure = np.array(y_val[i]) + noise = np.random.normal(0, 1, pure.shape) + signal = pure + noise_factor * noise + noisy_samples.append([x_val[i], signal]) + +# Save data to file for re-use +np.save('./signal_waves_noisy_medium.npy', noisy_samples) + +# Visualize a few random samples +for i in range(0, num_samples_visualize): + random_index = np.random.randint(0, len(noisy_samples)-1) + x_axis, y_axis = noisy_samples[random_index] + plt.plot(x_axis, y_axis) + plt.title(f'Visualization of noisy sample {random_index} ---- y: f(x) = x^2') + plt.show() +``` + +* * * + +## Creating the autoencoder + +[![](images/model-5-187x300.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/model-5.png) + +It's now time for the interesting stuff: creating the autoencoder 🤗 + +Creating it contains these steps: + +- Once again, adding some imports 😋 +- Setting configuration details for the model; +- Data loading and preparation; +- Defining the model architecture; +- Compiling the model and starting training; +- Visualizing denoised waveforms from the test set, to find out visually whether it works. + +To run it successfully, you'll need **TensorFlow 2.4.0+**, **Matplotlib** and **Numpy**. + +Let's create a third (and final 😋) file: `python signal_autoencoder.py`. + +### Adding imports + +First, let's specify the imports: + +``` +import tensorflow.keras +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Conv1D, Conv1DTranspose +from tensorflow.keras.constraints import max_norm +import matplotlib.pyplot as plt +import numpy as np +import math +``` + +From Keras, we import the Sequential API (which we use to stack the layers on top of each other), the Conv1D and Conv1DTranspose layers (see the architecture and the rationale [here](#the-model) to find out why), and the MaxNorm constraint, in order to keep the weight updates in check. We also import Matplotlib, Numpy and the Python `math` library. + +### Model configuration + +Next, we set some configuration options for the model: + +``` +# Model configuration +input_shape = (150, 1) +batch_size = 150 +no_epochs = 5 +train_test_split = 0.3 +validation_split = 0.2 +verbosity = 1 +max_norm_value = 2.0 +``` + +Here are some insights about the model configuration: + +- The `input_shape`, in line with Conv1D input, is thus \[latex\] (150, 1)\[/latex\]. +- The batch size is 150. This number seemed to work well, offering a nice balance between loss value and prediction time. +- The number of epochs is fairly low, but pragmatic: the autoencoder did not improve substantially anymore after this number. +- We use 30% of the total data, i.e. 30k samples, as testing data. +- 20% of the training data (70k) will be used for validation purposes. Hence, 14k will be used to validate the model per epoch (and even per minibatch), while 56k will be used for actual training. +- All model outputs are displayed on screen, with `verbosity` mode set to True. +- The `max_norm_value` is 2.0. This value worked well in a different scenario, and slightly improved the training results. + +### Data loading & preparation + +The next thing to do is to load the data. We load both the noisy and the pure samples into their respective variables: + +``` +# Load data +data_noisy = np.load('./signal_waves_noisy_medium.npy') +x_val_noisy, y_val_noisy = data_noisy[:,0], data_noisy[:,1] +data_pure = np.load('./signal_waves_medium.npy') +x_val_pure, y_val_pure = data_pure[:,0], data_pure[:,1] +``` + +Next, we'll reshape the data. We do so for each sample. This includes the following steps: + +[![](images/bce-1-300x123.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/bce-1.png) + +_Binary crossentropy loss values for target = 1, in the prediction range \[0, 1\]._ + +- First, given the way how [binary crossentropy loss works](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#binary-crossentropy), we normalize our samples to fall in the range \[latex\]\[0, 1\]\[/latex\]. Without this normalization step, odd loss values (extremely negative ones, impossible with BCE loss) start popping up (Quetzalcohuatl, n.d.). +- We subsequently add the noisy and pure samples to the specific `*_r` arrays. + +``` +# Reshape data +y_val_noisy_r = [] +y_val_pure_r = [] +for i in range(0, len(y_val_noisy)): + noisy_sample = y_val_noisy[i] + pure_sample = y_val_pure[i] + noisy_sample = (noisy_sample - np.min(noisy_sample)) / (np.max(noisy_sample) - np.min(noisy_sample)) + pure_sample = (pure_sample - np.min(pure_sample)) / (np.max(pure_sample) - np.min(pure_sample)) + y_val_noisy_r.append(noisy_sample) + y_val_pure_r.append(pure_sample) +y_val_noisy_r = np.array(y_val_noisy_r) +y_val_pure_r = np.array(y_val_pure_r) +noisy_input = y_val_noisy_r.reshape((y_val_noisy_r.shape[0], y_val_noisy_r.shape[1], 1)) +pure_input = y_val_pure_r.reshape((y_val_pure_r.shape[0], y_val_pure_r.shape[1], 1)) +``` + +Once each sample is resampled, we convert the _entire_ array for both the resampled noisy and resampled pure samples into a structure that TensorFlow/Keras can handle. That is, we increase the shape with another dimension to represent the number of channels, which in our case is just 1. + +Finally, we perform the split into training and testing data (30k test, 56+14 = 70k train): + +``` +# Train/test split +percentage_training = math.floor((1 - train_test_split) * len(noisy_input)) +noisy_input, noisy_input_test = noisy_input[:percentage_training], noisy_input[percentage_training:] +pure_input, pure_input_test = pure_input[:percentage_training], pure_input[percentage_training:] +``` + +### Creating the model architecture + +This is the architecture of our autoencoder: + +``` +# Create the model +model = Sequential() +model.add(Conv1D(128, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform', input_shape=input_shape)) +model.add(Conv1D(32, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv1DTranspose(32, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv1DTranspose(128, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv1D(1, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='sigmoid', padding='same')) + +model.summary() +``` + +- We'll use the Sequential API, for stacking the layers on top of each other. +- The two Conv1D layers serve as the _encoder_, and learn 128 and 32 filters, respectively. They activate with the [ReLU activation function](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/), and by consequence require [He initialization](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). Max-norm regularization is applied to each of them. +- The two Conv1DTranspose layers, which learn 32 and 128 filters, serve as the _decoder_. They also use ReLU activation and He initialization, as well as Max-norm regularization. +- The final Conv layer serves as the output layer, and does (by virtue of `padding='same'`) not alter the shape, except for the number of channels (back into 1). +- Kernel sizes are 3 pixels. + +Generating a model summary, i.e. calling `model.summary()`, results in this summary, which also shows the number of parameters that is trained: + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv1d (Conv1D) (None, 148, 128) 512 +_________________________________________________________________ +conv1d_1 (Conv1D) (None, 146, 32) 12320 +_________________________________________________________________ +conv1d_transpose (Conv1DTran (None, 148, 32) 3104 +_________________________________________________________________ +conv1d_transpose_1 (Conv1DTr (None, 150, 128) 12416 +_________________________________________________________________ +conv1d_2 (Conv1D) (None, 150, 1) 385 +================================================================= +Total params: 28,737 +Trainable params: 28,737 +Non-trainable params: 0 +_________________________________________________________________ +``` + +### Model compilation & starting the training process + +The next thing to do is to compile the model (i.e., specify the optimizer and loss function) and to start the training process. We use Adam and Binary crossentropy for the fact that they are relatively default choices for today's deep learning models. + +Fitting the data shows that we're going from `noisy_input` (features) to `pure_input` (targets). The number of epochs, the batch size and the validation split are as configured earlier. + +``` +# Compile and fit data +model.compile(optimizer='adam', loss='binary_crossentropy') +model.fit(noisy_input, pure_input, + epochs=no_epochs, + batch_size=batch_size, + validation_split=validation_split) +``` + +### Visualizing denoised waveforms from test set + +Once the training process finishes, it's time to find out whether our model actually works. We do so by generating a few reconstructions: we add a noisy sample from the test set (which is data the model has never seen before!) and visualize whether it outputs the noise-free shape. This is the code + +``` +# Generate reconstructions +num_reconstructions = 4 +samples = noisy_input_test[:num_reconstructions] +reconstructions = model.predict(samples) + +# Plot reconstructions +for i in np.arange(0, num_reconstructions): + # Prediction index + prediction_index = i + percentage_training + # Get the sample and the reconstruction + original = y_val_noisy[prediction_index] + pure = y_val_pure[prediction_index] + reconstruction = np.array(reconstructions[i]) + # Matplotlib preparations + fig, axes = plt.subplots(1, 3) + # Plot sample and reconstruciton + axes[0].plot(original) + axes[0].set_title('Noisy waveform') + axes[1].plot(pure) + axes[1].set_title('Pure waveform') + axes[2].plot(reconstruction) + axes[2].set_title('Conv Autoencoder Denoised waveform') + plt.show() +``` + +Open up your terminal again, and run `python signal_autoencoder.py`. Now, the training process should begin. + +### Full model code + +If you're interested in the full code, here you go: + +``` +import tensorflow.keras +from tensorflow.keras.models import Sequential, save_model +from tensorflow.keras.layers import Conv1D, Conv1DTranspose +from tensorflow.keras.constraints import max_norm +import matplotlib.pyplot as plt +import numpy as np +import math + +# Model configuration +input_shape = (150, 1) +batch_size = 150 +no_epochs = 5 +train_test_split = 0.3 +validation_split = 0.2 +verbosity = 1 +max_norm_value = 2.0 + +# Load data +data_noisy = np.load('./signal_waves_noisy_medium.npy') +x_val_noisy, y_val_noisy = data_noisy[:,0], data_noisy[:,1] +data_pure = np.load('./signal_waves_medium.npy') +x_val_pure, y_val_pure = data_pure[:,0], data_pure[:,1] + +# Reshape data +y_val_noisy_r = [] +y_val_pure_r = [] +for i in range(0, len(y_val_noisy)): + noisy_sample = y_val_noisy[i] + pure_sample = y_val_pure[i] + noisy_sample = (noisy_sample - np.min(noisy_sample)) / (np.max(noisy_sample) - np.min(noisy_sample)) + pure_sample = (pure_sample - np.min(pure_sample)) / (np.max(pure_sample) - np.min(pure_sample)) + y_val_noisy_r.append(noisy_sample) + y_val_pure_r.append(pure_sample) +y_val_noisy_r = np.array(y_val_noisy_r) +y_val_pure_r = np.array(y_val_pure_r) +noisy_input = y_val_noisy_r.reshape((y_val_noisy_r.shape[0], y_val_noisy_r.shape[1], 1)) +pure_input = y_val_pure_r.reshape((y_val_pure_r.shape[0], y_val_pure_r.shape[1], 1)) + +# Train/test split +percentage_training = math.floor((1 - train_test_split) * len(noisy_input)) +noisy_input, noisy_input_test = noisy_input[:percentage_training], noisy_input[percentage_training:] +pure_input, pure_input_test = pure_input[:percentage_training], pure_input[percentage_training:] + +# Create the model +model = Sequential() +model.add(Conv1D(128, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform', input_shape=input_shape)) +model.add(Conv1D(32, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv1DTranspose(32, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv1DTranspose(128, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(Conv1D(1, kernel_size=3, kernel_constraint=max_norm(max_norm_value), activation='sigmoid', padding='same')) + +model.summary() + +# Compile and fit data +model.compile(optimizer='adam', loss='binary_crossentropy') +model.fit(noisy_input, pure_input, + epochs=no_epochs, + batch_size=batch_size, + validation_split=validation_split) + +# Generate reconstructions +num_reconstructions = 4 +samples = noisy_input_test[:num_reconstructions] +reconstructions = model.predict(samples) + +# Plot reconstructions +for i in np.arange(0, num_reconstructions): + # Prediction index + prediction_index = i + percentage_training + # Get the sample and the reconstruction + original = y_val_noisy[prediction_index] + pure = y_val_pure[prediction_index] + reconstruction = np.array(reconstructions[i]) + # Matplotlib preparations + fig, axes = plt.subplots(1, 3) + # Plot sample and reconstruciton + axes[0].plot(original) + axes[0].set_title('Noisy waveform') + axes[1].plot(pure) + axes[1].set_title('Pure waveform') + axes[2].plot(reconstruction) + axes[2].set_title('Conv Autoencoder Denoised waveform') + plt.show() + +``` + +## Results + +Next, the results 😎 + +After the fifth epoch, validation loss \[latex\]\\approx 0.3556\[/latex\]. This is high, but acceptable. What's more important is to find out how well the model works when visualizing the test set predictions. + +Here they are: + +- [![](images/1-2-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1-2.png) + +- [![](images/2-2-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2-2.png) + +- [![](images/3-2-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/3-2.png) + +- [![](images/4-2-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/4-2.png) + + +Clearly, the autoencoder has learnt to remove much of the noise. As you can see, the denoised samples are not entirely noise-free, but it's a lot better. Some nice results! 😎 + +## Summary + +In this blog post, we created a denoising / noise removal autoencoder with Keras, specifically focused on signal processing. By generating 100.000 pure and noisy samples, we found that it's possible to create a trained noise removal algorithm that is capable of removing specific noise from input data. I hope you've learnt something today, and if you have any questions or remarks - please feel free to leave a comment in the comments box below! 😊 I'll try to answer your comment as soon as I can. + +Thanks for reading MachineCurve today and happy engineering! 😎 + +_Please note that the code for these models is also available in my [keras-autoencoders Github repository](https://github.com/christianversloot/keras-autoencoders)._ + +## References + +Keras Blog. (n.d.). Building Autoencoders in Keras. Retrieved from [https://blog.keras.io/building-autoencoders-in-keras.html](https://blog.keras.io/building-autoencoders-in-keras.html) + +Quetzalcohuatl. (n.d.). The loss becomes negative · Issue #1917 · keras-team/keras. Retrieved from [https://github.com/keras-team/keras/issues/1917#issuecomment-451534575](https://github.com/keras-team/keras/issues/1917#issuecomment-451534575) + +TensorFlow. (n.d.). _Tf.keras.layers.Conv1DTranspose_. [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/Conv1DTranspose](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1DTranspose) + +TensorFlow. (n.d.). _Tf.keras.layers.Conv1D_. [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/Conv1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv1D) diff --git a/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn.md b/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn.md new file mode 100644 index 0000000..5d5a584 --- /dev/null +++ b/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn.md @@ -0,0 +1,679 @@ +--- +title: "Creating a simple binary SVM classifier with Python and Scikit-learn" +date: "2020-05-03" +categories: + - "frameworks" + - "svms" +tags: + - "classification" + - "classifier" + - "python" + - "scikit-learn" + - "support-vector-machine" + - "svm" +--- + +Suppose that you are cleaning your house - and especially the clothes you never wear anymore. For every item, you decide whether you keep it or whether you'll throw it away (or, more preferably, bring it to some kind of second-hand clothing initiative). + +What you are effectively doing here is _classifying_ each sample into one of two classes: "keep" and "throw away". + +This is called **binary classification** and it is precisely what we will be looking at in today's blog post. In supervised machine learning, we can create models that do the same - assign one of two classes to a new sample, based on samples from the past that instruct it to do so. + +Today, neural networks are very hot - and [they can be used for binary classification as well](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/). However, today, we will keep the neural networks out of this post - and we will focus on another Machine Learning technique called Support Vector Machine. It is one of the more _traditional_ techniques, but it is still used today. + +Let's take a look at what we will do today. Firstly, we'll dive into classification in more detail. What is it? What is a class? What is a binary classifier? How are classifiers trained? We will answer those questions, so that you can understand what is going on - but don't worry, we'll do so intuitively. + +Subsequently, we will focus on the Support Vector Machine class of classifiers. How do they work? How are they trained? We'll cover those questions in today's blog. + +Following the theoretical part is a practical one - namely, building a SVM classifier for binary classification This answers the question _How to create a binary SVM classifier?_ We will be using Python for doing so - for many data scientists and machine learning engineers the lingua franca for creating machine learning models. More specifically, we will use Scikit-learn, a Python framework for machine learning, for creating our SVM classifier. It is one of the most widely used frameworks and therefore a perfect candidate for today's post. + +Part of the theoretical part is a step-by-step example of how to generate a sample dataset, build the SVM classifier, train it, and visualize the decision boundary that has emerged after training. We'll explain every part, so that you understand with great detail how to build one yourself for a different dataset. + +All right. Are you ready? Let's go :) + +* * * + +\[toc\] + +* * * + +## What is classification in Machine Learning? + +Let's revisit that scenario that we discussed above. + +You are in your bedroom, because you've decided that you need to clean up your closet. It's time to renew it, which includes getting rid of all the clothing that you no longer wear - or maybe, even have grown out of, in either of two directions :) + +[![](images/assorted-clothes-996329-1-1024x683.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/05/assorted-clothes-996329-1-scaled.jpg) + +Photographer: Kai Pilger / Pexels License + +You would follow this process: + +1. Pick an item from your closet. +2. Take a look at it, and at your decision criteria, and make a decision: + 1. **Keep** it; + 2. **Discard** it; +3. Put the item onto the pile of clothing that likely already exists, or at some assigned place for clothing assigned that particular choice if it's the first item you've assigned that decision to. + +[![](images/bin.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/bin.png) + +Translated into conceptual terms, this is what you have been doing: + +1. Pick a new sample. +2. Check the characteristics of the sample against your decision criteria, and assign the class **"keep"** or the class **"discard"**. + +This means that you've been _classifying_ new samples according to a preexisting set of decision criteria. + +### From the human world to the machine world + +In fact, it is something we humans do every day: we make a choice to take ("yes") or don't take ("no") some fastfood out on our way home, to go for a run ("yes/no" again), whether a date is good or not ("friendzone/romance zone" ;-) ), and so on! + +In supervised machine learning, scholars and engineers have attempted to mimic this decision-making ability by allowing us to create what is known as a **classifier**. Using data from the past, it attempts to learn a **decision boundary** between the samples from the different classes - i.e., the decision criteria we just mentioned for sorting the clothes. + +The end result: a machine learning model which can be used to decide automatically what class should be assigned once it is fed a new sample. But, of course, only if it is trained well. + +### Binary and multiclass classification + +In the scenario above, we had two classes: this is called a **binary classification** scenario. + +However, sometimes, there are more classes - for example, in the dating scenario above, you might wish to add the class "never want to see / speak to again", which I'd consider a good recommendation for some people :) + +This is called **multiclass classification**. + +In any transition from binary into multiclass classification, you should take a close look at machine learning models and find out whether they support it out of the box. + +Very often, they do, but they may not do so natively - requiring a set of tricks for multiclass classification to work. + +For example, [neural networks](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) support multiclass classification out of the box. It's simply a matter of adding the [Softmax activation function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) to generate a multiclass probability distribution that will give you the likelihood of your sample belonging to one class. + +Support Vector Machines, which we are using in today's blog post, do not support multiclass classification natively, as we shall see next. However, they _do_ support it with a few tricks, [but those will be covered in another blog post](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/). Should you wish to find out more, you could look [here](https://en.wikipedia.org/wiki/Support-vector_machine#Multiclass_SVM). + +\[affiliatebox\] + +* * * + +## What is a Support Vector Machine? + +Let's now take a look at what a Support Vector Machine is. Here is a great visual explanation: + +https://www.youtube.com/watch?v=N-sPSXDtcQw + +* * * + +## Creating a binary SVM classifier, step-by-step + +Now that we know what classification is and how SVMs can be used for classification, it's time to move to the more practical part of today's blog post. + +We're going to build a SVM classifier step-by-step with Python and Scikit-learn. This part consists of a few steps: + +1. **Generating a dataset:** if we want to classify, we need something to classify. For this reason, we will generate a linearly separable dataset having 2 features with Scikit's `make_blobs`. +2. **Building the SVM classifier:** we're going to explore the concept of a kernel, followed by constructing the SVM classifier with Scikit-learn. +3. **Using the SVM to predict new data samples:** once the SVM is trained, it should be able to correctly predict new samples. We're going to demonstrate how you can evaluate your binary SVM classifier. +4. **Finding the support vectors of your trained SVM:** as we know, support vectors determine the decision boundary. But given your training data, which vectors were used as a support vector? We can find out - and we will show you. +5. **Visualizing the decision boundary:** by means of a [cool extension called Mlxtend](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/), we can visualize the decision boundary of our model. We're going to show you how to do this with your binary SVM classifier. + +Make sure that you have installed all the Python dependencies before you start coding. These dependencies are Scikit-learn (or `sklearn` in PIP terms), Numpy, and Matplotlib. + +Let's go and generate a dataset :) Open up a code editor, create a file (such as `binary-svm.py`), and code away 👩‍💻 + +[![](images/dataset.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/dataset.png) + +_A plot of today's dataset._ + +### Generating a dataset + +As with any Python script, we need to define our imports on top: + +``` +# Imports +from sklearn.datasets import make_blobs +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt +``` + +We're going to use four imports for generating our dataset: + +- Scikit-learn's `make_blobs` function, which allows us to generate the two clusters/blobs of data displayed above. +- Scikit-learn's `train_test_split` function, which allows us to split the generated dataset into a [part for training and a part for testing](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets) easily. +- Numpy, for numbers processing. +- Matplotlib, for generating the plot from above. + +#### Configuration + +Once we set the imports, we're going to define a number of configuration options: + +``` +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5)] +cluster_std = 1 +frac_test_split = 0.33 +num_features_for_samples = 2 +num_samples_total = 1000 +``` + +- The **random seed for our blobs** ensures that we initialize the pseudorandom numbers generator with the same start initialization. We need to do this to ensure that varying initializations don't interfere with our random numbers generation. This can be any number, but the number 42 is cool [for obvious reasons](https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy#Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything_(42)). +- The **centers** represent the (X, y) positions of the centers of the blobs we're generating. +- The **cluster standard deviation** tells us something about how scattered the centers are across the two-dimensional mathematical space. It can be set to any number, but the lower, the more condensed the clusters are. +- The **fraction of the test split** tells us what percentage of our data is used for testing purposes. In our case, that's 33%, or one third of our dataset. +- The **number of features for samples** tells us the number of classes we wish to generate data for. In our case, that's 2 classes - we're building a binary classifier. +- The **number of samples in total** tells us the number of samples that are generated in total. For educational purposes, we're keeping the number quite low today, but it can be set to larger numbers if you desire. + +#### Generation + +Now that we have the imports and the configuration, we can generate the data: + +``` +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) +``` + +For this, we're calling `make_blobs` with the configuration options from before. We store its output in the `inputs` and `targets` variables, which store the features (inputs) and targets (class outcomes), respectively. + +Then, we split the inputs and targets into training and testing data. + +\[affiliatebox\] + +#### Saving and loading (optional) + +Should you wish to re-use your generated data many times, you don't want the plot to change every time you run the script. In that case, you might use Numpy to save the data temporarily, and load it before continuing: + +``` +# Save and load temporarily +np.save('./data.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./data.npy', allow_pickle=True) +``` + +Now, if you run the code once, then uncomment `np.save` (and possibly the generation part of the code as well), you'll always have your code run with the same dataset. A simple trick. + +#### Visualizing + +Finally, we can generate that visualization from above: + +``` +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +Et voila - if we run it, we get the plot (although in yours, the samples are at a different position, but relatively close to where mine are): + +[![](images/dataset.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/dataset.png) + +#### Full code so far + +Should you wish to obtain the full code so far, you can copy from here: + +``` +# Imports +from sklearn.datasets import make_blobs +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt + +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5)] +cluster_std = 1 +frac_test_split = 0.33 +num_features_for_samples = 2 +num_samples_total = 1000 + +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) + +# Save and load temporarily +np.save('./data.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./data.npy', allow_pickle=True) + +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +### Building the SVM classifier + +All right - now we have the data, we can build our SVM classifier :) + +We will be doing so with `SVC` from Scikit-learn, which is their representation of a **S**upport **V**ector **C**lassifier - or SVC. This primarily involves two main steps: + +1. **Choosing a kernel function** \- in order to make _nonlinear_ data linearly separable, if necessary. Don't worry, we'll explain this next. +2. **Building our classifier** - i.e., writing our code. + +Let's take a look. + +#### Choosing a kernel function + +As we've seen above, SVMs will attempt to find a **linear separation** between the samples in your dataset. + +In cases like this... + +[![](images/dataset.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/dataset.png) + +...this won't be too problematic :) + +But if your data looks differently... + +![](images/moons.png) + +Whoops 👀 + +We could use a **kernel** for this. Let's take a look - if we plot our 'moons' (the data looks similar to 2 moons) in 3D, we would get this: + +- [![](images/moons3d.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/moons3d.png) + +- [![](images/moons3d1.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/moons3d1.png) + + +Indeed, we still cannot separate them linearly - but the extra dimension shows you why a kernel is useful. In SVMs, kernel functions _map the function into another space, where the data becomes linearly separable_. + +And through a smart mathematical formulation, this will be possible at no substantial increase in computational cost. It's truly one of the most beautiful things of SVMs, if you ask me :) + +Any mathematical function can be used as a kernel function. Scikit-learn also supports this by means of a 'callable', which means that you can provide a kernel function if you see fit. However, out of the box, Scikit-learn supports these: + +- **Linear:** which simply maps the same onto a different space. +- **Polynomial kernel**: it "represents vector similarity over polynomials of the original variables". +- **RBF,** or **Radial Basis Function:** value depends on the distance from some point. +- The **[Sigmoid function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/)**. +- A precomputed function. + +Here's an example of what would happen if we apply some customkernel to our moons: + +- [![](images/kernelized.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/kernelized.png) + +- [![](images/kernelized1.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/kernelized1.png) + + +As you can see, they are mapped onto the 3rd dimension differently than in our original setting. Still, they are not linearly separable - but you get the point. + +Fortunately, in our case, we have linearly separable data - check the plot again - so we choose `linear` as our kernel: + +[![](images/dataset.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/dataset.png) + +#### Building the classifier + +We can now extend our code - by adding this to our imports first: + +``` +from sklearn import svm +``` + +Subsequently, we can initialize our SVM classifier: + +``` +# Initialize SVM classifier +clf = svm.SVC(kernel='linear') +``` + +After which we can fit our training data to our classifier, which means that the training process starts: + +``` +clf = clf.fit(X_train, y_train) +``` + +#### Full model code so far + +All right, so far, we have generated our dataset _and_ initialized our SVM classifier, with which we are also fitting data already. Should you wish to obtain what we have so far in full, here you go: + +``` +# Imports +from sklearn.datasets import make_blobs +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt +from sklearn import svm + +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5)] +cluster_std = 1 +frac_test_split = 0.33 +num_features_for_samples = 2 +num_samples_total = 1000 + +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) + +# Save and load temporarily +np.save('./data.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./data.npy', allow_pickle=True) + +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Initialize SVM classifier +clf = svm.SVC(kernel='linear') + +# Fit data +clf = clf.fit(X_train, y_train) +``` + +### Using the SVM to predict new data samples + +Generating new predictions is simple. For example, for generating predictions of our test set, we simply add: + +``` +predictions = clf.predict(X_test) +``` + +After training, it's wise to evaluate a model with the test set to see how well it performs. Today, we'll do so by means of a **confusion matrix**, which shows you the correct and wrong predictions in terms of true positives, true negatives, false positives and false negatives + +Let's show the confusion matrix. + +\[affiliatebox\] + +#### Confusion matrix + +If we add to our imports... + +``` +from sklearn.metrics import plot_confusion_matrix +``` + +...and subsequently after our `fit` call: + +``` +# Predict the test set +predictions = clf.predict(X_test) + +# Generate confusion matrix +matrix = plot_confusion_matrix(clf, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for our classifier') +plt.show(matrix) +plt.show() +``` + +We can generate what is known as a **confusion matrix:** + +![](images/conf_matrix.png) + +It shows the true positives, true negatives, false positives and false negatives for our model given the evaluation dataset. In our case, we have 100% true positives and 100% true negatives, and no wrong predictions. + +That's not strange given the linear separability of our dataset - and very unlikely to happen in practice - but the confusion matrix is then still a very useful tool :) + +#### Model code so far + +Should you wish to obtain what we have so far - here you go: + +``` +# Imports +from sklearn.datasets import make_blobs +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt +from sklearn import svm +from sklearn.metrics import plot_confusion_matrix + +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5)] +cluster_std = 1 +frac_test_split = 0.33 +num_features_for_samples = 2 +num_samples_total = 1000 + +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) + +# Save and load temporarily +# np.save('./data.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./data.npy', allow_pickle=True) + +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Initialize SVM classifier +clf = svm.SVC(kernel='linear') + +# Fit data +clf = clf.fit(X_train, y_train) + +# Predict the test set +predictions = clf.predict(X_test) + +# Generate confusion matrix +matrix = plot_confusion_matrix(clf, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for our classifier') +plt.show(matrix) +plt.show() +``` + +### Finding the support vectors of your trained SVM + +Now, on to the next topic: **finding the support vectors of your trained model**. + +As we recalled before, the decision boundary is determined by so-called "support vectors" - vectors from each class that are the figurative last man standing between "their own" and "the others", i.e. the other cluster of data. + +We can visualize those support vectors with Scikit-learn and Matplotlib: + +``` +# Get support vectors +support_vectors = clf.support_vectors_ + +# Visualize support vectors +plt.scatter(X_train[:,0], X_train[:,1]) +plt.scatter(support_vectors[:,0], support_vectors[:,1], color='red') +plt.title('Linearly separable data with support vectors') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +This produces the following plot: + +[![](images/supportvectors.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/supportvectors.png) + +Indeed, as we intuitively grasped, the linear separability of our dataset ensures that only limited support vectors are necessary to make the separation with highest margin - two, in our case. + +#### Full code so far + +Here's our code so far: + +``` +# Imports +from sklearn.datasets import make_blobs +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt +from sklearn import svm +from sklearn.metrics import plot_confusion_matrix + +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5)] +cluster_std = 1 +frac_test_split = 0.33 +num_features_for_samples = 2 +num_samples_total = 1000 + +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) + +# Save and load temporarily +# np.save('./data.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./data.npy', allow_pickle=True) + +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Initialize SVM classifier +clf = svm.SVC(kernel='linear') + +# Fit data +clf = clf.fit(X_train, y_train) + +# Predict the test set +predictions = clf.predict(X_test) + +# Generate confusion matrix +matrix = plot_confusion_matrix(clf, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for our classifier') +plt.show(matrix) +plt.show() + +# Get support vectors +support_vectors = clf.support_vectors_ + +# Visualize support vectors +plt.scatter(X_train[:,0], X_train[:,1]) +plt.scatter(support_vectors[:,0], support_vectors[:,1], color='red') +plt.title('Linearly separable data with support vectors') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +``` + +### Visualizing the decision boundary + +Sometimes, we don't want to visualize the support vectors, but **the exact decision boundary** for our SVM classifier. + +We can do so with a fantastic package called [Mlxtend](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/), created by dr. Sebastian Raschka, who faced this problem for his classifiers. + +It can be installed in a very simple way: `pip install mlxtend`. Then, if we add it to the imports: + +``` +from mlxtend.plotting import plot_decision_regions +``` + +...and subsequently add _two lines of code only_: + +``` +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=clf, legend=2) +plt.show() +``` + +We get a very nice plot :) + +[![](images/boundary.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/boundary.png) + +Nice :D + +#### Full and final model code + +Now, if you should wish to obtain everything at once - here you go :D + +``` +# Imports +from sklearn.datasets import make_blobs +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt +from sklearn import svm +from sklearn.metrics import plot_confusion_matrix +from mlxtend.plotting import plot_decision_regions + +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5)] +cluster_std = 1 +frac_test_split = 0.33 +num_features_for_samples = 2 +num_samples_total = 1000 + +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) + +# Save and load temporarily +# np.save('./data.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./data.npy', allow_pickle=True) + +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Initialize SVM classifier +clf = svm.SVC(kernel='linear') + +# Fit data +clf = clf.fit(X_train, y_train) + +# Predict the test set +predictions = clf.predict(X_test) + +# Generate confusion matrix +matrix = plot_confusion_matrix(clf, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for our classifier') +plt.show(matrix) +plt.show() + +# Get support vectors +support_vectors = clf.support_vectors_ + +# Visualize support vectors +plt.scatter(X_train[:,0], X_train[:,1]) +plt.scatter(support_vectors[:,0], support_vectors[:,1], color='red') +plt.title('Linearly separable data with support vectors') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=clf, legend=2) +plt.show() +``` + +\[affiliatebox\] + +## Summary + +In today's blog post, we created a binary Support Vector Machine classifier with Python and Scikit-learn. We first looked at classification in general - what is it? How does it work? This was followed by a discussion on Support Vector Machines, and how they construct a decision boundary when training a classifier. + +All the theory was followed by a practical example that was explained step-by-step. Using Python and Scikit-learn, we generated a dataset that is linearly separable and consists of two classes - so, in short, a simple and binary dataset. We then created a SVM with a linear kernel for training a classifier, but not before explaining the function of kernel functions, as to not to skip an important part of SVMs. This was followed by explaining some post-processing as well: generating a confusion matrix, visualizing the support vectors and visualizing the decision boundary of the model. + +I hope you've learnt something from today's blog post! :) If you did, I'd really appreciate your comment in the comments section below 💬 Please leave a comment as well if you have any questions, remarks or other comments. Thank you for reading MachineCurve today and happy engineering! 😎 + +\[scikitbox\] + +* * * + +## References + +_Scikit-learn_. (n.d.). scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved May 3, 2020, from [https://scikit-learn.org/stable/index.html](https://scikit-learn.org/stable/index.html) + +Scikit-learn. (n.d.). _1.4. Support vector machines — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved May 3, 2020, from [https://scikit-learn.org/stable/modules/svm.html#classification](https://scikit-learn.org/stable/modules/svm.html#classification) + +Scikit-learn. (n.d.). _Sklearn.svm.SVC — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved May 3, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) + +Wikipedia. (2005, July 26). _Radial basis function_. Wikipedia, the free encyclopedia. Retrieved May 3, 2020, from [https://en.wikipedia.org/wiki/Radial\_basis\_function](https://en.wikipedia.org/wiki/Radial_basis_function) + +Wikipedia. (2012, November 12). _Polynomial kernel_. Wikipedia, the free encyclopedia. Retrieved May 3, 2020, from [https://en.wikipedia.org/wiki/Polynomial\_kernel](https://en.wikipedia.org/wiki/Polynomial_kernel) + +Raschka, S. (n.d.). _Home - mlxtend_. Site not found · GitHub Pages. [https://rasbt.github.io/mlxtend/](https://rasbt.github.io/mlxtend/) diff --git a/creating-an-mlp-for-regression-with-keras.md b/creating-an-mlp-for-regression-with-keras.md new file mode 100644 index 0000000..0fa87d0 --- /dev/null +++ b/creating-an-mlp-for-regression-with-keras.md @@ -0,0 +1,506 @@ +--- +title: "MLP for regression with TensorFlow 2 and Keras" +date: "2019-07-30" +categories: + - "buffer" + - "frameworks" + - "svms" +tags: + - "keras" + - "mlp" + - "multilayer-perceptron" + - "neural-networks" +--- + +Machine learning is a wide field and machine learning problems come in many flavors. If, say, you wish to group data based on similarities, you would choose an _unsupervised_ approach called _clustering_. If you have a fixed number of classes which you wish to assign new data to, you'll choose a _supervised_ approach named _classification_. If, however, you don't have a fixed number, but wish to estimate a real value - your approach will still be _supervised_, but your ML problem has changed: you'll then focus on _regression_. + +In a previous blog we showed that [Multilayer Perceptrons](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) (MLPs) can be used successfully for classification, albeit that state-of-the-art methods may yield better performance for some datasets. + +But MLPs can also be used for a regression problem. And that's exactly what we will demonstrate in today's blog. + +We'll create a MLP for regression for a (relatively simple) regression problem. For this reason, we'll use the Chennai Water Management Dataset, which describes the water levels and daily amounts of rainfall for four water reservoirs near Chennai. It was uploaded during the Chennai Water Crisis of 2019, in which the reservoirs literally dried up. Despite our quest for a simple regression problem, the 'business' problem behind the data isn't simple at all. + +After reading this tutorial, you will... + +- See the impact of climate change on India and how ML can be part of a solution. +- Understand the differences between MLPs for classification and for regression. +- Be capable of building an MLP for regression with TensorFlow 2.0 and Keras. + +The code for this blog is also available at [GitHub](https://github.com/christianversloot/keras-mlp-regression). + +Let's go. + +* * * + +**Update 18/Jan/2021:** added example to the top of this tutorial. Ensured that the tutorial is up to date for 2021. Also updated header information. + +**Update 02/Nov/2020:** updated code to TensorFlow 2.x APIs and added full model code block. + +* * * + +\[toc\] + +* * * + +## Example code: Multilayer Perceptron for regression with TensorFlow 2.0 and Keras + +If you want to get started immediately, you can use this **example code for a Multilayer Perceptron**. It was created with **TensorFlow 2.0 and Keras**, and runs on the Chennai Water Management Dataset. The dataset can be downloaded [here](https://www.kaggle.com/sudalairajkumar/chennai-water-management/version/3). If you want to understand the code and the concepts behind it in more detail, make sure to read the rest of the tutorial too! 😎 + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np + +# Load data +dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4)) + +# Shuffle dataset +np.random.shuffle(dataset) + +# Separate features and targets +X = dataset[:, 0:3] +Y = dataset[:, 3] + +# Set the input shape +input_shape = (3,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(16, input_shape=input_shape, activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(1, activation='linear')) + +# Configure the model and start training +model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_squared_error']) +model.fit(X, Y, epochs=250, batch_size=1, verbose=1, validation_split=0.2) +``` + +* * * + +## What you'll need + +If you wish to run the code that you'll create during this tutorial, you do need to have a working setup. What you'll need is: + +- A running Python installation, preferably 3.8+ +- A working installation of Tensorflow: `pip install tensorflow`. +- A working NumPy package: `pip install numpy`. + +Preferably, install these in an environment with Anaconda. See [here](https://towardsdatascience.com/installing-keras-tensorflow-using-anaconda-for-machine-learning-44ab28ff39cb) how you can do that. + +* * * + +## MLPs for classification and regression: the differences + +We created a Multilayer Perceptron for classifying data (MNIST data, to be specific) in [another blog](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/). As we'll discover in this blog, MLPs can also be applied to regression. However, I must stress that there are a few differences that we must take into account before we proceed. + +Firstly, the final activation function. For classification MLPs, we used the `Softmax` activation function for the multiclass classification problem that we intended to solve. This does not work for regression MLPs. While you want to compute the probability that a sample belongs to any of the predetermined classes during classification (i.e., what Softmax does), you want something different during regression. In fact, what you want is to predict a real-valued number, like '24.05'. You therefore cannot use Softmax during regression. You'll simply use the linear activation function instead for the final layer. + +(For the same reason, you don't [convert your data](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/#loading-your-data) with `to_categorical` during regression). + +Secondly, the loss function that you'll define is different. For multiclass classification problems, categorical crossentropy was your loss function of preference (Chollet, 2017). Binary crossentropy would be the one for binary classification. However, once again, you're regressing this time - and you cannot use crossentropy, which essentially attempts to compare probability distributions (or, by the analogy from our previous blog, purple elephants) and see how much they are alike. Instead, you'll use the mean average error or mean squared error, or similar loss functions. These simply compute the difference between the prediction and the expected value and perform some operations to make the outcome better for optimization. We'll cover them in more detail later. + +Thirdly, while for Softmax based output layers the number of neurons had to be equal to the number of classes you wish to predict for, in the case of regression, you'll simply use 1 output neuron - unless you wish to regress multiple values at the same time, but that's not for now. + +Let's next first get used to our dataset :) + +* * * + +## Getting familiar with the data: the Chennai Water Crisis + +In this blog, we use the Chennai Water Management Dataset. It is a CC0 Public Domain dataset that is available at [Kaggle](https://www.kaggle.com/sudalairajkumar/chennai-water-management/version/3). It is about the city of Chennai in India and especially its water management. Particularly: + +> Chennai also known as Madras is the capital of the Indian state of Tamil Nadu. Located on the Coromandel Coast off the Bay of Bengal, it is the biggest cultural, economic and educational centre of south India. +> +> Being my second home, the city is facing an acute water shortage now (June 2019). Chennai is entirely dependent on ground water resources to meet its water needs. There are four reservoirs in the city, namely, Red Hills, Cholavaram, Poondi and Chembarambakkam, with a combined capacity of 11,057 mcft. These are the major sources of fresh water for the city. +> +> Source: [Sudalai Rajkumar](https://www.kaggle.com/sudalairajkumar/chennai-water-management/version/3), the author of the dataset + +It was uploaded with the goal of inspiring people to come up with solutions that will help Chennai face its water shortage. + +Can you imagine, a city with 7+ million people without solid access to water? It's extreme. + +Although we might not exactly aim for resolving Chennai's water problem today, it's still nice to use this dataset in order to make the problem more known to the world. Water shortage is an increasing problem given climate change and more and more cities throughout the world will face it in the years to come. Public awareness is the first step then, I'd say! + +So let's see if we can get a better idea about the water crisis that Chennai is facing right now. + +### Rain and water levels for four reservoirs + +The dataset provides daily rain and water levels for four reservoirs in the vicinity of Chennai: the Poondi Reservoir, the Cholavaram Reservoir, the Red Hills Reservoir and the Chembarambakkam Reservoir. They are some of the primary sources for water in Chennai, because the rivers are polluted with sewage (Wikipedia, 2013). + +The lakes are located here: + +[![](images/image-10.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-10.png) + +The lakes in the Chennai (Madras) area. Source: [Google Maps](https://goo.gl/maps/o5Ynbx6iMRg4KH8h6) + +For each of the four sites, the dataset provides two types of data. Firstly, it provides the daily amount of rain in millimeters (mm): + +[![](images/image-3-1024x414.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-3.png) + +Secondly, it provides the daily water levels in the reservoirs in millions of cubic feet. Every million is about 28.3 million litres, if that makes this chart more intuitive: + +[![](images/image-4-1024x415.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-4.png) + +### The problem: increasing water shortage + +Poondi Reservoir is the most important water reservoir for Chennai (Wikipedia, 2015). Rather unfortunately, if you inspect the water levels for this reservoir and add a trend line, you'll see that they indeed decrease over the years: + +[![](images/image-5-1024x415.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-5.png) + + + +The same can be observed for the other reservoirs: + +[![](images/image-7-1024x415.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-7.png) + +[![](images/image-8-1024x415.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-8.png) + +[![](images/image-9-1024x415.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-9.png) + +Except for 2015, when there were [heavy floods](https://en.wikipedia.org/wiki/2015_South_Indian_floods) due to large amounts of rainfall, the reservoirs have been emptier than in the years before 2012. One of the primary reasons for this is that the monsoons have become less predictable over the last couple of years (NASA, 2019). By consequence, refilling those reservoirs becomes a challenging task, with real trouble starting this year. + +### 2019 Chennai Water Crisis: there's no water left + +This was Puzhal Lake (also known as the Red Hills Lake) on May 31, 2018: + +[![](images/chennai_oli_2018151.jpg)](https://machinecurve.com/wp-content/uploads/2019/07/chennai_oli_2018151.jpg) + +Source: [NASA](https://earthobservatory.nasa.gov/images/145242/water-shortages-in-india) + +This was the situation in June 2019: + +[![](images/chennai_oli_2019170.jpg)](https://machinecurve.com/wp-content/uploads/2019/07/chennai_oli_2019170.jpg) + +Source: [NASA](https://earthobservatory.nasa.gov/images/145242/water-shortages-in-india) + +As you can see, the Red Hills lake dried up entirely. + +That's bad - and it is the perfect example of what is known as the [Chennai Water Crisis of 2019](https://en.wikipedia.org/wiki/2019_Chennai_water_crisis). + +This is also perfectly visible in the data. As you can see, the lakes had been filled only marginally after the 2018 Monsoons and were empty by June: + +[![](images/image-4-1024x415.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-4.png) + +Now that we have a feel for the dataset and the real problem that it presents, we could think of certain ways in which machine learning could potentially help the Chennai residents. + +In this blog, we specifically tailor this quest towards the MLP we intend to create, but obviously, there's much more imaginable. + +The first question that popped into my mind was this one: _what if we can predict the water level at one particular reservoir given the current levels in the other three?_ In that case, we might be able to accurately estimate the water contents in the case measurements at some lake are not possible. + +Intuitively, that might make sense, because from the charts it indeed seems that the water levels fluctuate up and down together. Obviously, we would need to do correlation analyses if we wish to know for sure, but I'll skip these for the sake of simplicity... we're creating an MLP for regression today, and the dataset is -despite the severity of the problem- the means to an end. + +Similarly, much more useful means of applying ML can be thought of with regards to this problem, such as timeseries based prediction, but we'll keep it easy in order to focus on what we intend to create ... an MLP. + +* * * + +## Building a Keras based MLP for predicting the water levels + +As usual, we'll start by creating a folder, say `keras-mlp-regression`, and we create a model file named `model.py`. + +We then add our imports: + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np +``` + +We use the Sequential API and the densely-connected layer type for creating the particular structure of the MLP. We'll use NumPy for importing our data. + +That's what we do next, we load our dataset (it is available from [Kaggle](https://www.kaggle.com/sudalairajkumar/chennai-water-management/version/3)): + +``` +# Load data +dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4)) +``` + +We use NumPy's `loadtxt` definition for loading the data from the CSV file. It works nicely with textual data, of which CSV data is a good example. Since the data is delimited by a `|`, we configure that above. Additionally, we skip the first row (which contains the column names) and only use columns 1-4, representing the actual data. + +### Feature/target split + +We next split the data into feature vectors and targets: + +``` +# Separate features and targets +X = dataset[:, 0:3] +Y = dataset[:, 3] +``` + +The assumption that I make here is that the water levels at one reservoir can be predicted from the other three. Specifically, I use the first three (`0:3`, a.k.a. zero to but excluding three) columns in the dataset as predictor variables, while I use the fourth (column `3`) as the predicted variable. + +In plain English, this means that I'm trying to predict the water levels at the Chembarambakkam reservoir based on the Red Hills, Poondi and Cholavaram reservoirs. + +If you're from the region and say in advance that it's a false assumption - my apologies. Despite some research, I am not entirely sure about the assumption as well - and since I'm not from the region, I cannot know for sure. However, it would still be possible to train an MLP since it fits the data - and show you how to create one. And that's what we'll do next. + +We set the input shape as our next step: + +``` +# Set the input shape +input_shape = (3,) +print(f'Feature shape: {input_shape}') +``` + +The input shape is a onedimensional vector of three features, this time. The features are the water levels at Red Hills, Poondi and Cholavaram reservoirs at one particular date, while the Chembarambakkam one is to be predicted. + +### Creating the model + +Next, we create our MLP: + +``` +# Create the model +model = Sequential() +model.add(Dense(16, input_shape=input_shape, activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(1, activation='linear')) +``` + +Similar to the [MLP for classification](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/), we're using the Keras Sequential API since it makes our life easier given the simplicity of our model. + +We then specify three densely-connected layers of neurons: one with 16 outputs, one with 8 outputs and one with 1 output. This way, the neural network will be allowed to 'think' wider first, before converging to the actual prediction. + +The input layer is specified by the input shape and therefore contains 3 neurons; one per input feature. + +Note that we're using ReLU based activation because it is [one of the standard activation functions](https://machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/#todays-activation-functions) used today. However, note as well that for the final layer we're no longer using `Softmax`, as with the MLP classifier. Instead, we're using the identity function or \[latex\]f(x) = x\[/latex\] for generating the prediction. Using the linear function allows us to generate a real-valued or numeric prediction, which is exactly what we need. + +### Hyperparameter configuration and fitting the data + +We finally configure the model and start the training process: + +``` +# Configure the model and start training +model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_squared_error']) +model.fit(X, Y, epochs=10, batch_size=10, verbose=1, validation_split=0.2) +``` + +Contrary to the MLP based classifier, in which we used categorical crossentropy as our loss function, we do not wish to compare certain classes (or as I called them, elephants). + +Instead, we want to generate a real-valued or numeric prediction and see how much it deviates from the actual outcome. + +Some loss functions are available for this, which are based on the error \[latex\]\\text{E = prediction - real outcome}\[/latex\] (Grover, 2019). Those include: + +- The **mean squared error** (MSE), which computes the squared error (\[latex\]error^2\[/latex\]) for all the predictions made, and subsequently averages them by dividing it by the number of predictions. +- The **mean absolute error** (MAE), which instead of computing the squared error computes the absolute error (\[latex\]|error|\[/latex\]) for all predictions made and subsequently averages them in the same way. + +To illlustrate how they work, we'll use an example: if there are two errors, e.g. \[latex\]-4\[/latex\] and \[latex\]4\[/latex\], the MSE will produce 16 twice, while the MAE produces 4 twice. + +They both have their [benefits and drawbacks](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0), but generally, the MAE is used in situations in which outliers can be present (Grover, 2019). + +We'll train our MLP with both, adding the other as a support variable in the `metrics` attribute. + +Since the Adam optimizer is pretty much the standard optimizer used today, we use it in this example (Chollet, 2017). Adam is an extension of traditional stochastic gradient descent by means of momentum and local neuron optimization. I'll cover the details in another blog later. + +We use 10 epochs, a batch size of 1, a validation split of 20% and verbosity mode 1. This way, we'll finish training quickly but are likely capable of estimating the gradient very accurately during optimization. + +### Full model code + +Should you wish to obtain the full model code just at once - that's of course possible too. Here you go 😎 + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np + +# Load data +dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4)) + +# Shuffle dataset +np.random.shuffle(dataset) + +# Separate features and targets +X = dataset[:, 0:3] +Y = dataset[:, 3] + +# Set the input shape +input_shape = (3,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(16, input_shape=input_shape, activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(1, activation='linear')) + +# Configure the model and start training +model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_squared_error']) +model.fit(X, Y, epochs=250, batch_size=1, verbose=1, validation_split=0.2) +``` + +* * * + +## Validating the model + +Next, let's start the training process and see what happens. + +These are the results from our first attempt: + +``` +Epoch 1/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 332.6803 - mean_squared_error: 246576.6700 - val_loss: 294.8595 - val_mean_squared_error: 151995.6923 +Epoch 2/10 +4517/4517 [==============================] - 13s 3ms/step - loss: 276.1181 - mean_squared_error: 126065.0225 - val_loss: 305.3823 - val_mean_squared_error: 160556.6063 +Epoch 3/10 +4517/4517 [==============================] - 13s 3ms/step - loss: 274.3100 - mean_squared_error: 125171.9773 - val_loss: 322.0316 - val_mean_squared_error: 174732.2345 +Epoch 4/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 273.0496 - mean_squared_error: 124494.1493 - val_loss: 304.1849 - val_mean_squared_error: 158879.7165 +Epoch 5/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 273.0190 - mean_squared_error: 124420.8973 - val_loss: 326.6588 - val_mean_squared_error: 179274.0880 +Epoch 6/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 272.5061 - mean_squared_error: 124192.4299 - val_loss: 305.9678 - val_mean_squared_error: 160826.3846 +Epoch 7/10 +4517/4517 [==============================] - 15s 3ms/step - loss: 271.1735 - mean_squared_error: 124102.1444 - val_loss: 302.8888 - val_mean_squared_error: 153143.9235 +Epoch 8/10 +4517/4517 [==============================] - 15s 3ms/step - loss: 270.2527 - mean_squared_error: 123426.2535 - val_loss: 304.5966 - val_mean_squared_error: 154317.4158 +Epoch 9/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 270.5909 - mean_squared_error: 123033.3367 - val_loss: 316.0911 - val_mean_squared_error: 165068.8407 +Epoch 10/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 268.9381 - mean_squared_error: 121666.2221 - val_loss: 320.5413 - val_mean_squared_error: 166442.5935 +``` + +Our validation loss seems to be in the range of 290-320. That's relatively bad; we're off by a couple of hundred million of square feet of water. + +And that's no single droplet only. + +Second attempt with MSE as the loss function: + +``` +Epoch 1/10 +4517/4517 [==============================] - 15s 3ms/step - loss: 255334.5861 - mean_absolute_error: 333.2326 - val_loss: 158943.3863 - val_mean_absolute_error: 304.4497 +Epoch 2/10 +4517/4517 [==============================] - 13s 3ms/step - loss: 129793.7640 - mean_absolute_error: 286.0301 - val_loss: 160327.8901 - val_mean_absolute_error: 308.0849 +Epoch 3/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 125248.8358 - mean_absolute_error: 280.8977 - val_loss: 170016.9162 - val_mean_absolute_error: 318.3974 +Epoch 4/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 124579.2617 - mean_absolute_error: 278.7398 - val_loss: 159538.5700 - val_mean_absolute_error: 310.0963 +Epoch 5/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 123096.8864 - mean_absolute_error: 277.0384 - val_loss: 166921.0205 - val_mean_absolute_error: 315.9326 +Epoch 6/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 122259.9060 - mean_absolute_error: 274.9807 - val_loss: 166284.8314 - val_mean_absolute_error: 315.1071 +Epoch 7/10 +4517/4517 [==============================] - 16s 4ms/step - loss: 121631.5276 - mean_absolute_error: 274.2378 - val_loss: 171566.1304 - val_mean_absolute_error: 323.3036 +Epoch 8/10 +4517/4517 [==============================] - 17s 4ms/step - loss: 120780.4943 - mean_absolute_error: 272.7180 - val_loss: 157775.8531 - val_mean_absolute_error: 305.2346 +Epoch 9/10 +4517/4517 [==============================] - 15s 3ms/step - loss: 120394.1161 - mean_absolute_error: 272.3696 - val_loss: 171933.4463 - val_mean_absolute_error: 319.7063 +Epoch 10/10 +4517/4517 [==============================] - 16s 4ms/step - loss: 119243.6368 - mean_absolute_error: 270.3955 - val_loss: 176639.7063 - val_mean_absolute_error: 322.7455 +``` + +Neither a single droplet only. + +However, what immediately came to mind is what I once read in François Chollet's book Deep Learning with Python: that you should especially be careful with your data splits when you're using timeseries data (Chollet, 2017). + +It crossed my mind that we're indeed using timeseries data, albeit not in a timeseries way. + +However, precisely that may still be problematic. We split the data into training and validation data - and this is how Keras splits the data: + +> The validation data is selected from the last samples in the x and y data provided, before shuffling. +> +> Source: [Keras (n.d.)](https://keras.io/models/sequential/) + +Ah, okay. That's like taking the last 20 percent off this graph for validation while training with the rest: + +![](images/image-4-1024x415.png) + +The point is that most of the 20%. is the situation with a lack of water while much of the first 80%. is from the situation in which water levels were relatively okay. However, this way, we train our model with very different ideosyncrasies in the training versus the validation data: + +- The monsoons got less predictable during the years with water shortages. By consequence, so do the water levels. This is a difference from the early years. +- Water management in Chennai could have changed, especially since it is described as one of the major causes for the water crisis (Wikipedia, 2019). +- Perhaps, rainfall has changed due to unexplainable facts - cycles in the weather that we may not know about. +- Perhaps, the demand for water has increased, reducing the lifecycle time of water in the reservoirs. +- And so on. + +By consequence, we must take into account time as much as we can. + +### Taking into account time + +And strangely, we could do so by randomly shuffling the data, I believe. + +Our MLP does not take into account time by design (i.e., although the data is a timeseries, our MLP is not a timeseries model. Perhaps naïvely, it attempts to simply predict the level at one lake based on the current levels in the other three). + +Yet, it took it into account by consequence because of how we split our data. + +Randomly shuffling the data before training may yield a balance between training and validation data. + +For this, we add two lines between `Loading the data` and `Separating the data into training and testing data`, as follows: + +``` +# Load data +dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4)) + +# Shuffle dataset +np.random.shuffle(dataset) + +# Separate features and targets +X = dataset[:, 0:3] +Y = dataset[:, 3] +``` + +Those are the results when we run the training process again: + +``` +4517/4517 [==============================] - 16s 3ms/step - loss: 296.1796 - mean_squared_error: 156532.2806 - val_loss: 290.2458 - val_mean_squared_error: 141232.8286 +Epoch 2/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 282.1418 - mean_squared_error: 133645.8504 - val_loss: 280.9738 - val_mean_squared_error: 134865.3968 +Epoch 3/10 +4517/4517 [==============================] - 15s 3ms/step - loss: 279.2078 - mean_squared_error: 132291.1732 - val_loss: 281.8184 - val_mean_squared_error: 135522.1895 +Epoch 4/10 +4517/4517 [==============================] - 15s 3ms/step - loss: 277.4232 - mean_squared_error: 130418.7432 - val_loss: 279.9939 - val_mean_squared_error: 131684.8306 +Epoch 5/10 +4517/4517 [==============================] - 14s 3ms/step - loss: 275.6177 - mean_squared_error: 130715.3942 - val_loss: 280.5357 - val_mean_squared_error: 130576.4042 +Epoch 6/10 +4517/4517 [==============================] - 15s 3ms/step - loss: 273.3028 - mean_squared_error: 128172.1251 - val_loss: 272.0446 - val_mean_squared_error: 126942.4550 +Epoch 7/10 +4517/4517 [==============================] - 16s 4ms/step - loss: 271.7314 - mean_squared_error: 126806.0373 - val_loss: 273.5686 - val_mean_squared_error: 127348.5214 +Epoch 8/10 +4517/4517 [==============================] - 15s 3ms/step - loss: 270.4174 - mean_squared_error: 125443.8001 - val_loss: 269.9208 - val_mean_squared_error: 125395.7469 +Epoch 9/10 +4517/4517 [==============================] - 17s 4ms/step - loss: 270.0084 - mean_squared_error: 125520.7887 - val_loss: 274.6282 - val_mean_squared_error: 129173.8515 +Epoch 10/10 +4517/4517 [==============================] - 17s 4ms/step - loss: 268.4413 - mean_squared_error: 124098.9995 - val_loss: 268.5992 - val_mean_squared_error: 125443.7568 +``` + +They are better indeed - but they aren't good yet. + +Training the model for 250 epochs instead of 10 got me to a validation loss of approximately 240 million square feet, but that's still too much. + +Here's why I think that the relatively poor performance occurs: + +- **Unknown factors** interfering with the data. I expect that water levels cannot be predicted by water levels alone and that, given the relatively large distances between the lakes, certain ideosyncratic factors between those sites influence the water levels as well. Primarily, this may be the case because - if I'm not wrong - certain lakes seem to be river-fed as well. This makes the water levels at those dependent on rain conditions upstream, while this may not be the case for all the lakes. Perhaps, taking this into account may make our model better - e.g. by removing the river-fed lakes (although you may wonder, what will remain?). + - If I'm wrong with this assumption, please let me know in the comments! +- We didn't take into account **time**. We simply predicted the water level at Chembarambakkam based on the levels in the three other lakes. The movements in water levels over the past few days, perhaps weeks, may be important predictors for the water levels instead. Perhaps, making it a true timeseries model may make it better. +- We didn't take into account **human activity**. The numbers do not say anything about human activity; perhaps, water levels changed due to certain water management activities. If this is the case, it would directly influence the model's predictive power if it this pattern does not occur in all the lakes. I read [here](https://en.wikipedia.org/wiki/Poondi_reservoir#2008-2009:_Construction_of_pump_houses) that activities were undertaken in 2008-2009 to reduce the effects of evaporation. This might influence the data. +- Finally, we also did not take into account **weather conditions**. The weather is chaotic and may therefore reduce balance within the data. This is particularly the case because we only have rain data - and no data about, say, sunshine, and by consequence the degree of evaporation. It may be the case that we can improve the performance of the model if we simply add more weather data to it. + +And to be frank, one can think about many better approaches to this problem than an MLP - approaches that would make the prediction much more aware of (primarily the temporal) context. For the sake of simplicity, I won't cover them all, but creating timeseries based models with e.g. [CNNs](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) could be an option. + +Nevertheless, we have been successful in creating a Multilayer Perceptron in Keras for regression - contrary to the classification one that we created before. + +And despite the major crisis that Chennai is currently facing, that was the goal of our post today. + +I do still hope though that you'll be also a little bit more aware now of the challenges that our planet is facing with respect to climate over the years to come. What simply visualizing data for a Keras tutorial can't do, can it? 😊🌍 + +The code for this blog is available at [GitHub](https://github.com/christianversloot/keras-mlp-regression). + +Thank you once again for reading my blog. If you have any comments, questions or remarks, or if you have suggestions for improvement, please feel free to leave a comment below 👇 I'll try to review them and respond to them as soon as I can. Particularly, I'm interested in your suggestions for the Chennai Water Management dataset - what can we do with it to make the world a slightly better place? Let creativity loose. Thanks again! 👍 + +* * * + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Grover, P. (2019, May 24). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from [https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0) + +NASA. (2019, June 27). Water Shortages in India. Retrieved from [https://earthobservatory.nasa.gov/images/145242/water-shortages-in-india](https://earthobservatory.nasa.gov/images/145242/water-shortages-in-india) + +Keras. (n.d.). Sequential. Retrieved from [https://keras.io/models/sequential/](https://keras.io/models/sequential/) + +Rajkumar, S. (2019). Chennai Water Management. Retrieved from [https://www.kaggle.com/sudalairajkumar/chennai-water-management/version/3](https://www.kaggle.com/sudalairajkumar/chennai-water-management/version/3) + +Wikipedia. (2013, July 14). Water management in Chennai. Retrieved from [https://en.wikipedia.org/wiki/Water\_management\_in\_Chennai#Primary\_water\_sources](https://en.wikipedia.org/wiki/Water_management_in_Chennai#Primary_water_sources) + +Wikipedia. (2015, May 7). Poondi reservoir. Retrieved from [https://en.wikipedia.org/wiki/Poondi\_reservoir](https://en.wikipedia.org/wiki/Poondi_reservoir) diff --git a/creating-dcgan-with-pytorch.md b/creating-dcgan-with-pytorch.md new file mode 100644 index 0000000..b2cadd8 --- /dev/null +++ b/creating-dcgan-with-pytorch.md @@ -0,0 +1,957 @@ +--- +title: "Creating DCGAN with PyTorch" +date: "2021-07-15" +categories: + - "deep-learning" + - "frameworks" +tags: + - "dcgan" + - "deep-learning" + - "gan" + - "gans" + - "generative-adversarial-networks" + - "generative-models" + - "neural-network" + - "neural-networks" + - "pytorch" +--- + +Generative Adversarial Networks have been able to produce images that are _shockingly_ realistic (think [This Person Does Not Exist](https://www.machinecurve.com/index.php/2019/07/17/this-person-does-not-exist-how-does-it-work/)). For this reason, I have started focusing on GANs recently. After reading about [GAN theory](https://www.machinecurve.com/index.php/generative-adversarial-networks-explanations-examples/), I wanted to create a GAN myself. For this reason, I started with a relatively simple type of GAN called the Deep Convolutional GAN. In this article, you will... + +- **Briefly cover what a DCGAN is, to understand what is happening.** +- **Learn to build a DCGAN with PyTorch.** +- **See what happens when you train it on the MNIST dataset.** + +In other words, you're going to build a model that can learn to output what's on the right when beginning with what's on the left: + +- ![](images/epoch0_batch0.jpg) + +- ![](images/epoch22_batch250.jpg) + + +Ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What is a Deep Convolutional GAN (DCGAN)? + +A **Generative Adversarial Network** or GAN for short is a combination of two neural networks and can be used for generative Machine Learning. In other words, and plainer English, it can be used to generate data if it has learned what data it must generate. + +As we have seen with [This Person Does Not Exist](https://www.machinecurve.com/index.php/2019/07/17/this-person-does-not-exist-how-does-it-work/), GANs can be used to generate highly realistic pictures of peoples' faces - because that specific GAN has learned to do so. However, GANs can also be used for more serious purposes, such as composing music for movies and for generative medicine, possibly helping us cure disease. + +Now, with respect to the **Deep Convolutional GAN** that we will create today, we'll briefly cover its components. If you want to understand DCGANs in more detail, [refer to this article](https://www.machinecurve.com/index.php/2021/03/24/an-introduction-to-dcgans/). + +A DCGAN is composed of a **Generator** and a **Discriminator**. As you can see in the image below, the Generator takes as input a noise sample, which is taken from a standard normal distribution. It outputs a fake image, which is fed to the Discriminator. The Discriminator itself is trained on real images, and is capable of judging whether the generated image is real or fake. By generating joint loss and subsequent combined optimization, the Discriminator can get better in separating fakes from real, but unknowingly training the Generator in generating better fake images. + +Eventually, the noise distribution (also called _latent distribution_) can be structured in such a way through training, that the Generator generates images that cannot be distinguished anymore, beating the Discriminator. In today's article, we're going to create such a system using PyTorch! + +Compared to _standard_ GANs (vanilla GANs / original GANs), DCGANs have a set of additional improvements: + +1. **A minimum of fully connected layers is used.** +2. **Any pooling is replaced with learnt downsampling and upsampling.** +3. **Batch Normalization is applied.** +4. **ReLU is applied in the Generator.** +5. **Leaky ReLU is applied in the Discriminator.** + +Where necessary you will also apply these in this article :) + +![](images/GAN-1024x431.jpg) + +A Generative Adversarial Network + +* * * + +## Building a DCGAN with PyTorch + +Let's now actually create a Deep Convolutional Gan with PyTorch, including a lot of code examples and step-by-step explanations! :D + +### What you'll need to run the code + +If you want to run this code, it is important that you have installed the following dependencies: + +- PyTorch, including `torchvision` +- NumPy +- Matplotlib +- Python 3.x, most preferably a recent version. + +### Specifying imports and configurable variables + +The first step in building a DCGAN is creating a file. Let's call it `dcgan.py`. We start with specifying the imports: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import numpy as np +import matplotlib.pyplot as plt +import uuid +``` + +We import `os` because we need some Operating System functions. We also import `torch` and the `nn` library for building a neural network. As we will train our GAN on the `MNIST` dataset, we import it, as well as the `DataLoader` which ensures that the dataset is properly shuffled and batched. The `transforms` import ensures that we can convert the MNIST images into Tensor format, after which we can normalize them (more on that later). + +Finally, we import `numpy` for some number processing, `plt` for visualizing the GAN outputs and `uuid` for generating unique identifiers for each training session - so that we can save the trained models. + +This is followed by a variety of configurable variables. + +``` +# Configurable variables +NUM_EPOCHS = 50 +NOISE_DIMENSION = 50 +BATCH_SIZE = 128 +TRAIN_ON_GPU = True +UNIQUE_RUN_ID = str(uuid.uuid4()) +PRINT_STATS_AFTER_BATCH = 50 +OPTIMIZER_LR = 0.0002 +OPTIMIZER_BETAS = (0.5, 0.999) +``` + +- The **number of epochs** specifies the number of iterations on the full training set, i.e., the number of epochs. +- The **noise dimension** can be configured to set the number of dimensions of the noise vector that is input to the Generator. +- The **batch size** instructs the `DataLoader` how big batches should be when MNIST samples are loaded. +- If available, we can **train on GPU** - this can be configured. +- The **unique run ID** represents a unique identifier that describes this training session, and is used when the models and sample images are saved. +- **Print stats after batch** tells us how many mini batches should pass in an epoch before intermediary statistics are printed. +- The **optimizer LR** and **optimizer Betas** give the Learning Rate and Beta values for the `AdamW` optimizer used in our GAN. + +### Training speedups + +PyTorch code can be made to run faster with [some simple tweaks](https://betterprogramming.pub/how-to-make-your-pytorch-code-run-faster-93079f3c1f7b). Some must be applied within the model (e.g. in the `DataLoader`), while others can be applied standalone. Here are some standalone training speedups. + +``` +# Speed ups +torch.autograd.set_detect_anomaly(False) +torch.autograd.profiler.profile(False) +torch.autograd.profiler.emit_nvtx(False) +torch.backends.cudnn.benchmark = True +``` + +### The Generator + +Now that we have prepared, it's time for the real work! Let's start creating our DCGAN Generator model. Recall that the generator takes a small input sample, generated from a standard normal distribution. It uses Transposed Convolutions (upsampling layers that _learn_ the upsampling process rather than performing interpolation) for constructing the output image in a step-by-step fashion. In our case, the generator produces a `28 x 28` pixel image - hopefully resembling an MNIST digit after a while :) + +Below, you'll see the code for the Generator. + +- As the Generator is a separate PyTorch model, it must be a class that extends `nn.Module`. +- In the constructor (`__init__`), we initialize the superclass, set the number of feature maps output by our model, and create our layers. +- The Generator contains five upsampling blocks. In each, `ConvTranspose2d` is used for learned upsampling. Starting with the `NOISE_DIMENSION` (representing the dimensionality of the generated noise), many feature maps (`num_feature_maps * 8`) are generated, whereas the number of feature maps decreases with downstream layers. +- Note a variety of optimizations: + - Characteristic for DCGAN is the use of Batch Normalization (`BatchNorm2d`), the use of `ReLU` in the generator and the use of `Tanh` after the final upsampling block. + - More generally, `bias` is set to `False` in each layer that is followed by a Batch Normalization layer - possibly leading to a model that converges faster. Bias is nullified in a Batch Normalization layer; that's why it makes no sense to use it in the layers directly before BN. +- The `forward` def simply performs a forward pass. + +``` +class Generator(nn.Module): + """ + DCGan Generator + """ + def __init__(self,): + super().__init__() + num_feature_maps = 64 + self.layers = nn.Sequential( + # First upsampling block + nn.ConvTranspose2d(NOISE_DIMENSION, num_feature_maps * 8, 4, 1, 0, bias=False), + nn.BatchNorm2d(num_feature_maps * 8), + nn.ReLU(), + # Second upsampling block + nn.ConvTranspose2d(num_feature_maps * 8, num_feature_maps * 4, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 4), + nn.ReLU(), + # Third upsampling block + nn.ConvTranspose2d(num_feature_maps * 4, num_feature_maps * 2, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 2), + nn.ReLU(), + # Fourth upsampling block + nn.ConvTranspose2d(num_feature_maps * 2, num_feature_maps, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps), + nn.ReLU(), + # Fifth upsampling block: note Tanh + nn.ConvTranspose2d(num_feature_maps, 1, 1, 1, 2, bias=False), + nn.Tanh() + ) + + def forward(self, x): + """Forward pass""" + return self.layers(x) +``` + +### The Discriminator + +Next up: the Discriminator! + +Recall that while the Generator generates images, the Discriminator serves as a mechanism of quality control - it can ensure that no fake images pass, and by consequence helps the Generator generate fake images that are difficult to distinguish anymore. + +Like the Generator, the Discriminator is also a `nn.Module` based class with a constructor (`__init__`) and a forward pass definition (`forward`). The forward pass def is simple so will not be explained in detail. For the constructor, here's what happens: + +- First of all, the **number of feature maps** is defined. Note that this must be equal to the number of feature maps specified in the Generator. +- It follows the structure of a [Convolutional Neural Network](https://www.machinecurve.com/index.php/2021/07/08/convolutional-neural-networks-with-pytorch/). Using a stack of `Conv2d` layers, feature maps are generated that help detect certain patterns in the input data. The feature maps of the final `Conv2d` layer are eventually Flattened and passed to a `Linear` (or fully-connected) layer, after which the [Sigmoid](https://www.machinecurve.com/index.php/2021/01/21/using-relu-sigmoid-and-tanh-with-pytorch-ignite-and-lightning/) activation function ensures that the output is in the range `[0, 1]`. +- Two-dimensional batch normalization (`BatchNorm2d`) is used to help speed up the training process, as suggested in general and for DCGANs specifically. This is also why, like in the Generator, the `bias` values for the preceding layers are set to `False`. +- Leaky ReLU with an `alpha=0.2` is used instead of regular ReLU. + +``` +class Discriminator(nn.Module): + """ + DCGan Discriminator + """ + def __init__(self): + super().__init__() + num_feature_maps = 64 + self.layers = nn.Sequential( + nn.Conv2d(1, num_feature_maps, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 1), + nn.LeakyReLU(0.2), + nn.Conv2d(num_feature_maps, num_feature_maps * 2, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 2), + nn.LeakyReLU(0.2), + nn.Conv2d(num_feature_maps * 2, num_feature_maps * 4, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 4), + nn.LeakyReLU(0.2), + nn.Conv2d(num_feature_maps * 4, 1, 4, 2, 1, bias=False), + nn.Flatten(), + nn.Linear(1, 1), + nn.Sigmoid() + ) + + def forward(self, x): + """Forward pass""" + return self.layers(x) +``` + +### The DCGAN: a set of definitions + +Now that we have built the Generator and the Discriminator, it's actually time to construct functionality for the DCGAN! :) + +In Python, it's good practice to split as many individual functionalities in separate definitions. This avoids that you'll end up with one large block of code, and ensures that you can re-use certain functions if they must be used in different places. + +Now, because a GAN is quite complex in terms of functionality, we're going to write a lot of definitions that eventually will be merged together: + +- **Getting the PyTorch device:** recall that you used `TRAIN_ON_GPU` for specifying whether you want to train the GAN on your GPU or your CPU. Getting the PyTorch device makes sure that all other functionalities can use that device, should it be available. +- **Making run directory & generating images:** the GAN will be constructed in such a way, that each time when you run it, a folder is created with a `UNIQUE_RUN_ID`. Here, saved models and images generated during the training process will be stored. We'll also construct the def for generating the images there. House keeping, in other words. +- **Functionality for saving models & printing progress:** even more house keeping. Models will be saved every once in a while, so that intermediate versions of your GAN can also be used. Here, we'll create a def for that, as well as a def for printing progress during the training process. +- **Preparing the dataset:** now that we have written defs for house keeping, we can continue with the real work. Next up is a def for preparing the MNIST dataset. +- **Weight initializer function:** DCGAN weights must be initialized in a specific way. With this def, you'll ensure that this is done properly. +- **Initializing models, loss and optimizers:** here, we will create three defs for initializing the models (i.e. Generator and Discriminator), loss function (BCELoss) and optimizers (we will use `AdamW`, which is Adam with weight decay). +- **Generating noise:** recall that Generators take noise from a latent distribution and convert it into an output image. We create a def that can generate noise. +- **Efficiently zero-ing gradients:** in PyTorch, gradients must be zeroed before a new training step can occur. While PyTorch has `zero_grad()` for this, it can be done more efficiently. We create a custom def for this purpose. +- **Forward and backward passes:** the real work! Here, we feed a batch of data through a model, perform backpropagation and return the loss. The def will be set up in a generic way so that it can be used for both Generator and Discriminator. +- **Combining passes into a training step:** each batch of data will be forward and backward propagated. This def ensures that the previous def will be called for the batch of data currently being enumerated. +- **Combining training steps into epochs:** recall that one epoch is a forward and backward pass over all the training data. In other words, one epoch combines all training steps for the training data. In this def, we ensure that this happens. +- **Combining epochs into a DCGAN:** the final def that combines everything. It calls previous defs for preparation and eventually starts the training process by iterating according to the number of epochs configured before. + +Let's now take a look at all the definitions and provide example code. + +#### Getting the PyTorch device + +In this def, we'll construct the PyTorch device depending on configuration and availability. We check whether **both** `TRAIN_ON_GPU` and `torch.cuda.is_available()` resolve to `True` to ensure that the `cuda:0` device can be loaded. If not, we use our CPU. + +Note that this def is configured to run the GAN on one GPU only. You'll have to manually add a multi-GPU training strategy if necessary. + +``` +def get_device(): + """ Retrieve device based on settings and availability. """ + return torch.device("cuda:0" if torch.cuda.is_available() and TRAIN_ON_GPU else "cpu") +``` + +#### Making run directory & generating images + +The contents of `make_directory_for_run()`, the definition that makes a directory for the training run, are quite straight-forward. It checks whether a folder called `runs` exists in the current path, and creates it if it isn't available. Then, in `./runs`, a folder for the `UNIQUE_RUN_ID` is created. + +``` +def make_directory_for_run(): + """ Make a directory for this training run. """ + print(f'Preparing training run {UNIQUE_RUN_ID}') + if not os.path.exists('./runs'): + os.mkdir('./runs') + os.mkdir(f'./runs/{UNIQUE_RUN_ID}') +``` + +In `generate_image`, an image with sub plots containing generated examples will be created. As you can see, noise is generated, fed to the generator, and is then added to a Matplotlib plot. It is saved to a folder `images` relative to `/.runs/{UNIQUE_RUN_ID}` which itself is created if it doesn't exist yet. + +``` +def generate_image(generator, epoch = 0, batch = 0, device=get_device()): + """ Generate subplots with generated examples. """ + images = [] + noise = generate_noise(BATCH_SIZE, device=device) + generator.eval() + images = generator(noise) + plt.figure(figsize=(10, 10)) + for i in range(16): + # Get image + image = images[i] + # Convert image back onto CPU and reshape + image = image.cpu().detach().numpy() + image = np.reshape(image, (28, 28)) + # Plot + plt.subplot(4, 4, i+1) + plt.imshow(image, cmap='gray') + plt.axis('off') + if not os.path.exists(f'./runs/{UNIQUE_RUN_ID}/images'): + os.mkdir(f'./runs/{UNIQUE_RUN_ID}/images') + plt.savefig(f'./runs/{UNIQUE_RUN_ID}/images/epoch{epoch}_batch{batch}.jpg') +``` + +#### Functionality for saving models & printing progress + +Saving the models is also really straight-forward. Once again, they are saved relative to `./runs/{UNIQUE_RUN_ID}`, and both the `generator` and `discriminator` are saved. As they are saved after every epoch ends, the `epoch` is passed as well and included in the `*.pth` file. + +``` +def save_models(generator, discriminator, epoch): + """ Save models at specific point in time. """ + torch.save(generator.state_dict(), f'./runs/{UNIQUE_RUN_ID}/generator_{epoch}.pth') + torch.save(discriminator.state_dict(), f'./runs/{UNIQUE_RUN_ID}/discriminator_{epoch}.pth') +``` + +Printing training progress during the training steps is done with a specific def called `print_training_progress`. It simply prints the batch number, generator loss and discriminator loss in a standardized way. + +``` +def print_training_progress(batch, generator_loss, discriminator_loss): + """ Print training progress. """ + print('Losses after mini-batch %5d: generator %e, discriminator %e' % + (batch, generator_loss, discriminator_loss)) +``` + +#### Preparing the dataset + +Recall that all previous definitions were preparatory in terms of house keeping, but that you will now create a definition for preparing the dataset. It is as follows: + +``` +def prepare_dataset(): + """ Prepare dataset through DataLoader """ + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, train=True, transform=transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)) + ])) + # Batch and shuffle data with DataLoader + trainloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True) + # Return dataset through DataLoader + return trainloader +``` + +Here, you can see that the `MNIST` dataset is loaded from the current working directory (`os.getcwd()`). It is downloaded if necessary and the training data is used. In addition, a composition of various `transforms` is used. First, we convert the MNIST images (which are PIL-based images) into Tensor format, so that PyTorch can use them efficiently. Subsequently, the images are Normalized into the range `[-1, 1]`. + +That the `dataset` is now available does not mean that it can already be used. We must apply a `DataLoader` to use the data efficiently; in other words, in batches (hence `BATCH_SIZE`) and in a shuffled fashion. The number of workers is set to 4 and memory is pinned due to the PyTorch efficiencies that we discussed earlier. + +Finally, the `dataloader` is returned. + +#### Weight initializer function + +Recall from the Radford et al. (2015) paper that weights must be initialized in a specific way: + +> All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. + +Next, we therefore write a definition that ensures this and which can be used later: + +``` +def weights_init(m): + """ Normal weight initialization as suggested for DCGANs """ + classname = m.__class__.__name__ + if classname.find('Conv') != -1: + nn.init.normal_(m.weight.data, 0.0, 0.02) + elif classname.find('BatchNorm') != -1: + nn.init.normal_(m.weight.data, 1.0, 0.02) + nn.init.constant_(m.bias.data, 0) +``` + +#### Initializing models, loss and optimizers + +Now, we create three definitions: + +- In `initialize_models()`, we initialize the Generator and the Discriminator. Here, the `weights_init` def is also applied, and the models are moved to the device that was configured. Both the Generator and Discriminator are then returned. +- Using `initialize_loss`, an instance of Binary cross-entropy loss is returned. BCELoss is used to compare an output between 0 and 1 with a corresponding target variable, which is either 0 or 1. +- With `initialize_optimizers`, we init the optimizers for both the Generator and the Discriminator. Recall that each is an individual neural network and hence requires a separate optimizer. We use `AdamW`, which is Adam with weight decay - it is expected to make training faster. The learning rates and optimizer betas are configured in line with configuration options specified above. + +``` +def initialize_models(device = get_device()): + """ Initialize Generator and Discriminator models """ + generator = Generator() + discriminator = Discriminator() + # Perform proper weight initialization + generator.apply(weights_init) + discriminator.apply(weights_init) + # Move models to specific device + generator.to(device) + discriminator.to(device) + # Return models + return generator, discriminator + + +def initialize_loss(): + """ Initialize loss function. """ + return nn.BCELoss() + + +def initialize_optimizers(generator, discriminator): + """ Initialize optimizers for Generator and Discriminator. """ + generator_optimizer = torch.optim.AdamW(generator.parameters(), lr=OPTIMIZER_LR,betas=OPTIMIZER_BETAS) + discriminator_optimizer = torch.optim.AdamW(discriminator.parameters(), lr=OPTIMIZER_LR,betas=OPTIMIZER_BETAS) + return generator_optimizer, discriminator_optimizer +``` + +#### Generating noise + +The definition for generating noise is also really straight-forward. Using `torch.rand`, noise for a specific amount of images with a specific dimension is generated into a specific device. + +``` +def generate_noise(number_of_images = 1, noise_dimension = NOISE_DIMENSION, device=None): + """ Generate noise for number_of_images images, with a specific noise_dimension """ + return torch.randn(number_of_images, noise_dimension, 1, 1, device=device) +``` + +#### Efficiently zero-ing gradients + +In PyTorch, gradients must be zeroed during every training step because otherwise history can interfire with the current training step. PyTorch itself provides `zero_grad()` for this purpose, but it sets gradients to `0.0` - which is numeric rather than `None`. It was found that setting the gradients to `None` can make training faster. Hence, we create a definition for thus purpose, which can be used with any `model` and can be re-used multiple times later in this article. + +``` +def efficient_zero_grad(model): + """ + Apply zero_grad more efficiently + Source: https://betterprogramming.pub/how-to-make-your-pytorch-code-run-faster-93079f3c1f7b + """ + for param in model.parameters(): + param.grad = None +``` + +#### Forward and backward passes + +Recall that training a neural network involves a forward pass, where data is passed through the network returning predictions, and a backward pass, where the error is backpropagated through the network. Once this is done, the network can be optimized. In this definition, we ensure that for any `model` a batch of `data` can be fed forward through the model. Subsequently, using a `loss_function`, loss is computed and subsequently backpropagated through the network. The numeric value for loss is returned so that it can be printed with the print def created above. + +``` +def forward_and_backward(model, data, loss_function, targets): + """ + Perform forward and backward pass in a generic way. Returns loss value. + """ + outputs = model(data) + error = loss_function(outputs, targets) + error.backward() + return error.item() +``` + +#### Combining the passes into a training step + +So far, we have created everything that is necessary for constructing functionality for a single training step. Recall that training the GAN involves iterating for a specific amount of epochs, and that each epoch is composed of a number of training steps. + +Here, you will create the def for the training steps. As you can see, the `generator`, `discriminator`, a batch of `real_data`, as well as loss functions and optimizers can be passed. A specific device can be passed as well, or the configured device will be used. + +A training step consists of four phrases: + +1. **Preparation**. Here, the real and fake labels are set, the real images are loaded onto the device, and a Tensor with the real label is set so that we can train the discriminator with real images. +2. **Training the discriminator**. First, the gradients are zeroed, after which a forward and backward pass is performed with the discriminator and real images. Directly afterwards, a forward and backward pass is performed on an equal amount of fake images, for which noise is generated. After these passes, the discriminator is optimized. +3. **Training the generator**. This involves a forward pass on the generated images for the _updated discriminator_, after which the generator is optimized with resulting loss. Here you can see the interplay between discriminator and generator: the discriminator is first updated based on images generated by the generator (using its current state), after which the generator is trained based on the _updated_ discriminator. In other words, they play the minimax game which is characteristic for a GAN. +4. **Computing the results.** Finally, some results are computed, and loss values for the discriminator and generator are returned. + +``` +def perform_train_step(generator, discriminator, real_data, \ + loss_function, generator_optimizer, discriminator_optimizer, device = get_device()): + """ Perform a single training step. """ + + # 1. PREPARATION + # Set real and fake labels. + real_label, fake_label = 1.0, 0.0 + # Get images on CPU or GPU as configured and available + # Also set 'actual batch size', whih can be smaller than BATCH_SIZE + # in some cases. + real_images = real_data[0].to(device) + actual_batch_size = real_images.size(0) + label = torch.full((actual_batch_size,1), real_label, device=device) + + # 2. TRAINING THE DISCRIMINATOR + # Zero the gradients for discriminator + efficient_zero_grad(discriminator) + # Forward + backward on real iamges + error_real_images = forward_and_backward(discriminator, real_images, \ + loss_function, label) + # Forward + backward on generated images + noise = generate_noise(actual_batch_size, device=device) + generated_images = generator(noise) + label.fill_(fake_label) + error_generated_images =forward_and_backward(discriminator, \ + generated_images.detach(), loss_function, label) + # Optim for discriminator + discriminator_optimizer.step() + + # 3. TRAINING THE GENERATOR + # Forward + backward + optim for generator, including zero grad + efficient_zero_grad(generator) + label.fill_(real_label) + error_generator = forward_and_backward(discriminator, generated_images, loss_function, label) + generator_optimizer.step() + + # 4. COMPUTING RESULTS + # Compute loss values in floats for discriminator, which is joint loss. + error_discriminator = error_real_images + error_generated_images + # Return generator and discriminator loss so that it can be printed. + return error_generator, error_discriminator +``` + +#### Combining training steps into epochs + +Recall that an epoch consists of multiple training steps. With the `perform_epoch` def, we iterate over the data provided by the `dataloader`. For each batch of `real_data`, we perform the training step by calling `perform_train_step` that we just created above. After each training step is completed, we check if a certain amount of steps has been completed after which we print the training progress and generate intermediate images. + +On epoch completion, the generator and discriminator are saved and CUDA memory is cleared as far as possible, speeding up the training process. + +``` +def perform_epoch(dataloader, generator, discriminator, loss_function, \ + generator_optimizer, discriminator_optimizer, epoch): + """ Perform a single epoch. """ + for batch_no, real_data in enumerate(dataloader, 0): + # Perform training step + generator_loss_val, discriminator_loss_val = perform_train_step(generator, \ + discriminator, real_data, loss_function, \ + generator_optimizer, discriminator_optimizer) + # Print statistics and generate image after every n-th batch + if batch_no % PRINT_STATS_AFTER_BATCH == 0: + print_training_progress(batch_no, generator_loss_val, discriminator_loss_val) + generate_image(generator, epoch, batch_no) + # Save models on epoch completion. + save_models(generator, discriminator, epoch) + # Clear memory after every epoch + torch.cuda.empty_cache() +``` + +#### Combining epochs into a DCGAN + +Now that you have completed the preparatory definitions, the training step and the epochs, it's time to actually combine everything into a definition that allows us to train the GAN. + +In the code below, you can see that a directory for the training run is created, the random seet is configured, the dataset is prepared, that models, loss and optimizers are initialized, and that finally the model is trained per `perform_epoch` (and hence per the training steps). + +Voila, this composes your DCGAN! + +``` +def train_dcgan(): + """ Train the DCGAN. """ + # Make directory for unique run + make_directory_for_run() + # Set fixed random number seed + torch.manual_seed(42) + # Get prepared dataset + dataloader = prepare_dataset() + # Initialize models + generator, discriminator = initialize_models() + # Initialize loss and optimizers + loss_function = initialize_loss() + generator_optimizer, discriminator_optimizer = initialize_optimizers(generator, discriminator) + # Train the model + for epoch in range(NUM_EPOCHS): + print(f'Starting epoch {epoch}...') + perform_epoch(dataloader, generator, discriminator, loss_function, \ + generator_optimizer, discriminator_optimizer, epoch) + # Finished :-) + print(f'Finished unique run {UNIQUE_RUN_ID}') +``` + +### Initializing and starting GAN training + +There is only one thing left now, and that is to instruct Python to call the `train_dcgan()` definition when you run the script: + +``` +if __name__ == '__main__': + train_dcgan() +``` + +### Full DCGAN code example + +Of course, it is also possible to copy and use the DCGAN code altogether. If that's what you want, here you go: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import numpy as np +import matplotlib.pyplot as plt +import uuid + + +# Configurable variables +NUM_EPOCHS = 50 +NOISE_DIMENSION = 50 +BATCH_SIZE = 128 +TRAIN_ON_GPU = True +UNIQUE_RUN_ID = str(uuid.uuid4()) +PRINT_STATS_AFTER_BATCH = 50 +OPTIMIZER_LR = 0.0002 +OPTIMIZER_BETAS = (0.5, 0.999) + + +# Speed ups +torch.autograd.set_detect_anomaly(False) +torch.autograd.profiler.profile(False) +torch.autograd.profiler.emit_nvtx(False) +torch.backends.cudnn.benchmark = True + + +class Generator(nn.Module): + """ + DCGan Generator + """ + def __init__(self,): + super().__init__() + num_feature_maps = 64 + self.layers = nn.Sequential( + # First upsampling block + nn.ConvTranspose2d(NOISE_DIMENSION, num_feature_maps * 8, 4, 1, 0, bias=False), + nn.BatchNorm2d(num_feature_maps * 8), + nn.ReLU(), + # Second upsampling block + nn.ConvTranspose2d(num_feature_maps * 8, num_feature_maps * 4, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 4), + nn.ReLU(), + # Third upsampling block + nn.ConvTranspose2d(num_feature_maps * 4, num_feature_maps * 2, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 2), + nn.ReLU(), + # Fourth upsampling block + nn.ConvTranspose2d(num_feature_maps * 2, num_feature_maps, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps), + nn.ReLU(), + # Fifth upsampling block: note Tanh + nn.ConvTranspose2d(num_feature_maps, 1, 1, 1, 2, bias=False), + nn.Tanh() + ) + + def forward(self, x): + """Forward pass""" + return self.layers(x) + + +class Discriminator(nn.Module): + """ + DCGan Discriminator + """ + def __init__(self): + super().__init__() + num_feature_maps = 64 + self.layers = nn.Sequential( + nn.Conv2d(1, num_feature_maps, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 1), + nn.LeakyReLU(0.2), + nn.Conv2d(num_feature_maps, num_feature_maps * 2, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 2), + nn.LeakyReLU(0.2), + nn.Conv2d(num_feature_maps * 2, num_feature_maps * 4, 4, 2, 1, bias=False), + nn.BatchNorm2d(num_feature_maps * 4), + nn.LeakyReLU(0.2), + nn.Conv2d(num_feature_maps * 4, 1, 4, 2, 1, bias=False), + nn.Flatten(), + nn.Linear(1, 1), + nn.Sigmoid() + ) + + def forward(self, x): + """Forward pass""" + return self.layers(x) + + +def get_device(): + """ Retrieve device based on settings and availability. """ + return torch.device("cuda:0" if torch.cuda.is_available() and TRAIN_ON_GPU else "cpu") + + +def make_directory_for_run(): + """ Make a directory for this training run. """ + print(f'Preparing training run {UNIQUE_RUN_ID}') + if not os.path.exists('./runs'): + os.mkdir('./runs') + os.mkdir(f'./runs/{UNIQUE_RUN_ID}') + + +def generate_image(generator, epoch = 0, batch = 0, device=get_device()): + """ Generate subplots with generated examples. """ + images = [] + noise = generate_noise(BATCH_SIZE, device=device) + generator.eval() + images = generator(noise) + plt.figure(figsize=(10, 10)) + for i in range(16): + # Get image + image = images[i] + # Convert image back onto CPU and reshape + image = image.cpu().detach().numpy() + image = np.reshape(image, (28, 28)) + # Plot + plt.subplot(4, 4, i+1) + plt.imshow(image, cmap='gray') + plt.axis('off') + if not os.path.exists(f'./runs/{UNIQUE_RUN_ID}/images'): + os.mkdir(f'./runs/{UNIQUE_RUN_ID}/images') + plt.savefig(f'./runs/{UNIQUE_RUN_ID}/images/epoch{epoch}_batch{batch}.jpg') + + +def save_models(generator, discriminator, epoch): + """ Save models at specific point in time. """ + torch.save(generator.state_dict(), f'./runs/{UNIQUE_RUN_ID}/generator_{epoch}.pth') + torch.save(discriminator.state_dict(), f'./runs/{UNIQUE_RUN_ID}/discriminator_{epoch}.pth') + + +def print_training_progress(batch, generator_loss, discriminator_loss): + """ Print training progress. """ + print('Losses after mini-batch %5d: generator %e, discriminator %e' % + (batch, generator_loss, discriminator_loss)) + + +def prepare_dataset(): + """ Prepare dataset through DataLoader """ + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, train=True, transform=transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)) + ])) + # Batch and shuffle data with DataLoader + trainloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=True) + # Return dataset through DataLoader + return trainloader + + +def weights_init(m): + """ Normal weight initialization as suggested for DCGANs """ + classname = m.__class__.__name__ + if classname.find('Conv') != -1: + nn.init.normal_(m.weight.data, 0.0, 0.02) + elif classname.find('BatchNorm') != -1: + nn.init.normal_(m.weight.data, 1.0, 0.02) + nn.init.constant_(m.bias.data, 0) + + +def initialize_models(device = get_device()): + """ Initialize Generator and Discriminator models """ + generator = Generator() + discriminator = Discriminator() + # Perform proper weight initialization + generator.apply(weights_init) + discriminator.apply(weights_init) + # Move models to specific device + generator.to(device) + discriminator.to(device) + # Return models + return generator, discriminator + + +def initialize_loss(): + """ Initialize loss function. """ + return nn.BCELoss() + + +def initialize_optimizers(generator, discriminator): + """ Initialize optimizers for Generator and Discriminator. """ + generator_optimizer = torch.optim.AdamW(generator.parameters(), lr=OPTIMIZER_LR,betas=OPTIMIZER_BETAS) + discriminator_optimizer = torch.optim.AdamW(discriminator.parameters(), lr=OPTIMIZER_LR,betas=OPTIMIZER_BETAS) + return generator_optimizer, discriminator_optimizer + + +def generate_noise(number_of_images = 1, noise_dimension = NOISE_DIMENSION, device=None): + """ Generate noise for number_of_images images, with a specific noise_dimension """ + return torch.randn(number_of_images, noise_dimension, 1, 1, device=device) + + +def efficient_zero_grad(model): + """ + Apply zero_grad more efficiently + Source: https://betterprogramming.pub/how-to-make-your-pytorch-code-run-faster-93079f3c1f7b + """ + for param in model.parameters(): + param.grad = None + + +def forward_and_backward(model, data, loss_function, targets): + """ + Perform forward and backward pass in a generic way. Returns loss value. + """ + outputs = model(data) + error = loss_function(outputs, targets) + error.backward() + return error.item() + + +def perform_train_step(generator, discriminator, real_data, \ + loss_function, generator_optimizer, discriminator_optimizer, device = get_device()): + """ Perform a single training step. """ + + # 1. PREPARATION + # Set real and fake labels. + real_label, fake_label = 1.0, 0.0 + # Get images on CPU or GPU as configured and available + # Also set 'actual batch size', whih can be smaller than BATCH_SIZE + # in some cases. + real_images = real_data[0].to(device) + actual_batch_size = real_images.size(0) + label = torch.full((actual_batch_size,1), real_label, device=device) + + # 2. TRAINING THE DISCRIMINATOR + # Zero the gradients for discriminator + efficient_zero_grad(discriminator) + # Forward + backward on real iamges + error_real_images = forward_and_backward(discriminator, real_images, \ + loss_function, label) + # Forward + backward on generated images + noise = generate_noise(actual_batch_size, device=device) + generated_images = generator(noise) + label.fill_(fake_label) + error_generated_images =forward_and_backward(discriminator, \ + generated_images.detach(), loss_function, label) + # Optim for discriminator + discriminator_optimizer.step() + + # 3. TRAINING THE GENERATOR + # Forward + backward + optim for generator, including zero grad + efficient_zero_grad(generator) + label.fill_(real_label) + error_generator = forward_and_backward(discriminator, generated_images, loss_function, label) + generator_optimizer.step() + + # 4. COMPUTING RESULTS + # Compute loss values in floats for discriminator, which is joint loss. + error_discriminator = error_real_images + error_generated_images + # Return generator and discriminator loss so that it can be printed. + return error_generator, error_discriminator + + +def perform_epoch(dataloader, generator, discriminator, loss_function, \ + generator_optimizer, discriminator_optimizer, epoch): + """ Perform a single epoch. """ + for batch_no, real_data in enumerate(dataloader, 0): + # Perform training step + generator_loss_val, discriminator_loss_val = perform_train_step(generator, \ + discriminator, real_data, loss_function, \ + generator_optimizer, discriminator_optimizer) + # Print statistics and generate image after every n-th batch + if batch_no % PRINT_STATS_AFTER_BATCH == 0: + print_training_progress(batch_no, generator_loss_val, discriminator_loss_val) + generate_image(generator, epoch, batch_no) + # Save models on epoch completion. + save_models(generator, discriminator, epoch) + # Clear memory after every epoch + torch.cuda.empty_cache() + + +def train_dcgan(): + """ Train the DCGAN. """ + # Make directory for unique run + make_directory_for_run() + # Set fixed random number seed + torch.manual_seed(42) + # Get prepared dataset + dataloader = prepare_dataset() + # Initialize models + generator, discriminator = initialize_models() + # Initialize loss and optimizers + loss_function = initialize_loss() + generator_optimizer, discriminator_optimizer = initialize_optimizers(generator, discriminator) + # Train the model + for epoch in range(NUM_EPOCHS): + print(f'Starting epoch {epoch}...') + perform_epoch(dataloader, generator, discriminator, loss_function, \ + generator_optimizer, discriminator_optimizer, epoch) + # Finished :-) + print(f'Finished unique run {UNIQUE_RUN_ID}') + + +if __name__ == '__main__': + train_dcgan() +``` + +* * * + +## Results + +Time to start the training process! Ensure that the dependencies listed above are installed in your environment, open up a terminal, and run `python dcgan.py`. You should see that the process starts when these messages start showing up on your screen: + +``` +Preparing training run bbc1b297-fd9d-4a01-abc6-c4d03f18d54f +Starting epoch 0... +Losses after mini-batch 0: generator 1.337156e+00, discriminator 1.734429e+00 +Losses after mini-batch 50: generator 3.972991e+00, discriminator 1.365001e-01 +Losses after mini-batch 100: generator 4.795033e+00, discriminator 3.830627e-02 +Losses after mini-batch 150: generator 5.441184e+00, discriminator 1.489213e-02 +Losses after mini-batch 200: generator 5.729664e+00, discriminator 1.159845e-02 +Losses after mini-batch 250: generator 5.579849e+00, discriminator 1.056747e-02 +Losses after mini-batch 300: generator 5.983423e+00, discriminator 5.716243e-03 +Losses after mini-batch 350: generator 6.004053e+00, discriminator 6.531999e-03 +Losses after mini-batch 400: generator 2.578202e+00, discriminator 3.643379e-01 +Losses after mini-batch 450: generator 4.946642e+00, discriminator 3.067930e-01 +Starting epoch 1... +``` + +Don't worry if you'll see the model produce _nonsense_ for the first series of batches. Only after 400 batches in the first epoch the model started to produce something that something good was happening :) + +- ![](images/epoch0_batch0.jpg) + + Untrained model + +- ![](images/epoch0_batch100.jpg) + + Epoch 0, batch 100 + +- ![](images/epoch0_batch200.jpg) + + Epoch 0, batch 200 + +- ![](images/epoch0_batch300.jpg) + + Epoch 0, batch 300 + +- ![](images/epoch0_batch400.jpg) + + Epoch 0, batch 400 + +- ![](images/epoch1_batch0.jpg) + + Epoch 1, batch 0 + +- ![](images/epoch1_batch100.jpg) + + Epoch 1, batch 100 + +- ![](images/epoch1_batch200.jpg) + + Epoch 1, batch 200 + +- ![](images/epoch1_batch300.jpg) + + Epoch 1, batch 300 + + +After epoch 22, the numbers were already becoming realistic: + +- ![](images/epoch22_batch100.jpg) + +- ![](images/epoch22_batch150.jpg) + +- ![](images/epoch22_batch200.jpg) + +- ![](images/epoch22_batch250.jpg) + + +That's it, you just created a DCGAN from scratch! :) + +* * * + +## Summary + +In this article, you have... + +- **Learned what a DCGAN is, to understand what is happening.** +- **Learned to build a DCGAN with PyTorch.** +- **Seen what happens when you train it on the MNIST dataset.** + +I hope that it was useful for your learning process! Please feel free to leave a comment in the comment section below if you have any questions or other remarks. I'll happily respond and adapt the article when necessary. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## Sources + +Radford, A., Metz, L., & Chintala, S. (2015). [Unsupervised representation learning with deep convolutional generative adversarial networks.](https://arxiv.org/abs/1511.06434) _arXiv preprint arXiv:1511.06434_ + +Verma, A. (2021, April 5). _How to make your PyTorch code run faster_. Medium. [https://betterprogramming.pub/how-to-make-your-pytorch-code-run-faster-93079f3c1f7b](https://betterprogramming.pub/how-to-make-your-pytorch-code-run-faster-93079f3c1f7b) + +TensorFlow. (n.d.). _Deep Convolutional generative adversarial network_. [https://www.tensorflow.org/tutorials/generative/dcgan](https://www.tensorflow.org/tutorials/generative/dcgan) diff --git a/creating-dcgan-with-tensorflow-2-and-keras.md b/creating-dcgan-with-tensorflow-2-and-keras.md new file mode 100644 index 0000000..22fe5ac --- /dev/null +++ b/creating-dcgan-with-tensorflow-2-and-keras.md @@ -0,0 +1,724 @@ +--- +title: "Creating DCGAN with TensorFlow 2 and Keras" +date: "2021-07-15" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "gan" + - "generative-adversarial-networks" + - "keras" + - "machine-learning" + - "tensorflow" +--- + +Generative Machine Learning is a really interesting area of research that investigates how Machine Learning (and by consequence, Deep Learning) models can be used for _generative_ purposes. Or in other words, how models can learn to generate data, such as images, music and even works of art. + +While there are various ways to generate data (such as [VAEs](https://www.machinecurve.com/index.php/2019/12/30/how-to-create-a-variational-autoencoder-with-keras/)), [Generative Adversarial Networks](https://www.machinecurve.com/index.php/generative-adversarial-networks-explanations-examples/) are one of them. By allowing a Generator to generate data and a Discriminator to detect these fake images, both can learn to become better, after which the Generator can eventually trick the Discriminator better and better. And precisely that principle is what we will be using in today's article: we're going to create a _Deep Convolutional GAN_, or a GAN that primarily uses Convolutions to generate and discriminate data. + +In this article, you will… + +- **Briefly cover what a DCGAN is, to understand what is happening.** +- **Learn to build a DCGAN with [TensorFlow 2 and Keras](https://www.machinecurve.com/index.php/mastering-keras/).** +- **See what happens when you train it on the MNIST dataset.** + +In other words, you’re going to build a model that can learn to output what’s on the right when beginning with what’s on the left: + +- ![](images/epoch0_batch50.jpg) + +- ![](images/epoch40_batch1750.jpg) + + +* * * + +\[toc\] + +* * * + +## What is a DCGAN? + +Compared to [_standard_ GANs](https://www.machinecurve.com/index.php/2021/03/23/generative-adversarial-networks-a-gentle-introduction/) (vanilla GANs / original GANs), DCGANs have a set of additional improvements: + +1. **A minimum of fully connected layers is used.** +2. **Any pooling is replaced with learnt downsampling and upsampling.** +3. **Batch Normalization is applied.** +4. **ReLU is applied in the Generator.** +5. **Leaky ReLU is applied in the Discriminator.** + +![This image has an empty alt attribute; its file name is GAN-1024x431.jpg](images/GAN-1024x431.jpg) + +The structure of a GAN. + +* * * + +## Building a DCGAN with TensorFlow 2 and Keras - code examples & explanations + +Now that we understand what a DCGAN is, it's time to build one with TensorFlow 2 and Keras. [Click here for the PyTorch equivalent](https://www.machinecurve.com/index.php/2021/07/15/creating-dcgan-with-pytorch/). Note that any GAN is quite complex in terms of the code that has to be written. That's why you'll write quite a large amount of Python defs, which split the code into smaller parts that are combined together. Here are the definitions that will be written: + +- **Imports** +- **Configuration variables** +- **Initializing loss function, weight init scheme and optimizers** +- **Function for preparing the training run** +- **Function for generating images** +- **Function for loading data** +- **Creating the generator** +- **Function for generating noise** +- **Creating the discriminator** +- **Functions for computing generator and discriminator loss** +- **Functions for saving models & printing training progress** +- **Function for performing training steps** +- **Function that combines training steps into epochs** +- **Combining everything together** + +Let's start with the imports. + +### Imports + +If you want to run this code, you'll need a recent version of TensorFlow 2 - which contains the Keras deep learning library by default. In addition, you must install Matplotlib, Python 3.x, and NumPy. + +Here are the imports that we'll need for today's article: + +``` +# Import +import tensorflow +from tensorflow.keras import layers +import matplotlib.pyplot as plt +import uuid +import os +import numpy as np +``` + +### Configuration variables + +You must now initialize a set of variables that will be used throughout the code. They are grouped together here so that you can config your GAN without having to search throughout your code, possibly forgetting a few options here and there. + +- The **number of epochs** specifies the number of iterations on the full training set, i.e., the number of epochs. +- The **batch size** and **buffer size** instruct our code how the `tf.Dataset` should be constructed that is used for training the GAN. +- If available, we can **train on GPU** – this can be configured. +- The **noise dimension** can be configured to set the number of dimensions of the noise vector that is input to the Generator. +- The **unique run ID** represents a unique identifier that describes this training session, and is used when the models and sample images are saved. +- **Print stats after batch** tells us how many mini batches should pass in an epoch before intermediary statistics are printed. +- The **[optimizer](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) LR** and **optimizer Betas** give the Learning Rate and Beta values for the `AdamW` optimizer used in our GAN. +- The **weight init standard deviation** represents the standard deviation that will be used in the weight init schema that you will create below. + +``` +# Initialize variables +NUM_EPOCHS = 50 +BUFFER_SIZE = 30000 +BATCH_SIZE = 28 +NOISE_DIMENSION = 75 +UNIQUE_RUN_ID = str(uuid.uuid4()) +PRINT_STATS_AFTER_BATCH = 50 +OPTIMIZER_LR = 0.0002 +OPTIMIZER_BETAS = (0.5, 0.999) +WEIGHT_INIT_STDDEV = 0.02 +``` + +### Initializing loss function, weight init scheme and optimizers + +Okay, now, after specifying the configuration options, it's time to do something with them! :) + +As a next step, you will define and initialize the **loss function** that will be used for comparing predictions (from the Discriminator) with corresponding targets, a **weight initialization schema** that will be used for initializing the Generator and Discriminator layer kernels, and two **optimizers** for both generator and discriminator. + +We use [binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) directly applied to the [logits](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/). This loss will be used to compare the outputs of the Discriminator on either the real or generated images (somewhere in the range `[0, 1]` with the true labels (either `0` or `1`)). + +A `RandomNormal` initializer is used in line with the Radford et al. (2015) paper. It is initialized with a `WEIGHT_INIT_STDDEV=0.02`. + +The optimizers for Generator and Discriminator are initializes as an Adam optimizer with a preconfigured `OPTIMIZER_LR` (learning rate) and Beta values. + +``` +# Initialize loss function, init schema and optimizers +cross_entropy_loss = tensorflow.keras.losses.BinaryCrossentropy(from_logits=True) +weight_init = tensorflow.keras.initializers.RandomNormal(stddev=WEIGHT_INIT_STDDEV) +generator_optimizer = tensorflow.keras.optimizers.Adam(OPTIMIZER_LR, \ + beta_1=OPTIMIZER_BETAS[0], beta_2=OPTIMIZER_BETAS[1]) +discriminator_optimizer = tensorflow.keras.optimizers.Adam(OPTIMIZER_LR, \ + beta_1=OPTIMIZER_BETAS[0], beta_2=OPTIMIZER_BETAS[1]) +``` + +### Function for preparing the training run + +After defining loss function, weight init scheme and optimizers, it's time to add another preparatory Python def: that for making a directory for a run. + +You will see that during the training process, intermediate images are generated that display how the model performs after some training step. In addition, both the Generator and Discriminator will be saved after every epoch. To perform some housekeeping, we save them in a specific file. That's why you'll first check whether a directory called `runs` is available relative to the current working directory (and if not create it), followed by the creation of a directory following some unique run ID. This directory will be where the intermediate models and images are saved. + +``` +def make_directory_for_run(): + """ Make a directory for this training run. """ + print(f'Preparing training run {UNIQUE_RUN_ID}') + if not os.path.exists('./runs'): + os.mkdir('./runs') + os.mkdir(f'./runs/{UNIQUE_RUN_ID}') +``` + +### Function for generating images + +Above, you read that the model will generate images during the training process. These images look as follows: + +![](images/epoch40_batch1850.jpg) + +Although the actual _creation_ of images will be added later, you will now add a function that can be used _for creating images_. In other words, it will be created now, but used later. The code below will create a Matplotlib based image containing generated images from noise. An example is displayed above. + +``` +def generate_image(generator, epoch = 0, batch = 0): + """ Generate subplots with generated examples. """ + images = [] + noise = generate_noise(BATCH_SIZE) + images = generator(noise, training=False) + plt.figure(figsize=(10, 10)) + for i in range(16): + # Get image and reshape + image = images[i] + image = np.reshape(image, (28, 28)) + # Plot + plt.subplot(4, 4, i+1) + plt.imshow(image, cmap='gray') + plt.axis('off') + if not os.path.exists(f'./runs/{UNIQUE_RUN_ID}/images'): + os.mkdir(f'./runs/{UNIQUE_RUN_ID}/images') + plt.savefig(f'./runs/{UNIQUE_RUN_ID}/images/epoch{epoch}_batch{batch}.jpg') +``` + +### Function for loading data + +In addition to _creating images_, the DCGAN will have access to a set of _real images_ that are used by the Discriminator. The `load_data` def that you will write now ensures that samples from the MNIST dataset are imported, reshaped, and normalized to the `[-1, 1]` range. Subsequently, it's converted into a `tensorflow.data.Dataset`, shuffled and batched properly according to the buffer and batch size. + +``` +def load_data(): + """ Load data """ + (images, _), (_, _) = tensorflow.keras.datasets.mnist.load_data() + images = images.reshape(images.shape[0], 28, 28, 1) + images = images.astype('float32') + images = (images - 127.5) / 127.5 + return tensorflow.data.Dataset.from_tensor_slices(images).shuffle(BUFFER_SIZE).batch(BATCH_SIZE) +``` + +### Creating the generator + +Time for the real work, creating the Generator! You will add a variety of layers to a `tensorflow.keras.Sequential` model. First of all, you will add a `Dense` layer that has quite a few outputs, does not use bias (because any `BatchNormalization` will nullify the bias value of the previous layer) and uses the `NOISE_DIMENSION` as input shape. These are followed by Batch Normalization and Leaky ReLU. + +Following the first block, a few upsampling blocks are added which use `Conv2DTranspose` layers (transposed convolutions) for learned upsampling, as well as batch normalization and Leaky ReLU. Also note the `kernel_initializer`, which utilizes the weight init schema specified above. Finally, the `generator` is returned. + +``` +def create_generator(): + """ Create Generator """ + generator = tensorflow.keras.Sequential() + # Input block + generator.add(layers.Dense(7*7*128, use_bias=False, input_shape=(NOISE_DIMENSION,), \ + kernel_initializer=weight_init)) + generator.add(layers.BatchNormalization()) + generator.add(layers.LeakyReLU()) + # Reshape 1D Tensor into 3D + generator.add(layers.Reshape((7, 7, 128))) + # First upsampling block + generator.add(layers.Conv2DTranspose(56, (5, 5), strides=(1, 1), padding='same', use_bias=False, \ + kernel_initializer=weight_init)) + generator.add(layers.BatchNormalization()) + generator.add(layers.LeakyReLU()) + # Second upsampling block + generator.add(layers.Conv2DTranspose(28, (5, 5), strides=(2, 2), padding='same', use_bias=False, \ + kernel_initializer=weight_init)) + generator.add(layers.BatchNormalization()) + generator.add(layers.LeakyReLU()) + # Third upsampling block: note tanh, specific for DCGAN + generator.add(layers.Conv2DTranspose(1, (5, 5), strides=(2, 2), padding='same', use_bias=False, activation='tanh', \ + kernel_initializer=weight_init)) + # Return generator + return generator +``` + +### Function for generating noise + +As you could see in the `create_generator()` def written above, the `input_shape` for the first layer is a `NOISE_DIMENSION`. Recall that the Generator is fed a noise sample from a latent (eventually learned) distribution that is converted into an output image that should preferably resemble the 'real images' fed to the Discriminator. If noise is fed, it must be generated. You'll therefore use `tensorflow.random.normal` to generate noise for a `number_of_images`, with a specific `noise_dimension`. + +``` +def generate_noise(number_of_images = 1, noise_dimension = NOISE_DIMENSION): + """ Generate noise for number_of_images images, with a specific noise_dimension """ + return tensorflow.random.normal([number_of_images, noise_dimension]) +``` + +### Creating the discriminator + +Now, the Generator is complete, and we can continue with the Discriminator. Below, you'll write a def for it. It is also a `tensorflow.keras.Sequential` model, which has a 28\*28\*1 image as its input (a one-dimensional grayscale 28x28 pixel MNIST image or fake image). The input is downsampled with Convolutional layers (Conv2D) and fed through Leaky ReLU and Dropout, and all layers are initialized using the weight initialization scheme. The final layer outputs a value between 0 and 1, implicating the 'real-ness' of the image. + +After creation, the discriminator is returned. + +``` +def create_discriminator(): + """ Create Discriminator """ + discriminator = tensorflow.keras.Sequential() + # First Convolutional block + discriminator.add(layers.Conv2D(28, (5, 5), strides=(2, 2), padding='same', + input_shape=[28, 28, 1], kernel_initializer=weight_init)) + discriminator.add(layers.LeakyReLU()) + discriminator.add(layers.Dropout(0.5)) + # Second Convolutional block + discriminator.add(layers.Conv2D(64, (5, 5), strides=(2, 2), padding='same', kernel_initializer=weight_init)) + discriminator.add(layers.LeakyReLU()) + discriminator.add(layers.Dropout(0.5)) + # Flatten and generate output prediction + discriminator.add(layers.Flatten()) + discriminator.add(layers.Dense(1, kernel_initializer=weight_init, activation='sigmoid')) + # Return discriminator + return discriminator +``` + +### Functions for computing generator and discriminator loss + +We're getting a bit ahead of ourselves, but realize that training the GAN will follow this schema in a few definitions below: + +1. A batch of real data is fed to the Discriminator. +2. A batch of generated data is fed to the Discriminator. +3. How poor the Generator performs (i.e., its loss) is measured by looking at how well the Discriminator can identify fake samples. +4. How poor the Discriminator performs (i.e., its loss) is measured by looking at the classification error for both real and fake samples. + +When they are subsequently optimized, the Generator will attempt to fool the Discriminator better, while the Discriminator will attempt to be better in catching the Generator while also improving on the real data. + +This must be reflected in how the loss is computed. In the two definitions below, you'll see that... + +1. For Generator loss, the predicted fakes are compared with a Tensor filled with _ones_. In other words, any fakes that are classified incorrectly will increase loss, and _exponentially_ if the difference is high. +2. For Discriminator loss, the predicted reals are compared with a _ones_ Tensor, and the fakes with a _zeros_ Tensor. They are then combined. + +``` +def compute_generator_loss(predicted_fake): + """ Compute cross entropy loss for the generator """ + return cross_entropy_loss(tensorflow.ones_like(predicted_fake), predicted_fake) + + +def compute_discriminator_loss(predicted_real, predicted_fake): + """ Compute discriminator loss """ + loss_on_reals = cross_entropy_loss(tensorflow.ones_like(predicted_real), predicted_real) + loss_on_fakes = cross_entropy_loss(tensorflow.zeros_like(predicted_fake), predicted_fake) + return loss_on_reals + loss_on_fakes +``` + +### Functions for saving models & printing training progress + +Functions for saving the models and printing the training progress are now added. Saving the models does nothing more than saving the `generator` and `discriminator` into the folder created for this run. Printing the training process simply prints the batch number and loss values in a standardized way. + +``` +def save_models(generator, discriminator, epoch): + """ Save models at specific point in time. """ + tensorflow.keras.models.save_model( + generator, + f'./runs/{UNIQUE_RUN_ID}/generator_{epoch}.model', + overwrite=True, + include_optimizer=True, + save_format=None, + signatures=None, + options=None + ) + tensorflow.keras.models.save_model( + discriminator, + f'./runs/{UNIQUE_RUN_ID}/discriminator{epoch}.model', + overwrite=True, + include_optimizer=True, + save_format=None, + signatures=None, + options=None + ) + + +def print_training_progress(batch, generator_loss, discriminator_loss): + """ Print training progress. """ + print('Losses after mini-batch %5d: generator %e, discriminator %e' % + (batch, generator_loss, discriminator_loss)) +``` + +### Function for performing training steps + +All right, time for the real work! Now that we have created the Generator, the Discriminator and all support definitions, we can begin with the training loop. Recall that the training process involves feeding forward batches of data through Generator and Discriminator. Recall as well that an epoch contains all the batches of data that jointly represent the training dataset, and that the whole process involves a number of epochs. + +In other words, you'll now create a function that performs a training step (a full forward pass, backward pass and optimization for a batch of data). Below, you'll use this function in the epochs, and eventually in the whole GAN. + +In each training step, for the `BATCH_SIZE`, noise is generated. Using the [TensorFlow gradient tape](https://www.tensorflow.org/guide/advanced_autodiff) we can construct the actual training step without having to rely on high-level abstractions such as `model.fit(...)`. You'll see that a tape is created for both the discriminator and the generator. Using them, we feed the noise to the Generator, indicating that training is happening, and receiving a set of images in return. Both the generated and real images are then passed to the discriminator separately, once again indicating that training is taking place, after which loss for the Generator and Discriminator is computed. + +Once loss is known, backpropagation (the backward pass) can be used for computing the gradients for both models, after which they are combined with the existing variables and applied to the model. Voila, one training step is complete! For administration purposes, we return both Generator and Discriminator loss. + +``` +@tensorflow.function +def perform_train_step(real_images, generator, discriminator): + """ Perform one training step with Gradient Tapes """ + # Generate noise + noise = generate_noise(BATCH_SIZE) + # Feed forward and loss computation for one batch + with tensorflow.GradientTape() as discriminator_tape, \ + tensorflow.GradientTape() as generator_tape: + # Generate images + generated_images = generator(noise, training=True) + # Discriminate generated and real images + discriminated_generated_images = discriminator(generated_images, training=True) + discriminated_real_images = discriminator(real_images, training=True) + # Compute loss + generator_loss = compute_generator_loss(discriminated_generated_images) + discriminator_loss = compute_discriminator_loss(discriminated_real_images, discriminated_generated_images) + # Compute gradients + generator_gradients = generator_tape.gradient(generator_loss, generator.trainable_variables) + discriminator_gradients = discriminator_tape.gradient(discriminator_loss, discriminator.trainable_variables) + # Optimize model using gradients + generator_optimizer.apply_gradients(zip(generator_gradients, generator.trainable_variables)) + discriminator_optimizer.apply_gradients(zip(discriminator_gradients, discriminator.trainable_variables)) + # Return generator and discriminator losses + return (generator_loss, discriminator_loss) +``` + +### Function that combines training steps into epochs + +Above, we defined what should happen _within_ a training step. Recall again that an epoch contains multiple training steps; as many as the data set allows given the batch size. You will therefore now create a `train_gan` def. It iterates over the configured amount of epochs, as well as over the batches _within_ an epoch. For each batch, it calls `perform_train_step`, actually performing the training step. + +If necessary (after every `PRINT_STATS_AFTER_BATCH`th epoch, it prints statistics (current progress) and generates the images we discussed above. + +After every epoch, the Generator and Discriminator are saved to disk. + +This comprises the whole training process of the GAN! + +> **Important!** If you get the error message `Attribute error: ‘BatchDataset’ object has no attribute ‘__len__’` for the following code when running the script, this likely means that you are running an older version of TensorFlow. If you change `num_batches = image_data`\_\_len\_() into `num_batches = image_data.._batch_size`, it will work. + +``` +def train_gan(num_epochs, image_data, generator, discriminator): + """ Train the GAN """ + # Perform one training step per batch for every epoch + for epoch_no in range(num_epochs): + num_batches = image_data.__len__() + print(f'Starting epoch {epoch_no+1} with {num_batches} batches...') + batch_no = 0 + # Iterate over batches within epoch + for batch in image_data: + generator_loss, discriminator_loss = perform_train_step(batch, generator, discriminator) + batch_no += 1 + # Print statistics and generate image after every n-th batch + if batch_no % PRINT_STATS_AFTER_BATCH == 0: + print_training_progress(batch_no, generator_loss, discriminator_loss) + generate_image(generator, epoch_no, batch_no) + # Save models on epoch completion. + save_models(generator, discriminator, epoch_no) + # Finished :-) + print(f'Finished unique run {UNIQUE_RUN_ID}') +``` + +### Combining everything together + +The only thing left is combining everything (preparations, model initialization, and model training) into a definition: + +``` +def run_gan(): + """ Initialization and training """ + # Make run directory + make_directory_for_run() + # Set random seed + tensorflow.random.set_seed(42) + # Get image data + data = load_data() + # Create generator and discriminator + generator = create_generator() + discriminator = create_discriminator() + # Train the GAN + print('Training GAN ...') + train_gan(NUM_EPOCHS, data, generator, discriminator) +``` + +...after which we can call the `def` when we run the Python script: + +``` + +if __name__ == '__main__': + run_gan() +``` + +That's it! You just created a DCGAN with TensorFlow 2 and Keras! :D + +### Full code example + +Should you wish to use the code example without walking through this article step-by-step, you can also use this entire code example: + +``` +# Import +import tensorflow +from tensorflow.keras import layers +import matplotlib.pyplot as plt +import uuid +import os +import numpy as np + +# Initialize variables +NUM_EPOCHS = 50 +BUFFER_SIZE = 30000 +BATCH_SIZE = 28 +NOISE_DIMENSION = 75 +UNIQUE_RUN_ID = str(uuid.uuid4()) +PRINT_STATS_AFTER_BATCH = 50 +OPTIMIZER_LR = 0.0002 +OPTIMIZER_BETAS = (0.5, 0.999) +WEIGHT_INIT_STDDEV = 0.02 + +# Initialize loss function, init schema and optimizers +cross_entropy_loss = tensorflow.keras.losses.BinaryCrossentropy(from_logits=True) +weight_init = tensorflow.keras.initializers.RandomNormal(stddev=WEIGHT_INIT_STDDEV) +generator_optimizer = tensorflow.keras.optimizers.Adam(OPTIMIZER_LR, \ + beta_1=OPTIMIZER_BETAS[0], beta_2=OPTIMIZER_BETAS[1]) +discriminator_optimizer = tensorflow.keras.optimizers.Adam(OPTIMIZER_LR, \ + beta_1=OPTIMIZER_BETAS[0], beta_2=OPTIMIZER_BETAS[1]) + + +def make_directory_for_run(): + """ Make a directory for this training run. """ + print(f'Preparing training run {UNIQUE_RUN_ID}') + if not os.path.exists('./runs'): + os.mkdir('./runs') + os.mkdir(f'./runs/{UNIQUE_RUN_ID}') + + +def generate_image(generator, epoch = 0, batch = 0): + """ Generate subplots with generated examples. """ + images = [] + noise = generate_noise(BATCH_SIZE) + images = generator(noise, training=False) + plt.figure(figsize=(10, 10)) + for i in range(16): + # Get image and reshape + image = images[i] + image = np.reshape(image, (28, 28)) + # Plot + plt.subplot(4, 4, i+1) + plt.imshow(image, cmap='gray') + plt.axis('off') + if not os.path.exists(f'./runs/{UNIQUE_RUN_ID}/images'): + os.mkdir(f'./runs/{UNIQUE_RUN_ID}/images') + plt.savefig(f'./runs/{UNIQUE_RUN_ID}/images/epoch{epoch}_batch{batch}.jpg') + + +def load_data(): + """ Load data """ + (images, _), (_, _) = tensorflow.keras.datasets.mnist.load_data() + images = images.reshape(images.shape[0], 28, 28, 1) + images = images.astype('float32') + images = (images - 127.5) / 127.5 + return tensorflow.data.Dataset.from_tensor_slices(images).shuffle(BUFFER_SIZE).batch(BATCH_SIZE) + + +def create_generator(): + """ Create Generator """ + generator = tensorflow.keras.Sequential() + # Input block + generator.add(layers.Dense(7*7*128, use_bias=False, input_shape=(NOISE_DIMENSION,), \ + kernel_initializer=weight_init)) + generator.add(layers.BatchNormalization()) + generator.add(layers.LeakyReLU()) + # Reshape 1D Tensor into 3D + generator.add(layers.Reshape((7, 7, 128))) + # First upsampling block + generator.add(layers.Conv2DTranspose(56, (5, 5), strides=(1, 1), padding='same', use_bias=False, \ + kernel_initializer=weight_init)) + generator.add(layers.BatchNormalization()) + generator.add(layers.LeakyReLU()) + # Second upsampling block + generator.add(layers.Conv2DTranspose(28, (5, 5), strides=(2, 2), padding='same', use_bias=False, \ + kernel_initializer=weight_init)) + generator.add(layers.BatchNormalization()) + generator.add(layers.LeakyReLU()) + # Third upsampling block: note tanh, specific for DCGAN + generator.add(layers.Conv2DTranspose(1, (5, 5), strides=(2, 2), padding='same', use_bias=False, activation='tanh', \ + kernel_initializer=weight_init)) + # Return generator + return generator + + +def generate_noise(number_of_images = 1, noise_dimension = NOISE_DIMENSION): + """ Generate noise for number_of_images images, with a specific noise_dimension """ + return tensorflow.random.normal([number_of_images, noise_dimension]) + + +def create_discriminator(): + """ Create Discriminator """ + discriminator = tensorflow.keras.Sequential() + # First Convolutional block + discriminator.add(layers.Conv2D(28, (5, 5), strides=(2, 2), padding='same', + input_shape=[28, 28, 1], kernel_initializer=weight_init)) + discriminator.add(layers.LeakyReLU()) + discriminator.add(layers.Dropout(0.5)) + # Second Convolutional block + discriminator.add(layers.Conv2D(64, (5, 5), strides=(2, 2), padding='same', kernel_initializer=weight_init)) + discriminator.add(layers.LeakyReLU()) + discriminator.add(layers.Dropout(0.5)) + # Flatten and generate output prediction + discriminator.add(layers.Flatten()) + discriminator.add(layers.Dense(1, kernel_initializer=weight_init, activation='sigmoid')) + # Return discriminator + return discriminator + + +def compute_generator_loss(predicted_fake): + """ Compute cross entropy loss for the generator """ + return cross_entropy_loss(tensorflow.ones_like(predicted_fake), predicted_fake) + + +def compute_discriminator_loss(predicted_real, predicted_fake): + """ Compute discriminator loss """ + loss_on_reals = cross_entropy_loss(tensorflow.ones_like(predicted_real), predicted_real) + loss_on_fakes = cross_entropy_loss(tensorflow.zeros_like(predicted_fake), predicted_fake) + return loss_on_reals + loss_on_fakes + + +def save_models(generator, discriminator, epoch): + """ Save models at specific point in time. """ + tensorflow.keras.models.save_model( + generator, + f'./runs/{UNIQUE_RUN_ID}/generator_{epoch}.model', + overwrite=True, + include_optimizer=True, + save_format=None, + signatures=None, + options=None + ) + tensorflow.keras.models.save_model( + discriminator, + f'./runs/{UNIQUE_RUN_ID}/discriminator{epoch}.model', + overwrite=True, + include_optimizer=True, + save_format=None, + signatures=None, + options=None + ) + + +def print_training_progress(batch, generator_loss, discriminator_loss): + """ Print training progress. """ + print('Losses after mini-batch %5d: generator %e, discriminator %e' % + (batch, generator_loss, discriminator_loss)) + + +@tensorflow.function +def perform_train_step(real_images, generator, discriminator): + """ Perform one training step with Gradient Tapes """ + # Generate noise + noise = generate_noise(BATCH_SIZE) + # Feed forward and loss computation for one batch + with tensorflow.GradientTape() as discriminator_tape, \ + tensorflow.GradientTape() as generator_tape: + # Generate images + generated_images = generator(noise, training=True) + # Discriminate generated and real images + discriminated_generated_images = discriminator(generated_images, training=True) + discriminated_real_images = discriminator(real_images, training=True) + # Compute loss + generator_loss = compute_generator_loss(discriminated_generated_images) + discriminator_loss = compute_discriminator_loss(discriminated_real_images, discriminated_generated_images) + # Compute gradients + generator_gradients = generator_tape.gradient(generator_loss, generator.trainable_variables) + discriminator_gradients = discriminator_tape.gradient(discriminator_loss, discriminator.trainable_variables) + # Optimize model using gradients + generator_optimizer.apply_gradients(zip(generator_gradients, generator.trainable_variables)) + discriminator_optimizer.apply_gradients(zip(discriminator_gradients, discriminator.trainable_variables)) + # Return generator and discriminator losses + return (generator_loss, discriminator_loss) + + +def train_gan(num_epochs, image_data, generator, discriminator): + """ Train the GAN """ + # Perform one training step per batch for every epoch + for epoch_no in range(num_epochs): + num_batches = image_data.__len__() + print(f'Starting epoch {epoch_no+1} with {num_batches} batches...') + batch_no = 0 + # Iterate over batches within epoch + for batch in image_data: + generator_loss, discriminator_loss = perform_train_step(batch, generator, discriminator) + batch_no += 1 + # Print statistics and generate image after every n-th batch + if batch_no % PRINT_STATS_AFTER_BATCH == 0: + print_training_progress(batch_no, generator_loss, discriminator_loss) + generate_image(generator, epoch_no, batch_no) + # Save models on epoch completion. + save_models(generator, discriminator, epoch_no) + # Finished :-) + print(f'Finished unique run {UNIQUE_RUN_ID}') + + +def run_gan(): + """ Initialization and training """ + # Make run directory + make_directory_for_run() + # Set random seed + tensorflow.random.set_seed(42) + # Get image data + data = load_data() + # Create generator and discriminator + generator = create_generator() + discriminator = create_discriminator() + # Train the GAN + print('Training GAN ...') + train_gan(NUM_EPOCHS, data, generator, discriminator) + + +if __name__ == '__main__': + run_gan() +``` + +* * * + +## Results + +Now, you can open a terminal where all dependencies are installed (e.g. a Conda environment), and run your script, say `python dcgan.py`. When you'll see the following (possibly with some TensorFlow logs in between), you are successfully training your GAN: + +``` +Training GAN ... +Starting epoch 1 with 2143 batches... +Losses after mini-batch 50: generator 6.096838e-01, discriminator 1.260103e+00 +Losses after mini-batch 100: generator 6.978830e-01, discriminator 1.074400e+00 +Losses after mini-batch 150: generator 6.363150e-01, discriminator 1.181754e+00 +Losses after mini-batch 200: generator 8.537785e-01, discriminator 1.195267e+00 +Losses after mini-batch 250: generator 8.990633e-01, discriminator 1.261971e+00 +Losses after mini-batch 300: generator 7.339471e-01, discriminator 1.260589e+00 +Losses after mini-batch 350: generator 7.893692e-01, discriminator 1.238701e+00 +.... +``` + +It will take some time before actual outputs are generated. After the first few hundred batches, these were the results: + +- ![](images/epoch0_batch50.jpg) + +- ![](images/epoch0_batch200-1.jpg) + +- ![](images/epoch0_batch350.jpg) + +- ![](images/epoch0_batch500.jpg) + +- ![](images/epoch0_batch650.jpg) + + +However, outputs will become more and more accurate after some time - for example, after the 40th epoch: + +- ![](images/epoch40_batch1550.jpg) + +- ![](images/epoch40_batch1650.jpg) + +- ![](images/epoch40_batch1750.jpg) + +- ![](images/epoch40_batch1850.jpg) + + +What dataset will you apply this GAN to? :) + +* * * + +## Summary + +In this article, we have... + +- **Briefly covered what a DCGAN is, to understand what is happening.** +- **Learned to build a DCGAN with [TensorFlow 2 and Keras](https://www.machinecurve.com/index.php/mastering-keras/).** +- **Seen what happens when you train it on the MNIST dataset.** + +I hope that it was useful to you! Please make sure to leave any questions or other comments in the comments section below 💬 I'll try to respond when I can. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## Sources + +Radford, A., Metz, L., & Chintala, S. (2015). [Unsupervised representation learning with deep convolutional generative adversarial networks.](https://arxiv.org/abs/1511.06434) _arXiv preprint arXiv:1511.06434_ + +TensorFlow. (n.d.). _Deep Convolutional generative adversarial network_. [https://www.tensorflow.org/tutorials/generative/dcgan](https://www.tensorflow.org/tutorials/generative/dcgan) diff --git a/creating-depthwise-separable-convolutions-in-keras.md b/creating-depthwise-separable-convolutions-in-keras.md new file mode 100644 index 0000000..14e7a93 --- /dev/null +++ b/creating-depthwise-separable-convolutions-in-keras.md @@ -0,0 +1,444 @@ +--- +title: "Creating depthwise separable convolutions with TensorFlow 2 and Keras" +date: "2019-09-24" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "convolutional-neural-networks" + - "deep-learning" + - "keras" + - "kernel" +--- + +In a recent blog post, we took a look at [separable convolutions](https://machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/). When you separate your convolutional kernels in a depthwise way, you can substantially reduce the required amount of resources for your machine learning project. + +The best thing: presumably, this is all without losing the predictive power of the traditional convolutional neural network. + +In this blog, we'll adapt a neural network that we trained earlier to illustrate the following: + +**How to create a depthwise separable convolutional neural network in Keras.** + +We'll first briefly review traditional convolutions, depthwise separable convolutions and how they improve the training process of your neural network. We then move towards adapting a ConvNet that we created earlier, for performing classifications with the MNIST dataset. The best thing: we can even compare the two in terms of performance _and_ time required for completing the training. + +After reading this tutorial, you will... + +- Understand what depthwise separable convolutional layers are. +- How they are represented in TensorFlow 2 based Keras. +- How to use `tensorflow.keras.layers.SeparableConv2D` in your neural network. + +Let's take a look! 🚀 + +Note that the code for this blog post is also available on [GitHub](https://github.com/christianversloot/keras-cnn). + +* * * + +**Update 08/Feb/2021:** ensured that article is up to date. + +**Update 03/Nov/2020:** updated blog post to make the code examples compatible with TensorFlow 2.x. Also added link to relevant articles. + +* * * + +\[toc\] + +* * * + +## A brief review: what is a depthwise separable convolutional layer? + +Suppose that you're working with some traditional [convolutional kernels](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), like the ones in this image: + +![](images/CNN.png) + +If your 15x15 pixels image is RGB, and by consequence has 3 channels, you'll need (15-3+1) x (15-3+1) x 3 x 3 x 3 x N = 4563N multiplications to complete the full interpretation of _one image_. If you're working with ten kernels, so N = 10, you'll need over 45000 multiplications. Today, 20 to 50 kernels are not uncommon, datasets often span thousands of images and neural networks often compose multiple convolutional layers in their architecture. + +That's many resources you'll need, possibly draining you from funds that might have been spent better. + +Enter [depthwise separable convolutional](https://machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/) layers: + +![](images/CNNaltogether.png) + +With those, you essentially split your N traditional kernels into _depthwise convolutions_ and _pointwise convolutions_. In the first subprocess, you convolve with M filters on a layer basis, adding the kernels 'pointwise' in the second subprocess. + +While achieving the same result, you'll need only **9633 convolutions** as we've seen in [our other blog post](https://machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/#how-many-multiplications-do-we-save). + +Depthwise separable convolutional layers may therefore greatly optimize your learning process without giving in on accuracy, since essentially the same operation is performed. + +We'll test this premise today, in this blog. We'll adapt a [traditional CNN classifier](https://machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) created in a previous blog to use `SeparableConv2D` instead of the traditional `Conv2D`. What's more, we'll cover each of the configuration settings in detail, to augment the [Keras docs](https://keras.io/layers/convolutional/) for SeparableConv2D. + +Training with SeparableConv2D instead of Conv2D using the same model architecture and the same dataset allows us to compare the two in terms of performance and training time without much interference from architecture-specific factors or configuration-specific factors. This ensures that the comparison is as fair as possible. + +Allright, let's go! + +* * * + +## Adapting our traditional MNIST CNN + +Next, we'll adapt the traditional CNN we created for classifying instances of the MNIST dataset. As we recall, the MNIST dataset stands for _Modified National Institute of Standards and Technology_ and contains thousands of 28 x 28 pixel images of the digits 0-9. We first present the Keras code for the traditional CNN. Then, we introduce the `SeparableConv2D` layer and explain its configuration options. Finally, before we move on to the training and comparison stages, we show you how to adapt a normal CNN to use depthwise separable convolutions. + +### The traditional CNN + +This was the traditional CNN that we used in the other blog + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape the data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert into [0, 1] range. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Briefly reviewing the code, this is what happens: + +- Firstly, all the dependencies are imported into your Python script: Keras itself, the MNIST dataset (which is [embedded](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) in Keras), the Sequential API, and the layers that we'll need. +- Secondly, we specify the configuration of our model. Mainly, we cover hyperparameters and the shape of our data (by specifying image size). +- Thirdly, the MNIST dataset is [loaded](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). +- Fourthly, we reshape the data into the \[latex\]\[0, 1\]\[/latex\] range. +- Fifthly, we parse numbers as floats ([this benefits training on GPUs](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/#float32-in-your-ml-model-why-its-great)), convert the images into grayscale (to make them color-agnostic, which benefits classification of new instances) and convert target vectors (which are scalars) into categorical data (vectors deciding for each possible target, in this case scalars 0-9, whether it belongs to that category yes/no). [More about](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/#small-detour-categorical-cross-entropy) `[to_categorical](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/#small-detour-categorical-cross-entropy)` [here.](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/#small-detour-categorical-cross-entropy) +- Sixthly, we do one of the most important things: we specify the model architecture. Our model makes use of the `Sequential` API provided by Keras and stacks all layers on top of each other, in line with this API. We employ `Conv2D` twice, followed by Max Pooling and Dropout, before we flatten the abstract feature map and classify the data by means of densely-connected layers. +- Seventhly, we _configure_ the model and _fit the data_. We specify hyperparameters such as the [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) ([categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/)), the [optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), additional metrics, batch size, number of epochs and validation split. +- Eightly, and finally, we add [model evaluation](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/). Since during training _validation loss_ is computed, we can fairly accurately assess the _predictive power_ of our model. However, it can overfit, which means that it no longer works as well with data that the model has never seen before. By means of the _testing set_, we can test our model. In the other blog, test accuracy was as high as training (validation) accuracy. Test loss was even lower (which is better). That's great. + +### SeparableConv2D in Keras + +Now that we understand what happens in the model code, we can introduce the `SeparableConv2D` convolutional layer in Keras, which implements depthwise separable convolution for two-dimensional data (such as images). + +The layer is very similar to the traditional `Conv2D` layer. It can be added to your Keras model easily and, as we saw above, it performs mostly the same trick. However, it comes with some separation-specific configuration options that must be set before training is commenced. The [Keras website](https://keras.io/layers/convolutional/) defines the `SeparableConv2D` layer as follows: + +``` +tensorflow.keras.layers.SeparableConv2D(filters, kernel_size, strides=(1, 1), padding='valid', data_format=None, dilation_rate=(1, 1), depth_multiplier=1, activation=None, use_bias=True, depthwise_initializer='glorot_uniform', pointwise_initializer='glorot_uniform', bias_initializer='zeros', depthwise_regularizer=None, pointwise_regularizer=None, bias_regularizer=None, activity_regularizer=None, depthwise_constraint=None, pointwise_constraint=None, bias_constraint=None) +``` + +Where all the configuration options mean the following: + +- **Filters:** the number of output filters (or traditional 'kernels' length in the situation above); +- **Kernel size:** either an integer (if the spatial dimensions are the same) or a tuple of integers (if they are not). Hence, both `3` and `(3, 3)` represent a `3x3xN` kernel. +- **Stride:** how fast the kernel will convolve over your input image. If `1`, it will move pixel by pixel, whereas with larger values, it will skip certain convolutions in order to be faster. +- **Padding:** use no padding (`valid`; might even drop rows if the kernel size and stride don't match up) or padding equally distributed left, right, up and down (`same`) in order to fully cover the input images. +- **Data format:** whether your image input is `channels_first` or `channels_last`. By default, this is defined as `channels_last`. +- **Dilation rate:** as a simple example, a convolution over a grayscale image is usually a `mxn` block of which the convolving pixels are grouped tightly together; really as a `mxn` block. When you specify dilations of `> 1`, what you will see is that the distance between the convolving pixels increases and that the convolution is _dilated_, as if you're no longer looking directly to the image but continuously from the edge. Dilation has empirically shown to improve model training in some cases, so it may be worth playing with this parameter. +- **Depth multiplier:** how many depthwise convolutions must be performed over the channels of the input image. Traditionally, this is `1`, as we've seen in the drawing above. However, you might wish to manually set this to a larger value. Note that you must accomodate for the required resources, though. +- **Activation function:** well, this one speaks for itself - which [activation function](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) you'll use to (very likely) add nonlinearity to your deep learning model. +- **Whether bias must be used:** bias might help you steer your result a bit into the right direction if you cannot find a proper decision boundary with gradient-optimized weights only. By default, bias is used. +- **Depthwise, pointwise and bias initializers:** which [weight initialization strategy](https://machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) is used for the neuron's vectors representing the depthwise and pointwise convolutions and the accompanying bias vectors. By default, this is zeros for bias (which is fine) and Glorot uniform or Xavier for the depthwise and pointwise convolutions. [Watch out in that case when you use ReLU for activating your network, especially when you train with much data.](https://machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) +- **Depthwise, pointwise and bias regularizers:** which regularization techniques are applied to the depthwise and pointwise convolutions and the accompanying bias, to keep the training process balanced. +- **Activity regularizer:** which regularization technique is applied to the _output_ of the layer, i.e. what flows out of the activation function. This is different than the other regularizers, which are applied _within_ the layer. +- **Depthwise, pointwise and bias constraints:** [constraints](https://keras.io/constraints/) applied to the depthwise and pointwise convolution and the layer bias vector. + +### Adapting the CNN to use depthwise separable convolutions + +Now that we understand how to create a depthwise separable convolutional layer in Keras and how to configure it, we'll move on to adapting the CNN from above to use depthwise separable convolutions. + +And that's really simple - we'll just adapt the `Conv2D` layers to use `SeparableConv2D` and add the extra configuration that we need. + +Eventually, we then end up with this: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import SeparableConv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data. +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert into [0, 1] range. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(SeparableConv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(SeparableConv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### A small note on Conv1D and Conv3D + +Although this blog post shows you how to create a depthwise separable convolutional neural network based on a `Conv2D` layer, it's of course also possible to use separable convolutions in `1D`: `Conv1D` can be replaced with `SeparableConv1D`. So far, there is [no such thing](https://github.com/keras-team/keras/issues/5639) as a `SeparableConv3D` available in Keras. + +* * * + +## Training the neural network + +Let's go & train our model to see how it performs! + +### Software dependencies you'll need to install first + +I quote my usual advice about software dependencies from another blog - + +> +> We always start with listing certain dependencies that you'll need to install before you can run the model on your machine. Those are for today: +> +> A version of **Python** that can run `tensorflow.keras` (e.g. 3.8+). +> **TensorFlow 2.0**, e.g. 2.4+. +> If you wish to generate plots, it's also wise to install **Numpy** (if it's not a peer dependency of the previous ones) and **Matplotlib**. +> +> Preferably, you'll install these in an Anaconda environment. [Read here how to do that.](https://towardsdatascience.com/installing-keras-tensorflow-using-anaconda-for-machine-learning-44ab28ff39cb) + +### Running your model + +Create a file that is called e.g. `model_depthwise_separable.py` and store it somewhere (possibly besides the regular CNN [you created before](https://machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/)). Subsequently open up a terminal and `cd` to the particular folder. Issue the command `python model_depthwise_separable.py` to start training. Note that if you're using Anaconda that you must activate your Keras environment first, with `conda activate `, in my case e.g. `conda activate tensorflow_gpu`. + +* * * + +## Traditional vs Depthwise separable CNN: performance comparison + +This is the output of training the depthwise separable CNN: + +``` +Epoch 1/25 +48000/48000 [==============================] - 9s 198us/step - loss: 0.5469 - acc: 0.8535 - val_loss: 0.1651 - val_acc: 0.9510 +Epoch 2/25 +48000/48000 [==============================] - 4s 84us/step - loss: 0.1720 - acc: 0.9459 - val_loss: 0.1176 - val_acc: 0.9648 +Epoch 3/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.1310 - acc: 0.9597 - val_loss: 0.0889 - val_acc: 0.9734 +Epoch 4/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.1072 - acc: 0.9658 - val_loss: 0.0853 - val_acc: 0.9740 +Epoch 5/25 +48000/48000 [==============================] - 4s 86us/step - loss: 0.0939 - acc: 0.9710 - val_loss: 0.0721 - val_acc: 0.9781 +Epoch 6/25 +48000/48000 [==============================] - 4s 87us/step - loss: 0.0811 - acc: 0.9747 - val_loss: 0.0626 - val_acc: 0.9815 +Epoch 7/25 +48000/48000 [==============================] - 4s 87us/step - loss: 0.0734 - acc: 0.9773 - val_loss: 0.0588 - val_acc: 0.9821 +Epoch 8/25 +48000/48000 [==============================] - 4s 86us/step - loss: 0.0695 - acc: 0.9783 - val_loss: 0.0530 - val_acc: 0.9843 +Epoch 9/25 +48000/48000 [==============================] - 4s 88us/step - loss: 0.0616 - acc: 0.9797 - val_loss: 0.0512 - val_acc: 0.9853 +Epoch 10/25 +48000/48000 [==============================] - 4s 89us/step - loss: 0.0557 - acc: 0.9827 - val_loss: 0.0520 - val_acc: 0.9838 +Epoch 11/25 +48000/48000 [==============================] - 4s 89us/step - loss: 0.0525 - acc: 0.9825 - val_loss: 0.0485 - val_acc: 0.9857 +Epoch 12/25 +48000/48000 [==============================] - 4s 92us/step - loss: 0.0477 - acc: 0.9845 - val_loss: 0.0491 - val_acc: 0.9844 +Epoch 13/25 +48000/48000 [==============================] - 4s 93us/step - loss: 0.0445 - acc: 0.9849 - val_loss: 0.0484 - val_acc: 0.9852 +Epoch 14/25 +48000/48000 [==============================] - 4s 91us/step - loss: 0.0404 - acc: 0.9863 - val_loss: 0.0456 - val_acc: 0.9868 +Epoch 15/25 +48000/48000 [==============================] - 4s 90us/step - loss: 0.0385 - acc: 0.9869 - val_loss: 0.0449 - val_acc: 0.9859 +Epoch 16/25 +48000/48000 [==============================] - 4s 91us/step - loss: 0.0349 - acc: 0.9887 - val_loss: 0.0467 - val_acc: 0.9857 +Epoch 17/25 +48000/48000 [==============================] - 4s 94us/step - loss: 0.0337 - acc: 0.9886 - val_loss: 0.0430 - val_acc: 0.9871 +Epoch 18/25 +48000/48000 [==============================] - 5s 95us/step - loss: 0.0298 - acc: 0.9902 - val_loss: 0.0406 - val_acc: 0.9881 +Epoch 19/25 +48000/48000 [==============================] - 5s 94us/step - loss: 0.0300 - acc: 0.9900 - val_loss: 0.0434 - val_acc: 0.9872 +Epoch 20/25 +48000/48000 [==============================] - 5s 95us/step - loss: 0.0269 - acc: 0.9906 - val_loss: 0.0410 - val_acc: 0.9884 +Epoch 21/25 +48000/48000 [==============================] - 5s 96us/step - loss: 0.0269 - acc: 0.9912 - val_loss: 0.0407 - val_acc: 0.9883 +Epoch 22/25 +48000/48000 [==============================] - 5s 96us/step - loss: 0.0255 - acc: 0.9914 - val_loss: 0.0420 - val_acc: 0.9874 +Epoch 23/25 +48000/48000 [==============================] - 5s 104us/step - loss: 0.0230 - acc: 0.9928 - val_loss: 0.0443 - val_acc: 0.9869 +Epoch 24/25 +48000/48000 [==============================] - 5s 99us/step - loss: 0.0209 - acc: 0.9926 - val_loss: 0.0418 - val_acc: 0.9890 +Epoch 25/25 +48000/48000 [==============================] - 5s 95us/step - loss: 0.0211 - acc: 0.9931 - val_loss: 0.0419 - val_acc: 0.9881 +Test loss: 0.03642239146179636 / Test accuracy: 0.9886 +``` + +### Accuracy performance + +Those are the last five epochs from the traditional CNN together with its test evaluation performance: + +``` +Epoch 20/25 +48000/48000 [==============================] - 4s 84us/step - loss: 0.0094 - acc: 0.9968 - val_loss: 0.0281 - val_acc: 0.9924 +Epoch 21/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0098 - acc: 0.9966 - val_loss: 0.0306 - val_acc: 0.9923 +Epoch 22/25 +48000/48000 [==============================] - 4s 84us/step - loss: 0.0094 - acc: 0.9967 - val_loss: 0.0320 - val_acc: 0.9921 +Epoch 23/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0068 - acc: 0.9979 - val_loss: 0.0347 - val_acc: 0.9917 +Epoch 24/25 +48000/48000 [==============================] - 5s 100us/step - loss: 0.0074 - acc: 0.9974 - val_loss: 0.0347 - val_acc: 0.9916 +Epoch 25/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0072 - acc: 0.9975 - val_loss: 0.0319 - val_acc: 0.9925 + +Test loss: 0.02579820747410522 / Test accuracy: 0.9926 +``` + +The depthwise separable convolution seems to perform _slightly worse_ on both validation accuracy (~0.04 in the last five epochs vs ~0.03 in the last five traditional epochs) and test loss (~0.026 against ~0.036). This may be caused by the initialization of your weights (which, by setting your starting point uniquely, may impact how the model performs even towards the end). + +I therefore ran the model multiple times. This was the output of the 25th epoch and the evaluation step for five re-runs: + +``` +Epoch 25/25 +48000/48000 [==============================] - 5s 97us/step - loss: 0.0218 - acc: 0.9927 - val_loss: 0.0445 - val_acc: 0.9873 +Test loss: 0.03588760701002175 / Test accuracy: 0.9883 + +Epoch 25/25 +48000/48000 [==============================] - 5s 99us/step - loss: 0.0230 - acc: 0.9918 - val_loss: 0.0392 - val_acc: 0.9893 +Test loss: 0.03982483066770946 / Test accuracy: 0.9886 + +Epoch 25/25 +48000/48000 [==============================] - 6s 128us/step - loss: 0.0189 - acc: 0.9934 - val_loss: 0.0396 - val_acc: 0.9883 +Test loss: 0.03224361159349937 / Test accuracy: 0.9895 + +Epoch 25/25 +48000/48000 [==============================] - 5s 107us/step - loss: 0.0281 - acc: 0.9903 - val_loss: 0.0432 - val_acc: 0.9874 +Test loss: 0.04041151546177571 / Test accuracy: 0.9869 + +Epoch 25/25 +48000/48000 [==============================] - 5s 98us/step - loss: 0.0308 - acc: 0.9893 - val_loss: 0.0461 - val_acc: 0.9875 +Test loss: 0.04591406463075546 / Test accuracy: 0.9852 +``` + +On average, test loss is 0.038852 and validation loss is similar. This is still worse than the traditional `Conv2D` layer. Oops. You might wish to experiment with Conv2D and SeparableConv2D first before you choose to do large-scale training. + +Why this is the case might be explained through the number of trainable parameters. Since fewer multiplications are necessary, fewer parameters are to be trained. This might result in the model becoming unable to capture the underlying patterns in the data set. + +In our model, neither adding a layer or removing one helps improve validation and test loss. You might thus really wish to test between `Conv2D` and `SeparableConv2D` first. + +### Time performance: it's slower than `Conv2D` (with TensorFlow) + +What's more interesting is that _despite the many fewer multiplications_ the depthwise separable convolutional layer trains _slower_ than the traditional `Conv2D` layer. Although theoretically impossible, it's a practical reality, possibly removing all the benefits from the spectrum (especially with large datasets, small training time deterioration per epoch turns into large deviations with many epochs). However, this seems to be some issue with TensorFlow. + +- **Update 08/Feb/2021:** it seems to be the case that the issue remains unresolved. + +> Also experiencing that SeparableConv2d is slower than Conv2d in Keras. The number of input\_channels does not seem to matter, I tested 32-2048 and in all cases the Conv2d is faster. Interestingly, in the SeparableConv2d-model the number parameters is lower as well as the FLOPS. Still this does not seem to have the wanted affect on the inference. +> +> Source: gitman88 (2019), [https://github.com/tensorflow/tensorflow/issues/12132#issuecomment-471880273](https://github.com/tensorflow/tensorflow/issues/12132#issuecomment-471880273) + +* * * + +## Summary + +In this blog, we've seen how a (2D) depthwise separable convolutional layer can be implemented by Keras by means of the `SeparableConv2D` layer. For this to work well, we briefly recapped depthwise separable convolutions and their split into depthwise and pointwise convolutions. We also explained the Keras configuration for the `SeparableConv2D` layer and showed how to implement one by adapting a previous CNN based classifier we created before - see e.g. [GitHub](https://github.com/christianversloot/keras-cnn) for the code. + +The fact that they were very similar in terms of data and configuration allowed us to compare the results. The performance of the depthwise separable convolution seems to be a bit lower than the traditional layer, perhaps due to underfitting given the fewer multiplications and, hence, fewer amount of trainable parameters. Similarly, its time performance was lower, presumably due to an issue with TensorFlow that performs the numerical operations. Therefore: choose wisely and test first! + +I hope you've learnt something today - at least, I thought it was interesting to find deviating performance that directly opposes the theoretical benefits of the depthwise separable layer. Let's hope the issue with TensorFlow is repaired relatively soon. Until then, happy engineering! 😎 + +* * * + +## References + +Keras. (n.d.). Convolutional Layers. Retrieved from [https://keras.io/layers/convolutional/](https://keras.io/layers/convolutional/) + +Keras. (n.d.). Constraints. Retrieved from [https://keras.io/constraints/](https://keras.io/constraints/) + +Keras-team/keras. (n.d.). Trains a simple convnet on the MNIST dataset. Retrieved from [https://github.com/keras-team/keras/blob/master/examples/mnist\_cnn.py](https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py) + +Gitman88. (2019). slim.separable\_conv2d is too slow · Issue #12132 · tensorflow/tensorflow. Retrieved from [https://github.com/tensorflow/tensorflow/issues/12132#issuecomment-471880273](https://github.com/tensorflow/tensorflow/issues/12132#issuecomment-471880273) + +Alexvicegrab. (n.d.). SeparableConv3D · Issue #5639 · keras-team/keras. Retrieved from [https://github.com/keras-team/keras/issues/5639](https://github.com/keras-team/keras/issues/5639) + +Ceballos, F. (2019, September 8). Installing a Python Based Machine Learning Environment in Windows 10. Retrieved from [https://towardsdatascience.com/installing-keras-tensorflow-using-anaconda-for-machine-learning-44ab28ff39cb](https://towardsdatascience.com/installing-keras-tensorflow-using-anaconda-for-machine-learning-44ab28ff39cb) + +MachineCurve. (2019, September 23). Understanding separable convolutions. Retrieved from [https://machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/](https://machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/) + +MachineCurve. (2019, May 30). Convolutional Neural Networks and their components for computer vision. Retrieved from [https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) + +MachineCurve. (2019, September 24). How to create a CNN classifier with Keras? Retrieved from [https://machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/](https://machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) + +MachineCurve. (2019, July 27). How to create a basic MLP classifier with the Keras Sequential API – Small detour: categorical cross entropy. Retrieved from [https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/#small-detour-categorical-cross-entropy](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/#small-detour-categorical-cross-entropy) + +MachineCurve. (2019, September 4). ReLU, Sigmoid and Tanh: today's most used activation functions. Retrieved from [https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) + +MachineCurve. (2019, August 22). What is weight initialization? Retrieved from [https://machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/](https://machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) + +MachineCurve. (2019, September 18). He/Xavier initialization & activation functions: choose wisely. Retrieved from [https://machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/](https://machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) diff --git a/creating-interactive-visualizations-of-tensorflow-keras-datasets.md b/creating-interactive-visualizations-of-tensorflow-keras-datasets.md new file mode 100644 index 0000000..ae86312 --- /dev/null +++ b/creating-interactive-visualizations-of-tensorflow-keras-datasets.md @@ -0,0 +1,368 @@ +--- +title: "Creating Interactive Visualizations of TensorFlow Keras datasets" +date: "2021-03-25" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "datasets" + - "keras" + - "keras-datasets" + - "streamlit" + - "tensorflow-datasets" +--- + +Data scientists find some aspects of their job really frustrating. Data preprocessing is one of them, but the same is true for generating visualizations and other kind of reports. They're boring, nobody reads them, and creating them takes a lot of time. + +What if there is an alternative, allowing you to create interactive visualizations of your data science results within minutes? + +That's what we're going to find out today. You're going to explore Streamlit, an open source and free package for creating data driven web apps. More specifically, you will generate visualizations of the `tensorflow.keras.datasets` datasets related to images: the MNIST dataset, the Fashion MNIST dataset, and the CIFAR-10 and CIFAR-100 datasets. It allows you to easily walk through the datasets, generating plots on the fly. + +After reading this tutorial, you will... + +- **Understand what Streamlit is and what it can be used for.** +- **Have built your first Streamlit app for walking through the `tensorflow.keras` datasets.** +- **Have a good basis for creating more advanced functionalities with Streamlit.** + +Are you ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What is Streamlit? + +While the job of data scientists can be cool, it can also be really frustrating - especially when it comes to visualizing your datasets. + +Creating an application for showing what you have built or what you want to built can be really frustrating. + +No more. Say hello to [Streamlit](https://streamlit.io/). Streamlit is an [open source and free](https://github.com/streamlit/streamlit) package with which you can create data driven web apps in _minutes_. + +Really, it takes almost no time to build your data dashboard - and we're going to see how to use it today. + +![](images/image-12-1024x801.png) + +* * * + +## Example code: visualizing datasets with Streamlit + +Let's now write some code! 🚀 + +### Software dependencies + +You will need to install these dependencies, if not already installed, to run the code in this tutorial + +- Streamlit: `pip install streamlit` +- TensorFlow: `pip install tensorflow` +- Matplotlib: `pip install matplotlib` + +### Writing our tool + +Let's now take a look at writing our tool. Creating an interactive visualization for the Keras datasets involves the following steps: + +1. Stating the imports. +2. Writing the `get_dataset_mappings()` def. +3. Creating the `load_dataset()` def. +4. Implementing the `draw_images()` def. +5. Finally, merging everything together in the `do_streamlit()` def. +6. Then invoking everything in the `__main__` part. + +However, let's begin with creating a file where our code can be written - say, `keras-image-datasets.py`. + +![](images/image-11.png) + +A screenshot from the visualization generated by our tool. + +#### Stating the imports + +The first thing we do - as always - is writing the specification of the dependencies that we need: + +``` +import streamlit as st +import tensorflow +from tensorflow.keras import datasets +import matplotlib.pyplot as plt +``` + +We will need `streamlit` because it is the runtime for our interactive visualization. With `tensorflow` and `tensorflow.keras`, we can load the datasets. Finally, we're using Matplotlib's `pyplot` API for visualizing the images. + +#### Writing the get\_dataset\_mappings() def + +We can then write the dataset to dataset mappings: + +``` +def get_dataset_mappings(): + """ + Get mappings for dataset key + to dataset and name. + """ + mappings = { + 'CIFAR-10': datasets.cifar10, + 'CIFAR-100': datasets.cifar100, + 'Fashion MNIST': datasets.fashion_mnist, + 'MNIST': datasets.mnist + } + return mappings +``` + +This definition provides a `string -> dataset` mapping by defining a dictionary that can be used for converting some input String to the corresponding `tensorflow.keras.datasets` dataset. For example, if we take its `MNIST` attribute, it returns the MNIST dataset. We can use this dictionary for emulating `switch`\-like behavior, which is not present in Python by default. + +#### Creating the load\_dataset() def + +Subsequently, we can define `load_dataset()`. It takes a `name` argument. First, it retrieves the dataset mappings that we discussed above. Subsequently, it loads the corresponding Keras dataset (also as discussed above) and performs `load_data()`. As you can see, we're only using the training inputs, which we return as the output of this def. + +``` +def load_dataset(name): + """ + Load a dataset + """ + # Define name mapping + name_mapping = get_dataset_mappings() + + # Get train data + (X, _), (_, _) = name_mapping[name].load_data() + + # Return data + return X +``` + +#### Implementing the draw\_images() def + +Now that we have a dataset, we can draw some images! + +With `draw_images()`, we will be able to generate a multiplot with the samples that we selected. + +For this, we have to specify a dataset (`data`), a position/index of our starting image (`start_index`), and the number of rows (`num_rows`) and columns (`num_cols)` that we want to show. + +First of all, we generate Matplotlib subplots - as many as `num_rows` and `num_cols` allow. + +Then, usign the columns and rows, we can compute the total number of images, in `show_items`. We then specify an iterator index and iterate over each `row` and `col`, filling the specific frame with the image at that index. + +Finally, we return the figure - but do so using Streamlit's `pyplot` wrapper, to make it work. + +``` +def draw_images(data, start_index, num_rows, num_cols): + """ + Generate multiplot with selected samples. + """ + # Get figure and axes + fig, axs = plt.subplots(num_rows, num_cols) + # Show number of items + show_items = num_rows * num_cols + # Iterate over items from start index + iterator = 0 + for row in range(0, num_rows): + for col in range(0, num_cols): + index = iterator + start_index + axs[row, col].imshow(data[index]) + axs[row, col].axis('off') + iterator += 1 + # Return figure + return st.pyplot(fig) +``` + +#### Finally, creating do\_streamlit() + +It is good practice in Python to keep as much of your code in definitions. That's why we finally define `do_streamlit()`, which does nothing more than setting up the Streamlit dashboard and processing user interactions. + +It involves the following steps: + +- Setting the Pyplot style to use a black background, in line with Streamlit's styling. +- Creating a title Streamlit object. +- Defining a selection box with the datasets supported by the tool. +- Loading the selected dataset with our `load_dataset()` def. +- Loading the number of images in the dataset given the shape of our dataset. +- Defining the sliders for the picture index, the number of rows and the number of columns. Note that we specify `maximum_length` here in order to not exceed the input shape by too much. +- Finally, we show the image. We capture this in a `try/except` statement because invalid combinations, although minimized, remain possible. For example, by setting the `picture_id` to a value that less than `no_rows * no_cols` below the `maximum_length`, image generation crashes. We can fix this with some additional code, but chose to keep things simple. Who needs the final images if you can visualize many in between? + +``` +def do_streamlit(): + """ + Set up the Streamlit dashboard and capture + interactions. + """ + # Styling + plt.style.use('dark_background') + + # Set title + st.title('Interactive visualization of Keras image datasets') + + # Define select box + dataset_selection = st.selectbox('Dataset', ('CIFAR-10', 'CIFAR-100', 'Fashion MNIST', 'MNIST')) + + # Dataset + dataset = load_dataset(dataset_selection) + + # Number of images in dataset + maximum_length = dataset.shape[0] + + # Define sliders + picture_id = st.slider('Start at picture', 0, maximum_length, 0) + no_rows = st.slider('Number of rows', 2, 30, 5) + no_cols = st.slider('Number of columns', 2, 30, 5) + + # Show image + try: + st.image(draw_images(dataset, picture_id, no_rows, no_cols)) + except: + print() +``` + +#### Then invoking everything in \_\_main\_\_ + +Finally, we write the runtime `if` statement, which checks if we are running the Python interpreter. If so, we're invoking everything with `do_streamlit()`. + +``` +if __name__ == '__main__': + do_streamlit() +``` + +### Full model code + +I can understand if you don't want to follow all the individual steps above and rather want to play with the full code. That's why you can also retrieve the full code below. Make sure to rest of the article in order to understand everything that is going on! :) + +``` +import streamlit as st +import tensorflow +from tensorflow.keras import datasets +import matplotlib.pyplot as plt + + +def get_dataset_mappings(): + """ + Get mappings for dataset key + to dataset and name. + """ + mappings = { + 'CIFAR-10': datasets.cifar10, + 'CIFAR-100': datasets.cifar100, + 'Fashion MNIST': datasets.fashion_mnist, + 'MNIST': datasets.mnist + } + return mappings + + +def load_dataset(name): + """ + Load a dataset + """ + # Define name mapping + name_mapping = get_dataset_mappings() + + # Get train data + (X, _), (_, _) = name_mapping[name].load_data() + + # Return data + return X + + +def draw_images(data, start_index, num_rows, num_cols): + """ + Generate multiplot with selected samples. + """ + # Get figure and axes + fig, axs = plt.subplots(num_rows, num_cols) + # Show number of items + show_items = num_rows * num_cols + # Iterate over items from start index + iterator = 0 + for row in range(0, num_rows): + for col in range(0, num_cols): + index = iterator + start_index + axs[row, col].imshow(data[index]) + axs[row, col].axis('off') + iterator += 1 + # Return figure + return st.pyplot(fig) + + +def do_streamlit(): + """ + Set up the Streamlit dashboard and capture + interactions. + """ + # Styling + plt.style.use('dark_background') + + # Set title + st.title('Interactive visualization of Keras image datasets') + + # Define select box + dataset_selection = st.selectbox('Dataset', ('CIFAR-10', 'CIFAR-100', 'Fashion MNIST', 'MNIST')) + + # Dataset + dataset = load_dataset(dataset_selection) + + # Number of images in dataset + maximum_length = dataset.shape[0] + + # Define sliders + picture_id = st.slider('Start at picture', 0, maximum_length, 0) + no_rows = st.slider('Number of rows', 2, 30, 5) + no_cols = st.slider('Number of columns', 2, 30, 5) + + # Show image + try: + st.image(draw_images(dataset, picture_id, no_rows, no_cols)) + except: + print() + + +if __name__ == '__main__': + do_streamlit() +``` + +* * * + +## Results + +Let's now take a look what happens when we run the code. + +We can do so by opening up a terminal and making sure that it runs in the environment where our dependencies are installed. If not, make sure that it does - by enabling it. + +Then run `streamlit run keras-image-datasets.py`. It should open up your browser relatively quickly and this is what you should see: + +![](images/image-7-737x1024.png) + +You can use the selectors on top for customing the output image. With _Dataset_, you can pick one of the [image-based TensorFlow Keras datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). With _number of rows_ and _number of columns_, you can configure the output dimensions of your image. Finally, using _start at picture_, you can choose the index of the picture in the top left corner. All other images are the subsequent indices. + +For example, by switching to the Fashion MNIST dataset: + +![](images/image-8.png) + +This is what we get: + +![](images/image-9.png) + +Then, we also tune the start position, the number of rows and the number of columns: + +![](images/image-10.png) + +And see, we have created ourselves a tool that allows us to quickly explore the Keras datasets! + +With some adaptation, it should even be possible to explore your own dataset with this tool, but that's for another tutorial :) + +* * * + +## Summary + +Now that you have read this tutorial, you... + +- **Understand what Streamlit is and what it can be used for.** +- **Have built your first Streamlit app for walking through the `tensorflow.keras` datasets.** +- **Have a good basis for creating more advanced functionalities with Streamlit.** + +I hope that it was useful for your learning process! Please feel free to share what you have learned in the comments section 💬 I’d love to hear from you. Please do the same if you have any questions or other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Streamlit. (n.d.). The fastest way to build and share data apps. [https://streamlit.io/](https://streamlit.io/) + +GitHub. (n.d.). _Streamlit/streamlit_. [https://github.com/streamlit/streamlit](https://github.com/streamlit/streamlit) diff --git a/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn.md b/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn.md new file mode 100644 index 0000000..6418ba8 --- /dev/null +++ b/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn.md @@ -0,0 +1,258 @@ +--- +title: "Creating One-vs-Rest and One-vs-One SVM Classifiers with Scikit-learn" +date: "2020-11-11" +categories: + - "frameworks" + - "svms" +tags: + - "classification" + - "multiclass-classification" + - "scikit-learn" + - "support-vector-machine" + - "svm" +--- + +Support Vector Machines (SVMs) are a class of Machine Learning algorithms that are used quite frequently these days. Named after their [method for learning a decision boundary](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/), SVMs are binary classifiers - meaning that they only work with a [0/1 class scenario](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). In other words, it is not possible to create a multiclass classification scenario with an SVM natively. + +Fortunately, there are some methods for allowing SVMs to be used with multiclass classification. In this article, we focus on two similar but slightly different ones: **one-vs-rest classification** and **one-vs-one classification**. Both involve the utilization of multiple binary SVM classifiers to finally get to a multiclass prediction. It is structured as follows. First, we'll look at multiclass classification in general. It serves as a brief recap, and gives us the necessary context for the rest of the article. + +After introducing multiclass classification, we will take a look at why it is not possible to create multiclass SVMs natively. That is, why they are binary classifiers and binary classifiers only. This is followed by two approaches for creating multiclass SVMs anyway: tricks, essentially - the one-vs-rest and one-vs-one classifiers. Those approaches include examples that illustrate step-by-step how to create them with the Scikit-learn machine learning library. + +Let's take a look! :D + +* * * + +\[toc\] + +* * * + +## What are multiclass classifiers? + +Classification is one of the approaches available in _supervised learning_. With a training dataset that has feature vectors (i.e. input samples with multiple columns per sample) and corresponding labels, we can train a model to assign one of the labels the model was trained on when it is fed new samples. + +Classification can be visualized as an automated system that categorizes items that are moving on a conveyor belt. In this assembly line scenario, the automated system recognizes characteristics of the object and moves it into a specific bucket when it is first in line. This looks as follows: + +![](images/whatisclassification5.png) + +There are [3 variants of classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). In the _binary_ case, there are only two buckets - and hence two categories. This can be implemented with most machine learning algorithms. The other two cases - _multiclass_ and _multilabel_ classification, are different. In the multiclass case, we can assignitems into one of multiple (> 2) buckets; in the multilabel case, we can assign multiple labels to one instance. + +> Multiclass classification can therefore be used in the setting where your classification dataset has more than two classes. +> +> [3 Variants of Classification Problems in Machine Learning](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) + +Multiclass classification is reflected in the figure above. We clearly have no binary classifier: there are three buckets. We neither have a multilabel classifier: we assign items into buckets, rather than attaching multiple labels onto each item and then moving them into _one_ bucket. + +Implementing a multiclass classifier is easy when you are using Neural networks. When using [SVMs](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/), this is more difficult. Let's now take a look at why this cannot be done so easily. + +* * * + +## Why you cannot create multiclass SVMs natively + +Take a look at the figure below. You see samples from two classes - black and white - plotted in a scatter plot, which visualizes a two-dimensional feature space. In addition, you see three decision boundaries: \[latex\]H\_1\[/latex\], \[latex\]H\_2\[/latex\] and \[latex\]H\_3\[/latex\]. The first is not capable of adequately separating the classes. The second is, and the third is as well. + +But which one is best if you are training a Support Vector Machine? + +Spoiler alert: it's \[latex\]H\_3\[/latex\]. The reason why is because SVMs are **maximum-margin classifiers**, which means that they attempt to generate a decision boundary that is _equidistant_ from the two classes of data. + +> A point is said to be **equidistant** from a set of objects if the distances between that point and each object in the set are equal. +> +> Wikipedia (2005) + +To be more precise, it will not take into account the whole class - but rather the samples closest to the decision boundary, the so-called [support vectors](https://www.machinecurve.com/index.php/2020/05/05/how-to-visualize-support-vectors-of-your-svm-classifier/). + +![](images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png) + +Hyperplanes and data points. The [image](https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)is not edited. Author: [Zack Weinberg](https://commons.wikimedia.org/w/index.php?title=User:ZackWeinberg&action=edit&redlink=1), derived from [Cyc’s](https://commons.wikimedia.org/w/index.php?title=User:Cyc&action=edit&redlink=1) work. License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode) + +Now, as you can see, using a SVM for learning a decision boundary makes that the hyperplane is _binary_ - i.e., it is capable of distinguishing between two classes of samples. + +The equidistance property simply does not allow us to distinguish between > 2 classes. + +Imagine adding another class to the image, with another separation boundary, effectively creating three sub boundaries: that between 1 and 2, that between 2 and 3, and that between 1 and 3. + +For all sub boundaries, the equidistance property is no longer true: the 1-2 boundary no longer guarantees an equidistant distance to the support vectors from class 3, and so on. + +This is why Support Vector Machines are binary classifiers and cannot be used for multiclass classification natively. + +### Two approaches for creating them anyway + +Fortunately, SVMs _can_ in practice be used for multiclass classification. There are a few approaches which help mimic a multiclass classifier. In this article, we'll cover two ones: + +- The **One-vs-Rest** method for multiclass classification: distinguishing between some label and all the others, where the class prediction with highest probability wins. +- The **One-vs-One** method: a classifier is trained for every pair of classes, allowing us to make continuous comparisons. The class prediction with highest quantity of predictions wins. + +Let's now take a look at each individual method in more detail and see how we can implement them with Scikit-learn. + +* * * + +## One-vs-Rest (OvR) Classification + +The **One-vs-Rest** method can be used for creating a multiclass SVM classifier. Let's recall the multiclass assembly line that we discussed above. Here, the output is one out of three possible classes: `{yellow, blue, red}`. + +![](images/whatisclassification5.png) + +Training an One-vs-Rest classifier for our model actually involves creating three binary classifiers under the hood: + +- **OvR binary classifier 1:** `yellow` vs `{blue, red}` +- **OvR binary classifier 2:** `blue` vs `{yellow, red}` +- **OvR binary classifier 3:** `red` vs `{blue, yellow}` + +Each binary classifier should predict a [class probability](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/). Say that we can define the predictions for each binary classifier as `p1`, `p2` and `p3`, then the multiclass prediction that is the outcome of the OvR classifier is `argmax(p1, p2, p3)`. In other words, if the probability that it is yellow vs blue or red is `0.99`, blue vs yellow or red is `0.23`, red vs blue or yellow is `0.78`, then the outcome of the multiclass classifier is `0` a.k.a. yellow. + +### One-vs-Rest in Scikit-learn: OneVsRestClassifier + +Say that we've got the following linearly separable dataset with three classes in a two-dimensional feature space: + +![](images/linearly.png) + +It can be generated as follows: + +``` +from sklearn.datasets import make_blobs + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3), (1,5)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) +``` + +We can now create a linear Support Vector Machine for classification with Scikit-learn's `sklearn.svm.LinearSVC` model type and a `OneVsRestClassifier` wrapper. Note that for evaluation purposes, we also generate a [confusion matrix](https://www.machinecurve.com/index.php/2020/05/05/how-to-create-a-confusion-matrix-with-scikit-learn/) and a [decision boundary plot](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) in the code below. For this reason, make sure that besides `sklearn` you also have `mlxtend` installed onto your system (or remove the code if not). + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.multiclass import OneVsRestClassifier +from sklearn.svm import LinearSVC +from sklearn.model_selection import train_test_split +from sklearn.metrics import plot_confusion_matrix +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3), (1,5)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) + +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) + +# np.save('./clusters.npy', X) +X = np.load('./clusters.npy') + +# Create the SVM +svm = LinearSVC(random_state=42) + +# Make it an OvR classifier +ovr_classifier = OneVsRestClassifier(svm) + +# Fit the data to the OvR classifier +ovr_classifier = ovr_classifier.fit(X_train, y_train) + +# Evaluate by means of a confusion matrix +matrix = plot_confusion_matrix(ovr_classifier, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for OvR classifier') +plt.show(matrix) +plt.show() + +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=ovr_classifier, legend=2) +plt.show() +``` + +As expected, since our data is linearly separable, running the model results in a confusion matrix and decision boundary plot which show perfect linear separation. Of course, this is never the case in the real world - but it illustrates that we _can_ create a multiclass SVM when using One-vs-Rest! + +- [![](images/ovr_conf.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/ovr_conf.png) + +- [![](images/ovr_boundary.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/ovr_boundary.png) + + +* * * + +## One-vs-One (OvO) Classification + +The **One-vs-One** method can be used as well for creating a multiclass SVM classifier. Given the assembly line scenario from above, we create a set of binary classifiers, each representing one of the pairs: + +- ****OvO binary classifier**** **1:** yellow vs blue +- ****OvO binary classifier**** **2:** yellow vs red +- **OvO binary classifier** **3:** blue vs red + +Here, the winner is the class that is picked the most. So, for example, if yellow is picked twice in OvO 1 and OvO 2, it wins, because neither red and blue can exceed one win anymore (that of OvO 3). + +### One-vs-One in Scikit-learn: OneVsOneClassifier + +Here is a simple example of using `OneVsOneClassifier` i.e. One-vs-One with Scikit-learn. Very similar to the One-vs-Rest setting, we can wrap a linear binary SVM into the wrapper, resulting in a set of classifiers being created, trained and subsequently used for multiclass predictions. Do note again that we are also generating a [confusion matrix](https://www.machinecurve.com/index.php/2020/05/05/how-to-create-a-confusion-matrix-with-scikit-learn/) and [decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) and that by consequence `mlxtend` is required besides `sklearn`. + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.multiclass import OneVsOneClassifier +from sklearn.svm import LinearSVC +from sklearn.model_selection import train_test_split +from sklearn.metrics import plot_confusion_matrix +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3), (1,5)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) + +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) + +# np.save('./clusters.npy', X) +X = np.load('./clusters.npy') + +# Create the SVM +svm = LinearSVC(random_state=42) + +# Make it an OvO classifier +ovo_classifier = OneVsOneClassifier(svm) + +# Fit the data to the OvO classifier +ovo_classifier = ovo_classifier.fit(X_train, y_train) + +# Evaluate by means of a confusion matrix +matrix = plot_confusion_matrix(ovo_classifier, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for OvO classifier') +plt.show(matrix) +plt.show() + +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=ovo_classifier, legend=2) +plt.show() +``` + +Here, we too observe an artificially perfect confusion matrix and decision boundary plot: + +- ![](images/ovo_boundary.png) + +- ![](images/ovo_conf.png) + + +* * * + +## Summary + +In this article, we looked at multiclass SVM classification in Scikit-learn by means of two strategies: the One-vs-Rest and the One-vs-One strategy for multiclass classification. In order to explain this, we first looked at what multiclass classification and SVM classification are, and why they don't mix well natively. The OvR and OvO methods do make multiclass classification possible, though. + +I hope that you have learned something from today's article. Please feel free to leave a comment if you did! If you have other questions or comments, please leave a comment in the comments section as well 💬 Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2005, February 21). _Equidistant_. Wikipedia, the free encyclopedia. Retrieved November 11, 2020, from [https://en.wikipedia.org/wiki/Equidistant](https://en.wikipedia.org/wiki/Equidistant) diff --git a/cropping-layers-with-pytorch.md b/cropping-layers-with-pytorch.md new file mode 100644 index 0000000..ae879ae --- /dev/null +++ b/cropping-layers-with-pytorch.md @@ -0,0 +1,143 @@ +--- +title: "Cropping layers with PyTorch" +date: "2021-11-10" +categories: + - "deep-learning" + - "frameworks" +tags: + - "cropping" + - "cropping-layer" + - "deep-learning" + - "machine-learning" + - "neural-network" + - "neural-networks" + - "pytorch" +--- + +Sometimes, you may wish to perform cropping on the input images that you are feeding to your neural network. While strictly speaking a part of data processing in many cases, it can be interesting to _move_ cropping your input data to the neural network itself, because then you might not need to adapt a full dataset in advance. + +In TensorFlow and Keras, cropping your input data is relatively easy, using the [Cropping layers](https://www.machinecurve.com/index.php/2020/02/05/how-to-use-cropping-layers-with-keras/) readily available there. + +In PyTorch, this is different, because Cropping layers are not part of the PyTorch API. + +In this article, you will learn how you can perform Cropping within PyTorch anyway - by using the `ZeroPad2d` layer, which performs zero padding. By using it in an inverse way, we can _remove_ padding (and hence perform cropping) instead of _adding_ it. + +Ready? Let's take a look. 😎 + +* * * + +\[toc\] + +* * * + +## Using `ZeroPad2d` for Cropping + +For creating our Cropping layer, we will be using the `ZeroPad2d` layer that is available within PyTorch. + +Normally, it's used for _adding_ a box of pixels around the input data - which is what padding does. In that case, it's used with _positive_ padding. In the image below, on the left, you can see what happens when it's called with a +1 padding - an extra box of zero-valued pixels is added around the input image. + +Now, what if we used a -1 padding instead? You would expect that padding then works in the _opposite direction_, meaning that a box is not added, but _removed_. And precisely this effect is what we will use for creating a Cropping layer for your PyTorch model. + +![](images/Zero.drawio.png) + +Calling Zero Padding with a positive padding results in a zero-valued box of pixels being added to your input image. Using a negative padding removes data from your image. + +* * * + +## Full Cropping layers example using PyTorch + +Let's now take a look at how we can implement `ZeroPad2d` for generating a Cropping layer with PyTorch. First, it's time to write down our imports. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import matplotlib.pyplot as plt +``` + +These are relatively straight-forward: there are many `torch` related imports, which are explained in our articles on PyTorch based networks [such as the ConvNet](https://www.machinecurve.com/index.php/2021/07/08/convolutional-neural-networks-with-pytorch/). + +Time to move forward with the `CroppingNetwork`. Here it is: + +``` +class CroppingNetwork(nn.Module): + ''' + Simple network with one Cropping layer + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.ZeroPad2d(-1), + nn.ZeroPad2d(-1), + nn.ZeroPad2d(-1), + nn.ZeroPad2d(-1), + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) +``` + +It is actually really simple! By specifying `nn.ZeroPad2d` with a cropping size of `-1`, we remove 1 column of pixels on the left, 1 on the right, as well as a row from the top and the bottom of the image. + +Our input images - MNIST images - have an input shape of `(1, 28, 28)` - or `(28, 28)` when we reshape them. Since we repeat the layer four times, we remove 4 pixels from the left, 4 from the right, 4 from the top and 4 from the bottom. This means that the shape of our outputs will be `(20, 20)`. + +What remains is stitching everything together: + +``` +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the CroppingNetwork + croppingnet = CroppingNetwork() + + # Iterate over some samples + for i, data in enumerate(trainloader, 0): + + # Unpack inputs and targets + inputs, targets = data + + # Feed samples through the network + cropped_samples = croppingnet(inputs) + + # Reshape the samples + reshaped_original = inputs[i].reshape(28, 28) + reshaped_cropped = cropped_samples[i].reshape(20, 20) + fig, (ax1, ax2) = plt.subplots(1, 2) + fig.set_size_inches(9, 5, forward=True) + fig.suptitle('Original sample (left) and Cropped sample (right)') + ax1.imshow(reshaped_original) + ax2.imshow(reshaped_cropped) + plt.show() +``` + +The code above uses the PyTorch `DataLoader` for loading the first minibatch of samples, feeds them through the `CroppingNetwork`, and visualizes the results. + +* * * + +## Examples of PyTorch cropping layers + +And here they are - some examples of what is produced by the cropping network: + +- ![](images/3.png) + +- ![](images/2.png) + +- ![](images/1.png) + + +* * * + +## References + +PyTorch. (n.d.). _ZeroPad2d — PyTorch 1.10.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.ZeroPad2d.html](https://pytorch.org/docs/stable/generated/torch.nn.ZeroPad2d.html) diff --git a/dall-e-openai-gpt-3-model-can-draw-pictures-based-on-text.md b/dall-e-openai-gpt-3-model-can-draw-pictures-based-on-text.md new file mode 100644 index 0000000..58c7ab6 --- /dev/null +++ b/dall-e-openai-gpt-3-model-can-draw-pictures-based-on-text.md @@ -0,0 +1,102 @@ +--- +title: "DALL·E: OpenAI GPT-3 model can draw pictures based on text" +date: "2021-01-05" +categories: + - "deep-learning" +tags: + - "dall-e" + - "dalle" + - "gpt" + - "gpt-3" + - "openai" + - "transformer" + - "transformers" +--- + +In 2020, the GPT-3 model created by OpenAI created big headlines: it was capable of generating text that could not be distinguished from _human-written text_. In addition, Microsoft acquired an exclusive license to the model, possibly integrating it with its cloud services for text generation. + +GPT-3, however, cannot only be used for text purposes. Recently, we have seen the emergence of Transformers for Computer Vision. Today, in a blog post at OpenAI.com, DALL·E was announced. The model, which is named after Salvador Dalí and Pixar's WALL·E, is capable of generating high-quality images based on text. + +We've ploughed through the blog article to understand how it works. In this article, you'll therefore find what DALL·E is capable of. In addition, you'll also find how it works and how it was trained. We've also brainstormed about a few possible applications for DALL·E. + +We're still awaiting the publishing of the DALL·E paper, but let's already take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What DALL·E does + +Suppose that you need to generate images. Previously, you'd hire an artist, which would take your requirements and generate the image in return. Or if you wanted a photograph that looked professional, you'd hire a photographer, tell him or her what to do, and await the results. + +With DALL·E, you can instead give the requirements to the Artificial Intelligence model and get the result back. For example, as available in [OpenAI's blog article](https://openai.com/blog/dall-e/) (really recommended to read - more examples can be found there): + +- The query **an illustration of a baby daikon radish in a tutu walking a dog** gives, well, the result you want. +- **A store front that has the word 'openai' written on it** also gives awesome results. + +![](images/image-2.png) + +Source: [OpenAI (2021)](https://openai.com/blog/dall-e/) + +* * * + +## How DALL·E works + +DALL·E is based on the GPT-3 model that we have heard a lot of buzz about in the past few months. This model, which is an extension of GPT-2 which extends [GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) itself, [autoregressively](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/) learns to build an understanding of natural language. This understanding can subsequently being used for downstream tasks like [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/) or [question answering](https://www.machinecurve.com/index.php/2020/12/21/easy-question-answering-with-machine-learning-and-huggingface-transformers/). + +### About GPT-3 and previous approaches + +Previous approaches like [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) and the original GPT model followed the _fine-tuning approach_. Here, the model was first pretrained on massive datasets that are unlabeled (e.g. the BooksCorpus dataset, or the English Wikipedia dataset), which allows it to build up an unguided understanding of natural language. It could then be finetuned to a specific language task by means of some labeled, but smaller, dataset. + +GPT-2 and GPT-3 recognized that even while pretraining already provided lots of benefits compared to training from scratch, so-called zero-shot learning - where the model is finetuned and then applied to language tasks, without pretraining - could be the way forward. The creators of these successive models argued that pretrained models could build sufficient language understanding to be used in the downstream applications. And they succeeded: GPT-3 is capable of generating human-like language. This does however come at a cost: the models are _huge_. So huge that they cannot be used normally in practice. But diving into this is beyond the scope of this article. Let's get back to DALL·E now. + +### Specific implementation of GPT-3 for DALL·E + +Like GPT-3, DALL·E is based on the [Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/). This architecture, which was originally proposed back in 2017, has changed the field of Natural Language Processing. The DALL·E model, during pretraining, receives two sequences of data of at max 1280 tokens: both the **text** as well as the **image** (OpenAI, 2021). + +It is then trained using maximum likelihood, predicting the tokens in a sequence, in some sort of a Language Modeling task (OpenAI, 2021). + +As we can see [in the article](https://openai.com/blog/dall-e/), DALL·E is capable of performing a variety of tasks: + +- **Controlling attributes**, instructing the model what particular attributes of an object should look like. For example: "a collection of glasses is sitting on a table" (OpenAI, 2021). Here, we instruct the model about the glasses, and more precisely, their location. +- **Drawing multiple objects** is also possible, but is more challenging, because it can be unknown whether certain characteristics belong to one object or another (OpenAI, 2021). DALL·E is however also capable of performing that task, but at the risk of making mistakes - once again due to the issue mentioned previously. The success rate decreases rapidly when the number of objects increases. +- **Visualizing perspective and three-dimensionality**, meaning that DALL·E can be instructed to take a particular "perspective" when generating the image (OpenAI, 2021). +- **Visualizing across many levels**, from "extreme close-up" to "higher-level concepts" (OpenAI, 2021). +- **Inferring context**, meaning that particular elements can be added to an image that normally do not belong to a particular context (e.g. the OpenAI logo in the image above; this is normally not displayed on a store front). + +* * * + +## Possible applications for DALL·E + +We can come up with a wide variety of applications for the new DALL·E model: + +- **Industrial and interior design**, to aid designers when creating a variety of household and other objects. +- **Architecture**, to guide the creation of buildings and other forms of constructions. +- **Photography**, to create an image specifically tailored to one's requirements. +- **Graphic design**, with e.g. the creation of a variety of icons. + +![](images/image-3.png) + +How DALL·E can be used in industrial and interior design: an armchair in the shape of an avocado. Source: [OpenAI (2021)](https://openai.com/blog/dall-e/) + +* * * + +## Summary + +DALL·E is a GPT-3 based model that can use text for the creation of images. OpenAI published about the model in January 2021, spawning yet another possibility to use GPT-3 in practice. + +In this article, we first looked at what DALL·E is. Named after Salvador Dalí and Pixar's WALL·E movie, we saw that it can indeed be used for image creation. Then, when taking a look at how it works, we saw that it is not _so_ different from the original GPT-3 model. Whereas the latter utilizes textual inputs in a language modelling task, DALL·E jointly inputs text and images in a fixed-length sequence to learn how to generate the images. + +[OpenAI's article](https://openai.com/blog/dall-e/) gives you the opportunity to create many images yourself. Go check it out. It's really awesome! 😎 + +* * * + +## References + +Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). [Language models are unsupervised multitask learners.](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) _OpenAI blog_, _1_(8), 9. + +Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Agarwal, S. (2020). [Language models are few-shot learners](https://arxiv.org/abs/2005.14165). _arXiv preprint arXiv:2005.14165_. + +OpenAI. (2021, January 5). _DALL·E: Creating images from text_. [https://openai.com/blog/dall-e/](https://openai.com/blog/dall-e/) diff --git a/dialogpt-transformers-for-dialogues.md b/dialogpt-transformers-for-dialogues.md new file mode 100644 index 0000000..448aaa1 --- /dev/null +++ b/dialogpt-transformers-for-dialogues.md @@ -0,0 +1,15 @@ +--- +title: "DialoGPT: Transformers for Dialogues" +date: "2021-03-16" +categories: + - "buffer" + - "deep-learning" +tags: + - "dialogpt" + - "dialogue" + - "machine-learning" + - "text" + - "transformer" +--- + +DialoGPT is “a tunable gigaword-scale neural network model for generation of conversational responses, trained on Reddit data”. It uses a Transformer based architecture for doing so, because of their great empirical success. Doing so, the creators have attempted to resolve challenges present with neural response generation – i.e. generating texts relevant to the prompt. These are related to the fact that conversations are informal, noisy, and contain abbreviations or errors. diff --git a/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning.md b/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning.md new file mode 100644 index 0000000..0f43c28 --- /dev/null +++ b/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning.md @@ -0,0 +1,213 @@ +--- +title: "Differences between Autoregressive, Autoencoding and Sequence-to-Sequence Models in Machine Learning" +date: "2020-12-29" +categories: + - "deep-learning" +tags: + - "autoencoder" + - "autoencoding" + - "autoregressive" + - "deep-learning" + - "machine-learning" + - "seq2seq" + - "sequence-to-sequence-learning" + - "transformers" +--- + +Transformers have changed the application of Machine Learning in Natural Language Processing. They have replaced [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) as state-of-the-art (SOTA) approaches in the wide variety of language and text related tasks that can be resolved by Machine Learning. + +However, as we have seen before when paradigms shift towards different approaches, one breakthrough spawns a large amount of research and hence a large amount of small improvements. For example, we have seen this with ConvNets in computer vision: after the introduction of AlexNet in 2012, which won the ImageNet competition with an unprecedented advantage, a wide variety of convolutional architectures has been proposed, tested and built for image related tasks. + +The same is true for Transformers: after the 2017 work by [Vaswani et al.](https://arxiv.org/abs/1706.03762) changing the nature of sequence-to-sequence models, many different architectures have seen the light of day. + +However, what these extensions have in common is that they use a wide variety of terms to describe all parts of the model. When you read related papers, you'll find that some models are called _**autoregressive**_, that others are called _**autoencoding**_, or _**sequence-to-sequence**_. As a beginner, this can be confusing, because when you are trying to understand Transformers, you're going to compare everything with the basic Vaswani Transformer. + +And precisely that is why this article covers the **overlap** and **differences** between these three encoder-decoder architectures. We'll first cover the basics of encoder-decoder architectures in order to provide the necessary context. This also includes a brief coverage of the classic or vanilla Transformer architecture. Then, we move on to autoregressive models. We'll subsequently cover autoencoding models and will see that when combined, we get Seq2Seq or sequence-to-sequence models. Multimodal and retrieval-based architectures are covered finally, before we summarize. + +Ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Introducing encoder-decoder architectures + +In this article, we're going to take a look at the overlap and differences between three (general) model architectures used in Natural Language Processing. In order to do this, we'll first have to take a look at progress in so-called **encoder-decoder architectures**, because every architecture type is related to this way of thinking. + +Encoder-decoder architectures are composed of an **encoder** and a **decoder**. The encoder is capable of taking inputs, for example sentences (sequences) written in German, and mapping them to a high-dimensional representation. The encoder here learns which parts of the inputs are important and passes them to the representation, while the less-important aspects are left out. We cannot understand the representation easily, because there are no semantics involved, as the mapping is learned. + +However, if we add a decoder to the architecture, we can convert the high-dimensional representation into another sequence. This sequence can for example be a sentence written in English. Adding an encoder and a decoder allows us to build models that can transduce (i.e. map without losing semantics) 'one way' into 'another', e.g. German into English. By training the encoder and decoder together, we have created what is known as a sequence-to-sequence model. If we train one part only, we get either an autoregressive or an autoencoding model. We'll cover each now. + +[![](images/Diagram-33-1024x352.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-33.png) + +* * * + +## What are Seq2Seq models? + +A **sequence-to-sequence model** is capable of ingesting a sequence of a particular kind and outputting another sequence of another kind. In general, it's the model architecture visualized above. Such models are also called Seq2Seq models. + +There are many applications of performing sequence-to-sequence learning. + +> Sequence to sequence learning has been successful in many tasks such as machine translation, speech recognition (...) and text summarization (...) amongst others. +> +> Gehring et al. (2017) + +While this is not strictly necessary (e.g. think vanilla RNNs), most contemporary Seq2Seq models make use of an encoder-decoder architecture. In this architecture, an encoder is trained to convert input sequences into a hidden representation. Often, this is a [high-dimensional hidden state vector](https://www.machinecurve.com/index.php/2019/12/26/how-to-visualize-the-encoded-state-of-an-autoencoder-with-keras/). + +Subsequently, a trained decoder is applied, which is capable of changing the hidden state vector into some desired output. + +By chaining the encoder and decoder together into one Machine Learning task, e.g. for translating using German inputs and English outputs, the encoder and decoder's weight matrices jointly learn to perform the transduction task. + +> The primary components \[of a Seq2Seq model\] are one encoder and one decoder network. The encoder turns each item into a corresponding hidden vector containing the item and its context. The decoder reverses the process, turning the vector into an output item, using the previous output as the input context. +> +> Wikipedia (2019) + +### Seq2Seq made visual + +More visually, this looks as follows. Say that we've got an input sequence of four tokens, e.g. a tokenized version of the phrase "I am going home". When feeding this sequence to the encoder, it'll generate a high-dimensional representation. Through the training process, it has been trained to do so. + +![](images/Diagram-34-1024x353.png) + +We can then feed the high-dimensional representation into the decoder, which once again generates a tokenized sequence. For example, in the use case of translation, this can be "Je vais à la maison", or _I am going home_ in French. + +![](images/Diagram-36-1024x353.png) + +### Original Transformer is a Seq2Seq model + +In a different article, [we introduced the original Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), as proposed by Vaswani et al. back in 2017. Below, you will find a visualization of its architecture. Even though the flow is more vertical than in the example above, you can see that it is in essence an encoder-decoder architecture performing sequence-to-sequence learning: + +- We have **N encoder segments** that take inputs (in the form of a learned embedding) and encode it into a higher-dimensional intermediate representation (in the case of the original Transformer, it outputs a 512-dimensional [state vector](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#vanilla-transformers-use-learned-input-embeddings)). It takes either the previously encoded state as its input, or the source sequence (i.e., the phrase in English). +- We have **N decoder segments** that take the final encoded state as the input, as well as the output of either the previous decoder segment or the target input sequence (i.e., the phrase in French). + +The encoder segments ensure that the inputs are converted into an abstract, high-dimensional intermediate representation. The decoder segments take this representation providing context about the input as well as the target sequence, and ensure that appropriate sequences in a target language can be predicted for those in a source language. + +The original Transformer model, a.k.a. _vanilla_ or _classic_ Transformers, is therefore a Sequence-to-Sequence model. + +![](images/Diagram-32-1-1024x991.png) + +Source: [Introduction to Transformers in Machine Learning](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), based on Vaswani et al. (2017) + +* * * + +## What are Autoregressive models? + +Sequence-to-Sequence models are traditionally used to convert entire sequences from a target format into a source format. It's a performed transformation at the sequence level, and it applies to each and individual token. + +There are however more tasks within Natural Language Processing. One of these tasks is the generation of language, or in more formal terms Natural Language Generation (NLG). It is quite difficult to generate text with a model that is capable of converting sequences, as we simply don't know the full sequence yet. That's why a different approach is necessary. + +The answer to creating a model that can generate text lies in the class of **autoregressive models**. + +> A statistical model is autoregressive if it predicts future values based on past values. For example, an autoregressive model might seek to predict a stock's future prices based on its past performance. +> +> Investopedia (n.d.) + +In the statistics oriented but applicable definition above, you'll already read what is key to text generation: using past values for predicting future values. Or, in other words, using words predicted in the past for predicting the word at present. + +An autoregressive model can therefore be seen as a model that utilizes its previous predictions for generating new ones. In doing so, it can continue infinitely, or - in the case of NLP models - until a stop signal is predicted. + +### Autoregressive Transformers + +[![](images/Diagram-37.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-37.png) + +The GPT architecture (based on Radford et al., 2018) + +After studying the original Transformer proposed by Vaswani et al. (2017), many researchers and engineers have sought for methods to apply autoregression with Transformers as well. + +And they succeeded: Transformers can actually be used for autoregression and hence for text generation. + +The class of Transformers called **GPT** (indeed, even [GPT-2](https://openai.com/blog/better-language-models/) and [GPT-3](https://en.wikipedia.org/wiki/GPT-3)) is autoregressive (Radford et al., 2018). GPT is heavily inspired by the decoder segment of the original [Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), as we can see in the visualization on the right. + +- The input is first embedded. This embedding is a matrix (_position embedding matrix_) and hence the actual input is a vector with multiple tokens (meaning that it can be used time and time again, i.e., have an autoregressive property). +- 12 decoder segments with masked multi-head attention segments, feedforward segments, and layer normalization segments interpret the input values. +- The output can be a text prediction; in that case, the task is to model language. However, it can also be used for other tasks, such as similarity detection and multiple choice answering. + +By means of pretraining, the model learns to model language. It can subsequently be fine-tuned for the additional tasks mentioned above. + +* * * + +## What are Autoencoding models? + +Autoregressive models are very good when the goal is to model language - i.e., to perform Natural Language Generation. However, there is another class of tasks that does not benefit from autoregressive models. It does neither benefit from Seq2Seq models. We're talking about Natural Language Understanding activities. + +- While Seq2Seq models are required to understand language, they use this understanding to perform a different task (usually, translation). +- Natural Language Generation tasks and hence autoregressive models do not necessarily require to _understand_ language if generation can be performed successfully. + +**Autoencoding models** can help here. + +> The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name. +> +> Wikipedia (2006) + +> Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original sentence. +> +> HuggingFace (n.d.) + +### Autoencoding Transformers + +An example of an autoencoding Transformer is the BERT model, proposed by Devlin et al. (2018). It first corrupts the inputs and aims to predict the original inputs and by consequence learns an encoding that can be used for downstream tasks. + +That's precisely the dogma used with BERT-like models: pretrain on an unsupervised dataset, after which it becomes possible to fine-tune the model on downstream tasks such as [question answering](https://www.machinecurve.com/index.php/2020/12/21/easy-question-answering-with-machine-learning-and-huggingface-transformers/). + +* * * + +## Autoregressive vs autoencoding depends on the task and training, not on the architecture + +While so far we have gained some understanding about Seq2Seq, autoregressive and autoencoding models, for me, there was still some unclarity when I was at this point. + +If autoencoder models learn an encoding, why can autoregressive models then be used for fine-tuning as well? + +The answer is simple. Whether a model is Seq2Seq, autoregressive or autoencoding does **not depend on the architecture**. The decoder segment of the original Transformer, traditionally being used for autoregressive tasks, can also be used for autoencoding (but it may not be the smartest thing to do, given the masked nature of the segment). The same is true for the encoder segment and autoregressive tasks. Then what makes a model belong to a particular type? + +It's **the task that is solved**, as well as the **type of training** (HuggingFace, n.d.). + +- If the idea is that the model as a whole transducts (i.e. transforms without altering semantics) one sequence into another, then we're talking about a **Seq2Seq model**. +- If the idea is that you learn an encoded representation of the inputs by corrupting inputs and generating the original variants, we're talking about an **autoencoding model**. +- If the idea is that you use all previous predictions for generating the next one, in a cyclical fashion, we're talking about an **autoregressive model**. + +> Note that the only difference between autoregressive models and autoencoding models is in the way the model is pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models. When a given model has been used for both types of pretraining, we have put it in the category corresponding to the article where it was first introduced. +> +> HuggingFace (n.d.) + +Hopefully, this makes things a bit more clear. + +* * * + +## Summary + +Transformers have significantly changed the way Machine Learning is applied in Natural Language Processing, for a variety of tasks. However, there is also a large amount of terms used within the literature about these Transformers - Seq2Seq models, autoregressive models, and autoencoder models. + +In this article, we looked at these terms in more detail. Firstly, we looked at the concept of encoder-decoder architectures. Through using two segments with an intermediary representation, we can build models that perform a wide variety of NLP tasks. + +This was followed by looking at the concept of Seq2Seq models. We saw that when one sequence is fed to a model that produces another sequence, we call it a Seq2Seq model. Often, but not strictly necessary, these models are built following the idea of an encoder-decoder architecture. + +Autoregressive models take the previous predictions to generate a new prediction. Training them therefore involves a language modelling task: models have to learn a language and interdependencies between words, phrases, including semantics. Text generation is a classic task that is performed with autoregressive models. + +Autoencoding models corrupt textual inputs and generate the original inputs in return. The result is an encoding that can be used for additional downstream tasks, such as question answering. + +What makes things a bit more confusing is that saying whether we're performing a _Seq2Seq,_ an _autoregressive_ or an _autoencoding_ task does not depend on the architecture. Many state-of-the-art approaches such as GPT and BERT simply use parts of the original Transformer architecture. Rather, they adapt the training task to the task they want to perform: text generation or text understanding. Hence, whether a model is autoregressive or autoencoding therefore depends mostly on the task and by consequence the type of training. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article. If you did, please feel free to leave a comment in the comments section 💬 I'd love to hear from you. Please do the same if you have remarks or suggestions for improvement. If you have questions, please click the **Ask Questions** button above, or leave a message below. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +HuggingFace. (n.d.). _Summary of the models — transformers 4.1.1 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/model\_summary.html](https://huggingface.co/transformers/model_summary.html) + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). [Convolutional sequence to sequence learning](https://arxiv.org/abs/1705.03122). _arXiv preprint arXiv:1705.03122_. + +Wikipedia. (2019, December 17). _Seq2seq_. Wikipedia, the free encyclopedia. Retrieved December 29, 2020, from [https://en.wikipedia.org/wiki/Seq2seq](https://en.wikipedia.org/wiki/Seq2seq) + +Investopedia. (n.d.). _What does autoregressive mean?_ [https://www.investopedia.com/terms/a/autoregressive.asp](https://www.investopedia.com/terms/a/autoregressive.asp) + +Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). [Improving language understanding by generative pre-training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). + +Wikipedia. (2006, September 4). _Autoencoder_. Wikipedia, the free encyclopedia. Retrieved December 29, 2020, from [https://en.wikipedia.org/wiki/Autoencoder](https://en.wikipedia.org/wiki/Autoencoder) + +Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805). _arXiv preprint arXiv:1810.04805_. diff --git a/distributed-training-tensorflow-and-keras-models-with-apache-spark.md b/distributed-training-tensorflow-and-keras-models-with-apache-spark.md new file mode 100644 index 0000000..4d72518 --- /dev/null +++ b/distributed-training-tensorflow-and-keras-models-with-apache-spark.md @@ -0,0 +1,404 @@ +--- +title: "Distributed training: TensorFlow and Keras models with Apache Spark" +date: "2020-10-22" +categories: + - "deep-learning" + - "frameworks" +tags: + - "apache-spark" + - "big-data" + - "deep-learning" + - "distributed-training" + - "machine-learning" + - "neural-networks" + - "parallelism" + - "tensorflow" +--- + +Ever since that particular breakthrough in 2012, deep learning has been an important driver of today's buzz about Artificial Intelligence. And in some areas, it absolutely deserves applause - for example, [convolutional neural networks](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) have spawned really great applications of computer vision: + +- [Snagging Parking Spaces with Mask R-CNN and Python: Using Deep Learning to Solve Minor Annoyances](https://medium.com/@ageitgey/snagging-parking-spaces-with-mask-r-cnn-and-python-955f2231c400) +- [How to Get Beautiful Results with Neural Style Transfer](https://towardsdatascience.com/how-to-get-beautiful-results-with-neural-style-transfer-75d0c05d6489) +- [Scaling Machine Learning at Uber with Michelangelo](https://eng.uber.com/scaling-michelangelo/) + +...and there are many more! + +Despite the progress made so far, deep learning is still a computationally expensive field. Training neural networks [involves feeding forward data, computing the error or loss and subsequently optimizing](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) the model with [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [adaptive optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/). + +Especially in settings where your model is large - and deep learning models _are_ large, with sometimes hundreds of layers in just one model, yielding millions and millions of trainable parameters - training a model does (1) take a lot of time, (2) requires GPU powered resources which are expensive, and (3) consumes a lot of electricity. + +Now, what if we can take a different approach to deep learning? What if we can take our existing TensorFlow and Keras deep learning models and run them in a distributed way - that is, we don't do all computations on one heavy machine, splitting all the work across many cheaper and less powerful ones? We'll take a look at doing so in this article 😀 + +More specifically, we'll look at a few things. Firstly, we discuss the difference between _small data and big data_. Subsequently, we introduce Apache Spark - which is a well-known framework among data engineers for processing big data in a distributed way. Then, we'll cover a range of Spark extensions for running your TF/Keras models in a distributed way. In doing so, we also give examples, as well as our observations. We finally discuss our experiences with regards to their maturity. + +It's promising to become quite a read, but I'm sure that it'll provide you with a nice overview of what's out there. Let's go! 😎 + +**Update 07/Jan/2021:** the [Elephas project](https://www.machinecurve.com/index.php/2020/10/22/distributed-training-tensorflow-and-keras-models-with-apache-spark/#elephas-distributed-deep-learning-with-keras-spark) was taken over by [@danielenricocahall](https://github.com/danielenricocahall). References were updated to accommodate for this. In addition, the new Elephas release now also supports regression models 🚀 This was adapted in the text. + +* * * + +\[toc\] + +* * * + +## Small data versus big data + +Data is very hot these days. + +So hot that I keep hearing people pouring terms like _big data, deep learning, machine learning, Artificial Intelligence_... sometimes asking myself whether people truly understand what they are talking about. + +Well, back on-topic: we're going to look at the differences between small data and big data - another buzzword that is very common these days. + +Despite the buzz, big data is really a thing, and must be treated as such. + +But what is it? And how is it different from small data? Let's take a look. + +### Small data + +Data and databases play a big role in our life today, and it's likely that many people are unaware of it. + +Are you using an ERP system at work, or a CRM system? They are supported by databases (and often very expensive proprietary ones - smart business models). + +Even MachineCurve runs on top of a database, which stores my articles, and serves them to you - the reader - when necessary (hopefully today included :) ). + +Databases are traditionally built in a relational way, meaning that commonday objects are modeled into their generic form (a "class" or "entity"), and that relationships between objects of those entities are possible. + +To make things a little bit less abstract, I always use the example of a school bus. + +Suppose that there are two school buses with which people are brought to school: a yellow one, as we can all visualize a school bus to be, and a purple one, which is a bit... weird, but well. + +Both buses exist in the real world, making them and "object" - just generically speaking, a _thing_. What do they share in common? Indeed, that they are composed of many similar ingredients (wheels, windows, ...), but also that both are a "school bus". That's the _class_ of objects, and `SchoolBus` could thus be an entity in our relational model. + +Now, say that we have 10 students. While they are all very different people (objects), they can all be gathered under the `Student` class or entity. What's more, we can assign Students to a SchoolBus - which is precisely why those are called _relational_ models. + +The benefits of relational data models are that the relationships can be checked to be valid. This reduces errors within the data structure, or in plain English the odds that a student is not assigned to a school bus by accident and is left standing in the rain. + +The disbenefit of relational data models and by consequence databases is... precisely the same thing. The fact that the checks must be done means that the database must be locked for very brief amounts of time every time... which is unacceptable with today's vast quantities of data. + +Different solutions are necessary. + +### Big data + +Here, technologies that can be shared under the umbrella term of _big data_ come in. Indeed, that's a widely used term and often a bit overhyped, but still, the technologies are very tangile and _really_ useful when your datasets can no longer be stored or processed on just one machine. + +Over the past few years, a variety of big data technologies has emerged - all with different tasks in the big data landscape. For example, Hadoop is a distributed file system that composes a set of commodity machines which altogether, and in a smart way, represent your data redundantly. This makes it very resilient against failure. In recent years, we have also seen object storage - most notably S3 and S3 compatible types of storage - rise to power, sometimes even taking over Hadoop based big data storage. + +For processing, many people are familiar with Apache Spark. By creating what is called a Resilient Distributed Dataset, and allowing engineers to apply MapReduce principles to processing data, Spark runs processing jobs on a variety of commodity machines - just like Hadoop, but then for Compute tasks rather than Storage tasks. + +Over the years, many other tools such as Apache NiFi, Apache Ambari and Apache Airflow as well as a variety of proprietary tools / cloud based services (often based off Apache tooling!) have emerged for other tasks, such as metadata monitoring and ETL jobs in case of batch processing. + +Now, this is no big data article, so let's take a look at how Machine Learning is related to this discussion about big data. + +### Machine learning: small and big data based ML + +If your dataset is _small_, that is - it fits on the disk of the machine that you're using and, as we're talking about machine learning, in the memory of your machine as well - then there is no problem related to training your model. + +In fact, by simply installing Keras, it is possible to train a variety of models [like this classifier](https://www.machinecurve.com/index.php/2020/10/20/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras/). + +Life becomes more difficult when your dataset gets bigger. Fortunately, the machine learning community comes to the rescue with a concept called _distributed training_. + +What is distributed training, you may now ask. + +Very simple: rather than performing the entire training process on one machine/GPU, it is _spread_ or _distributed_ across many different ones. There is a variety of general distribution strategies that can be applied: + +- An **on-machine distribution strategy**, where the machine has multiple GPUs, which are used in parallel for training the machine learning model; +- An **across-machine distribution strategy**, where the machine has one GPU, but many machines are used in parallel for training the machine learning model; +- A **best-of-both-worlds distribution strategy**, where multiple machines with multiple GPUs are employed for training your machine learning model; +- A **big data-powered distribution strategy**, where a batch data processing framework from the big data field is employed for distributing the training operations. + +In today's article, we will focus on the latter strategy. The other distributed strategies can be employed on your machine, should you have a _beast_ on-premise, [or in the cloud](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/), if you don't. The big data-powered distribution strategy that we will look at today will benefit greatly from Apache Spark and the way it distributes processing jobs across a variety of commodity machines. + +Let's first take a look at what Apache Spark is, what it does and how its properties can possibly benefit machine learning too. Then, we'll take a look at a variety of modules built on top of Spark / making use of Spark, which can be used for training your machine learning model in a distributed way. + +* * * + +## Introducing Apache Spark + +If you're a bit old, like me, you know what an FTP server is - indeed, it's that place where you put a variety of files in case you were building a website. Standing for File Transfer Protocol, did you know that its use cases extend beyond websites alone? I mean: it's also possible to _put files_ on an FTP server, so that someone else can _get them off_ again. + +While FTP is increasingly being replaced by S3 storage, it's still a very common method for transferring files from one place to another. + +Say that you have an FTP server where many CSV files are put on - the CSV files, here, report millisecond measurements of some chemical experiments performed in your laboratory (I just had to come up with something). Hence, they are quite big in relative terms: hundreds of megabytes worth of CSV data per file is nothing special. The quantity with which those files flow into your FTP server is also relatively large: many files are stored on the server every few minutes or so. + +The task of processing those files and generating metadata for summary reports is up to you. That's quite a challenging task given the quantity constraints in terms of _number of files_ and _size of the files_. + +If you would approach this problem naïvely, it's likely that you would write a program that reads the FTP server every minute or so, checks which files are new, and then processes them sequentially. While this approach is simple, it's not scalable, and your program will likely be slower than the files flowing in. Even when you think you're smart and apply Python based parallelism, the problem likely persists. Processing all the rows into a summary report is simply impossible with this quantity. + +Enter Apache Spark! Being one of the crown jewels in your big data technology landscape, it is the perfect tool for processing those files into a report. But how does it do that? And how does it fit in that data landscape? Let's take a look. + +![](images/Apache_Spark_logo.svg_.png) + +First of all, the landscape. Apache Spark is used for _processing_. It works best if connected to big data compatible file storage such as Hadoop or S3. In our case, we would want to connect Spark to S3. This means that first, all files need to be transferred to S3 storage - which can be done automatically by a workflow management system like Apache Airflow. Once in S3 storage, Spark can do its job. + +Second of all, the how. How does Apache Spark make sure that it can do what your naïve approach can't? Although precisely explaining how Spark works takes too much time here (especially since this article focuses on machine learning), according to Wikipedia: + +> Apache Spark has its architectural foundation in the resilient distributed dataset (RDD), a read-only [multiset](https://en.wikipedia.org/wiki/Multiset) of data items distributed over a cluster of machines, that is maintained in a [fault-tolerant](https://en.wikipedia.org/wiki/Fault-tolerant_computing) way. +> +> Wikipedia (2012) + +The benefit of Spark is that it is capable of generating a Resilient Distributed Dataset or RDD, which is a multiset (a replicated set) distributed over a cluster of machines. This essentially allows for the same benefits for Compute as available to Storage with e.g. Hadoop: using a large amount of commodity (i.e., cheap) machines, large-scale data processing can take place. Because it's stored in a multiset fashion, resiliency is built-in and data loss is not much of a problem. In addition, because it is built in a particular way (Google for RDD lineage), Spark ensures that it's fast too. + +The fact that Spark is capable of distributing processing jobs into essentially small packages, makes one wonder if it cannot be used for _machine learning_ too! The reasoning why this could possibly work is simple: supervised machine learning is + +The fact that Spark is capable of distributing processing jobs into essentially small packages, makes one wonder if it cannot be used for _machine learning_ too! The reasoning why this could possibly work is simple: supervised machine learning [is essentially solving an optimization problem, iteratively.](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) Feeding forward the training samples produces an error score, which is subsequently used to optimize the [weights](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) of the individual neurons. It should therefore be possible to distribute this process over many machines. + +Is it? Let's take a look at some methods for distributing deep learning using Apache Spark. + +* * * + +## Distributing deep learning: training your models on Apache Spark + +Today, we'll take a look at the following extensions that make it possible to train your machine learning models with Apache Spark: + +- Elephas +- CERN dist-keras +- Intel Analytics BigDL +- Apache Spark SystemML's Keras2DML +- Databricks Spark Deep Learning +- Yahoo TensorFlowOnSpark + +### Elephas: Distributed Deep learning with Keras & Spark + +![](images/elephas-logo.png) + +Let's begin with **Elephas**. Freely available on [GitHub](https://github.com/danielenricocahall/elephas), with an open source license - I am always fan of open source tools - it can be described as follows: + +> Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale with Spark. +> +> Elephas (n.d.) + +We all know that Keras makes creating a deep learning model incredibly simple if you know what you are doing. With the latter, I of course mean: if you know what concepts to apply in what order, and if you know how the parameters and hyperparameters of the model must be trained. It is therefore not surprising that TensorFlow 2.x utilizes `tensorflow.keras` as the main API towards creating machine learning models. Its deep integration is truly symbiotic. + +Now, Elephas - which pretty much attempts to extends Keras, as even becomes clear from its logo - "brings deep learning with Keras to Spark" (Elephas, n.d.). Recognizing the usability and simplicity of the Keras library, implements algorithms that use Spark's concept of an RDD and dataframes to train models in a parallel fashion. Most specifically, it does so in a data-parallel way. + +Data-parallel? That's a term that we haven't seen yet. + +#### Data parallelism vs model parallelism + +In a great article that should be read by those who have a background in maths, Mao (n.d.) argues that [batch gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) can produce big oscillations in the computed gradients especially because they work with batches of data - and such batches do not necessarily have to be distributed in an equal way. + +This increases the road towards convergence. + +Now, one naïve attempt to fix this is by increasing the size of your batches. While this will work for some time, you will not completely avoid the problem and will eventually run into issues with respect to hardware constraints - especially memory constraints, for those who haven't experienced such issues before. + +Data parallelism is a way of overcoming this issue by making use of the law of large numbers which dictates that if you consider a computed gradient as a sample, a sufficiently large set of samples should - when averaged - produce a gradient that is closer to the population mean, i.e., to the _true_ best gradient for a particular point in time. Data parallelism implements this by parallelizing the computation of gradients across batches and then averaged where the average gradient is used. Each batch is trained on a different GPU. + +Model parallelism, on the other hand, simply cuts the model into pieces and trains separate parts on separate GPUs. Rather than the data being split, it's the model. + +#### Data parallelism of Elephas + +Spark's capability of parallelization in a resilient way with RDDs aligns naturally with the concept of data parallelism as essentially a Spark job parallelizes the processing of data across many machines. In effect, how it works is simple: a Keras model is initialized on the Spark driver and then passed as a whole to a worker, as well as a bit of the data which it should train on. Each worker then trains the model on its part, sends the gradients abck to the driver, which subsequently updates the "master model" in the data parallel way described above. + +Schematically, this looks as follows: + +![](images/elephas.gif) + +Image from [Elephas GitHub](https://github.com/maxpumperla/elephas). License: MIT. + +Personally, I think this is a great way of aligning the benefits of Apache Spark with the requirements of training Keras models in a parallel way. In fact, Elephas does not only support training. In total, a number of three use cases is supported by it: + +- Distributed data parallel training of your Keras model (animation above). +- Distributed hyperparameter optimization for your Keras model (that is, finding the best set of hyperparameters automatically, such as with [Keras Tuner](https://www.machinecurve.com/index.php/2020/06/09/automating-neural-network-configuration-with-keras-tuner/), but then distributed). +- Distributed training of ensemble models, by means of hyperparameter optimization and subsequently ensembling on \[latex\]N\[/latex\] best-performing models. + +While it [used to be impossible](https://github.com/maxpumperla/elephas/issues/139) to perform regression tasks in previous versions of Elephas, [it was added](https://www.machinecurve.com/index.php/2020/10/22/distributed-training-tensorflow-and-keras-models-with-apache-spark/#comment-12187) in [version 0.4.5](https://github.com/danielenricocahall/elephas/releases/tag/0.4.5) released in early 2021. + +#### Creating a model with Keras and Elephas + +Creating a model with Keras and Elephas is truly simple. [As per the docs on GitHub](https://github.com/maxpumperla/elephas), it's necessary to perform a few steps: + +- Create a `pyspark` context +- Define and compile the Keras model +- Convert your dataset into an RDD +- Initialize an `elephas.spark_model.SparkModel` instance +- Submitting your script with `spark-submit` + +The steps are explained in more detailed [here](https://github.com/maxpumperla/elephas#basic-spark-integration), but here's a full code example of a simple Keras classifier - our [hot dog classifier made Spark-ready](https://www.machinecurve.com/index.php/2020/10/20/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras/). Do note that we omitted some general parts, which can be retrieved in the linked article. + +``` +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Conv2D, Flatten +from pyspark import SparkContext, SparkConf +from elephas.utils.rdd_utils import to_simple_rdd +from elephas.spark_model import SparkModel + +# >> Omitting loading the dataset: check article for how-to << + +# >> Omitting configuration options: check article for how-to << + +# Generating Spark Context +conf = SparkConf().setAppName('MachineCurve').setMaster('local[8]') +sc = SparkContext(conf=conf) + +# Model creation +def create_model(): + model = Sequential() + model.add(Conv2D(4, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(Conv2D(8, kernel_size=(3, 3), activation='relu')) + model.add(Conv2D(12, kernel_size=(3, 3), activation='relu')) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + return model + +# Model compilation +def compile_model(model): + model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + return model + +# Create and compile the model +model = create_model() +model = compile_model(model) + +# Convert dataset to RDD +rdd = to_simple_rdd(sc, X_train, y_train) + +# Train model +spark_model = SparkModel(model, frequency='epoch', mode='asynchronous') +spark_model.fit(rdd, epochs=20, batch_size=32, verbose=0, validation_split=0.1) +``` + +* * * + +### CERN dist-keras + +The [CERN Database Group](https://github.com/cerndb) (indeed, the European Organization for Nuclear Research, which produced the [Large Hadron Collider](https://home.cern/science/accelerators/large-hadron-collider)) created [dist-keras](https://github.com/cerndb/dist-keras), which can be used for distributed optimization of your Keras-based deep learning model. In fact: + +> Distributed Keras is a distributed deep learning framework built op top of Apache Spark and Keras, with a focus on "state-of-the-art" distributed optimization algorithms. We designed the framework in such a way that a new distributed optimizer could be implemented with ease, thus enabling a person to focus on research. Several distributed methods are supported, such as, but not restricted to, the training of **ensembles** and models using **data parallel** methods. +> +> CERN (n.d.) + +Similar to Elephas, `dist-keras` also allows people to train models on Apache Spark in a data parallel way (for those who haven't read about Elephas yet: navigate to the Elephas section above if you want to understand the concept of data parallelism in more detail). It does so by allowing people to perform distributed optimization; that is, rather than performing [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) or [classic SGD](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/), `dist-keras` utilizes _distributed_ optimizers such as ADAG, Dynamic SGD and AEASGD ([click here for a great article that explains them](https://joerihermans.com/ramblings/distributed-deep-learning-part-1-an-introduction/), of which the author is also affiliated with `dist-keras`). + +Contrary to Elephas, with `dist-keras` it is possible to implement your own distributed optimizer - for example, because new state-of-the-art methods have appeared. You don't have to wait for people to adapt your tools as you can simply replace the part that requires replacement. In theory, this is a big advantage over other distributed deep learning methods. + +The latest commit on GitHub dates back to 2018. In addition, [Joeri Hermans](https://www.linkedin.com/in/joerih/) - the driving force between `dist-keras` - is no longer working with CERN, but now instead pursues a PhD at the University of Liège. From an engineering point of view, it is therefore questionable whether `dist-keras` should still be used - as the Keras landscape has changed significantly since 2018 (think of TensorFlow 2.x, anyone?). We therefore don't provide an example for it. Still, it's a great effort that cannot go unnoticed. + +* * * + +### Intel Analytics BigDL + +![](images/bigdl-logo-bw.jpg) + +License: [Apache-2.0 License](https://github.com/intel-analytics/BigDL/blob/master/LICENSE) + +Produced by [intel-analytics](https://github.com/intel-analytics), BigDL can also be used for distributing training of your deep learning model on Apache Spark. + +In fact, that's what its [GitHub page](https://github.com/intel-analytics/BigDL) claims as the primary header: **BigDL: Distributed Deep Learning on Apache Spark**. It has emerged from three drivers: + +1. Data scale driving deep learning processes. As deep learning models get deeper, more data is required for training them so that they can predict _and_ generalize. With _more data_, we often talk about the big datasets we discussed above. Hadoop and Spark are often deployed for processing those datasets, but no well-performing distributed deep learning library was available. +2. Real-world deep learning applications can be viewed as complex big data pipelines. Then why not integrate with existing big data tooling for training your deep learning models? +3. Deep learning is increasingly being adopted by big data and data science communities. In general, tools that align with current ways of working are adopted more quickly. That's why it could be wise to create tooling that works with what people already know. + +Where `elephas` and `dist-keras` focus on Keras models, BigDL works a bit differently. Instead of focusing on an existing framework for deep learning, it requires people to write models directly against Spark - that is, by using `pyspark`. In doing so, it attempts to replicate what we know as the Sequential API from Keras, which should make it fairly easy for people used to the Keras way of working to implement models with BigDL. + +While the former may sound strange at first, it does in fact come with great benefits. Since Spark is not effectively 'abused' to run Keras models in a data parallel way, but instead runs as _direct transformations_ of inputs to outputs (essentially replicating the mathematical operations performed by e.g. [Convolutional layers](https://www.machinecurve.com/index.php/2020/10/20/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras/#what-is-a-convnet) directly in Spark), it becomes possible to train with _extremely_ large datasets currently stored on Hadoop (or S3). This is not possible with Elephas, to give just one example: here, an existing dataset had to be converted into an RDD and then run on the Spark cluster. + +As such, BigDL also allows you to train your deep learning model in a data parallel way, but is Spark-native rather than Keras-based. + +In fact, the benefit we just discussed makes BigDL a lot more mature compared to Elephas and `dist-keras`. What's more, its GitHub shows that it was updated only 14 days ago, and now allows you to deploy on Spark 3.x based clusters, thereby supporting Spark's latest release. That's definitely some great news for those who need to train their deep learning models on really big datasets. + +#### Creating a model with BigDL + +Many examples for creating a model with BigDL are available [here](https://github.com/intel-analytics/BigDL-tutorials). + +* * * + +### Apache Spark SystemML: Keras2DML + +While BigDL utilizes native Spark processing for creating your deep learning model, [Keras2DML](http://systemml.incubator.apache.org/docs/1.2.0/beginners-guide-keras2dml) comes back to the approach we saw earlier in this article - converting a Keras model into DML, which can then be run on Spark. More specifically, it allows you to train your Keras Functional API based model on a Spark cluster by converting it into a [Caffe](https://caffe.berkeleyvision.org/) model first, and then in DML. + +It's essentially a converter to a format that can be converted into a Spark-compatible model and is part of the Apache Spark SystemML, a flexible machine learning system automatically scaling to Spark and Hadoop clusters. + +I do however have some relatively bad news for you, which makes this section really short compared to the others: I'm not so sure anymore whether utilizing Keras2DML is the best approach today, especially given the benefits of BigDL and the non-Spark method of [TensorFlow Cloud](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/). This observation comes from two lower-level ones: + +1. SystemML is now called SystemDS. I cannot find anything about KerasDML on a SystemDS related website. The only website I find is from SystemML; this means that it's really old. +2. Articles about SystemML and Keras date back to 2018, as well as this [example](https://gist.github.com/NiloyPurkait/1c6c44f329f2255f5de2b0d498c3f238). The example even imports `keras` manually, rather than importing it from TensorFlow as `tensorflow.keras` - clearly indicating that Keras2DML is compatible with Keras 1.x based models only! + +That's why I wouldn't recommend using Keras2DML today, unless you really know what you're doing, and why. + +* * * + +### Yahoo TensorFlowOnSpark + +Combining important elements from TensorFlow with Apache Spark and Apache Hadoop, the TensorFlowOnSpark system that was created by Yahoo makes it possible to train your deep learning model in a distributed way on a GPU or CPU machine powered cluster (TensorFlowOnSpark, n.d.). + +> _TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters._ +> +> (TensorFlowOnSpark, n.d.) + +According to the docs, it was created with TensorFlow compatibility in mind. The authors argue that it provides many benefits over other solutions used for training your deep learning model: + +- Converting your TensorFlow model to a TensorFlowOnSpark based one is easy, requiring a code change of < 10 lines of code. This is not the most salient benefit, as e.g. Elephas requires you to change almost no code either. +- Many TensorFlow functionalities are supported: various forms of parallelism, inferencing, and even [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/). +- Allow your datasets to reside on HDFS and other sources (think S3). Elephas and `dist-keras` don't support this; BigDL does, but doesn't work with TensorFlow/Keras models. +- Deployment can be done everywhere where Spark is running. + +The repository on GitHub is updated rather frequently and the README files suggests that TensorFlow 2.x is supported. This should mean that it can be used for training with contemporary libraries. + +#### Creating a model with TensorFlowOnSpark + +Some examples for using TensorFlowOnSpark are available [here](https://github.com/yahoo/TensorFlowOnSpark/tree/master/examples). + +* * * + +## Summary: distributed DL maturity and what to choose + +In this article, we looked at some extensions and other tools for training your deep learning models in a distributed way, on Apache Spark. We saw that training your deep learning model in a distributed way often boils down to parallelizing the data, training many small instances of your machine learning model on a variety of machines while subsequently computing the gradient by computing a weighted average of all the parallelized gradients. + +Spark natively fits this approach, as it also performs data processing in a parallelized way by means of RDDs and processing on a cluster of commodity machines. + +If you have experience with Apache Spark and want to start with training deep learning models, it's some great news that you can also use Spark for training your deep learning models. For this reason, in this article, we also looked at a few tools for doing so. More specifically: + +1. **Elephas**, which allows you to train your Keras models in a data parallelized way on Spark. It only seems to support classification tasks, while throwing an error for regression tasks. +2. **Dist-keras**, which allows you to do the same, but then with distributed optimization algorithms compared to standard optimizers. The maintainer no longer works for CERN, the creator, so it's not known whether and even if it will be updated again (last update: 2018). +3. **BigDL**, which allows you to to train deep learning models. It is not dependent on any deep learning library but rather implements the operations as Spark operations, meaning that you can also train with datasets present on Hadoop storage that Spark connects to. +4. **Keras2DML:** a method for converting Keras models into Spark compatible Caffe models, and subsequently into Spark compatible format. +5. **TensorFlowOnSpark**, which allows you to train TensorFlow models on Spark, even with Hadoop based data, like BigDL. + +Now, the obvious question would be: **what is the best choice for training your deep learning model distributed on Spark?** The answer, as always, is that "it depends". Here's why - it depends on what you need and what your entry point is. + +Do you already have a TensorFlow or Keras model, for example? Then, you might not want to use BigDL, for example, because you'd have to specify your model again (although it's not really difficult, but still). Do you have a dataset that extends any reasonable storage device, requiring you to store it on S3 or on a Hadoop cluster? Then, Elephas and Keras-dist may not be for you. Are you adventurous or do you want to navigate towards a more production-ready way of working? If the latter is true, then you might wish to use BigDL or TensorFlowOnSpark. + +More generally, I would therefore say that BigDL and TensorFlowOnSpark are the most mature from this list. They support a wide variety of operations, support connecting to data stored on Hadoop, are maintained by larger organizations, support modern versions of the libraries (e.g. TensorFlow 2.x) and have been updated recently. Elephas and Dist-keras, while appreciating the amount of work that must have been put into creating them, don't have all these pros. Keras2DML seems to be very outdated, so I wouldn't recommend using it. + +But still, "it depends". Choose wisely. [For example, consider using TensorFlow Cloud on Google machines if you don't have experience with Spark](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/). You then don't have the hassle of getting to know Spark while you can benefit from distributed strategies there as well. In fact, it's also really easy, after you completed the setup. + +I hope that you've learnt something interesting from today's article. I certainly did - for me, it was new that Apache Spark can be used for training deep learning models. However, after reading about data parallelism, all clicked - and I recognized why using Spark can be really useful. It was fun to see the effort put into creating the tools that we covered in the article today, and I'm happy to see that some are maintained even today. + +Please feel free to leave a comment if you have any questions whatsoever. Please also do so if you have remarks, suggestions for improvements, or a comment in general 💬 I'd love to hear from and will happily respond! Thank you for reading MachineCurve today and happy engineering 😎 + +\[kerasbox\] + +* * * + +## References + +Geitgey, A. (2019, January 21). _Snagging parking spaces with mask R-CNN and Python_. Medium. [https://medium.com/@ageitgey/snagging-parking-spaces-with-mask-r-cnn-and-python-955f2231c400](https://medium.com/@ageitgey/snagging-parking-spaces-with-mask-r-cnn-and-python-955f2231c400) + +Hotaj, E. (2020, March 28). _How to get beautiful results with neural style transfer_. Medium. [https://towardsdatascience.com/how-to-get-beautiful-results-with-neural-style-transfer-75d0c05d6489](https://towardsdatascience.com/how-to-get-beautiful-results-with-neural-style-transfer-75d0c05d6489) + +Hermann, J. (2019, October 27). _Scaling machine learning at Uber with Michelangelo_. Uber Engineering Blog. [https://eng.uber.com/scaling-michelangelo/](https://eng.uber.com/scaling-michelangelo/) + +Apache Spark. (n.d.). Apache Spark™ - Unified Analytics Engine for Big Data. [https://spark.apache.org/](https://spark.apache.org/) + +Wikipedia. (2012, November 17). _Apache spark_. Wikipedia, the free encyclopedia. Retrieved October 21, 2020, from [https://en.wikipedia.org/wiki/Apache\_Spark](https://en.wikipedia.org/wiki/Apache_Spark) + +Wikipedia. (2002, May 24). _File transfer protocol_. Wikipedia, the free encyclopedia. Retrieved October 21, 2020, from [https://en.wikipedia.org/wiki/File\_Transfer\_Protocol](https://en.wikipedia.org/wiki/File_Transfer_Protocol) + +Elephas. (n.d.). _Maxpumperla/elephas_. GitHub. [https://github.com/maxpumperla/elephas](https://github.com/maxpumperla/elephas) + +Mao, L. (n.d.). _Data parallelism VS model parallelism in distributed deep learning training_. Lei Mao's Log Book. [https://leimao.github.io/blog/Data-Parallelism-vs-Model-Paralelism/](https://leimao.github.io/blog/Data-Parallelism-vs-Model-Paralelism/) + +CERN. (n.d.). _Cerndb/dist-keras_. GitHub. [https://github.com/cerndb/dist-keras](https://github.com/cerndb/dist-keras) + +Dai, J. J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., ... & Wang, J. (2019, November). [Bigdl: A distributed deep learning framework for big data](https://dl.acm.org/doi/abs/10.1145/3357223.3362707). In _Proceedings of the ACM Symposium on Cloud Computing_ (pp. 50-60). + +BigDL. (n.d.). _Intel-analytics/BigDL_. GitHub. [https://github.com/intel-analytics/BigDL](https://github.com/intel-analytics/BigDL) + +Apache SystemML. (n.d.). _Beginner's guide for Caffe2DML users_. Apache SystemML - Declarative Large-Scale Machine Learning. [https://systemml.incubator.apache.org/docs/0.15.0/beginners-guide-caffe2dml](https://systemml.incubator.apache.org/docs/0.15.0/beginners-guide-caffe2dml) + +TensorFlowOnSpark. (n.d.). _Yahoo/TensorFlowOnSpark_. GitHub. [https://github.com/yahoo/TensorFlowOnSpark](https://github.com/yahoo/TensorFlowOnSpark) diff --git a/easy-causal-language-modeling-with-machine-learning-and-huggingface-transformers.md b/easy-causal-language-modeling-with-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..1ed67ea --- /dev/null +++ b/easy-causal-language-modeling-with-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,235 @@ +--- +title: "Easy Causal Language Modeling with Machine Learning and HuggingFace Transformers" +date: "2021-03-03" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "causal-language-model" + - "causal-language-modeling" + - "deep-learning" + - "huggingface" + - "language-model" + - "language-modeling" + - "machine-learning" + - "nlp" + - "transformer" + - "transformers" +--- + +Machine Learning in NLP is making a lot of progress. It can be used for many language tasks, primarily thanks to the so-called Transformer architecture that was invented back in 2017 and has been improved until today. [Text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/), [machine translation](https://www.machinecurve.com/index.php/2021/02/16/easy-machine-translation-with-machine-learning-and-huggingface-transformers/), [named entity recognition](https://www.machinecurve.com/index.php/2021/02/11/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers/) and even [speech-to-text](https://www.machinecurve.com/index.php/2021/02/17/easy-speech-recognition-with-machine-learning-and-huggingface-transformers/) - those are just a few examples. + +But **language modeling** itself is also a task that can be performed by such models. That is: using machine learning to predict a new word given the previous words. Using language modeling, you will be able to generate text and use ML for generative purposes. In fact, this lies at the basis of many more specialized models, such as the ones mentioned above. + +And this importance combined with opportunities for usage is why we'll take a look at language modeling in this tutorial. + +**After reading this tutorial, you will understand...** + +- What Causal Language Modeling involves. +- How the GPT family of language models supports these tasks, and how they are different from each other. +- How to build a GPT2-based Language Modeling pipeline with HuggingFace Transformers. + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## Code example: language modeling with Python + +This **fully working code example** shows how you can create a generative language model with Python. We use HuggingFace Transformers for this model, so make sure to have it installed in your environment (`pip install transformers`). Also make sure to have a recent version of PyTorch installed, as it is also required. However, with a few changes, it can also be adapted to run with TensorFlow. + +Make sure to read the rest of the article to understand everything in more detail, but here you go 🚀 + +``` +from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering +import torch +from torch.nn import functional as F + +# Load GPT-2 tokenizer and model +tokenizer = AutoTokenizer.from_pretrained('gpt2') +model = AutoModelForCausalLM.from_pretrained('gpt2') + +# Tokenize input phrase +phrase = f'Make sure to read the rest of this ' +inputs = tokenizer.encode(phrase, return_tensors='pt') + +# Get logits from last layer +last_layer_logits = model(inputs).logits[:, -1, :] + +# Keep top 30 logits at max; stop if cumulative probability >= 1.0. +top_logits = top_k_top_p_filtering(last_layer_logits, top_k=100, top_p=1.0) + +# Softmax the logits into probabilities +probabilities = F.softmax(top_logits, dim=-1) + +# Generate next token +generated_next_token = torch.multinomial(probabilities, num_samples=1) +generated = torch.cat([inputs, generated_next_token], dim=-1) + +# Get result +result_string = tokenizer.decode(generated.tolist()[0]) + +# Print string +print(result_string) +``` + +Result: + +``` +Make sure to read the rest of this ____ +``` + +* * * + +## Causal Language Modeling and Transformers + +According to HuggingFace (n.d.): + +> Causal language modeling is the task of predicting the token following a sequence of tokens. In this situation, the model only attends to the left context (tokens on the left of the mask). Such a training is particularly interesting for generation tasks. + +Today's NLP models, which primarily rely on the Transformer architecture that we will discuss shortly, are mostly trained in a [pretraining-finetuning fashion](https://www.machinecurve.com/index.php/question/what-is-fine-tuning-based-training-for-nlp-models/). This is a two-stage process where models are first _pretrained_ with a very large, unlabeled set of textual data. This way, machine learning models can benefit from the vast quantities of such data, without the cost of labeling, which is relatively big. Subsequently, in a _finetuning step_, pretrained models are tailored to a specific task - such as [sentiment analysis](https://www.machinecurve.com/index.php/2020/12/23/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers/) or [named entity recognition](https://www.machinecurve.com/index.php/2021/02/11/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers/). + +Pretraining happens with a particular task. **Language modeling** is one of these tasks. As you can see in the definition above and the image below, the model must compute the most likely token given the current sequence of tokens. In other words, it must learn to predict the best word given a certain phrase, which comes from a particular context. As you can imagine, when language models do this at scale (hence the vast quantities of supervised data and large models), they can learn to understand many patterns underlying human language. + +![](images/causal-1024x445.png) + +(Causal) language modeling + +The definition from HuggingFace (n.d.) quoted above mentions that "the model only attends to the left context", meaning "tokens on the left of the mask". If we want to understand this in more detail, we must take a look at the trade-off between [unidirectionality and bidirectionality](https://www.machinecurve.com/index.php/question/what-are-unidirectional-language-models/). If you know Transformers, you know that the [multi-head attention mechanism](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#multi-head-attention) generates attention maps that illustrate which words are closely related. + +Attention can be used for paying attention to words that matter and hence play a large role in predicting, say, the next token. + +Original Transformer models, such as the [Vaswani model](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) and [OpenAI's GPT model](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/), added a so-called _masked_ attention segment into their architecture as well. Masking provides unidirectionality: attention can only be computed in a left-to-right or right-to-left fashion; often left-to-right. It ensures that models cannot see 'into the future' during training, which would translate into simply memorizing tokens if the goal is to predict the most likely token that follows another sequence. + +Masking works by setting all future tokens in the attention map to minus infinite, meaning that they are converted into zero when fed to the Softmax layer that is [common within Transformer attention segments](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#multi-head-attention). + +![](images/Diagram-20-1024x282.png) + +Masked self-attention + +Masking future tokens in the attention map can benefit text generation. For other tasks, such as sentiment analysis, it may be counterproductive - as was argued by the creators of the [BERT model](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/). The B in BERT stands for _Bidirectional_ and it is unsurprising to find that BERT architectures remove masking altogether (by using the encoder segment from the Transformer only). This is why we see traditional (Seq2Seq) and GPT-like (decoder-only; autoregressive) models being used for text generation a lot, whereas BERT-like models are more used for other tasks (say sentiment analysis, text classification, ...). + +Let's now take a look at implementing a Language Modeling model with HuggingFace Transformers and Python. + +* * * + +## Implementing a Language Modeling model with Python + +In this section, you will learn how a Python based pipeline for (Causal) Language Modeling can be implemented. We will first take a brief look at GPT2, the Transformer model that we will be using today. Then, we'll introduce HuggingFace Transformers, which is a library that can be used for creating such models with few lines of code. Then, we teach you how to code your model, and finally show some results. + +### Today's Transformer: GPT2, part of the GPT family of language models + +If you have been following developments within machine learning recently, you know that the GPT family of language models has gained a lot of traction recently. These models use the [decoder segment](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#the-decoder-segment) of the original Transformer model, applying some changes, and using an [autoregressive language modeling task](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/) - where the goal is to predict the next token given the subsequent ones (does that ring a bell? ;-) ). + +Very briefly: + +- **GPT**, or GPT-1, was one of the founding models for the pretraining-finetuning paradigm. It showed that there was no need to use labeled datasets and training for specific tasks directly. Rather, it is possible to use large-scale unlabeled datasets for pretraining first, followed by using smaller datasets for small-scale fine-tuning to specific tasks. This was a big breakthrough. +- **GPT-2**, the model under consideration today, mostly improved GPT by using a larger dataset for pretraining and adding more parameters. However, some key other improvements were "Task Conditioning", meaning that a multitask model can be created rather than a single-task model, and "Zero Shot Learning", where the model understands that particular task without prior instructions in the text. In other words, with GPT-2, OpenAI shows that it wants to move towards models that require _no_ finetuning and can be trained with pretraining only. +- **GPT-3**, the current frontrunner in the GPT family, once more added more parameters to the model architecture - 100 times more than GPT-2! GPT-3 now shows adequate performance on tasks in zero-shot and few-shot settings. It can even write articles because of its good text generation capabilities. For this, it uses "in-context learning" - requiring presentation of a few examples or description only, allowing the model to adapt its output to the specific concept. This is a powerful strength of really big language models. + +We will be using GPT-2 for our Language Modeling pipeline today. It is open source and available within the [Hugg](https://huggingface.co/models)[i](https://huggingface.co/models)[ngFace Model Hub](https://huggingface.co/models), whereas GPT-3 is [exclusively licensed by Microsoft](https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-openai-to-exclusively-license-gpt-3-language-model/) (goodbye, open source?). + +![](images/Diagram-17-627x1024.png) + +Vaswani et al.'s decoder segment lies at the basis of GPT2. + +### HuggingFace Transformers + +For building our pipeline, we will be using [HuggingFace Transfor](https://huggingface.co/transformers/)[m](https://huggingface.co/transformers/)[ers](https://huggingface.co/transformers/), part of the HuggingFace community that is focused on democratizing NLP models through the open source movement. It is a library that contains many functionalities for using pretrained and finetuned models that are stored in the Model Hub, including GPT-2. + +![](images/image.png) + +### Model code + +Time to write some code! Ensure that you have installed HuggingFace Transformers (`pip install transformers`) and, in this case, PyTorch - [although it will also work with TensorFlow backends](https://huggingface.co/transformers/task_summary.html) (search for Causal Language Modeling there). + +Here's what happens under the hood: + +- First, we specify the imports. The most important ones are the `AutoModelForCausalLM`, which supports pretrained language models for Causal Language Modeling. We also use the `AutoTokenizer` for tokenization and `top_k_top_p_filtering` for selecting the most contributing logits (more about that later). +- We then load the GPT-2 tokenizer and model. It can be the case that you will need to download it first, which involves a download of approximately ~550MB. HuggingFace Transformers starts the download automatically when you run the script for the first time. +- We specify and tokenize an input phrase. +- We pass the tokenized phrase through the model and take the [logits](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/#logits-layer-and-logits) from the last layer. We then keep the top `30` contributing logits, unless we get a cumulative probability of `>= 1.0` with fewer logits. We [Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work) the outcome to generate pseudoprobabilities, generate the next token based on this outcome, and generate the result. +- Finally, we show the result on screen. + +``` +from transformers import AutoModelForCausalLM, AutoTokenizer, top_k_top_p_filtering +import torch +from torch.nn import functional as F + +# Load GPT-2 tokenizer and model +tokenizer = AutoTokenizer.from_pretrained('gpt2') +model = AutoModelForCausalLM.from_pretrained('gpt2') + +# Tokenize input phrase +phrase = f'I sleep in a bed that is poorly ' +inputs = tokenizer.encode(phrase, return_tensors='pt') + +# Get logits from last layer +last_layer_logits = model(inputs).logits[:, -1, :] + +# Keep top 30 logits at max; stop if cumulative probability >= 1.0. +top_logits = top_k_top_p_filtering(last_layer_logits, top_k=30, top_p=1.0) + +# Softmax the logits into probabilities +probabilities = F.softmax(top_logits, dim=-1) + +# Generate next token +generated_next_token = torch.multinomial(probabilities, num_samples=1) +generated = torch.cat([inputs, generated_next_token], dim=-1) + +# Get result +result_string = tokenizer.decode(generated.tolist()[0]) + +# Print string +print(result_string) +``` + +### Results + +Running the code for the first time indeed ensures that the model is downloaded: + +``` +Downloading: 100%|█████████████████████████████████████████████████████████████████████| 665/665 [00:00<00:00, 333kB/s] +Downloading: 100%|████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.41MB/s] +Downloading: 100%|███████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 650kB/s] +Downloading: 100%|█████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:01<00:00, 843kB/s] +Downloading: 100%|██████████████████████████████████████████████████████████████████| 548M/548M [01:05<00:00, 8.43MB/s] +``` + +I then get the following output: + +``` +I sleep in a bed that is poorly iced +``` + +A bit strange, but hey! 😂 + +* * * + +## Summary + +In this tutorial, you have learned the following things: + +- What Causal Language Modeling involves. +- How the GPT family of language models supports these tasks, and how they are different from each other. +- How to build a GPT2-based Language Modeling pipeline with HuggingFace Transformers. + +I hope that you have learned a few things from this tutorial! If you did, please feel free to leave a message in the comments section below, as I'd love to hear from you 💬 Please do the same when you have any questions or remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +HuggingFace. (n.d.). _Summary of the tasks — transformers 4.3.0 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/task\_summary.html](https://huggingface.co/transformers/task_summary.html) + +Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). [Language models are unsupervised multitask learners](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf). _OpenAI blog_, _1_(8), 9. + +Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). [Improving language understanding by generative pre-training.](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf) + +Shree, P. (2020, November 10). _The journey of open AI GPT models_. Medium. [https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2](https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2) diff --git a/easy-chatbot-with-dialogpt-machine-learning-and-huggingface-transformers.md b/easy-chatbot-with-dialogpt-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..c639b53 --- /dev/null +++ b/easy-chatbot-with-dialogpt-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,227 @@ +--- +title: "Easy Chatbot with DialoGPT, Machine Learning and HuggingFace Transformers" +date: "2021-03-16" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "chatbot" + - "deep-learning" + - "dialogpt" + - "huggingface" + - "machine-learning" + - "natural-language-processing" + - "neural-response-generation" + - "nlp" + - "text-generation" + - "transformers" +--- + +These past few years, machine learning has boosted the field of Natural Language Processing via Transformers. Whether it's Natural Language Understanding or Natural Language Generation, models like GPT and BERT have ensured that human-like texts and interpretations can be generated on a wide variety of language tasks. + +For example, today, we can create pipelines for [sentiment analysis](https://www.machinecurve.com/index.php/2020/12/23/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers/), [missing text prediction](https://www.machinecurve.com/index.php/2021/03/02/easy-masked-language-modeling-with-machine-learning-and-huggingface-transformers/) and even [speech-to-text](https://www.machinecurve.com/index.php/2021/02/17/easy-speech-recognition-with-machine-learning-and-huggingface-transformers/) with only a few lines of code. + +One of these tasks is human-level response generation. In other words, the creation of chatbots. In this tutorial, we will explore precisely that topic. You will build a chatbot with the DialoGPT model. We already covered the foundations of this approach [in a different article](https://www.machinecurve.com/index.php/question/what-is-dialogpt-and-how-does-it-work/), so click the link if you want to understand it in more detail. Here, you will learn... + +- **How DialoGPT works at a high level.** +- **How you can build a chatbot with Machine Learning and Transformers.** +- **How you can converse with your chatbot.** + +Ready? Let's go 🚀 + +* * * + +\[toc\] + +* * * + +## DialoGPT for Neural Response Generation - a.k.a., Chatbots + +Before we move on to creating code for our chatbot, I think that it's important that we cover DialoGPT at a high level. This way, you can also understand what happens in the background hwen your code runs. + +Let's first take a look at what chatbots are. Formally, they belong to the class of models for _neural response generation_, or NRG. In other words, their goal is to predict a response text to some input text, as if two people are chatting. + +Traditionally, chatbots have been solved in a recurrent way - with models like [Long Short-Term Memory networks](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) or LSTMs. As we know [from our introduction to Transformers](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), these model approaches have really taken over from LSTMs thanks to the self-attention mechanism. We can therefore ask ourselves whether Transformers can also be used to improve how chatbots work. + +That's also what Zhang et al. (2019) thought. The group of authors, which works at Microsoft, is the creator of the **[DialoGPT](https://www.machinecurve.com/index.php/question/what-is-dialogpt-and-how-does-it-work/)** Transformer. It inherits from the GPT-2 model (which itself is already a very powerful NLP model) and was trained with a custom dataset derived from Reddit. Evaluation was performed with a wide variety of datasets and tasks. It boosts the state-of-the-art in NRG, even Microsoft's `PersonalityStyle` model used in Azure Cognitive Services, and is available in three flavors (117M, 345M and 762M parameters). + +**Additional reading** + +- [DialoGPT: Transformers for Dialogues](https://www.machinecurve.com/index.php/question/what-is-dialogpt-and-how-does-it-work/) + +![](images/Diagram-17-627x1024.png) + +Decoder segment from the original Transformer. This segment lies at the basis of the GPT-2 model, used in DialoGPT. Source: Vaswani et al. (2017) + +* * * + +## Building a Chatbot with Transformers + +Now that we have a high-level understanding about how DialoGPT works, we can actually start writing some code! + +Firstly, we'll take a look at the software dependencies that must be available on your machine for the chatbot to work. Then, we'll cover creating the chatbot step-by-step, explaining every piece of the code that we will create. Finally, we're going to chat with the bot that we created, to see if it works well. Let's take a look! 🚀 + +### What you'll need for the chatbot + +Today's Machine Learning based chatbot will be created with [HuggingFace Transformers](https://huggingface.co/). Created by a company with the same name, it is a library that aims to democratize Transformers - meaning that everyone should be able to use the wide variety of Transformer architectures with only a few lines of code. + +And we shall see below that creating a chatbot is really easy and can be done in approximately 50 lines. + +However, in order to make it run, you will need to have installed HuggingFace Transformers onto your system, preferably in some kind of Python-based environment. You can do so with `pip install transformers`. Note that this also requires that Python is installed. Finally, you will also need PyTorch, because we will use `torch` in our code. Once you have these dependencies, you're ready to start coding. + +### Chatbot code example - explained + +Let's create a file or a [Notebook](https://www.machinecurve.com/index.php/2020/10/07/easy-install-of-jupyter-notebook-with-tensorflow-and-docker/) - e.g. called `chatbot.py` and write some code! As you can see, you will create a set of Python definitions that you will execute towards the end. Let's walk through each of them individually: + +- First of all, **we define `load_tokenizer_and_model`**. As you can imagine, it loads the tokenizer and the model instance for a specific variant of DialoGPT. As with any Transformer, inputs must be tokenized - that's the role of the tokenizer. The model subsequently generates the predictions based on what the tokenizer has created. We're using the `AutoTokenizer` and the `AutoModelForCausalLM` instances of HuggingFace for this purpose, and return the `tokenizer` and `model`, because we'll need them later. + - Do note that by default, the `microsoft/DialoGPT-large` model is loaded. You can also use the `-medium` and `-small` models. +- Then **we define `generate_response`**. Using the `tokenizer`, the `model`, a `chat_round` (indicating the _n_th chat round) and a set of `chat_history_ids`, a response to some user input is generated. First of all, the user input and an End-of-String (EOS) token are encoded. These are appended to the chat history, because DialoGPT (in theory) uses the whole chat history for generating predictions. Subsequently, this is used for generating a response - but only using the 1250 most recent tokens in the input sequence. The response is finally printed and the `chat_history_ids` (the current response) is returned for usage in a subsequent round. +- This is followed by `**chat_for_n_rounds**`. It loads the tokenizer and model by calling the `load_tokenizer_and_model` definition that we created above. Subsequently, it sets the chat history to `None` (there is no history before the first round) and chats for n rounds by means of a `for` loop. The number of rounds is configurable by means of the `n` parameter. As you can see, this generates an iterative chatting process. + - The chatbot can also be expanded so that it continues chatting forever until you give some kind of a stop word, like `bye`. That's out of scope for now, but please ask for it in the comments if you're interested in that! +- Finally, we check if the **`'__main__'` process** is running (in other words, if the code is running). If so, we start the chatting process by chatting for 5 rounds. This concludes our walkthrough. As you can see, we start with relatively detailed functionalities and mix everything together towards the end. + +We should have a working chatbot now! 🤖 Let's see what it can do. + +``` +from transformers import AutoModelForCausalLM, AutoTokenizer +import torch + + +def load_tokenizer_and_model(model="microsoft/DialoGPT-large"): + """ + Load tokenizer and model instance for some specific DialoGPT model. + """ + # Initialize tokenizer and model + print("Loading model...") + tokenizer = AutoTokenizer.from_pretrained(model) + model = AutoModelForCausalLM.from_pretrained(model) + + # Return tokenizer and model + return tokenizer, model + + +def generate_response(tokenizer, model, chat_round, chat_history_ids): + """ + Generate a response to some user input. + """ + # Encode user input and End-of-String (EOS) token + new_input_ids = tokenizer.encode(input(">> You:") + tokenizer.eos_token, return_tensors='pt') + + # Append tokens to chat history + bot_input_ids = torch.cat([chat_history_ids, new_input_ids], dim=-1) if chat_round > 0 else new_input_ids + + # Generate response given maximum chat length history of 1250 tokens + chat_history_ids = model.generate(bot_input_ids, max_length=1250, pad_token_id=tokenizer.eos_token_id) + + # Print response + print("DialoGPT: {}".format(tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True))) + + # Return the chat history ids + return chat_history_ids + + +def chat_for_n_rounds(n=5): + """ + Chat with chatbot for n rounds (n = 5 by default) + """ + + # Initialize tokenizer and model + tokenizer, model = load_tokenizer_and_model() + + # Initialize history variable + chat_history_ids = None + + # Chat for n rounds + for chat_round in range(n): + chat_history_ids = generate_response(tokenizer, model, chat_round, chat_history_ids) + + +if __name__ == '__main__': + chat_for_n_rounds(5) +``` + +### What if downloading takes really long? + +It can be the case that downloading the model file takes a lot of time when running the code above. This happens because the `microsoft/DialoGPT-large` is approximately 1.75 GB, and the download speed seems to be limited to approximately 300-400 kB/s, even though I have fast internet. In that case, you might want to use `microsoft/DialoGPT-small` or `microsoft/DialoGPT-medium` instead. + +### Chatting with the bot + +Let's take a look at some conversations that I had with the bot :) + +#### A repetitive one + +``` +(pytorch) C:\Users\chris\MachineCurve>python dialogpt.py +Loading model... +>> You:Good day! +DialoGPT: Good day to you too! +>> You:What day follows a Friday? +DialoGPT: I think it's a Friday. +>> You:No, it's Tuesday. +DialoGPT: I think it's a Tuesday. +>> You:Yes, indeed. Are you just repeating me? +DialoGPT: I think it's a Tuesday. +>> You:Haha ... wow. +DialoGPT: I think it's a Tuesday. +``` + +#### Wikipedia, built by the people of the world + +``` +(pytorch) C:\Users\chris\MachineCurve>python dialogpt.py +Loading model... +>> You:What is Wikipedia? +DialoGPT: It's a website that has a lot of information about the world. +>> You:Cool! Who builds it? +DialoGPT: The people of the world. +>> You:Everyone? +DialoGPT: The people of the world. +>> You:Okay, interesting. +DialoGPT: I'm not sure if I should be impressed or not. +>> You:Why not? +DialoGPT: I'm not sure either. +``` + +#### Not sure if I'm human anymore + +``` +(pytorch) C:\Users\chris\MachineCurve>python dialogpt.py +Loading model... +>> You:Are you a machine? +DialoGPT: I am a machine. +>> You:That is creepy, don't you think? +DialoGPT: I think it's a good thing. +>> You:Why is that? +DialoGPT: Because you're a machine. +>> You:Am I?!?! +DialoGPT: I think you're a machine. +>> You:Why do you think that? +DialoGPT: Because you're a machine. +``` + +* * * + +## Summary + +Transformers have really boosted the field of NLP. DialoGPT is a GPT2-based model that can be used for Neural Response Generation, or in plain English for chatbots. In this article, you have learned... + +- **How DialoGPT works at a high level.** +- **How you can build a chatbot with Machine Learning and Transformers.** +- **How you can converse with your chatbot.** + +And they were interesting conversations indeed! DialoGPT thought I was a machine, and it finds that Wikipedia was built by the people of the world. While this is still somewhat.... vague, every now and then, it already feels really human-like. + +I hope that you have learned something from this article! If you did, please feel free to leave a message in the comments section below 💬 I'd love to hear from you. Please do the same if you have any comments or suggestions for improvement. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Zhang, Y., Sun, S., Galley, M., Chen, Y. C., Brockett, C., Gao, X., … & Dolan, B. (2019). [Dialogpt: Large-scale generative pre-training for conversational response generation.](https://arxiv.org/abs/1911.00536) arXiv preprint arXiv:1911.00536. + +Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). [Language models are unsupervised multitask learners.](https://arxiv.org/abs/1911.00536) OpenAI blog, 1(8), 9. diff --git a/easy-grammar-error-detection-correction-with-machine-learning.md b/easy-grammar-error-detection-correction-with-machine-learning.md new file mode 100644 index 0000000..9277e78 --- /dev/null +++ b/easy-grammar-error-detection-correction-with-machine-learning.md @@ -0,0 +1,337 @@ +--- +title: "Easy grammar error detection & correction with Machine Learning" +date: "2021-07-14" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "gramformer" + - "grammar-correction" + - "huggingface" + - "machine-learning" + - "natural-language-processing" + - "nlp" + - "transformers" +--- + +Machine learning in general and deep learning in particular has boosted Natural Language Processing. A variety of models has allowed to perform machine translation, text summarization and sentiment analysis - to name just a few use cases. Today, we're adding another one to that list: we're going to construct a pipeline for grammar error detection & correction with Machine Learning, using Gramformer. + +After reading this article, you will... + +- **Understand how Transformers can be used for Natural Language Processing.** +- **Have built a Gramformer based grammar error detection & correction system with Python.** +- **Have built the same system with HuggingFace Transformers instead of the Gramformer repository.** + +Let's take a look! :) + +* * * + +\[toc\] + +* * * + +## Transformers for Natural Language Processing + +Deep learning based techniques have transformed the field of Machine Learning ever since the _breakthrough moment_ of AI in 2012. While that breakthrough was in the field of Computer Vision, another prominent field where such models have been applied is Natural Language Processing. + +Ever since 2017, [Transformer based models](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) have been rising in popularity. Before we dive into grammar checking & correction with Gramformer, it is a good idea to provide a brief Transformer background so that everyone understands Gramformer's context. Click on the link for a more detailed introduction. + +Written and spoken text is a sequence of _words_, and eventually even letters. The combination of letters into words and the combination of words, which is the _syntax_ of e.g. a written text, has underlying _semantics_, or meaning. This means that when neural networks are to process text, they must be able to handle such meaning. Hence, they must be able to process the text _in sequence_ - or they would fail to capture the meaning. No good would come from a model that mixes all words and letters before processing them, would there? + +Traditionally, NLP has worked with recurrent neural networks (such as [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)) for handling text. A recurrent neural network is a network where the output of a previous 'pass' is passed along to the next, by means of recurrent connections. In other words, the history of what has been processed before during a run (e.g. the words "I was on the way to..." processed before "the supermarket") is used to predict the next output. In the case of translations, for example, this can be highly useful: translations are sometimes highly dependent on the meaning of what has been produced before. + +Precisely this recurrent segment is the bottleneck of recurrent neural networks. It means that every element of the sequence (e.g., every word) has to be processed _in sequence_. In addition, because LSTMs work with 'memory', memory of words processed quite a long time ago (e.g., 20 words ago with long phrases) is faded, possibly hiding semantic dependencies in complex phrases. Using recurrent neural networks and LSTMs in other words was highly ineffective especially with longer sentences. + +In 2017, Vaswani et al. produced a completely new architecture for processing language - the Transformer architecture. By applying the attention mechanism in a different way, they showed that _attention is all you need_ - meaning that recurrent segments are no longer necessary. The original Transformer architecture is displayed below and represents N **encoder segments** and N **decoder segments**. The encoder segments jointly process text into an intermediary representation, which contains the semantics in a compressed way. This is done by computing **multi-head self-attention**, a mechanism that essentially allows us to compare the importance of individual words (self-attention) from different angles (multi-head). Once again, please check out the link above if you wish to understand this mechanism in more detail. + +The intermediary representations from each encoder segment are then passed into the corresponding decoder segment, as you can see in the image. Where the encoder segment takes a _source_ sequence as its input (e.g. a phrase in Dutch), the decoder takes the corresponding _target_ as its input (e.g. the translation in English). By computing the individual importance of words in the target phrase, and then combining these with the intermediary representation from the source phrase, the model can learn to produce a proper translation. + +Beyond translation, which is traditionally performed with such [sequence-to-sequence](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/) architectures, Transformers have also been applied to text generation (with the GPT-like architectures, using the decoder part) and text interpretation (mostly with BERT-like architectures, using the encoder part). + +Let's now take a look at Gramformer. + +![](images/Diagram-32-1-1024x991.png) + +The original Transformer architecture, as proposed by Vaswani et al. (2017) + +* * * + +## Grammar detection & correction with Gramformer + +[Gramformer](https://github.com/PrithivirajDamodaran/Gramformer/) is an _open source_ tool for the **detection** and **correction** of grammatical errors in English text: + +> Gramformer is a library that exposes 3 seperate interfaces to a family of algorithms to detect, highlight and correct grammar errors. To make sure the corrections and highlights recommended are of high quality, it comes with a quality estimator. +> +> GitHub (n.d.) + +* * * + +## Grammar detection & correction with Machine Learning - example code + +Let's now take a look at using Gramformer to build a system for grammar detection & correction. Below, you'll find how to install Gramformer, how to use it for getting corrected text, for getting individual edits, and for getting highlights where errors are detected. + +### Installing Gramformer + +Installing Gramformer is really easy - you can do so using `pip` (preferably `pip3` because of Python 3.x) directly from the Gramformer GitHub repository: + +``` +pip3 install -U git+https://github.com/PrithivirajDamodaran/Gramformer.git +``` + +#### Possible issues when installing Gramformer + +- Issue with `lm-scorer` +- Errant not installed +- En not found https://stackoverflow.com/questions/49964028/spacy-oserror-cant-find-model-en + +### Getting corrected text + +Getting corrected text from Gramformer is quite easy and takes the following steps: + +- Specifying the imports. +- Fixing the PyTorch seed. +- Initializing Gramformer. +- Specifying incorrect phrases. +- Letting Gramformer give suggestions for phrases including corrections. +- Printing corrected phrases. + +Let's begin with the imports. We import `Gramformer` and PyTorch, through `torch`. + +``` +# Imports +from gramformer import Gramformer +import torch +``` + +Then, we fix the seed. This means that all random number generation is performed with the same initialization vector, and that any deviations can not be related to random number generation. + +``` +# Fix seed, also on GPU +def fix_seed(value): + torch.manual_seed(value) + if torch.cuda.is_available(): + torch.cuda.manual_seed_all(value) + +fix_seed(42) +``` + +Then, we initialize `Gramformer`. We set `models` to `1`, or correction mode, and we instruct it _not_ to use GPU. If you have a dedicated GPU, you can of course set it to `True`. + +``` +# Initialize Gramformer +grammar_correction = Gramformer(models = 1, use_gpu=False) +``` + +Let's then create a list with three gramatically incorrect phrases: + +``` +# Incorrect phrases +phrases = [ + 'How is you doing?', + 'We is on the supermarket.', + 'Hello you be in school for lecture.' +] +``` + +...after which we can let Gramformer improve them. For each phrase, we let Gramformer perform a correction by suggesting two candidates, and then printing the incorrect phrase with suggested improvements. + +``` +# Improve each phrase +for phrase in phrases: + corrections = grammar_correction.correct(phrase, max_candidates=2) + print(f'[Incorrect phrase] {phrase}') + for i in range(len(corrections)): + print(f'[Suggestion #{i}] {corrections[i]}') + print('~'*100) +``` + +As a whole, this yields the following code: + +``` +# Imports +from gramformer import Gramformer +import torch + +# Fix seed, also on GPU +def fix_seed(value): + torch.manual_seed(value) + if torch.cuda.is_available(): + torch.cuda.manual_seed_all(value) + +fix_seed(42) + +# Initialize Gramformer +grammar_correction = Gramformer(models = 1, use_gpu=False) + +# Incorrect phrases +phrases = [ + 'How is you doing?', + 'We is on the supermarket.', + 'Hello you be in school for lecture.' +] + +# Improve each phrase +for phrase in phrases: + corrections = grammar_correction.correct(phrase, max_candidates=2) + print(f'[Incorrect phrase] {phrase}') + for i in range(len(corrections)): + print(f'[Suggestion #{i}] {corrections[i]}') + print('~'*100) +``` + +And these are the results when running it: + +``` +[Gramformer] Grammar error correct/highlight model loaded.. +[Incorrect phrase] How is you doing? +[Suggestion #0] ('How are you doing?', -20.39444351196289) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +[Incorrect phrase] We is on the supermarket. +[Suggestion #0] ("We're in the supermarket.", -32.21493911743164) +[Suggestion #1] ('We are at the supermarket.', -32.99837112426758) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +[Incorrect phrase] Hello you be in school for lecture. +[Suggestion #0] ('Hello, are you in school for the lecture?', -48.61809539794922) +[Suggestion #1] ('Hello, you are in school for lecture.', -49.94304275512695) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +``` + +Great! We just built a grammar issue checker & correction tool! :-D + +### Getting individual edits + +Instead of the corrected phrases, we can also print the _edits_ that Gramformer has performed: + +``` +# Print edits for each improved phrase +for phrase in phrases: + corrections = grammar_correction.correct(phrase, max_candidates=2) + print(f'[Incorrect phrase] {phrase}') + for i in range(len(corrections)): + edits = grammar_correction.get_edits(phrase, corrections[i][0]) + print(f'[Edits #{i}] {edits}') + print('~'*100) +``` + +You can see that _is_ was improved into _are_ for the first phrase; that _We is on_ is turned into _We're in_ in the second phrase, and so forth. + +``` +[Incorrect phrase] How is you doing? +[Edits #0] [('VERB:SVA', 'is', 1, 2, 'are', 1, 2)] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +[Incorrect phrase] We is on the supermarket. +[Edits #0] [('OTHER', 'We is on', 0, 3, "We're in", 0, 2)] +[Edits #1] [('VERB:SVA', 'is', 1, 2, 'are', 1, 2), ('PREP', 'on', 2, 3, 'at', 2, 3)] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +[Incorrect phrase] Hello you be in school for lecture. +[Edits #0] [('OTHER', 'Hello', 0, 1, 'Hello,', 0, 1), ('VERB', '', 1, 1, 'are', 1, 2), ('VERB', 'be', 2, 3, '', 3, 3), ('DET', '', 6, 6, 'the', 6, 7), ('NOUN', 'lecture.', 6, 7, 'lecture?', 7, 8)] +[Edits #1] [('OTHER', 'Hello', 0, 1, 'Hello,', 0, 1), ('MORPH', 'be', 2, 3, 'are', 2, 3)] +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +``` + +### Getting highlights + +Simply changing `get_edits` into `highlight` will yield the original phrase where the errors are marked: + +``` +# Print highlights for each improved phrase +for phrase in phrases: + corrections = grammar_correction.correct(phrase, max_candidates=2) + print(f'[Incorrect phrase] {phrase}') + for i in range(len(corrections)): + highlights = grammar_correction.highlight(phrase, corrections[i][0]) + print(f'[Highlights #{i}] {highlights}') + print('~'*100) +``` + +In other words: + +``` +[Gramformer] Grammar error correct/highlight model loaded.. +[Incorrect phrase] How is you doing? +[Highlights #0] How is you doing? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +[Incorrect phrase] We is on the supermarket. +[Highlights #0] We is on the supermarket. +[Highlights #1] We is on the supermarket. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +[Incorrect phrase] Hello you be in school for lecture. +[Highlights #0] Hello are'>Hello you be in school for lecture. +[Highlights #1] Hello you be in school for lecture. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +``` + +* * * + +## Using Gramformer with HuggingFace Transformers + +According to the `setup.py` installation instructions, Gramformer is built on top of HuggingFace Transformers. This means that you can also construct Gramformer with HuggingFace Transformers, meaning that you don't need to install the Gramformer repository with `pip`. Here's an example that illustrates how you can use the `AutoTokenizer` and `AutoModelForSeq2SeqLM` with the pretrained Gramformer tokenizer/model for grammar checking: + +``` +# Imports +from transformers import AutoTokenizer, AutoModelForSeq2SeqLM + +# Load the tokenizer +tokenizer = AutoTokenizer.from_pretrained("prithivida/grammar_error_correcter_v1") + +# Load the model +model = AutoModelForSeq2SeqLM.from_pretrained("prithivida/grammar_error_correcter_v1") + +# Incorrect phrases +phrases = [ + 'How is you doing?', + 'We is on the supermarket.', + 'Hello you be in school for lecture.' +] + +# Tokenize text +tokenized_phrases = tokenizer(phrases, return_tensors='pt', padding=True) + +# Perform corrections and decode the output +corrections = model.generate(**tokenized_phrases) +corrections = tokenizer.batch_decode(corrections, skip_special_tokens=True) + +# Print correction +for i in range(len(corrections)): + original, correction = phrases[i], corrections[i] + print(f'[Phrase] {original}') + print(f'[Suggested phrase] {correction}') + print('~'*100) +``` + +...results: + +``` +[Phrase] How is you doing? +[Suggested phrase] How are you doing? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +[Phrase] We is on the supermarket. +[Suggested phrase] We are at the supermarket. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +[Phrase] Hello you be in school for lecture. +[Suggested phrase] Hello you are in school for lecture. +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +``` + +* * * + +## Summary + +In this article, you have... + +- **Found how Transformers can be used for Natural Language Processing.** +- **Built a Gramformer based grammar error detection & correction system with Python.** +- **Built the same system with HuggingFace Transformers instead of the Gramformer repository.** + +I hope that it was useful and that you have learned a lot. Thank you for reading MachineCurve and happy engineering! 😎 + +* * * + +## Sources + +GitHub. (n.d.). _PrithivirajDamodaran/Gramformer_. [https://github.com/PrithivirajDamodaran/Gramformer](https://github.com/PrithivirajDamodaran/Gramformer) + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +GitHub. (2020). _Python 3.8 support? · Issue #10 · simonepri/LM-scorer_. [https://github.com/simonepri/lm-scorer/issues/10](https://github.com/simonepri/lm-scorer/issues/10) diff --git a/easy-install-of-jupyter-notebook-with-tensorflow-and-docker.md b/easy-install-of-jupyter-notebook-with-tensorflow-and-docker.md new file mode 100644 index 0000000..1c897a5 --- /dev/null +++ b/easy-install-of-jupyter-notebook-with-tensorflow-and-docker.md @@ -0,0 +1,255 @@ +--- +title: "Easy install of Jupyter Notebook with TensorFlow 2.0 and Docker" +date: "2020-10-07" +categories: + - "frameworks" +tags: + - "docker" + - "jupyter-notebook" + - "machine-learning" + - "tensorflow" +--- + +Being a data scientist could mean that you have the sexiest job of the 21st Century, according to some business literature. I'd argue that very similar things are true for those who research and engineer machine learning models, as breakthroughs in the areas can directly be captured. If you're familiar with deployment tools, you can even [deploy the model](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/) in the field, for example by means of a web service. + +In my experience, success factors of data science and machine learning projects - or any software project in general - include that runtime environments are shared. In the past, this meant that everyone had to install dependencies on their own systems. Then came Python environments, then came Anaconda, but today we will cover Jupyter Notebook. It's widely used in the data science community and therefore deserves a more prominent role on MachineCurve and in any future article I write. + +We'll do a few things in particular. Firstly, we'll take a look at what a Jupyter Notebook is. What can it be used for? How can it help? This is what we will try to answer. Subsequently, we are interested in actually _installing_ such a Notebook onto your system. This could have been problematic, as everyone's host machine works differently (e.g. due to different software installed on the machine, or different operating systems that are in play). Fortunately, with Docker, we can remove many of those problems by abstracting away the host machine. We'll therefore also cover what Docker is, briefly how it works and how to install it to your system. + +Subsequently, we're going to install a Jupyter Notebook with Docker. Specifically, we will install a Notebook oriented to TensorFlow projects, although - as we shall see - there are other Notebooks specifically tailored to other use cases (such as Apache Spark). + +\[toc\] + +* * * + +## What is a Jupyter Notebook? + +Nobody installs software without knowing what it is and what it does. If our goal is to use a Jupyer Notebook, we must first understand what it is. Fortunately, the Jupyter website provides clear information as to what you can expect (Project Jupyter, n.d.): + +> The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. + +Sounds awesome, doesn't it? :) + +Indeed - being widely used within the Data Science Community, a Jupyter Notebook is a web application which can be used for _live code documents_. Those notebooks are essentially digital paper sheets where code can be written. The code can also be executed right there, which makes it an ideal playground for creating a variety of data science and machine learning related code. + +As Python code can be created and executed within a Jupyter Notebook, it is also possible to create and train TensorFlow models from within the web application. What's more, it's even possible to export the Notebook - so that reuse of code is really easy! + +Here's what a (part of) a Jupyter Notebook looks like, with some TensorFlow code: + +![](images/image-3-1024x356.png) + +* * * + +## What is Docker? + +Time to look at the other component of today's article: Docker. If we take a look at the [Wikipedia page for Docker](https://en.wikipedia.org/wiki/Docker_(software)), we read the following: + +> Docker is a set of platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers. + +Now, that's quite a technical text, with some interesting words that you may not be familiar with. Let's therefore break things apart into its individual components to understand them better: + +- **Platform as a Service (PaaS):** a term used to describe software components that together constitute a platform, i.e. a "place" where "things can run" - in this case, containers. +- **OS-level virtualization:** virtualization at the operating system level. +- **Virtualization:** running an operating system virtually, i.e., within another operating system (such as running a Linux VM on a Windows machine). +- **OS-level virtualization, again:** virtualization at the operating system level. Now that we understand virtualization, we can understand that it happens _within_ the operating system (virtualization can be applied on hardware as well). +- **Package-based software called containers:** a design pattern where software is broken up into smaller components, packaged into its own "virtualized file system" (such as Linux) and then ran (called a "container"). + +If you already have some experience with virtualization, it's likely that something is starting to appear here: by means of Docker, you can run software packages in a virtualized way, in their own pseudo-OS, isolated from each other. + +Indeed, that is precisely what Docker does - by means of containerization. Not running a _true_ VM, i.e. a real operating system, but running the basics to make e.g. Linux work as the basis for many packages, it allows software developers to 'package' their software and related components together, publish them, for others to run them in an isolated way. + +As a frequent user of Docker myself in my daily work (often, as a container runtime for the Kubernetes orchestration technology), I really love how it works! 😎 + +Now that we know what Docker is and what it can be used for, as well understand what Jupyter Notebooks are, we can clearly see that they can be combined together. Using Docker, it becomes possible to run a Jupyter Notebook as well as the dependencies that come installed with one, in an isolated fashion - i.e., as a container. And precisely that is what we're going to do in order to install a Jupyer Notebook on your machine easily! + +* * * + +## Installing a TensorFlow Notebook with Docker + +- Make sure to install Docker first: [click here for installation instructions](https://www.docker.com/products/docker-desktop). + +If Docker was setup successfully on your machine, it's really easy to install a TensorFlow Notebook with Docker. This is because Jupyter has made available so-called [docker-stacks](https://github.com/jupyter/docker-stacks), which are Notebook based Docker images that can be readily installed. There are many, as you can see by means of the link, but those are most prominent: + +- **Datascience-notebook:** running data science tasks with a Notebook specifically tailored to data scientists and their package requirements. +- **TensorFlow-notebook:** training TensorFlow models from your Notebook with `tensorflow` 2.x preinstalled. As we know given the TensorFlow dependencies, this includes the installation of packages such as `numpy` and `scipy`. +- **Scipy-notebook:** running scientific programming jobs with a Notebook tailored to this usage, specifically focused on `scipy`. +- **R-notebook:** running mathematical programming with a Notebook filled with R packages. +- **Pyspark-notebook:** starting Apache Spark jobs from your Notebook with Spark preinstalled. + +For our case, we want to run this command: + +``` +docker run -v c:/notebook:/home/jovyan/notebooks -p 8888:8888 jupyter/tensorflow-notebook +``` + +It does the following: + +1. It downloads the **jupyter/tensorflow-notebook** Docker image and with `run` creates a container based on this image. +2. **Port 8888** on your host system maps to **port 8888** within the Docker container, meaning that any communications to http://localhost:8888 will be passed to port 8888 of the container. Fortunately for us, that's where our Notebook runs! (If you have something else running at 8888 locally, you could e.g. move your deployment to port 1234 by writing `-p 1234:8888`.) +3. We're **mounting** the folder `notebooks` within the container's `/home/jovyan` folder to, in our case `c:/notebook`, because we want to store the Notebooks on our host machine. If we would not do that, all our work would be gone as soon as we kill the Docker container - or if it crashes. Now, all Notebooks are written to `c:/notebook`, and will be loaded into Jupyter the next time your Notebook container starts. Note the following: + 1. On a Linux or Mac based machine, you can map any folder to `/home/jovyan/notebooks`, e.g. `./hello:/home/jovyan/notebooks`. [This does not work like that on Windows](https://rominirani.com/docker-on-windows-mounting-host-directories-d96f3f056a2c). In Docker for Windows, you will have to make available a folder directly in `c:/`, enable volume mounts in your Docker settings, and mount like we did. + 2. As you will see when you start Jupyter for the first time, everything is stored in a folder called `notebooks`. This makes sense, because Jupyter itself starts from `/home/jovyan` - and `/home/jovyan/notebooks` simply represents a folder there. If we would mount our volume _directly_ to `/home/jovyan`, however, we would get a permissions error and our Python kernel would not start (see below). That's why we had to mount to a sub folder, so that kernel files generated _within the container_ and Notebooks _stored outside of the container_ are separated! + +``` +Traceback (most recent call last): + File "/opt/conda/lib/python3.8/site-packages/tornado/web.py", line 1703, in _execute + result = await result + File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 742, in run + yielded = self.gen.throw(*exc_info) # type: ignore + File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/handlers.py", line 69, in post + model = yield maybe_future( + File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 735, in run + value = future.result() + File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 742, in run + yielded = self.gen.throw(*exc_info) # type: ignore + File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/sessionmanager.py", line 88, in create_session + kernel_id = yield self.start_kernel_for_session(session_id, path, name, type, kernel_name) + File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 735, in run + value = future.result() + File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 742, in run + yielded = self.gen.throw(*exc_info) # type: ignore + File "/opt/conda/lib/python3.8/site-packages/notebook/services/sessions/sessionmanager.py", line 100, in start_kernel_for_session + kernel_id = yield maybe_future( + File "/opt/conda/lib/python3.8/site-packages/tornado/gen.py", line 735, in run + value = future.result() + File "/opt/conda/lib/python3.8/site-packages/notebook/services/kernels/kernelmanager.py", line 176, in start_kernel + kernel_id = await maybe_future(self.pinned_superclass.start_kernel(self, **kwargs)) + File "/opt/conda/lib/python3.8/site-packages/jupyter_client/multikernelmanager.py", line 185, in start_kernel + km.start_kernel(**kwargs) + File "/opt/conda/lib/python3.8/site-packages/jupyter_client/manager.py", line 309, in start_kernel + kernel_cmd, kw = self.pre_start_kernel(**kw) + File "/opt/conda/lib/python3.8/site-packages/jupyter_client/manager.py", line 256, in pre_start_kernel + self.write_connection_file() + File "/opt/conda/lib/python3.8/site-packages/jupyter_client/connect.py", line 468, in write_connection_file + self.connection_file, cfg = write_connection_file(self.connection_file, + File "/opt/conda/lib/python3.8/site-packages/jupyter_client/connect.py", line 138, in write_connection_file + with secure_write(fname) as f: + File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__ + return next(self.gen) + File "/opt/conda/lib/python3.8/site-packages/jupyter_core/paths.py", line 445, in secure_write + raise RuntimeError("Permissions assignment failed for secure file: '{file}'." +RuntimeError: Permissions assignment failed for secure file: '/home/jovyan/.local/share/jupyter/runtime/kernel-38ce2548-e4f9-4a5a-9f28-206ed3225e93.json'. Got '0o655' instead of '0o0600'. +``` + +* * * + +## Running a Keras model in the Notebook + +After the Docker container has started, you will see log output in the console (use the `-d` flag if you want to run the container in `daemon` mode, i.e., in the back ground). Log output will look as follows: + +``` + To access the notebook, open this file in a browser: + file:///home/jovyan/.local/share/jupyter/runtime/nbserver-6-open.html + Or copy and paste one of these URLs: + http://c456944aff29:8888/?token=cea80acd38c70100d733a2aa185fc7a3048be68ca69c1998 + or http://127.0.0.1:8888/?token=cea80acd38c70100d733a2aa185fc7a3048be68ca69c1998 +``` + +You can now copy the second URL (the first one is a Docker-internal URL) into your web browser and a Notebook environment should start: + +![](images/image-2-1024x476.png) + +Now click 'New', then 'Python 3', and a new Notebook will be created for you: + +![](images/image-4-1024x207.png) + +Here, we can add some TensorFlow code, because recall that we ran a Docker container with the TensorFlow dependencies preinstalled, meaning that we can use them immediately. Should you wish to use other resources, you might be able to install them by adding them through `pip` in the container, or preferably, look up the Dockerfile online, copy it and build an image from the copied file including your edits, to ensure that your dependencies won't be gone when you. + +We can now add Keras code for an actual Notebook. However, since we noted before that Notebooks can be easily distributed, it would possibly be preferable to show you the Notebook that I created - [it can be found here](https://github.com/christianversloot/easy-jupyter-notebook/blob/master/example-notebook.ipynb)! :) Note that you can also download it there, and import it into your own Jupyter Notebook environment. + +However, I've also added the code for a [simple MNIST classifier](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) in the next section. Here's what our Notebook looks like right now: + +![](images/image-5-1024x886.png) + +### Keras code that we used + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential, save_model +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +no_epochs = 1 +5 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics for original model +score = model.evaluate(input_test, target_test, verbose=0) +print(f'CNN - Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Summary + +In this blog, we saw how we can easily install a Jupyter Notebook by means of Docker. Jupyter Notebooks are web application based live code documents where code can be created, run and exchanged with other people. Since Python runs natively within Notebooks, and TensorFlow can be installed, Notebooks have been very prominent in the data science communities. + +Docker, on the other hand, is a containerization technology which means that you can package software into containers and then ship them - for other people to run. Combined, we used Docker and Jupyter Notebook to very easily deploy a Notebook on your system. In addition, TensorFlow components came already preinstalled, meaning that you could deploy a TensorFlow model immediately - as we saw by means of a simple Convolutional Neural Network. + +I hope that you've learnt something from today's article. If you did, please feel free to leave a comment in the comments section below 💬 Please also do the same if you have any other comments, questions or suggestions for improvement. Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +_Jupyter/docker-stacks_. (n.d.). GitHub. [https://github.com/jupyter/docker-stacks](https://github.com/jupyter/docker-stacks) + +_Project Jupyter_. (n.d.). [https://jupyter.org/](https://jupyter.org/) + +_Docker_. (n.d.). [https://www.docker.com/](https://www.docker.com/) + +_Docker (software)_. (2013, July 30). Wikipedia, the free encyclopedia. Retrieved October 7, 2020, from [https://en.wikipedia.org/wiki/Docker\_(software)](https://en.wikipedia.org/wiki/Docker_(software)) diff --git a/easy-machine-translation-with-machine-learning-and-huggingface-transformers.md b/easy-machine-translation-with-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..53adc44 --- /dev/null +++ b/easy-machine-translation-with-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,306 @@ +--- +title: "Easy Machine Translation with Machine Learning and HuggingFace Transformers" +date: "2021-02-15" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "huggingface" + - "machine-translation" + - "seq2seq" + - "sequence-to-sequence-learning" + - "text-translation" + - "transformers" +--- + +Transformers have significantly changed the way in which Natural Language Processing tasks can be performed. This architecture, which trumps the classic recurrent one - and even LSTM-based architectures in some cases, has been around since 2017 and is the process of being democratized today. And in fact, many tasks can use these developments: for example, [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/), [named entity recognition](https://www.machinecurve.com/index.php/2021/02/11/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers/), [sentiment analysis](https://www.machinecurve.com/index.php/2020/12/23/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers/) - they can all be successfully used with this type of model. + +In this tutorial, we will be looking at the task of **machine translation**. We'll first take a look at how Transformers can be used for this purpose, and that they effectively perform a sequence-to-sequence learning task. This includes a brief recap on what Transformers are and how the T5 Transformer, which we will use in this article, works. + +Subsequently, we'll be introducing HuggingFace Transformers, which is a library that is democratizing Transformer-based NLP at incredible speed. We'll show you how easy pipelines for Machine Translation are available for English-French, English-German and English-Romanian translation tasks. We also show you how you can use them. If they don't suit you - for example because you want to translate into a different language - you will also learn how. + +So, in short, after reading this tutorial, you will... + +- **Understand how Transformers can be used for Machine Translation, in particular the T5 Transformer.** +- **See how HuggingFace Transformer based Pipelines can be used for easy Machine Translation.** +- **See how you can use other pretrained models if the standard pipelines don't suit you.** + +Let's take a look! 🚀 + +**Update 24/Mar/2021:** fixed issue with example 2. + +* * * + +\[toc\] + +* * * + +## Code example: pipelines for Machine Translation + +The two code examples below give **fully working examples of pipelines for Machine Translation**. The first is an easy out-of-the-box pipeline making use of the HuggingFace Transformers `pipeline` API, and which works for English to German (`en_to_de`), English to French (`en_to_fr`) and English to Romanian (`en_to_ro`) translation tasks. + +The second is a more difficult but generic approach with which you can use any of the HuggingFace Seq2Seq [Translation models](https://huggingface.co/models?pipeline_tag=translation) available within HuggingFace. + +If you want to understand what's happening under the hood in more detail, such as how the T5 Transformer used for this task works, make sure to read the rest of this tutorial as well! 🔥 + +### Example 1: easy out-of-the-box pipeline + +``` +from transformers import pipeline + +# Init translator +translator = pipeline("translation_en_to_de") + +# Translate text +text = "Hello my friends! How are you doing today?" +translation = translator(text) + +# Print translation +print(translation) +``` + +### Example 2: constructing a pipeline for any pretrained model + +_Note:_ this example requires you to run PyTorch. + +``` +from transformers import AutoTokenizer, AutoModelForSeq2SeqLM + +# Initialize the tokenizer +tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl") + +# Initialize the model +model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-nl") + +# Tokenize text +text = "Hello my friends! How are you doing today?" +tokenized_text = tokenizer.prepare_seq2seq_batch([text], return_tensors='pt') + +# Perform translation and decode the output +translation = model.generate(**tokenized_text) +translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0] + +# Print translated text +print(translated_text) +``` + +* * * + +## How Transformers can be used for Machine Translation + +Previously, machine learning engineers used **recurrent neural networks** when they wanted to perform tasks related to sequences. These networks obviously generated an output when served an input, but in addition also included a _recurrent segment_ - a segment pointing to itself. + +In other words, these models can use representations of the hidden state - and hence previous interactions, slowly faded over time - for generating new inputs. In the case of the sentence "I went to the milk store. I bought a can of", the presence of "milk store" might help the model realize that it was in fact a can of _milk_ that I bought. + +Visually, such networks look as follows when folded and eventually [unfolded for optimization](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/#why-vanishing-gradients). + +![](images/2560px-Recurrent_neural_network_unfold.svg_.png) + +A fully recurrent network. Created by [fdeloche](https://commons.wikimedia.org/wiki/User:Ixnay) at [Wikipedia](https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg), licensed as [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0). No changes were made. + +While in theory a significant advancement, these models proved troublesome. For example, due to their structure and activation functions used, they suffered significantly from the vanishing gradients problem. In other words, when maximum sequence length was set for too long, the most upstream unfolds would no longer learn properly. This was solved by the introduction of Long Short-Term Memory networks, or LSTMs, but still another problem persisted. This problem was that inputs in such networks are processed sequentially, which significantly slows down processing ('one at a time processing'). + +Even the addition of a mechanism that more strongly considered relationships between tokens, the attention mechanism, did not solve the problem of sequential processing, because it is inherently associated with the network architectures themselves. + +**Transformers**, which were introduced in 2017 in a paper by Vaswani et al., **do** solve this problem by showing that in fact _attention is all you need_. + +### What are Transformer models? + +Transformer models, which have been visualized below, entirely remove the need for sequentially processing. In fact, they allow sequences of tokens (in plainer English, parts of words from a phrase) to be processed in parallel! Below, you can see how the entire architecture works (although there are architectures like [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) which use the left part only and like [GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) which use the right part only) for performing sequence-to-sequence tasks like Machine Translation. + +- Note that with _inputs_, we mean sequences of tokens from a source representation (e.g. a source language like English) and with _outputs_, sequences of tokens from a target representation (e.g. a target language like German). "It's going well" is therefore an input sequence from English, "Es geht gut" the corresponding output sequence from German. +- The _**inputs**_ are first converted into (learned) **input embeddings**, which effectively convert these inputs into vector format. This helps reduce the dimensionality of the input space. In addition, input embeddings are then **positionally encoded** meaning that information about the positioning of these embeddings is added. Since inputs are no longer processed in sequence, this information is lost, but positional encodings add this information back. +- Then input is fed through a series of so-called **Encoder Segments**. In these segments, inputs are split into query, key and value blocks, which are fed into [multi-head attention segments](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#multi-head-attention). These essentially score the input tokens for their interimportance, i.e. how important they are given each other. Subsequently, the inputs are passed through a feed forward network (one time per input), yielding a so-called _hidden state_ that is either used by the next encoder segment or serves as the output from the encoder. Note that in the whole process, residual layers are present too in order to allow gradients to flow more smoothly during error backpropagation. +- The **Decoder Segment** then first takes the _**outputs**_, embeds and encodes them, and lets them pass through a [masked multi-head attention](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#masked-multi-head-attention) segment. This segment performs the same scoring as normal multi-head attention, but only in a masked way, meaning that inputs cannot see future inputs. This is necessary as the decoder segment involves predicting the output of the model, and if during training samples can see future values, they will only memorize these values instead of learning patterns from text. The outputs from masked multi-head attention are fed to another multi-head attention segment that combines the outputs from the encoder with the expected textual outputs. These are then processed and fed through a Feed Forward network per token. Note that there are also multiple Decoder Segments here, and that thus outputs either serve as inputs for the next decoder segment or as output of the decoder as a whole. +- The **final output** from the last Decoder Segment is then passed through a Linear layer where a Softmax activation function generates a probability distribution over all possible output values. The `argmax` value represents the most likely token, and it is selected. +- In other words, without recurrent segments, this architecture is capable of being trained on source inputs and target outputs, learning to pay attention to specific structures in text, and predicting the output given previous inputs. Really great! + +![](images/Diagram-32-1-1024x991.png) + +An overview of the Transformer architecture. Source: Vaswani et al. (2017) + +### Today's model type: T5 Transformer + +In this article, we will be using a Transformer architecture called [Text-to-Text Transfer Transformer or T5](https://www.machinecurve.com/index.php/question/what-is-the-t5-transformer-and-how-does-it-work/). This type of Transformer architecture [was proposed by Google](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) and investigated whether it was possible to train a universal Transformer architecture with many language tasks instead of using a task-specific architecture. + +Using a Common Crawl-derived dataset called C4 and by prefixing the various tasks with instructions (such as "translate" or "summarize"), the authors were able to create a model that can be used for a variety of language tasks when finetuned for these tasks. The image below visualizes how T5 works from input to output. + +Today, we'll be using a T5 model (`t5-base`) that was finetuned for Machine Translation. + +![](images/j3XuSZT.png) + +Source: Raffel et al. (2019) + +* * * + +## Introducing HuggingFace Transformers and Pipelines + +For creating today's Transformer model, we will be using the [HuggingFace Transformers library](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/). This library was created by the company HuggingFace to democratize NLP. It makes available many pretrained Transformer based models. In addition to that, it also exposes a set of Pipelines with which it is very easy to build NLP based applications. + +Examples of these pipelines are [Sentiment Analysis](https://www.machinecurve.com/index.php/2020/12/23/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers/), [Named Entity Recognition](https://www.machinecurve.com/index.php/2021/02/11/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers/) and [Text Summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/), but today we will focus on Machine Translation. + +### Pipelines for Machine Translation + +Currently (February 2021), a `translation` pipeline is available within the HuggingFace library. If called, it performs the following: + +- It implements a so-called `TranslationPipeline`. +- For PyTorch (`pt`), it implements an `AutoModelForSeq2SeqLM`, whereas for TensorFlow it implements a `TFAutoModelForSeq2SeqLM`.... which can be loaded with a variety of pretrained models: + - A T5 Base Transformer for English to French (`en-fr`) translation tasks. + - A T5 Base Transformer for English to German (`en-de`) translation tasks. + - A T5 Base Transformer for English to Romanian (`en-ro`) translation tasks. + +``` + # This task is a special case as it's parametrized by SRC, TGT languages. + "translation": { + "impl": TranslationPipeline, + "tf": TFAutoModelForSeq2SeqLM if is_tf_available() else None, + "pt": AutoModelForSeq2SeqLM if is_torch_available() else None, + "default": { + ("en", "fr"): {"model": {"pt": "t5-base", "tf": "t5-base"}}, + ("en", "de"): {"model": {"pt": "t5-base", "tf": "t5-base"}}, + ("en", "ro"): {"model": {"pt": "t5-base", "tf": "t5-base"}}, + }, + }, +``` + +* * * + +## Building your own Machine Translation pipeline + +Now that we understand _what_ we will be using exactly, it's time to show you _how we will use it_. + +In other words, you're going to build your own pipeline for Machine Translation using Transformers. + +Let's take a look at how this can be done - it may surprise you, but these days doing so only requires a few lines of code thanks to libraries like HuggingFace Transformers. + +### Machine Translation example code + +``` +from transformers import pipeline + +# Init translator +translator = pipeline("translation_en_to_de") + +# Translate text +text = "Hello my friends! How are you doing today?" +translation = translator(text) + +# Print translation +print(translation) +``` + +As you can see above, a series of steps are performed: + +- First of all, we import the `pipeline` API from the `transformers` library. If you don't have it yet, you can install HuggingFace Transformers with `pip` using `pip install transformers`. Make sure to install it in the environment where you have a running installation of (recent) TensorFlow or PyTorch too. +- Then, we initialize the Machine Translation pipeline for an English to German translation task (verify that this is possible by taking a look at the spec a bit earlier). +- Now, we're ready to translate some text. We input the text `Hello my friends! How are you doing today?` into the translation pipeline. +- We then print the outcome on screen. + +### Running the Machine Translation pipeline + +Running the script for the first time requires that the pretrained model is downloaded: + +``` +Downloading: 100%|█████████████████████████████████████████████████████████████████| 1.20k/1.20k [00:00<00:00, 240kB/s] +Downloading: 39%|█████████████████████████▍ | 343M/892M [03:04<05:31, 1.65MB/s] +``` + +### Outcome + +Once this is complete, however, this is what you'll see: + +``` +[{'translation_text': 'Hallo liebe Freunde, wie geht es Ihnen heute?'}] +``` + +Since `Hallo liebe Freunde, wie geht es Ihnen heute?` equals our input but then in German, this is pretty awesome! 😎 + +* * * + +## Using another pretrained model for translation + +The default HuggingFace pipeline only supports translation tasks from English into German, French and Romanian. Since these + +Fortunately, the HuggingFace platform also comes with a repository of [pretrained models](https://huggingface.co/models) using a variety of Transformer architectures (think BERT, GPT, T5, ...) and then finetuned on a large variety of language tasks (including machine translation!). + +Filtering for [translation models](https://huggingface.co/models?pipeline_tag=translation), we can see that in February 2021 over 1350 models are available for translation. + +### Using the generic model loading approach + +That's why here, you'll also learn how ot apply another pretrained model for translation. Doing so requires you to perform a few additional steps. + +1. Importing `AutoTokenizer` and `AutoModelForSeq2SeqLM` from `transformers`. Note that you need to import `TFAutoModelForSeq2SeqLM` if you want the TensorFlow equivalent. +2. Initializing the Tokenizer. We'll be using the Helsinki-NLP pretrained/finetuned OpusMT English to Dutch model for initializing the tokenizer. Using a tokenizer, we can convert textual inputs into tokens. +3. Initializing the model. Using the same pretrained/finetuned model, we can generate translations. +4. We then tokenize the input text in a Seq2Seq fashion as if we convert a batch of one sentence (hence wrapping everything inside a Python list). +5. We then generate a translation for all the elements in the batch, decode the batch, and take the first element. +6. Which is the translated text that we then print on screen. + +The code for this looks as follows. + +``` +from transformers import AutoTokenizer, AutoModelForSeq2SeqLM + +# Initialize the tokenizer +tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-nl") + +# Initialize the model +model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-nl") + +# Tokenize text +text = "Hello my friends! How are you doing today?" +tokenized_text = tokenizer.prepare_seq2seq_batch([text]) + +# Perform translation and decode the output +translation = model.generate(**tokenized_text) +translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0] + +# Print translated text +print(translated_text) +``` + +### Running the translation model + +Here too, we're facing a downloading step: + +``` +Downloading: 100%|█████████████████████████████████████████████████████████████████| 1.13k/1.13k [00:00<00:00, 280kB/s] +Downloading: 100%|███████████████████████████████████████████████████████████████████| 790k/790k [00:02<00:00, 378kB/s] +Downloading: 100%|███████████████████████████████████████████████████████████████████| 814k/814k [00:01<00:00, 439kB/s] +Downloading: 100%|█████████████████████████████████████████████████████████████████| 1.66M/1.66M [00:03<00:00, 457kB/s] +Downloading: 100%|██████████████████████████████████████████████████████████████████| 42.0/42.0 [00:00<00:00, 12.6kB/s] +Downloading: 25%|████████████████ | 78.4M/316M [00:42<01:57, 2.03MB/s] +``` + +### Outcome + +This is the outcome of our generic translation task: + +``` +Hallo vrienden, hoe gaat het vandaag? +``` + +...since this is Dutch for `Hello my friends! How are you doing today?`, this is once again awesome! 😎 + +* * * + +## Recap + +In this article, we looked at creating a Machine Translation pipeline with Python and HuggingFace Transformers using the T5 Transformer. This Transformer type of architecture has the ability to learn a variety of language tasks using one generic architecture, rather than a task-specific one, and was proposed by Google in 2020. Our T5 Transformer, `t5-base`, was subsequently finetuned on Machine Translation, so it can be used for that purpose. + +And with HuggingFace Transformers (`pip install transformers`), generating such a pipeline is really easy! + +Beyond the simple pipeline, which supports English-German, English-French and English-Romanian translations out of the box, it may be the case that you want to translate from a different source language or into a different target language, or maybe even both. We also showed you how to create a pipeline for any pretrained Seq2Seq model for translation available within HuggingFace, using an English-to-Dutch translation model as an example. + +I hope that you have learned something from today's tutorial! If you did, please feel free to drop a message in the comment section below 💬 Please do the same if you have any questions, remarks, or suggestions for improvement. I'd love to hear from you :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … & Liu, P. J. (2019). [Exploring the limits of transfer learning with a unified text-to-text transformer](https://arxiv.org/abs/1910.10683). arXiv preprint arXiv:1910.10683. diff --git a/easy-masked-language-modeling-with-machine-learning-and-huggingface-transformers.md b/easy-masked-language-modeling-with-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..c1dd833 --- /dev/null +++ b/easy-masked-language-modeling-with-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,282 @@ +--- +title: "Easy Masked Language Modeling with Machine Learning and HuggingFace Transformers" +date: "2021-03-02" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "bert" + - "distilbert" + - "distilroberta" + - "huggingface" + - "language-model" + - "masked-language-modeling" + - "mlm" + - "nlp" + - "roberta" + - "transformers" +--- + +Masked Language Modeling (MLM) is a language task very common in Transformer architectures today. It involves masking part of the input, then learning a model to predict the missing tokens - essentially reconstructing the non-masked input. MLM is often used within pretraining tasks, to give models the opportunity to learn textual patterns from unlabeled data. + +Downstream tasks can benefit from models pretrained on MLM too. Suppose that you are faced with the task of reconstructing the contents of partially destroyed documents. For example, say that you have found a written letter that reads "I am ... to the bakery". While this is easy - _going_ is the expected missing value here - you can imagine that many tasks may benefit if complexity is bigger. + +In this tutorial, we will therefore focus on creating a pipeline for Masked Language Modeling. It will be an easy pipeline, meaning that you can do so with only a few lines of code, using a model pretrained before. For this, we will be using the HuggingFace Transformers library. + +**After reading this tutorial, you will understand...** + +- What Masked Language Modeling involves. +- How Transformers can be used for MLM tasks, and especially the DistilRoBERTa base model. +- What it takes to build a pipeline for Masked Language Modeling yourself, with only a few lines of code. + +Let's take a look 🚀 + +**Update 05/Mar/2021:** fixed a small mistake regarding the description of Masked Language Modeling tasks. + +* * * + +\[toc\] + +* * * + +## Example code: MLM with HuggingFace Transformers + +This code example shows you how you can implement Masked Language Modeling with HuggingFace Transformers. It provides a full example for constructing a pipeline, masking a phrase and getting the result with the model. It can be used _if_ HuggingFace Transformers (`pip install transformers`) and a recent version of TensorFlow 2 or PyTorch are installed in your environment. + +Of course, make sure to read the rest of this tutorial as well if you want to understand concepts in more detail! 🚀 + +``` +from transformers import pipeline + +# Initialize MLM pipeline +mlm = pipeline('fill-mask') + +# Get mask token +mask = mlm.tokenizer.mask_token + +# Get result for particular masked phrase +phrase = f'Read the rest of this {mask} to understand things in more detail' +result = mlm(phrase) + +# Print result +print(result) +``` + +This yields: + +``` +[{ + 'sequence': 'Read the rest of this article to understand things in more detail', + 'score': 0.35419148206710815, + 'token': 1566, + 'token_str': ' article' +}, { + 'sequence': 'Read the rest of this post to understand things in more detail', + 'score': 0.20478709042072296, + 'token': 618, + 'token_str': ' post' +}, { + 'sequence': 'Read the rest of this guide to understand things in more detail', + 'score': 0.07164707034826279, + 'token': 4704, + 'token_str': ' guide' +}, { + 'sequence': 'Read the rest of this essay to understand things in more detail', + 'score': 0.06781881302595139, + 'token': 14700, + 'token_str': ' essay' +}, { + 'sequence': 'Read the rest of this blog to understand things in more detail', + 'score': 0.04165174812078476, + 'token': 5059, + 'token_str': ' blog' +}] +``` + +* * * + +## What is Masked Language Modeling? + +Today, many language models - primarily Transformers models, which we will discuss in more detail below - are trained on a language task. There is a variety of language tasks. **Language modeling** is one of them. The goal with language modeling is that given a current set of input tokens, a new token is predicted. This token should obviously be the token that corresponds to the actual next token in the input data. This way, language models can learn to recognize patterns in text. + +**Masked Language Modeling** works slightly differently. In this case, a model does not have access to the full input. Rather, it has access to a _masked_ input, where some (often [10-20 percent](https://www.machinecurve.com/index.php/question/what-percentage-of-tokens-is-masked-in-berts-mlm-objective/)) of the input tokens is masked. With masked, we simply mean that the token (and sometimes a span of token) is replaced with a ` token. The goal, then, becomes reconstructing the original sequence, i.e. to reveal what is hidden under the mask. The task adds complexity on top of a regular language model task, and some works argue that it can help boost performance. + +![](images/mlm-1024x447.png) + +Masked Language Modeling + +MLM is primarily used for pretraining a model, after which it can be [finetuned](https://www.machinecurve.com/index.php/question/what-is-fine-tuning-based-training-for-nlp-models/) to a particular downstream task. As you can see in the image below, no text needs to be labeled by human labelers in order to predict the missing values. I corresponds to I, am to am, going to going, and so on. The only thing is that some of the words are masked, but the underlying word is available during the improvement step. This is greatly beneficial, since labeling data is a costly task and little labeled data is available. Unlabeled data, however, is ubiquitous. This is why models are often pretrained on these large unlabeled corpora. Subsequently, they can be finetuned to a particular task with a labeled dataset - for example, for [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/). It is effectively a form of transfer learning, and MLM can greatly help here. + +![](images/Diagram-39-1024x436.png) + +* * * + +## Today's model: a DistilRoBERTa base model + +These days, NLP models often make use of the so-called [Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) paradigm. With Transformer models, we no longer need [recurrent segments](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) to make sequence compatible machine learning models. This was necessary for quite a long time, significantly impacting the performance of models especially with longer sequences of words. + +Vaswani et al. showed in a 2017 paper that _Attention is all you need_ - that, by slightly changing the neural network architecture, the attention mechanism was the only necessary thing in order to build language models that can learn by processing all tokens in parallel. Ever since, Transformer models have been at the forefront of NLP developments. These days, there are many, and BERT is one of them. + +### What is BERT? + +BERT, which stands for **[Bidirectional Encoder Representations from Transformers](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/)**, is a special type of Transformer model. Using the left part of the Transformer only - i.e., the [encoder segment](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#what-are-transformers) - it is not a fully Seq2Seq model and must use a special task to generate an encoding during pretraining. As you can see in the image below, it utilizes a Masked Language Modeling task for this purpose. + +- In BERT, inputs are separated into two segments per sequence: a sentence A and a sentence B. +- There is a separator token separating both sequences. +- In addition, there is a [CLS token](https://www.machinecurve.com/index.php/question/how-does-bert-separate-token-and-phrase-level-tasks/) that represents class level (i.e. global) information. + +Sentences A and B are masked. When processed through BERT, the goal is to reconstruct the original input - a typical MLM task. In addition, the CLS input token produces a C output token. This token contains global information (i.e. information about the sequence as a whole) and is primarily relevant during finetuning, e.g. to generate a model that can perform a task for which the whole sequence is required (such as [sentiment analysis](https://www.machinecurve.com/index.php/2020/12/23/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers/)). + +![](images/Diagram-44-1024x625.png) + +### From BERT to RoBERTa to DistilRoBERTa + +While BERT is very successful (in fact, [BERT powers many of Google's search queries](https://blog.google/products/search/search-language-understanding-bert/)), it has one drawback: BERT is a really big model. With up to 340 million parameters, it cannot be run on slower machines. Edge devices? Forget it. + +That's why in recent years, research communities have started focusing on making Transformers more available to the masses. This approach is twofold. First of all, companies like [HuggingFace](https://huggingface.co/) democratize NLP by creating a generic library for using Transformer based models - and allowing researchers to open source their pretrained and finetuned models for usage by the open source community. + +While this is great, it does not solve the problem of BERT's size. That's where _efficiency approaches_ come in. Take [ConvBERT](https://www.machinecurve.com/index.php/question/what-is-convbert-and-how-does-it-work/), which is a more recent example. It utilizes special convolution operations to replace part of the self-attention mechanism in the BERT model, yielding more efficient training and inference without losing much of the performance. In addition, since BERT was an early Transformer directly spawning from the [Vaswani model](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) and the first [GPT model](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/), authors had less knowledge about how to pretrain and finetune most optimally. + +That's where **RoBERTa** steps in, which stands for **Ro**bust **BERT** pretraining **a**pproach. Described in Liu et al. (2019), the work attempts to replicate the training process for BERT - and found that BERT is significantly undertrained. They design and validate a new pretraining approach which allows their version of BERT to significantly outperform the then state-of-the-art. + +But RoBERTa is still big. That's why we use a distilled variant called **Distil**RoBERTa. According to HuggingFace (n.d.), it's faster because it is smaller - with 82 million parameters instead of 125 million. That's still too many for many real-time uses, but hey, I think that we will see edge oriented Transformer like approaches only in the years to come. We haven't even seen the start yet! + +> This model is a distilled version of the RoBERTa-base model. It follows the same training procedure as DistilBERT. The code for the distillation process can be found here. This model is case-sensitive: it makes a difference between english and English. The model has 6 layers, 768 dimension and 12 heads, totalizing 82M parameters (compared to 125M parameters for RoBERTa-base). On average DistilRoBERTa is twice as fast as Roberta-base. +> +> HuggingFace (n.d.) + +* * * + +## Building a Masked Language Modeling pipeline + +Let's now take a look at how you can build a Masked Language Modeling pipeline with Python. For this, we'll be using [HuggingFace Transformers](https://huggingface.co/transformers/quicktour.html). This is a library created by a company democratizing NLP by making available generic pipelines and APIs for many pretrained and finetuned Transformer models in an open source way. + +In other words, you can create your own pipelines for a variety of tasks - think [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/), [machine translation](https://www.machinecurve.com/index.php/2021/02/16/easy-machine-translation-with-machine-learning-and-huggingface-transformers/) and [sentiment analysis](https://www.machinecurve.com/index.php/2020/12/23/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers/); more [here](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/) - with very few lines of code. And that is precisely what we will show you here. Let's take a look. + +Before we start, it's important to create a Python file - e.g. `mlm.py`. Also make sure that you have installed the most recent version of HuggingFace Transformers. The library is available through `pip`, so you can do `pip install transformers`. Note that it has to run with either a recent version of TensorFlow 2 or PyTorch, so this must also be installed in the environment where the pipeline is run. + +### Full model code + +Now that we have stated all the preparations, it's time to write some code. As promised, creating a Masked Language Modeling pipeline is really easy. + +- We first import the `pipeline` module from `transformers`. +- We use it to initialize a `mlm` pipeline, which is represented as `fill-mask`. Note that this loads the `[distilroberta-base](https://huggingface.co/distilroberta-base)` which was pretrained on OpenWebTextCorpus. +- We derive the mask token from its tokenizer, so that we can mask a phrase. +- We specify a `phrase`, where we deliberately leave out a word with our mask, and feed it through the pipeline. +- We finally print the result. + +``` +from transformers import pipeline + +# Initialize MLM pipeline +mlm = pipeline('fill-mask') + +# Get mask token +mask = mlm.tokenizer.mask_token + +# Get result for particular masked phrase +phrase = f'At a {mask} you can drink beer and wine' +result = mlm(phrase) + +# Print result +print(result) +``` + +### Results + +Running our code with e.g. `python mlm.py` yields the following result: + +``` +[{ + 'sequence': 'At a discount you can drink beer and wine', + 'score': 0.22876130044460297, + 'token': 6720, + 'token_str': ' discount' +}, { + 'sequence': 'At a premium you can drink beer and wine', + 'score': 0.08584875613451004, + 'token': 4549, + 'token_str': ' premium' +}, { + 'sequence': 'At a minimum you can drink beer and wine', + 'score': 0.07710543274879456, + 'token': 3527, + 'token_str': ' minimum' +}, { + 'sequence': 'At a cafe you can drink beer and wine', + 'score': 0.059273071587085724, + 'token': 16381, + 'token_str': ' cafe' +}, { + 'sequence': 'At a festival you can drink beer and wine', + 'score': 0.04346294328570366, + 'token': 3241, + 'token_str': ' festival' +}] +``` + +These are quite relevant, yet we can also see that the model is not too confident - it can either be about the price (cheaper, more expensive) or about the location (cafe, festival). Let's try again with some phrase: + +``` +phrase = f'Performing additions and subtractions is a part of {mask}' +``` + +The result: + +``` +[{ + 'sequence': 'Performing additions and subtractions is a part of mathematics', + 'score': 0.0855618417263031, + 'token': 25634, + 'token_str': ' mathematics' +}, { + 'sequence': 'Performing additions and subtractions is a part of programming', + 'score': 0.07897020876407623, + 'token': 8326, + 'token_str': ' programming' +}, { + 'sequence': 'Performing additions and subtractions is a part of calculus', + 'score': 0.06476884335279465, + 'token': 41454, + 'token_str': ' calculus' +}, { + 'sequence': 'Performing additions and subtractions is a part of arithmetic', + 'score': 0.03068726696074009, + 'token': 43585, + 'token_str': ' arithmetic' +}, { + 'sequence': 'Performing additions and subtractions is a part of learning', + 'score': 0.025549395009875298, + 'token': 2239, + 'token_str': ' learning' +}] +``` + +Looking good! 😎 + +* * * + +## Summary + +In this tutorial, you learned about the following things: + +- What Masked Language Modeling is about. +- How Transformers and specifically the DistilRoBERTa model can be used for this purpose. +- How to build a Masked Language Modeling pipeline yourself with Python and HuggingFace Transformers. + +I hope that you have learned a few things here and there! If you did, please leave a message in the comments section below, as I'd love to hear from you 💬 Please do the same if you have any questions or other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805). _arXiv preprint arXiv:1810.04805_. + +Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). [Roberta: A robustly optimized bert pretraining approach.](https://arxiv.org/abs/1907.11692) _arXiv preprint arXiv:1907.11692_. + +HuggingFace. (n.d.). _Distilroberta-base · Hugging face_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/distilroberta-base](https://huggingface.co/distilroberta-base) diff --git a/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers.md b/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..53df94b --- /dev/null +++ b/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,201 @@ +--- +title: "Easy Named Entity Recognition with Machine Learning and HuggingFace Transformers" +date: "2021-02-11" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "huggingface" + - "machine-learning" + - "named-entity-recognition" + - "transformers" +--- + +Deep learning approaches have boosted the field of Natural Language Processing in recent years. A variety of tasks can now be performed, and relatively easy. For example, we can now use ML to perform [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/), [question answering](https://www.machinecurve.com/index.php/2020/12/21/easy-question-answering-with-machine-learning-and-huggingface-transformers/) and [sentiment analysis](https://www.machinecurve.com/index.php/2020/12/23/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers/) - with only a few lines of code. + +And it doesn't end there. The task of **Named Entity Recognition** can also be performed using Machine Learning. Among others, it can be performed with Transformers, which will be the focus of today's tutorial. In it, we will focus on performing an NLP task with a pretrained Transformer. It is therefore structured as follows. Firstly, we'll take a brief look at the concept of Named Entity Recognition itself - because you'll need to understand what it is. Then, we focus on Transformers for NER, and in particular the pretraining-finetuning approach and the model we will be using today. + +This is finally followed by **an example implementation** of a Named Entity Recognition model that is **easy and understandable** by means of a HuggingFace Transformers pipeline. + +After reading this tutorial, you will understand... + +- **What Named Entity Recognition is all about.** +- **How Transformers can be used for Named Entity Recognition.** +- **How a pipeline performing NER with Machine Learning can be built.** + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## Code example: NER with Transformers and Python + +The code below allows you to create a **simple but effective Named Entity Recognition pipeline** with HuggingFace Transformers. If you use it, ensure that the former is installed on your system, as well as TensorFlow or PyTorch. If you want to understand everything in a bit more detail, make sure to read the rest of the tutorial as well! 🔥 + +``` +from transformers import pipeline + +# Initialize the NER pipeline +ner = pipeline("ner") + +# Phrase +phrase = "David helped Peter enter the building, where his house is located." + +# NER task +ner_result = ner(phrase) + +# Print result +print(ner_result) +``` + +* * * + +## What is Named Entity Recognition? + +If we are to build a model for **Named Entity Recognition** (NER), we will need to understand what it does, don't we? + +> \[Named Entity Recognition is used\] to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. +> +> Wikipedia (2005) + +As with any technical definition, it is quite a difficult one for beginners, so let's take a look at it in a bit more detail :-) + +Now, what is a "named entity", for example? + +> A named entity is a real-world object, such as persons, locations, organizations, products, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. +> +> Wikipedia (2007) + +I see - so NER models can be used to detect real-world objects in text. For example, for the following text: + +- The bus is heading to the garage for maintenance. + +Here, 'bus' is of type _vehicle_, whereas the 'garage' is of type _building_. Those are named entities. The words 'the', 'is', 'to the', 'for' are not, and are hence of type _outside of a named entity_, as we shall see later. + +In other words, using Named Entity Recognition, we can extract real-world objects from text, or infuse more understanding about the meaning of a particular text (especially when combined with other approaches that highlight different aspects of the text). Let's now take a look at how Transformer architectures can be used for this purpose. + +* * * + +## Transformers for NER: pretraining and finetuning + +Until 2017, most NLP related tasks that used neural networks were performed using network architectures like recurrent neural networks or LSTMs. This proved to be troublesome, despite some improvements such as the _attention mechanism_: the sequential nature of models ensured that they could not be trained well on larger texts. + +Vaswani et al. (2017) entirely replaced the paradigm of recurrent networks with a newer paradigm by introducing an architecture called a Transformer - notably, in a paper named and indicating that _attention is all you need_. The attention mechanism, when combined with an encoder-decoder type of architecture, is enough to achieve state-of-the-art performance in a variety of language tasks... _without_ the recurrent segments being there. + +In other words, NLP models have moved from sequential processing to parallel processing of text... and this has tremendously improved their performance. + +Among others, models like BERT and GPT have been introduced. They use (parts of, i.e. only the encoder or decoder) the original Transformer architecture and apply their own elements on top of it, then train it to achieve great performance. But how does training happen? Let's take a look at the common approach followed by Transformers, which is called pretraining-finetuning. + +[![](images/Diagram-32-1-1024x991.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-32-1.png) + +An overview of the Transformer architecture. Source: Vaswani et al. (2017) + +### What is the pretraining-finetuning approach to NLP? + +Training a supervised learning model requires you to have at your disposal a labeled dataset. As with anything related to data management, creating, maintaining and eventually having such datasets available poses a big challenge to many organizations. This can be problematic if organizations want to use Transformers, because these models are often _very big_ (GPT-3 has billions of parameters!). If datasets are too small, models cannot be trained because they overfit immediately. + +Compared to labeled data, organizations (and the world in general) often have a lot of _unlabeled_ data at their disposal. Think of the internet as one big example of massive amounts of unlabeled text -- semantics are often hidden within the content, while web pages don't provide such metadata and thus labels in some kind of parallel data space whatsoever. If only we could benefit from this vast amount of data, that would be good. + +Transformers are often trained with a **pretraining-finetuning approach**, which benefits from this fact. The approach involves using a large, unlabeled corpus of text (with large, you can think about gigabytes of data) to which it applies a very generic language modeling task (such as ["predict the next token given this input text"](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/#pre-training-task) or ["predict the hidden tokens"](https://www.machinecurve.com/index.php/question/what-is-a-masked-language-model-mlm-objective/)). This process called _pretraining_ allows a Transformer to capture generic syntactical and semantic patterns from text. After the pretraining phase has finished, we can use the labeled but smaller dataset to perform _finetuning_ the model to that particular task (indeed, such as "predict the named entity for this word/token", which we are taking a look at in this tutorial). + +Visually, this process looks as follows: + +![](images/Diagram-39-1024x436.png) + +Briefly note that a pretrained model does not necessarily have to be used in a finetuning setting, because finetuning requires a lot of computational resources. You can also perform a [**feature-based approach**](https://www.machinecurve.com/index.php/question/what-is-feature-based-training-for-nlp-models/) (i.e. use the outputs of the pretrained models as tokens in a normal, and thus smaller, neural network). Many studies however find finetuning-based approaches to be superior to feature-based ones, despite the increased computational cost. + +### Today's pretrained Transformer: BERTlarge finetuned on CoNLL-2003 + +Today, we will be using the [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) Transformer. BERT, which stands for Bidirectional Encoder Representations for Transformer, utilizes the encoder segment (i.e. the left part) of the original Transformer architecture. For pretraining, among others, it performs a particular task where masked inputs have to be reconstructed. + +We are using the `BERTlarge` type of BERT which is pretrained with 24 encoder segments, a 1024-dimensional hidden state, and 16 attention heads (64 dimensions per head). + +Finetuning happens with the CoNLL-2003 dataset: + +> The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. +> +> UAntwerpen (n.d.) + +* * * + +## Building the Named Entity Recognition pipeline + +Constructing the pipeline for our Named Entity Recognition pipeline occurs with the HuggingFace Transformers library. This library, which is developed by a company called HuggingFace and democratizes using language models (and training language models) for PyTorch and TensorFlow, provides a so-called `pipeline` that supports Named Entity Recognition out of the box. + +Very easy indeed! + +The model that will be used in this pipeline is the so-called `dbmdz/bert-large-cased-finetuned-conll03-english` model, which involves a BERTlarge model trained on CoNLL-2003 and more specifically, its English NER dataset. + +It can recognize whether (parts of) words belong to either of these classes: + +- O, Outside of a named entity +- B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity +- I-MIS, Miscellaneous entity +- B-PER, Beginning of a person’s name right after another person’s name +- I-PER, Person’s name +- B-ORG, Beginning of an organization right after another organization +- I-ORG, Organization +- B-LOC, Beginning of a location right after another location +- I-LOC, Location + +### Full model code + +Below, you can find the entire code for the NER pipeline. As I said, it's going to be a very easy pipeline! + +- First of all, we import the `pipeline` API from the HuggingFace `transformers` library. If you don't have it installed: you can do so with `pip install transformers`. Please make sure that you have TensorFlow or PyTorch on your system, and in the environment where you are running the code. +- You then initialize the NER pipeline by initializing the pipeline API for a `"ner"` task. +- The next action you take is defining a phrase and feeding it through the `ner` pipeline. +- That's it - you then print the outcome on screen. + +``` +from transformers import pipeline + +# Initialize the NER pipeline +ner = pipeline("ner") + +# Phrase +phrase = "David helped Peter enter the building, where his house is located." + +# NER task +ner_result = ner(phrase) + +# Print result +print(ner_result) +``` + +Here's what you will see for the phrase specified above: + +``` +[{'word': 'David', 'score': 0.9964208602905273, 'entity': 'I-PER', 'index': 1}, {'word': 'Peter', 'score': 0.9955975413322449, 'entity': 'I-PER', 'index': 3}] +``` + +The model recognizes `David` at index 1 and `Peter` at index 3 as `I-PER`, a person's name. Indeed they are! + +* * * + +## Recap + +In this tutorial, you have seen how you can create a simple but effective pipeline for Named Entity Recognition with Machine Learning and Python. First, we looked at what NER involves, and saw that it can be used for recognizing real-world objects in pieces of text. Subsequently, we looked at Transformers, and how they are used in NLP tasks these days. We saw that Transformers improve upon more classic approaches like recurrent neural networks and LSTMs in the sense that they do no longer process data sequentially, but rather in parallel. + +Once the theoretical part was over, we implemented an easy NER pipeline with HuggingFace Transformers. This library democratizes NLP by means of providing a variety of models and model training facilities out of the box. Today, we used a BERTlarge model trained on a specific NER dataset for our NER pipeline. Creating the pipeline was really easy, as we saw in our code example! + +I hope that you have learned something when reading the tutorial today! If you did, please feel free to drop a comment in the comments section below 💬 Please do the same if you have any questions, remarks, or comments otherwise. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2005, May 18). _Named-entity recognition_. Wikipedia, the free encyclopedia. Retrieved February 11, 2021, from [https://en.wikipedia.org/wiki/Named-entity\_recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) + +Wikipedia. (2007, October 11). _Named entity_. Wikipedia, the free encyclopedia. Retrieved February 11, 2021, from [https://en.wikipedia.org/wiki/Named\_entity](https://en.wikipedia.org/wiki/Named_entity) + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +UAntwerpen. (n.d.). _Language-independent named entity recognition (II)_. [https://www.clips.uantwerpen.be/conll2003/ner/](https://www.clips.uantwerpen.be/conll2003/ner/) diff --git a/easy-object-detection-with-python-huggingface-transformers-and-machine-learning.md b/easy-object-detection-with-python-huggingface-transformers-and-machine-learning.md new file mode 100644 index 0000000..98ea519 --- /dev/null +++ b/easy-object-detection-with-python-huggingface-transformers-and-machine-learning.md @@ -0,0 +1,321 @@ +--- +title: "Easy Object Detection with Python, HuggingFace Transformers and Machine Learning" +date: "2022-01-04" +categories: + - "deep-learning" + - "frameworks" +tags: + - "detection-transformer" + - "huggingface-transformers" + - "object-detection" + - "transformers" +--- + +YOLO! If you're into machine learning, it's a term that rings a bell. Indeed, You Only _Look_ Once has been one of the default ways for object detection in the past few years. Driven by the progress made in ConvNets, many versions of the object detection method have been created already. + +These days, however, there is a competitor on the horizon - and it's the use of Transformer based models in computer vision. More specifically, the use of Transformers for _object detection._ + +In today's tutorial, you'll be learning about this type of Transformer model. You will also learn to create your own object detection pipeline with Python, a default Transformer model and the HuggingFace Transformers library. In fact, that will be very _easy_, so let's take a look! + +After reading this tutorial, you will... + +- **Understand what object detection can be used for.** +- **Know how Transformer models work when they are used for object detection.** +- **Have implemented a Transformer model based pipeline for (image) object detection with Python and HuggingFace Transformers.** + +Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## What is object detection? + +Take a look around you. Likely, you will see a lot of things - possibly a computer monitor, a keyboard and mouse, or when you're browsing in your mobile browser, a smartphone. + +These are all _objects_, instances of a specific _class_. In the image below, for example, we see an instance of class _human_. We also see many instances of class _bottle_. While a class is a blueprint, an object is the real deal, having a lot of unique characteristics while being member of the class because of the shared ones. + +In pictures and videos, we see many such objects. When you're making a video of traffic, for example, it's likely that you see many instances of _pedestrian_, of _car_, of _bicycle_, and so forth. And knowing that they are in the image can be very fruitful! + +Why? Because you can count them, to give one example. It allows you to say something about the crowdedness of a neighborhood. Another example is the [detection of a parking spot](https://www.researchgate.net/publication/337502002_Deep_Learning_Based_On-Street_Parking_Spot_Detection_for_Smart_Cities) in busy areas, allowing you to park your car. + +And so forth. + +That's _object detection_ used for! + +![](images/image-15.png) + +* * * + +## Object detection and Transformers + +Traditionally, object detection is performed with [Convolutional Neural Networks](https://www.machinecurve.com/index.php/2021/07/08/convolutional-neural-networks-with-pytorch/). Usually, their architectures are [specifically tailored to object detection](https://www.machinecurve.com/index.php/2021/01/15/object-detection-for-images-and-videos-with-tensorflow-2-x/), as they take images as their input and output the bounding boxes of the images. + +If you're familiar with neural networks, you know that ConvNets are really useful when it comes to learning important features in images, and that they are spatially invariant - in other words, it doesn't matter where learned objects are in the image or what their size is. If the network is capable of _seeing_ the object's characteristics, and associates it with a specific class, then it can recognize it. Many different cats, for example, can be recognized as instances of the _cat class_. + +Recently, however, [Transformer architectures](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) have acquired significant attention in the field of deep learning - and that of NLP in particular. Transformers work by encoding the input into a _highly dimensional state_, and subsequently decoding it back into a desired output. By smartly using the [concept of self-attention](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#multi-head-attention), Transformers not only learn to detect specific patterns, but also learn to associate these patterns with others. In the cat example above, to give just one example, Transformers can learn to associate the cat with its characteristic spot - the couch, to give just an idea :) + +If Transformers can be used for image classification, it is only one step further to use them for object detection. Carion et al. (2020) have showed that it is in fact possible to use a Transformer based architecture for doing so. In their work [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872), they introduce the _Detection Transformer_ or DeTr, which we will use for creating our object detection pipeline today. + +It works as follows, and does not even abandon CNNs fully: + +- Using a Convolutional Neural Network, important features are derived from the input image. These are positionally encoded, like in language Transformers, to help the neural network learn where these features are present in the image. +- The input is flattened and subsequently encoded into intermediate state, using the _transformer encoder_, and attention. +- The input to the _transformer decoder_ is this state and a _learned set of object queries_, acquired during the training process. You can imagine them as questions, asking "is there an object here, because I have seen one before in many cases?", which will be answered by using the intermediate state. +- Indeed, the decoder's output is a set of predictions via multiple prediction heads: one for each query. As the number of queries in DeTr is set to 100 by default, it can predict only 100 objects in one image, unless you configure it differently. + +![](images/image-5-1024x268.png) + +How Transformers can be used for object detection. From Carion et al. (2020), [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872), introducing the DeTr Transformer used in this pipeline. + +* * * + +## HuggingFace Transformers and its ObjectDetectionPipeline + +Now that you understand how DeTr works, it's time to use it for creating an actual object detection pipeline! + +We will use [HuggingFace Transformers](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/) for this purpose, which was built to make working with NLP and Computer Vision Transformers easy. In fact, it is so easy that using it boils down to loading the `ObjectDetectionPipeline` - that by defaults loads a DeTr Transformer trained with a ResNet-50 backbone for generating image features. + +Let's start looking at the tech details now! :D + +The `ObjectDetectionPipeline` can easily be initialized as a `pipeline` instance ... in other words, by means of `pipeline("object-detection")`, and we shall see this in the example below. When you provide no other input, this is how the pipeline gets initialized according to GitHub (n.d.): + +``` + "object-detection": { + "impl": ObjectDetectionPipeline, + "tf": (), + "pt": (AutoModelForObjectDetection,) if is_torch_available() else (), + "default": {"model": {"pt": "facebook/detr-resnet-50"}}, + "type": "image", + }, +``` + +Unsurprisingly, an `ObjectDetectionPipeline` instance is used, which is tailored to object detection. In the PyTorch version of HuggingFace Transformers, an `AutoModelForObjectDetection` is used for this purpose. Interestingly, for the TensorFlow version, no implementation of this pipeline is available...yet?! + +As you learned, by default, the `facebook/detr-resnet-50` [model](https://huggingface.co/facebook/detr-resnet-50) is used for deriving image features: + +> DEtection TRansformer (DETR) model trained end-to-end on COCO 2017 object detection (118k annotated images). It was introduced in the paper [End-to-End Object Detection with Transformers](https://arxiv.org/abs/2005.12872) by Carion et al. +> +> HuggingFace (n.d.) + +The COCO dataset (Common Objects in Context) is one of the standard datasets used for object detection models and was used for training this model. Don't worry, you can obviously also train your own DeTr based model! + +**Important!** To use the `ObjectDetectionPipeline`, it is important that the `timm` package - containing PyTorch image models - is installed. Make sure to run this command when you haven't installed it yet: `pip install timm`. + +* * * + +## Implementing an Easy Object Detection pipeline with Python + +Let's now take a look at implementing an easy solution for Object Detection with Python. + +Recall that you are using HuggingFace Transformers, which must be installed onto your system - run `pip install transformers` if you don't have it yet. + +We also assume that PyTorch, one of the leading libraries for deep learning these days, is installed. Recall from above `ObjectDetectionPipeline` that will be loaded under the hood when calling `pipeline("object-detection")` has no instance for TensorFlow, and thus PyTorch is necessary. + +This is the image we will run the object detection pipeline we're creating for, later in this article: + +![](images/street_mc-1024x684.jpg) + +We begin with the imports: + +``` +from transformers import pipeline +from PIL import Image, ImageDraw, ImageFont +``` + +Obviously, we're using `transformers`, and specifically its `pipeline` representation. Then, also, we use `PIL`, a Python library for loading, visualizing and manipulating images. Specifically, we're using the first import - `Image` for loading the image, `ImageDraw` for drawing the bounding boxes and the labels, the latter of which also requires `ImageFont`. + +Speaking of both, next up is loading the font (we pick Arial) and initializing the object detection pipeline we introduced above. + +``` +# Load font +font = ImageFont.truetype("arial.ttf", 40) + +# Initialize the object detection pipeline +object_detector = pipeline("object-detection") +``` + +Then, we create a definition called `draw_bounding_box`, which - unsurprisingly - will be used for drawing bounding boxes. It takes the image (`im`), the class probability, coordinates of the bounding box, the bounding box index in the list with bounding boxes this definition will be used for, and the length of that list as input. + +In the definition, you will... + +- First draw the actual bounding box on top of the image, represented as a `rounded_rectangle` bbox with a red color and small radius to ensure smooth edges. +- Secondly, draw the textual label slightly above the bounding box. +- Finally, return the intermediate result, so that we can draw the next bounding box and label on top. + +``` +# Draw bounding box definition +def draw_bounding_box(im, score, label, xmin, ymin, xmax, ymax, index, num_boxes): + """ Draw a bounding box. """ + + print(f"Drawing bounding box {index} of {num_boxes}...") + + # Draw the actual bounding box + im_with_rectangle = ImageDraw.Draw(im) + im_with_rectangle.rounded_rectangle((xmin, ymin, xmax, ymax), outline = "red", width = 5, radius = 10) + + # Draw the label + im_with_rectangle.text((xmin+35, ymin-25), label, fill="white", stroke_fill = "red", font = font) + + # Return the intermediate result + return im +``` + +What remains is the _core part_ - using the `pipeline` and then drawing the bounding boxes based on its result. + +Here's how we do that. + +First of all, the image - which we call `street.jpg` and which is present in the same directory as the Python script - will be opened and stored in an `im` PIL object. We simply feed it to the initialized `object_detector` - **which is enough for the model to return the bounding boxes!** The Transformers library takes care of the rest 😲. + +We then assign data to some variables and iterate over each result, drawing the bounding box. + +Finally, we save the image - to `street_bboxes.jpg`. + +Voilà, that's it! :o + +``` +# Open the image +with Image.open("street.jpg") as im: + + # Perform object detection + bounding_boxes = object_detector(im) + + # Iteration elements + num_boxes = len(bounding_boxes) + index = 0 + + # Draw bounding box for each result + for bounding_box in bounding_boxes: + + # Get actual box + box = bounding_box["box"] + + # Draw the bounding box + im = draw_bounding_box(im, bounding_box["score"], bounding_box["label"],\ + box["xmin"], box["ymin"], box["xmax"], box["ymax"], index, num_boxes) + + # Increase index by one + index += 1 + + # Save image + im.save("street_bboxes.jpg") + + # Done + print("Done!") +``` + +### Using a different model / using your own model for object detection + +....and if you _did_ create your own model, or want to use a different one, it is very easy to use _that_ instead of the ResNet-50 based DeTr Transformer. + +Doing so will require you to add the following to the imports: + +``` +from transformers import DetrFeatureExtractor, DetrForObjectDetection +``` + +Then, you can initialize the feature extractor and model, and initialize the `object_detector` with them instead of the default one. For example, if you want to use ResNet-101 as your backbone instead, you can do this as follows: + +``` +# Initialize another model and feature extractor +feature_extractor = DetrFeatureExtractor.from_pretrained('facebook/detr-resnet-101') +model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-101') + +# Initialize the object detection pipeline +object_detector = pipeline("object-detection", model = model, feature_extractor = feature_extractor) +``` + +* * * + +## Results + +Here's the result we got after running the object detection pipeline on our input image: + +![](images/street_bboxes_mc-1024x684.jpg) + +Or, when zoomed in: + +[![](images/image-4-1024x455.png)](https://www.machinecurve.com/wp-content/uploads/2021/12/image-4.png) + +* * * + +## Object detection example - full code + +Here's the full code for people who want to get started immediately: + +``` +from transformers import pipeline +from PIL import Image, ImageDraw, ImageFont + + +# Load font +font = ImageFont.truetype("arial.ttf", 40) + +# Initialize the object detection pipeline +object_detector = pipeline("object-detection") + + +# Draw bounding box definition +def draw_bounding_box(im, score, label, xmin, ymin, xmax, ymax, index, num_boxes): + """ Draw a bounding box. """ + + print(f"Drawing bounding box {index} of {num_boxes}...") + + # Draw the actual bounding box + im_with_rectangle = ImageDraw.Draw(im) + im_with_rectangle.rounded_rectangle((xmin, ymin, xmax, ymax), outline = "red", width = 5, radius = 10) + + # Draw the label + im_with_rectangle.text((xmin+35, ymin-25), label, fill="white", stroke_fill = "red", font = font) + + # Return the intermediate result + return im + + +# Open the image +with Image.open("street.jpg") as im: + + # Perform object detection + bounding_boxes = object_detector(im) + + # Iteration elements + num_boxes = len(bounding_boxes) + index = 0 + + # Draw bounding box for each result + for bounding_box in bounding_boxes: + + # Get actual box + box = bounding_box["box"] + + # Draw the bounding box + im = draw_bounding_box(im, bounding_box["score"], bounding_box["label"],\ + box["xmin"], box["ymin"], box["xmax"], box["ymax"], index, num_boxes) + + # Increase index by one + index += 1 + + # Save image + im.save("street_bboxes.jpg") + + # Done + print("Done!") +``` + +That's it! Using Transformers for object detection is very easy these days. + +If you have any questions, comments or suggestions, feel free to leave a message in the comments section below 💬 I will then try to answer you as quickly as possible. For now, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +GitHub. (n.d.). _Transformers/\_\_init\_\_.py at master · huggingface/transformers_. [https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/\_\_init\_\_.py](https://github.com/huggingface/transformers/blob/master/src/transformers/pipelines/__init__.py) + +HuggingFace. (n.d.). _Pipelines_. Hugging Face – The AI community building the future. [https://huggingface.co/docs/transformers/v4.15.0/en/main\_classes/pipelines#transformers.ObjectDetectionPipeline](https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/pipelines#transformers.ObjectDetectionPipeline) + +HuggingFace. (n.d.). _Facebook/detr-resnet-50 · Hugging face_. Hugging Face – The AI community building the future. [https://huggingface.co/facebook/detr-resnet-50](https://huggingface.co/facebook/detr-resnet-50) + +Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020, August). [End-to-end object detection with transformers.](https://arxiv.org/abs/2005.12872) In _European Conference on Computer Vision_ (pp. 213-229). Springer, Cham. diff --git a/easy-question-answering-with-machine-learning-and-huggingface-transformers.md b/easy-question-answering-with-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..363f678 --- /dev/null +++ b/easy-question-answering-with-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,204 @@ +--- +title: "Question Answering with Python, HuggingFace Transformers and Machine Learning" +date: "2020-12-21" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "bert" + - "deep-learning" + - "distilbert" + - "huggingface" + - "natural-language-processing" + - "question-answering" + - "transformer" +--- + +In the last few years, Deep Learning has really boosted the field of Natural Language Processing. Especially with the [Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) which has become a state-of-the-art approach in text based models since 2017, many Machine Learning tasks involving language can now be performed with unprecedented results. Question answering is one such task for which Machine Learning can be used. In this article, we will explore **building a Question Answering model pipeline** in a **really easy way**. + +It is structured as follows. Firstly, we will take a look at the role of the Transformer architecture in Natural Language Processing. We're going to take a look at what Transformers are and how the encoder-decoder segments from the architecture work together. This includes a look at [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/), which is an extension of the original or _vanilla_ Transformer, only using the encoder segment. Here, we also focus on the prevalent line of thinking in NLP that models must be _pretrained on massive datasets_ and subsequently _finetuned to specific tasks_. + +Jointly, this information provides the necessary context for introducing today's Transformer: a **DistilBERT-based Transformer** fine-tuned on the Stanford Question Answering Dataset, or **SQuAD**. It lies at the basis of the practical implementation work to be performed later in this article, using the **[HuggingFace Transformers](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/)** library and the `question-answering` pipeline. HuggingFace Transformers democratize the application of Transformer models in NLP by making available _really easy_ pipelines for building Question Answering systems powered by Machine Learning, and we're going to benefit from that today! :) + +Let's take a look! 😎 + +**Update 07/Jan/2021:** added more links to relevant articles. + +* * * + +\[toc\] + +* * * + +## The role of Transformers in Natural Language Processing + +Before we dive in on the Python based implementation of our Question Answering Pipeline, we'll take a look at _some_ theory. I always think that Machine Learning should be intuitive and developer driven, but this doesn't mean that we should omit all theory. Rather, I think that having a basic and intuitive understanding of what is going on under the hood will only help in making sound choices with respect to Machine Learning algorithms and architectures that can be used. + +For this reason, in this section, we'll be looking at three primary questions: + +1. What is a Transformer architecture? +2. What is this _pretraining and fine-tuning_ dogma? +3. What does today's Transformer look like? + +### What is a Transformer architecture? + +![](images/1_BHzGVskWGS_3jEcYYi6miQ-842x1024.png) + +Source: Vaswani et al. (2017) + +Back in 2017, researchers and engineers faced a problem when they wanted to train language models. + +The state-of-the-art approaches at the time required sequences (such as sentences) to be processed in a sequential, word-by-word fashion. Each word had to be fed to the model individually, after which a prediction about the most likely token emerged. This was the only way in which some source sequences could be converted into corresponding target sequences. + +Having [solved the issues](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/) with respect to vanishing gradients (by means of [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) and GRUs) and long-term memory loss (by means of the previous ones as well as the attention mechanism), this was still bugging the Machine Learning communities involved with language models. + +Until Vaswani et al. (2017) proposed an approach where the recurrent and hence sequential aspects from the model were removed altogether. In the landmark paper _[Attention is all you need](https://arxiv.org/abs/1706.03762)_, the authors outlined that by applying the attention mechanism in a smart way, i.e. in a self-attention fashion, inputs could be processed in parallel without losing the ability for particular inputs to attend to other inputs when generating the target sequence prediction. + +This approach, which is called the **[Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/)**, has been a real breakthrough in Natural Language Processing. In fact, thanks to Transformers, LSTM and GRU based models are now no longer considered to be state-of-the-art. Rather, many model architectures have emerged based on the original or _vanilla_ Transformer proposed by Vaswani et al. If you're reading about [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) ([driving many Google Searches today](https://blog.google/products/search/search-language-understanding-bert/)) or [GPT-based models](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) (such as [the exclusive GPT-3 license acquired by Microsoft](https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-openai-to-exclusively-license-gpt-3-language-model/)), you're reading about Transformer-inspired architectures. + +Transformers are a smart combination of two segments that work together nicely during the [training process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). There is an _encoder_ segment which converts inputs, in the vanilla Transformer [learned embeddings](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) with positional encoding, into a high-dimensional, intermediate state. This is subsequently fed into the _decoder segment_, which processes the expected outputs jointly with the encoded inputs into a prediction for the subsequent token. By applying self-attention and doing so in a smart way so that many contexts can be looked at at once (the so-called _multi-head attention_), Transformers have really ensured that parallelism entered the world of Natural Language Processing. + +This finally solved one of the remaining key issues with language models at the time. + +Click [here](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/#attention-is-all-you-need-transformers) if you want to read more about vanilla Transformers. Here, we're going to continue by looking at the BERT architecture. + +### BERT: Bidirectional Encoder Representations from Transformer + +Vanilla Transformers perform both _encoding_ and _decoding_, meaning that when an input flows through the model, it automatically gets converted into an output prediction. In other words, if I input the English phrase _I am doing great today_ into a model trained for a translation task into Dutch, the output would be _Het gaat vandaag geweldig._ + +Sometimes, we don't want that, especially when we want to perform Transfer Learning activities: what if we can train a model to encode really well based on a really large dataset? If we'd add the decoder segment, there is only a limited opportunity for transfering what has been learned onto one's own Machine Learning task. If we leave the user with the encoded state instead, they can choose how to fine-tune on their own. + +#### BERT + +This is one of the key ideas in the **[BERT architecture](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/)**, which stands for Bidirectional Encoder Representations from Transformer. It was proposed in a paper written by Devlin et al. (2018) and takes the encoder segment from the vanilla Transformer architecture. With additional changes (such as not taking any learned embeddings but rather WordPiece embeddings, and changing the learning tasks performed during training), a BERT-based model is really good at understanding natural language. + +One of the other key changes is that a BERT based model is bidirectional in the sense that it does not only use the context in a left-to-right fashion (which is what vanilla Transformers do). It also does so in a right-to-left fashion - at the same time. This allows models to experience much richer context for generating encodings based on the input values. + +#### Pretraining and fine-tuning + +The idea of taking the encoder only is that when it is trained on a massive dataset, it can learn to perform the encoding task in a general way, and do so _really well_. This is precisely why BERT proposes that models are pretrained on really large datasets and subsequently fine-tuned to specific language tasks. For example, as we have seen in our article about [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/), a BERT-like encoder can be coupled with a GPT-like decoder and subsequently be fine-tuned to summarization on a dataset related to the task. + +### Today's Transformer: DistilBERT + +Even BERT was not the end station itself. The reason why is its computational intensity: in its two flavors, it has either 110 million parameters (BERT base) or 345 million parameters (BERT large). And that is a _huge_ number, especially if you look at [relatively simple ConvNets](https://www.machinecurve.com/index.php/2020/01/31/reducing-trainable-parameters-with-a-dense-free-convnet-classifier/) which have only hundreds of thousands of parameters. + +The problem with such large amounts of parameters is that both fine-tuning _and_ inference takes a really long time. If you have to wait seconds for your prediction to return, well, how can we expect to use that model in production? + +This is why many approaches have emerged to make computation lighter, just like in Computer Vision - with e.g. the MobileNet architecture, and others. One of these approaches and the one that lies at the basis of today's Transformer-based Question Answering pipeline is the **DistilBERT architecture**, which was proposed in a 2019 paper by Sanh et al. + +Here's the abstract for the work. If you would like to read about DistilBERT in more detail I'd suggest [clicking here](https://arxiv.org/abs/1910.01108) for the article, but from what the abstract suggests it was made 60% faster by performing a 40% size reduction while retaining 97% of its language understanding. This is a significant improvement and a great optimization with respect to traditional or 'vanilla' BERT. + +> As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study. +> +> Sanh et al. (2019) + +### Fine-tuning DistilBERT on SQuAD + +DistilBERT was pretrained on the same datasets as BERT, being "a concatenation of English Wikipedia and Toronto Book Corpus" (Sanh et al., 2019; HuggingFace, n.d.). The general distilled version of BERT was subsequently fine-tuned using the SQuAD dataset, which stands for **Stanford Question Answering Dataset** (Stanford Question Answering Dataset, n.d.). + +> **S**tanford **Qu**estion **A**nswering **D**ataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or _span_, from the corresponding reading passage, or the question might be unanswerable. +> +> Stanford Question Answering Dataset (n.d.) + +A few questions with corresponding answer from this dataset are as follows: + +- **Q: In what country is Normandy located?** + - A: France +- **Q: What was the percentage of people that voted in favor of the Pico Act of 1859?** + - A: 75 +- **Q: What is the second busiest airport in the United States?** + - A: Los Angeles International Airport + +In sum, DistilBERT improves BERT performance and is Transformer inspired. Having been pretrained on a massive dataset (like all BERT models) and subsequently been fine-tuned on the SQuAD dataset, it can be used for answering questions. Let's now take a look at how we can generate an easy Question Answering system with HuggingFace Transformers. + +* * * + +## Implementing a Question Answering Pipeline with HuggingFace Transformers + +Now that we understand the concepts behind today's Machine Learning pipeline for Question Answering, it's time to actually start building. Today, we're going to build a **very easy implementation** of a Question Answering system using the [HuggingFace Transformers library](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/). This library is becoming increasingly important for democratizing Transformer based approaches in Machine Learning and allows people to use Transformers out-of-the-box. + +Let's quickly take a look at building that pipeline. + +### Full model code + +Here it is, the full model code for our Question Answering Pipeline with HuggingFace Transformers: + +- From `transformers` we import the `pipeline`, allowing us to perform one of the tasks that HuggingFace Transformers supports out of the box. + - If you don't have `transformers` installed yet, you can do so easily via `pip install transformers`. Make sure to have recent versions of PyTorch or TensorFlow installed as well! +- Subsequently, we specify a `question` ("What is the capital of the Netherlands?") and provide some context (cited [straight from the Wikipedia page](https://en.wikipedia.org/wiki/Netherlands) about the Netherlands). +- We initialize the `question-answering` Pipeline allowing us to easily create the Question Answering pipeline, because it utilizes the [DistilBERT model](https://huggingface.co/distilbert-base-cased) fine-tuned to [SQuAD](https://huggingface.co/distilbert-base-cased-distilled-squad). +- We then generate the answer from the context-based question, and print it on screen. + +``` +from transformers import pipeline + +# Open and read the article +question = "What is the capital of the Netherlands?" +context = r"The four largest cities in the Netherlands are Amsterdam, Rotterdam, The Hague and Utrecht.[17] Amsterdam is the country's most populous city and nominal capital,[18] while The Hague holds the seat of the States General, Cabinet and Supreme Court.[19] The Port of Rotterdam is the busiest seaport in Europe, and the busiest in any country outside East Asia and Southeast Asia, behind only China and Singapore." + +# Generating an answer to the question in context +qa = pipeline("question-answering") +answer = qa(question=question, context=context) + +# Print the answer +print(f"Question: {question}") +print(f"Answer: '{answer['answer']}' with score {answer['score']}") +``` + +With one of the recent versions of HuggingFace Transformers, you might run into this issue: + +``` +RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding) +``` + +The fix so far is to install the most recent `master` branch with `pip`: + +``` +pip install git+https://github.com/huggingface/transformers +``` + +Et voila, running it now gives: + +``` +Question: What is the capital of the Netherlands? +Answer: 'Amsterdam' with score 0.37749919295310974 +``` + +Brilliant! 😎 + +* * * + +## Summary + +In today's article, we saw how we can build a Question Answering pipeline based on Transformers using [HuggingFace Transformers](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/). In doing so, we first looked at what [Transformers](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) are in the first place. We saw that by smartly connecting an encoder and decoder segment, removing sequential elements from processing data while keeping the benefits of attention, they have become state-of-the-art in Natural Language Processing. + +Moving away from traditional or vanilla Transformers, we looked at the [BERT model](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/), and noticed that it only implements the encoder segment - which is pretrained on a large, massive corpus of unlabeled text. It can be fine-tuned and hence be tailor-made to your specific NLP problem. In fact, the emergence of the BERT model has given rise to the dogma of "pretraining and subsequent fine-tuning" that is very common today. + +Now, BERT itself is really big: the base version has > 100 million parameters and the large version has > 300 million ones. This is too big for many projects and hence various people have improved BERT; DistilBERT is one such example, yielding only 3% information loss at 40-60% size and speed improvements. It is therefore unsurprising that DistilBERT, which is trained on the same corpus as traditional BERT, is used frequently today. In fact, it is used in the [HuggingFace](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/) `question-answering` pipeline that we used for today's question answering model. It is fine-tuned on the Stanford Question Answering Dataset or SQuAD dataset. + +In the final section of this article, we saw how we can use HuggingFace Transformers - a library driving democratization of Transformers in NLP - to implement a Question Answering pipeline _without a hassle_. With just a few lines of code, we generated a pipeline that can successfully answer questions. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from today's article, whether that's how Transformers work, how Question Answering pipelines are pretrained and fine-tuned, or how to use the HuggingFace Transformers library for actual implementations. If you did, please feel free to leave a comment in the comments section below. I'd love to hear from you! 💬 Please do the same if you have any questions, or click the **Ask Questions** button to the right. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.](https://arxiv.org/abs/1910.01108) _arXiv preprint arXiv:1910.01108_. + +HuggingFace. (n.d.). _Distilbert-base-cased · Hugging face_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/distilbert-base-cased](https://huggingface.co/distilbert-base-cased) + +HuggingFace. (n.d.). _Distilbert-base-cased-distilled-squad · Hugging face_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/distilbert-base-cased-distilled-squad](https://huggingface.co/distilbert-base-cased-distilled-squad) + +HuggingFace. (n.d.). _Transformers.pipelines — transformers 4.1.1 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/\_modules/transformers/pipelines.html](https://huggingface.co/transformers/_modules/transformers/pipelines.html) + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805). _arXiv preprint arXiv:1810.04805_. + +Stanford Question Answering Dataset. (n.d.). _The Stanford question answering dataset_. Pranav Rajpurkar. [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/) diff --git a/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers.md b/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..2b6267f --- /dev/null +++ b/easy-sentiment-analysis-with-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,226 @@ +--- +title: "How to perform Sentiment Analysis with Python, HuggingFace Transformers and Machine Learning" +date: "2020-12-23" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "bert" + - "distilbert" + - "huggingface" + - "machine-learning" + - "natural-language-processing" + - "python" + - "sentiment-analysis" + - "sst-2" + - "transformer" + - "transformers" +--- + +While human beings can be really rational at times, there are other moments when emotions are most prevalent within single humans and society as a whole. Humans also find it difficult to strictly separate rationality from emotion, and hence express emotion in _all their communications_. + +Such emotion is also known as _sentiment_. Texts, being examples of human communication, are hence also examples of a way in which human beings express emotion to the outside world. The task of **Sentiment Analysis** is hence to determine emotions in text. It is a subfield of Natural Language Processing and is becoming increasingly important in an ever-faster world. + +In this article, we will take a look at Sentiment Analysis in more detail. Firstly, we'll try to better understand what it is. Then, we take a look at state-of-the-art approaches for building Sentiment Analysis models with Machine Learning, using [Transformers](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/). This includes looking at what Transformers are, and inspecting the BERT and DistilBERT architectures in more detail, because they lie at the basis of the Sentiment Analysis ML pipeline that we will build today. Finally, we also take a look at the SST-2 dataset, which was used for fine-tuning the pretrained DistilBERT architecture used as a model. + +Once we understand how everything works, which should go relatively quickly, we'll move on to implementing a Sentiment Analysis Pipeline with Python. Since we are using the [HuggingFace Transformers library](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/) and more specifically its out-of-the-box pipelines, this should be really easy. With only a few lines of code, you will have a Transformer that is capable of analyzing the sentiment of text. + +Let's take a look! 😎 + +**Update 07/Jan/2021:** added more links to related articles. + +* * * + +\[toc\] + +* * * + +## What is Sentiment Analysis? + +Before we move to taking a look at the technical details of Sentiment Analysis, it may be a good idea to take a look at what Sentiment Analysis is in the first place. + +> **Sentiment analysis** (also known as **opinion mining** or **emotion AI**) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. +> +> Wikipedia (2006) + +Now, that is quite a mouth full of words. + +So, when performing Sentiment Analysis, a variety of techniques and technologies is used to extract "subjective information" and "affective states". Subjective here means obviously that it is related to personal feelings; affective state is related to _affect_. + +> Affect, in psychology, refers to the underlying experience of feeling, emotion or mood. +> +> Wikipedia (2005) + +Aha! + +Sentiment Analysis therefore involves the extraction of personal feelings, emotions or moods from language - often text. + +There are many applications for Sentiment Analysis activities. For example, with well-performing models, we can derive sentiment from news, satiric articles, but also from customer reviews. And what about emails, film reviews, or even Tweets, which can be really sarcastic at times? + +Further application areas of Sentiment Analysis range to stock markets, to give just a few examples. In the short term, stocks are known to be very sensitive to market sentiments, and hence performing such analyses can give people an edge when trading stocks. Applying (relatively) open and massive data sources such as Tweets has therefore been an area of active research with respect to stock trading. + +* * * + +## Sentiment Analysis with Transformers + +Beyond a variety of human-developed algorithms used for sentiment analysis, Machine Learning can also be used really well for extracting sentiment from language. What's more, a special Deep Learning approach called a Transformer has been the state-of-the-art in Machine Learning for NLP in the past few years. + +### What is a Transformer? + +![](images/1_BHzGVskWGS_3jEcYYi6miQ-842x1024.png) + +Source: Vaswani et al. (2017) + +In Natural Language Processing, people have traditionally used [recurrent neural networks](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/). In those networks, sequences were processed into sequences of another nature. In plainer English, that would e.g. be one phrase (e.g. in English) processed into its German equivalent: + +_It's going well --> Es geht gut._ + +Classic RNNs worked, but came with a range of disadvantages: vanishing gradients caused long-term memory loss, the sequential nature of processing meant that models were not optimized for training at sentence-level (but rather had to train at word-level), and so on. + +Long Short-Term Memory ([LSTM](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)) and Gated Recurrent Unit (GRU) models, eventually augmented with the attention mechanism, [replaced the classic or vanilla RNN some years ago](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/). By adding memory cells and resolving the vanishing gradients issue, the problem with respect to long-term memory loss was resolved to some extent. This was especially the case when attention was added. But the problem with sequential processing persisted due to the sequential design of these models. + +Enter [Transformers](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), which were originally proposed by Vaswani et al. (2017). In this architecture, which you can see on the right, sequences are first processed into some intermediary encoding, after which they are processed into the target sequence - but this is not new, as such encoder-decoder architectures were also common with LSTMs and GRUs. + +What was new, however, is that Vaswani et al. argued that "attention is all you need" - hence the name of their paper. In other words, they claimed that it is not necessary to add recurrent segments to a Sequence-to-Sequence model, but that a word having attention to other words in its phrase is enough for language models to perform well. Showing that this works through self-attention in a multi-headed (and hence multi-contextual) way, Transformers have entered the NLP world by storm. They have even repalaced LSTMs and GRUs as state-of-the-art approaches. + +### The BERT and DistilBERT architectures + +As is usual with breakthroughs in Machine Learning, the massive amount of attention that is drawn to works like the one by Vaswani et al. yields new ideas that are then developed and validated. One of the arguments put forward by Devlin et al. (2019) was that classic Transformers work in a left-to-right fashion: by reading text in a left-to-right fashion, classic Transformers learn to add context to individual words, after which they can learn to predict target tokens in a really good way. But humans read differently, Devlin et al. (2019) argued; they also take a look at words from the right to the left, i.e. at the whole text, if they want to add context to whatever they read. + +That is why they proposed the [**Bidirectional Encoder Representations from Transformers** (BERT)](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) in their 2019 work. They took only the encoder segment from the classic Transformer architecture and changed it in a few ways. One of the key changes was that it is no longer trained on a language modelling task, but rather on a "masked" language modelling task, where its goal is to predict what's under the mask. In addition, it's also given pairs of sentences, where the goal is to predict whether a next sentence should actually be there, in order to learn additional context. This made possible _bidirectional learning_, meaning that BERT based models will take a look at texts in a left-to-right _and_ in a right-to-left fashion. + +Devlin et al. (2019) also helped put forward the common way of thinking today that such models must be pretrained on large, unlabeled datasets. They therefore made available BERT based models trained on large corporate with millions and millions of words. They argued that based on a pretrained model, fine-tuning to a specific task (such as [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/) or [question answering](https://www.machinecurve.com/index.php/2020/12/21/easy-question-answering-with-machine-learning-and-huggingface-transformers/)) can be a lot easier. + +As BERT models only take the encoder segment when training, they are generally really good at text understanding, but not so good at text generation. That's why many generative tasks (such as summarization) use derivations like [BART](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/) which add a generative ([GPT-like](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/)) decoder to the BERT-like encoder as well, allowing better fine-tuning. + +While BERT is good, BERT is also _really_ big. The standard BERT model has over 100 million trainable parameters, and the large BERT model has more than 300 million. This means that inferencing task are really intensive in terms of compute costs, and this decreases possible adoption of state-of-the-art language models. This is why Sanh et al. (2019) proposed a different version of BERT which is smaller (40-60% smaller and faster) without losing much of the performance (97% retained knowledge). By using a technique called **knowledge distillation**, which means that a smaller model is trained based on loss of the original model, they proposed **DistilBERT** - a BERT-like model pre-trained with BERT and the original data, but then with many aspects from the original BERT model stripped way. + +By using DistilBERT as your pretrained model, you can significantly speed up fine-tuning _and_ model inference _without_ losing much of the performance. + +### DistilBERT fine-tuned on SST-2 + +The original DistilBERT model has been pretrained on the unlabeled datasets BERT was also trained on. It must be fine-tuned if it needs to be tailored to a specific task. In the HuggingFace based Sentiment Analysis pipeline that we will implement, the DistilBERT architecture was fine-tuned on the SST-2 dataset. This dataset stands for **Stanford Sentiment Treebank** **version 2** and can be described as follows: + +> The Stanford Sentiment Treebank SST-2 dataset contains 215,154 phrases with fine-grained sentiment labels in the parse trees of 11,855 sentences from movie reviews. Models performances are evaluated either based on a fine-grained (5-way) or binary classification model based on accuracy. +> +> DeepAI (n.d.) + +In other words, sentences are expressed in a tree-like structure. Contrary to SST-1, which is version 1 of the same dataset, neutral phrases are deleted in order to keep strictly positive and strictly negative ones. Visually, phrases can look as follows (Stanford NLP, n.d.). We can see that this sentence as a whole is negative, but that some parts of it are positive. + +![](images/image-3-1024x534.png) + +* * * + +## Implementing an Easy Sentiment Analysis Pipeline with Python + +Now that we understand how Sentiment Analysis is used, what our [Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) based model looks like and how it is fine-tuned, we have sufficient context for implementing a pipeline with Sentiment Analysis with Python. At least, you'll now understand what happens under the hood, which I think is really important for Machine Learning engineers. + +For the pipeline, we will be using the HuggingFace Transformers library: + +> 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. +> +> HuggingFace (n.d.) + +In other words, applying Transformers in a PyTorch and TensorFlow setup has never been easier before. Installing it is also easy: ensure that you have TensorFlow or PyTorch installed, followed by a simple HF install with `pip install transformers`. + +That's all! + +### Creating the pipeline + +In today's model, we're setting up a pipeline with HuggingFace's DistilBERT-pretrained and SST-2-fine-tuned Sentiment Analysis model. This is really easy, because it belongs to HuggingFace's out-of-the-box pipelines: + +``` + "sentiment-analysis": { + "impl": TextClassificationPipeline, + "tf": TFAutoModelForSequenceClassification if is_tf_available() else None, + "pt": AutoModelForSequenceClassification if is_torch_available() else None, + "default": { + "model": { + "pt": "distilbert-base-uncased-finetuned-sst-2-english", + "tf": "distilbert-base-uncased-finetuned-sst-2-english", + }, + }, + }, +``` + +We can see that it implements a `TextClassificationPipeline`, which is essentially what Sentiment Analysis involves - assigning a (sentiment) class to an input text. A TensorFlow or PyTorch based sequence classification model is initialized based on library availability, and filled with the `distilbert-base-uncased-finetuned-sst-2-english` model: + +![](images/image-4-897x1024.png) + +HuggingFace. (n.d.). Distilbert-base-uncased-finetuned-sst-2-english · Hugging face. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) + +Implementing the pipeline is really easy: + +- We import the `pipeline` class from `transformers` and initialize it with a `sentiment-analysis` task. This ensures that the PyTorch and TensorFlow models are initialized following the SST-2-fine-tuned model above. +- We can then easily call the Sentiment Analyzer and print the results. + +``` +from transformers import pipeline +sentimentanalyzer = pipeline("sentiment-analysis") + +result = sentimentanalyzer("I really don't like what you did")[0] +print(f"Sentiment: {result['label']}") + +result = sentimentanalyzer("I really like what you did")[0] +print(f"Sentiment: {result['label']}") + +result = sentimentanalyzer("This is good")[0] +print(f"Sentiment: {result['label']}") +``` + +The outcome: + +``` +Sentiment: NEGATIVE +Sentiment: POSITIVE +Sentiment: POSITIVE +``` + +Which indeed is a correct classification of the sentiment of these phrases! + +Great 😎 + +* * * + +## Summary + +In this article, we built a Sentiment Analysis pipeline with Machine Learning, Python and the [HuggingFace Transformers library](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/). However, before actually implementing the pipeline, we looked at the concepts underlying this pipeline with an intuitive viewpoint. Firstly, we saw what Sentiment Analysis involves - the classification of subjective language, related to affect, or emotion. We also saw that it can be used in a wide variety of use cases, with stock market analysis based on sentiment from Tweets as a cool example. + +We then immediately made the switch to Machine Learning. We briefly covered the history of ML architectures in Sentiment Analysis, including classic RNNs, [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/), GRUs and the attention mechanism. Through understanding [Transformers](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), we saw how the [BERT model](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) became very prominent and how it was improved through DistilBERT. This architecture was pretrained on a large unlabeled corpus and can be fine-tuned to be tailored to specific Machine Learning tasks. + +Using the SST-2 dataset, the DistilBERT architecture was fine-tuned to Sentiment Analysis using English texts, which lies at the basis of the pipeline implementation in the Transformers library. Finally, after having gained a basic understanding of what happens under the hood, we saw how we can implement a Sentiment Analysis Pipeline powered by Machine Learning, with only a few lines of code. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article. If you did, please feel free to leave a message in the comments section 💬 I'd love to hear from you. Please do the same if you have any questions, or click the **Ask Questions** button to the right. Anyway, thanks a lot for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2006, August 13). _Sentiment analysis_. Wikipedia, the free encyclopedia. Retrieved December 23, 2020, from [https://en.wikipedia.org/wiki/Sentiment\_analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) + +Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.](https://arxiv.org/abs/1910.01108) _arXiv preprint arXiv:1910.01108_. + +HuggingFace. (n.d.). _Transformers.pipelines — transformers 4.1.1 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/\_modules/transformers/pipelines.html](https://huggingface.co/transformers/_modules/transformers/pipelines.html) + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805). _arXiv preprint arXiv:1810.04805_. + +Wikipedia. (2005, December 19). _Affect (psychology)_. Wikipedia, the free encyclopedia. Retrieved December 23, 2020, from [https://en.wikipedia.org/wiki/Affect\_(psychology)](https://en.wikipedia.org/wiki/Affect_(psychology)) + +Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. _Ain Shams engineering journal_, _5_(4), 1093-1113. + +Zimbra, D., Abbasi, A., Zeng, D., & Chen, H. (2018). The state-of-the-art in Twitter sentiment analysis: A review and benchmark evaluation. _ACM Transactions on Management Information Systems (TMIS)_, _9_(2), 1-29. + +DeepAI. (n.d.). _Stanford sentiment Treebank dataset_. [https://deepai.org/dataset/stanford-sentiment-treebank](https://deepai.org/dataset/stanford-sentiment-treebank) + +Stanford NLP. (n.d.). _Recursive deep models for semantic Compositionality over a sentiment Treebank_. The Stanford Natural Language Processing Group. [https://nlp.stanford.edu/sentiment/treebank.html](https://nlp.stanford.edu/sentiment/treebank.html) + +HuggingFace. (n.d.). _Transformers.pipelines — transformers 4.1.1 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/\_modules/transformers/pipelines.html](https://huggingface.co/transformers/_modules/transformers/pipelines.html) + +HuggingFace. (n.d.). _Transformers — transformers 4.1.1 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/index.html](https://huggingface.co/transformers/index.html) + +HuggingFace. (n.d.). _Distilbert-base-uncased-finetuned-sst-2-english · Hugging face_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english) diff --git a/easy-speech-recognition-with-machine-learning-and-huggingface-transformers.md b/easy-speech-recognition-with-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..7aa550d --- /dev/null +++ b/easy-speech-recognition-with-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,241 @@ +--- +title: "Easy Speech Recognition with Machine Learning and HuggingFace Transformers" +date: "2021-02-17" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "huggingface" + - "speech-recognition" + - "speech-to-text" + - "transformers" +--- + +Transformer architectures have gained a lot of attention in the field of Natural Language Processing. Ever since the original Transformer architecture was released in 2017, they have achieved state-of-the-art results on a variety of language tasks. + +Another task was added to which Transformers can be applied last year. In this tutorial, we will take a look at **Speech Recognition**. We will take a look at the [Wav2vec2 model](https://www.machinecurve.com/index.php/question/how-does-wav2vec-2-for-speech-recognition-speech2text-work/) which is specifically tailored to Speech Recognition tasks. We will show you how it can be used to pretrain and then finetune a model to the task of Speech-to-text recognition. This also includes an example implementation of a pipeline created with HuggingFace Transformers. Using the pipeline, you'll be able to apply Speech Recognition to your Machine Learning driven project very easily. + +After reading this tutorial, you will be able to... + +- **Understand how Transformer-based architectures can be applied to Speech Recognition.** +- **Explain how the Wav2vec2 architecture works at a high level, and refer to a summary of the paper.** +- **Build a Wav2vec2-powered Machine Learning pipeline with HuggingFace Transformers and Python.** + +* * * + +\[toc\] + +* * * + +## Example: speech recognition with Transformers + +This code example shows how you can create a **Speech Recognition pipeline with Transformers** relatively easily. You can use it to get started straight away, granted that you have `transformers` (HuggingFace Transformers) installed as well as a PyTorch or TensorFlow installation. + +If you wish to understand everything in a bit more detail, make sure to read the rest of this tutorial as well 🚀 + +``` +from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC +import librosa as lb +import torch + +# Initialize the tokenizer +tokenizer = Wav2Vec2Tokenizer.from_pretrained('facebook/wav2vec2-base-960h') + +# Initialize the model +model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-base-960h') + +# Read the sound file +waveform, rate = lb.load('./order.wav', sr = 16000) + +# Tokenize the waveform +input_values = tokenizer(waveform, return_tensors='pt').input_values + +# Retrieve logits from the model +logits = model(input_values).logits + +# Take argmax value and decode into transcription +predicted_ids = torch.argmax(logits, dim=-1) +transcription = tokenizer.batch_decode(predicted_ids) + +# Print the output +print(transcription) +``` + +* * * + +## Speech recognition with Transformers: Wav2vec2 + +In this tutorial, we will be implementing a pipeline for Speech Recognition. In this area, there have been some developments, which had previously been related to extracting more abstract (latent) representations from raw waveforms, and then letting these convolutions converge to a token (see e.g. Schneider et al., 2019 for how this is done with Wav2vec 1.0). + +However, today, we are living in the era of Transformer architectures. These architectures have greatly benefited the field of NLP by making transfer learning a very feasible approach for training language models. Combined with the benefits resulting from their architecture (i.e. attention is all you need, and no sequential processing is necessary), very large models (like BERT or the GPT series) have been trained that achieve state-of-the-art performance on a variety of language tasks. + +And now, they are also making entry in the fields related to pure text-based language processing. Say hello to Wav2vec version 2! + +![](images/qSEY9xn.png) + +Source: Baevski et al. (2020) + +We have covered Wav2vec2 extensively in a [summary of the paper](https://www.machinecurve.com/index.php/question/how-does-wav2vec-2-for-speech-recognition-speech2text-work/), but we'll briefly cover it here as well. + +As you can see, the Wav2vec2 architecture that moves from a raw waveform to a Transformer output that, combined with a quantized representation, contributes to a loss value for optimization. Let's take a look at how this works. + +- First of all, the **raw waveform**. This is a piece of sound sample at a particular frequency. +- A **feature encoder** in the form of a 1D/temporal ConvNet with 7 layers takes the waveform and converts it into `T` time steps. These are the _latent speech representations_. These time steps serve as input to the Transformer architecture. +- The **Transformer architecture** takes inputs and converts them to outputs, the so-called _context representations_. +- When performing fine-tuning or using the architecture in practice, a **linear layer segment** is stacked on top of the context representations to generate the _outputs_. This works similar to the [C/CLS class vector in BERT](https://www.machinecurve.com/index.php/question/how-is-the-cls-c-output-in-bert-used-in-nsp/), which must also be used for this purpose (but is included in the architecture). +- We did avoid discusisng the **quantization segment** with the _quantized representations_ so far. This is an important element of the architecture. Quantization of a vector essentially means that you generate a vector from a finite set of possibilities given some input, rather than using the real vector. As this constrains your possible input space, the model should possibly learn to generalize better. This is at least what was found in previous studies. In the quantization segment, representations are generated by generating codebooks with many entries, then constructing a vector based by combining and then linearly projecting the closest contributions from each codebook. Sounds quite difficult, but once you get it (the paper describes it in more detail and you can also find some interesting articles on Google), it's very clear! +- The outputs of the Transformer are combined with the quantized representations in a loss value which, during pretraining, learns to (1) select good quantized representations for the expected outputs (i.e. find a good generalized representation for the output value), and (2) favor diversity over non-diversity in terms of the linear projection - so that all codebooks contribute relatively equally to construcing the quantized representation. + +The model was pretrained on any of these two datasets: + +- LibriSpeech corpus with 960 hours of audio (LS-960) +- LibriVox dataset (LV-60k); 53.200 hours after preprocessing. + +It was then finetuned with one of these five: + +- 960 hours of transcribed LibriSpeech +- 100 hours of transcribed LibriSpeech +- 10 hours of transcribed LibriSpeech +- 1 hour of transcribed LibriSpeech +- 10 minutes of transcribed LibriSpeech + +* * * + +## Implementing Speech Recognition in a Pipeline + +Now that we understand at a high level what Wav2vec2 is and how it works, we can take a look at implementing it in a Machine Learning based Pipeline. + +Fortunately, the [HuggingFace Transformers library](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/) - which democratizes the application of Transformer architectures in NLP - introduced Wav2vec2 functionality per its [4.3.0 release](https://github.com/huggingface/transformers/releases/tag/v4.3.0). In other words, we can benefit from pretrained and fine-tuned models _and_ some really nice APIs that can load these models for performing Speech Recognition ourselves. + +Let's now get to work 🚀 + +### Ensuring that you have HuggingFace 4.3.0+ + +If you want to be able to run the code below, you must ensure that you have a recent version of HuggingFace Transformers on your system. You can easily check your current version by running `python` in your development environment, then importing `transformers`, and printing its version number: + +``` +>>> import transformers +>>> print(transformers.__version__) +3.4.0 +``` + +This clearly suggests that I have to upgrade to a new version. + +`pip install transformers --upgrade` does the trick. + +``` +>>> import transformers +>>> print(transformers.__version__) +4.3.2 +``` + +Voila! And if you don't have HuggingFace Transformers installed on your system yet, you can easily do so by running `pip install transformers`. Make sure to have either PyTorch or TensorFlow installed in your particular Transformers environment as well, because it runs on any of the two. + +### Using an `.mp3` file, converted into `.wav` + +The pipeline that we will be creating today requires you to use `.wav` files, and more specifically `.wav` files with a sampling rate of 16000 Hz (16 kHz). This is because the model we're using [was pretrained and finetuned on 16 kHz data](https://huggingface.co/facebook/wav2vec2-base-960h), and our data needs to be similar. + +Any file that you want to use can be converted with online converters. You can [use this converter](https://audio.online-convert.com/convert-to-wav) to give just one example. Make sure to set **'change sampling rate'** to 16000 Hz, as illustrated here: + +![](images/hz.png) + +### Implementing the Python code + +It's now time to implement the Python code for our pipeline. Creating a Speech Recognition pipeline involves a few parts: + +1. **The model imports**. We import the `Wav2Vec2Tokenizer` and `Wav2Vec2ForCTC`. The tokenizer is used for tokenization: converting the raw waveform into tokens that can be fed to the model; `Wav2Vec2ForCTC` represents the CTC-loss based model class. +2. **Initializing the tokenizer**. We use the `[facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)` model for this. This model was pretrained on the LibriSpeech corpus and then finetuned on the 960 hours of data; hence the name. +3. **Initializing the model**. We use the same model for this purpose. +4. **Read the sound file**. Using `librosa`, we read the `.wav` file, with a sampling rate of 16000 Hz. +5. **Tokenize the waveform**. Using the `tokenizer`, we tokenize the waveform, and retrieve the input values. +6. **Retrieve logits from the model**. We retrieve logits from the model, reflecting the whole probability distribution over all possible output tokens. +7. **Take the argmax value and decode into transcription**. As with any logits, we can take `argmax` to find the most probable value(s) for the logits. We can batch decode these to find the text corresponding to the speech. Finally, we print this text. + +``` +from transformers import Wav2Vec2Tokenizer, Wav2Vec2ForCTC +import librosa as lb +import torch + +# Initialize the tokenizer +tokenizer = Wav2Vec2Tokenizer.from_pretrained('facebook/wav2vec2-base-960h') + +# Initialize the model +model = Wav2Vec2ForCTC.from_pretrained('facebook/wav2vec2-base-960h') + +# Read the sound file +waveform, rate = lb.load('./order.wav', sr = 16000) + +# Tokenize the waveform +input_values = tokenizer(waveform, return_tensors='pt').input_values + +# Retrieve logits from the model +logits = model(input_values).logits + +# Take argmax value and decode into transcription +predicted_ids = torch.argmax(logits, dim=-1) +transcription = tokenizer.batch_decode(predicted_ids) + +# Print the output +print(transcription) +``` + +* * * + +## Results + +Here, we test the pipeline on three sound segments: + +- _Order, order!_ with UK House of Commons speaker John Bercow. +- _Please stand for Her Majesty the Queen!_ during the 2012 London Olympics. +- _Just smile and wave, boys, smile and wave_, from the Madagascar movie. + +### Order, order! + +- **Source:** [Order! Speaker John Bercow sorgt im House of Commons für Ordnung](https://www.youtube.com/watch?v=H4v7wddN-Wg) (00:20 - 00:30) converted into `.wav` with `16 kHz` sampling. +- **Input clip:** + +- **Model output:** `['AND WILL BE HEARD ORDER ORDER THE ORABLE GENTLEMAN HAS GOT TO LEARN THE ART OF PATIENCE AND IF HE IS PATIENT']` +- **Verdict:** except for the ... orable part ... this works pretty well! Clearly, swallowing the "hon" in "honorable" is not picked up by the model. + +### Please stand for Her Majesty the Queen + +- **Source:** [James Bond and The Queen London 2012 Performance](https://www.youtube.com/watch?v=1AS-dCdYZbo) (05:32 - 05:46) converted into `.wav` with `16 kHz` sampling. +- **Input clip:** + +- **Model output:** `['HORDOR LAYAY DAA AN BAY DAA AT TO THE QUEEN AND HIS ROYAL HIGHNESS THE DUKE OF EDIBRA ACCOMPANIED BY THE PRIVTENT OF THE INTERNATIONAL ELYGCNICAMITE JACU BROD']` +- **Verdict:** well... whoops. Clearly, the fact that this speech is not _really clear_ here generates some trouble. This occurs because the model was pretrained and finetuned on relatively clean data. "Hordo layay daa and bay daa"... is no English. "Edinbra" is literal -- and it should have been "Edinburgh", of course. "Privtent" -- "president". "Elygcnicamite" - "Olympic committee". "Jacu Brod" - "Jacques Rogge". Clearly some improvement required here! + +### Just smile and wave, boys, smile and wave + +- **Source:** [smile and wave (scene) ||Madagascar 2005](https://www.youtube.com/watch?v=F1mr6L9OgQI) (00:03 - 00:10) converted into `.wav` with `16 kHz` sampling. +- **Input clip:** + +- **Model output:** `['JUST SMILEIN WAVE BOYS SMILEIN WAVE GWAWSKY CROGESRYPOR']` +- **Verdict:** while people who are familiar with _Madagascar_ can recognize this phrase out of thousands, it's clear that the model also needs a bit of improvement here. + +Altogether, we can see that the model can achieve quite good results if speech is clear. This is not surprising, given the fact that it was pretrained and finetuned with a very clean dataset in terms of noise, swallowing of words, et cetera. Now, for example, if we pretrain the model with other real-world data, and then finetune it to e.g. the problem we face, we might improve it. + +But that's for another tutorial :) + +* * * + +## Recap + +This tutorial focused on the language task of Speech Recognition, or speech-to-text recognition. Using the Wav2vec2 architecture, we showed you how you can pretrain a Transformer based architecture specifically on audio datasets. It can then be finetuned to perform Speech Recognition by applying a labeled dataset, just like Transformer based approaches like BERT apply this to regular, text-only based tasks. + +Beyond the theoretical part, which helped you understand how Wav2vec2 works, we also saw how it can be implemented. Using the HuggingFace Transformers library, you implemented an example pipeline to apply Speech Recognition / Speech to Text with Wav2vec2. Through this tutorial, you saw that using Wav2vec2 is really a matter of only a few lines of code. + +I hope that you have learned something from today's tutorial. If you did, please feel free to drop a comment in the comments section below 💬 Please do the same if you have any questions, remarks or suggestions for improvement. I'd love to hear from you :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). [wav2vec 2.0: A framework for self-supervised learning of speech representations.](https://arxiv.org/abs/2006.11477) arXiv preprint arXiv:2006.11477. +  +Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). [wav2vec: Unsupervised pre-training for speech recognition.](https://arxiv.org/abs/1904.05862) arXiv preprint arXiv:1904.05862. + +MachineCurve. (2021, February 16). _How does Wav2vec 2 for speech recognition (speech2text) work?_ [https://www.machinecurve.com/index.php/question/how-does-wav2vec-2-for-speech-recognition-speech2text-work/](https://www.machinecurve.com/index.php/question/how-does-wav2vec-2-for-speech-recognition-speech2text-work/) diff --git a/easy-table-parsing-with-tapas-machine-learning-and-huggingface-transformers.md b/easy-table-parsing-with-tapas-machine-learning-and-huggingface-transformers.md new file mode 100644 index 0000000..5efadae --- /dev/null +++ b/easy-table-parsing-with-tapas-machine-learning-and-huggingface-transformers.md @@ -0,0 +1,265 @@ +--- +title: "Easy Table Parsing with TAPAS, Machine Learning and HuggingFace Transformers" +date: "2021-03-10" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "huggingface" + - "language-model" + - "machine-learning" + - "nlp" + - "table-parsing" + - "tapas" + - "transformers" +--- + +Big documents often contain quite a few tables. Tables are useful: they can provide a structured overview of data that supports or contradicts a particular statement, written in the accompanying text. However, if your goal is to analyze reports - tables can especially be useful because they provide more raw data. But analyzing tables costs a lot of energy, as one has to reason over these tables in answering their questions. + +But what if that process can be partially automated? + +The **Table Parser** Transformer, or **TAPAS**, is a machine learning model that is capable of precisely that. Given a table and a question related to that table, it can provide the answer in a short amount of time. + +In this tuturial, we will be taking a look at using Machine Learning for Table Parsing in more detail. Previous approaches cover extracting logic forms manually, while Transformer-based approaches have simplified parsing tables. Finally, we'll take a look at the TAPAS Transformer for table parsing, and how it works. This is followed by implementing a table parsing model yourself using a pretrained and finetuned variant of TAPAS, with HuggingFace Transformers. + +After reading this tutorial, you will understand... + +- **How Machine Learning can be used for parsing tables.** +- **Why Transformer-based approaches have simplified table parsing over other ML approaches.** +- **How you can use TAPAS and HuggingFace Transformers to implement a table parser with Python and ML.** + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## Machine Learning for Table Parsing: TAPAS + +Ever since Vaswani et al. (2017) introduced the [Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) back in 2017, the field of NLP has been on fire. Transformers have removed the need for [recurrent segments](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) and thus avoiding the drawbacks of recurrent neural networks and LSTMs when creating sequence based models. By relying on a mechanism called [self-attention](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#multi-head-attention), built-in with multiple so-called _attention heads_, models are capable of generating a supervision signal themselves. + +By consequence, Transformers have widely used the [pretraining-finetuning paradigm](https://www.machinecurve.com/index.php/question/what-is-fine-tuning-based-training-for-nlp-models/), where models are first pretrained using a massive but unlabeled dataset, acquiring general capabilities, after which they are finetuned with a smaller but labeled and hence task-focused dataset. + +The results are incredible: through subsequent improvements like [GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) and [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) and a variety of finetuned models, Transformers can now be used for a wide variety of tasks ranging from [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/), [machine translation](https://www.machinecurve.com/index.php/2021/02/16/easy-machine-translation-with-machine-learning-and-huggingface-transformers/) to [speech recognition](https://www.machinecurve.com/index.php/2021/02/17/easy-speech-recognition-with-machine-learning-and-huggingface-transformers/). And today we can also add table parsing to that list. + +**Additional reading materials:** + +- [List of Transformer tutorials for Deep Learning](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/) +- [The TAPAS Transformer: Table Parsing with BERT](https://www.machinecurve.com/index.php/question/what-is-the-tapas-transformer-in-nlp-and-how-does-it-work/) + +### BERT for Table Parsing + +The [BERT family](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) of language models is a widely varied but very powerful family of language models that relies on the encoder segment of the [original Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/). Invented by Google, it employs [Masked Language Modeling](https://www.machinecurve.com/index.php/2021/03/02/easy-masked-language-modeling-with-machine-learning-and-huggingface-transformers/) during the pretraining and finetuning stages, and slightly adapts architecture and embedding in order to add more context to the processed representations. + +**[TAPAS](https://www.machinecurve.com/index.php/question/what-is-the-tapas-transformer-in-nlp-and-how-does-it-work/)**, which stands for **Table Parser**, is an extension of BERT proposed by Herzig et al. (2020) - who are affiliated [with Google](https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html). It is specifically tailored to table parsing - not unsurprising given its name. TAPAS allows tables to be input after they are flattened and thus essentially converted into 1D. + +By adding a variety of additional embeddings, however, table specific and additional table context can be harnessed during training. It outputs a prediction for an _aggregation operator_ (i.e., what to do with some outcome) and _cell selection coordinates_ (i.e., what is the outcome to do something with). + +TAPAS is covered in [another article on this website](https://www.machinecurve.com/index.php/question/what-is-the-tapas-transformer-in-nlp-and-how-does-it-work/), and I recommend going there if you want to understand how it works in great detail. For now, a visualization of its architecture will suffice - as this is a practical tutorial :) + +![](images/B3gizm9.png) + +Source: Herzig et al. (2020) + +* * * + +## Implementing a Table Parsing model with HuggingFace Transformers + +Let's now take a look at how you can implement a Table Parsing model yourself with HuggingFace Transformers. We'll first focus on the software requirements that you must install into your environment. You will then learn how to code a TAPAS based table parser for question answering. Finally, we will also show you the results that we got when running the code. + +### Software requirements + +HuggingFace Transformer is a Python library that was created for democratizing the application of state-of-the-art NLP models, [Transformers](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/). It can easily be installed with `pip`, by means of `pip install transformers`. If you are running it, you will also need to use PyTorch or TensorFlow as the backend - by installing it into the same environment (or vice-versa, installing HuggingFace Transformers in your PT/TF environment). + +The code in this tutorial was created with PyTorch, but it _may_ be relatively easy (possibly with a few adaptations) to run it with TensorFlow as well. + +To run the code, you will need to install the following things into an environment: + +- **HuggingFace Transformers:** `pip install transformers`. +- **A deep learning framework:** either TensorFlow or PyTorch. +- **Torch Scatter**, which is a TAPAS dependency. The command is dependent on whether you are using it with PyTorch GPU or CPU. Replace `1.6.0` with your PyTorch version. + - For GPU: `pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.6.0+${CUDA}.html` + - For CPU: `pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.6.0+cpu.html` + +pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.6.0+${CUDA}.html + +### Model code + +Compared to [Pipelines](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/#getting-started-with-transformer-based-pipelines) and [other pretrained models](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/#running-other-pretrained-and-fine-tuned-models), running TAPAS requires you to do a few more things. Below, you can find the code for the TAPAS based model as a whole. But don't worry! I'll explain everything right now. + +- **Imports:** First of all, we're importing the `TapasTokenizer` and `TapasForQuestionAnswering` imports from `transformers` - that is, HuggingFace Transformers. The tokenizer can be used for tokenization of which the result can be fed to the question answering model subsequently. Tapas requires a specific way of tokenization and input presenting, and these Tapas specific tokenizer and QA model have this built in. Very easy! We also import `pandas`, which we'll need later. +- **Table and question definitions:** next up is defining the table and the questions. As you can see, the table is defined as a Python dictionary. Our table has two columns - `Cities` and `Inhabitants` - and values (in millions of inhabitants) are provided for Paris, London and Lyon. +- **Specifying some Python definitions:** + - _Loading the model and tokenizer:_ in `load_model_and_tokenizer`, we initialize the Tokenizer and QuestionAnswering model with a [finetuned](https://www.machinecurve.com/index.php/question/what-is-fine-tuning-based-training-for-nlp-models/) variant of TAPAS - more specifically, `google/tapas-base-finetuned-wtq`, or TAPAS finetuned on WikiTable Questions (WTQ). + - _Preparing the inputs:_ our Python dictionary must first be converted into a `DataFrame` before it can be tokenized. We use `pandas` for this purpose, and create the dataframe from a dictionary. We can then feed it to the `tokenizer` together with the `queries`, and return the results. + - _Generating the predictions:_ in `generate_predictions`, we feed the tokenized inputs to our TAPAS model. Our tokenizer can be used subsequently to find the cell coordinates and aggregation operators that were predicted - recall that TAPAS predicts relevant cells (the coordinates) and an operator that must be executed to answer the question (the aggregation operator). + - _Postprocessing the predictions:_ in `postprocess_predictions`, we convert the predictions into a format that can be displayed on screen. + - _Showing the answers:_ in `show_answers`, we then actually visualize these answers. + - _Running TAPAS:_ `run_tapas` combines all other `def`s together in an end-to-end flow. This wasn't directly added to `__main__` because it's best practice to keep as much functionality within Python definitions. +- **Running the whole thing:** so far, we have created a lot of definitions, but nothing is running yet. That's why we check whether our Python is running with that if statement at the bottom, and if so, invoke `run_tapas()` - and therefore the whole model. + +``` +from transformers import TapasTokenizer, TapasForQuestionAnswering +import pandas as pd + +# Define the table +data = {'Cities': ["Paris, France", "London, England", "Lyon, France"], 'Inhabitants': ["2.161", "8.982", "0.513"]} + +# Define the questions +queries = ["Which city has most inhabitants?", "What is the average number of inhabitants?", "How many French cities are in the list?", "How many inhabitants live in French cities?"] + +def load_model_and_tokenizer(): + """ + Load + """ + # Load pretrained tokenizer: TAPAS finetuned on WikiTable Questions + tokenizer = TapasTokenizer.from_pretrained("google/tapas-base-finetuned-wtq") + + # Load pretrained model: TAPAS finetuned on WikiTable Questions + model = TapasForQuestionAnswering.from_pretrained("google/tapas-base-finetuned-wtq") + + # Return tokenizer and model + return tokenizer, model + + +def prepare_inputs(data, queries, tokenizer): + """ + Convert dictionary into data frame and tokenize inputs given queries. + """ + # Prepare inputs + table = pd.DataFrame.from_dict(data) + inputs = tokenizer(table=table, queries=queries, padding='max_length', return_tensors="pt") + + # Return things + return table, inputs + + +def generate_predictions(inputs, model, tokenizer): + """ + Generate predictions for some tokenized input. + """ + # Generate model results + outputs = model(**inputs) + + # Convert logit outputs into predictions for table cells and aggregation operators + predicted_table_cell_coords, predicted_aggregation_operators = tokenizer.convert_logits_to_predictions( + inputs, + outputs.logits.detach(), + outputs.logits_aggregation.detach() + ) + + # Return values + return predicted_table_cell_coords, predicted_aggregation_operators + + +def postprocess_predictions(predicted_aggregation_operators, predicted_table_cell_coords, table): + """ + Compute the predicted operation and nicely structure the answers. + """ + # Process predicted aggregation operators + aggregation_operators = {0: "NONE", 1: "SUM", 2: "AVERAGE", 3:"COUNT"} + aggregation_predictions_string = [aggregation_operators[x] for x in predicted_aggregation_operators] + + # Process predicted table cell coordinates + answers = [] + for coordinates in predicted_table_cell_coords: + if len(coordinates) == 1: + # 1 cell + answers.append(table.iat[coordinates[0]]) + else: + # > 1 cell + cell_values = [] + for coordinate in coordinates: + cell_values.append(table.iat[coordinate]) + answers.append(", ".join(cell_values)) + + # Return values + return aggregation_predictions_string, answers + + +def show_answers(queries, answers, aggregation_predictions_string): + """ + Visualize the postprocessed answers. + """ + for query, answer, predicted_agg in zip(queries, answers, aggregation_predictions_string): + print(query) + if predicted_agg == "NONE": + print("Predicted answer: " + answer) + else: + print("Predicted answer: " + predicted_agg + " > " + answer) + + +def run_tapas(): + """ + Invoke the TAPAS model. + """ + tokenizer, model = load_model_and_tokenizer() + table, inputs = prepare_inputs(data, queries, tokenizer) + predicted_table_cell_coords, predicted_aggregation_operators = generate_predictions(inputs, model, tokenizer) + aggregation_predictions_string, answers = postprocess_predictions(predicted_aggregation_operators, predicted_table_cell_coords, table) + show_answers(queries, answers, aggregation_predictions_string) + + +if __name__ == '__main__': + run_tapas() +``` + +### Results + +Running the WTQ based TAPAS model against the questions specified above gives the following results: + +``` +Which city has most inhabitants? +Predicted answer: London, England +What is the average number of inhabitants? +Predicted answer: AVERAGE > 2.161, 8.982, 0.513 +How many French cities are in the list? +Predicted answer: COUNT > Paris, France, Lyon, France +How many inhabitants live in French cities? +Predicted answer: SUM > 2.161, 0.513 +``` + +This is great! + +- According to our table, London has most inhabitants. **True**. As you can see, there was no prediction for an aggregation operator. This means that TAPAS was smart enough to recognize that this is a cell selection procedure rather than some kind of aggregation! +- For the second question, the operator predicted is `AVERAGE` - and all relevant cells are selected. **True again**. +- Very cool is that we can even ask more difficult questions that do not contain the words - for example, we get `COUNT` and two relevant cells - precisely what we mean - when we ask _which French cities are in the list_. +- Finally, a correct `SUM` operator and cells are also provided when the question is phrased differently, focusing on inhabitants instead. + +Really cool! 😎 + +* * * + +## Summary + +Transformers have really changed the world of language models. Harnessing the self-attention mechanism, they have removed the need for recurrent segments and hence sequential processing, allowing bigger and bigger models to be created that every now and then show human-like behavior - think [GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/), [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) and [DALL-E](https://www.machinecurve.com/index.php/2021/01/05/dall-e-openai-gpt-3-model-can-draw-pictures-based-on-text/). + +In this tutorial, we focused on [TAPAS](https://www.machinecurve.com/index.php/question/what-is-the-tapas-transformer-in-nlp-and-how-does-it-work/), which is an extension of [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) and which can be used for table parsing. It specifically focused on the practical parts: that is, implementing this model for real-world usage by means of HuggingFace Transformers. + +Reading it, you have learned... + +- **How Machine Learning can be used for parsing tables.** +- **Why Transformer-based approaches have simplified table parsing over other ML approaches.** +- **How you can use TAPAS and HuggingFace Transformers to implement a table parser with Python and ML.** + +I hope that this tutorial was useful for you! 🚀 If it was, please let me know in the comments section below 💬 Please do the same if you have any questions or other comments. I'd love to hear from you. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Herzig, J., Nowak, P. K., Müller, T., Piccinno, F., & Eisenschlos, J. M. (2020). [Tapas: Weakly supervised table parsing via pre-training.](https://arxiv.org/abs/2004.02349) _arXiv preprint arXiv:2004.02349_. + +GitHub. (n.d.). _Google-research/tapas_. [https://github.com/google-research/tapas](https://github.com/google-research/tapas) + +Google. (2020, April 30). _Using neural networks to find answers in tables_. Google AI Blog. [https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html](https://ai.googleblog.com/2020/04/using-neural-networks-to-find-answers.html) + +HuggingFace. (n.d.). _TAPAS — transformers 4.3.0 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/model\_doc/tapas.html](https://huggingface.co/transformers/model_doc/tapas.html) diff --git a/easy-text-summarization-with-huggingface-transformers-and-machine-learning.md b/easy-text-summarization-with-huggingface-transformers-and-machine-learning.md new file mode 100644 index 0000000..bcc9395 --- /dev/null +++ b/easy-text-summarization-with-huggingface-transformers-and-machine-learning.md @@ -0,0 +1,271 @@ +--- +title: "How to perform Text Summarization with Python, HuggingFace Transformers and Machine Learning" +date: "2020-12-21" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "bart" + - "bert" + - "gpt" + - "huggingface" + - "natural-language-processing" + - "text-summarization" + - "transformer" +--- + +Natural Language Processing is one of the key areas where Machine Learning has been very effective. In fact, whereas NLP traditionally required a lot of human intervention, today, this is no longer true. Specifically Deep Learning technology can be used for learning tasks related to language, such as translation, classification, entity recognition or in this case, summarization. + +Because summarization is what we will be focusing on in this article. We will see how we can use [HuggingFace Transformers](https://github.com/huggingface/transformers) for performing **easy text summarization**. We'll structure things as follows. First of all, we'll be looking at how Machine Learning can be useful to summarizing text. Subsequently, we'll take a look at how summarization can be performed with a pretrained Transformer. We'll look at vanilla Transformers, [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/), [GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) and eventually BART to find out how today's summarizer works. Subsequently, we'll also see how it was trained, before moving on to the coding part. + +Finally, after all the text, we'll actually implement the text summarization model with [HuggingFace Transformers](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/), which is a library for easy NLP with Python. It will be fun! + +**After reading this tutorial, you will...** + +- Understand what a Transformer is at a high level. +- See how BERT and GPT can be composed to form the BART Transformer. +- Create a Text Summarization pipeline that _really_ works on all of your English texts! + +Let's take a look 😎 + +* * * + +**Update 28/Jan/2020:** slight update to article metadata and introduction text. Also added summary. + +**Update 07/Jan/2020:** added more relevant links to the article. + +* * * + +\[toc\] + +* * * + +## Summary & Example: Text Summarization with Transformers + +Transformers are taking the world of language processing by storm. These models, which learn to interweave the importance of tokens by means of a mechanism called self-attention and without recurrent segments, have allowed us to train larger models without all the problems of recurrent neural networks. There are many use cases for NLP, including text summarization, which is the focus of this tutorial. + +In this tutorial, you'll learn how to create an easy summarization pipeline with a library called **HuggingFace Transformers**. This library, which runs on top of PyTorch and TensorFlow, allows you to implement Transformer models and use them for a variety of language tasks. The example below shows how to run a text summarization pipeline for an (English) text stored in a file called `article.txt`, based on a so-called BART (= BERT + GPT) Transformer. You can immediately use it, as long as you have installed HuggingFace Transformers with `pip install transformers`. + +If you want to understand everything in a bit more detail, make sure to read the rest of the tutorial as well! ⚡ + +``` +from transformers import pipeline + +# Open and read the article +f = open("article.txt", "r", encoding="utf8") +to_tokenize = f.read() + +# Initialize the HuggingFace summarization pipeline +summarizer = pipeline("summarization") +summarized = summarizer(to_tokenize, min_length=75, max_length=300) + +# Print summarized text +print(summarized) +``` + +* * * + +## Machine Learning for Text Summarization + +Human beings have limited cognitive capacity for performing certain texts. Today, many people aren't fond of reading large amounts of text anymore. For this reason, summarizing texts is quite important: when studying for exams, reading self-help books, or simply reading the news, today's world is becoming so accelerated that time is of the absolute essence. + +Let's take this text describing the historical drama The Crown (Wikipedia, 2020): + +``` +The Crown is a historical drama streaming television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Television for Netflix. Morgan developed it from his drama film The Queen (2006) and especially his stage play The Audience (2013). The first season covers the period from Elizabeth's marriage to Philip, Duke of Edinburgh in 1947, to the disintegration of her sister Princess Margaret's engagement to Group Captain Peter Townsend in 1955. The second season covers the period from the Suez Crisis in 1956 to the retirement of Prime Minister Harold Macmillan in 1963 and the birth of Prince Edward in 1964. The third season spans the period between 1964 and 1977, including Harold Wilson's two periods as prime minister, and introduces Camilla Shand. The fourth season spans 1977 to 1990 and includes Margaret Thatcher's tenure as prime minister and Lady Diana Spencer's marriage to Prince Charles. The fifth and sixth seasons, which will close the series, will cover the Queen's reign into the 21st century. + +New actors are being cast every two seasons. Claire Foy portrays the Queen in the first two seasons, alongside Matt Smith as Prince Philip and Vanessa Kirby as Princess Margaret. For the third and fourth seasons, Olivia Colman takes over as the Queen, Tobias Menzies as Prince Philip, and Helena Bonham Carter as Princess Margaret. Imelda Staunton, Jonathan Pryce, and Lesley Manville will succeed Colman, Menzies, and Bonham Carter, respectively, for the final two seasons. Filming takes place at Elstree Studios in Borehamwood, Hertfordshire, with location shooting throughout the United Kingdom and internationally. The first season was released by Netflix on 4 November 2016, the second on 8 December 2017, the third on 17 November 2019, and the fourth on 15 November 2020. The fifth season is anticipated in 2022. As of 2020, the estimated production budget of The Crown has been reported to be $260 million, making it one of the most expensive television series in history.[4] + +The Crown has been praised for its acting, directing, writing, cinematography, and production values. It received accolades at the 23rd Screen Actors Guild Awards, won Best Actress for Foy in the lead role and Best Actor for John Lithgow as Winston Churchill, and has secured a total of 39 nominations for its first three seasons at the Primetime Emmy Awards, including three for Outstanding Drama Series.[5] The series was nominated for Best Drama TV Series at the 77th Golden Globe Awards. +``` + +For those who don't know The Crown: + +https://www.youtube.com/watch?v=JWtnJjn6ng0 + +...quite a text to read, even while we quoted selectively 😂 What if instead, we could read a summary, covering the most important tasks only? + +``` +The Crown is a television series based on the life of Queen Elizabeth II . The first season was released by Netflix on 4 November 2016, the third on 17 November 2019, the fourth on 15 November 2020. The series has been nominated for Best Drama TV Series at the 23rd Screen Actors Guild Awards, and has received 39 nominations for its first three seasons at the Primetime Emmy Awards, including three for Best Actress. +``` + +Wouldn't it be great if we can use Machine Learning to make such summaries as well? + +It would! + +In fact, this summary was created by a Machine Learning model - precisely the one that we will be using today. It shows the power of Machine Learning in Natural Language Processing in general and Text Summarization in particular. In fact, back in the early days, using ML for text summarization was so interesting that [massive sums of money were paid for the capability](https://en.wikipedia.org/wiki/Nick_D%27Aloisio). + +And today, we can build a text summarizer using only a few lines of code. Let's now dive into how this can be achieved. + +* * * + +## Performing summarization with a pretrained Transformer + +More precisely, today, we will be performing text summarization with a pretrained [Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/). While the code that we will write is really simple and easy to follow, the technology behind the easy interface is complex. In this section, we will therefore take a look at Transformers first, which are state-of-the-art in Natural Language Processing. This is followed by taking a closer look at the two variations of the Transformer that lie at the basis of the pretrained one that we will use, being the BERT and the GPT model architectures. + +Having understood these basics, we'll move on and look at the BART model, which is the model architecture that underpins the easy summarizer that we will be using today. We will see that BART combines a bidirectional BERT-like encoder with a GPT-like decoder, allowing us to benefit from BERT bidirectionality while being able to generate text, which is not one of BERT's key benefits. Once we understand BART intuitively, we're going to take a look at the _pretrained_ BART model - because BART itself is only an architecture. We will take a look at the CNN / Daily Mail dataset, which is what our model has been trained on. + +Once we understand all these aspects, we can clearly see _how_ our summarizer works, _why_ it works, and then we can move to _making it work_. Let's go! + +### What is a Transformer? + +![](images/1_BHzGVskWGS_3jEcYYi6miQ-842x1024.png) + +Source: Vaswani et al. (2017) + +In Natural Language Processing, the state-of-the-art in Machine Learning today involves a wide variety of [Transformer-based models](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/#attention-is-all-you-need-transformers). + +A **[Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/)** is a machine learning architecture that combines an _encoder_ with a _decoder_ and jointly learns them, allowing us to convert input sequences (e.g. phrases) into some _intermediate format_ before we convert it back into human-understandable format. + +A human analogy would be two translators which both speak some imaginary language and a human-interpretable one, such as German and French. The first translator can translate French into the imaginary language; the second then has learned to translate the intermediate language back into German. Without an understanding of both human languages, one translator (the encoder) and another (the decoder) can still perform the translation job. + +They have become the primary choice for ML driven language tasks these days because they can apply self-attention and are parallel in nature. As we have seen, [previous approaches](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/) couldn't do this: they either suffered from long-term memory loss or experienced significant compute bottlenecks. Transformers don't, and in fact, we can train them on datasets with unprecedented scale (Dosovitskiy et al., 2020). + +### BERT and GPT models + +When there is a breakthrough in Machine Learning, many researchers and organizations dive in - and boost the speed with which original ideas evolve. We have seen this with Convolutional Neural Networks: since their breakthrough application in 2012 (with the AlexNet - essentially a combination of [Conv layers](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) with [pooling ones](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) in between followed by [Dense ones](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/)), many improvements have occurred: think of VGGs, ResNets, ... you name it. + +And recently, even Transformers are applied for computer vision tasks (Dosovitskiy et al., 2020). + +But let's get back on-topic: since the introduction of Transformers by Vaswani et al. (2017), many improvements have been suggested. Two of these improvements are BERT and GPT: + +- The [**Bidirectional Encoder** **Representations from Transformers** (BERT)](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) by Devlin et al. (2018) takes the encoder segment from the classic (or vanilla) Transformer, slightly changes how the inputs are generated (by means of WordPiece rather than learned embeddings) and changes the learning task into a Masked Language Model (MLM) plus Next Sentence Prediction (NSP) rather than training a simple language model. They also follow the argument for pretraining and subsequent fine-tuning: by taking the encoder segment only, pretraining it on massive datasets, BERT can be used as the encoder for subsequent finetuning tasks. In other words, we can easily build on top of BERT and use it as the root for training our own models with significantly smaller language datasets. + - Since BERT utilizes the encoder segment from the vanilla Transformer only, it is really good at understanding natural language, but less good at generating text. + - BERT's differences ensure that it does not only look at text in a left-to-right fashion, which is common in especially the masked segments of vanilla Transformers. Rather, it is bidirectional, which means that it can both look at text in a left-to-right _and_ right-to-left fashion. +- The [**Generative Pre-Training** GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) model series were invented by OpenAI and take the decoder segment of vanilla Transformers (Radford et al., 2018). It is therefore really good at generating text. Today, we're at GPT-3, for which [Microsoft has acquired an exclusive license](https://www.infoq.com/news/2020/09/microsoft-license-gpt-3). + +### BART = BERT-like encoder and GPT-like decoder + +Above, we saw that a pretrained BERT model is really good at understanding language (and hence understanding input text) but less adequate at generating new text. In other words, while it can understand questions really well (see Google, [which is utilizing BERT for many search queries these days](https://blog.google/products/search/search-language-understanding-bert/)), it's not the primary choice for generating the answers. + +On the other hand, GPT is really good at generating text, but likely less good at BERT-like understanding simply because it utilizes the decoder segment of the vanilla Transformer. + +This is where BART comes in, which stands for **Bidirectional and Auto-Regressive Transformers** (Lewis et al., 2019). It essentially generalizes BERT and GPT based architectures by using the standard Seq2Seq Transformer architecture from Vaswani et al. (2017) while mimicing BERT/GPT functionality and training objectives. For example, pretraining BART involves _token masking_ (like BERT does), _token deletion_, _text infilling_, _sentence permutation_ and _document rotation_. Once the pretrained BART model has finished training, it can be fine-tuned to a more specific task, such as text summarization. + +In the schema below, we visualize what BART looks like at a high level. First of all, you can see that input texts are passed through the _bidirectional encoder_, i.e. the BERT-like encoder. By consequence, texts are looked at from left-to-right and right-to-left, and the subsequent output is used in the _autoregressive decoder_, which predicts the output based on the encoder input _and_ the output tokens predicted so far. In other words, with BART, we can now both _understand_ the inputs really well and generate new outputs. + +That is, we can e.g. finetune a model for a text summarization task. + +![](images/bart-1024x449.jpg) + +### Our pretrained BART model finetuned to summarization + +Using the BART architecture, we can finetune the model to a specific task (Lewis et al., 2019). In the case of today's article, this finetuning will be **summarization**. For this summarization task, the implementation of HuggingFace (which we will use today) has performed finetuning with the CNN/DailyMail summarization dataset. This dataset has two features: + +- The article, which is the text of the news article. +- The highlights, which represent the key elements of the text and can be useful for summarization. + +![](images/image-1.png) + +The dataset is available with [TensorFlow](https://www.tensorflow.org/datasets/catalog/cnn_dailymail). + +* * * + +## Implementing a summarizer with HuggingFace Transformers + +Now that we understand many aspects of the summarizer that we will create, we can take a look at how we can easily implement the CNN/DailyMail pretrained summarizer with [HuggingFace Transformers](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/): + +> 🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. +> +> HuggingFace (n.d.) + +Implementing such a summarizer involves multiple steps: + +- Importing the `pipeline` from `transformers`, which imports the Pipeline functionality, allowing you to easily use a variety of pretrained models. + - If you don't have Transformers installed, you can do so with `pip install transformers`. Do note that it's best to have [PyTorch](https://pytorch.org/) installed as well, possibly in a separate environment. +- Read an article stored in some text file. +- Initializing and configuring the summarization pipeline, and generating the summary using BART. +- Printing the summarized text. + +### Everything in code + +...and easy it is! In fact, you can build a text summarization model with pretrained BART in just a few lines of code: + +``` +from transformers import pipeline + +# Open and read the article +f = open("article.txt", "r", encoding="utf8") +to_tokenize = f.read() + +# Initialize the HuggingFace summarization pipeline +summarizer = pipeline("summarization") +summarized = summarizer(to_tokenize, min_length=75, max_length=300) + +# Print summarized text +print(summarized) +``` + +### Running the model + +Let's now run the model by taking a [BBC article](https://www.bbc.com/worklife/article/20201214-how-linguistic-mirroring-can-make-you-more-convincing), copying the text to the `article.txt` file, and running the summarizer with `python summarization.py` (or whatever your file is called). + +This is what the text looks like (full text via the linked page above), on both the BBC website and when added to the `article.txt` file through Notepad: + +- ![](images/image-2.png) + +- ![](images/notepad-1024x349.jpg) + + +This is the outcome of the summarizer: + +``` +New research shows that analysing someone’s communication type and parroting it back may make you more persuasive . The technique is called ‘linguistic mirroring’ and data shows that implementing the strategy can boost the efficacy of your message . The next time you’re on that Zoom all-staff meeting, pay close attention to how each of your colleagues speak and present their thoughts . +``` + +Really awesome! 😎 You can even make the summary bigger by setting the value for `min_length`: + +``` +New research shows that analysing someone’s communication type and parroting it back may make you more persuasive . Linguistic mirroring is called ‘linguistic mirror’ and data shows that implementing the strategy can boost the efficacy of your message . For example, the next time you’re on that Zoom all-staff meeting, pay close attention to how each of your colleagues speak and present their thoughts . Some might only be concerned with fast data points and bottom lines, acting brusque and maybe even a bit standoffish . Others may be far less linear, and might launch into a rambling story . The research shows you should adjust your speech to mimic them – even if their communication style is miles from your own. +``` + +* * * + +## Recap + +In this article, we generated an **easy text summarization** Machine Learning model by using the [HuggingFace pretrained implementation](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/) of the BART architecture. More specifically, it was implemented in a Pipeline which allowed us to create such a model with only a few lines of code. + +However, we first looked at text summarization in the first place. What is it, and why is it necessary? Through a text of the Crown television series, we saw that summarizing might help people speed up their learning - by only appreciating the key highlights rather than all the details. + +To understand how text summarizers work, we looked at the [Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), which is one of the most prominent architectures in Natural Language Processing today. After looking at the vanilla version of the architecture, we saw how Transformers like [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) and [GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) are implemented at a high level. BART is the outcome of combining the best of both worlds and yields a model architecture that, once trained, is really good at understanding text _and_ at generating text. + +Seeing that the HuggingFace BART based Transformer was trained on the CNN/DailyMail dataset for finetuning it to text summarization, we built an easy text summarization Machine Learning model with only a few lines of code. The examples above illustrate that it works really well, which is really impressive! + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something by reading today's article. If you did, please feel free to leave a message in the comments section below 💬 Please do the same if you have any questions, or click the **Ask Questions button** to the right. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2020, December 15). _The crown (TV series)_. Wikipedia, the free encyclopedia. Retrieved December 15, 2020, from [https://en.wikipedia.org/wiki/The\_Crown\_(TV\_series)](https://en.wikipedia.org/wiki/The_Crown_(TV_series)) + +HuggingFace. (n.d.). _Summary of the tasks — transformers 4.0.0 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/task\_summary.html#summarization](https://huggingface.co/transformers/task_summary.html#summarization) + +GitHub. (n.d.). _Abisee/CNN-dailymail_. [https://github.com/abisee/cnn-dailymail](https://github.com/abisee/cnn-dailymail) + +Papers With Code. (2019, October 29). _Papers with code - BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension_. The latest in Machine Learning | Papers With Code. [https://paperswithcode.com/paper/bart-denoising-sequence-to-sequence-pre](https://paperswithcode.com/paper/bart-denoising-sequence-to-sequence-pre) + +Giacaglia, G. (2020, October 5). _Transformers_. Medium. [https://towardsdatascience.com/transformers-141e32e69591](https://towardsdatascience.com/transformers-141e32e69591) + +Wikipedia. (2019, August 25). _Transformer (machine learning model)_. Wikipedia, the free encyclopedia. Retrieved December 16, 2020, from [https://en.wikipedia.org/wiki/Transformer\_(machine\_learning\_model)](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Uszkoreit, J. (2020). [An image is worth 16x16 words: Transformers for image recognition at scale](https://arxiv.org/abs/2010.11929). _arXiv preprint arXiv:2010.11929_. + +Wikipedia. (2019, October 10). _BERT (language model)_. Wikipedia, the free encyclopedia. Retrieved December 21, 2020, from [https://en.wikipedia.org/wiki/BERT\_(language\_model)](https://en.wikipedia.org/wiki/BERT_(language_model)) + +Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805). _arXiv preprint arXiv:1810.04805_. + +HuggingFace. (n.d.). _BERT — transformers 4.1.1 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/model\_doc/bert.html](https://huggingface.co/transformers/model_doc/bert.html) + +Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). [Improving language understanding by generative pre-training](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf). + +HuggingFace. (n.d.). _OpenAI GPT — transformers 4.1.1 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/model\_doc/gpt.html](https://huggingface.co/transformers/model_doc/gpt.html) + +Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., ... & Zettlemoyer, L. (2019). [Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension](https://arxiv.org/abs/1910.13461). _arXiv preprint arXiv:1910.13461_. + +HuggingFace. (n.d.). _Transformers — transformers 4.1.1 documentation_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/transformers/](https://huggingface.co/transformers/) diff --git a/exploring-the-keras-datasets.md b/exploring-the-keras-datasets.md new file mode 100644 index 0000000..eff8f84 --- /dev/null +++ b/exploring-the-keras-datasets.md @@ -0,0 +1,788 @@ +--- +title: "Exploring the Keras Datasets" +date: "2019-12-30" +categories: + - "deep-learning" + - "frameworks" +tags: + - "dataset" + - "deep-learning" + - "keras" + - "machine-learning" + - "neural-networks" +--- + +Datasets are crucial to functional machine learning models. Having a good dataset available can be the difference between success or failure in your ML projects. This is especially true when you're new to machine learning and when you're creating models to learn: you don't want your models to be dysfunctional because of the data, instead of your own effort (as harsh as it sounds 😉, from the latter you can learn). + +That's why the Keras deep learning framework contains a set of standard datasets. Today, we'll take a look at these datasets in more detail. We explore the datasets individually, taking a look at the data in detail, visualizing contents where possible. Additionally, we'll try to find out about some use cases when these datasets may be useful for your learning trajectory. + +Ready? Let's go! 😎 + +**Update 16/Nov/2020:** made the examples compatible with TensorFlow 2.x. + +\[toc\] + +* * * + +## What are the Keras Datasets? + +In order to allow people who are interested in machine learning to start smoothly, Keras provides a number of datasets within the context of the framework (Keras, n.d.). This means that you can start creating models without having to be concerned about the data: you'll only need a tiny amount of code to load it. + +The rationale behind this is simple: getting data to work for you is a notorious bottleneck in machine learning projects. Often, data is available in CSV sheets, traditional SQL databases, or worse - in Word documents or PDF files. You'll then have to scrape the data, clean it, and store it in things like a Pandas dataframe, before you can use it in your machine learning model. + +Likely, this is too much for a young and aspiring student of machine learning, and can be a substantial bottleneck to picking up learning Keras. This is why Keras made available some datasets for loading easily with just _one_ call: `load_data()`. The datasets are as follows: + +- **Image classification:** CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST; +- **Text classification:** IMDB Movie Reviews, Reuters Newswire topics; +- **Regression:** Boston House Prices. + +We'll now take a look at each dataset individually :) + +* * * + +## The datasets + +### CIFAR-10 small image classification + +The **CIFAR-10** dataset was introduced by Krizhevsky & Hinton (2009) and can be used for image classification. Having been named after the Canadian Institute for Advanced Research (CIFAR), which funded the project that created it, it contains 60.000 RGB images across 10 classes - 6.000 per class. + +CIFAR-10 has images for these classes (Krizhevsky & Hinton, 2009): + +
AirplaneAutomobileBirdCatDeer
DogFrogHorseShipTruck
+ +The images are 32 times 32 pixels and are split into a training set of 50.000 images and a test set of 10.000 images. + +With the Keras datasets API, it can be loaded easily (Keras, n.d.). Including the dataset in your code goes as follows: + +``` +from tensorflow.keras.datasets import cifar10 +(x_train, y_train), (x_test, y_test) = cifar10.load_data() +``` + +Let's now visualize 30 random samples from the CIFAR-10 dataset, to get an impression of what the images look like: + +- [![](images/834.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/834.jpg) + +- [![](images/3576.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/3576.jpg) + +- [![](images/11312.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/11312.jpg) + +- [![](images/12403.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/12403.jpg) + +- [![](images/13749.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/13749.jpg) + +- [![](images/15330.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/15330.jpg) + +- [![](images/18017.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/18017.jpg) + +- [![](images/20619.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/20619.jpg) + +- [![](images/24100.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/24100.jpg) + +- [![](images/24854.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/24854.jpg) + +- [![](images/27447.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27447.jpg) + +- [![](images/27569.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27569.jpg) + +- [![](images/28222.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/28222.jpg) + +- [![](images/28291.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/28291.jpg) + +- [![](images/36144.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/36144.jpg) + +- [![](images/36450.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/36450.jpg) + +- [![](images/37591.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/37591.jpg) + +- [![](images/37932.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/37932.jpg) + +- [![](images/38151.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/38151.jpg) + +- [![](images/38333.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/38333.jpg) + +- [![](images/38811.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/38811.jpg) + +- [![](images/40969.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/40969.jpg) + +- [![](images/41192.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/41192.jpg) + +- [![](images/42180.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/42180.jpg) + +- [![](images/45028.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/45028.jpg) + +- [![](images/46818.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/46818.jpg) + +- [![](images/47308.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/47308.jpg) + +- [![](images/48003.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/48003.jpg) + +- [![](images/48715.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/48715.jpg) + +- [![](images/48975.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/48975.jpg) + + +``` +# Imports +import matplotlib.pyplot as plt +import numpy as np +from tensorflow.keras.datasets import cifar10 + +# CIFAR-10 +(x_train, y_train), (x_test, y_test) = cifar10.load_data() + +# Target classes: numbers to text +classes = { + 0: 'airplane', + 1: 'automobile', + 2: 'bird', + 3: 'cat', + 4: 'deer', + 5: 'dog', + 6: 'frog', + 7: 'horse', + 8: 'ship', + 9: 'truck' +} + +# Visualize 30 random samples +for i in np.random.randint(0, len(x_train)-1, 30): + # Get data + sample = x_train[i] + target = y_train[i][0] + # Set figure size and axis + plt.figure(figsize=(1.75, 1.75)) + plt.axis('off') + # Show data + plt.imshow(sample) + plt.title(f'{classes[target]}') + plt.savefig(f'./{i}.jpg') +``` + +* * * + +### CIFAR-100 small image classification + +While the CIFAR-10 dataset contains 60.000 samples across 10 classes, the **CIFAR-100** dataset has 60.000 as well - but this time across 100 non-overlapping classes (Krizhevsky & Hinton, 2009). Instead of 6.000 samples per class, CIFAR-100 contains 600. For the rest, the structure is pretty similar to the CIFAR-10 dataset. + +Loading it is easy, as with any of the Keras Datasets (Keras, n.d.): + +``` +from tensorflow.keras.datasets import cifar100 +(x_train, y_train), (x_test, y_test) = cifar100.load_data() +``` + +These are the classes present within CIFAR-100 (Krizhevsky & Hinton, 2009): + +
BeaverDolphinOtterSealWhale
Aquarium fishFlatfishRaySharkTrout
OrchidsPoppiesRosesSunflowersTulips
BottlesBowlsCansCupsPlates
ApplesMushroomsOrangesPearsSweet peppers
ClokComputer keyboardLampTelephoneTelevision
BedChairCouchTableWardrobe
BeeBeetleButterflyCaterpillarCockroach
BearLeopardLionTigerWolf
BridgeCastleHouseRoadSkyscraper
CloudForestMountainPlainSea
CamelCattleChimpanzeeElephantKangaroo
FoxPorcupinePossumRaccoonSkunk
CrabLobsterSnailSpiderWorm
BabyBoyGirlManWoman
CrocodileDinosaurLizardSnakeTurtle
HamsterMouseRabbitShrewSquirrel
MapleOakPalmPineWillow
BicycleBusMotorcyclePickup truckTrain
Lawn-mowerRocketStreetcarTankTractor
+ +And here are, once again, 30 samples randomly drawn and visualized: + +- [![](images/1403.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/1403.jpg) + +- [![](images/1676.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/1676.jpg) + +- [![](images/1813.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/1813.jpg) + +- [![](images/3513.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/3513.jpg) + +- [![](images/5023.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/5023.jpg) + +- [![](images/6418.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/6418.jpg) + +- [![](images/10425.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/10425.jpg) + +- [![](images/15307.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/15307.jpg) + +- [![](images/15743.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/15743.jpg) + +- [![](images/18167.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/18167.jpg) + +- [![](images/21402.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/21402.jpg) + +- [![](images/26247.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/26247.jpg) + +- [![](images/26544.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/26544.jpg) + +- [![](images/27260.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27260.jpg) + +- [![](images/27757.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27757.jpg) + +- [![](images/27872.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27872.jpg) + +- [![](images/29119.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/29119.jpg) + +- [![](images/29735.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/29735.jpg) + +- [![](images/30218.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/30218.jpg) + +- [![](images/33582.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/33582.jpg) + +- [![](images/34242.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/34242.jpg) + +- [![](images/34889.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/34889.jpg) + +- [![](images/35045.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/35045.jpg) + +- [![](images/35793.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/35793.jpg) + +- [![](images/39358.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/39358.jpg) + +- [![](images/41909.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/41909.jpg) + +- [![](images/42681.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/42681.jpg) + +- [![](images/43871.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/43871.jpg) + +- [![](images/49406.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/49406.jpg) + +- [![](images/49626.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/49626.jpg) + + +``` +# Imports +import matplotlib.pyplot as plt +import numpy as np +from tensorflow.keras.datasets import cifar100 + +# CIFAR-100 +(x_train, y_train), (x_test, y_test) = cifar100.load_data() + +# Target classes: numbers to text +# Source: https://github.com/keras-team/keras/issues/2653#issuecomment-450133996 +classes = [ + 'apple', + 'aquarium_fish', + 'baby', + 'bear', + 'beaver', + 'bed', + 'bee', + 'beetle', + 'bicycle', + 'bottle', + 'bowl', + 'boy', + 'bridge', + 'bus', + 'butterfly', + 'camel', + 'can', + 'castle', + 'caterpillar', + 'cattle', + 'chair', + 'chimpanzee', + 'clock', + 'cloud', + 'cockroach', + 'couch', + 'crab', + 'crocodile', + 'cup', + 'dinosaur', + 'dolphin', + 'elephant', + 'flatfish', + 'forest', + 'fox', + 'girl', + 'hamster', + 'house', + 'kangaroo', + 'computer_keyboard', + 'lamp', + 'lawn_mower', + 'leopard', + 'lion', + 'lizard', + 'lobster', + 'man', + 'maple_tree', + 'motorcycle', + 'mountain', + 'mouse', + 'mushroom', + 'oak_tree', + 'orange', + 'orchid', + 'otter', + 'palm_tree', + 'pear', + 'pickup_truck', + 'pine_tree', + 'plain', + 'plate', + 'poppy', + 'porcupine', + 'possum', + 'rabbit', + 'raccoon', + 'ray', + 'road', + 'rocket', + 'rose', + 'sea', + 'seal', + 'shark', + 'shrew', + 'skunk', + 'skyscraper', + 'snail', + 'snake', + 'spider', + 'squirrel', + 'streetcar', + 'sunflower', + 'sweet_pepper', + 'table', + 'tank', + 'telephone', + 'television', + 'tiger', + 'tractor', + 'train', + 'trout', + 'tulip', + 'turtle', + 'wardrobe', + 'whale', + 'willow_tree', + 'wolf', + 'woman', + 'worm', +] + +# Visualize 30 random samples +for i in np.random.randint(0, len(x_train)-1, 30): + # Get data + sample = x_train[i] + target = y_train[i][0] + # Set figure size and axis + plt.figure(figsize=(1.75, 1.75)) + plt.axis('off') + # Show data + plt.imshow(sample) + plt.title(f'{classes[target]}') + plt.savefig(f'./{i}.jpg') +``` + +* * * + +### IMDB Movie reviews sentiment classification + +Maas et al. (2011) provide the **IMDB Movie Reviews dataset** for sentiment classification, which is made available preprocessed in the Keras Datasets section. The dataset contains 25.000 movie reviews from IMDB, labeled by sentiment (positive and negative, Keras n.d.). + +It can be used to experiment with building models for sentiment classification. + +As said, it's preprocessed, which warrants a description of _how_ it was preprocessed. Firstly, it's important to understand that "each review is encoded as a sequence of word indexes" (Keras, n.d.). This means that each word was converted into an integer, which represents the position of the word in some word index. One sample (to be precise, index 3 in the training data) looks like this: + +``` +[1, 14, 47, 8, 30, 31, 7, 4, 249, 108, 7, 4, 5974, 54, 61, 369, 13, 71, 149, 14, 22, 112, 4, 2401, 311, 12, 16, 3711, 33, 75, 43, 1829, 296, 4, 86, 320, 35, 534, 19, 263, 4821, 1301, 4, 1873, 33, 89, 78, 12, 66, 16, 4, 360, 7, 4, 58, 316, 334, 11, 4, 1716, 43, 645, 662, 8, 257, 85, 1200, 42, 1228, 2578, 83, 68, 3912, 15, 36, 165, 1539, 278, 36, 69, 44076, 780, 8, 106, 14, 6905, 1338, 18, 6, 22, 12, 215, 28, 610, 40, 6, 87, 326, 23, 2300, 21, 23, 22, 12, 272, 40, 57, 31, 11, 4, 22, 47, 6, 2307, 51, 9, 170, 23, 595, 116, 595, 1352, 13, 191, 79, 638, 89, 51428, 14, 9, 8, 106, 607, 624, 35, 534, 6, 227, 7, 129, 113] +``` + +Where the target for this value is plain and simple: `0`. Apparently, this review is negative, although we don't know why, since we don't know the words :) + +However, we can find out about them. + +I adapted some code created by Mdaoust (2019) [and available here](https://stackoverflow.com/a/44891281) into the following, exploiting the usability of the `get_word_index()` call available for the IMDB dataset: + +``` +from tensorflow.keras.datasets import imdb +(x_train, y_train), (x_test, y_test) = imdb.load_data() + +INDEX_FROM=3 # word index offset +word_to_id = imdb.get_word_index() +word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()} +word_to_id[""] = 0 +word_to_id[""] = 1 +word_to_id[""] = 2 +word_to_id[""] = 3 + +id_to_word = {value:key for key,value in word_to_id.items()} +print(' '.join(id_to_word[id] for id in x_train[2] )) +``` + +And here it is: + +``` + this has to be one of the worst films of the 1990s when my friends i were watching this film being the target audience it was aimed at we just sat watched the first half an hour with our jaws touching the floor at how bad it really was the rest of the time everyone else in the theatre just started talking to each other leaving or generally crying into their popcorn that they actually paid money they had earnt working to watch this feeble excuse for a film it must have looked like a great idea on paper but on film it looks like no one in the film has a clue what is going on crap acting crap costumes i can't get across how embarrasing this is to watch save yourself an hour a bit of your life +``` + +Definitely negative 😂 + +Do note that the actual index is sorted by word frequency: `i = 1` is the most frequent word, `i = 2` the second most frequent word, and so on. This allows one to e.g. "consider the top 10.000 most common words, but eliminate the top 20 \[ones\]" (Keras, n.d.). + +In the simplest form, the data can be loaded as follows: + +``` +from tensorflow.keras.datasets import imdb +(x_train, y_train), (x_test, y_test) = imdb.load_data() +``` + +However, there are some arguments that can be set (Keras, n.d.): + +``` +from tensorflow.keras.datasets import imdb +(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz", + num_words=None, + skip_top=0, + maxlen=None, + seed=113, + start_char=1, + oov_char=2, + index_from=3) +``` + +These are, respectively (Keras, n.d.): + +- `path`: the path to which the IMDB data will be downloaded if you don't have it locally yet. +- `num_words`: the top most frequent words to consider. Anything beyond this value will be encoded as an `oov_var`, which, as we shall see, has to be configured by you. +- `skip_top` tells Keras how many of the top frequent words to skip before starting the count towards `num_words`. +- `maxlen` specifies the maximum length of the sequence, before it will be truncated. +- `seed` is the random seed value for "reproducible data shuffling" (Keras, n.d.). It's for fixing the random generator used when shuffling. +- The `start_char` shows you where some sequence starts. +- The `oov_char` replaces any character that is "out of value" (i.e., because it falls out of the range `skip_top < top n word < (num_words + skip_top)`). +- The `index_from` setting tells Keras Datasets to index words from that particular index. + +Knowing all this, it shouldn't be too hard for you to build a sentiment classifier :) We'll do that in another blog post ;-) + +* * * + +### Reuters newswire topics classification + +Another dataset for text classification is the **Reuters newswire topics dataset** (Keras, n.d.). It's preprocessed in the same way as the IMDB dataset before and can be used for classifying texts into one of 46 topics: + +> Dataset of 11,228 newswires from Reuters, labeled over 46 topics. As with the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions). +> +> Keras (n.d.) + +The simplest way of loading this dataset goes as follows: + +``` +from tensorflow.keras.datasets import reuters +(x_train, y_train), (x_test, y_test) = reuters.load_data() +``` + +The attributes discussed under the IMDB dataset are also available, as well as `test_split` (float): this represents the fraction of data to be used for testing (Keras, n.d.) and assigned to the `_test` variables. + +Adapting the code we used previously (originally created by Mdaoust (2019) [and available here](https://stackoverflow.com/a/44891281), adapted by me; please note that I found the Reuters dataset topics [here](https://github.com/keras-team/keras/issues/12072#issuecomment-458154097), Bauer n.d.) into ... + +``` +from tensorflow.keras.datasets import reuters +import numpy as np +(x_train, y_train), (x_test, y_test) = reuters.load_data() + +# Define the topics +# Source: https://github.com/keras-team/keras/issues/12072#issuecomment-458154097 +topics = ['cocoa','grain','veg-oil','earn','acq','wheat','copper','housing','money-supply', + 'coffee','sugar','trade','reserves','ship','cotton','carcass','crude','nat-gas', + 'cpi','money-fx','interest','gnp','meal-feed','alum','oilseed','gold','tin', + 'strategic-metal','livestock','retail','ipi','iron-steel','rubber','heat','jobs', + 'lei','bop','zinc','orange','pet-chem','dlr','gas','silver','wpi','hog','lead'] + +# Obtain 3 texts randomly +for i in np.random.randint(0, len(x_train), 3): + INDEX_FROM=3 # word index offset + word_to_id = reuters.get_word_index() + word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()} + word_to_id[""] = 0 + word_to_id[""] = 1 + word_to_id[""] = 2 + word_to_id[""] = 3 + + id_to_word = {value:key for key,value in word_to_id.items()} + print('=================================================') + print(f'Sample = {i} | Topic = {topics[y_train[i]]} ({y_train[i]})') + print('=================================================') + print(' '.join(id_to_word[id] for id in x_train[i] )) +``` + +...yields three texts in the dataset, about _money earnt_, _crude oil_ and _business acquisitions_, so it seems: + +``` +================================================= +Sample = 8741 | Topic = earn (3) +================================================= + qtly div 50 cts vs 39 cts pay jan 20 record dec 31 reuter 3 +================================================= +Sample = 8893 | Topic = crude (16) +================================================= + ice conditions are unchanged at the soviet baltic oil port of ventspils with continuous and compacted drift ice 15 to 30 cms thick the latest report of the finnish board of navigation said icebreaker assistance to reach ventspils harbour is needed for normal steel vessels without special reinforcement against ice the report said it gave no details of ice conditions at the other major soviet baltic export harbour of klaipeda reuter 3 +================================================= +Sample = 1829 | Topic = acq (4) +================================================= + halcyon investments a new york firm reported a 6 9 pct stake in research cottrell inc alan slifka a partner in halcyon told reuters the shares were purchased for investment purposes but declined further comment on june 8 research cottrell said it had entered into a definitive agreement to be acquired by r c acquisitions inc for 43 dlrs per share research cottrell closed at 44 1 4 today unchanged from the previous close reuter 3 +``` + +In a different blog article, we'll see if we can create a classifier 😁 + +* * * + +### MNIST database of handwritten digits + +Another dataset that is included in the Keras Datasets API is the **MNIST dataset**, which stands for Modified National Institute of Standards and Technology (LeCun et al., n.d.). The dataset contains 60.000 training images and 10.000 testing images of handwritten digits, which are all 28 times 28 pixels in size. + +Loading them is easy: + +``` +from tensorflow.keras.datasets import mnist +(x_train, y_train), (x_test, y_test) = mnist.load_data() +``` + +This is a selection of MNIST digits: + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +There is a wide range of possibilities when using MNIST in educational machine learning settings. I use it a lot in my blogs here, on MachineCurve. For example, I've created a [variational autoencoder](https://www.machinecurve.com/index.php/2019/12/30/how-to-create-a-variational-autoencoder-with-keras/) with the MNIST dataset, which allowed me to generate new digits: + +[![](images/mnist_digits.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/mnist_digits.png) + +And so on! :) + +* * * + +### Fashion-MNIST database of fashion articles + +The MNIST dataset is used as a benchmark dataset in many studies, for validation of algorithms, and so on. Xiao et al. (n.d.) however argue that people should move away from MNIST: + +- It's too easy. I must say that the dataset is indeed really discriminative. According to Xiao et al., state-of-the-art convolutional nets achieve 99.7% accuracies. This means that real breakthroughs are likely no longer found when using MNIST. +- It's overused. A lot of people are using it, me included 🙊 + +https://twitter.com/goodfellow\_ian/status/852591106655043584 + +In order to overcome these issues, Xiao et al. introduce the **Fashion-MNIST** dataset. The dataset, which is a drop-in replacement for MNIST (which means: you can simply replace `mnist` with `fashion_mnist` to use it!), also contains 60.000 training images and 10.000 testing images (Xiao et al., n.d.). + +Loading it is easy, once again: + +``` +from tensorflow.keras.datasets import fashion_mnist +(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data() +``` + +However, the samples are different in nature. Rather than digits, they represent the following classes (Xiao et al., n.d.): + +
T-shirt/topTrouserPulloverDressCoat
SandalShirtSneakerBagAnkle boot
+ +Visualizing 30 yields that the differences within this dataset are larger than with traditional MNIST. For example, compare the sneaker on the second row with the sneaker on the third. While both are sneakers, the second-row sneaker has a striped pattern, whereas the third-row sneaker does not: + +- [![](images/2558.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/2558.jpg) + +- [![](images/4798.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/4798.jpg) + +- [![](images/5436.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/5436.jpg) + +- [![](images/5726.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/5726.jpg) + +- [![](images/7333.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/7333.jpg) + +- [![](images/10305.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/10305.jpg) + +- [![](images/10539.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/10539.jpg) + +- [![](images/11515.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/11515.jpg) + +- [![](images/12365.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/12365.jpg) + +- [![](images/12481.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/12481.jpg) + +- [![](images/15294.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/15294.jpg) + +- [![](images/17749.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/17749.jpg) + +- [![](images/21752.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/21752.jpg) + +- [![](images/24085.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/24085.jpg) + +- [![](images/27809.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27809.jpg) + +- [![](images/31184.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/31184.jpg) + +- [![](images/32057.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/32057.jpg) + +- [![](images/32091.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/32091.jpg) + +- [![](images/35707.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/35707.jpg) + +- [![](images/38876.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/38876.jpg) + +- [![](images/40360.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/40360.jpg) + +- [![](images/49002.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/49002.jpg) + +- [![](images/49065.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/49065.jpg) + +- [![](images/50830.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/50830.jpg) + +- [![](images/51660.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/51660.jpg) + +- [![](images/54277.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/54277.jpg) + +- [![](images/54650.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/54650.jpg) + +- [![](images/56116.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/56116.jpg) + +- [![](images/56334.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/56334.jpg) + +- [![](images/58299.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/58299.jpg) + + +``` +# Imports +import matplotlib.pyplot as plt +import numpy as np +from tensorflow.keras.datasets import fashion_mnist + +# Fashion MNIST +(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data() + +# Target classes: numbers to text +classes = { + 0: 'T-shirt/top', + 1: 'trouser', + 2: 'pullover', + 3: 'dress', + 4: 'coat', + 5: 'sandal', + 6: 'shirt', + 7: 'sneaker', + 8: 'bag', + 9: 'ankle boot' +} + +# Visualize 30 random samples +for i in np.random.randint(0, len(x_train)-1, 30): + # Get data + sample = x_train[i] + target = y_train[i] + # Set figure size and axis + plt.figure(figsize=(1.75, 1.75)) + plt.axis('off') + # Show data + plt.imshow(sample, cmap='gray') + plt.title(f'{classes[target]}') + plt.savefig(f'./{i}.jpg') +``` + +Let's hope that Fashion MNIST or a similar dataset can play an actual role as a benchmark dataset in the machine learning community. + +Fun side note: with this dataset, it's also possible to create a [variational autoencoder](https://www.machinecurve.com/index.php/2019/12/24/what-is-a-variational-autoencoder-vae/) :) Visualizing the so-called latent space, or the distribution learnt by the autoencoder, shows that we can generate samples that are something in between a shoe and a boot, in between trouwsers and a t-shirt, and so on. Might be interesting for fashion designers 😋 + +[![](images/fmnist_dmax_plot.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/fmnist_dmax_plot.png) + +* * * + +### Boston housing price regression dataset + +Another dataset that is available within the Keras Datasets is the **Boston Housing Price Regression Dataset**. As the name implies, this dataset can be used in regression settings, contrary to the many classification-related datasets that we've already seen in this blog post. + +Loading the data is easy, as with pretty much all of the Keras Datasets: + +``` +from tensorflow.keras.datasets import boston_housing +(x_train, y_train), (x_test, y_test) = boston_housing.load_data() +``` + +The dataset contains 506 observations that relate certain characteristics with the price of houses (in $1000s) in Boston in some period. As we can see: + +- The minimum house price is $5000, while the maximum house price is $50.000. This may sound weird, but it's not: house prices have risen over the decades, and the study that produced this data is from 1978 (Harrison & Rubinfeld, 1978). Actually, around 1978 prices of \[latex\]\\approx $ 50.000\[/latex\] were quite the median value, so this dataset seems to contain relatively cheaper houses (or the Boston area was cheaper back then - I don't know; Martin, 2017). +- The mean house price was $22.533. +- Variance in house prices is $84.587. + +``` +DescribeResult(nobs=506, minmax=(5.0, 50.0), mean=22.53280632411067, variance=84.58672359409854, skewness=1.1048108228646372, kurtosis=1.4686287722747515) +``` + +Given this box plot of the training _and_ testing data combined, outliers are primarily present in the upper segment: + +[![](images/boxplot.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/boxplot.jpg) + +Code for generating the summary and the box plot: + +``` +''' + Generate a BoxPlot image to determine how many outliers are within the Boston Housing Pricing Dataset. +''' +from tensorflow.keras.datasets import boston_housing +import numpy as np +import matplotlib.pyplot as plt +from scipy import stats as st + +# Load the data +(x_train, y_train), (x_test, y_test) = boston_housing.load_data() + +# We only need the targets, but do need to consider all of them +y = np.concatenate((y_train, y_test)) + +# Describe statistics +stats = st.describe(y) +print(stats) + +# Generate box plot +plt.boxplot(y) +plt.title('Boston housing price regression dataset - boxplot') +plt.savefig('./boxplot.jpg') +``` + +Okay, now that we know the descriptive statistics, there's one big question that remains - _what does the data look like?_ Which variables describe the housing price, and why were they taken? + +These are all the variables present in the dataset (StatLib, n.d.): + +- **CRIM** per capita crime rate by town +- **ZN** proportion of residential land zoned for lots over 25,000 sq.ft. +- **INDUS** proportion of non-retail business acres per town +- **CHAS** Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) +- **NOX** nitric oxides concentration (parts per 10 million) +- **RM** average number of rooms per dwelling +- **AGE** proportion of owner-occupied units built prior to 1940 +- **DIS** weighted distances to five Boston employment centres +- **RAD** index of accessibility to radial highways +- **TAX** full-value property-tax rate per $10,000 +- **PTRATIO** pupil-teacher ratio by town +- **B** 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town +- **LSTAT** % lower status of the population +- **MEDV** Median value of owner-occupied homes in $1000's + +**MEDV** is the median value and hence the target variable, while the other variables describe the MEDV or median value of owner-occupied homes. Given the focus of the study ("hedonic housing prices and the demand for clean air", see Harrison & Rubinfeld 1978), it's clear why variables such as crime rate, retail acres and nitric oxides concentration are present. + +[(It does however seem to be the case that additional factors help determine house prices, as we found quite a high error rate when creating a neural network with this dataset. We didn't however explore which ones they are.)](https://www.machinecurve.com/index.php/2019/10/12/using-huber-loss-in-keras/) + +* * * + +## Summary + +In this blog post, we explored the datasets that are available within the Keras Datasets: + +- **Image classification:** CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST; +- **Text classification:** IMDB Movie Reviews, Reuters Newswire topics; +- **Regression:** Boston House Prices. + +For each, we looked at the individual datasets: what do they represent? How can they be visualized? How must they be interpreted? How are they useful for learning machine learning? Those were the questions that we tried to answer today :) + +I hope this blog post was useful to you. If it was, I'd love to know, so that I know to spend more attention to these topics in my future articles. If it wasn't, or if you have questions, please let me know as well, so that I can improve. In those cases, please leave a message in the comments box below 👇 + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Keras. (n.d.). Datasets. Retrieved from [https://keras.io/datasets/](https://keras.io/datasets/) + +Krizhevsky, A., & Hinton, G. (2009). _[Learning multiple layers of features from tiny images](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)_ (Vol. 1, No. 4, p. 7). Technical report, University of Toronto (alternatively: [take a look at their website](https://www.cs.toronto.edu/~kriz/cifar.html)!). + +LeCun, Y., Cortes, C., & Burges, C. (n.d.). MNIST handwritten digit database. Retrieved from [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/) + +Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. [arXiv:1708.07747](http://arxiv.org/abs/1708.07747) + +Harrison, D., & Rubinfeld, D. L. (1978). Hedonic housing prices and the demand for clean air. _Journal of Environmental Economics and Management_, _5_(1), 81-102. doi:10.1016/0095-0696(78)90006-2 + +StatLib. (n.d.). Datasets Archive. Retrieved from [http://lib.stat.cmu.edu/datasets/](http://lib.stat.cmu.edu/datasets/) + +Martin, E. (2017, June 23). Here's how much housing prices have skyrocketed over the last 50 years. Retrieved from [https://www.cnbc.com/2017/06/23/how-much-housing-prices-have-risen-since-1940.html](https://www.cnbc.com/2017/06/23/how-much-housing-prices-have-risen-since-1940.html) + +Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011, June). [Learning word vectors for sentiment analysis](https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset). In _Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1_ (pp. 142-150). Association for Computational Linguistics. + +Mdaoust. (2019). Restore original text from Keras's imdb dataset. Retrieved from [https://stackoverflow.com/a/44891281](https://stackoverflow.com/a/44891281) + +Bauer, S. (n.d.). Where can I find topics of reuters dataset · Issue #12072 · keras-team/keras. Retrieved from [https://github.com/keras-team/keras/issues/12072#issuecomment-458154097](https://github.com/keras-team/keras/issues/12072#issuecomment-458154097) diff --git a/extensions-to-gradient-descent-from-momentum-to-adabound.md b/extensions-to-gradient-descent-from-momentum-to-adabound.md new file mode 100644 index 0000000..e2387f3 --- /dev/null +++ b/extensions-to-gradient-descent-from-momentum-to-adabound.md @@ -0,0 +1,265 @@ +--- +title: "Extensions to Gradient Descent: from momentum to AdaBound" +date: "2019-11-03" +categories: + - "buffer" + - "deep-learning" +tags: + - "adam" + - "adaptive-optimizers" + - "deep-learning" + - "machine-learning" + - "minibatch-gradient-descent" + - "neural-networks" + - "optimizer" +--- + +Today, optimizing neural networks is often performed with what is known as gradient descent: analogous to [walking down a mountain](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/), an algorithm attempts to find a minimum in a neural network's loss landscape. + +Traditionally, one of the variants of gradient descent - batch gradient descent, stochastic gradient descent and minibatch gradient descent - were used for this purpose. + +However, over many years of usage, various shortcomings of traditional methods were found to exist. In this blog post, I'll cover these challenges based on the available literature, and introduce new optimizers that have flourished since then. Even today's standard optimizers, such as Adam, are covered here. + +As you'll see, it's going to be somewhat of a chronology - many of the optimizers covered in this post will be improvements of each other. Funnily, the usage of _adaptive optimizers_ has caused renewed interest in traditional gradient descent as of recently. This is due to the fact that adaptive optimizers were found to perform worse than traditional ones in terms of generalization - i.e., on the test set. We'll therefore also cover more state-of-the-art optimizers here, such as AdaBound, which aims to combine the best of both worlds. + +If you have questions - feel free to leave a message in my comments box below! 👇 I'll happily answer any question you have. Thanks and enjoy the post! 😄 + +**After reading this article, you will understand...** + +- What challenges there are with classic gradient descent. +- Why adaptive optimizers are called _adaptive_. +- How a variety of adaptive optimizers - (Nesterov) momentum, Adagrad, Adadelta, RMSprop, Adam, AdaMax and Nadam - works, and how they are different. + +**Update 01/Mar/2021:** ensure that article is still relevant in 2021. + +* * * + +\[toc\] + +* * * + +## Traditional gradient descent & challenges + +When considering the [high-level machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) for supervised learning, you'll see that each forward pass generates a loss value that can be used for optimization. + +Although backpropagation generates the actual gradients in order to perform the optimization, the optimizer algorithm used determines _how optimization is performed_, i.e., where to apply what change in the weights of your neural network in order to improve loss during the next iteration. + +https://www.youtube.com/watch?v=kJgx2RcJKZY + +Source: [Christopher Gondek](https://www.youtube.com/watch?v=kJgx2RcJKZY) + +Gradient descent is the traditional class of algorithms used for this purpose. In another blog post detailing three of the traditional variants, we introduced these optimizers that can also be used today: + +- **Batch gradient descent**, which optimizes the model when the entire dataset was fed forward, i.e., producing one global update. Although quite accurate in terms of your dataset, it's terribly slow (datasets often have thousands of samples, if not millions!) and is barely usable in practice. +- **Stochastic gradient descent**, which optimizes your model each time a sample is fed forward, based on the loss generated for this sample. Although it's blazing fast, especially compared to batch gradient descent, it is much less accurate. Imagine what happens when a statistical outlier is fed forward - your model will swing away from its path to the minimum. +- **Minibatch gradient descent**, which lies somewhere in between: your model is optimized based on a weights change determined by mini batches of 50 to 250 samples; hence, not based on _one_, but neither on your whole dataset. With minibatch gradient descent, you essentially create balance between accuracy and speed. + +Various challenges were identified with usage of traditional gradient descent over the years: + +- The computed optimization is always computed with a learning rate before actually performed. This learning rate determines the speed with which actual optimization must be performed, or the size of the steps the optimization algorithm must take. When taking too large steps, the model may continuously overshoot the minimum (imagine yourself stepping over a hole in the ground if your steps are large enough). If your steps are too small, it will take a tremendous amount of time (and hence computing resources) to reach the minimum. As you will have to provide the learning rate in advance, whilst not knowing the ideosyncrasies of your dataset, you will likely create a less-than-optimal performing optimizer (Ruder, 2016). +- Even though _decay schemes_ are available which set a large learning rate at first and decreasing it substantially with each epoch, you'll have to configure these _in advance_. You thus essentially face the same problem again (Ruder, 2016). +- When the loss landscapes are _non-convex_, or in plain English when there is no actual minimum available, gradient descent-like optimizers face great difficulty in optimizing the model (Shaikh, 2019). +- Traditional optimizers like the ones above adapt the same learning rate to each _trainable parameter_, or, in plain English, _the weights of each neuron_ (Ruder, 2016). You might wish to perform updates more sparsely and use different learning rates across the network, dynamically. +- The learning process may slow down when large flat areas are encountered during training (O'Reilly, n.d.). +- Additionally, _saddle points_ may be encountered, which are points that descend when you approach them from one direction, yet ascend when you approach them from another one. Optimizers will get confused when they encounter such points (Shaikh, 2019; Ruder, 2016). +- A loss landscape often contains multiple minima. But which one is the global minimum? Traditional gradient descent does not know this in advance, since it cannot look over the next mountainous peak. This is another problem with the traditional methods (Ruder, 2016). +- You may face [vanishing and exploding gradients](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) (Shaikh, 2019). + +* * * + +## Adaptive optimizers + +Over the years, given those challenges, many new optimizers have been created, most of which belong to the class of _adaptive optimizers_. Contrary to gradient descent optimizers, which adapt the weights in a _static_ way with a _fixed_ learning rate across all parameters, adaptive optimizers have more flexibility built-in. We'll now cover many of these optimizers, starting with momentum. + +### Momentum + +Remember that gradient descent was like walking down a mountain? + +Now suppose that you're a ball instead, and roll down that same mountain. + +What would happen when rolling down? Indeed, you would go faster and faster, and if you were to change direction, that would be more difficult when you're already rolling. + +Traditional gradient descent techniques don't have this built in: they just change direction and keep setting small steps, even though it has been moving downward for some time already. + +This is problematic when your loss landscape contains many local minima (Ruder, 2016). In fact, what happens, is that your optimizer will move towards the bottom of the local minimum with small baby steps. Eventually, it could get stuck in the local minimum, not being able to escape from it. This leaves you a suboptimal model. + +Now, what if we could model gradient descent after the ball-like way of descending a mountain, overshooting the minimum because it already built up some _momentum_ during descent? + +This is exactly what **momentum based gradient descent** does for you (Qian, 1999; Ruder, 2016). To each gradient update, it adds a little bit of the previous update; that is, its current speed (= direction + velocity) is added to the update. This ensures two things: + +- The model will likely converge to its optimum much faster, since built-up momentum will reduce drifting / oscillation of the optimizer (Qian, 1999); +- Similarly, the model _might_ converge to better optima, since built-up momentum can cause the optimizer to overshoot suboptimal and local minima, ending up in the global minimum that you couldn't find with traditional gradient descent (Qian, 1999). + +Take a look at this video which we found at [Towards AIMLPY](https://www.youtube.com/watch?v=6iwvtzXZ4Mo), which shows how momentum works: + +https://www.youtube.com/watch?v=6iwvtzXZ4Mo + +Source: [Towards AIMLPY](https://www.youtube.com/watch?v=6iwvtzXZ4Mo) + +### Nesterov accelerated gradient + +In his work, Ruder (2016) asked himself: what if we can get an \[even\] smarter ball? + +One that can essentially look ahead, to see what's coming, merged with whatever happened in the past? + +Enter **Nesterov accelerated gradient** (Nesterov, 1983), which is based on traditional momentum. + +If we're at some position and know our momentum, we know where we'll move towards with some margin of error (caused by the gradient update we didn't take into account yet). + +This means than rather computing the gradient update _given our current position_ plus our momentum, we can compute the gradient update _given our current update + momentum_, then adding momentum again. This is as if we're looking _one step ahead of time_, given our current position and where we're going to. Ruder (2016) argues that this "results in increased responsiveness" of our model. That's nice 😊 + +### Adagrad + +While the previous adaptive optimizers were able to allow neural networks to better navigate the data's loss landscapes by adapting their weights _altogether_, another step forward could be to _adapt the individual weights only_. + +In response to this desire, a _subgradient_ method called Adagrad was developed about ten years ago (Duchi et al., 2011). Contrary to the learning rate of the previous methods, which is set and fixed, Adagrad uses dynamic learning rates for every parameter (Ruder, 2016). + +This learning rate, for which a basic rate can be set a priori, adapts itself as it is divided by the square root of the past gradients computed for the parameter, plus some error rate (Ruder, 2016). The square root is computed by means of an efficient matrix operation. If the gradient updates in the past were large, the learning rate gets smaller and smaller, while if they were small, the learning rate remains relatively large. + +The result: parameters that didn't update as much as the ones that did, will have larger updates later on in the model's training process. This has multiple benefits: + +- Self-adapting learning rates effectively allow you to omit configuring the learning rate a priori (Ruder, 2016), or using some learning rate decay scheme (Milman & Dali, n.d.). This solves one of the challenges listed earlier in this blog post. +- For data problems with much sparsity, such as one-hot encoded word vectors where a few words are present (ones) and many are not (zeros), Adagrad significantly improves optimization (Ruder, 2016). This occurs because of the fit between Adagrad and sparse data (Gupta, n.d.; Mallinar, n.d.): sparse data contains 'islands' of features which are represented in _certain areas of your neural network_. If those islands must be adapted, it's best to only adapt the _areas impacted by those islands_. Since Adagrad does this - flexibly adjusting learning rates based on the structure of your data - it was a substantial improvement over traditional gradient descent like methods, especially for sparse ML problems like word processing. + +Unfortunately, Adagrad also comes with a big drawback: **the speed of learning will decrease with each time step** (Milman (2), n.d.). What's more, at some point in time, the model will effectively stop learning, since the self-adaptive learning rate converges to zero. + +Why this happens is simple: it's due to the square root of the past gradients by which the fixed learning rate is divided. + +If you have an infinite amount of steps, and hence an infinitely large history of gradients, the squared root will also be infinite. Dividing anything by an infinitely large number yields a value that is approximately zero - and this also applies to the adaptive learning rate. While no practical training process will remotely approach an infinite amount of epochs, you get what happens with many iterations of your training process. + +Fortunately, Adadelta comes to the rescue. + +### Adadelta + +We recall that Adagrad benefited from dynamic learning rates, that they were computed by dividing the preconfigured learning rate by the squared root of the parameter's previous gradients, and that precisely this computation yields vanishing gradients due to the increasing nature of the denominator (Ruder, 2016; Milman (2), n.d.). + +Adadelta provides a fix for this problem (Zeiler, 2012). It does so by following an interesting philosophy. Suppose that you're currently in the 101st epoch, or iteration. For adapting the parameters, how interesting is the gradient update between the 4th and 5th epoch? Isn't it less interesting than the update between the 30th and 31st epoch, which itself is less interesting than the 67th and 68th one, 80th and 81st, and so on? + +Yep, according to Zeiler (2012): the older the gradient update, the less interesting it is. + +Adadelta, by consequence, uses a decaying average of some \[latex\]w\[/latex\] previous gradients. If \[latex\]w = 5\[/latex\], for example, only the last 5 gradient updates would be used in the computation, of which the last is most relevant, followed by the second last, and so on (Ruder, 2016). What's more, given the maths behind Adadelta, it's no longer necessary to provide a learning rate a priori (Zeiler, 2012). + +### RMSprop + +An optimizer that was developed in parallel to Adadelta, and actually resembles it quite much, is RMSprop (Ruder, 2016; Hinton, n.d.). In line with Adadelta, RMSprop also divides the learning rate by an exponentially decaying average of some previous gradients. Contrary to Adadelta, however, it is still necessary to configure an initial learning rate when using RMSprop (Hinton, n.d.). + +### Adam + +Now what if we could add _both momentum_ and _adapt our neuron's weights locally_? + +This is what the **Adam optimizer** does, which combines the best of both worlds - i.e., it performs similar to RMSprop and Adadelta, _as well as momentum_ (Ruder, 2016; Kingma & Ba, 2014). + +By consequence, it is one of the most widely used optimizers today - and it is the reason that we use it in many of our blog posts, such as [How to create a CNN classifier with Keras?](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) + +### AdaMax + +In their same work, Kingma & Ba (2014) introduce the **AdaMax** optimizer. It is a variant of the Adam optimizer that uses the infinity norm, while the Adam optimizer itself uses the L2 norm for optimization (Kingma & Ba, 2014). + +What is a norm, you might now think - that's what I did, at least, when I was confronted with these mathematical terms. Since I'm not a mathematician, I'll try to explain norms intuitively, based on sources on the internet. + +Essentially, a norm is a _function_ that maps some input to some output. In the case of norms, it assigns a positive length to any vector in some vector space (Wikipedia, 2004). Essentially, norms therefore tell us something about the length of a vector in a certain mathematical space (Karthick, n.d.). Beyond this, norms must also satisfy other properties which ensure that norms do return particular vector lengths given their space (Wikipedia, 2004): + +- If the output of the norm is zero, the input must be the zero vector (a vector without any length). Any positive vectors cannot have zero norms. This ensures that any non-zero vector gets assigned a length in their respective space. +- It is _absolutely homogeneous_. That is, if the vector is multiplied with some integer \[latex\]n\[/latex\] (e.g. \[latex\]n = 2\[/latex\]), the norm is multiplied with \[latex\]n\[/latex\] as well. +- The norm of two vectors that are added together is smaller than the two individual norms added together. + +Multiple norms exist, such as the L0 norm (which essentially tells you something about the number of non-zero elements in a vector; Vishnu, n.d.), the L1 norm (which in any space produces the _taxicab norm_ or a block-style length for the vector), the L2 norm (which produces the shortest possible distance or Euclidian distance), and so on. + +This can be generalized to the _p-norm_, which essentially computes the L-norm for some \[latex\]p\[/latex\] (p = 2 is the Euclidian norm, and so on). + +[![](images/1280px-Manhattan_distance.svg_-1024x1024.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/1280px-Manhattan_distance.svg_.png) + +L1 norm (red, blue and yellow) versus L2 norm (green) / public domain. + +When you let p approach infinity, you'll get what is known as the max norm or infinity norm. Given some vector \[latex\]\\textbf{x} = \\{ x\_1, x\_2, ..., x\_n \\}\[/latex\], the infinity norm gives you the _maximum element in the vector_. + +Now this is exactly the difference between Adam and the Adamax optimizer, which is essentially a generalization of the L2 norm into the L-infinity norm. + +The Adam optimizer updates the gradients inversely proportional to the L2 norm of the "past gradients (...) and current gradient" (Ruder, 2016). In plain English, what gets computed is a _vector_ that is composed of the past gradients and the current gradient (I would like to refer to the Ruder (2016) paper if you wish to get the maths as well!), which by consequence is the smallest vector possible to represent the distance between the two vector edges, and hence represents the value output by the L2 norm. + +When generalizing Adam to the L-infinity norm, and hence Adamax, you'll find that the gradient update is the _maximum between the past gradients and current gradient_. That is, if large weight swings (by virtue of the current gradient) are required, this is possible, but only if they are _really significant_ (given the influence of the past gradients). + +In practice, this means that data that is traditionally noisy in terms of gradient updates (e.g., datasets with many outliers) can benefit from Adamax over Adam; this obviously for the reason that previous stable gradient updates can counteract one significant gradient swing. Across the internet, you'll see that many examples illustrate perceived benefits of Adamax when using word embeddings (Ghosh, n.d.). Those representations are traditionally sparse and indeed, Adamax could then be of value. However, as illustrated by Ghosh (n.d.), it's always best to just try at first to see whether Adamax is actually better than Adam or even traditional SGD 😄 + +### Nadam + +Now this one is simple: Nadam = Nesterov acceleration + Adam (Dozat, 2016). + +Adam benefits from _momentum_ and _individual parameter updates_, but does so based on the _current gradient_. + +Momentum based SGD also computes the gradient update based on the current gradient, and we can recall from above that Nesterov acceleration ensures that SGD can essentially _look one step ahead_ by computing the estimated position given current momentum. + +Well, Dozat (2016) thought, why can't we incorporate this into Adam? + +The result is **Nadam**: Nesterov momentum based Adam. Now that we've covered Adamax with some abstract norm stuff, life gets easier with Nadam 😊 ...but we aren't there yet 😄 + +* * * + +## Challenges with adaptive optimizers & new ones + +Even though adaptive optimizers have improved traditional SGD in terms of momentum and/or individual parameter optimization, they have their own challenges as well 😄 + +Those challenges are as follows: + +- Adaptive optimizers have been found to generalize poorly compared to traditional (stochastic) gradient descent (Keskar & Socher, 2017). That is, when performance is tested against the test set, models trained with optimizers such as Adam perform worse than when trained with gradient descent like methods. +- What's worse - and actually one step before generalization - neural networks optimized with adaptive optimizers sometimes even fail to converge to the loss optimum (Luo et al., 2019; Reddi et al., 2018). This is likely caused by the moving average used to dynamically set the learning rate, as illustrated earlier in this blog post. While this type of short-term 'memory' was introduced to avoid vanishing gradients due to many gradients in the denominator, now the fix seems to be the problem since larger gradients are forgotten. Perhaps, we need some longer-term memory instead? (Reddi et al., 2018). + +In an attempt to resolve these problems, a range of new optimizers was developed to improve Adam and AdaMax. [AMSGrad](http://www.satyenkale.com/papers/amsgrad.pdf) was one of the first ones, but Luo et al. (2019) argue that it still doesn't fix both issues, and propose AdaBound and AMSBound to fix either Adam or AmsGrad. + +The premise behind those two methods is easy: use an adaptive optimizer during the first epochs, when it performs better than traditional methods, and switch to traditional gradient descent later (Luo et al., 2019). So far, results are promising: it seems that at least parts of the generalization gap can be closed. + +* * * + +## Summary + +In this blog post, we've covered many of the optimizers that are in use in today's neural networks. We covered gradient descent as well as adaptive optimizers, and took a look at some of their extensions to overcome problems with convergence and generalization. + +I hope you've learnt something from this post 😊 If you miss something, or if you have any questions or other remarks, please feel free to leave a comment below! I'll happily answer and/or adapt my post. Thanks! 😄 + +* * * + +## References + +Ruder, S. (2016). An overview of gradient descent optimization algorithms. _arXiv preprint [arXiv:1609.04747](https://arxiv.org/abs/1609.04747)_. + +O'Reilly. (n.d.). Fundamentals of Deep Learning. Retrieved from [https://www.oreilly.com/library/view/fundamentals-of-deep/9781491925607/ch04.html](https://www.oreilly.com/library/view/fundamentals-of-deep/9781491925607/ch04.html) + +Dabbura, I. (2019, September 3). Gradient Descent Algorithm and Its Variants. Retrieved from [https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3](https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3) + +Shaikh, F. (2019, June 24). Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning. Retrieved from [https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/](https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/) + +Qian, N. (1999). On the momentum term in gradient descent learning algorithms. _[Neural networks](https://www.sciencedirect.com/science/article/abs/pii/S0893608098001166)_[,](https://www.sciencedirect.com/science/article/abs/pii/S0893608098001166) _[12](https://www.sciencedirect.com/science/article/abs/pii/S0893608098001166)_[(1), 145-151.](https://www.sciencedirect.com/science/article/abs/pii/S0893608098001166) + +Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence O (1/k^ 2). In _[Doklady AN USSR](https://ci.nii.ac.jp/naid/20001173129/)_ [(Vol. 269, pp. 543-547)](https://ci.nii.ac.jp/naid/20001173129/). + +Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. _[Journal of Machine Learning Research](http://www.jmlr.org/papers/v12/duchi11a.html)_[,](http://www.jmlr.org/papers/v12/duchi11a.html) _[12](http://www.jmlr.org/papers/v12/duchi11a.html)_[(Jul), 2121-2159](http://www.jmlr.org/papers/v12/duchi11a.html). + +Milman, O., & Dali (pseudonym), S. (n.d.). Gradient Descent vs Adagrad vs Momentum in TensorFlow. Retrieved from [https://stackoverflow.com/a/44225502](https://stackoverflow.com/a/44225502) + +Gupta, S. (n.d.). Shashank Gupta's answer to What is the purpose of AdaGrad for stochastic gradient decent neural network training? Retrieved from [https://www.quora.com/What-is-the-purpose-of-AdaGrad-for-stochastic-gradient-decent-neural-network-training/answer/Shashank-Gupta-75](https://www.quora.com/What-is-the-purpose-of-AdaGrad-for-stochastic-gradient-decent-neural-network-training/answer/Shashank-Gupta-75) + +Mallinar, N. (n.d.). Neil Mallinar's answer to What is the purpose of AdaGrad for stochastic gradient decent neural network training? Retrieved from [https://www.quora.com/What-is-the-purpose-of-AdaGrad-for-stochastic-gradient-decent-neural-network-training/answer/Neil-Mallinar](https://www.quora.com/What-is-the-purpose-of-AdaGrad-for-stochastic-gradient-decent-neural-network-training/answer/Neil-Mallinar) + +Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. _arXiv preprint [arXiv:1212.5701](https://arxiv.org/abs/1212.5701)_. + +Milman (2), O. (n.d.). Understanding the mathematics of AdaGrad and AdaDelta. Retrieved from [https://datascience.stackexchange.com/a/38319](https://datascience.stackexchange.com/a/38319) + +Hinton, G. (n.d.). _Overview of mini-batch gradient descent_ \[PDF\]. Retrieved from [http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture\_slides\_lec6.pdf](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) + +Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. _[arXiv preprint arXiv:1412.6980](https://arxiv.org/abs/1412.6980)_. + +Vishnu. (n.d.). Vishnu's answer to What is a norm of vector (intuitive definition)? Retrieved from [https://www.quora.com/What-is-a-norm-of-vector-intuitive-definition/answer/Vishnu-55](https://www.quora.com/What-is-a-norm-of-vector-intuitive-definition/answer/Vishnu-55) + +Karthick, N. G. (n.d.). Karthick N.G.'s answer to What is a norm of vector (intuitive definition)? Retrieved from [https://www.quora.com/What-is-a-norm-of-vector-intuitive-definition/answer/Karthick-N-G](https://www.quora.com/What-is-a-norm-of-vector-intuitive-definition/answer/Karthick-N-G) + +Wikipedia. (2004, September 16). Norm (mathematics). Retrieved from [https://en.wikipedia.org/wiki/Norm\_(mathematics)](https://en.wikipedia.org/wiki/Norm_(mathematics)) + +Ghosh, T. (n.d.). Tapa Ghosh's answer to When would you use Adamax over Adam? Retrieved from [https://www.quora.com/When-would-you-use-Adamax-over-Adam/answer/Tapa-Ghosh](https://www.quora.com/When-would-you-use-Adamax-over-Adam/answer/Tapa-Ghosh) + +Dozat, T. (2016). [Incorporating nesterov momentum into Adam.](https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ) + +Keskar, N. S., & Socher, R. (2017). Improving generalization performance by switching from adam to sgd. _[arXiv preprint arXiv:1712.07628](https://arxiv.org/abs/1712.07628)_[.](https://arxiv.org/abs/1712.07628) + +Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. _[arXiv preprint arXiv:1902.09843](https://arxiv.org/abs/1902.09843)_[.](https://arxiv.org/abs/1902.09843) + +Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. _[arXiv preprint arXiv:1904.09237](https://arxiv.org/pdf/1904.09237.pdf)_[.](https://arxiv.org/pdf/1904.09237.pdf) diff --git a/feature-scaling-with-python-and-sparse-data.md b/feature-scaling-with-python-and-sparse-data.md new file mode 100644 index 0000000..8b22171 --- /dev/null +++ b/feature-scaling-with-python-and-sparse-data.md @@ -0,0 +1,255 @@ +--- +title: "Feature Scaling with Python and Sparse Data" +date: "2020-11-23" +categories: + - "frameworks" + - "svms" +tags: + - "feature-scaling" + - "normalization" + - "scikit-learn" + - "sparse-data" + - "sparsity" + - "standardization" +--- + +When you are training a Supervised Machine Learning model, scaling your data before you start fitting the model can be a crucial step for training success. In fact, without doing so, there are cases when the model's [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) will behave very strangely. However, not every dataset is made equal. There are cases when [standard approaches to scaling](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) don't work so well. Having a sparse dataset is one such scenario. In this article, we'll find out why and what we can do about it. + +The article is structured as follows. Firstly, we will look at Feature Scaling itself. What is it? Why is it necessary? And what are those _standard approaches_ that we have just talked about? Then, we move on to the sparsity characteristic of a dataset. What makes it sparse? Those questions will be answered first before we move to the core of our article. + +This core combines the two topics: _why we can't apply default Feature Scaling techniques when our dataset is sparse_. We will show you what happens and why this is a bad thing. We do however also show you an example of how to handle this, involving Python and the Scikit-learn `MaxAbsScaler`. This way, you can still perform scaling, even when your dataset is sparse. + +Let's take a look! 😎 + +**Update 25/Nov/2020:** fixed issue where wrong `MaxAbsScaler` output was displayed. + +* * * + +\[toc\] + +* * * + +## What is Feature Scaling? + +Suppose that we have the following dataset: + +![](images/gauss0.png) + +It visualizes two variables and two classes of variables. + +We can use both variables to tell us something about the class: the variables closest to \[latex\](X, Y) = (2, 8)\[/latex\] likely belong to the purple-black class, while variables towards the edge belong to the yellow class. + +In other words, we can create a classifier that helps us determine what class a new sample belongs to. When we train a classifier, it will attempt to learn from the variables. Depending on the algorithm, there are various issues that can possibly occur when doing that: + +1. When our classifier involves a _distance_ computation for class computation, e.g. when we use Radial Basis Function networks, our classifier will possibly be distorted by large distances, especially if the distances for one variable are large (e.g. it ranges from \[latex\]\[0, 1000000\]\[/latex\]) and low for another one (e.g. \[latex\]\[0, 1\]\[/latex\]. If not made comparable, it thinks that the distances from the first variable are way more important, because the deltas are larger. +2. When our classifier utilizes _[regularization](https://www.machinecurve.com/index.php/2020/01/26/which-regularizer-do-i-need-for-training-my-neural-network/)_ for reducing model complexity, we can get ourselves into trouble as well, because the [most common regularizers](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/) are based on distance metrics. Here, the same thing goes wrong. +3. Sometimes, especially when we are using traditional Machine Learning algorithms, we don't want too many variables in our feature space - because of the _[curse of dimensionality](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/)._ In those cases, we want to select the variables that contribute most first. Algorithms we can use for this purpose, such as Principal Component Analysis, rely on the _variance_ of the variables for picking the most important ones. + +> _Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average value._ +> +> Wikipedia (2001) + +Given the three points mentioned above and the dataset displayed above, we can intuitively say the following: + +**Variance of the vertical variable is larger than the one of the horizontal one.** + +Or is it? + +Can we actually compare those variables? What if we can't? + +Let's check with [standardization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/). Using this technique, with which we can express our variables in terms of their differences in standard deviation from the variable's mean value, we get the following picture: + +![](images/gauss1.png) + +So it seems to be the case that the first variable was not more important than the second one after all! + +The process of standardization is part of a class of techniques called **Feature Scaling** techniques. They involve methods to make variable scales comparable, and involve two mainly used techniques: + +1. **Normalization**, or _min-max normalization_, uses the minimum and maximum values from the dataset to normalize the variables into the \[latex\]\[0, 1\]\[/latex\] or \[latex\]\[a, b\]\[/latex\] ranges depending on your choice. +2. **Standardization**, or _Z-score normalization,_ converts the scale into the deviation in standard intervals from the mean for each variable. We already saw what could happen when applying standardization before. + +If you want to understand Feature Scaling techniques in more detail, it would be good to read [this article first](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) before moving on. + +* * * + +## What is Sparse Data? + +Suppose that this is a sample from the dataset that you are training a Machine Learning model with. You can see that it is five-dimensional; there are five features that can - when desired - jointly be used to generate predictions. + +For example, they can be measurements of e.g. particles, or electrical current, or anything like that. If it's zero, it means that there is no measurement. + +This is what such a table can look like: + +
Feature 1Feature 2Feature 3Feature 4Feature 5
00000
0007,70
1,260000
00000
2,1202,1100
00000
00000
00000
00000
01,28000
00000
00001,87
+ +This is an example of **sparse data**: + +> A variable with sparse data is one in which a relatively high percentage of the variable's cells do not contain actual data. Such "empty," or NA, values take up storage space in the file. +> +> Oracle (n.d.) + +Having sparse data is common when you are creating Machine Learning models related to time series. As we shall see, Feature Scaling can be quite problematic in that case. + +* * * + +## Feature Scaling with Sparse Data + +Suppose that we take the first feature and use standardization to rescale it: + +``` +import numpy as np +from sklearn.preprocessing import StandardScaler +samples_feature = np.array([0, 0, 1.26, 0, 2.12, 0, 0, 0, 0, 0, 0, 0]).reshape(-1, 1) +scaler = StandardScaler() +scaler.fit(samples_feature) +standardized_dataset = scaler.transform(samples_feature) +print(standardized_dataset) +``` + +This would be the output: + +``` +[[-0.43079317] + [-0.43079317] + [ 1.49630526] + [-0.43079317] + [ 2.81162641] + [-0.43079317] + [-0.43079317] + [-0.43079317] + [-0.43079317] + [-0.43079317] + [-0.43079317] + [-0.43079317]] +``` + +Not good! + +As you can see, all values formerly 0 have turned into \[latex\]\\approx -0.431\[/latex\]. By consequence, the scalars from feature 1 are not sparse anymore - and the entire dataset has become dense! + +If your Machine Learning setting depends on sparse data, e.g. when it needs to fit into memory, applying standardization entirely removes the benefits that would become present in another case (StackOverflow, n.d.). + +### Using the MaxAbsScaler to handle Sparse Data + +Fortunately, there is a way in which Feature Scaling can be applied to Sparse Data. We can do so using Scikit-learn's `MaxAbsScaler`. + +> Scale each feature by its maximum absolute value. This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity. +> +> Scikit-learn (n.d.) + +As we can see, it uses the maximum absolute value to perform the scaling - and it therefore works in a similar way compared to regular min-max normalization, except then that we use absolute values here. The MaxAbsScaler does not center the data, but rather scales the range. This is why it works perfectly with sparse data. In fact, it is the recommenmded + +``` + +import numpy as np +from sklearn.preprocessing import MaxAbsScaler +samples_feature = np.array([0, 0, 1.26, 0, 2.12, 0, 0, 0, 0, 0, 0, 0]).reshape(-1, 1) +scaler = MaxAbsScaler() +scaler.fit(samples_feature) +standardized_dataset = scaler.transform(samples_feature) +print(standardized_dataset) +``` + +...indeed gives the sparsity and scaling that we were looking for: + +``` +[[0. ] + [0. ] + [0.59433962] + [0. ] + [1. ] + [0. ] + [0. ] + [0. ] + [0. ] + [0. ] + [0. ] + [0. ]] +``` + +### Why MaxAbsScaler and not MinMaxScaler for sparse data? + +Great, I thought, but why use the `MaxAbsScaler` - and why cannot we use simple [min-max normalization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) when we have a sparse dataset? + +Especially because the output would be the same if we applied the `MinMaxScaler`, which is Scikit-learn's implementation of min-max normalization, to the dataset we used above: + +``` +[[0. ] + [0. ] + [0.59433962] + [0. ] + [1. ] + [0. ] + [0. ] + [0. ] + [0. ] + [0. ] + [0. ] + [0. ]] +``` + +Now, here's the catch - all values in the original input array to the scaler were positive. This means that the minimum value is zero and that, because it scales by minimum and maximum value, all values will be in the range \[latex\]\[0, 1\]\[/latex\]. Since the maximum absolute value here equals the overall maximum value. + +What if we used a dataset where negative values are present? + +``` +samples_feature = np.array([-2.40, -6.13, 0.24, 0, 0, 0, 0, 0, 0, 2.13]).reshape(-1, 1) +``` + +Min-max normalization would produce this: + +``` +[[0.45157385] + [0. ] + [0.77118644] + [0.74213075] + [0.74213075] + [0.74213075] + [0.74213075] + [0.74213075] + [0.74213075] + [1. ]] +``` + +Bye bye sparsity! + +The output of our `MaxAbsScaler` is good, as we would expect: + +``` +[[-0.39151713] + [-1. ] + [ 0.03915171] + [ 0. ] + [ 0. ] + [ 0. ] + [ 0. ] + [ 0. ] + [ 0. ] + [ 0.34747145]] +``` + +So that's why you should prefer absolute-maximum-scaling (using `MaxAbsScaler`) when you are working with a sparse dataset. + +* * * + +## Summary + +In this article, we looked at what to do when you have a sparse dataset and you want to apply Feature Scaling techniques. The reason why we did this is because applying the standard methods for Feature Scaling is problematic in this case, because it destroys the sparsity characteristic of the dataset, meaning that e.g. memory benefits are no longer applicable. + +And with Machine Learning algorithms, which can use a lot of compute capacity from time to time, this can be really problematic. + +Normalization and Standardization are therefore not applicable. However, fortunately, there is a technique that can be applied: scaling by means of the maximum absolute value from the dataset. In this case, we create a scaled dataset where sparsity is preserved. We saw that it works by means of a Python example using Scikit-learn's `MaxAbsScaler`. In the example, we also saw why regular max-min normalization doesn't work and why we really need the `MaxAbsScaler`. + +I hope that you have learned something from today's article! If you did, please feel free to leave a message in the comments section below 💬 Please do the same if you have any questions or other remarks. Regardless, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2001, June 30). _Variance_. Wikipedia, the free encyclopedia. Retrieved November 18, 2020, from [https://en.wikipedia.org/wiki/Variance](https://en.wikipedia.org/wiki/Variance) + +Oracle. (n.d.). _Defining data objects, 6 of 9_. Moved. [https://docs.oracle.com/cd/A91202\_01/901\_doc/olap.901/a86720/esdatao6.htm](https://docs.oracle.com/cd/A91202_01/901_doc/olap.901/a86720/esdatao6.htm) + +StackOverflow. (n.d.). _Features scaling and mean normalization in a sparse matrix_. Stack Overflow. [https://stackoverflow.com/questions/21875518/features-scaling-and-mean-normalization-in-a-sparse-matrix](https://stackoverflow.com/questions/21875518/features-scaling-and-mean-normalization-in-a-sparse-matrix) + +Scikit-learn. (n.d.). _Sklearn.preprocessing.MaxAbsScaler — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 23, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) diff --git a/finding-optimal-learning-rates-with-the-learning-rate-range-test.md b/finding-optimal-learning-rates-with-the-learning-rate-range-test.md new file mode 100644 index 0000000..578aad3 --- /dev/null +++ b/finding-optimal-learning-rates-with-the-learning-rate-range-test.md @@ -0,0 +1,591 @@ +--- +title: "Finding optimal learning rates with the Learning Rate Range Test" +date: "2020-02-20" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "learning-rate" + - "learning-rate-range-test" + - "machine-learning" + - "neural-network" + - "neural-networks" +--- + +Learning Rates are important when configuring a neural network. But choosing one is not easy, as there is no single best learning rate due to its dependency on your dataset. + +Now, how to choose one? And should it be a fixed one or should I use learning rate decay? If I know how I'll choose one, how to do so objectively? They're all interesting questions - and we'll answer each of them in this blog post. + +Today, we'll look at multiple things. In our blog post, we'll... + +1. Introduce you to the concept of a learning rate by taking a look at optimizing supervised machine learning models at a high level. +2. Show you why fixed learning rates are almost never a good idea, and how learning rate decay may help you. +3. Show you why learning rate decay suffers from the same issue as fixed learning rates, i.e. that humans still have to make a guess about where to start. +4. Introduce the Learning Rate Range Test based on academic works and other Medium blogs, which allows you to select the optimal learning rate for your model empirically and easily. +5. Provide Python code that implements the Learning Rate Range Test for a series of tests, using the Keras deep learning framework and the `keras-lr-finder` package. + +Are you ready? + +Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## On optimizing supervised machine learning models + +Let's take a look at the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process): + +![](images/High-level-training-process-1024x973.jpg) + +Training such models goes through a simple, sequential and cyclical process: + +1. The _features_, i.e. the inputs, predictors or independent variables, are fed to the machine learning model. The model will generate predictions for the data, e.g. the class it thinks that the features belong to. +2. These predictions are compared with the _targets_, which represent the ground truth for the features. That is, they are the _actual_ classes in the classification scenario above. +3. The difference between the predictions and the actual targets can be captured in the loss value. Depending on your machine learning problem, [you can choose from a wide range of loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss-functions). +4. Based on the loss value, the model computes the best way of making it better - i.e., it computes gradients using backpropagation. +5. Based on these gradients, an optimizer (such as [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [an adaptive optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/)) will adapt the model accordingly. +6. The process starts again. Likely, and hopefully, the model performs slightly better this time. + +Once you're happy with the end results, you stop the machine learning process, and you have a model that can hopefully be used in production :) + +Now, if we wish to understand the concept of the Learning Rate Range Test in more detail, we must take a look at model optimizers. In particular, we should study the concept of a learning rate. + +### Configuration of model optimizers: learning rates + +When specifying an optimizer, it's possible to configure the learning rate most of the times. For example, the Adam optimizer in Keras (Keras, n.d.): + +``` +keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False) +``` + +Indeed, here, the learning rate can be set with `learning_rate` - and it is set to 0.001 by default. + +Now, what is a learning rate? If our goal is to study the _Learning Rate_ Range Test, it's critical to understand the concept of a learning rate, isn't it? :-P + +Let's go back to step 4 of the machine learning process outlined above: computing gradients with backpropagation. + +![](images/adult-adventure-backpack-287240-1024x767.jpg) + +I always compare optimizing a model with walking down a mountain. + +The mountain represents the "loss landscape", or how the loss value changes with respect to the particular model state, and your goal is to walk to the valley, where loss is lowest. + +This analogy can be used to understand what backpropagation does and why you need learning rates to control it. + +Essentially, I like to see backpropagation a "step-computer". While you walk down the mountain, you obviously set steps towards your goal. However, you don't want to miss out on possible shortcuts towards the valley. This requires you to take smaller steps. + +Now this is why learning rates are useful: while backpropagation will likely compute relatively large steps, you wish to slow down your descent to allow yourself to look around more thoroughly. Perhaps, you'll indeed find that path that brings you to the valley in a shorter amount of time! + +So, while backpropagation is a "step-computer", the learning rate will allow you to "control" the size of your steps. While you'll take longer to arrive, you might do so more efficiently after all. Especially when the valley is very narrow, you might no longer overstep it because your steps are too large. + +This analogy also perfectly explains why the learning rate in the Adam example above was set to `learning_rate = 0.001`: while it uses the _computed gradient_ for optimization, it makes it 1.000 times smaller first, before using it to change the model weights with the optimizer. + +### Overfitting and underfitting - checking your validation loss + +Let's now build in a small intermezzo: the concepts of **overfitting** and **underfitting**, and checking for them by using validation and test loss. + +Often, before you train a model with all your data, you'll first evaluate your choice with [hold-out techniques or K-fold Cross Validation](https://www.machinecurve.com/index.php/2020/02/18/how-to-use-k-fold-cross-validation-with-keras/). These generate a dataset split between training data and testing data, which you'll need, as you're going to need to decide when the model is good enough. + +And good enough is the precise balance between _having it perform better_ and _having it perform too adequately._ + +In the first case, which is called **underfitting**, your model can still improve in a predictive sense. By feeding more samples, and optimizing further, it's likely to improve and show better performance over time. + +However, when you do so for too long, the model will **overfit** - or adapt too much to your dataset and its ideosyncrasies. As your dataset is a sample, which is drawn from the true population you wish to train for, you face differences between the sample and population means and variances - by definition. If your model is over-adapted to your training set, it's likely that these differences get in the way when you want to use it for new data from the population. And likely, this will occur when you use your model in production. + +You'll therefore always have to strike a balance between the model's predictive performance and the model's ability to generalize. This is a very intricate balance that can often only be found in a small interval of your training iterations. + +Fortunately, it's possible to detect overfitting using a plot of your loss value (Smith, 2018). Always take your validation or test loss for this. Use your test loss if you don't split your _training_ data in true training and validation data (which is the case if you're simply evaluating models with e.g. K-fold Cross Validation). Use validation loss if you evaluate models and train the final one at once (requiring training, validation and testing data). In both cases, you ensure that you use data that the model has not seen before, avoiding that you - as a student - mark your own homework ;) + +This is especially useful when [you are using e.g. TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/), where you can inspect progress in real-time. + +However, it's also possible [to generate a plot when your training process finishes](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). Such diagrams make things crisply clear: + +[![](images/UnderOver.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/UnderOver.png) + +In the first part of the training process, the model's predictive performance is clearly improving. Hence, it is _underfit_ during that stage - and additional epochs can improve model performance. + +However, after about the 20th epoch, validation loss starts improving, while (you must assume this) _training_ loss still decreases. This means that while the model gets better and better in predicting the training data, it is getting worse in predicting the validation data. Hence, after the 20th epoch, _overfitting_ starts to occur. + +While you can reduce the impact of overfitting or delay it with [regularizers](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/) and [Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/), it's clear that for this model and corresponding configuration, the optimum is achieved at the 20th epoch. What's important to understand here is that this optimum emerges _given the model architecture and configuration!_ If you changed the architecture, or configured it differently, you might e.g. delay overfitting or achieve even lower validation loss minimums. That's why training neural networks is more of an art than a science :) + +As choosing a learning rate setting impacts the loss significantly, it's good that it's clear what overfitting and underfitting are, and how you can spot them on a plot. Let's now take a look at _choosing a learning rate._ + +* * * + +## Choosing a learning rate: static and decaying ones + +Which learning rate to choose? What options do I have? + +Good questions. + +Let's now take a look at two ways of setting a learning rate: + +- Choosing one static learning rate for the entire training process. +- Choosing a fixed start rate, which you'll decay over time with a decay scheme. + +### Why static learning rates are likely suboptimal + +Let's take a look at the Adam optimizer implementation for Keras again (Keras, n.d.): + +``` +keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False) +``` + +Here, the learning rate is set as a _constant_. It's a fixed value which is used in every epoch. + +Unfortunately, this doesn't produce an optimal learning process. + +Let's take a look at two other models that we trained [for another blog post](https://www.machinecurve.com/index.php/2020/01/31/reducing-trainable-parameters-with-a-dense-free-convnet-classifier/): + +[![](images/gap_loss.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/gap_loss.png) + +The model in orange clearly produces a low loss rapidly, and much faster than the model in blue. However, we can also observe some overfitting to occur after approximately the 10th epoch. Not so weird, given the fact that we trained for ten times longer than strictly necessary. + +Now, the rapid descent of the loss value and the increasingly slower pace of falling down are typical for machine learning settings which use optimizers like [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [adaptive ones](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/). + +Why is this the case? And why is this important for a learning rate? + +Let's dig a little bit deeper. + +#### Model initialization + +Supervised machine learning models work with _model weights_: on initialization, models are configured to accept certain input data, and they create "weight vectors" in which they can store the numeric patterns they observe. Eventually, they multiply these vectors with the input vectors during training and production usage. + +Now, when you start training, it's often best practice to initialize your weight vectors randomly, or by using [approaches adapted to your model](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). + +For the forward pass (step 1 of the 6 steps outlined at the start), you can imagine that multiplying your input data with random weights will produce very poor results. Indeed, loss is likely high during the first few epochs. However, in this stage, it's also possible to make large steps towards accurate weights and hence adequate loss values. That's why you see loss descend so rapidly during the first few iterations of a supervised ML training process: it's looking for a global loss minimum very fast. + +However, as you walk down that "loss mountain", the number of possible steps that can be taken goes down - by function of the number of steps you already set. This is also true for loss landscapes in neural networks: once you get close to the global loss minimum (should it exist), then room for improvement gets tighter and tighter. For this reason, loss balances out (or even gets worse! - i.e. overfitting) over time. + +![](images/weight_histogram_2.jpg) + +Weight histograms for one layer across 5 epochs; clearly, the weights change a bit. + +#### The issue with static learning rates + +This rationale for as to why loss values initially decrease substantially while balancing out later on is a substantial issue for our learning rate: + +**We don't want it to be static.** + +As we recall, the learning rate essentially tells the model _how much of the gradient_ to use during optimization. Remember that with `learning_rate = 0.001` only 1/1000th of the computed gradient is used. + +For the latter part of the training process, this would be good, as there's no point in setting large steps. Instead, here, you want to set small ones in order to truly find the global minimum, without overshooting it every time. You might even want to use lower learning rate values here. + +However, for the first part of the training process, such low learning rates are problematic. Here, you would actually _benefit_ from large learning rates, for the simple reason that you can afford setting large steps during the first few epochs. Having a small fixed learning rate will thus unnecessarily slow down your learning process or make finding a global minimum in time even impossible! + +Hence, a static learning rate is - in my opinion - not really a good idea when training a neural network. + +Now, of course, you can choose to use a static learning rate that lies somewhere between the "large" and "small" ones. However, is this really a solution, especially when better solutions are available? + +Let's now introduce the concept of a _decaying learning rate_. Eventually, we'll now also begin to discover why the Learning Rate Range Test can be useful. + +### Decaying learning rates + +Instead of a fixed learning rate, wouldn't it be good if we could reduce it over time? + +That is, apply [learning rate decay](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/#what-is-learning-rate-decay)? + +Indeed, this seems to be an approach to reducing the negative impact of a fixed learning rate. By using a so-called "decay scheme", which decides how the learning rate decays over time, you can exhibit control over the learning rate for an arbitrary epoch. + +There are many decay schemes available, and here are four examples: + +- [![](images/linear_decay.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/linear_decay.png) + +- [![](images/step_decay.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/step_decay.png) + +- [![](images/exponential_decay.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/exponential_decay.png) + +- [![](images/time_decay.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/time_decay.png) + + +Linear decay allows you to start with a large learning rate, decay it pretty rapidly, and then keeping it balanced at a static one. Together with step decay, which keeps your learning rate fixed for a set number of epochs, these learning rates are not smooth. + +It's also possible to use exponential and time decay, which _are_ in fact smooth. With exponential decay, your learning rate decays rapidly at first, and slower over time - but smoothly. Time decay is like a diesel engine: it's a slow start, with great performance once the car has velocity, balancing out when its max is reached. + +### What start rate to choose? What decay to choose? - questions about architecture and hyperparameters + +While each has their benefits, there is a wide range of new questions: + +- When I use learning rate decay, which learning rate should I start with? +- Given some decay scheme, how fast should my decay happen? For example, this means controlling the exponential decay, which can also happen at a slower pace than visualized above. +- Can I achieve better results when adapting the batch size of my training feedforward process? What start rate and decay scheme settings do I need then? +- What happens when I adapt my architecture? How do I need to configure my model then? +- And so on. + +These are all important questions and the list is going on and on. It's impractical if not impossible to train your whole architecture every time such a question pops up, to compare. Neither is performing a grid search operation, which is expensive (Smith, 2018). However, especially with respect to the first two questions, there is another way: the Learning Rate Range Test (Smith, 2018). + +Let's take a look at what it is and what it does! :) + +* * * + +## Learning Rate Range Test + +With the **Learning Rate Range Test**, it's possible to find an estimate of the optimal learning rate quite quickly and accurately. Smith (2018) gives a perfect introduction to the topic: + +> It is relatively straight-forward: in a test run, one starts with a very small learning rate, for which one runs the model and computes the loss on the validation data. One does this iteratively, while increasing the learning rate exponentially in parallel. One can then plot their findings into a diagram representing loss at the y axis and the learning rate at the x axis. The x value representing the lowest y value, i.e. the lowest loss, represents the optimal learning rate for the training data. + +However, he also argues that... + +> The learning rate at this extrema is the largest value that can be used as the learning rate for the maximum bound with cyclical learning rates but a smaller value will be necessary when choosing a constant learning rate or the network will not begin to converge. + +Therefore, we'll simply pick a value just a tiny bit to the left of the loss minimum. + +One such Learning Rate Range Test could, theoretically, yield the following plot: + +- [![](images/sgd_only_v-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/sgd_only_v.png) + +- [![](images/sgd_only-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/sgd_only.png) + + +It's a real plot generated with a ConvNet tested for MNIST data. + +We see the fastest learning rate descent at \[latex\]\\approx 10^{-1.95}\[/latex\]: in the first plot, the descent is steepest there. The second plot confirms this as it displays the lowest _loss delta_, i.e. where "negative change in loss value" (= improvement) was highest given change of learning rate. By consequence, we would choose this learning rate :) + +* * * + +## Implementing the Learning Rate Range Test with Keras + +Now that we know what the LR Range Test is, it's time to implement it with Keras. Fortunately, that's not a difficult thing to do! :D + +Let's take a look. + +### Installing keras-lr-finder and other dependencies + +We need a few dependencies if we wish to run this example successfully. Before you continue, make sure that you have them installed. The dependencies are as follows: + +- Keras, and preferably Keras using TensorFlow 2.0+ i.e. the integrated one. +- Matplotlib. +- The **keras-lr-finder** package, which is an implementation of the Learning Rate Range Test for Keras. **Please select the variant for your TensorFlow version below**. Clone the GitHub repository to some folder, open a command prompt, `cd` to the particular folder and run `python setup.py install`. It should install immediately. + - For _old_ Keras i.e. 1.x, you can use the original repository: [https://github.com/surmenok/keras\_lr\_finder](https://github.com/surmenok/keras_lr_finder) + - For _new_ Keras i.e. TensorFlow 2.x based Keras, you can use the changes I made: [https://github.com/christianversloot/keras\_lr\_finder](https://github.com/christianversloot/keras_lr_finder) + +Now, keep your command prompt open, and generate a new file, e.g. `touch lr-finder.py`. Open this file in a code editor, and you're ready to code 😎 + +### Model imports + +The first thing I always do is to import everything we need: + +- The [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#mnist-database-of-handwritten-digits), which we'll use today; +- The Sequential API, which allows us to stack layers nicely; +- The Dense, Flatten, Conv2D and MaxPooling2D layers, as we'll find optimal learning rates for a ConvNet that classifies the MNIST data; +- [Sparse categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/), which is our loss function for today; +- The [SGD](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) and [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) optimizers, for which we'll compute the optimum learning rates. + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import SGD, Adam +import matplotlib.pyplot as plt +from keras_lr_finder import LRFinder +``` + +### Test configuration + +Next, we set the configuration for our test scenario. We'll use batches of 250 samples for testing. Our images are 28 x 28 pixels and are one-channeled, as the MNIST dataset is grayscale. The number of classes equals 10, while we'll test for 5 epochs (unless one of the abort conditions, such as a loss value that goes out of the roof, occurs before then). Our estimated start learning rate is \[latex\]10^{-4}\[/latex\] while we stop at \[latex\]10^0\[/latex\]. When generating a plot of our test results, we use a moving average of 20 loss values for smoothing the line, to make our results more interpretable. + +``` +# Model configuration +batch_size = 250 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 5 +start_lr = 0.0001 +end_lr = 1 +moving_average = 20 +``` + +### Data import and preparation + +The next things we do are related to the dataset: + +- First, we import the MNIST data; +- Then, we determine the `input_shape` that will be used by Keras; +- This is followed by casting the data into `float32` format (presumably speeding up training, especially when using GPU based TensorFlow) and reshaping the data into the `input_shape` we specified. +- Finally, we scale the data. + +``` +# Load MNIST data +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Cast numbers to float32 format and reshape data +input_train = input_train.astype('float32').reshape(input_train.shape[0], img_width, img_height, img_num_channels) +input_test = input_test.astype('float32').reshape(input_test.shape[0], img_width, img_height, img_num_channels) + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +### Model architecture + +Then, we specify the model architecture. It's not the most important thing for today, but here it is. It's a simple [ConvNet](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) using [Max Pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/): + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +### Learning Rate Range Tests to be performed + +Now, here's the interesting part. We specified the model architecture in our previous step, so we can now decide about which tests we want to perform. For the sake of simplicity, we specify only two, but you can test as much as you'd like: + +``` +# Determine tests you want to perform +tests = [ + (SGD(), 'SGD optimizer'), + (Adam(), 'Adam optimizer'), +] +``` + +As you can see, the tests that we will perform today will find the best learning rate for the traditional SGD optimizer, and also for the Adam one. What's great is that by plotting them together (that's what we will do later), we can even compare the performance of the optimizer given this architecture. We can thus also answer the question _Which optimizer produces lowest loss?_ + +### Performing the tests + +Now that we have specified the tests, let's perform them! 😎 In preparation for this, let's specify three 'containers' for data - one for the learning rates per step, one for the corresponding losses per step, one for the loss changes (a.k.a. deltas) and one for the labels of the tests. + +``` +# Set containers for tests +test_learning_rates = [] +test_losses = [] +test_loss_changes = [] +labels = [] +``` + +Then, we perform the test. For every test, we specify the `test_optimizer` to be used as well as the `label`, and compile the model following that particular optimizer. This is followed by instantiating the Learning Rate Range Test through `LRFinder`, and performing the actual test using the training data and the configuration we specified above. + +Once the test has finished - this may either be the case because we have completed all epochs, because loss becomes `NaN` or because loss becomes too large - we take the `learning_rates`, the `losses` and `loss_changes` and store them in containers. However, before storing the loss changes, we smooth them using the `moving_average` that we defined before. Credits for the smoothing part of the code go to the [keras-lr-finder package](https://github.com/surmenok/keras_lr_finder/blob/master/keras_lr_finder/lr_finder.py). + +After smoothing, we store the learning rates per step, as well as the test losses and the labels, to the containers we specified before. This iteration will ensure that all tests are performed in line with how we want them to perform. + +``` +# Perform each test +for test_optimizer, label in tests: + + # Compile the model + model.compile(loss=loss_function, + optimizer=test_optimizer, + metrics=['accuracy']) + + # Instantiate the Learning Rate Range Test / LR Finder + lr_finder = LRFinder(model) + + # Perform the Learning Rate Range Test + outputs = lr_finder.find(input_train, target_train, start_lr=start_lr, end_lr=end_lr, batch_size=batch_size, epochs=no_epochs) + + # Get values + learning_rates = lr_finder.lrs + losses = lr_finder.losses + loss_changes = [] + + # Compute smoothed loss changes + # Inspired by Keras LR Finder: https://github.com/surmenok/keras_lr_finder/blob/master/keras_lr_finder/lr_finder.py + for i in range(moving_average, len(learning_rates)): + loss_changes.append((losses[i] - losses[i - moving_average]) / moving_average) + + # Append values to container + test_learning_rates.append(learning_rates) + test_losses.append(losses) + test_loss_changes.append(loss_changes) + labels.append(label) +``` + +### Visualizing the outcomes + +Now that we have the outcomes, we can visualize them! :) We'll use Matplotlib for doing so, and we'll create two plots: one for the loss deltas and one for the actual loss values. + +For each, the first thing we do is iterate over the containers, and generate a plot for each test with `plt.plot`. In our case, this generates two plots, both on top of each other. This is followed by plot configuration - for example, we set the x axis to logarithmic scale, and finally by a popup that visualizes the end result. + +``` +# Generate plot for Loss Deltas +for i in range(0, len(test_learning_rates)): + plt.plot(test_learning_rates[i][moving_average:], test_loss_changes[i], label=labels[i]) +plt.xscale('log') +plt.legend(loc='upper left') +plt.ylabel('loss delta') +plt.xlabel('learning rate (log scale)') +plt.title('Results for Learning Rate Range Test / Loss Deltas for Learning Rate') +plt.show() + +# Generate plot for Loss Values +for i in range(0, len(test_learning_rates)): + plt.plot(test_learning_rates[i], test_losses[i], label=labels[i]) +plt.xscale('log') +plt.legend(loc='upper left') +plt.ylabel('loss') +plt.xlabel('learning rate (log scale)') +plt.title('Results for Learning Rate Range Test / Loss Values for Learning Rate') +plt.show() +``` + +### Interpreting the results + +All right, you should now have a model that runs! :) + +Open up that command prompt again, `cd` to the folder where your `.py` file is located (if you're not already there :) ), and run e.g. `python lr-finder.py`. You should see the epochs begin, and once they finish, two plots similar to these ones should pop up sequentially: + +- [![](images/lrt_losses-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/lrt_losses.png) + +- [![](images/lrt_loss_deltas-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/lrt_loss_deltas.png) + + +_Note that yours won't be exactly the same due to the fact that machine learning models are stochastic, e.g. due to random or pseudo-random initialization of your weight vectors during model initialization._ + +The results are very clear: for this training setting, Adam performs substantially better. We can observe that it reaches a lower loss value compared to SGD (first plot), and that it does so in a much shorter time (second plot - the negative delta occurs at a lower learning rate). Likely, this is how we benefit from the fact that Adam performs [local parameter updates](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam), whereas SGD does not. If we had to choose between these two optimizers, it would clearly be Adam with a learning rate of \[latex\]\\approx 10^{-3.95}\[/latex\]. + +### Full code + +If you wish, it's also possible to obtain the full model code at once :) + +Here you go: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import SGD, Adam +import matplotlib.pyplot as plt +from keras_lr_finder import LRFinder + +# Model configuration +batch_size = 250 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 5 +start_lr = 0.0001 +end_lr = 1 +moving_average = 20 + +# Load MNIST data +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Cast numbers to float32 format and reshape data +input_train = input_train.astype('float32').reshape(input_train.shape[0], img_width, img_height, img_num_channels) +input_test = input_test.astype('float32').reshape(input_test.shape[0], img_width, img_height, img_num_channels) + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Determine tests you want to perform +tests = [ + (SGD(), 'SGD optimizer'), + (Adam(), 'Adam optimizer'), +] + +# Set containers for tests +test_learning_rates = [] +test_losses = [] +test_loss_changes = [] +labels = [] + +# Perform each test +for test_optimizer, label in tests: + + # Compile the model + model.compile(loss=loss_function, + optimizer=test_optimizer, + metrics=['accuracy']) + + # Instantiate the Learning Rate Range Test / LR Finder + lr_finder = LRFinder(model) + + # Perform the Learning Rate Range Test + outputs = lr_finder.find(input_train, target_train, start_lr=start_lr, end_lr=end_lr, batch_size=batch_size, epochs=no_epochs) + + # Get values + learning_rates = lr_finder.lrs + losses = lr_finder.losses + loss_changes = [] + + # Compute smoothed loss changes + # Inspired by Keras LR Finder: https://github.com/surmenok/keras_lr_finder/blob/master/keras_lr_finder/lr_finder.py + for i in range(moving_average, len(learning_rates)): + loss_changes.append((losses[i] - losses[i - moving_average]) / moving_average) + + # Append values to container + test_learning_rates.append(learning_rates) + test_losses.append(losses) + test_loss_changes.append(loss_changes) + labels.append(label) + +# Generate plot for Loss Deltas +for i in range(0, len(test_learning_rates)): + plt.plot(test_learning_rates[i][moving_average:], test_loss_changes[i], label=labels[i]) +plt.xscale('log') +plt.legend(loc='upper left') +plt.ylabel('loss delta') +plt.xlabel('learning rate (log scale)') +plt.title('Results for Learning Rate Range Test / Loss Deltas for Learning Rate') +plt.show() + +# Generate plot for Loss Values +for i in range(0, len(test_learning_rates)): + plt.plot(test_learning_rates[i], test_losses[i], label=labels[i]) +plt.xscale('log') +plt.legend(loc='upper left') +plt.ylabel('loss') +plt.xlabel('learning rate (log scale)') +plt.title('Results for Learning Rate Range Test / Loss Values for Learning Rate') +plt.show() +``` + +* * * + +## Summary + +In this blog post, we looked at the Learning Rate Range Test for finding the best learning rate for your neural network - empirically. + +This was done by looking at the concept of a learning rate before moving to Python code. What is a learning rate? Why is it useful? And how to configure it objectively? Do I need a fixed or a decaying learning rate? Those are all questions that we answered in the first part of this blog post. + +In the second part, we introduced the Learning Rate Range Test: a method based on Smith (2018) that allows us to empirically determine the best learning rate for the model and its `compile` settings that you specify. It even allows us to compare multiple settings at once, and which learning rate is best! + +In the third and final part, we used the `keras-lr-finder` package to implement the Learning Rate Range Test. With blocks of Python code, we explained each step of doing so - and why we set that particular step. This should allow you to use the Learning Rate Range Test in your own projects too. + +I hope that this blog was useful to you and that you've learnt new things! :) If you did, I'd be very honored if you left a comment in the comments section below 💬 Please do the same if you have questions, other remarks or if you think that I made a mistake. I'll happily improve and mention your feedback. + +Thanks for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Smith, L. N. (2018). [A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay](https://arxiv.org/abs/1803.09820). _arXiv preprint arXiv:1803.09820_. + +Keras. (n.d.). Optimizers. Retrieved from [https://keras.io/optimizers/](https://keras.io/optimizers/) + +Surmenok, P. (2018, July 14). Estimating an Optimal Learning Rate For a Deep Neural Network. Retrieved from [https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0](https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0) diff --git a/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning.md b/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning.md new file mode 100644 index 0000000..0d38ecf --- /dev/null +++ b/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning.md @@ -0,0 +1,221 @@ +--- +title: "From vanilla RNNs to Transformers: a history of Seq2Seq learning" +date: "2020-12-21" +categories: + - "deep-learning" + - "svms" +tags: + - "attention" + - "backpropagation" + - "deep-learning" + - "gated-recurrent-unit" + - "gru" + - "long-short-term-memory" + - "lstm" + - "machine-learning" + - "named-entity-recognition" + - "natural-language-processing" + - "recurrent-neural-networks" + - "sentiment-analysis" + - "text-summarization" + - "text-translation" + - "transformer" + - "vanilla-rnn" +--- + +Machine Learning has been playing an important role in Natural Language Processing over the past few years. Machine summarization, machine translation, sentiment analysis, you name it - ML has been used for it. In particular, using a technique called sequence-to-sequence learning (Seq2seq), the goal is to transform one sequence into another by learning an intermediate representation that can perform the transformation. In this article, we'll be looking at a few of the major approaches and where state-of-the-art is today. + +It is structured as follows. First of all, we'll be looking at the concept of sequence-to-sequence learning in Natural Language Processing. What is it? How does it work? Those are the questions that we will be answering. Subsequently, we're going to look at how those techniques have evolved. We're going to start with _vanilla RNNs_, which are simple implementations of recurrent neural networks. This is followed by Long-Short Term Memory networks and Gated Recurrent Units, as well as the concept of attention. Finally, we're going to take a look at Transformers. + +The article can be a starting point for those who wish to understand the relationships between the major components of Machine Learning in NLP: RNNs, [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) and GRUs, as well as attention and transformers. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Sequence-to-Sequence learning in Natural Language Processing + +Natural Language Processing is a wide field and many techniques and algorithms have been used for interpreting text. If, however, we look at Machine Learning approaches closely, many of them have focused on **Sequence-to-Sequence learning**, or Seq2Seq. Here's how Wikipedia describes it: + +> Seq2seq turns one sequence into another sequence. It does so by use of a recurrent neural network (RNN) or more often [LSTM](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) or GRU to avoid the problem of vanishing gradient. The context for each item is the output from the previous step. The primary components are one encoder and one decoder network. The encoder turns each item into a corresponding hidden vector containing the item and its context. The decoder reverses the process, turning the vector into an output item, using the previous output as the input context. +> +> Wikipedia (2019) + +I can imagine that you have a few questions now and that it's still difficult to grasp the concept altogether. + +For this reason, let's take a look at Seq2seq in more detail. + +### What is a sequence? + +Suppose that we have the following phrase: _I am going to work today._ + +This phrase can be expressed as a sequence of the individual words: `[I, am, going, to, work, today]`. + +In French, we can say _Je vais travailler aujourd'hui_, and hence the target sequence would be `[Je, vais, travailler, aujourd'hui]`. + +The goal of sequence-to-sequence learning is to learn a model that can map the first sequence into the second. Of course, only if our goal is translation. In the case of summarization, the goal would be to transform (really) long sequences into (really) short ones; summaries. + +### Recurrent Neural Networks and Encoder-Decoder Models + +There are many ways to perform sequence-to-sequence learning. We'll cover the main techniques in the next section, but here are the two main branches that have been used for this purpose: + +- **Classic Recurrent Neural Networks**, where the goal is to learn a _model that can process_ the sequence item by passing this representation as input together with the next item from the sequence. In other words, each item will use (part of) the context for a word based on the words that have been processed previously. In a real-life analogy, this looks like a human translator, e.g. one who can translate German into French directly. +- **Encoder-decoder models**, where the goal is to learn an _encoder_ that can process the input sequence into a hidden representation, and a _decoder_ which maps the representation into an output sequence. Following the analogy, here, the hidden representation represents an imaginary language; the encoder is a translator who can translate German into this language; the decoder is a translator who translates the imaginary language into French. + +As we shall see now, encoder-decoder models - specifically Transformers - have benefits over Recurrent Neural Networks when it comes to learning mappings between sequences. + +* * * + +## Evolution of techniques for Sequence-to-Sequence Learning + +Now that we understand what Sequence-to-Sequence learning is, and now that we know about the two categories of models that are primarily being used in Seq2seq, we're going to cover them in more detail. First of all, we're going to cover the vanilla variant of Recurrent Neural Networks. This is followed by introducing LSTMs and GRUs, the attention mechanism, and finally the basis of today's state-of-the-art approaches: the Transformer architecture. + +Let's take a look! + +### Vanilla RNNs: simple Recurrent Neural Networks + +Previous approaches in Natural Language Processing were sequential in nature. In other words, if we consider the phrase \[latex\]\\text{I am doing great}\[/latex\], transformed into a set of individual components \[latex\]{I, am, doing, great}\[/latex\], previous approaches would have to cover each word individually _by means of letting it flow through the entire model_. + +Let's take a look at this recurrent structure. Each input flows through \[latex\]h\[/latex\] and then becomes an output (for example, if we translate from Dutch to English, the word \[latex\]\\text{ik}\[/latex\] would become \[latex\]\\text{I}\[/latex\], which is what we know from regular neural networks. However, it also connects back to itself, which means that upon the entry of new input, the context from the previous forward passes will be used to provide a better prediction. In other words, when predicting \[latex\]\\text{am}\[/latex\], we're using the context provided by the \[latex\]\\text{ik} \\rightarrow \\text{I}\[/latex\] translation. + +We call this a **Vanilla Recurrent Neural Network.** Visually, this looks as follows: + +![](images/2560px-Recurrent_neural_network_unfold.svg_.png) + +A fully recurrent network. Created by [fdeloche](https://commons.wikimedia.org/wiki/User:Ixnay) at [Wikipedia](https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg), licensed as [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0). No changes were made. + +Due to [vanishing gradients](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) in the backpropagation step of computing the error backwards for improvement, vanilla RNNs are quite problematic when input sequences are long. While they show good results for shorter sequences, in longer ones (e.g. "Every week, I repetitively wash my clothes, and I do grocery shopping every day. Starting next week, I will ...") the distance between the relevant words can be too long for results to be acceptable. + +### LSTM: Long Short-Term Memory + +Over the past few years, many extensions of the classic RNN have been proposed, of which the **[Long Short-Term Memory](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)** (LSTM) is one of the most prominent ones. In these architectures, rather than passing the entire hidden state, several gates are available for partially processing the previous state as well as new inputs. + +For example, in an **LSTM network**, there are three gates: an input gate, an output gate and a forget gate. Here they are, visualized together: + +![](images/1920px-LSTM_cell.svg_.png) + +An LSTM cell. Created by [Guillaume Chevalier](https://commons.wikimedia.org/w/index.php?title=User:GChe&action=edit&redlink=1) (svg by Ketograff) at [Wikipedia](https://en.wikipedia.org/wiki/Long_short-term_memory#/media/File:LSTM_cell.svg), licensed as [CC BY 4.0](https://creativecommons.org/licenses/by/4.0). + +On the left, you can see the forget gate: the output produced by the previous cell \[latex\]h\_{t-1}\[/latex\] is passed through a Sigmoid function, together with the input \[latex\]X\[/latex\] at \[latex\]t\[/latex\], \[latex\]X\_t\[/latex\]. We know that Sigmoid [maps each value](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) to the range between 0 and 1. In other words, by doing so, the network can learn to forget certain aspects from cell state based on the current input values and the previous hidden state, through this gate. + +The second gate, which takes \[latex\]X\_t\[/latex\] as its input, is the input gate. It is dual in nature, because \[latex\]X\_t\[/latex\] and the hidden state from the previous cell is passed through both a Sigmoid (\[latex\]\\sigma\[/latex\]) and a \[latex\]tanh\[/latex\] function, after which the results are combined. Tanh here forces the input into the \[latex\]\[-1, 1\]\[/latex\] range and hence normalizes the data. The Sigmoid function once again maps the combination to \[latex\]\[0, 1\]\[/latex\] range, indicating which parts of the input must be kept. The outcome of what must be _kept_ is combined with what must be _forgotten_. This is passed to the third gate, the output gate. + +In this output gate, the hidden state from the previous cell as well as current input are Sigmoid-ed to identify what must be kept based on short-term input. This is combined with the short-term input-influenced memory provided by the cell state that has just been altered by the forget and input gates. The output is passed to the outer world above and serves as the short-term input for the next cell (providing short-term output context for the next element in the sequence). The memory is also passed to the next cell, as we can see. + +### GRU: Gated Recurrent Unit + +A simplification of Long Short-Term Memory networks was proposed back in 2014. It is called a **Gated Recurrent Unit** and is similar to an LSTM, but it also has its fair share of differences. For example: + +- It has an input gate and a forget gate. +- It lacks an output gate. + +By consequence, it is faster to train as an LSTM, because it has fewer parameters. However, this comes at a cost: GRUs have been shown to be incapable of performing some tasks that can be learned by LSTMs. For example, "(...) the GRU fails to learn simple languages that are learnable by the LSTM" (Wikipedia, 2016). This is why in practice, if you have to choose between LSTMs and GRUs, it's always best to test both approaches. + +GRUs are composed of a reset gate and an update gate. The goal of the reset gate is to _forget_ whereas the goal of the update gate is to _remind_ based on only the previous output and the current input. There are multiple variants (Wikipedia, 2016): + +1. A fully gated unit where there are three sub types, which compute outputs based on hidden state, hidden state + bias and bias only inputs. +2. A minimally gated unit where the reset and update gates are merged into a forget gate, where the goal of the network is to _forget what is not important_, i.e., it should no longer try to remember the important things, but rather only forget the not-so-important things. + +Visually, this is what a GRU looks like if we consider its fully gated version: + +![](images/2560px-Gated_Recurrent_Unit_base_type.svg_.png) + +Created by [Jeblad](https://commons.wikimedia.org/wiki/User:Jeblad) at [Wikipedia](https://en.wikipedia.org/wiki/Gated_recurrent_unit#/media/File:Gated_Recurrent_Unit,_base_type.svg), licensed as [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0) (no changes made). + +We can see that the output from the previous item \[latex\]h\[t-1\]\[/latex\] and the input of the current item \[latex\]x\[t\]\[/latex\] are jointly passed through a Sigmoid function, which allows us to compute what must be forgotten. This is subsequently applied to the output from the previous item, which now excludes what must have been forgotten. Together with the current input, it is passed to a Tanh function which normalizes the inputs, and together with a Sigmoid-ed combination of the hidden output and current input, it is used for updating the hidden state and hence generating the output. + +The output is then used in the next cell for performing the same with the new input value. + +### Attention mechanism + +By means of benefits in terms of backwards error computation, LSTMs and GRUs have yielded better performance over classic, vanilla RNNs. Nevertheless, they still face the problem related to the sequential processing of data: if sequences are too long, even those improvements cannot handle the sequence transformation. + +This primarily occurs because the memory is updated by means of a short-term change: in LSTMs, the memory is adapted based on short-term (i.e. current and previous input) interrelationships. While longer-term ones can pass through memory, they are forgotten over time. + +For this reason, scientists invented what is known as an **attention mechanism**. This mechanism effectively involves using _all_ the intermediate states, not only the last one, for generating the output prediction. By learning weights for these states, it becomes possible to teach the model to attend to certain parts of the input for computing a particular output. In other words, it can still use certain parts of the inputs while not losing focus on the most important aspects. In other words, thanks to attention, models no longer face the impact of long-term memory loss. + +But this still wasn't enough. Vanilla RNNs, LSTMs and GRUs all face the same bottleneck: even though _memory_ itself was improved through attention, _computation_ was not. The process was still sequential, because each token has to be processed sequentially. And that is where Transformers came in. + +### Attention is all you need: Transformers + +In a 2017 article, Vaswani et al. introduce a new concept, based on the premise that _attention is all you need_. In their work, they described the development of a new model architecture which completely strips off the recurrent aspects of a model, while still using attention. In other words, they ensured that the benefits of attention are still valid while significantly reducing the computational costs involved. The specific model architecture is called a **Transformer**. + +Wikipedia describes Transformers as follows: + +> Like recurrent neural networks (RNNs), Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. +> +> Wikipedia (2019) + +Aha! An interesting property. + +Their architecture is quite complex, but let's try to simplify it. Note that the goal here is not to explain a Transformer in great detail, but rather to provide you with an introduction as to why they are an improvement over classic RNNs and LSTMs/GRUs. We'll dive into the Transformer architecture in more detail in another article. + +#### Transformers, visualized + +Like LSTMs and GRUs which can be used as encoder-decoder architectures, Transformers involve an encoder segment and a decoder segment as well. This is what a Transformer looks like, visually (Vaswani et al., 2017): + +![](images/1_BHzGVskWGS_3jEcYYi6miQ-842x1024.png) + +#### So complex, how does it work? + +I can imagine that the image visualized above is interpreted as complex - and it is. But it's not _too_ complex, to say the least. + +We can see that a Transformer is composed of two main parts: an **encoder** **segment**, which encodes an input into an intermediate representation, and a **decoder segment**, which converts it back into readable text. To illustrate this: if our goal is to train a model for translating English texts into French, the encoder first translates English into some kind of imaginary language, while the decoder - who understands this language as well - translates it into French. + +The encoder segment is composed of the following parts: + +- A (learned) **[embedding](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) layer** which allows us to capture the textual input in a way that machines can process. +- A **positional encoding** which adds position information to the input embedding, allowing the model to understand the position of certain parts of the input in the input as a whole. Recall that there is no recurrence in a Transformer anymore, and that hence positional information must be added in a different way. It is achieved by means of the positional encoding. +- An **encoder part**, which can be identically repeated for \[latex\]N\[/latex\] times, each time increasing the precision of the encoding. It is composed of a _multi-head attention segment_ and a _feed forward segment_. Residual connections (i.e. connections that pass the original input to the output, such as the input to the _Add & Norm_ aspect, help boost training performance by allowing gradients to flow freely during backpropagation. + - The **multi-head attention segment** allows us to split inputs into queries, keys and values. Suppose that you are looking for a Netflix title. You're doing so by means of a _query_, which is mapped against characteristics of the series available - the _keys_. Based on this mapping, we can find the best matches - the _values_. By performing these matches, by means of _self-attention_ in the \[latex\]\\text{keys} \\times \\text{values}\[/latex\] mapping, we can teach the encoder segment to focus on certain parts of phrases when encoding certain words. It is multi-headed in the sense that the _queries, keys and values_ combination can be split into \[latex\]N\[/latex\] parallel parts, allowing the model to take a variety of viewpoints on the mappings being made. The multihead attention blocks are subsequently being added together, and layer normalized with the residual input. + - The **feedforward segment** allows us to generate the encoding for each individual input. The encoding itself is a high-dimensional representation of the input. Here, too, a residual connection is available for free flow of gradients. + +The decoder segment, here, attempts to translate the imaginary language back into written language, or perform summarization, anything you have trained the model for. It is composed of the following parts: + +- Once again, a (learned) **embedding layer** allowing us to capture the textual input in a way that machines can process. In Vaswani et al. (2017), the weight matrices for at least the input and output embedding layers are shared. The embedding is also **position encoded**, like in the encoder segment. +- A **decoder part**, which can also be repeated \[latex\]N\[/latex\] times, and composed of the following elements: + - A **masked multi-head attention segment**, generating self-attention for the desired outputs, i.e., what words to focus on given a certain input word. It is _masked_ in the sense that for any word, all future words in a phrase are 'hidden' because we cannot know anything from the future. The same multi-head split and subsequent residual adding & layer normalization happens here, too. + - Another **multi-head attention segment**, where the keys and values from the encoder outputs are merged with the values from the masked multi-head attention block from the decoder. It essentially allows us to combined the "key-value mapping" (the question-to-possibility mapping) to actual best-matching outcomes (the values). In other words, it allows us to find a best match (e.g. "I am doing fine") for an encoded representation of the phrase "How are you doing?". All information from the masked segment is also passed as a residual flow. + - A **feedforward network**, including the residual connection, is available here too. It allows us to convert the processed input-output combination into a \[latex\]W \\text{-dimensional}\[/latex\] vector, where \[latex\]W\[/latex\] equals the number of words. Subsequently, using a [Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) function, we can generate the most likely word. + +The predicted outputs are now added back into the decoder segment, allowing us to better predict the next output token given the encoded input token and the previous predictions. This way of working has allowed Natural Language Processing practitioners to achieve extremely impressing results in terms of text processing - i.e., in terms of language generation, summarization, translation, and so on. They have solved the issues with long-term memory (by means of attention) _and_ computation speed (by means of removing the recurrent segments), meaning that recurrent neural networks may no longer be the first choice for generating language models. Still, this is an oversimplification: many [LSTM](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) models are still being used today. + +As is usual with machine learning problems, the path forward is simple: it's up to you to find the best model for the job! + +* * * + +## Summary + +In this article, we looked at sequence-to-sequence learning, and especially the history of doing so. We first investigated the concept of a sequence and what it means to perform sequence-to-sequence learning, e.g. in the case of machine summarization or machine translation. This also allowed us to briefly explore the differences between _classic_ Seq2Seq and _encoder-decoder_ based Seq2Seq, which is very prominent today. + +After these basics, we moved on and looked at a variety of model architectures that have been used for Seq2Seq learning in the last few years. We introduced classic or vanilla RNNs, where recurrent sections are available between predictions. In other words, the output from the previous prediction is also used as a "hidden state" for the next prediction, providing some context of the phrase that is being predicted. We have seen that this leads to many problems in terms of (1) long-term memory loss due to an enormous focus on short-term memory, (2) computation issues because of vanishing gradients and (3) computation issues in terms of the sequential nature of predicting tokens. + +Long Short-term Memory Networks ([LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)) and Gated Recurrent Units (GRUs) have significantly improved the memory aspects of Seq2Seq models. By allowing networks to keep track of some memory by means of a cell structure, adding 'gates' to forget and add new information into memory, they have ensured that the long-term memory loss problem has been reduced significantly. In addition, the vanishing gradients problem has been reduced if not resolved. Still, with these model types, outputs are still problematic when sequences are really long - the memory additions have their limits! In addition, the sequential nature of computation is still there. + +The attention mechanism, which has been introduced and first added to LSTMs and GRUs, allows models to take into account a (learned) weighted average of all previous output vecetors when generating an input value. This extends memory into infinity, resolving that problem and providing sufficient context for a good prediction. Still, due to the sequential nature of processing, computational requirements remained high. + +In 2017, there was a fix for this final problem, with the introduction of the Transformer architecture. By arguing that _attention is all you need_, a group of researchers produced an encoder-decoder model architecture that allows us to generate models which generate attention over their own inputs. In addition, attention can be generated over the entire input, i.e. the _input as a whole_. This massively boosts parallelization and hence the sequential nature of processing is something from the past when using Transformers. In other words, we get the benefits from attention (i.e. theoretically infinite memory) while we also get to remove the drawbacks from sequential processing. + +In the past few years, Transformers have greatly changed the landscape of Seq2Seq learning and have been the state-of-the-art approach for such Machine Learning tasks. In fact, many extensions such as BERT, BART, ALBERT, GPT, GPT-2, GPT-3, ..., have been proposed and developed. If your goal is to understand the application of Machine Learning in Natural Language Processing, I'd say: understand why previous approaches were problematic, and start from Transformer based architectures. Good luck! + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from today's article. If you did, please feel free to leave a message at the comments section below 💬 I'd love to hear from you :) Please feel free to ask questions too, if you have them. For asking questions, click the button to the right, or add a message in the comments section below. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2019, August 25). _Transformer (machine learning model)_. Wikipedia, the free encyclopedia. Retrieved December 16, 2020, from [https://en.wikipedia.org/wiki/Transformer\_(machine\_learning\_model)](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Wikipedia. (2019, December 17). _Seq2seq_. Wikipedia, the free encyclopedia. Retrieved December 17, 2020, from [https://en.wikipedia.org/wiki/Seq2seq](https://en.wikipedia.org/wiki/Seq2seq) + +Wikipedia. (2007, April 16). _Long short-term memory_. Wikipedia, the free encyclopedia. Retrieved December 18, 2020, from [https://en.wikipedia.org/wiki/Long\_short-term\_memory](https://en.wikipedia.org/wiki/Long_short-term_memory) + +Wikipedia. (2016, May 18). _Gated recurrent unit_. Wikipedia, the free encyclopedia. Retrieved December 18, 2020, from [https://en.wikipedia.org/wiki/Gated\_recurrent\_unit](https://en.wikipedia.org/wiki/Gated_recurrent_unit) diff --git a/gans-an-introduction-to-frechet-inception-distance-fid.md b/gans-an-introduction-to-frechet-inception-distance-fid.md new file mode 100644 index 0000000..b4b1eea --- /dev/null +++ b/gans-an-introduction-to-frechet-inception-distance-fid.md @@ -0,0 +1,15 @@ +--- +title: "GANs: an Introduction to Fréchet Inception Distance (FID)" +date: "2021-11-09" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "fid" + - "frechet-inception-distance" + - "machine-learning" + - "neural-networks" +--- + +The **Frechet Inception Distance** or FID is a method for comparing the statistics of two distributions by computing the distance between them. In GANs, the FID method is used for computing how much the distribution of the Generator looks like the distribution of the Discriminator. By consequence, it is a metric of GAN performance – the lower the FID, the better the GAN. +It is named _Inception_ Distance because you’re using an Inception neural network (say, InceptionV3) for computing the distance. Here’s how you’ll do that, technically: diff --git a/generative-adversarial-networks-a-gentle-introduction.md b/generative-adversarial-networks-a-gentle-introduction.md new file mode 100644 index 0000000..d0ecf84 --- /dev/null +++ b/generative-adversarial-networks-a-gentle-introduction.md @@ -0,0 +1,172 @@ +--- +title: "Generative Adversarial Networks, a gentle introduction" +date: "2021-03-23" +categories: + - "buffer" + - "deep-learning" +tags: + - "gan" + - "gans" + - "generative-adversarial-networks" + - "generative-models" +--- + +In the past few years, deep learning has revolutionalized the field of Machine Learning. They are about "discovering rich (...) models" that work well with a variety of data (Goodfellow et al., 2014). While most approaches have been discriminative, over the past few years, we have seen a rise in _generative_ deep learning. + +Within the field of image generation, **Generative Adversarial Networks** or GANs have been really popular. I have recently started reading about them because I want to expand my knowledge about them -- I see a lot of momentum and my knowledge about GANs was really scarce. Turns out the first paper about this approach was already written back in 2014. In _Generative Adversarial Nets_, Ian Goodfellow and others introduce the adversary neural network based approach to add simplicity to generative ML - backprop and Dropout is possibly all you need. + +As we shall see, GANs involve two models: a generative one that is capable of generating images, and an adversary one that is capable of detecting fake from real ones. In other words, it's a competition between counterfeiters and the police, where counterfeiters eventually learn to distract the police because they become too good at generating fake images. + +In this tutorial, we'll be taking a brief but deep look at how GANs work. It is in effect an explanation of the 2014 Goodfellow paper. It therefore doesn't have many of the new things of GANs in the modern era. However, it _does_ provide an intuitive explanation of the core ideas. Other topics follow in subsequent tutorials, which you can find on [this page](https://www.machinecurve.com/index.php/generative-adversarial-networks-explanations-examples/). + +After reading this article, you will understand... + +- **What a Generative Adversarial Network is.** +- **How the Generator (Counterfeiter) and Discriminator (Police) components of GANs work.** +- **How the Generator and Discriminator play a Minimax game, enabling generative ML.** +- **How a GAN is trained.** + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## What is a Generative Adversarial Network? + +Websites like [thispersondoesnotexist.com](http://thispersondoesnotexist.com) show that Machine Learning - and specifically Deep Learning - applications can also be used for generative purposes these days. Beyond images, they are even used for other types of deep fakes - videos, for example. + +Generative Deep Learning is mostly powered by Generative Adversarial Networks these days. A **GAN** is a machine learning approach that combines two neural networks. The first is a _Generator_, which takes a random noise sample and converts it into an image. This output image is then fed to a _Discriminator_, which was trained on real images. The Discriminator detects whether the image is fake or real. This leads to a loss, using which both the Discriminator and Generator are optimized. + +![](images/GAN-1024x431.jpg) + +The schematics of a Generative Adversarial Network. Two neural networks, a Generator and Discriminator, battle with each other. The Generator serves as a counterfeiter whereas the Discriminator serves as the police. Through their battle, the Generator learns to generate images that cannot be distinguished from real ones - using noisy inputs drawn from a "latent space" (we'll cover that in the next section). + +By consequence of this joint optimization, the process can be framed as a battle between counterfeiters (Generator) and the police (Discriminator). This is in fact how it was framed by the 2014 Goodfellow et al. work. In this battle, the Generator faces the steepest learning curve - because it has no notion of what is real. Rather, it has to learn so through failure. As a result, it can learn to generate images that are _eerily real_. + +- ![](images/6-1024x1024.jpg) + +- ![](images/1-1024x1024.jpg) + +- ![](images/4-1024x1024.jpg) + +- ![](images/2-1024x1024.jpg) + +- ![](images/3-1024x1024.jpg) + +- ![](images/5-1024x1024.jpg) + + +Examples of images generated by a GAN (Karras et al., 2019). + +* * * + +## The maths of GANs, intuitively + +Lets take look at maths. First Generator, then Discriminator, then their interplay, and what happens in background. + +Now that we understand how Generative Adversarial Networks work intuitively, let's take a look at the maths behind them. I think it's really crucial to understand these maths if you want to learn about the internals of a GAN. However, as with any tutorial on this website, maths are not leading. Rather, I try to cover all the maths relatively intuitively. + +We use this order: first, we're going to take a look at the Generator. This is followed by the Discriminator, their interplay, and how this translates into a minimax game between both. Let's go! + +### Latent space, priors and image generation: the Generator + +The first element of any GAN, and maybe the most important part, is the **Generator**. Visualized in red below, a Generator can be defined as \[latex\]G\[/latex\] mathematically. \[latex\]G\[/latex\] is usually a neural network based model. Even more specifically, it can be defined as \[latex\]G(\\textbf{z}, \\theta\_g)\[/latex\]. Let's break this apart into a variety of components. + +![](images/image-2-1024x454.png) + +The generator part of a GAN. + +First, \[latex\]\\theta\_g\[/latex\]. These are simply the parameters of the neural network; its weights. As they can be trained (and hence updated), the parameters are specified specifically, because they are not fixed all the time. + +Secondly, \[latex\]\\textbf{z}\[/latex\]. Beyond the parameters, it is the other input to the Generator. In the image above, we can see that the only input to a Generator is a _noise vector_. Recall that Generators of a GAN battle against Discriminators. However, in order to do so, we'll have to generate an image. We cannot generate an output of a model without any input, and that's why we have to input _something_. And that something is random noise. (And we shall see that the battle between both ensures that random noise will be capable of producing very high-quality stuff later.) + +Anyway, let's get back on topic. \[latex\]\\textbf{z}\[/latex\] is the noise. We also call this noise vector a _latent vector_. Latent means "hidden" (Quora, n.d.). It is called that way because it comes from a latent or "hidden" probability distribution \[latex\]p\_\\textbf{z}\[/latex\]. It provides the probabilities that if we sample randomly, we get \[latex\]\\textbf{z}\[/latex\]. For any \[latex\]\\textbf{z}\[/latex\], we can define its probability \[latex\]p\_\\textbf{z}(\\textbf{z})\[/latex\]. We call this the _prior_. + +![](images/image-3-1024x350.png) + +Note that the probability distribution can be any continuous distribution (StackExchange, n.d.). Often, however, the Gaussian distribution with \[latex\](\\mu = 0.0, \\sigma = 1.0)\[/latex\] is chosen for two reasons (Goodfellow et al., 2014): + +1. The unit variance means that each element in the noise vector can be a different feature of the output image. +2. The zero mean means that we can "walk" over our latent space and generate new images with some continuity between them. + +Okay, back to the generator - \[latex\]G(\\textbf{z}, \\theta\_g)\[/latex\]. It maps any \[latex\]\\textbf{z}\[/latex\] (latent vector) to "data space" (via a data vector, \[latex\]\\textbf{x}\[/latex\], using the network and its weights. Data space here means the space with all possible output images. The neural network \[latex\]G\[/latex\] and the parameters \[latex\]\\theta\_g\[/latex\] determine how the mapping is made; \[latex\]\\textbf{z}\[/latex\] provides the sample that is mapped. + +So, to summarize, for each iteration - and sample - we randomly draw a latent vector from our latent space. It's fed to the Generator, and we receive an image. This image is picked up by the Discriminator. + +### Checking for counterfeiting: the Discriminator + +The Discriminator, or \[latex\]D\[/latex\], learns to detect whether an image created by generator \[latex\]G\[/latex\] is real or not. Let's take a look at how it does that. + +First of all, we know that \[latex\]G\[/latex\] generates an image. This image, \[latex\]\\textbf{x}\[/latex\], is fed to the Discriminator - \[latex\]D(\\textbf{x}\[/latex\]. The Discriminator is also a neural network. Instead of an image, however, it outputs the probability that \[latex\]\\textbf{x}\[/latex\] comes from the data rather than the probability distribution \[latex\]p\_g\[/latex\] (Goodfellow et al., 2014). In other words, that it is "real" (and thus from the real images) rather than "counterfeit") (from the Generator's distribution, \[latex\]p\_g\[/latex\]). + +Quite simple! + +![](images/image-4-1024x407.png) + +### Generator vs Discriminator: a minimax game + +Okay, now we know how the Generator works (TL/DR: mapping vectors sampled from a latent space to an output) as well as how the Discriminator works (TL/DR: outputting the probability that its input is from data space / real rather than from generator space / counterfeit). + +In other words, we know that \[latex\]D\[/latex\] is trained to _maximize_ the probability of assigning the correct label to training examples and samples from G ("distinguish real from fake"). \[latex\]G\[/latex\], on the other hand, is trained to minimize \[latex\]log(1-D(G(\\textbf{z})))\[/latex\] ("fool \[latex\]D\[/latex\] with my image \[latex\]G(\\textbf{z})\[/latex\]"). + +By consequence, the battle between the Generator and the Discriminator is a minimax game ("minimizing the performance of G while maximizing the performance of D"). Although this sounds counterintuitive at first, it is easy to see that G thus faces an uphill battle and that it has to work _really hard_ to fool the police \[latex\]D\[/latex\]. Just like any counterfeiter these days. This is good, because only such pressure ensures that \[latex\]G\[/latex\] will learn to generate scary stuff. + +The game can be illustrated by the formula below, which we adapted from Goodfellow et al. (2014) - remarks by me. We add the expected loss value for the Discriminator given the \[latex\]\\textbf{x}\[/latex\] generated by \[latex\]G\[/latex\] ("how well the Discriminator works on real data") to the expected loss value for the Discriminator given how the Generator processed the sampled vector \[latex\]\\textbf{z}\[/latex\]. In other words, how bad the Discriminator works on counterfeit data. + +The game minimizes this loss for the Generator (minimize how well \[latex\]D\[/latex\] works on real data and how good it is on counterfeit data jointly) and maximizes the loss for the Discriminator (maximize how poor it performs jointly). As with any battle, balance between the parties will emerge. And precisely this game is why G will be capable of generating real images. + +![](images/image-1-1024x401.png) + +The Minimax game played by a Generative Adversarial Network. Original formula from Goodfellow et al. (2014), remarks by me. + +* * * + +## Training a GAN: basic algorithm + +Great! We now know how the Generator works, how its outputs are checked by the Discriminator, and how they battle. The only thing that we haven't looked at in more detail is the _training process_. In the original GAN paper, Goodfellow et al. (2014) describe an algorithm for training a Generative Adversarial Network. + +Brief note - recall that we really take a look at the fundamentals here, and that any updates to the original training algorithm are not reflected here. Instead, they will be reflected in future articles. + +This is the training algorithm from Goodfellow et al. (2014): + +- For _some number of training iterations_, do: + - For _k steps_, do: + - Sample a minibatch of _m_ noisy samples \[latex\]\\{\\textbf{z}^{(1)}, \\textbf{z}^{(2)}, ..., \\textbf{z}^{(m)}\\}\[/latex\] from the noise prior, \[latex\]p\_g(\\textbf{z})\[/latex\]. + - Sample a minibatch of _m_ real data examples \[latex\]\\{\\textbf{x}^{(1)}, \\textbf{x}^{(2)}, ..., \\textbf{x}^{(m)}\\}\[/latex\] from the data generating distribution, \[latex\]p\_{data}(\\textbf{x})\[/latex\]. + - Update the Discriminator \[latex\]D\[/latex\] by performing gradient ascent using the average loss according to the minimax formula above, for each pair \[latex\]\\{\\textbf{z}^{(i)}, \\textbf{x}^{(i)}\\}\[/latex\], where \[latex\]0 < i < m\[/latex\]. + - Sample a minibatch of _m_ noisy samples \[latex\]\\{\\textbf{z}^{(1)}, \\textbf{z}^{(2)}, ..., \\textbf{z}^{(m)}\\}\[/latex\] from the noise prior, \[latex\]p\_g(\\textbf{z})\[/latex\]. + - Update the Generator \[latex\]G\[/latex\] by performing gradient descent using the average loss according to the _expected generator loss_ from the formula above, for each pair \[latex\]\\textbf{z}^{(i)}\[/latex\], where \[latex\]0 < i < m\[/latex\]. + +As you can see, the Discriminator is updated for \[latex\]k\[/latex\] steps and only then the Generator is updated. This process is repeated continuously. \[latex\]k\[/latex\] can be set to 1, but usually larger values are better (Goodfellow et al., 2014). Any gradient-based learning rule can be used for optimization. Goodfellow et al. (2014) used momentum. + +* * * + +## Summary + +Generative Adversarial Networks yield very realistic _generative_ products these days. For example, they can be used to generate images of human beings that do not exist. In this tutorial, we looked at the core ideas about GANs from the 2014 Goodfellow et al. paper. By reading it, you have learned... + +- **What a Generative Adversarial Network is.** +- **How the Generator (Counterfeiter) and Discriminator (Police) components of GANs work.** +- **How the Generator and Discriminator play a Minimax game, enabling generative ML.** +- **How a GAN is trained.** + +I hope that it was useful for your learning process! Please feel free to share what you have learned in the comments section 💬 I'd love to hear from you. Please do the same if you have any questions or other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... & Bengio, Y. (2014). [Generative adversarial networks.](https://arxiv.org/abs/1406.2661) _arXiv preprint arXiv:1406.2661_. + +StackExchange. (n.d.). _How does a generator (GAN) create samples similar to the data space from a vector of random numbers?_ Cross Validated. [https://stats.stackexchange.com/questions/278623/how-does-a-generator-gan-create-samples-similar-to-the-data-space-from-a-vecto](https://stats.stackexchange.com/questions/278623/how-does-a-generator-gan-create-samples-similar-to-the-data-space-from-a-vecto) + +Quora. (n.d.). _What is the meaning of latent space? - Quora_. A place to share knowledge and better understand the world. [https://www.quora.com/What-is-the-meaning-of-latent-space](https://www.quora.com/What-is-the-meaning-of-latent-space) + +Wikipedia. (2002, February 25). _Minimax_. Wikipedia, the free encyclopedia. Retrieved March 22, 2021, from [https://en.wikipedia.org/wiki/Minimax](https://en.wikipedia.org/wiki/Minimax) + +Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., & Aila, T. (2020). [Analyzing and improving the image quality of stylegan.](https://arxiv.org/abs/1912.04958) In _Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition_ (pp. 8110-8119). diff --git a/getting-out-of-loss-plateaus-by-adjusting-learning-rates.md b/getting-out-of-loss-plateaus-by-adjusting-learning-rates.md new file mode 100644 index 0000000..47de030 --- /dev/null +++ b/getting-out-of-loss-plateaus-by-adjusting-learning-rates.md @@ -0,0 +1,351 @@ +--- +title: "Getting out of Loss Plateaus by adjusting Learning Rates" +date: "2020-02-26" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "deep-neural-network" + - "learning-rate" + - "learning-rate-range-test" + - "loss" + - "loss-function" + - "loss-plateau" +--- + +When you train a supervised machine learning model, your goal is to minimize the [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) - or error function - that you specify for your model. + +Generally speaking, this goes pretty easily in the first iterations of your training process. Loss falls fast, and the model improves substantially. But then, you get stuck. + +You encounter what is known as a **loss plateau** - suddenly, it seems to have become impossible to improve the model, with loss values balancing around some constant value. + +It may be the case that you have reached the _global loss minimum_. In that case, you're precisely where you want to be. + +But what if you're not? What if your model is stuck in what is known as a saddle point, or a local minimum? What are these? Interesting questions, which we'll answer in this blog post. + +Now, in those cases, you might wish to "boost" your model and ensure that it can escape from these problematic areas of your loss landscape. We'll show you two possible approaches in this blog post, one of which we'll dive into much deeper. Firstly, we'll briefly touch Cyclical Learning Rates - subsequently pointing you to another blog post at MachineCurve which discusses them in detail. + +Secondly, and most importantly, we'll show you how automated adjustment of your Learning Rate may be just enough to escape the problematic areas mentioned before. What's more, we'll also provide an example implementation for your neural network using the Keras framework for deep learning, using TensorFlow 2.0. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Loss plateaus: saddle points and local minima + +In the introduction, we introduced the training process for a supervised machine learning model. There, we also noticed that two types of problematic areas may occur in your loss landscape: saddle points and local minima. + +Indeed, they may be the reason that your loss does not improve any further - especially when at a particular point in time, your learning rate becomes very small, either because it is configured that way or because it has decayed to really small values. + +Let's take a look at saddle points and local minima in more detail next. + +### Saddle points + +A loss landscape is a representation in some space of your loss value. Below, you'll see two (slices of) loss landscapes with a saddle point in each of them. + +- [![](images/Saddle_point.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/Saddle_point.png) + +- [![](images/Saddle_Point_between_maxima.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/Saddle_Point_between_maxima.png) + + +_Two landscapes with saddle points. On the left, it's most visible - while on the right, it's in between two maxima. | Left: By [Nicoguaro](//commons.wikimedia.org/wiki/User:Nicoguaro "User:Nicoguaro") - Own work, [CC BY 3.0](https://creativecommons.org/licenses/by/3.0 "Creative Commons Attribution 3.0"), [Link](https://commons.wikimedia.org/w/index.php?curid=20570051) | Right: By [Nicoguaro](//commons.wikimedia.org/wiki/User:Nicoguaro "User:Nicoguaro") - Own work, [CC BY 4.0](https://creativecommons.org/licenses/by/4.0 "Creative Commons Attribution 4.0"), [Link](https://commons.wikimedia.org/w/index.php?curid=48854962)_ + +The loss landscapes, here, are effectively the \[latex\]z\[/latex\] values for the \[latex\]x\[/latex\] and \[latex\]y\[/latex\] inputs to the fictional loss function used to generate them. + +Now, if they are the output of a _function_, it may be the case that we can compute the _derivative_ of that function as well. And by consequence, we can compute the _gradient_ for that particular \[latex\](x, y)\[/latex\] position too, a.k.a. the direction and speed of change at that point. + +**Saddle points are points in your loss landscape where the gradient is zero, but which are no extremum** (Wikipedia, 2004). That is, the gradient is zero but they don't represent minima or maxima. + +And this is problematic. Why, you may ask. Fair question. + +Here's the answer: supervised machine learning models are optimized by means of the gradients. If they're zero, the model gets stuck. + +Contrary to local minima, which we will cover next, saddle points are extra problematic because they don't represent an extremum. Hence, for example, if you'd go left and right, you'd find a loss that increases - while it would decrease for the other two directions. This means that it's extra difficult to escape such points. + +Let's therefore focus on another, but slightly less problematic area in your loss landscape first, before we move on to possible solutions. + +### Local minima + +Indeed, another possible bottleneck for your training process can be when it encounters a local minimum. In this case, the point is an extremum - which is good - but the gradient is zero. + +For example, the red dot in this plot represent such a local minimum: + +![](images/MaximumCounterexample.png) + +_Source: [Sam Derbyshire](https://en.wikipedia.org/wiki/User:Sam_Derbyshire "wikipedia:User:Sam Derbyshire") at [Wikipedia](https://en.wikipedia.org/wiki/ "wikipedia:") [CC BY-SA 3.0](http://creativecommons.org/licenses/by-sa/3.0/ "Creative Commons Attribution-Share Alike 3.0"), [Link](https://commons.wikimedia.org/w/index.php?curid=48728184)_ + +Here, too, if your learning rate is too small, you might not escape the local minimum. Recall that in the beginning of this blog post, we noted that the loss value in your hypothetical training scenario started balancing around some constant value. It may be that this value represents this local minimum. + +While, as you can see towards the right bottom part of the cube, loss starts decreasing rapidly if you're able to escape the minimum and get over the ridge. It may thus be actually worth it to try and see whether you can escape these points. + +### Zero gradients and consequences for training + +Altogether, we can thus say that zero gradients are bottlenecks for your training process - unless they represent the global minimum in your entire loss landscape. + +We can also say that we must try and find a way to escape from areas with saddle points and local minima. + +Let's now take a look at a few approaches with which we can try and make it happen. + +* * * + +## Getting out of loss plateaus + +Here, we'll cover the concepts behind Cyclical Learning Rates and Automated Plateau Adjustment of your Neural Learning Rate. + +Yeah, the latter one is just an invention by me, but well, I had to give it a name, right? :) + +We'll briefly cover Cyclical Learning Rates, as we covered them in detail [in another blog post](https://www.machinecurve.com/index.php/2020/02/25/training-your-neural-network-with-cyclical-learning-rates/). Nevertheless, it's worthwhile to introduce them here. However, after doing so, we'll focus on APANLR - crazy acronym, so let's skip that one from now on 😋 + +### Using Cyclical Learning Rates + +One cause for getting stuck in saddle points and global minima can be a learning rate that is too small. + +![](images/adult-adventure-backpack-287240-1024x767.jpg) + +As learning rates effectively represent the "step size" of your mountain descent, which is what you're doing when you're walking down that loss landscape visualized in blue above, when they're too small, you get slow. + +With respect to local minima and saddle points, one could argue that you could simply walk "past" them if you set steps that are large enough. Having a learning rate that is too small will thus ensure that you get stuck. + +Now, **Cyclical Learning Rates** - which were introduced by Smith (2017) - help you fix this issue. These learning rates are indeed cyclical, and ensure that the learning rate moves back and forth between a _minimum value_ and a _maximum value_ all the time. Here are a few examples: + +- [![](images/triangular.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/triangular.png) + +- [![](images/sinusoidal.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/sinusoidal.png) + +- [![](images/parabolic.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/parabolic.png) + + +As you can imagine, this is a perfect balance between "stepping over" local minima while allowing yourself to look around in detail every now and then. Or, how Smith (2017) calls it - giving up short-term performance improvements in order to get better in the long run. + +Make sure to look at [that blog post](https://www.machinecurve.com/index.php/2020/02/25/training-your-neural-network-with-cyclical-learning-rates/) if you wish to understand them in more detail. It provides a Keras example too! 😎 + +### Adjusting your Learning Rate when Plateaus are encountered + +While the Cyclical Learning Rates may work very nicely, can't we think of another way that may work to escape such points? + +We actually might. And once again, we'll be using the **Learning Rate Range Test** for this, a test that [has proved to be useful](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/) when learning rates are concerned. + +This test, which effectively starts a training process starting at a very small, but exponentially increasing learning rate, allows you to find out which learning rate - or which _range of learning rates_ - works best for your model. + +Now, we - and by _we_ I mean Jonathan Mackenzie with his `keras_find_lr_on_plateau` [repository on GitHub](https://github.com/JonnoFTW/keras_find_lr_on_plateau) ([mirror](https://github.com/christianversloot/keras_find_lr_on_plateau)) - could invent an algorithm which both ensures that the model trains and uses the Learning Rate Range Test to find new learning rates when loss plateaus: + +> Train a model for a large number of epochs. If the model's loss fails to improve for `n` epochs: +> +> 1\. Take a snapshot of the model +> 2\. Set training rate to min\_lr and train for a batch +> 3\. Increase the learning rate exponentially toward max\_lr after every batch. +> 4\. Once candidate learning rates have been exhausted, select new\_lr as the learning rate that gave the steepest negative gradient in loss. +> 5\. Reload weights from the snapshot +> 6\. Set model's learning rate to new\_lr and continue training as normal +> +> Mackenzie (n.d.) + +Interesting! :) + +Now, if you look at Mackenzie's repository more closely, you'll see that he's also provided an implementation for Keras - by means of a Keras callback. Such callbacks effectively "spy" on the training process, and can act on it after every epoch. In this case, it thus simply looks at model improvement, pausing the training process temporarily (by snapshotting the model), finding a better learning rate, after which it's resumed again (with the snapshotted model). + +* * * + +## Automatically adjusting Learning Rates on Plateaus - a Keras example + +Let's now find out how we can use this implementation with an actual Keras model :) + +### Today's dataset + +In today's model, we'll be working with the [CIFAR-10 dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#cifar-10-small-image-classification) - a dataset generated by a Canadian institute that contains many images across ten varying classes: + +- [![](images/834.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/834.jpg) + +- [![](images/20619.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/20619.jpg) + +- [![](images/18017.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/18017.jpg) + +- [![](images/15330.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/15330.jpg) + +- [![](images/13749.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/13749.jpg) + +- [![](images/12403.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/12403.jpg) + +- [![](images/11312.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/11312.jpg) + +- [![](images/3576.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/3576.jpg) + + +### Today's Keras model + +The model with which we'll be showing you how to use this callback is a slight adaptation of the [sparse categorical crossentropy loss based model](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) that we created before. Check out that blog if you wish to understand it in more detail, as we explained each individual block of code there. + +Now, open up your Explorer/Finder, create a file - say, `plateau_model.py` - and add this code. Ensure that TensorFlow 2.0 is installed, and that its Keras implementation works flawlessly (i.e., if you use the GPU version, this means that you'll also need to install other dependencies such as correct CUDA versions, and so on). + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 100 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-100 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Installing the LR Plateau Finder + +Installing the code is easy - open a terminal, ensure that Git is installed, `cd` into the folder where your `plateau_model.py` file is stored, and clone the repository: + +``` +git clone https://github.com/JonnoFTW/keras_find_lr_on_plateau.git +``` + +(if the repository above doesn't work anymore, you could always use the mirrored i.e. forked version, but I can't guarantee that it's up to date - therefore, I'd advise to use Jonathan Mackenzie's one.) + +``` +git clone https://github.com/christianversloot/keras_find_lr_on_plateau.git +``` + +### Detecting plateaus and adjusting learning rates + +Now, let's add the code for the callback :) + +First of all, we'll add an `ImageDataGenerator`. This is a built-in facility in Keras for processing your images and adding e.g. augmentation at the same time. Strictly speaking, we don't need it - our CIFAR-10 dataset is quite simple - but the LR Plateau Optimizer requires it. By consequence, we'll add it next - directly after `model.compile` and before `model.fit`: + +``` +# Define an ImageDataGenerator +gen = ImageDataGenerator(validation_split=validation_split) +``` + +Note that we do have to specify the validation split in the Image Data Generator rather than the `fit`, because - as we shall see - we'll be using it a little bit differently. Do note that we also have to add the generator to the imports: + +``` +from tensorflow.keras.preprocessing.image import ImageDataGenerator +``` + +The same goes for the LR Plateau Optimizer: + +``` +from keras_find_lr_on_plateau.keras_lr_optimiser_callback.optimize_lr_on_plateau import LRFOnPlateau +``` + +Next, we can instantiate it with the corresponding configuration - with a `max_lr` of 1, in order to provide a real "boost" during the testing phase: + +``` +# Define the LR Plateau Optimizer +adjuster = LRFOnPlateau(max_lr=1e0, train_iterator=gen, train_samples=input_train, batch_size=batch_size, epochs=no_epochs) +``` + +Finally, we fit the data to the generator - note the `adjuster` callback! + +``` +# Fit data to model +history = model.fit_generator(gen.flow(input_train, target_train, batch_size=batch_size), + epochs=no_epochs, + verbose=verbosity, + callbacks=[adjuster]) +``` + +And we're ready to go! Open up a terminal, `cd` to the folder where your `plateau_model.py` file is located, and run it with `python plateau_model.py`. The training process _including_ the Plateau Optimizer should now begin :) + +### Oops - it breaks down... let's fix it + +Except that it doesn't. It turns out that it's not entirely up to date, as far as I can tell. + +However, we can easily fix this by replacing two parts within the `optimize_lr_on_plateau.py` file: + +First, we'll replace the `LRFinder` import with: + +``` +from .lr_finder import LRFinder +``` + +This fixes the first issue. Now the second: + +``` + File "C:\Users\chris\MachineCurve\Models\keras-k-fold\keras_find_lr_on_plateau\keras_lr_optimiser_callback\optimize_lr_on_plateau.py", line 24, in on_epoch_end + if self.monitor_op(current, self.best): + File "C:\Users\chris\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\callbacks\callbacks.py", line 1023, in + self.monitor_op = lambda a, b: np.less(a, b - self.min_delta) +TypeError: '<' not supported between instances of 'NoneType' and 'float' +``` + +When Googling around, this seems like a typical error. Now, after line 22 (which reads `self.wait = 0`), add this: + +``` +if current is None: + current = 0.0 +``` + +This should fix the issue. Now, training should commence as expected :) + +* * * + +## Summary + +In this blog post, we found out how to implement a method for finding a possibly better learning rate once loss plateaus occur. It does so by applying the Learning Rate Range Test as a callback in the learning process, which we demonstrated by implementing it for Keras model. + +Hopefully, this method works for you when you're facing saddle points, local minima or other issues that cause your losses to plateau. If it does, please let me know! I'll be happy to hear from you. Please leave a comment as well if you spot a mistake, or when you have questions or remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2004, May 7). Saddle point. Retrieved from [https://en.wikipedia.org/wiki/Saddle\_point](https://en.wikipedia.org/wiki/Saddle_point) + +Smith, L. N. (2017, March). [Cyclical learning rates for training neural networks](https://arxiv.org/abs/1506.01186). In _2017 IEEE Winter Conference on Applications of Computer Vision (WACV)_ (pp. 464-472). IEEE. + +Mackenzie, J. (n.d.). JonnoFTW/keras\_find\_lr\_on\_plateau. Retrieved from [https://github.com/JonnoFTW/keras\_find\_lr\_on\_plateau](https://github.com/JonnoFTW/keras_find_lr_on_plateau) diff --git a/getting-started-with-pytorch.md b/getting-started-with-pytorch.md new file mode 100644 index 0000000..6e3bc15 --- /dev/null +++ b/getting-started-with-pytorch.md @@ -0,0 +1,487 @@ +--- +title: "Getting started with PyTorch" +date: "2021-01-13" +categories: + - "frameworks" +tags: + - "deep-learning" + - "getting-started" + - "introduction" + - "machine-learning" + - "pytorch" +--- + +When you want to build a deep learning model these days, there are two machine learning libraries that you must consider. The first is [TensorFlow](https://www.machinecurve.com/index.php/mastering-keras/), about which we have written a lot on this website already. TensorFlow, having been created by Google and released to the public in 2015, has been the leading library for years. The second one is **PyTorch**, which was released by Facebook in 2016. Long running behind, both frameworks are now on par with each other, and are both used very frequently. + +In this article, we will take a look at **getting started with PyTorch**. We will focus on simplicity of both our explanations and the code that we write. For this reason, we have chosen to work with [PyTorch Lightning](https://www.pytorchlightning.ai/) in the PyTorch articles on this website. Being a way to structure native PyTorch code, it helps boost reusability while saving a lot of overhead. In other words: you'll have the freedom of native PyTorch, while having the benefits of neat and clean code. + +After reading this tutorial, you will have the answer to the question _"How to get started with PyTorch?"_. More specifically, you will... + +- Know what steps you'll have to take in order to get started. +- Understand what PyTorch Lightning is and how it improves classic PyTorch. +- See how functionality in Lightning is organized in a `LightningModule` and how it works. +- Be able to set up PyTorch Lightning yourself. +- Have created your first PyTorch Lightning model. + +* * * + +**Update 20/Jan/2021:** Added `pl.seed_everything(42)` and `deterministic = True` to the code examples to ensure that pseudo-random number generator initialization happens with the same value, and use deterministic algorithms where available. + +* * * + +\[toc\] + +* * * + +## Quick start: 3 steps to get started with PyTorch Lightning + +If you want to get started with PyTorch, **follow these 3 starting steps** to get started straight away! If you want to understand getting started with PyTorch in more detail, make sure to read the full tutorial. Here are the steps: + +1. Ensure that Python, PyTorch and PyTorch Lightning are installed through `conda install pytorch-lightning -c conda-forge` and `conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch`. +2. Make sure that you understand what a `LightningModule` is, how it works and why it improves the model creation process over classic PyTorch. +3. Copy and paste the following example code into your editor and run it with Python. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import pytorch_lightning as pl + +class MNISTNetwork(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + return self.layers(x) + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-4) + return optimizer + + +if __name__ == '__main__': + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + pl.seed_everything(42) + neuralnetwork = MNISTNetwork() + trainer = pl.Trainer(auto_scale_batch_size='power',gpus=1,deterministic=True) + trainer.fit(neuralnetwork, DataLoader(dataset)) +``` + +* * * + +## What is PyTorch Lightning? + +Today, when you want to create a deep learning model, you can choose **[PyTorch](https://pytorch.org/)** as the library of your choice. This library, which was released in September 2016 by Facebook, has become one of the two leading deep learning libraries. It is used by many researchers varying from academia to engineering, and is updated frequently. + +[![](images/image-5-1024x487.png)](https://www.machinecurve.com/wp-content/uploads/2021/01/image-5.png) + +The website of PyTorch Lightning + +Native PyTorch models can be a bit disorganized, to say it nicely. They are essentially **long Python files** with all the elements you need, but **without any order**. For example, you'll have to… + +- Declare the models and their structure. +- Define and load the dataset that you are using. +- Initialize optimizers and defining your custom training loop. + +With **[PyTorch Lightning](https://www.pytorchlightning.ai/)**, this is no longer the case. It is a layer on top of native [PyTorch](https://pytorch.org/) and is hence compatible with all your original code - which can in fact be re-organized into Lightning code, to improve reusability. This is what makes Lightning different: + +> Lightning makes coding complex networks simple. +> +> PyTorch Lightning (2021) + +### Benefits of PyTorch Lightning over classic PyTorch + +If we take a look at the benefits in more detail, we get to the following four: + +1. **The same code, but then organized.** +2. **Trainer automates parts of the training process.** +3. **No **`.cuda()`** or **`.to()`** calls.** +4. **Built-in parallelism.** + +Let's explore each in more detail now. + +#### Benefit 1: The same code, but then organized + +The _first benefit_ of using PyTorch Lightning is that **you'll have the same, PyTorch-compatible code, but then organized**. In fact, it "is just plain PyTorch" (PyTorch Lightning, 2021). Let's take a look at this example, which comes from the [Lightning website](https://www.pytorchlightning.ai/), and slightly adapted. We can see that the code is composed of a few segments that are all interrelated: + +- The `models` segment specifies the neural network's encoder and decoder segments using the `torch.nn` APIs. +- Under `download data`, we download the MNIST dataset, and apply a transform to [normalize the data](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/). +- We then generate a [train/test split](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/) of 55.000/5.000 images and load the data with `DataLoaders`. +- We specify an `optimizer`; the [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) one in this case. +- Finally, we specify a custom training loop. + +``` +# models +encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3)) +decoder = nn.Sequential( + nn.Linear(28 * 28, 64), nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28) +) +encoder.cuda(0) +decoder.cuda(0) + +# download data + +transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize(0.5, 0.5)]) +mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform) + +# train (55,000 images), val split (5,000 images) +mnist_train, mnist_val = random_split(mnist_train, [55000, 5000]) + +# The dataloaders handle shuffling, batching, etc... +mnist_train = DataLoader(mnist_train, batch_size=64) +mnist_val = DataLoader(mnist_val, batch_size=64) + +# optimizer +params = [encoder.parameters(), decoder.parameters()] +optimizer = torch.optim.Adam(params, lr=1e-3) + +# TRAIN LOOP +model.train() +num_epochs = 1 +for epoch in range(num_epochs): + for train_batch in mnist_train: + x, y = train_batch + x = x.cuda(0) + x = x.view(x.size(0), -1) + z = encoder(x) + x_hat = decoder(z) + loss = F.mse_loss(x_hat, x) + print("train loss: ", loss.item()) +``` + +And this is a simple model. You can imagine that when your model grows (and it does, because you'll have to write custom data loading and transformation segments; specify more layers; perhaps use custom loss functions and such), it'll become very difficult to see how things interrelate. + +One of the key benefits of PyTorch Lightning is _that it organizes your code into a `LightningModule`._ We will cover this Lightning Module later in this article, and you will see that things are much more organized there! + +#### Benefit 2: Trainer automates parts of the training process + +In classic PyTorch, in the training loop, you have to write a lot of custom code, including... + +- Instructing the model to get into training mode, enabling gradients to flow. +- Looping over the data loaders for training, validation and testing data; thus performing training, validation and testing activities. +- Computing loss for a batch, performing backprop, and applying the results with the optimizer. +- Defining device parallelism. + +With PyTorch Lightning, this is no longer necessary either. The second benefit is that it **comes with a** **`Trainer` object** **that automates all the steps mentioned above, without forbidding control.** + +> Once you’ve organized your PyTorch code into a LightningModule, the Trainer automates everything else. +> +> PyTorch Lightning (n.d.) + +Yes: the `Trainer` automates training mode and gradient flow, automates the training loop, performs optimization, and allows you to tell PyTorch easily on what devices it must run and with what strategy. + +No: this does not come at the cost of forfeiting control over your training process. Rather, while `Trainer` objects allow you to abstract away much of the training process, they allow you to customizer whatever part of the training process you want to customize. This allows you to get started quickly, while being able to configure the training process to your needs for when models are more complex. + +#### Benefit 3: No .cuda() or .to() calls + +This one's a bit more difficult, but the third benefit of PyTorch Lightning is **that you don't need to provide manual `.cuda()` and `.to()` calls**. + +In order to understand what this means, you must realize that data processing on a GPU happens differently compared to processing on a CPU. GPU-based processing requires you to convert Tensors (i.e. the representations of data used within both TensorFlow and PyTorch) into CUDA objects; this is performed with `.cuda()`. Using `.to()`, you can also convert Tensors into different formats and across devices. + +An example from the PyTorch docs is provided below (PyTorch, n.d.). In this example, three Tensors are created and possibly manipulated. The first Tensor is directly allocated to the first CUDA available device, i.e. a GPU. The second is first created on CPU and then transferred to the same GPU with `.cuda()`. The third is also first created on CPU and then transferred to a GPU, but then an explicitly defined one, using `.to(device=cuda)`. + +``` +cuda = torch.device('cuda') # Default CUDA device +cuda0 = torch.device('cuda:0') +cuda2 = torch.device('cuda:2') # GPU 2 (these are 0-indexed) + +with torch.cuda.device(1): + # allocates a tensor on GPU 1 + a = torch.tensor([1., 2.], device=cuda) + + # transfers a tensor from CPU to GPU 1 + b = torch.tensor([1., 2.]).cuda() + # a.device and b.device are device(type='cuda', index=1) + + # You can also use ``Tensor.to`` to transfer a tensor: + b2 = torch.tensor([1., 2.]).to(device=cuda) +``` + +While this gives you full control over the deployment of your model, it also comes at a cost: getting a correct configuration can be difficult. What's more, in many cases, your training setting is static over time - it's unlikely that you have 1.000 GPUs at your disposal at one time, and 3 at another time. This is why manual configuration of your CUDA devices and Tensor creation is an overhead at best and can be inefficient at worst. + +PyTorch Lightning overcomes this issue by fully automating the `.cuda()`/`.to()` calls depending on the configuration provided in your `Trainer` object. You simply don't have to use them anymore in most of your code. Isn't that cool! + +#### Benefit 4: Built-in parallelism + +In classic PyTorch, when you want to train your model in a parallel setting (i.e. training on multiple GPUs), you had to build this into your code manually. + +The fourth and final key benefit of PyTorch Lightning is that **Lightning takes care of parallelism when training your model, through the `Trainer` object.** + +Indeed, adding parallelism is as simple as specifying e.g. the GPUs that you want to train your model on in the `Trainer` object (PyTorch Lightning, n.d.): + +``` +Trainer(gpus=[0, 1]) +``` + +And that's it - PyTorch Lightning takes care of the rest! + +#### All benefits together + +Let's sum a few things together now. + +PyTorch Lightning improves classic PyTorch in the following ways: + +1. **The same code, but then organized.** +2. **Trainer automates parts of the training process.** +3. **No `.cuda()` or `.to()` calls.** +4. **Built-in parallelism.** + +But even then, you still have full control, and can override any automated choices made by Lightning. And what's more, it runs native PyTorch under the hood. + +That's why we'll use Lightning in our PyTorch oriented tutorials as the library of choice. + +* * * + +## Introducing the LightningModule + +Okay, so now we know why PyTorch Lightning improves PyTorch and that it can be used for constructing PyTorch models. Let's now take a look at the _what_, i.e. the `LightningModule` with which we'll work during the construction of our PyTorch models. + +> A `LightningModule` is a [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module) but with added functionality +> +> PyTorch Lightning (n.d.) + +Here, the `torch.nn.Module` is the base class for all PyTorch based neural networks. In other words, a `LightningModule` is a layer on top of the basic way in which neural networks are constructed with PyTorch. It allows us to achieve the benefits that were outlined above, and in particular the benefit related to the organization of your machine learning model. + +Each `LightningModule` is composed of six subsegments: + +1. **Initialization segment**, or `__init__`. Essentially being the constructor of the `LightningModule` based class, it allows you to define the computations that must be used globally. For example, in this segment, you can specify the layers of your network and possibly how they are stacked together. +2. **Forward segment**, or `forward`. All inference data flows through `forward` and it therefore allows you to customize what happens during inference. Primarily though, this should be the generation of the prediction. +3. **Training segment**, or `training_step`. Here, you can specify the forward pass through the model during training, and the computation of loss. Upon returning the loss, PyTorch Lightning ensures that (1) the actual forward pass happens, that (2) errors with respect to loss are backpropagated, and that (3) the model is optimized with the optimizer of choice. +4. **Configure optimizers** through `configure_optimizers`. In this definition, you can specify the optimizer that you want to use. +5. **Validation segment** _(optional)_, or `validation_step`. Equal to the training segment, it is used for validating your model during the training process. Having a separate _validation step_ segment allows you to define a different validation approach, if necessary. +6. **Testing segment** _(optional)_, or `test_step`. Once again equal to the training segment, but then for evaluation purposes. It is not called during the training process, but rather when `.test()` is called on the `Trainer` object. + +* * * + +![](images/pexels-photo-1114690-1024x684.jpeg) + +Getting started with PyTorch can be done at the speed of lightning - hence the name of the library. + +* * * + +## Setting up PyTorch Lightning + +PyTorch Lightning can be installed really easily: + +- **With PIP:** `pip install pytorch-lightning` +- **With Conda:** `conda install pytorch-lightning -c conda-forge` + +That's all you need to get started with PyTorch Lightning! + +If you are still missing packages after installation, also try the following: + +``` +conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch +``` + +* * * + +## Your first PyTorch model with Lightning + +Now that we know both PyTorch Lightning and its `LightningModule`, it's time to show how you can build a neural network with PyTorch. Today, for introductory purposes, we will be creating a simple neural network that is capable of classifying the MNIST dataset. Building a neural network with PyTorch involves these five steps: + +1. **Creating the LightningModule.** +2. **Defining the forward pass for inference.** +3. **Defining the training step.** +4. **Configuring the optimizers.** +5. **Setting the operational aspects.** + +Let's now take a look at each individual one in more detail. Open up a Terminal and write some code. + +### Creating the LightningModule + +The first step involves specifying all the imports and creating the class that implements the `LightningModule` class. + +With respect to the imports, we can say that we import the default modules. We will need `os` for dataset related activities. We use `torch` and its lower-level imports for PyTorch related aspects: + +- The `nn` import defines building blocks for our neural network. +- The `DataLoader` can be used for loading the dataset into the model when training. +- From `torchvision`, we import both the `MNIST` dataset and `transforms`, the latter of which will be used for transforming the dataset into proper Tensor format later. +- Finally, we import PyTorch Lightning as `pl`. + +Once this is completed, we can create the `LightningModule`. In fact, we create a class - called `MNISTNetwork` that implements the `LightningModule` class and hence has to implement many of its functions as well. The first definition that we implement is `__init__`, or the constructor function if you are familiar with object-oriented programming. Here, we: + +- Also initialize the super class i.e. the instantiation of `pl.LightningModule` using `super().__init__()`. +- Define the neural network: using `nn.Sequential`, we can add our neural layers on top of each other. In this network, we're going to use three `Linear` layers that have ReLU activation functions and one final `Linear` layer. + +``` +import os +import torch +from torch import nn + +from torch.utils.data import DataLoader +from torchvision.datasets import MNIST +from torchvision import transforms +import pytorch_lightning as pl + +class MNISTNetwork(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + self.ce = nn.CrossEntropyLoss() +``` + +### Defining the forward step for inference + +The second step is to define the `forward` step that is used during inference. In other words, if a sample is passed into the model, you define here what should happen. + +In our case, that is a pass of the input sample through our layers, and the output is returned. + +``` + def forward(self, x): + return self.layers(x) +``` + +### Defining the training step + +The third step is to define the training step, by means of the `training_step` definition. This definition accepts `batch` and `batch_idx` variables, where `batch` represents the items that are to be processed during this training step. + +We first decompose the batch into `x` and `y` values, which contain the inputs and targets, respectively. + +``` + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss +``` + +### Configuring the optimizers + +We can then configure the optimizer. In this case, we use the Adam optimizer - which is a very common optimizer - and return it in our `configure_optimizers` definition. We set the default learning rate to \[latex\]10^-4\[/latex\] and let it use the model's parameters. + +``` + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-4) + return optimizer +``` + +### Setting the operational aspects + +That's it for creating the model. However, that's also only what we've got so far. We must add a few more things: loading and preparing the dataset, initializing the neural network, initializing the `Trainer` object (recall that it's a PyTorch Lightning feature that helps us automate aspects of the training process), and finally fitting the data. + +We wrap all these aspects in `if __name__ == '__main__':`: + +- Into `dataset`, we assign the `MNIST` dataset, which we download when it's not on our system and Transform into Tensor format using `transform=transforms.ToTensor()`. +- We use `pl.seed_everything(42)` to set a random seed for our pseudo-random number generator. This ensures full reproducibility regardless of pseudo-random number initialization (PyTorch Lightning, n.d.). +- We initialize the `MNISTNetwork` so that we can use our neural network. +- We initialize the PyTorch Lightning `Trainer` and instruct it to automatically scale batch size based on the hardware characteristics of our system. In addition, we instruct it to use a GPU device for training. If you don't have a dedicated GPU, you might use the CPU for training instead. In that case, simply remove `gpus=1`. Finally, we set `deterministic=True` to ensure reproducibility of the model (PyTorch LIghtning, n.d.). +- Finally, we apply `.fit(..)` and fit the `dataset` to the `neuralnetwork` by means of a `DataLoader`. + +``` +if __name__ == '__main__': + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + neuralnetwork= MNISTNetwork() + + trainer = pl.Trainer(auto_scale_batch_size='power',gpus=1,deterministic=True) + trainer.fit(neuralnetwork, DataLoader(dataset)) +``` + +### Full model code + +Here's the full model code, for those who want to copy it and get started immediately. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import pytorch_lightning as pl + +class MNISTNetwork(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + return self.layers(x) + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-4) + return optimizer + + +if __name__ == '__main__': + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + pl.seed_everything(42) + neuralnetwork = MNISTNetwork() + trainer = pl.Trainer(auto_scale_batch_size='power',gpus=1,deterministic=True) + trainer.fit(neuralnetwork, DataLoader(dataset)) +``` + +* * * + +## Summary + +PyTorch is one of the leading frameworks for machine learning these days, besides TensorFlow. In this article, we have started with PyTorch and showed you how you can get started too. First of all, we noticed that there are layers on top of PyTorch that can make your life easier as a PyTorch developer. We saw that with PyTorch Lightning, you don't have to worry about the organization of your code, parallelism of the training process, GPU deployment of your Tensors. In fact, many parts of the training process are automated away. + +We then saw that a PyTorch Lightning module is called a `LightningModule` and that it consists of a few common building blocks that make it work. With the `__init__` definition, you can initialize the module, e.g. specifying the layers of your neural network. `Forward` can be used for specifying what should happen upon inference, i.e. when new samples are passed through the model. The `training_step`, `testing_step` and `validation_step` definitions describe what happens during the training, testing or validation steps, respectively. Finally, with `configure_optimizers`, you can choose what optimizer must be used for training the neural network and how it must be configured. + +In an example implementation of a PyTorch model, we looked at how to construct a neural network using PyTorch in a step-by-step fashion. We saw that it's quite easy to do so once you understand the basics of neural networks and the way in which LightningModules are constructed. In fact, with our neural network, a classifier can be trained that is capable of classifying the MNIST dataset. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that this tutorial was useful! If you learned something, please feel free to leave a comment in the comments section 💬 Please do the same if you have questions, or leave a question through the **Ask Questions** button on the right. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +PyTorch Lightning. (2021, January 12). [https://www.pytorchlightning.ai/](https://www.pytorchlightning.ai/) + +PyTorch Lightning. (n.d.). _LightningModule — PyTorch lightning 1.1.4 documentation_. PyTorch Lightning Documentation — PyTorch Lightning 1.1.4 documentation. [https://pytorch-lightning.readthedocs.io/en/stable/lightning\_module.html](https://pytorch-lightning.readthedocs.io/en/stable/lightning_module.html) + +PyTorch Lightning. (n.d.). _Trainer — PyTorch lightning 1.1.4 documentation_. PyTorch Lightning Documentation — PyTorch Lightning 1.1.4 documentation. [https://pytorch-lightning.readthedocs.io/en/latest/trainer.html](https://pytorch-lightning.readthedocs.io/en/latest/trainer.html) + +PyTorch. (n.d.). _CUDA semantics — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/notes/cuda.html](https://pytorch.org/docs/stable/notes/cuda.html) + +PyTorch Lightning. (n.d.). _Multi-GPU training — PyTorch lightning 1.1.4 documentation_. PyTorch Lightning Documentation — PyTorch Lightning 1.1.4 documentation. [https://pytorch-lightning.readthedocs.io/en/latest/multi\_gpu.html](https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html) diff --git a/gradient-descent-and-its-variants.md b/gradient-descent-and-its-variants.md new file mode 100644 index 0000000..d391ec1 --- /dev/null +++ b/gradient-descent-and-its-variants.md @@ -0,0 +1,162 @@ +--- +title: "Gradient Descent and its variants" +date: "2019-10-24" +categories: + - "buffer" + - "deep-learning" +tags: + - "deep-learning" + - "gradient-descent" + - "machine-learning" + - "minibatch-gradient-descent" + - "optimizer" + - "stochastic-gradient-descent" +--- + +When you're creating machine learning models, people say that you're _training_ the model when you're using supervised approaches such as classification and regression. + +In a different post, we've seen how the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) ensures that models can train: by feeding the model examples that represent the statistical population you wish to classify or regress from/to, the model will iteratively adapt its internals (often, its weights) and hence _learn_ to discover the patterns hidden in the dataset. + +It does so by computing a loss value (check the link if you wish to know more about loss functions) which tells you something about how poorly the model performs. This loss value, which essentially represents the error, can further be used to optimize the model. Let's now zoom into neural networks. + +How do they optimize? + +That's an interesting question. + +In today's blog, we'll cover three variants of optimizers that have been here for some time now: **Batch Gradient Descent**, **Stochastic Gradient Descent** and **Minibatch Gradient Descent**. Although many problems have been discovered with these approaches and many other optimizers are now available in contemporary deep learning frameworks, they're still of value and still being used when they can. + +We do so by first introducing the concept of gradient descent intuitively - followed by the three variants outlined above. Subsequently, we will discuss its practical use in today's deep learning scenarios: although it is relatively slow, it is accurate (given the quality of your dataset, of course). In fact, it can be more accurate than more contemporary methods such as Adam, which belong to the class of _adaptive optimizers_. We'll take a look at them too, introducing them so that we can cover them better in a later blog post! + +**After reading this article, you will understand...** + +- What gradient descent does for optimization. +- How gradient descent relates to backpropagation. +- What variants of gradient descent (batch, stochastic and minibatch) there are and how they work. + +All right, let's go 😊 + +**Update 01/Mar/2021:** ensure that article is up to date for 2021. + +* * * + +\[toc\] + +* * * + +## Gradient descent, intuitively + +Let's first take a look at what gradient descent is in an intuitive way. + +Suppose that a helicopter drops you at the summit of a mountain. Upon departure, the helicopter pilot gives you a letter which instructs you to do as follows: **to get to the valley, which you see when you look down the mountain, as soon as possible.** + +However, it also instructs you to do this: + +1. To get down safely. +2. To take the quickest path, but only if it helps you in the long run. +3. To evaluate your plans each time you make a step, and to change them if necessary. + +Motivated to successfully complete your mission, you start moving. Literally at every step you question whether you're still moving in the right direction, whether you're safe and whether you're not taking a slower path than necessary. + +And indeed, after some time, you arrive in the valley safely, where you are greeted by many villagers who were informed about your challenging trek when they heard a helicopter landed in their village awaiting a mountaineer's arrival. + +You successfully completed your trip and your knowledge about the mountain is now always reusable, so that you can tell new mountaineers whether they can undertake the same trip, and if so, how they do it best. + +### From mountains to gradient descent + +Obviously, we're not talking about real mountains here - rather, I'm giving you an analogy for what happens during gradient descent. + +In the blog post describing the [high level machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) for supervised learning problems, we saw that when the _forward pass_ is made, a loss value is computed. This loss value is effectively a _mathematical function_ that, given the parameters being input (i.e., everything from the forward pass) outputs the numeric loss value. + +When visualizing this function, you effectively visualize what looks like the _loss landscape_. And as you can see, loss landscapes can look substantially like mountaineering problems: + +![](images/resnet56_noshort_small.jpg) + +A visualization of a neural network's loss landscape. It really looks like a mountainous path. The goal of the model: to descent as efficiently as possible, without risking that it gets stuck in local valleys. + +Copyright (c) 2017 Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer and Tom Goldstein / [loss-landscape](https://github.com/tomgoldstein/loss-landscape) library. Licensed under the [MIT License](https://github.com/tomgoldstein/loss-landscape/blob/master/LICENSE). + +The analogy between the mountain story and loss functions is now hopefully clear. + +Gradient descent can be visualized as _mountain descent_: the goal is to navigate the loss landscape, moving towards the valley, while doing so efficiently yet cautiously: you don't want to get stuck in one of the intermediate valleys, where you cannot escape from (Ruder, 2016). + +Now that we understand what gradient descent is, let's take a look at how it relates to another well-covered aspect of neural networks, being _backpropagation_. + +### How gradient descent relates to backpropagation + +Let's go back to that mountain story 😄 + +You were given a set of conditions which describe _how you have to move_ when you exited the helicopter. Or, in terms of the machine learning model, when your loss value (the error function) has been computed. + +Gradient descent will ensure that you'll walk towards the mathematical optimum, or the valley in mountain terms, so that you can arrive at a machine learning model that is useful to practice (Garcia, n.d.). How it achieves that? It will adapt the parameters, or the _weights of the neurons_, based on some gradient multiplied with a learning rate (i.e., how much you'll have to improve; Proroković (n.d.)). + +But how does the neural network know _how much to improve in the first place?_ Or: how large the step should be when you're descending that mountain, while _actually setting the step_ is gradient descent? + +This is the role of backpropagation (Garcia, n.d.; Proroković, n.d.). + +Given the error at the end of your neural network, it will compute the error backwards to the parameters (or neural weights) for every layer. As you can imagine, for multiple layers away from the loss function, the gradient is dependent on the previous layers, just as an actual step would change if you knew more about the terrain that was ahead of you. + +Together, backpropagation and gradient descent will ensure that you arrive in the valley safely: backpropagation computes how large your step (or neural weight update) should be (multiplied with something called the learning rate which reduces the length of your step, as "baby steps induce more safety"), while gradient descent actually changes the neurons, or sets the step in mountain terms. + +![](images/adult-adventure-backpack-287240-1024x767.jpg) + +In my opinion, walking down a mountainous path is one of the most powerful analogies I found when understanding gradient descent and backprop myself. Photographer: Nina Uhlíková, Slovakia, Pexels License. + +* * * + +## Variants of gradient descent + +Now that we understand both gradient descent and backpropagation as well as their role in optimizing a neural network, we can look at the oldest variants of gradient descent. Don't worry, though - the fact that they are old does not mean that they are no longer relevant today 😄 + +We'll cover Batch Gradient Descent, Stochastic Gradient Descent and Minibatch Gradient Descent. + +### Batch Gradient Descent + +Suppose that you're walking down that mountain again. If your goal would be to both walk down the mountain _efficiently_ and _safely_, you may determine that you sacrifice one in order to maximize performance on the other. + +Since you don't want to fall off that mountain, you sacrifice efficiency instead of safety. + +Well, but how can we do that? + +One way would be to make an educated guess about the structure of the mountain. If you know what you've seen so far, you might be able to extrapolate this towards all possible steps you can set, computing a highly informed step that you'll always set, no matter what. + +This is effectively what batch gradient descent does: it uses the entire batch of training data to compute a large step that is very accurate, since it takes into account maximum information about the environment (it's being informed by all the training samples). However, it's very slow (Ruder, 2016). In fact, it can be so slow that online learning (i.e., adapting your model on the fly, when new data gets in) becomes impossible. + +If you want maximum performance, choose batch gradient descent (in Keras this can be done by setting `batch_size` to `len(training_samples)`, for example). If you don't, make sure to read on 😉 + +### Stochastic Gradient Descent + +Another approach would be to compute _very small_ steps, baby steps indeed. You do this by computing a parameter update for each sample in your training set (in Keras: `batch_size = 1`). This is very fast, since you don't have to use a lot of information, but this comes at a cost. + +...the cost being the safety and long-term efficiency of your mountainous descent. While speed is increased substantially, it gets easier to misstep towards a local valley that you cannot escape from once you arrive there. + +Hence, stochastic gradient descent can be a good idea if you don't care about maximum performance but do care about speed, or when your loss landscape has a clear global minimum without many local minima. In any other case, it may be wise to benefit from the best of both worlds - with minibatch gradient descent. + +### Minibatch Gradient Descent + +In Minibatch Gradient Descent, you don't set `batch_size` to 1 but neither to `len(training_samples)`. Instead, you choose to look at a few samples at once, but discard many as being irrelevant for now. This helps you achieve reducing the variance of parameter updates (Ruder, 2016). That is, since stochastic gradient descent works with very limited information, the updates will drift around a bit. While batch gradient descent computes a very sharp path but cannot be used given its speed, minibatch gradient descent allows you to reduce drifting ('variance') while being faster than in the batch approach. + +While the size of your minibatches varies with the ideosyncracies of your machine learning problem, generally speaking, an acceptable size would be somewhere in between 50 and 256 samples (Ruder, 2016). It would likely be the approach of preference when you're training neural networks. + +* * * + +## Summary & what's next: Adaptive methods + +In this blog post, we've intuitively introduced gradient descent. Subsequently, we described three variants of traditional gradient descent which help you to get a basic feeling about which one to choose in your machine learning setting. + +Gradient descent methods are actually quite old (Ruder, 2016). In fact, many extensions have been proposed and implemented that fix many of gradient descent's challenges that were discovered when it was actively used. In another blog post, that extends this one, we will cover these challenges and the majority of today's common optimizers that are different from but similar to traditional gradient descent. + +If you have any questions or wish to express your remarks, please feel free to leave a comment in the comments box below! 👇 I'll happily answer and adapt my post where necessary. Thanks 😊, and happy coding! + +* * * + +## References + +Ruder, S. (2016). An overview of gradient descent optimization algorithms. _arXiv preprint [arXiv:1609.04747](https://arxiv.org/abs/1609.04747)_. + +Gradient descent. (2003, March 26). Retrieved from [https://en.wikipedia.org/wiki/Gradient\_descent](https://en.wikipedia.org/wiki/Gradient_descent) + +Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). [Visualizing the loss landscape of neural nets.](http://papers.nips.cc/paper/7875-visualizing-the-loss-landscape-of-neural-nets) In _Advances in Neural Information Processing Systems_ (pp. 6389-6399). + +Garcia, F. (n.d.). Francisco Garcia's answer to What is the difference between backpropagation and gradient descent when training a deep learning neural network? Which of the two is Tensorflow using? Retrieved from [https://www.quora.com/What-is-the-difference-between-backpropagation-and-gradient-descent-when-training-a-deep-learning-neural-network-Which-of-the-two-is-Tensorflow-using/answer/Francisco-Garcia-52](https://www.quora.com/What-is-the-difference-between-backpropagation-and-gradient-descent-when-training-a-deep-learning-neural-network-Which-of-the-two-is-Tensorflow-using/answer/Francisco-Garcia-52) + +Proroković, K. (n.d.). Krsto Proroković's answer to What is the difference between gradient descent and back propagation in deep learning? Are they not the same thing? Retrieved from [https://www.quora.com/What-is-the-difference-between-gradient-descent-and-back-propagation-in-deep-learning-Are-they-not-the-same-thing/answer/Krsto-Prorokovi%C4%87](https://www.quora.com/What-is-the-difference-between-gradient-descent-and-back-propagation-in-deep-learning-Are-they-not-the-same-thing/answer/Krsto-Prorokovi%C4%87) diff --git a/greedy-layer-wise-training-of-deep-networks-a-pytorch-example.md b/greedy-layer-wise-training-of-deep-networks-a-pytorch-example.md new file mode 100644 index 0000000..2c9f21a --- /dev/null +++ b/greedy-layer-wise-training-of-deep-networks-a-pytorch-example.md @@ -0,0 +1,703 @@ +--- +title: "Greedy layer-wise training of deep networks, a PyTorch example" +date: "2022-01-24" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "exploding-gradients" + - "greedy-layer-wise-training" + - "machine-learning" + - "neural-networks" + - "pytorch" + - "vanishing-gradients" +--- + +In the _old days_ of deep learning, pracitioners ran into many problems - vanishing gradients, exploding gradients, a non-abundance of compute resources, and so forth. In addition, not much was known about the theoretic behavior of neural networks, and by consequence people frequently didn't know _why_ their model worked. + +While that is still the case for many models these days, much has improved, but today's article brings a practical look to a previous fix that remains useful, even today. You're going to take a look at **greedy layer-wise training of a PyTorch neural network** using a practical point of view. Firstly, we'll briefly explore greedy layer-wise training, so that you can get a feeling about what it involves. Then, we continue with a Python example - by building and training a neural network greedily and layer-wise ourselves. + +Are you ready? Let's take a look! 😎 + +- If you want to build a neural network using greedy layer-wise training with TensorFlow and Keras, [take a look at this article](https://www.machinecurve.com/index.php/2022/01/09/greedy-layer-wise-training-of-deep-networks-a-tensorflow-keras-example/). + +* * * + +\[toc\] + +* * * + +## What is greedy layer-wise training? + +In the early days of deep learning, an abundance of resources was not available when training a deep learning model. In addition, deep learning practitioners suffered from the vanishing gradients problem and the exploding gradients problem. + +This was an unfortunate combination when one wanted to train a model with increasing depth. What depth would be best? From what depth would we suffer from vanishing and/or exploding gradients? And how can we try to find out without _wasting_ a lot of resources? + +**Greedy layer-wise training** of a neural network is one of the answers that was posed for solving this problem. By adding a hidden layer every time the model finished training, it becomes possible to find what depth is adequate given your training set. + +It works really simply. You start with a simple neural network - an input layer, a hidden layer, and an output layer. You train it for a fixed number of epochs - say, 25. Then, after training, you **freeze all the layers**, except for the last one. In addition, you cut it off the network. At the tail of your cutoff network, you now add a new layer - for example, a densely-connected one. You then re-add the trained final layer, and you end up with a network that is one layer deeper. In addition, because all layers except for the last two are frozen, your progress so far will help you to train the final two better. + +The idea behind this strategy is to find an optimum number of layers for training your neural network. + +![](images/greedy.drawio-1024x336.png) + +* * * + +## Implementing greedy layer-wise training with PyTorch + +Let's now take a look at how you can implement greedy layer-wise training with PyTorch. Even though the strategy is really old (in 2022, it's 15 years ago that it was proposed!), there are cases when it may be really useful today. + +Implementing greedy layer-wise training with PyTorch involves multiple steps: + +1. Importing all dependencies, including PyTorch. +2. Defining the `nn.Module` structure; in other words, your PyTorch model. +3. Creating a definition for getting the global configuration. +4. Creating another one for getting the model configuration. +5. Retrieving the DataLoader through another definition. +6. Writing a definition for adding a layer to an existing model, while freezing all existing layers. +7. Creating a definition for training a model. +8. Wrapping everything together. + +### Model imports + +Let's begin writing some code. Open up a Python supporting IDE, create a file - say, `greedy.py` - or a Jupyter Notebook, and add the following imports: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms +from collections import OrderedDict +from accelerate import Accelerator +``` + +You will use the following dependencies: + +- `os`, which is a Python dependency for Operating System calls. For this reason, you'll need to make sure that you have a recent version of Python installed, too. +- PyTorch, which is represented in the `torch` package. Besides the package itself, you will also import the `CIFAR10` dataset (which you will train today's model with) and the `DataLoader`, which is used for loading the training data. +- From `torchvision`, a sub package that must be installed jointly with PyTorch, you will import `transforms`, which is used for transforming the input data into Tensor format, and allows you to perform additional transformations otu of the box. +- From `collections`, you import an ordered dictionary - `OrderedDict`. You will see that it will play a big role in structuring the layers of your neural network. It is a default Python API, so if you have installed Python, nothing else needs to be installed. +- Finally, you will import `Accelerator` - which is the [HuggingFace Accelerate](https://www.machinecurve.com/index.php/2022/01/07/quick-and-easy-gpu-tpu-acceleration-for-pytorch-with-huggingface-accelerate/) package. It can be used to relieve you from all the `.to(cuda)` calls, moving your data and your model to your CUDA device if available. It handles everything out of the box! Click the link if you want to understand it in more detail. + +![](images/cifar10_images.png) + +Samples from the CIFAR-10 dataset, which is what you will use for training today's model. + +### Defining the nn.Module + +Now that you know what you will use, it's time to actually define your neural network. Here's the full code, which you'll learn more about after the code segment: + +``` +class LayerConfigurableMLP(nn.Module): + ''' + Layer-wise configurable Multilayer Perceptron. + ''' + def __init__(self, added_layers = 0): + super().__init__() + + # Retrieve model configuration + config = get_model_configuration() + shape = config.get("width") * config.get("height") * config.get("channels") + layer_dim = config.get("layer_dim") + num_classes = config.get("num_classes") + + # Create layer structure + layers = [ + (str(0), nn.Flatten()), + (str(1), nn.Linear(shape, layer_dim)), + (str(2), nn.ReLU()) + ] + + # Create output layers + layers.append((str(3), nn.Linear(layer_dim, num_classes))) + + # Initialize the Sequential structure + self.layers = nn.Sequential(OrderedDict(layers)) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + + def set_structure(self, layers): + self.layers = nn.Sequential(OrderedDict(layers)) +``` + +Let's break this class apart by its definitions - `__init__`, `forward` and `set_structure`. + +- Each class must have a **constructor**. In Python classes, this is the `__init__` definition. In ours, which is the constructor for the `nn.Module` ([the base PyTorch class for a neural network](https://www.machinecurve.com/index.php/2021/07/20/how-to-create-a-neural-network-for-regression-with-pytorch/)), the constructor does the following: + - First, it retrieves the configuration - because we will need some items from it. + - We compute the full dimensionality should our Tensor be one-dimensional, which is a simple multiplication of width, height and the number of channels. + - We retrieve the `layer_dim`, which is the dimensionality of each hidden layer - including the layers that we will add later, during greedy layer-wise training. + - The `num_classes` represents the number of output classes. In the case of the CIFAR-10 dataset, that's ten classes. + - Then, you create the **basic layer structure**. It is composed of a `Flatten` layer, which flattens each three-dimensional input Tensor (width, height, channels) into a one-dimensional Tensor (hence the multiplication). This is bad practice in neural networks, because we have convolutional layers for learning features from such image-like data, but for today's model, we simply flatten it - because it's about the _greedy layer-wise training_ rather than convolutions. + - After the `Flatten` layer, you will add a `Linear` layer. This layer has `shape` inputs and produces `layer_dim` outputs. It is followed by a ReLU activation function for nonlinearity. + - Then, you will also add the **output layer**, which converts `layer_dim` input dimensionality into `num_classes` - after which a Softmax activation can be applied by the loss function. + - The keys of each element represents the position of the layer in your neural network structure. You will see now why this is necessary: the `nn.Sequential` layer is built up by an `OrderedDict`, created from the `layers`. Normally, using such dictionaries is not necessary, but to preserve order when adding layers later, we do need it now. +- The constructor is followed by the `forward` definition - which represents the **forward pass**, to speak in deep learning language. It simply passes the input Tensor `x` through the layers, and returns the result. +- Finally, there is an additional definition - `set_structure` - which you don't see in neural networks often. It simply takes a new `layers` structure, creates an `OrderedDict` from it, and replaces the layers with the new structure. You will see later how this is used.! + +### Getting the global configuration + +First, however, let's create a definition with **global settings**. + +``` +def get_global_configuration(): + """ Retrieve configuration of the training process. """ + + global_config = { + "num_layers_to_add": 10, + } + + return global_config +``` + +It's pretty simple - the global configuration specifies the number of layers that must be added. For your model, this means that a base model will be trained at first, after which another layer will be added and training will be continued; another; another, and so forth, until 10 such iterations have been performed. + +### Getting the model configuration + +The **model configuration** is a bit more complex - it specifies all the settings that are necessary for successsfully training your model. In addition, these settings are _model specific_ rather than specific to the _training process_. + +For example, through the `width`, `height` and `channels`, the shape of your image Tensor is represented. Indeed, a CIFAR-10 sample is a 32 x 32 pixels image with 3 channels. The number of classes in the output is 10, and we use a 250-sample batch size when training. We also specify (but not initialize!) the loss function and optimizer. We use `CrossEntropyLoss` for [computing how poorly the model performs.](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss) + +> This criterion combines [`nn.LogSoftmax()`](https://pytorch.org/docs/stable/nn.html#logsoftmax) and [`nn.NLLLoss()`](https://pytorch.org/docs/stable/nn.html#nllloss) in one single class. +> +> PyTorch docs + +Using `CrossEntropyLoss` is also why we don't use Softmax activation in our layer structure! This [PyTorch loss function](https://www.machinecurve.com/index.php/2021/07/19/how-to-use-pytorch-loss-functions/) combines both softmax and NLL loss and hence pushes Softmax computation to the loss function, which is more stable numerically. + +For optimization, we use Adam, which is an adaptive optimizer and one of the default optimizers that are used in neural networks these days. + +For educational purposes, we set `num_epochs` to 1 - to allow you to walk through greedy layer-wise training quickly. However, a better setting would be `num_epochs = 5`, or `num_epochs = 25`. + +Finally, you set the `layer_dim` to 256. This is the dimensionality of all hidden layers. Obviously, if you want to have a varying layer dimensionality or a different approach, you can alter layer construction and have it your way - but for today's example, having hidden layers with equal dimensionality is the simplest choice :) + +``` +def get_model_configuration(): + """ Retrieve configuration for the model. """ + + model_config = { + "width": 32, + "height": 32, + "channels": 3, + "num_classes": 10, + "batch_size": 250, + "loss_function": nn.CrossEntropyLoss, + "optimizer": torch.optim.Adam, + "num_epochs": 1, + "layer_dim": 256 + } + + return model_config +``` + +### Retrieving the DataLoader + +Now that you have specified global and model configurations, it's time to retrieve the `DataLoader`. + +Its functionality is pretty simple - it initializes the `CIFAR10` dataset with a simple `ToTensor()` transform applied, and inits a `DataLoader` which constructs _shuffled_ batches per your batch size configuration. + +``` +def get_dataset(): + """ Load and convert dataset into inputs and targets """ + config = get_model_configuration() + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=config.get("batch_size"), shuffle=True, num_workers=1) + + return trainloader +``` + +### Adding a layer to an existing model + +Next up is adding a layer to an `existing` model. + +Recall that greedy layer-wise training involves training a model for a full amount of epochs, after which a layer is added, while all trained layers (except for the last layer) are set to nontrainable. + +This means that you will need functionality which: + +- Retrieves the current layer structure. +- Saves the last layer for adding it back later. +- Sets all layer parameters to nontrainable (a.k.a. freeze the layers), while appending them to your new layer structure. +- Adds a brand new, untrained layer to your new layer structure. +- Re-adds the last layer saved previously. +- Changes the model structure. + +Here's the definition which performs precisely that. It first retrieves the current layers, prints them to your terminal, saves the last layer, and defines a new layer structure to which all existing layers (except for the last one) are added. These layers are also made nontrainable by setting `requires_grad` to `False`. + +When these have been added, a brand new hidden layer that respects the `layer_dim` configuration is added to your new layer structure. Finally, the last layer is re-added, and the `model` structure is changed (indeed, via `set_structure`). Now, you hopefully realize too why we're using the `OrderedDict` - the keys of this dictionary simply specify the layer order of your new `nn.Sequential` structure, allowing the layers to be added properly. + +Finally, after restructuring your model, you simply return it for later usage. + +``` +def add_layer(model): + """ Add a new layer to a model, setting all others to nontrainable. """ + config = get_model_configuration() + + # Retrieve current layers + layers = model.layers + print("="*50) + print("Old structure:") + print(layers) + + # Save last layer for adding later + last_layer = layers[-1] + + # Define new structure + new_structure = [] + + # Iterate over all except last layer + for layer_index in range(len(layers) - 1): + + # For old layer, set all parameters to nontrainable + old_layer = layers[layer_index] + for param in old_layer.parameters(): + param.requires_grad = False + + # Append old layer to new structure + new_structure.append((str(layer_index), old_layer)) + + # Append new layer to the final intermediate layer + new_structure.append((str(len(new_structure)), nn.Linear(config.get("layer_dim"), config.get("layer_dim")))) + + # Re-add last layer + new_structure.append((str(len(new_structure)), last_layer)) + + # Change the model structure + model.set_structure(new_structure) + + # Return the model + print("="*50) + print("New structure:") + print(model.layers) + + return model +``` + +### Training a model + +The next definitions is a pretty default PyTorch training loop. + +- You specify the loss at 0.0, iterate over the number of epochs and per epoch over the data loader, feed forward the data, compute loss and perform optimization. + +Do note that you're using the HuggingFace Accelerate way of optimization: you first prepare the `model`, `optimizer` and `trainloader` with `accelerator.prepare(...)`, and then perform the backward pass with `accelerator`, too. + +In the end, you return the trained `model` as well as the loss value at the end of training, so that you can compare it with the loss value of the next set of epochs, with yet another layer added. This allows you to see whether adding layers yields better performance or whether you've reached layer saturation for your training scenario. + +``` +def train_model(model): + """ Train a model. """ + config = get_model_configuration() + loss_function = config.get("loss_function")() + optimizer = config.get("optimizer")(model.parameters(), lr=1e-4) + trainloader = get_dataset() + accelerator = Accelerator() + + # Set current loss value + end_loss = 0.0 + + # Accelerate model + model, optimizer, trainloader = accelerator.prepare(model, optimizer, trainloader) + + # Iterate over the number of epochs + for epoch in range(config.get("num_epochs")): + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = model(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + accelerator.backward(loss) + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + end_loss = current_loss / 500 + current_loss = 0.0 + + # Return trained model + return model, end_loss +``` + +### Wrapping everything together + +Finally, it's time to wrap all the definitions together into a working whole. + +In the `greedy_layerwise_training` def, you load the global config, initialize your MLP, and iterate over the number of layers that must be added, adding one more at each step. Then, for each layer configuration, you train the model and compare loss. + +When you run your Python script, you call `greedy_layerwise_training()` for training your neural network in a greedy layer-wise fashion. + +``` +def greedy_layerwise_training(): + """ Perform greedy layer-wise training. """ + global_config = get_global_configuration() + torch.manual_seed(42) + + # Initialize the model + model = LayerConfigurableMLP() + + # Loss comparison + loss_comparable = 0.0 + + # Iterate over the number of layers to add + for num_layers in range(global_config.get("num_layers_to_add")): + + # Print which model is trained + print("="*100) + if num_layers > 0: + print(f">>> TRAINING THE MODEL WITH {num_layers} ADDITIONAL LAYERS:") + else: + print(f">>> TRAINING THE BASE MODEL:") + + # Train the model + model, end_loss = train_model(model) + + # Compare loss + if num_layers > 0 and end_loss < loss_comparable: + print("="*50) + print(f">>> RESULTS: Adding this layer has improved the model loss from {loss_comparable} to {end_loss}") + loss_comparable = end_loss + elif num_layers > 0: + print("="*50) + print(f">>> RESULTS: Adding this layer did not improve the model loss.") + elif num_layers == 0: + loss_comparable = end_loss + + # Add layer to model + model = add_layer(model) + + # Process is complete + print("Training process has finished.") + + +if __name__ == '__main__': + greedy_layerwise_training() + +``` + +### Full model code + +If you want to get started immediately, this is the full code for **greedy layer-wise training with PyTorch:** + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms +from collections import OrderedDict +from accelerate import Accelerator + +class LayerConfigurableMLP(nn.Module): + ''' + Layer-wise configurable Multilayer Perceptron. + ''' + def __init__(self, added_layers = 0): + super().__init__() + + # Retrieve model configuration + config = get_model_configuration() + shape = config.get("width") * config.get("height") * config.get("channels") + layer_dim = config.get("layer_dim") + num_classes = config.get("num_classes") + + # Create layer structure + layers = [ + (str(0), nn.Flatten()), + (str(1), nn.Linear(shape, layer_dim)), + (str(2), nn.ReLU()) + ] + + # Create output layers + layers.append((str(3), nn.Linear(layer_dim, num_classes))) + + # Initialize the Sequential structure + self.layers = nn.Sequential(OrderedDict(layers)) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + + def set_structure(self, layers): + self.layers = nn.Sequential(OrderedDict(layers)) + + +def get_global_configuration(): + """ Retrieve configuration of the training process. """ + + global_config = { + "num_layers_to_add": 10, + } + + return global_config + + +def get_model_configuration(): + """ Retrieve configuration for the model. """ + + model_config = { + "width": 32, + "height": 32, + "channels": 3, + "num_classes": 10, + "batch_size": 250, + "loss_function": nn.CrossEntropyLoss, + "optimizer": torch.optim.Adam, + "num_epochs": 1, + "layer_dim": 256 + } + + return model_config + + +def get_dataset(): + """ Load and convert dataset into inputs and targets """ + config = get_model_configuration() + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=config.get("batch_size"), shuffle=True, num_workers=1) + + return trainloader + + +def add_layer(model): + """ Add a new layer to a model, setting all others to nontrainable. """ + config = get_model_configuration() + + # Retrieve current layers + layers = model.layers + print("="*50) + print("Old structure:") + print(layers) + + # Save last layer for adding later + last_layer = layers[-1] + + # Define new structure + new_structure = [] + + # Iterate over all except last layer + for layer_index in range(len(layers) - 1): + + # For old layer, set all parameters to nontrainable + old_layer = layers[layer_index] + for param in old_layer.parameters(): + param.requires_grad = False + + # Append old layer to new structure + new_structure.append((str(layer_index), old_layer)) + + # Append new layer to the final intermediate layer + new_structure.append((str(len(new_structure)), nn.Linear(config.get("layer_dim"), config.get("layer_dim")))) + + # Re-add last layer + new_structure.append((str(len(new_structure)), last_layer)) + + # Change the model structure + model.set_structure(new_structure) + + # Return the model + print("="*50) + print("New structure:") + print(model.layers) + + return model + + + + +def train_model(model): + """ Train a model. """ + config = get_model_configuration() + loss_function = config.get("loss_function")() + optimizer = config.get("optimizer")(model.parameters(), lr=1e-4) + trainloader = get_dataset() + accelerator = Accelerator() + + # Set current loss value + end_loss = 0.0 + + # Accelerate model + model, optimizer, trainloader = accelerator.prepare(model, optimizer, trainloader) + + # Iterate over the number of epochs + for epoch in range(config.get("num_epochs")): + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = model(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + accelerator.backward(loss) + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + end_loss = current_loss / 500 + current_loss = 0.0 + + # Return trained model + return model, end_loss + + +def greedy_layerwise_training(): + """ Perform greedy layer-wise training. """ + global_config = get_global_configuration() + torch.manual_seed(42) + + # Initialize the model + model = LayerConfigurableMLP() + + # Loss comparison + loss_comparable = 0.0 + + # Iterate over the number of layers to add + for num_layers in range(global_config.get("num_layers_to_add")): + + # Print which model is trained + print("="*100) + if num_layers > 0: + print(f">>> TRAINING THE MODEL WITH {num_layers} ADDITIONAL LAYERS:") + else: + print(f">>> TRAINING THE BASE MODEL:") + + # Train the model + model, end_loss = train_model(model) + + # Compare loss + if num_layers > 0 and end_loss < loss_comparable: + print("="*50) + print(f">>> RESULTS: Adding this layer has improved the model loss from {loss_comparable} to {end_loss}") + loss_comparable = end_loss + elif num_layers > 0: + print("="*50) + print(f">>> RESULTS: Adding this layer did not improve the model loss.") + elif num_layers == 0: + loss_comparable = end_loss + + # Add layer to model + model = add_layer(model) + + # Process is complete + print("Training process has finished.") + + +if __name__ == '__main__': + greedy_layerwise_training() + +``` + +* * * + +## Results + +When you run your script, you should see a base model being trained first (given our settings for 1 epoch or given yours for the number of epochs that you have configured), after which another layer is added and the same process is repeated. Then, loss is compared, and yet another layer is added. + +Hopefully, this allows you to get a feeling for empirically finding the number of layers that is likely adequate for your PyTorch neural network! :) + +``` +==================================================================================================== +>>> TRAINING THE BASE MODEL: +Files already downloaded and verified +Starting epoch 1 +================================================== +Old structure: +Sequential( + (0): Flatten(start_dim=1, end_dim=-1) + (1): Linear(in_features=3072, out_features=256, bias=True) + (2): ReLU() + (3): Linear(in_features=256, out_features=10, bias=True) +) +================================================== +New structure: +Sequential( + (0): Flatten(start_dim=1, end_dim=-1) + (1): Linear(in_features=3072, out_features=256, bias=True) + (2): ReLU() + (3): Linear(in_features=256, out_features=256, bias=True) + (4): Linear(in_features=256, out_features=10, bias=True) +) +==================================================================================================== +>>> TRAINING THE MODEL WITH 1 ADDITIONAL LAYERS: +Files already downloaded and verified +Starting epoch 1 +================================================== +>>> RESULTS: Adding this layer did not improve the model loss. +================================================== +Old structure: +Sequential( + (0): Flatten(start_dim=1, end_dim=-1) + (1): Linear(in_features=3072, out_features=256, bias=True) + (2): ReLU() + (3): Linear(in_features=256, out_features=256, bias=True) + (4): Linear(in_features=256, out_features=10, bias=True) +) +================================================== +New structure: +Sequential( + (0): Flatten(start_dim=1, end_dim=-1) + (1): Linear(in_features=3072, out_features=256, bias=True) + (2): ReLU() + (3): Linear(in_features=256, out_features=256, bias=True) + (4): Linear(in_features=256, out_features=256, bias=True) + (5): Linear(in_features=256, out_features=10, bias=True) +) +.......... +``` + +* * * + +## References + +Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). [Greedy layer-wise training of deep networks](https://proceedings.neurips.cc/paper/2006/file/5da713a690c067105aeb2fae32403405-Paper.pdf). In _Advances in neural information processing systems_ (pp. 153-160). + +MachineCurve. (2022, January 9). _Greedy layer-wise training of deep networks, a TensorFlow/Keras example_. [https://www.machinecurve.com/index.php/2022/01/09/greedy-layer-wise-training-of-deep-networks-a-tensorflow-keras-example/](https://www.machinecurve.com/index.php/2022/01/09/greedy-layer-wise-training-of-deep-networks-a-tensorflow-keras-example/) diff --git a/greedy-layer-wise-training-of-deep-networks-a-tensorflow-keras-example.md b/greedy-layer-wise-training-of-deep-networks-a-tensorflow-keras-example.md new file mode 100644 index 0000000..66de93c --- /dev/null +++ b/greedy-layer-wise-training-of-deep-networks-a-tensorflow-keras-example.md @@ -0,0 +1,452 @@ +--- +title: "Greedy layer-wise training of deep networks, a TensorFlow/Keras example" +date: "2022-01-09" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "greedy-layer-wise-training" + - "keras" + - "machine-learning" + - "neural-networks" + - "tensorflow" + - "vanishing-gradients" +--- + +In the early days of deep learning, people training neural networks continuously ran into issues - the vanishing gradients problem being one of the main issues. In addition to that, cloud computing was nascent at the time, meaning that computing infrastructure (especially massive GPUs in the cloud) was still expensive. + +In other words, one could not simply run a few GPUs to find that one's model does not perform very well. Put simply, that's a waste of money. + +To overcome these limitations, researchers came up with **greedy layer-wise training** as an approach to training a neural network. By adding a layer after every training process finishes, it became possible to determine when the model became too _deep_ i.e. when the vanishing gradients problem became too _severe_ for the model to have additional performance gains. + +While it's a relatively old technique, its concepts are still useful today (e.g., because they are applied in some GANs in a relatively obscure way), so in today's article you'll be creating a TensorFlow/Keras based neural network using greedy layer-wise training. After reading this tutorial, you will... + +- **Understand why training neural networks was problematic around 2006-07.** +- **How greedy layer-wise training solves some of these issues.** +- **Have implemented greedy layer-wise training with TensorFlow and Keras.** + +Are you ready? Let's take a look! 😎 + +- If you want to build a neural network using greedy layer-wise training with PyTorch[](https://www.machinecurve.com/index.php/mastering-keras/), [take a look at this article](https://www.machinecurve.com/index.php/2022/01/24/greedy-layer-wise-training-of-deep-networks-a-pytorch-example/). + +* * * + +\[toc\] + +* * * + +## What is greedy layer-wise training? + +Today, thanks to a set of standard components that is used when training a deep neural network, the odds are that you will end up with a model that learns to successfully predict for new samples that belong to your training distribution. + +For example, we can thank nonlinear activation functions like ReLU for this. + +However, there was a time - think before 2007 - when these improvements were not available yet. At least, if we take ReLU as an example, we know that it has been around since the 1960s, but it was only until 2011 when renewed interest emerged because it was found that using it improves neural network performance (Wikipedia, 2012). + +In 2007, however, people employing deep neural networks ran into many issues that all boiled down to the fact that the resulting networks didn't perform. A key issue however was the **[vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/)**, which means that for deeper networks the most upstream layers didn't receive a sufficient gradient because [error propagation with Sigmoid and Tanh](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/#vanishing-gradients) resulted in very small gradients, and hence slow learning. + +In other words, there was a limit to how deep networks could become in order to remain trainable, while they can be [universal function approximators](https://www.machinecurve.com/index.php/2019/07/18/can-neural-networks-approximate-mathematical-functions/) in theory. + +Thanks to a [paper](https://proceedings.neurips.cc/paper/2006/file/5da713a690c067105aeb2fae32403405-Paper.pdf) by Bengio et al. from 2007, **greedy layer-wise (pre)training** of a neural network renewed interest in deep networks. Although it sounds very complex, it boils down to one simple observation: + +**A deep network is trained once with a hidden layer; then a second hidden layer is added and training is repeated; a third is added and training is repeated, and so forth.** This process is repeated until your target number of layers is reached. Obviously, you can set an absolute target number of layers, or adapt dynamically based on [test performance](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/): if the model does no longer improve for e.g. 3 times, it's possible that this happens due to vanishing gradients (at least, then, when ReLU was not yet widely used). + +![](images/greedy.drawio-1024x336.png) + +Greedy layer-wise training can be performed in four ways: + +- **The setting can differ:** you can use it in a a **pretraining** way, where you _pretrain_ the network with a smaller amount of training samples, and select the model with best-performing depth for training with your entire training set. This requires that you are certain that your selected samples are similarly distributed as the full dataset. If you don't, or are unsure how to do this, you can also use a **full training setting**, where every depth iteration performs a full training process. The first is simpler but more error-prone, where the latter is safe but requires a lot more computational resources (and possibly, time). +- **The amount of supervision can differ:** obviously, training such a neural network can be performed in a supervised way. However, you may la + +* * * + +## Implementing greedy layer-wise training with TensorFlow and Keras + +Now that you understand what greedy layer-wise training is, let's take a look at how you can harness this approach to training a neural network using TensorFlow and Keras. + +The first thing you'll need to do is to ensure that you have [installed TensorFlow](https://www.tensorflow.org/install). Then, create a Python file (e.g. `greedy.py`) or open up a [Jupyter Notebook](https://www.machinecurve.com/index.php/2020/10/07/easy-install-of-jupyter-notebook-with-tensorflow-and-docker/) and let's write some code! + +### Python imports + +First of all, you'll need a few imports. Obviously, you'll import `tensorflow`. This is followed by importing the CIFAR-10 dataset, with which we'll train today's neural network. We use the Keras Sequential API and the `Dense`, `Dropout` and `Flatten` layers. + +Do note that in the case of images, it would be best to create a [Convolutional Neural Network](https://www.machinecurve.com/index.php/2021/07/08/convolutional-neural-networks-with-pytorch/). Instead, for the sake of simplicity, we will be [creating an MLP](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) instead. + +``` +import tensorflow +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +``` + +### Global configuration + +Now that you have specified the imports, it's time to start writing some Python definitions - building blocks with which you'll join all the functionality later! + +The first one we create is `get_global_configuration`. This definition will return the overall configuration for the training process. In today's model, that will only be the number of layers to add greedily - in other words, we'll be training 10 times, expanding the layers after every run. + +``` +def get_global_configuration(): + """ Retrieve configuration of the training process. """ + num_layers_to_add = 10 + return num_layers_to_add +``` + +### Model configuration + +Then, we add the model configuration definition - `get_model_configuration`. It has the model-specific elements, such as image size, number of classes present in the dataset, and so forth. These all speak for themselves if you have worked with deep learning models before. If not, [take a look here](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/). + +``` +def get_model_configuration(): + """ Retrieve configuration for the model. """ + img_width, img_height = 32, 32 + no_classes = 10 + batch_size = 250 + no_epochs = 25 + validation_split = 0.2 + verbosity = 0 + return img_width, img_height, no_classes, batch_size,\ + no_epochs, validation_split, verbosity +``` + +### Retrieving training and evaluation data + +Next up is a definition for retrieving the dataset, after it has been preprocessed. This involves multiple steps: + +- Loading the relevant model configuration: image size and the number of classes. The `_` represents variables that are returned by the definition, but which we won't need here. +- Loading the `cifar10` dataset. Note that it returns training and testing data; both inputs and targets. +- Reshaping the data so that it fits the `(32, 32, 3)` structure of a CIFAR10 sample well. +- Parsing numbers as floats, benefiting training, and converting them into the `[0, 1]` range, also [benefiting the training process](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). +- Converting target vectors (targets here are simple integers) into [categorical targets](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras/) by means of one-hot encoding. +- Finally, returning all the data elements! + +``` +def get_dataset(): + """ Load and convert dataset into inputs and targets """ + # Load relevant model configuration + img_width, img_height, no_classes, _, _, _, _ = get_model_configuration() + + # Load cifar10 dataset + (input_train, target_train), (input_test, target_test) = cifar10.load_data() + + # Reshape data + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) + input_shape = (img_width, img_height, 3) + + # Parse numbers as floats + input_train = input_train.astype('float32') + input_test = input_test.astype('float32') + + # Convert into [0, 1] range. + input_train = input_train / 255 + input_test = input_test / 255 + + # Convert target vectors to categorical targets + target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) + target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + + # Return data + return input_train, input_test, target_train, target_test +``` + +### Creating the base model + +Now that we have created definitions for configuration and dataset loading, it's time to create one that returns the base model. + +First of all, you will also retrieve relevant model configuration here, being the number of classes, while leaving the rest as -is. + +Then, you create a simple Keras `model` using the `Sequential` API. The first thing that is done is flattening the 3D sample into an 1D array, because `Dense` layers can only handle one-dimensional data. Then, you add a `Dense` intermediate layer that is [ReLU activated](https://www.machinecurve.com/index.php/2021/01/21/using-relu-sigmoid-and-tanh-with-pytorch-ignite-and-lightning/), followed by a [Softmax activated](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) output layer. + +Finally, you return the base model. + +``` +def create_base_model(): + """ Create a base model: one Dense layer. """ + # Retrieve relevant model configuration + _, _, no_classes, _, _, _, _ = get_model_configuration() + # Create model instance and add initial layers + model = Sequential() + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + # Return model + return model +``` + +### Adding an extra layer to the model + +Recall that greedy layer-wise training involves adding an extra layer to the model after every training run finishes. This can be summarized with the following equation: + +**Old model + New layer = New model**. + +In other words, we'll need the existing `model` as input, and add a new layer. Here's how that's done: + +- The existing `model` is passed as an input parameter. +- You define the new intermediate layer that must be added. In our case, it will be a `Dense` layer with 256 outputs and ReLU activation every time. +- Then, you add the trained Softmax output layer to a temporary variable. +- Because all layers have been trained already, you set them so that they become untrainable. In other words, you will only notice the impact of adding a new layer on model performance. +- Then, you remove the output layer, and add the new, untrained layer. +- Finally, you re-add the trained (and trainable) output layer and return the model. + +``` +def add_extra_layer(model): + """ Add an extra Dense layer to the model. """ + # Define the layer that must be added + layer_to_add = Dense(256, activation='relu') + # Temporarily assign the output layer to a variable + output_layer = model.layers[-1] + # Set all upstream layers to nontrainable + for layer in model.layers: + layer.trainable = False + # Remove output layer and add new layer + model.pop() + model.add(layer_to_add) + # Re-add output layer + model.add(output_layer) + # Return trained model, with extra layer + return model +``` + +### Training a model instance + +Now that you have created definitions for creating a base model and adding an extra layer to an existing model (regardless of whether it's a base model or an intermediate model), it's time to create a definition for actually training _any_ model instance. + +This involves multiple steps: + +- Receiving the `model` and training/testing `data` as input. +- Retrieving relevant model configuration: batch size, number of epochs, validation split, and Keras verbosity i.e. how much output is written to the terminal during the training process. +- Decomposing the training/testing data in its individual components. +- Compiling the model, or actually creating a valid model instance from the model skeleton. +- Training the model given your model configuration, using your training data. +- Evaluating your model with your testing data, and writing evaluation results to the terminal. +- Returning the trained `model`, so that a layer can be added and a new training process can start. + +``` +def train_model(model, data): + """ Compile and train a model. """ + # Retrieve relevant model configuration + _, _, _, batch_size, no_epochs, validation_split, verbosity = get_model_configuration() + # Decompose data into components + input_train, input_test, target_train, target_test = data + # Compile model + model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + # Train model + model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + # Evaluate model + score = model.evaluate(input_test, target_test, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + # Return trained model + return model +``` + +### Instantiating the training process + +Finally, we end up with the last definition: instantiating the training process. + +As you can see, you start with the base model here as well as retrieving the dataset. + +Then, you load the process configuration, and create a loop that iterates for as many times that you have configured the process to iterate. + +In each iteration, you train and evaluate the current model (the base model in the first loop; the base model + (N-1) layers in each subsequent Nth loop) and add an extra layer. + +``` +def training_process(): + """ Run the training process. """ + # Create the base model + model = create_base_model() + # Get data + data = get_dataset() + # Apply greedy layer-wise training + num_layers_to_add = get_global_configuration() + for i in range(num_layers_to_add): + # Train and evaluate current model + model = train_model(model, data) + # Add extra layer + model = add_extra_layer(model) +``` + +Then, you instruct the Python interpreter to start the training process when your Python script starts: + +``` +if __name__ == "__main__": + training_process() +``` + +### Full model code + +Now, everything should run! :) + +Here's the full model code for when you want to get started immediately: + +``` +import tensorflow +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten + + +def get_global_configuration(): + """ Retrieve configuration of the training process. """ + num_layers_to_add = 10 + return num_layers_to_add + + +def get_model_configuration(): + """ Retrieve configuration for the model. """ + img_width, img_height = 32, 32 + no_classes = 10 + batch_size = 250 + no_epochs = 25 + validation_split = 0.2 + verbosity = 0 + return img_width, img_height, no_classes, batch_size,\ + no_epochs, validation_split, verbosity + + +def get_dataset(): + """ Load and convert dataset into inputs and targets """ + # Load relevant model configuration + img_width, img_height, no_classes, _, _, _, _ = get_model_configuration() + + # Load cifar10 dataset + (input_train, target_train), (input_test, target_test) = cifar10.load_data() + + # Reshape data + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) + input_shape = (img_width, img_height, 3) + + # Parse numbers as floats + input_train = input_train.astype('float32') + input_test = input_test.astype('float32') + + # Convert into [0, 1] range. + input_train = input_train / 255 + input_test = input_test / 255 + + # Convert target vectors to categorical targets + target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) + target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + + # Return data + return input_train, input_test, target_train, target_test + + +def create_base_model(): + """ Create a base model: one Dense layer. """ + # Retrieve relevant model configuration + _, _, no_classes, _, _, _, _ = get_model_configuration() + # Create model instance and add initial layers + model = Sequential() + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + # Return model + return model + + +def add_extra_layer(model): + """ Add an extra Dense layer to the model. """ + # Define the layer that must be added + layer_to_add = Dense(256, activation='relu') + # Temporarily assign the output layer to a variable + output_layer = model.layers[-1] + # Set all upstream layers to nontrainable + for layer in model.layers: + layer.trainable = False + # Remove output layer and add new layer + model.pop() + model.add(layer_to_add) + # Re-add output layer + model.add(output_layer) + # Return trained model, with extra layer + return model + + +def train_model(model, data): + """ Compile and train a model. """ + # Retrieve relevant model configuration + _, _, _, batch_size, no_epochs, validation_split, verbosity = get_model_configuration() + # Decompose data into components + input_train, input_test, target_train, target_test = data + # Compile model + model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + # Train model + model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + # Evaluate model + score = model.evaluate(input_test, target_test, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + # Return trained model + return model + + +def training_process(): + """ Run the training process. """ + # Create the base model + model = create_base_model() + # Get data + data = get_dataset() + # Apply greedy layer-wise training + num_layers_to_add = get_global_configuration() + for i in range(num_layers_to_add): + # Train and evaluate current model + model = train_model(model, data) + # Add extra layer + model = add_extra_layer(model) + + +if __name__ == "__main__": + training_process() + +``` + +* * * + +## Results + +When I ran the code, these are the results that were written to my screen: + +``` +Test loss: 1.4798256158828735 / Test accuracy: 0.4846999943256378 +Test loss: 1.3947865962982178 / Test accuracy: 0.513700008392334 +Test loss: 1.4665762186050415 / Test accuracy: 0.5048999786376953 +Test loss: 1.666954517364502 / Test accuracy: 0.5002999901771545 +Test loss: 1.9360666275024414 / Test accuracy: 0.48730000853538513 +Test loss: 2.1698007583618164 / Test accuracy: 0.48739999532699585 +Test loss: 2.333308219909668 / Test accuracy: 0.48019999265670776 +Test loss: 2.470284938812256 / Test accuracy: 0.48190000653266907 +Test loss: 2.5734057426452637 / Test accuracy: 0.47859999537467957 +Test loss: 2.6469039916992188 / Test accuracy: 0.4790000021457672 +``` + +The relatively poor performance can be explained easily - CIFAR10 is a relatively complex dataset and Dense layers aren't really suitable for image classification, at least not for deriving features. In addition, the model was trained for a relatively few amount of epochs. + +Still, it becomes clear that you can derive a suitable number of hidden layers for this problem by means of greedy layer-wise training: today's model performs best when it has 2 hidden layers, after which performance deteriorates. + +![](images/image-1024x562.png) + +That's it! Today, you have learned to apply greedy layer-wise training procedures for training your neural network with TensorFlow and Keras :) + +If you have any questions, comments or suggestions, feel free to leave a message in the comments section below 💬 I will then try to answer you as quickly as possible. For now, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). [Greedy layer-wise training of deep networks](https://proceedings.neurips.cc/paper/2006/file/5da713a690c067105aeb2fae32403405-Paper.pdf). In _Advances in neural information processing systems_ (pp. 153-160). + +Wikipedia. (2012, December 7). _Rectifier (neural networks)_. Wikipedia, the free encyclopedia. Retrieved January 8, 2022, from [https://en.wikipedia.org/wiki/Rectifier\_(neural\_networks)](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) diff --git a/grouped-convolutions-with-tensorflow-2-and-keras.md b/grouped-convolutions-with-tensorflow-2-and-keras.md new file mode 100644 index 0000000..9573c62 --- /dev/null +++ b/grouped-convolutions-with-tensorflow-2-and-keras.md @@ -0,0 +1,115 @@ +--- +title: "Grouped convolutions with TensorFlow 2 and Keras" +date: "2022-01-29" +categories: + - "deep-learning" + - "frameworks" +tags: + - "computer-vision" + - "convolutional-neural-networks" + - "convolutions" + - "deep-learning" + - "grouped-convolutions" + - "keras" + - "machine-learning" + - "resnet" + - "tensorflow" +--- + +Improving your convolution performance does not have to be difficult - one way to achieve this is by using **grouped convolutions**. By splitting the filter maps in your convolutional layers into multiple disjoint groups, it's possible to reduce the parameters in your network, while having the network learn better features. + +How? That's what you will discover by reading today's article. Firstly, you will read about grouped convolutions, from a ResNeXt point of view. Then, you'll learn about why they can improve network and training performance. Finally, you will take a look at implementing these convolutions with TensorFlow and Keras. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What are grouped convolutions? + +In their paper introducing the [ResNeXt architecture](https://arxiv.org/abs/1611.05431), Xie et al. (2017) noted that there are multiple ways of improving model performance. A relatively standard one is to make the model deeper - that is, for example, adding more convolutional layers to learn a deeper hierarchy of features that can be used for classification. + +Making the neural network wider, by increasing the number of feature maps learned at every level, is another option, to increase feature richness benefiting optimization. + +However, in their work, they state that there is another way: by increasing the cardinality + +> Experiments demonstrate that in- creasing cardinality is a more effective way of gaining accu- racy than going deeper or wider, especially when depth and width starts to give diminishing returns for existing models. +> +> Xie et al. (2017) + +Cardinality, here, is defined as "the size of the set of transformations". Because things may still be a bit vague now, let's make things visual for better understanding. + +### Starting from a simple residual block... + +Suppose that you are training a convolutional architecture. To be more specific, you're using ResNet blocks as the backbone for your classifier. A simple residual block - with a regular mapping and a skip connection - can look as follows: + +![](images/simple-resnet-block.png) + +The creators of the ResNeXt architecture suggest that by splitting this set of transformations into multiple sub sets, performance can increase, because the sub sets become separate feature learners at a specific level in the feature hierarchy. + +The number of sub sets at a specific level is the cardinality of that specific level. For example, if we split the residual block into sub sets with a cardinality of 2, this would be the outcome. We see a similarly complex model, where the outputs of both subsets are summated and then the skip connection is added back. + +![](images/cardinality-2.png) + +The ResNeXt authors have found empirical improvements of their architecture over classic ResNet. However, the split-transform-summate approach from above is not the only possible approach. For example, it is also possible to perform split-transform-concatenate, after which the concatenation is processed by another convolutional layer to preserve feature map dimensionaltiy of the whole block. + +### Using grouped convolutions + +Now, getting to the point of this article, another approach is to use **grouped convolutions**. Having been at the basis of the deep learning revolution, it was already used by AlexNet in order to allow training in a multi-GPU setting. + +![](images/grouped.png) + +Now, what is a grouped convolution? + +> \[A\] group is convolved separately with filters / groups filters. The output is the concatenation of all the groups results along the channel axis. +> +> TensorFlow (n.d.) + +Normally, in a regular convolution, we have filters (which slide or convolve over the input feature maps). For example, we can have 32 filters that slide over the input feature maps. These 32 filters are convolved at the same time, over the whole input. + +When using grouped convolutions, we can separate the filters into disjoint groups - i.e., groups of filters that convolve over distinct feature maps. For example, if we would split the 32 in 2 groups of 16 filters, the first group would convolve over the first 50% of input filters, while the second would convolve over the second 50%. + +Note that the percentage is relative - let's illustrate this with another example. If we have 60 input feature maps, and 256 output feature maps, each group would convolve over 15 (25%) feature maps if we would use 4 groups. + +### Benefits of using grouped convolutions + +Using grouped convolutions has multiple benefits compared to using normal convolutions: + +1. **Hardware efficiency.** By splitting the convolution procedure in disjoint groups, training can be parallelized over GPUs quite easily - for example, by using one GPU per group. +2. **Reduced number of trainable parameters.** The wider one's convolutional layer, the more parameters are used. By using grouped convolutions, the number of parameters is reduced significantly. +3. **Better model performance!** Now, that's something that is quite surprising (since splitting the convolution into groups would yield a model that is theoretically as complex as using regular convolutions). Ioannou (2017) discusses this in [an interesting article](https://blog.yani.ai/filter-group-tutorial/) - something that was underrecognized is that using grouped convolutions means learning better representations. + +But why? Make sure to read the article if you want to learn it in more detail, but the gist of the argument is that grouped convolutions perform better representation learning because **irrelevant correlations between features across layers are left out**. For example, if you would have a regular convolution with 32 filters in one layer and 64 in the next, learning capacity of the network correlates the 32 with the 64 filters. This happens because the gradient update yielding a change in the 32 filters is dependent on the gradient update generated for the 64 filters more downstream. In other words, all correlations between the 64 and 32 layers mean something for network performance. + +Using grouped convolution breaks down the filters into separate and disjoint groups. In other words, these groups do not know about each other when being trained. In sum, this means that (should we use grouped convolutions with 2 groups for each layer) now only 16 filters are correlated with 32 filters. Increasing the number of groups both reduces parameters and improves performance, by having actually many small networks into one. + +* * * + +## Implementing grouped convolutions with TensorFlow 2 and Keras + +Using grouped convolutions with TensorFlow 2 and Keras is actually really easy. The only thing that you will need to do is using the `groups` attribute in specifying your convolutional layer (whether that is a `Conv1D`, `Conv2D` or `Conv3D` layer). + +> A positive integer specifying the number of groups in which the input is split along the channel axis. Each group is convolved separately with filters / groups filters. The output is the concatenation of all the groups results along the channel axis. Input channels and filters must both be divisible by groups. +> +> TensorFlow (n.d.) + +For example, if you have a two-dimensional convolutional layer that outputs 64 feature maps, you can turn it into a grouped convolution that outputs 4x16 feature maps by simply specifying this in layer initialization: + +``` +Conv2D(64, (3, 3), groups=4) +``` + +That's it! You now understand what grouped convolutions are, why they can be useful and beneficial to your neural network, and how you can use them within TensorFlow 2 and Keras 😎 If you have any questions, comments or suggestions, feel free to leave a message in the comments section below 💬 I will then try to answer you as quickly as possible. For now, thank you for reading MachineCurve today and happy engineering! + +* * * + +## References + +Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). [Aggregated residual transformations for deep neural networks.](https://arxiv.org/abs/1611.05431) In _Proceedings of the IEEE conference on computer vision and pattern recognition_ (pp. 1492-1500). + +TensorFlow. (n.d.). _Tf.keras.layers.Conv2D_. [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/Conv2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Conv2D) + +Ioannou, Y. (2017, August 10). _A tutorial on filter groups (Grouped convolution)_. A Shallow Blog about Deep Learning. [https://blog.yani.ai/filter-group-tutorial/](https://blog.yani.ai/filter-group-tutorial/) diff --git a/he-xavier-initialization-activation-functions-choose-wisely.md b/he-xavier-initialization-activation-functions-choose-wisely.md new file mode 100644 index 0000000..9c0576e --- /dev/null +++ b/he-xavier-initialization-activation-functions-choose-wisely.md @@ -0,0 +1,204 @@ +--- +title: "He/Xavier initialization & activation functions: choose wisely" +date: "2019-09-16" +categories: + - "buffer" + - "deep-learning" +tags: + - "activation-functions" + - "deep-learning" + - "initializers" + - "neural-networks" + - "weight-initialization" +--- + +Deep learning models require to be initialized. Their layers have activation functions to make neuron outputs nonlinear. But how to initialize? And how to choose an activation function? We covered those questions in different blogs. Today, we'll cover a different topic: + +**The intrinsic relationship between the Xavier and He initializers and certain activation functions.** + +You're right, we focus on a niche within the overlap between weight initialization and activation functions - and cover how Xavier and He initializers require one to choose certain activation functions over others, and vice-versa. + +However, if you're interested in the other topics, feel free to also read these blogs: + +- [What is weight initialization?](https://machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) +- [Random initialization: vanishing and exploding gradients](https://machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) +- [ReLU, Sigmoid and Tanh: today's most used activation functions](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) + +Let's go! :-) After reading this article, you will understand... + +- The basics of weight initialization. +- Why choosing an initializer depends on your choice for activation functions. +- How He and Xavier initialization must be applied differently. + +* * * + +**Update 05/Feb/2021:** ensured that article is up to date. + +**Update 07/Oct/2020:** clarified the meaning of \[latex\]N\[/latex\] in the initialization strategies. + +* * * + +\[toc\] + +* * * + +## Recap: the ingredients of this blog + +Before I can make my point with respect to the He and Xavier initializers and their relationships to activation functions, we must take a look at the individual ingredients of this blog first. With those, I mean weight initialization and activation functions. We'll briefly cover these next and also provide links to blogs that cover them in more detail. + +Subsequently, we move on to He and Xavier initialization and our final point. However, if you're well aware of initializers and activation functions, feel free to skip this section altogether. It must be all very familiar to you. + +### What is initialization? + +Neural networks are collections of neurons - that's nothing strange. + +But how do neurons operate? + +By producing an operation called a _dot product_ between a _weights vector_ and an _input vector_. A _bias value_ is added to this product and the whole is subsequently passed to an _activation function_. + +Since all neurons do this, a system emerges that can adapt to highly complex data. + +During optimization, which occurs every time data is fed to the network (either after each sample or after all of them, or somewhere in between), the _weights vectors_ are slightly adapted to simply better cover the patterns represented by the training set. + +However, you'll need to start somewhere - the weights vectors cannot be empty once you start training. Hence, they must be initialized. _That's_ weight initialization. + +_Read more about initialization here: [What is weight initialization?](https://machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/)_ + +#### Initializers + +Weight initialization is performed by means of an initializer. There are many ways of initializing your neural network, of which some are better - or, more nicely, less naïve - than others. For example, you may choose to initialize your weights as zeros, but then your model won't improve. + +Additionally, you may also choose to initialize them randomly. We then get somewhere, but face the vanishing and exploding gradient problems. + +_If you wish to understand more about initializers, click the link above_ 👆 + +#### Vanishing & exploding gradients + +When you initialize your weights randomly, the values are probably close to zero given the probability distributions with which they are initialized. Since optimization essentially chains the optimizations in the 'downstream' layers (i.e., the ones closer to the output) when calculating the weights improvement in the 'upstream' ones (e.g., the one you're currently trying to optimize), you'll face either two things: + +- When your weights and hence your gradients are close to zero, the gradients in your upstream layers **vanish** because you're multiplying small values and e.g. 0.1 x 0.1 x 0.1 x 0.1 = 0.0001. Hence, it's going to be difficult to find an optimum, since your upstream layers learn slowly. +- The opposite can also happen. When your weights and hence gradients are > 1, multiplications become really strong. 10 x 10 x 10 x 10 = 1000. The gradients may therefore also **explode**, causing number overflows in your upstream layers, rendering them untrainable (even dying off the neurons in those layers). + +In both cases, your model will never reach its theoretical optimum. We'll see that He and Xavier initializers will substantially safeguard yourself from the vanishing and exploding gradients problems. However, let's briefly recap on activation functions first. + +_Read more about vanishing and exploding gradients here:_ +_[Random initialization: vanishing and exploding gradients](https://machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/)_ + +### What are activation functions? + +As we saw in the recap on weight initialization, neural networks are essentially a system of individual neurons, which produce outputs given an input (being the _input vector_). + +If we don't add activation functions, we find our network to behave poorly: it simply does not converge well to your real-world data. + +Why is that the case? + +The operation, without the activation function, is _linear_: you simply multiply values and add a bias value. That's some linear operations. + +Hence, without the activation function, your model will behave as if it is linear. That, we don't want, because real world data is pretty much always nonlinear. + +Therefore, activation functions enter the playing field. + +An activation is a mathematical function that simply takes an input which may or may not be linear (it just takes any real valued number) and converts it into another real valued number. Since the function _itself_ behaves nonlinearly, the neural network will behave as such too. We can now handle much more complex data. Great! + +#### ReLU, Sigmoid and Tanh + +In today's world, there are three widely used activation functions: Rectified Linear Unit (ReLU), Sigmoid and Tanh. ReLU is most widely used because it is an improvement over Sigmoid and Tanh. Nevertheless, improvement is still possible, as we can see by clicking the link below 👇 + +_Read more about activation functions here: [ReLU, Sigmoid and Tanh: today’s most used activation functions](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/)_ + +* * * + +## He and Xavier initialization against gradient issues + +In his paper _[On weight initialization in deep neural networks](https://arxiv.org/abs/1704.08863)_, Siddharth Krishna Kumar identifies mathematically what the problem is with vanishing and exploding gradients and why He and Xavier (or Glorot) initialization do work against this problem. + +He argues as follows: + +**Deep neural networks face the difficulty that variance of the layer outputs gets lower the more upstream the data you go.** + +The problem with this is what we've seen in our post about _vanishing gradients_: slow model convergence. + +The problem with this is what we've seen in our post about _vanishing gradients_: slow model convergence. + +In _[Why are deep neural networks hard to train?](http://neuralnetworksanddeeplearning.com/chap5.html)_, the author of the Neural Networks and Deep Learning website helps us illustrate Kumar's point by means of the Sigmoid activation function. + +Suppose that your neural networks are equipped with the Sigmoid activation function. The neuron outputs will flow through this function to become nonlinear, and the Sigmoid derivative will be used during optimization: + +[![](images/sigmoid_and_deriv-1024x511.jpeg)](https://machinecurve.com/wp-content/uploads/2019/09/sigmoid_and_deriv.jpeg) + +Sigmoid and its derivative + +As you can see, there are two problems with the Sigmoid function and its behavior during optimization: + +- When variance is really high, the _absolute value_ of the gradient will be low and the network will learn very slowly; +- When variance is really _low_, the gradient will move in a very small range, and hence the network will also learn very slowly. + +This especially occurs when weights are drawn from a [standard normal distribution](https://machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/), since weights will also be < 1 and > -1. + +Kumar argued that it's best to have variances of ≈ 1 through all layers. This way, slow learning can be mitigated quite successfully. The fun thing is, He and Xavier initialization attempt to ensure such variance in layer outputs by default. But first, a brief look into the sensitivity of ReLU. + +* * * + +### Why is ReLU less sensitive to this problem? + +In general, we therefore use [ReLU as our activation function of general choice](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/). + +This is ReLU and its derivative: + +[![](images/relu_and_deriv-1024x511.jpeg)](https://machinecurve.com/wp-content/uploads/2019/09/relu_and_deriv.jpeg) + +As you can see, the derivative of ReLU behaves differently. If the original input is < 0, the derivative is 0, else it is 1. This observation emerges from the way ReLU is designed. + +Hence, it does no longer matter whether the variance is 1 or 100; in the both positive and negative numbers drawn from such a sample, the gradient will always be zero or one. Hence, it's not bothered much by vanishing and exploding gradients, contrary to Sigmoid and tanh. + +![](images/tanh_and_deriv-1024x511.jpeg) + +Let's now take a look at He and Xavier initialization. + +### Xavier initialization + +In his work, Kumar argued that when variance of the layer outputs (and hence the downstream layer inputs) is not ≈ 1, depending on the activation function, models will converge more slowly, especially when these are < 1. + +For "activation functions differentiable at 0", Kumar derives a generic weight initialization strategy. With this strategy, which essentially assumes random initialization from e.g. the standard normal distribution but then with a specific variance that yields output variances of 1, he derives the so-called "Xavier initialization" for the Tanh activation function: + +\\begin{equation} v^{2} = 1/N \\end{equation} + +### He initialization + +When your neural network is ReLU activated, He initialization is one of the methods you can choose to bring the variance of those outputs to approximately one (He et al., 2015). + +Although it attempts to do the same, He initialization is different than Xavier initialization (Kumar, 2017; He et al., 2015). This difference is related to the nonlinearities of the ReLU activation function, which make it non-differentiable at \[latex\]x = 0\[/latex\]. However, Kumar indeed proves mathematically that for the ReLU activation function, the best weight initialization strategy is to initialize the weights randomly but with this variance: + +\\begin{equation} v^{2} = 2/N \\end{equation} + +...which is He initialization. + +* * * + +## Summary: choose wisely + +Weight initialization is very important, as "all you need is a good init" (Mishkin & Matas, 2015). It's however important to choose a proper weight initialization strategy in order to maximize model performance. We've seen that such strategies are dependent on the activation functions that are used in the model. + +For Tanh based activating neural nets, the Xavier initialization seems to be a good strategy, which essentially performs random initialization from a distribution with a variance of \[latex\]1/N\[/latex\]. + +Here, \[latex\]N\[/latex\] is the number of input neurons to a particular layer. + +For Sigmoid based activation functions, this is not the case, as was derived in the Kumar paper (Kumar, 2017). + +ReLU activating networks, which are pretty much the standard ones today, benefit from the He initializer - which does the same thing, but with a different variance, namely \[latex\]2/N\[/latex\]. + +This way, your weight init strategy is pinpointed to your neural net's ideosyncrasies, which at least theoretically makes it better. I'm looking forward to hearing from your experience as to whether you also see these results in practice. Leave a comment below if you're feeling like sharing 👇 + +Thanks for reading and happy engineering! 😄 + +* * * + +## References + +Kumar, S. K. (2017). On weight initialization in deep neural networks. _CoRR_, _abs/1704.08863_. Retrieved from [http://arxiv.org/abs/1704.08863](http://arxiv.org/abs/1704.08863) + +He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. _2015 IEEE International Conference on Computer Vision (ICCV)_. [doi:10.1109/iccv.2015.123](http://doi.org/10.1109/iccv.2015.123) + +Mishkin, D., & Matas, J. (2015). All you need is a good init. _arXiv preprint arXiv:1511.06422_. Retrieved from [https://arxiv.org/abs/1511.06422](https://arxiv.org/abs/1511.06422) + +Neural networks and deep learning. (n.d.). Why are deep neural networks hard to train? Retrieved from [http://neuralnetworksanddeeplearning.com/chap5.html](http://neuralnetworksanddeeplearning.com/chap5.html) diff --git a/help-fight-covid-19-participate-in-the-cord-19-challenge.md b/help-fight-covid-19-participate-in-the-cord-19-challenge.md new file mode 100644 index 0000000..30ba6f2 --- /dev/null +++ b/help-fight-covid-19-participate-in-the-cord-19-challenge.md @@ -0,0 +1,35 @@ +--- +title: "Help fight COVID-19: participate in the CORD-19 challenge" +date: "2020-03-17" +categories: + - "news" +tags: + - "artificial-intelligence" + - "covid-19" + - "deep-learning" + - "machine-learning" +--- + +We all know it by now: the novel coronavirus, resulting in COVID-19, is spreading across the globe. In haste, governments are taking unprecedented measures such as total lockdown (France and Italy) and controlled spread (Netherlands). In doing so, they attempt to reduce the impact of the virus on the countries' health systems, awaiting a vaccine to be developed and considered safe. + +However, we as data science, data engineering and machine learning communities might just be able to help fight the virus - especially in times where other work _might_ be getting less. + +The CORD-19 challenge is a Kaggle challenge launched by the Allen Institute for AI in partnership with the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, and the National Library of Medicine - National Institutes of Health, in coordination with The White House Office of Science and Technology Policy. + +_COVID-19 picture: Miguel Á. Padriñán_, _Pexels.com_ + +## The challenge + +It comes with a dataset of more than 29.000 scholarly articles: + +> In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up. +> +> [COVID-19 Open Research Dataset Challenge (CORD-19)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) + +## Tasks + +There is a wide range of tasks available, each with $1000 euro in prizes for the winner, sponsored by Kaggle: + +[![](images/image-922x1024.png)](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) + +If you have some spare time, it might definitely be worth a look - and perhaps, even a try. [Click here to go to the challenge.](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) diff --git a/how-does-the-softmax-activation-function-work.md b/how-does-the-softmax-activation-function-work.md new file mode 100644 index 0000000..896039b --- /dev/null +++ b/how-does-the-softmax-activation-function-work.md @@ -0,0 +1,476 @@ +--- +title: "How does the Softmax activation function work?" +date: "2020-01-08" +categories: + - "deep-learning" + - "frameworks" +tags: + - "activation-function" + - "deep-learning" + - "machine-learning" + - "neural-network" + - "softmax" + - "training-process" +--- + +When you're creating a neural network for classification, you're likely trying to solve either a binary or a multiclass classification problem. In the latter case, it's very likely that the activation function for your final layer is the so-called **Softmax activation function**, which results in a multiclass probability distribution over your target classes. + +However, what is this activation function? How does it work? And why does the way it work make it useful for use in neural networks? Let's find out. + +In this blog, we'll cover all these questions. We first look at how Softmax works, in a primarily intuitive way. Then, we'll illustrate why it's useful for neural networks/machine learning when you're trying to solve a multiclass classification problem. Finally, we'll show you how to use the Softmax activation function with deep learning frameworks, by means of an example created with Keras. + +This allows you to understand what Softmax is, what it does and how it can be used. + +Ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## How does Softmax work? + +Okay: Softmax. It always "returns a probability distribution over the target classes in a multiclass classification problem" - these are often my words when I have to explain intuitively how Softmax works. + +But let's now dive in a little bit deeper. + +What does "returning a probability distribution" mean? And why is this useful when we wish to perform multiclass classification? + +### Logits layer and logits + +We'll have to take a look at the structure of a neural network in order to explain this. Suppose that we have a neural network, such as the - very high-level variant - one below: + +[![](images/logits.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/logits.png) + +The final layer of the neural network, _without the activation function_, is what we call the **"logits layer"** (Wikipedia, 2003). It simply provides the final outputs for the neural network. In the case of a four-class multiclass classification problem, that will be four neurons - and hence, four outputs, as we can see above. + +Suppose that these are the outputs, or our **logits**: + +[![](images/logits_with_outputs.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/logits_with_outputs.png) + +These essentially tell us something about our target classes, but from the outputs above, we can't make sense of it yet.... are they likelihoods? No, because can we have a negative one? Uh... + +### Multiclass classification = generating probabilities + +In a way, however, _predicting_ which target class some input belongs to is related to a _probability distribution_. For the setting above, if you would know the probabilities of the value being any of the four possible outcomes, you could simply take the \[latex\]argmax\[/latex\] of these discrete probabilities and find the class outcome. Hence, if we could convert the logits above into a probability distribution, that would be awesome - we'd be there! + +Let's explore this idea a little bit further :) + +If we would actually want to convert our logits into a probability distribution, we'll need to first take a look at what a probability distribution is. + +#### Kolmogorov's axioms + +From probability theory class at university, I remember that probability theory as a whole can be described by its foundations, the so-called probability axioms or Kolmogorov's axioms. They are named after Andrey Kolmogorov, who introduced the axioms in 1933 (Wikipedia, 2001). + +They are as follows (Wikipedia, 2001): + +- The probability of something to happen, a.k.a. an event, is a non-negative real number. +- The probability that at least one of the events in the distribution occurs is 1, i.e. the sum of all the individual probabilities is 1. +- That the probability of a sequence of disjoint sets occurring equals the sum of the individual set probabilities. + +For reasons of clarity: in percentual terms, 1 = 100%, and 0.25 would be 25%. + +Now, the third axiom is not so much of interest for today's blog post, but the first two are. + +From them, it follows that _the odds of something to occur_ must be a positive real number, e.g. \[latex\]0.238\[/latex\]. Since the sum of probabilities must be equal to \[latex\]1\[/latex\], no probability can be \[latex\]> 1\[/latex\]. Hence, any probability therefore lies somewhere in the range \[latex\]\[0, 1\]\[/latex\]. + +Okay, we can work with that. However, there's one more explanation left before we can explore possible approaches towards converting the logits into a multiclass probability distribution: the difference between a _continuous_ and a _discrete_ probability distribution. + +#### Discrete vs continuous distributions + +To deepen our understanding of the problem above, we'll have to take a look at the differences between discrete and continuous probability distribution. + +According to Wikipedia (2001), this is a discrete probability distribution: + +> A **discrete probability distribution** is a probability distribution that can take on a countable number of values. +> +> Wikipedia (2001): Discrete probability distribution + +A continuous one, on the other hand: + +> A **continuous probability distribution** is a probability distribution with a cumulative distribution function that is [absolutely continuous](https://en.wikipedia.org/wiki/Absolute_continuity). +> +> Wikipedia (2001): Continuous probability distribution + +So, while a discrete distribution can take a certain amount of values - four, perhaps ;-) - and is therefore rather 'blocky' with one probability per value, a continuous distribution can take _any_ value, and probabilities are expressed as being in a range. + +### Towards a discrete probability distribution + +As you might have noticed, I already gave away the answer as to whether the neural network above benefits from converting the logits into a _discrete_ or _continuous_ distribution. + +To play captain obvious: it's a discrete probability distribution. + +For each outcome (each neuron represents the outcome for a target class), we'd love to know the individual probabilities, but of course they must be relative to the other target classes in the machine learning problem. Hence, probability distributions, and specifically discrete probability distributions, are the way to go! :) + +But how do we convert the logits into a probability distribution? We use Softmax! + +### The Softmax function + +The Softmax function allows us to express our inputs as a discrete probability distribution. Mathematically, this is defined as follows: + +\[latex\]Softmax(x \_i ) = \\frac{exp(x\_i)}{\\sum{\_j}^ {} {} exp(x\_j))}\[/latex\] + +Intuitively, this can be defined as follows: for each value (i.e. input) in our input vector, the Softmax value is the _exponent of the individual input_ divided by a sum of _the exponents of all the inputs_. + +This ensures that multiple things happen: + +- Negative inputs will be converted into nonnegative values, thanks to the exponential function. +- Each input will be in the interval \[latex\](0, 1)\[/latex\]. +- As the _denominator_ in each Softmax computation is the same, the values become proportional to each other, which makes sure that together they sum to 1. + +This, in return, allows us to "interpret them as probabilities" (Wikipedia, 2006). Larger input values correspond to larger probabilities, at exponential scale, once more due to the exponential function. + +Let's now go back to the initial scenario that we outlined above. + +[![](images/logits_with_outputs.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/logits_with_outputs.png) + +We can now convert our logits into a discrete probability distribution: + +| **Logit value** | Softmax computation | Softmax outcome | +| --- | --- | --- | +| 2.0 | \[latex\] \\frac{exp(x\_i)}{\\sum{\_j}^ {} {} exp(x\_j))} = \\frac{exp(2.0)}{exp(2.0) + exp(4.3) + exp(1.2) + exp(-3.1)} \[/latex\] | 0.087492 | +| 4.3 | \[latex\] \\frac{exp(x\_i)}{\\sum{\_j}^ {} {} exp(x\_j))} = \\frac{exp(4.3)}{exp(2.0) + exp(4.3) + exp(1.2) + exp(-3.1)} \[/latex\] | 0.872661 | +| 1.2 | \[latex\] \\frac{exp(x\_i)}{\\sum{\_j}^ {} {} exp(x\_j))} = \\frac{exp(1.2)}{exp(2.0) + exp(4.3) + exp(1.2) + exp(-3.1)} \[/latex\] | 0.039312 | +| \-3.1 | \[latex\] \\frac{exp(x\_i)}{\\sum{\_j}^ {} {} exp(x\_j))} = \\frac{exp(-3.1)}{exp(2.0) + exp(4.3) + exp(1.2) + exp(-3.1)} \[/latex\] | 0.000292 | +| **Sum** | | 0.999757 | +| **(rounded)** | | 1 | + +Let's see if the outcome adheres to Kolmogorov's probability axioms that we discussed above, to verify whether it really _is_ a valid probability distribution :) + +1. **Each probability must be a nonzero real number**. This is true for our outcomes: each is real-valued, and nonzero. +2. **The sum of probablities must be 1**. This is also true for our outcomes: the sum of cutoff values is \[latex\]\\approx 1\[/latex\], due to the nature of real-valued numbers. The _true_ sum is 1. + +In fact, for our logits scenario, any input would satisfy these values. First of all, the denominator for any of the inputs would be the same, so they are normalized into the \[latex\](0, 1)\[/latex\] range, summing together to 1. What's more, as we can see, due to the nature of the exponential function, any input indeed yields a nonzero real number when fed to the Softmax function: + +[![](images/softmax_logits.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/softmax_logits.png) + +This also explains why our \[latex\]logit = 4.3\[/latex\] produces such a large probability of \[latex\]p \\approx 0.872661\[/latex\] :) + +This, in return, means: hooray, we can use Softmax for generating a probability distribution! 🎉 + +...but we still don't know _why_ Softmax and its distribution-generating characteristics make it so useful for training neural networks. Let's find out :) + +* * * + +## Why Softmax in neural networks? + +If we're looking for an answer to why Softmax is so useful for neural networks, we'll have to look at three things. + +Firstly, we'll have to explore why we cannot use argmax directly instead of approximating its outcome with Softmax. + +Then, we'll have to look at the benefits of using exponents over traditional normalization and at the benefits of using Euler's constant as the base for the exponent. + +Finally, we're going to find out what this means for the optimization process, and why this is exactly what we want from a neural net. + +### Why no argmax directly? + +Recall that we have a neural network with a logits layer that has these outputs: + +[![](images/logits_with_outputs.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/logits_with_outputs.png) + +If you're thinking very strictly today, you may wonder about this: why don't we simply take the `argmax` value as our activation function? Doesn't it provide you with the same result? + +That is, with \[latex\]\\textbf{x} = \[2.0, 4.3, 1.2, -3.1\]\[/latex\] being the input to some `argmax` function, the output would be \[latex\]\[0, 1, 0, 0\]\[/latex\]. This is great, because we now have the output value! + +But is it correct? + +We don't know, as we don't know what our input is. If we did, we could check. + +Now suppose that we input an image that should have been class 4. This is problematic, as our output is \[latex\]\[0, 1, 0, 0\]\[/latex\] - or class 2! + +We'd need to improve! + +By default, in neural networks, optimization techniques like [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [adaptive optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) are used for this purpose. For each trainable parameter, backpropagation computes the gradient with respect to the [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) and the intermediate layers, and the optimizer subsequently adapts the weights of the parameter. + +But this requires that a function is differentiable. And `argmax` is not, or useless if it were (Shimis, n.d.). Either, the gradient is zero almost everywhere (because if you move a really small bit on the argmax function, the outputs will remain the same), or even undefined, as argmax is not continuous (Shimis, n.d.). By consequence, argmax cannot be used when training neural networks with gradient descent based optimization. + +Softmax can: besides having nice properties with regards to normalization (as we saw before), it can be differentiated. Hence, it's very useful for optimizing your neural network. + +Now, you may wonder: all right, I believe that I can't use argmax. But why do I have to use Softmax instead? Why can't I just normalize without any exponents (Vega, n.d.)? + +### Benefits of the exponent + +That is, instead of writing... + +\[latex\] \\frac{exp(x\_i)}{\\sum{\_j}^ {} {} exp(x\_j))} = \\frac{exp(2.0)}{exp(2.0) + exp(4.3) + exp(1.2) + exp(-3.1)} \[/latex\] + +...you would want to write... + +\[latex\]\\frac{2.0}{2.0 + 4.3 + 1.2 + -3.1}\[/latex\] + +It makes perfect sense, but let's now take a look at what we _want_ our output to be. Even though we don't use it, don't we want our output to be like argmax? That the actual class arrives on top? + +Now take a look at our logits. + +With argmax, they would convert to: + +| **Logit value** | Argmax computation | Argmax outcome | +| --- | --- | --- | +| 2.0 | \[latex\]argmax(2.0, 4.3, 1.2, -3.1)\[/latex\] | \[latex\]\[0, 1, 0, 0\]\[/latex\] | +| 4.3 | \[latex\]argmax(2.0, 4.3, 1.2, -3.1)\[/latex\] | \[latex\]\[0, 1, 0, 0\]\[/latex\] | +| 1.2 | \[latex\]argmax(2.0, 4.3, 1.2, -3.1)\[/latex\] | \[latex\]\[0, 1, 0, 0\]\[/latex\] | +| \-3.1 | \[latex\]argmax(2.0, 4.3, 1.2, -3.1)\[/latex\] | \[latex\]\[0, 1, 0, 0\]\[/latex\] | + +As we saw before, our Softmax converts to \[latex\]\[0.09, 0.87, 0.04, 0.00\]\[/latex\]. This is really close! + +But what would happen with "normal", or exponent-free, division? + +| **Logit value** | Regular division | Argmax outcome | +| --- | --- | --- | +| 2.0 | \[latex\]\\frac{2.0}{2.0 + 4.3 + 1.2 + -3.1}\[/latex\] | 0.454 | +| 4.3 | \[latex\]\\frac{4.3}{2.0 + 4.3 + 1.2 + -3.1}\[/latex\] | 0.978 | +| 1.2 | \[latex\]\\frac{1.2}{2.0 + 4.3 + 1.2 + -3.1}\[/latex\] | 0.273 | +| \-3.1 | \[latex\]\\frac{-3.1}{2.0 + 4.3 + 1.2 + -3.1}\[/latex\] | \-0.704 ??? 😕 | + +As we can see, the values no longer make sense, and do even no longer adhere to Kolmogorov's axioms to represent a valid probability distribution! + +Hence, we use Softmax. + +Now, one final question: why do we use the base of the natural logarithm \[latex\]e\[/latex\] with Softmax? Why don't we use a constant, say, \[latex\]f(x) = 3^x\[/latex\] instead of \[latex\]f(x) = e^x\[/latex\]? + +This has to do with the derivatives (Vega, n.d.; CliffsNotes, n.d.): + +- For \[latex\]f(x) = e^x\[/latex\], the derivative is \[latex\]f'(x) = e^x\[/latex\]. +- For \[latex\]f(x) = a^x\[/latex\], where \[latex\]a\[/latex\] is some constant, the derivative is \[latex\]f'(x) = (\\ln(a)) \* a^x\[/latex\]. + +The derivative for \[latex\]e^x\[/latex\] is thus much nicer, and hence preferred. + +### Maximizing logit values for class outcomes + +All right. We can use Softmax to generate a discrete probability distribution over the target classes, as represented by the neurons in the logits layer. + +Now, before we'll work on an example model with Keras, it's time to briefly stop and think about what happens during optimization. + +As you likely know, during the forward pass in the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), your training data is fed to the model. The predictions are compared with the ground truth, i.e. the targets, and eventually summarized in a loss value. Based on this loss value, backpropagation computes the gradient for improvement, and the optimizer subsequently performs this improvement based on its ideosyncrasies. This iterative process stops when the model performs adequately enough. + +So far so good, but what happens when you design a neural network with `num_classes` output neurons, as well as a Softmax layer? + +It's important to understand for reasons of practicality that likely, your model will learn to map certain classes to certain logits, i.e. it learns to maximize certain logit values for certain class outcomes. Doing so, the training process effectively learns to "steer" inputs to outputs, generating a practically useful machine learning model. + +This is good: for new inputs from some class, the odds increase that the class outcome equals the ground truth. This, with the other reasons from above, is why Softmax is so useful for neural networks. + +Let's now take a look at an example :) + +* * * + +## Softmax example with Keras + +Now, let's move from theory to practice - we're going to code! + +In fact, we're going to code an example model with Keras that makes use of the Softmax function for classification. More specifically, it's going to be a densely-connected neural network that will learn to classify samples into one of four classes. Fortunately (...and as intended 🤡), the training data (which we'll generate as part of the process) is separable in 2D space, albeit not linearly: + +[![](images/example_nonlinear.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/example_nonlinear.png) + +Time to open up your editor and start coding! 😎 + +### Imports + +First, we'll define a few imports: + +``` +''' + Keras model to demonstrate Softmax activation function. +''' +import keras +from keras.models import Sequential +from keras.layers import Dense +from keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +``` + +Most importantly, we use Keras and a few of its modules to build the model. Under the hood, it will run on Tensorflow. Additionally, we'll import Matplotlib's Pyplot library for visualizing the data, Numpy for number processing and Scikit-learn for generating the data. + +Therefore, make sure that you have these dependencies installed before you run this model. In the simplest case, you can install them with `pip install`: + +``` +pip install keras tensorflow matplotlib numpy scikit-learn +``` + +(however, you might wish to install them with Anaconda instead!) + +### Model config + +Next up is the model configuration. We define here how many samples we'll generate, how much of them are used for _testing_ the trained model (250), where in 2D space our clusters are located, how many clusters we've got and which loss function is to be used (indeed, as we expect with Softmax activation at our final layer, [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/)). + +``` +# Configuration options +num_samples_total = 1000 +training_split = 250 +cluster_centers = [(15,0), (15,15), (0,15), (30,15)] +num_classes = len(cluster_centers) +loss_function_used = 'categorical_crossentropy' +``` + +### Generating data + +After configuring our model, it's time to generate some data. We use Scikit-Learn's `make_blobs` for this purpose, which allows us to generate clusters of samples as illustrated in the plot above. We generate them according to our config, i.e., based on the cluster centers, the number of samples to be generated in total and the number of classes we want. + +``` +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1.5) +categorical_targets = to_categorical(targets) +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = categorical_targets[training_split:] +Targets_testing = categorical_targets[:training_split].astype(np.integer) +``` + +Once data has been generated, we can convert the targets into one-hot encoded vectors, in order to make them compatible with [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). Finally, we make the split between training and testing data. + +Once this is done, we can set the shape of our input data, as we know this by now: + +``` +# Set shape based on data +feature_vector_length = len(X_training[0]) +input_shape = (feature_vector_length,) +print(f'Feature shape: {input_shape}') +``` + +We can also generate the visualization you saw earlier: + +``` +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +### Model architecture + +Now that we have imported the dependencies that we need and have configured the model, it's time to define its _architecture_. + +It's going to be a very simple one; a densely-connected one, to be precise. It will have three layers, of which one is an output layer. The first layer takes in data of `input_shape` shape, activates by means of [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) and hence requires [He weight init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). It has a (12, ) output shape. + +The second layer works similarly, but learns has an (8, ) output shape instead. + +The final layer, our output layer, learns `num_classes` outputs. As in our case, `num_classes = 4`, it aligns with the scenario we've been discussing throughout this blog post. What's more, rather than ReLU activation, it uses Softmax, so we'll end up with a multiclass probability distribution! + +``` +# Create the model +model = Sequential() +model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(num_classes, activation='softmax')) +``` + +### Compilation, data fitting and evaluation + +We can subsequently compile (i.e. configure) the model based on the [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) we configured as well as the [optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) we want to use, and set additional metrics (accuracy due to the fact that it's intuitive for humans). + +Then, we fit the training data to the model, train it for 30 iterations (epochs) and use a batch size of 5. 20% of the training data will be used for validation purposes and all output is shown on screen with verbosity mode set to True. + +``` +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=keras.optimizers.adam(lr=0.001), metrics=['accuracy']) +history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') +``` + +Once data has been fit, it's time to test the model. We do so by means of `model.evaluate`, feeding it the testing data. The outcome is shown on screen. + +If desired, the `history` object can be used to [visualize the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). + +### Full model code + +Here's the full model code, if you wish to start playing around right away: + +``` +''' + Keras model to demonstrate Softmax activation function. +''' +import keras +from keras.models import Sequential +from keras.layers import Dense +from keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs + +# Configuration options +num_samples_total = 1000 +training_split = 250 +cluster_centers = [(15,0), (15,15), (0,15), (30,15)] +num_classes = len(cluster_centers) +loss_function_used = 'categorical_crossentropy' + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1.5) +categorical_targets = to_categorical(targets) +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = categorical_targets[training_split:] +Targets_testing = categorical_targets[:training_split].astype(np.integer) + +# Set shape based on data +feature_vector_length = len(X_training[0]) +input_shape = (feature_vector_length,) +print(f'Feature shape: {input_shape}') + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Create the model +model = Sequential() +model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(num_classes, activation='softmax')) + +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=keras.optimizers.adam(lr=0.001), metrics=['accuracy']) +history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') +``` + +## Results + +Once run, you should find an extremely well-performing model (which makes sense, as the data is separable nonlinearly, which our model is capable of): + +``` +Test results - Loss: 0.002027431168593466 - Accuracy: 100.0% +``` + +Of course, in practice, your machine learning models will be more complex - and so will your data - but that wasn't the goal of this blog post. Rather, besides learning about Softmax in theory, you've now also seen how to apply it in practice ;) + +* * * + +## Summary + +This blog post revolved around the Softmax activation function. What is it? How does it work? Why is it useful for neural networks? And how can we implement it in practice, using Keras? Those are the questions that we answered. + +In doing so, we saw that Softmax is an activation function which converts its inputs - likely the logits, a.k.a. the outputs of the last layer of your neural network when no activation function is applied yet - into a discrete probability distribution over the target classes. Softmax ensures that the criteria of probability distributions - being that probabilities are nonnegative realvalued numbers and that the sum of probabilities equals 1 - are satisfied. This is great, as we can now create models that learn to maximize logit outputs for inputs that belong to a particular class, and by consequence also maximize the probability distribution. Simply taking \[latex\]argmax\[/latex\] then allows us to pick the class prediction, e.g. showing it on-screen in object detectors, image classifiers and text classifiers. + +I hope you've learnt something today. If you did, I'd appreciate if you left a comment in the comments section below! 😊 Please do the same if you have any questions or when you have remarks, as I'll try to read everything and answer whenever possible. + +For now, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2001, March 24). Probability distribution. Retrieved from [https://en.wikipedia.org/wiki/Probability\_distribution](https://en.wikipedia.org/wiki/Probability_distribution) + +Wikipedia. (2001, September 13). Probability axioms. Retrieved from [https://en.wikipedia.org/wiki/Probability\_axioms](https://en.wikipedia.org/wiki/Probability_axioms) + +Wikipedia. (2006, July 28). Softmax function. Retrieved from [https://en.wikipedia.org/wiki/Softmax\_function](https://en.wikipedia.org/wiki/Softmax_function) + +Shimis. (n.d.). argmax differentiable? Retrieved from [https://www.reddit.com/r/MachineLearning/comments/4e2get/argmax\_differentiable](https://www.reddit.com/r/MachineLearning/comments/4e2get/argmax_differentiable) + +Vega. (n.d.). In softmax classifier, why use exp function to do normalization? Retrieved from [https://datascience.stackexchange.com/a/24112](https://datascience.stackexchange.com/a/24112) + +CliffsNotes. (n.d.). Differentiation of Exponential and Logarithmic Functions. Retrieved from [https://www.cliffsnotes.com/study-guides/calculus/calculus/the-derivative/differentiation-of-exponential-and-logarithmic-functions](https://www.cliffsnotes.com/study-guides/calculus/calculus/the-derivative/differentiation-of-exponential-and-logarithmic-functions) + +Wikipedia. (2003, January 21). Logit. Retrieved from [https://en.wikipedia.org/wiki/Logit](https://en.wikipedia.org/wiki/Logit) diff --git a/how-to-build-a-convnet-for-cifar-10-and-cifar-100-classification-with-keras.md b/how-to-build-a-convnet-for-cifar-10-and-cifar-100-classification-with-keras.md new file mode 100644 index 0000000..368a962 --- /dev/null +++ b/how-to-build-a-convnet-for-cifar-10-and-cifar-100-classification-with-keras.md @@ -0,0 +1,605 @@ +--- +title: "How to build a ConvNet for CIFAR-10 and CIFAR-100 classification with Keras?" +date: "2020-02-09" +categories: + - "deep-learning" + - "frameworks" +tags: + - "cifar10" + - "cifar100" + - "classifier" + - "cnn" + - "convolutional-neural-networks" + - "deep-learning" + - "keras" + - "machine-learning" + - "tensorflow" +--- + +Convolutional neural networks are great tools for building image classifiers. They have been used thoroughly since the 2012 deep learning breakthrough, and have led to interesting applications such as classifiers and object detectors. + +But why are they so useful for classifying images? And how can we build one with Keras on TensorFlow 2.0? That's what today's blog post will look at. + +Firstly, we'll study why ConvNets are so suitable when your goal is to build an image classifier. Then, we'll actually build one - by using the CIFAR-10 and CIFAR-100 datasets. After inspecting the datasets, which is what we do first, we build a Keras based model using the new TensorFlow 2.0 style of implementing them. This way, you should have Python based code examples that will help you implement such classifiers yourself. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Why Convolutional Neural Networks suit image classification + +We all know that numbers are built from digits, and more specifically, the digits 0 to 9. + +Now, say that we show you a few of these digits, handwritten ones: + +[![](images/emnist-mnist.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/emnist-mnist.png) + +Check out ["Making more datasets available for Keras"](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) for this dataset. + +Likely, you know which digits they are instantly. The first row: 4 - 1 - 4 -1. The second: 6 - 7 - 6 - 8. And so on. + +But have you ever thought about _why_ you can recognize them so fast? And why you know that the two 4s are 4s, even though they are written differently? + +### Decomposition of images into smaller, generic parts + +The answer is a little bit more complex than this (i.e., we leave the discussion about induction vs deduction out of scope here), but it essentially boils down to this: + +1. Your brain decomposes (or "breaks down") the image it sees into smaller parts. +2. These parts, in return, take some kind of "generic shape". While the bottom part of the second 4 is written in a curvy way, and the first in a cursive way, we still know that it's the bottom part of the 4. We thus instantly recognize it as the "bottom part", regardless of the precise shape it takes. + +Now, **[convolutional neural networks](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/)**, together with extra additions such as [pooling layers](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/), attempt to mimic this process. They break down input images into smaller parts in ways they have learnt during training. Preferably, these smaller parts are "generic", so that a wide range of input data should yield the same conclusion. Let's take a look at how this works. + +### How a convolutional layer works + +Convolutional neural networks are composed of, among others, convolutional layers. They are often present within the first part of the network, together with layers that are related to them. The second part, then, is composed of Dense layers often. They generate the actual classification based on the features that were extracted by the convolutional layers. + +Here is what a convolutional layer does, and why it is such a good feature extractor, at a high level: + +[![](images/CNN.png)](https://www.machinecurve.com/wp-content/uploads/2019/09/CNN.png) + +The input to this convolutional layer is a \[latex\]H \\times W\[/latex\] image, where \[latex\]H\[/latex\] is the height and \[latex\]W\[/latex\] is the width of this image. These days, most images are RGB - red, green and blue - and hence have 3 image channels. This is not different in the scenario above. + +Now, the convolutional layer works with \[latex\]N\[/latex\] so called "kernels". The value for \[latex\]N\[/latex\] can be configured by the machine learning engineer. These kernels, which have a fixed \[latex\]H\_{kernel}\[/latex\] and \[latex\]W\_{kernel}\[/latex\] that are often much smaller than the input (e.g. 3x3), have the same amount of channels as the input data (3 in our case). They are initialized with "weights", and this is what makes learning possible (as we will see later). + +Now, this "kernel" (which is 5 x 5 pixels in the schematic drawing below) slides (or "convolves") over the input data. In doing so, for each position it takes, it multiplies the weight at some point with the corresponding pixel in your input data, element-wise. This means that all the individual multiplications are added together, and that the output of that particular kernel-input multiplication is 1 pixel: + +[![](images/Cnn_layer-1.jpg)](https://www.machinecurve.com/wp-content/uploads/2018/11/Cnn_layer-1.jpg) + +Now, sliding over the entire image horizontally and vertically, it produces many of such "outputs" - rendering the output on the right in the image above. This output, which is called a "feature map", is smaller than the input data, and essentially contains the input data in a more "abstract" fashion. Now, as there are \[latex\]N\[/latex\] kernels, there will be \[latex\]N\[/latex\] such feature maps produced by a convolutional layer. + +### Feature detection and the "smaller parts" + +The fun thing, here, is that the network can be trained. That is, the weights can be adapted. During this training process, the network as a whole will produce one output value. This output value can be compared to the true target - a.k.a. the "ground truth". The difference between the two can be captured in a [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) that can subsequently be used for [optimizing the model](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). This way, iteratively, the model can learn weights that yield the most optimal outputs. + +Now, possibly and most preferably, what will these weights become? + +Indeed - **the individual parts of the images, such as the "bottom part of the four", that represents a four "together with its top part".** + +This is why the _convolutional layers are said to be feature extractors_ in a convolutional neural network. They break down the images into smaller parts (feature maps that get more abstract when you go downstream in a ConvNet), a process that is guided by the desired outputs in the training process. This way, you'll eventually get a Convolutional neural network that learns to detect "parts" in the images that are very discriminative with respect to the final outcome. + +And that's precisely what you want when you're training an image classifier! 😎 + +### Adding Pooling to make the parts generic + +The convolutional layers you - theoretically - apply so far do indeed result in a "spatial hierarchy", where the outputs of the subsequent convolutional layers get smaller every time. However, the hierarchy will look very much like the one on the right of this drawing: + +![](images/hierarchies.png) + +Thus, even though you have a spatial hierarchy, it's not very _sharp_. This, in return, will mean that even though you do break apart the inputs into smaller, more abstract blocks, the network will still be sensitive to e.g. the shape of the bottom part of the 4. + +What's more, it's still not "translation invariant", which means that it's also sensitive to the _orientation, size, and position_ of the particular element. In the case of the four, if the top part were cut off and the bottom part was shifted to the top, leaving blank space at the bottom, the network may now not detect it as a 4 anymore. + +Adding **pooling layers** may [help you resolve this issue](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/). Similar to convolutional layers, pooling layers slide over the inputs, but instead of multiplying the parts with some learnt weights, they compute a hard value such as \[latex\]max()\[/latex\]. + +- [![](images/Max-Pooling-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Max-Pooling-1.png) + +- [![](images/Max-Pooling-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Max-Pooling-2.png) + + +As you can see, a pooling layer - Max Pooling in the image above - substantially reduces the size of your feature map, and in this case takes the maximum value. What this means is this: + +- The most important feature in the feature map will be used. As the convolutional layer directly in front of the pooling layer will likely learn to _detect the object(s) of interest_, this is likely the object we want to detect. +- It does not matter in which of the four red positions the object is present; it will always be taken along into the pooling layer's output. + +This way, we introduce "feature invariance" into the model. Together, the convolutional layer both "learns parts" and "learns them in a generic way". Exactly what we want :) + +Now that we understand the two most important parts of a ConvNet, it's time to build one. Please note that it's possible to use additional layers such as [Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/), and that you can [apply padding when desired](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/), but this is optional. Let's now take a look at the datasets that we'll use today! 📷 + +* * * + +## Image datasets: the CIFAR-10 and CIFAR-100 datasets + +The CIFAR datasets were introduced by Krizhevsky & Hinton (2009) and were named after the Canadian Institute for Advanced Research (CIFAR). They come in two ways: the CIFAR-10 datasets, with ten classes, and the CIFAR-100 dataset, with one hundred classes. Let's inspect them in more detail now 🕵️‍♀️ + +### The CIFAR-10 dataset + +The **CIFAR-10 dataset** contains contains 60.000 32x32 pixel RGB images across 10 classes – which means 6.000 per class. These are the classes that it supports (Krizhevsky & Hinton, 2009): + +
AirplaneAutomobileBirdCatDeer
DogFrogHorseShipTruck
+ +A few samples: + +- [![](images/10885.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/10885.jpg) + +- [![](images/18017.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/18017.jpg) + +- [![](images/15330.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/15330.jpg) + +- [![](images/13749.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/13749.jpg) + +- [![](images/12403.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/12403.jpg) + +- [![](images/11312.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/11312.jpg) + +- [![](images/3576.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/3576.jpg) + +- [![](images/834.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/834.jpg) + +- [![](images/47056.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/47056.jpg) + +- [![](images/43819.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/43819.jpg) + +- [![](images/14650.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/14650.jpg) + +- [![](images/1523.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/1523.jpg) + + +The dataset is split into 50.000 images for training and 10.000 images for testing purposes. + +### The CIFAR-100 dataset + +**CIFAR-100** is effectively the "parent dataset" for the CIFAR-10 one. It contains many images across 100 non-overlapping classes. It also contains 60.000 samples in total, which means that each class only has 600 samples instead of 6.000 (as with the CIFAR-10 one). + +- [![](images/33582.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/33582.jpg) + +- [![](images/30218.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/30218.jpg) + +- [![](images/29735.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/29735.jpg) + +- [![](images/29119.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/29119.jpg) + +- [![](images/27872.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27872.jpg) + +- [![](images/27757.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27757.jpg) + +- [![](images/27260.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27260.jpg) + +- [![](images/26544.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/26544.jpg) + +- [![](images/26247.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/26247.jpg) + +- [![](images/21402.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/21402.jpg) + +- [![](images/18167.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/18167.jpg) + +- [![](images/15743.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/15743.jpg) + + +These are the classes present within CIFAR-100 (Krizhevsky & Hinton, 2009): + +
BeaverDolphinOtterSealWhale
Aquarium fishFlatfishRaySharkTrout
OrchidsPoppiesRosesSunflowersTulips
BottlesBowlsCansCupsPlates
ApplesMushroomsOrangesPearsSweet peppers
ClokComputer keyboardLampTelephoneTelevision
BedChairCouchTableWardrobe
BeeBeetleButterflyCaterpillarCockroach
BearLeopardLionTigerWolf
BridgeCastleHouseRoadSkyscraper
CloudForestMountainPlainSea
CamelCattleChimpanzeeElephantKangaroo
FoxPorcupinePossumRaccoonSkunk
CrabLobsterSnailSpiderWorm
BabyBoyGirlManWoman
CrocodileDinosaurLizardSnakeTurtle
HamsterMouseRabbitShrewSquirrel
MapleOakPalmPineWillow
BicycleBusMotorcyclePickup truckTrain
Lawn-mowerRocketStreetcarTankTractor
+ +Like the CIFAR-10 dataset, the CIFAR-100 dataset is also split in a 50.000/10.000 fashion (Krizhevsky & Hinton, 2009). + +* * * + +## Keras neural network for CIFAR-10 classification + +Now that we know what our datasets look like, let's take a look at some code! 👩‍💻 + +Open up a code editor, create a file (e.g. `cifar10.py`) and let's go :) + +### What you'll need to run the model + +...but wait: one small intermezzo. Obviously, you cannot run this model out of the blue: you'll need to install a few dependencies before you can run it. Don't worry, you don't need to install much, but you do need at least: + +- **TensorFlow**, and preferably TensorFlow 2.0+: `pip install tensorflow` or, if you have a strong GPU, `pip install tensorflow-gpu`. +- **Keras**, if you don't use TensorFlow 2.0+ (otherwise, the right version comes preinstalled): `pip install keras`. +- **Numpy**, for number processing: `pip install numpy`. +- **Matplotlib**, for generating plots: `pip install matplotlib`. + +### Model imports + +Time to write some actual code! We'll start with the model imports. As our model consists of these elements... + +- The **CIFAR-10 dataset**; +- The **Sequential API**, which allows us to stack the individual layers nicely together; +- The **Conv2D, MaxPooling2D, Flatten** and **Dense** layers; +- **[Adam optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam)** with **[sparse categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/)**; +- Visualizations of the [model history](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/); + +...you'll need to import these dependencies. You can do this as follows: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +import matplotlib.pyplot as plt +``` + +### Model configuration + +Now, let's set some configuration options for our model: + +``` +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 100 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 +``` + +What do they mean is what you may wonder now. Let's find out: + +- The **batch size** is the amount of samples that will be [fed forward](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#forward-pass) in your model at once, after which the loss value is computed. You could either feed the model the [entire training batch, one sample every time or a minibatch](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/#variants-of-gradient-descent) - and you can set this value by specifying `batch_size`. +- The **image width, image height and number of channels**. Width and height are 32, respectively, and number of channels is 3, as the dataset contains RGB images. +- The **loss function** used to compare predictions with ground truth during training. We use [sparse categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#sparse-categorical-crossentropy). We skip the "why" for now - I'll show you later why we use _sparse_ instead of regular categorical crossentropy loss. +- The **number of classes** and **number of epochs** (or iterations), which we set to 10 and 100, respectively. We set the first to 10 because we have ten distinct classes - the digits 0 to 9. The second is set to 100 because I'm assuming that we'll have passed maximum model performance by then. We don't want to be training infinitely, as this induces [overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/#how-well-does-your-model-perform-underfitting-and-overfitting). +- The **optimizer**, or the method by which we update the weights of our neural network. We use [Adam optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) - which is a relatively state-of-the-art optimizer and common in today's neural networks. +- 20% of our training data will be used for **validation purposes**; that is, used to test the model with non-training-data _during_ training. +- Verbosity mode is set to "1", which means "True", which means that all the output is displayed on screen. This is good for understanding what happens during training, but it's best to turn it off when you _actually_ train models, as it slows down the training process. + +### Loading & preparing CIFAR-10 data + +Now, let's load some CIFAR-100 data. We can do so easily because Keras provides intuitive access to the dataset by design: + +``` +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() +``` + +The next step is to determine the shape of one sample. This is required by Keras to understand what data it can expect in the input layer of your neural network. You can do so as follows: + +``` +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) +``` + +Next, two technical things. Firstly, we'll convert our data into `float32` format, which presumably speeds up training. Then, we normalize the data, into the \[latex\]\[-1, 1\]\[/latex\] range. + +``` +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +### Creating the model architecture + +We can then create the architecture of our model. First, we'll instantiate the `Sequential` API and assign it to `model` - this is like the foundation of your model, the Lego board onto which you can "click" bricks, a.k.a. layers. + +Next, it's time to stack a few layers. Firstly, we'll use three convolutional blocks - which is the nickname I often use for convolutional layers with some related ones. In this case, the related layer that is applied every time is a `MaxPooling2D` one directly after the `Conv2D` layer. As you can see, each time, the numer of feature maps increases - from 32, to 64, to 128. This is done because the model then learns a limited number of "generic" patterns (32) and a high amount of patterns unique to the image (128). Max Pooling ensures translation invariance, as we discussed before. + +After the convolutional blocks, we add a `Flatten` layer. The `Dense` layers, which are responsible for generating the actual classifications, only work with one-dimensional data. Flatten makes this happen: it converts the multidimensional feature maps into one-dimensional shape. Great! + +As said, the Dense layers ensure that classification is possible. As you can see, in terms of the number of outputs per layer, we create an information bottleneck that eventually converges in `no_classes` - thus 10 - outputs, exactly the number of unique classes in our dataset. As we're using the [Softmax activation function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/), we'll get a discrete multiclass probability distribution as our output for any input. From this distribution, we can draw the one with the highest value, which is the most likely class for our input. There we go, our classifier is ready! Or isn't it? 😉 + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +### Compiling the model & fitting data + +To be fair: no, it's not :) What we did so far was create the _skeleton_ for our model. We don't have a _model_ yet, as it must be compiled first. This can be done by calling `model.compile`. As you can see, it involves specifying settings for the training process, such as the loss function and the optimizer. What's more, and what I always prefer, is to add accuracy as an additional metric, due to it being intuitive for humans. + +Once the model is compiled, we _do_ have a model, but it's not yet trained. We can start the training process by calling `model.fit`, which fits our data (in this case our training data and the corresponding targets) and specifies some settings for our training process, ones that we configured before. + +Here, it also becomes clear why we decided to use _sparse_ categorical crossentropy instead of _true_ categorical crossentropy. Categorical crossentropy requires our data to be categorical, which can e.g. be achieved with `to_categorical` i.e. one-hot encoding of your target vectors. + +Our data is not categorical by nature: our targets are integers in the range \[latex\]\[0, 9\]\[/latex\]. But why convert them, I'd argue, if there is a loss function which does the same as _true_ categorical crossentropy but works with integer targets? Indeed, [_sparse_ categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) is this activation function. Hence, we choose it over the other one :) + +``` +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Generating evaluation metrics & visualizations + +We're almost there. As you can see, we assigned the results of `model.fit` to a `history` object. This will allow us to see the _testing_ results as well as [generate nice plots of the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). Here's the code: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Visualize history +# Plot history: Loss +plt.plot(history.history['val_loss']) +plt.title('Validation loss history') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['val_accuracy']) +plt.title('Validation accuracy history') +plt.ylabel('Accuracy value (%)') +plt.xlabel('No. epoch') +plt.show() +``` + +Ready! We have a functional Keras model now 😊 Open up a terminal which has the sofware dependencies installed, `cd` into the folder where your Python code is located, and run e.g. `python cifar10.py`. The training process should now begin! :) + +### Full model code + +If you wish to obtain the full model code at once, here you go: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +import matplotlib.pyplot as plt + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 100 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Visualize history +# Plot history: Loss +plt.plot(history.history['val_loss']) +plt.title('Validation loss history') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['val_accuracy']) +plt.title('Validation accuracy history') +plt.ylabel('Accuracy value (%)') +plt.xlabel('No. epoch') +plt.show() +``` + +### The results - how well does our CIFAR-10 CNN perform? + +Once the training process finishes, it's time to look at some statistics. Firstly, the test results from `model.evaluate`: + +``` +Test loss: 2.931418807697296 / Test accuracy: 0.6948000192642212 +``` + +In approximately 70% of the cases, our model was correct. This is in line with the _validation_ accuracies visualized below across the epochs. But, quite importantly, take a look at the _loss values_ now! At first, loss went down pretty fast, reached a minimum at about the 5th epoch, and then went up again - substantially. + +This is a clear sign that our model is overfitting, or that it is highly adapted to our _training dataset_. This may mean that its performance on data it has never seen before is worse than if the training process was stopped at e.g. the fifth epoch. Take a look at these blog posts if you wish to reduce the impact of overfitting: + +- [What is Dropout? Reduce overfitting in your neural networks](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) +- [What are L1, L2 and Elastic Net Regularization in neural networks?](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/) +- [Avoid wasting resources with EarlyStopping and ModelCheckpoint in Keras](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) + +- [![](images/val_acc.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/val_acc.png) + +- [![](images/val_loss.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/val_loss.png) + + +* * * + +## Keras neural network for CIFAR-100 classification + +Let's now take a look how to create a Keras model for the CIFAR-100 dataset :) + +### From CIFAR-10 to CIFAR-100 + +In order to ensure that this blog post stays within check in terms of length, we'll take the model we just created for the CIFAR-10 dataset as our base model. In fact, this is smart for another reason: the CIFAR-100 dataset, like the CIFAR-10 one, has 60.000 samples of shape \[latex\](32, 32, 3)\[/latex\]. + +Essentially, moving from CIFAR-10 to CIFAR-100 is thus very easy! First, let's change the import so that it supports CIFAR-100: + +``` +from tensorflow.keras.datasets import cifar100 +``` + +Instead of `cifar10`, you'll import `cifar100`. Then, you change it in a similar way in the `load_data` part of your model: + +``` +# Load CIFAR-100 data +(input_train, target_train), (input_test, target_test) = cifar100.load_data() +``` + +Finally, also make sure to change the number of classes from ten to one hundred: `no_classes = 100`. + +Ready to go! Open up a new terminal, or use your same terminal, `cd` to the folder and run e.g. `python cifar100.py`. + +### Full model code + +Here's the full model for CIFAR-100, if you wish to use it directly: + +``` +from tensorflow.keras.datasets import cifar100 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +import matplotlib.pyplot as plt + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 100 +no_epochs = 100 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-100 data +(input_train, target_train), (input_test, target_test) = cifar100.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Visualize history +# Plot history: Loss +plt.plot(history.history['val_loss']) +plt.title('Validation loss history') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['val_accuracy']) +plt.title('Validation accuracy history') +plt.ylabel('Accuracy value (%)') +plt.xlabel('No. epoch') +plt.show() +``` + +### The results - how well does our CIFAR-100 CNN perform? + +As you can see, our CIFAR-100 image classifier performs worse than the CIFAR-10 one. This is not strange: the CIFAR-100 one has ten times as many classes that it can choose from. + +Neither is it strange that we observe overfitting again, by looking at the plot that displays validation loss. Our best model, with an accuracy of just above 35%, is achieved around epoch number 20. After then, performance deteriorates. + +- ![](images/100_acc.png) + +- ![](images/100_loss.png) + + +* * * + +## Summary + +In this blog post, we looked at how we can implement a CNN based classifier with Keras for the CIFAR-10 and CIFAR-100 datasets. Firstly, we explored why ConvNets are so good for building image classifiers: having convolutional layers work as "feature extractors" essentially allows you to let the model take care of feature engineering as well. This included a discussion on why pooling layers may improve the effectiveness of your CNN even further. + +Then, we looked at the datasets - the CIFAR-10 and CIFAR-100 image datasets, with hundreds to thousands of samples across ten or one hundred classes, respectively. This was followed by implementations of CNN based classifiers using Keras with TensorFlow 2.0, one of the more popular deep learning frameworks used today. + +I hope you've learnt something from this blog post! 😊 If you did, feel free to leave a comment in the comments box below. Please do the same if you have questions, if you spot mistakes, or when you have other remarks. I'll try to answer as soon as I can! + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +MachineCurve. (2020, January 6). Exploring the Keras Datasets. Retrieved from [https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) + +MachineCurve. (2020, February 2). Convolutional Neural Networks and their components for computer vision. Retrieved from [https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) + +MachineCurve. (2019, September 24). How to create a CNN classifier with Keras? Retrieved from [https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) + +Keras. (n.d.). Datasets: CIFAR10 small image classification. Retrieved from [https://keras.io/datasets/#cifar10-small-image-classification](https://keras.io/datasets/#cifar10-small-image-classification) + +Keras. (n.d.). Datasets: CIFAR100 small image classification. Retrieved from [https://keras.io/datasets/#cifar100-small-image-classification](https://keras.io/datasets/#cifar100-small-image-classification) + +Krizhevsky, A., & Hinton, G. (2009). _[Learning multiple layers of features from tiny images](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf)_ (Vol. 1, No. 4, p. 7). Technical report, University of Toronto (alternatively: [take a look at their website](https://www.cs.toronto.edu/~kriz/cifar.html)!). diff --git a/how-to-build-a-resnet-from-scratch-with-tensorflow-2-and-keras.md b/how-to-build-a-resnet-from-scratch-with-tensorflow-2-and-keras.md new file mode 100644 index 0000000..0f22a32 --- /dev/null +++ b/how-to-build-a-resnet-from-scratch-with-tensorflow-2-and-keras.md @@ -0,0 +1,1073 @@ +--- +title: "How to build a ResNet from scratch with TensorFlow 2 and Keras" +date: "2022-01-20" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "machine-learning" + - "residual-network" + - "resnet" + - "tensorflow" +--- + +In computer vision, residual networks or ResNets are still one of the core choices when it comes to training neural networks. These networks, which implement building blocks that have skip connections _over_ the layers within the building block, perform much better than plain neural networks. In today's article, you're going to take a practical look at these neural network types, by building one yourself - from scratch! + +After reading this tutorial, you will understand... + +- **What residual networks (ResNets) are.** +- **How to build a configurable ResNet from scratch with TensorFlow and Keras.** +- **What performance can be achieved with a ResNet model on the CIFAR-10 dataset.** + +In other words, by learning to build a ResNet from scratch, you will learn to understand what happens thoroughly. + +Are you ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What are residual networks (ResNets)? + +Training a neural network is really difficult. Especially in the early days of the _deep learning_ _revolution_, people often didn't know why their neural networks converged to an optimum... and neither why they did not. + +If you're familiar with machine learning (and likely you are when reading this tutorial), you have heard about vanishing and exploding gradients. These two problems made training neural networks really difficult. However, interestingly and strangely, even when replacing classic activation functions with ReLU nonlinearities and adding Batch Normalization, a problem persisted. He et al. (2016) clearly described it in their paper _Deep residual learning for image recognition:_ a neural network that has more layers would possibly perform worse than one with fewer layers. + +And this goes directly against what should be possible in theory. In fact, a neural network with more layers is increasingly able to learn the feature representations that are necessary for achieving good performance. But adding layers made performance worse. Strange, isn't it? + +[Shattering gradients](https://www.machinecurve.com/index.php/2022/01/13/resnet-a-simple-introduction/#shattering-gradients-problem), where neural network gradients resemble white noise during optimization, may lie at the basis of this problem. And [residual networks](https://www.machinecurve.com/index.php/2022/01/13/resnet-a-simple-introduction/#introducing-residual-networks-resnets) or [ResNets](https://www.machinecurve.com/index.php/2022/01/13/resnet-a-simple-introduction/#introducing-residual-networks-resnets) for short help overcome this problem. A ResNet is a neural network that is composed of _residual building blocks_: weighted layers to which a _skip connection_ is added. This skip connection allows information to pass more freely, and gradients to be more realistic. The image below shows a residual building block: + +![](images/image-3.png) + +Source: He, K., Zhang, X., Ren, S., & Sun, J. (2016). [Deep residual learning for image recognition.](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf) In _Proceedings of the IEEE conference on computer vision and pattern recognition_ (pp. 770-778). + +In practice, using today's deep learning libraries, building the skip connection is really easy. The skip connection \[latex\]\\textbf{x}\[/latex\] displayed in the image can simply be added to the output of the regular block. As you will see, however, this sometimes produces issues related to dimensionality and feature map size (i.e., width and height). He et al. describe two ways of resolving this, and you will explore both in the remainder of this tutorial: + +- An **identity mapping**, which simply maps the input to the output, adding padding or reducing feature map size where necessary. +- A **projection mapping**, which uses convolutions to generate an output that 'clicks' onto the next residual building block. + +If you're interested in the theory behind ResNets, you can [read this article](https://www.machinecurve.com/index.php/2022/01/13/resnet-a-simple-introduction/#introducing-residual-networks-resnets). Let's now take a closer look at building a simple ResNet. In today's tutorial, we're going to use TensorFlow 2 and Keras for doing so. + +* * * + +## Building a simple ResNet with TensorFlow + +Now that you understand what residual networks are, it's time to build one! Today, you'll use TensorFlow and the Keras Sequential API for this purpose. But first, let's take a look at the dataset that you will be training your ResNet model on. + +In creating the ResNet (more technically, the ResNet-20 model) we will follow the design choices made by He et al. (2016) as much as possible. That way, we hope to create a ResNet variant that is as proper as possible. Whenever we deviate from He et al.'s design decisions (and that happens only marginally), we will provide arguments for doing so. + +### Today's dataset: CIFAR-10 + +The CIFAR-10 dataset is a widely known dataset in the world of computer vision. + +> The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. +> +> Krizhevsky (n.d.) + +It is a slightly more complex dataset compared to MNIST and hence neural networks will have a bit more difficulty to achieve good performance on the dataset. As you can see in the image below, CIFAR-10 contains a broad range of common objects - like frog, truck, deer, automobile, and so forth. + +![](images/cifar10_images.png) + +### What you'll need to run this model + +Now that you understand a few things about the dataset that you will be training the model with, it's time to get coding! + +First, you'll need to ensure that you can actually _run_ the model. In other words, you'll need to make sure that you have all the dependencies installed onto your system. + +For today's code, that will be relatively easy. You will need the following in order to run the model successfully: + +- A recent version of Python. +- A recent version of the `numpy` package. +- Obviously, a recent version (2.x) of `tensorflow`, which comes with the Keras library for building a neural network. + +### Let's start writing some code: TensorFlow imports + +Enough theory for now - it's time to start writing some code! + +Open up your code editor, create a file (e.g. `resnet.py`) or a Jupyter Notebook, and write down these imports: + +``` +import os +import numpy as np +import tensorflow +from tensorflow.keras import Model +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.layers import Add, GlobalAveragePooling2D,\ + Dense, Flatten, Conv2D, Lambda, Input, BatchNormalization, Activation +from tensorflow.keras.optimizers import schedules, SGD +from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint +``` + +Let's take a brief look at why you will need them: + +- With `os`, you will perform file IO operations - which makes sense given the fact that you're going to process some input data through a neural network. +- With `numpy`, abbreviated `np`, you will manipulate the input data per the paper's data augmentation choices - we will come back to that. +- Then, you'll import `tensorflow`. Besides the library itself, you will also need to import some sub dependencies: + - You'll use the `Model` class for instantiating the ResNet that you will be creating. + - Obviously, you'll need the `cifar10` dataset - it's nice that [Keras comes with datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) that can be used out of the box. + - A variety of Keras layers are also necessary. These are all described in standard deep learning literature (e.g. [Conv2d](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) or [Dense](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/)). Why they are necessary can be found in the He et al. (2016) paper. + - You're going to implement the learning rate scheduler functionality as `schedules` and the Stochastic Gradient Descent optimizer for optimization purposes. If you're a TensorFlow expert, you'll recognize that weight decay as described in the He et al. paper is not a part of this optimizer. Once again, later in the tutorial, we'll come back to why we use regular SGD instead. + - Finally, you'll also import some TensorFlow callbacks, being `TensorBoard` and `ModelCheckpoint` - for [visualizing your training results](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) and [saving your model](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), respectively. + +### Model configuration + +You will see that we rely heavily on Python definitions - atomic building blocks that can and will be reused throughout the code. The most widely used component from all the ones that you will create today is `model_configuration`, which serves to group and output tunable configuration options for your neural network. + +Let's briefly walk through them. + +First of all, you're going to load the input samples from the CIFAR-10 dataset, because you will need them for computing a few elements in this definition. + +Then, you're writing the generic configuration: + +- You specify the **width, height and the number of image channels** for a CIFAR-10 sample. +- You specify the **batch size**. We set it to 128, because it's one of the design decisions made by He et al. +- As CIFAR-10 has 10 classes, we set `**num_classes**` to 10. +- He et al. choose a **45/5 validation split**. As 5/(45+5) = 0.1, that's the value for our validation split. In other words, 90% of our `input_train` samples will be used for training, while 10% will be used for validation. +- Keras will run in **verbosity mode**. That is, it will write its outputs to the terminal, so that you have a better idea about training progress. +- In their paper, He et al. specify a value called **n**. Recall that it stands for the number of residual block groups and that it also relates to the number of layers present in your ResNet. In today's network, we set `n = 3`, yielding `6n + 2 = 20` layers. Indeed, we are building a ResNet-20 model. However, by simply tuning this value for `n`, you can easily change it into e.g. `6n + 2 = 6*9 + 2` or a ResNet-56 model. +- The **initial number of feature maps** is set by means of `init_fm_dim`. He et al. choose an initial value of 16 feature maps, which increases by a factor two when the feature map size halves. +- Recall that He et al. describe two **shortcut types** - the `identity` shortcut and the `projection` shortcut. In their work, they used the identity shortcut for their CIFAR-10 experiments. When training this network with identity shortcuts, you will find better performance compared to projection shortcuts, as described by the He et al. paper as well. However, by simply changing the variable, a different shortcut type will be used. +- Using the size of your training and validation (sub) datasets, the **number of steps per epoch** is computed. Here, we rely on another design decision made in the He et al. paper - namely that they trained their ResNet with **64000 iterations**. Using that maximum, we compute the number of steps per epoch for our training and validation data, as well as the number of epochs themselves. +- We then define some hyperparameter related options: + - The **loss function** - which is [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/), a pretty standard loss function for [multiclass classification problems](https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/). + - The **learning rate scheduler**. Initially, the optimizer will use a learning rate of `0.1`. This is a pretty intense learning rate. It will help you to achieve big leaps forward during your initial epochs, but you will subsequently overstep the optimum every time. This is why He et al. use learning rate scheduling in their ResNet - they divide the learning rate by 10 after 32000 and 48000 iterations (i.e., after 50% and 75% of training has completed). We can achieve this through TensorFlow's `PiecewiseConstantDecay`. + - In the paper, He et al. discuss using **He initialization** and hence we set it as our initializer. + - The final hyperparameter related option is choosing an **optimizer**. Here's where we differ slightly from the He et al. findings. In their work, they use Stochastic Gradient Descent (SGD) with the learning rate schedule discussed. They also use a momentum term of 0.9 and weight decay of 0.0001. When developing this ResNet, we found that the Adam optimizer did not work - but that was unexpected. However, we also found that the SGD with weight decay implementation in TensorFlow (more specifically, TensorFlow Addons' [SGDW optimizer](https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers/SGDW)) did not work properly too! We could only reproduce results similar to those reported in the paper by using default _SGD_, with momentum, but without weight decay. That's why you'll use `SGD`. +- Finally, what remains is to initialize the two callbacks - the TensorBoard callback for visualizing your results and the ModelCheckpoint callback so that an instance of your model is saved after every epoch. +- Your configuration is returned as a Python dictionary. + +Quite a bit of a discussion, I agree, but well - this allows you to keep configuration in one place! :D + +``` +def model_configuration(): + """ + Get configuration variables for the model. + """ + + # Load dataset for computing dataset size + (input_train, _), (_, _) = load_dataset() + + # Generic config + width, height, channels = 32, 32, 3 + batch_size = 128 + num_classes = 10 + validation_split = 0.1 # 45/5 per the He et al. paper + verbose = 1 + n = 3 + init_fm_dim = 16 + shortcut_type = "identity" # or: projection + + # Dataset size + train_size = (1 - validation_split) * len(input_train) + val_size = (validation_split) * len(input_train) + + # Number of steps per epoch is dependent on batch size + maximum_number_iterations = 64000 # per the He et al. paper + steps_per_epoch = tensorflow.math.floor(train_size / batch_size) + val_steps_per_epoch = tensorflow.math.floor(val_size / batch_size) + epochs = tensorflow.cast(tensorflow.math.floor(maximum_number_iterations / steps_per_epoch),\ + dtype=tensorflow.int64) + + # Define loss function + loss = tensorflow.keras.losses.CategoricalCrossentropy(from_logits=True) + + # Learning rate config per the He et al. paper + boundaries = [32000, 48000] + values = [0.1, 0.01, 0.001] + lr_schedule = schedules.PiecewiseConstantDecay(boundaries, values) + + # Set layer init + initializer = tensorflow.keras.initializers.HeNormal() + + # Define optimizer + optimizer_momentum = 0.9 + optimizer_additional_metrics = ["accuracy"] + optimizer = SGD(learning_rate=lr_schedule, momentum=optimizer_momentum) + + # Load Tensorboard callback + tensorboard = TensorBoard( + log_dir=os.path.join(os.getcwd(), "logs"), + histogram_freq=1, + write_images=True + ) + + # Save a model checkpoint after every epoch + checkpoint = ModelCheckpoint( + os.path.join(os.getcwd(), "model_checkpoint"), + save_freq="epoch" + ) + + # Add callbacks to list + callbacks = [ + tensorboard, + checkpoint + ] + + # Create config dictionary + config = { + "width": width, + "height": height, + "dim": channels, + "batch_size": batch_size, + "num_classes": num_classes, + "validation_split": validation_split, + "verbose": verbose, + "stack_n": n, + "initial_num_feature_maps": init_fm_dim, + "training_ds_size": train_size, + "steps_per_epoch": steps_per_epoch, + "val_steps_per_epoch": val_steps_per_epoch, + "num_epochs": epochs, + "loss": loss, + "optim": optimizer, + "optim_learning_rate_schedule": lr_schedule, + "optim_momentum": optimizer_momentum, + "optim_additional_metrics": optimizer_additional_metrics, + "initializer": initializer, + "callbacks": callbacks, + "shortcut_type": shortcut_type + } + + return config +``` + +### Loading the dataset + +Because we just worked so hard, it's now time to create a very simple def - haha! :D + +Using `load_dataset`, you will be able to load CIFAR-10 data. It returns four arrays with data: + +- A combination of `(input_train, target_train)`, representing your training samples and their corresponding targets. +- Secondly, `(input_test, target_test)`, which covers your testing samples. + +``` +def load_dataset(): + """ + Load the CIFAR-10 dataset + """ + return cifar10.load_data() +``` + +### Preprocessing the dataset + +Let's now take a look at what must be done for image preprocessing. + +> The network inputs are 32×32 images, with the per-pixel mean subtracted. +> +> He et al. (2016) + +Image _preprocessing wise_, there's only a small amount of preprocessing necessary - subtracting the per-pixel mean from each input image. + +Then, He et al. also apply data augmentation to the input data: + +- Adding 4 pixels on each side by means of padding. +- Randomly sampling a 32 x 32 pixel crop from the padded image or its horizontal flip. + +> We follow the simple data augmentation in \[24\] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. +> +> He et al. (2016) + +Let's now implement this in a definition called `preprocessed_dataset`. In the def, we'll be using `ImageDataGenerator`s for flowing the data, allowing us to specify a variety of data augmentation options. + +...but unfortunately, performing padding and cropping is not part of TensorFlow's data augmentation options by default. + +Fortunately, [on his website](https://jkjung-avt.github.io/keras-image-cropping/), Jung (2018) proposed a method for generating random crops of a specific size from an input image. Let's use these definitions and pay a lot of gratitude to the author: + +``` +def random_crop(img, random_crop_size): + # Note: image_data_format is 'channel_last' + # SOURCE: https://jkjung-avt.github.io/keras-image-cropping/ + assert img.shape[2] == 3 + height, width = img.shape[0], img.shape[1] + dy, dx = random_crop_size + x = np.random.randint(0, width - dx + 1) + y = np.random.randint(0, height - dy + 1) + return img[y:(y+dy), x:(x+dx), :] + + +def crop_generator(batches, crop_length): + """Take as input a Keras ImageGen (Iterator) and generate random + crops from the image batches generated by the original iterator. + SOURCE: https://jkjung-avt.github.io/keras-image-cropping/ + """ + while True: + batch_x, batch_y = next(batches) + batch_crops = np.zeros((batch_x.shape[0], crop_length, crop_length, 3)) + for i in range(batch_x.shape[0]): + batch_crops[i] = random_crop(batch_x[i], (crop_length, crop_length)) + yield (batch_crops, batch_y) +``` + +We can implement them in our `preprocessed_dataset` def. + +- First, you load the dataset arrays with `load_dataset()`. +- You'll then retrieve some necessary configuration options from the configuration dictionary. +- You now use `tensorflow.pad` to pad 4 pixels on each side in the 2nd and 3rd dimension of your input data. Recall that in TensorFlow, which follows a channels-last strategy, a batch of data can be described as being `(batch_size, rows, cols, channels)`. We don't need to manipulate the batch size or the channels, only the rows and columns. That's why we use these dimensions only. +- Then, you convert the scalar targets (i.e., integer values) into categorical format by means of [one-hot encoding](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras/). This way, you'll be able to use categorical crossentropy loss. +- Now, you define a data generator for your training data. A data generator will make available some inputs by means of the [generator principle](https://www.machinecurve.com/index.php/2020/04/06/using-simple-generators-to-flow-data-from-file-with-keras/). Today, you'll use the following options: + - With `validation_split`, you'll indicate what part of the data must be used for validation purposes. + - Horizontal flip implements the "or its horizontal flip" part from the data augmentation design in He et al. to our padded input image before random cropping. + - Rescaling by 1/255 is performed to ensure that [gradients don't explode](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). + - Finally, you'll use TensorFlow's default ResNet preprocessing for doing the rest of your preprocessing work. + +> The images are converted from RGB to BGR, then each color channel is zero-centered with respect to the ImageNet dataset, without scaling. +> +> TensorFlow (n.d.) + +- From this `train_generator`, you'll generate the training and validation batches. Using `.flow`, you'll flow the training data to the data generator, taking only the training or validation part depending on the subset configuration. Then, you'll use `crop_generator` to convert the batches (which are 40x40 padded and possibly flipped images) to 32x32 format again, i.e., the "random crop". +- Then, you'll do the same for testing data, except for the flipping and padding/cropping - this is also per the He et al. paper. +- Finally, you return the training, validation and test batches. + +> For testing, we only evaluate the single view of the original 32×32 image. +> +> He et al. (2016) + +``` +def preprocessed_dataset(): + """ + Load and preprocess the CIFAR-10 dataset. + """ + (input_train, target_train), (input_test, target_test) = load_dataset() + + # Retrieve shape from model configuration and unpack into components + config = model_configuration() + width, height, dim = config.get("width"), config.get("height"),\ + config.get("dim") + num_classes = config.get("num_classes") + + # Data augmentation: perform zero padding on datasets + paddings = tensorflow.constant([[0, 0,], [4, 4], [4, 4], [0, 0]]) + input_train = tensorflow.pad(input_train, paddings, mode="CONSTANT") + + # Convert scalar targets to categorical ones + target_train = tensorflow.keras.utils.to_categorical(target_train, num_classes) + target_test = tensorflow.keras.utils.to_categorical(target_test, num_classes) + + # Data generator for training data + train_generator = tensorflow.keras.preprocessing.image.ImageDataGenerator( + validation_split = config.get("validation_split"), + horizontal_flip = True, + rescale = 1./255, + preprocessing_function = tensorflow.keras.applications.resnet50.preprocess_input + ) + + # Generate training and validation batches + train_batches = train_generator.flow(input_train, target_train, batch_size=config.get("batch_size"), subset="training") + validation_batches = train_generator.flow(input_train, target_train, batch_size=config.get("batch_size"), subset="validation") + train_batches = crop_generator(train_batches, config.get("height")) + validation_batches = crop_generator(validation_batches, config.get("height")) + + # Data generator for testing data + test_generator = tensorflow.keras.preprocessing.image.ImageDataGenerator( + preprocessing_function = tensorflow.keras.applications.resnet50.preprocess_input, + rescale = 1./255) + + # Generate test batches + test_batches = test_generator.flow(input_test, target_test, batch_size=config.get("batch_size")) + + return train_batches, validation_batches, test_batches +``` + +### Creating the Residual block + +Now, it's time for creating the actual residual block. Recall from the section recapping ResNets above that a residual block is composed of two methods: + +- The regular mapping. +- A skip connection. + +Using the Functional API, we can effectively create these paths and finally merge them back together. + +So, in the definition, you will first load the initializer from your model configuration. You will need it for initializing the `Conv2D` layers that you will specify next. + +Then, you create the skip connection - `x_skip` - based on the input `x`. You will later re-add this variable to the output of your residual block, effectively creating the skip connection as described in the section above. + +Next up is performing the original mapping. Per the He et al. paper, each residual block is composed of 2 convolutional layers with a 3x3 kernel size. Depending on whether you'll need to match the size of your first `Conv2D` layer with the output filter maps (which is a lower amount), you'll be using a different stride. + +> Then we use a stack of 6n layers with 3×3 convolutions on the feature maps of sizes {32, 16, 8} respectively, with 2n layers for each feature map size. +> +> He et al. paper + +Each layer is followed by Batch Normalization and a ReLU activation function. + +Then it's time to add the skip connection. You will do this by means of `Add()`. However, sometimes, the number of filters in `x` no longer matches the number of filters in `x_skip`... which happens because the number of feature maps is increased with each group of residual blocks. + +There are multiple ways of overcoming this issue: + +> (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameter free (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections. +> +> He et al. paper + +We can implement these so-called _identity_ shortcuts by padding zeros to the left and right side of your channel dimension, using the `Lambda` layer. This layer type essentially allows us to manipulate our Tensors in any way, returning the result. It works as follows: + +- Of the input Tensor `x`, where the 2nd and 3rd dimensions (rows and columns) are reduced in size by a factor 2, we apply `number_of_filters//4` to each side of the feature map size dimension. In other words, we expand the number of filters by 2 (which is necessary for the next group of residual blocks) but using 50% on each side. + +Another option is a _projection_ mapping. You then simply use a `Conv2D` layer with a 1x1 kernel size and 2 stride for generating the projection. + +As He et al. found identity mappings to work best, the configuration is set to `identity` by default. You can change it to `projection` by adapting the model configuration. + +Finally, the combined output/skip connection is nonlinearly activated with ReLU before being passed to the next residual block. + +``` +def residual_block(x, number_of_filters, match_filter_size=False): + """ + Residual block with + """ + # Retrieve initializer + config = model_configuration() + initializer = config.get("initializer") + + # Create skip connection + x_skip = x + + # Perform the original mapping + if match_filter_size: + x = Conv2D(number_of_filters, kernel_size=(3, 3), strides=(2,2),\ + kernel_initializer=initializer, padding="same")(x_skip) + else: + x = Conv2D(number_of_filters, kernel_size=(3, 3), strides=(1,1),\ + kernel_initializer=initializer, padding="same")(x_skip) + x = BatchNormalization(axis=3)(x) + x = Activation("relu")(x) + x = Conv2D(number_of_filters, kernel_size=(3, 3),\ + kernel_initializer=initializer, padding="same")(x) + x = BatchNormalization(axis=3)(x) + + # Perform matching of filter numbers if necessary + if match_filter_size and config.get("shortcut_type") == "identity": + x_skip = Lambda(lambda x: tensorflow.pad(x[:, ::2, ::2, :], tensorflow.constant([[0, 0,], [0, 0], [0, 0], [number_of_filters//4, number_of_filters//4]]), mode="CONSTANT"))(x_skip) + elif match_filter_size and config.get("shortcut_type") == "projection": + x_skip = Conv2D(number_of_filters, kernel_size=(1,1),\ + kernel_initializer=initializer, strides=(2,2))(x_skip) + + # Add the skip connection to the regular mapping + x = Add()([x, x_skip]) + + # Nonlinearly activate the result + x = Activation("relu")(x) + + # Return the result + return x +``` + +### Creating the ResidualBlocks structure + +Now that we have the structure of a residual block, it's time to create the logic for specifying _all_ our residual blocks. You can do so as follows: + +- First, like always, retrieving the model configuration, and from it the initial filter size. +- The paper suggests that a ResNet is built using a stack of `6n` layers with `2n` layers for each feature map size. `6n` layers divided by `2n` layers for each feature map size, means that there will be `6n/2n = 3` groups of residual blocks, with 3 filter map sizes. + - Indeed, He et al. use filter map sizes of 16, 32 and 64, respectively. +- Using a `for` loop, we can simply iterate over this number of groups. +- Each block in our code has 2 weighted layers (see the 2 Conv layers above, excluding the one for the skip connection should a _projection_ mapping be used), and each group has `2n` layers (per the paper and defined above). This means that there will be `2n/2 = n` blocks per group. That's why you'll create another for loop, creating `n` blocks. +- The rest is simple: if it's the second layer group or higher and it's the first block within the group, you increase the filter size by a factor two (per the paper) and then specify the `residual_block`, instructing it to match filter sizes (by manipulating the input and the skip connection using the identity or projection mapping). +- If not, you simply specify the `residual_block`. + +For example, with `n = 3`, this yields `6n = 6*3 = 18` layers in your residual blocks and `2n = 2*3 = 6` layers per group. Indeed, with 3 groups, this matches. Finally, with `n = 3`, you will have `6n+2 = 6 * 3 + 2 = 20` layers in your network. Indeed, that's a ResNet-20! :) + +``` +def ResidualBlocks(x): + """ + Set up the residual blocks. + """ + # Retrieve values + config = model_configuration() + + # Set initial filter size + filter_size = config.get("initial_num_feature_maps") + + # Paper: "Then we use a stack of 6n layers (...) + # with 2n layers for each feature map size." + # 6n/2n = 3, so there are always 3 groups. + for layer_group in range(3): + + # Each block in our code has 2 weighted layers, + # and each group has 2n such blocks, + # so 2n/2 = n blocks per group. + for block in range(config.get("stack_n")): + + # Perform filter size increase at every + # first layer in the 2nd block onwards. + # Apply Conv block for projecting the skip + # connection. + if layer_group > 0 and block == 0: + filter_size *= 2 + x = residual_block(x, filter_size, match_filter_size=True) + else: + x = residual_block(x, filter_size) + + # Return final layer + return x +``` + +### Model base: stacking your building blocks + +Then, after creating the structure for the residual blocks, it's time to finalize the model by specifying its base structure. Recall that a ResNet is composed of `6n+2` weighted layers, and that you have created `6n` such layers so far. Two more to go! + +From the paper: + +> The first layer is 3×3 convolutions (...) The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. +> +> He et al. + +Let's add them: + +- First of all, the inputs to your neural network are passed to the `Input` layer. This is a default Keras layer that is capable of picking up inputs served to the model. +- Then, you create the initial 3x3 kernel size `Conv2D` layer with the initial number of filter maps, a 1x1 stride, zeros padding if necessary and the kernel initializer specified in the model configuration (remember, that would be He initialization, in line with the paper). +- This is followed by Batch Normalization and ReLU activation - pretty standard in these networks. +- Then, you let the input pass through the `6n` `ResidualBlocks` that you created above. +- Subsequently, your data flows through a `GlobalAveragePooling2D` nonweighted layer, performing global average pooling. +- Finally, your data is flattened, so that it can be processed by a fully-connected layer (`Dense` layer), also initialized using He initialization. This outputs a `(num_classes, )` shaped logits Tensor, which in the case of CIFAR-10 is `(10, )` because of `num_classes = 10`. +- Finally, references to `inputs` and `outputs` are returned so that the model can be initialized. + +``` +def model_base(shp): + """ + Base structure of the model, with residual blocks + attached. + """ + # Get number of classes from model configuration + config = model_configuration() + initializer = model_configuration().get("initializer") + + # Define model structure + # logits are returned because Softmax is pushed to loss function. + inputs = Input(shape=shp) + x = Conv2D(config.get("initial_num_feature_maps"), kernel_size=(3,3),\ + strides=(1,1), kernel_initializer=initializer, padding="same")(inputs) + x = BatchNormalization()(x) + x = Activation("relu")(x) + x = ResidualBlocks(x) + x = GlobalAveragePooling2D()(x) + x = Flatten()(x) + outputs = Dense(config.get("num_classes"), kernel_initializer=initializer)(x) + + return inputs, outputs +``` + +### Model initialization + +Now on to the simple part: model initialization. + +You have built your ResNet model, and it's time to initialize it. Doing so is easy but requires the layer structure: for this, you simply call the `model_base` definition using some input parameters representing input sample shape `shp`, and you assign its outputs to `inputs, outputs`. + +Now that you have these layer references, you can actually initialize the model by means of the Keras `Model` class. You specify the inputs and outputs and give it a name, like `resnet` (per the model configuration). + +Then, you compile the model with `model.compile` using the loss function, optimizer and additional metrics configured in your model configuration, print the model summary, and return the model. + +Time to start training! :) + +``` +def init_model(): + """ + Initialize a compiled ResNet model. + """ + # Get shape from model configuration + config = model_configuration() + + # Get model base + inputs, outputs = model_base((config.get("width"), config.get("height"),\ + config.get("dim"))) + + # Initialize and compile model + model = Model(inputs, outputs, name=config.get("name")) + model.compile(loss=config.get("loss"),\ + optimizer=config.get("optim"),\ + metrics=config.get("optim_additional_metrics")) + + # Print model summary + model.summary() + + return model +``` + +### Model training + +Keras has a high-level API available for training your model. Recall that we have created our training and validation batches using the `ImageDataGenerator` before, and that we have an initialized `model` at this stage. + +We simply pass these to the `model.fit` with a large variety of other configuration options specified in the model configuration. + +This will start the training process and return the trained `model` for evaluation. + +``` +def train_model(model, train_batches, validation_batches): + """ + Train an initialized model. + """ + + # Get model configuration + config = model_configuration() + + # Fit data to model + model.fit(train_batches, + batch_size=config.get("batch_size"), + epochs=config.get("num_epochs"), + verbose=config.get("verbose"), + callbacks=config.get("callbacks"), + steps_per_epoch=config.get("steps_per_epoch"), + validation_data=validation_batches, + validation_steps=config.get("val_steps_per_epoch")) + + return model +``` + +### Model evaluation + +Evaluation is even simpler: you simply pass the test batches to `model.evaluate` and output the test scores. + +``` +def evaluate_model(model, test_batches): + """ + Evaluate a trained model. + """ + # Evaluate model + score = model.evaluate(test_batches, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Wrapping up things + +So far, we have individual building blocks: + +- Building blocks for loading the data. +- Building blocks for building the model layer structure. +- Building blocks for initializing the model. +- ...and for training and evaluating the model. + +Let's combine things together now so that we will end up with working code! + +In `training_process`, you will do this. + +- First, you retrieve the training, validation and testing batches using `preprocessed_dataset()`. +- Then, you initialize a compiled ResNet model with `init_model()`. +- This is followed by training the model using the training and validation batches, by calling `train_model()`. +- Finally, you'll evaluate the trained model with `evaluate_model()`. + +``` +def training_process(): + """ + Run the training process for the ResNet model. + """ + + # Get dataset + train_batches, validation_batches, test_batches = preprocessed_dataset() + + # Initialize ResNet + resnet = init_model() + + # Train ResNet model + trained_resnet = train_model(resnet, train_batches, validation_batches) + + # Evalute trained ResNet model post training + evaluate_model(trained_resnet, test_batches) +``` + +That's pretty much it! + +The only thing that remains is starting the training process when your Python script starts: + +``` +if __name__ == "__main__": + training_process() +``` + +And voila! You have a working ResNet model :) + +By tweaking with the `n` parameter in the configuration settings (like He et al. did by setting it to 3, 5, 7 and 9), you can simply spawn, train and evaluate a ResNet-20, ResNet-32, ResNet-44 or ResNet-56 model. + +* * * + +## Results for our ResNet-20 on the CIFAR-10 dataset + +These are our results when training a ResNet-20 (`n = 3`) on the CIFAR-10 dataset. Training took approximately 45 minutes: + +[![](images/epoch_learning_rate-1024x192.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/epoch_learning_rate.png) + +[![](images/epoch_accuracy-1024x190.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/epoch_accuracy.png) + +Clearly, the results of the learning rate scheduler are visible around epoch 90 and 135, both in terms of the learning rate applied (above) and accuracy (validation accuracy is blue; training accuracy is orange). + +Subsequently, during model evaluation using our testing data, we found the following scores: + +``` +Test loss: 0.6111826300621033 / Test accuracy: 0.8930000066757202 +``` + +With a `1 - 0.893 = 0.107` test error, results are similar to those found in the ResNet paper (`0.0875`). Possibly, the omission of weight decay due to reasons of non-convergence played a role here. In that case, you may want to use TensorFlow Addons' `SGDW` optimizer a try. + +* * * + +## Full model code + +If you want to get started immediately, here is the full model code for building a ResNet from scratch using TensorFlow 2 and Keras: + +``` +import os +import numpy as np +import tensorflow +from tensorflow.keras import Model +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.layers import Add, GlobalAveragePooling2D,\ + Dense, Flatten, Conv2D, Lambda, Input, BatchNormalization, Activation +from tensorflow.keras.optimizers import schedules, SGD +from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint + + +def model_configuration(): + """ + Get configuration variables for the model. + """ + + # Load dataset for computing dataset size + (input_train, _), (_, _) = load_dataset() + + # Generic config + width, height, channels = 32, 32, 3 + batch_size = 128 + num_classes = 10 + validation_split = 0.1 # 45/5 per the He et al. paper + verbose = 1 + n = 3 + init_fm_dim = 16 + shortcut_type = "identity" # or: projection + + # Dataset size + train_size = (1 - validation_split) * len(input_train) + val_size = (validation_split) * len(input_train) + + # Number of steps per epoch is dependent on batch size + maximum_number_iterations = 64000 # per the He et al. paper + steps_per_epoch = tensorflow.math.floor(train_size / batch_size) + val_steps_per_epoch = tensorflow.math.floor(val_size / batch_size) + epochs = tensorflow.cast(tensorflow.math.floor(maximum_number_iterations / steps_per_epoch),\ + dtype=tensorflow.int64) + + # Define loss function + loss = tensorflow.keras.losses.CategoricalCrossentropy(from_logits=True) + + # Learning rate config per the He et al. paper + boundaries = [32000, 48000] + values = [0.1, 0.01, 0.001] + lr_schedule = schedules.PiecewiseConstantDecay(boundaries, values) + + # Set layer init + initializer = tensorflow.keras.initializers.HeNormal() + + # Define optimizer + optimizer_momentum = 0.9 + optimizer_additional_metrics = ["accuracy"] + optimizer = SGD(learning_rate=lr_schedule, momentum=optimizer_momentum) + + # Load Tensorboard callback + tensorboard = TensorBoard( + log_dir=os.path.join(os.getcwd(), "logs"), + histogram_freq=1, + write_images=True + ) + + # Save a model checkpoint after every epoch + checkpoint = ModelCheckpoint( + os.path.join(os.getcwd(), "model_checkpoint"), + save_freq="epoch" + ) + + # Add callbacks to list + callbacks = [ + tensorboard, + checkpoint + ] + + # Create config dictionary + config = { + "width": width, + "height": height, + "dim": channels, + "batch_size": batch_size, + "num_classes": num_classes, + "validation_split": validation_split, + "verbose": verbose, + "stack_n": n, + "initial_num_feature_maps": init_fm_dim, + "training_ds_size": train_size, + "steps_per_epoch": steps_per_epoch, + "val_steps_per_epoch": val_steps_per_epoch, + "num_epochs": epochs, + "loss": loss, + "optim": optimizer, + "optim_learning_rate_schedule": lr_schedule, + "optim_momentum": optimizer_momentum, + "optim_additional_metrics": optimizer_additional_metrics, + "initializer": initializer, + "callbacks": callbacks, + "shortcut_type": shortcut_type + } + + return config + + +def load_dataset(): + """ + Load the CIFAR-10 dataset + """ + return cifar10.load_data() + + +def random_crop(img, random_crop_size): + # Note: image_data_format is 'channel_last' + # SOURCE: https://jkjung-avt.github.io/keras-image-cropping/ + assert img.shape[2] == 3 + height, width = img.shape[0], img.shape[1] + dy, dx = random_crop_size + x = np.random.randint(0, width - dx + 1) + y = np.random.randint(0, height - dy + 1) + return img[y:(y+dy), x:(x+dx), :] + + +def crop_generator(batches, crop_length): + """Take as input a Keras ImageGen (Iterator) and generate random + crops from the image batches generated by the original iterator. + SOURCE: https://jkjung-avt.github.io/keras-image-cropping/ + """ + while True: + batch_x, batch_y = next(batches) + batch_crops = np.zeros((batch_x.shape[0], crop_length, crop_length, 3)) + for i in range(batch_x.shape[0]): + batch_crops[i] = random_crop(batch_x[i], (crop_length, crop_length)) + yield (batch_crops, batch_y) + + +def preprocessed_dataset(): + """ + Load and preprocess the CIFAR-10 dataset. + """ + (input_train, target_train), (input_test, target_test) = load_dataset() + + # Retrieve shape from model configuration and unpack into components + config = model_configuration() + width, height, dim = config.get("width"), config.get("height"),\ + config.get("dim") + num_classes = config.get("num_classes") + + # Data augmentation: perform zero padding on datasets + paddings = tensorflow.constant([[0, 0,], [4, 4], [4, 4], [0, 0]]) + input_train = tensorflow.pad(input_train, paddings, mode="CONSTANT") + + # Convert scalar targets to categorical ones + target_train = tensorflow.keras.utils.to_categorical(target_train, num_classes) + target_test = tensorflow.keras.utils.to_categorical(target_test, num_classes) + + # Data generator for training data + train_generator = tensorflow.keras.preprocessing.image.ImageDataGenerator( + validation_split = config.get("validation_split"), + horizontal_flip = True, + rescale = 1./255, + preprocessing_function = tensorflow.keras.applications.resnet50.preprocess_input + ) + + # Generate training and validation batches + train_batches = train_generator.flow(input_train, target_train, batch_size=config.get("batch_size"), subset="training") + validation_batches = train_generator.flow(input_train, target_train, batch_size=config.get("batch_size"), subset="validation") + train_batches = crop_generator(train_batches, config.get("height")) + validation_batches = crop_generator(validation_batches, config.get("height")) + + # Data generator for testing data + test_generator = tensorflow.keras.preprocessing.image.ImageDataGenerator( + preprocessing_function = tensorflow.keras.applications.resnet50.preprocess_input, + rescale = 1./255) + + # Generate test batches + test_batches = test_generator.flow(input_test, target_test, batch_size=config.get("batch_size")) + + return train_batches, validation_batches, test_batches + + + +def residual_block(x, number_of_filters, match_filter_size=False): + """ + Residual block with + """ + # Retrieve initializer + config = model_configuration() + initializer = config.get("initializer") + + # Create skip connection + x_skip = x + + # Perform the original mapping + if match_filter_size: + x = Conv2D(number_of_filters, kernel_size=(3, 3), strides=(2,2),\ + kernel_initializer=initializer, padding="same")(x_skip) + else: + x = Conv2D(number_of_filters, kernel_size=(3, 3), strides=(1,1),\ + kernel_initializer=initializer, padding="same")(x_skip) + x = BatchNormalization(axis=3)(x) + x = Activation("relu")(x) + x = Conv2D(number_of_filters, kernel_size=(3, 3),\ + kernel_initializer=initializer, padding="same")(x) + x = BatchNormalization(axis=3)(x) + + # Perform matching of filter numbers if necessary + if match_filter_size and config.get("shortcut_type") == "identity": + x_skip = Lambda(lambda x: tensorflow.pad(x[:, ::2, ::2, :], tensorflow.constant([[0, 0,], [0, 0], [0, 0], [number_of_filters//4, number_of_filters//4]]), mode="CONSTANT"))(x_skip) + elif match_filter_size and config.get("shortcut_type") == "projection": + x_skip = Conv2D(number_of_filters, kernel_size=(1,1),\ + kernel_initializer=initializer, strides=(2,2))(x_skip) + + # Add the skip connection to the regular mapping + x = Add()([x, x_skip]) + + # Nonlinearly activate the result + x = Activation("relu")(x) + + # Return the result + return x + + +def ResidualBlocks(x): + """ + Set up the residual blocks. + """ + # Retrieve values + config = model_configuration() + + # Set initial filter size + filter_size = config.get("initial_num_feature_maps") + + # Paper: "Then we use a stack of 6n layers (...) + # with 2n layers for each feature map size." + # 6n/2n = 3, so there are always 3 groups. + for layer_group in range(3): + + # Each block in our code has 2 weighted layers, + # and each group has 2n such blocks, + # so 2n/2 = n blocks per group. + for block in range(config.get("stack_n")): + + # Perform filter size increase at every + # first layer in the 2nd block onwards. + # Apply Conv block for projecting the skip + # connection. + if layer_group > 0 and block == 0: + filter_size *= 2 + x = residual_block(x, filter_size, match_filter_size=True) + else: + x = residual_block(x, filter_size) + + # Return final layer + return x + + +def model_base(shp): + """ + Base structure of the model, with residual blocks + attached. + """ + # Get number of classes from model configuration + config = model_configuration() + initializer = model_configuration().get("initializer") + + # Define model structure + # logits are returned because Softmax is pushed to loss function. + inputs = Input(shape=shp) + x = Conv2D(config.get("initial_num_feature_maps"), kernel_size=(3,3),\ + strides=(1,1), kernel_initializer=initializer, padding="same")(inputs) + x = BatchNormalization()(x) + x = Activation("relu")(x) + x = ResidualBlocks(x) + x = GlobalAveragePooling2D()(x) + x = Flatten()(x) + outputs = Dense(config.get("num_classes"), kernel_initializer=initializer)(x) + + return inputs, outputs + + +def init_model(): + """ + Initialize a compiled ResNet model. + """ + # Get shape from model configuration + config = model_configuration() + + # Get model base + inputs, outputs = model_base((config.get("width"), config.get("height"),\ + config.get("dim"))) + + # Initialize and compile model + model = Model(inputs, outputs, name=config.get("name")) + model.compile(loss=config.get("loss"),\ + optimizer=config.get("optim"),\ + metrics=config.get("optim_additional_metrics")) + + # Print model summary + model.summary() + + return model + + +def train_model(model, train_batches, validation_batches): + """ + Train an initialized model. + """ + + # Get model configuration + config = model_configuration() + + # Fit data to model + model.fit(train_batches, + batch_size=config.get("batch_size"), + epochs=config.get("num_epochs"), + verbose=config.get("verbose"), + callbacks=config.get("callbacks"), + steps_per_epoch=config.get("steps_per_epoch"), + validation_data=validation_batches, + validation_steps=config.get("val_steps_per_epoch")) + + return model + + +def evaluate_model(model, test_batches): + """ + Evaluate a trained model. + """ + # Evaluate model + score = model.evaluate(test_batches, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + + +def training_process(): + """ + Run the training process for the ResNet model. + """ + + # Get dataset + train_batches, validation_batches, test_batches = preprocessed_dataset() + + # Initialize ResNet + resnet = init_model() + + # Train ResNet model + trained_resnet = train_model(resnet, train_batches, validation_batches) + + # Evalute trained ResNet model post training + evaluate_model(trained_resnet, test_batches) + + +if __name__ == "__main__": + training_process() +``` + +* * * + +## References + +MachineCurve. (2022, January 13). _ResNet, a simple introduction – MachineCurve_. [https://www.machinecurve.com/index.php/2022/01/13/resnet-a-simple-introduction/](https://www.machinecurve.com/index.php/2022/01/13/resnet-a-simple-introduction/) + +He, K., Zhang, X., Ren, S., & Sun, J. (2016). [Deep residual learning for image recognition.](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf) In _Proceedings of the IEEE conference on computer vision and pattern recognition_ (pp. 770-778). + +Krizhevsky, A. (n.d.). _CIFAR-10 and CIFAR-100 datasets_. Department of Computer Science, University of Toronto. [https://www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html) + +Jung, J. K. (2018, April 16). _Extending Keras' ImageDataGenerator to support random cropping_. JK Jung's blog. [https://jkjung-avt.github.io/keras-image-cropping/](https://jkjung-avt.github.io/keras-image-cropping/) + +TensorFlow. (n.d.). _Tf.keras.applications.resnet50.preprocess\_input_. [https://www.tensorflow.org/api\_docs/python/tf/keras/applications/resnet50/preprocess\_input](https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/preprocess_input) diff --git a/how-to-build-a-u-net-for-image-segmentation-with-tensorflow-and-keras.md b/how-to-build-a-u-net-for-image-segmentation-with-tensorflow-and-keras.md new file mode 100644 index 0000000..d5de385 --- /dev/null +++ b/how-to-build-a-u-net-for-image-segmentation-with-tensorflow-and-keras.md @@ -0,0 +1,1268 @@ +--- +title: "How to build a U-Net for image segmentation with TensorFlow and Keras" +date: "2022-01-30" +categories: + - "deep-learning" + - "frameworks" +tags: + - "classification" + - "computer-vision" + - "deep-learning" + - "image-segmentation" + - "keras" + - "machine-learning" + - "neural-network" + - "tensorflow" + - "unet" +--- + +Computer vision has a few sub disciplines - and image segmentation is one of them. If you're segmenting an image, you're deciding about what is visible in the image at pixel level (when performing classification) - or inferring relevant real-valued information from the image at pixel level (when performing regression). + +One of the prominent architectures in the image segmentation community is **U-Net**. Having been named after its shape, the fully-convolutional architecture first contracts an image followed by its expansion into the outcome. While this contracting path builds up a hierarchy of learned features, skip connections help transform these features back into a relevant model output in the expansive path. + +While you can learn more about the U-net architecture [by clicking this link](https://www.machinecurve.com/index.php/2022/01/28/u-net-a-step-by-step-introduction/), this article focuses on a practical implementation. Today, you will learn to build a U-Net architecture from scratch. You will use TensorFlow and Keras for doing so. Firstly, you're going to briefly cover the components of a U-Net at a high level. This is followed by a step-by-step tutorial for implementing U-Net yourself. Finally, we're going to train the network on the Oxford-IIIT Pet Dataset from scratch, show you what can be achieved _and_ how to improve even further! + +So, after reading this tutorial, you will understand... + +- **What the U-Net architecture is and what its components are.** +- **How to build a U-Net yourself using TensorFlow and Keras.** +- **What performance you can achieve with your implementation and how to improve even further.** + +Are you ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What is a U-Net? + +When you ask a computer vision engineer about _image segmentation_, it's likely that the term **U-Net** will be mentioned somewhere in their explanation! + +The U-Net, which is named after its shape, is a convolutional architecture originally proposed by Ronneberger et al. (2015) for use in the biomedical sciences. More specifically, it is used for cell segmentation, and worked really well compared to approaches previously used in the field. + +MachineCurve has an [in-depth article explaining U-Net](https://www.machinecurve.com/index.php/2022/01/28/u-net-a-step-by-step-introduction/), and here we will review the components at a high-level only. U-Nets are composed of three component groups: + +1. **A contracting path**. Visible to the left in the image below, groups of convolutions and pooling layers are used to downsample the image, sometimes even halving it in size. The contracting path learns a hierarchy of features at varying levels of granularity. +2. **An expansive path**. To the right, you see groups of _upsampling layers_ (whether simple interpolation layers or transposed convolutions) that upsample the resolution of the input image. In other words, from the contracted input, the network tries to construct a higher-resolution output. +3. **Skip connections.** Besides having the lower-level feature maps as input to the upsampling process, U-Net also receives information from the contracting path's same-level layer. This is to mitigate the information bottleneck present at the lowest layer in the U, effectively 'dropping' the signal from higher-level features if not used through skip connections. + +Note that in the original U-Net architecture, the width and height of the output are lower than the input width and height (572x572 pixels versus 388x388 pixels). This originates in the architecture and can be avoided by using another default architecture (such as ResNet) as your backbone architecture. This will be covered in another article. + +With architectures like U-Net, it becomes possible to learn features important to specific images, while using this information to generate a higher-resolution output. Maps representing class indexes at pixel level can be such output. And by reading further, you will learn to build a U-Net for doing so! + +![](images/unet-1-1024x868.png) + +Inspired by Ronneberger et al. (2015) + +* * * + +## Building a U-Net with TensorFlow and Keras + +Now that you understand how U-Net works at a high level, it's time to build one. Open up your IDE and create a Python file (such as `unet.py`) or open up a Jupyter Notebook. Also ensure that you have installed the prerequisites, which follow next. We can then start writing some code! + +### Prerequisites + +For running today's code, it's important that you have installed some dependencies into your environment. + +First of all, you will need a recent version of Python - 3.x, preferably 3.9+. + +In addition, you will need `tensorflow` and `matplotlib`. These can be installed through `pip` package manager. When installed, you're ready to go! + +### Today's structure + +Building a U-Net model can be grouped into three separate groups, besides specifying model imports: + +1. Defining the configuration of your U-Net model, so that it can be reused throughout your code. +2. Defining the building blocks of your U-Net. +3. Defining the process definitions to train and evaluate your U-Net model. + +Afterwards, you will merge everything together into a working whole. + +Let's begin with model configuration! :) + +### Imports + +Your first lines of code will cover the imports that you will need in the rest of your code. Let's walk through them briefly: + +- Python `os` represents operating system functions such as constructing file paths. You will need it when loading your dataset. +- TensorFlow speaks pretty much for itself, doesn't it? :) +- A variety of layers will be used in your model. As we are working with Keras for building your neural network, they must be imported from `tensorflow.keras.layers`. You will use two-dimensional convolutional layers (`Conv2D`), two-dimensional max pooling (`MaxPool2D`), transposed convolutions (`Conv2DTranspose`), and more general layers, such as the `Input` layer (representing the input batch), `Activation` (representing a nonlinear activation function), `Concatenate` for Tensor concatenation and `CenterCrop` for taking a crop of the skip connections to match shapes (this will be discussed later). +- In addition, you will need to import the `Model` class for constructing your U-Net, He normal initialization via `HeNormal`, `Adam` for optimization including learning rate scheduling functionality (`schedules`), and sparse categorical crossentropy (`SparseCategoricalCrossentropy`) as your loss function. +- Recall that TensorFlow has a variety of callbacks that make your modelling life easier. An example of these callbacks is the TensorBoard callback, which allows you to have your training progress exported to a great tool for visualization. Finally, you will import a Keras `util` called `plot_model` for plotting the structure of your model. +- What rests are other imports. Our dataset is represented in `tensorflow_datasets` and finally you will also need Matplotlib's `pyplot` librari for visualization purposes. + +``` +import os +import tensorflow +from tensorflow.keras.layers import Conv2D,\ + MaxPool2D, Conv2DTranspose, Input, Activation,\ + Concatenate, CenterCrop +from tensorflow.keras import Model +from tensorflow.keras.initializers import HeNormal +from tensorflow.keras.optimizers import schedules, Adam +from tensorflow.keras.losses import SparseCategoricalCrossentropy +from tensorflow.keras.callbacks import TensorBoard +from tensorflow.keras.utils import plot_model +import tensorflow_datasets as tfds +import matplotlib.pyplot as plt +``` + +### U-Net configuration definition + +In my view, it's bad practice to scatter a variety of configuration options throughout your model. Rather, I prefer to define them in one definition, allowing me to reuse them across the model (and should I ever need to deploy my model into a production setting, I can for example provide my configuration through a JSON environment variable which can be easily read into Python as a `dict`). Here's what the configuration definition looks like. Below, we'll discuss the components: + +``` +''' + U-NET CONFIGURATION +''' +def configuration(): + ''' Get configuration. ''' + + return dict( + data_train_prc = 80, + data_val_prc = 90, + data_test_prc = 100, + num_filters_start = 64, + num_unet_blocks = 3, + num_filters_end = 3, + input_width = 100, + input_height = 100, + mask_width = 60, + mask_height = 60, + input_dim = 3, + optimizer = Adam, + loss = SparseCategoricalCrossentropy, + initializer = HeNormal(), + batch_size = 50, + buffer_size = 50, + num_epochs = 50, + metrics = ['accuracy'], + dataset_path = os.path.join(os.getcwd(), 'data'), + class_weights = tensorflow.constant([1.0, 1.0, 2.0]), + validation_sub_splits = 5, + lr_schedule_percentages = [0.2, 0.5, 0.8], + lr_schedule_values = [3e-4, 1e-4, 1e-5, 1e-6], + lr_schedule_class = schedules.PiecewiseConstantDecay + ) +``` + +- Recall that a dataset must be split into a **training set, validation set and testing set**. The training set is the largest and primary set, allowing you to make forward & backward passes and optimization during your training process. However, because you have seen this dataset, a validation set is used during training to evaluate performance after every epoch. Finally, because the model may eventually overfit on this validation set too, there is a testing set, which is not used during training at all. Rather, it is used during model evaluation to find whether your model performs on data that it has not seen before. If it does so, it's more likely to work in the real world, too. + - In your model configuration, `data_train_prc`, `data_val_prc` and `data_test_prc` are used to represent the percentage at which the specific split ends. In the configuration above, 80, 90 and 100 mean that 0-80% of your dataset will be used for training purposes, 80-90% (i.e. 10% in total) for validation and 90-100% (10%, too) for testing. You will see later that it's good to specify them in this way, because `tfds.load` allows us to recombine the two datasets (train/test) and split them into three! +- The number of feature maps generated at the first U-net convolutional block will be 64. In total, your network will consist of 3 U-Net blocks (the sketch above has 5, but we found 3 to work better on this dataset) and will have 3 feature maps in the _final 1x1 Conv layer_. It's set to 3 because our dataset has three possible classes to assign to each pixel - in other words, it should be equal to the number of classes in your dataset. +- The width and height of our input image will be 100 pixels. Dimensionality of the input will be 3 channels (it's an RGB image). +- The width and height of the output mask will be 60 pixels. Indeed, in the original U-Net architecture input and output size is not equal to each other! +- Model wise, the Adam optimizer, sparse categorical crossentropy and He normal initialization are used. For the Adam optimizer, we use a learning rate schedule called `PiecewiseConstantDecay`. This schedule ensures that the learning rate is set to a preconfigured value after a predefined amount of training time. We start with a learning rate of `3e-4` (i.e., 0.0003) and decrease to `1e-4`, `1e-5` and `1e-6` after 20%, 50% and 80% of training. Decreasing your learning rate will help you move towards an optimum in a better way. [Read here why.](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/) +- Training wise, we generate batches of 50 pixels and perform shuffling with a 50 buffer size, and train the model for 50 epochs. +- As an additional metric, we use `accuracy`. +- Our dataset will be located in the current working directory, `data` sub folder. 5 sub splits are used for validation purposes. +- When you are training with an imbalanced dataset, it can be a good idea to assign class weights to the target predictions. This will put more importance on the weights that are underrepresented. + +Okay, this was the important but relatively boring part. Let's now build some U-Net blocks! :) + +### U-Net building blocks + +Recall that a U-Net is composed of a **contracting path**, which itself is built from **convolutional blocks**, and an **expansive path** built from **upsampling blocks**. At each individual level (except for the last level in the contractive path, which is connected to the head of the expansive path) the output of a convolutional block is connected to an upsampling block via a skip connection. + +You will start with building a convolutional block and creating many of them in the contracting path. Then, you will do the same for the upsampling block and the expansive path. + +#### The convolutional block + +Here's the structure of your `conv_block`: + +``` +''' + U-NET BUILDING BLOCKS +''' + +def conv_block(x, filters, last_block): + ''' + U-Net convolutional block. + Used for downsampling in the contracting path. + ''' + config = configuration() + + # First Conv segment + x = Conv2D(filters, (3, 3),\ + kernel_initializer=config.get("initializer"))(x) + x = Activation("relu")(x) + + # Second Conv segment + x = Conv2D(filters, (3, 3),\ + kernel_initializer=config.get("initializer"))(x) + x = Activation("relu")(x) + + # Keep Conv output for skip input + skip_input = x + + # Apply pooling if not last block + if not last_block: + x = MaxPool2D((2, 2), strides=(2,2))(x) + + return x, skip_input +``` + +Each convolutional block, per the Ronneberger et al. (2015) paper, is composed of two 3x3 convolutional blocks the output of which are each ReLU activated. Per the configuration, He initialization is used ([because we use ReLU activation](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/)). + +> It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. +> +> Ronneberger et al. (2015) + +Recall from the image above that at each level, the output of the convolutions in the convolutional block is passed as a skip connection to the first upsampling layer in the upsampling block at the corresponding level. + +Max pooling is applied to the same output, so that the output can be used by the next convolutional block. + +![](images/afbeelding-3.png) + +In the code above, you can see that the output of the convolutional layers is assigned to `skip_input`. Subsequently, if this is not the last convolutional block, you will see that `MaxPool2D` is applied with a 2x2 pool size and stride 2. + +Both the processed Tensor `x` and the skip connection `skip_input` are returned. Note that this also happens in the last layer! It's only what whe do with the returned values that counts, and you will see that we don't use the skip connection when it's the last layer when creating the full contracting path. + +#### Contracting path and skip connections + +Which, as if it is meant to be, is right now! :) + +Let's create another definition called `contracting_path`. In it, you will construct the convolutional block that belong to the contracting path. Per your code above, these convolutional blocks will perform feature learning at their level of hierarchy and subsequently perform max pooling to make the Tensors ready for the next convolutional block. + +In the original U-Net, at each "downsampling step" (i.e., max pooling, although a regular convolution is a downsampling step too, strictly speaking), the number of feature channels is doubled. + +> At each downsampling step we double the number of feature channels. +> +> Ronneberger et al. (2015) + +And you will need to take this into account when creating your contracting path. This is why you will use the utility function `compute_number_of_filters` (you will define it next) to compute the number of filters used within each convolutional block. Given the starting number of 64, that will be 64, 128 and 256 for the 3-block U-Net that you are building today (per your model configuration). For the original 5-block U-Net in Ronneberger et al. (2014), that would be 64, 128, 256, 512 and 1024. + +Next, you create a list where the Tensors provided by the convolutions can be stored. It serves as a container for the skip connections. + +Now, it's time to create the actual blocks. By using `enumerate` you can create an enumerator that outputs `(index, value)`, and you are doing that to create a `for` loop that provides both the block number (`index`) and the number of filters in that particular block (`block_num_filters`). In the loop, you check if it's the last block, and let the input pass through the convolutional block setting the number of filters given the level of your convolutional block. + +Then, if it's not the last block, you'll add the `skip_input` to the `skip_inputs` container. + +Finally, you return both `x` (which now has passed through the entire contracting path) and the `skip_inputs` skip connection Tensors produced when doing so. + +``` +def contracting_path(x): + ''' + U-Net contracting path. + Initializes multiple convolutional blocks for + downsampling. + ''' + config = configuration() + + # Compute the number of feature map filters per block + num_filters = [compute_number_of_filters(index)\ + for index in range(config.get("num_unet_blocks"))] + + # Create container for the skip input Tensors + skip_inputs = [] + + # Pass input x through all convolutional blocks and + # add skip input Tensor to skip_inputs if not last block + for index, block_num_filters in enumerate(num_filters): + + last_block = index == len(num_filters)-1 + x, skip_input = conv_block(x, block_num_filters,\ + last_block) + + if not last_block: + skip_inputs.append(skip_input) + + return x, skip_inputs +``` + +#### Utility function: computing number of feature maps + +In the `contracting_path` definition, you were using `compute_number_of_filters` to compute the number of filters that must be used / feature maps that must be generated at a specific convolutional block. + +This utility function is actually really simple: you take the number of filters in your first convolutional block (which, per your model configuration is 64) and multiply it with \[latex\]2^{\\text{level}}\[/latex\]. For example, at the third level (with index = 2) your convolutional block has \[latex\]64 \\times 2^2 = 256\[/latex\] filters. + +``` +def compute_number_of_filters(block_number): + ''' + Compute the number of filters for a specific + U-Net block given its position in the contracting path. + ''' + return configuration().get("num_filters_start") * (2 ** block_number) +``` + +#### The upsampling block + +So far, you have created code for downsampling your input data. It's now time to shape the building blocks for the expansive path. Let's add another definition, which you'll call `upconv_block`. It takes some input, an expected number of filters, a skip input Tensor corresponding to the hierarchical level of your upsampling block, and information about whether it's the last block. + +![](images/afbeelding-4.png) + +Per the design of U-Net, the first step is performing upsampling. In the image to the right, for example, a 52x52x512 Tensor is upsampled to a 104x104x512 Tensor. + +In computer vision models, there are two primary ways of performing **upsampling**: + +- **By means of interpolation.** This is the classic approach and is used by Ronneberger et al. (2015). An interpolation function, such as bicubic interpolation, is used to compute the missing pixels. In TensorFlow and Keras, this functionality is covered by the [Upsampling](https://www.machinecurve.com/index.php/2019/12/11/upsampling2d-how-to-use-upsampling-with-keras/) blocks. +- **By means of learned upsampling with [transposed convolutions](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions/).** Another approach would be using transposed convolutions, which are convolutions that work the other way around. Instead of using learned kernels/filters to _down_sample a larger image, they _up_sample the image, but also by using learned kernels/filters! In TensorFlow, these are represented by meansa of `[ConvXDTranspose](https://www.machinecurve.com/index.php/2019/12/10/conv2dtranspose-using-2d-transposed-convolutions-with-keras/)`. You will be using this type of upsampling because it is (1) more common today and (2) makes the whole model use trainable parameters where possible. + +So, the first processing that happens to your input Tensor `x` is upsampling by means of `Conv2DTranspose`. + +Then it's time to discuss the following important detail - the **crop** that is applied to the skip connection. + +Note that the shape of the first two dimensions of the output of your convolutional block at arbitrary level _L_ is larger than the shape of these dimensions at the corresponding upsampling block. For example, in the example below you see that a skip connection of shape 136x136 pixels must be concatenated with a 104x104 pixel Tensor. + +> Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, **a concatenation with the correspondingly cropped feature map from the contracting path**, and two 3x3 convolutions, each followed by a ReLU. +> +> Ronneberger et al. (2015) + +This is not possible. Ronneberger et al. (2015), in their original implementation of U-Net, mitigate this problem by taking a _center crop_ from the feature maps generated by the convolutional block. This center crop has the same width and height of the upsampled Tensor; in our case, that is 104x104 pixels. Now, both Tensors can be concatenated. + +![](images/afbeelding-5.png) + +To make this crop, you use TensorFlow's `CenterCrop` layer to take a center crop from the skip input using the target width and height as specified by the upsampled Tensor. + +Then, you use the `Concatenate` layer to concatenate the cropped skip input with the upsampled Tensor, after which you can proceed with processing the whole. This, per the Ronneberger et al. (2015) and the quote above, is performed using two 3x3 convolutions followed by ReLU activation each. + +> At the final layer a 1x1 convolution is used to map each 64- component feature vector to the desired number of classes. +> +> Ronneberger et al. (2015) + +Finally, at the last layer, you apply an 1x1 convolution (preserving the width and height dimensions) that outputs a Tensor with C for the third dimension. C, here, represents the desired number of classes - something we have in our model configuration as `num_filters_end`, and indeed, that is three classes for today's dataset! :) + +Here's the code for creating your upsampling block: + +``` +def upconv_block(x, filters, skip_input, last_block = False): + ''' + U-Net upsampling block. + Used for upsampling in the expansive path. + ''' + config = configuration() + + # Perform upsampling + x = Conv2DTranspose(filters//2, (2, 2), strides=(2, 2),\ + kernel_initializer=config.get("initializer"))(x) + shp = x.shape + + # Crop the skip input, keep the center + cropped_skip_input = CenterCrop(height = x.shape[1],\ + width = x.shape[2])(skip_input) + + # Concatenate skip input with x + concat_input = Concatenate(axis=-1)([cropped_skip_input, x]) + + # First Conv segment + x = Conv2D(filters//2, (3, 3), + kernel_initializer=config.get("initializer"))(concat_input) + x = Activation("relu")(x) + + # Second Conv segment + x = Conv2D(filters//2, (3, 3), + kernel_initializer=config.get("initializer"))(x) + x = Activation("relu")(x) + + # Prepare output if last block + if last_block: + x = Conv2D(config.get("num_filters_end"), (1, 1), + kernel_initializer=config.get("initializer"))(x) + + return x +``` + +#### Expansive path using skip connections + +As with the contracting path, you will also need to compose the upsampling layers in your expansive path. + +Similar to the contracting path, you will also compute the number of filters for the blocks in your expansive path. This time, however, you start counting at the end - i.e., at the number of blocks minus one, because you are working from a high number of filters to a low number of filters. + +Then, you iterate over the number of filters, compute whether it's the last block and compute the _level_ to take the skip input from, and pass the Tensor through your upsampling block. + +Now, should you feed your Tensor to all the blocks if they were composed, they would make a complete pass through the contracting path and the expansive path. Time to stitch together your U-Net components! + +``` +def expansive_path(x, skip_inputs): + ''' + U-Net expansive path. + Initializes multiple upsampling blocks for upsampling. + ''' + num_filters = [compute_number_of_filters(index)\ + for index in range(configuration()\ + .get("num_unet_blocks")-1, 0, -1)] + + skip_max_index = len(skip_inputs) - 1 + + for index, block_num_filters in enumerate(num_filters): + skip_index = skip_max_index - index + last_block = index == len(num_filters)-1 + x = upconv_block(x, block_num_filters,\ + skip_inputs[skip_index], last_block) + + return x +``` + +#### U-Net builder + +...which is something that we can do with the `build_unet` definition that you will create now. + +It is a relatively simple definition. It constructs the input shape by means of the configured height, width and dimensionality of your input data, and then passes this to an `Input` layer - which is TensorFlow's way of representing input data. + +Your inputs are then passed through the `contracting_path`, which yields the contracted data _and_ the outputs of each convolutional block for the skip connections. + +These are then fed to the `expansive_path` which produces the expanded data. Note that we choose to explicitly _not_ model a Softmax activation function, because we push it to the loss function, [as prescribed by TensorFlow](https://datascience.stackexchange.com/questions/73093/what-does-from-logits-true-do-in-sparsecategoricalcrossentropy-loss-function). Finally, we initialize the `Model` class with our input data as our starting point and the expanded data as our ending point. The model is named `U-Net`. + +``` +def build_unet(): + ''' Construct U-Net. ''' + config = configuration() + input_shape = (config.get("input_height"),\ + config.get("input_width"), config.get("input_dim")) + + # Construct input layer + input_data = Input(shape=input_shape) + + # Construct Contracting path + contracted_data, skip_inputs = contracting_path(input_data) + + # Construct Expansive path + expanded_data = expansive_path(contracted_data, skip_inputs) + + # Define model + model = Model(input_data, expanded_data, name="U-Net") + + return model +``` + +### U-Net training process definitions + +Now that you have created the model building blocks, it's time to start creating definitions for training your U-Net. These are the ones that you will create: + +- Initializing the model. +- Loading the dataset. +- Data preprocessing. +- Training callbacks. +- Data visualization. + +#### Initializing the model + +You have a definition for creating a model. However, that's just a skeleton - because a model needs to be initialized with a loss function, an optimizer needs to be configured, and so forth. + +Let's thus create a definition called `init_model` which allows you to do this. It accepts the steps per epoch, which come from your dataset configuration that will be added later. + +The following happens within this definition: + +- Configuration is loaded and the model skeleton is built. +- The loss function is initialized as well as additional metrics and the number of epochs. Note that with `from_logits=True`, you instruct TensorFlow that the output of your model are logits rather than a Softmaxed output. When configured, the loss function performs Softmax activation before computing loss. +- The learning rate schedule is constructed from the percentages by computing the boundaries - which are the number of iterations that must be passed. Note that an iteration here is a batch of data being fed through the network; the number of samples divided by your batch size yields the number of iterations in one epoch). So, to compute the boundaries, we take the number of epochs, the particular percentage, and the number of steps (batches) per epoch. You then initialize the learning rate schedule with the boundaries and corresponding learning rate values (which are discussed in the section about model configuration). +- Then, the optimizer is initialized with the learning rate schedule. +- Now, you can compile your model as is standard with TensorFlow models. +- Some utilities will now describe your model - both [visually](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/) and by means of a [summary](https://www.machinecurve.com/index.php/2020/04/01/how-to-generate-a-summary-of-your-keras-model/). +- Finally, you return the initialized `model`. + +``` +''' + U-NET TRAINING PROCESS BUILDING BLOCKS +''' + +def init_model(steps_per_epoch): + ''' + Initialize a U-Net model. + ''' + config = configuration() + model = build_unet() + + # Retrieve compilation input + loss_init = config.get("loss")(from_logits=True) + metrics = config.get("metrics") + num_epochs = config.get("num_epochs") + + # Construct LR schedule + boundaries = [int(num_epochs * percentage * steps_per_epoch)\ + for percentage in config.get("lr_schedule_percentages")] + lr_schedule = config.get("lr_schedule_class")(boundaries, config.get("lr_schedule_values")) + + # Init optimizer + optimizer_init = config.get("optimizer")(learning_rate = lr_schedule) + + # Compile the model + model.compile(loss=loss_init, optimizer=optimizer_init, metrics=metrics) + + # Plot the model + plot_model(model, to_file="unet.png") + + # Print model summary + model.summary() + + return model +``` + +This is what your model looks like, visually. Indeed, that's a U shape! :) + +![](images/model-219x1024.png) + +#### Loading the dataset + +For training your model in today's tutorial, you will be using the Oxford-IIT Pets dataset that was published in Parkhi et al. (2012): + +> We have created a 37 category pet dataset with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel level trimap segmentation. +> +> Parkhi et al. (2012) + +We're using it because it's available in [TensorFlow datasets](https://www.tensorflow.org/datasets/catalog/oxford_iiit_pet), making loading it more easy, and because it has a segmentation max available out of the box. For example, here is an input image with the corresponding segmentation mask: + +![](images/afbeelding-6.png) + +Source: Parkhi et al. (2012); TensorFlow Datasets. + +Loading the dataset is quite simple. Because the TensorFlow dataset contains training and testing data _only_, and because you will need three splits (train, val and test), you will _redefine_ the split per your model configuration, and pass it to `tfds.load`. By returning info (`with_info=True`), you will be able to read some metadata interesting later. + +``` +def load_dataset(): + ''' Return dataset with info. ''' + config = configuration() + + # Retrieve percentages + train = config.get("data_train_prc") + val = config.get("data_val_prc") + test = config.get("data_test_prc") + + # Redefine splits over full dataset + splits = [f'train[:{train}%]+test[:{train}%]',\ + f'train[{train}%:{val}%]+test[{train}%:{val}%]',\ + f'train[{val}%:{test}%]+test[{val}%:{test}%]'] + + # Return data + return tfds.load('oxford_iiit_pet:3.*.*', split=splits, data_dir=configuration()\ + .get("dataset_path"), with_info=True) +``` + +#### Dataset preprocessing + +Datasets require preprocessing before they can be used in deep learning models. That's why today's tutorial will also require you to write some preprocessing code. To be more precise, you will perform the following preprocessing: + +- Preprocessing **at sample level**, including **image normalization**. +- **Data augmentation** to artificially increase the size of your dataset. +- Computing **sample weights** to balance between overrepresented and underrepresented classes in your segmentation masks. +- Preprocessing at **dataset level**, combining all previous bullet points. + +Let's now write code for each of these bullet points. + +Performing **image normalization** simply involves casting your Tensors to `float32` format and division by `255.0`. In addition to this, you subtract 1 from the mask's class, because they range from 1-3 and we want them to range from 0-2: + +``` +def normalize_sample(input_image, input_mask): + ''' Normalize input image and mask class. ''' + # Cast image to float32 and divide by 255 + input_image = tensorflow.cast(input_image, tensorflow.float32) / 255.0 + + # Bring classes into range [0, 2] + input_mask -= 1 + + return input_image, input_mask +``` + +Next, you implement this in your definition for **sample-level preprocessing**. The input image is resized to the size specified in your model configuration, and the same is true for your mask. Finally, both the input image and mask are normalized, and returned. + +``` +def preprocess_sample(data_sample): + ''' Resize and normalize dataset samples. ''' + config = configuration() + + # Resize image + input_image = tensorflow.image.resize(data_sample['image'],\ + (config.get("input_width"), config.get("input_height"))) + + # Resize mask + input_mask = tensorflow.image.resize(data_sample['segmentation_mask'],\ + (config.get("mask_width"), config.get("mask_height"))) + + # Normalize input image and mask + input_image, input_mask = normalize_sample(input_image, input_mask) + + return input_image, input_mask +``` + +**Data augmentation** allows TensorFlow to perform arbitrary image manipulations on your input Tensors. In today's tutorial, you will implement data augmentation by having the sammples flipped horizontally and vertically at random. We use the same seed across the calls to ensure that your inputs and labels are manipulated in the same way. + +``` +def data_augmentation(inputs, labels): + ''' Perform data augmentation. ''' + # Use the same seed for deterministic randomness over both inputs and labels. + seed = 36 + + # Feed data through layers + inputs = tensorflow.image.random_flip_left_right(inputs, seed=seed) + inputs = tensorflow.image.random_flip_up_down(inputs, seed=seed) + labels = tensorflow.image.random_flip_left_right(labels, seed=seed) + labels = tensorflow.image.random_flip_up_down(labels, seed=seed) + + return inputs, labels +``` + +Next up is computing **sample weights**. Given the weights for each class, you compute the relative power of these class weights by means of `reduce_sum`. Subsequently, you compute the sample weights for each class, and return this as an extra array to be used in `model.fit`. + +``` +def compute_sample_weights(image, mask): + ''' Compute sample weights for the image given class. ''' + # Compute relative weight of class + class_weights = configuration().get("class_weights") + class_weights = class_weights/tensorflow.reduce_sum(class_weights) + + # Compute same-shaped Tensor as mask with sample weights per + # mask element. + sample_weights = tensorflow.gather(class_weights,indices=\ + tensorflow.cast(mask, tensorflow.int32)) + + return image, mask, sample_weights +``` + +Finally, you can combine all the definitions above in **dataset-level preprocessing**. Depending on the dataset type, this is performed differently: + +- When preprocessing your **training data** or **validation data**, preprocessing, data augmentation and class weighting is performed, including some utility processing to improve the training process. +- The utility functions and class weighting is left out when preprocessing your **testing data**, because they are not necessary as during testing the model is not trained. + +``` +def preprocess_dataset(data, dataset_type, dataset_info): + ''' Fully preprocess dataset given dataset type. ''' + config = configuration() + batch_size = config.get("batch_size") + buffer_size = config.get("buffer_size") + + # Preprocess data given dataset type. + if dataset_type == "train" or dataset_type == "val": + # 1. Perform preprocessing + # 2. Cache dataset for improved performance + # 3. Shuffle dataset + # 4. Generate batches + # 5. Repeat + # 6. Perform data augmentation + # 7. Add sample weights + # 8. Prefetch new data before it being necessary. + return (data + .map(preprocess_sample) + .cache() + .shuffle(buffer_size) + .batch(batch_size) + .repeat() + .map(data_augmentation) + .map(compute_sample_weights) + .prefetch(buffer_size=tensorflow.data.AUTOTUNE)) + else: + # 1. Perform preprocessing + # 2. Generate batches + return (data + .map(preprocess_sample) + .batch(batch_size)) +``` + +#### Training callbacks + +What's left is writing some utility functions. If you're familiar with TensorFlow, it's likely that you know about the [Keras callbacks](https://www.machinecurve.com/index.php/2020/11/10/an-introduction-to-tensorflow-keras-callbacks/). These can be used to allow certain actions to take place at specific steps in your training process. + +Today, we're using these callbacks to integrate TensorBoard logging into your model. This way, you'll be able to evaluate progress and model training during and after your training process. + +``` +def training_callbacks(): + ''' Retrieve initialized callbacks for model.fit ''' + return [ + TensorBoard( + log_dir=os.path.join(os.getcwd(), "unet_logs"), + histogram_freq=1, + write_images=True + ) + ] +``` + +#### Data visualization + +The last utility function is related to data visualization. We want to understand what the performance of our model will be, so we're going to construct a visualization util that displays the **source image**, the **actual mask**, the **predicted mask** and the **predicted mask overlayed on top of the source image**. For doing so, we'll need to create a function that generates a mask from the model prediction: + +``` +def probs_to_mask(probs): + ''' Convert Softmax output into mask. ''' + pred_mask = tensorflow.argmax(probs, axis=2) + return pred_mask +``` + +Across the third dimension, it simply takes the class index with the maximum value and returns it. Indeed, that's equal to picking a class. + +You integrate this in `generate_plot`, which uses Matplotlib to generate four plots with the source image, actual mask, predicted mask and the overlay: + +``` +def generate_plot(img_input, mask_truth, mask_probs): + ''' Generate a plot of input, truthy mask and probability mask. ''' + fig, axs = plt.subplots(1, 4) + fig.set_size_inches(16, 6) + + # Plot the input image + axs[0].imshow(img_input) + axs[0].set_title("Input image") + + # Plot the truthy mask + axs[1].imshow(mask_truth) + axs[1].set_title("True mask") + + # Plot the predicted mask + predicted_mask = probs_to_mask(mask_probs) + axs[2].imshow(predicted_mask) + axs[2].set_title("Predicted mask") + + # Plot the overlay + config = configuration() + img_input_resized = tensorflow.image.resize(img_input, (config.get("mask_width"), config.get("mask_height"))) + axs[3].imshow(img_input_resized) + axs[3].imshow(predicted_mask, alpha=0.5) + axs[3].set_title("Overlay") + + # Show the plot + plt.show() +``` + +### Merging everything together into a working example + +The final step is merging everything together into an example that works: + +``` +def main(): + ''' Run full training procedure. ''' + + # Load config + config = configuration() + batch_size = config.get("batch_size") + validation_sub_splits = config.get("validation_sub_splits") + num_epochs = config.get("num_epochs") + + # Load data + (training_data, validation_data, testing_data), info = load_dataset() + + # Make training data ready for model.fit and model.evaluate + train_batches = preprocess_dataset(training_data, "train", info) + val_batches = preprocess_dataset(validation_data, "val", info) + test_batches = preprocess_dataset(testing_data, "test", info) + + # Compute data-dependent variables + train_num_samples = tensorflow.data.experimental.cardinality(training_data).numpy() + val_num_samples = tensorflow.data.experimental.cardinality(validation_data).numpy() + steps_per_epoch = train_num_samples // batch_size + val_steps_per_epoch = val_num_samples // batch_size // validation_sub_splits + + # Initialize model + model = init_model(steps_per_epoch) + + # Train the model + model.fit(train_batches, epochs=num_epochs, batch_size=batch_size,\ + steps_per_epoch=steps_per_epoch, verbose=1, + validation_steps=val_steps_per_epoch, callbacks=training_callbacks(),\ + validation_data=val_batches) + + # Test the model + score = model.evaluate(test_batches, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + + # Take first batch from the test images and plot them + for images, masks in test_batches.take(1): + + # Generate prediction for each image + predicted_masks = model.predict(images) + + # Plot each image and masks in batch + for index, (image, mask) in enumerate(zip(images, masks)): + generate_plot(image, mask, predicted_masks[index]) + if index > 4: + break + + +if __name__ == '__main__': + main() +``` + +### Full code example + +If you want to get started immediately, that is possible too :) Here is the full model code: + +``` +import os +import tensorflow +from tensorflow.keras.layers import Conv2D,\ + MaxPool2D, Conv2DTranspose, Input, Activation,\ + Concatenate, CenterCrop +from tensorflow.keras import Model +from tensorflow.keras.initializers import HeNormal +from tensorflow.keras.optimizers import schedules, Adam +from tensorflow.keras.losses import SparseCategoricalCrossentropy +from tensorflow.keras.callbacks import TensorBoard +from tensorflow.keras.utils import plot_model +import tensorflow_datasets as tfds +import matplotlib.pyplot as plt + + +''' + U-NET CONFIGURATION +''' +def configuration(): + ''' Get configuration. ''' + + return dict( + data_train_prc = 80, + data_val_prc = 90, + data_test_prc = 100, + num_filters_start = 64, + num_unet_blocks = 3, + num_filters_end = 3, + input_width = 100, + input_height = 100, + mask_width = 60, + mask_height = 60, + input_dim = 3, + optimizer = Adam, + loss = SparseCategoricalCrossentropy, + initializer = HeNormal(), + batch_size = 50, + buffer_size = 50, + num_epochs = 25, + metrics = ['accuracy'], + dataset_path = os.path.join(os.getcwd(), 'data'), + class_weights = tensorflow.constant([1.0, 1.0, 2.0]), + validation_sub_splits = 5, + lr_schedule_percentages = [0.2, 0.5, 0.8], + lr_schedule_values = [3e-4, 1e-4, 1e-5, 1e-6], + lr_schedule_class = schedules.PiecewiseConstantDecay + ) + + +''' + U-NET BUILDING BLOCKS +''' + +def conv_block(x, filters, last_block): + ''' + U-Net convolutional block. + Used for downsampling in the contracting path. + ''' + config = configuration() + + # First Conv segment + x = Conv2D(filters, (3, 3),\ + kernel_initializer=config.get("initializer"))(x) + x = Activation("relu")(x) + + # Second Conv segment + x = Conv2D(filters, (3, 3),\ + kernel_initializer=config.get("initializer"))(x) + x = Activation("relu")(x) + + # Keep Conv output for skip input + skip_input = x + + # Apply pooling if not last block + if not last_block: + x = MaxPool2D((2, 2), strides=(2,2))(x) + + return x, skip_input + + +def contracting_path(x): + ''' + U-Net contracting path. + Initializes multiple convolutional blocks for + downsampling. + ''' + config = configuration() + + # Compute the number of feature map filters per block + num_filters = [compute_number_of_filters(index)\ + for index in range(config.get("num_unet_blocks"))] + + # Create container for the skip input Tensors + skip_inputs = [] + + # Pass input x through all convolutional blocks and + # add skip input Tensor to skip_inputs if not last block + for index, block_num_filters in enumerate(num_filters): + + last_block = index == len(num_filters)-1 + x, skip_input = conv_block(x, block_num_filters,\ + last_block) + + if not last_block: + skip_inputs.append(skip_input) + + return x, skip_inputs + + +def upconv_block(x, filters, skip_input, last_block = False): + ''' + U-Net upsampling block. + Used for upsampling in the expansive path. + ''' + config = configuration() + + # Perform upsampling + x = Conv2DTranspose(filters//2, (2, 2), strides=(2, 2),\ + kernel_initializer=config.get("initializer"))(x) + shp = x.shape + + # Crop the skip input, keep the center + cropped_skip_input = CenterCrop(height = x.shape[1],\ + width = x.shape[2])(skip_input) + + # Concatenate skip input with x + concat_input = Concatenate(axis=-1)([cropped_skip_input, x]) + + # First Conv segment + x = Conv2D(filters//2, (3, 3), + kernel_initializer=config.get("initializer"))(concat_input) + x = Activation("relu")(x) + + # Second Conv segment + x = Conv2D(filters//2, (3, 3), + kernel_initializer=config.get("initializer"))(x) + x = Activation("relu")(x) + + # Prepare output if last block + if last_block: + x = Conv2D(config.get("num_filters_end"), (1, 1), + kernel_initializer=config.get("initializer"))(x) + + return x + + +def expansive_path(x, skip_inputs): + ''' + U-Net expansive path. + Initializes multiple upsampling blocks for upsampling. + ''' + num_filters = [compute_number_of_filters(index)\ + for index in range(configuration()\ + .get("num_unet_blocks")-1, 0, -1)] + + skip_max_index = len(skip_inputs) - 1 + + for index, block_num_filters in enumerate(num_filters): + skip_index = skip_max_index - index + last_block = index == len(num_filters)-1 + x = upconv_block(x, block_num_filters,\ + skip_inputs[skip_index], last_block) + + return x + + +def build_unet(): + ''' Construct U-Net. ''' + config = configuration() + input_shape = (config.get("input_height"),\ + config.get("input_width"), config.get("input_dim")) + + # Construct input layer + input_data = Input(shape=input_shape) + + # Construct Contracting path + contracted_data, skip_inputs = contracting_path(input_data) + + # Construct Expansive path + expanded_data = expansive_path(contracted_data, skip_inputs) + + # Define model + model = Model(input_data, expanded_data, name="U-Net") + + return model + + +def compute_number_of_filters(block_number): + ''' + Compute the number of filters for a specific + U-Net block given its position in the contracting path. + ''' + return configuration().get("num_filters_start") * (2 ** block_number) + + +''' + U-NET TRAINING PROCESS BUILDING BLOCKS +''' + +def init_model(steps_per_epoch): + ''' + Initialize a U-Net model. + ''' + config = configuration() + model = build_unet() + + # Retrieve compilation input + loss_init = config.get("loss")(from_logits=True) + metrics = config.get("metrics") + num_epochs = config.get("num_epochs") + + # Construct LR schedule + boundaries = [int(num_epochs * percentage * steps_per_epoch)\ + for percentage in config.get("lr_schedule_percentages")] + lr_schedule = config.get("lr_schedule_class")(boundaries, config.get("lr_schedule_values")) + + # Init optimizer + optimizer_init = config.get("optimizer")(learning_rate = lr_schedule) + + # Compile the model + model.compile(loss=loss_init, optimizer=optimizer_init, metrics=metrics) + + # Plot the model + plot_model(model, to_file="unet.png") + + # Print model summary + model.summary() + + return model + + +def load_dataset(): + ''' Return dataset with info. ''' + config = configuration() + + # Retrieve percentages + train = config.get("data_train_prc") + val = config.get("data_val_prc") + test = config.get("data_test_prc") + + # Redefine splits over full dataset + splits = [f'train[:{train}%]+test[:{train}%]',\ + f'train[{train}%:{val}%]+test[{train}%:{val}%]',\ + f'train[{val}%:{test}%]+test[{val}%:{test}%]'] + + # Return data + return tfds.load('oxford_iiit_pet:3.*.*', split=splits, data_dir=configuration()\ + .get("dataset_path"), with_info=True) + + +def normalize_sample(input_image, input_mask): + ''' Normalize input image and mask class. ''' + # Cast image to float32 and divide by 255 + input_image = tensorflow.cast(input_image, tensorflow.float32) / 255.0 + + # Bring classes into range [0, 2] + input_mask -= 1 + + return input_image, input_mask + + +def preprocess_sample(data_sample): + ''' Resize and normalize dataset samples. ''' + config = configuration() + + # Resize image + input_image = tensorflow.image.resize(data_sample['image'],\ + (config.get("input_width"), config.get("input_height"))) + + # Resize mask + input_mask = tensorflow.image.resize(data_sample['segmentation_mask'],\ + (config.get("mask_width"), config.get("mask_height"))) + + # Normalize input image and mask + input_image, input_mask = normalize_sample(input_image, input_mask) + + return input_image, input_mask + + +def data_augmentation(inputs, labels): + ''' Perform data augmentation. ''' + # Use the same seed for deterministic randomness over both inputs and labels. + seed = 36 + + # Feed data through layers + inputs = tensorflow.image.random_flip_left_right(inputs, seed=seed) + inputs = tensorflow.image.random_flip_up_down(inputs, seed=seed) + labels = tensorflow.image.random_flip_left_right(labels, seed=seed) + labels = tensorflow.image.random_flip_up_down(labels, seed=seed) + + return inputs, labels + + +def compute_sample_weights(image, mask): + ''' Compute sample weights for the image given class. ''' + # Compute relative weight of class + class_weights = configuration().get("class_weights") + class_weights = class_weights/tensorflow.reduce_sum(class_weights) + + # Compute same-shaped Tensor as mask with sample weights per + # mask element. + sample_weights = tensorflow.gather(class_weights,indices=\ + tensorflow.cast(mask, tensorflow.int32)) + + return image, mask, sample_weights + + +def preprocess_dataset(data, dataset_type, dataset_info): + ''' Fully preprocess dataset given dataset type. ''' + config = configuration() + batch_size = config.get("batch_size") + buffer_size = config.get("buffer_size") + + # Preprocess data given dataset type. + if dataset_type == "train" or dataset_type == "val": + # 1. Perform preprocessing + # 2. Cache dataset for improved performance + # 3. Shuffle dataset + # 4. Generate batches + # 5. Repeat + # 6. Perform data augmentation + # 7. Add sample weights + # 8. Prefetch new data before it being necessary. + return (data + .map(preprocess_sample) + .cache() + .shuffle(buffer_size) + .batch(batch_size) + .repeat() + .map(data_augmentation) + .map(compute_sample_weights) + .prefetch(buffer_size=tensorflow.data.AUTOTUNE)) + else: + # 1. Perform preprocessing + # 2. Generate batches + return (data + .map(preprocess_sample) + .batch(batch_size)) + + +def training_callbacks(): + ''' Retrieve initialized callbacks for model.fit ''' + return [ + TensorBoard( + log_dir=os.path.join(os.getcwd(), "unet_logs"), + histogram_freq=1, + write_images=True + ) + ] + + +def probs_to_mask(probs): + ''' Convert Softmax output into mask. ''' + pred_mask = tensorflow.argmax(probs, axis=2) + return pred_mask + + +def generate_plot(img_input, mask_truth, mask_probs): + ''' Generate a plot of input, truthy mask and probability mask. ''' + fig, axs = plt.subplots(1, 4) + fig.set_size_inches(16, 6) + + # Plot the input image + axs[0].imshow(img_input) + axs[0].set_title("Input image") + + # Plot the truthy mask + axs[1].imshow(mask_truth) + axs[1].set_title("True mask") + + # Plot the predicted mask + predicted_mask = probs_to_mask(mask_probs) + axs[2].imshow(predicted_mask) + axs[2].set_title("Predicted mask") + + # Plot the overlay + config = configuration() + img_input_resized = tensorflow.image.resize(img_input, (config.get("mask_width"), config.get("mask_height"))) + axs[3].imshow(img_input_resized) + axs[3].imshow(predicted_mask, alpha=0.5) + axs[3].set_title("Overlay") + + # Show the plot + plt.show() + + +def main(): + ''' Run full training procedure. ''' + + # Load config + config = configuration() + batch_size = config.get("batch_size") + validation_sub_splits = config.get("validation_sub_splits") + num_epochs = config.get("num_epochs") + + # Load data + (training_data, validation_data, testing_data), info = load_dataset() + + # Make training data ready for model.fit and model.evaluate + train_batches = preprocess_dataset(training_data, "train", info) + val_batches = preprocess_dataset(validation_data, "val", info) + test_batches = preprocess_dataset(testing_data, "test", info) + + # Compute data-dependent variables + train_num_samples = tensorflow.data.experimental.cardinality(training_data).numpy() + val_num_samples = tensorflow.data.experimental.cardinality(validation_data).numpy() + steps_per_epoch = train_num_samples // batch_size + val_steps_per_epoch = val_num_samples // batch_size // validation_sub_splits + + # Initialize model + model = init_model(steps_per_epoch) + + # Train the model + model.fit(train_batches, epochs=num_epochs, batch_size=batch_size,\ + steps_per_epoch=steps_per_epoch, verbose=1, + validation_steps=val_steps_per_epoch, callbacks=training_callbacks(),\ + validation_data=val_batches) + + # Test the model + score = model.evaluate(test_batches, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + + # Take first batch from the test images and plot them + for images, masks in test_batches.take(1): + + # Generate prediction for each image + predicted_masks = model.predict(images) + + # Plot each image and masks in batch + for index, (image, mask) in enumerate(zip(images, masks)): + generate_plot(image, mask, predicted_masks[index]) + if index > 4: + break + + +if __name__ == '__main__': + main() +``` + +* * * + +## Training our U-Net + +Now, let's train our model! Open up a terminal, navigate to the location where your Python script is located, and run it. You should see the training process start quickly :) + +Training our U-Net yielded this performance for me when training it from scratch, i.e. with He initialized weights: + +[![](images/epoch_accuracy-1-1024x190.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/epoch_accuracy-1.png) + +Training accuracy (orange) and validation accuracy (blue). + +[![](images/epoch_learning_rate-1-1024x192.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/epoch_learning_rate-1.png) + +The learning rate over the epochs. The learning rate schedule is clearly visible. + +### Examples of image segmentations generated with our model + +Recall that after training, the model takes some examples from the testing set and outputs the results. Here's what your U-Net will produce: + +[![](images/1-1024x384.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/1.png) + +[![](images/2-1024x384.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/2.png) + +[![](images/3-1024x384.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/3.png) + +[![](images/4-1024x384.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/4.png) + +[![](images/5-1024x384.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/5.png) + +[![](images/6-1024x384.png)](https://www.machinecurve.com/wp-content/uploads/2022/01/6.png) + +### Improving model performance by model pretraining + +Indeed, while some examples (the dog) produce pretty good overlays, with others (one of the cats) the prediction is a lot worse. + +One of the key reasons for this is dataset size - despite being a relatively large dataset, the Pets dataset is _really small_ compared to other, more real-world datasets. While data augmentation has likely improved the results, it's not a magic method that can fix all your problems. + +Besides increasing the size of your dataset, however, there is a method that will work too - **by not starting with weights initialized randomly**. Rather, it can be a good idea to **pretrain your model**, for example using the ImageNet dataset. That way, your model will already learn to detect specific patterns and will allow you to initialize your model with. + +There are many packages available that allow you to construct U-Nets for TensorFlow and Keras by using contemporary ConvNets as backbones (ResNet, and so forth). Even better, they produce weights for these backbones, allowing you to take off from a much better starting point! + +Creating a U-Net based image segmentation model by using a pretrained backbone will be covered other articles. Keep reading MachineCurve to learn more about this! + +We can wrap up by saying that you've done it - you created a U-Net from scratch! 🎉 If you have any questions, comments or suggestions, feel free to leave a message in the comments section below 💬 I will then try to answer you as quickly as possible. For now, thank you for reading MachineCurve today and happy engineering! + +* * * + +## References + +Ronneberger, O., Fischer, P., & Brox, T. (2015, October). [U-net: Convolutional networks for biomedical image segmentation.](https://arxiv.org/abs/1505.04597) In _International Conference on Medical image computing and computer-assisted intervention_ (pp. 234-241). Springer, Cham. + +Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. V. (2012, June). [Cats and dogs.](https://ieeexplore.ieee.org/abstract/document/6248092) In _2012 IEEE conference on computer vision and pattern recognition_ (pp. 3498-3505). IEEE. diff --git a/how-to-check-if-your-deep-learning-model-is-underfitting-or-overfitting.md b/how-to-check-if-your-deep-learning-model-is-underfitting-or-overfitting.md new file mode 100644 index 0000000..21a70f4 --- /dev/null +++ b/how-to-check-if-your-deep-learning-model-is-underfitting-or-overfitting.md @@ -0,0 +1,226 @@ +--- +title: "How to check if your Deep Learning model is underfitting or overfitting?" +date: "2020-12-01" +categories: + - "deep-learning" + - "geen-categorie" + - "svms" +tags: + - "deep-learning" + - "fit" + - "keras" + - "loss-function" + - "loss-value" + - "overfitting" + - "pytorch" + - "tensorflow" + - "underfitting" + - "validation-loss" +--- + +Training a Deep Learning model means that you have to balance between finding a model that works, i.e. that has _predictive power_, and one that works in many cases, i.e. a model that can _generalize well_. + +This is a difficult task, because the balance is precise, and can sometimes be difficult to find. + +In this article, we will be precisely looking at this balance. We will first take a look at what training a Deep Learning model involves by taking a high-level perspective. We then move forward and check what it means to _overfit_ and _underfit_ those models, and that the balance in between these two is crucial for Machine Learning success. We'll then show you how you can check whether your model is underfitting, and also whether it is overfitting. + +This way, you can ensure that you have a model that works in many cases, rather than just a few training ones. + +Let's take a look! 😎 + +**Update 13/Jan/2021:** Ensured that the article is up-to-date. Added a quick answer to the top of the article, changed header information and added links to other articles. + +* * * + +\[toc\] + +* * * + +## Quick Answer: How to see if your model is underfitting or overfitting? + +[![Finding optimal learning rates with the Learning Rate Range Test – MachineCurve](images/UnderOver.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/UnderOver.png) + +Use these steps to determine if your machine learning model, deep learning model or neural network is currently **underfit** or **overfit**. + +1. **Ensure that you are using validation loss next to training loss in the training phase.** +2. **When your validation loss is decreasing, the model is still underfit.** +3. **When your validation loss is increasing, the model is overfit.** +4. **When your validation loss is equal, the model is either perfectly fit or in a local minimum.** + +If you want to understand the _whys_ in more detail, make sure to keep reading the rest of this tutorial! 🚀 + +* * * + +## Training a Deep Learning model: a high-level process + +If we want to understand the concepts of underfitting and overfitting, we must place it into the context of training a Deep Learning model. That's why I think that we should take a look at how such a model is trained first. + +At a high level, training such a model involves three main phases. These phases are cyclical, meaning that training a Deep Learning model is an iterative process. These are the three main components of a training step: + +1. **Feeding samples to the Deep Learning model.** During a training step, samples from your training dataset are fed forward through the model. We call this the _forwards pass_. For each sample that is fed forward, a prediction is generated. +2. **Comparing the predictions and the ground truth.** The predictions that are generated by the Deep Learning model are compared with the _actual_ target values for the samples, which are called the _ground truth_. These comparisons are then jointly combined into a so-called _loss score_ by a [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). This score indicates how bad your model performs at this time. +3. **Improving the Deep Learning model.** Using a technique called _backpropagation_, we can then (automatically) compute the contribution of the different parts of the Neural Network to the loss score. If we know this contribution, we know into what direction to move in order to improve - the direction here called a _gradient_. Using those gradients and the model, we can use [optimizers](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) for actually optimizing the model. + +With these three steps, you'll eventually get a model that is optimized for the dataset that it is trained with. + +![](images/feed-1024x404.jpg) + +* * * + +## Overfitting and underfitting Machine Learning models + +Suppose that we have the following observations, where a relationship \[latex\]\\text{X} \\rightarrow \\text{y}\[/latex\] exists: + +![](images/30samples.png) + +We can generate a predictive model that captures this relationship and allows us to predict any value for \[latex\]\\text{y}\[/latex\] within the domain of \[latex\]\\text{x}\[/latex\] displayed in the plot: + +![](images/30good.png) + +_Fitting a model_ is another term that is used for this process of building and training a Deep Learning model. + +Although it sounds simple, it can actually be really difficult to do just that. If we end up with a model that has the fit as displayed above, we have struck a precise balance between a model that is _underfit_ and between one that is highly _overfit_. + +Time to take a look at the two in more detail. + +### What is underfitting a Machine Learning Model? + +Sometimes, your Deep Learning model is not able to capture the relationship between your independent variables and your dependent variable(s). In other words, we have then **underfit** our model. + +> Underfitting occurs when a statistical model cannot adequately capture the underlying structure of the data. An under-fitted model is a model where some parameters or terms that would appear in a correctly specified model are missing. +> +> Wikipedia (2003) + +In the case of our Deep Learning model, the relationship between \[latex\]\\text{X} \\rightarrow \\text{y}\[/latex\] cannot be captured properly if the model is underfit, and a plot of the fit would look like this: + +![](images/30under.png) + +Underfitting can have many causes and by consequence fixes: + +- You haven't trained your model for long enough, and adding extra training time might help you generate a better fit. +- You haven't trained with the appropriate architecture/model type. For example, if your dataset is nonlinear and you used [linear activation functions](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/), your Deep Learning model will not be able to properly capture the patterns from the dataset. + +In other words, underfitting occurs when the model shows [high bias and low variance](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/). + +### What is overfitting a Machine Learning model? + +Above, we looked at one side of the balance between a good fit and a poor one. Let's now take a look at the other one, i.e., what happens when your model is **overfit**. + +> The essence of overfitting is to have unknowingly extracted some of the residual variation (i.e. the noise) as if that variation represented underlying model structure. +> +> Wikipedia (2003) + +In each dataset, noise is present, besides the patterns that actually describe the relationship. If we train our model in such a way that it captures these patterns in great detail, we are training a model that is _overfit_. In other words, it will work very well for the data that it is trained on, but does it also work with data that comes from the real world? After all, the noise may not be present there? + +If we visualize what overfitting means for our setting, we get the following visualization: + +![](images/30over.png) + +Here, we can clearly see that our models capture much of the noise: what should be a smooth, quadratic rise (the function we actually visualized is the function \[latex\]x^2\[/latex\] at the domain \[latex\]x \\in \[0, 10\]\[/latex\]) is now a noisy one. We therefore have introduced noise into our model which is not present in our 'real world' (i.e., in \[latex\]x^2\[/latex\])! + +In the real world, the odds are relatively low that you will produce a model that is _underfit_. Overfitting is the problem: today's real-world datasets are often highly complex and have many variables. Capturing noise is really easy these days! + +* * * + +## Checking for underfitting/overfitting + +### Visualizing loss + +When you are training a Deep Learning model, for example with Keras, you specify a [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). This function, which produces a value that tells you something about how bad your model performs, can be constructed based on training data or validation data. We can specify such a loss function in the `model.compile` step. + +During training, we always need to rely on validation data when estimating the performance of our model (relying on the training data for estimating model performance is like checking your own homework). That's why we specify a validation split in `model.fit`. Together, this can look as follows: + +``` +# Compile the model +model.compile(loss=categorical_crossentropy, + optimizer=Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(X_train, y_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split_size) +``` + +When we start the training process, we can now both see `loss` (loss based on training data) and `val_loss` (loss based on validation data) besides our epochs: + +``` +Epoch 1/25 +5832/5832 [==============================] - 1s 203us/sample - loss: 2.2721 - accuracy: 0.1811 - val_loss: 2.2729 - val_accuracy: 0.1590 +Epoch 2/25 +5832/5832 [==============================] - 0s 27us/sample - loss: 2.2347 - accuracy: 0.2172 - val_loss: 2.2235 - val_accuracy: 0.2879 +Epoch 3/25 +5832/5832 [==============================] - 0s 23us/sample - loss: 2.1743 - accuracy: 0.2723 - val_loss: 2.1465 - val_accuracy: 0.2934 +Epoch 4/25 +5832/5832 [==============================] - 0s 25us/sample - loss: 2.0708 - accuracy: 0.3014 - val_loss: 2.0082 - val_accuracy: 0.3132 +Epoch 5/25 +5832/5832 [==============================] - 0s 25us/sample - loss: 1.9090 - accuracy: 0.3527 - val_loss: 1.8271 - val_accuracy: 0.3722 +Epoch 6/25 +5832/5832 [==============================] - 0s 23us/sample - loss: 1.7152 - accuracy: 0.4504 - val_loss: 1.6274 - val_accuracy: 0.5435 +Epoch 7/25 +5832/5832 [==============================] - 0s 24us/sample - loss: 1.5153 - accuracy: 0.6020 - val_loss: 1.4348 - val_accuracy: 0.6580 +Epoch 8/25 +5832/5832 [==============================] - 0s 23us/sample - loss: 1.3304 - accuracy: 0.6965 - val_loss: 1.2606 - val_accuracy: 0.6929 +Epoch 9/25 +5832/5832 [==============================] - 0s 25us/sample - loss: 1.1723 - accuracy: 0.7443 - val_loss: 1.1096 - val_accuracy: 0.7676 +``` + +We can now manually plot this loss or [use TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) if we want to visualize loss in realtime. + +A plot of your validation loss, after some epochs, will look somewhat like this: + +![Finding optimal learning rates with the Learning Rate Range Test – MachineCurve](images/UnderOver.png) + +### Checking for underfitting + +Using validation loss, we can find whether our model is underfitting. Recall that when it does, the model can still gain predictive power without being _too_ trained on the training dataset itself. + +The general rule for calling a model underfit is as follows: + +- **A model can be considered underfit if your validation loss is still decreasing.** + +For this reason: if your validation loss decreases, don't stop the training process. + +### Checking for overfitting + +Recall, however, that overfitting is the bigger problem these days. Whereas it is relatively easy to fight underfitting (just keep the training process running), avoiding overfitting is more difficult. + +But how can we know whether our model is overfitting in the first place? + +Here, too, we have a general rule: + +- **A model can be considered overfit if your validation loss has been increasing (for some time)**. + +Think about what this means: if validation loss starts to increase, the model - when trained - becomes _progressively worse_ during the check. + +In other words, it has now started learning patterns that are not present within the dataset. + +Especially around the optimum loss can oscillate a bit, so don't stop training immediately; perhaps you are at a local optimum, and loss will decrease further. However, if it keeps increasing, you can stop the training process. + +![](images/image-1024x372.png) + +### Find the optimum! + +Training a Deep Learning model involves finding the balance between a model that is underfit and one that is overfit, yielding a model that has a _good fit_. As you saw above, we can find this optimum where the change in loss for some iteration is ~0 for a series of epochs. At precisely that time, we can stop the training process. + +It is possible to instruct TensorFlow to train your Keras model while automatically monitoring for this balance. Using both the `EarlyStopping` and `ModelCheckpoint` callbacks, you can ensure that your model [stops training when this optimum is reached](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). Isn't that easy! + +* * * + +## Summary + +In this article, we looked at the concepts of overfitting and underfitting for your Deep Learning model. Because, well, what do they mean - and how can we check if our model is underfit or if it is overfit? First of all, we saw what happens during a Deep Learning training process from a high-level perspective. We saw that samples are fed forward through the model, that predictions are generated and compared with the actual targets, and that the model is subsequently improved (partially) based on these computations. + +We also saw that when our model is underfit, it is not yet capable of capturing the relevant patterns within the dataset. In other words, if our dataset represents a quadratic function, the function that would be fit would likely be in the shape of a line. If our model is overfit, however, any noise in the dataset disturbs the fit significantly - meaning that it is too focused on the training data at hand. + +Validation loss can be used for checking whether your model is underfitting or whether it is overfitting. If you plot validation loss, by configuring it in `model.compile` and `model.fit` in Keras and subsequently generating a plot in TensorBoard, you can estimate how your model is doing. When validation loss keeps decreasing, your model is still underfit. When it continues to rise, overfitting is occurring. You therefore want a model that is trained to be at the precise balance between the two, where the change in loss is zero. + +I hope that you have learned something from this article. If you did, please feel free to leave a comment in the comments section below 💬 Please feel free to do the same if you have any questions, comments or other remarks. Regardless of that, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2003, January 23). _Overfitting_. Wikipedia, the free encyclopedia. Retrieved November 30, 2020, from [https://en.wikipedia.org/wiki/Overfitting](https://en.wikipedia.org/wiki/Overfitting) diff --git a/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api.md b/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api.md new file mode 100644 index 0000000..5951ff3 --- /dev/null +++ b/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api.md @@ -0,0 +1,504 @@ +--- +title: "How to create an MLP classifier with TensorFlow 2 and Keras" +date: "2019-07-27" +categories: + - "buffer" + - "frameworks" + - "svms" +tags: + - "classifier" + - "keras" + - "mlp" + - "multilayer-perceptron" + - "neural-networks" +--- + +In one of my previous blogs, I showed why [you can't truly create a Rosenblatt's Perceptron](https://machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/) with Keras. Fortunately for this lovely Python framework, Rosenblatt's was only the first in many developments with respect to neural networks. Since Rosenblatt published his work in 1957-1958, many years have passed since and, consequentially, many algorithms have been developed. + +One class of algorithms that stands out relatively often is the class of so-called Multilayer Perceptrons. I often like to call them _basic neural network_, since they have the shape that people usually come up with when they talk about neural nets. They aren't complex, really, while they are much more powerful than the single-neuron ones. + +In this blog, I'll show you how to create a basic MLP classifier with TensorFlow 2.0 using the `tf.keras` Sequential API. But before we can do that, we must do one thing. First, we shall cover a little bit of history about MLPs. I always think it's important to place learning in a historical context, and that's why I always include brief histories in my blogs. + +And then, we'll code it in Keras and test it with a real dataset. If you're feeling lucky today, you might also be interested in finding the code on [GitHub](https://github.com/christianversloot/keras-multilayer-perceptron). + +After reading this tutorial, you will... + +- Have an idea about the history of Multilayer Perceptrons. +- Be able to code and test an MLP with TensorFlow 2.0 using TensorFlow 2.0 and Keras - with many code examples, including a full one. +- Understand why it's better to use Convolutional layers in addition to Dense ones when working with image data. + +Let's go! 🚀 + +* * * + +**Update 17/01/2021:** added code example to the top of the article. Updated article structure and header information. Made clear that the article is written for TensorFlow 2.0 and made sure that it is up to date for 2021. + +**Update 29/09/2020:** ensured that model has been adapted to `tf.keras` to work with TensorFlow 2.x. Also added full model code and repaired minor textual errors. + +**Update 29/09/2020:** repaired mistake related to `num_classes` variable. Credits to Alexandre L for reporting! + +* * * + +\[toc\] + +* * * + +## Code example: Multilayer Perceptron with TensorFlow 2.0 and Keras + +Here is a full example code for creating a Multilayer Perceptron created with TensorFlow 2.0 and Keras. It is used to classify on the MNIST dataset. If you want to understand it in more detail, or why you better use Conv2D layers in addition to Dense layers when handling image data, make sure to read the rest of this tutorial too! + +``` +# Imports +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical + +# Configuration options +feature_vector_length = 784 +num_classes = 10 + +# Load the data +(X_train, Y_train), (X_test, Y_test) = mnist.load_data() + +# Reshape the data - MLPs do not understand such things as '2D'. +# Reshape to 28 x 28 pixels = 784 features +X_train = X_train.reshape(X_train.shape[0], feature_vector_length) +X_test = X_test.reshape(X_test.shape[0], feature_vector_length) + +# Convert into greyscale +X_train = X_train.astype('float32') +X_test = X_test.astype('float32') +X_train /= 255 +X_test /= 255 + +# Convert target classes to categorical ones +Y_train = to_categorical(Y_train, num_classes) +Y_test = to_categorical(Y_test, num_classes) + +# Set the input shape +input_shape = (feature_vector_length,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(350, input_shape=input_shape, activation='relu')) +model.add(Dense(50, activation='relu')) +model.add(Dense(num_classes, activation='softmax')) + +# Configure the model and start training +model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) +model.fit(X_train, Y_train, epochs=10, batch_size=250, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_test, Y_test, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]}%') +``` + +* * * + +## History: a Multilayer Perceptron + +The Rosenblatt perceptron triggered a fairly big controversy in the field of AI. But before I can proceed with this, we must go back to the 1940s and the 1950s first. It was the age of cybernetics. In this field, although it is possibly better described as a movement than a scientific field, people attempted to study how human beings and machines could work together to advance the world. + +As with any fairly new field of science or practice, the cybernetics movement was rather hype-saturated. Although prominent figures such as Alan Turing participated in cybernetic research, dreams often went beyond what was realistic at the time (Rid, 2016). However, that can be said about many things in retrospect... :-) + +Two main streams of thought emerged in the 1950s for making the cybernetic dreams a reality (Olazaran, 1996). The first was the _neural net_ stream. This stream, in which [Frank Rosenblatt](https://machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/) played a prominent role, was about automated learning in a network-like fashion: by attempting to mimic the human brain through artificial neural networks, they argued, learning could be automated. + +The other stream of thought had a radically different point of view. In this stream, the symbolic one, "symbolic expressions stand for words, propositions and other conceptual entities" (Olazaran, 1996). By manipulating these propositions, possibly linking them together, knowledge about the world could be captured and manipulated - and by consequence, intelligent machines could emerge. One of the most prominent thought leaders in the field of symbolic AI was [Marvin Minsky](https://en.wikipedia.org/wiki/Marvin_Minsky) (Olazaran, 1996). + +### The perceptron controversy + +When Rosenblatt demonstrated his perceptron in the late 1950s, he made it quite clear what he thought it would be capable of in many years: + +> The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. +> +> A summary of Rosenblatt's remarks (The New York Times, 1958). + +Minsky and other people who thought symbolic AI was the way forward got furious about these claims. With strong rhetoric, they argued that Rosenblatt only introduced a hype and did not stress upon the limitations of the Perceptron enough (Olazaran, 1996). + +In fact, they essentially thought that "(...) Frank Rosenblatt's work was a waste of time" (Olazaran, 1996). And they set out to show it ... in the work _Perceptrons_, which was published in the late 1960s. + +In this work, they showed that perceptrons had fundamental problems which made learning as envisioned by Rosenblatt impossible, and claimed that no further research should be undertaken in the neural net niche. The main problem was that a single-layer perceptron could not successfully [represent the XOR function](https://medium.com/@lucaspereira0612/solving-xor-with-a-single-perceptron-34539f395182). Mathematically, this was possible with perceptrons that were stacked into multiple layers, but optimization of those would be way too heavy in terms of computational costs. + +### The first AI winter ... and the second + +The consequences of this attack were large: much funding for neural net projects was withdrawn and no new funding was approved by many organizations. As a result, many people working on neural nets were transferred to other fields of study or entirely abandoned their field in favor of symbolic AI. + +This is what is known as the first AI winter. The focus of AI research eventually shifted entirely towards symbolic AI. + +However, when symbolic AI was _institutionalized_, as Olazaran calls it, many problems also came to light with the symbolic approach (Olazaran, 1996). That is, when much research attraction was drawn and many paths in which symbolic AI could be applied were explored, various problems were found with the symbolic approach. One of the primary ones was that the relatively fuzzy context in which humans often operate cannot be captured by machines that fully operate on the rules of logic. + +The consequence? The same as for neural net research in the 1960s ... enter the second AI winter. + +### New momentum for neural networks + +Fortunately, the field of neural net research was not abandoned entirely. Particularly, certain scholars invented what is called the _backpropagation algorithm_. By slightly altering the way a perceptron operates, e.g. by having it use a [continuous rather than a discontinuous function](https://machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/), much progress could be made. Particularly, researchers were since able to optimize it by using a descending-down-the-hill approach, computing the error backwards throughout the layers. They were now especially able to _train perceptrons that were stacked in multiple layers_, or **multilayer perceptrons**. Finally! One of the primary problems of the 1950s-1960s was overcome. + +Minsky and folks were quick to respond with the notion that this revival did not mean that e.g. their remarks about computational costs were no longer accurate. Indeed, they were still right about this, but machine learning by means of neural nets remained here to stay. In the years since, we've seen many incremental improvements and a fair share of breakthroughs, of which the deep learning hype is the latest development. + +* * * + +## Coding an MLP with TensorFlow 2.0 and Keras + +Now that we know a thing or two about how the AI field has moved from single-layer perceptrons to deep learning (albeit on a high level), we can focus on the multilayer perceptron (MLP) and actually code one. + +We'll use Keras for that in this post. Keras is a very nice API for creating neural networks in Python. It runs as an abstraction layer on top of frameworks like TensorFlow, Theano and CNTK and makes creating neural networks very easy. + +Under the condition that you know what you're doing, obviously. + +Because now, everyone can mix together some neural network building blocks and create a neural network. Optimizing is however a different story. + +All right. Let's first describe the dataset that we'll use for creating our MLP. + +### The MNIST dataset + +We use the MNIST database, which stands for Modified National Institute of Standards and Technology (LeCun et al., 1998). It is one of the standard datasets that is used throughout the machine learning community, often for educational purposes. + +In simple English, it's just a database of handwritten numbers that are 28 by 28 pixels. They've been used in the early days of neural networks in one of the first practical applications of AI, being a digit recognizer for handwritten numbers. More information on MNIST is available [here](https://en.wikipedia.org/wiki/MNIST_database). + +And this is what these numbers look like: + +![](images/mnist.png) + +### Today's imports + +Okay, let's start work on our MLP in Keras. We must first create a Python file in which we'll work. As your first step, create a file called `model.py` and open it in a text or code editor. + +Also make sure that your machine is ready to run Keras and TensorFlow. Make sure that it has Python installed as well, preferably 3.6+. You'll need this to actually run your code. + +If you wish to visualize your data, you also need Matplotlib. This is however not mandatory for your model. + +Let's now import the essential Python packages: + +``` +# Imports +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical +``` + +Why we import the `keras` package should make sense by now. The same applies to the import of the `mnist` dataset. For the others, let's quickly look into why we import them. + +First, the `Sequential` model. It's one of the two APIs that Keras supports (the other being the `Functional` API). The Sequential one is often used by beginning ML engineers. It offers less flexibility but makes creating neural networks easier. Especially for educational purposes, like this blog, the Sequential API is a very good choice. + +Then, the `Dense` layer. Keras supports a wide number of layers, such as convolutional ones if one aims to build a [Convolutional Neural Network](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). However, we don't: our goal is to build a Multilayer Perceptron. Those aren't built of spectacular layers; rather, it's simply a stack of so-called densely-connected ones. That means that an arbitrary neuron is connected to all neurons in the subsequent layer. It looks as follows: + +![](images/Basic-neural-network.jpg) + +Next is the `to_categorical` util. We don't need it immediately, but require it later. It has to do with the structure of the MNIST dataset, specifically the number of target classes. Contrary to the [single-layer perceptron](https://machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/) that we created, which was a binary classification problem, we're dealing with a multiclass classification problem this time - simply because we have 10 classes, the numbers 0-9. + +#### Small detour: categorical cross entropy + +For those problems, we need a loss function that is called _categorical crossentropy._ In plain English, I always compare it with a purple elephant 🐘. + +Suppose that the relationships in the real world (which are captured by your training date) together compose a purple elephant (a.k.a. distribution). We next train a machine learning model that attempts to be as accurate as the original data; hence attempting to classify data as that purple elephant. How well the model is capable of doing that is what is called a _loss_, and the loss function allows one to compare one distribution (elephant) with the other (hopefully the same elephant). Cross entropy allows one to compare those. We can't use the binary variant (it only compares two elephants), but need the _categorical_ one (which can compare multiple elephants). This however requires us to 'lock' the set of elephants first, to avoid that another one is added somehow. This is called _categorical data_: it belongs to a fixed set of categories (Chollet, 2017). + +\[mathjax\] + +However, the MNIST targets, which are just numbers (_and numbers can take any value!)_, are not categorical. With `to_categorical`, we can turn the numbers into categorical data. For example, if we have a trinary classification problem with the possible classes being \[latex\]\\{ 0, 1, 2 \\}\[/latex\], the numbers 0, 1 or 2 are encoded into categorical vectors. One categorical vector looks as follows: + +\\begin{equation} \\textbf{y} = \\begin{bmatrix}0 \\\\ 1 \\\\ 0\\end{bmatrix} \\end{equation} + +...or in plain English: + +- Class 0: false. +- Class 1: true. +- Class 2: false. + +_Categorical data is fixed with respect to the possible outcomes; categorical crossentropy therefore requires your data to be fixed (categorical)_. + +And `to_categorical` serves this purpose. + +### Loading your data + +Next, we can assign some configuration variables: + +``` +# Configuration options +feature_vector_length = 784 +num_classes = 10 +``` + +One MNIST sample is an image of 28 by 28 pixels. An interesting observation that I made a while ago is that MLPs don't support multidimensional data like images natively. What you'll have to do is to _flatten_ the image, in the sense that you'll just take all the rows and put them into a massive row. Since 28 times 28 is 784, our feature vector ([which with the Pima dataset SLP was only 8](https://machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/#loading-dependencies-and-data)) will contain 784 features (pixels). + +The MNIST dataset contains 60.000 images in its training set. Each image belongs to one of ten classes. Hence, the `num_classes` is 10. + +Finally, we can load the data: + +``` +# Load the data +(X_train, Y_train), (X_test, Y_test) = mnist.load_data() + +# Reshape the data - MLPs do not understand such things as '2D'. +# Reshape to 28 x 28 pixels = 784 features +X_train = X_train.reshape(X_train.shape[0], feature_vector_length) +X_test = X_test.reshape(X_test.shape[0], feature_vector_length) + +# Convert into greyscale +X_train = X_train.astype('float32') +X_test = X_test.astype('float32') +X_train /= 255 +X_test /= 255 + +# Convert target classes to categorical ones +Y_train = to_categorical(Y_train, num_classes) +Y_test = to_categorical(Y_test, num_classes) +``` + +We'll use the Keras provided `mnist.load_data()` to load the MNIST dataset relatively easily. The function returns two tuples: one with training data; the other with testing data. The `X` elements represent the feature vectors (which at that point in time are still 28x28 pixels); the `Y` elements represent the targets (at that point still being numbers, i.e. 0-9). + +The next step is to `reshape` the data: we argued that the 28x28 must be converted into 784 to be suitable for MLPs. That's what we do there - we reshape the features to `feature_vector_length` for both the training and testing features. + +Next, we'll convert the data into greyscale. This way, when new colors are added to the dataset, the model does not get into trouble - it has simply been trained in a color-agnostic way. + +Finally, we'll do what we discussed before - convert the data into categorical format by means of the `to_categorical` function. Rather than being _scalars_, such as \[latex\]0\[/latex\] of \[latex\]4\[/latex\], one target _vector_ will subsequently look as follows: + +\\begin{equation} \\textbf{y} = \\begin{bmatrix}0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\\end{bmatrix} \\end{equation} + +Obviously, the target here is 5. + +### Intermezzo: visualizing certain features + +Perhaps you are willing to visualize your features first in order to get a better feeling for them. You can do that by means of `matplotlib`. If you execute `imshow` on either a testing or training sample _before_ you convert it into MLP-ready data, you can see the data you'll be working with. + +Code: + +``` +# Imports +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical + +# Configuration options +feature_vector_length = 784 +num_classes = 10 + +# Load the data +(X_train, Y_train), (X_test, Y_test) = mnist.load_data() + +# Visualize one sample +import matplotlib.pyplot as plt +plt.imshow(X_train[0], cmap='Greys') +plt.show() +``` + +Result: + +[![](images/mnist_visualized.jpeg)](https://machinecurve.com/wp-content/uploads/2019/07/mnist_visualized.jpeg) + +### Creating the Multilayer Perceptron + +All right, let's continue ... the next step is actually creating the MLP in your code: + +``` +# Set the input shape +input_shape = (feature_vector_length,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(350, input_shape=input_shape, activation='relu')) +model.add(Dense(50, activation='relu')) +model.add(Dense(num_classes, activation='softmax')) +``` + +Question: have you got any idea about the shape of the data that we'll feed into the MLP once we fit the data? + +\[latex\](784, )\[/latex\]. + +We'll feed it a one-dimensional feature vector that contains 784 features. + +That's why we assign `feature_vector_length` converted into tuple format to `input_shape` and use it later in the `model`. + +As discussed before, the Keras Sequential API is used for creating the model. We'll next add three hidden layers to our MLP: + +- The first has 350 output neurons and takes the input of 784 input neurons, which are represented by an input layer specified by the `input_shape` argument. We activate using Rectified Linear Unit (ReLU), which is one of the [standard activation functions](https://machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/#todays-activation-functions) used today. Below, you'll see how it activates. +- The second has 50 output neurons and activates by means of ReLU. You'll by now notice that we somehow funnel the information into a very dense format. This way, the model will be capable of learning the most important patterns, which helps generalizing to new data. +- Finally, there's an output layer, which has `num_classes` output neurons and activates by means of `Softmax`. The number of neurons equals the number of scalars in your output vector. Since that data must be categorical for categorical cross entropy, and thus the number of scalar values in your target vector equals the number of classes, it makes sense why `num_classes` is used. Softmax, the activation function, is capable of generating a so-called multiclass probability distribution. That is, it computes the probability that a certain feature vector belongs to one class. + +[![](images/relu-1024x511.png)](https://machinecurve.com/wp-content/uploads/2019/05/relu.png) + +How Rectified Linear Unit a.k.a. ReLU activates. + +### MLP hyperparameters + +Ok, we just configured the model _architecture_... but we didn't cover yet _how it learns_. + +We can configure precisely that by means of the model's hyperparameters: + +``` +# Configure the model and start training +model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) +model.fit(X_train, Y_train, epochs=10, batch_size=250, verbose=1, validation_split=0.2) +``` + +As discussed before, we use categorical crossentropy as our loss function (Chollet, 2017). We use the `Adam` optimizer for optimizing our model. It combines various improvements to traditional stochastic gradient descent (Kingma and Ba, 2014; Ruder, 2016). Adam is the standard optimizer used today (Chollet, 2017). + +Accuracy is highly intuitive to humans so we'll use that alongside our categorical crossentropy loss. + +Next, we fit the training data to our model. We choose 10 epochs, or the number of iterations before it stops training, a batch size of 250, verbosity mode 1 and a validation split of 20%. The latter splits the 60.000 training samples into 48.000 used for training and 12.000 for optimization. + +All right, let's go. + +* * * + +## Testing your MLP implementation + +Execute your code in Python, in an environment where TensorFlow and Keras are installed: + +`python model.py` + +It then starts training, which should be similar to this: + +``` +2019-07-27 20:35:33.356042: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3026 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1) +48000/48000 [==============================] - 54s 1ms/step - loss: 1.8697 - acc: 0.5851 - val_loss: 0.4227 - val_acc: 0.8801 +Epoch 2/10 +48000/48000 [==============================] - 72s 1ms/step - loss: 0.3691 - acc: 0.8939 - val_loss: 0.3069 - val_acc: 0.9122 +Epoch 3/10 +48000/48000 [==============================] - 73s 2ms/step - loss: 0.2737 - acc: 0.9222 - val_loss: 0.2296 - val_acc: 0.9360 +Epoch 4/10 +48000/48000 [==============================] - 62s 1ms/step - loss: 0.2141 - acc: 0.9385 - val_loss: 0.1864 - val_acc: 0.9477 +Epoch 5/10 +48000/48000 [==============================] - 61s 1ms/step - loss: 0.1785 - acc: 0.9482 - val_loss: 0.1736 - val_acc: 0.9495 +Epoch 6/10 +48000/48000 [==============================] - 75s 2ms/step - loss: 0.1525 - acc: 0.9549 - val_loss: 0.1554 - val_acc: 0.9577 +Epoch 7/10 +48000/48000 [==============================] - 79s 2ms/step - loss: 0.1304 - acc: 0.9620 - val_loss: 0.1387 - val_acc: 0.9597 +Epoch 8/10 +48000/48000 [==============================] - 94s 2ms/step - loss: 0.1118 - acc: 0.9677 - val_loss: 0.1290 - val_acc: 0.9622 +Epoch 9/10 +48000/48000 [==============================] - 55s 1ms/step - loss: 0.0988 - acc: 0.9705 - val_loss: 0.1232 - val_acc: 0.9645 +Epoch 10/10 +48000/48000 [==============================] - 55s 1ms/step - loss: 0.0862 - acc: 0.9743 - val_loss: 0.1169 - val_acc: 0.9676 +10000/10000 [==============================] - 21s 2ms/step +``` + +Or, visually: + +[![](images/image-2.png)](https://machinecurve.com/wp-content/uploads/2019/07/image-2.png) + +As you can see, training loss decreases rapidly. This is perfectly normal, as the model always learns most during the early stages of optimization. Accuracies converge after only one epoch, and still improve during the 10th, albeit slightly. + +Validation loss is also still decreasing during the 10th epoch. This means that although the model already performs well (accuracies of 96.8%!), it can still improve further without losing its power to generalize to data it has never seen. In other words, our model is still underfit... perhaps, increasing the number of `epochs` until validation loss increases again might yield us an even better model. + +However, this was all observed from validation data. What's best is to test it with the actual testing data that was generated earlier: + +``` +# Test the model after training +test_results = model.evaluate(X_test, Y_test, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]}%') +``` + +Testing against the testing data will ensure that you've got a reliable metric for testing the model's power for generalization. This is because every time, during optimization which is done based on validation data, information about the validation data leaks into the model. Since the validation data is a statistical sample which also deviates slightly from the actual population in terms of, say, mean and variance, you get into trouble when you rely on it too much. + +However, for our attempt, the test results are positive: + +``` +Test results - Loss: 0.1073538348050788 - Accuracy: 0.9686% +``` + +Similar - almost 97%! That's great 😎 + +* * * + +## Full model code + +It's of course also possible to obtain the full code for this model: + +``` +# Imports +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical + +# Configuration options +feature_vector_length = 784 +num_classes = 10 + +# Load the data +(X_train, Y_train), (X_test, Y_test) = mnist.load_data() + +# Reshape the data - MLPs do not understand such things as '2D'. +# Reshape to 28 x 28 pixels = 784 features +X_train = X_train.reshape(X_train.shape[0], feature_vector_length) +X_test = X_test.reshape(X_test.shape[0], feature_vector_length) + +# Convert into greyscale +X_train = X_train.astype('float32') +X_test = X_test.astype('float32') +X_train /= 255 +X_test /= 255 + +# Convert target classes to categorical ones +Y_train = to_categorical(Y_train, num_classes) +Y_test = to_categorical(Y_test, num_classes) + +# Set the input shape +input_shape = (feature_vector_length,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(350, input_shape=input_shape, activation='relu')) +model.add(Dense(50, activation='relu')) +model.add(Dense(num_classes, activation='softmax')) + +# Configure the model and start training +model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) +model.fit(X_train, Y_train, epochs=10, batch_size=250, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_test, Y_test, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]}%') +``` + +* * * + +## Wrapping up: why you'd better use CNNs rather than MLPs for image data + +All right. We were successful in creating a multilayer perceptron that classifies the MNIST dataset with an extremely high accuracy: we achieved a success rate of about 97% on 10.000 images. That's pretty cool, isn't it? + +Yep. + +But... + +...we can do better. + +MLPs were very popular years back (say, in the 2000s), but when it comes to image data, they have been overtaken in populary and effectiveness by [Convolutional Neural Networks](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) (CNNs). If you wish to create an image classifier, I'd suggest looking at them, perhaps combining them with MLPs in some kind of ensemble classifier. Don't use MLPs only. + +### More observations + +- I trained CNNs before. In my experience, they train a lot faster on the MNIST dataset than the MLP we just built. It's rather easy to explain this: the more you navigate to the right in your CNN layers, the more abstract your data gets to be. This speeds up the training process. Compare this to MLPs, which learn the entire feature vector; the funneling approach may be effective, but isn't as effective as CNN sparsity. Another reason to look at CNNs! +- Another observation is that when you wish to use MLPs, image like data must be flattened into a onedimensional feature vector first. Otherwise, you simple cannot use them for image data. CNNs often come with multidimensional convolutional layers, like the `Conv2D` and `Conv3D` ones in Keras. CNNs therefore save you preprocessing time and _computational costs_ if you deal with a lot of data. +- As we noted before, when you use Softmax and - by consequence - categorical crossentropy, the number of neurons in your final layer must be equal to the number of target classes present in your dataset. This has to do with the fact that you're converting your data into categorical format first, which effectively converts your target scalar into a target vector with `num_classes` scalars (of the values 0 and 1). + +I hope you enjoyed this post and have learnt more about MLPs, creating them in Keras, the history of moving from perceptrons to modern algorithms and, finally, why you better use CNNs for image like data. If you've got any remaining questions or if you have got remarks whatsoever, please feel free to leave a comment below 👇 I'm happy to receive your remarks so that we can together improve this post. Questions will be answered as soon as I can. + +Thank you... and happy engineering! 😎 + +_The code for this work is also available on_ [_GitHub_](https://github.com/christianversloot/keras-multilayer-perceptron)_._ + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. Retrieved from [https://arxiv.org/abs/1412.6980](https://arxiv.org/abs/1412.6980) + +LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. _Proceedings of the IEEE_, _86_(11), 2278-2324. [doi:10.1109/5.726791](http://doi.org/10.1109/5.726791) + +Olazaran, M. (1996). A Sociological Study of the Official History of the Perceptrons Controversy. _Social Studies of Science_, _26_(3), 611-659. [doi:10.1177/030631296026003005](http://doi.org/10.1177/030631296026003005) + +Rid, T. (2016). _Rise of the Machines: the lost history of cybernetics_. Scribe Publications. + +Ruder, S. (2016). An overview of gradient descent optimization algorithms. Retrieved from [https://arxiv.org/abs/1609.04747](https://arxiv.org/abs/1609.04747) + +The New York Times. (1958, July 8). NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser. Retrieved from [https://www.nytimes.com/1958/07/08/archives/new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html](https://www.nytimes.com/1958/07/08/archives/new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html) diff --git a/how-to-create-a-cnn-classifier-with-keras.md b/how-to-create-a-cnn-classifier-with-keras.md new file mode 100644 index 0000000..9dcb89c --- /dev/null +++ b/how-to-create-a-cnn-classifier-with-keras.md @@ -0,0 +1,518 @@ +--- +title: "How to create a CNN with TensorFlow 2.0 and Keras?" +date: "2019-09-17" +categories: + - "deep-learning" + - "frameworks" +tags: + - "classifier" + - "convolutional-neural-networks" + - "deep-learning" + - "keras" + - "mnist" + - "python" +--- + +In the last couple of years, much buzz has emerged related to deep learning. Especially in the field of computer vision, much progress has been made with respect to replacing more traditional models with deep learning models that show very promising performance. For example, Tesla's autopilot relies on such models to a great extent. + +But how do we create such _Convolutional Neural Networks_ (CNNs)? This blog explains it by means of the Keras deep learning framework for Python. We'll first look at the concept of a classifier, CNNs themselves and their components. We then continue with a real Keras / Python implementation for classifying numbers using the MNIST dataset. + +The code used in this blog is also available freely at [GitHub](https://github.com/christianversloot/keras-cnn). + +In this tutorial, you will... + +- Understand the basic concepts behind Convolutional Neural Networks. +- Learn how to implement a ConvNet classifier with TensorFlow 2.0 and Keras. +- See how you can evaluate the CNN after it was trained. + +Let's go! 😎 + +* * * + +**Update 18/Jan/2021:** updated article title and tutorial information. + +**Update 11/Jan/2021:** ensured timeliness of the article and updated header information. Added quick code example to the top of the article for people who want to immediately get started. Also updated article structure. + +**Update 17/Nov/2020:** fixed error where Keras 1.x `backend` import after update remained in some of the code. + +**Update 03/Nov/2020:** made the code compatible with TensorFlow 2.x. Also added links to newer blog articles that are valuable extensions to this article. + +* * * + +\[toc\] + +* * * + +## Code example: ConvNet with TensorFlow and Keras + +Below, you'll find a full-fledged code example for a Convolutional Neural Network based classifier created with TensorFlow and Keras. Make sure to read the article below if you want to understand the code and the concepts in full detail. + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert into [0, 1] range. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Basic ingredients + +Before we can start off with creating our model in Keras, we must take a look at the basic ingredients of this post first. That means that this section will give you a brief introduction to the concept of a classifier. It will also tell you something about the nature of Convolutional Neural Networks. + +We'll keep it brief, of course. We won't be looking into these topics deeply, since we've got other posts for that. However, it's necessary to understand them (or to recap them) if you wish to understand what happens in the actual Keras model. + +If you're already very familiar with those basic concepts in machine learning / deep learning, feel free to continue to the next section. If not, let's go! :-) + +### What is a classifier? + +Suppose that you work in the field of separating non-ripe tomatoes from the ripe ones. It's an important job, one can argue, because we don't want to sell customers tomatoes they can't process into dinner. It's the perfect job to illustrate what a human [classifier](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) would do. + +Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape: + +- If it's green, it's likely to be unripe (or: not sellable); +- If it smells, it is likely to be unsellable; +- The same goes for when it's white or when fungus is visible on top of it. + +If none of those occur, it's likely that the tomato can be sold. + +We now have _two classes_: sellable tomatoes and non-sellable tomatoes. + +Human classifiers _decide about which class an object (a tomato) belongs to._ + +The same principle occurs again in machine learning and deep learning. + +Only then, we replace the human with a machine learning model. We're then using machine learning for _classification_, or for deciding about some "model input" to "which class" it belongs. + +Especially when we deal with image-like inputs, Convolutional Neural Networks can be very good at doing that. + +### What is a Convolutional Neural Network? + +I tend to compare the working of convolutional neural networks with magnifiers. + +Suppose that you have an image. In the case of the humans classifying tomatoes above this would be the continuous stream of image-like data that is processed by our brain and is perceived with our eyes. In the case of artificial classification with machine learning models, it would likely be input generated from a camera such as a webcam. + +You wish to detect certain characteristics from the object in order to classify them. This means that you'll have to make a _summary_ of those characteristics that gets more abstract over time. For example, with the tomatoes above, humans translate their continuous stream of observation into a fixed set of intuitive rules about when to classify a tomato as non-sellable; i.e., the three rules specified above. + +Machine learning models and especially [convolutional neural networks (CNNs)](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) do the same thing. + +![](images/Cnn_layer-1.jpg) + +Summarizing with a convolutional layer, as if you're using a magnifier. + +They essentially, like a magnifier, take a look at a very small part of the image and magnify the colors into a much denser bundle. By consequence, we're generating some kind of _summary_ of the initial image. + +When we do that multiple times in a row, by adding multiple layers of such _convolutions_, we end up with a very abstract summary of the original image. This abstract summary can then be compared with some _average_ sellable tomato and non-sellable tomato learnt through training, and hence can be classified by a machine learning model. + +And by consequence, especially with the large amounts of data that are used today, the machine learning community has been able to create very well-performing deep learning models that have accuracies of more than 99%! + +_Please note_: due to reasons of simplicity, I left out the often-common layers like max pooling and batch normalization from the description above. It makes the story better to understand. + +Hence, by creating an abstract summary with a Convolutional Neural Network, it's going to be possible to train a classifier that can _assign an object (the image) into a class_. Just like humans do. We'll show it next with a simple and clear example using the MNIST dataset in Keras. + +* * * + +## Today's dataset + +For the model that we'll create today, we're going to use the MNIST [dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). The dataset, or the Modified National Institute of Standards and Technology database, contains many thousands of 28x28 pixel images of handwritten numbers, like this: + +![](images/mnist-visualize.png) + +It's a very fine dataset for practicing with CNNs in Keras, since the dataset is already pretty normalized, there is not much noise and the numbers discriminate themselves relatively easily. Additionally, much data is available. + +Hence, let's go and create our CNN! :-) + +* * * + +## Creating a CNN with TensorFlow 2.0 and Keras + +### Software dependencies + +We always start with listing certain dependencies that you'll need to install before you can run the model on your machine. Those are for today: + +- **Python**: version 3.5-3.8. +- **TensorFlow**: `pip install tensorflow`. +- If you wish to generate plots, it's also wise to install **Numpy** (if it's not a peer dependency of the previous ones) and **Matplotlib**. + +Preferably, you'll install these in an Anaconda environment. [Read here how to do that.](https://towardsdatascience.com/installing-keras-tensorflow-using-anaconda-for-machine-learning-44ab28ff39cb) + +### Creating the model + +The first step of creating the machine learning model is creating a folder, e.g. `keras-cnn`, with a file in it, e.g. `model.py`. + +#### Model dependencies + +In this file, we'll first import the dependencies that we require later on: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +``` + +Obviously, we need **Keras** since it's the framework we're working with. We import the `mnist` dataset and benefit from the fact that it [comes with Keras by default](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) - we don't have a lot of trouble using it. + +With respect to the layers, we will primarily use the **Conv2D** and **Dense** layers - I would say that these constitute the _core_ of your deep learning model. The Conv2D layers will provide these _magnifier_ operations that I discussed before, at two dimensions (like on the image above). That means: it slides with a small 2D box over a larger 2D box, being the image. It goes without saying that one can also apply 3D convolutional layers (for analyzing videos, with boxes sliding over a larger box) and 1D convolutional layers (for analyzing e.g. timeseries, with 'pixels' / points on a line sliding over the line). + +We use the Dense layers later on for generating predictions (_classifications_) as it's the structure used for that. + +However, we'll also use **[Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/)**, **Flatten** and **[MaxPooling2D](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/)**. A max pooling layer is often added after a Conv2D layer and it also provides a magnifier operation, although a different one. In the 2D case, it also slides with a box over the image (or in that case, the 'convolutional maps' generated by the first convolutional layer, i.e. the summarized image) and for every slide picks the maximum value for further propagation. In short, it generates an even stronger summary and can be used to induce sparsity when data is large. + +Flatten connects the convolutional parts of the layer with the Dense parts. Those latter ones can only handle flat data, e.g. onedimensional data, but convolutional outputs are anything but onedimensional. Flatten simply takes all dimensions and concatenates them after each other. + +With Dropout, we're essentially breaking tiny bits of the magnifier directly in front of it. This way, a little bit of noise is introduced into the summary during training. Since we're breaking the magnifiers randomly, the noise is somewhat random as well and hence cannot be predicted in advance. Perhaps counterintuitively, it tends to improve model performance and reduce overfitting: the variance between training images increases without becoming too large. This way, a 'weird' slice of e.g. a tomato can perhaps still be classified correctly. + +#### Model configuration + +We'll next configure the CNN itself: + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +Since the MNIST images are 28x28 pixels, we define `img_width` and `img_height` to be 28. We use a batch size of 250 samples, which means that 250 samples are fed forward every time before a model improvement is calculated. We'll do 25 `epochs`, or passing _all_ data 25 times (in batches of 250 samples, many batches per epoch), and have 10 `classes`: the numbers 0-9. We also use 20% of the training data, or 0.2, for `validation` during optimization. Finally, we wish to see as much output as possible, thus configure the training process to be `verbose`. + +#### Loading and preparing MNIST data + +We next load and prepare the MNIST data. The code looks somewhat complex, but it is actually really simple: + +``` +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert into [0, 1] range. +input_train = input_train / 255 +input_test = input_test / 255 +``` + +We first reshape our input data (the feature vectors). As you can see with the `input_shape`, it's the way your data must be built up to be handled correctly by the framework. + +We then parse the numbers as floats, especially 32-bit floats. This [optimizes the trade-off between memory and number precision](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/#float32-in-your-ml-model-why-its-great) over e.g. integers and 64-bit floats. + +Finally, we convert the numbers to greyscale by dividing all (numeric!) image samples by 255. This allows them to be converted to the interval \[0, 1\] - or, greyscale. Why we do this? Because we don't care about the color of a number, only about the number itself. + +#### Preparing target vectors with `to_categorical` + +We next convert our target vectors, which are integers (`0-9`) into _categorical data_: + +``` +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +In a different post explaining [how to create MLPs with Keras](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/), I explained the need for categorical data as being dependent on the loss function (the means of computing the difference between actual targets and generated predictions during passing the data forward): + +> +> For those problems, we need a loss function that is called _categorical crossentropy._ In plain English, I always compare it with a purple elephant 🐘. +> +> Suppose that the relationships in the real world (which are captured by your training date) together compose a purple elephant (a.k.a. distribution). We next train a machine learning model that attempts to be as accurate as the original data; hence attempting to classify data as that purple elephant. How well the model is capable of doing that is what is called a _loss_, and the loss function allows one to compare one distribution (elephant) with the other (hopefully the same elephant). Cross entropy allows one to compare those. We can't use the binary variant (it only compares two elephants), but need the _categorical_ one (which can compare multiple elephants). This however requires us to 'lock' the set of elephants first, to avoid that another one is added somehow. This is called _categorical data_: it belongs to a fixed set of categories (Chollet, 2017). +> +> [How to create a basic MLP classifier with the Keras Sequential API](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) + +I suggest to click the link above if you wish to understand `to_categorical` at a deeper level. We'll need it again here, since we have 10 categories of data - the numbers 0 to 10, and don't ever include an 11th category in this scenario. Hence, we apply it in our model. + +#### Creating your model architecture + +We then create the architecture of the model: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +We first define the `model` itself to be using the `Sequential` API, or, a stack of layers that together compose the Convolutional Neural Network. + +We start off with a two-dimensional convolutional layer, or a Conv2D layer. It learns 32 [filters](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/#convolutional-layers), or feature maps, based on the data. The kernel, or the small image that slides over the larger one, is 3x3 pixels. As expected, we use the [ReLU](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) activation function for nonlinearity. In the first layer, we also specify the `input_shape` of our data, as determined by reshape. + +The Conv2D layer is followed by a [MaxPooling2D](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) layer with a pool size of 2 x 2. That is, we further summarize the derivation of the Conv2D layer by applying max pooling with another image sliding over the filters that is 2x2 pixels. For every _slide_, it takes the maximum value (hence max pooling) within the 2x2 box and passes it on. Hence, each 2x2 = 4 pixel wide slide is turned into a one-pixel output. This greatly reduces memory requirements while keeping mostly intact your model performance. + +Finally, before repeating the convolutional layers, we add [Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). Dropout, as said, essentially breaks the magnifiers we discussed at the start of this blog. Hence, a little bit of random noise is introduced during training. This greatly reduces the odds of overfitting. It does so by converting certain inputs to 0, and does so randomly. The parameter `0.25` is the dropout rate, or the number of input neurons to drop (in this case, 25% of the inputs is converted to 0). + +Since we wish to summarize further, we repeat the Conv2D process (although learning _more_ filters this time), the MaxPooling2D process and the Dropout process. + +It's then likely that the summary is _general_ enough to compare new images and assign them one of the classes 0-9. We must however convert the many filters learnt and processed to a _flat_ structure before it can be processed by the part that can actually generate the predictions. Hence, we use the Flatten layer. Subsequently, we let the data pass through two Dense layers, of which the first is `ReLU`\-activated and the second one is `Softmax`\-activated. [Softmax activation](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) essentially generates a _multiclass probability distribution_, or computes the probability that the item belongs to one of the classes 0-9, summed to 1 (the maximum probability). _This is also why we must have categorical data: it's going to be difficult to add an 11th class on the fly._ + +Note that the number of output neurons is `num_classes` for the final layer for the same reason: since `num_classes` probabilities must be computed, we must have `num_classes` different outputs so that for every class a unique output exists. + +#### Model compilation & starting training + +We then compile the model and start the training by _fitting the data_: + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +Model compilation essentially _configures_ the model architecture that was created in the previous section. We decide about the _loss value_, about the _optimizer_, and the additional _metrics_ that will be used during the training process. We'll briefly cover them next: + +- The **[loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/)** can be used to compute the difference between the actual targets (as indicated by the training and/or testing data) and the targets generated by the model during an arbitrary epoch. The higher the difference, or the higher the loss, the worse the model performs. The goal of the machine learning training process is therefore to _minimize loss_. +- Each machine learning scenario needs a different loss function. Since we deal with _classification_, we must use a function called cross entropy. It essentially compares the actual outcomes with the generated outcomes by computing the _entropy_, or the difficulty of successfully comparing between the classes. Since our data is categorical in nature, we use **[categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/)**. +- We use **Adaptive Moment Estimation** or **[Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/)** for optimization. It's one of the de facto standard optimizers that are used today. +- For reasons of being more intuitive to humans, we also use **accuracy** as a metric. + +We next _fit the data to the model_, or in plain English start the training process. We do so by feeding the training data (both inputs and targets), specifying the batch size, number of epochs, verbosity and validation split configured before. + +And then let's see what happens! + +#### Adding test metrics for testing generalization + +...except that you'll need to add metrics for _testing_ as well. After training with the training and validation data, which essentially tells you something about the model's _predictive performance_, you also wish to test it for _generalization_ - or, whether it works well when data is used that the model has [never seen before](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/). That's why you created the train / test split in the first place. Now is the time to add a test, or an evaluation step, to the model - which executes just after the training process ends: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Final model + +In the process, altogether you've created this: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert into [0, 1] range. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +It's a complete Keras model that can now be run in order to find its performance and to see whether it works. Let's go! + +Open your terminal, preferably an Anaconda environment, and ensure that all the necessary dependencies are installed and are in working order. + +Then, navigate to the folder with e.g. `cd /path/to/folder` and execute your model with Python: e.g. `python model.py`. You should see Keras starting up, running the training process in TensorFlow, and displaying the results of the epochs. + +* * * + +## Model performance + +With the training process configured above, this is what you'll likely see: + +``` +Epoch 1/25 +48000/48000 [==============================] - 7s 145us/step - loss: 0.3609 - acc: 0.8909 - val_loss: 0.1040 - val_acc: 0.9711 +Epoch 2/25 +48000/48000 [==============================] - 3s 71us/step - loss: 0.0981 - acc: 0.9694 - val_loss: 0.0625 - val_acc: 0.9820 +Epoch 3/25 +48000/48000 [==============================] - 3s 70us/step - loss: 0.0674 - acc: 0.9785 - val_loss: 0.0599 - val_acc: 0.9827 +Epoch 4/25 +48000/48000 [==============================] - 3s 70us/step - loss: 0.0549 - acc: 0.9824 - val_loss: 0.0454 - val_acc: 0.9863 +Epoch 5/25 +48000/48000 [==============================] - 3s 71us/step - loss: 0.0451 - acc: 0.9858 - val_loss: 0.0364 - val_acc: 0.9896 +Epoch 6/25 +48000/48000 [==============================] - 4s 74us/step - loss: 0.0370 - acc: 0.9888 - val_loss: 0.0333 - val_acc: 0.9908 +Epoch 7/25 +48000/48000 [==============================] - 4s 73us/step - loss: 0.0317 - acc: 0.9896 - val_loss: 0.0367 - val_acc: 0.9892 +Epoch 8/25 +48000/48000 [==============================] - 4s 74us/step - loss: 0.0283 - acc: 0.9911 - val_loss: 0.0327 - val_acc: 0.9904 +Epoch 9/25 +48000/48000 [==============================] - 4s 76us/step - loss: 0.0255 - acc: 0.9912 - val_loss: 0.0345 - val_acc: 0.9902 +Epoch 10/25 +48000/48000 [==============================] - 4s 76us/step - loss: 0.0215 - acc: 0.9930 - val_loss: 0.0290 - val_acc: 0.9929 +Epoch 11/25 +48000/48000 [==============================] - 4s 76us/step - loss: 0.0202 - acc: 0.9934 - val_loss: 0.0324 - val_acc: 0.9913 +Epoch 12/25 +48000/48000 [==============================] - 4s 77us/step - loss: 0.0198 - acc: 0.9935 - val_loss: 0.0298 - val_acc: 0.9919 +Epoch 13/25 +48000/48000 [==============================] - 5s 107us/step - loss: 0.0173 - acc: 0.9942 - val_loss: 0.0326 - val_acc: 0.9916 +Epoch 14/25 +48000/48000 [==============================] - 4s 79us/step - loss: 0.0148 - acc: 0.9947 - val_loss: 0.0319 - val_acc: 0.9910 +Epoch 15/25 +48000/48000 [==============================] - 4s 79us/step - loss: 0.0127 - acc: 0.9955 - val_loss: 0.0316 - val_acc: 0.9917 +Epoch 16/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0135 - acc: 0.9954 - val_loss: 0.0347 - val_acc: 0.9907 +Epoch 17/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0124 - acc: 0.9959 - val_loss: 0.0297 - val_acc: 0.9919 +Epoch 18/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0118 - acc: 0.9957 - val_loss: 0.0306 - val_acc: 0.9917 +Epoch 19/25 +48000/48000 [==============================] - 4s 84us/step - loss: 0.0112 - acc: 0.9960 - val_loss: 0.0303 - val_acc: 0.9924 +Epoch 20/25 +48000/48000 [==============================] - 4s 84us/step - loss: 0.0094 - acc: 0.9968 - val_loss: 0.0281 - val_acc: 0.9924 +Epoch 21/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0098 - acc: 0.9966 - val_loss: 0.0306 - val_acc: 0.9923 +Epoch 22/25 +48000/48000 [==============================] - 4s 84us/step - loss: 0.0094 - acc: 0.9967 - val_loss: 0.0320 - val_acc: 0.9921 +Epoch 23/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0068 - acc: 0.9979 - val_loss: 0.0347 - val_acc: 0.9917 +Epoch 24/25 +48000/48000 [==============================] - 5s 100us/step - loss: 0.0074 - acc: 0.9974 - val_loss: 0.0347 - val_acc: 0.9916 +Epoch 25/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0072 - acc: 0.9975 - val_loss: 0.0319 - val_acc: 0.9925 + +Test loss: 0.02579820747410522 / Test accuracy: 0.9926 +``` + +In 25 epochs, the model has achieved a _validation accuracy_ of approximately 99.3%. That's great: in most of the cases, the model was successful in predicting the number that was input to the network. What's even better is that it shows similar performance for the _generalization test_ executed near the end, with the test data: similarly, test accuracy is 99.3%. Model loss is even better than _during_ training! + +* * * + +## Summary + +In this blog, we've seen how to create a Convolutional Neural Network classifier for image-like data. We introduced the concepts of classifiers and CNNs and built one in Keras, harnessing the MNIST numbers dataset for reasons of simplicity. We explained the design considerations that we made as well. + +If you're interested in the code, you might also take a look at [GitHub](https://github.com/christianversloot/keras-cnn). + +I really hope that this blog post has helped you in understanding the concepts of CNNs and CNNs in Keras. If you have any questions, remarks, comments whatsoever, please feel free to leave a comment below. + +Happy engineering! 😊 + +* * * + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +MNIST handwritten digit database, Yann LeCun, Corinna Cortes and Chris Burges. (n.d.). Retrieved from [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/) + +Keras. (n.d.). Core Layers. Retrieved from [https://keras.io/layers/core/](https://keras.io/layers/core/) + +When should I use tf.float32 vs tf.float64 in TensorFlow? (n.d.). Retrieved from [https://www.quora.com/When-should-I-use-tf-float32-vs-tf-float64-in-TensorFlow](https://www.quora.com/When-should-I-use-tf-float32-vs-tf-float64-in-TensorFlow) + +MachineCurve. (2019, July 27). How to create a basic MLP classifier with the Keras Sequential API – MachineCurve. Retrieved from [https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) diff --git a/how-to-create-a-confusion-matrix-with-scikit-learn.md b/how-to-create-a-confusion-matrix-with-scikit-learn.md new file mode 100644 index 0000000..c028884 --- /dev/null +++ b/how-to-create-a-confusion-matrix-with-scikit-learn.md @@ -0,0 +1,368 @@ +--- +title: "How to create a confusion matrix with Scikit-learn?" +date: "2020-05-05" +categories: + - "frameworks" +tags: + - "confusion-matrix" + - "machine-learning" + - "model-evaluation" + - "support-vector-machine" + - "visualization" +--- + +After training a supervised machine learning model such as a classifier, you would like to know how well it works. + +This is often done by setting apart a small piece of your data called the **test set**, which is used as data that the model has never seen before. + +If it performs well on this dataset, it is likely that the model performs well on other data too - if it is sampled from the same distribution as your test set, of course. + +Now, when you test your model, you feed it the data - and compare the predictions with the ground truth, measuring the number of true positives, true negatives, false positives and false negatives. These can subsequently be visualized in a visually appealing **confusion matrix**. + +In today's blog post, we'll show you how to create such a confusion matrix with Scikit-learn, one of the most widely used frameworks for machine learning in today's ML community. By means of an example created with Python, we'll show you step-by-step how to generate a matrix with which you can visually determine the performance of your model easily. + +All right, let's go! :) + +* * * + +\[toc\] + +* * * + +## A confusion matrix in more detail + +Training your machine learning model involves its evaluation. In many cases, you have set apart a test set for this. + +The test set is a dataset that the trained model has never seen before. Using it allows you to test whether the model has overfit, or adapted to the training data too well, or whether it still generalizes to new data. + +This allows you to ensure that your model does not perform very poorly on new data while it still performs really good on the training set. That wouldn't really work in practice, would it :) + +Evaluation with a test set often happens by feeding all the samples to the model, generating a prediction. Subsequently, the predictions are compared with the _ground truth_ - or the true targets corresponding to the test set. These can subsequently be used for computing various metrics. + +But they can also be used to demonstrate model performance in a visual way. + +Here is an example of a confusion matrix: + +[![](images/cf_matrix.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/cf_matrix.png) + +To be more precise, it is a _normalized_ confusion matrix. Its axes describe two measures: + +- The **true labels**, which are the ground truth represented by your test set. +- The **predicted labels**, which are the predictions generated by the machine learning model for the features corresponding to the true labels. + +It allows you to easily compare how well your model performs. For example, in the model above, for all true labels 1, the predicted label is 1. This means that all samples from class 1 were classified correctly. Great! + +For the other classes, performance is also good, but a little bit worse. As you can see, for class 2, some samples were predicted as being part of classes 0 and 1. + +In short, it answers the question "For my true labels / ground truth, how well does the model predict?". + +It's also possible to start from a prediction point of view. In this case, the question would change to "For my predicted label, how many predictions are actually part of the predicted class?". It's the opposite point of view, but could be a valid question in many machine learning cases. + +Most preferably, the entire set of true labels is equal to the set of predicted labels. In those cases, you would see zeros everywhere except for the line from the top left to the bottom right. In practice, however, this does not happen often. Likely, the plot is much more scattered, like this SVM classifier where many supporrt vectors are necessary to draw [a decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) that _does not work perfectly, but adequately enough:_ + +- [![](images/likethis.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/likethis.png) + +- [![](images/likethis2.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/likethis2.png) + +- [![](images/likekthis3.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/likekthis3.png) + + +* * * + +## Creating a confusion matrix with Python and Scikit-learn + +Let's now see if we can create a confusion matrix ourselves. Today, we will be using Python and Scikit-learn, one of the most widely used frameworks for machine learning today. + +Creating a confusion matrix involves various steps: + +1. **Generating an example dataset.** This one makes sense: we need data to train our model on. We'll therefore be generating data first, so that we can make an adequate choice for a ML model class next. +2. **Picking a machine learning model class.** Obviously, if we want to evaluate a model, we need to train a model. We'll choose a particular type of model first that fits the characteristics of our data. +3. **Constructing and training the ML model.** The consequence of the first two steps is that we end up with a trained model. +4. **Generating the confusion matrix.** Finally, based on the trained model, we can create our confusion matrix. + +### Software dependencies you need to install + +Very briefly, but importantly: if you wish to run this code, you must make sure that you have certain software dependencies installed. Here they are: + +- You need to install **Python**, which is the platform that our code runs on, version 3.6+. +- You need to install **Scikit-learn**, the machine learning framework that we will be using today: `pip install -U scikit-learn`. +- You need to install **Numpy** for numbers processing: `pip install numpy`. +- You need to install **Matplotlib** for visualizing the plots: `pip install matplotlib`. +- Finally, if you wish to generate a plot of decision boundaries (not required), you also need to install **Mlxtend:** `pip install mlxtend`. + +\[affiliatebox\] + +### Generating an example dataset + +The first step is generating an example dataset. We will be using Scikit-learn for this purpose too. First, create a file called `confusion-matrix.py`, and open it in a code editor. The first thing we do is add the imports: + +``` +# Imports +from sklearn.datasets import make_blobs +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt +``` + +The `make_blobs` function from Scikit-learn allows us to generate 'blobs', or clusters, of samples. Those blobs are centered around some point and are the samples are scattered around this point based on some standard deviation. This gives you flexibility about both the position and the structure of your generated dataset, in turn allowing you to experiment with a variety of ML models without having to worry about the data. + +As we will evaluate the model, we need to ensure that the dataset is split between training and testing data. Scikit-learn also allows us to do this, with `train_test_split`. We therefore import that one too. + +#### Configuration options + +Next, we can define a number of configuration options: + +``` +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5), (0,5), (2,3)] +cluster_std = 1.3 +frac_test_split = 0.33 +num_features_for_samples = 4 +num_samples_total = 5000 +``` + +The **random seed** describes the initialization of the pseudo-random number generator used for generating the blobs of data. As you may know, no random number generator is truly random. What's more, they are also initialized differently. Configuring a fixed seed ensures that every time you run the script, the random number generator initializes in the same way. If weird behavior occurs, you know that it's likely not the random number generator. + +The **centers** describe the centers in two-dimensional space of our blobs of data. As you can see, we have 4 blobs today. + +The **cluster standard deviation** describes the standard deviation with which a sample is drawn from the sampling distribution used by the random point generator. We set it to 1.3; a lower number produces clusters that are better separable, and vice-versa. + +The **fraction of the train/test split** determines how much data is split off for testing purposes. In our case, that's 33% of the data. + +The **number of features for our samples** is 4, and indeed describes how many targets we have: 4, as we have 4 blobs of data. + +Finally, the **number of samples generated** is pretty self-explanatory. We set it to 5000 samples. That's not too much data, but more than sufficient for the educational purposes of today's blog post. + +#### Generating the data + +Next up is the call to `make_blobs` and to `train_test_split` for actually generating and splitting the data: + +``` +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) +``` + +#### Saving the data (optional) + +Once the data is generated, you may choose to save it to file. This is an optional step - and I include it because I want to re-use the same dataset every time I run the script (e.g. because I am tweaking a visualization). If you use the code below, you can run it once - then, it's saved in the `.npy` file. When you subsequently uncomment the `np.save` call, and possibly also the generate data calls, you'll always have the same data load from file. + +Then, you can tweak away your visualization easily without having to deal with new data all the time :) + +``` +# Save and load temporarily +np.save('./data_cf.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./data_cf.npy', allow_pickle=True) +``` + +Should you wish to visualize the data, this is of course possible: + +``` +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +### Picking a machine learning model class + +Now that we have our code for generating the dataset, we can take a look at the output to determine what kind of model we could use: + +![](images/rcf_data.png) + +I can derive a few characteristics from this dataset (which, obviously, I also built-in up front ;-) ). + +First of all, the number of features is low: only two - as our data is two-dimensional. This is good, because then we likely don't face the curse of dimensionality, and a wider range of ML models is applicable. + +Next, when inspecting the data from a closer point of view, I can see a gap between what seem to be blobs of data (it is also slightly visible in the diagram above): + +[![](images/possibly_separable.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/possibly_separable.png) + +This suggests that the data may be separable, and possibly even linearly so (yes, of course, I know this is the case ;-) ). + +Third, and finally, the number of samples is relatively low: only 5.000 samples are present. Neural networks with their relatively large amount of trainable parameters would likely start overfitting relatively quickly, so they wouldn't be my preferable choice. + +However, traditional machine learning techniques to the rescue. A [Support Vector Machine](https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/), which attempts to construct a decision boundary between separable blobs of data, can be a good candidate here. Let's give it a try: we're going to construct and train an SVM and see how well it performs through its confusion matrix. + +### Constructing and training the ML model + +As we have seen in the post linked above, we can also use Scikit-learn to construct and train a SVM classifier. Let's do so next. + +#### Model imports + +First, we'll have to add a few extra imports to the top of our script: + +``` +from sklearn import svm +from sklearn.metrics import plot_confusion_matrix +from mlxtend.plotting import plot_decision_regions +``` + +(The Mlxtend one is optional, as we discussed at 'what you need to install', but could be useful if you wish to [visualize the decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) later.) + +#### Training the classifier + +First, we initialize the SVM classifier. I'm using a `linear` kernel because I suspect (actually, I'm confident, as we constructed the data ourselves) that the data is linearly separable: + +``` +# Initialize SVM classifier +clf = svm.SVC(kernel='linear') +``` + +Then, we fit the training data - starting the training process: + +``` +# Fit data +clf = clf.fit(X_train, y_train) +``` + +That's it for training the machine learning model! The classifier variable, or `clf`, now contains a reference to the trained classifier. By calling `clf.predict`, you can now generate predictions for new data. + +### Generating the confusion matrix + +But let's take a look at generating that confusion matrix now. As we discussed, it's part of the evaluation step, and we use it to visualize its predictive and generalization power on the _test set_. + +Recall that we compare the predictions generated during evaluation with the ground truth available for those inputs. + +The `plot_confusion_matrix` call takes care of this for us, and we simply have to provide it the classifier (`clf`), the test set (`X_test` and `y_test`), a color map and whether to normalize the data. + +``` +# Generate confusion matrix +matrix = plot_confusion_matrix(clf, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for our classifier') +plt.show(matrix) +plt.show() +``` + +![](images/rcf_matrix.png) + +Normalization, here, involves converting back the data into the \[0, 1\] format above. If you leave out normalization, you get the number of samples that are part of that prediction: + +[![](images/samples-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/samples-1.png) + +Here are some other visualizations that help us explain the confusion matrix (for the [boundary plot](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/), you need to install Mlxtend with `pip install mlxtend`): + +- ![](images/rcf_boundary.png) + +- ![](images/rcf_sup.png) + + +``` +# Get support vectors +support_vectors = clf.support_vectors_ + +# Visualize support vectors +plt.scatter(X_train[:,0], X_train[:,1]) +plt.scatter(support_vectors[:,0], support_vectors[:,1], color='red') +plt.title('Linearly separable data with support vectors') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=clf, legend=2) +plt.show() +``` + +It's clear that we need many support vectors (the red samples) to generate the decision boundary. Given the relative _unclarity_ of the separability between the data points, this is not unexpected. I'm actually quite satisfied with the performance of the model, as demonstrated by the confusion matrix (relatively blue diagonal line). + +The only class that underperforms is class 3, with a score of 0.68. It's still acceptable, but is lower than preferred. This can be explained by looking at the class in the [decision boundary plot](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/). Here, it's clear that it's the middle class - the reds. As those samples are surrounded by the other ones, it's clear that the model has had significant difficulty generating the decision boundary. We might for example counter this by using a different kernel function which takes this into account, ensuring better separability. However, that's not the core of today's post. + +### Full model code + +Should you wish to obtain the full model code, that's of course possible. Here you go :) + +``` +# Imports +from sklearn.datasets import make_blobs +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt +from sklearn import svm +from sklearn.metrics import plot_confusion_matrix +from mlxtend.plotting import plot_decision_regions + +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5), (0,5), (2,3)] +cluster_std = 1.3 +frac_test_split = 0.33 +num_features_for_samples = 4 +num_samples_total = 5000 + +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) + +# Save and load temporarily +np.save('./data_cf.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./data_cf.npy', allow_pickle=True) + +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Initialize SVM classifier +clf = svm.SVC(kernel='linear') + +# Fit data +clf = clf.fit(X_train, y_train) + +# Generate confusion matrix +matrix = plot_confusion_matrix(clf, X_test, y_test, + cmap=plt.cm.Blues) +plt.title('Confusion matrix for our classifier') +plt.show(matrix) +plt.show() + +# Get support vectors +support_vectors = clf.support_vectors_ + +# Visualize support vectors +plt.scatter(X_train[:,0], X_train[:,1]) +plt.scatter(support_vectors[:,0], support_vectors[:,1], color='red') +plt.title('Linearly separable data with support vectors') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=clf, legend=2) +plt.show() +``` + +\[affiliatebox\] + +## Summary + +That's it for today! In this blog post, we created a confusion matrix with Python and Scikit-learn. After studying what a confusion matrix is, and how it displays true positives, true negatives, false positives and false negatives, we gave a step-by-step example for creating one yourself. + +The example included generating a dataset, picking a suitable machine learning model for the dataset, constructing, configuring and training it, and finally interpreting the results i.e. the confusion matrix. This way, you should be able to understand what is happening and why I made certain choices. + +I hope you've learnt something from today's blog post! :) If you did, I would really appreciate it if you left a comment in the comments section 💬 Please do the same if you have questions or remarks. I'll happily answer and improve my blog post where necessary. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[scikitbox\] + +* * * + +## References + +Raschka, S. (n.d.). _Home - mlxtend_. Site not found · GitHub Pages. [https://rasbt.github.io/mlxtend/](https://rasbt.github.io/mlxtend/) + +_Scikit-learn_. (n.d.). scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved May 3, 2020, from [https://scikit-learn.org/stable/index.html](https://scikit-learn.org/stable/index.html) + +Scikit-learn. (n.d.). _1.4. Support vector machines — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved May 3, 2020, from [https://scikit-learn.org/stable/modules/svm.html#classification](https://scikit-learn.org/stable/modules/svm.html#classification) + +Scikit-learn. (n.d.). _Confusion matrix — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved May 5, 2020, from [https://scikit-learn.org/stable/auto\_examples/model\_selection/plot\_confusion\_matrix.html](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) + +Scikit-learn. (n.d.). _Sklearn.metrics.plot\_confusion\_matrix — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved May 5, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot\_confusion\_matrix.html#sklearn.metrics.plot\_confusion\_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html#sklearn.metrics.plot_confusion_matrix) diff --git a/how-to-create-a-multilabel-svm-classifier-with-scikit-learn.md b/how-to-create-a-multilabel-svm-classifier-with-scikit-learn.md new file mode 100644 index 0000000..0ab91e3 --- /dev/null +++ b/how-to-create-a-multilabel-svm-classifier-with-scikit-learn.md @@ -0,0 +1,196 @@ +--- +title: "How to create a Multilabel SVM classifier with Scikit-learn" +date: "2020-11-12" +categories: + - "frameworks" + - "svms" +tags: + - "classification" + - "confusion-matrix" + - "multilabel-classification" + - "scikit-learn" + - "support-vector-machine" + - "svm" +--- + +Classification comes in many flavors. For example, if you need to categorize your input samples into one out of two classes, you are dealing with a binary classification problem. Is the number of classes > 2, the problem is a multiclass one. But now, what if you won't classify your input sample into _one_ out of many classes, but rather into _some_ of the many classes? + +That would be a multilabel classification problem and we're going to cover it from a Support Vector Machine perspective in this article. + +Support Vector Machines can be used for building classifiers. They are natively equipped to perform binary classification tasks. However, they cannot perform multiclass and multilabel classification natively. Fortunately, there are techniques out there with which this becomes possible. How the latter - multilabel classification - can work with an SVM is what you will see in this article. It is structured as follows. + +Firstly, we'll take a look at multilabel classification in general. What is it? What can it be used for? And how is it different from multi_class_ classification? This is followed by looking at multilabel classification with Support Vector Machines. In particular, we will look at why multilabel classification is not possible natively. Fortunately, the Scikit-learn library for machine learning provides a `MultiOutputClassifier` module, with which it _is_ possible to create a multilabel SVM! We cover implementing one with Scikit-learn and Python step by step in the final part of this article. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What is multilabel classification? + +Imagine that you're an employee working in a factory. Your task is to monitor a conveyor belt which is forwarding two types of objects: a yellow rotated-and-square-shaped block and a blue, circular one. When an object is near the end of the conveyor belt, you must label it with two types of labels: its _color_ and its _shape_. + +In other words, the labels yellow and square are attached to the yellow squares, while blue and circular end up with the blue circles. + +This is a human-powered **multilabel classifier**. Human beings inspect objects, attach \[latex\]N\[/latex\] labels to them (here \[latex\]N = 2\[/latex\]), and pass them on - possibly into a bucket or onto another conveyor belt for packaging. So far, so good. + +![](images/whatisclassification6.png) + +Human beings can however be quite a bottleneck in such a process. Because it is so repetitive, it can become boring, and if humans don't like something, it's to be bored at work. In addition, the work is very continuous and hence tiring, increasing the odds of human error. In other words, wouldn't it be a good idea to replace the human being with a machine here? The result would be a reduction in error rates while humans might be happier, doing more creative work. + +That's where Machine Learning comes into play. If we can learn to distinguish the yellow objects from the blue ones, we can build an automated system that attaches the labels for us. Since machines never get tired and work with what they have learnt from observations, they could potentially be a good replacement in our conveyor belt scenario. + +There are many algorithms with which multilabel classification can be implemented. Neural Networks also belong to that category and are very popular these days. However, another class of algorithms with which a multilabel classifier can be created is that of Support Vector Machines. Let's now take a look at what SVMs are, how they work, and how we can create a multilabel classifier with them. + +* * * + +## Multilabel classification with Support Vector Machines + +If we want to build a multilabel classifier with Support Vector Machines, we must first know how they work. For this reason, we will now take a brief look at what SVMs are conceptually and how they work. In addition, we'll provide some brief insight into why a SVM cannot be used for multilabel classification _natively_. This provides the necessary context for understanding how we _can make it work_ regardless, and you will understand the technique and the need for it better. + +Let's now cut to the chase. + +A [Support Vector Machine](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) is a class of Machine Learning algorithms which uses _kernel functions_ to learn a decision boundary between two classes (or learn a function for regression, should you be doing that). This decision boundary is of _maximum margin_ between the two classes, meaning that it is _equidistant_ from classes one and two. In the figure below, that would be the class of black items and the class of white ones. In addition, determining the boundary (which is called a _hyperplane_) is performed by means of _support vectors_. + +All right, that's quite a lot of complexity, so let's break it apart into plainer English. + +In the figure below, you can see three decision boundaries \[latex\]H\_1\[/latex\], \[latex\]H\_2\[/latex\] and \[latex\]H\_3\[/latex\]. These decision boundaries are also called hyperplanes because they are `N-1` dimensional compared to the feature space itself. In other words, in the figure below, we have a two-dimensional feature space (axes \[latex\]X\_1\[/latex\] and \[latex\]X\_2\[/latex\]) and have three one-dimensional lines (i.e. hyperplane) that serve as candidate decision boundaries: indeed, \[latex\]H\_1\[/latex\], \[latex\]H\_2\[/latex\] and \[latex\]H\_3\[/latex\]. + +\[latex\]H\_1\[/latex\] is actually no decision boundary at all, because it cannot distinguish between the classes. The other two _are_ decision boundaries, because they can successfully be used to separate the classes from each other. But which is best? Obviously, that's \[latex\]H\_3\[/latex\], even intuitively. But why is that the case? Let's look at the decision boundary in more detail. + +If you look at the line more closely, you can see that it is precisely in the middle of the area between the samples from each class _that are closest to each other_. These samples are called the _support vectors_, and hence the name _Support Vector_ Machine. They effectively support the algorithm in learning the decision boundary. Now, recall that the line is precisely in the middle of the area in between those support vectors. This means that the line is _equidistant_ to the two classes, meaning that on both ends the distance is the same. This in return means that our decision boundary is of _maximum margin_ - it has the highest margin between the classes and is hence (one of the two) best decision boundaries that can be found. + +![](images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png) + +Hyperplanes and data points. The [image](https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)is not edited. Author: [Zack Weinberg](https://commons.wikimedia.org/w/index.php?title=User:ZackWeinberg&action=edit&redlink=1), derived from [Cyc’s](https://commons.wikimedia.org/w/index.php?title=User:Cyc&action=edit&redlink=1) work. License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode) + +### Why SVMs can't perform multiclass and multilabel classification natively + +An unfortunate consequence of the way that SVMs learn their decision boundary is that they cannot be used for [multilabel or multiclass classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). The reason why is simple: for a decision boundary to be a decision boundary in a SVM, the hyperplane (in our two-dimensional feature space that's a line) must be _equidistant_ from the classes in order to ensure _maximum margin_. + +We can see that if we would add another class, generating a multiclass classification scenario, this would no longer be the case: at maximum, we can only guarantee equidistance between two of the classes - discarding this property with all other classes. The way an SVM works thus means that it cannot be used for multiclass classification, but fortunately there are many approaches (such as [One-vs-One/One-vs-Rest](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/)) which can be used. [Error-Correcting Output Codes](https://www.machinecurve.com/index.php/2020/11/12/using-error-correcting-output-codes-for-multiclass-svm-classification/) are another means for generating a multiclass SVM classifier. + +The other case would be multilabel classification. Here, we don't assign one out of multiple classes to the input sample, but rather, we assign _multiple_ classes to the input sample. Here, the number of classes assigned can in theory be equal to the absolute number of classes available, but often this is not the case. Now let's take a look at assigning multiple labels to a SVM. The SVM is really rigid, [a.k.a. relatively high bias](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/), in terms of the function that is learned: one line separating two classes from each other. There is simply no way that multiple classes can be learned. This is why, next to multiclass classification, multilabel classification cannot be performed natively with SVMs. + +### Using a trick for creating a multilabel SVM classifier + +As usual, people have found workarounds for creating a multilabel classifier with SVMs. The answer lies in the fact that the classification problem, which effectively involves assigning multiple labels to an instance, can be converted into many classification problems. While this increases the computational complexity of your Machine Learning problem, it _is_ thus possible to create a multilabel SVM based classifier. + +Since manually splitting the problem into many classification problems would be a bit cumbersome, we will now take a look at how we can implement multilabel classification with Scikit-learn. + +* * * + +## Implementing a MultiOutputClassifier SVM with Scikit-learn + +Scikit-learn provides the `MultiOutputClassifier` functionality, which implements a multilabel classifier for any regular classifier. For this reason, it will also work with an SVM. Let's first generate two blobs of data which represent the `classes`, or the 'type' from the assembly line scenario above: + +``` +from sklearn.datasets import make_blobs + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3)] +num_classes = len(cluster_centers) + +# Generate data +X, classes = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) +colors = np.random.randint(0, 2, size=len(classes)) +``` + +This looks as follows - with two blobs of data belonging to one class. Do note that we also create `colors` which is an array of the same shape as the `classes` array. It is filled randomly for the sake of simplicity. This array contains the second label (color) that we will be using in this multilabel classification setting. + +![](images/classes.png) + +We can now use Scikit-learn to generate a multilabel SVM classifier. Here, we assume that our data is linearly separable. For the `classes` array, we will see that this is the case. For the `colors` array, this is not necessarily true since we generate it randomly. For this reason, you might wish to look for a particular _[kernel function](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/#what-if-data-is-not-linearly-separable-kernels)_ that provides the linear decision boundary if you would use this code in a production setting. Always ensure that your data is or can become linearly separable before using SVMs! + +- First of all, we ensure that all our dependencies are imported. We import the `pyplot` API from Matplotlib for visualizing our results. Numpy is used for some numbers processing, and we import some `sklearn` dependencies as well. More specifically, we use `make_blobs` for data generation, `MultiOutputClassifier` for the multilabel classifier, `LinearSVC` for the (linear!) SVM, `train_test_split` for splitting the data into a training and testing set, and finally `multilabel_confusion_matrix` and `ConfusionMatrixDisplay` for generating and visualizing [a confusion matrix](https://www.machinecurve.com/index.php/2020/05/05/how-to-create-a-confusion-matrix-with-scikit-learn/). +- We then specify some configuration options, such as the number of samples to generate, the cluster centers, and the number of classes. We can see here that we define two centers, and hence have two classes for the first label. +- We then generate the data with the spec we provided in the previous bullet point. In addition, we create an array of the same shape for the second label - `colors`. We initialize it randomly for the sake of simplicity. While linearity is guaranteed for the first label, we might not find it for the second due to this reason! +- We then combine the training labels into one array so that we can generate a split between training and testing data. This is what we do directly afterwards. +- Then, we initialize the SVM classifier and turn it into a multilabel one. The `n_jobs=-1` attribute indicates that all available processor functionality can be used for learning the classifiers. +- We then `.fit` the data to the classifier, meaning that we start the training process. After fitting is complete, the trained classifier is available in `multilabel_classifier`. We can then call `.predict` to generate predictions for our testing data. +- Comparing the `y_test` (actual ground truth labels) and `y_test_pred` (predicted labels) can be done by means of a confusion matrix (follows directly after the code segment). We can create a confusion matrix for each label with `multilabel_confusion_matrix`, and then plot it with `ConfusionMatrixDisplay` using Matplotlib. + +That's it - we have now created a multilabel Support Vector Machine! Now, ensure that `sklearn`, `matplotlib` and `numpy` are installed onto your system / into your environment, and run the code. + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.multioutput import MultiOutputClassifier +from sklearn.svm import LinearSVC +from sklearn.model_selection import train_test_split +from sklearn.metrics import multilabel_confusion_matrix, ConfusionMatrixDisplay + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3)] +num_classes = len(cluster_centers) + +# Generate data +X, classes = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) +colors = np.random.randint(0, 2, size=len(classes)) + +# Combine training labels +y = np.vstack((classes, colors)).T + +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) + +# Create the SVM +svm = LinearSVC(random_state=42) + +# Make it an Multilabel classifier +multilabel_classifier = MultiOutputClassifier(svm, n_jobs=-1) + +# Fit the data to the Multilabel classifier +multilabel_classifier = multilabel_classifier.fit(X_train, y_train) + +# Get predictions for test data +y_test_pred = multilabel_classifier.predict(X_test) + +# Generate multiclass confusion matrices +matrices = multilabel_confusion_matrix(y_test, y_test_pred) + +# Plotting matrices: code +cmd = ConfusionMatrixDisplay(matrices[0], display_labels=np.unique(y_test)).plot() +plt.title('Confusion Matrix for label 1 (type)') +plt.show() +cmd = ConfusionMatrixDisplay(matrices[1], display_labels=np.unique(y_test)).plot() +plt.title('Confusion Matrix for label 2 (color)') +plt.show() +``` + +You'll then get two popups with confusion matrices: + +- ![](images/mlabel_2.png) + +- ![](images/mlabel_1.png) + + +We can clearly see that our initial estimations with regards to the dataset were true. For the linearly separable label (i.e. the `classes` label), our Confusion Matrix illustrates perfect behavior - with no wrong predictions. For the `colors` label (which was randomly generated based on the `classes` label) we see worse performance: this label is predicted right in only 50% of the cases. Now, this is of course due to the fact that this label was generated randomly. If, say, we added colors based on the class, we would also see good performance here. + +A next step a ML engineer would undertake now is finding out how to make the data for the second label linearly separable by means of a kernel function. That's however outside the scope of this article. We did manage to create a multilabel SVM though! :) + +* * * + +## Summary + +In this article, we looked at creating a multilabel Support Vector Machine with Scikit-learn. Firstly, we looked at what multilabel classification is and how it is different than multiclass and binary classification. More specifically, a multilabel classifier assigns multiple labels to an input sample, e.g. the labels color and type if we are looking at an assembly line scenario. This is contrary to the multiclass and binary classifiers which assign just one class to an input sample. + +Then, we looked at Support Vector Machines work in particular and why their internals are at odds with how multilabel classification works. Fortunately, people have sought to fix this, and we thus continued with making it work. More specifically, we used Scikit-learn's `MultiOutputClassifier` for wrapping the SVM into a situation where multiple classifiers are generated that together predict the labels. By means of a confusion matrix, we then inspected the performance of our model, and provided insight in what to do when a confusion matrix does not show adequate performance. + +I hope that you have learned something from this article! If you did, I would be happy to hear from you, so please feel free to leave a comment in the comments section below 💬 If you have other remarks or suggestions, please leave a message as well. I'd love to hear from you! Anyway, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2005, February 21). _Equidistant_. Wikipedia, the free encyclopedia. Retrieved November 11, 2020, from [https://en.wikipedia.org/wiki/Equidistant](https://en.wikipedia.org/wiki/Equidistant) + +Scikit-learn. (n.d.). _1.12. Multiclass and multilabel algorithms — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 12, 2020, from [https://scikit-learn.org/stable/modules/multiclass.html#multioutput-classification](https://scikit-learn.org/stable/modules/multiclass.html#multioutput-classification) + +Scikit-learn. (n.d.). _Sklearn.multioutput.MultiOutputClassifier — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 12, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html](https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) diff --git a/how-to-create-a-neural-network-for-regression-with-pytorch.md b/how-to-create-a-neural-network-for-regression-with-pytorch.md new file mode 100644 index 0000000..11e0813 --- /dev/null +++ b/how-to-create-a-neural-network-for-regression-with-pytorch.md @@ -0,0 +1,443 @@ +--- +title: "How to create a neural network for regression with PyTorch" +date: "2021-07-20" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "mlp" + - "multilayer-perceptron" + - "neural-network" + - "pytorch" + - "regression" +--- + +In many examples of Deep Learning models, the model target is classification - or the assignment of a class to an input sample. However, there is another class of models too - that of regression - but we don't hear as much about regression compared to classification. + +Time to change that. Today, we're going to build a neural network for regression. We will be using the PyTorch deep learning library for that purpose. After reading this article, you will... + +- **Understand what regression is and how it is different from classification.** +- **Be able to build a Multilayer Perceptron based model for regression using PyTorch.** + +Are you ready? Let's take a look! + +* * * + +\[toc\] + +* * * + +## What is regression? + +Deep Learning models are systems of trainable components that can learn a _mappable function_. Such a function can be represented as \[latex\]\\textbf{x} \\rightarrow \\text{y}\[/latex\] at a high level, where some input \[latex\]\\textbf{x}\[/latex\] is mapped to an output \[latex\]\\text{y}\[/latex\]. + +Given the [universal approximation theorem](https://www.machinecurve.com/index.php/2019/07/18/can-neural-networks-approximate-mathematical-functions/), they should even be capable of approximating any mathematical function! The exact _mapping_ is learned through the high-level training process, in which example data that contains this mapping is fed through the model, after which the error is computed backwards and the model is optimized. + +There is a wide variety of such mappings: + +- An image of a **cat** represents to the class _cat_ whereas a **dog** belongs to _dog_. +- The bounding box drawn here **contains an object**, whereas another one does not. +- And so forth. + +These are all examples of **classification**. They answer whether a particular _instance_ is present or not. Is that cat present? Yes or no. Is that dog present? Yes or no. Does it contain the object? Yes or no. You can compare such problems by assigning certain inputs to one or sometimes multiple bins. + +Regression involves the same mappable function, but the output is not a bin-like (i.e. a discrete) value. Rather, the mappable function \[latex\]\\textbf{x} \\rightarrow \\text{y}\[/latex\] also converts the input data \[latex\]\\textbf{x}\[/latex\] to an output \[latex\]\\text{y}\[/latex\], but instead of a discrete value, \[latex\]\\text{y}\[/latex\] is continuous. + +In other words, \[latex\]\\text{y}\[/latex\] can take any value that belongs to a particular range (for example, the real numbers). In other words, values such as \[latex\]\\text{y} = 7.23\[/latex\] or \[latex\]\\text{y} = -12.77438\[/latex\] are perfectly normal. Learning a model that maps an input \[latex\]\\textbf{x}\[/latex\] to a continuous target variable is a process called **regression**. It is now easy to see why such models are quite frequently used to solve numeric problems - such as predicting the yield of a crop or the expected risk level in a financial model. + +* * * + +## Creating a MLP regression model with PyTorch + +In a different article, we already looked at building a [classification model](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/) with PyTorch. Here, instead, you will learn to build a model for **regression**. We will be using the PyTorch deep learning library, which is one of the most frequently used libraries at the time of writing. Creating a regression model is actually really easy when you break down the process into smaller parts: + +1. Firstly, we will make sure that we **import all the dependencies** needed for today's code. +2. Secondly, we're going to ensure that we have our **training data** available. This data, which is the [Boston Housing Dataset](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/boston_housing), provides a set of variables that may together ensure that a price prediction (the target variable) becomes possible. +3. Subsequently, the **neural network** will be created. This will be a **Multilayer Perceptron** based model, which is essentially a stack of layers containing neurons that can be [trained](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). +4. The training dataset, which is by now represented as a `torch.utils.data.Dataset`, will need to be used in the model. The fourth step is to ensure that the **dataset is prepared into a `DataLoader`**, which ensures that data is shuffled and batched appropriately. +5. Then, we **pick a loss function and initialize it**. We also **init the model and the optimizer** (Adam). +6. Finally, we create the **training loop**, which effectively contains the [high-level training process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) captured in code. + +Let's get to work! 👩‍💻 Create a file or Notebook, e.g. `regression-mlp.py`, and write along :) + +### Today's dataset + +The **[Boston House Prices Regression dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#boston-housing-price-regression-dataset)** contains 506 observations that relate certain characteristics with the price of houses (in $1000s) in Boston in some period. + +Some observations about this data (from [this article](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#boston-housing-price-regression-dataset)): + +> The minimum house price is $5000, while the maximum house price is $50.000. This may sound weird, but it’s not: house prices have risen over the decades, and the study that produced this data is from 1978 (Harrison & Rubinfeld, 1978). Actually, around 1978 prices of ≈$50.000 were quite the median value, so this dataset seems to contain relatively cheaper houses (or the Boston area was cheaper back then – I don’t know; Martin, 2017). +> +> The mean house price was $22.533. +> +> Variance in house prices is $84.587. +> +> MachineCurve (2020) + +These are variables available in the dataset: + +> **CRIM** per capita crime rate by town +> +> **ZN** proportion of residential land zoned for lots over 25,000 sq.ft. +> +> **INDUS** proportion of non-retail business acres per town +> +> **CHAS** Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) +> +> **NOX** nitric oxides concentration (parts per 10 million) +> +> **RM** average number of rooms per dwelling +> +> **AGE** proportion of owner-occupied units built prior to 1940 +> +> **DIS** weighted distances to five Boston employment centres +> +> **RAD** index of accessibility to radial highways +> +> **TAX** full-value property-tax rate per $10,000 +> +> **PTRATIO** pupil-teacher ratio by town +> +> **B** 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town +> +> **LSTAT** % lower status of the population +> +> **MEDV** Median value of owner-occupied homes in $1000’s +> +> MachineCurve (2020) + +Obviously, **MEDV** is the median value and hence the target variable. + +### Imports + +The first thing that we have to do is specifying the imports that will be used for today's regression model. First of all, we need `torch`, which is the representation of PyTorch in Python. We will also need its `nn` library, which is the _neural networks_ library and contains neural network related functionalities. The `DataLoader` with which we will batch and shuffle the dataset is imported as well, and that's it for the PyTorch imports. + +Next to PyTorch, we will also import two parts (the `load_boston` and `StandardScaler` components) from Scikit-learn. We will need them for loading and preparing the data; they represent as the source and [a preparation mechanism](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/), respectively. + +``` +import torch +from torch import nn +from torch.utils.data import DataLoader +from sklearn.datasets import load_boston +from sklearn.preprocessing import StandardScaler +``` + +### Representing the Dataset + +Above, you saw that we use Scikit-learn for importing the Boston dataset. Because it is not directly compatible with PyTorch, we cannot simply feed the data to our PyTorch neural network. For doing so, it needs to be prepared. This is actually quite easy: we can create a PyTorch `Dataset` for this purpose. + +A PyTorch dataset simply is a class that extends the `Dataset` class; in our case, we name it `BostonDataset`. It has three defs: `__init__` or the constructor, where most of the work is done, `__len__` returning dataset length, and `__getitem__` for retrieving an individual item using an index. + +In the constructor, we receive `X` and `y` representing inputs and targets and possibly a `scale_data` variable for [standardization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/), being `True` by default. We then check whether the data already has Tensor format - it really needs to be non-Tensor format to be processed. Subsequently, depending on whether we want our data to be [standardized](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) (which is smart), we apply the `StandardScaler` and immediately transform the data after fitting the scaler to the data. Next, we represent the inputs (`X`) and targets (`y`) as instance variables of each `BostonDataset` object. + +``` +class BostonDataset(torch.utils.data.Dataset): + ''' + Prepare the Boston dataset for regression + ''' + + def __init__(self, X, y, scale_data=True): + if not torch.is_tensor(X) and not torch.is_tensor(y): + # Apply scaling if necessary + if scale_data: + X = StandardScaler().fit_transform(X) + self.X = torch.from_numpy(X) + self.y = torch.from_numpy(y) + + def __len__(self): + return len(self.X) + + def __getitem__(self, i): + return self.X[i], self.y[i] +``` + +### Creating the neural network + +![](images/Basic-neural-network.jpg) + +The regression model that we will create today will be a [Multilayer Perceptron](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/). It is the classic prototype of a neural network which you can see on the right as well. + +In other words, a Multilayer Perceptron has _multi_ple _layers_ of _perceptrons_. A [Perceptron](https://www.machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/) goes back into the 1950s and was created by an American psychologist named Frank Rosenblatt. It involves a learnable _neuron_ which can learn a mapping between `X` and `y`. + +Recall that this is precisely what we want to create. The Rosenblatt Perceptron, however, turned out to be incapable of mapping _all possible functions_ - not surprising given the fact that it is one neuron only. + +Multilayer Perceptrons change the internals of the original Perceptron and stack them in layers. In addition, they apply [nonlinear activation functions](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/) to the individual neurons, meaning that they can also capture nonlinear patterns in datasets. + +The result is that Multilayer Perceptrons can produce behavior that outperforms human judgment, although more recent approaches such as Convolutional Neural Networks and Recurrent Neural Networks are more applicable to some problems (such as computer vision and time series prediction). + +Now, let's get back to writing some code. Our regression Multilayer Perceptron can be created by means of a class called `MLP` which is a sub class of the `nn.Module` class; the PyTorch representation of a neural network. + +In the constructor (`__init__`), we first init the superclass as well and specify a `nn.Sequential` set of layers. Sequential here means that input first flows through the first layer, followed by the second, and so forth. We apply three linear layers with two ReLu activation functions in between. The first `nn.Linear` layer takes 13 inputs. This is the case because we have 13 different variables in the Boston dataset, all of which we will use (which may be suboptimal; you may wish to apply e.g. [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/) first). It converts the 13 inputs into 64 outputs. The second takes 64 and generates 32, and the final one takes the 32 ReLU-activated outputs and learns a mapping between them and _one output value_. + +Yep, that one output value is precisely the target variable that should be learned! + +In the `forward` pass, we simply feed the input data (`x`) through the model (`self.layers`) and return the result. + +``` +class MLP(nn.Module): + ''' + Multilayer Perceptron for regression. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(13, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1) + ) + + + def forward(self, x): + ''' + Forward pass + ''' + return self.layers(x) +``` + +### Preparing the dataset + +Now that we have specified a representation of the dataset and the model, it is time that we start using them. + +``` +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Load Boston dataset + X, y = load_boston(return_X_y=True) +``` + +The `if` statement written above means that the code following it will only run when the script is run with the Python interpreter. The first thing we do is fixing the initialization vector of our pseudorandom number generator, so that inconsistencies between random number generations will not yield to inconsistent end results. + +Next, we actually use Scikit-learn's `load_boston` call to load the Boston housing dataset. The input data is assigned to the `X` variable while the corresponding targets are assigned to `y`. + +We can next actually prepare our dataset in PyTorch format by creating a `BostonDataset` object with our training data. In other words, we will init the class that we created above! + +Now that we have a PyTorch-compatible dataset, it still cannot be used directly. We will need to batch and shuffle the dataset first. This essentially means changing the order of the inputs and targets randomly, so that no hidden patterns in data collection can disturb model training. Following this, we generate _batches_ of data - so that we can feed them through the model batched, given possible hardware constraints. We config the model to use 10 samples per batch, but this can be configured depending on your own hardware. + +``` + # Prepare Boston dataset + dataset = BostonDataset(X, y) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) +``` + +### Picking a loss function + +Recall from the high-level supervised training process that a loss function is used to compare model predictions and true targets - essentially computing **how poor the model performs**. + +Picking a [loss function](https://www.machinecurve.com/index.php/2021/07/19/how-to-use-pytorch-loss-functions/#pytorch-regression-loss-function-examples) thus has to be done relative to the characteristics of your data. For example, if your dataset has many outliers, [Mean Squared Error](https://www.machinecurve.com/index.php/2021/07/19/how-to-use-pytorch-loss-functions/#mean-squared-error-mse-loss-nn-mseloss) loss may not be a good idea. In that case, [L1 or Mean Average Error loss](https://www.machinecurve.com/index.php/2021/07/19/how-to-use-pytorch-loss-functions/#mean-absolute-error-mae-l1-loss-nn-l1loss) can be a better choice. In other words, first perform Exploratory Data Analysis on the variables you will be working with. Are there many outliers? Are the values close together? Depending on that, you will be able to pick an appropriate loss function to start with. Trial and error will tell whether you'll need to change after training a few times. + +### Initializing the model, loss function and optimizer + +For the sake of simplicity, we will be using MAE loss (i.e., `nn.L1Loss`) today. Now that we have defined the MLP and prepared the data, we can initialize the `MLP` and the `loss_function`. We also initialize the optimizer, which adapts the weights of our model (i.e. makes it better) after the error (loss) was computed backwards. We will be using Adam, which is quite a standard optimizer, with a relatively default learning rate of `1e-4`. + +``` + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.L1Loss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) +``` + +### Training loop + +What remains is the creation of the training loop! + +You can see that in this loop, the following happens: + +- During training, we iterate over the entire training dataset for a fixed number of **epochs**. In today's model, we set the number of epochs to 5. This may be insufficient for minimization of loss in your model, but to illustrate how training works we set it to 5. +- At the start of every epoch, we set the value for `current_loss` (current loss in the epoch) to zero. +- Next, we iterate over the `DataLoader`. Recall that our data loader contains the shuffled and batched data. In other words, we iterate over all the batches - which have a maximum size of `batch_size` as configured above. The total number of batches is `max(datasize_length / batch_size)` (this means that if your `batch_size` is 10 and you'll have a dataset with 10006 samples, your number of batches will be 1001 - with the final batch having 6 samples only). +- We perform some conversions (e.g. Floating point conversion and reshaping) on the `inputs` and `targets` in the current batch. +- We then zero the gradients in the optimizer. This means that knowledge of previous improvements (especially important in batch > 0 for every epoch) is no longer available. This is followed by the **forward pass**, the error computation using our loss function, the **backward pass**, and finally the **optimization**. +- Found loss is added to the loss value for the current epoch. In addition, after every tenth batch, some statistics about the current state of affairs are printed. + +``` + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get and prepare inputs + inputs, targets = data + inputs, targets = inputs.float(), targets.float() + targets = targets.reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### MLP for regression with PyTorch - full code example + +It may be the case that you want to use all the code immediately.. In that case, here you go! :) + +``` +import torch +from torch import nn +from torch.utils.data import DataLoader +from sklearn.datasets import load_boston +from sklearn.preprocessing import StandardScaler + +class BostonDataset(torch.utils.data.Dataset): + ''' + Prepare the Boston dataset for regression + ''' + + def __init__(self, X, y, scale_data=True): + if not torch.is_tensor(X) and not torch.is_tensor(y): + # Apply scaling if necessary + if scale_data: + X = StandardScaler().fit_transform(X) + self.X = torch.from_numpy(X) + self.y = torch.from_numpy(y) + + def __len__(self): + return len(self.X) + + def __getitem__(self, i): + return self.X[i], self.y[i] + + +class MLP(nn.Module): + ''' + Multilayer Perceptron for regression. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(13, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1) + ) + + + def forward(self, x): + ''' + Forward pass + ''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Load Boston dataset + X, y = load_boston(return_X_y=True) + + # Prepare Boston dataset + dataset = BostonDataset(X, y) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.L1Loss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get and prepare inputs + inputs, targets = data + inputs, targets = inputs.float(), targets.float() + targets = targets.reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## Summary + +In this article, you have learned to... + +- **Understand what regression is and how it is different from classification.** +- **Build a Multilayer Perceptron based model for regression using PyTorch**. + +I hope that this article was useful for your understanding and growth! If it was, please let me know through the comments section below 💬 Please also let me know if you have any questions or suggestions for improvement. I'll try to adapt my article :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## Sources + +PyTorch. (n.d.). _L1Loss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html#torch.nn.L1Loss](https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html#torch.nn.L1Loss) + +MachineCurve. (2020, November 2). _Why you can't truly create Rosenblatt's Perceptron with Keras – MachineCurve_. [https://www.machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/](https://www.machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/) + +MachineCurve. (2021, January 12). _Rosenblatt's Perceptron with Python – MachineCurve_. [https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/](https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/) + +MachineCurve. (2020, November 16). _Exploring the Keras datasets – MachineCurve_. [https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) diff --git a/how-to-create-a-variational-autoencoder-with-keras.md b/how-to-create-a-variational-autoencoder-with-keras.md new file mode 100644 index 0000000..49da1c9 --- /dev/null +++ b/how-to-create-a-variational-autoencoder-with-keras.md @@ -0,0 +1,855 @@ +--- +title: "How to create a variational autoencoder with Keras?" +date: "2019-12-30" +categories: + - "deep-learning" + - "frameworks" +tags: + - "autoencoder" + - "deep-learning" + - "keras" + - "machine-learning" + - "neural-networks" + - "variational-autoencoder" +--- + +In a [different blog post](https://www.machinecurve.com/index.php/2019/12/24/what-is-a-variational-autoencoder-vae/), we studied the concept of a _Variational Autoencoder_ (or VAE) in detail. The models, which are generative, can be used to manipulate datasets by learning the distribution of this input data. + +But there's a difference between theory and practice. While it's always nice to understand neural networks in theory, it's always even more fun to actually create them with a particular framework. It makes them really _usable_. + +Today, we'll use the Keras deep learning framework to create a convolutional variational autoencoder. We subsequently train it on the MNIST dataset, and also show you what our latent space looks like as well as new samples generated from the latent space. + +But first, let's take a look at what VAEs are. + +Are you ready? + +Let's go! 😎 + +**Update 17/08/2020:** added a fix for an issue with vae.fit(). + +\[toc\] + +* * * + +## Recap: what are variational autoencoders? + +If you are already familiar with variational autoencoders or wish to find the implementation straight away, I'd suggest to skip this section. In any other case, it may be worth the read. + +### How do VAEs work? + +Contrary to a [normal autoencoder](https://www.machinecurve.com/index.php/2019/12/24/what-is-a-variational-autoencoder-vae/#about-normal-autoencoders), which learns to encode some input into a point in _latent space_, Variational Autoencoders (VAEs) learn to encode multivariate probability distributions into latent space, given their configuration usually Gaussian ones: + +[![](images/vae-encoder-decoder-1024x229.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/vae-encoder-decoder.png) + +Sampling from the distribution gives a point in latent space that, given the distribution, is oriented around some mean value \[latex\]\\mu\[/latex\] and standard deviation \[latex\]\\sigma\[/latex\], like the points in this two-dimensional distribution: + +[![](images/MultivariateNormal.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/MultivariateNormal.png) + +Combining this with a [Kullback-Leibler divergence segment](https://www.machinecurve.com/index.php/2019/12/21/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras/) in the loss function leads to a latent space that is both _continuous_ and _complete_: for every point sampled close to the distribution's mean and standard deviation (which is in our case the standard normal distribution) the output should be both _similar to samples around that sample_ and _should make sense_. + +[![](images/vae_space.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/vae_space.png) + +_Continuity and completeness in the latent space._ + +### What can you do with VAEs? + +Besides the regular stuff one can do with an autoencoder (like denoising and dimensionality reduction), the principles of a VAE outlined above allow us to use variational autoencoders for generative purposes. + +[![](images/fmnist_dmax_plot.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/fmnist_dmax_plot.png) + +_Samples generated with a VAE trained on the Fashion MNIST dataset._ + +I would really recommend my blog ["What is a Variational Autoencoder (VAE)?"](https://www.machinecurve.com/index.php/2019/12/24/what-is-a-variational-autoencoder-vae/) if you are interested in understanding VAEs in more detail. However, based on the high-level recap above, I hope that you now both understand (1) how VAEs work at a high level and (2) what this allows you to do with them: using them for generative purposes. + +Let's now take a look at how we will use VAEs today 😊 + +* * * + +## Creating a VAE with Keras + +### What we'll create today + +Today, we'll use the [Keras](https://keras.io) deep learning framework for creating a VAE. It consists of three individual parts: the encoder, the decoder and the VAE as a whole. We do so using the Keras Functional API, which allows us to combine layers very easily. + +The MNIST dataset will be used for training the autoencoder. This dataset contains thousands of 28 x 28 pixel images of handwritten digits, as we can see below. As such, our autoencoder will learn the distribution of handwritten digits across (two)dimensional latent space, which we can then use to manipulate samples into a format we like. + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +_Samples from the MNIST dataset_ + +This is the structure of the encoder: + +``` +Model: "encoder" +__________________________________________________________________________________________________ +Layer (type) Output Shape Param # Connected to +================================================================================================== +encoder_input (InputLayer) (None, 28, 28, 1) 0 +__________________________________________________________________________________________________ +conv2d_1 (Conv2D) (None, 14, 14, 8) 80 encoder_input[0][0] +__________________________________________________________________________________________________ +batch_normalization_1 (BatchNor (None, 14, 14, 8) 32 conv2d_1[0][0] +__________________________________________________________________________________________________ +conv2d_2 (Conv2D) (None, 7, 7, 16) 1168 batch_normalization_1[0][0] +__________________________________________________________________________________________________ +batch_normalization_2 (BatchNor (None, 7, 7, 16) 64 conv2d_2[0][0] +__________________________________________________________________________________________________ +flatten_1 (Flatten) (None, 784) 0 batch_normalization_2[0][0] +__________________________________________________________________________________________________ +dense_1 (Dense) (None, 20) 15700 flatten_1[0][0] +__________________________________________________________________________________________________ +batch_normalization_3 (BatchNor (None, 20) 80 dense_1[0][0] +__________________________________________________________________________________________________ +latent_mu (Dense) (None, 2) 42 batch_normalization_3[0][0] +__________________________________________________________________________________________________ +latent_sigma (Dense) (None, 2) 42 batch_normalization_3[0][0] +__________________________________________________________________________________________________ +z (Lambda) (None, 2) 0 latent_mu[0][0] + latent_sigma[0][0] +================================================================================================== +Total params: 17,208 +Trainable params: 17,120 +Non-trainable params: 88 +``` + +And the decoder: + +``` +__________________________________________________________________________________________________ +Model: "decoder" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +decoder_input (InputLayer) (None, 2) 0 +_________________________________________________________________ +dense_2 (Dense) (None, 784) 2352 +_________________________________________________________________ +batch_normalization_4 (Batch (None, 784) 3136 +_________________________________________________________________ +reshape_1 (Reshape) (None, 7, 7, 16) 0 +_________________________________________________________________ +conv2d_transpose_1 (Conv2DTr (None, 14, 14, 16) 2320 +_________________________________________________________________ +batch_normalization_5 (Batch (None, 14, 14, 16) 64 +_________________________________________________________________ +conv2d_transpose_2 (Conv2DTr (None, 28, 28, 8) 1160 +_________________________________________________________________ +batch_normalization_6 (Batch (None, 28, 28, 8) 32 +_________________________________________________________________ +decoder_output (Conv2DTransp (None, 28, 28, 1) 73 +================================================================= +Total params: 9,137 +Trainable params: 7,521 +Non-trainable params: 1,616 +``` + +And, finally, the VAE as a whole: + +``` +_________________________________________________________________ +Model: "vae" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +encoder_input (InputLayer) (None, 28, 28, 1) 0 +_________________________________________________________________ +encoder (Model) [(None, 2), (None, 2), (N 17208 +_________________________________________________________________ +decoder (Model) (None, 28, 28, 1) 9137 +================================================================= +Total params: 26,345 +Trainable params: 24,641 +Non-trainable params: 1,704 +``` + +From the final summary, we can see that indeed, the VAE takes in samples of shape \[latex\](28, 28, 1)\[/latex\] and returns samples in the same format. Great! 😊 + +Let's now start working on our model. Open up your Explorer / Finder, navigate to some folder, and create a new Python file, e.g. `variational_autoencoder.py`. Now, open this file in your code editor, and let's start coding! 😎 + +### What you'll need to run the model + +Before we begin, it's important that you ensure that you have all the required dependencies installed on your system: + +- First of all, you'll need the **Keras deep learning framework**, with which we are creating the VAE. +- It's best if you used the **Tensorflow** backend (on top of which Keras can run). However, Theano and CNTK work as well (for Python). +- By consequence, it's preferred if you run Keras with **Python**, version 3.6+. +- You'll also need **Numpy**, for number processing, and **Matplotlib**, for visualization purposes. + +### Model imports + +Let's now declare everything that we will import: + +- Keras, obviously. +- From Keras Layers, we'll need convolutional layers and [transposed convolutions](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions/), which we'll use for the autoencoder. Given our usage of the Functional API, we also need Input, Lambda and Reshape, as well as Dense and Flatten. +- We'll import BatchNormalization as well to ensure that the mean and variance of our layer's inputs remains close to (0, 1) during training. This benefits the training process. +- We'll import the `Model` container from `keras.models`. This allows us to instantiate the models eventually. +- The `mnist` dataset is the dataset we'll be training our VAE on. +- With `binary_crossentropy`, we can compute reconstruction loss. +- Our backend (`K`) contains calls for tensor manipulations, which we'll use. +- Numpy is used for number processing and Matplotlib for plotting the visualizations on screen. + +This is the code that includes our imports: + +``` +''' + Variational Autoencoder (VAE) with the Keras Functional API. +''' + +import keras +from keras.layers import Conv2D, Conv2DTranspose, Input, Flatten, Dense, Lambda, Reshape +from keras.layers import BatchNormalization +from keras.models import Model +from keras.datasets import mnist +from keras.losses import binary_crossentropy +from keras import backend as K +import numpy as np +import matplotlib.pyplot as plt +``` + +### Loading data + +Next thing: importing the MNIST dataset. Since MNIST is part of the Keras Datasets, we can import it easily - by calling `mnist.load_data()`. Love the Keras simplicity! + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +``` + +### Model configuration + +Importing the data is followed by setting config parameters for data and model. + +``` +# Data & model configuration +img_width, img_height = input_train.shape[1], input_train.shape[2] +batch_size = 128 +no_epochs = 100 +validation_split = 0.2 +verbosity = 1 +latent_dim = 2 +num_channels = 1 +``` + +The width and height of our configuration settings is determined by the training data. In our case, they will be `img_width = img_height = 28`, as the MNIST dataset contains samples that are 28 x 28 pixels. + +Batch size is set to 128 samples per (mini)batch, which is quite normal. The same is true for the number of epochs, which was set to 100. 20% of the training data is used for validation purposes. This is also quite normal. Nothing special here. + +Verbosity mode is set to True (by means of `1`), which means that all the output is shown on screen. + +The final two configuration settings are of relatively more interest. First, the latent space will be two-dimensional - which means that a significant information bottleneck will be created which should yield good results with autoencoders on relatively simple datasets. Finally, the `num_channels` parameter can be configured to equal the number of _image channels_: for RGB data, it's 3 (red - green - blue), and for grayscale data (such as MNIST), it's 1. + +### Data preprocessing + +The next thing is data preprocessing: + +``` +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_height, img_width, num_channels) +input_test = input_test.reshape(input_test.shape[0], img_height, img_width, num_channels) +input_shape = (img_height, img_width, num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +First, we reshape the data so that it takes the shape (X, 28, 28, 1), where X is the number of samples in either the training or testing dataset. We also set (28, 28, 1) as `input_shape`. + +Next, we parse the numbers as floats, which presumably speeds up the training process, and normalize it, which the neural network appreciates. And that's it already for data preprocessing :-) + +### Creating the encoder + +Now, it's time to create the encoder. This is a three-step process: firstly, we define it. Secondly, we perform something that is known as the _reparameterization trick_ in order to allow us to link the encoder to the decoder later, to instantiate the VAE as a whole. But before that, we instantiate the encoder first, as our third and final step. + +#### Encoder definition + +The first step in the three-step process is the definition of our encoder. Following the connection process of the Keras Functional API, we link the layers together: + +``` +# # ================= +# # Encoder +# # ================= + +# Definition +i = Input(shape=input_shape, name='encoder_input') +cx = Conv2D(filters=8, kernel_size=3, strides=2, padding='same', activation='relu')(i) +cx = BatchNormalization()(cx) +cx = Conv2D(filters=16, kernel_size=3, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +x = Flatten()(cx) +x = Dense(20, activation='relu')(x) +x = BatchNormalization()(x) +mu = Dense(latent_dim, name='latent_mu')(x) +sigma = Dense(latent_dim, name='latent_sigma')(x) +``` + +Let's now take a look at the individual lines of code in more detail. + +- The first layer is the `Input` layer. It accepts data with `input_shape = (28, 28, 1)` and is named _encoder\_input_. It's actually a pretty dumb layer, haha 😂 +- Next up is a [two-dimensional convolutional layer](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), or Conv2D in Keras terms. It learns 8 filters by deploying a 3 x 3 kernel which it convolves over the input. It has a stride of two which means that it skips over the input during the convolution as well, speeding up the learning process. It employs 'same' padding and [ReLU activation](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/). Do note that officially, it's best to use [He init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) with ReLU activating layers. However, since the dataset is relatively small, it shouldn't be too much of a problem if you don't. +- Subsequently, we use Batch Normalization. This layer ensures that the outputs of the Conv2D layer that are input to the next Conv2D layer have a steady mean and variance, likely \[latex\]\\mu = 0.0, \\sigma = 1.0\[/latex\] (plus some \[latex\]\\epsilon\[/latex\], an error term to ensure numerical stability). This benefits the learning process. +- Once again, a Conv2D layer. It learns 16 filters and for the rest is equal to the first Conv2D layer. +- BatchNormalization once more. +- Next up, a `Flatten` layer. It's a relatively dumb layer too, and only serves to flatten the multidimensional data from the convolutional layers into one-dimensional shape. This has to be done because the densely-connected layers that we use next require data to have this shape. +- The next layer is a Dense layer with 20 output neurons. It's the autoencoder bottleneck we've been talking about. +- BatchNormalization once more. +- The next two layers, `mu` and `sigma`, are actually not separate from each other - look at the previous layer they are linked to (both `x`, i.e. the Dense(20) layer). The first outputs the mean values \[latex\]\\mu\[/latex\] of the encoded input and the second one outputs the stddevs \[latex\]\\sigma\[/latex\]. With these, we can sample the random variables that constitute the point in latent space onto which some input is mapped. + +That's for the layers of our encoder 😄 The next step is to retrieve the shape of the _final Conv2D output_: + +``` +# Get Conv2D shape for Conv2DTranspose operation in decoder +conv_shape = K.int_shape(cx) +``` + +We'll need it when defining the layers of our decoder. I won't bother you with the details yet, as they are best explained when we're a bit further down the road. However, just remember to come back here if you wonder why we need some `conv_shape` value in the decoder, okay? 😉 + +Let's now take a look at the second part of our encoder segment: the _reparameterization trick_. + +#### Reparameterization trick + +While for a mathematically sound explanation of the so-called "reparameterization trick" introduced to VAEs by Kingma & Welling (2013) I must refer you to Gregory Gunderson's ["The Reparameterization Trick"](http://gregorygundersen.com/blog/2018/04/29/reparameterization/), I'll try to explain the need for reparameritization briefly. + +If you use neural networks (or, to be more precise, gradient descent) for optimizing the variational autoencoder, you effectively minimize some expected loss value, which can be estimated with Monte-Carlo techniques (Huang, n.d.). However, this requires that the loss function is differentiable, which is not necessarily the case, because it is dependent on the parameter of some probability distribution that we don't know about. In this case, it's possible to rewrite the equation, but then it _no longer has the form of an expectation_, making it impossible to use the Monte-Carlo techniques usable before. + +However, if we can _reparameterize_ the sample fed to the function into the shape \[latex\]\\mu + \\sigma^2 \\times \\epsilon\[/latex\], it now becomes possible to use gradient descent for estimating the gradients accurately (Gunderson, n.d.; Huang, n.d.). + +And that's precisely what we'll do in our code. We "sample" the value for \[latex\]z\[/latex\] from the computed \[latex\]\\mu\[/latex\] and \[latex\]\\sigma\[/latex\] values by resampling into `mu + K.exp(sigma / 2) * eps`. + +``` +# Define sampling with reparameterization trick +def sample_z(args): + mu, sigma = args + batch = K.shape(mu)[0] + dim = K.int_shape(mu)[1] + eps = K.random_normal(shape=(batch, dim)) + return mu + K.exp(sigma / 2) * eps +``` + +We then use this with a `Lambda` to ensure that correct gradients are computed during the backwards pass based on our values for `mu` and `sigma`: + +``` +# Use reparameterization trick to ensure correct gradient +z = Lambda(sample_z, output_shape=(latent_dim, ), name='z')([mu, sigma]) +``` + +#### Encoder instantiation + +Now, it's time to instantiate the encoder - taking inputs through input layer `i`, and outputting the values generated by the `mu`, `sigma` and `z` layers (i.e., the individual means and standard deviations, and the point sampled from the random variable represented by them): + +``` +# Instantiate encoder +encoder = Model(i, [mu, sigma, z], name='encoder') +encoder.summary() +``` + +Now that we've got the encoder, it's time to start working on the decoder :) + +### Creating the decoder + +Creating the decoder is a bit simpler and boils down to a two-step process: defining it, and instantiating it. + +#### Decoder definition + +Firstly, we'll define the layers of our decoder - just as we've done when defining the structure of our encoder. + +``` +# ================= +# Decoder +# ================= + +# Definition +d_i = Input(shape=(latent_dim, ), name='decoder_input') +x = Dense(conv_shape[1] * conv_shape[2] * conv_shape[3], activation='relu')(d_i) +x = BatchNormalization()(x) +x = Reshape((conv_shape[1], conv_shape[2], conv_shape[3]))(x) +cx = Conv2DTranspose(filters=16, kernel_size=3, strides=2, padding='same', activation='relu')(x) +cx = BatchNormalization()(cx) +cx = Conv2DTranspose(filters=8, kernel_size=3, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +o = Conv2DTranspose(filters=num_channels, kernel_size=3, activation='sigmoid', padding='same', name='decoder_output')(cx) +``` + +- Our decoder also starts with an `Input` layer, the `decoder_input` layer. It takes input with the shape `(latent_dim, )`, which as we will see is the vector we sampled for `z` with our encoder. +- If we'd like to upsample the point in latent space with [Conv2DTranspose](https://www.machinecurve.com/index.php/2019/12/10/conv2dtranspose-using-2d-transposed-convolutions-with-keras/) layers, in exactly the opposite symmetrical order as with we downsampled with our encoder, we must first bring back the data from shape `(latent_dim, )` into some shape that can be reshaped into the _output shape_ of the last convolutional layer of our encoder. +- **This is why you needed the `conv_shape` variable**. We'll thus now add a `Dense` layer which has `conv_shape[1] * conv_shape[2] * conv_shape[3]` output, and converts the latent space into many outputs. +- We next use a `Reshape` layer to convert the output of the Dense layer into the output shape of the last convolutional layer: `(conv_shape[1], conv_shape[2], conv_shape[3] = (7, 7, 16)`. Sixteen filters learnt with 7 x 7 pixels per filter. +- We then use `Conv2DTranspose` and `BatchNormalization` in the exact opposite order as with our encoder to upsample our data into 28 x 28 pixels (which is equal to the width and height of our inputs). However, we still have 8 filters, so the shape so far is `(28, 28, 8)`. +- We therefore add a final `Conv2DTranspose` layer which does nothing to the width and height of the data, but ensures that the number of filters learns equals `num_channels`. For MNIST data, where `num_channels = 1`, this means that the shape of our output will be `(28, 28, 1`), just as it has to be :) This last layer also uses Sigmoid activation, which allows us to use binary crossentropy loss when computing the reconstruction loss part of our loss function. + +#### Decoder instantiation + +The next thing we do is instantiate the decoder: + +``` +# Instantiate decoder +decoder = Model(d_i, o, name='decoder') +decoder.summary() +``` + +It takes the inputs from the decoder input layer `d_i` and outputs whatever is output by the output layer `o`. Simple :) + +### Creating the whole VAE + +Now that the encoder and decoder are complete, we can create the VAE as a whole: + +``` +# ================= +# VAE as a whole +# ================= + +# Instantiate VAE +vae_outputs = decoder(encoder(i)[2]) +vae = Model(i, vae_outputs, name='vae') +vae.summary() +``` + +If you think about it, the _outputs_ of the entire VAE are the _original inputs_, encoded by the _encoder_, and decoded by the _decoder_. + +That's how we arrive at `vae_outputs = decoder(encoder(i)[2])`: inputs `i` are encoded by the `encoder` into `[mu, sigma, z]` (the individual means and standard deviations with the sampled `z` as well). We then take the sampled `z` values (hence the `[2]`) and feed it to the `decoder`, which ensures that we arrive at correct VAE output. + +We the instantiate the model: `i` are our inputs indeed, and `vae_outputs` are the outputs. We call the model `vae`, because it simply is. + +### Defining custom VAE loss function + +Now that we have defined our model, we can proceed with model configuration. Usually, with neural networks, this is done with `model.compile`, where a loss function is specified such as [binary crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). However, when we look [at how VAEs are optimized](https://www.machinecurve.com/index.php/2019/12/24/what-is-a-variational-autoencoder-vae/#second-difference-kl-divergence-reconstruction-error-for-optimization), we see that it's not a simple loss function that is used: we use reconstruction loss (in our case, binary crossentropy loss) _together with_ [KL divergence loss](https://www.machinecurve.com/index.php/2019/12/21/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras/) to ensure that our latent space is both **[continuous and complete](https://www.machinecurve.com/index.php/2019/12/24/what-is-a-variational-autoencoder-vae/#continuity-and-completeness)**. + +We define it as follows: + +``` +# Define loss +def kl_reconstruction_loss(true, pred): + # Reconstruction loss + reconstruction_loss = binary_crossentropy(K.flatten(true), K.flatten(pred)) * img_width * img_height + # KL divergence loss + kl_loss = 1 + sigma - K.square(mu) - K.exp(sigma) + kl_loss = K.sum(kl_loss, axis=-1) + kl_loss *= -0.5 + # Total loss = 50% rec + 50% KL divergence loss + return K.mean(reconstruction_loss + kl_loss) +``` + +- Our `reconstruction_loss` is the binary crossentropy value computed for the flattened `true` values (representing our targets, i.e. our ground truth) and the `pred` prediction values generated by our VAE. It's multiplied with `img_width` and `img_height` to reduce the impact of flattening. +- Our KL divergence loss can be rewritten in the formula defined above (Wiseodd, 2016). +- We use 50% reconstruction loss and 50% KL divergence loss, and do so by returning the mean value between the two. + +### Compilation & training + +Now that we have defined our custom loss function, we can compile our model. We do so using the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) and our `kl_reconstruction_loss` custom loss function. + +``` +# Compile VAE +vae.compile(optimizer='adam', loss=kl_reconstruction_loss) + +# Train autoencoder +vae.fit(input_train, input_train, epochs = no_epochs, batch_size = batch_size, validation_split = validation_split) +``` + +Once compiled, we can call `vae.fit` to start the training process. Note that we set `input_train` both as our features and targets, as is usual with autoencoders. For the rest, we configure the training process as defined previously, in the model configuration step. + +* * * + +## Visualizing VAE results + +Even though you can now actually train your VAE, it's best to wait _just a bit more_ - because we'll add some code for visualization purposes: + +- We will visualize our test set inputs mapped onto the latent space. This allows us to check the continuity and completeness of our latent space. +- We will also visualize an uniform walk across latent space to see how sampling from it will result in output that actually makes sense. This is actually the end result we'd love to see :) + +Some credits first, though: the code for the two visualizers was originally created (and found by me) in the Keras Docs, at the link [here](https://keras.io/examples/variational_autoencoder_deconv/), as well as in François Chollet's blog post, [here](https://blog.keras.io/building-autoencoders-in-keras.html). All credits for the original ideas go to the authors of these articles. I made some adaptations to the code to accomodate for this blog post: + +- First of all, I split the visualizers into two separate definitions. Originally, there was one definition, that generated both visualizations. However, I think that having them separated gives you more flexibility. +- Additionally, I ensured that multi-channeled data can be visualized as well. The original code was created _specifically_ for MNIST, which is only one-channel. RGB datasets, such as CIFAR10, are three-dimensional. This required some extra code to make it work based on the autoencoder we created before. + +### Visualizing inputs mapped onto latent space + +Visualizing inputs mapped onto the latent space is simply taking some input data, feeding it to the encoder, taking the mean values \[latex\]\\mu\[/latex\] for the predictions, and plotting them in a scatter plot: + +``` +# ================= +# Results visualization +# Credits for original visualization code: https://keras.io/examples/variational_autoencoder_deconv/ +# (François Chollet). +# Adapted to accomodate this VAE. +# ================= +def viz_latent_space(encoder, data): + input_data, target_data = data + mu, _, _ = encoder.predict(input_data) + plt.figure(figsize=(8, 10)) + plt.scatter(mu[:, 0], mu[:, 1], c=target_data) + plt.xlabel('z - dim 1') + plt.ylabel('z - dim 2') + plt.colorbar() + plt.show() +``` + +### Visualizing samples from the latent space + +Visualizing samples from the latent space entails a bit more work. First, we'll have to create a figure filled with zeros, as well as a linear space around \[latex\](\\mu = 0, \\sigma = 1)\[/latex\] we can iterate over (from \[latex\]domain = range = \[-4, +4\]\[/latex\]). We take a sample from the grid (determined by our current \[latex\]x\[/latex\] and \[latex\]y\[/latex\] positions) and feed it to the decoder. We then replace the zeros in our `figure` with the output, and finally plot the entire figure on screen. This includes reshaping one-dimensional (i.e., grayscale) input if necessary. + +``` + +def viz_decoded(encoder, decoder, data): + num_samples = 15 + figure = np.zeros((img_width * num_samples, img_height * num_samples, num_channels)) + grid_x = np.linspace(-4, 4, num_samples) + grid_y = np.linspace(-4, 4, num_samples)[::-1] + for i, yi in enumerate(grid_y): + for j, xi in enumerate(grid_x): + z_sample = np.array([[xi, yi]]) + x_decoded = decoder.predict(z_sample) + digit = x_decoded[0].reshape(img_width, img_height, num_channels) + figure[i * img_width: (i + 1) * img_width, + j * img_height: (j + 1) * img_height] = digit + plt.figure(figsize=(10, 10)) + start_range = img_width // 2 + end_range = num_samples * img_width + start_range + 1 + pixel_range = np.arange(start_range, end_range, img_width) + sample_range_x = np.round(grid_x, 1) + sample_range_y = np.round(grid_y, 1) + plt.xticks(pixel_range, sample_range_x) + plt.yticks(pixel_range, sample_range_y) + plt.xlabel('z - dim 1') + plt.ylabel('z - dim 2') + # matplotlib.pyplot.imshow() needs a 2D array, or a 3D array with the third dimension being of shape 3 or 4! + # So reshape if necessary + fig_shape = np.shape(figure) + if fig_shape[2] == 1: + figure = figure.reshape((fig_shape[0], fig_shape[1])) + # Show image + plt.imshow(figure) + plt.show() +``` + +### Calling the visualizers + +Using the visualizers is however much easier: + +``` +# Plot results +data = (input_test, target_test) +viz_latent_space(encoder, data) +viz_decoded(encoder, decoder, data) +``` + +## Time to run it! + +Let's now run our model. Open up a terminal which has access to all the required dependencies, `cd` to the folder where your Python file is located, and run it, e.g. `python variational_autoencoder.py`. + +The training process should now begin with some visualizations being output after it finishes! :) + +### If you get an error with vae.fit() + +Marc, one of our readers, reported an issue with the model when running the VAE with TensorFlow 2.3.0 (and possibly also newer versions): https://github.com/tensorflow/probability/issues/519 + +By adding the following line of code, this issue can be resolved: + +``` +tf.config.experimental_run_functions_eagerly(True) +``` + +## Full VAE code + +Even though I would recommend to read the entire post first before you start playing with the code (because the structures are intrinsically linked), it may be that you wish to take the full code and start fiddling right away. In this case, having the full code at once may be worthwhile to you, so here you go 😊 + +``` +''' + Variational Autoencoder (VAE) with the Keras Functional API. +''' + +import keras +from keras.layers import Conv2D, Conv2DTranspose, Input, Flatten, Dense, Lambda, Reshape +from keras.layers import BatchNormalization +from keras.models import Model +from keras.datasets import mnist +from keras.losses import binary_crossentropy +from keras import backend as K +import numpy as np +import matplotlib.pyplot as plt + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Data & model configuration +img_width, img_height = input_train.shape[1], input_train.shape[2] +batch_size = 128 +no_epochs = 100 +validation_split = 0.2 +verbosity = 1 +latent_dim = 2 +num_channels = 1 + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_height, img_width, num_channels) +input_test = input_test.reshape(input_test.shape[0], img_height, img_width, num_channels) +input_shape = (img_height, img_width, num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# # ================= +# # Encoder +# # ================= + +# Definition +i = Input(shape=input_shape, name='encoder_input') +cx = Conv2D(filters=8, kernel_size=3, strides=2, padding='same', activation='relu')(i) +cx = BatchNormalization()(cx) +cx = Conv2D(filters=16, kernel_size=3, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +x = Flatten()(cx) +x = Dense(20, activation='relu')(x) +x = BatchNormalization()(x) +mu = Dense(latent_dim, name='latent_mu')(x) +sigma = Dense(latent_dim, name='latent_sigma')(x) + +# Get Conv2D shape for Conv2DTranspose operation in decoder +conv_shape = K.int_shape(cx) + +# Define sampling with reparameterization trick +def sample_z(args): + mu, sigma = args + batch = K.shape(mu)[0] + dim = K.int_shape(mu)[1] + eps = K.random_normal(shape=(batch, dim)) + return mu + K.exp(sigma / 2) * eps + +# Use reparameterization trick to ....?? +z = Lambda(sample_z, output_shape=(latent_dim, ), name='z')([mu, sigma]) + +# Instantiate encoder +encoder = Model(i, [mu, sigma, z], name='encoder') +encoder.summary() + +# ================= +# Decoder +# ================= + +# Definition +d_i = Input(shape=(latent_dim, ), name='decoder_input') +x = Dense(conv_shape[1] * conv_shape[2] * conv_shape[3], activation='relu')(d_i) +x = BatchNormalization()(x) +x = Reshape((conv_shape[1], conv_shape[2], conv_shape[3]))(x) +cx = Conv2DTranspose(filters=16, kernel_size=3, strides=2, padding='same', activation='relu')(x) +cx = BatchNormalization()(cx) +cx = Conv2DTranspose(filters=8, kernel_size=3, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +o = Conv2DTranspose(filters=num_channels, kernel_size=3, activation='sigmoid', padding='same', name='decoder_output')(cx) + +# Instantiate decoder +decoder = Model(d_i, o, name='decoder') +decoder.summary() + +# ================= +# VAE as a whole +# ================= + +# Instantiate VAE +vae_outputs = decoder(encoder(i)[2]) +vae = Model(i, vae_outputs, name='vae') +vae.summary() + +# Define loss +def kl_reconstruction_loss(true, pred): + # Reconstruction loss + reconstruction_loss = binary_crossentropy(K.flatten(true), K.flatten(pred)) * img_width * img_height + # KL divergence loss + kl_loss = 1 + sigma - K.square(mu) - K.exp(sigma) + kl_loss = K.sum(kl_loss, axis=-1) + kl_loss *= -0.5 + # Total loss = 50% rec + 50% KL divergence loss + return K.mean(reconstruction_loss + kl_loss) + +# Compile VAE +vae.compile(optimizer='adam', loss=kl_reconstruction_loss) + +# Train autoencoder +vae.fit(input_train, input_train, epochs = no_epochs, batch_size = batch_size, validation_split = validation_split) + +# ================= +# Results visualization +# Credits for original visualization code: https://keras.io/examples/variational_autoencoder_deconv/ +# (François Chollet). +# Adapted to accomodate this VAE. +# ================= +def viz_latent_space(encoder, data): + input_data, target_data = data + mu, _, _ = encoder.predict(input_data) + plt.figure(figsize=(8, 10)) + plt.scatter(mu[:, 0], mu[:, 1], c=target_data) + plt.xlabel('z - dim 1') + plt.ylabel('z - dim 2') + plt.colorbar() + plt.show() + +def viz_decoded(encoder, decoder, data): + num_samples = 15 + figure = np.zeros((img_width * num_samples, img_height * num_samples, num_channels)) + grid_x = np.linspace(-4, 4, num_samples) + grid_y = np.linspace(-4, 4, num_samples)[::-1] + for i, yi in enumerate(grid_y): + for j, xi in enumerate(grid_x): + z_sample = np.array([[xi, yi]]) + x_decoded = decoder.predict(z_sample) + digit = x_decoded[0].reshape(img_width, img_height, num_channels) + figure[i * img_width: (i + 1) * img_width, + j * img_height: (j + 1) * img_height] = digit + plt.figure(figsize=(10, 10)) + start_range = img_width // 2 + end_range = num_samples * img_width + start_range + 1 + pixel_range = np.arange(start_range, end_range, img_width) + sample_range_x = np.round(grid_x, 1) + sample_range_y = np.round(grid_y, 1) + plt.xticks(pixel_range, sample_range_x) + plt.yticks(pixel_range, sample_range_y) + plt.xlabel('z - dim 1') + plt.ylabel('z - dim 2') + # matplotlib.pyplot.imshow() needs a 2D array, or a 3D array with the third dimension being of shape 3 or 4! + # So reshape if necessary + fig_shape = np.shape(figure) + if fig_shape[2] == 1: + figure = figure.reshape((fig_shape[0], fig_shape[1])) + # Show image + plt.imshow(figure) + plt.show() + +# Plot results +data = (input_test, target_test) +viz_latent_space(encoder, data) +viz_decoded(encoder, decoder, data) +``` + +* * * + +## Results + +Now, time for the results :) + +Training the model for 100 epochs yields this visualization of the latent space: + +[![](images/mnist_100_latentspace.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/mnist_100_latentspace.png) + +As we can see, around \[latex\](0, 0)\[/latex\] our latent space is pretty continuous as well as complete. Somewhere around \[latex\](0, -1.5)\[/latex\] we see some holes, as well as near the edges (e.g. \[latex\](3, -3)\[/latex\]). We can see these issues in the actual sampling too: + +[![](images/mnist_digits.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/mnist_digits.png) + +Especially in the right corners, we see the issue with completeness, which yield outputs that do not make sense. Some issues with continuity are visible wherever the samples are _blurred_. However, generally speaking, I'm quite happy with the results! 😎 + +However, let's see if we can make them even better :) + +* * * + +## DCGAN-like architecture + +In their paper "[Unsupervised representation learning with deep convolutional generative adversarial networks](https://arxiv.org/abs/1511.06434)", Radford et al. (2015) introduce the concept of a _deep convolutional generative adversarial network_, or DCGAN. While a GAN represents the other branch of generative models, results have suggested that deep convolutional architectures for generative models may produce better results with VAEs as well. + +So, as an extension of our original post, we've changed the architecture of our model into deeper and wider convolutional layers, in line with Radford et al. (2015). I changed the `encoder` into: + +``` +i = Input(shape=input_shape, name='encoder_input') +cx = Conv2D(filters=128, kernel_size=5, strides=2, padding='same', activation='relu')(i) +cx = BatchNormalization()(cx) +cx = Conv2D(filters=256, kernel_size=5, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +cx = Conv2D(filters=512, kernel_size=5, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +cx = Conv2D(filters=1024, kernel_size=5, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +x = Flatten()(cx) +x = Dense(20, activation='relu')(x) +x = BatchNormalization()(x) +mu = Dense(latent_dim, name='latent_mu')(x) +sigma = Dense(latent_dim, name='latent_sigma')(x) +``` + +And the `decoder` into: + +``` +# Definition +d_i = Input(shape=(latent_dim, ), name='decoder_input') +x = Dense(conv_shape[1] * conv_shape[2] * conv_shape[3], activation='relu')(d_i) +x = BatchNormalization()(x) +x = Reshape((conv_shape[1], conv_shape[2], conv_shape[3]))(x) +cx = Conv2DTranspose(filters=1024, kernel_size=5, strides=2, padding='same', activation='relu')(x) +cx = BatchNormalization()(cx) +cx = Conv2DTranspose(filters=512, kernel_size=5, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +cx = Conv2DTranspose(filters=256, kernel_size=5, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +cx = Conv2DTranspose(filters=128, kernel_size=5, strides=2, padding='same', activation='relu')(cx) +cx = BatchNormalization()(cx) +o = Conv2DTranspose(filters=num_channels, kernel_size=3, activation='sigmoid', padding='same', name='decoder_output')(cx) +``` + +While our original VAE had approximately 26.000 trainable parameters, this one has approximately 9M: + +``` +_________________________________________________________________ +Model: "vae" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +encoder_input (InputLayer) (None, 28, 28, 1) 0 +_________________________________________________________________ +encoder (Model) [(None, 2), (None, 2), (N 4044984 +_________________________________________________________________ +decoder (Model) (None, 28, 28, 1) 4683521 +================================================================= +Total params: 8,728,505 +Trainable params: 8,324,753 +Non-trainable params: 403,752 +``` + +However, even after training it for only 5 epochs, results have become considerably better: + +[![](images/latent-space-visualized.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/latent-space-visualized.png) + +Latent space (left) also looks better compared to our initial VAE (right): + +- [![](images/latent-space-without-outliers.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/latent-space-without-outliers.png) + +- [![](images/mnist_100_latentspace.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/mnist_100_latentspace.png) + + +However, what is interesting, is that the left one is a _zoom_, actually, as we also have some outliers now: + +[![](images/latent-space-with-outliers.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/latent-space-with-outliers.png) + +Interesting result :) + +## Summary + +In this blog post, we've seen how to create a variational autoencoder with Keras. We first looked at what VAEs are, and why they are different from regular autoencoders. We then created a neural network implementation with Keras and explained it step by step, so that you can easily reproduce it yourself while _understanding_ what happens. + +In order to compare our initial 26K-parameter VAE, we expanded the architecture to resemble a DCGAN-like architecture of approx. 9M parameters, for both the encoder and the decoder. This yielded better results, but also increased the number of outliers. + +I hope you've learnt something from this article :) If you did, please let me know by leaving a comment in the comments section below! 👇 If you have questions or remarks, please do the same! + +Thank you for reading MachineCurve today and happy engineering 😎 + +* * * + +## References + +Keras. (n.d.). Variational autoencoder deconv. Retrieved from [https://keras.io/examples/variational\_autoencoder\_deconv/](https://keras.io/examples/variational_autoencoder_deconv/) + +Gundersen, G. (2018, April 29). The Reparameterization Trick. Retrieved from [http://gregorygundersen.com/blog/2018/04/29/reparameterization/](http://gregorygundersen.com/blog/2018/04/29/reparameterization/) + +Kingma, D. P., & Welling, M. (2013). [Auto-encoding variational bayes](https://arxiv.org/abs/1312.6114). _arXiv preprint arXiv:1312.6114_. + +Huang, G. (n.d.). Reparametrization Trick · Machine Learning. Retrieved from [https://gabrielhuang.gitbooks.io/machine-learning/content/reparametrization-trick.html](https://gabrielhuang.gitbooks.io/machine-learning/content/reparametrization-trick.html) + +Wiseodd. (2016, December 10). Variational Autoencoder: Intuition and Implementation. Retrieved from [http://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/](http://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/) + +Keras Blog. (n.d.). Building Autoencoders in Keras. Retrieved from [https://blog.keras.io/building-autoencoders-in-keras.html](https://blog.keras.io/building-autoencoders-in-keras.html) + +Radford, A., Metz, L., & Chintala, S. (2015). [Unsupervised representation learning with deep convolutional generative adversarial networks](https://arxiv.org/abs/1511.06434). _arXiv preprint arXiv:1511.06434_. diff --git a/how-to-easily-create-a-train-test-split-for-your-machine-learning-model.md b/how-to-easily-create-a-train-test-split-for-your-machine-learning-model.md new file mode 100644 index 0000000..2c30932 --- /dev/null +++ b/how-to-easily-create-a-train-test-split-for-your-machine-learning-model.md @@ -0,0 +1,207 @@ +--- +title: "How to create a train/test split for your Machine Learning model?" +date: "2020-11-16" +categories: + - "frameworks" + - "svms" +tags: + - "machine-learning" + - "testing-data" + - "train-test-split" + - "training-data" + - "training-split" +--- + +When you are training a Supervised Machine Learning model, such as a Support Vector Machine or Neural Network, it is important that you split your dataset into at least a training dataset and a testing dataset. This can be done in many ways, and I often see a variety of manual approaches for doing this. Scikit-learn however can easily be leveraged for this purpose, allowing you to create a train/test split for your Machine Learning model. In this article, we'll find out how. + +First of all, we'll take a look at _why_ it's wise to generate a training and testing dataset. We will see that this involves the difference between the model's capability for _prediction_ and _generalization_. This includes looking at validation data for Neural networks. + +Secondly, we'll show you how to create a train/test split with Scikit-learn for a variety of use cases. First of all, we'll show you the most general scenario - creating such a split for pretty much any dataset that can be loaded into memory. Subsequently, we'll show you how this can be done for a multilabel classification/multilabel regression dataset. Then, we look at HDF5 data, and show you how we can generate such a split if we load data from file. Finally, as the `tf.keras.datasets` module is used very frequently to practice with ML, we'll show you how to create one there. + +Enough introduction for now - let's take a look! :) + +* * * + +\[toc\] + +* * * + +## Why split your dataset into training and testing data? + +Before we look at _how_ we can split your dataset into a training and a testing dataset, first let's take a look at _why_ we should do this in the first place. + +Training a Supervised Machine Learning model is conceptually really simple and involves the following three-step process: + +1. **Feed samples to (an initialized) model**: samples from your dataset are fed forward through the model, generating predictions. +2. **Compare predictions and ground truth:** the predictions are compared with the _true_ labels corresponding to the samples, allowing us to identify how bad the model performs. +3. **Improve:** based on the optimization metric, we can change the model's internals here and there, so that it (hopefully) performs better during the next iteration. + +Obviously, the entire process starts at (1) as well, and the process will halt until the _error score_ (the metric which identifies how bad the model performs) exceeds some threshold, after a certain (fixed) amount of iterations have passed, or when the model [no longer improves](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). + +![](images/feed-1024x404.jpg) + +When you keep performing these iterations, the model will continue to improve - because it can perfectly exploit all the spurious patterns in your dataset. + +But what if those spurious patterns are not present in the real-world data you will generate predictions for after training? What if the model is hence trained on patterns that are _unique_ to the training dataset, and are not or scantily present in the dataset for inference? + +Then, put briefly, you have a problem. + +And it is also why you will split your dataset into a **training dataset** and a **testing dataset**. By doing so, you can still perform the iterations displayed above, continuously improving the model. But, on the other hand, you will now also have a dataset available that your trained model has never seen before and can hence be used to identify whether, besides _predicting adequately_, the model is also capable of **generalizing**. You don't want a model that performs well to your training data but performs poorly during inference. + +Having a testing dataset partially helps you get rid of this problem! + +Common splits are 80% training data and 20% testing data, called **simple hold-out splits**, but [more advanced approaches](https://www.machinecurve.com/index.php/2020/02/18/how-to-use-k-fold-cross-validation-with-keras/) can also be used. + +![](images/feed-2.jpg) + +### Another split: training/validation data + +Traditional Machine Learning algorithms, such as Support Vector Machines, attempt to maximize an error function in order to find the best model performance. The _change_ that is applied here does not depend on the model itself, but only on the error function that is maximized. + +If you are training Neural networks, this is different. Here, the error function _is dependent on the neurons_, and hence, the data you fed forward can thus be used to trace back error to neurons that have significantly contributed to the error. + +By consequence, improvement in a Neural network is achieved by computing the improvement (gradient) and then applying it in a form of [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). + +If you use the training set for both feeding data forward and improving the model, you're getting yourself into trouble again. Here's why: improvement will then be a butcher who checks their own meat. Just like with the training data and testing data, optimizing using training data will mean that you will _always_ tend to move towards capturing patterns present in the training set only. You don't want to touch the testing data until you have finished training, so you must figure out a different solution. + +This solution is simple: we'll apply another split when training a Neural network - a **training/validation split**. Here, we use the training data available after the split (in our case 80%) and split it again following (usually) a 80/20 split as well. + +![](images/feed-3.jpg) + +* * * + +## Creating a train/test split with Scikit-learn + +Now that we know what the importance is of train/test splits and possibly train/validation splits, we can take a look at how we can create such splits ourselves. We're going to use Scikit-learn for this purpose, which is an extensive Machine Learning library for Python. More specifically, we're going to leverage `sklearn.model_selection.train_test_split` to create train/test splits for our Machine Learning models. Note that the call is model agnostic and involves data only: it can be used with Scikit-learn models, but also with TensorFlow/Keras, PyTorch, and other libraries. + +We look at four different settings: + +- Creating a train/test split for any dataset. +- Creating a train/test split for a multilabel dataset. +- Creating a train/test split for HDF5 data. +- Creating a train/test split for a `tf.keras.datasets` dataset. + +### Train/test split for any dataset + +If you have an arbitrary dataset, e.g. one generated with Scikit's `make_blobs` function, you likely have feature vectors (a.k.a. input samples) and corresponding targets. Often, those are assigned to variables called `X` and `y`, or `inputs` and `targets`, et cetera. For example, this is how we can create blobs of data: + +``` +from sklearn.datasets import make_blobs + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3), (1,5)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) +``` + +We can then easily create a train/test split: + +``` +from sklearn.model_selection import train_test_split +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33) +``` + +Here, we split the input data (`X/y`) into training data (`X_train; y_train`) and testing data (`X_test; y_test`) using a `test_size=0.20`, meaning that 20% of our data will be used for testing. In other words, we're creating a 80/20 split. Shuffling (i.e. randomly drawing) samples is applied as part of the fit. Using a `random_state`, we can seed the random numbers generator to make its behavior replicable. + +### Train/test split for a multilabel dataset + +Suppose that we have a [multilabel dataset](https://www.machinecurve.com/index.php/2020/11/12/how-to-create-a-multilabel-svm-classifier-with-scikit-learn/): + +``` +from sklearn.datasets import make_multilabel_classification + +# Configuration options +n_samples = 10000 +n_features = 6 +n_classes = 3 +n_labels = 2 +n_epochs = 50 +random_state = 42 + +# Generate data +X, y = make_multilabel_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, n_labels=n_labels, random_state=random_state) +``` + +It's then also really easy to split it into a train/test dataset: + +``` + +from sklearn.model_selection import train_test_split +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=random_state) +``` + +Here, too, we apply a 80/20 train/test split. + +### Train/test split for HDF5 data + +In many cases, training data is available in HDF5 files - and [we can then load it using H5Py](https://www.machinecurve.com/index.php/2020/04/13/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files/), with an example here: + +``` +import h5py + +# Load data +f = h5py.File('./data.hdf5', 'r') +X = f['image'][...] +y = f['label'][...] +f.close() +``` + +We can also then generate a train/test split as follows: + +``` + +from sklearn.model_selection import train_test_split +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=random_state) +``` + +### Train/test splits for a tf.keras.datasets dataset + +Did you know that TensorFlow 2.x provides a variety of datasets by default, the so-called `tf.keras.datasets` [module](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/)? + +Loading a dataset is really easy: + +``` +from tensorflow.keras.datasets import cifar10 + +# CIFAR-10 +(X_train, y_train), (X_test, y_test) = cifar10.load_data() +``` + +This loads the CIFAR10 dataset, which can be used with Computer Vision models and contains a variety of images, which look as follows. + +> The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. +> +> University of Toronto (n.d.) + +![](images/cifar10_visualized.png) + +While Keras already loads data in a train/test split fashion, you could generate an additional split - e.g. a 50/50 one - in the following way: + +``` +from sklearn.model_selection import train_test_split +# Split into training and testing data +X_one, X_two, y_one, y_two = train_test_split(X_train, y_train, test_size=0.50, random_state=random_state) +``` + +* * * + +## Summary + +In this article, we looked at generating a train/test split for your Machine Learning models. First of all, we looked at why this is necessary. We saw that training a Supervised Machine Learning model effectively means that you iteratively optimize it, and that you can over-involve spurious patterns in your training set if you continue improving based on the training set only. That's why you need a testing dataset, which is something that can be achieved with a train/test split. In the case of Neural networks, a validation set split off from the remaining training data can be useful too. + +After the theoretical part, we moved forward by looking at how to implement train/test splits with Scikit-learn and Python. We saw that with Scikit's `train_test_split`, generating such a split is a no-brainer. We gave examples for four settings: using any basic dataset, using a multilabel dataset, using a HDF5-loaded dataset, and using a `tensorflow.keras.datasets` driven dataset (for further splits). + +I hope that you have learned something by reading today's article. If you did, I'd love to hear from you, so please feel free to leave a message in the comments section 💬 Please do the same if you have any questions or suggestions for improvement. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +University of Toronto. (n.d.). _CIFAR-10 and CIFAR-100 datasets_. Department of Computer Science, University of Toronto. [https://www.cs.toronto.edu/~kriz/cifar.html](https://www.cs.toronto.edu/~kriz/cifar.html) + +Scikit-learn. (n.d.). _Sklearn.model\_selection.train\_test\_split — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 16, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.model\_selection.train\_test\_split.html](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) diff --git a/how-to-evaluate-a-keras-model-with-model-evaluate.md b/how-to-evaluate-a-keras-model-with-model-evaluate.md new file mode 100644 index 0000000..a530448 --- /dev/null +++ b/how-to-evaluate-a-keras-model-with-model-evaluate.md @@ -0,0 +1,239 @@ +--- +title: "How to evaluate a TensorFlow 2.0 Keras model with model.evaluate" +date: "2020-11-03" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "generalization" + - "keras" + - "model-evaluation" + - "model-evaluate" + - "overfitting" +--- + +Training a supervised machine learning model means that you want to achieve two things: firstly, a model that performs - in other words, that it can successfully predict what class a sample should belong to, or what value should be output for some input. Secondly, while predictive power is important, your model should also be able to generalize well. In other words, it should also be able to predict relatively correctly for input samples that it hasn't seen before. + +This often comes at a trade-off: the trade-off between underfitting and overfitting. You don't want your model to lose too much of its predictive power, i.e. being overfit. However, you neither want it to be _too good_ for the data it is trained on - causing it to be overfit, and losing its ability to generalize to data that it hasn't seen before. + +And although it may sound strange, this _can_ actually cause problems, because the training dataset and inference samples should not necessarily come from a sample with an approximately equal distribution! + +Measuring the balance between underfitting and overfitting can be done by [splitting](https://www.machinecurve.com/index.php/2020/02/18/how-to-use-k-fold-cross-validation-with-keras/) the dataset into three subsets: training data, validation data and testing data. The first two ensure that the model is trained (training data) and steered away from overfitting (validation data), while the latter can be used to test the model after it has been trained. In this article, we'll focus on the latter. + +First, we will look at the balance between underfitting and overfitting in more detail. Subsequently, we will use the `tensorflow.keras` functionality for evaluating your machine learning model, called `model.evaluate`. This includes a full Keras example, where we train a model and subsequently evaluate it. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Why evaluate Keras models? + +Great question - why do we need to evaluate TensorFlow/Keras models in the first place? + +To answer it, we must take a look at how a supervised machine learning model [is trained](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). Following the supervised learning process linked before, we note that samples from a _training set_ are fed forward, after which an average error value is computed and subsequently used for model [optimization](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). + +The samples in a training set are often derived from some kind of population. For example, if we want to measure voting behavior in a population, we often take a representative sample. We therefore don't measure the behavior of the entire population - which would be really inefficient - but instead assume that if our sample is large enough, its distribution approaches the distribution of the entire population. + +In other words, we generalize the smaller sample to the population. + +While this often leads to good results, it can also be really problematic. + +This emerges from the fact that we don't know whether our sample distribution is equal to the population distribution. While exact equality is hard to achieve, we should do our best to make them as equal as possible. And we know that neither without thorough analysis, and even then, because we can only compare to bigger samples. + +Now, if you would train a supervised machine learning model with the training set, you would train until it is no longer **underfit**. This means that the model is capable of correctly generating predictions for the samples in your generalized population. However, we must also ensure that it is not **overfit** - meaning that it was trained _too closely_ for the distribution of your training set. If the distributions don't match, the model will show worse performance when it is used in practice. + +Model evaluation helps us to avoid falling into the underfitting/overfitting trap. Before training the model, we split off and set apart some data from the training set, called a **testing dataset**, Preferably, we split off randomly - in order to ensure that the distributions of the testing set and remaining training set samples are relatively equal. After training the model, we then feed the test samples to the model. When it performs well for those samples, we can be more confident that our model can work in practice. + +![](images/nonlinear-1-1024x514.png) + +Especially models with [high variance](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/) are sensitive to overfitting. + +* * * + +## Working with model.evaluate + +If you look at the TensorFlow API, the `model.evaluate` functionality for model evaluation is part of the `tf.keras.Model` functionality class, which "groups layers into an object with training and inference features" (Tf.kerasa.Model, n.d.). + +It looks like this: + +``` +evaluate( + x=None, y=None, batch_size=None, verbose=1, sample_weight=None, steps=None, + callbacks=None, max_queue_size=10, workers=1, use_multiprocessing=False, + return_dict=False +) +``` + +With these attributes: + +- `x` and `y` representing the samples and targets of your testing data, respectively. +- The `batch_size` representing the number of samples fed through `evaluate` at once. Default, it's `None`, and then equals to 32. +- With `verbose`, it is possible to show a progress bar (`1`) or nothing (`0`). +- If you wish to increase the importance of some test scores of some samples (e.g. the first half), you can use `sample_weight` to specify an 1D or 2D array with weights in order to weigh individual samples (the 1D case) or timesteps (the 2D case). +- The `steps` represents the total number of batches before evaluating is declared finished. If the number of batches available based on the batch size (i.e. `int(len(test_data) / batch_size)`) is higher than `steps`, only `steps` batches will be fed forward. If set to `None`, it will continue until exhausted (i.e. until all batches have been fed forward). +- With `callbacks`, it is possible to attach callbacks to the evaluation process. +- If you use a generator, you can specify generator specific functionality with `max_queue_size`, `workers` and `use_multiprocessing`. +- If you want a Python dictionary instead of a Python list, you can set `return_dict` to `True` in order to let the `evaluate` function return a dictionary. + +* * * + +## A full Keras example + +Let's now take a look at creating a TensorFlow/Keras model that uses `model.evaluate` for model evaluation. + +We first create the following TensorFlow model. + +- We import the TensorFlow imports that we need. We also use the [extra\_keras\_datasets](https://github.com/christianversloot/extra_keras_datasets) module as we are training the model on the [EMNIST](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) dataset. +- We specify some configuration options for the model. +- We load the EMNIST dataset, reshape the data (to make it compatible with TensorFlow), convert the data into `float32` format ([read here why](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/#float32-in-your-ml-model-why-its-great)), and then scale the data to the \[latex\]\[0, 1\]\[/latex\] range. +- We then create and compile the model, and fit the data, i.e. construct and complete the training process. + +Click [here](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) if you wish to understand creating a Convolutional Neural Network in more detail. + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from extra_keras_datasets import emnist + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load EMNIST dataset +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='digits') + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Cast numbers to float32 +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=sparse_categorical_crossentropy, + optimizer=Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +As we saw, training a model is only one step - your other task as a ML engineer is to see whether your model generalizes well. + +For this, during the loading operation, we loaded both training data and testing data. + +You can now use `model.evaluate` in order to generate evaluation scores and print them in your console. + +- We call `evaluate` on the `model` with the testing data - verbosity off, so we don't see output on the screen. +- As our main loss function is [sparse categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) (see above) and our additional metric is accuracy, the `score` variable contains the scores in that particular other. Hence, `score[0]` represents crossentropy, and `score[1]` represents accuracy. We finally call `print()` to output the scores on screen. + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Running the model will first train our model and subsequently print the evaluation metrics: + +``` +Test loss: 0.0175113923806377 / Test accuracy: 0.9951000213623047 +``` + +* * * + +## Keras model.evaluate if you're using a generator + +In the example above, we used `load_data()` to load the dataset into variables. This is easy, and that's precisely the goal of my Keras extensions library. However, many times, practice is a bit less ideal. In those cases, many approaches to importing your training dataset are out there. Three of them are, for example: + +- [Creating a Keras model with HDF5 files and H5Py](https://www.machinecurve.com/index.php/2020/04/13/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files/) +- [Creating a Keras model with HDF5 files and HDF5Matrix](https://www.machinecurve.com/index.php/2020/04/26/how-to-use-hdf5matrix-with-keras/) +- [Creating a Keras model with data flowing from files using a generator](https://www.machinecurve.com/index.php/2020/04/06/using-simple-generators-to-flow-data-from-file-with-keras/) + +With the former two, you likely still end up with lists of training samples - i.e., having to load them into variables and thus in memory. For these cases, the example above can be used. But did you know that it is also possible to flow data from your system into the model. In other words, did you know that you can use a [generator](https://www.machinecurve.com/index.php/2020/04/06/using-simple-generators-to-flow-data-from-file-with-keras/) to train your machine learning model? + +And it is also possible to evaluate a model using `model.evaluate` if you are using a generator. Say, for example, that you are using the following generator: + +``` +# Load data +def generate_arrays_from_file(path, batchsize): + inputs = [] + targets = [] + batchcount = 0 + while True: + with open(path) as f: + for line in f: + x,y = line.split(',') + inputs.append(x) + targets.append(y) + batchcount += 1 + if batchcount > batchsize: + X = np.array(inputs, dtype='float32') + y = np.array(targets, dtype='float32') + yield (X, y) + inputs = [] + targets = [] + batchcount = 0 +``` + +Then you can evaluate the model by passing the generator to the evaluation function. Make sure to use a different `path` compared to your training dataset, since these need to be strictly separated. + +``` +# Generate generalization metrics +score = model.evaluate(generate_arrays_from_file('./five_hundred_evaluation_samples.csv', batch_size), verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Here, we would have a CSV file with five hundred evaluation samples - and we feed them forward with `batch_size` sized sample batches. In our cases, that would be 2 steps for each evaluation round, as we configured `batch_size` to be 250. + +Note that you don't have to pass targets here, as they are obtained from the generator (Tf.keras.Model, n.d.). + +* * * + +## Summary + +In this article, we looked at model evaluation, and most specifically the usage of `model.evaluate` in TensorFlow and Keras. Firstly, we looked at the need for evaluating your machine learning model. We saw that it is necessary to do that because of the fact that models must work in practice, and that it is easy to overfit them in some cases. + +We then moved forward to practice, and demonstrated how `model.evaluate` can be used to evaluate TensorFlow/Keras models based on the loss function and other metrics specified in the training process. This included an example. Another example was also provided for people who train their Keras models by means of a generator and want to evaluate them. + +I hope that you have learnt something from today's article! If you did, please feel free to leave a comment in the comments section 💬 I'd love to hear from you. Please do the same if you have questions or other comments. Where possible, I'd love to help you out. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +_Tf.keras.Model_. (n.d.). TensorFlow. [https://www.tensorflow.org/api\_docs/python/tf/keras/Model#evaluate](https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate) diff --git a/how-to-find-the-value-for-keras-input_shape-input_dim.md b/how-to-find-the-value-for-keras-input_shape-input_dim.md new file mode 100644 index 0000000..7b404d5 --- /dev/null +++ b/how-to-find-the-value-for-keras-input_shape-input_dim.md @@ -0,0 +1,295 @@ +--- +title: "How to find the value for Keras input_shape/input_dim?" +date: "2020-04-05" +categories: + - "deep-learning" + - "frameworks" +tags: + - "dataset" + - "deep-learning" + - "input-shape" + - "machine-learning" + - "rank" + - "shape" + - "tensor" +--- + +Developing a machine learning model with today's tools is much easier than it was years ago. Keras is one of the deep learning frameworks that can be used for developing deep learning models - and it's actually my lingua franca for doing so. + +One of the aspects of building a deep learning model is specifying the shape of your input data, so that the model knows how to process it. In today's blog, we'll look at precisely this tiny part of building a machine learning model with Keras. We'll answer these questions in particular: + +- What is the "shape" of any data? +- What are the `input_shape` and `input_dim` properties in Keras? +- Given an arbitrary dataset, how can we find the shape of the dataset as a whole? +- How can we convert the shape we identified into sample size, so that Keras understands it? +- How does all this come together - i.e., can we build an example of a Keras model that shows how it's done? + +Are you ready? + +Let's go! :) + +* * * + +\[toc\] + +* * * + +## The first layer in your Keras model: specifying input shape or input dim + +Here's a very simple neural network: + +![](images/Basic-neural-network.jpg) + +It has three layers. In yellow, you see the input layer. This layer is like the entry point to the layers which process the information - it often simply takes the data that you serve the network, feeding it to the hidden layers, in blue. These layers are primarily responsible for processing towards the expected end result (which could be a correct classification, for example). Then, there is the output layer, which - depending on the problem such as regression or classification - is simply one parameter or a few of them. Depending on how you configure this layer, the output can be e.g. a probability distribution over the classes that are present in your classification problem. + +Now, let's go back and take a look at the input layer. Understand that we have a neural network - which is eager to process data - and a dataset. The dataset contains samples, and often thousands or even hundreds of thousands of them. Each sample is fed to the network in sequential order. When all of them are fed, we say that _one epoch_ was completed - or, in plainer English, one iteration. + +There is an obvious connection between the input layer and each individual sample. They must be of the same shape. If you imagine a scenario where a kid has to put a square block into one of three possible holes: a square hole, a circular hole or a rectangular hole. Now, you'll immediately see what action the kid has to take: match the shape of the hole with the shape of the object. + +The same is true for input datasets. Each sample must match the shape of the input layer for the connection to be established. If both shapes aren't equal, the network cannot process the data that you're trying to feed it. + +With this understanding, let's now take a look at the _rank_ and the _shape_ of Tensors (or arrays) in more detail, before we continue with how Keras input layers expect to receive information about such shapes by means of the `input_shape` and `input_dim` properties. + +### The rank and shape of a Tensor (or Array, if you wish) + +Say that we have this Array: + +``` +[[1, 2, 3], [4, 5, 6]] +``` + +Which, if fed to a framework that runs on top of TensorFlow, is converted into Tensor format - which is TensorFlow's representation for numeric data (TensorFlow, n.d.) + +Now, we can distinguish between _rank_ and _shape_ (TensorFlow, n.d.). The distinction is simple: + +- The **rank** of a Tensor represents the _number of dimensions_ for your Tensor. +- The **shape** of a Tensor represents the _number of samples within each dimension_. + +Tensors can be multidimensional. That is, they are representations in "some" mathematical space. Just like we can position ourselves at some (x, y, z) position in 3D space and compare our position with someone else's, Tensors are representations in some space. From this, and TensorFlow (n.d.), it follows that: + +- A **rank-0** **Tensor** is a scalar value; a number, that has magnitude, but no direction. +- A **rank-1 Tensor** is a vector; it has magnitude _and_ direction; +- A **rank-2 Tensor** is a matrix; it is a table of numbers; +- A **rank-3 Tensor** is a cube of numbers. + +[![](images/rankshape.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/rankshape.png) + +From the image above, what follows with respect to shape: + +- There's no shape for the rank-0 Tensor, because it has no dimensions. The shape would hence be an empty array, or `[]`. +- The rank-1 Tensor has a shape of `[3]`. +- The rank-2 Tensor has a shape of `[3, 6]`: three rows, six columns. +- The rank-3 Tensor has a shape of `[2, 2, 2]`: each axis has so many elements. + +### Keras input layers: the `input_shape` and `input_dim` properties + +Now that we know about the rank and shape of Tensors, and how they are related to neural networks, we can go back to Keras. More specifically, let's take a look at how we can connect the _shape of your dataset_ to the input layer through the `input_shape` and `input_dim` properties. + +Let's begin with `input_shape`: + +``` +model = Sequential() +model.add(Dense(4, input_shape=(10,)) +``` + +Here, the input layer would expect a one-dimensional array with 10 elements for input. It would produce 4 outputs in return. + +#### Input shape + +It's actually really simple. The input shape parameter simply tells the input layer **what the shape of one sample looks like** (Keras, n.d.). Adding it to your input layer, will ensure that a match is made. + +#### Input dim + +Sometimes, though, you just have one dimension - which is the case with one-dimensional / flattened arrays, for example. In this case, you can also simply use `input_dim`: specifying the number of elements within that first dimension only. For example: + +``` +model = Sequential() +model.add(Dense(32, input_dim=784)) +``` + +This would make the input layer expect a one-dimensional array of 784 elements as each individual sample. It would produce 32 outputs. This is the kind of information bottleneck that we often want to see! + +* * * + +## Using Numpy to find the shape of your dataset + +Now, suppose that I'm loading an example dataset - such as the MNIST dataset from the [Keras Datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). + +That would be something like this: + +``` +from tensorflow.keras.datasets import mnist +(x_train, y_train), (x_test, y_test) = mnist.load_data() +``` + +Now, how can we find the _shape_ of the dataset? + +Very simple - we can use the [Numpy](https://numpy.org/) package used for numbers processing! + +Let's add this import to the top: + +``` +import numpy as np +``` + +And then we add this to the bottom: + +``` +training_set_shape = x_train.shape +print(training_set_shape) +``` + +Yielding this as a whole: + +``` +import numpy as np +from tensorflow.keras.datasets import mnist +(x_train, y_train), (x_test, y_test) = mnist.load_data() +training_set_shape = x_train.shape +print(training_set_shape) +``` + +Let's now run it and see what happens. + +``` +$ python datasettest.py +2020-04-05 19:22:27.146991: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll +(60000, 28, 28) +``` + +Et voila: a shape of `(60000, 28, 28)`. From this, we can derive that we have 60.000 samples - of 28 x 28 pixels. As the number of image channels is not present, we can assume that it's 1 - and that the images thus must be grayscale. There we go! + +* * * + +## Altering the shape to sample level + +Unfortunately, we're not there yet. We cannot use this shape as our `input_shape`. This latter has to be the input shape of _one sample_, remember? Not the shape of the dataset as a whole. + +Now, from the `(60000, 28, 28)`, which elements contribute to our knowledge about the shape at sample level? + +Indeed, the 28 and 28 - while the 60.000 is not of interest (after all, at sample level, this would be 1). + +Now, with images, we would often use Convolutional Neural Networks. In those models, we use [Conv](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) layers, which expect the `input_shape` in a very specific way. Specifically, they expect it as follows: `(x_shape, y_shape, channels)`. We already have `x_shape` and `y_shape`, which are both 28. We don't have `channels` yet, but do know about its value: 1. By consequence, our value for `input_shape` will be `(28, 28, 1)`! + +However, we can also automate this, for the case when we want to use a different image dataset. We simply add the following: + +``` +number_of_channels = 1 +sample_shape = (training_set_shape[1], training_set_shape[2], number_of_channels) +``` + +We could even expand on our prints: + +``` +print(f'Dataset shape: {training_set_shape}') +print(f'Sample shape: {sample_shape}') +``` + +Indeed, it would yield the same output: + +``` +$ python datasettest.py +2020-04-05 19:28:28.235295: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll +Dataset shape: (60000, 28, 28) +Sample shape: (28, 28, 1) +``` + +* * * + +## A Keras example + +Now that we know about Tensor shapes, their importance for neural network input layers, and how to derive the sample shape for a dataset, let's now see if we can expand this to a real Keras model. + +For this, we'll be analyzing the [simple two-dimensional ConvNet](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) that we created in a different blog post. + +Here is the code - you can find the analysis below it: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Display a model summary +model.summary() + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Specifically, we can observe: + +- That the `img_width` and `img_height` are 32. This is correct, as we're now using a different dataset - see `cifar10.load_data()` - where the images are 32 x 32 pixels. +- The value for `img_num_channels` was set to 3. This is also correct, because the CIFAR10 dataset contains RGB images - which have three image channels. So no 1 anymore - and our final sample shape will be `(32, 32, 3)`. +- We subsequently set the comuted `input_shape` as the `input_shape` of our first Conv2D layer - specifying the input layer implicitly (which is just how it's done with Keras). + +There we go - we can now actually determine the input shape for our data and use it to create Keras models! 😎 + +* * * + +## Summary + +In this blog post, we've looked at the Keras `input_shape` and `input_dim` properties. We did so in quite a chained way, by first looking at the link between neural network input layers and the shape of your dataset - and specifically, the shape at sample level. + +Additionally, we looked at the concepts of rank and shape in order to understand the foundations of a layer's input shape / dim in the first place. + +Then, we looked at how the Keras framework for deep learning implements specifying the input shape / dimension by means of the beforementioned properties. This included looking at how to determine the input shape for your dataset at dataset level, converting it into sample level shape, and subsequently using it in an actual Keras model. We provided a simple example by means of a ConvNet implementation. I really hope that it helps you build your Keras models - as I know that it's often these simple steps that get you stuck! + +If you have any questions, remarks, or other comments - feel free to drop a message in the comments section below! 😎 I'll happily answer and help you build your Keras model. Thank you for reading MachineCurve today and happy engineering! + +\[kerasbox\] + +* * * + +## References + +_TensorFlow tensors_. (n.d.). TensorFlow. [https://www.tensorflow.org/guide/tensor#top\_of\_page](https://www.tensorflow.org/guide/tensor#top_of_page) + +Keras. (n.d.). _Guide to the sequential model_. [https://keras.io/getting-started/sequential-model-guide/](https://keras.io/getting-started/sequential-model-guide/) diff --git a/how-to-generate-a-summary-of-your-keras-model.md b/how-to-generate-a-summary-of-your-keras-model.md new file mode 100644 index 0000000..e4ae14e --- /dev/null +++ b/how-to-generate-a-summary-of-your-keras-model.md @@ -0,0 +1,234 @@ +--- +title: "How to generate a summary of your Keras model?" +date: "2020-04-01" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "model-summary" + - "neural-network" + - "summary" +--- + +When building a neural network with the Keras framework for deep learning, I often want to have a quick and dirty way of checking whether everything is all right. That is, whether my layers output data correctly, whether my parameters are in check, and whether I have a good feeling about the model as a whole. + +Keras model summaries help me do this. They provide a text-based overview of what I've built, which is especially useful when I have to add symmetry such as with autoencoders. But how to create these summaries? And why are they so useful? We'll discover this in today's blog post. + +Firstly, we'll look at some high-level building blocks which I usually come across when I build neural networks. Then, we continue by looking at how Keras model summaries help me during neural network development. Subsequently, we generate one ourselves, by adding it to an example Keras ConvNet. This way, you'll be able to generate model summaries too in your Keras models. + +Are you ready? Let's go! 😊 + +* * * + +\[toc\] + +* * * + +## High-level building blocks of a Keras model + +I've created quite a few neural networks over the past few years. While everyone has their own style in creating them, I always see a few high-level building blocks return in my code. Let's share them with you, as it will help you understand the model with which we'll be working today. + +First of all, you'll always state **the imports of your model**. For example, you import Keras - today often as `tensorflow.keras.something`, but you'll likely import Numpy, Matplotlib and other libraries as well. + +Next, and this is entirely personal, you'll find the **model configuration**. The model compilation and model training stages - which we'll cover soon - require configuration. This configuration is then spread across a number of lines of code, which I find messy. That's why I always specify a few Python variables storing the model configuration, so that I can refer to those when I actually configure the model. + +Example variables are the batch size, the size of your input data, your [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), the [optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) that you will use, and so on. + +Once the model configuration was specified, you'll often **load and preprocess your dataset**. Loading the dataset can be done in a multitude of ways - you can load data from file, you can use the [Keras datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), it doesn't really matter. Below, we'll use the latter scenario. Preprocessing is done in a minimal way - in line with the common assumption within the field of deep learning that models will take care of feature extraction themselves as much as possible - and often directly benefits the training process. + +Once data is ready, you next **specify the architecture of your neural network**. With Keras, you'll often use the Sequential API, because it's easy. It allows you to stack individual layers on top of each other simply by calling `model.add`. + +Specifying the architecture actually means creating the skeleton of your neural network. It's a design, rather than an actual model. To make an actual model, we move to the **model compilation** step - using `model.compile`. Here, we actually _instantiate_ the model with all the settings that we configured before. Once compiled, we're ready to start training. + +**Starting the training process** is what we finally do. By using `model.fit`, we fit the dataset that we're training with to the model. The training process should now begin as configured by yourself. + +Finally, once training has finished, you wish to **evaluate** the model against data that it hasn't yet seen - to find out whether it _really_ performs and did not simply [overfit](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) to your training set. We use `model.evaluate` for this purpose. + +* * * + +## Model summaries + +...what is lacking, though, is some quick and dirty information about your model. Can't we generate some kind of summary? + +Unsurprisingly, we can! 😀 It would look like this: + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d (Conv2D) (None, 30, 30, 32) 896 +_________________________________________________________________ +conv2d_1 (Conv2D) (None, 28, 28, 64) 18496 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 26, 26, 128) 73856 +_________________________________________________________________ +flatten (Flatten) (None, 86528) 0 +_________________________________________________________________ +dense (Dense) (None, 128) 11075712 +_________________________________________________________________ +dense_1 (Dense) (None, 10) 1290 +================================================================= +Total params: 11,170,250 +Trainable params: 11,170,250 +Non-trainable params: 0 +_________________________________________________________________ +``` + +There are multiple benefits that can be achieved from generating a model summary: + +- Firstly, you have that quick and dirty overview of the components of your Keras model. The names of your layers, their types, as well as the shape of the data that they output and the number of trainable parameters. +- Secondly, with respect to the shape of your output data, this is beneficial if - for example - you have a mismatch somewhere. This can happen in the case of an [autoencoder](https://www.machinecurve.com/index.php/2019/12/19/creating-a-signal-noise-removal-autoencoder-with-keras/), where you effectively link two funnels together in order to downsample and upsample your data. As you want to have perfect symmetry, model summaries can help here. +- Thirdly, with respect to the number of parameters, you can make a guess as to where overfitting is likely and why/where you might face computational bottlenecks. The more trainable parameters your model has, the more computing power you need. What's more, if you provide an overkill of trainable parameters, your model might also be more vulnerable to overfitting, especially when the total size of your model or the size of your dataset does not account for this. + +Convinced? Great 😊 + +* * * + +## Generating a model summary of your Keras model + +Now that we know some of the high-level building blocks of a Keras model, and know how summaries can be beneficial to understand your model, let's see if we can actually generate a summary! + +For this reason, we'll give you an example [Convolutional Neural Network](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) for two-dimensional inputs. Here it is: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Clearly, all the high-level building blocks are visible: + +- The imports speak for themselves. +- The model configuration variables tell us that we'll be using [sparse categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) and the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/). We will train for ten epochs (or iterations) and feed the model 50 samples at once. +- We load the CIFAR10 dataset, which contains everyday objects - see below for some examples. Once it's loaded, we do two three things: firstly, we'll determine the shape of our data - to be used in the first model layer. Secondly, we cast the numbers into `float32` format, which might speed up the training process when you are using a GPU powered version of Keras. Thirdly, and finally, we scale the data, to ensure that we don't face massive weight swings during the optimization step after each iteration. As you can see, we don't really do feature engineering _in terms of the features themselves_, but rather, we do some things to benefit the training process. +- We next specify the model architecture: three [Conv2D layers](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) for feature extraction, followed by a Flatten layer, as our Dense layers - which serve to generate the classification - can only handle one-dimensional data. +- Next, we compile the skeleton into an actual model and subsequently start the training process. +- Once training has finished, we evaluate and show the evaluation on screen. + +[![](images/cifar10_images.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/cifar10_images.png) + +Now, how to add that summary? + +Very simple. + +Add `model.summary()` to your code, perhaps with a nice remark, like `# Display a model summary`. Like this: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Display a model summary +model.summary() + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) +``` + +Running the model again then nicely presents you the model summary: + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d (Conv2D) (None, 30, 30, 32) 896 +_________________________________________________________________ +conv2d_1 (Conv2D) (None, 28, 28, 64) 18496 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 26, 26, 128) 73856 +_________________________________________________________________ +flatten (Flatten) (None, 86528) 0 +_________________________________________________________________ +dense (Dense) (None, 128) 11075712 +_________________________________________________________________ +dense_1 (Dense) (None, 10) 1290 +================================================================= +Total params: 11,170,250 +Trainable params: 11,170,250 +Non-trainable params: 0 +_________________________________________________________________ +``` + +Nice! 🎆 + +* * * + +## Summary + +In this blog post, we looked at generating a model summary for your Keras model. This summary, which is a quick and dirty overview of the layers of your model, display their output shape and number of trainable parameters. Summaries help you debug your model and allow you to immediately share the structure of your model, without having to send all of your code. + +For this to work, we also looked at some high-level components of a Keras based neural network that I often come across when building models. Additionally, we provided an example ConvNet to which we added a model summary. + +Although it's been a relatively short blog post, I hope that you've learnt something today! If you did, or didn't, or when you have questions/remarks, please leave a comment in the comments section below. I'll happily answer your comment and improve my blog post where necessary. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +_Keras_. (n.d.). Utils: Model Summary. [https://keras.io/utils/#print\_summary](https://keras.io/utils/#print_summary) diff --git a/how-to-label-your-dataset-for-yolo-object-detection.md b/how-to-label-your-dataset-for-yolo-object-detection.md new file mode 100644 index 0000000..5a4ca30 --- /dev/null +++ b/how-to-label-your-dataset-for-yolo-object-detection.md @@ -0,0 +1,186 @@ +--- +title: "How to label your dataset for YOLO Object Detection" +date: "2021-03-30" +categories: + - "buffer" + - "frameworks" +tags: + - "object-detection" + - "yolo" + - "yololabel" + - "you-only-look-once" +--- + +The YOLO object detector family - where YOLO stands for You Only Look Once - is one of the most widely known and used types of Object Detectors. Already at the fifth version in early 2021, YOLO can be trained relatively easily and is optimized for speed - hence looking _once_. + +Training your own YOLO object detector requires that you provide a labeled dataset. In this tutorial, we're going to take a look at how you can do that. Using a tool called YoloLabel, which works on Windows and macOS, you will learn how you can generate bounding boxes for your own YOLO object detection model. + +After reading it, you will know... + +- **How YoloLabel can be used for performing your labeling task.** +- **How YoloLabel is installed.** +- **What the structure of a YOLO label file is.** + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## Using YoloLabel for generating labels + +Let's take a look at how you can use [YoloLabel](https://github.com/developer0hye/Yolo_Label) for generating bounding boxes for your YOLO object detection dataset. YoloLabel is a tool specifically designed to optimize the labeling process. Rather than using a so-called "drag and drop" method, which is implemented by many labeling tools, YoloLabel favors a "click twice" approach where you click to start generating a bounding box, and click again to stop doing that. + +See for yourself how this reduces strain on your arm: + +- ![](images/48674135-6fe49400-eb8c-11e8-963c-c343867b7565.gif) + +- ![](images/48674136-71ae5780-eb8c-11e8-8d8f-8cb511009491.gif) + + +Instead of "drag and drop", YoloLabel implements "twice click" for labeling. This method is more convenient than the first. Source: [YoloLabel](https://github.com/developer0hye/Yolo_Label), Copyright (c) 2019 Yonghye Kwon, images licensed according to [MIT License](https://github.com/developer0hye/Yolo_Label/blob/master/LICENSE), no changes made. + +### Installing YoloLabel + +YoloLabel runs on both Windows and macOS machines. Installation instructions can be found [here](https://github.com/developer0hye/Yolo_Label#install-and-run). + +### Performing your labeling task + +Let's now take a look at labeling some data. This involves a few steps: + +1. Opening YoloLabel +2. Loading the dataset and label file +3. Labeling your data +4. Possibly switching between label classes + +Let's start with opening YoloLabel. + +#### Opening YoloLabel + +If you have installed YoloLabel, you either have a file called `YoloLabel.exe` (on Windows) or `YoloLabel.app` available (on macOS). Double click this file. The following window should open: + +![](images/image-13-1024x624.png) + +You can see that the window is divided in three main blocks. + +The top left block will display the images that must be labeled. The right block will show the label classes and their color, while the lower block provides control blocks - such as selecting which dataset to open. + +From top to bottom, left to right, this is what the blocks do: + +- **Prev:** go back to the previous image. +- **Next:** go forward to the next image. +- **Slider:** manually pick an image to label. +- **Progress bar:** see how many images you have already labeled, and how many images are in the dataset in total. +- **Open Files:** load a dataset and label file for labeling. +- **Change Directory:** open a new dataset and label file for labeling. +- **Save:** save all bounding boxes generated in the current image. +- **Remove:** remove the _image_ from the dataset. + - Removing a bounding box can be done by performing a right click on the bounding box you want to remove. + +#### Loading the dataset and label file + +Say that we have a dataset that includes the following two pictures: + +- [![](images/pexels-ashley-williams-685382-1024x604.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/03/pexels-ashley-williams-685382.jpg) + +- [![](images/pexels-helena-lopes-1015568-1024x683.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/03/pexels-helena-lopes-1015568.jpg) + + +Source: pexels.com, Pexels License + +And that it is available at some location, say `C:/Users/chris/MachineCurve/labeling/images`. + +Clicking **Open Files** will require you to open the directory where your images are saved: + +- ![](images/step_1.png) + +- ![](images/step_2.png) + + +However, after doing so, it will also request that you provide a _Label List File_ - which is a `.txt` file or a `.names` file. + +A label list file contains all the labels that you want to possibly attach to your bounding boxes - and hence represents all the classes that can be present in an image. In other words, they are your target classes. + +A `labels.txt` file contains all class labels that you want to use in your labeling task, one per line: + +``` +human +animal +vehicle +plant +traffic light +``` + +Note that with YOLO, classes are not represented by text - but rather, by index. In other words, `human` is class `0`, `animal` class `1`, and so on. Make sure to take this into account when handling the labels or fusing the labels from two possibly different datasets! + +#### Labeling your data + +Once you complete these two steps, you will see the first image in your dataset and two lines - a vertical and a horizontal one - following your mouse cursor. You will also see the labels from your `labels.txt` file on the right, each having a unique color. + +It's now time to label your data. + +![](images/image-14-1024x638.png) + +Labeling is really easy. It's simply drawing a bounding box around the objects in the image. For example, for the women playing football/soccer (depending on what country you're from ⚽), we can label as follows: + +#### Switching between label classes + +Sometimes, more than just one class is visible within an image. For example, for this photo from a cafe, you can see that a human is there - but that the same is true for plants. In that case, you may need to switch between classes. + +Which can also be done easily by simply clicking the class you currently want to label for. + +![](images/image-15.png) + +And voila, you have now labeled a human being and a few plants: + +![](images/image-16-1024x636.png) + +### Inspecting your label files + +Now that we have generated some labels, we can take a look at how YoloLabel converts them into label files. These label files contain all the information YOLO needs for understanding where particular objects are in an input image. + +Let's go back to the women: + +![](images/pexels-ashley-williams-685382-1024x604.jpg) + +This is the label file: + +``` +0 0.558203 0.501340 0.080469 0.294906 +0 0.357031 0.492627 0.162500 0.435657 +0 0.216016 0.502681 0.194531 0.434316 +``` + +Interesting :) + +It will be much easier to understand what the numbers above mean if you knew that they represent `class - center_x - center_y - width - height`. In other words, the class number, the horizontal position of the center of its bounding box, the vertical position of the center of its bounding box, its width and its height. + +All values are relative, meaning that e.g. `0.558203` for `center_x` means that the center is at `55.8203%` of the image width. And indeed: the goal keeper is at approximately 55% of the image in terms of width, and indeed at approximately 50% of the screen. The most left woman is at approximately 22% width and 50% height. And the middle one takes approximately 16.3% of the screen's width and 43.6% of the screen's height. + +This way, it no longer matters if YOLO resizes the image - it still knows where the bounding boxes are. + +Great stuff! 😎 + +* * * + +## Summary + +You Only Look Once, or YOLO, is a family of object detection algorithms that is highly popular today. Training your own YOLO model means that you will need to provide a labeled dataset. In this tutorial, you have seen how you can use a tool called YoloLabel for doing that. You now know... + +- **How YoloLabel can be used for performing your labeling task.** +- **How YoloLabel is installed.** +- **What the structure of a YOLO label file is.** + +I hope that it was useful for your learning process! Please feel free to share what you have learned in the comments section 💬 I’d love to hear from you. Please do the same if you have any questions or other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +GitHub. (n.d.). _Developer0hye/Yolo\_Label_. [https://github.com/developer0hye/Yolo\_Label](https://github.com/developer0hye/Yolo_Label) + +GitHub. (n.d.). _No CSV export · Issue #39 · developer0hye/Yolo\_Label_. [https://github.com/developer0hye/Yolo\_Label/issues/39](https://github.com/developer0hye/Yolo_Label/issues/39) diff --git a/how-to-normalize-or-standardize-a-dataset-in-python.md b/how-to-normalize-or-standardize-a-dataset-in-python.md new file mode 100644 index 0000000..090c566 --- /dev/null +++ b/how-to-normalize-or-standardize-a-dataset-in-python.md @@ -0,0 +1,360 @@ +--- +title: "How to Normalize or Standardize a Dataset in Python?" +date: "2020-11-19" +categories: + - "frameworks" + - "svms" +tags: + - "data-preprocessing" + - "dataset" + - "deep-learning" + - "feature-scaling" + - "machine-learning" + - "normalization" + - "preprocessing" + - "standardization" +--- + +Training a Supervised Machine Learning model involves feeding forward data from a training dataset, through the model, generating predictions. These predictions are then compared with what is known as the _ground truth_, or the corresponding targets for the training data. Subsequently, the model is improved, by minimizing a cost, error or [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). + +It is important to prepare your dataset before feeding it to your model. When you pass through data without doing so, the model may show some very interesting behavior - and training can become really difficult, if not impossible. In those cases, when inspecting your model code, it could very well be the case that you forgot to apply **normalization** or **standardization**. What are they? Why are they necessary? And how do they work? Precisely that is what we will look at in this article. + +Firstly, we will take a look at why you need a normalized or standardized dataset. Subsequently, we'll move forward and see how those techniques actually work. Finally, we give a lot of step-by-step examples by using Scikit-learn and Python for making your dataset ready for Machine Learning models. + +Let's take a look! :) + +**Update 08/Dec/2020:** added references to PCA article. + +* * * + +\[toc\] + +* * * + +## Normalization and Standardization for Feature Scaling + +Before studying the _what_ of something, I always think that it helps studying the _why_ first. At least, it makes you understand why you have to apply certain techniques or methods. The same is true for Normalization and Standardization. Why are they necessary? Let's take a look at this in more detail. + +### They are required by Machine Learning algorithms + +When you are training a Supervised Machine Learning model, you are feeding forward data through the model, generating predictions, and subsequently improving the model. As you read in the introduction, this is achieved by minimizing a cost/error/loss function, and it allows us to optimize models in their unique ways. + +For example, a [Support Vector Machine](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) is optimized by finding support vectors that support the decision boundary with the greatest margin between two classes, effectively computing a distance metric. Neural networks use [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) for optimization, which involves walking down the loss landscape into the direction where loss improves most. And there are many other ways. Now, here are some insights about why datasets must be scaled for Machine Learning algorithms (Wikipedia, 2011): + +- Gradient descent converges much faster when the dataset is scaled. +- If the model depends on measuring distance (think SVM), the distances are comparable after the dataset was scaled. In fact, if it is _not_ scaled, computation of the loss can be "governed by this particular feature" if the feature has a really big scale compared to other features (Wikipedia, 2011). +- If you apply [regularization](https://www.machinecurve.com/index.php/2020/01/26/which-regularizer-do-i-need-for-training-my-neural-network/), you must also apply scaling, because otherwise some features may be penalized more than strictly necessary. + +### They help Feature Selection too + +Suppose that we given a dataset of a **runner's diary** and that our goal is to learn a predictive model between some of the variables and runner performance. What we would normally do in those cases is perform a feature selection procedure, because we cannot simply feed all samples due to two reasons: + +1. **The curse of dimensionality:** if we look at our dataset as a _feature space_ with each feature (i.e., column) representing one dimension, our space would be multidimensional if we use many features. The more dimensions we add, the more training data we need; this need increases exponentially. By consequence, although we should use sufficient features, we don't want to use every one of them. +2. **We don't want to use features that contribute insignificantly.** Some features (columns) contribute to the output less significantly than others. It could be that when removed, the model will still be able to perform, but at a significantly lower computational cost. We therefore want to be able to select the features that contribute most significantly. + +> In machine learning problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. +> +> Wikipedia (n.d.) about the curse of dimensionality + +We would e.g. apply algorithms such as [_Principal Component Analysis_ (PCA)](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/) to help us determine which features are most important. If we look at how these algorithms work, we see that e.g. PCA extracts new features based on the _principal directions_ in the dataset, i.e. the directions in your data where variance is largest (Scikit-learn, n.d.). + +> Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average value. +> +> Wikipedia (2001) + +Let's keep this in mind when looking at the following dataset: + +![](images/diary.png) + +Here, the variance of the variable _Time offset_ is larger than that of the variable _Distance run_. + +PCA will therefore naturally select the Time offset variable over the Distance run variable, because the [eigenpairs](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/) are more significant there. + +However, this does not necessarily mean that it _is_ in fact more important - because we cannot compare variance. Only if variance is comparable, and hence the scales are equal in the _unit they represent_, we can confidently use algorithms like PCA for feature selection. That's why we must find a way to make our variables comparable. + +### Introducing Feature Scaling + +And, to be speaking most generally, that method is called **feature scaling** - and it is applied during the data preprocessing step. + +> Feature scaling is a method used to normalize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. +> +> Wikipedia (2011) + +There are two primary ways for feature scaling which we will cover in the remainder of this article: + +- **Rescaling**, or _min-max normalization:_ we scale the data into one of two ranges: \[latex\]\[0, 1\]\[/latex\] or \[latex\]\[a, b\]\[/latex\], often \[latex\]\[-1, 1\]\[/latex\]. +- **Standardization**, or _Z-score normalization_: we scale the data so that the mean is zero and variance is 1. + +Let's now cover each of the three methods in more detail, find out how they work, and identify when they are used best. + +* * * + +## Rescaling (min-max normalization) + +Rescaling, or **min-max normalization**, is a simple method for bringing your data into one out of two ranges: \[latex\]\[0, 1\]\[/latex\] or \[latex\]\[a, b\]\[/latex\]. It highly involves the minimum and maximum values from the dataset in normalizing the data. + +### How it works - the \[0, 1\] way + +Suppose that we have the following array: + +``` +dataset = np.array([1.0, 12.4, 3.9, 10.4]) +``` + +Min-max normalization for the range \[latex\]\[0, 1\]\[/latex\] can be defined as follows: + +``` +normalized_dataset = (dataset - min(dataset)) / (max(dataset) - min(dataset)) +``` + +In a naïve way, using Numpy, we can therefore normalize our data into the \[latex\]\[0, 1\]\[/latex\] range in the following way: + +``` +import numpy as np +dataset = np.array([1.0, 12.4, 3.9, 10.4]) +normalized_dataset = (dataset - np.min(dataset)) / (np.max(dataset) - np.min(dataset)) +print(normalized_dataset) +``` + +This indeed yields an array where the lowest value is now `0.0` and the biggest is `1.0`: + +``` +[0. 1. 0.25438596 0.8245614 ] +``` + +### How it works - the \[a, b\] way + +If instead we wanted to scale it to some other arbitrary range - say \[latex\]\[0, 1.5\]\[/latex\], we can apply min-max normalization but then for the \[latex\]\[a, b\]\[/latex\] range, where \[latex\]a\[/latex\] and \[latex\]b\[/latex\] can be chosen yourself. + +We can use the following formula for normalization: + +``` +normalized_dataset = a + ((dataset - min(dataset)) * (b - a) / (max(dataset) - min(dataset))) +``` + +Or, for the dataset from the previous section, using a naïve Python implementation: + +``` +import numpy as np +a = 0 +b = 1.5 +dataset = np.array([1.0, 12.4, 3.9, 10.4]) +normalized_dataset = a + ((dataset - np.min(dataset)) * (b - a) / (np.max(dataset) - np.min(dataset))) +print(normalized_dataset) +``` + +Which yields: + +``` +[0. 1.5 0.38157895 1.23684211] +``` + +### Applying the MinMaxScaler from Scikit-learn + +Scikit-learn, the popular machine learning library used frequently for training many _traditional_ Machine Learning algorithms provides a module called `MinMaxScaler`, and it is part of the `sklearn.preprocessing` API. + +It allows us to fit a scaler with a predefined range to our dataset, and subsequently perform a transformation for the data. The code below gives an example of how to use it. + +- We import `numpy` as a whole and the `MinMaxScaler` from `sklearn.preprocessing`. +- We define the NumPy array that we just defined before, but now, we have to reshape it: `.reshape(-1, 1)`. This is a Scikit-learn requirement for arrays with just one feature per array item (which in our case is true, because we are using scalar values). +- We then initialize the `MinMaxScaler` and here we also specify our \[latex\]\[a, b\]\[/latex\] range: `feature_range=(0, 1.5)`. Of course, as \[latex\]\[0, 1\]\[/latex\] is also an \[latex\]\[a, b\]\[/latex\] range, we can implement that one as well using `MinMaxScaler`. +- We then fit the data to our scaler, using `scaler.fit(dataset)`. This way, it becomes capable of transforming datasets. +- We finally transform the `dataset` using `scaler.transform(dataset)` and print the result. + +``` +import numpy as np +from sklearn.preprocessing import MinMaxScaler +dataset = np.array([1.0, 12.4, 3.9, 10.4]).reshape(-1, 1) +scaler = MinMaxScaler(feature_range=(0, 1.5)) +scaler.fit(dataset) +normalized_dataset = scaler.transform(dataset) +print(normalized_dataset) +``` + +And indeed, after printing, we can see that the outcome is the same as obtained with our naïve approach: + +``` +[[0. ] + [1.5 ] + [0.38157895] + [1.23684211]] +``` + +* * * + +## Standardization (Z-scale normalization) + +In the previous example, we normalized our dataset based on the minimum and maximum values. Mean and standard deviation are however not _standard,_ meaning that the mean is zero and that the standard deviation is one. + +``` +print(normalized_dataset) +print(np.mean(normalized_dataset)) +print(np.std(normalized_dataset)) +``` + +``` +[[0. ] + [1.5 ] + [0.38157895] + [1.23684211]] +0.7796052631578947 +0.611196249385709 +``` + +Because the bounds of our normalizations would not be equal, it would still be (slightly) unfair to compare the outcomes e.g. with [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/). + +For example, if we used a different dataset, our results would be different: + +``` +import numpy as np +from sklearn.preprocessing import MinMaxScaler +dataset = np.array([2.4, 6.2, 1.8, 9.0]).reshape(-1, 1) +scaler = MinMaxScaler(feature_range=(0, 1.5)) +scaler.fit(dataset) +normalized_dataset = scaler.transform(dataset) +print(normalized_dataset) +print(np.mean(normalized_dataset)) +print(np.std(normalized_dataset)) + +[[0.125 ] + [0.91666667] + [0. ] + [1.5 ]] +0.6354166666666665 +0.6105090942538584 +``` + +This is where **standardization** or _Z-score normalization_ comes into the picture. Rather than using the minimum and maximum values, we use the mean and standard deviation from the data. By consequence, all our features will now have zero mean and unit variance, meaning that we can now compare the variances between the features. + +### How it works + +The formula for standardization is as follows: + +``` +standardized_dataset = (dataset - mean(dataset)) / standard_deviation(dataset)) +``` + +In other words, for each sample from the dataset, we subtract the mean and divide by the standard deviation. By removing the mean from each sample, we effectively move the samples towards a mean of 0 (after all, we removed it from all samples). In addition, by dividing by the standard deviation, we yield a dataset where the values describe _by how much of the standard deviation_ they are offset from the mean. + +### Python example + +This can also be implemented with Python: + +``` +import numpy as np +dataset = np.array([1.0, 2.0, 3.0, 3.0, 3.0, 2.0, 1.0]) +standardized_dataset = (dataset - np.average(dataset)) / (np.std(dataset)) +print(standardized_dataset) +``` + +Which yields: + +``` +[-1.37198868 -0.17149859 1.02899151 1.02899151 1.02899151 -0.17149859 + -1.37198868] +``` + +In Scikit-learn, the `sklearn.preprocessing` module provides the `StandardScaler` which helps us perform the same action in an efficient way. + +``` +import numpy as np +from sklearn.preprocessing import StandardScaler +dataset = np.array([1.0, 2.0, 3.0, 3.0, 3.0, 2.0, 1.0]).reshape(-1, 1) +scaler = StandardScaler() +scaler.fit(dataset) +standardized_dataset = scaler.transform(dataset) +print(standardized_dataset) +print(np.mean(standardized_dataset)) +print(np.std(standardized_dataset)) +``` + +With as outcome: + +``` +[[-1.37198868] + [-0.17149859] + [ 1.02899151] + [ 1.02899151] + [ 1.02899151] + [-0.17149859] + [-1.37198868]] +3.172065784643304e-17 +1.0 +``` + +We see that the mean is _really_ close to 0 (\[latex\]3.17 \\times 10^{-17}\[/latex\]) and that standard deviation is one. + +* * * + +## Normalization vs Standardization: when to use which one? + +Many people have the question **when to use normalization, and when to use standardization?** This is a valid question - and I had it as well. + +Most generally, the rule of thumb would be to **use min-max normalization if you want to normalize the data while keeping some differences in scales (because units remain different), and use standardization if you want to make scales comparable (through standard deviations)**. + +The example below illustrates the effects of standardization. In it, we create Gaussian data, stretch one of the axes with some value to make them relatively incomparable, and plot the data. This clearly indicates the stretched blobs in an absolute sense. Then, we use standardization and plot the data again. We now see that both the mean has moved to \[latex\](0, 0)\[/latex\] _and_ that when the data is standardized, the variance of the axes is pretty similar! + +If we hadn't applied feature scaling here, algorithms like [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/) would have pretty much fooled us. ;-) + +``` +# Imports +import matplotlib.pyplot as plt +from sklearn.datasets import make_gaussian_quantiles +from sklearn.preprocessing import StandardScaler + +# Make Gaussian data +plt.title("Gaussian data, two classes, mean at (2,3)") +X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=2, n_samples=1000, mean=(2,3)) + +# Stretch one of the axes +X1[:, 1] = 2.63 * X1[:, 1] + +# Plot data +plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1, + s=25, edgecolor='k') +axes = plt.gca() +axes.set_xlim([-5, 20]) +axes.set_ylim([-5, 20]) +plt.show() + +# Standardize Gaussian data +scaler = StandardScaler() +scaler.fit(X1) +X1 = scaler.transform(X1) + +# Plot standardized data +plt.title("Gaussian data after standardization, two classes, mean at (0,0)") +plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1, + s=25, edgecolor='k') +axes = plt.gca() +axes.set_xlim([-5, 20]) +axes.set_ylim([-5, 20]) +plt.show() +``` + +- [![](images/gauss0.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/gauss0.png) + +- [![](images/gauss1.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/gauss1.png) + + +* * * + +## Summary + +In this article, we looked at Feature Scaling for Machine Learning. More specifically, we looked at Normalization (min-max normalization) which brings the dataset into the \[latex\]\[a, b\]\[/latex\] range. In addition to Normalization, we also looked at Standardization, which allows us to convert the scales into _amounts of standard deviation_, making the axes comparable for e.g. algorithms like PCA. + +We illustrated our reasoning with step-by-step Python examples, including some with standard Scikit-learn functionality. + +I hope that you have learned something from this article! If you did, feel free to leave a message in the comments section 💬 Please do the same if you have questions or other comments. I'd love to hear from you! Thank you for reading MachineCurve today and happy engineering 😎 + +* * * + +## References + +Wikipedia. (2011, December 15). _Feature scaling_. Wikipedia, the free encyclopedia. Retrieved November 18, 2020, from [https://en.wikipedia.org/wiki/Feature\_scaling](https://en.wikipedia.org/wiki/Feature_scaling) + +Scikit-learn. (n.d.). _Importance of feature scaling — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 18, 2020, from [https://scikit-learn.org/stable/auto\_examples/preprocessing/plot\_scaling\_importance.html](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html) + +Wikipedia. (n.d.). _Curse of dimensionality_. Wikipedia, the free encyclopedia. Retrieved November 18, 2020, from [https://en.wikipedia.org/wiki/Curse\_of\_dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) + +Wikipedia. (2001, June 30). _Variance_. Wikipedia, the free encyclopedia. Retrieved November 18, 2020, from [https://en.wikipedia.org/wiki/Variance](https://en.wikipedia.org/wiki/Variance) diff --git a/how-to-perform-affinity-propagation-with-python-in-scikit.md b/how-to-perform-affinity-propagation-with-python-in-scikit.md new file mode 100644 index 0000000..280f66f --- /dev/null +++ b/how-to-perform-affinity-propagation-with-python-in-scikit.md @@ -0,0 +1,246 @@ +--- +title: "Affinity Propagation Tutorial: Example with Scikit-learn" +date: "2020-04-18" +categories: + - "deep-learning" + - "frameworks" +tags: + - "affinity-propagation" + - "clustering" + - "machine-learning" + - "python" + - "scikit-learn" + - "unsupervised-learning" +--- + +Say you've got a dataset where there exist relationships between individual samples, and your goal is to identify groups of related samples within the dataset. Clustering, which is part of the class of unsupervised machine learning algorithms, is then the way to go. But what clustering algorithm to apply when you do not really know the number of clusters? + +Enter **Affinity Propagation**, a gossip-style algorithm which derives the number of clusters by mimicing social group formation by passing messages about the popularity of individual samples as to whether they're part of a certain group, or even if they are the leader of one. This algorithm, which can estimate the number of clusters/groups in your dataset itself, is the topic of today's blog post. + +Firstly, we'll take a theoretical look at Affinity Propagation. **What is it** - and how does the group formation analogy work? **How does it work** in more detail, i.e. mathematically? And what kind of messages are sent, and how are those popularity metrics determined? How does the algorithm converge? We'll look at them first. + +Next, we provide an example implementation of Affinity Propagation using Scikit-learn and Python. We explain our model code step by step, so that you can understand what is happening piece by piece. For those who already have some experience and wish to play right away, the full model code is also available. Hence, today's blog post is both theoretical and practical - my favorite type of blog! + +In this tutorial, you will learn... + +- **How to perform Affinity Propagation clustering with Scikit-learn.** +- **What Affinity Propagation is.** +- **How Affinity propagation works.** + +* * * + +\[toc\] + +* * * + +## Example code: How to perform Affinity Propagation with Scikit-learn? + +With this **quick example** you will be able to start using **Affinity Propagation with Scikit-learn** immediately. Copy and paste the code into your project and you are ready to go. If you want to understand how Affinity Propagation works in more detail, or learn how to write the code step-by-step, make sure to read the rest of this tutorial. + +``` +from sklearn.datasets import make_blobs +from sklearn.cluster import AffinityPropagation + +# Generate data +X, targets = make_blobs(n_samples = 50, centers = [(20,20), (4,4)], n_features = 2, center_box=(0, 1), cluster_std = 1) + +# Fit Affinity Propagation with Scikit +afprop = AffinityPropagation(max_iter=250) +afprop.fit(X) +cluster_centers_indices = afprop.cluster_centers_indices_ +n_clusters_ = len(cluster_centers_indices) + +# Predict the cluster for all the samples +P = afprop.predict(X) +``` + +* * * + +## What is Affinity Propagation? + +Do you remember high school, where groups of people formed - and you could only become a member of a particular group _if the group's leaders_ thought you were cool? + +Although the analogy might be a bit far-fetched, I think this is how Affinity Propagation for clustering can be explained in plain English. For a **set of data points**, a "group formation" process begins, where each **sample** competes with other ones in order to gain group membership. The ones with most group capital, the group leaders are called **exemplars** (Scikit-learn, n.d.). + +The interesting thing about this machine learning techniques is that you don't have to configure the number of clusters in advance, unlike [K-means clustering](https://www.machinecurve.com/index.php/2020/04/16/how-to-perform-k-means-clustering-with-python-in-scikit/) (Scikit-learn, n.d.). The main drawback is the complexity: it's not one of the cheapest machine learning algorithms in terms of the computational resources that are required (Scikit-learn, n.d.). Hence, it's a suitable technique for "small to medium sized datasets" only (Scikit-learn, n.d.). + +### A little bit more detail + +Now that we understand Affinity Propagation at a high level, it's time to take a more detailed look. We'll look at a couple of things: + +- How the algorithm works, at a high level; +- What kind of messages are propagated; +- How the scores in those messages are computed. +- How the message scores are updated after each iteration, and thus how the true clusters are formed. + +First of all, as with any clustering algorithm, Affinity Propagation is iterative. This means that it will complete a number of iterations until completion. Contrary to K-means clustering, where convergence is determined with some threshold value, with Affinity Propagation you configure a _number of iterations_ to complete. After then, the algorithm assumes convergence and will return the resulting clusters (Scikit-learn, n.d.). + +### Two types of messages are propagated + +During each iteration, each sample broadcasts two types of messages to the other samples (Scikit-learn, n.d.). The first is called the **responsibility** \[latex\]r(i,k)\[/latex\] - which is the "evidence that sample \[latex\]k\[/latex\] should be the exemplar for sample \[latex\]i\[/latex\]" (Scikit-learn, n.d.). I always remember it as follows: the greater the _expected group leadership_ of \[latex\]k\[/latex\], the greater the _responsibility_ for the group. That's how you know that the responsibility from the point of \[latex\]i\[/latex\] always tells you something about the importance of \[latex\]k\[/latex\] for the group. + +The other type of message that is sent is the **availability**. This is the opposite of the responsibility: how certain \[latex\]i\[/latex\] is that it should choose \[latex\]k\[/latex\] as the exemplar, i.e. _how available it is to join a particular group_ (Scikit-learn, n.d.). In the high school case, say that you want to join a semi-cool group (some availability), while you're more willing to join the really cool group, your availability is much higher for the really cool one. The responsibility tells you something about whose acceptance you need to join the group, i.e. the most likely group leader i.e. exemplar. + +### Computing the scores for responsibility and availability + +Let's now take an even closer look at the concepts of responsibility and availability. Now that we know what they represent at a high level, it's time that we look at them in detail - which means mathematically. + +#### Responsibility + +Here's the formula for responsibility (Scikit-learn, n.d.): + +\[latex\]r(i, k) \\leftarrow s(i, k) - max \[ a(i, k') + s(i, k') \\forall k' \\neq k \]\[/latex\] + +Let's now decompose this formula into plain English. We start at the left. Here, \[latex\]r(i,k)\[/latex\] is once again the _responsibility_ that sample \[latex\]k\[/latex\] is the exemplar for sample \[latex\]i\[/latex\]. But what determines it? Two components: \[latex\]s(i, k)\[/latex\] and \[latex\]max \[ a(i, k') + s(i, k') \\forall k' \\neq k \]\[/latex\]. + +The first is the _similarity_ between samples \[latex\]i\[/latex\] and \[latex\]k\[/latex\]. If they are highly similar, the odds are very big that \[latex\]k\[/latex\] should be \[latex\]i\[/latex\]'s exemplar. However, this is not the full story, as we cannot look at similarity _only_ - as the other samples will also try to convince that they are the more suitable exemplars for \[latex\]i\[/latex\]. Hence, the similarity is _relative_, and that's why we need to subtract that big \[latex\]max\[/latex\] value. It looks complex, but it simply boils down to "the maximum availability and similarity of all the other samples \[latex\]k'\[/latex\], where \[latex\]k'\[/latex\] is never \[latex\]k\[/latex\]". We simply subtract the similarity _and_ the willingness of \[latex\]k\[/latex\]'s "biggest competitor" in order to show its relative strength as an exemplar. + +#### Availability + +Looks complex, but is actually relatively easy. And so is the formula for the availability (Scikit-learn, n.d.): + +\[latex\]a(i, k) \\leftarrow min \[0, r(k, k) + \\sum\_{i'~s.t.~i' \\notin {i, k}}{r(i', k)}\]\[/latex\] + +As we can see, the availability is determined as the minimum value between 0 and the responsibility of \[latex\]k\[/latex\] to \[latex\]k\[/latex\] (i.e. how important it considers itself to be an exemplar or a group leader) and the sum of the responsibilities for all other samples \[latex\]i'\[/latex\] to \[latex\]k\[/latex\], where \[latex\]i'\[/latex\] is neither \[latex\]i\[/latex\] or \[latex\]k\[/latex\]. Thus, in terms of group formation, a sample will become more available to a potential exemplar if itself thinks it's highly important and so do the other samples around. + +### Updating the scores: how clusters are formed + +Now that we know about the formulae for responsibility and availability, let's take a look at how scores are updated after every iteration (Scikit-learn, n.d.): + +\[latex\]r\_{t+1}(i, k) = \\lambda\\cdot r\_{t}(i, k) + (1-\\lambda)\\cdot r\_{t+1}(i, k)\[/latex\] + +\[latex\]a\_{t+1}(i, k) = \\lambda\\cdot a\_{t}(i, k) + (1-\\lambda)\\cdot a\_{t+1}(i, k)\[/latex\] + +Very simple: every update, we take \[latex\]\\lambda\[/latex\] of the old value and merge it with \[latex\](1-\\lambda)\[/latex\] of the new value. This lambda, which is also called "damping value", is a smoothing factor to ensure a smooth transition; it avoids large oscillations during the optimization process. + +Altogether, Affinity Propagation is therefore an algorithm which: + +- Estimates the number of clusters itself. +- Is useful for small to medium sized datasets given the computational expensiveness. +- Works by "gossiping" around as if it is attempting to form high school groups of students. +- Updates itself through small and smooth updates to the "attractiveness" of individual samples across time, i.e. after every iteration. +- Where the attractiveness is determined _for a sample_, answering the question "can this be the leader of the group I want to belong to?" and _for the sample itself_ ("what's the evidence that I'm a group leader?"). + +Let's now take a look how to implement it with Python and Scikit-learn! :) + +* * * + +## Implementing Affinity Propagation with Python and Scikit-learn + +Here they are again, the clusters that we also saw in our blog about [K-means clustering](https://www.machinecurve.com/index.php/2020/04/16/how-to-perform-k-means-clustering-with-python-in-scikit/), although we have fewer samples today: + +![](images/afp_cluster.png) + +Remember how we generated them? Open up a Python file and name it \`affinity.py\`, add the imports (which are Scikit-learn, Numpy and Matplotlib)... + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.cluster import AffinityPropagation +``` + +We then add a few configuration options: the number of samples in total we generate, the centers of the clusters, as well as the number of classes that we will generate samples for. Those are all to be used in `make_blobs`, which generates the clusters and assigns them to \[latex\]X\[/latex\] and \[latex\]targets\[/latex\], respectively. + +We save them with Numpy and subsequently load them and assign them to \[latex\]X\[/latex\] again. Those two lines of code aren't necessary for your model to run, but if you want to compare across settings, you likely don't want to generate samples at random every time. By saving them once, and subsequently commenting out `save` and `make_blobs`, you'll load them from file again and again :) + +``` +# Configuration options +num_samples_total = 50 +cluster_centers = [(20,20), (4,4)] +num_classes = len(cluster_centers) + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1) + +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') +``` + +We then fit the data to the Affinity Propagation algorithm, after we loaded it, which just takes two lines of code. In another two lines, we derive characteristics such as the exemplars and by consequence the number of clusters: + +``` +# Fit AFfinity Propagation with Scikit +afprop = AffinityPropagation(max_iter=250) +afprop.fit(X) +cluster_centers_indices = afprop.cluster_centers_indices_ +n_clusters_ = len(cluster_centers_indices) +``` + +Finally, by using the algorithm we fit, we predict for all our samples to which cluster they belong: + +``` +# Predict the cluster for all the samples +P = afprop.predict(X) +``` + +And finally visualize the outcome: + +``` +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', P)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title(f'Estimated number of clusters = {n_clusters_}') +plt.xlabel('Temperature yesterday') +plt.ylabel('Temperature today') +plt.show() +``` + +Here it is! 😊👇 + +![](images/afp_clustered.png) + +### Full model code + +Should you wish to obtain the full model code at once, so that you can start working with it straight away - here you go! 😎 + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.cluster import AffinityPropagation + +# Configuration options +num_samples_total = 50 +cluster_centers = [(20,20), (4,4)] +num_classes = len(cluster_centers) + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1) + +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') + +# Fit AFfinity Propagation with Scikit +afprop = AffinityPropagation(max_iter=250) +afprop.fit(X) +cluster_centers_indices = afprop.cluster_centers_indices_ +n_clusters_ = len(cluster_centers_indices) + +# Predict the cluster for all the samples +P = afprop.predict(X) + +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', P)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title(f'Estimated number of clusters = {n_clusters_}') +plt.xlabel('Temperature yesterday') +plt.ylabel('Temperature today') +plt.show() +``` + +* * * + +## Summary + +In today's blog post, we looked at the Affinity Propagation algorithm. This clustering algorithm allows machine learning engineers to cluster their datasets by means of "messaging". Resembling how groups are formed at high school, where the group leaders decide who gets in and who has to choose another, the pull game is played by the algorithm as well. + +By looking at the messages that are propagated, the responsibility and availability metrics that are sent with these messages, and how it converges iteratively, we first understood the theoretical part of the Affinity Propagation algorithm. This was followed by a practical example using Python and Scikit-learn, where we explained implementing Affinity Propagation step by step. For those interested, the model as a whole is also available above. + +I hope you've learnt something today! I certainly did - I never worked with this algorithm before. If you have any questions, please feel free to leave a message in the comments section below - I'd appreciate it 💬👇. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Scikit-learn. (n.d.). _2.3. Clustering — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved April 18, 2020, from [https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation) diff --git a/how-to-perform-fruit-classification-with-deep-learning-in-keras.md b/how-to-perform-fruit-classification-with-deep-learning-in-keras.md new file mode 100644 index 0000000..c75525f --- /dev/null +++ b/how-to-perform-fruit-classification-with-deep-learning-in-keras.md @@ -0,0 +1,418 @@ +--- +title: "How to Perform Fruit Classification with Deep Learning in Keras" +date: "2020-04-08" +categories: + - "deep-learning" + - "frameworks" +tags: + - "classifier" + - "dataset" + - "deep-learning" + - "fruit" + - "keras" + - "machine-learning" +--- + +Fruits are very common in today's world - despite the abundance of fast food and refined sugars, fruits remain widely consumed foods. During production of fruits, it might be that they need to be sorted, to give just one example. Traditionally being performed mechanically, today, deep learning based techniques _could_ augment or even take over this process. + +In today's blog post, we're going to work towards an example model with which fruits can be classified. Using the Fruits 360 dataset, we'll build a model with Keras that can classify between 10 different types of fruit. Relatively quickly, and with example code, we'll show you how to build such a model - step by step. For this to work, we'll first take a look at deep learning and ConvNet-based classification and fruit classification use cases. Then, we start our work. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Deep learning: about classification and ConvNets + +Before diving into creating some model code, I think that it's nice to take a look at some theory first. + +In this case, theory about Convolutional Neural Networks, which are the type of deep learning model that we will be using today. + +When humans take a look at images, they automatically slice the images into tiny fractions of recognizable objects - for example, a door is built out of a piece of wood, with often some paint, and a door handle. + +![](images/CNN.jpg) + +* * * + +## Fruit classification use cases + +Now, besides the educational aspects, why would classifying fruits be a good idea? + +Can we think of some use cases as to why we could use Deep Learning for fruit classification? + +Here's one, for starters: + +https://www.youtube.com/watch?v=h0-NS3z-EXo + +I really like what they built! 😀 + +Now, this immediately suggests a proper use case for fruit classification: separating ripe fruits from the others. Another one: checking whether certain fruits are spoiled, e.g. because fungus is present on the skin. Whereas this process was performed primarily mechanically until now, Deep Learning can also be used. + +Let's see if we can create such a model ourselves! :) + +* * * + +## Today's dataset: Fruits 360 + +For today's blogpost, we will be using the **Fruits 360 dataset**. The dataset is marketed as follows: "Fruits 360 dataset: A dataset of images containing fruits and vegetables". At least you know what you're getting 😉 + +It can be [downloaded at Kaggle](https://www.kaggle.com/moltean/fruits) and be used under the - this is somewhat unclear - MIT License or CC-BY-SA 4.0 license. In any case, the original work that generated this dataset can be found back in the references of this blog - check out Muresan & Oltean (2018) if you wish to know more. + +![](images/fruits.png) + +Some characteristics of the data: + +- **Total number of images:** 82.213 images +- **Training set size:** 61.488 images +- **Test set size:** 20.622 images +- **Train/test split:** approximately 75/25 +- **Number of classes:** 120 fruits and vegetables +- **Image size:** 100 x 100 pixels, RGB. + +We must also note that different varieties of the same fruit are stored as belonging to different classes. As we'll see later, for today's blog post, there is some merging to do :) + +### Visualization code + +Should you wish to generate the visualization yourself, here's the code that I used. Some pointers: + +- Make sure to have Matplotlib and PIL installed. +- Create a directory, say `imagecopies`, and copy some of the dataset files there. +- Configure the number of `files_per_row` based on the number of files that you've copied (e.g., you cannot set `files_per_row = 3` if you have copied 5 files) - or copy more files. +- Run the code! + +``` +# Imports +import os +import matplotlib.pyplot as plt +from PIL import Image +import math + +# Configuration +dir_with_examples = './imagecopies' +files_per_row = 3 + +# List the directory and perform computations +files_in_dir = os.listdir(dir_with_examples) +number_of_cols = files_per_row +number_of_rows = int(len(files_in_dir) / number_of_cols) + +# Generate the subplots +fig, axs = plt.subplots(number_of_rows, number_of_cols) +fig.set_size_inches(8, 5, forward=True) + +# Map each file to subplot +for i in range(0, len(files_in_dir)): + file_name = files_in_dir[i] + image = Image.open(f'{dir_with_examples}/{file_name}') + row = math.floor(i / files_per_row) + col = i % files_per_row + axs[row, col].imshow(image) + axs[row, col].axis('off') + +# Show the plot +plt.show() +``` + +### Dataset structure + +After downloading and unpacking the dataset, we can see two folders that are of interest for today's blog post: the **Train** folder, with all the training data, and the **Test** data, with all the testing data (once again, split in an approximate 75/25 fashion). + +Going one level deeper, here's (a part of) the contents of the Train folder: + +[![](images/image-1-1024x576.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/image-1.png) + +As you can see, the fruits are indeed stored nicely together - on a variety basis. That means that we can see, for instance, 'pear', 'pear abate', 'pear forelle', and so on. Opening up the folder for 'Banana' yields the following: + +[![](images/image-1024x584.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/image.png) + +Many bananas! 🍌 + +### Preparing our dataset for Deep Learning + +Before we continue, there is a trade-off to make: will we **create a model that highly specializes in variety**, or will we create **a generic model that can be capable of recognizing some fruits?** + +This is a relatively common question: do we keep at a more general level, or not? In the case where we want to specialize, we likely have to create a deeper model, which is hence more complex, and also more prone to [overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). The other scenario might sound great, but it comes with a drawback too: the varieties of different fruit don't always match well. For example, when we want to generalize all pear varieties into one class: + +![](images/pears-1024x464.png) + +...we would generalize a lot of varieties that look a lot like each other...right? Nope: + +- ![](images/18_100.jpg) + +- ![](images/26_100.jpg) + +- ![](images/19_100.jpg) + +- ![](images/119_100.jpg) + +- ![](images/10_100.jpg) + +- ![](images/10_100-1.jpg) + + +Those are all pears! + +In today's blog, for the sake of simplicity, we'll skip this question altogether. Instead, we'll do this: + +- Besides the _Training_ and _Test_ folders, we create two additional ones: _Training\_smaller_ and _Test\_smaller_. +- From the original _Training_ and _Test_ folders, we copy the fruits we wish to classify for. Hence, we avoid the question by manually selecting a few fruit classes that we wish to distinguish. In my case, I chose a random set of classes - but make sure that they're identical in both folders. In my case, I chose Apricot, Avocado, Banana, Blueberry, Cauliflower, Cocos, Eggplant, Hazelnut, Kiwi and Limes. +- Now, copy those folders into the _Training\_smaller_ and _Test\_smaller_ folders, respectively. + +This should be the contents of that particular folder: + +![](images/image-3.png) + +* * * + +## Building a Keras model for fruit classification + +Time to create an actual machine learning model! In today's blog, we're using the **Keras framework** ([keras.io](http://keras.io)) for deep learning. Created by François Chollet, the framework works on top of TensorFlow (2.x as of recently) and provides a much simpler interface to the TF components. In my opinion, it makes creating machine learning models really easy code-wise... but you still need to know what you're doing ML-wise! ;-) + +### Model imports + +As always, the first thing we do is import our dependencies. As you can see, we'll use the Sequential API, which allows us to stack each neural network layer on top of each other easily. We also import Dense, Flatten and Conv2D - [the default layers in such a network](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/). Then, we import [sparse categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) for computing [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) for optimization, and an ImageDataGenerator for loading our images from folder. + +``` +# Imports +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.preprocessing.image import ImageDataGenerator +``` + +### Data & model configuration + +The next step - not unsurprisingly given what we normally do - is specifying some configuration options. Today, we'll split them between _data related configuration_ and _model configuration_. + +The data configuration is simple: we simply set the paths to the training data and the testing data. + +The model configuration is a little bit more complex, but not too difficult. + +We specify the batch size to be 25 - which means that 25 samples are fed to the model for training [during every forward pass](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). The image width is 25x25 pixels, and as we are using RGB images, the number of channels is 3. + +25 by 25 pixels? Strange! Didn't you write that our input data is 100 by 100 pixels? + +Indeed - and you're a very diligent reader :) However, as you will see later, we're going to resize our samples to 25 by 25 pixels to speed up the training process. Good catch though! + +For loss, as said, we'll be using [sparse categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#sparse-categorical-crossentropy), which can work with integer targets. As the number of classes we're using is 10, we set `no_classes` to 10. The number of epochs (or iterations) is set to 25, which is low - very low - but is okay for education purposes. As we shall see, with 10 classes, we get some very good performance regardless. In normal settings, you would usually have thousands of epochs, though. For optimization, we use the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) - which is a good default choice, and extends traditional gradient descent with local parameter updates and momentum-like optimization ([click here for more information](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam)). Verbosity is set to 1, which means `True`, which means that all the output is displayed on screen. Normally, we set this to False, as prints slightly slow down the training process, but that's not the purpose of today's post - so we keep it on. + +``` +# Data configuration +training_set_folder = './fruits-360/Training_smaller' +test_set_folder = './fruits-360/Test_smaller' + +# Model configuration +batch_size = 25 +img_width, img_height, img_num_channels = 25, 25, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +verbosity = 1 +``` + +### Loading & preparing the data + +Next, we load and prepare the data. First, we set the `input_shape` - which is [required for the input layer](https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim/). Then, we create a _generator_ - an `ImageDataGenerator`, to be precise. + +What is such a generator? For a simple one, [click here](https://www.machinecurve.com/index.php/2020/04/06/using-simple-generators-to-flow-data-from-file-with-keras/), but let's try to explain things in layman's terms here as well. + +A generator looks like an iterative function, i.e. some kind of loop, which you can use to 'generate' new samples. Although this might sound weird, it's not - because you can also use such generators to _read existing ones_ \- but for the model, they're like new. + +Now, in Keras, `ImageDataGenerators` can be configured substantially - allowing you to specify things like image augmentation, and so on. We don't do this today. The only thing we do is _rescale_ the data, so that the values are closer to a mean of 0 and a variance of 1. This is often recommended, as it helps the training process, with fewer weight swings during optimization. + +Next, we _feed data to the data generator_. We do so with `flow_from_directory`, which allows us to load all the data from folder. We specify the folder where our training data is located, specify `save_to_dir` - which saves the intermediate samples to some directory, in `jpeg` format - as well as batch size and `class_mode` (_sparse_ because of our loss funciton). Then, `target_size` is set to `(25, 25)` - that's the resizing I just discussed! + +``` +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Create a generator +train_datagen = ImageDataGenerator( + rescale=1./255 +) +train_datagen = train_datagen.flow_from_directory( + training_set_folder, + save_to_dir='./adapted-images', + save_format='jpeg', + batch_size=batch_size, + target_size=(25, 25), + class_mode='sparse') +``` + +### Specifying the model architecture + +Now, back to the defaults - a model architecture that is very similar to the one we created in [our blog about Conv2D](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/). + +It's very simple: using the Sequential API, we stack four convolutional layers for feature extraction, subsequently flatten the feature maps into a one-dimensional input for the densely-connected layers, which generate a [multiclass probability distribution with Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/). This distribution, adhering to the laws of probability theory, give us the ultimate class prediction - precisely what we want. + +Additionally, we display a [model summary](https://www.machinecurve.com/index.php/2020/04/01/how-to-generate-a-summary-of-your-keras-model/) for visualization purposes. + +``` +# Create the model +model = Sequential() +model.add(Conv2D(16, kernel_size=(5, 5), activation='relu', input_shape=input_shape)) +model.add(Conv2D(32, kernel_size=(5, 5), activation='relu')) +model.add(Conv2D(64, kernel_size=(5, 5), activation='relu')) +model.add(Conv2D(128, kernel_size=(5, 5), activation='relu')) +model.add(Flatten()) +model.add(Dense(16, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Display a model summary +model.summary() +``` + +### Model compilation & starting the training process + +As our final step, we _compile the model_ - which means instantiating it, as we had previously created the skeleton / the framework only - and _fit_ the data (by means of the generator), so that the training process is started. Note that we use the configuration options that we defined previously to configure both the instantiation and the training process. + +``` +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Start training +model.fit( + train_datagen, + epochs=no_epochs, + shuffle=False) +``` + +### Full model code + +It's also possible to obtain the model code as a whole, if you wish to start playing around immediately. Here you go: + +``` +# Imports +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.preprocessing.image import ImageDataGenerator + +# Data configuration +training_set_folder = './fruits-360/Training_smaller' +test_set_folder = './fruits-360/Test_smaller' + +# Model configuration +batch_size = 25 +img_width, img_height, img_num_channels = 25, 25, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Create a generator +train_datagen = ImageDataGenerator( + rescale=1./255 +) +train_datagen = train_datagen.flow_from_directory( + training_set_folder, + save_to_dir='./adapted-images', + save_format='jpeg', + batch_size=batch_size, + target_size=(25, 25), + class_mode='sparse') + +# Create the model +model = Sequential() +model.add(Conv2D(16, kernel_size=(5, 5), activation='relu', input_shape=input_shape)) +model.add(Conv2D(32, kernel_size=(5, 5), activation='relu')) +model.add(Conv2D(64, kernel_size=(5, 5), activation='relu')) +model.add(Conv2D(128, kernel_size=(5, 5), activation='relu')) +model.add(Flatten()) +model.add(Dense(16, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Display a model summary +model.summary() + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Start training +model.fit( + train_datagen, + epochs=no_epochs, + shuffle=False) +``` + +* * * + +## Running the model: our results + +Time to run the model. Save your code somewhere, e.g. as `cnn.py`, and make sure that the _folder references_ in the data configuration are pointed correctly. Also make sure that you have the dependencies installed: at least TensorFlow with version 2.0+. Then, open up a terminal, `cd` to the folder where you stored your file, and run e.g. `python cnn.py`. + +The training process should begin: + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d (Conv2D) (None, 21, 21, 16) 1216 +_________________________________________________________________ +conv2d_1 (Conv2D) (None, 17, 17, 32) 12832 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 13, 13, 64) 51264 +_________________________________________________________________ +conv2d_3 (Conv2D) (None, 9, 9, 128) 204928 +_________________________________________________________________ +flatten (Flatten) (None, 10368) 0 +_________________________________________________________________ +dense (Dense) (None, 16) 165904 +_________________________________________________________________ +dense_1 (Dense) (None, 10) 170 +================================================================= +Total params: 436,314 +Trainable params: 436,314 +Non-trainable params: 0 +_________________________________________________________________ +Train for 199 steps +Epoch 1/25 +2020-04-08 19:25:01.858098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll +2020-04-08 19:25:02.539857: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll +2020-04-08 19:25:04.305871: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows +Relying on driver to perform ptx compilation. This message will be only logged once. +199/199 [=======================> +``` + +...and eventually end with quite some _training_ performance: + +``` +Epoch 25/25 +199/199 [==============================] - 13s 64ms/step - loss: 1.7623e-06 - accuracy: 1.0000 +``` + +The next step you could now take is [check for overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/), and if it occurs, apply techniques like [Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) - or [L2 regularization](https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/). Perhaps, also ensure that your data set gets bigger, possibly more varied or augmented through the Image Data Generator. But that's for another time! ;-) + +* * * + +## Summary + +In today's blog post, we looked at convolutional neural networks - and how they can be used for Fruit Classification with Deep Learning. We took a look at the Fruits 360 dataset, which is created by the authors of the article referenced below, and is a nice dataset containing a large variety of fruits. + +Subsequently, we created an actual example, with the Keras Deep Learning framework. With the example, we trained a model that could attain adequate training performance quickly. Left to do: checking for overfitting, adapting, and making things even better. + +Thank you for reading MachineCurve today! 😎 If you have any questions, remarks or comments, please feel free to drop a message in the comments section below 👇. I'll happily answer the questions and help where I can. Happy engineering! 👩‍💻 + +\[kerasbox\] + +* * * + +## References + +Horea Muresan, [Mihai Oltean](https://mihaioltean.github.io/), [Fruit recognition from images using deep learning](https://www.researchgate.net/publication/321475443_Fruit_recognition_from_images_using_deep_learning), Acta Univ. Sapientiae, Informatica Vol. 10, Issue 1, pp. 26-42, 2018. diff --git a/how-to-perform-k-means-clustering-with-python-in-scikit.md b/how-to-perform-k-means-clustering-with-python-in-scikit.md new file mode 100644 index 0000000..31d92e0 --- /dev/null +++ b/how-to-perform-k-means-clustering-with-python-in-scikit.md @@ -0,0 +1,341 @@ +--- +title: "K-means Clustering tutorial: example with Scikit-learn" +date: "2020-04-16" +categories: + - "frameworks" + - "svms" +tags: + - "clustering" + - "k-means" + - "k-means-clustering" + - "machine-learning" + - "python" + - "scikit-learn" + - "unsupervised-learning" +--- + +While deep learning algorithms belong to today's fashionable class of machine learning algorithms, there exists more out there. Clustering is one type of machine learning where you do not feed the model a training set, but rather try to derive characteristics from the dataset at run-time in order to structure the dataset in a different way. It's part of the class of unsupervised machine learning algorithms. + +**K-means clustering** is such an algorithm, and we will scrutinize it in today's blog post. We'll first take a look at what it is, by studying the steps it takes for generating clusters. We then take a look at the inertia metric, which is used to compute whether the algorithm needs to continue or whether it's done, i.e. whether there is convergence. This is followed by taking a look at convergence itself and in what cases K-means clustering may not be useful. + +The theoretical part is followed by a practical implementation by means of a Python script. It provides an example implementation of K-means clustering with [**Scikit-learn**](https://www.machinecurve.com/index.php/how-to-use-scikit-learn-for-machine-learning-with-python-mastering-scikit/), one of the most popular Python libraries for machine learning used today. Altogether, you'll thus learn about the theoretical components of K-means clustering, while having an example explained at the same time. + +In this tutorial, you will learn... + +- **What K-means clustering is.** +- **How K-means clustering works, including the random and `kmeans++` initialization strategies.** +- **Implementing K-means clustering with Scikit-learn and Python.** + +Let's take a look! 🚀 + +**Update 11/Jan/2021:** added [quick example](https://www.machinecurve.com/index.php/2020/04/16/how-to-perform-k-means-clustering-with-python-in-scikit/#quick-answer-how-to-perform-k-means-clustering-with-python-in-scikit-learn) to performing K-means clustering with Python in Scikit-learn. + +**Update 08/Dec/2020:** added references to PCA article. + +* * * + +\[toc\] + +* * * + +## Example code: How to perform K-means clustering with Python in Scikit-learn? + +Here's a [quick answer](https://www.machinecurve.com/index.php/2020/04/16/how-to-perform-k-means-clustering-with-python-in-scikit/#full-model-code) to performing K-means clustering with Python and Scikit-learn. Make sure to read the full article if you wish to understand what happens in full detail! + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.cluster import KMeans + +# Configuration options +num_samples_total = 1000 +cluster_centers = [(20,20), (4,4)] +num_classes = len(cluster_centers) + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 2) + +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') + +# Fit K-means with Scikit +kmeans = KMeans(init='k-means++', n_clusters=num_classes, n_init=10) +kmeans.fit(X) + +# Predict the cluster for all the samples +P = kmeans.predict(X) + +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', P)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title('Two clusters of data') +plt.xlabel('Temperature yesterday') +plt.ylabel('Temperature today') +plt.show() +``` + +* * * + +## What is K-means clustering? + +Suppose that we have a dataset \[latex\]X\[/latex\], which contains many n-dimensional vectors \[latex\]\\mathbf{x\_1} \\mathbf{x\_2}, ..., \\mathbf{x\_n}\[/latex\]. Say, \[latex\]n = 2\[/latex\], then \[latex\]\\mathbf{x\_1}\[/latex\] could be \[latex\]\[3.12, 4.14\]\[/latex\]. Mapping this one onto a two-dimensional space, i.e. a plane, gives this: + +[![](images/point.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/point.png) + +Say that the vectors that we described abstractly above are structured in a way that they form "blobs", like we merged two datasets of temperature measurements - one with measurements from our thermostat, measuring indoor temperatures of ~20 degrees Celcius, the other with measurements from our fridge, of say ~4 degrees Celcius. The vertical axis shows the temperature of today, whereas the horizontal one displays the temperature at the same time yesterday. + +That would likely make the point above a fridge measured temperature. The whole set of measurements would be this: + +[![](images/clusters_2-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/clusters_2-1.png) + +Now, suppose that we want to understand whether a sample belongs to the "fridge" cluster or the "room temperature" cluster. Visually, we can easily decide whether it's one or the other: there's enough space between the two blobs of data points to accurately assess whether it's been the fridge or the living room. + +But what if we want to do this algorithmically? + +**K-means clustering** is what can be useful in this scenario. It allows us to reach this result: + +![](images/clustered.png) + +For every sample clear whether it's a room temperature one (red) or a fridge temperature one (blue), determined algorithmically! + +### Introducing K-means clustering + +Now, while this is a very simple example, K-means clustering can be applied to problems that are way more difficult, i.e. problems where you have multiple clusters, and even where you have multidimensional data (more about that later). Let's first take a look at what K-means clustering is. + +For this, we turn to our good old friend Wikipedia - and cherry pick the most important aspects of a relatively abstract definition: + +> k-means clustering is a method (...) that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. +> +> Wikipedia (2020) + +Let's break that one apart into pieces that we can understand atomically: + +- You have a dataset with some length \[latex\]n\[/latex\]. +- The goal is clustering, which means that you want to create "groups" of data, like in the scenario above. +- You have control over the number of groups (clusters) that is created: it'll be \[latex\]k\[/latex\] clusters, configured upfront. As you can imagine, \[latex\] k \\leq n\[/latex\]. +- Now the abstract part: each sample in your dataset is assigned to the cluster where the distance to the "mean" of that cluster is lowest. With mean, we literally mean the "center point" of the particular cluster. This way, the sample is assigned to the most likely "group" of data points. + +Let's take a look at how the algorithm works. + +### The K-means clustering algorithm + +For this, we turn to the Scikit-learn website, which explains [it nicely in plain English](https://scikit-learn.org/stable/modules/clustering.html#k-means): + +1. **Initialization**: directly after starting it, the initial centroids (cluster centers) are chosen. Scikit-learn supports two ways for doing this: firstly, `random`, which selects \[latex\]k\[/latex\] samples from the dataset at random. Secondly, `k-means++`, which [optimizes this process](https://en.wikipedia.org/wiki/K-means%2B%2B). +2. **Centroid assignment:** each sample in the dataset is assigned to the nearest centroid. +3. **Centroid correction:** new centroids are created by computing new means for the assignments created in step 2. +4. **Difference comparison:** for each centroid, the difference between old and new is compared, and the algorithm stops when the difference is lower than a threshold called `inertia`, or `tolerance`. Otherwise, it moves back to step 2. + +A very simple and elegant but powerful algorithm indeed! + +https://www.youtube.com/watch?v=IJt62uaZR-M + +### Inertia / Within-cluster sum-of-squares criterion + +While we expressed the algorithm above in very plain ways, we can also express things a bit more mathematically. For example, we can take a look at K-means clustering as an algorithm which attempts to minimize the **inertia** or the **within-cluster sum-of-squares criterion** (Scikit-learn, n.d.). It does so by picking centroids - thus, centroids that minimize this value. + +How's this value determined? Well, as follows (Scikit-learn, n.d.): + +\[latex\]\\sum\_{i=0}^{n}\\min\_{\\mu\_j \\in C}(||x\_i - \\mu\_j||^2)\[/latex\] + +Let's break down the formula. + +The first part, the **sigma sign**, essentially tells you that the value is a _sum_ of something for all \[latex\]n\[/latex\] samples in your dataset. Nothing special for now. But what is this something? + +A minimum. To be more precise, a minimum of the **squares** of the difference between **each sample** and the **mean** of a particular cluster. + +When this value is minimized, the clusters are said to be internally coherent (Scikit-learn, n.d.) and movement in the "centroid correction" step will be low. If it's zero, it has converged to an optimum. In Scikit, we specify a certain threshold value which, if the inertia is lower, considers the algorithm to have converged. This speeds up the fitting process. + +### On convergence of K-means clustering + +Given enough time, K-means clustering will always converge to an optimum (Scikit-learn, n.d.). However, this does not necessarily have to be the global optimum - it can be a local one as well. According to Scikit-learn (n.d.), this is entirely dependent on the initialization of the centroids; that is, whether we're using a `random` initialization strategy or `k-means++`. + +In the random case, it's obvious that the initialization may produce _very good_ results sometimes, _mediocre_ to _good_ results often and _very poor_ results sometimes. That's the thing with flipping a coin as to whether to include a sample, well, figuratively then ;-) + +The `k-means++` strategy works a bit differently. Let's take a look at the random strategy again in order to explain why it often works better. In the random strategy, nobody can ensure that the selected samples are _far away from each other_. Although the odds are small, they might be _all very close to each other_. In that case, convergence will become a very difficult and time-consuming job (Scikit-learn, n.d.). We obviously don't want that. + +K-means++ ensures that the centroids to be "\[generally\] distant from each other" (Scikit-learn, n.d.). As you can imagine, this proves to be a substantial improvement with respect to convergence and especially the speed of it (Scikit-learn, n.d.). + +### The drawbacks of K-means clustering - when is it a bad choice? + +If you look at [this page](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html), you'll see that K-means clustering does not always work. Specifically, things won't work out well in those cases (Scikit-learn, n.d.): + +- **When your dataset has more blobs of data than the number of blobs you configure**. For obvious reasons, K-means clustering will then fail. The fact that the user must configure the number of clusters is one possible point of failure as well. Always look closely at your dataset before you apply K-means, is the advice! +- **When you don't have isotropic blobs**. Fancy words, I know, but isotropic means something like "nicely shaped" - i.e., equally wide and equally high. If they're not (and you will see this when you click the link above), K-means will detect halves of clusters, merging them together. +- **If clusters aren't convex**, or truly separable. In those cases, the algorithm might get confused, as you can see with the link above as well. +- Finally, **if your dimensionality is too high**. In the scenario above, we have a dimensionality of 2, but the more dimensions you add, the more time it will take for clustering to complete. This is due to the nature of the [euclidian distance](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adamax) that is computed for inertia. Hence, you'll have to apply dimensionality reduction first - with techniques like [Principal Components Analysis (PCA)](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/), for example. + +Think about applying K-means well before naïvely making the choice to "just" make it work. It might simply not work! + +* * * + +## Implementing K-means clustering with Python and Scikit-learn + +Now that we have covered much theory with regards to K-means clustering, I think it's time to give some example code written in Python. For this purpose, we're using the `scikit-learn` library, which is one of the most widely known libraries for applying machine learning models. Specifically, it's widely used for applying the relatively _traditional_ types of machine learning, i.e. the non-deep learning models. + +Let's open up your Finder / Explorer. Create a file called `scikit-blobs.py`. Open this file in your code editor and ensure that the following dependencies are installed on your system: + +- Scikit-learn +- Matplotlib +- Numpy + +If they are, great! Let's continue :D + +### Generating convex and isotropic clusters + +The first thing we do before we can apply K-means clustering with Scikit-learn is generating those **convex and isotropic clusters**. In plainer English, those are clusters which are separable and equally wide and high. Without English and with a visualization, I mean this: + +![](images/clusters_2-1.png) + +Ah, so that's what you meant is what you'll likely think now 😂 Oops :) + +For this to work, we'll first have to state our imports: + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.cluster import KMeans +``` + +Those are _all_ the imports for today, not just those for generating the blobs (which would be the `make_blobs` import). What's more, we also import `KMeans` from Scikit-learn, `numpy` for number processing and the `PyPlot` library from `matplotlib` for visualizing the clusters (i.e. generating that visualization above). + +Now that we have specified our imports, it's time to set a few configuration options: + +``` +# Configuration options +num_samples_total = 1000 +cluster_centers = [(20,20), (4,4)] +num_classes = len(cluster_centers) +``` + +Those are really simple: + +- We'll be generating 1000 samples in total. +- They will be spread over 2 clusters, the first of which is located at approximately \[latex\](x, y) = (20, 20)\[/latex\], the other at \[latex\](4, 4)\[/latex\]. +- The `num_classes` i.e. the number of clusters is, pretty obviously, the `len(cluster_centers)` - i.e. 2. + +We then generate the data: + +``` +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 2) +``` + +Generating the data simply equates calling the `make_blobs` definition from Scikit-learn, which does all the hard work. We specify the centers and number of samples that we configured before, as well as the number of features. We set a standard deviation of 2 - which means that the samples we generate at those two locations are distributed around the centers with a high likelihood of a deviation of \[latex\]\\pm 2\[/latex\]. + +Should you wish to save the data so that you can reuse the _exact_ positions later (e.g. in the cases where you want to generate different visualizations), you might add this code - which simply saves the data and reloads it immediately, for you to apply accordingly. It's not necessary though. + +``` +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') +``` + +### Applying the K-means clustering algorithm + +Time for applying K-means clustering! + +First, we instantiate the algorithm: + +``` +# Fit K-means with Scikit +kmeans = KMeans(init='k-means++', n_clusters=num_classes, n_init=10) +kmeans.fit(X) +``` + +Here, we choose an initialization strategy (which is either `random` or `k-means++`, of which the latter will likely save us computation time so we choose it), the number of clusters, and `n_init`, which does this: + +> Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n\_init consecutive runs in terms of inertia. +> +> [Sklearn.cluster.KMeans (n.d.)](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) + +Once we did this, it's time to actually _fit the data_ and generate the cluster predictions: + +``` +# Predict the cluster for all the samples +P = kmeans.predict(X) +``` + +That's it already - K-means clustering is complete! If you wish to generate that visualization with the two classes colored differently, you might also want to add this: + +``` +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', P)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title('Two clusters of data') +plt.xlabel('Temperature yesterday') +plt.ylabel('Temperature today') +plt.show() +``` + +### Full model code + +Should you wish to obtain the full model code at once immediately - that's possible too, of course. Here you go: + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.cluster import KMeans + +# Configuration options +num_samples_total = 1000 +cluster_centers = [(20,20), (4,4)] +num_classes = len(cluster_centers) + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 2) + +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') + +# Fit K-means with Scikit +kmeans = KMeans(init='k-means++', n_clusters=num_classes, n_init=10) +kmeans.fit(X) + +# Predict the cluster for all the samples +P = kmeans.predict(X) + +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', P)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title('Two clusters of data') +plt.xlabel('Temperature yesterday') +plt.ylabel('Temperature today') +plt.show() +``` + +### Results + +The results are pretty clear, aren't they: + +[![](images/clustered.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/clustered.png) + +Pretty much immediately (given the small number of samples and the fact that the blobs are highly separable), we have performed K-means clustering for the first time! + +* * * + +## Summary + +In this blog post, we looked at K-means clustering with Python and Scikit-learn. More specifically, we looked at a couple of questions: + +- What precisely is K-means clustering? +- How does K-means clustering work? +- What is inertia with K-means clustering? +- What are the drawbacks of using K-means clustering; i.e., when is it not smart to use it? +- How to implement K-means clustering with Python and Scikit-learn? Can you give an example? + +I hope you've learnt something today! :D If you did, feel free to leave a comment in the comments section below 👇 Thank you for reading MachineCurve today and happy engineering! 😊 + +* * * + +## References + +Wikipedia. (2020, April 13). _K-means clustering_. Wikipedia, the free encyclopedia. Retrieved April 14, 2020, from [https://en.wikipedia.org/wiki/K-means\_clustering](https://en.wikipedia.org/wiki/K-means_clustering) + +Scikit-learn. (n.d.). _2.3. Clustering — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved April 14, 2020, from [https://scikit-learn.org/stable/modules/](https://scikit-learn.org/stable/modules/clustering.html#k-means)[c](https://scikit-learn.org/stable/modules/clustering.html#k-means)[lustering.html#k-means](https://scikit-learn.org/stable/modules/clustering.html#k-means) + +Wikipedia. (2020, April 12). _K-means++_. Wikipedia, the free encyclopedia. Retrieved April 14, 2020, from [https://en.wikipedia.org/wiki/K-means%2B%2B](https://en.wikipedia.org/wiki/K-means%2B%2B) + +_Sklearn.cluster.KMeans — scikit-learn 0.22.2 documentation_. (n.d.). scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved April 16, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) diff --git a/how-to-perform-mean-shift-clustering-with-python-in-scikit.md b/how-to-perform-mean-shift-clustering-with-python-in-scikit.md new file mode 100644 index 0000000..4efa122 --- /dev/null +++ b/how-to-perform-mean-shift-clustering-with-python-in-scikit.md @@ -0,0 +1,204 @@ +--- +title: "How to perform Mean Shift clustering with Python in Scikit?" +date: "2020-04-23" +categories: + - "frameworks" + - "svms" +tags: + - "clustering" + - "mean-shift" + - "scikit-learn" + - "unsupervised-learning" +--- + +Suppose that you have a dataset in which you want to discover groups, or clusters, that share certain characteristics. There are various unsupervised machine learning techniques that can be used to do this. As we've seen in other blogs, [K-means clustering](https://www.machinecurve.com/index.php/2020/04/16/how-to-perform-k-means-clustering-with-python-in-scikit/) and [Affinity Propagation](https://www.machinecurve.com/index.php/2020/04/18/how-to-perform-affinity-propagation-with-python-in-scikit/) can be used if you have good data or small data, respectively. + +But in both cases, _the clusters need to be separated_. Or you may need to configure the number of clusters in advance. Now, your machine learning problem may be such that none of those two criteria are met. What to do? + +Enter **Mean Shift** clustering, a clustering approach for discovering "blobs in a smooth density of samples" (Scikit-learn, n.d.). That is, precisely what you want - discovering clusters if your data is not separated without configuring the number of clusters. + +In today's blog post, we will explore Mean Shift in more detail. First, we'll take a look at Mean Shift clustering. What is it? How does it work intuitively? And when does it work well, and when shouldn't you use Mean Shift? Those are the theoretical questions that we will be looking at. + +Then, we will move towards practice - and provide an implementation of Mean Shift clustering with Python and the Scikit-learn framework for machine learning. We explain our code step by step, which ensures that you can implement the model at your own pace. + +Are you ready? Let's go! :) + +* * * + +\[toc\] + +* * * + +## What is Mean Shift clustering? + +Here we are again - a scenario where we have blobs of data. In this case, we have three clusters: + +[![](images/clusters.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/clusters.png) + +If you look closely at those clusters, you'll see for every cluster that the number of points is highest around the centers of the cluster. + +We can also rephrase this into the observation that the **density** of points of a cluster is highest near its center, or centroid. + +Generalizing this statement, for any cluster, we can thus find the likely center by looking at the density of points at a particular spot in the diagram above. Hence, we can also find the _number_ of clusters, and estimate the approximate centers of those clusters that we identified. + +This is what the Mean Shift algorithm for clustering does. It looks at the "mode" of the density, and where it is highest, and will iteratively shift points in the plot towards the closest mode - resulting in a number of clusters, and the ability to assign a sample to a cluster, after fitting is complete (ML | mean-shift clustering, 2019). + +This way, even when your clusters aren't perfectly separated, Mean Shift will likely be able to detect them anyway (Scikit-learn, n.d.). + +When your dataset is relatively small, Mean Shift works quite well (Scikit-learn, n.d.). This changes when you have a large one - because the algorithm is quite expensive, to say the least. It would be wise to use Mean Shift for small to medium-sized datasets only. + +* * * + +## Implementing Mean Shift clustering with Python and Scikit-learn + +Let's now take a look at how to implement Mean Shift clustering with Python. We'll be using the Scikit-learn framework, which is one of the popular machine learning frameworks used today. We'll be trying to successfully cluster those three clusters: + +![](images/clusters.png) + +Yep, those are the clusters that we just showed you, indeed :) + +Now, open up a code editor, create a Python file (e.g. `meanshift.py`), so that we can start. The first thing we do is add the imports for today's code: + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.cluster import MeanShift, estimate_bandwidth +``` + +We'll use Matplotlib for generating visualizations, Numpy for some number processing and Scikit-learn functionality for generating the dataset (i.e., the unclustered blobs of data) and the actual clustering operation. + +Once we defined the imports, we can set the configuration options: + +``` +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3), (1,1)] +num_classes = len(cluster_centers) +``` + +We'll be generating 10000 samples in total, across 3 clusters. + +Then, it's time to generate the data: + +``` +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) +``` + +With `make_blobs`, we can let Scikit-learn make the blobs we want. We set the configuration that we just defined, and set a cluster standard deviation of 0.30. This can be pretty much anything, and I'd recommend that you play around a bit before you start the actual clustering. + +For reproducibility, though, you might wish to save the dataset you generated. That's why we use Numpy in today's code, for saving the data - and reloading it back into run-time immediately: + +``` +np.save('./clusters.npy', X) +X = np.load('./clusters.npy' +``` + +This code is not strictly necessary, but by simply running it once - you can uncomment the `save` and `make_blobs` operations and load the same dataset again. + +Next, we'll come to Mean Shift specific functionality. First, we define what is known as the "bandwidth" of the algorithm - as you can see here: + +``` +# Estimate bandwith +bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500) +``` + +As discussed, Mean Shift "looks around" and determines the direction where a sample must move to - i.e. where the cluster centroid likely is. However, it would be too expensive computationally to do so for _all_ the samples - because then the algorithm would get stuck, put simply. + +That's why the "bandwidth" helps - it simply defines an area around the samples where Mean Shift should look in order to determine the most probable path given density estimation. But what should this bandwidth value be? That's where `estimate_bandwidth` comes in, and it estimates the most suitable bandwidth based on your dataset. + +We immediately use the bandwidth in the instantiation of the Mean Shift algorithm, after which we fit the data and generate some consequential data, such as the number of labels: + +``` +# Fit Mean Shift with Scikit +meanshift = MeanShift(bandwidth=bandwidth) +meanshift.fit(X) +labels = meanshift.labels_ +labels_unique = np.unique(labels) +n_clusters_ = len(labels_unique) +``` + +Then, we generate predictions for all the samples in our dataset: + +``` +# Predict the cluster for all the samples +P = meanshift.predict(X) +``` + +And finally, we generate a visualization to see whether our clustering operation is successful: + +``` +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426' if x == 2 else '#67c614', P)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title(f'Estimated number of clusters = {n_clusters_}') +plt.xlabel('Temperature yesterday') +plt.ylabel('Temperature today') +plt.show() +``` + +Now, let's run it! Open up a terminal where Scikit-learn, Numpy and Matplotlib are accessible, and execute the Python file - i.e. `python meanshift.py`. After some time, you should find a result that looks like this: + +![](images/clusters_mean.png) + +Mission complete! 🚀 + +### Full model code + +Should you wish to obtain the full model code at once, that is also possible. Here you go: + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.cluster import MeanShift, estimate_bandwidth + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3), (1,1)] +num_classes = len(cluster_centers) + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) + +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') + +# Estimate bandwith +bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=500) + +# Fit Mean Shift with Scikit +meanshift = MeanShift(bandwidth=bandwidth) +meanshift.fit(X) +labels = meanshift.labels_ +labels_unique = np.unique(labels) +n_clusters_ = len(labels_unique) + +# Predict the cluster for all the samples +P = meanshift.predict(X) + +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426' if x == 2 else '#67c614', P)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title(f'Estimated number of clusters = {n_clusters_}') +plt.xlabel('Temperature yesterday') +plt.ylabel('Temperature today') +plt.show() +``` + +* * * + +## Summary + +In today's blog post, we looked at the Mean Shift algorithm for clustering. Based on an example, we looked at how it works intuitively - and subsequently presented a step-by-step explanation of how to implement Mean Shift with Python and Scikit-learn. + +I hope you've learnt something from today's post! If you did, feel free to leave a comment in the comments section below 👇 Please feel free to do the same if you have any questions or remarks - I'll happily answer them. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Scikit-learn. (n.d.). _2.3. Clustering — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved April 18, 2020, from [https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation](https://scikit-learn.org/stable/modules/clustering.html#affinity-propagation) + +_ML | mean-shift clustering_. (2019, May 16). GeeksforGeeks. [https://www.geeksforgeeks.org/ml-mean-shift-clustering/](https://www.geeksforgeeks.org/ml-mean-shift-clustering/) diff --git a/how-to-perform-multioutput-regression-with-svms-in-python.md b/how-to-perform-multioutput-regression-with-svms-in-python.md new file mode 100644 index 0000000..f93f455 --- /dev/null +++ b/how-to-perform-multioutput-regression-with-svms-in-python.md @@ -0,0 +1,258 @@ +--- +title: "How to perform Multioutput Regression with SVMs in Python" +date: "2020-11-17" +categories: + - "frameworks" + - "svms" +tags: + - "machine-learning" + - "multioutput-regression" + - "regression" + - "scikit-learn" + - "support-vector-machine" + - "support-vector-regression" + - "support-vectors" +--- + +Support Vector Machines can be used for performing regression tasks - we know that [from another article](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/). But did you know that it is also possible to use them for creating _multioutput_ regression models - that is, training it for regressing two values at the same time? Precisely that is what we will cover in today's article: we're going to build a **multioutput regression** model using Support Vector Machines with Python and Scikit-learn. + +The article is structured as follows. Firstly, we'll take a look at _regression_ with Support Vector Machines. I can understand that this sounds a bit counterintuitive, as SVMs are traditionally used for [classification tasks](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). But indeed, they can be used for regression as well! Secondly, we'll cover Multioutput Regression - and how it works conceptually. This is followed by building an actual multioutput regression SVM ourselves. For this, we'll be using Scikit-learn, a Python-based machine learning library. + +Let's go! + +* * * + +\[toc\] + +* * * + +## Regression with Support Vector Machines: how it works + +If you have some experience with building Machine Learning models, you know that [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) can be used for a wide range of classification tasks. Indeed, it is possible to use them in many ways for creating an automated system which assigns inputs to two or more classes, or even multiple classes to an input sample. + +- [Creating One-vs-Rest and One-vs-One SVM Classifiers with Scikit-learn](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/) +- [Using Error-Correcting Output Codes with Scikit-learn for multiclass SVM classification](https://www.machinecurve.com/index.php/2020/11/12/using-error-correcting-output-codes-for-multiclass-svm-classification/) +- [How to create a Multilabel SVM classifier with Scikit-learn](https://www.machinecurve.com/index.php/2020/11/12/how-to-create-a-multilabel-svm-classifier-with-scikit-learn/) + +It is perhaps less known that Support Vector Machines can be used for regression tasks as well. In this section, we will discuss why this is possible. + +### SVMs are maximum-margin models + +Before we can understand why SVMs are usable for regression, it's best if we take a look at how they can be used for classification tasks. From the articles linked above, we know that Support Vector Machines are **maximum-margin models** when they are applied to classification problems: when learning a decision boundary, they attempt to generate a boundary such that it maximizes its distance to class 0, but also its distance to class 1. This property is called _equidistance_ and ensures that we have the best possible decision boundary for our dataset. + +If you look closely at the decision boundaries plotted in the figure below, we can see that \[latex\]H\_1\[/latex\] is no decision boundary it all (it is not capable of separating class 0 and class 1), \[latex\]H\_2\[/latex\] works but is a bit short in relation to class 0, while \[latex\]H\_3\[/latex\] maximizes the distance between the two classes. + +![](images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png) + +Hyperplanes and data points. The [image](https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)is not edited. Author: [Zack Weinberg](https://commons.wikimedia.org/w/index.php?title=User:ZackWeinberg&action=edit&redlink=1), derived from [Cyc’s](https://commons.wikimedia.org/w/index.php?title=User:Cyc&action=edit&redlink=1) work. License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode) + +We also see some lines between samples and the decision boundaries, which are also called _hyperplanes_ (because they are `N-1` dimensional, i.e., in our two-dimensional plane plotted above, the boundary is a one-dimensional line). Those lines indicate that those samples were used to construct a particular boundary. As they essentially _support_ the construction of the boundary, they are called _support vectors_ - and hence we can guess why SVMs are called that way. + +The goal of finding a maximum-margin boundary is to find a set of support vectors for each class where the distance between support vectors for each class to the decision boundary is equal - while also ensuring that a minimum amount of samples is classified incorrectly. + +And by consequence, we can use them to build [a classifier](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). If you want to move from the linear case towards nonlinear data, I suggest you take a look at [this article](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) which covers kernel functions, but for now, we'll move forward to using SVMs for regression. + +### Using Support Vectors to perform regression + +Because indeed, SVMs can also be used to perform regression tasks. We know that the decision boundary that was learned in the figure above can be used to separate between the two classes. We call this a _discrete_ problem - there are two possible outcomes: class 0 for everything above the line, and class 1 for everything below the line. Classification problems are good exampls of discrete Machine Learning problems. + +Regression, however, is a continuous problem: one input value is mapped to a real numbered output, a number, and hence there is no such thing as "above the line" or "below the line" for an outcome. Rather, we must use the boundary _itself_ in order to generate the outcome. If we wanted to find a perfect boundary for our continuous data, however, the problem would exponentially grow because a _precise, maximum-margin fit_ is really difficult in those cases. + +This puts extra emphasis on the correctness _and_ time-complexity of the boundary, but it is possible to use Support Vector Machines to perform what is known as **Support Vector Regression** (SVR). A penalty-free area is captured around the maximum-margin decision boundary, called the _error tube_, where errors are accepted; this is a consequence of the fact that it must learn to compute continuous outputs. The goal of SVR is to find a tube that is as small as possible, without compromising much in model complexity and training time. + +Imagine that all the samples in the figure above don't belong to a particular class - but they just are what they are, samples, and they represent some \[latex\]x \\rightarrow y\[/latex\] mapping from one continuous input to a continuous output value. Obviously, when performing a regression task, you want the regressed function to be somewhere in the middle of the samples. This makes Support Vector Machines a good fit for (linear, and if not linear using some kernel function with the kernel trick) regression problems: using support vectors near the middle of your dataset, it will regress a function that maps those inputs to outputs. + +### Epsilon-SVR and nu-SVR + +There are in fact two types of Support Vector Regression: epsilon-based SVR (\[latex\]\\epsilon\[/latex\]-SVR) and nu-SVR (\[latex\]\\nu\[/latex\]-SVR). They differ by means of the control that they offer you over the regression problem (StackExchange, n.d.): + +- When using **nu-SVR**, you have control over the _total number of support vectors used_ but not necessarily over the error that is acceptable (often yielding smaller but possibly worse models). +- When using **epsilon-SVR**, you have control over the _error_ _that is acceptable_ but not necessarily over the number of support vectors used (often yielding better but large models). + +> Depending of what I want, I choose between the two. If I am really desperate for a small solution (fewer support vectors) I choose \[latex\]\\nu\[/latex\]-SVR and **hope** to obtain a decent model. But if I really want to control the amount of error in my model and go for the best performance, I choose \[latex\]\\epsilon\[/latex\]-SVR and **hope** that the model is not too complex (lots of support vectors). +> +> StackExchange, n.d. + +* * * + +## How does Multioutput Regression work? + +We can even generalize our single-output SVR model into a **multioutput regression** model. Constructing one is actually pretty simple: + +- Multiple regressors are trained for the problem, covered in a _multioutput regressor_ wrapper. +- This wrapper takes input and distributes it to the single-output regressors that are embedded in it. +- Predictions generated by the single-output regressors are combined and served as a multi-output regression. + +Pretty simple, isn't it? + +![](images/mor-1024x516.jpg) + +* * * + +## Building a Multioutput Regression SVM with Scikit-learn + +Now that we understand how SVMs can be used for regression tasks, and how we can generalize a single-output SVR into a multi-output one, we can take a look at how to create one with Scikit-learn. + +Open up your code editor, create a file (e.g. `mor.py`), and let's code! :) + +### Imports + +The first thing we always do (simply because it's necessary) is import all the dependencies into our Python script. Today, we will be using Scikit-learn, so the assumption is that you have it installed onto your system (and into your specific Python environment, if you use them). + +If you don't have it, you can easily install it, e.g. with `pip install scikit-learn`. + +We next import the dependencies - note that they are available as `sklearn` rather than `scikit-learn`. + +- We import `make_regression` from `sklearn.datasets` because it will help us create the dataset for today's regression problem (recall that up to now, we have no dataset :) ) +- From `sklearn.multioutput` we import `MultiOutputRegressor` - it's the wrapper we discussed in the previous section. +- As we will convert an SVR model into a multioutput regressor, we must import `SVR` from `sklearn.svm`. +- After generating the dataset with `make_regression`, we must split it into [train/test sets](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/). We can do so using `sklearn.model_selection`'s `train_test_split`. +- Finally, we import `mean_squared_error` and `mean_absolute_error` from `sklearn.metrics` for evaluating our model. Those are default [error functions for regression problems](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss-functions-for-regression). + +``` +from sklearn.datasets import make_regression +from sklearn.multioutput import MultiOutputRegressor +from sklearn.svm import SVR +from sklearn.model_selection import train_test_split +from sklearn.metrics import mean_squared_error, mean_absolute_error +``` + +### Generating and processing the dataset + +After the imports, it's time to make a dataset: + +- We will use `make_regression`, which generates a regression problem for us. +- We create 25.000 samples (i.e. input-target pairs) by setting `n_samples` to 25000. +- Each input part of the input-target-pairs has 3 features, or columns; we therefore set `n_features` to 3. +- The output part of the input-target-pairs has 2 targets, or values to be regressed; we therefore set `n_targets` to 2. Note that our multioutput regressor will therefore be a two-output regressor. +- Using `random_state`, we seed our regression problem by using the same random number initialization. + +``` + +# Generate dataset +X, y = make_regression(n_samples=25000, n_features=3, n_targets=2, random_state=33) +``` + +After generating the dataset, we must process it by [splitting it into a training and testing dataset](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/): + +``` + +# Train/test split +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33) +``` + +### Building the SVR single-output regressor + +We can then move forward and construct the SVR regressor: + +- Here, we set the value for \[latex\]\\epsilon\[/latex\] (epsilon) to `0.2`. It specifies the width of the 'error tube' where no penalty is assigned to mispredictions, effectively allowing us to take values close to the edges of the error tube as support vectors. +- If we want to apply regularization, we can also apply values for `C` - more information [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html). + +``` +# Create the SVR regressor +svr = SVR(epsilon=0.2) +``` + +### Wrapping the SVR into a MultiOutputRegressor + +We can then easily wrap the SVR into our imported `MultiOutputRegressor`: + +``` +# Create the Multioutput Regressor +mor = MultiOutputRegressor(svr) +``` + +### Fitting and evaluating the regressor + +Finally, we can fit the training data (`X_train`) and `y_train`) to our `MultiOutputRegressor`. This starts the training process. Once fitting the data is complete, we can generate `y_pred` prediction values for our testing inputs `X_test`. Using the [mean squared error and mean absolute error](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss-functions-for-regression), we can then evaluate model performance: + +``` +# Train the regressor +mor = mor.fit(X_train, y_train) + +# Generate predictions for testing data +y_pred = mor.predict(X_test) + +# Evaluate the regressor +mse_one = mean_squared_error(y_test[:,0], y_pred[:,0]) +mse_two = mean_squared_error(y_test[:,1], y_pred[:,1]) +print(f'MSE for first regressor: {mse_one} - second regressor: {mse_two}') +mae_one = mean_absolute_error(y_test[:,0], y_pred[:,0]) +mae_two = mean_absolute_error(y_test[:,1], y_pred[:,1]) +print(f'MAE for first regressor: {mae_one} - second regressor: {mae_two}') +``` + +### Full model code + +Should you wish to obtain the full code just at once, that's of course also possible. In that case, here you go :) + +``` +from sklearn.datasets import make_regression +from sklearn.multioutput import MultiOutputRegressor +from sklearn.svm import SVR +from sklearn.model_selection import train_test_split +from sklearn.metrics import mean_squared_error, mean_absolute_error + +# Generate dataset +X, y = make_regression(n_samples=25000, n_features=3, n_targets=2, random_state=33) + +# Train/test split +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33) + +# Create the SVR regressor +svr = SVR(epsilon=0.2) + +# Create the Multioutput Regressor +mor = MultiOutputRegressor(svr) + +# Train the regressor +mor = mor.fit(X_train, y_train) + +# Generate predictions for testing data +y_pred = mor.predict(X_test) + +# Evaluate the regressor +mse_one = mean_squared_error(y_test[:,0], y_pred[:,0]) +mse_two = mean_squared_error(y_test[:,1], y_pred[:,1]) +print(f'MSE for first regressor: {mse_one} - second regressor: {mse_two}') +mae_one = mean_absolute_error(y_test[:,0], y_pred[:,0]) +mae_two = mean_absolute_error(y_test[:,1], y_pred[:,1]) +print(f'MAE for first regressor: {mae_one} - second regressor: {mae_two}') +``` + +Running it gives the following performance: + +``` +MSE for first regressor: 141.01769634969892 - second regressor: 455.162512288481 +MAE for first regressor: 2.522852872893534 - second regressor: 5.167553576426942 +``` + +Not _too_ bad, but not really great either. Enough room for further optimization! :) + +* * * + +## Summary + +In today's article, we looked at how to create a multioutput regression scenario for Support Vector Machine based regressors - or Support Vector Regression for short. For doing so, we started by looking at how Support Vector Machines work in general. In other words, we looked at how they generate maximum-margin hyperplanes as decision boundaries, when they are used for classification. + +We then moved forward to regression problems by looking at how those hyperplanes can be used for regression problems, i.e. by constructing an error tube around the regressed function where errors are not penalized. This speeds up the training process and it makes Support Vector Regression actually possible. We also saw that there are two types of SVR, epsilon-SVR and nu-SVR, which allow you to configure the acceptable amount of error or the expected amount of support vectors used, respectively. + +When we understood SVR, we moved forward by creating a multioutput regressor for them. We saw that it is as simple as wrapping the problem with functionality that generates one single-output regression function for each problem, then combining the results into one multi-output output. This was demonstrated by a Scikit-learn based example, where we implemented a multi-output SVR model in a step-by-step fashion, explaining the details as well. + +I hope that you have learned something from today's article! If you did, please feel free to leave a message in the comments section 💬 Please do the same if you have questions or other remarks. I'd love to hear from you and will respond whenever I can. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +MachineCurve. (2019, October 22). _Intuitively understanding SVM and SVR_. [https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) + +MachineCurve. (2020, October 22). _3 variants of classification problems in machine learning_. [https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) + +Sayad, S. (n.d.). _Support vector regression_. Data Mining Map. [https://www.saedsayad.com/support\_vector\_machine\_reg.htm](https://www.saedsayad.com/support_vector_machine_reg.htm) + +StackExchange. (n.d.). _Difference between ep-SVR and nu-SVR (and least squares SVR)_. Cross Validated. [https://stats.stackexchange.com/questions/94118/difference-between-ep-svr-and-nu-svr-and-least-squares-svr](https://stats.stackexchange.com/questions/94118/difference-between-ep-svr-and-nu-svr-and-least-squares-svr) + +Scikit-learn. (n.d.). _Sklearn.svm.SVR — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 17, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) + +Scikit-learn. (n.d.). _1.12. Multiclass and multilabel algorithms — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 17, 2020, from [https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression](https://scikit-learn.org/stable/modules/multiclass.html#multioutput-regression) diff --git a/how-to-predict-new-samples-with-your-keras-model.md b/how-to-predict-new-samples-with-your-keras-model.md new file mode 100644 index 0000000..b2b99b3 --- /dev/null +++ b/how-to-predict-new-samples-with-your-keras-model.md @@ -0,0 +1,430 @@ +--- +title: "How to predict new samples with your TensorFlow / Keras model?" +date: "2020-02-21" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "keras" + - "machine-learning" + - "model" + - "neural-network" + - "neural-networks" + - "predict" +--- + +Training machine learning models can be awesome if they are accurate. However, you then also want to use them in production. + +But how to do so? + +The first step is often to allow the models to _generate new predictions_, for data that you - instead of Keras - feeds it. + +This blog zooms in on that particular topic. By providing a Keras based example using TensorFlow 2.0+, it will show you how to create a Keras model, train it, save it, load it and subsequently use it to generate new predictions. It's the first step of deploying your model into a production setting :) + +Are you ready? Let's go! 😎 + +**Update 11/Jan/2021:** added quick example to the article. + +**Update 03/Nov/2020:** fixed textual error. + +* * * + +\[toc\] + +* * * + +## Example code: using model.predict() for predicting new samples + +With this example code, you can start using `model.predict()` straight away. + +``` +# File path +filepath = './path_to_model' + +# Load the model +model = load_model(filepath, compile = True) + +# A few random samples +use_samples = [5, 38, 3939, 27389] +samples_to_predict = [] + +# Convert into Numpy array +samples_to_predict = np.array(samples_to_predict) + +# Generate predictions for samples +predictions = model.predict(samples_to_predict) +print(predictions) +``` + +* * * + +## Today's Keras model + +Let's first take a look at the Keras model that we will be using today for showing you how to generate predictions for new data. + +It's an adaptation of the Convolutional Neural Network that we trained to demonstrate [how sparse categorical crossentropy loss works](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/). Today's one works for TensorFlow 2.0 and the integrated version of Keras; hence, I'd advise to use this variant instead of the traditional `keras` package. + +[![](images/dig_4-300x225.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_4.png) + +Now, I won't cover all the steps describing _how_ this model is built - take a look at the link above if you wish to understand this in more detail. However, very briefly: + +- The model loads data from the EMNIST Digits dataset, which contains many samples of digits 0 to 9. To do this, we use our [Extra Keras Datasets](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) package. +- It prepares the data by reshaping it (adding the number of channels, which Keras requires), casting the data into the `float32` type, and scaling. +- It creates the ConvNet architecture: three convolutional blocks with [Max Pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) for spatial hierarchy and [Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/) against overfitting. Using Flatten, and Dense layers that end with a [Softmax activation](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/), we get a multiclass probability distribution. +- It compiles the model and fits the data. +- Finally, it evaluates the model based on the test set. + +Here's the code - add it to a file called e.g. `keras-predictions.py`: + +``` +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from extra_keras_datasets import emnist + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load EMNIST dataset +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='digits') + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Cast numbers to float32 +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=sparse_categorical_crossentropy, + optimizer=Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Saving and loading the model + +If we want to generate new predictions for future data, it's important that we save the model. It really is: if you don't, you'd have to retrain the model every time you want to use it. This is bad for two reasons: if you have data at scale, this is a terrible process, and your models may no longer be comparable. + +Let's thus find a way to save our model! + +Fortunately, Keras offers a built-in facility for saving your models. Today, we do so using the new TensorFlow `SavedModel` approach. However, the former way of working is also still available. [Check out this post if you wish to check out saving models using both approaches in more detail](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/). + +### Saving + +Now, let's add some extra code to your model so that we can save and load the model :) + +First, add the `save_model` and `load_model` definitions to our imports - replace the line where you import `Sequential` with: + +``` +from tensorflow.keras.models import Sequential, save_model, load_model +``` + +Then, create a folder in the folder where your `keras-predictions.py` file is stored. Make sure to name this folder `saved_model` or, if you name it differently, change the code accordingly - because you next add this at the end of your model file: + +``` +# Save the model +filepath = './saved_model' +save_model(model, filepath) +``` + +In line with how saving Keras models works, it saves the `model` instance at the `filepath` (i.e. that folder) that you specified. + +Hooray! We now saved our trained model 🎉 + +### Loading + +Loading the model for future usage is really easy - it's a two-line addition: + +``` +# Load the model +model = load_model(filepath, compile = True) +``` + +Your model is now re-loaded from `filepath` and compiled automatically (i.e., the `model.compile` step is performed; you can also do this manually if you like). + +_Note that saving and loading your model during run-time of one Python file makes no sense at all: why would you write a model to your file system and load it in the same run? Yeah, you're right :)_ _The goal is however to make your model re-usable across many Python files. Hence, in any practical setting, you'd use `save_model` during the training run, while you'd use `load_model` in e.g. another script._ + +* * * + +## Generating predictions + +With a loaded model, it's time to show you how to generate predictions with your Keras model! :) + +Firstly, let's add Matplotlib to our imports - which allows us to generate visualizations. Then, also add Numpy, for number processing: + +``` +import matplotlib.pyplot as plt +import numpy as np +``` + +Then, we'll add some code for visualizing the samples that we'll be using in today's post: + +``` +# A few random samples +use_samples = [5, 38, 3939, 27389] + +# Generate plots for samples +for sample in use_samples: + # Generate a plot + reshaped_image = input_train[sample].reshape((img_width, img_height)) + plt.imshow(reshaped_image) + plt.show() +``` + +Here they are: + +- [![](images/dig_4.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_4.png) + +- [![](images/dig_2.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_2.png) + +- [![](images/dig_3.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_3.png) + +- [![](images/dig_1.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_1.png) + + +We then extend this code so that we can actually store the samples temporarily for prediction later: + +``` +# A few random samples +use_samples = [5, 38, 3939, 27389] +samples_to_predict = [] + +# Generate plots for samples +for sample in use_samples: + # Generate a plot + reshaped_image = input_train[sample].reshape((img_width, img_height)) + plt.imshow(reshaped_image) + plt.show() + # Add sample to array for prediction + samples_to_predict.append(input_train[sample]) +``` + +Then, before feeding them to the model, we convert our list into a Numpy array. This allows us to compute shape and allows Keras to handle the data more smoothly: + +``` +# Convert into Numpy array +samples_to_predict = np.array(samples_to_predict) +print(samples_to_predict.shape) +``` + +The output of the `print` statement: `(4, 28, 28, 1)`. + +Correct ✅ We indeed added 4 images of 28x28 pixels with one channel per image. + +The next step is to generate the predictions: + +``` +# Generate predictions for samples +predictions = model.predict(samples_to_predict) +print(predictions) +``` + +The output here seems to be a bit jibberish at first: + +``` +[[8.66183618e-05 1.06925681e-05 1.40683464e-04 4.31487868e-09 + 7.31811961e-05 6.07917445e-06 9.99673367e-01 7.10965661e-11 + 9.43153464e-06 1.98050812e-10] + [6.35617238e-04 9.08200348e-10 3.23482091e-05 4.98994159e-05 + 7.29685112e-08 4.77315152e-05 4.25152575e-06 4.23201502e-10 + 9.98981178e-01 2.48882337e-04] + [9.99738038e-01 3.85520025e-07 1.05982785e-04 1.47284098e-07 + 5.99268958e-07 2.26216093e-06 1.17733900e-04 2.74483864e-05 + 3.30203284e-06 4.03360673e-06] + [3.42538192e-06 2.30619257e-09 1.29460409e-06 7.04832928e-06 + 2.71432992e-08 1.95419183e-03 9.96945918e-01 1.80040043e-12 + 1.08795590e-03 1.78136176e-07]] +``` + +Confused? 😕 Don't be! + +Remember that we used the [Softmax activation function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) when creating our model. This activation function doesn't compute _the prediction_, but rather a _discrete probability distribution over the target classes_. In simple English, this means that Softmax computes the probability that the input belongs to a particular class, for each class. The values in each row summate to 1 - or 100%, which is a characteristic of a valid probability distribution. + +Now, we can finalize our work by _actually_ finding out what our predicted classes are - by taking the `argmax` values (or "maximum argument", index of the maximum value) for each element in the list with predictions: + +``` +# Generate arg maxes for predictions +classes = np.argmax(predictions, axis = 1) +print(classes) +``` + +This outputs `[6 8 0 6]`. Yeah! ✅ 🎉 + +- [![](images/dig_4.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_4.png) + +- [![](images/dig_2.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_2.png) + +- [![](images/dig_3.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_3.png) + +- [![](images/dig_1.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/dig_1.png) + + +_Note that the code above trains with and predicts with both the training data. While this is bad practice when evaluating a model, it is acceptable when you're confident that your model generalizes to new data. I indeed am that it will generalize to new MNIST-like data, and hence I didn't make the split here._ + +## Full code + +If you're interested, you can find the code as a whole here: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential, save_model, load_model +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from extra_keras_datasets import emnist +import matplotlib.pyplot as plt +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load EMNIST dataset +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='digits') + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Cast numbers to float32 +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=sparse_categorical_crossentropy, + optimizer=Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# # Save the model +filepath = './saved_model' +save_model(model, filepath) + +# Load the model +model = load_model(filepath, compile = True) + +# A few random samples +use_samples = [5, 38, 3939, 27389] +samples_to_predict = [] + +# Generate plots for samples +for sample in use_samples: + # Generate a plot + reshaped_image = input_train[sample].reshape((img_width, img_height)) + plt.imshow(reshaped_image) + plt.show() + # Add sample to array for prediction + samples_to_predict.append(input_train[sample]) + +# Convert into Numpy array +samples_to_predict = np.array(samples_to_predict) +print(samples_to_predict.shape) + +# Generate predictions for samples +predictions = model.predict(samples_to_predict) +print(predictions) + +# Generate arg maxes for predictions +classes = np.argmax(predictions, axis = 1) +print(classes) +``` + +* * * + +## Summary + +In today's blog post, we looked at how to _generate predictions with a Keras model_. We did so by coding an example, which did a few things: + +- Load EMNIST digits from the [Extra Keras Datasets](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) module. +- Prepare the data. +- Define and train a Convolutional Neural Network for classification. +- Save the model. +- Load the model. +- Generate new predictions with the loaded model and validate that they are correct. + +I hope you've learnt something from today's post, even though it was a bit smaller than usual :) Please let me know in the comments section what you think 💬 + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +MachineCurve. (2020, January 10). Making more datasets available for Keras. Retrieved from [https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) + +MachineCurve. (2020, February 11). How to use sparse categorical crossentropy in Keras? Retrieved from [https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) + +Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an extension of MNIST to handwritten letters. Retrieved from [http://arxiv.org/abs/1702.05373](http://arxiv.org/abs/1702.05373) diff --git a/how-to-predict-new-samples-with-your-pytorch-model.md b/how-to-predict-new-samples-with-your-pytorch-model.md new file mode 100644 index 0000000..f404917 --- /dev/null +++ b/how-to-predict-new-samples-with-your-pytorch-model.md @@ -0,0 +1,233 @@ +--- +title: "How to predict new samples with your PyTorch model?" +date: "2021-02-10" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +--- + +Training a neural network with PyTorch also means that you'll have to deploy it one day - and this requires that you'll add code for predicting new samples with your model. In this tutorial, we're going to take a look at doing that, and show you how to + +It is structured as follows. Firstly, we will be taking a look at actually creating a neural network with PyTorch. We'll briefly walk you through the creation of a Multilayer Perceptron with the framework, which serves as the basis for predicting new samples. This is followed by actually predicting new samples after training the model. Altogether, after reading this tutorial, you will understand... + +- **How to create a PyTorch model from a high-level perspective.** +- **How you can generate predictions for new samples with your PyTorch model after training.** + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## Today's PyTorch model + +In another tutorial, we showed you [how to create a Multilayer Perceptron with PyTorch](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/). What follows is the code for doing so. If you want to understand all the details, I recommend clicking the link to follow that particular tutorial. + +However, here, we will cover it briefly, so that you understand what is happening when you are running the code. + +- First, the dependencies. You will need a fresh installation of Python, e.g. 3.6+, but preferably newer. In addition, you'll need PyTorch (`torch`) and the `torchvision` module because you'll train your model on the MNIST dataset. +- Second, the `nn.Module` class. This class represents the neural network, in this case the Multilayer Perceptron. In the `__init__` definition, you specify the layers of your model - here, using the `nn.Sequential` wrapper which stacks all the layers on top of each other. Using `forward`, you specify the forward pass, or what happens when you let a sample pass through the model. As you can see, you feed it through the layers, and return the results. +- Third, the runtime code. Here, you actually prepare the MNIST data, initialize the MLP, define loss function and optimizer, and define a custom training loop - for 5 iterations, or epochs. In the training loop, for every epoch, you feed forward all samples in a minibatch, compute loss, compute the error in the backwards pass, and optimize the model. +- Finally, once all 5 epochs have passed, you print about model completion. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(32 * 32 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## After training: predicting new samples with your PyTorch model + +The first thing to do when you want to generate new predictions is add `matplotlib` and `numpy` + +``` +import matplotlib.pyplot as plt +import numpy as np +``` + +You can then add the following code to predict new samples with your PyTorch model: + +- You first have to disable grad with `torch.no_grad()` or NumPy will not work properly. +- This is followed by specifying information about the item from the MNIST dataset that you want to generate predictions for. You specify an `index`, load the item, and split it into an `image` and a `true_target`. Obviously, this can also be one of the images from your own dataset. +- Generating a prediction is simple - you simply feed it to your `mlp` instance (here, `mlp` is the name of the instantiated `nn.Model` module, and can be anything depending on how you named the variable where you initialized your neural network). +- The `prediction` is a Softmax generated list of probabilities across the possible classes - and you therefore have to turn it into a `predicted_class` variable with `np.argmax`. This takes the argument with the highest value, i.e. the highest probability. +- This is followed by reshaping the `image` into a shape that can be used by Matplotlib for visualization. The default MNIST dataset represents images as `(1, 28, 28)` whereas Matplotlib requires `(28, 28, 1)`. +- Finally, you visualize the image, and set the prediction compared to the actual target as the `title`. + +``` + # Disable grad + with torch.no_grad(): + + # Retrieve item + index = 256 + item = dataset[index] + image = item[0] + true_target = item[1] + + # Generate prediction + prediction = mlp(image) + + # Predicted class value using argmax + predicted_class = np.argmax(prediction) + + # Reshape image + image = image.reshape(28, 28, 1) + + # Show result + plt.imshow(image, cmap='gray') + plt.title(f'Prediction: {predicted_class} - Actual target: {true_target}') + plt.show() +``` + +These are some of the results: + +- ![](images/pred_3.png) + +- ![](images/pred_2.png) + +- ![](images/pred_1.png) + + +* * * + +## Predicting new samples with a loaded PyTorch mdoel + +You can also use a [saved model](https://www.machinecurve.com/index.php/2021/02/03/how-to-save-and-load-a-pytorch-model/) for inference: + +``` + # Disable grad + with torch.no_grad(): + + # Retrieve item + index = 333 + item = dataset[index] + image = item[0] + true_target = item[1] + + # Loading the saved model + save_path = './mlp.pth' + mlp = MLP() + mlp.load_state_dict(torch.load(save_path)) + mlp.eval() + + # Generate prediction + prediction = mlp(image) + + # Predicted class value using argmax + predicted_class = np.argmax(prediction) + + # Reshape image + image = image.reshape(28, 28, 1) + + # Show result + plt.imshow(image, cmap='gray') + plt.title(f'Prediction: {predicted_class} - Actual target: {true_target}') + plt.show() +``` + +It also works: + +![](images/pred_4.png) + +* * * + +## Recap + +In this tutorial, we looked at how you can generate new predictions with your trained PyTorch model. Using a Multilayer Perceptron trained on the MNIST dataset, you have seen that it is very easy to perform inference - as easy as simply feeding the samples to your model instance. + +Using code examples, you have seen how to perform this, as well as for the case when you load your saved PyTorch model in order to generate predictions. + +I hope that you have learned something from this article! If you did, please feel free to leave a message in the comments section below 💬 Please do the same if you have any questions or remarks whatsoever. I'd love to hear from you :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +StackExchange. (n.d.). _What is the use of torch.no\_grad in pytorch?_ Data Science Stack Exchange. [https://datascience.stackexchange.com/questions/32651/what-is-the-use-of-torch-no-grad-in-pytorch](https://datascience.stackexchange.com/questions/32651/what-is-the-use-of-torch-no-grad-in-pytorch) diff --git a/how-to-save-and-load-a-model-with-keras.md b/how-to-save-and-load-a-model-with-keras.md new file mode 100644 index 0000000..36130bb --- /dev/null +++ b/how-to-save-and-load-a-model-with-keras.md @@ -0,0 +1,264 @@ +--- +title: "How to save and load a model with Keras?" +date: "2020-02-14" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "load-model" + - "machine-learning" + - "save-model" + - "tensorflow" +--- + +So far, at MachineCurve, we have primarily focused on how to train models with Keras. This is nice, but a bit useless if we cannot save the models that we've trained. Training is expensive and we shouldn't want to retrain a model every time we want to use it. + +Now, fortunately, the Keras deep learning framework supports _saving trained models and loading them for later use_. This is exactly what we want! + +In this blog post, we will therefore find out how it works. Firstly, we'll train a model, which serves as our case for today's blog. Secondly, we'll find out how we can save this model, either as an HDF5 file ("old style saving") or a SavedModel file ("new style saving"). Finally, we'll load this model again, and show you how to generate new predictions with it. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Our training scenario + +Before we can show you how to save and load your Keras model, we should define an example training scenario - because if we don't, there is nothing to save :D + +So, for this purpose, we'll be using this model today: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.models import Sequential +from tensorflow.keras.optimizers import Adam + +# Model configuration +batch_size = 150 +img_width, img_height = 28, 28 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape((input_train.shape[0], img_width, img_height, 1)) +input_test = input_test.reshape((input_test.shape[0], img_width, img_height, 1)) +input_shape = (img_width, img_height, 1) + +# Cast input to float32 +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Train the model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +It's an adaptation of our [Keras model for valid padding](https://www.machinecurve.com/index.php/2020/02/08/how-to-use-padding-with-keras/#how-to-use-valid-padding-with-keras), where the architecture is optimized to the structure of our dataset (for example, we're using [sparse categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) because our targets are integers rather than one-hot encoded vectors). + +Now, at a high level, this is what the code above does: + +1. We import the elements of Keras that we need, from the TensorFlow 2.0 installation. +2. We specify some configuration options for our model, such as model hyperparameters (loss function, optimizer), options for training (batch size, number of epochs, verbosity mode) and so on. +3. We load and prepare MNIST data, by reshaping it, casting it into `float32` format which presumably speeds up training, and normalizing the data. +4. We create the simple yet possibly useful architecture for the model. +5. We compile the model (i.e. instantiate the skeleton we created in the fourth step) and fit the data (i.e. start the training process). + +With this model, we can now take a look at how to save your model, the architecture only, the weights only, and so on. This includes a look at how to _load_ that stuff again too, haha! :) + +* * * + +## Saving your whole model + +But first, saving the model. + +In order to save whole models, Keras provides the `save_model` definition: + +``` +tf.keras.models.save_model( + model, + filepath, + overwrite=True, + include_optimizer=True, + save_format=None, + signatures=None, + options=None +) +``` + +You can provide these attributes (TensorFlow, n.d.): + +- `model` (required): the _model instance_ that we want to save. In the case of the model above, that's the `model` object. +- `filepath` (required): the path where we wish to write our model to. This can either be a `String` or a `h5py.File` object. In the first case, i.e. the String, the Python file system will write the model to the path specified by the String. In the latter case, a HDF object was opened or created with `h5py`, to which one can write data. Specifying this object will let you write data there, without having to use the Python file system redundantly. +- `overwrite` (defaults to `True`): if the user must be asked to overwrite existing files at the `filepath`, or whether we can simply write away. +- `include_optimizer` (defaults to True): whether we wish to save the state of the optimizer too. This may seem odd at first, but indeed, optimizers also have their state! For example, the Adam optimizer works so well because it applies [momentum-like optimization with local optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam). I can imagine that _this state_, and then especially with respect to local optimization, could be saved. The `include_optimizer` attribute allows you to do so. +- `save_format`: whether you wish to save the file in `tf` or `h5` format. The latter represents a HDF5 file and was the standard option in TensorFlow 1.0. However, in version 2.0+, this was changed into the [SavedModel](https://www.tensorflow.org/guide/saved_model) format. Given this change, it defaults to `tf` in 2.0+ and `h5` in earlier versions. It's up to you to decide what fits best :) +- `signatures`: it's possible to add custom methods to TensorFlow (and hence Keras) models. These are called "signatures". If you wish to save them together with your model, you can do so - by specifying them here. +- `options`: if you wish to save model options too, you could use `options`, which should be a `SaveOptions` [instance](https://www.tensorflow.org/api_docs/python/tf/saved_model/SaveOptions). + +### Saving our model in SavedModel format + +Now, let's take a look at what this means for our model. + +Fortunately, it's a simple one, so we can simply specify the model and the filepath and we're done. Add this to your code and run it to train the model: + +``` +# Save the model +filepath = './saved_model' +save_model(model, filepath) +``` + +Don't forget to add `save_model` to your imports and to create a directory called `save_model` at the `filepath` you specify. + +``` +from tensorflow.keras.models import Sequential, save_model +``` + +After running the model, indeed, our `save_model` folder is now full of model files: + +[![](images/image.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/image.png) + +### Saving our model in HDF5 format + +Now, if we wanted to save our model into HDF (`.h5`) format, we would change the `save_model` call into: + +``` +# Save the model +filepath = './saved_model' +save_model(model, filepath, save_format='h5') +``` + +(You might wish to add `.h5` as a suffix to the filepath, but this is up to you.) + +If you created a folder `saved_model` as before, you would get this error: + +``` +OSError: Unable to create file (unable to open file: name = './saved_model', errno = 13, error message = 'Permission denied', flags = 13, o_flags = 302) +``` + +The reason why this error occurs is that the HDF5 file format ensures that data is contained, i.e. that it is hierarchically structured in just _one_ file. You thus have to remove the directory and run the code again, and voila: + +![](images/image-1.png) + +* * * + +## Loading the whole model + +Now that we have a saved model, we can demonstrate how to _load_ it again - in order to generate predictions. + +The first thing that we'll have to do if we wish to load our Keras model is adding a few extra imports. Firstly, add `load_model` to your `tensorflow.keras.models` import: + +``` +from tensorflow.keras.models import Sequential, save_model, load_model +``` + +Also make sure to import `numpy`, as we'll need to compute an `argmax` value for our Softmax activated model prediction later: + +``` +import numpy as np +``` + +We can then load the model: + +``` +# Load the model +loaded_model = load_model( + filepath, + custom_objects=None, + compile=True +) +``` + +Et voila, you've loaded your model :) + +Now, while `filepath` is pretty clear, what do `custom_objects` and `compile` mean? + +> If the model you want to load includes custom layers or other custom classes or functions, you can pass them to the loading mechanism via the `custom_objects` argument. +> +> Keras (n.d.; FAQ) + +Indeed - by default, custom objects are not saved with the model. You can however specify them with the `custom_objects` attribute upon loading it, like this (Keras, n.d.): + +``` +model = load_model('my_model.h5', custom_objects={'AttentionLayer': AttentionLayer}) +``` + +Now, the `compile` indicates whether the model must be compiled or not. It's `True` by default. If you set it to `False`, you'll have to compile it manually again using `model.compile`, but in return you'll get the freedom to tweak the configuration options a bit. + +### Predictions for new data + +With the model we loaded, we can generate predictions for new data: + +``` +# Generate a prediction with loaded model +sample_index = 788 +sample_input, sample_target = input_test[sample_index], target_test[sample_index] +sample_input_array = np.array([sample_input]) +predictions = loaded_model.predict(sample_input_array) +prediction = np.argmax(predictions[0]) +print(f'Ground truth: {sample_target} - Prediction: {prediction}') +``` + +Here, for sample `788`, we take the true input and true target, feed the input to the model, and store the prediction. Subsequently, we print it, to check whether it's correct when we run the `py` file: + +``` +Ground truth: 9 - Prediction: 9 +``` + +Hooray! 🎉 + +* * * + +## Summary + +In this blog post, we saw how we can utilize Keras facilities for saving and loading models: i.e., the `save_model` and `load_model` calls. Through them, we've been able to train a Keras model, save it to disk in either HDF5 or SavedModel format, and load it again. + +I hope this blog was useful for you! If it was, feel free to leave a comment in the comments section below 💬 Please do the same if you have questions, when you think I've made a mistake or when you have other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +TensorFlow. (n.d.). tf.keras.models.save\_model. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/keras/models/save\_model](https://www.tensorflow.org/api_docs/python/tf/keras/models/save_model) + +TensorFlow. (n.d.). tf.keras.models.load\_model. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/keras/models/load\_model](https://www.tensorflow.org/api_docs/python/tf/keras/models/load_model) + +Keras. (n.d.). FAQ: Handling custom layers (or other custom objects) in saved models. Retrieved from [https://keras.io/getting-started/faq/#handling-custom-layers-or-other-custom-objects-in-saved-models](https://keras.io/getting-started/faq/#handling-custom-layers-or-other-custom-objects-in-saved-models) diff --git a/how-to-save-and-load-a-pytorch-model.md b/how-to-save-and-load-a-pytorch-model.md new file mode 100644 index 0000000..0861ff9 --- /dev/null +++ b/how-to-save-and-load-a-pytorch-model.md @@ -0,0 +1,123 @@ +--- +title: "How to save and load a PyTorch model?" +date: "2021-02-03" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "load-model" + - "machine-learning" + - "pytorch" + - "save-model" +--- + +You don't train deep learning models without using them later. Instead, you want to save them, in order to load them later - allowing you to perform inference activities. + +In this tutorial, we're going to take a look at **saving and loading your models created with PyTorch**. PyTorch is one of the leading frameworks for deep learning these days and is widely used in the deep learning industry. After reading it, you will understand... + +- How you can use `torch.save` for saving your PyTorch model. +- How you can load the model by initializing the skeleton and loading the state. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Saving a PyTorch model + +Suppose that you have created a PyTorch model, say a simple Multilayer Perceptron, like this. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Conv2d(1, 5, kernel_size=3), + nn.Flatten(), + nn.Linear(26 * 26 * 5, 300), + nn.ReLU(), + nn.Linear(300, 64), + nn.ReLU(), + nn.Linear(64, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) +``` + +You can then define a [training loop](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/#defining-the-training-loop) in order to train the model, in this case with the MNIST dataset. Note that we don't repeat creating the training loop here - click the link to see how this can be done. + +After training, it is possible that you have found a model that is useful in the real world. + +In other words, a well-performing model that must be saved. + +And saving a deep learning model with PyTorch is actually really easy - the only thing that you have to do is call `torch.save`, like this: + +``` +# Saving the model +save_path = './mlp.pth' +torch.save(mlp.state_dict(), save_path) +``` + +Here, you define a path to a PyTorch (`.pth`) file, and save the state of the model (i.e. the weights) to that particular file. Note that `mlp` here is the initialization of the neural network, i.e. we executed `mlp = MLP()` during the construction of your training loop. `mlp` is thus any object instantiated based on your `nn.Module` extending neural network class. + +When you run your model next time, the state gets saved to a file called `./mlp.pth`. + +* * * + +## Loading a saved PyTorch model + +...but things don't end there. When you saved a PyTorch model, you likely want to load it at a different location. + +For inference, for example, meaning that you will use it in a deployment setting for generating predictions. + +Loading the model is however really easy and involves the following steps: + +1. Initializing the model skeleton. +2. Loading the model state from a file defined at a particular path. +3. Setting the state of your model to the state just loaded. +4. Evaluating the model. + +``` +# Loading the model +mlp = MLP() +mlp.load_state_dict(torch.load(save_path)) +mlp.eval() +``` + +That's it! + +* * * + +## Recap + +After training a deep learning model with PyTorch, it's time to use it. This requires you to save your model. In this tutorial, we covered how you can **save and load your PyTorch models** using `torch.save` and `torch.load`. + +I hope that you have learned something from this article, despite it being really short - and shorter than you're used to when reading this website! Still, there's no point in writing a lot of text when the important things can be said with only few words, is there? :) + +If you have questions, please feel free to reach out in the comments section below 💬 + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +PyTorch. (n.d.). [https://pytorch.org](https://pytorch.org/) diff --git a/how-to-use-batch-normalization-with-keras.md b/how-to-use-batch-normalization-with-keras.md new file mode 100644 index 0000000..29ea07c --- /dev/null +++ b/how-to-use-batch-normalization-with-keras.md @@ -0,0 +1,446 @@ +--- +title: "How to use Batch Normalization with Keras?" +date: "2020-01-15" +categories: + - "deep-learning" + - "frameworks" +tags: + - "batch-normalization" + - "deep-learning" + - "keras" + - "machine-learning" + - "neural-network" + - "training-process" +--- + +The inputs to individual layers in a neural network can be normalized to speed up training. This process, called Batch Normalization, attempts to resolve an issue in neural networks called internal covariate shift. + +But how does it work? And how does it work in terms of code - with the Keras deep learning framework? We'll take a look at these questions in this blog. + +Firstly, we'll provide a recap on Batch Normalization to ensure that you've gained some _conceptual_ understanding, or that it has been revived. This includes a discussion on the problem, why it occurs during training, and how Batch Normalization may resolve it. + +Then, we move on to the actual Keras part - by providing you with an example neural network using Batch Normalization to learn classification on the KMNIST dataset. Each step of the code which creates the neural network is explained so that you understand how it works. + +Are you ready? Let's go! :) + +* * * + +\[toc\] + +* * * + +## Recap: about Batch Normalization + +Before we start coding, let's take a brief look at [Batch Normalization](https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/) again. We start off with a discussion about _internal covariate shift_ and how this affects the learning process. Subsequently, as the need for Batch Normalization will then be clear, we'll provide a recap on Batch Normalization itself to understand what it does. + +### Training a supervised ML model + +Suppose that you have this neural network, which is composed of Dropout neurons: + +[![](images/dropout.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/dropout.png) + +Following the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), training such a neural network is a multi-step process: + +- Feeding your training data to the network in a _feedforward_ fashion, in which each layer processes your data further. +- This leads to a prediction for every sample. +- This prediction can be compared to the actual target value (the "ground truth"), to see how well the model performs. +- How well, or strictly speaking how _bad_ the model performs is reflected in the _[loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/)_. +- Improving the neural network means firstly, identifying the necessary change in the weights of each neuron with respect to the loss value, and possibly with respect to the intermediate layers as well. +- Secondly, by means of an optimizer like [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or an [adaptive optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), the weights get updated based on these necessary changes (also called gradients). + +### Internal covariate shift + +Now take a look at the neural network from a per-layer point of view. Each layer takes some input, performs a linear operation using the input vector and the weights vector, feeds the data into a nonlinear activation function, and passes the data to the next layer or the output. + +Neural networks train fast if the distribution of the data remains the same, and especially if it is normalized to the range of \[latex\](\\mu = 0, \\sigma = 1)\[/latex\]. This is not the case when no Batch Normalization is applied: by training the network (i.e. changing the weights of the individual neurons), the outputs for every layer change, which means that the distribution of input data for every layer will change during every iteration. + +We call this _internal covariate shift_ (Ioffe & Szegedy, 2015). It is bad, because it can slow down learning. Fortunately, it can be avoided - and Batch Normalization is a way of doing so. + +### Batch Normalization normalizes layer inputs on a per-feature basis + +As we saw before, neural networks train fast if the distribution of the input data remains similar over time. Batch Normalization helps you do this by doing two things: _normalizing the input value_ and _scaling and shifting it_. + +**Normalizing the value:** \[latex\]\\hat{x}\_B^{(k)} \\leftarrow \\frac{x\_B{ ^{(k)} } - \\mu\_B^{(k)}}{\\sqrt{ \\sigma^2{ \_B^{(k)} } + \\epsilon}}\[/latex\] + +Every input \[latex\]x\_B{ ^{(k)}}\[/latex\] is normalized by first subtracting input sample mean \[latex\] \\mu\_B^{(k)} \[/latex\] and then dividing by \[latex\] \\sqrt{ \\sigma^2{ \_B^{(k)} } + \\epsilon} \[/latex\], which is the square root of the variance of the input sample, plus some \[latex\] \\epsilon \[/latex\]. Do note: + +- Whenever we mention "sample" we mean just _one dimension_ of the feature vectors in our minibatch, as normalization is done _per dimension_. This means, for e.g. the feature vector \[latex\]\[2.31, 5.12, 0.12\]\[/latex\], Batch Normalization is applied _three times_, so once per dimension. +- Contrary to _true_ \[latex\](0, 1)\[/latex\] normalization, a small value represented by \[latex\]\\epsilon\[/latex\] is added to the square root, to ensure that the denominator is nonzero - avoiding division by zero errors. + +**Scaling and shifting:** \[latex\]y\_i \\leftarrow \\gamma\\hat{x} \_B ^{(k)} + \\beta\[/latex\]. + +With some activation functions (such as the Sigmoid activation function), normalizing inputs to have the \[latex\](0, 1)\[/latex\] distribution may result in a different issue: they'll activate almost linearly as they primarily activate in the linear segment of the activation function. + +[Here](https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/#scaling-and-shifting), I explain this in more detail, and why this needs to be avoided. + +By _scaling_ the value with some \[latex\]\\gamma\[/latex\] and _shifting_ the value with some \[latex\]\\beta\[/latex\], this problem can be avoided. The values for these are learnt during training. + +### Batch Normalization in the Keras API + +In the Keras API (TensorFlow, n.d.), Batch Normalization is defined as follows: + +``` +keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None) +``` + +Put simply, Batch Normalization can be added as easily as adding a `BatchNormalization()` layer to your model, e.g. with `model.add`. However, if you wish, local parameters can be tuned to steer the way in which Batch Normalization works. These parameters are as follows: + +- **Axis**: the axis of your data which you like Batch Normalization to be applied on. Usually, this is not of importance, but if you have a channels-first Conv layer, it must be set to 1. +- **Momentum**: the momentum that is to be used on the moving mean and the moving variance. +- **Epsilon**: the value for \[latex\]\\epsilon\[/latex\] that is to be used in the normalization step, to avoid division by zero (also see the Batch Normalization formula above). +- **Center**: if `True`, the value for \[latex\]\\beta\[/latex\] is used; if `False`, it's ignored. +- **Scale**: if `True`, the value for \[latex\]\\gamma\[/latex\] is used; if `False`, it's ignored. +- **Beta initializer, regularizer and constraint:** these define the Keras initializer, regularizer and constraints for the _beta_ i.e. the center factor. They give you more control over how the network learns the values during training. +- **Gamma initializer, regularizer and constraint:** these define the Keras initializer, regularizer and constraints for the _gamma_ i.e. the scaling factor. They give you more control over how the network learns the values during training. +- **Moving mean initializer, moving variance initializer:** these define the Keras initializers for the moving mean and moving variance. + +Why the moving mean and variance, you say? + +This has to do with how Batch Normalization works _during training time_ versus _inference time_. + +During training time, there's a larger minibatch available which you can use to compute sample mean and sample variance. + +However, during inference, the sample size is _one_. There's no possibility to compute an average mean and an average variance - because you have _one_ value only, which may be an outlier. Having the moving _mean_ and moving _variance_ from the training process available during inference, you can use these values to normalize during inference. Smart and simple, but a great fix for this issue :) + +* * * + +## Today's model + +Let's take a look at the model we're going to create today :) First, we'll see what dataset we're going to use - being the KMNIST datset. This is followed by a discussion about the model we'll be creating in this tutorial. + +### The dataset + +For the dataset, we're using the KMNIST dataset today: + +[![](images/kmnist-kmnist.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/kmnist-kmnist.png) + +It is a drop-in replacement for the MNIST dataset: + +- The training set has 60.000 samples; +- The testing set has 10.000 samples; +- Each sample is a 28x28 pixels image; +- Each sample belongs to one of 10 target classes. + +#### Using the `extra-keras-datasets` module + +We use the `extra-keras-datasets` module to load our dataset. This module, which we created and discussed [in a different blog post](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/), attempts to replicate the `keras.datasets` way of loading data into your machine learning projects, albeit with different datasets. As we believe that making more datasets easily available boosts adoption of a framework, especially by people who are just starting out, we've been making available additional datasets for Keras through this module. It may be worthwhile to check it out separately! + +Installing this module is required if you wish to run the model (if you don't, you may also replace it with `keras.datasets.mnist`), and can be done very easily: + +``` +pip install extra-keras-datasets +``` + +With one line, a dataset can be imported into your model: + +``` +from extra_keras_datasets import kmnist +``` + +And subsequently loading the data into the particular variables is also easy: + +``` +(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist') +``` + +### The model architecture + +This is the architecture of today's model, which we generated with [Net2Vis](https://www.machinecurve.com/index.php/2020/01/07/visualizing-keras-neural-networks-with-net2vis-and-docker/) (Bäuerle & Ropinski, 2019): + +- [![](images/graph-1-1-1024x173.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/graph-1-1.png) + +- [![](images/legend-1-1024x108.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/legend-1.png) + + +Our model has two _convolutional_ blocks followed by two _dense_ layers: + +- Each convolutional block contains a Conv2D layer and a MaxPooling2D layer, whose outputs are normalized with BatchNormalization layers. +- The convolutional blocks will learn the feature maps, and will thus learn to generate activations for certain _parts_ within the images, such as edges. +- With a Flatten layer, the contents of the feature maps are converted into a one-dimensional Tensor that can be used in the Dense layers. +- The Dense layers together produce the classification. The input to the final Dense layer from the first one is also BatchNormalized. + +* * * + +## Keras implementation + +Let's now see how we can implement this model with Keras :) We'll be using the TensorFlow 2.0 approach to Keras, which is the currently preferred way of using the library. This unfortunately means that it's no longer possible to use Keras with Theano or CNTK. However, if you wish to still use it, it may still work to replace `tensorflow.keras` with `keras`, i.e. the original library. + +Creating the model is a multi-step process: + +- First, we import everything that we need. +- Then, we set the model configuration. +- This is followed by loading and preparing the dataset. +- Subsequently, we define the architecture of our model in line with what we defined above. +- Then, we compile the model and fit the data, i.e. start the training process. +- Once this finishes, we generate evaluation metrics based on our testing set. + +Let's go! Open your Explorer or Finder, navigate to some folder, and create a Python file, e.g. `model_batchnorm.py`. Next, open this file in your code editor - so that we can start coding :) + +### Model imports + +These are our model imports: + +``` +from extra_keras_datasets import kmnist +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.layers import BatchNormalization +``` + +We can describe them as follows: + +- We'll import the main `tensorflow` library so that we can import Keras stuff next. +- Then, from `models`, we import the `Sequential` API - which allows us to stack individual layers nicely and easily. +- Then, from `layers`, we import `Dense`, `Flatten`, `Conv2D`, `MaxPooling2D` and `BatchNormalization` - i.e., the layers from the architecture that we specified. +- Finally, we import the `kmnist` dataset from the `extra_keras_datasets` library. + +### Model configuration + +We can then set the configuration for our model: + +``` +# Model configuration +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +The batch size of our minibatches is set to 250, which balances well between gradient preciseness and memory requirements. We'll train for 25 epochs (which could be higher if you wish, just configure it to a different number :)) and tell the model that we have 10 classes that it can classify into - i.e., the 10 KMNIST classes. 20% of the training data will be used for validation purposes and with verbosity mode set to True, all output will be displayed on screen. + +### Loading & preparing data + +The next step is loading the data. Given the simplicity with which `extra-keras-datasets` can be used, thanks to the original `keras.datasets` module, this is definitely easy: + +``` +# Load KMNIST dataset +(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist') +``` + +Subsequent processing of the data so that it is prepared for training is a bit more complex, but it is neither very difficult: + +``` +# Shape of the input sets +input_train_shape = input_train.shape +input_test_shape = input_test.shape +``` + +With this step, we obtain the shape of our `input_train` and `input_test` datasets, i.e. our _features_. We'll use the first to set the shape of our Keras input data next - which are image height (shape dim 1), image width (shape dim 2) and the number of channels (just one): + +``` +# Keras layer input shape +input_shape = (input_train_shape[1], input_train_shape[2], 1) +``` + +Channels have to be included because Keras expects them during training. + +Next, because the data does not have yet the channels property, we'll have to reshape our data to include it there as well: + +``` +# Reshape the training data to include channels +input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1) +input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1) +``` + +Now, the bulk of the work is done. We next convert the data to `float32` format which presumably speeds up training: + +``` +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') +``` + +And finally normalize the data: + +``` +# Normalize input data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +We're now ready to define the architecture. + +### Defining the model architecture + +Since we already discussed the architecture of our model above, its components won't be too surprising. However, what is still lacking is the actual _code_ for our architecture - so let's write it now and explain it afterwards: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(BatchNormalization()) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(BatchNormalization()) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(BatchNormalization()) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(BatchNormalization()) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(BatchNormalization()) +model.add(Dense(no_classes, activation='softmax')) +``` + +What this code does is create an instance of a `model` based on the `Sequential` API. Subsequently, the convolutional, pooling, batch normalization and Dense layers are stacked with `model.add`. + +Some things we haven't included in the architectural discussion before: + +- **Activation functions: for the intermediate layers**: we use the [ReLU activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) in our convolutional and Dense layers, except for the last one. ReLU is the de facto standard activation function used today and we hence use it in our model. Given the small size of our dataset, we omit applying [He init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/), which is preferred over Xavier/Glorot init when using ReLU. +- **Activation function for the final layer:** in this layer, we're using the [Softmax activation function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/), which generates a probability distribution over the target classes, from which we can select the most likely class. + +### Model compilation & fitting data + +The next step is model compilation: + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +As you can see, model compilation is essentially _instantiating_ the model architecture we defined before with the model configuration we set before. We use [sparse categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/), which combines nicely with our integer target values - so that we don't have to convert these into categorical format before we start training. To optimize the model, we use the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam), and add accuracy as an additional metric. + +Then, we fit the data to our model, a.k.a. starting the training process: + +``` +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +We fit the input training set with its corresponding targets, and train according to the preconfigured `batch_size` and `no_epochs`, with verbosity mode set to on and the `validation_split` set as before (i.e., to 20%). + +Note that the `history` object can be used for [visualizing the training process / the improvements over epochs](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/) later. + +### Generating evaluation metrics + +The final step is generating evaluation metrics with our test set, to see whether our model generalizes to unseen data: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Now, we can start training! :) + +### Full model code + +Should you wish instead to obtain the full code for the model at once, here you go :) + +``` +from extra_keras_datasets import kmnist +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.layers import BatchNormalization + +# Model configuration +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load KMNIST dataset +(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist') + +# Shape of the input sets +input_train_shape = input_train.shape +input_test_shape = input_test.shape + +# Keras layer input shape +input_shape = (input_train_shape[1], input_train_shape[2], 1) + +# Reshape the training data to include channels +input_train = input_train.reshape(input_train_shape[0], input_train_shape[1], input_train_shape[2], 1) +input_test = input_test.reshape(input_test_shape[0], input_test_shape[1], input_test_shape[2], 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize input data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(BatchNormalization()) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(BatchNormalization()) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(BatchNormalization()) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(BatchNormalization()) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(BatchNormalization()) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metric s +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Results + +To start training, open up a terminal which has the required software dependencies installed (i.e. `tensorflow` 2.0+ and the `extra-keras-datasets` module), `cd` to the folder where your Python file is located, and run it with e.g. `python model_batchnorm.py`. + +Most likely, the training process will then begin, and you should see the test results once it finishes. Here are the results over the epochs shown visually. They were generated by means of the `history` object (note that you must add [extra code](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/) to make this work): + +- [![](images/accuracy.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/accuracy.png) + +- [![](images/loss-3.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/loss-3.png) + + +As you can see, the model performs well. Obviously, for practical settings, this will be different as your data set is likely much more complex, but I'm curious whether Batch Normalization will help ensure faster convergence in your models! Please let me know in the comments section below :) + +* * * + +## Summary + +In this blog post, we've looked at how to apply Batch Normalization in your Keras models. This included a discussion about the concept of internal covariate shift and why this may slow down the learning process. Additionally, we provided a recap on the concept of Batch Normalization and how it works, and why it may reduce these issues. + +This was followed by a Keras implementation using the TensorFlow 2.0 way of working. The full code was split into small blocks which contained an explanation. This way, I hope that you understood well why I coded what I coded. + +A long story short: I hope you've learnt something today! If you did, I'd love to know what, and you can leave a comment below. Please do the same if you have questions left or remarks that you wish to express. Thank you for reading MachineCurve today and happy engineering! 😊 + +\[kerasbox\] + +* * * + +## References + +Ioffe, S., & Szegedy, C. (2015). [Batch normalization: Accelerating deep network training by reducing internal covariate shift](https://arxiv.org/abs/1502.03167). _arXiv preprint arXiv:1502.03167_. + +Bäuerle, A., & Ropinski, T. (2019). [Net2Vis: Transforming Deep Convolutional Networks into Publication-Ready Visualizations](https://arxiv.org/abs/1902.04394). arXiv preprint arXiv:1902.04394. + +MachineCurve. (2020, January 14). What is Batch Normalization for training neural networks? Retrieved from [https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/](https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/) + +Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., & Ha, D. (2018). Deep learning for classical Japanese literature. arXiv preprint arXiv:1812.01718. Retrieved from [https://arxiv.org/abs/1812.01718](https://arxiv.org/abs/1812.01718) + +TensorFlow. (n.d.). tf.keras.layers.BatchNormalization. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/BatchNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/BatchNormalization) diff --git a/how-to-use-binary-categorical-crossentropy-with-keras.md b/how-to-use-binary-categorical-crossentropy-with-keras.md new file mode 100644 index 0000000..09a0f7b --- /dev/null +++ b/how-to-use-binary-categorical-crossentropy-with-keras.md @@ -0,0 +1,765 @@ +--- +title: "How to use binary & categorical crossentropy with TensorFlow 2 and Keras?" +date: "2019-10-22" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "binary-crossentropy" + - "categorical-crossentropy" + - "crossentropy" + - "deep-learning" + - "keras" + - "loss-function" + - "machine-learning" +--- + +Recently, I've been covering many of the deep learning [loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) that can be used - by converting them into actual Python code with the Keras deep learning framework. + +Today, in this post, we'll be covering _binary crossentropy_ and _categorical crossentropy_ - which are common loss functions for binary (two-class) classification problems and categorical (multi-class) classification problems. + +Note that another post on [sparse categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) extends this post, and particularly the categorical crossentropy one. + +What we'll do today: first, we recap the maths and _intuition_ behind the two crossentropies, since it's imperative to understand them before we implement them. This includes a comparison between the cross-entropies and another type of loss function that can be used, being hinge loss. + +We then continue, explaining what we need to run the models and introducing the datasets we'll use today (we generate them ourselves). + +Subsequently, we cover the implementation for both the binary crossentropy Keras model and the categorical one - in detail. We discuss each individual block to ensure that you understand what happens in the code. + +After reading this tutorial, you will understand... + +- **What the binary and categorical crossentropy loss functions do.** +- **How to use binary crossentropy loss with TensorFlow 2 based Keras.** +- **How to use categorical crossentropy loss with TensorFlow 2 based Keras.** + +Let's go! 😎 + +_Note that the full code for the models we create in this blog post is also available through my [Keras Loss Functions repository](https://github.com/christianversloot/keras-loss-functions) on GitHub._ + +* * * + +**Update 09/Mar/2021:** updated the tutorial to use `CategoricalCrossentropy` loss without explicitly setting `Softmax` in the final layer, by using `from_logits = True`. This pushes computing the probability distribution into the categorical crossentropy loss function and is more stable numerically. + +**Update 10/Feb/2021:** updated the tutorial to ensure that all code examples reflect TensorFlow 2 based Keras, so that they can be used with recent versions of the library. + +* * * + +\[toc\] + +* * * + +## Example code: binary & categorical crossentropy with TF2 and Keras + +This example code shows quickly how to use **binary and categorical crossentropy loss with TensorFlow 2 and Keras**. You can easily copy it to your model code and use it within your neural network. However, if you want to understand the loss functions in more detail and why they should be applied to certain classification problems, make sure to read the rest of this tutorial as well 🚀 + +``` +loss_function_used = 'binary_crossentropy' # or use categorical_crossentropy +model.compile(loss=loss_function_used, optimizer=tensorflow.keras.optimizers.Adam(lr=0.001), metrics=['accuracy']) +``` + +* * * + +## Recap on the cross-entropies + +As promised, we'll first provide some recap on the intuition (and a little bit of the maths) behind the cross-entropies. We start with the binary one, subsequently proceed with categorical crossentropy and finally discuss how both are different from e.g. hinge loss. + +### Binary crossentropy for binary classification + +Binary crossentropy in maths: + +![](images/image-5-1024x122.png) + +Don't let the maths scare you away... just read on! 😉 + +It can be visualized as follows: + +- ![](images/bce-1-1024x421.png) + + Binary crossentropy, target = 1 + +- ![](images/bce_t0-1024x459.png) + + Binary crossentropy, target = 0 + + +Well, what you need to know first is this: **binary crossentropy** works with **binary classification problems**, which is a difficult term for the _simple observation_ that your sample either belongs to class one (e.g. "diabetes") or class zero ("no diabetes"). Binary classification in most cases boils down to a true/false problem, where you want to classify new samples into one class or another. + +This also means that in your training set, each feature vector out of the many that your set contain (a feature vector contains the descriptive variables that together represent some relationship about the pattern you wish to discover) belongs to one of two targets: zero or one, or \[latex\]{ 0, 1 }\[/latex\]. + +Now, if we take a look at the [high-level machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) for supervised learning problems (which classification belongs to), we see that training encompasses many _forward passes_ - you essentially feed your training set to the current model, generate predictions, compare them to the actual targets, compute a loss value (hey, that's what we cover today!) and subsequently optimize by slightly adapting the model's internals. That way, you hope that your model improves when you repeat the process, eventually finding a model that performs really well. + +With this context, the equation above becomes a lot less scaring. First, let's introduce some additional information: + +- The binary cross entropy is computed _for each sample_ once the prediction is made. That means that upon feeding many samples, you compute the binary crossentropy many times, subsequently e.g. adding all results together to find the final crossentropy value. +- The formula above therefore covers the binary crossentropy _per sample_. +- For an arbitrary forward pass, this means that the binary crossentropy requires two input values - `t`, which is the actual target value for the sample (thus either zero or one) and `p`, which is the prediction generated by the model (likely anything between zero and one if you used the correct [activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) in your final layer). +- Now, feeding `t` and `p` to the logarithmic function above generates the curves visualized above - which, intuitively interpreted, simply tell you that loss increases the further your `p` moves away from the actual `t`. The loss increases increasingly the further you're away, which means that false predictions are not only penalized, but that _confident false predictions_ (e.g. ones that are really off) are penalized more significantly than less confident mistakes. + +Hope you now understand the binary crossentropy intuitively 😄 + +### Categorical crossentropy for multiclass classification + +Next up: categorical crossentropy. + +While _binary crossentropy_ can be used for _binary_ classification problems, not many classification problems are binary. Take for example the problems where the answer is not a true/false question implicitly, such as "diabetes" or "no diabetes". The MNIST dataset is a clear example: there are [10 possible classes](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). + +In that case, binary crossentropy cannot be used. + +Enter categorical crossentropy! + +![](images/image-6.png) + +Again, don't let the maths scare you away 😊 + +The equation looks slightly more complex, and it is, but we can once again explain it extremely intuitively. + +What you'll first have to understand that with _categorical_ crossentropy the targets must be _categorical_: that is, they cannot be integer-like (in the MNIST dataset the targets are integers ranging from 0-9) but must say for all possible classes whether the target belongs to the class or not. + +We do so by converting the integer targets into _categorical format_, or vectorized format, like \[latex\]\[0, 0, 0, 1, 0, 0, 0, 0, 0, 0\]\[/latex\] for the MNIST dataset - this target represents class 3, by the way. + +Similar to binary crossentropy, categorical crossentropy is computed for each sample and eventually merged together - hence, the formula above takes once again two inputs: prediction `p` and target `t`, where _both_ are categorical. + +How can predictions be categorical? They cannot be converted into categorical format from numeric format, can they? + +Well, you're right - but it's not exactly what happens. Instead of converting the data into categorical format, with categorical crossentropy we apply a categorical [activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) (such as Softmax) which generates a _multiclass probability distribution_. + +That's a difficult term which simply tells us that it outputs a vector (hence categorical format!) containing for each class the probability that the sample belongs to the class, all probabilities together being 1 (for 100%). + +If for example we have a target \[latex\]\[0, 0, 1\]\[/latex\], a confident and well-trianed ML model would in that case output e.g. \[latex\]\[0, 0.05, 0.95\]\[/latex\], to give just one example. + +Now that we understand that both inputs to categorical crossentropy are of categorical format, we can proceed with a scrutiny of the actual function. + +For the sake of simplicity, here it is again: + +![](images/image-6.png) + +The maths tell us that we iterate over all classes \[latex\]C\[/latex\] that our machine learning problem describes. + +The maths tell us too that some _observation_ is used in the computation -- hence the \[latex\]o,c\[/latex\] with the `t` and `p`. + +But what is an observation? We can look at this from the lens of the Sigma, or the loop / iteration, which simply iterates over all the possible classes. + +On each iteration, the particular element in both the _target vector_ and the _prediction vector_ is inspected and used in the computation - that is what is meant with an _observation_: inspecting a particular value that is part of a bigger whole, in this case both categorical vectors. + +For each observation, the logarithmic computation is made, which resembles the binary crossentropy one. + +However, there is one interesting detail: for all \[latex\]t\[/latex\] unequal to the actual target value, the result of this computation is 0, since \[latex\]t\_{o,c}\[/latex\] is 0 in that case. + +This way, categorical crossentropy allows us to compute the loss value for multiclass classification problems - while remaining flexible _with respect to the actual target class_. + +### Crossentropy vs hinge loss + +As we've seen theoretically and will see practically, crossentropy loss can be successfully used in classification problems. + +We'll now briefly look at a different question: **why use crossentropy loss and not hinge loss?** That is, [binary hinge](https://www.machinecurve.com/index.php/2019/10/15/how-to-use-hinge-squared-hinge-loss-with-keras/) and [multiclass hinge](https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/) loss can both be used as well instead of binary and multiclass crossentropy. + +Here, we see hinge loss and binary crossentropy loss plotted together: + +[![](images/hinge_binary-1024x524.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/hinge_binary.png) + +Although you'll have to find out which one works best for your ML problem by means of experimentation, these are some points as to the differences of hinge loss and crossentropy loss: + +- Hinge loss will attempt to maximize the margin between your classes, whereas crossentropy loss will attempt to maximize the likelihood of the prediction being the target class (Varma, n.d.). This is a fundamentally different approach: cross entropy requires a probability distribution while hinge loss does not (Why Isn't Cross Entropy Used in SVM?, n.d.). +- However, both are reported to perform as well as each other( What loss function should I use for binary detection in face/non-face detection in CNN?, n.d.). +- In terms of interpreting the outputs, you'll likely prefer the crossentropy outputs since they tell you something about how likely the sample belongs to one class. In the binary case, the real number between 0 and 1 tells you something about the binary case, whereas the categorical prediction tells you something about the multiclass case. Hinge loss just generates a number, but does not compare the classes (softmax+cross entropy v.s. square regularized hinge loss for CNNs, n.d.). +- However, hinge loss - which is simply a \[latex\]max()\[/latex\] function, is easier to compute than crossentropy loss computationally, which requires computing logarithms (Tay, n.d.). The same applies to its derivative: the derivative of a logarithm is more complex than the derivative of a maximum function, which can be rewritten as a [piecewise function](https://ell.stackexchange.com/questions/14700/what-is-the-name-of-the-branched-function-maths-in-correct-english). Hence, you'll get results faster when using hinge loss: "If you need real time decisions with a lesser \[sic\] accuracy depend on it" (Caliskan, n.d.). For crossentropy: "If you have any sexy idea like deep learning try to understand it." +- Hinge loss also introduces sparsity into your machine learning problem, while crossentropy loss does not (Tay, n.d.). + +So in short: if you'll favor sparsity and are not interested in maximum likelihood estimation (in plainer English: if you have a true/false problem, or a one-of-easily-separable-classes problem), it's likely that you'll gain faster and adequate results with hinge loss. Give it a try. If it doesn't work well, switch to crossentropy loss, as "just using this loss function to train your ML model will make it work relatively well" (Pham, n.d.). + +**If you wish to dive deeper into these losses, take a look at those articles:** + +- [About loss and loss functions](https://machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) +- [How to use sparse categorical crossentropy in Keras?](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) +- [How to use hinge & squared hinge loss with Keras?](https://www.machinecurve.com/index.php/2019/10/15/how-to-use-hinge-squared-hinge-loss-with-keras/) +- [How to use categorical / multiclass hinge with Keras?](https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/) + +* * * + +## Let's build some Keras models! + +Enough theory for now - let's create something! 😎 + +### What you'll need to run them + +If you want to run the model, you'll need to install these dependencies first: + +- **Python**, since we'll run everything in this language. You need Python 3.8+ to make it work. +- **TensorFlow 2.x**, which is the deep learning framework of our choice today. It includes `tensorflow.keras`, which are the tightly coupled APIs for Keras. +- **Matplotlib**, for visualizing the dataset. +- **Numpy**, for processing numbers and data. +- **Scikit-learn**, for generating the simple dataset that we will use today. +- **Mlxtend**, which you can use to visualize the [decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) of your model. + +### Today's datasets + +We use two simple datasets today - a **circles** dataset for the binary classification problem and a **clusters** dataset for the multiclass one. + +We generate them with Scikit-learn and use these simple ones on purpose, because we don't want to distract you from the goal of today's blog post - which is to create Keras models _with particular loss functions_. The goal is not to clean data whatsoever, so we use simple datasets that are easily separable. + +Before we proceed to dissecting the code, we'll show you the datasets first. + +This is the _circles_ dataset from our _binary_ classification scenario: + +[![](images/22_bce_circles.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/22_bce_circles.png) + +And this is the _clusters_ one from our _multiclass_ scenario: + +![](images/22_cce_clusters.png) + +* * * + +## Binary crossentropy Keras model + +Let's now create the Keras model using binary crossentropy. Open up some folder in your File Explorer (whether Apple, Windows or Linux 😉 - I just don't know all the names of the explorers in the different OSes) and create a file called `binary-cross-entropy.py`. Open up the file in some code editor and yep, we can write some code 😄 + +### Imports + +First, we'll have to import everything we need to run the model: + +``` +''' + TensorFlow 2 based Keras model discussing Binary Cross Entropy loss. +''' +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.losses import BinaryCrossentropy +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_circles +from mlxtend.plotting import plot_decision_regions +``` + +As discussed, we're using Keras, Matplotlib, Numpy, Scikit-learn and Mlxtend. We use the Keras Sequential API for stacking our model's layers and use densely-connected layers, or Dense layers, only. We use the PyPlot library from Matplotlib, `make_circles` from Scikit-learn and `plot_decision_regions` from Mlxtend, which we use to visualize the decision boundary of our model later. + +### Model configuration + +Next, we specify some model configuration options: + +``` +# Configuration options +num_samples_total = 1000 +training_split = 250 +loss_function_used = BinaryCrossentropy() +``` + +Our `make_circles` call will generate 1000 samples in total, of which 250 will be set apart for training data. We will use `BinaryCrossentropy` as our loss function, which is not surprising. + +### Dataset generation, preparation & visualization + +Next, we generate the dataset:s + +``` +# Generate data +X, targets = make_circles(n_samples = num_samples_total, factor=0.1) +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] +``` + +First, we call `make_circles` and generate the 1000 samples. Our `factor`, or the distance between the two circles, is 0.1 - which means that the circles are relatively far apart. This benefits separability and hence makes the learning problem easier (which benefits this blog post for the sake of simplicity). + +Next, we split data into training and testing data, before we move on to data visualization: + +``` +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +These Matplotlib calls produce the visualization that we saw before: + +[![](images/22_bce_circles.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/22_bce_circles.png) + +The only thing left before we can create the model architecture is the specification of the _shape_ of our input feature vector. Shape is meant literally here: what does the vector look like? How many dimensions does the vector have and how many elements will be covered by each dimension? Images, for example, are two-dimensional, videos three-dimensional and sound waves one-dimensional. + +In our case, we'll use a one-dimensional feature vector with a shape of (2, ) - one dimension, two elements in that dimension (the X and Y values represented in the visualization above). + +This looks as follows when coded: + +``` +# Set the input shape +feature_vector_shape = len(X_training[0]) +input_shape = (feature_vector_shape,) +``` + +### Model architecture & configuration + +Next, we can actually create the architecture of our model: + +``` +# Create the model +model = Sequential() +model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='sigmoid')) +``` + +It's a really simple one. We use the Keras Sequential API which allows us to stack the layers nicely. We use four layers, of which two are hidden. The first hidden layer has twelve neurons, the ReLU [activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) and hence the He uniform [initialization](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). The `input_shape` represents the input to this layer, and hence the structure of the first layer (being the input layer) of your neural network. + +A similar design is used for the second layer, although our data will be slightly more abstract now and I feel like 8 neurons are enough. This is often a process of trial and error: you don't want to create information bottlenecks by squeezing all information into very few neurons, but you also don't want to use way too many, since the model may not be able to summarize enough and hence might produce degraded performance. + +The final layer has just one output neuron, which produces one value - indeed, that prediction between zero and one. The Sigmoid function is capable of producing this output: with a range of (0, 1), it converts any input to a value in that interval. We now have an architecture that allows us to separate two classes. Let's move on to model configuration. + +[![](images/sigmoid-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/sigmoid.png) + +The Sigmoid activation function produces outputs between zero and one. + +Configuring the model means applying the _hyperparameters_, or the configuration settings. We do this next: + +``` +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=tensorflow.keras.optimizers.Adam(lr=0.001), metrics=['accuracy']) +history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2) +``` + +Particularly, we specify the loss function used, as well as the optimizer (Adam, since it's the default optimizer with relatively good performance across many ML problems) and possibly some additional metrics - such as accuracy in our case, since humans can interpret accuracy more intuitively than e.g. Crossentropy loss. + +Next, we fit the data to the model architecture. This actually starts the training process. We train for 30 iterations, or epochs, with a batch size of 5 - which benefits the memory used. Verbosity is set to `true` so that we can see full model output and 20% of the training data is used for validation purposes. Since all data is randomly shuffled upon creation, we do not need to worry about certain biases here, nor with the training/test split. + +### Model testing & visualization + +The training process will evaluate model performance continuously during training. However, this is the model's _predictive power_, i.e. how well it performs against data it has seen before (or based on which it has optimized, i.e. the validation dataset). + +However, in order to make the model useful to the real world, it must also work well against data it has _never seen before_. You cannot include all possible data in your training set and you don't want your model to be very off when you use it in the real world, simply because it was trained against the training set data too much. + +That's why you generate a _testing set_, which you can use to evaluate the model's _generalization performance_ afterwards. We can do that in Keras by calling the `evaluate` call on the `model` instance: + +``` +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') +``` + +Next, we [plot the decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) for our model with the _testing data_ to see how well it performs, once again with data it has never seen before: + +``` +# Plot decision boundary +plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2) +plt.show() +``` + +And we [plot the model's training process history](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/), to find out whether it has improved, whether it can still improve any further and whether it is not overfitting: + +``` +# Visualize training process +plt.plot(history.history['loss'], label='Binary crossentropy loss (training data)') +plt.plot(history.history['val_loss'], label='Binary crossentropy loss (validation data)') +plt.title('Binary crossentropy loss for circles') +plt.ylabel('Binary crossentropy loss value') +plt.yscale('log') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +Click one or both of the two links above if you wish to understand the visualization code in more detail. + +### The results + +Now that we have the full code, we can actually run it to find out how well it performs. Let's open up a Python terminal, e.g. an Anaconda prompt or your regular terminal, `cd` to the folder and execute `python binary-cross-entropy.py`. + +The training process will then start and eventually finish, while you'll see a visualization of the data you generated first. The outputs will be something like this: + +[![](images/22_bce_db.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/22_bce_db.png) + +As you can see, with binary crossentropy, the Keras model has learnt to generate a decision boundary that allows us to distinguish between both classes accurately. This is unsurprising, since we allowed the circles to be very well separable, and this is represented in model history: + +[![](images/22_bce_history.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/22_bce_history.png) + +... when 30 epochs passed, the model was still improving, also when tested with validation data. Hence, it was not overfitting yet - unsurprising again given the separability of our circles. This was once again confirmed by the _test set evaluation_ which produced an accuracy of 100% - as illustrated in the plot with the decision boundary. + +### Whole model code + +If you wish to copy the whole model at once, here you go: + +``` +''' + TensorFlow 2 based Keras model discussing Binary Cross Entropy loss. +''' +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.losses import BinaryCrossentropy +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_circles +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 1000 +training_split = 250 +loss_function_used = BinaryCrossentropy() + +# Generate data +X, targets = make_circles(n_samples = num_samples_total, factor=0.1) +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Set the input shape +feature_vector_shape = len(X_training[0]) +input_shape = (feature_vector_shape,) + +# Create the model +model = Sequential() +model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='sigmoid')) + +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=tensorflow.keras.optimizers.Adam(lr=0.001), metrics=['accuracy']) +history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') + +# Plot decision boundary +plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2) +plt.show() + +# Visualize training process +plt.plot(history.history['loss'], label='Binary crossentropy loss (training data)') +plt.plot(history.history['val_loss'], label='Binary crossentropy loss (validation data)') +plt.title('Binary crossentropy loss for circles') +plt.ylabel('Binary crossentropy loss value') +plt.yscale('log') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Categorical crossentropy Keras model + +Let's now move on with the categorical crossentropy case. I won't cover repetitive details here, such as why we need certain imports. If you wish to understand the whys here in more detail, please refer to the previous part about binary crossentropy. Where new details arise, I will obviously cover them. + +We implement the categorical crossentropy variant by creating a file called `categorical-cross-entropy.py` in a code editor. Let's start! 😎 + +### Imports + +We first put in place the imports: + +``` +''' + TensorFlow 2 based Keras model discussing Categorical Cross Entropy loss. +''' +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.losses import CategoricalCrossentropy +from tensorflow.keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from mlxtend.plotting import plot_decision_regions +``` + +Instead of `make_circles` as with the binary crossentropy model, we'll use `make_blobs` here - it allows us to make clusters of data instead of to draw two circles. Additionally, we import `to_categorical` from `keras.utils`, which ensures that we can convert our integer targets into categorical format. + +### Model configuration + +Next, we specify the configuration of our model: + +``` +# Configuration options +num_samples_total = 1000 +training_split = 250 +cluster_centers = [(15,0), (15,15), (0,15), (30,15)] +num_classes = len(cluster_centers) +loss_function_used = CategoricalCrossentropy(from_logits=True) +``` + +In particular, `cluster_centers` and `num_classes` are new while the loss function used has changed into `CategoricalCrossentropy`. We set `from_logits = True` because of Softmax constraints - we'll come back to that later. `cluster_centers` describes the centers of the, in this case, four clusters across the two dimensions of the space in which we visualize the clusters. + +If you wish, you can increase the number of clusters, reduce the number of clusters, relocate the clusters - keeping them easily separable or reducing or even removing separability, to see how the model performs. I really suggest you do this to get additional intuition for how generation of decision boundaries performs! + +`num_classes` is simply the length of the `cluster_centers` list, since the clusters must actually be created 😉 + +### Dataset generation, preparation & visualization + +Dataset generation is similar to the binary case but different at two essential points: + +``` +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1.5) +categorical_targets = to_categorical(targets) +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = categorical_targets[training_split:] +Targets_testing = categorical_targets[:training_split].astype(np.int32) +``` + +First, as introduced earlier, we use `make_blobs` instead of `make_circles`. This requires us to use additional variables such as `n_samples` (`num_classes`), `centers` (`cluster_centers`) and other configuration options such as `cluster_std`, which determines how big the clusters are (by setting their standard deviation from the cluster's center). + +Additionally, we convert targets into categorical format by applying `to_categorical` before we split them into training and testing targets. + +Setting the shape remains identical, but is still required: + +``` +# Set shape based on data +feature_vector_length = len(X_training[0]) +input_shape = (feature_vector_length,) +print(f'Feature shape: {input_shape}') +``` + +The same applies to the visualization code for the dataset: + +``` +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +### Model architecture & configuration + +We use the same architecture, except for the output layer: + +``` +# Create the model +model = Sequential() +model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(num_classes, activation='linear')) +``` + +We remember from the intuitive introduction to binary and categorical crossentropy that, contrary to the binary case, categorical crossentropy produces a _multiclass probability distribution_, or the relative likelihood that the sample belongs to any of the classes within your ML problem. + +Hence, we must: + +1. Find an activation function that supports this. The Sigmoid function generates an integer value between zero and one, which is no multiclass probability distribution. Fortunately, the **Softmax** activation function is capable of this. However, it's built into `CategoricalCrossentropy` loss if `from_logits = True`, and in fact, this is expected to be more numerically stable. That's why we activate _linearly_ here and set `from_logits` to `True` into our loss definition above. +2. Change the number of output neurons. Remember, we don't generate an _integer_ target anymore, but instead a _list_ of probabilities, one per class in the ML problem. Hence, you'll need `num_classes` neurons to generate the predictions by means of Softmax; in our case, that's 4 neurons. + +We kept model compilation & data fitting the same: + +``` +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=tensorflow.keras.optimizers.Adam(lr=0.001), metrics=['accuracy']) +history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2) +``` + +### Model testing & visualization + +The same for model performance testing: + +``` +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') +``` + +However, visualizing model performance is done a bit differently when it comes to [visualizing the decision boundaries](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/). As we have seen in the [hinge loss case](https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/), Mlxtend does not support categorical data natively when plotting the model's decision boundaries. This was fixed with help of dr. Sebastian Raschka (see [https://github.com/rasbt/mlxtend/issues/607](https://github.com/rasbt/mlxtend/issues/607)), Mlxtend's creator, who provided additional code which converts the model into a model generating a non-categorical (i.e., integer) target data, even though it was trained to produce categorical data. + +We therefore need to add this code next: + +``` +''' + The Onehot2Int class is used to adapt the model so that it generates non-categorical data. + This is required by the `plot_decision_regions` function. + The code is courtesy of dr. Sebastian Raschka at https://github.com/rasbt/mlxtend/issues/607. + Copyright (c) 2014-2016, Sebastian Raschka. All rights reserved. Mlxtend is licensed as https://github.com/rasbt/mlxtend/blob/master/LICENSE-BSD3.txt. + Thanks! +''' +# No hot encoding version +class Onehot2Int(object): + + def __init__(self, model): + self.model = model + + def predict(self, X): + y_pred = self.model.predict(X) + return np.argmax(y_pred, axis=1) + +# fit keras_model +keras_model_no_ohe = Onehot2Int(model) + +# Plot decision boundary +plot_decision_regions(X_testing, np.argmax(Targets_testing, axis=1), clf=keras_model_no_ohe, legend=3) +plt.show() +''' + Finish plotting the decision boundary. +''' +``` + +It will allow us to generate the decision boundary plot 😄 + +Finally, we visualize the training process as we have done with the binary case: + +``` +# Visualize training process +plt.plot(history.history['loss'], label='Categorical crossentropy loss (training data)') +plt.plot(history.history['val_loss'], label='Categorical crossentropy loss (validation data)') +plt.title('Categorical crossentropy loss for clusters') +plt.ylabel('Categorical crossentropy loss value') +plt.yscale('log') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +### The results + +As you can see in the decision plot visualization, the categorical crossentropy based model has been able to distinguish between the four classes quite accurately: + +[![](images/22_cce_db.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/22_cce_db.png) + +Only in the orange and read areas, there were some misclassifications. But well, that happens! + +Loss is also going down, although less smoothly: + +[![](images/22_cce_history.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/22_cce_history.png) + +All in all, I'm happy with the performance of this model too 😄 + +### Whole model code + +If you wish to obtain the whole model at once, you can find it here: + +``` +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.losses import CategoricalCrossentropy +from tensorflow.keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 1000 +training_split = 250 +cluster_centers = [(15,0), (15,15), (0,15), (30,15)] +num_classes = len(cluster_centers) +loss_function_used = CategoricalCrossentropy(from_logits=True) + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 1.5) +categorical_targets = to_categorical(targets) +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = categorical_targets[training_split:] +Targets_testing = categorical_targets[:training_split].astype(np.int32) + +# Set shape based on data +feature_vector_length = len(X_training[0]) +input_shape = (feature_vector_length,) +print(f'Feature shape: {input_shape}') + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Create the model +model = Sequential() +model.add(Dense(12, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(num_classes, activation='linear')) + +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=tensorflow.keras.optimizers.Adam(lr=0.001), metrics=['accuracy']) +history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') + +''' + The Onehot2Int class is used to adapt the model so that it generates non-categorical data. + This is required by the `plot_decision_regions` function. + The code is courtesy of dr. Sebastian Raschka at https://github.com/rasbt/mlxtend/issues/607. + Copyright (c) 2014-2016, Sebastian Raschka. All rights reserved. Mlxtend is licensed as https://github.com/rasbt/mlxtend/blob/master/LICENSE-BSD3.txt. + Thanks! +''' +# No hot encoding version +class Onehot2Int(object): + + def __init__(self, model): + self.model = model + + def predict(self, X): + y_pred = self.model.predict(X) + return np.argmax(y_pred, axis=1) + +# fit keras_model +keras_model_no_ohe = Onehot2Int(model) + +# Plot decision boundary +plot_decision_regions(X_testing, np.argmax(Targets_testing, axis=1), clf=keras_model_no_ohe, legend=3) +plt.show() +''' + Finish plotting the decision boundary. +''' + +# Visualize training process +plt.plot(history.history['loss'], label='Categorical crossentropy loss (training data)') +plt.plot(history.history['val_loss'], label='Categorical crossentropy loss (validation data)') +plt.title('Categorical crossentropy loss for clusters') +plt.ylabel('Categorical crossentropy loss value') +plt.yscale('log') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Summary + +In this blog post, we did quite some things. We first covered the binary crossentropy and categorical crossentropy loss functions intuitively, explaning them - with some maths - in plain English. Subsequently, we provided two example implementations with the Keras deep learning framework. + +I hope you've learnt something from this blog post! If you did, let me know - I would really appreciate your comments in the comment area below 👇 If you have questions, or if you feel like I made some mistakes, let me know! I'll happily answer you, to try and help you move forward, or improve the post. Thanks! 😊 + +_Note that the full code for the models we created in this blog post is also available through my [Keras Loss Functions repository](https://github.com/christianversloot/keras-loss-functions) on GitHub._ + +* * * + +## References + +About loss and loss functions – MachineCurve. (2019, October 15). Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) + +Keras. (n.d.). Losses. Retrieved from [https://keras.io/losses/](https://keras.io/losses/) + +Varma, R. (n.d.). Picking Loss Functions - A comparison between MSE, Cross Entropy, and Hinge Loss. Retrieved from [https://rohanvarma.me/Loss-Functions/](https://rohanvarma.me/Loss-Functions/) + +Why Isn't Cross Entropy Used in SVM? (n.d.). Retrieved from [https://stats.stackexchange.com/a/284413](https://stats.stackexchange.com/a/284413) + +What loss function should I use for binary detection in face/non-face detection in CNN? (n.d.). Retrieved from [https://stats.stackexchange.com/a/228668](https://stats.stackexchange.com/a/228668) + +softmax+cross entropy v.s. square regularized hinge loss for CNNs. (n.d.). Retrieved from [https://stats.stackexchange.com/questions/299876/softmaxcross-entropy-v-s-square-regularized-hinge-loss-for-cnns#comment570017\_299876](https://stats.stackexchange.com/questions/299876/softmaxcross-entropy-v-s-square-regularized-hinge-loss-for-cnns#comment570017_299876) + +Tay, J. (n.d.). Jonathan Tay's answer to What are the advantages of hinge loss over log loss? Retrieved from [https://www.quora.com/What-are-the-advantages-of-hinge-loss-over-log-loss/answer/Jonathan-Tay](https://www.quora.com/What-are-the-advantages-of-hinge-loss-over-log-loss/answer/Jonathan-Tay) + +Caliskan, K. (n.d.). Kerem Caliskan's answer to What is the advantage/disadvantage of Hinge-loss compared to cross-entropy? Retrieved from [https://www.quora.com/What-is-the-advantage-disadvantage-of-Hinge-loss-compared-to-cross-entropy/answer/Kerem-Caliskan](https://www.quora.com/What-is-the-advantage-disadvantage-of-Hinge-loss-compared-to-cross-entropy/answer/Kerem-Caliskan) + +Pham, H. (n.d.). Hieu Pham's answer to When should you use cross entropy loss and why? Retrieved from [https://www.quora.com/When-should-you-use-cross-entropy-loss-and-why/answer/Hieu-Pham-20](https://www.quora.com/When-should-you-use-cross-entropy-loss-and-why/answer/Hieu-Pham-20) + +Raschka, S. (n.d.). Home - mlxtend. Retrieved from [http://rasbt.github.io/mlxtend/](http://rasbt.github.io/mlxtend/) + +Raschka, S. (2018). MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. _Journal of Open Source Software_, _3_(24), 638. [doi:10.21105/joss.00638](https://joss.theoj.org/papers/10.21105/joss.00638) + +StackOverflow. (n.d.). _From\_logits=True and from\_logits=False get different training result for tf.losses.CategoricalCrossentropy for UNet_. Stack Overflow. [https://stackoverflow.com/questions/57253841/from-logits-true-and-from-logits-false-get-different-training-result-for-tf-loss](https://stackoverflow.com/questions/57253841/from-logits-true-and-from-logits-false-get-different-training-result-for-tf-loss) + +TensorFlow. (n.d.). _Tf.keras.losses.CategoricalCrossentropy_. [https://www.tensorflow.org/api\_docs/python/tf/keras/losses/CategoricalCrossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy) diff --git a/how-to-use-categorical-multiclass-hinge-with-keras.md b/how-to-use-categorical-multiclass-hinge-with-keras.md new file mode 100644 index 0000000..78b299c --- /dev/null +++ b/how-to-use-categorical-multiclass-hinge-with-keras.md @@ -0,0 +1,514 @@ +--- +title: "How to use categorical / multiclass hinge with TensorFlow 2 and Keras?" +date: "2019-10-17" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "categorical-hinge-loss" + - "deep-learning" + - "hinge-loss" + - "keras" + - "loss-function" + - "machine-learning" + - "mlxtend" +--- + +Recently, I've been looking into [loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) - and specifically these questions: What is their purpose? How does the concept of loss work? And more practically, how can loss functions be implemented with the TensorFlow 2 based Keras framework for deep learning? + +This resulted in blog posts that e.g. covered [huber loss](https://www.machinecurve.com/index.php/2019/10/12/using-huber-loss-in-keras/) and [hinge & squared hinge loss](https://www.machinecurve.com/index.php/2019/10/15/how-to-use-hinge-squared-hinge-loss-with-keras/). Today, in this tutorial, we'll extend the latter to multiclass classification: we cover **categorical hinge loss**, or multiclass hinge loss. How can categorical hinge / multiclass hinge be implemented with TF2 based Keras? That's what well find out today. + +After reading this tutorial, you will understand... + +- What it means to go from binary hinge loss to multiclass hinge loss. +- How categorical (multiclass) hinge loss works. +- How `tensorflow.keras.losses.CategoricalHinge` can be used in your TensorFlow 2 based Keras model. + +Let's go! 😎 + +* * * + +**Update 10/Feb/2021:** ensure that article is up to date. Code examples now reflect TensorFlow 2 ecosystem and have been upgraded from TensorFlow/Keras 1.x. + +* * * + +\[toc\] + +* * * + +## Code example: multiclass hinge loss with TensorFlow 2 based Keras + +This code example demonstrates quickly **how to use categorical (multiclass) hinge loss with TensorFlow 2 based Keras**. You can use this in your model straight away. If you want to understand the background details for multiclass hinge, make sure to read the rest of this tutorial as well 🚀 + +``` +loss_function_used = 'categorical_hinge' +model.compile(loss=loss_function_used, optimizer=optimizer_used, metrics=additional_metrics) +``` + +* * * + +## From binary hinge to multiclass hinge + +In that previous blog, we looked at _hinge loss_ and _squared hinge loss_ - which actually helped us to generate a decision boundary between two classes and hence a classifier, but yep - two classes only. + +Hinge loss and squared hinge loss can be used for [binary classification problems](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). + +Unfortunately, many of today's problems aren't binary, but rather, multiclass: the number of possible target classes is \[latex\]> 2\[/latex\]. + +And hinge and squared hinge do not accommodate for this. + +But **categorical hinge loss**, or multiclass hinge loss, does - and it is available in Keras! + +* * * + +## How does multiclass hinge work? + +_Multiclass hinge_ was introduced by researchers Weston and Watkins (Wikipedia, 2011): + +![](images/image-2-1024x170.png) + +What this means in plain English is this: + +**For a prediction \[latex\]y\[/latex\], take all \[latex\]y\[/latex\] values unequal to \[latex\]t\[/latex\], and compute the individual losses. Eventually, sum them together to find the multiclass hinge loss.** + +The name _categorical hinge loss_, which is also used in place of multiclass hinge loss, already implies what's happening here: + +We first convert our regular targets into categorical data. That is, if we have three possible target classes {0, 1, 2}, an arbitrary target (e.g. 2) would be converted into categorical format (in that case, \[latex\]\[0, 0, 1\]\[/latex\]). + +Next, _for any sample_, our DL model generates a multiclass probability distribution over all possible target classes. That is, for the total probability of 100% (or, statistically, \[latex\]1\[/latex\]) it generates the probability that any of the possible categorical classes is the actual target class (in the scenario above, e.g. \[latex\]\[0.25, 0.25, 0.50\]\[/latex\] - which would mean _class two_, but with some uncertainty. + +Computing the loss - the difference between _actual target and predicted targets_ - is then equal to computing the hinge loss for _taking the prediction for all the computed classes, except for the target class, since loss is always 0 there_. The hinge loss computation itself is similar to the [traditional hinge loss](https://www.machinecurve.com/index.php/2019/10/15/how-to-use-hinge-squared-hinge-loss-with-keras/). + +Categorical hinge loss can be optimized as well and hence used for generating decision boundaries in multiclass machine learning problems. Let's now see how we can implement it with TensorFlow 2 based Keras. + +* * * + +## Today's dataset: extending the binary case + +...which requires defining a dataset first :-) + +In our post covering [traditional hinge loss](https://www.machinecurve.com/index.php/2019/10/15/how-to-use-hinge-squared-hinge-loss-with-keras/), we generated data ourselves because this increases simplicity. + +We'll do so as well in today's blog. Specifically, we create a dataset with three separable clusters that looks as follows: + +[![](images/mh_3.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/mh_3.png) + +How? Let's find out. + +First, open some folder and create a Python file where you'll write your code - e.g. `multiclass-hinge.py`. + +Next, open a development environment as well as the file, and you can start coding 😊 + +### Importing software dependencies + +First, we add the imports: + +``` +''' + Keras model discussing Categorical (multiclass) Hinge loss. +''' +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from mlxtend.plotting import plot_decision_regions +``` + +We need **TensorFlow 2** (`pip install tensorflow`) since we build the model by means of its APIs and functionalities. From its `tensorflow.keras` representation of [Keras](https://keras.io), we need: + +- The **Sequential API**, which allows us to stack neural network layers; +- The **densely-connected layer type**, since we'll build our network by means of such layers. + +We also need **Matplotlib** for generating visualizations of our dataset, **Numpy** for basic number processing, **Scikit-learn** for generating the dataset and **Mlxtend** for [visualizing the decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) of our model. + +### Model & data configuration + +We next add some configuration options: + +``` +# Configuration options +num_samples_total = 3000 +training_split = 1000 +num_classes = 3 +loss_function_used = 'categorical_hinge' +learning_rate_used = 0.03 +optimizer_used = tensorflow.keras.optimizers.Adam(lr=learning_rate_used) +additional_metrics = ['accuracy'] +num_epochs = 30 +batch_size = 5 +validation_split = 0.2 # 20% +``` + +The three clusters contain 3000 samples in total divided over three classes or clusters, as we saw in the image above. The `training_split` value is 1000, which means that 1000 samples are split off the training set to serve as testing data. + +Next, we specify the hyper parameters. Obviously, we'll use categorical hinge loss. We set the learning rate to 0.03 since traditional hinge required a [more aggressive value](https://www.machinecurve.com/index.php/2019/10/15/how-to-use-hinge-squared-hinge-loss-with-keras/) contrary to 0.001, which is default in Keras. We use the Adam optimizer and configure it to use this learning rate, which is very common today since Adam is the de facto standard optimizer used in DL projects. + +As an additional metric, we specify accuracy, as we have done before in many of our blog posts. Accuracy is more intuitively understandable to humans. + +The model will train for 30 epochs with a batch size of 5 samples per forward pass, and 20% of the training data (2000 samples, hence 400 samples) will be used for validating each epoch as validation data. + +### Generating a dataset + +Next, we can generate the data: + +``` +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = [(0,0), (15,15), (0,15)], n_features = num_classes, center_box=(0, 1), cluster_std = 1.5) +categorical_targets = to_categorical(targets) +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = categorical_targets[training_split:] +Targets_testing = categorical_targets[:training_split].astype(np.integer) + +# Set shape based on data +feature_vector_length = len(X_training[0]) +input_shape = (feature_vector_length,) +``` + +We use Scikit-learns `make_blobs` function to generate data. It simply does as it suggests: it generates blobs of data, or clusters of data, where you specify them to be. Specifically, it generates `num_samples_total` (3000, see model configuration section) in our case, splits them across three clusters centered at \[latex\]{ (0, 0), (15, 15), (0,15) }\[/latex\]. The standard deviation in a cluster is approximately 1.5 to ensure that they are actually separable. + +Next, we must convert our target values (which are one of \[latex\]{ 0, 1, 2 }\[/latex\]) into [categorical format](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/#preparing-target-vectors-with-to_categorical) since our _categorical hinge loss_ requires categorical format (and hence no integer targets such as \[latex\]2\[/latex\], but categorical vectors like \[latex\]\[0, 0, 1\]\[/latex\]. + +Subsequently, we can split our feature vectors and target vectors according to the `training_split` we configured in our model configuration. Note that we add `.astype(np.integer`) to the testing targets. We do this because when visualizing categorical data, the Mlxtend library requires the vector contents to be _integers_ (instead of floating point numbers). + +Finally, we set the `input_shape` based on the length of our feature vector, which originates from the training data. + +### Visualizing our dataset + +We can finally visualize the data we generated: + +``` +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Three clusters ') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +...which, as illustrated before, looks like this: + +[![](images/mh_3.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/mh_3.png) + +As illustrated before, this is what is generated 😎 + +We can work with this! + +* * * + +## Creating the multiclass hinge Keras model + +### What you'll need to run this model + +If you wish to run this model on your machine, you'll need to install some dependencies to make the code work. First of all, you need **Keras**, the deep learning framework with which this model is built. It's the most essential dependency and can be installed by installing **TensorFlow 2.x** today, e.g. [2.4.0](https://www.machinecurve.com/index.php/2020/11/05/saying-hello-to-tensorflow-2-4-0/). It is then available as `tensorflow.keras`. + +Additionally, you'll need the de facto standard Python libraries Matplotlib, Numpy and Scikit-learn - they can be installed with `pip` quite easily. + +Another package, which can also be installed with `pip`, is Sebastian Raschka's [](https://github.com/rasbt/mlxtend)[Mlxtend. We use it to visualize the decision boundary of our model.](https://github.com/rasbt/mlxtend)Creating the model architecture + +We will create a very simple model today, a four-layered (two hidden layers, one input layer and one output layer) MLP: + +``` +# Create the model +model = Sequential() +model.add(Dense(4, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(2, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(num_classes, activation='tanh')) +``` + +More specifically, we use the Keras Sequential API which allows us to stack multiple layers on top of each other. We subsequently `add` the Dense or densely-connected layers; the first having four neurons, the second two, and the last `num_classes`, or three in our case. The hidden layers activate by means of the ReLU [activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) and hence are initialized with [He uniform init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). The last layer activates with tanh. + +### Model configuration & training + +Next, we configure the model and start the training process: + +``` +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=optimizer_used, metrics=additional_metrics) +history = model.fit(X_training, Targets_training, epochs=num_epochs, batch_size=batch_size, verbose=1, validation_split=validation_split) +``` + +It's as simple as calling `model.compile` with the settings that we configured under model configuration, followed by `model.fit` which fits the training data to the model architecture specified above. The training history is saved in the `history` object which we can use for [visualization purposes](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). + +Next, we must add some more code for testing the model's ability to generalize to data it hasn't seen before. + +* * * + +## Model performance + +### Generalization power with testing set + +In order to test model performance, we add some code that evaluates the model with the testing set: + +``` +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') +``` + +What it will do is this: it takes the testing data (both features and targets) and feeds them through the model, comparing predicted target with the actual prediction. Since the model has never seen the data before, it tells us something about the degree of overfitting that occurred during training. When the model performs well during validation _but also during testing_, it's useful to practice. + +### Visualizing the decision boundary + +[Visualizing the decision boundaries](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) of the model (remember, we have a three-class classification problem!) is the next step. + +I must admit, I had a little help from dr. Sebastian Raschka here, the creator of Mlxtend (also see [https://github.com/rasbt/mlxtend/issues/607](https://github.com/rasbt/mlxtend/issues/607)). As noted before, we had to convert our targets into categorical format, or e.g. \[latex\]target = 2\[/latex\] into \[latex\]target = \[0, 0, 1\]\[/latex\]. Mlxtend does not natively support this, but fortunately, Raschka helped out by creating a helper class that embeds the model yet converts the way it makes predictions (back into non-categorical format). This looks as follows: + +``` +''' + The Onehot2Int class is used to adapt the model so that it generates non-categorical data. + This is required by the `plot_decision_regions` function. + The code is courtesy of dr. Sebastian Raschka at https://github.com/rasbt/mlxtend/issues/607. + Copyright (c) 2014-2016, Sebastian Raschka. All rights reserved. Mlxtend is licensed as https://github.com/rasbt/mlxtend/blob/master/LICENSE-BSD3.txt. + Thanks! +''' +# No hot encoding version +class Onehot2Int(object): + + def __init__(self, model): + self.model = model + + def predict(self, X): + y_pred = self.model.predict(X) + return np.argmax(y_pred, axis=1) + +# fit keras_model +keras_model_no_ohe = Onehot2Int(model) + +# Plot decision boundary +plot_decision_regions(X_testing, np.argmax(Targets_testing, axis=1), clf=keras_model_no_ohe, legend=3) +plt.show() +''' + Finish plotting the decision boundary. +''' +``` + +### Visualizing the training process + +Finally, we can [visualize the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/) itself by adding some extra code - which essentially plots the Keras `history` object with Matplotlib: + +``` +# Visualize training process +plt.plot(history.history['loss'], label='Categorical Hinge loss (training data)') +plt.plot(history.history['val_loss'], label='Categorical Hinge loss (validation data)') +plt.title('Categorical Hinge loss for circles') +plt.ylabel('Categorical Hinge loss value') +plt.yscale('log') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +### How does the model perform? + +Now that we've completed our code, we can actually run the model! + +Open up a terminal where you have access to the software dependencies required to run the code, `cd` to the directory where your file is located, and execute e.g. `python multiclass-hinge.py`. + +After the visualization of your dataset (with the three clusters), you'll see the training process run and complete - as well as model evaluation with the testing set: + +``` +Epoch 1/30 +2019-10-16 19:39:12.492536: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll +1600/1600 [==============================] - 1s 906us/step - loss: 0.5006 - accuracy: 0.6950 - val_loss: 0.3591 - val_accuracy: 0.6600 +Epoch 2/30 +1600/1600 [==============================] - 1s 603us/step - loss: 0.3397 - accuracy: 0.6681 - val_loss: 0.3528 - val_accuracy: 0.6500 +Epoch 3/30 +1600/1600 [==============================] - 1s 615us/step - loss: 0.3398 - accuracy: 0.6681 - val_loss: 0.3721 - val_accuracy: 0.7425 +Epoch 4/30 +1600/1600 [==============================] - 1s 617us/step - loss: 0.3379 - accuracy: 0.8119 - val_loss: 0.3512 - val_accuracy: 0.8500 +Epoch 5/30 +1600/1600 [==============================] - 1s 625us/step - loss: 0.3368 - accuracy: 0.8869 - val_loss: 0.3515 - val_accuracy: 0.8600 +Epoch 6/30 +1600/1600 [==============================] - 1s 608us/step - loss: 0.3358 - accuracy: 0.8906 - val_loss: 0.3506 - val_accuracy: 0.9325 +Epoch 7/30 +1600/1600 [==============================] - 1s 606us/step - loss: 0.3367 - accuracy: 0.9344 - val_loss: 0.3532 - val_accuracy: 0.9375 +Epoch 8/30 +1600/1600 [==============================] - 1s 606us/step - loss: 0.3365 - accuracy: 0.9375 - val_loss: 0.3530 - val_accuracy: 0.9425 +Epoch 9/30 +1600/1600 [==============================] - 1s 625us/step - loss: 0.3364 - accuracy: 0.9419 - val_loss: 0.3528 - val_accuracy: 0.9475 +Epoch 10/30 +1600/1600 [==============================] - 1s 627us/step - loss: 0.3364 - accuracy: 0.9450 - val_loss: 0.3527 - val_accuracy: 0.9500 +Epoch 11/30 +1600/1600 [==============================] - 1s 606us/step - loss: 0.3363 - accuracy: 0.9506 - val_loss: 0.3525 - val_accuracy: 0.9525 +Epoch 12/30 +1600/1600 [==============================] - 1s 642us/step - loss: 0.3366 - accuracy: 0.9425 - val_loss: 0.3589 - val_accuracy: 0.6475 +Epoch 13/30 +1600/1600 [==============================] - 1s 704us/step - loss: 0.3526 - accuracy: 0.8606 - val_loss: 0.3506 - val_accuracy: 0.9850 +Epoch 14/30 +1600/1600 [==============================] - 1s 699us/step - loss: 0.3364 - accuracy: 0.9925 - val_loss: 0.3502 - val_accuracy: 0.9875 +Epoch 15/30 +1600/1600 [==============================] - 1s 627us/step - loss: 0.3363 - accuracy: 0.9944 - val_loss: 0.3502 - val_accuracy: 0.9875 +Epoch 16/30 +1600/1600 [==============================] - 1s 670us/step - loss: 0.3363 - accuracy: 0.9937 - val_loss: 0.3502 - val_accuracy: 0.9875 +Epoch 17/30 +1600/1600 [==============================] - 1s 637us/step - loss: 0.3362 - accuracy: 0.9694 - val_loss: 0.3530 - val_accuracy: 0.9400 +Epoch 18/30 +1600/1600 [==============================] - 1s 637us/step - loss: 0.3456 - accuracy: 0.9744 - val_loss: 0.3537 - val_accuracy: 0.9825 +Epoch 19/30 +1600/1600 [==============================] - 1s 635us/step - loss: 0.3347 - accuracy: 0.9975 - val_loss: 0.3501 - val_accuracy: 0.9950 +Epoch 20/30 +1600/1600 [==============================] - 1s 644us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950 +Epoch 21/30 +1600/1600 [==============================] - 1s 655us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950 +Epoch 22/30 +1600/1600 [==============================] - 1s 636us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950 +Epoch 23/30 +1600/1600 [==============================] - 1s 648us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950 +Epoch 24/30 +1600/1600 [==============================] - 1s 655us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950 +Epoch 25/30 +1600/1600 [==============================] - 1s 656us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950 +Epoch 26/30 +1600/1600 [==============================] - 1s 641us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3501 - val_accuracy: 0.9950 +Epoch 27/30 +1600/1600 [==============================] - 1s 644us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3500 - val_accuracy: 0.9950 +Epoch 28/30 +1600/1600 [==============================] - 1s 666us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3500 - val_accuracy: 0.9950 +Epoch 29/30 +1600/1600 [==============================] - 1s 645us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3500 - val_accuracy: 0.9950 +Epoch 30/30 +1600/1600 [==============================] - 1s 669us/step - loss: 0.3344 - accuracy: 0.9994 - val_loss: 0.3500 - val_accuracy: 0.9950 +1000/1000 [==============================] - 0s 46us/step +Test results - Loss: 0.3260095896720886 - Accuracy: 99.80000257492065% +``` + +In my case, it was able to achieve very high accuracy - 99.5% on the testing set and 99.8% on the training set! Indeed, the decision boundaries allow us to classify the majority of samples correctly: + +[![](images/mh_boundary-1024x587.png)](blob:https://www.machinecurve.com/195f7d77-e155-47f9-bbfc-057dff681520) + +...and the training process looks like this: + +[![](images/mh_loss-1024x564.png)](blob:https://www.machinecurve.com/2cce0143-1950-47a5-abc4-5bff87ccb897) + +Just after the first epoch, model performance pretty much maxed out. + +...which is not unsurprising given the fact that our datasets are quite separable by nature, or perhaps, _by design_ 😉 The relative ease with which the datasets are separable allows us to focus on the topic of this blog post, which was the categorical hinge loss. + +All in all, we've got a working model using categorical hinge in Keras! + +* * * + +## All code merged together + +When merging all code together, we get this: + +``` +''' + Keras model discussing Categorical (multiclass) Hinge loss. +''' +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 3000 +training_split = 1000 +num_classes = 3 +feature_vector_length = len(X_training[0]) +input_shape = (feature_vector_length,) +loss_function_used = 'categorical_hinge' +learning_rate_used = 0.03 +optimizer_used = tensorflow.keras.optimizers.Adam(lr=learning_rate_used) +additional_metrics = ['accuracy'] +num_epochs = 30 +batch_size = 5 +validation_split = 0.2 # 20% + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = [(0,0), (15,15), (0,15)], n_features = num_classes, center_box=(0, 1), cluster_std = 1.5) +categorical_targets = to_categorical(targets) +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = categorical_targets[training_split:] +Targets_testing = categorical_targets[:training_split].astype(np.integer) + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Three clusters ') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Create the model +model = Sequential() +model.add(Dense(4, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(2, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(num_classes, activation='tanh')) + +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=optimizer_used, metrics=additional_metrics) +history = model.fit(X_training, Targets_training, epochs=num_epochs, batch_size=batch_size, verbose=1, validation_split=validation_split) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') + +''' + The Onehot2Int class is used to adapt the model so that it generates non-categorical data. + This is required by the `plot_decision_regions` function. + The code is courtesy of dr. Sebastian Raschka at https://github.com/rasbt/mlxtend/issues/607. + Copyright (c) 2014-2016, Sebastian Raschka. All rights reserved. Mlxtend is licensed as https://github.com/rasbt/mlxtend/blob/master/LICENSE-BSD3.txt. + Thanks! +''' +# No hot encoding version +class Onehot2Int(object): + + def __init__(self, model): + self.model = model + + def predict(self, X): + y_pred = self.model.predict(X) + return np.argmax(y_pred, axis=1) + +# fit keras_model +keras_model_no_ohe = Onehot2Int(model) + +# Plot decision boundary +plot_decision_regions(X_testing, np.argmax(Targets_testing, axis=1), clf=keras_model_no_ohe, legend=3) +plt.show() +''' + Finish plotting the decision boundary. +''' + +# Visualize training process +plt.plot(history.history['loss'], label='Categorical Hinge loss (training data)') +plt.plot(history.history['val_loss'], label='Categorical Hinge loss (validation data)') +plt.title('Categorical Hinge loss for circles') +plt.ylabel('Categorical Hinge loss value') +plt.yscale('log') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Summary + +In this blog post, we've seen how categorical hinge extends binary (normal) hinge loss and squared hinge loss to multiclass classification problems. We considered the loss mathematically, but also built up an example with Keras that allows us to use categorical hinge with a real dataset, generating visualizations of the training process and decision boundaries as well. This concludes today's post. + +I hope you've learnt something here. If you did, I'd appreciate it if you let me know! 😊 You can do so by leaving a comment below 👇 Thanks a lot - and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2011, September 16). Hinge loss. Retrieved from [https://en.wikipedia.org/wiki/Hinge\_loss](https://en.wikipedia.org/wiki/Hinge_loss) + +Raschka, S. (n.d.). Home - mlxtend. Retrieved from [http://rasbt.github.io/mlxtend/](http://rasbt.github.io/mlxtend/) + +Raschka, S. (2018). MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. _Journal of Open Source Software_, _3_(24), 638. [doi:10.21105/joss.00638](https://joss.theoj.org/papers/10.21105/joss.00638) + +About loss and loss functions – MachineCurve. (2019, October 15). Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) + +Keras. (n.d.). Losses. Retrieved from [http://keras.io/losses](http://keras.io/losses) diff --git a/how-to-use-conv2d-with-keras.md b/how-to-use-conv2d-with-keras.md new file mode 100644 index 0000000..8393c4a --- /dev/null +++ b/how-to-use-conv2d-with-keras.md @@ -0,0 +1,349 @@ +--- +title: "How to use Conv2D with Keras?" +date: "2020-03-30" +categories: + - "deep-learning" + - "frameworks" +tags: + - "conv2d" + - "convolutional-neural-networks" + - "deep-learning" + - "keras" + - "machine-learning" + - "neural-networks" + - "tutorial" +--- + +One of the most widely used layers within the Keras framework for deep learning is the **Conv2D layer**. However, especially for beginners, it can be difficult to understand what the layer is and what it does. + +For this reason, we'll explore this layer in today's blog post. What is the Conv2D layer? How is it related to Convolutional Neural Networks? What does the "2D" mean - two dimensions? And how to actually implement it? + +Those are questions that we'll answer today. Firstly, we'll take a look at ConvNets in general - discussing what they are and how they play an important role in today's deep learning world. Secondly, we move on to the Keras framework, and study how they are represented by means of Conv2D layers. Thirdly, we'll implement an actual model, guiding you through the code step by step. Finally, we'll run the model, and discuss our results. + +Are you ready? Let's go! + +* * * + +\[toc\] + +* * * + +## Some theory about Conv2D: about convolutional neural networks + +In my opinion, it's important to dive a bit into concepts first before we discuss code, as there's no point in giving you code examples if you don't understand _why_ things are as they are. + +Now, let's take a look at some theory related to the Keras Conv2D layer. here, we'll discuss three things: + +- **What is a neural network?** Very briefly, we'll take a look at what neural networks are - and why today's ones are different than the ones from the past, and other machine learning models. +- **What are convolutional neural networks?** There are many such models. Convolutional ones, have really thrived over the past few years. What are they? And how are they related to Conv2D layers? We'll take a look. +- **What is the impact of ConvNets / how are they used in practice?** ConvNets have especially boosted the popularity of deep learning because of their interesting applications, especially in computer vision. We'll explore a few of these. + +Okay, let's begin. + +### What is a neural network? + +Since hundreds of years, philosophers and scientists have been interested in the human brain. The brain, you must know, is extremely efficient in what it does - allowing humans to think with unprecedented complexity and sheer flexibility. + +Some of those philosophers and scientists have also been interested in finding out how to _make an artificial brain_ - that is, use a machine to mimic the functionality of the brain. + +Obviously, this was not possible until the era of computing. That is, only since the 1950s, when computers emerged as a direct consequence of the Second World War, scientists could actually _build_ artificially intelligent systems. + +![](images/photo-of-head-bust-print-artwork-724994-1024x736.jpg) + +Initially, researchers like [Frank Rosenblatt](https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/) attempted to mimic the neural structure of the brain. As you likely know since you likely have some background in neural networks, or at least know what they are, our brains consist of individual neurons and "highways" - synapses - in between them. + +Neurons fire based on inputs, and by consequence trigger synapses to become stronger over time - allowing entire patterns of information processing to be shaped within the brain, giving humans the ability to think and act in very complex ways. + +Whereas the so-called "Rosenblatt Perceptron" (click the link above if you want to know more) was just _one_ artificial neuron, today's neural networks are complex networks of many neurons, like this: + +[![](images/ComplexNeuralNetwork.png)](https://www.machinecurve.com/wp-content/uploads/2017/09/ComplexNeuralNetwork.png) + +A complex neural network. These and even more complex neural nets provide different layers of possibly non-linear functionality, and may thus be used in deep learning. + +What you see above is what is known as "fully connected neurons". Each neuron is connected to a neuron in the next layer, except for the input and output layer. Growing complexity means that the number of connections grows. This isn't good news, as this means that (1) the time required to train the network increases significantly and (2) the network is more prone to ["overfitting"](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). Are there more efficient ways, perhaps? + +### What are convolutional neural networks? + +There are! + +And convolutional neural networks, or ConvNets for short, are one of them. They are primarily used for computer vision tasks - although they have emerged in the areas of text processing as well. Not spoiling too much - we'll show some examples in the next section - let's now take a look at what makes ConvNets different. + +The answer to this quest is relatively simple: ConvNets also contain layers that _are not fully connected_, and are built in a different way - convolutional layers. + +Let's schematically draw two such layers. + +![](images/Cnn_layer-1.jpg) + +On the left, you see the first layer - and the pixels of, say, your input image. The yellow part is the "convolutional layer", and more precisely, one of the filters (convolutional layers often contain many such filters which are learnt based on the data). It slides over the input image, and averages a box of pixels into _just one_ value. Repeating this process in many layer, we generate a small, very abstract image that we can use for classification. + +### How are ConvNets used in practice? + +As a result of this, we see many interesting applications of convolutional layers these days - especially in the field of computer vision. For example, object detectors use ConvNets to "detect" known objects within images or even videos, allowing you to draw bounding boxes and act based on the observation: + +https://www.youtube.com/watch?v=yQwfDxBMtXg + +* * * + +## The Keras framework: Conv2D layers + +Such layers are also represented within the Keras deep learning framework. For two-dimensional inputs, such as images, they are represented by `keras.layers.Conv2D`: the Conv2D layer! + +In more detail, this is its exact representation (Keras, n.d.): + +``` +keras.layers.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', data_format=None, dilation_rate=(1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None) +``` + +Now, what does each attribute mean? + +- **Filters** represents the number of filters that should be learnt by the convolutional layer. From the schematic drawing above, you should understand that each filter slides over the input image, generating a "feature map" as output. +- The **kernel size** represents the number of pixels in height and width that should be summarized, i.e. the two-dimensional width and height of the filter. +- The **stride** tells us how the kernel jumps over the input image. If the stride is 1, it slides pixel by pixel. If it's two, it jumps one pixel. It jumps two with a stride of 3, and so on. +- The **[padding](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/)** tells us what happens when the kernels/filters don't fit, for example because the input image has a width and height that do not match with the combination of kernel size and stride. +- Depending on the backend you're using Keras with, the _channels_ (each image has image channels, e.g. 3 channels with Red-Green-Blue or RGB) are in the _first_ dimension or the _last_. Hence, the **data format** represents whether it's a channels first or channels last approach. With recent versions of Keras, which support TensorFlow only, this is no longer a concern. +- If you're using dilated convolutions, the **dilation rate** can be specified as well. +- The **[activation function](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/)** to which the linear output of the Conv2D layer is fed to make it nonlinear can be specified too. +- A **bias value** can be added to each layer in order to scale the learnt function vertically. This possibly improves training results. It can be configured here, especially if you _don't_ want to use biases. By default, it's enabled. +- The **[initializer](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/)** for the kernels, the biases can be configured too, as well as **[regularizers](https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/)** and **constraints**. + +* * * + +## Implementing a Keras model with Conv2D + +Let's now see how we can implement a Keras model using Conv2D layers. It's important to remember that we need **Keras** for this to work, and more specifically we need the newest version. That means that we best install **TensorFlow** version 2.0+, which supports Keras out of the box. I cannot explain here how to install Tensorflow, but if you Google for "installing TensorFlow", you'll most likely find a perfect example. + +Obviously, you'll also need a recent version of Python - possibly, using Anaconda. + +### Full model code + +This is the model that we'll be coding today. Don't worry - I will walk you through every step, but here's the code as a whole for those who just wish to copy and play: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 100 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Let's now study the model in more detail. + +### The imports + +The first thing we'll need to do is import some things: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +``` + +We'll be using the CIFAR10 dataset today. Later, we'll see what it looks like. We also use the Sequential API. This allows us to stack the layers nicely. Subsequently, we import the Dense (from densely-connected), Flatten and Conv2D layers. + +The principle here is as follows: + +- The **Conv2D layers** will transform the input image into a very abstract representation. +- This representation can be used by **densely-connected layers** to generate a classification. +- However, as Dense layers can only handle one-dimensional data, we have to convert the multidimensional feature map output by the final Conv2D layer into one-dimensional format first. We can do so with the **Flatten** layer. + +Next, we import the optimizer and the [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). These will help us with improving the model: the [optimizer](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) adapts the weights, while the loss function computes the difference between the predictions and the ground truth of your training dataset. Loss functions are tailored to the problem you're trying to solve. For multiclass classification scenarios, which is what we're doing today, [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/) is a good choice. However, as our dataset targets are integers rather than vectors, we use the [sparse equivalent](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/). + +### The model configuration + +Next up, the model configuration. + +- We set the **batch size** to 50. This means that 50 samples are fed to the model in each step. +- We also define the **image width, height and the number of channels**. As our dataset contains 32x32 pixel RGB images, we set them to `32, 32, 3`, respectively. +- For the **loss function** and **optimizer**, we set the values that we just discussed. +- As our dataset has 10 classes, we set **no\_classes** to 10. +- We will train our model for 25 iterations, or **epochs**. Usually, the number of epochs is a very large number, but as this is an educational scenario, we keep it low. Experiment with a few settings to see how it works! +- We use 20% of the training data for **validation purposes** - i.e., to see how well your model performs after each iteration. This helps spot whether our model is [overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). +- Finally, we set **verbosity** mode to 1 - or True, showing all the output on screen. + +``` +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 +``` + +### Loading and preparing our dataset + +The third step is to load and prepare our dataset. + +But wait, what dataset will we be using? + +Let's take a look at the CIFAR10 dataset: + +[![](images/cifar10_images.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/cifar10_images.png) + +These are just a few samples from this dataset - as you can see, it contains many common day classes such as truck, deer, and automobile. We'll load and prepare it as follows: + +``` +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +The first step, loading the data from the [Keras datasets wrapper](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), should be clear. The same goes for _determining the shape of our data_ - which is done based on the configuration settings that we discussed earlier. + +Now, for the other two steps, these are just technicalities. By casting our data into `float32`, the training process will presumably be faster if you run it on a GPU. Scaling the data ensures that we have smaller weight updates, benefiting the final outcome. + +### Specifying model architecture + +Now that all the "work upfront" is complete, we can actually specify the model: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +First, we instantiate the `Sequential` API - literally laying the foundation on top of which we can stack layers. + +As you can see, we specify three `Conv2D` layers in sequential order, with 3x3 kernel sizes, [ReLU activation](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) and 32, 64 and 128 filters, respectively. + +Next, we use Flatten, and have two Dense layers to generate the classification. The last layer doesn't activate with ReLU, but with Softmax instead. This allows us to generate [a true multiclass probability distribution](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/), which is what we need if we want to answer the question "which class is most likely?". + +Now that we have specified the architecture, or the framework, we can _compile_ (or initialize) the model and _fit the data_ (i.e., start training). + +### Model compilation and fitting the data + +Keras allows you to do so quite easily: with `model.compile` and `model.fit`. The `compile` call allows you to specify the loss function, the optimizer and additional metrics, of which we use accuracy, as it's intuitive to humans. + +Then, with `fit`, we can fit the `input_train` and `target_train` (i.e. the inputs and targets of our training set) to the model, actually starting the training process. We do so based on the options that we configured earlier, i.e. batch size, number of epochs, verbosity mode and validation split. + +``` +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Model evaluation + +The final step is to evaluate our model after we performed training. Keras allows you to do so with `model.evaluate`. As you can see, instead of the training dataset, we're using testing data here: `input_test` and `target_test`. This way, we can be sure that we test the model with data that it hasn't seen before during training, evaluating its power to generalize to new data (which happens in real-world settings all the time!). Evaluation is done in a non-verbose way, and the results are printed on screen. + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Running our model + +Now, let's run the model. Say that you save your model as `model.py`. Open a terminal, `cd` to the folder where your file is located, and run `python model.py`. You should see the training process begin on screen. + +``` +2020-03-30 20:25:58.336324: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows +Relying on driver to perform ptx compilation. This message will be only logged once. +40000/40000 [==============================] - 35s 887us/sample - loss: 1.4811 - accuracy: 0.4676 - val_loss: 1.1419 - val_accuracy: 0.5906 +Epoch 2/25 +40000/40000 [===================> +``` + +Once it finishes, you should get an evaluation that is close to this: + +``` +Test loss: 4.3477772548675535 / Test accuracy: 0.6200000047683716 +``` + +62% is not _extremely_ good, but it's not very bad either. For sure that we can improve a lot, but that wasn't the point of this blog post! 😉 + +* * * + +## Summary + +In this blog post, we looked at how two-dimensional convolutional layers can be used with the Keras deep learning framework. Having studied a little bit of neural network theory up front, and diving into the concepts of convolutional layers, we quickly moved on to Keras and its Conv2D representation. With an example model, which we looked at step by step, we showed you how you can create a Keras ConvNet yourself. + +I hope you've learnt something from today's blog post. If you did, please feel free to leave a comment in the comments section below! Please do the same if you have questions or when you have remarks - I'll happily answer and improve my blog post if necessary. + +For now, thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +Keras. (n.d.). _Convolutional layers: Conv2D_. Home - Keras Documentation. [https://keras.io/layers/convolutional/#conv2d](https://keras.io/layers/convolutional/#conv2d) diff --git a/how-to-use-cropping-layers-with-keras.md b/how-to-use-cropping-layers-with-keras.md new file mode 100644 index 0000000..82abb91 --- /dev/null +++ b/how-to-use-cropping-layers-with-keras.md @@ -0,0 +1,533 @@ +--- +title: "How to use Cropping layers with TensorFlow and Keras?" +date: "2020-02-04" +categories: + - "deep-learning" + - "frameworks" +tags: + - "conv2d" + - "convolutional-neural-networks" + - "cropping" + - "deep-learning" + - "keras" + - "machine-learning" + - "neural-networks" +--- + +Sometimes, your data set may consist of e.g. images from which you only need to use a tiny bit in your neural network. Cropping the images manually prior to training would then be a possible option. However, this can be done smarter, with the Keras Cropping layers, which perform all the work for you. + +This blog covers these layers. Firstly, we'll take a look at why cropping may be necessary in the first place. Secondly, we introduce the Cropping layers from the Keras API, and then proceed with a simple Cropping example. We then extend this simple example to a CNN based classifier using Cropped data, and finally take a look at how Cropping may also be used at the level of feature maps rather than input data. + +Are you ready? Let's go! 😎 + +**Update 05/Nov/2020:** made code compatible with TensorFlow 2.x and fixed some other issues. + +* * * + +\[toc\] + +* * * + +## Why cropping? + +Say that you have an experient in which you photograph a variety of books. The size of the books is similar, and you're using a tripod in order to keep the camera stable and at a fixed position with respect to the table. You lay down each book and make a picture, and will have a set of pictures of books in the end, assigned to a category (e.g., "thriller"). + +Now imagine that your goal is to classify the book into a category based on the cover. What's more, you have another dataset available - a set of pictures with labels - where you did precisely the same thing. However, rather than the table of your first dataset - which is a wooden one - here, the table you used is made of white plastic. + +This is problematic, as we don't want the prediction to be determined by the _material of the table_, or the table at all! + +Intuitively, the fix for this problem would be to "cut off" the table from each book. That is, we simply remove the edges, so that the cover of the book remains. It's a simple and elegant fix which is called "cropping". And indeed, it's the way forward - suppose that in this case, the "2" is the book and the surrounding blue is the table: + +- [![](images/crop_4.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/crop_4.png) + + +Cropping allows us to focus on the book alone rather than its unique combination with the table. + +Now, a naïve way would be to crop all your images manually - that is, use a software tool like Paint to remove the edge from each image. This is an extremely time-intensive process, which is not what we want. What's more, the precision of our cropping may be off by a few pixels every time. This introduces instability into the dataset. + +Instead, we'll perform _cropping_ in our neural network! Keras, the deep learning framework with which we will work today, has such layers available natively. Let's explore them! 😎 + +* * * + +## Cropping in the Keras API + +Cropping often goes hand in hand with [Convolutional layers](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), which themselves are used for feature extracting from one-dimensional (i.e. time), two-dimensional (i.e. spatial) or three-dimensional (i.e. 3D spatial or spatiotemporal a.k.a. spatial over time) data. + +Hence, it shouldn't surprise you that Keras offers three types of Cropping layers: `Cropping1D`, `Cropping2D` and `Cropping3D`, to be used with the dimensionality of your dataset and often the corresponding `Conv` layer(s) (Keras, n.d.). + +Firstly, there is the one-dimensional variant: + +``` +tensorflow.keras.layers.Cropping1D(cropping=(1, 1)) +``` + +It has one simple attribute: `cropping`, which specifies "how many units should be trimmed off at the beginning and end of the cropping dimension" (Keras, n.d.). That is, if your input is an array of shape \[latex\](20, )\[/latex\] and you apply a `cropping` of \[latex\](2, 5)\[/latex\], then it will be \[latex\](13, )\[/latex\] with 2 values cut off the front and 5 off the back. You could also provide `cropping` as an integer `i`, which then equates to `(i, i)`. + +Secondly, there is two-dimensional variant: + +``` +tensorflow.keras.layers.Cropping2D(cropping=((0, 0), (0, 0)), data_format=None) +``` + +It is pretty similar to the one-dimensional case, but there are minor differences: + +- The `cropping` attribute now specifies a _box_ for cropping, with the structure being `((top_crop, bottom_crop), (left_crop, right_crop))` (Keras, n.d.). However, if you wish to perform a symmetric crop (i.e. remove as much in terms of height from top and bottom and width from left and right), you could also specify the two-element tuple `(symmetric_height_crop, symmetric_width_crop)` (Keras, n.d.). If you only apply an int value, Keras will perform a symmetric crop with `width = height`, like the 1D case. +- New is the `data_format`, which is nothing different than `channels_last` or `channels_first` dependent on how the backend you use Keras with processes images (i.e., whether channels are present in the first or last dimensions of your tensor). Tensorflow, by default, utilizes a channels-last approach, and given the deep integration between Keras and Tensorflow, `channels_last` is the default option (StackOverflow, n.d.; Keras, n.d.). + +Thirdly, and finally, there is the 3D Cropping layer: + +``` +tensorflow.keras.layers.Cropping3D(cropping=((1, 1), (1, 1), (1, 1)), data_format=None) +``` + +- Here, you can also specify the `cropping` attribute as a dimension-based crop i.e. `((left_dim1_crop, right_dim1_crop), (left_dim2_crop, right_dim2_crop), (left_dim3_crop, right_dim3_crop))`, but with `(symmetric_dim1_crop, symmetric_dim2_crop, symmetric_dim3_crop)` it's also possible to create a symmetric crop again (Keras, n.d.). +- Similarly, `data_format` can here be set to `channels_first` or `channels_last`, with the latter being the default (StackOverflow, n.d.; Keras, n.d.). + +Now that we know how the Cropping layers are represented in the Keras API, it's time for some coding work! 😄 + +* * * + +## A simple Cropping2D example + +Let's start with a simple example that demonstrates what the `Cropping` layers do. More precisely, we'll be using the `Cropping2D` layer from Keras, using the TensorFlow 2.0+ variant so we're future proof. + +### The imports + +Open up a code editor and create a file, e.g. `cropping2d.py`. Then, the first step is adding the imports: + +- The `Sequential` API from `tensorflow.keras.models`, so we can stack everything together nicely. +- The `Cropping2D` layer from `tensorflow.keras.layers`; +- The `mnist` dataset from `tensorflow.keras.datasets`, i.e. the [Keras datasets module](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). +- The `PyPlot` API from Matplotlib, for generating some plots. +- Finally, `Numpy`, for number processing. + +``` +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Cropping2D +from tensorflow.keras.datasets import mnist +import matplotlib.pyplot as plt +import numpy as np +``` + +### The dataset + +Next, it's time to load the MNIST dataset and select a sample: + +``` +# Load MNIST data +(x_train, y_train), (x_test, y_test) = mnist.load_data() +input_image_shape = (28, 28, 1) +input_image = x_train[25].reshape(input_image_shape) +``` + +Here, we use `load_data()` to load the full dataset. However, we only use one sample at once, in this case the 26th (with `i = 25`) from the training dataset (i.e. `x_train[25]`). On the fly, we also reshape it so that channels are supported: the 28x28 pixel image with shape `(28, 28)` is reshaped into `(28, 28, 1)`. + +### The architecture + +Subsequently, we can specify the model architecture... _which is pretty simple:_ + +``` +# Create the model +model = Sequential() +model.add(Cropping2D(cropping=((5, 5), (5, 5)), input_shape=input_image_shape)) +model.summary() +``` + +It's simply an instantiation of the `Sequential` API, to which we add the `Cropping2D` layer, and generate a summary using `model.summary`. In the Cropping layer, we specify the `cropping` attribute that specifies the box that _must be kept_: in this case the box 5 pixels from the left; from the right; from the top, and from the bottom. + +In the _Results_ section, you'll see that this indeed cuts off the blank edges 😉 But first, add some code which actually generates the cropped output... through a _prediction_. + +### The cropping action + +By specifying the `input_shape` in the `model.add` section, the model automatically builds/compiles, and - as we shall see in the _Results_ section as well - since our model doesn't have any trainable parameters, we don't need to call `model.fit`. + +Hence, we can continue straight away with generating a 'prediction' - i.e. feeding the `input_image` to the `model` through `model.predict`. Do note that our model expects an array of inputs, and thus we must wrap it with a list: + +``` +# Perform actual cropping +model_inputs = np.array([input_image]) +outputs_cropped = model.predict(model_inputs) +``` + +As the model predicts for a list, the outputs are also a list, and we need to take the first element: + +``` +# Get output +outputs_cropped = outputs_cropped[0] +``` + +Finally, we can visualize the input and output together with Matplotlib: + +``` +# Visualize input and output +fig, axes = plt.subplots(1, 2) +axes[0].imshow(input_image[:, :, 0]) +axes[0].set_title('Original image') +axes[1].imshow(outputs_cropped[:, :, 0]) +axes[1].set_title('Cropped input') +fig.suptitle(f'Original and cropped input') +fig.set_size_inches(9, 5, forward=True) +plt.show() +``` + +### Results + +First, the summary generated with `model.summary()`: + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +cropping2d (Cropping2D) (None, 18, 18, 1) 0 +================================================================= +Total params: 0 +Trainable params: 0 +Non-trainable params: 0 +_________________________________________________________________ +``` + +As you can see, there are no trainable parameters whatsoever - the `Cropping2D` layer only crops the inputs based on the `cropping` attribute that was specified! + +Then, three examples of the cropped inputs: + +- [![](images/crop_3.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/crop_3.png) + +- [![](images/crop_2.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/crop_2.png) + +- [![](images/crop_1.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/crop_1.png) + + +Indeed, the blank box around the digits has been removed! 😎 + +### Full code + +Should you wish to obtain the full code for this simple application of Keras `Cropping2D` layers, here you go: + +``` +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Cropping2D +from tensorflow.keras.datasets import mnist +import matplotlib.pyplot as plt +import numpy as np + +# Load MNIST data +(x_train, y_train), (x_test, y_test) = mnist.load_data() +input_image_shape = (28, 28, 1) +input_image = x_train[25].reshape(input_image_shape) + +# Create the model +model = Sequential() +model.add(Cropping2D(cropping=((5, 5), (5, 5)), input_shape=input_image_shape)) +model.summary() + +# Perform actual cropping +model_inputs = np.array([input_image]) +outputs_cropped = model.predict(model_inputs) + +# Get output +outputs_cropped = outputs_cropped[0] + +# Visualize input and output +fig, axes = plt.subplots(1, 2) +axes[0].imshow(input_image[:, :, 0]) +axes[0].set_title('Original image') +axes[1].imshow(outputs_cropped[:, :, 0]) +axes[1].set_title('Cropped input') +fig.suptitle(f'Original and cropped input') +fig.set_size_inches(9, 5, forward=True) +plt.show() +``` + +* * * + +## Training a ConvNet with Cropping2D inputs + +[![](images/model_cropping2d-84x300.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/model_cropping2d.png) + +Let's now make the example a little bit more complex. Rather than creating a model which allows an input image to be cropped, we'll apply Cropping layers to a [Convolutional Neural Network based classifier](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) - in order to find out what it does in terms of performance when it is trained on the MNIST dataset. + +On the right, you'll see the architecture that we will create today: a convolutional neural network that eventually leads to densely-connected layer based classification. + +Let's take a look at some code! 😎 Open up a code editor, create a new file (e.g. `model_cropping2d.py`) and start coding :) + +### Model imports + +Firstly, we'll define the imports for our model: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D, Cropping2D +``` + +We import `tensorflow`, as we'll need it later to specify e.g. the [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). Then, from `tensorflow.keras`, we import a couple of things: + +- Firstly, from `.datasets`, we import the `mnist` dataset. +- From `.models`, we import the `Sequential` API which will allow us to stack the layers quite nicely. +- Then, from `.layers`, we import `Dense`, `Dropout` and `Flatten` - all necessary for the latter part (i.e. the classifier) of the model or for reducing overfitting (i.e. [Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/)). +- Then, from `.layers`, we import the layers used for feature extracting: the `Conv2D` layer for the actual extraction, the `MaxPooling2D` layer for [downsampling and introducing translation invariance](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) and `Cropping2D` for cropping, obviously :) + +### Model configuration + +Now that our imports are defined, we can set the configuration options for the model: + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +As the MNIST digits have a width and height of 28 pixels, we set both `img_width` and `img_height` to 28. Then, the batch size is set to 250 - which is a fair balance between memory requirements and [gradient preciseness](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). The number of epochs is set to 25 - which is low, but which should be more than enough for a simple dataset like MNIST. The number of classes is set to 10, which equals the distinct number of digits present within the dataset - i.e. the digits 0 to 9. + +Finally, 20% of the training data is used for validation purposes (i.e. validating model performance for every epoch) and verbosity mode is set to True (through `1`), outputting everything in your terminal (and in my experience slightly slowing down the training process due to the speed of these operations - turn it off when you use it for real!). + +### Loading and preparing data + +When the model configuration options are set, we can load the MNIST dataset. We do so by calling the `load_data()` definition that is present within the [Keras datasets module](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/): + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +``` + +The first subsequent activity is reshaping the data into the correct format, so that it can be consumed by the neural network. + +``` +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) +``` + +We subsequently cast the numbers into `float32` type. This makes learning more [precise](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/#float32-in-your-ml-model-why-its-great): + +``` +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') +``` + +The next step is to divide the data points by 255 in order to normalize the data into the \[latex\]\[0, 1\]\[/latex\] range: + +``` +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +The final step that is left is to convert the targets into categorical format through one-hot encoding, so that [categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) can be used: + +``` +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +### Model architecture + +The next step now that we have prepared the data for use in our neural network is to specify the actual _architecture_... i.e., the skeleton of your model: + +``` +# Create the model +model = Sequential() +model.add(Cropping2D(cropping=((5, 5), (5, 5)), input_shape=input_shape)) +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +It works as follows: + +- We instantiate the `Sequential` API so that we can stack our layers nicely. +- We first add the `Cropping2D` layer we used in the simple example - so that our MNIST data will be cropped and that the "blank" box around it will be cut off. +- Then, we feed the data into two convolutional blocks that are composed of `Conv2D`, `MaxPooling2D` and `Dropout` layers. Here, feature extraction, downsampling, ensuring translation invariance and reducing overfitting takes place - twice. +- Subsequently, we flatten the highly dimensional outputs of the last convolutional block so that we can feed them to the `Dense` layers, for classification. +- Each layer utilizes a [ReLU activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) except for the last - which ensures that a multiclass probability distribution is generated by means of [Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/). +- The Dropout rate is set to 0.25, which is relatively low - even better results may be achieved with a rate of 0.5; however, we set it to 0.25 in order to keep the model comparable to the classic CNN we created [in another blog post](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). +- The convolutional layers extract 32 and 64 [feature maps](https://www.machinecurve.com/index.php/2019/12/03/what-do-convnets-see-visualizing-filters-with-activation-maximization/), respectively - 32 relatively "generic" ones and 64 more "specific" ones to the data. We use 3x3 pixel kernels and 2x2 pools, reducing the size of the feature maps by 50% each time. +- The first Dense layer has 256 neurons - and is already a "bottleneck" for the highly dimensional flattened data (this is a good thing). The second Dense layer is an even greater bottleneck and generates _ten_ outputs only, one "importance" score per class. Letting those flow through the Softmax activation function mentioned earlier ensures that you can talk about the final output in a probabalistic way, and pick the most likely class. + +### Model compilation & data fitting + +The next step is model compilation, or "configuring" the model skeleton. For this, we use [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) and the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam). Accuracy is added as a more intuitive metric. + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +Next, we `fit` the data to the model - and assign the output to the `history` object. With this object, it will be possible to [visualize e.g. the history of the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). Here, we also _actually_ set the configuration options that we set before. + +``` +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Model evaluation + +Finally, we can generate some evaluation metrics based on the test set: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +...possibly, you can also add the code here if you wish to visualize [the history of your training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). + +### Comparing Cropped CNN to Uncropped CNN + +To evaluate, we trained the CNN defined above with the 'classic' one - i.e., the one without cropped data - as a baseline model. Here are the results in terms of accuracies and loss values: + +- [![](images/classic_cropped_accuracy.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/classic_cropped_accuracy.png) + +- [![](images/classic_cropped_loss.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/classic_cropped_loss.png) + + +As you can see, the cropped CNN performs worse on the validation dataset - although the difference is not _enormous_. The same is visible within the testing data: + +``` +Cropped CNN - Test loss: 0.030419354529614792 / Test accuracy: 0.9904999732971191 +Classic CNN - Test loss: 0.028982607408999137 / Test accuracy: 0.9926999807357788 +``` + +Even though the model does not show better results, we at least found out about how to apply Cropping layers with Keras :) + +### Full model code + +Should you wish to obtain the full model code instead? That's possible :) Here you go: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D, Cropping2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Cropping2D(cropping=((5, 5), (5, 5)), input_shape=input_shape)) +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Applying Cropping layers to ConvNet feature maps + +Cropping does not necessarily have to take place at the level of input data only. Rather, you can also apply it at the level of the _feature maps_, i.e. the features extracted by the convolutional layers. This is demonstrated here: + +``` +# Create the model +model = Sequential() +model.add(Cropping2D(cropping=((5, 5), (5, 5)), input_shape=input_shape)) +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu')) +model.add(Cropping2D(cropping=((2, 2), (2, 2)))) +``` + +As you can see, after the second `Cropping2D` layer, there are still 32 feature maps, but they're four pixels less wide and less high - which is in line with the `(2, 2), (2, 2)` crop we defined! 😊 + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +cropping2d (Cropping2D) (None, 18, 18, 1) 0 +_________________________________________________________________ +conv2d (Conv2D) (None, 16, 16, 32) 320 +_________________________________________________________________ +cropping2d_1 (Cropping2D) (None, 12, 12, 32) 0 +_________________________________________________________________ +``` + +* * * + +## Summary + +In this blog post, we looked at a couple of things related to Cropping layers. Firstly, we identified the need for cropping - by a fictional scenario using books. Discarding the naïve way of using tools like Microsoft Paint for this purpose, we introduced the concept of a Cropping layer. + +This was followed by an introduction to how these layers are represented within the Keras API. Finally, we saw three examples - a simple Cropping example, a Cropping layer with a CNN classifier, and an example displaying how Cropping can be used on the feature maps rather than the input data. + +I hope you've learnt something today. If you did, I'd really appreciate it if you left a comment in the comments box below! 💬😊 Please do the same if you have questions, remarks or if you find mistakes. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +Keras. (n.d.). Convolutional Layers. Retrieved from [https://keras.io/layers/convolutional/](https://keras.io/layers/convolutional/) + +StackOverflow. (n.d.). Channels first vs Channels last - what doe these mean? Retrieved from [https://stackoverflow.com/questions/56754574/channels-first-vs-channels-last-what-doe-these-mean](https://stackoverflow.com/questions/56754574/channels-first-vs-channels-last-what-doe-these-mean) diff --git a/how-to-use-dropout-with-keras.md b/how-to-use-dropout-with-keras.md new file mode 100644 index 0000000..47f79e5 --- /dev/null +++ b/how-to-use-dropout-with-keras.md @@ -0,0 +1,379 @@ +--- +title: "How to use Dropout with Keras?" +date: "2019-12-18" +categories: + - "deep-learning" + - "frameworks" +tags: + - "convolutional-neural-networks" + - "deep-learning" + - "deep-neural-network" + - "dropout" + - "keras" + - "machine-learning" + - "regularization" + - "regularizer" +--- + +When you have a dataset of limited size, overfitting is quite a problem. That is, while your training results might be good, it's likely that they don't generalize to data that has not been seen during training. + +This severely impacts the production usability of your machine learning module. + +Fortunately, with regularization techniques, it's possible to reduce overfitting. Dropout is such a technique. In this blog post, we cover how to implement Keras based neural networks with Dropout. We do so by firstly recalling the basics of Dropout, to understand at a high level what we're working with. Secondly, we take a look at how Dropout is represented in the Keras API, followed by the design of a ConvNet classifier of the CIFAR-10 dataset. We subsequently provide the implementation with explained example code, and share the results of our training process. + +Ready? Let's go! 😊 + +\[toc\] + +## Recap: what is Dropout? + +Before discussing the implementation of Dropout in the Keras API, the design of our model and its implementation, let's first recall what Dropout is and how it works. + +In our blog post ["What is Dropout? Reduce overfitting in your neural networks"](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/), we looked at what Dropout is theoretically. In short, it's a regularizer technique that reduces the odds of overfitting by dropping out neurons at random, during every epoch (or, when using a minibatch approach, during every minibatch). + +![](images/dropout.png) + +Dropping out neurons happens by attaching Bernoulli variables to the neural outputs (Srivastava et al., 2014). These variables, which take the value of \[latex\]1\[/latex\] with probability \[latex\]p\[/latex\] and 0 with \[latex\]1-p\[/latex\], help reduce overfitting by "making the presence of other (..) units unreliable". This way, neural networks cannot generate what Srivastava et al. call complex co-adaptations that do not generalize to unseen data. + +By consequence, the occurrence of overfitting is reduced. + +Let's now continue with some Dropout best practices. If you wish to understand the concepts behind Dropout in more detail, I'd like to point you to [this blog](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). + +### Dropout best practices + +When working on software projects, and hence when working on machine learning development, it's always best to take a look at some best practices. Srivastava et al. (2014), who discussed Dropout in their work ["Dropout: A Simple Way to Prevent Neural Networks from Overfitting"](http://jmlr.org/papers/v15/srivastava14a.html), empirically found some best practices which we'll take into account in today's model: + +- While it's best to determine the value for parameter \[latex\]p\[/latex\] with a validation set, it's perfectly fine to set it to \[latex\]p \\approx 0.5\[/latex\]. This value has shown the best empirical results when being tested with the MNIST dataset. +- To avoid holes in your input data, the authors argued that you best set \[latex\]p\[/latex\] for the input layer to \[latex\]1.0\[/latex\] - effectively the same as not applying Dropout there. +- Dropout seems to work best when a combination of max-norm regularization (in Keras, with the [MaxNorm constraint](https://keras.io/constraints/#maxnorm)), high learning rates that [decay](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/#what-is-learning-rate-decay) to smaller values, and high momentum is used as well. + +Any optimizer can be used. Given the benefits of the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) (momentum-like optimization with locally adapted weights), we're using that one today, as well as the best practices mentioned above. + +## Dropout in the Keras API + +Within Keras, Dropout is represented as one of the _Core layers_ (Keras, n.d.): + +``` +keras.layers.Dropout(rate, noise_shape=None, seed=None) +``` + +It can be added to a Keras deep learning model with `model.add` and contains the following attributes: + +- **Rate**: the parameter \[latex\]p\[/latex\] which determines the odds of dropping out neurons. When you did not validate which \[latex\]p\[/latex\] works best for you with a validation set, recall that it's best to set it to \[latex\]rate \\approx 0.5\[/latex\] for hidden layers and \[latex\]rate \\approx 0.1\[/latex\] for the input layer (note that \[latex\]rate \\approx 0.1\[/latex\] equals \[latex\]p \\approx 0.9\[/latex\] - Keras turns the logic upside down, making _rate_ the odds of _dropping out_ rather than _keeping_ neurons!) +- **Noise shape:** if you wish to share noise across one of (batch, timesteps, features), you can set the noise shape for this purpose. [Read more about noise shape here.](https://stackoverflow.com/questions/46585069/keras-dropout-with-noise-shape) +- **Seed**: if you wish to fixate the pseudo-random generator that determines whether the Bernoulli variables are 1 or 0 (e.g., to rule out issues with the number generator), then you can set some seed by specifying an integer value here. + +**Important:** once more, the drop rate (or 'rate') in Keras determines the odds of dropping out neurons - instead of keeping them. In effect, with respect to the parameter \[latex\]p\[/latex\] defined by Srivastava et al. (2014) when discussing Dropout, `rate` thus effectively means \[latex\]1-p\[/latex\]. If 75% of the neurons are kept with \[latex\]p = 0.75\[/latex\], `rate` must be \[latex\]0.25\[/latex\]. + +## Designing a ConvNet classifier with Dropout + +Let's now take a look how to create a neural network with Keras that makes use of Dropout for reducing overfitting. For this purpose, we're creating a [convolutional neural network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) for image classification. Next, we discuss the dataset we're using today and the design of our model. + +### Today's dataset + +These are a few samples from the CIFAR-10 dataset, which we will use today: + +[![](images/cifar10_images.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/cifar10_images.png) + +The CIFAR-10 dataset is one of the standard machine learning datasets and contains thousands of small natural images, divided in 10 classes. For example, it contains pictures of cats, trucks, and ships. It's one of the default choices when you want to show how certain models work. + +### Model architecture + +Next, the architecture of our model. Today, it looks like this: + +![](images/model-4.png) + +This architecture, which contains two Conv2D layers followed by Max Pooling, as well as two Densely-connected layers, worked best in some empirical testing up front - so I chose it to use in the real training process. + +Note that Dropout is applied with \[latex\]rate = 0.50\[/latex\], and that - which is not visible in this diagram - max-norm regularization is applied as well, in each layer (also the Dense ones). The Conv2D layers learn 64 filters each and convolve with a 3x3 kernel over the input. The max pooling pool size will be 2 x 2 pixels. + +The activation functions in the hidden layer are [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/), and by consequence, we use [He uniform init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) as our weight initialization strategy. + +### What you'll need to run the model + +If you wish to run today's model, you'll need **Keras** - one of the popular deep learning frameworks these days. For this to run, you'll need one of the backends (preferably **Tensorflow**) as well as **Python** (or, although not preferably, R). + +## Implementing the classifier with Dropout + +Okay, let's create the Keras ConvNet :) + +Open up your Explorer, navigate to some folder, and create a file called `model_dropout.py`. Now open this file in your code editor of choice. There we go, we can start coding :) + +### Model imports + +The first thing we need to do is to list our imports: + +``` +import keras +from keras.datasets import cifar10 +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras import backend as K +from keras.constraints import max_norm +``` + +We'll use the `keras` deep learning framework, from which we'll use a variety of functionalities. From `keras.datasets`, we import the CIFAR-10 dataset. It's a nice shortcut: Keras contains API pointers to datasets like MNIST and CIFAR-10, which means that you can load them with only a few lines of code. This way, we don't get buried with a lot of data loading work, so that we can fully focus on creating the model. + +From `keras.layers`, we import `Dense` (the densely-connected layer type), `Dropout` (which [serves to regularize](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/)), `Flatten` (to link the convolutional layers with the Dense ones), and finally `Conv2D` and `MaxPooling2D` - the conv & related layers. + +We also import the `Sequential` model, which allows us to stack the layers nicely on top of each other, from `keras.models`. + +Next, we import the Keras `backend` for some data preparation functionalities. + +Finally, we import the `max_norm` Constraints, which is a Dropout best practice and should improve the model significantly. + +### Model configuration + +Next, we can specify some configuration parameters for the model: + +``` +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 55 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +max_norm_value = 2.0 +``` + +CIFAR-10 samples are 32 pixels wide and 32 pixels high, and therefore we set `img_width = img_height = 32`. Batch size is set to 250, which empirically worked best for CIFAR-10 with my model. I set the number of epochs to 55, because - as we shall see - the differences between _dropout_ and _no dropout_ will be pretty clear by then. + +The number of classes our model will be able to handle - `no_classes` - is 10, which is the number of classes supported by the CIFAR-10 dataset. Verbosity mode is set to 1 (or `True`), sending all output to screen. 20% of the training data will be used for validation purposes. + +Finally, `max_norm_value` is set to 2.0. This value specifies the maximum norm that is acceptable for the max-norm regularization with the MaxNorm Keras constraint. Empirically, I found that 2.0 is a good value for today's model. However, if you use it with some other model and/or another dataset, you must experiment a bit to find a suitable value yourself. + +### Loading & preparing data + +The next steps to add are related to loading and preparing the CIFAR-10 dataset: + +``` +# Load CIFAR10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0],3, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 3, img_width, img_height) + input_shape = (3, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) + input_shape = (img_width , img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) +``` + +With the Keras `load_data` call, it's possible to load CIFAR-10 very easily into variables for features and targets, for the training and testing datasets. + +Once the data has been loaded, we reshape it based on the backend we're using - i.e., Tensorflow, Theano and CNTK - so that no matter the backend, the data has a uniform shape. + +Next, we parse numbers as floats, which presumably speeds up the training process. Subsequently, we normalize the data, which neural networks appreciate. Finally, we apply `to_categorical`, to ensure that [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) can be used for this multiclass classification problem. + +### Defining the architecture + +Once the data has been loaded, we can define the architecture: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(64, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='relu', input_shape=input_shape, kernel_initializer='he_uniform')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.50)) +model.add(Conv2D(64, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.50)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', kernel_constraint=max_norm(max_norm_value), kernel_initializer='he_uniform')) +model.add(Dense(no_classes, activation='softmax')) +``` + +It's in line with the architectural diagram [we discussed earlier](#model-architecture). It has two Conv2D and related layers, two Dense layers, and outputs a multiclass probability distribution for a sample, with the Softmax activation function. + +### Compilation & training + +The next step is to compile the model. Compiling, or configuring the model, allows you to specify a [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), an [optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) and additional metrics, such as accuracy. As said, we use categorical crossentropy loss to determine the difference between prediction and actual target. Additionally, we use the Adam optimizer - pretty much one of the standard optimizers today. + +``` +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split +) +``` + +Once our model has been configured, we can `fit` the training data to the model! We do so by specifying the `input_train` and `target_train` variables, as well as batch size, number of epochs, verbosity mode and the validation split. We set their values [earlier](#model-configuration). + +### Model evaluation + +The final step is adding a metric for evaluation with the test set - to identify how well it generalizes to data it has not seen before. This allows us to compare various models, which we will do next. + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Full model code + +If you wish to copy the entire model at once, here you go: + +``` +import keras +from keras.datasets import cifar10 +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras import backend as K +from keras.constraints import max_norm + +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 55 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +max_norm_value = 2.0 + +# Load CIFAR10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0],3, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 3, img_width, img_height) + input_shape = (3, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) + input_shape = (img_width , img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(64, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='relu', input_shape=input_shape, kernel_initializer='he_uniform')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.50)) +model.add(Conv2D(64, kernel_size=(3, 3), kernel_constraint=max_norm(max_norm_value), activation='relu', kernel_initializer='he_uniform')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.50)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', kernel_constraint=max_norm(max_norm_value), kernel_initializer='he_uniform')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split +) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Running the model + +It's now time to run the model. Open up a terminal, `cd` to the folder where you put your file, and execute `python model_dropout.py`. Training then starts! + +## Training results + +Generally speaking, the models converge at accuracies of approximately 65-75%, not uncommon for the CIFAR-10 dataset. However, what's important is to see whether the model is actually _overfitting_ - and we can do so by inspecting the loss value. + +Because what worth has a model with 75% accuracy, when it is overconfident in terms of deteriorating loss? You'd still not benefit from it in practice. + +I ran the model multiple times, each time comparing two situations: + +- Dropout vs No Dropout; +- Dropout _with_ vs Dropout _without_ max-norm regularization. +- Dropout with Adam optimizer vs Dropout with SGD optimizer. + +### Dropout vs No dropout + +The difference is enormous for the Dropout vs No dropout case, clearly demonstrating the benefits of Dropout for reducing overfitting. As you can see, and primarily by taking a look at the loss value, the model without Dropout starts overfitting pretty soon - and does so significantly. + +The model with Dropout, however, shows no signs of overfitting, and loss keeps decreasing. You even end up with a model that significantly outperforms the no-Dropout case, even in terms of accuracy. That's great news - we didn't do all our work for nothing! + +- [![](images/acc-2-1024x528.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/acc-2.png) + +- [![](images/loss-2-1024x528.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/loss-2.png) + + +### Dropout with vs without Max-norm regularization + +Let's now take a look what happens when we apply max-norm regularization versus when we leave it out. + +As you can see, the difference is less significant than with the Dropout/No-dropout case, but it still matters. Our \[latex\]norm = 2.0\[/latex\] max-norm regularization (i.e., our MaxNorm Keras constraint) ensures that overfitting does not happen, whereas the no-max-norm case starts overfitting slightly. Indeed, Srivastava et al.'s (2014) results can be confirmed: adding max-norm regularization to Dropout leads to even better performance. + +- [![](images/acc-3-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/acc-3.png) + +- [![](images/loss-3-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/loss-3.png) + + +### Dropout with Adam vs Dropout with SGD + +Well, the results for this one clearly indicate that Adam performs much better when Dropout is applied, compared to traditional SGD. Likely, this is the case because Adam [combines momentum and local parameter updates](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) - benefiting the training process irrespective of Dropout. + +- ![](images/acc-4-1024x537.png) + +- ![](images/loss-4-1024x537.png) + + +## Summary + +In today's blog post, we've seen how to implement Dropout with Keras. Based on some theory, we implemented a ConvNet with Python that makes use of Dropout to reduce the odds of overfitting. + +That Dropout really works was confirmed by our experiments. Having trained on the CIFAR-10 dataset, the ConvNet we created experiences substantial overfitting when Dropout is omitted, while no overfitting is reported with Dropout added. + +Max-norm regularization indeed benefits Dropout, reducing the odds of overfitting even further. Finally, it's also become clear that when using Dropout, it might be a good idea to use Adam and not traditional SGD. + +Thank you for reading MachineCurve today and I hope you've learnt something from this article! 😀 If you did, I'd be happy to hear from you - feel free to leave a comment in the comments box below. Thanks again, and happy engineering! 😎 + +## References + +Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014, June 15). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Retrieved from [http://jmlr.org/papers/v15/srivastava14a.html](http://jmlr.org/papers/v15/srivastava14a.html) + +MachineCurve. (2019, December 16). What is Dropout? Reduce overfitting in your neural networks. Retrieved from [https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks) + +Keras. (n.d.). Core Layers: Dropout. Retrieved from [https://keras.io/layers/core/#dropout](https://keras.io/layers/core/#dropout) diff --git a/how-to-use-elu-with-keras.md b/how-to-use-elu-with-keras.md new file mode 100644 index 0000000..74a436a --- /dev/null +++ b/how-to-use-elu-with-keras.md @@ -0,0 +1,359 @@ +--- +title: "How to use ELU with Keras?" +date: "2019-12-09" +categories: + - "deep-learning" + - "frameworks" +tags: + - "activation-function" + - "activation-functions" + - "deep-learning" + - "elu" + - "keras" + - "machine-learning" + - "neural-network" +--- + +Exponential Linear Unit is a new type of activation function that attempts to resolve both the defects of traditional ReLU activation and the ones that have already attempted to resolve its problems, being Leaky ReLU and PReLU. + +But what is wrong with ReLU in the first place? And why was ELU suggested in place of Leaky ReLU and PReLU? We'll find out in this blog. We start with ReLU, why it's better than classic activation functions but also why it introduces new ones. We then cover PReLU and Leaky ReLU and see while how they resolve ReLU's problems, they also introduce a new one, being noise sensitivity. + +ELU, which we cover subsequently, attempts to resolve this problem by introducing a saturation value at the negative part of the input spectrum. We show how to implement this with Python by providing a Keras example, using a ConvNet that is trained on the MNIST dataset. The results suggest that ELU might benefit you, but only if you train for many epochs, possibly with deeper networks. + +\[toc\] + +## Recap: what was the point with ReLU, again? + +Rectified Linear Unit, or [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) for short, is one of the most widely used activation functions these days. It works really well and due to the fact that it can be used across various machine learning problems it has grown into what it is today. It is also a really simple activation function, outputting zero for all \[latex\]x < 0\[/latex\] and outputting \[latex\]x\[/latex\] (i.e., the input) in all the other cases. + +[![](images/relu_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/relu_and_deriv.jpeg) + +Among others, this makes your model _sparse_, since many of the inputs result in neurons that are deactivated: only the important neurons will keep firing and playing a role in the training process. + +Another benefit is related to the gradients produced by the ReLU activation function. + +### No vanishing gradients + +As you might recall from the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), during training, your trainingset is _fed forward_, generating predictions based on the current state of the model. These predictions are subsequently converted into a loss value, which can be used to optimize the model's weights - repeating this process over and over again, until you stop training. + +But how to improve? From the article about [gradient descent based optimization](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [adaptive optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), we recall that two elements play a role here: first, the optimizer, and second, backpropagation. + +The optimizer will _actually change the weights in order to improve the model_. But if you want to change the weights, you must know how by much they should change - in theory. This is what the _gradient_ is for, or the change that should be applied to a neuron with respect to the current loss value. + +However, neural networks are layered, and their neurons - present in these layers - are linked to each other through artificial synapses. This means that if we wish to compute the gradient for a particular layer, we always have to take into account the gradients of the layers in between that particular layer and the loss value. We essentially have to compute the gradient while taking into account some layer, some other layer, (....and so on...) and finally the prediction error (a.k.a. loss value). + +[![](images/sigmoid_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/sigmoid_and_deriv.jpeg) + +As can be seen from the plot, activation functions like the Sigmoid function produce gradients that cannot be more than 0.25 given any input. In most cases, the value is even smaller, converging to zero for large positive and large negative numbers. + +This is bad, especially for really large networks - i.e., the ones that we see today, with many (i.e., dozens) of layers. + +Because when chaining the gradients together in these cases, you would for four layers in between find a gradient of (0.25^4) = 0.00390625 at max for the particular upstream layer. Welcome to what is called the **[vanishing gradients problem.](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/)** In those cases, upstream layers learn very slowly or do not converge at all, essentially wasting all your resources as you will never get the results you want. + +[![](images/relu_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/relu_and_deriv.jpeg) + +Fortunately, ReLU is not sensitive to the vanishing gradients problem - as can be seen from the plot above. The gradient is either zero or one. No more vanishing gradients 👯‍♂️ + +Besides simplicity (computing the output is performing the computationally inexpensive operation `max(x, 0)`), this is actually one of the reasons why ReLU is so popular today. + +### Dying ReLUs instead + +Unfortunately, the party ends now 😑 + +The fact that gradients are **one or zero** introduces an entirely new problem, being the **dying ReLU problem**. + +What is the dying ReLU problem? Let's take a look at the ReLU gradient again: + +[![](images/relu_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/relu_and_deriv.jpeg) + +The gradient is _either zero or one_. + +While small gradients silence the chained gradients for upstream layers, having _one_ zero ReLU gradient somewhere within the chain of gradients will silence your layer entirely. + +Which means that your neuron cannot improve, and that it is actually dead. + +Given enough such dead neurons, your network will once again learn very slowly or fail to converge at all. + +You're back in trouble again. + +What's more, since the outputs are zero for all negative inputs and equal to the input when they are positive, the mean activation of any ReLU neuron is nonzero. This impacts the next layers, most presumably slowing down the learning process compared to activation functions that _do_ activate close to zero (Clevert et al., 2015). + +Fortunately, new activation functions are to the rescue. + +## Recap: what's wrong with Leaky ReLU and PReLU? + +These functions all _change the ReLU formula_ slightly in order to overcome some of the problems: + +- **[Leaky ReLU](https://www.machinecurve.com/index.php/2019/11/12/using-leaky-relu-with-keras/)** sets the negative part of the formula to really small but nonzero outputs (the inputs are being multiplied by some parameter \[latex\]\\alpha\[/latex\]), which means that dying neurons are no longer present. +- **[PReLU](https://www.machinecurve.com/index.php/2019/12/05/how-to-use-prelu-with-keras/)** recognizes that setting \[latex\]\\alpha\[/latex\] manually in advance of training means that certain assumptions about the data and the model have to be made. Such assumptions may not hold or may not be fully perfect for the particular ML problem, which means that performance may deteriorate. PReLU generalizes Leaky ReLU to a situation where \[latex\]\\alpha\[/latex\] is made input-specific and becomes trainable. As with Leaky ReLU, this avoids the dying ReLU problem. + +Unfortunately, while they do contribute towards a better activation function, the functions do still not solve all the well-known issues. + +In their paper "Fast and accurate deep network learning by exponential linear units", Clevert et al. (2015) argue that they introduce new issues. While they are not too sensitive to the vanishing gradients problem and remove the dying ReLU problem from the equation, they have no such thing as a "noise-rebust deactivation state" (Clevert et al, 2015). + +[![](images/leaky_relu.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/leaky_relu.png) + +What this means can be derived from the visualization above. For positive inputs, the Leaky ReLU activation function displayed behaves like traditional ReLU. For negative inputs, the outputs are small but nonzero. So far, so good. + +But what happens if, for example, we input -5.000.000? While this does not happen quite often - we hope - the input would still be very negative. + +The risk of this happening increases when the Leaky ReLU \[latex\]\\alpha\[/latex\] is increased (steepening the curve) or when the same happens with the learned PReLU \[latex\]\\alpha\[/latex\]s. + +_Any noise will this interfere with training, and this is a new problem introduced by Leaky ReLU and PReLU, according to Clevert et al._ + +## What are ELUs? + +This is why they propose a new activation function, called the **Exponential Linear Unit** (or ELU), which shares PReLU/Leaky ReLU benefits yet improves them as well (Clevert et al., 2015): + +- ELU is not too sensitive to vanishing gradients and removes the dying ReLU problem. +- Mean ELU activations are closer to zero, which is estimated to make the learning process faster - a fact shared by PReLU and Leaky ReLU. +- ELU saturates to a fixed negative value with decreasing input, making it relatively robust to noise. + +ELU can be written down mathematically as: + +\\begin{equation} f(x) = \\begin{cases} x, & \\text{if}\\ x >= 0 \\\\ \\alpha(exp(x) -1), & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Do note that according to the paper, \[latex\]\\alpha > 0\[/latex\] must hold for ELU to work. This must be the case since \[latex\]\\alpha\[/latex\] represents the absolute value of the negative saturation level; by definition of the formula above this must be larger than zero. + +This looks as follows: + +[![](images/elu_avf.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_avf.png) + +And this is the gradient function: + +[![](images/elu_deriv.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_deriv.png) + +As you can see, the _vanishing gradients_ and the _dying neurons_ are gone (gradient plot) - and the function saturates to \[latex\]f(x) = -1.0\[/latex\], as configured with \[latex\]\\alpha = 1.0\[/latex\]. According to Clevert et al. (2015), this makes ELU "well suited for deep neural networks with many layers (...) \[enabling\] faster learning \[as well through mean activations close to zero\]". + +### Empirical tests with ELU + +Clevert et al. (2015) validated the effectiveness of the ELU activation function with multiple standard datasets: + +- With the **MNIST dataset**, which contains 70k grayscale images of digits and hence 10 classes, median activation was closer to zero and faster decrease in error rate was reported. +- With the **CIFAR10 dataset**, which contains 60k color images in 10 categories and thus classes, ELU based networks showed significantly lower test error rates compared to other architectures. +- With the **CIFAR100 dataset**, which contains 60k color images in 100 categories and thus classes, the same results were reported. +- With the **ImageNet dataset**, which contains 1.4M color images in 1000 categories, a faster decrease in error rate was reported as well as lower error rates. + +Note that [He initialization](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) (He et al., 2015) was used throughout all cases since we're working with ReLU-like activation functions here, which are traditionally incompatible with standard Xavier (or Glorot) [weight initialization](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/). + +I would like to refer you to the [original paper](#references) if you wish to look at the results in more detail. + +## Implementing ELUs with Keras + +Let's now see if we can achieve similar results when implementing ELUs with Keras. I'm especially curious to see whether we can replicate them with the MNIST dataset, as this has been difficult with Leaky ReLU and PReLU. + +Why this occurs? Presumably due to the relative ease of training given the discriminative power of the dataset, as well as the relative shallowness of the network, making it less sensitive to e.g. vanishing gradients. + +We'll therefore code a Keras model today 😀 + +I won't explain the model here except its few ideosyncrasies, since it's the Keras CNN [we coded in another blog](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). Hence, if you wish to understand the architectural components of this model in more detail, I'd recommend you take a look at the other blog post 😄 + +``` +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras.initializers import Constant +from keras import backend as K +from keras.layers import ELU +import matplotlib.pyplot as plt + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +elu_alpha = 0.1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height) + input_shape = (1, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape, kernel_initializer='he_normal')) +model.add(ELU(alpha=elu_alpha)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), kernel_initializer='he_normal')) +model.add(ELU(alpha=elu_alpha)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, kernel_initializer='he_normal')) +model.add(ELU(alpha=elu_alpha)) +model.add(Dense(no_classes, activation='softmax', kernel_initializer='he_normal')) + +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss for Keras ELU CNN: {score[0]} / Test accuracy: {score[1]}') + +# Visualize model history +plt.plot(history.history['accuracy'], label='Training accuracy') +plt.plot(history.history['val_accuracy'], label='Validation accuracy') +plt.title('ELU training / validation accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() + +plt.plot(history.history['loss'], label='Training loss') +plt.plot(history.history['val_loss'], label='Validation loss') +plt.title('ELU training / validation loss values') +plt.ylabel('Loss value') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() +``` + +These are the differences: + +- We configure an `elu_alpha` value in the model configuration section, which simply specifies the \[latex\]\\alpha\[/latex\] value for the ELU activation layers. +- We apply [He initialization](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) to the Conv2D and Dense layers, in line with Clevert et al. (2015) given the findings of He et al. (2015). + +Generating the evaluation metrics & visualizations is also in line with what we've seen in the blog about [visualizing the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/): + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss for Keras ELU CNN: {score[0]} / Test accuracy: {score[1]}') + +# Visualize model history +plt.plot(history.history['accuracy'], label='Training accuracy') +plt.plot(history.history['val_accuracy'], label='Validation accuracy') +plt.title('ELU training / validation accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() + +plt.plot(history.history['loss'], label='Training loss') +plt.plot(history.history['val_loss'], label='Validation loss') +plt.title('ELU training / validation loss values') +plt.ylabel('Loss value') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() +``` + +## Results + +I've trained this architecture both with ELU (\[latex\]\\alpha = 1.0\[/latex\]) and with traditional ReLU, for both Xavier/Glorot (standard) weight initialization and He initialization (as recommended for ReLUs). + +### ReLU/ELU with Xavier/Glorot init + +With Xavier/Glorot init, ELU performs slightly worse than traditional ReLU: loss is higher, and ELU activation seems to have started overfitting. + +``` +Test loss for Keras ReLU CNN: 0.03084432035842483 / Test accuracy: 0.9915000200271606 +Test loss for Keras ELU CNN: 0.04917487791230358 / Test accuracy: 0.9905999898910522 +``` + +- [![](images/elu_acc.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_acc.png) + +- [![](images/elu_loss.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_loss.png) + +- [![](images/elu_relu.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_relu.png) + + +### ReLU/ELU with He init + +With He init, however, ELU performs slightly _better_ (albeit really slightly!) than ReLU in terms of loss. What's more, the steep increase towards the 25th epoch is gone, possibly showing the benefit of He init when using ELU. + +``` +Test loss for Keras ReLU CNN: 0.03047580350333262 / Test accuracy: 0.9918000102043152 +Test loss for Keras ELU CNN: 0.029303575038436554 / Test accuracy: 0.9922000169754028 +``` + +- [![](images/elu_he_loss.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_he_loss.png) + +- [![](images/elu_he_acc.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_he_acc.png) + +- [![](images/elu_he_relu.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_he_relu.png) + + +### ReLU & ELU, He init, 250 epochs + +These are the results when training the model with ReLU and ELU activations, with He init, for 250 epochs, with Dropout increased to 0.5 (from 0.25) to avoid overfitting: + +``` +Test loss for Keras ReLU CNN: 0.042991068468006335 / Test accuracy: 0.9923999905586243 +Test loss for Keras ELU CNN: 0.08624260328077216 / Test accuracy: 0.9908999800682068 +``` + +- [![](images/long_elu_loss-1024x294.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/long_elu_loss.png) + +- [![](images/long_elu_acc-1024x294.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/long_elu_acc.png) + +- [![](images/long_elu_relu-1024x294.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/long_elu_relu.png) + + +The results are both positive and negative. Yes, we do observe in line with the authors that ELU results in faster convergence and hence a sped-up training process, but we _also_ observe that overfitting occurs faster when ConvNets are trained with ELU. Hence, when considering ELU, you may wish to use [EarlyStopping with ModelCheckpointing](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) in parallel to stop at precisely the correct point. + +### Interpretation + +He init is necessary when using ELU, that's clear. Even then, for smaller models, it converges to values that are close to ReLU. + +This made me think: **the fact that ELU performs less powerful here, does not mean that ELU is useless.** On the contrary: + +- It may still benefit you when training deeper networks. +- When training for 250 epochs, the suggested speedier convergence is indeed observed, but we also note that ELU results in stronger overfitting. +- It may be the case that a particular setting is hampering learning. Perhaps, it's the Dropout from the architecture, which was not present in Clevert et al. (2015)? Who knows. Additional research is required into this. +- Perhaps, we find improvements if we let it train for ten times as long (i.e., 250 epochs). + +Hence: consider ELU when you face dying ReLUs and wish to avoid vanishing gradients, but consider it carefully, taking [proper mitigation measures](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). + +## Summary + +In this blog post, we've seen how Exponential Linear Units attempt to resolve both the problems of traditional activation functions (being vanishing gradients) and the problems of the newer ones trying to resolve additional problems (being dying ReLUs). Recognizing that these newer functions have nonzero mean activations, which slow down the training process, Clevert et al. (2015) have added a nonzero negative part that saturates to a particular value when activating. This, they argue, makes the activation more robust to noise. + +We provided a Keras implementation of this so-called ELU activation function. How it works, can be found above. Empirically, with a small neural network with only few epochs, ELU showed similar results to traditional ReLU. When the number of epochs was increased, the speedier convergence was found, at the risk of increased overfitting. As we noted, this does not necessarily mean that ELU is not useful. You'll just need to apply it carefully - while additional R&D is required to validate these findings further. + +Thanks for reading MachineCurve today and happy engineering! 😎 + +## References + +Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). _[arXiv preprint arXiv:1511.07289](https://arxiv.org/abs/1511.07289)_. + +He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. _2015 IEEE International Conference on Computer Vision (ICCV)_. [doi:10.1109/iccv.2015.123](https://arxiv.org/abs/1502.01852) + +Keras. (n.d.). Advanced Activations Layers: Exponential Linear Unit. Retrieved from [https://keras.io/layers/advanced-activations/#elu](https://keras.io/layers/advanced-activations/#elu) diff --git a/how-to-use-ftswish-with-keras.md b/how-to-use-ftswish-with-keras.md new file mode 100644 index 0000000..435e62e --- /dev/null +++ b/how-to-use-ftswish-with-keras.md @@ -0,0 +1,450 @@ +--- +title: "How to use FTSwish with Keras?" +date: "2020-01-06" +categories: + - "deep-learning" + - "frameworks" +tags: + - "activation-function" + - "activation-functions" + - "ftswish" + - "keras" +--- + +Flatten-T Swish is a new (2018) activation function that attempts to find the best of both worlds between traditional ReLU and traditional Sigmoid. + +However, it's not readily available within the Keras deep learning framework, which only covers the standard activation functions like ReLU and Sigmoid. + +Therefore, in today's blog, we'll implement it ourselves. First, we'll take a look at FTSwish - by providing a recap - and then implement it using Keras. This blog also includes an example model which uses FTSwish, and evaluates the model after training. + +Are you ready? + +Let's go! 😊 + +* * * + +\[toc\] + +* * * + +## Recap: what is FTSwish? + +In our blog post "[What is the FTSwish activation function?](https://www.machinecurve.com/index.php/2020/01/03/what-is-the-ftswish-activation-function/)" we looked at what the Flatten-T Swish or FTSwish activation function is like. Here, we'll recap the essentials, so that you can understand with ease what we're going to build next. + +We can define FTSwish as follows: + +\\begin{equation} FTSwish: f(x) = \\begin{cases} T, & \\text{if}\\ x < 0 \\\\ \\frac{x}{1 + e^{-x}} + T, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +It's essentially a combination of the **ReLU** and **Sigmoid** activation functions, with some threshold `T` which ensures that negative inputs always yield nonzero outputs. + +It looks as follows: + +[![](images/ftswish-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/ftswish-1.png) + +And indeed, it does resemble Swish in a way: + +[![](images/relu_swish-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/relu_swish.png) + +* * * + +## Defining FTSwish as a Keras activation function + +Keras has a range of activation functions available, but FTSwish is not one of them. Fortunately, it's possible to define your own activations, so yes: we can still use FTSwish with Keras :) Let's now find out how. + +In any Keras model, you'll first have to import the backend you're working with, in order to provide tensor-specific operations such as `maximum`: + +``` +from keras import backend as K +``` + +We can then define the FTSwish activation function as follows: + +``` +# Define +t = -1.0 +def ftswish(x): + return K.maximum(t, K.relu(x)*K.sigmoid(x) + t) +``` + +Let's break the definition down into understandable steps: + +- The value for `t` is the threshold value \[latex\]T\[/latex\], which in our case is -1.0. It ensures that negative inputs saturate to this value. Its value can be different, but take a look at the derivative plot to ensure that you'll have a smooth one. +- Next, the `def` (definition) ensures that we can use `ftswish` as some kind of function - mapping some input to an output. It also means that we can simply feed it to Keras later, to be used in processing. +- Note that FTSwish _combines_ ReLU and Sigmoid with a threshold value for _positive_ inputs, in a way that it can be broken apart in a multiplication: + - `K.relu` is the ReLU part. + - `K.sigmoid` is the Sigmoid part. + - Multiplying them yields the ReLU/Sigmoid part of the FTSwish activation function. + - Adding the threshold is simply adding `t` to the outcome of the multiplication. +- Note that `ReLU`, which is \[latex\]0\[/latex\] for negative inputs and \[latex\]x\[/latex\] for others, can be rewritten to \[latex\]max(0, x)\[/latex\] (indeed: \[latex\]x = 4\[/latex\] yields outputs of 4, while \[latex\]x = -2\[/latex\] yields 0. This is in line with the ReLU definition). Hence, given the formula for FTSwish above, we can rewrite it to a `max` between `t` (the negative output) and the ReLU/Sigmoid combination (the positive output). +- We're using `K` instead of `np` because we're performing these operations on multidimensional tensors. + +* * * + +## Example model using FTSwish + +Let's now create an example with Keras :) Open up your Explorer or Finder, navigate to some folder, and create a Python file, e.g. `model_ftswish.py`. + +### What you'll need to run this model + +- **Python**, which we'll write our code in. Preferably, use Python 3.6+. +- **Keras**, which is the deep learning framework we're using. +- **Tensorflow**, which is now the preferred backend for Keras. +- **Matplotlib** and **Numpy**, to support our model in terms of visualization and number processing. + +### Model imports + +Now, open up `model_ftswish.py` in a code editor and start coding :) First, we'll add the imports: + +``` +''' + Keras model using Flatten-T Swish (FTSwish) activation function + Source for FTSwish activation function: + Chieng, H. H., Wahid, N., Ong, P., & Perla, S. R. K. (2018). Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning. arXiv preprint arXiv:1812.06247. + https://arxiv.org/abs/1812.06247 +''' +import keras +from keras.datasets import cifar10 +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras import backend as K +import matplotlib.pyplot as plt +import numpy as np +``` + +As expected, we'll imporrt `keras` and a lot of sub parts of it: the `cifar10` dataset (which we'll use today), the `Sequential` API for easy stacking of our layers, all the layers that are common in a ConvNet, and the Keras backend (which, in our case, maps to Tensorflow). Finally, we also import `pyplot` from Matplotlib and `numpy`. + +### Model configuration + +Next, it's time to set some configuration values: + +``` +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 100 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +The [CIFAR-10 dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#cifar-10-small-image-classification) which we're using today contains 32 x 32 pixels images across 10 different classes. Hence, `img_width = img_height = 32`, and `no_classes = 10`. The `batch_size` is 250 which is a fairly OK setting based on experience ([click here to find out why to balance between high batch sizes and memory requirements](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/)). We train for 100 `epochs`, and use 20% of our training data for validation purposes. We output everything on screen by setting `verbosity` to True. + +### Loading & preparing the data + +We next load the CIFAR-10 data: + +``` +# Load CIFAR-10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() +``` + +Which easily loads the CIFAR-10 samples into our training and testing variables: + +- [![](images/45028.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/45028.jpg) + +- [![](images/42180.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/42180.jpg) + +- [![](images/41192.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/41192.jpg) + +- [![](images/40969.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/40969.jpg) + +- [![](images/38811.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/38811.jpg) + +- [![](images/38333.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/38333.jpg) + +- [![](images/38151.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/38151.jpg) + +- [![](images/37932.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/37932.jpg) + +- [![](images/37591.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/37591.jpg) + +- [![](images/36450.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/36450.jpg) + +- [![](images/36144.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/36144.jpg) + +- [![](images/28291.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/28291.jpg) + +- [![](images/28222.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/28222.jpg) + +- [![](images/27569.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27569.jpg) + +- [![](images/27447.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/27447.jpg) + + +_A few CIFAR-10 samples._ + +After loading, we reshape the data based on the channels first/channels last approach used by our backend (to ensure that we can use a fixed `input_shape`): + +``` +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 3, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 3, img_width, img_height) + input_shape = (3, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) + input_shape = (img_width, img_height, 3) +``` + +Then, we parse our numbers into `float32` format, which presumably speeds up our training process: + +``` +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') +``` + +This is followed by normalizing our data to be in the \[latex\]\[-1, 1\]\[/latex\] range, which is appreciated by the neural network during optimization: + +``` +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +Finally, we convert our targets into _categorical format_, which allows us to use [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) later: + +``` +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) +``` + +### Adding the defined FTSwish activation function + +We can next add the definition of the FTSwish activation function we created earlier: + +``` +# Define +t = -1.0 +def ftswish(x): + return K.maximum(t, K.relu(x)*K.sigmoid(x) + t) +``` + +### Creating the model architecture + +Then, we can create the architecture of our model. + +``` +# Create the model +model = Sequential() +model.add(Conv2D(64, kernel_size=(3, 3), activation=ftswish, input_shape=input_shape, kernel_initializer='he_normal')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.5)) +model.add(Conv2D(128, kernel_size=(3, 3), activation=ftswish, kernel_initializer='he_normal')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.5)) +model.add(Flatten()) +model.add(Dense(512, kernel_initializer='he_normal', activation=ftswish)) +model.add(Dense(256, kernel_initializer='he_normal', activation=ftswish)) +model.add(Dense(no_classes, activation='softmax', kernel_initializer='he_normal')) +``` + +It's a relatively simple ConvNet, with two Conv2D layers, max pooling, [Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) and finally Dense layers for classification. We use [He init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) because our activation function resembles ReLU. + +### Compiling the model + +Next, we can compile the model: + +``` +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +Because we are facing a multiclass classification problem with one-hot encoded vectors (by virtue of calling `to_categorical`), we'll be using [categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). If you wish to skip the conversion to categorical targets, you might want to replace this with [sparse categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/), which supports integer targets. + +For optimization, we use the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) - the default choice for today's neural networks. Finally, we specify `accuracy` as an additional metric, which is more intuitive than crossentropy loss. + +### Fitting the data + +Then, we fit the training data, configuring the model in line with how we specified our model configuration before: + +``` +# Fit data to model +history_FTSwish = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Evaluation metrics & model history + +The final thing we do is adding code for evaluation (using our _testing data_) and [visualizing the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/): + +``` +# Generate evaluation metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss for Keras FTSwish CNN: {score[0]} / Test accuracy: {score[1]}') + +# Visualize model history +plt.plot(history_FTSwish.history['accuracy'], label='Training accuracy') +plt.plot(history_FTSwish.history['val_accuracy'], label='Validation accuracy') +plt.title('FTSwish training / validation accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() + +plt.plot(history_FTSwish.history['loss'], label='Training loss') +plt.plot(history_FTSwish.history['val_loss'], label='Validation loss') +plt.title('FTSwish training / validation loss values') +plt.ylabel('Loss value') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() +``` + +### Full model code + +It's also possible to get the full model code at once, should you wish to start playing around with it. In that case, here you go: + +``` +''' + Keras model using Flatten-T Swish (FTSwish) activation function + Source for FTSwish activation function: + Chieng, H. H., Wahid, N., Ong, P., & Perla, S. R. K. (2018). Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning. arXiv preprint arXiv:1812.06247. + https://arxiv.org/abs/1812.06247 +''' +import keras +from keras.datasets import cifar10 +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras import backend as K +import matplotlib.pyplot as plt +import numpy as np + +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 100 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 3, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 3, img_width, img_height) + input_shape = (3, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) + input_shape = (img_width, img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) + +# Define +t = -1.0 +def ftswish(x): + return K.maximum(t, K.relu(x)*K.sigmoid(x) + t) + +# Create the model +model = Sequential() +model.add(Conv2D(64, kernel_size=(3, 3), activation=ftswish, input_shape=input_shape, kernel_initializer='he_normal')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.5)) +model.add(Conv2D(128, kernel_size=(3, 3), activation=ftswish, kernel_initializer='he_normal')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.5)) +model.add(Flatten()) +model.add(Dense(256, kernel_initializer='he_normal', activation=ftswish)) +model.add(Dense(no_classes, activation='softmax', kernel_initializer='he_normal')) + +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history_FTSwish = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + + +# Generate evaluation metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss for Keras FTSwish CNN: {score[0]} / Test accuracy: {score[1]}') + +# Visualize model history +plt.plot(history_FTSwish.history['accuracy'], label='Training accuracy') +plt.plot(history_FTSwish.history['val_accuracy'], label='Validation accuracy') +plt.title('FTSwish training / validation accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() + +plt.plot(history_FTSwish.history['loss'], label='Training loss') +plt.plot(history_FTSwish.history['val_loss'], label='Validation loss') +plt.title('FTSwish training / validation loss values') +plt.ylabel('Loss value') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Results + +Now that you have finished creating the model, it's time to train it - and to see the results :) + +Open up a terminal that supports the dependencies listed above, `cd` into the folder where your Python file is located, and run `python model_ftswish.py`. The training process should begin. + +Once it finishes, you should also be able to see the results of the evaluation & visualization steps: + +``` +Test loss for Keras FTSwish CNN: 2.3128050004959104 / Test accuracy: 0.6650999784469604 +``` + +As you can see, loss is still quite high, and accuracy relatively low - it's only correct in 2/3 of cases. This likely occurs because the CIFAR-10 dataset is relatively complex (with various objects in various shapes), which means that it's likely overfitting. Additional techniques such as data augmentation may help here. + +But is it overfitting? Let's take a look at the visualizations. + +Visually, the training process looks as follows. + +- [![](images/f_loss.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/f_loss.png) + +- [![](images/f_acc.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/f_acc.png) + + +Indeed: overfitting starts once the model seems to hit the 66% accuracy mark. Ever since, performance in terms of loss gets worse and worse. To fix this is not within the scope of this post, which was about FTSwish. However, what may be worthwhile is adding extra Conv2D layers, using Batch Normalization, or using data augmentation. + +* * * + +## Summary + +In this blog post, we've seen how to create and use the Flatten-T Swish (FTSwish) activation function with Keras. It included a recap of the FTSwish activation function, which was followed by an example implementation of the activation function. + +I hope you've learnt something from today's blog post! Thanks for reading MachineCurve and happy engineering 😎 + +* * * + +## References + +Chieng, H. H., Wahid, N., Ong, P., & Perla, S. R. K. (2018). [Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning](https://arxiv.org/abs/1812.06247). _arXiv preprint arXiv:1812.06247_. diff --git a/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files.md b/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files.md new file mode 100644 index 0000000..97d86b7 --- /dev/null +++ b/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files.md @@ -0,0 +1,275 @@ +--- +title: "How to use H5Py and Keras to train with data from HDF5 files?" +date: "2020-04-13" +categories: + - "deep-learning" + - "frameworks" +tags: + - "dataset" + - "deep-learning" + - "h5py" + - "hdf5" + - "keras" + - "machine-learning" + - "mnist" +--- + +In the many simple educational cases where people show you how to build Keras models, data is often loaded from the [Keras datasets module](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) - where loading the data is as simple as adding one line of Python code. + +However, it's much more common that data is delivered in the HDF5 file format - and then you might stuck, especially if you're a beginner. + +How to use this format for your machine learning model? How can I train a model with data stored in the HDF5 format? That's what we will look at in today's blog post. We'll be studying the Hierarchical Data Format, as the data format is called, as well as how to access such files in Python - with `h5py`. Then, we actually create a Keras model that is trained with MNIST data, but this time not loaded from the Keras Datasets module - but from HDF5 files instead. + +Do note that there's also a different way of working with HDF5 files in Keras - being, with the HDF5Matrix util. While this works great, I found it difficult to _adapt data_ when using it. That means, if your dataset already has the correct structure (e.g. my problem was that I wanted to add image channels to 1-channel RGB images stored in HDF5 format, which isn't really possible with HDF5Matrix, as we shall see later here), it's wise to use this util. If not, you can proceed with this blog post. We'll cover the HDF5Matrix in a different one. + +Are you ready? Let's go! 😊 + +* * * + +\[toc\] + +* * * + +## What is an HDF5 file? + +You see them every now and then: HDF5 files. Let's see what such a file is before we actually start working with them. If we go to Wikipedia, we see that... + +> Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. +> +> Wikipedia (2004) + +It's a file format that is specifically designed for large datasets. That might be what we need sometimes for our machine learning projects! + +Let's now take a slightly closer look at the structure of the HDF format, specifically for HDF5 files - as in my opinion, the HDF4 format is outdated. + +It consists of **datasets** and **groups**, where (Wikipedia, 2004)... + +- Datasets are multidimensional arrays of a homogeneous type +- Groups are container structures which can hold datasets and other groups. + +According to Wikipedia, this creates a truly hierarchical data structure. The multidimensional array structure can hold our data, whereas targets and labels can be split between two different datasets. Finally, the different _classes_ of your dataset, spread between two datasets per class (target / label), can be structured into multiple groups. + +A very handy format indeed! + +https://www.youtube.com/watch?v=q14F3WRwSck + +* * * + +## Why use HDF5 instead of CSV/text when storing ML datasets? + +There is a wide range of possible file types which you can use to store data. HDF5 is one example, but you could also use SQL based solutions like SQLite, or plain text files / CSVs. However, if we take a look at a post by Alex I. (n.d.), HDF5 has some advantages over these data types: + +1. While databases can be an advantage in terms of data that cannot be stored in memory, they are often slower than HDF5 files. You must make this trade-off depending on the size of your dataset. +2. The same goes for text files. While they can be "fairly space-efficient" (especially when compressed substantially), they are slower to use as "parsing text is much, much slower than HDF". +3. While "other binary formats" like Numpy arrays are quite good, they are not as widely supported as HDF, which is the "lingua franca or common interchange format". + +The author also reports that whereas "a certain small dataset" took 2 seconds to read as HDF, 1 minute to read as JSON, and 1 hour to write to database. + +You get the point :) + +* * * + +## A Keras example + +Now, let's take a look if we can create a simple [Convolutional Neural Network](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) which operates with the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#mnist-database-of-handwritten-digits), stored in HDF5 format. + +Fortunately, this dataset is readily available at [Kaggle for download](https://www.kaggle.com/benedictwilkinsai/mnist-hd5f), so make sure to create an account there and download the **train.hdf5** and **test.hdf5** files. + +### The differences: the imports & how to load the data + +Our HDF5 based model is not too different compared to any other Keras model. In fact, the only differences are present at the start - namely, an extra import as well as a different way of loading the data. That's what we'll highlight in this post primarily. If you wish to understand the ConvNet creation process in more detail, I suggest you also take a look at [this blog](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/). + +### The imports + +The imports first. The only thing that we will add to the imports we already copied from that other blog is the `import h5py` statement: + +``` +import h5py +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +``` + +This is what H5py does: + +> **HDF5 for Python** +> The h5py package is a Pythonic interface to the HDF5 binary data format. +> +> H5py (n.d.) + +We can thus use it to access the data, which we'll do now. + +### Loading the data + +Let's put the model configuration in your file next: + +``` +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 +``` + +Followed by loading and reshaping the input data into the correct [input shape](https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim/) (i.e. _length_ of the datasets times `(28, 28, 1)` as MNIST contains grayscale 28x28 pixels images). Here's the code for that: + +``` +# Load MNIST data +f = h5py.File('./train.hdf5', 'r') +input_train = f['image'][...] +label_train = f['label'][...] +f.close() +f = h5py.File('./test.hdf5', 'r') +input_test = f['image'][...] +label_test = f['label'][...] +f.close() + +# Reshape data +input_train = input_train.reshape((len(input_train), img_width, img_height, img_num_channels)) +input_test = input_test.reshape((len(input_test), img_width, img_height, img_num_channels)) +``` + +...interpreting it is actually pretty simple. We use `h5py` to load the two HDF5 files, one with the training data, the other with the testing data. + +From the HDF5 files, we retrieve the `image` and `label` datasets, where the `[...]` indicates that we retrieve every individual sample - which means 60.000 samples in the training case, for example. + +Don't forget to close the files once you've finished working with them, before starting the reshaping process. + +That's pretty much it with respect to loading data from HDF5! + +### Full model code + +We can now add the other code which creates, configures and trains the Keras model, which means that we end with this code as a whole: + +``` +import h5py +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load MNIST data +f = h5py.File('./train.hdf5', 'r') +input_train = f['image'][...] +label_train = f['label'][...] +f.close() +f = h5py.File('./test.hdf5', 'r') +input_test = f['image'][...] +label_test = f['label'][...] +f.close() + +# Reshape data +input_train = input_train.reshape((len(input_train), img_width, img_height, img_num_channels)) +input_test = input_test.reshape((len(input_test), img_width, img_height, img_num_channels)) + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Display a model summary +model.summary() + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, label_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, label_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Let's run it + +Now, save this model - e.g. as `h5model.py` - and open a terminal. `cd` to the folder where your file is located and execute it with `python h5model.py`. + +Make sure that TensorFlow 2.x is installed, as well as `h5py`: + +- [Installing TensorFlow 2.x onto your system](https://www.tensorflow.org/install); +- [Installing H5py onto your system](http://docs.h5py.org/en/stable/build.html). + +Then, you should see the training process begin - as we are used to: + +``` +Model: "sequential" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d (Conv2D) (None, 26, 26, 32) 320 +_________________________________________________________________ +conv2d_1 (Conv2D) (None, 24, 24, 64) 18496 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 22, 22, 128) 73856 +_________________________________________________________________ +flatten (Flatten) (None, 61952) 0 +_________________________________________________________________ +dense (Dense) (None, 128) 7929984 +_________________________________________________________________ +dense_1 (Dense) (None, 10) 1290 +================================================================= +Total params: 8,023,946 +Trainable params: 8,023,946 +Non-trainable params: 0 +_________________________________________________________________ +Train on 48000 samples, validate on 12000 samples +Epoch 1/25 +2020-04-13 15:15:25.949751: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll +2020-04-13 15:15:26.217503: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll +2020-04-13 15:15:27.236616: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows +Relying on driver to perform ptx compilation. This message will be only logged once. +48000/48000 [========================= +``` + +We've done the job! 😊 + +* * * + +## Summary + +In this blog post, we answered the question _how to use datasets represented in HDF5 files for training your Keras model?_ Despite the blog being relatively brief, I think that it helps understanding what HDF5 is, how we can use it in Python through h5py, and how we can subsequently prepare the HDF5-loaded data for training your Keras model. + +Hopefully, you've learnt something new today! If you did, I'd appreciate a comment - please feel free to leave one in the comments section below. Please do the same if you have any questions or other remarks. In any case, thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +## References + +Wikipedia. (2004, May 4). _Hierarchical data format_. Wikipedia, the free encyclopedia. Retrieved April 13, 2020, from [https://en.wikipedia.org/wiki/Hierarchical\_Data\_Format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) + +Alex I. (n.d.). _Hierarchical data format. What are the advantages compared to alternative formats?_ Data Science Stack Exchange. [https://datascience.stackexchange.com/a/293](https://datascience.stackexchange.com/a/293) + +BenedictWilkinsAI. (n.d.). _Mnist - Hdf5_. Kaggle: Your Machine Learning and Data Science Community. [https://www.kaggle.com/benedictwilkinsai/mnist-hd5f](https://www.kaggle.com/benedictwilkinsai/mnist-hd5f) + +H5py. (n.d.). _HDF5 for Python — h5py 2.10.0 documentation_. [https://docs.h5py.org/en/stable/index.html](https://docs.h5py.org/en/stable/index.html) diff --git a/how-to-use-hdf5matrix-with-keras.md b/how-to-use-hdf5matrix-with-keras.md new file mode 100644 index 0000000..722ffc7 --- /dev/null +++ b/how-to-use-hdf5matrix-with-keras.md @@ -0,0 +1,420 @@ +--- +title: "How to use HDF5Matrix with Keras?" +date: "2020-04-26" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "h5py" + - "hdf5" + - "hdf5matrix" + - "keras" + - "machine-learning" + - "neural-networks" + - "tensorflow" +--- + +In machine learning, when performing supervised learning, you'll have to load your dataset from somewhere - and then [feed it to the machine learning model](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). Now, there are multiple ways for loading data. + +A CSV file is one example, as well as a text file. It works really well if you're looking for simplicity: loading a dataset from a text-based file is really easy with Python. + +The downside? Scalability. Reading text files is slow. And while this does not really matter when your dataset is small, it can become a true burden when you have dataset with millions and millions of rows. + +As we've seen, the HDF5 format - the Hierarchical Data Format - comes to the rescue. This format, which stores the data into a hierarchy of groups and datasets (hence the name, plus version 5), is faster to read, as [we've seen before](https://www.machinecurve.com/index.php/2020/04/13/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files/). It's also easily integrable with Python: with the `h5py` library, we can load our data into memory, and subsequently feed it to the machine learning training process. + +Now, did you know that Keras already partially automates those steps? In fact, it does: the creators already provide a `util` that allows you to load a HDF5 based dataset easily, being the `HDF5Matrix`. Great! + +In today's blog post, we'll take a look at this util. Firstly, we'll take a brief look at the HDF5 data format, followed by inspecting the HDF5Matrix util. Subsequently, we will provide an example of how to use it with an actual Keras model - both in preparing the dataset (should you need to, as we will see) and training the model. + +Are you ready? + +Let's go! :) + +* * * + +\[toc\] + +* * * + +## The HDF5 file format + +Well, let's take a look at the HDF5 file format first - because we must know what we're using, before doing so, right? + +Here we go: + +> _Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data._ +> +> Wikipedia (2004) + +As we can read in [our other blog post on HDF5](https://www.machinecurve.com/index.php/2020/04/13/how-to-use-h5py-and-keras-to-train-with-data-from-hdf5-files/), it is characterized as follows: + +- It's a dataset that is designed for large datasets. Could be great for our ML projects! +- It consists of datasets and groups, where datasets are multidimensional arrays of a homogeneous type. Groups are container structures which can hold datasets and other groups. +- This way, we can group different sub datasets into one hierarchical structure, which we can transfer and interpret later. No more shuffling with CSV columns and delimiters, and so on. No - an efficient format indeed. + +Here's a video for those who wish to understand HDF5 in more detail: + +https://www.youtube.com/watch?v=q14F3WRwSck&feature=emb\_title + +* * * + +## The HDF5Matrix + +Time to study the HDF5Matrix util. Well, it's not too exciting - haha :) In fact, this is what it is: + +> Representation of HDF5 dataset which can be used instead of a Numpy array. +> +> Keras (n.d.) + +It's as simple as that. + +In the Keras API, it is represented in two ways: + +- In 'old' Keras i.e. `keras`, as `keras.utils.io_utils.HDF5Matrix` +- In 'new' Keras i.e. `tensorflow.keras`, as `tensorflow.keras.utils.HDF5Matrix`. + +Take this into account when specifying your imports ;-) + +Now, it also has a few options that can be configured by the machine learning engineer (Keras, n.d.): + +- **datapath**: string, path to a HDF5 file +- **dataset**: string, name of the HDF5 dataset in the file specified in datapath +- **start**: int, start of desired slice of the specified dataset +- **end**: int, end of desired slice of the specified dataset +- **normalizer**: function to be called on data when retrieved + +* * * + +## Training a Keras model with HDF5Matrix + +They pretty much speak for themselves, so let's move on to training a Keras model with HDF5Matrix. + +### Today's dataset + +Today's dataset will be the MNIST one, which we know pretty well by now - it's a numbers dataset of handwritten digits: + +[![](images/mnist-visualize.png)](https://www.machinecurve.com/wp-content/uploads/2019/06/mnist-visualize.png) + +Now, let’s take a look if we can create a simple [Convolutional Neural Network](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) which operates with the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#mnist-database-of-handwritten-digits), stored in HDF5 format. + +Fortunately, this dataset is readily available at [Kaggle for download](https://www.kaggle.com/benedictwilkinsai/mnist-hd5f), so make sure to create an account there and download the **train.hdf5** and **test.hdf5** files. + +### Why we need to adapt our data - and how to do it + +Unfortunately - and this is the reason why we created that other blog post - the dataset cannot be used directly with a Keras model, for multiple reasons: + +1. The shape of the input datasets (i.e. the training inputs and the testing inputs) is wrong. When image data is single-channel (and the MNIST dataset is), the dataset is often delivered without the channel dimension (in our case, that would equal a shape of `(60000, 28, 28)` for the training set instead of the desired `(60000, 28, 28, 1)`). If we were to use the data directly, we would get an error like `ValueError: Error when checking input: expected conv2d_input to have 4 dimensions, but got array with shape (60000, 28, 28)`. +2. The dataset is ill-prepared for neural networks, as it's unscaled. That is, the grayscale data has values somewhere between `[0, 255]` - that's the nature of grayscale data. Now, the distance between 0 and 255 is relatively far - especially so if we could also rescale the data so that the distance becomes much smaller, i.e. `[0, 1]`. While the relationships in the data won't change, the representation of the data does, and it really helps the training process - as it reduces the odds of weight swings during optimization. +3. What's more, it could also be cast into `float32` format, which presumably speeds up the training process on GPUs. + +We will thus have to adapt our data. While the `HDF5Matrix` util provides the _normalizer_ function, it doesn't work when our data has the wrong shape - we still get that `ValueError`. + +That's why we created that other blog post about applying `h5py` directly first. + +#### Imports and configuration + +But today, we'll make sure to adapt the data so that we can run it with `HDF5Matrix` too. Let's take a look. Make sure that `h5py` is installed with `pip install h5py`. Then open up a code editor, create a file such as `hdf5matrix_prepare.py` and write some code: + +``` +import h5py +``` + +This one speaks for itself. We import the `h5py` library. + +``` +# Configuration +img_width, img_height, img_num_channels = 28, 28, 1 +``` + +This one does too. We set a few configuration options, being the image width, image height, and number of channels. As we know MNIST to be 28x28 px single-channel images, we set the values to 28, 28 and 1. + +#### Loading the MNIST data + +Then, we load the MNIST data: + +``` +# Load MNIST data +f = h5py.File('./train.hdf5', 'r') +input_train = f['image'][...] +label_train = f['label'][...] +f.close() +f = h5py.File('./test.hdf5', 'r') +input_test = f['image'][...] +label_test = f['label'][...] +f.close() +``` + +Here, we load the `image` and `label` datasets into memory for both the training and testing HDF5 files. The `[...]` part signals that we load the entire dataset into memory. If your dataset is too big, this might fail. You then might wish to rewrite the code so that you process the dataset in slices. + +#### Reshaping, casting and scaling + +Now that we have loaded the data, it's time to adapt our data to resolve the conflicts that we discussed earlier. First, we'll reshape the data: + +``` +# Reshape data +input_train = input_train.reshape((len(input_train), img_width, img_height, img_num_channels)) +input_test = input_test.reshape((len(input_test), img_width, img_height, img_num_channels)) +``` + +The code speaks pretty much for itself. We set the shape to be equal to the size of the particular array, and the values for the image that we configured before. + +Casting and scaling is also pretty straight-forward: + +``` +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +#### Saving the adapted data + +Then, we can save the data into new files - being `train_reshaped.hdf5` and `test_reshaped.hdf5`: + +``` +# Save reshaped training data +f = h5py.File('./train_reshaped.hdf5', 'w') +dataset_input = f.create_dataset('image', (len(input_train), img_width, img_height, img_num_channels)) +dataset_label = f.create_dataset('label', (len(input_train),)) +dataset_input[...] = input_train +dataset_label[...] = label_train +f.close() + +# Save reshaped testing data +f = h5py.File('./test_reshaped.hdf5', 'w') +dataset_input = f.create_dataset('image', (len(input_test), img_width, img_height, img_num_channels)) +dataset_label = f.create_dataset('label', (len(input_test),)) +dataset_input[...] = input_test +dataset_label[...] = label_test +f.close() +``` + +#### Full preprocessing code + +If you wish to obtain the full code for preprocessing at once - of course, that's possible. Here you go :) + +``` +import h5py + +# Configuration +img_width, img_height, img_num_channels = 28, 28, 1 + +# Load MNIST data +f = h5py.File('./train.hdf5', 'r') +input_train = f['image'][...] +label_train = f['label'][...] +f.close() +f = h5py.File('./test.hdf5', 'r') +input_test = f['image'][...] +label_test = f['label'][...] +f.close() + +# Reshape data +input_train = input_train.reshape((len(input_train), img_width, img_height, img_num_channels)) +input_test = input_test.reshape((len(input_test), img_width, img_height, img_num_channels)) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Save reshaped training data +f = h5py.File('./train_reshaped.hdf5', 'w') +dataset_input = f.create_dataset('image', (len(input_train), img_width, img_height, img_num_channels)) +dataset_label = f.create_dataset('label', (len(input_train),)) +dataset_input[...] = input_train +dataset_label[...] = label_train +f.close() + +# Save reshaped testing data +f = h5py.File('./test_reshaped.hdf5', 'w') +dataset_input = f.create_dataset('image', (len(input_test), img_width, img_height, img_num_channels)) +dataset_label = f.create_dataset('label', (len(input_test),)) +dataset_input[...] = input_test +dataset_label[...] = label_test +f.close() +``` + +### Training the model + +At this point, we have HDF5 files that we can actually use to train a [Keras based ConvNet](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/)! + +Let's take a look. + +Small note: if you wish to understand how to create a ConvNet with Keras `Conv2D` layers, I'd advise you click the link above - as it will take you through all the steps. There's no point repeating them here. Below, we will primarily focus on the `HDF5Matrix` linkage to the Keras model, in order not to confuse ourselves. + +#### Imports and model configuration + +First, we specify the imports - make sure that `tensorflow` and especially TensorFlow 2.x is installed on your system: + +``` +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.utils import HDF5Matrix +``` + +Then, we set the model configuration: + +``` +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 +``` + +#### Loading data with HDF5Matrix + +Now, we can show how the `HDF5Matrix` works: + +``` +# Load MNIST data +input_train = HDF5Matrix('./train_reshaped.hdf5', 'image') +input_test = HDF5Matrix('./test_reshaped.hdf5', 'image') +label_train = HDF5Matrix('./train_reshaped.hdf5', 'label') +label_test = HDF5Matrix('./test_reshaped.hdf5', 'label') +``` + +Yep. It's that simple. We assign the output of calling `HDF5Matrix` to arrays, and specify the specific HDF5 dataset that we wish to load :) + +#### Model specification, compilation and training + +Subsequently, we perform the common steps of model specification, compilation and training a.k.a. fitting the data to the compiled model: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(5, 5), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(5, 5), activation='relu')) +model.add(Conv2D(128, kernel_size=(5, 5), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Display a model summary +model.summary() + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, label_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, label_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Et voila, we have a Keras model that can be trained with `HDF5Matrix`! :) Running the model in your ML environment, should indeed yield a training process that starts: + +``` +Epoch 1/25 +2020-04-26 19:44:12.481645: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll +2020-04-26 19:44:12.785200: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll +2020-04-26 19:44:13.975734: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows +Relying on driver to perform ptx compilation. This message will be only logged once. +48000/48000 [==============================] - 13s 274us/sample - loss: 0.1159 - accuracy: 0.9645 - val_loss: 0.0485 - val_accuracy: 0.9854 +Epoch 2/25 +48000/48000 [======================> +``` + +#### Full model code + +Once again, should you wish to obtain the full model code in order to play straight away - here you go + +``` +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.utils import HDF5Matrix + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 25 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load MNIST data +input_train = HDF5Matrix('./train_reshaped.hdf5', 'image') +input_test = HDF5Matrix('./test_reshaped.hdf5', 'image') +label_train = HDF5Matrix('./train_reshaped.hdf5', 'label') +label_test = HDF5Matrix('./test_reshaped.hdf5', 'label') + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(5, 5), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(5, 5), activation='relu')) +model.add(Conv2D(128, kernel_size=(5, 5), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Display a model summary +model.summary() + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, label_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, label_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Summary + +In this blog post, we looked at HDF5 data and the Keras `HDF5Matrix` for loading your data from HDF5 file directly. Firstly, we discussed the HDF5 format itself - which stands for Hierarchical Data Format, version 5, and is composed of groups of groups and groups of datasets that together form a hierarchical data structure. It is especially useful for large datasets, as they can be easily transferred (it's as simple as transferring one file) and loaded quite quickly (it's one of the faster formats, especially compared to text based files). + +Subsequently, we looked at the HDF5Matrix implementation within the Keras API. It can be used to load datasets from HDF5 format into memory directly, after which you can use them for training your Keras deep learning model. + +Finally, we provided an example implementation with TensorFlow 2.x based Keras and Python - to show you how you can do two things: + +- Adapt the dataset a priori to using it with `h5py`, as some raw datasets must be reshaped, scaled and cast; +- Training a Keras neural network with the adapted dataset. + +That's it for today! :) I hope you've learnt something from this blog post. If you did, please feel free to leave a message in the comments section below 💬👇. Please do the same if you have any questions or remarks - I'll happily answer you. Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +## References + +Keras. (n.d.). _I/O Utils_. Home - Keras Documentation. [https://keras.io/io\_utils/](https://keras.io/io_utils/) + +Wikipedia. (2004, May 4). _Hierarchical data format_. Wikipedia, the free encyclopedia. Retrieved April 13, 2020, from [https://en.wikipedia.org/wiki/Hierarchical\_Data\_Format](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) diff --git a/how-to-use-hinge-squared-hinge-loss-with-keras.md b/how-to-use-hinge-squared-hinge-loss-with-keras.md new file mode 100644 index 0000000..7e2874a --- /dev/null +++ b/how-to-use-hinge-squared-hinge-loss-with-keras.md @@ -0,0 +1,505 @@ +--- +title: "How to use hinge & squared hinge loss with TensorFlow 2 and Keras?" +date: "2019-10-15" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "hinge" + - "hinge-loss" + - "keras" + - "loss-function" + - "machine-learning" + - "neural-networks" + - "squared-hinge-loss" + - "training-process" +--- + +In order to discover the ins and outs of the Keras deep learning framework, I'm writing blog posts about [commonly used loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), subsequently implementing them with Keras to practice and to see how they behave. + +Today, we'll cover two closely related loss functions that can be used in neural networks - and hence in TensorFlow 2 based Keras - that behave similar to how a [Support Vector Machine](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) generates a decision boundary for classification: the **hinge loss** and **squared hinge loss**. + +In this blog, you'll first find a brief introduction to the two loss functions, in order to ensure that you intuitively understand the maths before we move on to implementing one. + +Next, we introduce today's dataset, which we ourselves generate. Subsequently, we implement both hinge loss functions with TensorFlow 2 based Keras, and discuss the implementation so that you understand what happens. Before wrapping up, we'll also show model performance. + +After reading this tutorial, you will understand... + +- **How hinge loss and squared hinge loss work.** +- **What the differences are between the two.** +- **How to implement hinge loss and squared hinge loss with TensorFlow 2 based Keras.** + +Let's go! 😎 + +_Note that the full code for the models we create in this blog post is also available through my [Keras Loss Functions repository](https://github.com/christianversloot/keras-loss-functions) on GitHub._ + +* * * + +**Update 08/Feb/2021:** ensure that article is up to date. Utilizes TensorFlow 2 APIs now to make it compatible with current versions of TensorFlow. + +* * * + +\[toc\] + +* * * + +## Example code: (squared) hinge loss with TF 2 / Keras + +This example code shows you how to use hinge loss and squared hinge loss easily. If you want to understand how it works, what the differences are and how to apply it to a full Keras model more deeply, make sure to read the rest of this tutorial as well! + +``` +loss_function_used = 'hinge' # or use 'squared_hinge' +model.compile(loss=loss_function_used, optimizer=tensorflow.keras.optimizers.Adam(lr=0.03), metrics=['accuracy']) +``` + +* * * + +## Brief recap + +### What is hinge loss? + +In our blog post on loss functions, we defined the **hinge loss** as follows (Wikipedia, 2011): + +![](images/image-1.png) + +Maths can look very frightning, but the explanation of the above formula is actually really easy. + +When you're training a machine learning model, you effectively feed forward your data, generating predictions, which you then compare with the actual targets to generate some cost value - that's the loss value. In the case of using the hinge loss formula for generating this value, you compare the prediction (\[latex\]y\[/latex\]) with the actual target for the prediction (\[latex\]t\[/latex\]), substract this value from 1 and subsequently compute the maximum value between 0 and the result of the earlier computation. + +For every sample, our target variable \[latex\]t\[/latex\] is either +1 or -1. + +This means that: + +- When \[latex\]t = y\[/latex\], e.g. \[latex\]t = y = 1\[/latex\], loss is \[latex\]max(0, 1 - 1) = max(0, 0) = 0\[/latex\] - or perfect. +- When \[latex\]t\[/latex\] is very different than \[latex\]y\[/latex\], say \[latex\]t = 1\[/latex\] while \[latex\]y = -1\[/latex\], loss is \[latex\]max(0, 2) = 2\[/latex\]. +- When \[latex\]t\[/latex\] is not exactly correct, but only slightly off (e.g. \[latex\]t = 1\[/latex\] while \[latex\]y = 0.9\[/latex\], loss would be \[latex\]max(0, 0.1) = 0.1). + +This looks as follows if the target is \[latex\]+1\[/latex\] - for all targets >= 1, loss is zero (the prediction is correct or even overly correct), whereas loss increases when the predictions are incorrect. + +[![](images/hinge_loss-1024x507.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/10/hinge_loss.jpeg) + +What effectively happens is that hinge loss will attempt to maximize the decision boundary between the two groups that must be discriminated in your machine learning problem. In that way, it looks somewhat like how [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) work, but it's also kind of different (e.g., with hinge loss in Keras there is no such thing as support vectors). + +### What is squared hinge loss? + +Suppose that you need to draw a very fine decision boundary. In that case, you wish to punish larger errors more significantly than smaller errors. **Squared hinge loss** may then be what you are looking for, especially when you already considered the hinge loss function for your machine learning problem. + +[![](images/hinge_squared-1024x511.png)](blob:https://www.machinecurve.com/df9d5f50-e0bb-4acd-9dbb-042de634b54d) + +Squared hinge loss is nothing else but a square of the output of the hinge's \[latex\]max(...)\[/latex\] function. It generates a loss function as illustrated above, compared to regular hinge loss. + +As you can see, larger errors are punished more significantly than with traditional hinge, whereas smaller errors are punished slightly lightlier. + +Additionally, especially around \[latex\]target = +1.0\[/latex\] in the situation above (if your target were \[latex\]-1.0\[/latex\], it would apply there too) the loss function of traditional hinge loss behaves relatively non-smooth, like the ReLU activation function does so around \[latex\]x = 0\[/latex\]. Although it is very unlikely, it might impact how your model optimizes since the loss landscape is not smooth. With squared hinge, the function is smooth - but it is more sensitive to larger errors (outliers). + +Therefore, choose carefully! 😉 + +* * * + +## Start implementing: today's dataset + +Now that we know about what hinge loss and squared hinge loss are, we can start our actual implementation. We'll have to first implement & discuss our dataset in order to be able to create a model. + +Before you start, it's a good idea to create a file (e.g. `hinge-loss.py`) in some folder on your machine. Then, you can start off by adding the necessary software dependencies: + +``` +''' + Keras model discussing Hinge loss. +''' +import tensorflow.keras +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_circles +from mlxtend.plotting import plot_decision_regions +``` + +First, and foremost, you need the [Keras deep learning framework](https://www.machinecurve.com/index.php/mastering-keras/), which allows you to create neural network architectures relatively easily. From Keras, you'll import the Sequential API and the Dense layer (representing densely-connected layers, or the [MLP-like layers](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) you always see when people use neural networks in their presentations). + +You'll subsequently import the PyPlot API from Matplotlib for visualization, Numpy for number processing, `make_circles` from Scikit-learn to generate today's dataset and Mlxtend for [visualizing the decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) of your model. + +### What you'll need to run it + +Hence, this is what you need to run today's code: + +- Python, preferably 3.8+ +- TensorFlow 2, preferably [2.4.0+](https://www.machinecurve.com/index.php/2020/11/05/saying-hello-to-tensorflow-2-4-0/) +- Matplotlib +- Numpy +- Scikit-learn +- Mlxtend + +...preferably in an Anaconda environment so that your packages run isolated from other Python ones. + +### Generate the data + +As indicated, we can now generate the data that we use to demonstrate how hinge loss and squared hinge loss works. We generate data today because it allows us to entirely focus on the loss functions rather than cleaning the data. Of course, you can also apply the insights from this blog posts to other, real datasets. + +We first specify some configuration options: + +``` +# Configuration options +num_samples_total = 1000 +training_split = 250 +``` + +Put very simply, these specify _how many samples are generated in total_ and how many are _split off the training set_ to form the testing set. With this configuration, we generate 1000 samples, of which 750 are training data and 250 are testing data. You'll later see that the 750 training samples are subsequently split into true training data and validation data. + +Next, we actually generate the data: + +``` +# Generate data +X, targets = make_circles(n_samples = num_samples_total, factor=0.1) +targets[np.where(targets == 0)] = -1 +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] +``` + +We first call `make_circles` to generate `num_samples_total` (1000 as configured) for our machine learning problem. `make_circles` does what it suggests: it generates two circles, a larger one and a smaller one, which are separable - and hence perfect for machine learning blog posts 😄 The `factor` parameter, which should be \[latex\]0 < factor < 1\[/latex\], determines how close the circles are to each other. The lower the value, the farther the circles are positioned from each other. + +We next convert all zero targets into -1. Why? Very simple: `make_circles` generates targets that are either 0 or 1, which is very common in those scenarios. Zero or one would in plain English be 'the larger circle' or 'the smaller circle', but since targets are numeric in Keras they are 0 and 1. + +Hinge loss doesn't work with zeroes and ones. Instead, targets must be either +1 or -1. Hence, we'll have to convert all zero targets into -1 in order to support Hinge loss. + +Finally, we split the data into training and testing data, for both the feature vectors (the \[latex\]X\[/latex\] variables) and the targets. + +### Visualizing the data + +We can now also visualize the data, to get a feel for what we just did: + +``` +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +This looks as follows: + +[![](images/hinge_nonlienar.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/hinge_nonlienar.png) + +As you can see, we have generated two circles that are composed of individual data points: a large one and a smaller one. These are perfectly separable, although not linearly. + +(With traditional SVMs one would have to perform the [kernel trick](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/#what-if-data-is-not-linearly-separable-kernels) in order to make data linearly separable in kernel space. With neural networks, this is less of a problem, since the layers [activate nonlinearly](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#what-is-an-activation-function).) + +* * * + +## Implementing hinge & squared hinge in TensorFlow 2 / Keras + +Now that we have a feel for the dataset, we can actually implement a `tensorflow.keras` model that makes use of hinge loss and, in another run, squared hinge loss, in order to show you how it works. + +### Model configuration + +As usual, we first define some variables for model configuration by adding this to our code: + +``` +# Set the input shape +feature_vector_shape = len(X_training[0]) +input_shape = (feature_vector_shape,) +loss_function_used = 'hinge' +print(f'Feature shape: {input_shape}') +``` + +We set the shape of our feature vector to the _length_ of the _first sample from our training set_. If this sample is of length 3, this means that there are three features in the feature vector. Since the array is only one-dimensional, the shape would be a one-dimensional vector of length 3. Since our training set contains X and Y values for the data points, our `input_shape` is (2,). + +Obviously, we use `hinge` as our loss function. Using squared hinge loss is possible too by simply changing `hinge` into `squared_hinge`. That's up to you! + +### Model architecture + +Next, we define the architecture for our model: + +``` +# Create the model +model = Sequential() +model.add(Dense(4, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(2, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='tanh')) +``` + +We use the Keras Sequential API, which allows us to stack multiple layers easily. Contrary to other blog posts, e.g. ones where we created a [MLP for classification](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) or [regression](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/), I decided to add three layers instead of two. This was done for the reason that the dataset is slightly more complex: the decision boundary cannot be represented as a line, but must be a circle separating the smaller one from the larger one. Hence, I thought, a little bit more capacity for processing data would be useful. + +The layers activate with [Rectified Linear Unit](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#rectified-linear-unit-relu) or ReLU, except for the last one, which activates by means of [Tanh](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#tangens-hyperbolicus-tanh). I chose ReLU because it is the de facto standard activation function and requires fewest computational resources without compromising in predictive performance. I chose Tanh because of the way the predictions must be generated: they should end up in the range \[-1, +1\], given the way Hinge loss works (remember why we had to convert our generated targets from zero to minus one?). + +Tanh indeed precisely does this -- converting a linear value to a range close to \[-1, +1\], namely (-1, +1) - the actual ones are not included here, but this doesn't matter much. It looks like this: + +[![](images/tanh-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/tanh.png) + +The kernels of the ReLU activating layers are initialized with He uniform init instead of Glorot init for the reason that this approach [works better](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) mathematically. + +Information is eventually converted into one prediction: the target. Hence, the final layer has _one_ neuron. The intermediate ones have fewer neurons, in order to stimulate the model to generate more abstract representations of the information during the feedforward procedure. + +### Hyperparameter configuration & starting model training + +Now that we know what architecture we'll use, we can perform hyperparameter configuration. We can also actually start training our model. + +However, first, the hyperparameters: + +``` +# Configure the model and start training +model.compile(loss=loss_function_used, optimizer=tensorflow.keras.optimizers.Adam(lr=0.03), metrics=['accuracy']) +``` + +The loss function used is, indeed, `hinge` loss. We use Adam for optimization and manually configure the learning rate to 0.03 since initial experiments showed that the default learning rate is insufficient to learn the decision boundary many times. In your case, it may be that you have to shuffle with the learning rate as well; you can configure it there. As an additional metric, we included accuracy, since it can be interpreted by humans slightly better. + +Now the actual training process: + +``` +history = model.fit(X_training, Targets_training, epochs=30, batch_size=5, verbose=1, validation_split=0.2) +``` + +We _fit_ the training data (`X_training` and `Targets_training`) to the model architecture and allow it to optimize for 30 epochs, or iterations. Each batch that is fed forward through the network during an epoch contains five samples, which allows to benefit from accurate gradients without losing too much time and / or resources which increase with decreasing batch size. Verbosity mode is set to 1 ('True') in order to output everything during the training process, which helps your understanding. As highlighted before, we split the training data into _true training data_ and _validation data_: 20% of the training data is used for validation. + +Hence, from the 1000 samples that were generated, 250 are used for testing, 600 are used for training and 150 are used for validation (600 + 150 + 250 = 1000). + +### Testing & visualizing model performance + +We store the results of the fitting (training) procedure into a `history` object, which allows us the actually [visualize model performance across epochs](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). But first, we add code for testing the model for its generalization power: + +``` +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') +``` + +Then a [plot of the decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) based on the testing data: + +``` +# Plot decision boundary +plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2) +plt.show() +``` + +And eventually, the [visualization for the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/): + +``` +# Visualize training process +plt.plot(history.history['loss'], label='Hinge loss (testing data)') +plt.plot(history.history['val_loss'], label='Hinge loss (validation data)') +plt.title('Hinge loss for circles') +plt.ylabel('Hinge loss value') +plt.yscale('log') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +(A logarithmic scale is used because loss drops significantly during the first epoch, distorting the image if scaled linearly.) + +* * * + +## The results: model performance + +Now, if you followed the process until now, you have a file called `hinge-loss.py`. Open up the terminal which can access your setup (e.g. Anaconda Prompt or a regular terminal), `cd`to the folder where your `.py` is stored and execute `python hinge-loss.py`. The training process should then start. + +These are the results. + +### Hinge loss + +For hinge loss, we quite unsurprisingly found that validation accuracy went to 100% immediately. This is indeed unsurprising because the dataset is quite well separable (the distance between circles is large), the model was made quite capable of interpreting relatively complex data, and a relatively aggressive learning rate was set. This is the visualization of the training process using a **logarithmic scale**: + +[![](images/logarithmic_performance-1024x537.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/logarithmic_performance.png) + +The decision boundary: + +[![](images/hinge_db.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/hinge_db.png) + +Or in plain text: + +``` +Epoch 1/30 +600/600 [==============================] - 1s 1ms/step - loss: 0.4317 - accuracy: 0.6083 - val_loss: 0.0584 - val_accuracy: 1.0000 +Epoch 2/30 +600/600 [==============================] - 0s 682us/step - loss: 0.0281 - accuracy: 1.0000 - val_loss: 0.0124 - val_accuracy: 1.0000 +Epoch 3/30 +600/600 [==============================] - 0s 688us/step - loss: 0.0097 - accuracy: 1.0000 - val_loss: 0.0062 - val_accuracy: 1.0000 +Epoch 4/30 +600/600 [==============================] - 0s 693us/step - loss: 0.0054 - accuracy: 1.0000 - val_loss: 0.0038 - val_accuracy: 1.0000 +Epoch 5/30 +600/600 [==============================] - 0s 707us/step - loss: 0.0036 - accuracy: 1.0000 - val_loss: 0.0027 - val_accuracy: 1.0000 +Epoch 6/30 +600/600 [==============================] - 0s 692us/step - loss: 0.0026 - accuracy: 1.0000 - val_loss: 0.0020 - val_accuracy: 1.0000 +Epoch 7/30 +600/600 [==============================] - 0s 747us/step - loss: 0.0019 - accuracy: 1.0000 - val_loss: 0.0015 - val_accuracy: 1.0000 +Epoch 8/30 +600/600 [==============================] - 0s 717us/step - loss: 0.0015 - accuracy: 1.0000 - val_loss: 0.0012 - val_accuracy: 1.0000 +Epoch 9/30 +600/600 [==============================] - 0s 735us/step - loss: 0.0012 - accuracy: 1.0000 - val_loss: 0.0010 - val_accuracy: 1.0000 +Epoch 10/30 +600/600 [==============================] - 0s 737us/step - loss: 0.0010 - accuracy: 1.0000 - val_loss: 8.4231e-04 - val_accuracy: 1.0000 +Epoch 11/30 +600/600 [==============================] - 0s 720us/step - loss: 8.6515e-04 - accuracy: 1.0000 - val_loss: 7.1493e-04 - val_accuracy: 1.0000 +Epoch 12/30 +600/600 [==============================] - 0s 786us/step - loss: 7.3818e-04 - accuracy: 1.0000 - val_loss: 6.1438e-04 - val_accuracy: 1.0000 +Epoch 13/30 +600/600 [==============================] - 0s 732us/step - loss: 6.3710e-04 - accuracy: 1.0000 - val_loss: 5.3248e-04 - val_accuracy: 1.0000 +Epoch 14/30 +600/600 [==============================] - 0s 703us/step - loss: 5.5483e-04 - accuracy: 1.0000 - val_loss: 4.6540e-04 - val_accuracy: 1.0000 +Epoch 15/30 +600/600 [==============================] - 0s 728us/step - loss: 4.8701e-04 - accuracy: 1.0000 - val_loss: 4.1065e-04 - val_accuracy: 1.0000 +Epoch 16/30 +600/600 [==============================] - 0s 732us/step - loss: 4.3043e-04 - accuracy: 1.0000 - val_loss: 3.6310e-04 - val_accuracy: 1.0000 +Epoch 17/30 +600/600 [==============================] - 0s 733us/step - loss: 3.8266e-04 - accuracy: 1.0000 - val_loss: 3.2392e-04 - val_accuracy: 1.0000 +Epoch 18/30 +600/600 [==============================] - 0s 782us/step - loss: 3.4199e-04 - accuracy: 1.0000 - val_loss: 2.9011e-04 - val_accuracy: 1.0000 +Epoch 19/30 +600/600 [==============================] - 0s 755us/step - loss: 3.0694e-04 - accuracy: 1.0000 - val_loss: 2.6136e-04 - val_accuracy: 1.0000 +Epoch 20/30 +600/600 [==============================] - 0s 768us/step - loss: 2.7671e-04 - accuracy: 1.0000 - val_loss: 2.3608e-04 - val_accuracy: 1.0000 +Epoch 21/30 +600/600 [==============================] - 0s 778us/step - loss: 2.5032e-04 - accuracy: 1.0000 - val_loss: 2.1384e-04 - val_accuracy: 1.0000 +Epoch 22/30 +600/600 [==============================] - 0s 725us/step - loss: 2.2715e-04 - accuracy: 1.0000 - val_loss: 1.9442e-04 - val_accuracy: 1.0000 +Epoch 23/30 +600/600 [==============================] - 0s 728us/step - loss: 2.0676e-04 - accuracy: 1.0000 - val_loss: 1.7737e-04 - val_accuracy: 1.0000 +Epoch 24/30 +600/600 [==============================] - 0s 680us/step - loss: 1.8870e-04 - accuracy: 1.0000 - val_loss: 1.6208e-04 - val_accuracy: 1.0000 +Epoch 25/30 +600/600 [==============================] - 0s 738us/step - loss: 1.7264e-04 - accuracy: 1.0000 - val_loss: 1.4832e-04 - val_accuracy: 1.0000 +Epoch 26/30 +600/600 [==============================] - 0s 702us/step - loss: 1.5826e-04 - accuracy: 1.0000 - val_loss: 1.3628e-04 - val_accuracy: 1.0000 +Epoch 27/30 +600/600 [==============================] - 0s 802us/step - loss: 1.4534e-04 - accuracy: 1.0000 - val_loss: 1.2523e-04 - val_accuracy: 1.0000 +Epoch 28/30 +600/600 [==============================] - 0s 738us/step - loss: 1.3374e-04 - accuracy: 1.0000 - val_loss: 1.1538e-04 - val_accuracy: 1.0000 +Epoch 29/30 +600/600 [==============================] - 0s 762us/step - loss: 1.2326e-04 - accuracy: 1.0000 - val_loss: 1.0645e-04 - val_accuracy: 1.0000 +Epoch 30/30 +600/600 [==============================] - 0s 742us/step - loss: 1.1379e-04 - accuracy: 1.0000 - val_loss: 9.8244e-05 - val_accuracy: 1.0000 +250/250 [==============================] - 0s 52us/step +Test results - Loss: 0.0001128034592838958 - Accuracy: 100.0% +``` + +We can see that validation loss is still decreasing together with training loss, so the model is not overfitting yet. + +Reason why? Simple. My thesis is that this occurs because the data, both in the training and validation set, is perfectly separable. The decision boundary is crystal clear. + +### Squared hinge loss + +By changing `loss_function_used` into `squared_hinge` we can now show you results for squared hinge: + +``` +loss_function_used = 'squared_hinge' +``` + +Visually, it looks as follows: + +- ![](images/sqh-generated.png) + +- ![](images/sqh-history-1024x537.png) + +- ![](images/sqh-db.png) + + +And once again plain text: + +``` +Epoch 1/30 +600/600 [==============================] - 1s 1ms/step - loss: 0.2361 - accuracy: 0.7117 - val_loss: 0.0158 - val_accuracy: 1.0000 +Epoch 2/30 +600/600 [==============================] - 0s 718us/step - loss: 0.0087 - accuracy: 1.0000 - val_loss: 0.0050 - val_accuracy: 1.0000 +Epoch 3/30 +600/600 [==============================] - 0s 727us/step - loss: 0.0036 - accuracy: 1.0000 - val_loss: 0.0026 - val_accuracy: 1.0000 +Epoch 4/30 +600/600 [==============================] - 0s 723us/step - loss: 0.0020 - accuracy: 1.0000 - val_loss: 0.0016 - val_accuracy: 1.0000 +Epoch 5/30 +600/600 [==============================] - 0s 723us/step - loss: 0.0014 - accuracy: 1.0000 - val_loss: 0.0011 - val_accuracy: 1.0000 +Epoch 6/30 +600/600 [==============================] - 0s 713us/step - loss: 9.7200e-04 - accuracy: 1.0000 - val_loss: 8.3221e-04 - val_accuracy: 1.0000 +Epoch 7/30 +600/600 [==============================] - 0s 697us/step - loss: 7.3653e-04 - accuracy: 1.0000 - val_loss: 6.4083e-04 - val_accuracy: 1.0000 +Epoch 8/30 +600/600 [==============================] - 0s 688us/step - loss: 5.7907e-04 - accuracy: 1.0000 - val_loss: 5.1182e-04 - val_accuracy: 1.0000 +Epoch 9/30 +600/600 [==============================] - 0s 712us/step - loss: 4.6838e-04 - accuracy: 1.0000 - val_loss: 4.1928e-04 - val_accuracy: 1.0000 +Epoch 10/30 +600/600 [==============================] - 0s 698us/step - loss: 3.8692e-04 - accuracy: 1.0000 - val_loss: 3.4947e-04 - val_accuracy: 1.0000 +Epoch 11/30 +600/600 [==============================] - 0s 723us/step - loss: 3.2525e-04 - accuracy: 1.0000 - val_loss: 2.9533e-04 - val_accuracy: 1.0000 +Epoch 12/30 +600/600 [==============================] - 0s 735us/step - loss: 2.7692e-04 - accuracy: 1.0000 - val_loss: 2.5270e-04 - val_accuracy: 1.0000 +Epoch 13/30 +600/600 [==============================] - 0s 710us/step - loss: 2.3846e-04 - accuracy: 1.0000 - val_loss: 2.1917e-04 - val_accuracy: 1.0000 +Epoch 14/30 +600/600 [==============================] - 0s 773us/step - loss: 2.0745e-04 - accuracy: 1.0000 - val_loss: 1.9093e-04 - val_accuracy: 1.0000 +Epoch 15/30 +600/600 [==============================] - 0s 718us/step - loss: 1.8180e-04 - accuracy: 1.0000 - val_loss: 1.6780e-04 - val_accuracy: 1.0000 +Epoch 16/30 +600/600 [==============================] - 0s 730us/step - loss: 1.6039e-04 - accuracy: 1.0000 - val_loss: 1.4876e-04 - val_accuracy: 1.0000 +Epoch 17/30 +600/600 [==============================] - 0s 698us/step - loss: 1.4249e-04 - accuracy: 1.0000 - val_loss: 1.3220e-04 - val_accuracy: 1.0000 +Epoch 18/30 +600/600 [==============================] - 0s 807us/step - loss: 1.2717e-04 - accuracy: 1.0000 - val_loss: 1.1842e-04 - val_accuracy: 1.0000 +Epoch 19/30 +600/600 [==============================] - 0s 722us/step - loss: 1.1404e-04 - accuracy: 1.0000 - val_loss: 1.0641e-04 - val_accuracy: 1.0000 +Epoch 20/30 +600/600 [==============================] - 1s 860us/step - loss: 1.0269e-04 - accuracy: 1.0000 - val_loss: 9.5853e-05 - val_accuracy: 1.0000 +Epoch 21/30 +600/600 [==============================] - 0s 768us/step - loss: 9.2805e-05 - accuracy: 1.0000 - val_loss: 8.6761e-05 - val_accuracy: 1.0000 +Epoch 22/30 +600/600 [==============================] - 0s 753us/step - loss: 8.4169e-05 - accuracy: 1.0000 - val_loss: 7.8690e-05 - val_accuracy: 1.0000 +Epoch 23/30 +600/600 [==============================] - 0s 727us/step - loss: 7.6554e-05 - accuracy: 1.0000 - val_loss: 7.1713e-05 - val_accuracy: 1.0000 +Epoch 24/30 +600/600 [==============================] - 0s 720us/step - loss: 6.9799e-05 - accuracy: 1.0000 - val_loss: 6.5581e-05 - val_accuracy: 1.0000 +Epoch 25/30 +600/600 [==============================] - 0s 715us/step - loss: 6.3808e-05 - accuracy: 1.0000 - val_loss: 5.9929e-05 - val_accuracy: 1.0000 +Epoch 26/30 +600/600 [==============================] - 0s 695us/step - loss: 5.8448e-05 - accuracy: 1.0000 - val_loss: 5.4957e-05 - val_accuracy: 1.0000 +Epoch 27/30 +600/600 [==============================] - 0s 730us/step - loss: 5.3656e-05 - accuracy: 1.0000 - val_loss: 5.0587e-05 - val_accuracy: 1.0000 +Epoch 28/30 +600/600 [==============================] - 0s 760us/step - loss: 4.9353e-05 - accuracy: 1.0000 - val_loss: 4.6493e-05 - val_accuracy: 1.0000 +Epoch 29/30 +600/600 [==============================] - 0s 750us/step - loss: 4.5461e-05 - accuracy: 1.0000 - val_loss: 4.2852e-05 - val_accuracy: 1.0000 +Epoch 30/30 +600/600 [==============================] - 0s 753us/step - loss: 4.1936e-05 - accuracy: 1.0000 - val_loss: 3.9584e-05 - val_accuracy: 1.0000 +250/250 [==============================] - 0s 56us/step +Test results - Loss: 4.163062170846388e-05 - Accuracy: 100.0% +``` + +As you can see, squared hinge works as well. Comparing the two decision boundaries - + +- ![](images/hinge_db.png) + + Decision boundary hinge + +- ![](images/sqh-db.png) + + Decision boundary squared hinge + + +...it seems to be the case that the decision boundary for squared hinge is _closer_, or _tighter_. Perhaps due to the _smoothness_ of the loss landscape? However, this cannot be said for sure. + +* * * + +## Summary + +In this blog post, we've seen how to create a machine learning model with Keras by means of the **hinge loss** and the **squared hinge loss** cost functions. We introduced hinge loss and squared hinge intuitively from a mathematical point of view, then swiftly moved on to an actual implementation. Results demonstrate that hinge loss and squared hinge loss can be successfully used in nonlinear classification scenarios, but they are relatively sensitive to the separability of your dataset (whether it's linear or nonlinear does not matter). Perhaps, binary crossentropy is less sensitive - and we'll take a look at this in a next blog post. + +For now, it remains to thank you for reading this post - I hope you've been able to derive some new insights from it! Please let me know what you think by writing a comment below 👇, I'd really appreciate it! 😊 Thanks and happy engineering! + +_Note that the full code for the models we created in this blog post is also available through my [Keras Loss Functions repository](https://github.com/christianversloot/keras-loss-functions) on GitHub._ + +* * * + +## References + +Wikipedia. (2011, September 16). Hinge loss. Retrieved from [https://en.wikipedia.org/wiki/Hinge\_loss](https://en.wikipedia.org/wiki/Hinge_loss) + +About loss and loss functions – MachineCurve. (2019, October 15). Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) + +Intuitively understanding SVM and SVR – MachineCurve. (2019, September 20). Retrieved from [https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) + +Mastering Keras – MachineCurve. (2019, July 21). Retrieved from [https://www.machinecurve.com/index.php/mastering-keras/](https://www.machinecurve.com/index.php/mastering-keras/) + +How to create a basic MLP classifier with the Keras Sequential API – MachineCurve. (2019, July 27). Retrieved from [https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) + +How to visualize the decision boundary for your Keras model? – MachineCurve. (2019, October 11). Retrieved from [https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) diff --git a/how-to-use-k-fold-cross-validation-with-keras.md b/how-to-use-k-fold-cross-validation-with-keras.md new file mode 100644 index 0000000..cd4ea1a --- /dev/null +++ b/how-to-use-k-fold-cross-validation-with-keras.md @@ -0,0 +1,714 @@ +--- +title: "How to use K-fold Cross Validation with TensorFlow 2 and Keras?" +date: "2020-02-18" +categories: + - "buffer" + - "frameworks" + - "svms" +tags: + - "dataset" + - "k-fold-cross-validation" + - "split" + - "training-process" + - "training-split" + - "validation" +--- + +When you train supervised machine learning models, you'll likely try multiple models, in order to find out how good they are. Part of this process is likely going to be the question _how can I compare models objectively?_ + +Training and testing datasets have been invented for this purpose. By splitting a small part off your full dataset, you create a dataset which (1) was not yet seen by the model, and which (2) you assume to approximate the distribution of the _population_, i.e. the real world scenario you wish to generate a predictive model for. + +Now, when generating such a split, you should ensure that your splits are relatively unbiased. In this blog post, we'll cover one technique for doing so: **K-fold Cross Validation**. Firstly, we'll show you how such splits can be made naïvely - i.e., by a simple hold out split strategy. Then, we introduce K-fold Cross Validation, show you how it works, and why it can produce better results. This is followed by an example, created with Keras and Scikit-learn's KFold functions. + +Are you ready? Let's go! 😎 + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +* * * + +**Update 12/Feb/2021:** added TensorFlow 2 to title; some styling changes. + +**Update 11/Jan/2021:** added code example to start using K-fold CV straight away. + +**Update 04/Aug/2020:** clarified the (in my view) necessity of validation set even after K-fold CV. + +**Update 11/Jun/2020:** improved K-fold cross validation code based on reader comments. + +* * * + +\[toc\] + +* * * + +## Code example: K-fold Cross Validation with TensorFlow and Keras + +This quick code can be used to perform K-fold Cross Validation with your TensorFlow/Keras model straight away. If you want to understand it in more detail, make sure to read the rest of the article below! + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from sklearn.model_selection import KFold +import numpy as np + +# Merge inputs and targets +inputs = np.concatenate((input_train, input_test), axis=0) +targets = np.concatenate((target_train, target_test), axis=0) + +# Define the K-fold Cross Validator +kfold = KFold(n_splits=num_folds, shuffle=True) + +# K-fold Cross Validation model evaluation +fold_no = 1 +for train, test in kfold.split(inputs, targets): + + # Define the model architecture + model = Sequential() + model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(MaxPooling2D(pool_size=(2, 2))) + model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) + model.add(MaxPooling2D(pool_size=(2, 2))) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(128, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + + # Compile the model + model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + + + # Generate a print + print('------------------------------------------------------------------------') + print(f'Training for fold {fold_no} ...') + + # Fit data to model + history = model.fit(inputs[train], targets[train], + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity) + + # Generate generalization metrics + scores = model.evaluate(inputs[test], targets[test], verbose=0) + print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%') + acc_per_fold.append(scores[1] * 100) + loss_per_fold.append(scores[0]) + + # Increase fold number + fold_no = fold_no + 1 +``` + +* * * + +## Evaluating and selecting models with K-fold Cross Validation + +Training a [supervised machine learning model](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) involves changing model weights using a _training set_. Later, once training has finished, the trained model is tested with new data - the _testing set_ - in order to find out how well it performs in real life. + +When you are satisfied with the performance of the model, you train it again with the entire dataset, in order to finalize it and use it in production (Bogdanovist, n.d.) + +However, when checking how well the model performance, the question _how to split the dataset_ is one that emerges pretty rapidly. K-fold Cross Validation, the topic of today's blog post, is one possible approach, which we'll discuss next. + +However, let's first take a look at the concept of generating train/test splits in the first place. Why do you need them? Why can't you simply train the model with all your data and then compare the results with other models? We'll answer these questions first. + +Then, we take a look at the efficient but naïve _simple hold-out splits_. This way, when we discuss K-fold Cross Validation, you'll understand more easily why it can be more useful when comparing performance between models. Let's go! + +### Why using train/test splits? - On finding a model that works for you + +Before we'll dive into the approaches for generating train/test splits, I think that it's important to take a look at _why we should split them_ in the first place when evaluating model performance. + +For this reason, we'll invent a model evaluation scenario first. + +#### Generating many predictions + +Say that we're training a few models to classify images of digits. We train a [Support Vector Machine](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) (SVM), a [Convolutional Neural Network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) (CNN) and a [Densely-connected Neural Network](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) (DNN) and of course, hope that each of them predicts "5" in this scenario: + +[![](images/EvaluationScenario-1024x366.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/EvaluationScenario.png) + +Our goal here is to use the model that performs best in production, a.k.a. "really using it" :) + +The central question then becomes: **how well does each model perform?** + +Based on their performance, we can select a model that can be used in real life. + +However, if we wish to determine model performance, we should generate a whole bunch of predictions - preferably, thousands or even more - so that we can compute metrics like accuracy, or [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). Great! + +#### Don't be the student who checks his own homework + +Now, we'll get to the core of our point - i.e., why we need to generate splits between training and testing data when evaluating machine learning models. + +We'll require an understanding of the high-level supervised machine learning process for this purpose: + +[![](images/High-level-training-process-1024x973.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/09/High-level-training-process.jpg) + +It can be read as follows: + +- In the first step, all the training samples (in blue on the left) are fed forward to the machine learning model, which generates predictions (blue on the right). +- In the second step, the predictions are compared with the "ground truth" (the real targets) - which results in the computation of a [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). +- The model can subsequently be optimized by steering the model away from the error, by changing its weights, in the backwards pass of the gradient with respect to (finally) the loss value. +- The process then starts again. Presumably, the model performs better this time. + +As you can imagine, the model will improve based on the _loss generated by the data_. This data is a _sample_, which means that there is always a difference between the _sample distribution_ and the _population distribution_. In other words, there is always a difference between _what your data tells that the patterns are_ and _what the patterns are in the real world_. This difference can be really small, but it's there. + +Now, if you let the model train for long enough, it will adapt substantially to the dataset. This also means that the impact of the difference will get larger and larger, relative to the patterns of the real-world scenario. If you've trained it for too long - [a problem called overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) - the difference may be the cause that it won't work anymore when real world data is fed to it. + +Generating a split between training data and testing data can help you solve this issue. By training your model using the training data, you can let it train for as long as you want. Why? Simple: you have the testing data to evaluate model performance afterwards, using data that is (1) presumably representative for the real world and (2) unseen yet. If the model is highly overfit, this will be clear, because it will perform very poorly during the evaluation step with the testing data. + +Now, let's take a look at how we can do this. We'll s tart with simple hold-out splits :) + +### A naïve approach: simple hold-out split + +Say that you've got a dataset of 10.000 samples. It hasn't been split into a training and a testing set yet. Generally speaking, a 80/20 split is acceptable. That is, 80% of your data - 8.000 samples in our case - will be used for training purposes, while 20% - 2.000 - will be used for testing. + +We can thus simply draw a boundary at 8.000 samples, like this: + +[![](images/Traintest.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/Traintest.png) + +We call this _simple hold-out split_, as we simply "hold out" the last 2.000 samples (Chollet, 2017). + +It can be a highly effective approach. What's more, it's also very inexpensive in terms of the computational power you need. However, it's also a very naïve approach, as you'll have to keep these edge cases in mind all the time (Chollet, 2017): + +1. **Data representativeness**: all datasets, which are essentially samples, must represent the patterns in the population as much as possible. This becomes especially important when you generate samples from a sample (i.e., from your full dataset). For example, if the first part of your dataset has pictures of ice cream, while the latter one only represents espressos, trouble is guaranteed when you generate the split as displayed above. Random shuffling may help you solve these issues. +2. **The arrow of time**: if you have a time series dataset, your dataset is likely ordered chronologically. If you'd shuffle randomly, and then perform simple hold-out validation, you'd effectively "\[predict\] the future given the past" (Chollet, 2017). Such temporal leaks don't benefit model performance. +3. **Data redundancy**: if some samples appear more than once, a simple hold-out split with random shuffling may introduce redundancy between training and testing datasets. That is, identical samples belong to both datasets. This is problematic too, as data used for training thus leaks into the dataset for testing implicitly. + +Now, as we can see, while a simple hold-out split based approach can be effective and will be efficient in terms of computational resources, it also requires you to monitor for these edge cases continuously. + +\[affiliatebox\] + +### K-fold Cross Validation + +A more expensive and less naïve approach would be to perform K-fold Cross Validation. Here, you set some value for \[latex\]K\[/latex\] and (hey, what's in a name 😋) the dataset is split into \[latex\]K\[/latex\] partitions of equal size. \[latex\]K - 1\[/latex\] are used for training, while one is used for testing. This process is repeated \[latex\]K\[/latex\] times, with a different partition used for testing each time. + +For example, this would be the scenario for our dataset with \[latex\]K = 5\[/latex\] (i.e., once again the 80/20 split, but then 5 times!): + +[![](images/KTraintest.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/KTraintest.png) + +For each split, the same model is trained, and performance is displayed per fold. For evaluation purposes, you can obviously also average it across all folds. While this produces better estimates, K-fold Cross Validation also increases training cost: in the \[latex\]K = 5\[/latex\] scenario above, the model must be trained for 5 times. + +Let's now extend our viewpoint with a few variations of K-fold Cross Validation :) + +If you have no computational limitations whatsoever, you might wish to try a special case of K-fold Cross Validation, called Leave One Out Cross Validation (or LOOCV, Khandelwal 2019). LOOCV means \[latex\]K = N\[/latex\], where \[latex\]N\[/latex\] is the number of samples in your dataset. As the number of models trained is maximized, the precision of the model performance average is maximized too, but so is the cost of training due to the sheer amount of models that must be trained. + +If you have a binary classification problem, you might also wish to take a look at Stratified Cross Validation (Khandelwal, 2019). It extends K-fold Cross Validation by ensuring an equal distribution of the target classes over the splits. This ensures that your classification problem is balanced. It doesn't work for multiclass classification due to the way that samples are distributed. + +Finally, if you have a time series dataset, you might wish to use Time-series Cross Validation (Khandelwal, 2019). [Check here how it works.](https://medium.com/datadriveninvestor/k-fold-and-other-cross-validation-techniques-6c03a2563f1e#4a74) + +* * * + +## Creating a Keras model with K-fold Cross Validation + +Now that we understand how K-fold Cross Validation works, it's time to code an example with the Keras deep learning framework :) + +Coding it will be a multi-stage process: + +- Firstly, we'll take a look at what we need in order to run our model successfully. +- Then, we take a look at today's model. +- Subsequently, we add K-fold Cross Validation, train the model instances, and average performance. +- Finally, we output the performance metrics on screen. + +### What we'll need to run our model + +For running the model, we'll need to install a set of software dependencies. For today's blog post, they are as follows: + +- TensorFlow 2.0+, which includes the Keras deep learning framework; +- The most recent version of scikit-learn; +- Numpy. + +That's it, already! :) + +### Our model: a CIFAR-10 CNN classifier + +Now, today's model. + +We'll be using a [convolutional neural network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) that can be used to classify CIFAR-10 images into a set of 10 classes. The images are varied, as you can see here: + +[![](images/cifar10_images.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/cifar10_images.png) + +Now, my goal is not to replicate the process of creating the model here, as we already did that in our blog post ["How to build a ConvNet for CIFAR-10 and CIFAR-100 classification with Keras?"](https://www.machinecurve.com/index.php/2020/02/09/how-to-build-a-convnet-for-cifar-10-and-cifar-100-classification-with-keras/). Take a look at that post if you wish to understand the steps that lead to the model below. + +_(Do note that this is a small adaptation, where we removed the third convolutional block for reasons of speed.)_ + +Here is the full model code of the original CIFAR-10 CNN classifier, which we can use when adding K-fold Cross Validation: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +import matplotlib.pyplot as plt + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 100 +no_epochs = 100 +optimizer = Adam() +verbosity = 1 + +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Visualize history +# Plot history: Loss +plt.plot(history.history['val_loss']) +plt.title('Validation loss history') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['val_accuracy']) +plt.title('Validation accuracy history') +plt.ylabel('Accuracy value (%)') +plt.xlabel('No. epoch') +plt.show() +``` + +### Removing obsolete code + +Now, let's slightly adapt the model in order to add K-fold Cross Validation. + +Firstly, we'll strip off some code that we no longer need: + +``` +import matplotlib.pyplot as plt +``` + +We will no longer generate the visualizations, and besides the import we thus also remove the part generating them: + +``` +# Visualize history +# Plot history: Loss +plt.plot(history.history['val_loss']) +plt.title('Validation loss history') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['val_accuracy']) +plt.title('Validation accuracy history') +plt.ylabel('Accuracy value (%)') +plt.xlabel('No. epoch') +plt.show() +``` + +### Adding K-fold Cross Validation + +Secondly, let's add the `KFold` code from `scikit-learn` to the imports - as well as `numpy`: + +``` +from sklearn.model_selection import KFold +import numpy as np +``` + +Which... + +> Provides train/test indices to split data in train/test sets. Split dataset into k consecutive folds (without shuffling by default). +> +> Scikit-learn (n.d.) [sklearn.model\_selection.KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) + +Precisely what we want! + +We also add a new configuration value: + +``` +num_folds = 10 +``` + +This will ensure that our \[latex\]K = 10\[/latex\]. + +What's more, directly after the "normalize data" step, we add two empty lists for storing the results of cross validation: + +``` +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Define per-fold score containers <-- these are new +acc_per_fold = [] +loss_per_fold = [] +``` + +This is followed by a concat of our 'training' and 'testing' datasets - remember that K-fold Cross Validation makes the split! + +``` +# Merge inputs and targets +inputs = np.concatenate((input_train, input_test), axis=0) +targets = np.concatenate((target_train, target_test), axis=0) +``` + +Based on this prior work, we can add the code for K-fold Cross Validation: + +``` +fold_no = 1 +for train, test in kfold.split(input_train, target_train): +``` + +Ensure that all the `model` related steps are now wrapped inside the `for` loop. Also make sure to add a couple of extra `print` statements and to replace the inputs and targets to `model.fit`: + +``` +# K-fold Cross Validation model evaluation +fold_no = 1 +for train, test in kfold.split(inputs, targets): + + # Define the model architecture + model = Sequential() + model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(MaxPooling2D(pool_size=(2, 2))) + model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) + model.add(MaxPooling2D(pool_size=(2, 2))) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(128, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + + # Compile the model + model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + + + # Generate a print + print('------------------------------------------------------------------------') + print(f'Training for fold {fold_no} ...') + + # Fit data to model + history = model.fit(inputs[train], targets[train], + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity) +``` + +We next replace the "test loss" `print` with one related to what we're doing. Also, we increase the `fold_no`: + +``` + # Generate generalization metrics + scores = model.evaluate(inputs[test], targets[test], verbose=0) + print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%') + acc_per_fold.append(scores[1] * 100) + loss_per_fold.append(scores[0]) + + # Increase fold number + fold_no = fold_no + 1 +``` + +Here, we simply print a "score for fold X" - and add the accuracy and sparse categorical crossentropy loss values to the lists. + +Now, why do we do that? + +Simple: at the end, we provide an overview of all scores and the averages. This allows us to easily compare the model with others, as we can simply compare these outputs. Add this code at the end of the model, but make sure that it is _not_ wrapped inside the `for` loop: + +``` +# == Provide average scores == +print('------------------------------------------------------------------------') +print('Score per fold') +for i in range(0, len(acc_per_fold)): + print('------------------------------------------------------------------------') + print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%') +print('------------------------------------------------------------------------') +print('Average scores for all folds:') +print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})') +print(f'> Loss: {np.mean(loss_per_fold)}') +print('------------------------------------------------------------------------') +``` + +#### Full model code + +Altogether, this is the new code for your K-fold Cross Validation scenario with \[latex\]K = 10\[/latex\]: + +\[affiliatebox\] + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from sklearn.model_selection import KFold +import numpy as np + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 100 +no_epochs = 25 +optimizer = Adam() +verbosity = 1 +num_folds = 10 + +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Define per-fold score containers +acc_per_fold = [] +loss_per_fold = [] + +# Merge inputs and targets +inputs = np.concatenate((input_train, input_test), axis=0) +targets = np.concatenate((target_train, target_test), axis=0) + +# Define the K-fold Cross Validator +kfold = KFold(n_splits=num_folds, shuffle=True) + +# K-fold Cross Validation model evaluation +fold_no = 1 +for train, test in kfold.split(inputs, targets): + + # Define the model architecture + model = Sequential() + model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(MaxPooling2D(pool_size=(2, 2))) + model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) + model.add(MaxPooling2D(pool_size=(2, 2))) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(128, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + + # Compile the model + model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + + + # Generate a print + print('------------------------------------------------------------------------') + print(f'Training for fold {fold_no} ...') + + # Fit data to model + history = model.fit(inputs[train], targets[train], + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity) + + # Generate generalization metrics + scores = model.evaluate(inputs[test], targets[test], verbose=0) + print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {scores[0]}; {model.metrics_names[1]} of {scores[1]*100}%') + acc_per_fold.append(scores[1] * 100) + loss_per_fold.append(scores[0]) + + # Increase fold number + fold_no = fold_no + 1 + +# == Provide average scores == +print('------------------------------------------------------------------------') +print('Score per fold') +for i in range(0, len(acc_per_fold)): + print('------------------------------------------------------------------------') + print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}%') +print('------------------------------------------------------------------------') +print('Average scores for all folds:') +print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})') +print(f'> Loss: {np.mean(loss_per_fold)}') +print('------------------------------------------------------------------------') +``` + +* * * + +## Results + +Now, it's time to run the model, to see whether we can get some nice results :) + +Say, for example, that you saved the model as `k-fold-model.py` in some folder. Open up your command prompt - for example, Anaconda Prompt - and `cd` to the folder where your file is stored. Make sure that your dependencies are installed and then run `python k-fold-model.py`. + +If everything goes well, the model should start training for 25 epochs per fold. + +### Evaluating the performance of your model + +During training, it should produce batches like this one: + +``` +------------------------------------------------------------------------ +Training for fold 3 ... +Train on 43200 samples, validate on 10800 samples +Epoch 1/25 +43200/43200 [==============================] - 9s 200us/sample - loss: 1.5628 - accuracy: 0.4281 - val_loss: 1.2300 - val_accuracy: 0.5618 +Epoch 2/25 +43200/43200 [==============================] - 7s 165us/sample - loss: 1.1368 - accuracy: 0.5959 - val_loss: 1.0767 - val_accuracy: 0.6187 +Epoch 3/25 +43200/43200 [==============================] - 7s 161us/sample - loss: 0.9737 - accuracy: 0.6557 - val_loss: 0.9869 - val_accuracy: 0.6522 +Epoch 4/25 +43200/43200 [==============================] - 7s 169us/sample - loss: 0.8665 - accuracy: 0.6967 - val_loss: 0.9347 - val_accuracy: 0.6772 +Epoch 5/25 +43200/43200 [==============================] - 8s 175us/sample - loss: 0.7792 - accuracy: 0.7281 - val_loss: 0.8909 - val_accuracy: 0.6918 +Epoch 6/25 +43200/43200 [==============================] - 7s 168us/sample - loss: 0.7110 - accuracy: 0.7508 - val_loss: 0.9058 - val_accuracy: 0.6917 +Epoch 7/25 +43200/43200 [==============================] - 7s 161us/sample - loss: 0.6460 - accuracy: 0.7745 - val_loss: 0.9357 - val_accuracy: 0.6892 +Epoch 8/25 +43200/43200 [==============================] - 8s 184us/sample - loss: 0.5885 - accuracy: 0.7963 - val_loss: 0.9242 - val_accuracy: 0.6962 +Epoch 9/25 +43200/43200 [==============================] - 7s 156us/sample - loss: 0.5293 - accuracy: 0.8134 - val_loss: 0.9631 - val_accuracy: 0.6892 +Epoch 10/25 +43200/43200 [==============================] - 7s 164us/sample - loss: 0.4722 - accuracy: 0.8346 - val_loss: 0.9965 - val_accuracy: 0.6931 +Epoch 11/25 +43200/43200 [==============================] - 7s 161us/sample - loss: 0.4168 - accuracy: 0.8530 - val_loss: 1.0481 - val_accuracy: 0.6957 +Epoch 12/25 +43200/43200 [==============================] - 7s 159us/sample - loss: 0.3680 - accuracy: 0.8689 - val_loss: 1.1481 - val_accuracy: 0.6938 +Epoch 13/25 +43200/43200 [==============================] - 7s 165us/sample - loss: 0.3279 - accuracy: 0.8850 - val_loss: 1.1438 - val_accuracy: 0.6940 +Epoch 14/25 +43200/43200 [==============================] - 7s 171us/sample - loss: 0.2822 - accuracy: 0.8997 - val_loss: 1.2441 - val_accuracy: 0.6832 +Epoch 15/25 +43200/43200 [==============================] - 7s 167us/sample - loss: 0.2415 - accuracy: 0.9149 - val_loss: 1.3760 - val_accuracy: 0.6786 +Epoch 16/25 +43200/43200 [==============================] - 7s 170us/sample - loss: 0.2029 - accuracy: 0.9294 - val_loss: 1.4653 - val_accuracy: 0.6820 +Epoch 17/25 +43200/43200 [==============================] - 7s 165us/sample - loss: 0.1858 - accuracy: 0.9339 - val_loss: 1.6131 - val_accuracy: 0.6793 +Epoch 18/25 +43200/43200 [==============================] - 7s 171us/sample - loss: 0.1593 - accuracy: 0.9439 - val_loss: 1.7192 - val_accuracy: 0.6703 +Epoch 19/25 +43200/43200 [==============================] - 7s 168us/sample - loss: 0.1271 - accuracy: 0.9565 - val_loss: 1.7989 - val_accuracy: 0.6807 +Epoch 20/25 +43200/43200 [==============================] - 8s 190us/sample - loss: 0.1264 - accuracy: 0.9547 - val_loss: 1.9215 - val_accuracy: 0.6743 +Epoch 21/25 +43200/43200 [==============================] - 9s 207us/sample - loss: 0.1148 - accuracy: 0.9587 - val_loss: 1.9823 - val_accuracy: 0.6720 +Epoch 22/25 +43200/43200 [==============================] - 7s 167us/sample - loss: 0.1110 - accuracy: 0.9615 - val_loss: 2.0952 - val_accuracy: 0.6681 +Epoch 23/25 +43200/43200 [==============================] - 7s 166us/sample - loss: 0.0984 - accuracy: 0.9653 - val_loss: 2.1623 - val_accuracy: 0.6746 +Epoch 24/25 +43200/43200 [==============================] - 7s 168us/sample - loss: 0.0886 - accuracy: 0.9691 - val_loss: 2.2377 - val_accuracy: 0.6772 +Epoch 25/25 +43200/43200 [==============================] - 7s 166us/sample - loss: 0.0855 - accuracy: 0.9697 - val_loss: 2.3857 - val_accuracy: 0.6670 +Score for fold 3: loss of 2.4695983460744224; accuracy of 66.46666526794434% +------------------------------------------------------------------------ +``` + +Do note the increasing validation loss, a clear [sign of overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). + +And finally, after the 10th fold, it should display the overview with results per fold and the average: + +``` +------------------------------------------------------------------------ +Score per fold +------------------------------------------------------------------------ +> Fold 1 - Loss: 2.4094747734069824 - Accuracy: 67.96666383743286% +------------------------------------------------------------------------ +> Fold 2 - Loss: 1.768296229839325 - Accuracy: 67.03333258628845% +------------------------------------------------------------------------ +> Fold 3 - Loss: 2.4695983460744224 - Accuracy: 66.46666526794434% +------------------------------------------------------------------------ +> Fold 4 - Loss: 2.363724467277527 - Accuracy: 66.28333330154419% +------------------------------------------------------------------------ +> Fold 5 - Loss: 2.083754387060801 - Accuracy: 65.51666855812073% +------------------------------------------------------------------------ +> Fold 6 - Loss: 2.2160572570165 - Accuracy: 65.6499981880188% +------------------------------------------------------------------------ +> Fold 7 - Loss: 1.7227793588638305 - Accuracy: 66.76666736602783% +------------------------------------------------------------------------ +> Fold 8 - Loss: 2.357142448425293 - Accuracy: 67.25000143051147% +------------------------------------------------------------------------ +> Fold 9 - Loss: 1.553109979470571 - Accuracy: 65.54999947547913% +------------------------------------------------------------------------ +> Fold 10 - Loss: 2.426255855560303 - Accuracy: 66.03333353996277% +------------------------------------------------------------------------ +Average scores for all folds: +> Accuracy: 66.45166635513306 (+- 0.7683473645622098) +> Loss: 2.1370193102995554 +------------------------------------------------------------------------ +``` + +This allows you to compare the performance across folds, and compare the averages of the folds across model types you're evaluating :) + +In our case, the model produces accuracies of 60-70%. This is acceptable, but there is still room for improvement. But hey, that wasn't the scope of this blog post :) + +### Model finalization + +If you're satisfied with the performance of your model, you can _finalize_ it. There are two options for doing so: + +- Save the best performing model instance (check ["How to save and load a model with Keras?"](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/) - do note that this requires retraining because you haven't saved models with the code above), and use it for generating predictions. +- Retrain the model, but this time with all the data - i.e., without making the train/test split. Save that model, and use it for generating predictions. I do suggest to continue using a validation set, as you want to know when the model [is overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). + +Both sides have advantages and disadvantages. The advantages of the first are that you don't have to retrain, as you can simply use the best-performing fold which was saved _during_ the training procedure. As retraining may be expensive, this could be an option, especially when your model is large. However, the disadvantage is that you simply miss out a percentage of your data - which may bring your training sample closer to the actual patterns in the _population_ rather than your _sample_. If that's the case, then the second option is better. + +However, that's entirely up to you! :) + +* * * + +## Summary + +In this blog post, we looked at the concept of model evaluation: what is it? Why would we need it in the first place? And how to do so objectively? If we can't evaluate models without introducing bias of some sort, there's no point in evaluating at all, is there? + +We introduced simple hold-out splits for this purpose, and showed that while they are efficient in terms of the required computational resources, they are also naïve. K-fold Cross Validation is \[latex\]K\[/latex\] times more expensive, but can produce significantly better estimates because it trains the models for \[latex\]K\[/latex\] times, each time with a different train/test split. + +To illustrate this further, we provided an example implementation for the Keras deep learning framework using TensorFlow 2.0. Using a Convolutional Neural Network for CIFAR-10 classification, we generated evaluations that performed in the range of 60-70% accuracies. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope you've learnt something from today's blog post. If you did, feel free to leave a comment in the comments section! If you have questions, you can add a comment or ask a question with the button on the right. Please do the same if you spotted mistakes or when you have other remarks. I'll happily answer your comments and will improve my blog if that's the best thing to do. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +Scikit-learn. (n.d.). sklearn.model\_selection.KFold — scikit-learn 0.22.1 documentation. Retrieved February 17, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.model\_selection.KFold.html](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) + +Allibhai, E. (2018, October 3). Holdout vs. Cross-validation in Machine Learning. Retrieved from [https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f](https://medium.com/@eijaz/holdout-vs-cross-validation-in-machine-learning-7637112d3f8f) + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Khandelwal, R. (2019, January 25). K fold and other cross-validation techniques. Retrieved from [https://medium.com/datadriveninvestor/k-fold-and-other-cross-validation-techniques-6c03a2563f1e](https://medium.com/datadriveninvestor/k-fold-and-other-cross-validation-techniques-6c03a2563f1e) + +Bogdanovist. (n.d.). How to choose a predictive model after k-fold cross-validation? Retrieved from [https://stats.stackexchange.com/a/52277](https://stats.stackexchange.com/a/52277) + +* * * diff --git a/how-to-use-k-fold-cross-validation-with-pytorch.md b/how-to-use-k-fold-cross-validation-with-pytorch.md new file mode 100644 index 0000000..e7adbf0 --- /dev/null +++ b/how-to-use-k-fold-cross-validation-with-pytorch.md @@ -0,0 +1,889 @@ +--- +title: "How to use K-fold Cross Validation with PyTorch?" +date: "2021-02-02" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "k-fold-cross-validation" + - "machine-learning" + - "model-evaluation" + - "neural-network" + - "pytorch" + - "testing-data" + - "train-test-split" +--- + +Machine learning models must be evaluated with a test set after they have been trained. We do this to ensure that models have not overfit and to ensure that they work with real-life datasets, which may have slightly deviating distributions compared to the training set. + +But in order to make your model really robust, simply evaluating with a train/test split may not be enough. + +For example, take the situation where you have a dataset composed of samples from two classes. Most of the samples in the first 80% of your dataset belong to class A, whereas most of the samples in the other 20% belong to class B. If you would take a simple 80/20 hold-out split, then your datasets would have vastly different distributions - and evaluation might result in wrong conclusions. + +That's something what you want to avoid. In this article, you'll therefore learn about another technique that can be applied - K-fold Cross Validation. By generating train/test splits across multiple folds, you can perform multiple training and testing sessions, with different splits. You'll also see how you can use K-fold Cross Validation with PyTorch, one of the leading libraries for neural networks these days. + +After reading this tutorial, you will... + +- **Understand why K-fold Cross Validation can improve your confidence in model evaluation results.** +- **Have an idea about how K-fold Cross Validation works.** +- **Know how to implement K-fold Cross Validation with PyTorch.** + +* * * + +**Update 29/Mar/2021:** fixed possible issue with weight leaks. + +**Update 15/Feb/2021:** fixed small textual error. + +* * * + +\[toc\] + +* * * + +## Summary and code example: K-fold Cross Validation with PyTorch + +Model evaluation is often performed with a hold-out split, where an often 80/20 split is made and where 80% of your dataset is used for training the model. and 20% for evaluating the model. While this is a simple approach, it is also very naïve, since it assumes that data is representative across the splits, that it's not a time series dataset and that there are no redundant samples within the datasets. + +K-fold Cross Validation is a more robust evaluation technique. It splits the dataset in \[latex\]k-1\[/latex\] training batches and 1 testing batch across \[latex\]k\[/latex\] folds, or situations. Using the training batches, you can then train your model, and subsequently evaluate it with the testing batch. This allows you to train the model for multiple times with different dataset configurations. Even better, it allows you to be more confident in your model evaluation results. + +Below, you will see a **full example of using K-fold Cross Validation with PyTorch**, using Scikit-learn's `KFold` functionality. It can be used on the go. If you want to understand things in more detail, however, it's best to continue reading the rest of the tutorial as well! 🚀 + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader, ConcatDataset +from torchvision import transforms +from sklearn.model_selection import KFold + +def reset_weights(m): + ''' + Try resetting model weights to avoid + weight leakage. + ''' + for layer in m.children(): + if hasattr(layer, 'reset_parameters'): + print(f'Reset trainable parameters of layer = {layer}') + layer.reset_parameters() + +class SimpleConvNet(nn.Module): + ''' + Simple Convolutional Neural Network + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Conv2d(1, 10, kernel_size=3), + nn.ReLU(), + nn.Flatten(), + nn.Linear(26 * 26 * 10, 50), + nn.ReLU(), + nn.Linear(50, 20), + nn.ReLU(), + nn.Linear(20, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Configuration options + k_folds = 5 + num_epochs = 1 + loss_function = nn.CrossEntropyLoss() + + # For fold results + results = {} + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset by concatenating Train/Test part; we split later. + dataset_train_part = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=True) + dataset_test_part = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=False) + dataset = ConcatDataset([dataset_train_part, dataset_test_part]) + + # Define the K-fold Cross Validator + kfold = KFold(n_splits=k_folds, shuffle=True) + + # Start print + print('--------------------------------') + + # K-fold Cross Validation model evaluation + for fold, (train_ids, test_ids) in enumerate(kfold.split(dataset)): + + # Print + print(f'FOLD {fold}') + print('--------------------------------') + + # Sample elements randomly from a given list of ids, no replacement. + train_subsampler = torch.utils.data.SubsetRandomSampler(train_ids) + test_subsampler = torch.utils.data.SubsetRandomSampler(test_ids) + + # Define data loaders for training and testing data in this fold + trainloader = torch.utils.data.DataLoader( + dataset, + batch_size=10, sampler=train_subsampler) + testloader = torch.utils.data.DataLoader( + dataset, + batch_size=10, sampler=test_subsampler) + + # Init the neural network + network = SimpleConvNet() + network.apply(reset_weights) + + # Initialize optimizer + optimizer = torch.optim.Adam(network.parameters(), lr=1e-4) + + # Run the training loop for defined number of epochs + for epoch in range(0, num_epochs): + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = network(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished. Saving trained model.') + + # Print about testing + print('Starting testing') + + # Saving the model + save_path = f'./model-fold-{fold}.pth' + torch.save(network.state_dict(), save_path) + + # Evaluationfor this fold + correct, total = 0, 0 + with torch.no_grad(): + + # Iterate over the test data and generate predictions + for i, data in enumerate(testloader, 0): + + # Get inputs + inputs, targets = data + + # Generate outputs + outputs = network(inputs) + + # Set total and correct + _, predicted = torch.max(outputs.data, 1) + total += targets.size(0) + correct += (predicted == targets).sum().item() + + # Print accuracy + print('Accuracy for fold %d: %d %%' % (fold, 100.0 * correct / total)) + print('--------------------------------') + results[fold] = 100.0 * (correct / total) + + # Print fold results + print(f'K-FOLD CROSS VALIDATION RESULTS FOR {k_folds} FOLDS') + print('--------------------------------') + sum = 0.0 + for key, value in results.items(): + print(f'Fold {key}: {value} %') + sum += value + print(f'Average: {sum/len(results.items())} %') +``` + +* * * + +## What is K-fold Cross Validation? + +Suppose that your goal is to build a classifier that correctly classifies input images - like in the example below. You input an image that represents a handwritten digit, and the output is expected to be 5. + +There are myriad ways for building such a classifier in terms of the model type that can be chosen. But which is best? You have to evaluate each model in order to find how well it works. + +![](images/EvaluationScenario-1024x366.png) + +### Why using train/test splits for model evaluation? + +Model evaluation happens after a machine learning model has been trained. It ensures that the model works with real-world data too by feeding samples from an dataset called the _test set_, which contains samples that the model has not seen before. + +By comparing the subsequent predictions with the ground truth labels that are also available for these samples, we can see how well the model performs on this dataset. And we can thus also see how well it performs on data from the real world, if we used that during model evaluation. + +However, we have to be cautious when evaluating our model. We cannot simply use the data that we trained the model with, to avoid becoming a student who grades their own homework. + +Because that is what would happen when you evaluated with your training data: as the model has learned to capture patterns related to that particular dataset, the model might perform poorly if these patterns were spurious and therefore not present within real-world data. Especially with [high-variance](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/) models, this can become a problem. + +Instead, we evaluate models with that _test set_, which has been selected and contains samples not present within the training set. But how to construct this test set is another question. There are multiple methods for doing so. Let's take a look at a naïve strategy first. We then understand why we might apply K-fold Cross Validation instead. + +### Simple hold-out splits: a naïve strategy + +Here's that naïve way, which is also called a **simple hold-out split:** + +![](images/Traintest.png) + +With this technique, you simply take a part of your original dataset, set it apart, and consider that to be testing data. + +Traditionally, such splits are taken in an 80/20 fashion, where 80% of the data is used for training the model, and 20% is used for evaluating it. + +There are a few reasons why this is a naïve approach: you’ll have to keep these edge cases in mind all the time (Chollet, 2017): + +1. **Data representativeness**: all datasets, which are essentially samples, must represent the patterns in the population as much as possible. This becomes especially important when you generate samples from a sample (i.e., from your full dataset). For example, if the first part of your dataset has pictures of ice cream, while the latter one only represents espressos, trouble is guaranteed when you generate the split as displayed above. Random shuffling may help you solve these issues. +2. **The arrow of time**: if you have a time series dataset, your dataset is likely ordered chronologically. If you’d shuffle randomly, and then perform simple hold-out validation, you’d effectively “\[predict\] the future given the past” (Chollet, 2017). Such temporal leaks don’t benefit model performance. +3. **Data redundancy**: if some samples appear more than once, a simple hold-out split with random shuffling may introduce redundancy between training and testing datasets. That is, identical samples belong to both datasets. This is problematic too, as data used for training thus leaks into the dataset for testing implicitly. + +That's why it's often a better idea to validate your model more robustly. Let's take a look at K-fold Cross Validation for doing so. + +### Introducing K-fold Cross Validation + +What if we could try multiple variations of this train/test split? + +We would then have a model that is evaluated much more robustly. + +And precisely that is what **K-fold Cross Validation** is all about. + +![](images/KTraintest.png) + +In K-fold Cross Validation, you set a number \[latex\]k\[/latex\] to any integer value \[latex\]> 1\[/latex\], and \[latex\]k\[/latex\] splits will be generated. Each split has \[latex\]1/k\[/latex\] samples that belong to a test dataset, while the rest of your data can be used for training purposes. + +As in each split a different part of the training data will be used for validation purposes, you effectively train and evaluate your model multiple times, allowing you to tell whether it works with more confidence than with a simple hold-out split. + +Let's now take a look at how we can implement K-fold Cross Validation with PyTorch! + +* * * + +## Implementing K-fold Cross Validation with PyTorch + +Now that you understand how K-fold Cross Validation works, let's take a look at how you can apply it with PyTorch. Using K-fold CV with PyTorch involves the following steps: + +1. Ensuring that your dependencies are up to date. +2. Stating your model imports. +3. Defining the `nn.Module` class of your neural network, as well as a weights reset function. +4. Adding the preparatory steps in your runtime code. +5. Loading your dataset. +6. Defining the K-fold Cross Validator, and generating folds. +7. Iterating over each fold, training and evaluating another model instance. +8. Averaging across all folds to get final performance. + +### What you'll need to run the code + +Running this example requires that you have installed the following dependencies: + +- **Python**, to run everything. Make sure to install 3.8+, although it'll also run with slightly older versions. +- **PyTorch**, which is the deep learning library that you are training the models with. +- **Scikit-learn**, for generating the folds. + +Let's open up a code editor and create a file, e.g. called `kfold.py`. Obviously, you might also want to run everything inside a Jupyter Notebook. That's up to you :) + +### Model imports + +The first thing we do is specifying the model imports. We import these Python modules: + +- For file input/output, we use `os`. +- All PyTorch functionality is imported as `torch`. We also have some sub imports: + + - Neural network functionality is imported as `nn`. + + - The `DataLoader` that we import from `torch.utils.data` is used for passing data to the neural network. + - The `ConcatDataset` will be used for concatenating the train and test parts of the MNIST dataset, which we'll use for training the model. K-fold CV means that you generate the splits yourself, so you don't want PyTorch to do this for you - as you'd effectively lose data. +- We also import specific functionality related to Computer Vision - using `torchvision`. First, we import the `MNIST` dataset from `torchvision.datasets`. We also import `transforms` from Torch Vision, which allows us to convert the data into Tensor format later. +- Finally, we import `KFold` from `sklearn.model_selection` to allow us to perform K-fold Cross Validation. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader, ConcatDataset +from torchvision import transforms +from sklearn.model_selection import KFold +``` + +### Model class + +Time to start with some real work! + +Let's define a simple convolutional neural network, i.e. a `SimpleConvNet`, that utilizes the `nn.Module` base class - and thus effectively implements a PyTorch neural network. + +We can implement it as follows, by specifying the `__init__` constructor definition and the forward pass. In the `__init__` definition, we specify the neural network as a Sequential stack of PyTorch layers. You can see that we use one convolutional layer (`Conv2d`) with ReLU activations and some `Linear` layers responsible for generating the predictions. Due to the simplicity of the MNIST dataset, this should suffice. We store the stack in `self.layers`, which we use in the forward pass, as defined in the `forward` definition. Here, we simply pass the data - available in `x` - to the layers. + +``` +class SimpleConvNet(nn.Module): + ''' + Simple Convolutional Neural Network + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Conv2d(1, 10, kernel_size=3), + nn.ReLU(), + nn.Flatten(), + nn.Linear(26 * 26 * 10, 50), + nn.ReLU(), + nn.Linear(50, 20), + nn.ReLU(), + nn.Linear(20, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) +``` + +Before the `class`, we will also add a `def` called `reset_weights`. During the folds, it will be used to reset the parameters of the model. This way, we ensure that the model is trained with weights that are initialized (pseudo)randomly, avoiding weight leakage. + +``` +def reset_weights(m): + ''' + Try resetting model weights to avoid + weight leakage. + ''' + for layer in m.children(): + if hasattr(layer, 'reset_parameters'): + print(f'Reset trainable parameters of layer = {layer}') + layer.reset_parameters() +``` + +### Runtime code + +Now that we have defined the model class, it's time to write some runtime code. By runtime code, we mean that you'll write code that actually runs when you run the Python file or the Jupyter Notebook. The class you defined before namely specifies a skeleton, and you'll have to initialize it first in order to have it run. We'll do that next. More specifically, our runtime code covers the following aspects: + +1. The **preparatory steps**, where we perform some (no surprise) preparation steps for running the model. +2. **Loading the dataset**; the MNIST one, to be precise. +3. **Defining the K-fold Cross Validator** to generate the folds. +4. Then, **generating the splits** that we can actually use for training the model, which we also do - once for every fold. +5. After training for every fold, we **evaluate the performance for that fold**. +6. Finally, we perform **performance evaluation for the model - across the folds**. + +Actually, it's that easy! :) + +#### Preparatory steps + +Below, we define some preparation steps that are executed prior to starting the training process across the set of folds. You can see that we run everything within the `__main__` name, meaning that this code only runs when we execute the Python file. In this part, we do the following things: + +1. We set the configuration options. We'll generate 5 folds (by setting \[latex\]k = 5\[/latex\]), we train for 1 epoch (normally, this value is much higher, but here we only want to illustrate K-fold CV to work), and we set `nn.CrossEntropyLoss` as our loss function. +2. We define a dictionary that will store the results for every fold. +3. We set a fixed random number seed, meaning that all our pseudo random number initializers will be initialized using the same initialization token. + +``` +if __name__ == '__main__': + + # Configuration options + k_folds = 5 + num_epochs = 1 + loss_function = nn.CrossEntropyLoss() + + # For fold results + results = {} + + # Set fixed random number seed + torch.manual_seed(42) +``` + +#### Loading the MNIST dataset + +We then load the MNIST dataset. If you're used to working with the PyTorch datasets, you may be familiar with this code already. However, the third line may still be a bit unclear - but it's actually really simple to understand what is happening here. + +We simply merge together the `train=True` and `train=False` parts of the MNIST dataset, which is already split in a simple hold-out split by PyTorch's `torchvision`. + +And we don't want that - recall that K-fold Cross Validation generates the train/test splits across \[latex\]k\[/latex\] folds, where \[latex\]k-1\[/latex\] parts are used for training your model and 1 part for model evaluation. + +To solve this, we simply load both parts, and then concatenate them in a `ConcatDataset` object. Don't worry about shuffling the data - you'll see that this is taken care of next. + +``` + # Prepare MNIST dataset by concatenating Train/Test part; we split later. + dataset_train_part = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=True) + dataset_test_part = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=False) + dataset = ConcatDataset([dataset_train_part, dataset_test_part]) +``` + +#### Defining the K-fold Cross Validator + +Because next, we _do_ define the shuffle - when we initialize the K-fold Cross Validator. Here, we set `shuffle=True`, meaning that shuffling occurs before the data is split into batches. `k_folds` indicates the number of folds, as you would have expected. + +``` + # Define the K-fold Cross Validator + kfold = KFold(n_splits=k_folds, shuffle=True) + + # Start print + print('--------------------------------') +``` + +#### Generating the splits and training the model for a fold + +We can now generate the splits and train our model. You can do so by defining a loop where you iterate over the splits, specifying the `fold` and the list of identifiers of the _training_ and _testing_ samples for that particular fold. These can be used for performing the actual training process. + +Within the for loop, we first perform a `print` statement, indicating the current fold. You then perform the training process. This involves the following steps: + +- Sampling the actual elements from the `train_ids` or `test_ids` with a `SubsetRandomSampler`. A sampler can be used within a `DataLoader` to use particular samples only; in this case based on identifiers, because the `SubsetRandomSampler` samples elements randomly from a list, _without replacements_. In other words, you create two subsamplers that adhere to the split as specified within the `for` loop. +- With the data loaders, you'll actually sample these samples from the full `dataset`. You can use any batch size that fits in memory, but a batch size of 10 works well in pretty much all of the cases. +- After preparing the dataset for this particular fold, you initialize the neural network by initializing the class - using `SimpleConvNet()`. +- Then, when the neural network is initialized, you can initialize the optimizer for this particular training session - in this case, we use Adam, with a `1e-4` learning rate. +- In PyTorch, you'll have to define [your own training loop](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/). It's relatively simple: you iterate over the number of epochs; within an epoch, over the minibatches; per minibatch, you perform the forward pass, the backward pass and subsequent optimization. That's what is happening here. Click the link if you want to understand this process in more detail. + +``` + # K-fold Cross Validation model evaluation + for fold, (train_ids, test_ids) in enumerate(kfold.split(dataset)): + + # Print + print(f'FOLD {fold}') + print('--------------------------------') + + # Sample elements randomly from a given list of ids, no replacement. + train_subsampler = torch.utils.data.SubsetRandomSampler(train_ids) + test_subsampler = torch.utils.data.SubsetRandomSampler(test_ids) + + # Define data loaders for training and testing data in this fold + trainloader = torch.utils.data.DataLoader( + dataset, + batch_size=10, sampler=train_subsampler) + testloader = torch.utils.data.DataLoader( + dataset, + batch_size=10, sampler=test_subsampler) + + # Init the neural network + network = SimpleConvNet() + + # Initialize optimizer + optimizer = torch.optim.Adam(network.parameters(), lr=1e-4) + + # Run the training loop for defined number of epochs + for epoch in range(0, num_epochs): + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = network(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished. Saving trained model.') +``` + +#### Fold evaluation + +After training a model within a particular fold, you must evaluate it too. That's what we'll do next. First, we save the model - so that it will be usable for generating productions later, should you want to re-use it. We then perform [model evaluation activities](https://www.machinecurve.com/index.php/2021/01/27/testing-pytorch-and-lightning-models/) - iterating over the `testloader` and generating predictions for all the samples in the test batch/test part of the fold split. We compute accuracy after evaluation, `print` it on screen, and add it to the `results` dictionary for that particular fold. + +``` + + # Print about testing + print('Starting testing') + + # Saving the model + save_path = f'./model-fold-{fold}.pth' + torch.save(network.state_dict(), save_path) + + # Evaluation for this fold + correct, total = 0, 0 + with torch.no_grad(): + + # Iterate over the test data and generate predictions + for i, data in enumerate(testloader, 0): + + # Get inputs + inputs, targets = data + + # Generate outputs + outputs = network(inputs) + + # Set total and correct + _, predicted = torch.max(outputs.data, 1) + total += targets.size(0) + correct += (predicted == targets).sum().item() + + # Print accuracy + print('Accuracy for fold %d: %d %%' % (fold, 100.0 * correct / total)) + print('--------------------------------') + results[fold] = 100.0 * (correct / total) +``` + +#### Model evaluation + +Finally, once all folds have passed, we have the `results` for every fold. Now, it's time to perform full model evaluation - and we can do so more robustly because we have information from across all the folds. Here's how you can show the results for every fold, and then print the average on screen. + +It allows you to do two things + +1. See whether your model performs well across all the folds; this is true if the accuracies for every fold don't deviate too significantly. +2. If they do, you know in which fold, and can take a closer look at the data to see what is happening there. + +``` + # Print fold results + print(f'K-FOLD CROSS VALIDATION RESULTS FOR {k_folds} FOLDS') + print('--------------------------------') + sum = 0.0 + for key, value in results.items(): + print(f'Fold {key}: {value} %') + sum += value + print(f'Average: {sum/len(results.items())} %') +``` + +#### Full code + +Instead of reading the explanation above, you might also be interested in simply running the code. If so, here it is 😊 + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader, ConcatDataset +from torchvision import transforms +from sklearn.model_selection import KFold + +def reset_weights(m): + ''' + Try resetting model weights to avoid + weight leakage. + ''' + for layer in m.children(): + if hasattr(layer, 'reset_parameters'): + print(f'Reset trainable parameters of layer = {layer}') + layer.reset_parameters() + +class SimpleConvNet(nn.Module): + ''' + Simple Convolutional Neural Network + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Conv2d(1, 10, kernel_size=3), + nn.ReLU(), + nn.Flatten(), + nn.Linear(26 * 26 * 10, 50), + nn.ReLU(), + nn.Linear(50, 20), + nn.ReLU(), + nn.Linear(20, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Configuration options + k_folds = 5 + num_epochs = 1 + loss_function = nn.CrossEntropyLoss() + + # For fold results + results = {} + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset by concatenating Train/Test part; we split later. + dataset_train_part = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=True) + dataset_test_part = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=False) + dataset = ConcatDataset([dataset_train_part, dataset_test_part]) + + # Define the K-fold Cross Validator + kfold = KFold(n_splits=k_folds, shuffle=True) + + # Start print + print('--------------------------------') + + # K-fold Cross Validation model evaluation + for fold, (train_ids, test_ids) in enumerate(kfold.split(dataset)): + + # Print + print(f'FOLD {fold}') + print('--------------------------------') + + # Sample elements randomly from a given list of ids, no replacement. + train_subsampler = torch.utils.data.SubsetRandomSampler(train_ids) + test_subsampler = torch.utils.data.SubsetRandomSampler(test_ids) + + # Define data loaders for training and testing data in this fold + trainloader = torch.utils.data.DataLoader( + dataset, + batch_size=10, sampler=train_subsampler) + testloader = torch.utils.data.DataLoader( + dataset, + batch_size=10, sampler=test_subsampler) + + # Init the neural network + network = SimpleConvNet() + network.apply(reset_weights) + + # Initialize optimizer + optimizer = torch.optim.Adam(network.parameters(), lr=1e-4) + + # Run the training loop for defined number of epochs + for epoch in range(0, num_epochs): + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = network(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished. Saving trained model.') + + # Print about testing + print('Starting testing') + + # Saving the model + save_path = f'./model-fold-{fold}.pth' + torch.save(network.state_dict(), save_path) + + # Evaluationfor this fold + correct, total = 0, 0 + with torch.no_grad(): + + # Iterate over the test data and generate predictions + for i, data in enumerate(testloader, 0): + + # Get inputs + inputs, targets = data + + # Generate outputs + outputs = network(inputs) + + # Set total and correct + _, predicted = torch.max(outputs.data, 1) + total += targets.size(0) + correct += (predicted == targets).sum().item() + + # Print accuracy + print('Accuracy for fold %d: %d %%' % (fold, 100.0 * correct / total)) + print('--------------------------------') + results[fold] = 100.0 * (correct / total) + + # Print fold results + print(f'K-FOLD CROSS VALIDATION RESULTS FOR {k_folds} FOLDS') + print('--------------------------------') + sum = 0.0 + for key, value in results.items(): + print(f'Fold {key}: {value} %') + sum += value + print(f'Average: {sum/len(results.items())} %') +``` + +* * * + +## After evaluation, what's next? + +Running the code gives you the following result for 5 folds with one epoch per fold. + +``` +-------------------------------- +FOLD 0 +-------------------------------- +Starting epoch 1 +Loss after mini-batch 500: 1.875 +Loss after mini-batch 1000: 0.810 +Loss after mini-batch 1500: 0.545 +Loss after mini-batch 2000: 0.450 +Loss after mini-batch 2500: 0.415 +Loss after mini-batch 3000: 0.363 +Loss after mini-batch 3500: 0.342 +Loss after mini-batch 4000: 0.373 +Loss after mini-batch 4500: 0.331 +Loss after mini-batch 5000: 0.295 +Loss after mini-batch 5500: 0.298 +Training process has finished. Saving trained model. +Starting testing +Accuracy for fold 0: 91 % +-------------------------------- +FOLD 1 +-------------------------------- +Starting epoch 1 +Loss after mini-batch 500: 1.782 +Loss after mini-batch 1000: 0.727 +Loss after mini-batch 1500: 0.494 +Loss after mini-batch 2000: 0.419 +Loss after mini-batch 2500: 0.386 +Loss after mini-batch 3000: 0.367 +Loss after mini-batch 3500: 0.352 +Loss after mini-batch 4000: 0.329 +Loss after mini-batch 4500: 0.307 +Loss after mini-batch 5000: 0.297 +Loss after mini-batch 5500: 0.289 +Training process has finished. Saving trained model. +Starting testing +Accuracy for fold 1: 91 % +-------------------------------- +FOLD 2 +-------------------------------- +Starting epoch 1 +Loss after mini-batch 500: 1.735 +Loss after mini-batch 1000: 0.723 +Loss after mini-batch 1500: 0.501 +Loss after mini-batch 2000: 0.412 +Loss after mini-batch 2500: 0.364 +Loss after mini-batch 3000: 0.366 +Loss after mini-batch 3500: 0.332 +Loss after mini-batch 4000: 0.319 +Loss after mini-batch 4500: 0.322 +Loss after mini-batch 5000: 0.292 +Loss after mini-batch 5500: 0.293 +Training process has finished. Saving trained model. +Starting testing +Accuracy for fold 2: 91 % +-------------------------------- +FOLD 3 +-------------------------------- +Starting epoch 1 +Loss after mini-batch 500: 1.931 +Loss after mini-batch 1000: 1.048 +Loss after mini-batch 1500: 0.638 +Loss after mini-batch 2000: 0.475 +Loss after mini-batch 2500: 0.431 +Loss after mini-batch 3000: 0.394 +Loss after mini-batch 3500: 0.390 +Loss after mini-batch 4000: 0.373 +Loss after mini-batch 4500: 0.383 +Loss after mini-batch 5000: 0.349 +Loss after mini-batch 5500: 0.350 +Training process has finished. Saving trained model. +Starting testing +Accuracy for fold 3: 90 % +-------------------------------- +FOLD 4 +-------------------------------- +Starting epoch 1 +Loss after mini-batch 500: 2.003 +Loss after mini-batch 1000: 0.969 +Loss after mini-batch 1500: 0.556 +Loss after mini-batch 2000: 0.456 +Loss after mini-batch 2500: 0.423 +Loss after mini-batch 3000: 0.372 +Loss after mini-batch 3500: 0.362 +Loss after mini-batch 4000: 0.332 +Loss after mini-batch 4500: 0.316 +Loss after mini-batch 5000: 0.327 +Loss after mini-batch 5500: 0.304 +Training process has finished. Saving trained model. +Starting testing +Accuracy for fold 4: 90 % +-------------------------------- +K-FOLD CROSS VALIDATION RESULTS FOR 5 FOLDS +-------------------------------- +Fold 0: 91.87857142857143 % +Fold 1: 91.75 % +Fold 2: 91.85 % +Fold 3: 90.35714285714286 % +Fold 4: 90.82142857142857 % +Average: 91.33142857142857 % +``` + +Indeed, this is the MNIST dataset, for which we get great results with only limited iterations - but that was something that we expected :) + +However, what we also see is that performance is relatively equal across the folds - so we don't see any weird outliers that skew our model evaluation efforts. + +This ensures that the distribution of the data was relatively equal across splits and that it will likely work on real-world data _if_ it has a relatively similar distribution. + +Generally, what I would now do often is to retrain the model with the _full dataset_, without evaluation on a hold-out split (or with a really small one - e.g. 5%). We have already seen that it generalizes and that it does so across folds. We can now use all the data at hand to boost performance perhaps slightly further. + +I'd love to know what you think about this too, as this is a strategy that confused some people in my [K-fold Cross Validation for TensorFlow](https://www.machinecurve.com/index.php/2020/02/18/how-to-use-k-fold-cross-validation-with-keras/) tutorial. + +* * * + +## Recap + +In this tutorial, we looked at applying K-fold Cross Validation with the PyTorch framework for deep learning. We saw that K-fold Cross Validation generates \[latex\]k\[/latex\] different situations called _folds_ using your dataset, where the data is split in \[latex\]k-1\[/latex\] training batches and 1 test batch per fold. K-fold Cross Validation can be used for evaluating your PyTorch model more thoroughly, giving you more confidence in the fact that performance hasn't been skewed by a weird outlier in your dataset. + +Besides theoretical stuff, we also provided a PyTorch example that shows how you can apply K-fold Cross Validation with the framework combined with Scikit-learn's `KFold` functionality. I hope that you have learned something from it. If you did, please feel free to leave a message in the comments section below 💬 I'd love to hear from you. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +PyTorch Lightning. (2021, January 12). [https://www.pytorchlightning.ai/](https://www.pytorchlightning.ai/) + +PyTorch. (n.d.). [https://pytorch.org](https://pytorch.org/) diff --git a/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras.md b/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras.md new file mode 100644 index 0000000..b2566ba --- /dev/null +++ b/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras.md @@ -0,0 +1,245 @@ +--- +title: "How to use Kullback-Leibler divergence (KL divergence) with Keras?" +date: "2019-12-21" +categories: + - "deep-learning" + - "frameworks" +tags: + - "autoencoder" + - "deep-learning" + - "keras" + - "kl-divergence" + - "kullback-leibler-divergence" + - "loss-function" + - "machine-learning" + - "neural-networks" +--- + +When you train a supervised machine learning model, [you feed forward data](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), generating predictions on the fly. The comparison of these predictions with the actual targets valid for the samples used during training can be used to [optimize your model](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). + +But how to compare? That's a valid question. + +There are various so-called _loss functions_ these days, which essentially present you the [difference between true target and prediction](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions). The Kullback-Leibler divergence (or KL Divergence for short) is one of these. Seeing it in the Keras docs spawned a lot of questions. What is KL divergence? How does it work as a loss function? In what kind of machine learning (or deep learning) problems can it be used? And how can I implement it? + +All valid questions which I'll try to answer in this blog article. First, I'll discuss _what the KL divergence is_ - and (spoiler alert) - it's nothing more than a comparison metric for two probability distributions. Subsequently, I'll cover use cases for KL divergence in deep learning problems. This is followed by a look at the Keras API, to find how KL divergence is defined in the Losses section. Finally, we implement a Keras model with a KL divergence loss value, find out see how it works. + +Are you ready? Let's go! + +* * * + +\[toc\] + +* * * + +## Comparing two probability distributions: KL divergence + +Okay, let's take a look at the first question: **what is the Kullback-Leibler divergence?** + +When diving into this question, I came across a really good article relatively quickly. At Count Bayesie's website, the article ["Kullback-Leibler Divergence Explained"](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) provides a really intuitive yet mathematically sound explanation in plain English. It lies at the basis of my attempt at explaining the KL divergence, augmented with a few extra sources. However, I definitely recommend taking a look! + +Small note: contrary to Count Bayesie's article, I'll start my discussion from a supervised machine learning point of view. + +### ML based probability distribution + +Suppose that you have a probability distribution. Some activation functions do that, such as the Softmax activation function, that generates a probability distribution over the classes in your supervised machine learning setting. + +Now what if, contrary to the Softmax situation where the categorical crossentropy loss function is often used which only takes into account the `argmax` of the predictions generated, you wish to compare the _predicted distribution_ with some _actual distribution_? + +As we will see, there are situations when this happens. In those cases, you can use the Kullback-Leibler Divergence, which is an adaptation of the entropy metric that is common in information theory (Count Bayesie, n.d.). + +### From entropy based information size to expected information loss + +But what is entropy? Mathematically, it can be defined as follows (Wikipedia, 2001). + +\\begin{equation} H(X) = -\\sum p(X)\\log p(X) \\end{equation} + +Intuitively, it's the _expected value of the probability of data in some distribution_. In plain English, it's something like this (assuming that \[latex\]log\_2\[/latex\] is used): "the minimum number of bits it would take us to encode our information" (Count Bayesie, 2017). + +The entropy for some probability distribution thus tells you, given some data, how much information is in it. Knowing this, we can also find out _how much is lost_ when you change the distribution. + +Because that's what you do when you're performing deep learning activities: your feedforward-generated predictions effectively form a probability distribution ("there is some probability that the value lies between \[latex\]x\[/latex\] or \[latex\]y\[/latex\] / takes value \[latex\]x\[/latex\]"), and hence can be compared with the true distribution for the sample (i.e., your training dataset). + +Now - if your optimizer adapts its weights, the predictions change, and so does the probability distribution generated by your model. If only you could measure the loss of information between the model-based probability distributions and the distribution of the actual training dataset... then, you could do some optimization. + +### KL divergence + +Well, you can!! 😎 + +By slightly adapting the formula for entropy, we arrive at the **Kullback-Leibler divergence** (Count Bayesie, 2017)! It can be defined as follows (Wikipedia, 2004): + +\\begin{equation} KL (P || Q) = \\sum p(X) \\log ( p(X) \\div q(X) ) \\end{equation} + +In plain English, this effectively tells you how much entropy you lose or gain when you would change probability distributions (recognize that \[latex\]\\log ( p(X) \\div q(X) ) = \\log p(X) - \\log q(X)\[/latex\], Count Bayesie 2017). + +It's hence not surprising that the KL divergence is also called _relative entropy_. It's the gain or loss of entropy when switching from distribution one to distribution two (Wikipedia, 2004) - and it allows us to compare two probability distributions. + +Let's now take a look which ML problems require KL divergence loss, to gain some understanding when it can be useful. + +* * * + +## Use cases for KL divergence in machine learning problems + +But when to use KL divergence in your machine learning projects? + +Based on some Googling, I found that there are some use cases when Kullback-Leibler divergence is quite useful: + +- Primarily, it is used in **Variational Autoencoders** (Count Bayesie, 2017; Shafkat, 2018). These autoencoders learn to encode samples into a latent probability distribution. From this latent distribution, a sample can be drawn that can be fed to a decoder which outputs e.g. an image. It's one of the types of _generative models_ currently being fashionable for generating e.g. [pictures of humans](https://www.machinecurve.com/index.php/2019/07/17/this-person-does-not-exist-how-does-it-work/) (although strictly speaking, for the linked blog a different type of model - a GAN - has been used). +- However, KL divergence can also be used in **_multi_class classification scenarios** (Moreno, n.d.). These problems, which traditionally use the Softmax function and use one-hot encoded target data, are naturally suitable to KL divergence since Softmax "normalizes \[data\] into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers" (Wikipedia, 2006). In plain English: the output tells you, for some sample \[latex\]x\[/latex\], the odds of being present in the input image. Since KL divergence works with probability distributions, it's very much usable here. +- Funnily, KL divergence is also used for **replacing _Least Squares minimization_** in models (Kosheleva & Kreinovich, 2018). In regression models, the loss function to minimize is usually the error (prediction minus target), often squared. While the simplicity of such loss functions pays off in terms of efficacy, they are notoriously sensitive to noise (especially when the predictions generated by the feedforward operation are everything but part of the normal distribution). Rather counterintuitively, KL divergence has appeared here as an interesting replacement - as it works on the distribution level rather than the sample level. + +* * * + +## Kullback-Leibler divergence in the Keras API + +The Keras API defines the KL divergence as follows (Keras, n.d.): + +``` +keras.losses.kullback_leibler_divergence(y_true, y_pred) +``` + +This means that it can simply be defined as 'kullback\_leibler\_divergence' in your models. Simple :-) + +* * * + +## Implementing a Keras model with KL divergence + +Let's now see whether it's possible to implement a model with Keras that makes use of the KL divergence. As we've seen, it's possible to use KL divergence in some ML problems - and multiclass classification with Softmax function is one of them, because it generates probability distributions. These can be compared with KL divergence, and hence training can take place with it. + +We'll therefore slightly adapt a [ConvNet created in another blog post](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) to use KL divergence. This way, you don't have to immerse yourself in an entirely new model (assuming that you've read the linked post) yet can see how KL divergence can be used with Keras. + +### Configuring the loss function during Keras model compilation + +And it's simple, actually. It just involves specifying it as the used `loss` function during the model compilation step: + +``` +# Compile the model +model.compile(loss=keras.losses.kullback_leibler_divergence, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +That's it! + +### Full Keras CNN code + +Here's the full ConvNet code, including KL divergence: + +``` +import keras +from keras.datasets import cifar10 +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras import backend as K + +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0],3, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 3, img_width, img_height) + input_shape = (3, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) + input_shape = (img_width , img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.50)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.50)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=keras.losses.kullback_leibler_divergence, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split +) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Results + +I ran the model twice, then changed to categorical crossentropy loss, and ran it twice too. This allows for some comparison between KL divergence and categorical crossentropy loss, which is normally used in multiclass classification with one-hot encoded vectors. + +In 25 epochs, performance is very similar. Therefore, I'd say you could use both if you're facing the choice when to use categoricall crossentropy or KL divergence. + +- [![](images/kld4.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/kld4.png) + +- [![](images/kld3.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/kld3.png) + +- [![](images/kld2.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/kld2.png) + +- [![](images/kld1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/kld1.png) + + +* * * + +## Summary + +In this blog, we looked at what KL divergence is and how it can be used in neural networks. To illustrate this, we created an example implementation of a convolutional neural network, created with the Keras deep learning framework in Python. This example, which makes use of KL divergence loss, performs equal to traditionally-used categorical crossentropy loss. + +I hope you've learnt something from this blog post, even though it's a bit shorter than usual. If you did, I'd love to know - so feel free to leave a comment in the comments box below! Please also do so when you have questions or when you spot mistakes in my text. I'll happily improve it and will then list you in the references list 😊 + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Count Bayesie. (2017, May 10). Kullback-Leibler Divergence Explained. Retrieved from [https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) + +Wikipedia. (2004, February 13). Kullback–Leibler divergence. Retrieved from [https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler\_divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) + +Wikipedia. (2001, July 9). Entropy (information theory). Retrieved from [https://en.wikipedia.org/wiki/Entropy\_(information\_theory)](https://en.wikipedia.org/wiki/Entropy_(information_theory)) + +Shafkat, I. (2018, April 5). Intuitively Understanding Variational Autoencoders. Retrieved from [https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf](https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf) + +Kosheleva, O., & Kreinovich, V. (2018). Why deep learning methods use KL divergence instead of least squares: a possible pedagogical explanation. [https://digitalcommons.utep.edu/cs\_techrep/1192](https://digitalcommons.utep.edu/cs_techrep/1192) + +Moreno. (n.d.). Alexander Moreno's answer to What are some applications of the KL-divergence in machine learning? Retrieved from [https://www.quora.com/What-are-some-applications-of-the-KL-divergence-in-machine-learning/answer/Alexander-Moreno-1](https://www.quora.com/What-are-some-applications-of-the-KL-divergence-in-machine-learning/answer/Alexander-Moreno-1) + +Keras. (n.d.). Losses. Retrieved from [https://keras.io/losses/#kullback\_leibler\_divergence](https://keras.io/losses/#kullback_leibler_divergence) + +Wikipedia. (2006, July 28). Softmax function. Retrieved from [https://en.wikipedia.org/wiki/Softmax\_function](https://en.wikipedia.org/wiki/Softmax_function) diff --git a/how-to-use-l1-l2-and-elastic-net-regularization-with-keras.md b/how-to-use-l1-l2-and-elastic-net-regularization-with-keras.md new file mode 100644 index 0000000..b9539d0 --- /dev/null +++ b/how-to-use-l1-l2-and-elastic-net-regularization-with-keras.md @@ -0,0 +1,649 @@ +--- +title: "How to use L1, L2 and Elastic Net Regularization with TensorFlow 2.0 and Keras?" +date: "2020-01-23" +categories: + - "deep-learning" + - "frameworks" +tags: + - "elastic-net-regularization" + - "keras" + - "l1-regularization" + - "l2-regularization" + - "machine-learning" + - "regularization" + - "regularizer" +--- + +Regularizers, or ways to reduce the complexity of your machine learning models - can help you to get models that generalize to new, unseen data better. L1, L2 and Elastic Net regularizers are the ones most widely used in today's machine learning communities. + +But what are these regularizers? Why are they needed in the first place? And, most importantly, how can I implement them in my Keras model? + +Those questions will be answered in today's blog post. + +Firstly, we'll provide a recap on L1, L2 and Elastic Net regularization. In the recap, we look at the need for regularization, how a regularizer is attached to the loss function that is minimized, and how the L1, L2 and Elastic Net regularizers work. We do so intuitively, but we don't hide the maths when necessary. + +However, the primary aspect of this blog post is the Keras based set of examples that show the wide range of kernel, bias and activity based regularizers that are available within the framework. Using a CNN based model, we show you how L1, L2 and Elastic Net regularization can be applied to your Keras model - as well as some interesting results for that particular model. + +After completing this tutorial, you will know... + +- How to use `tensorflow.keras.regularizers` in your TensorFlow 2.0/Keras project. +- What L1, L2 and Elastic Net Regularization is, and how it works. +- What the impact is of adding a regularizer to your project. + +**Update 16/Jan/2021:** ensured that post is up to date for 2021 and and that works with TensorFlow 2.0+. Also added a code example to the beginning of this article so that you can get started quickly. + +* * * + +\[toc\] + +* * * + +## Example code: L1, L2 and Elastic Net Regularization with TensorFlow 2.0 and Keras + +With these code examples, you can immediately apply L1, L2 and Elastic Net Regularization to your TensorFlow or Keras project. If you want to understand the regularizers in more detail as well as using them, make sure to read the rest of this tutorial as well. Please note that these regularizers can also be used as `bias_regularizer` and `activity_regularizer`, not just `kernel_regularizer`. + +### L1 Regularization example + +``` +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_regularizer=tensorflow.keras.regularizers.l1(0.01))) +``` + +### L2 Regularization example + +``` +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_regularizer=tensorflow.keras.regularizers.l2(0.01))) +``` + +### Elastic Net (L1+L2) Regularization example + +``` +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_regularizer=tensorflow.keras.regularizers.l1_l2(l1=0.01, l2=0.01))) +``` + +* * * + +## Recap: what are L1, L2 and Elastic Net Regularization? + +In our blog post ["What are L1, L2 and Elastic Net Regularization in neural networks?"](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/), we looked at the concept of regularization and the L1, L2 and Elastic Net Regularizers. We'll implement these in this blog post, using the Keras deep learning framework. + +However, before we actually start looking into the Keras API and coding our Keras based example, it's important to understand the basics of regularization and the basics of the regularizers. + +Here, we'll therefore cover these basics in order to provide a recap. Firstly, we'll discuss why we need a regularizer in the first place. Secondly, we'll take a look at L1 and L2 Regularization. Finally, we study Elastic Net Regularization in a bit more detail. Please refer to the blog post linked above for a more detailed explanation. + +### The need for regularization + +Training a supervised machine learning model equals learning a mapping for a function \[latex\]\\hat{y}: f(\\textbf{x})\[/latex\], where \[latex\]\\textbf{x}\[/latex\] is an input vector and \[latex\]\\hat{y}\[/latex\] is the predicted output value. Given the fact that it's supervised, you have the "ground truth" \[latex\]y\[/latex\] available for all \[latex\]\\textbf{x}\[/latex\] in your training set and hence, your definition of a well-performing machine learning model is to achieve \[latex\]\\hat{y} \\approx y\[/latex\] for your entire training set. + +This can be achieved by going through the iterative [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), which means that you feed your training set to the model, generate predictions, compare these with ground truth, summarize them in a [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss), which you then use to [optimize](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) the weights of your model, before starting a new iteration. This way, you might be able to find a mapping for which \[latex\]\\hat{y} \\approx y\[/latex\] is true to a great extent. + +[![](images/poly_both.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/poly_both.png) + +In the exemplary scenario of the blog post linked above, we did however see that many mappings can be learned based on your training data. In the plot above, this becomes clear with a simple polyfit: for a few blue training data samples, it may learn the orange mapping, but there's no guarantee that it doesn't learn the blue one instead. + +As you can imagine, the blue one is much less scalable to new data, as it's very unlikely that real-world data produces such large oscillations in such a small domain. It's probably highly overfit i.e. too adapted to the training data. + +Can this be avoided? + +Yes, to some extent: by adding **a regularizer**, you may enforce the training process to steer towards relatively "simple" weights, which may make your model more generic and thus scalable. + +### Loss based regularizer + +From above, we know that the supervised machine learning process produces some loss value. Let's now take a look at this loss value in a bit more detail, as it's important to understand what a regularizer does. The first step is to define the loss value at a high level; say, it's \[latex\]L(f, \\textbf{x}, y)\[/latex\], where \[latex\]f\[/latex\] is the model, \[latex\]\\textbf{x}\[/latex\] some input vector and \[latex\]y\[/latex\] the corresponding ground truth value. + +Now, the loss value is determined by a _loss function_. Loss functions provide a mathematical way of comparing two values. Exemplary ones are [binary crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) (which compares a ground truth value with a predicted output) and [hinge loss](https://www.machinecurve.com/index.php/2019/10/15/how-to-use-hinge-squared-hinge-loss-with-keras/). But as we don't want to get into too much detail here, we simply define the output of the loss function as \[latex\]L\_{function}(f, \\textbf{x}, y)\[/latex\]. So: + +\[latex\] L(f, \\textbf{x}, y) = L\_{function}(f, \\textbf{x}, y)\[/latex\] + +The objective during training is to minimize this value, and hence the function: + +\[latex\] \\min L(f, \\textbf{x}, y) = \\min L\_{function}(f, \\textbf{x}, y)\[/latex\] + +Now back to the regularizer. The _goal for using it_ is to _produce simpler models which scale to more generic data_. This means that you'll have to do something with the _weights_ of your model, and the only way of doing is is during the _optimization step_. + +However, measuring the need for regularizing is not something we want to do _during_ optimization. Take a look at loss: the _measurement_ is performed just _before_ the optimization step, after which its outcome - the loss value - is used for optimizing the model. + +Can't we do something similar with a regularizer? + +Yes, we can: there's no argument as to why we cannot provide a measurement for the _need for regularization_ directly in advance of regularization. In fact, we can even add it to the loss value \[latex\] L(f, \\textbf{x}, y)\[/latex\]! This way, the need for regularization given some model weights is taken into account during optimization, together with the comparison between ground truth and predicted value. This way, you may actually arrive at models which are simple _and_ where \[latex\]\\hat{y} \\approx y\[/latex\]. + +We do so as follows: + +\[latex\] L(f, \\textbf{x}, y) = L\_{function}(f, \\textbf{x}, y) + R(f)\[/latex\] + +After which the minimization operation becomes: + +\[latex\] \\min L(f, \\textbf{x}, y) = \\min ( L\_{function}(f, \\textbf{x}, y) + R(f) )\[/latex\] + +Let's now take a look at two possible instantiations for \[latex\]R(f)\[/latex\], i.e. two actual regularizers: L1 (or Lasso) regularization and L2 (or Ridge) regularization. + +### L1 and L2 Regularization + +When L1 Regularization is applied to one of the layers of your neural network, \[latex\]R(f)\[/latex\] is instantiated as \[latex\] \\sum\_f{ \_{i=1}^{n}} | w\_i | \[/latex\], where \[latex\]w\_i\[/latex\] is the value for one of your \[latex\]n\[/latex\] weights in that particular layer. This instantiation computes the L1 norm for a vector, which is also called "taxicab norm" as it computes and adds together the lengths between the origin and the value along the axis for a particular dimension. + +[![](images/l1_component.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_component.png) + +Applying L1 regularization ensures that given a relatively constant \[latex\] L\_{function}(f, \\textbf{x}, y) \[/latex\] your weights take very small values of \[latex\]\\approx 0\[/latex\], as the L1 value for \[latex\]x = 0\[/latex\] is lowest. Indeed, likely, your weights will even [become _zero_](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/#on-model-sparsity), due to the fact that the L1 derivative is constant. Applying L1 to your neural networks layers thus pushes them to drop out weights that do not contribute to their predictive power significantly enough, and thus leads to sparse models. + +However, it may be that you don't want models to be sparse. This may be the case if you face the "small, fat data problem", where you don't have a lot of samples, but the samples you've got are high-dimensional. Another case would be correlative data: if your features contain weights which have high pairwise correlation coefficients, dropping out the effect of certain variables through dropping out weights would be a bad idea, as you would effectively lose information. + +In this case, L2 regularization may be applied. For L2, \[latex\]R(f)\[/latex\] is instantiated as \[latex\] \\sum\_f{ \_{i=1}^{n}} w\_i^2 \[/latex\], where \[latex\]w\_i\[/latex\] is the value for one of your \[latex\]n\[/latex\] weights in that particular layer. As it's quadratic, it produces a characteristic plot: + +[![](images/l2_comp.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l2_comp.png) + +Applying L2 regularization does lead to models where the weights will get relatively small values, i.e. where they are simple. This is similar to applying L1 regularization. However, contrary to L1, L2 regularization [does not push your weights to be _exactly zero_](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/#why-l1-yields-sparsity-and-l2-likely-does-not). This is also caused by the derivative: contrary to L1, where the derivative is a constant (it's either +1 or -1), the L2 derivative is \[latex\]2x\[/latex\]. This means that the closer you get to zero, the smaller the derivative gets, and hence the smaller the update. As with the case of dividing \[latex\]1\[/latex\] by \[latex\]2\[/latex\], then \[latex\]\\frac{1}{2}\[/latex\] by \[latex\]2\[/latex\], then \[latex\]\\frac{1}{4}\[/latex\] by \[latex\]2\[/latex\], and so on, you never reach _zero_, but the values get _really small_. For the situations where L1 cannot be applied, L2 is a good candidate for regularization. + +### Elastic Net Regularization + +However, applying L2 yields one disbenefit: interpretability. What's more, it may be the case that you do not exactly know which regularizer to apply, as you don't have sufficient prior knowledge about your dataset. Finally, it can also be that you find insufficient results with either one, but think you could benefit from something in between. + +Say hello to Elastic Net Regularization, which was introduced by Zou & Hastie (2005). It effectively instantiates \[latex\]R(f)\[/latex\] as a linear combination of L1 and L2 regularization: + +\[latex\] L(f, \\textbf{x}, y) = L\_{function}(f, \\textbf{x}, y) + \\lambda\_1 \\sum\_f{ \_{i=1}^{n}} | w\_i | + \\lambda\_2 \\sum\_f{ \_{i=1}^{n}} w\_i^2 \[/latex\] + +In the original paper, \[latex\]\\lambda\_1\[/latex\] can also be defined as \[latex\]1 - \\alpha\[/latex\] and \[latex\]\\lambda\_2\[/latex\] as \[latex\]\\alpha\[/latex\]. This makes the impact of both relative to each other, with \[latex\]\\alpha = 1\[/latex\] giving L2 regularization and \[latex\]\\alpha = 0\[/latex\] giving L1 regularization. All the values in between produce something that mimics one of them. + +According to Zou & Hastie (2015) and many practitioners, Elastic Net Regularization produces better results and can be used more naïvely, e.g. when little prior knowledge is available about the dataset. + +![](images/penalty-values.png) + +Now that we know some details about the regularizers, let's find out how they are represented by the Keras API. + +### Which lambda values do I need? + +It's very difficult, if not impossible, to give an answer to this question, as the most suitable values for \[latex\]\\lambda/latex\] are data-dependent (Google Developers, n.d.). + +However, it's best to use values \[latex\]> 0\[/latex\] (otherwise, the regularizer would be dead). Also, it's best not to use lambdas that are too high (risking underfitting) but neither lambdas that are too low (making the regularizer ineffective, increasing the odds of overfitting) (Google Developers, n.d.). However, generally speaking, they should be rather lower than higher. For example, as we shall see, the default value within the Keras framework is \[latex\]\\lambda = 0.01\[/latex\] (TensorFlow, 2021). + +* * * + +## Regularizers in the Keras API + +If we take a look at the Keras docs, we get a sense of how regularization works in Keras. First of all, "the penalties are applied on a per-layer basis" - which means that you can use different regularizers on different layers in your neural network (TensorFlow, 2021). + +Secondly, for each layer, regularization can be performed on one (or all) of three areas within the layer (TensorFlow, 2021): + +- The **kernel**, through `kernel_regularizer`, which applies regularization to the kernel a.k.a. the actual weights; +- The **bias** value, through `bias_regularizer`, which applies regularization to the bias, which shifts the layer outputs; +- The **activity** value, through `activity_regularizer`, which applies the regularizer to the _output of the layer_, i.e. the activation value (which is the combination of the weights + biases with the input vector, fed through the [activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/)) (Tonutti, 2017). + +To each three, an instance of the `tensorflow.keras.regularizers.Regularizer` class can be supplied in order for regularization to work (TensorFlow, 2021). Soon, we'll cover the L1, L2 and Elastic Net instances of this class by means of an example, which are represented as follows (TensorFlow, 2021): + +``` +tensorflow.keras.regularizers.l1(0.) +tensorflow.keras.regularizers.l2(0.) +tensorflow.keras.regularizers.l1_l2(l1=0.01, l2=0.01) +``` + +In short, this way, you can either regularize _parts_ of what happens in the neural network layer, or the combination of the parts by means of the _output_. That's quite some flexibility, isn't it? :) + +Let's now take a look at how the regularizers can be applied in a neural network. + +* * * + +## Keras L1, L2 and Elastic Net Regularization examples + +Here's the model that we'll be creating today. It was generated with [Net2Vis](https://www.machinecurve.com/index.php/2020/01/07/visualizing-keras-neural-networks-with-net2vis-and-docker/), a cool web based visualization library for Keras models (Bäuerle & Ropinski, 2019): + +- [![](images/graph-4.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/graph-4.png) + +- [![](images/legend-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/legend-2.png) + + +As you can see, it's a convolutional neural network. It takes 28 x 28 pixel images as input, learns 32 and 64 filters in two Conv2D layers and applies max pooling twice, together with some Dropout. These results are fed to the Dense layers through a Flattening operation; the Dense layers generate the final prediction, which is a classification operation to 47 output classes through a Softmax activation function. + +**Read more:** + +- [Visualizing Keras neural networks with Net2Vis and Docker](https://www.machinecurve.com/index.php/2020/01/07/visualizing-keras-neural-networks-with-net2vis-and-docker/) +- [How does the Softmax activation function work?](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) + +The dataset that we'll be using today is the EMNIST dataset. It adds _letters_ to the traditional MNIST dataset, as you can see in the plot below. For this to work, we use the [Extra Keras Datasets](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) module. + +[![](images/emnist-balanced.png)](https://github.com/christianversloot/extra_keras_datasets/raw/master/assets/emnist-balanced.png) + +The steps for creating today's model are as follows: + +- Stating the imports; +- Setting the model configuration; +- Loading and preparing the data; +- Creating the model architecture; +- Configuring the model; +- Fitting the data; +- Generating evaluation metrics. + +### Stating the imports + +For today's model, we'll be using TensorFlow 2.0 and the corresponding built-in facilities for Keras. From them, we import the Sequential API, and the layers specified above. Besides Keras, we'll also use Numpy for numbers processing and [extra-keras-datasets](https://pypi.org/project/extra-keras-datasets/) for loading the data. Finally, Matplotlib is used for visualizing the model history. Make sure to have these dependencies installed before you run the model. + +``` +import tensorflow.keras +from extra_keras_datasets import emnist +import numpy as np +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import matplotlib.pyplot as plt +``` + +### Setting the model configuration + +The next step is to define the configuration for our model. First, we set the characteristics of our input image: its width, its height, the number of channels and - based on these - the input shape for one sample. + +We also specify batch size, the number of epochs, and the number of classes (47, because we now have capitalized and lowercase letters as well as digits!). The validation split i.e. how much training data will be set apart for model validation is set to 20% and through verbosity mode, we output everything on screen. + +``` +# Model configuration +img_width, img_height, num_channels = 28, 28, 1 +input_shape = (img_height, img_width, num_channels) +batch_size = 250 +no_epochs = 25 +no_classes = 47 +validation_split = 0.2 +verbosity = 1 +``` + +### Loading and preparing data + +The first step in loading the data is to use the [Extra Keras Datasets](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) module and call `load_data()`: + +``` +# Load EMNIST dataset +(input_train, target_train), (input_test, target_test) = emnist.load_data() +``` + +Next, we add the number of channels to the EMNIST dataset through a `reshape` operation, as they are traditionally not present: + +``` +# Add number of channels to EMNIST data +input_train = input_train.reshape((len(input_train), img_height, img_width, num_channels)) +input_test = input_test.reshape((len(input_test), img_height, img_width, num_channels)) +``` + +We then convert the data types into `float32` format, which presumably speeds up training: + +``` +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') +``` + +We then normalize the data: + +``` +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +Finally, we convert the targets into categorical format, which allows us to use [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/): + +``` +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +### The model part of a neural network + +We can next create the architecture for our Keras model. Depending on the regularizer you wish to use, you can choose one of the next combinations. Here, we'll show examples for: + +- L1 Kernel/Bias regularization; +- L1 Activity regularization; +- L2 Kernel/Bias regularization; +- L2 Activity regularization; +- Elastic Net Kernel/Bias regularization; +- Elastic Net Activity regularization. + +Obviously, you're free to mix and match if desired :) + +#### L1 Kernel/Bias regularization + +Applying L1 regularization to the kernel and bias values goes as follows: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, kernel_regularizer=regularizers.l1(0.01), bias_regularizer=regularizers.l1(0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_regularizer=regularizers.l1(0.01), bias_regularizer=regularizers.l1(0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', kernel_regularizer=regularizers.l1(0.01), bias_regularizer=regularizers.l1(0.01))) +model.add(Dense(no_classes, activation='softmax', kernel_regularizer=regularizers.l1(0.01), bias_regularizer=regularizers.l1(0.01))) +``` + +#### L1 Activity regularization + +Regularizing activity instead is also simple: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, activity_regularizer=regularizers.l1(0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', activity_regularizer=regularizers.l1(0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', activity_regularizer=regularizers.l1(0.01))) +model.add(Dense(no_classes, activation='softmax', activity_regularizer=regularizers.l1(0.01))) +``` + +#### L2 Kernel/Bias regularization + +Switching from L1 to L2 regularization for your kernel and bias values is simply replacing L1 for L2: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, kernel_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', kernel_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))) +model.add(Dense(no_classes, activation='softmax', kernel_regularizer=regularizers.l2(0.01), bias_regularizer=regularizers.l2(0.01))) +``` + +#### L2 Activity regularization + +The same goes for activity regularization: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, activity_regularizer=regularizers.l2(0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', activity_regularizer=regularizers.l2(0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', activity_regularizer=regularizers.l2(0.01))) +model.add(Dense(no_classes, activation='softmax', activity_regularizer=regularizers.l2(0.01))) +``` + +#### Elastic Net Kernel/Bias regularization + +Elastic net, or L1 + L2 regularization, can also be added easily to regularize kernels and biases: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01), bias_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01), bias_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01), bias_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(Dense(no_classes, activation='softmax', kernel_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01), bias_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +``` + +#### Elastic Net Activity regularization + +Once again, the same is true for activity regularization: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(Dense(no_classes, activation='softmax', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +``` + +### Compiling the model + +We then `compile` the model to use categorical crossentropy loss and the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam). Accuracy is added as an additional metric, which is more understandable to humans: + +``` +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +### Fitting the data + +Then, we `fit` the data to the model. Here, we set the configuration options that we defined earlier. It starts the training process: + +``` +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Adding generalization metrics + +Once the model has finished training, you'll want to test it with data that the model has never seen before. This is the `input_test` and `target_test` data available to us. By calling `model.evaluate` with this data, we get the results of testing it with the test data: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +By means of the `history` object to which we assigned the output of `model.fit`, we can [visualize the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). This way, you can find out how the loss value and/or accuracy value has evolved over time, for both training and validation data. Here is the code that generates a plot for training/validation loss and training/validation accuracy values: + +``` +# Plot history: Loss +plt.plot(history.history['loss'], label='Training data') +plt.plot(history.history['val_loss'], label='Validation data') +plt.title('L1/L2 Activity Loss') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['accuracy'], label='Training data') +plt.plot(history.history['val_accuracy'], label='Validation data') +plt.title('L1/L2 Activity Accuracy') +plt.ylabel('%') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +### Full model code + +It may be that you just want the model, in order to start playing around. For this purpose, here you've got the full model code at once - just replace the regularizers with the ones you need, possibly guided by the examples from above) ;) + +``` +import tensorflow.keras +from extra_keras_datasets import emnist +import numpy as np +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras import regularizers +import matplotlib.pyplot as plt + +# Model configuration +img_width, img_height, num_channels = 28, 28, 1 +input_shape = (img_height, img_width, num_channels) +batch_size = 250 +no_epochs = 25 +no_classes = 47 +validation_split = 0.2 +verbosity = 1 + +# Load EMNIST dataset +(input_train, target_train), (input_test, target_test) = emnist.load_data() + +# Add number of channels to EMNIST data +input_train = input_train.reshape((len(input_train), img_height, img_width, num_channels)) +input_test = input_test.reshape((len(input_test), img_height, img_width, num_channels)) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) +model.add(Dense(no_classes, activation='softmax', activity_regularizer=regularizers.l1_l2(l1=0.01, l2=0.01))) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Plot history: Loss +plt.plot(history.history['loss'], label='Training data') +plt.plot(history.history['val_loss'], label='Validation data') +plt.title('L1/L2 Activity Loss') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() + +# Plot history: Accuracy +plt.plot(history.history['accuracy'], label='Training data') +plt.plot(history.history['val_accuracy'], label='Validation data') +plt.title('L1/L2 Activity Accuracy') +plt.ylabel('%') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Results + +The results, which were obtained with regularizers having \[latex\]\\lambda = 0.01\[/latex\] (except for one, the Extended L2 regularizer), suggest a few things: + +- **On no regularization:** results are quite good. It serves as a baseline and has these evaluation metrics: `Test loss: 0.4031164909011506 / Test accuracy: 0.8728723526000977`. +- **On L1 regularization:** For EMNIST data, the assumption that sparsity must be introduce to the model seems to be **false**. I'm not sure, but perhaps this can be generalized to many image related problems (do you have any experience? Tell me by leaving a comment!). As we can see, both L1 Kernel/Bias and Activity regularization produce very poor results. +- **On L2 regularization**: results are good, with accuracies of 85%+ with the activity regularizer. Results are a bit lower with the kernel/bias regularizers. The evaluation metrics for the L2 activity regularizer based model: `Test loss: 0.37115383783553507 / Test accuracy: 0.8901063799858093`. +- **On L2 regularization vs No regularization:** L2 regularization with \[latex\]\\lambda = 0.01\[/latex\] results in a model that has a lower test loss and a higher accuracy (a 2 percentage points increase). +- **On extended L2 regularization:** to find out whether this effect gets stronger with an increased impact of the regularizer, we retrained the L2 Activity regularized model with \[latex\]\\lambda = 0.10\[/latex\]. The evaluation metrics : `Test loss: 0.5058084676620808 / Test accuracy: 0.8836702108383179`. Loss is clearly worse. +- **On Elastic Net regularization:** here, results are poor as well. Apparently, here the false sparsity assumption also results in very poor data due to the L1 component of the Elastic Net regularizer. Fortunate that L2 works! + +Next, you'll find all the `history` based [plots of the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/) for each regularizer / regularizer combination created above. + +### No regularization + +- [![](images/no_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/no_a.png) + +- [![](images/no_l.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/no_l.png) + + +### L1 Kernel/Bias regularization + +- [![](images/l1_kb_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_kb_a.png) + +- [![](images/l1_kb.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_kb.png) + + +### L1 Activity regularization + +- [![](images/l1_a_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_a_a.png) + +- [![](images/l1_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_a.png) + + +### L2 Kernel/Bias regularization + +- [![](images/l2_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l2_a.png) + +- [![](images/l2_kb.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l2_kb.png) + + +### L2 Activity regularization + +- [![](images/l2_a_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l2_a_a.png) + +- [![](images/l2_a_l.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l2_a_l.png) + + +### Extended L2 Activity regularization + +Here, \[latex\]\\lambda = 0.10\[/latex\], to find out whether the increased impact of the regularizer improves the model. + +- [![](images/extended_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/extended_a.png) + +- [![](images/extended_l.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/extended_l.png) + + +### Elastic Net Kernel/Bias regularization + +- [![](images/l1l2_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1l2_a.png) + +- [![](images/l1l2_l.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1l2_l.png) + + +### Elastic Net Activity regularization + +- [![](images/l1_l2_a_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_l2_a_a.png) + +- [![](images/l1_l2_a.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_l2_a.png) + + +* * * + +## Summary + +In this blog post, you've seen examples of how to implement L1, L2 and Elastic Net Regularizers with Keras. We saw that various Keras layer types support the regularizers, and that they can be applied at the level of kernels and biases, but also at the level of layer activations. This all was preceded by a recap on the concept of a regularizer, and why we need them in the first place. + +By doing so, I hope that I've provided a blog post which helps you to create regularized Keras models. Please let me know if it was useful by leaving a comment in the comments box below 😊👇 Please do the same if you have questions or remarks, or when you spot a mistake, so that I can improve the blog post. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. _Journal of the royal statistical society: series B (statistical methodology)_, _67_(2), 301-320. + +MachineCurve. (2020, January 21). What are L1, L2 and Elastic Net Regularization in neural networks? Retrieved from [https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks) + +TensorFlow. (2021). _Module: Tf.keras.regularizers_. [https://www.tensorflow.org/api\_docs/python/tf/keras/regularizers](https://www.tensorflow.org/api_docs/python/tf/keras/regularizers) + +Tonutti, M. (2017). Keras: Difference between Kernel and Activity regularizers. Retrieved from [https://stackoverflow.com/questions/44495698/keras-difference-between-kernel-and-activity-regularizers](https://stackoverflow.com/questions/44495698/keras-difference-between-kernel-and-activity-regularizers) + +Bäuerle, A., & Ropinski, T. (2019). [Net2Vis: Transforming Deep Convolutional Networks into Publication-Ready Visualizations](https://arxiv.org/abs/1902.04394). arXiv preprint arXiv:1902.04394. + +Google Developers. (n.d.). Regularization for Simplicity: Lambda. Retrieved from [https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/lambda](https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/lambda) diff --git a/how-to-use-l1-l2-and-elastic-net-regularization-with-pytorch.md b/how-to-use-l1-l2-and-elastic-net-regularization-with-pytorch.md new file mode 100644 index 0000000..1667c9a --- /dev/null +++ b/how-to-use-l1-l2-and-elastic-net-regularization-with-pytorch.md @@ -0,0 +1,431 @@ +--- +title: "How to use L1, L2 and Elastic Net regularization with PyTorch?" +date: "2021-07-21" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "elastic-net-regularization" + - "l1-regularization" + - "l1l2-regularization" + - "l2-regularization" + - "machine-learning" + - "model-complexity" + - "neural-networks" + - "regularization" + - "regularizer" +--- + +Training a neural network means that you will need to strike a balance between _optimization_ and _over-optimization_. Over-optimized models work really well on your training set, but due to their complexity - by taking the oddities within a training dataset as part of the mapping that is to be performed - they can fail really hard when the model is used in production. + +Regularization techniques can be used to mitigate these issues. In this article, we're going to take a look at [L1, L2 and Elastic Net Regularization](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/). Click on the previous link to understand them in more detail in terms of theory, because this article focuses on their implementation in PyTorch. After reading it, you will... + +- **Understand why you need regularization in your neural network.** +- **See how L1, L2 and Elastic Net (L1+L2) regularization work in theory.** +- **Be able to use L1, L2 and Elastic Net (L1+L2) regularization in PyTorch, by means of examples.** + +Ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Why you need regularization + +Training a neural network involves creating a mapping between an array of input variables \[latex\]\\textbf{x}\[/latex\] to an independent variable, often called \[latex\]\\text{y}\[/latex\]. Recall that a mapping between such variables can be expressed mathematically, and that a mapping is represented by a function - say, \[latex\]f\[/latex\]. In this case, the mapping of the actual function is as follows: \[latex\]\\text{y}: f(\\textbf{x})\[/latex\]. + +The way the mapping is performed is dependent on the way that you create it, or _fit_ it. For example, in the image below, we generated two such mappings using exactly the same input data - the set of points. The first is a polyfit with three degrees of freedom, creating the yellow line. The second has ten degrees of freedom, creating the blue line. + +Which mapping is more realistic, you say? Yellow or blue? + +If you said yellow, you're right. Such extremities in mappings that are visible in the blue one are often very unlikely to be true, and occur likely due to excessive sensitivity of the model to oddities in your data set. + +![](images/poly_both.png) + +Training a neural network involves using your input data (the set of \[latex\]\\textbf{x}\[/latex\]s) to generate predictions for each sample (the corresponding set of \[latex\]\\text{y}\[/latex\]. The network has trainable components that can jointly attempt to approximate the mapping, \[latex\]\\text{y}: f(\\textbf{x})\[/latex\]. The approximation is then called \[latex\]\\hat{\\text{y}}: f(\\textbf{x})\[/latex\], from _y hat_. + +When [feeding forward our samples and optimizing our model](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) we do not know whether our model will learn a mapping like the one in yellow or the one in blue. Rather, it will learn a mapping that minimizes the [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss). This can lead to a situation where a mapping like the one in blue is learned, while such extremities are unwanted. + +Adding **regularization** to your neural network, and specifically to the computed loss values, can help you in guiding the model towards learning a mapping that looks more like the one in yellow. After computing loss (i.e., the model error) after every forward pass, it adds _another value_ to the loss function - and this value is higher when the model is more complex, while lower when it is less complex. In other words, the model is punished for complexity. This leads to a trained model that is as good as it can be when it is as simple as it can be at the same time. + +Beyond [Dropout](https://www.machinecurve.com/index.php/2021/07/07/using-dropout-with-pytorch/), which is another mechanism for regularization, there are three main candidates that are used frequently: + +- **L1 Regularization**, also called Lasso Regularization, involves adding the absolute value of all weights to the loss value. +- **L2 Regularization**, also called Ridge Regularization, involves adding the squared value of all weights to the loss value. +- **Elastic Net Regularization**, which combines L1 and L2 Regularization in a weighted way. + +Now that we'll understand what regularization is and which key regularizers there are, you'll take a closer look at each - including examples for implementing them with PyTorch. + +Let's get to work! 😎 + +* * * + +## Example of L1 Regularization with PyTorch + +Suppose that you are using binary crossentropy loss with your PyTorch based classifier. You want to implement **L1 Regularization**, which effectively involves that \[latex\]\\sum\_f{ \_{i=1}^{n}} | w\_i |\[/latex\] is added to the loss. + +Here, \[latex\]n\[/latex\] represents the number of individual weights, and you can see that we iterate over these weights. We then take the absolute value for each value \[latex\]w\_i\[/latex\] and sum everything together. + +In other words, L1 Regularization loss can be implemented as follows: + +\[latex\]\\text{full\_loss = original\_loss + } \\sum\_f{ \_{i=1}^{n}} | w\_i |\[/latex\] + +Here, `original_loss` is binary crossentropy. However, it can be pretty much any loss function that you desire! + +Implementing **L1 Regularization** with PyTorch can be done in the following way. + +- We specify a class `MLP` that extends PyTorch's `nn.Module` class. In other words, it's a neural network using PyTorch. +- To the class, we add a `def` called `compute_l1_loss`. This is an implementation of taking the absolute value and summing all values for `w` in a particular trainable parameter. +- In the training loop specified subsequently, we specify a L1 weight, collect all parameters, compute L1 loss, and add it to the loss function before error backpropagation. +- We also print the L1 component of our loss when printing statistics. + +Here is the full example for L1 Regularization with PyTorch: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + def compute_l1_loss(self, w): + return torch.abs(w).sum() + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Compute L1 loss component + l1_weight = 1.0 + l1_parameters = [] + for parameter in mlp.parameters(): + l1_parameters.append(parameter.view(-1)) + l1 = l1_weight * mlp.compute_l1_loss(torch.cat(l1_parameters)) + + # Add L1 loss component + loss += l1 + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + minibatch_loss = loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.5f (of which %.5f L1 loss)' % + (i + 1, minibatch_loss, l1)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## Example of L2 Regularization with PyTorch + +Implementing **L2 Regularization** with PyTorch is also easy. Understand that in this case, we don't take the absolute value for the weight values, but rather their squares. In other words, we add \[latex\]\\sum\_f{ \_{i=1}^{n}} w\_i^2\[/latex\] to the loss component. In the example below, you can find how L2 Regularization can be used with PyTorch: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + def compute_l2_loss(self, w): + return torch.square(w).sum() + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Compute l2 loss component + l2_weight = 1.0 + l2_parameters = [] + for parameter in mlp.parameters(): + l2_parameters.append(parameter.view(-1)) + l2 = l2_weight * mlp.compute_l2_loss(torch.cat(l2_parameters)) + + # Add l2 loss component + loss += l2 + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + minibatch_loss = loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.5f (of which %.5f l2 loss)' % + (i + 1, minibatch_loss, l2)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Different way of adding L2 loss + +L2 based weight decay can also be implemented by setting a delta value for `weight_decay` in the optimizer. + +> **weight\_decay** ([_float_](https://docs.python.org/3/library/functions.html#float)_, optional_) – weight decay (L2 penalty) (default: 0) +> +> PyTorch (n.d.) + +For example: + +``` +optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4, weight_decay=1.0) +``` + +* * * + +## Example of Elastic Net (L1+L2) Regularization with PyTorch + +It is also possible to perform **Elastic Net Regularization** with PyTorch. This type of regularization essentially computes a **weighted combination of L1 and L2 loss**, with the weights of both summing to `1.0`. In other words, we add \[latex\]\\lambda\_{L1} \\times \\sum\_f{ \_{i=1}^{n}} | w\_i | + \\lambda\_{L2} \\times \\sum\_f{ \_{i=1}^{n}} w\_i^2\[/latex\] to the loss component: + +\[latex\]\\text{full\_loss = original\_loss + } \\lambda\_{L1} \\times \\sum\_f{ \_{i=1}^{n}} | w\_i | + \\lambda\_{L2} \\times \\sum\_f{ \_{i=1}^{n}} w\_i^2 \[/latex\] + +In this example, Elastic Net (L1 + L2) Regularization is implemented with PyTorch: + +- You can see that the MLP class representing the neural network provides two `def`s which are used to compute L1 and L2 loss, respectively. +- In the training loop, these are applied, in a weighted fashion (with weights of 0.3 and 0.7, respectively). +- The loss components are also printed on-screen when the statistics are printed. + +``` +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + def compute_l1_loss(self, w): + return torch.abs(w).sum() + + def compute_l2_loss(self, w): + return torch.square(w).sum() + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Specify L1 and L2 weights + l1_weight = 0.3 + l2_weight = 0.7 + + # Compute L1 and L2 loss component + parameters = [] + for parameter in mlp.parameters(): + parameters.append(parameter.view(-1)) + l1 = l1_weight * mlp.compute_l1_loss(torch.cat(parameters)) + l2 = l2_weight * mlp.compute_l2_loss(torch.cat(parameters)) + + # Add L1 and L2 loss components + loss += l1 + loss += l2 + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + minibatch_loss = loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.5f (of which %.5f L1 loss; %0.5f L2 loss)' % + (i + 1, minibatch_loss, l1, l2)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## Summary + +By reading this article, you have... + +- **Understood why you need regularization in your neural network.** +- **Seen how L1, L2 and Elastic Net (L1+L2) regularization work in theory.** +- **Been able to use L1, L2 and Elastic Net (L1+L2) regularization in PyTorch, by means of examples.** + +I hope that this article was useful for you! :) If it was, please feel free to let me know through the comments section 💬 Please let me know as well if you have any questions or other remarks. Where necessary, I will make sure to adapt the article. + +What remains is to thank you for reading MachineCurve today. Happy engineering! 😎 + +* * * + +## Sources + +PyTorch. (n.d.). _Adam — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam) + +StackOverflow. (n.d.). _L1 norm as regularizer in Pytorch_. Stack Overflow. [https://stackoverflow.com/questions/46797955/l1-norm-as-regularizer-in-pytorch](https://stackoverflow.com/questions/46797955/l1-norm-as-regularizer-in-pytorch) diff --git a/how-to-use-lisht-activation-function-with-keras.md b/how-to-use-lisht-activation-function-with-keras.md new file mode 100644 index 0000000..9ea8efe --- /dev/null +++ b/how-to-use-lisht-activation-function-with-keras.md @@ -0,0 +1,532 @@ +--- +title: "How to use LiSHT activation function with TensorFlow 2 based Keras?" +date: "2019-11-17" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "activation-function" + - "deep-learning" + - "keras" + - "lisht" + - "tensorflow" +--- + +It's very likely that you will use the [ReLU activation function](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) when creating a neural network. This is unsurprising, since there is a vast landscape of literature that suggests that ReLU performs better than today's other two standard activation functions, Sigmoid and Tanh. + +Nevertheless, ReLU has its shortcomings: when you don't configure your network properly, or when you use data that is not normalized before training, the outputs of your neurons may swing substantially during the first phases of training. Since gradients for ReLU are either zero or one, it may be that you cannot escape _zeroes_ when your initial neuron outputs are really small. We then call your neuron dead, and with many dead neurons, you essentially deprive your neural network from its ability to achieve acceptable performance. + +Fortunately, new activation functions have been designed that attempt to reduce the impact of this inherent shortcoming of the ReLU activation function. For example, [Swish](https://www.machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/) was designed to make ReLU more smooth. However, [LiSHT](https://www.machinecurve.com/index.php/2019/11/17/beyond-swish-the-lisht-activation-function/) is a very new activation function that attempts to reduce the ReLU shortcomings indirectly. It essentially manipulates the Sigmoid function (which does not result in dying neurons, but in [vanishing gradients](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) instead - which is just as worse). Fortunately, with LiSHT, the impact of this vanishing gradients problem is much less severe, and it may thus be a good candidate that hovers between ReLU and Sigmoid. + +But if LiSHT is to gain traction in the machine learning community, it must be usable for your own machine learning projects. This renders the question: **how can LiSHT be implemented with Keras**? Precisely the question that we'll attempt to answer with this blog. + +First, we provide a brief recap about LiSHT, although this will primarily be a reference [to our other blog post](https://www.machinecurve.com/index.php/2019/11/17/beyond-swish-the-lisht-activation-function/). Subsequently, we'll use the Keras deep learning framework to implement LiSHT into a [ConvNet](https://www.machinecurve.com/index.php/2019/10/18/a-simple-conv3d-example-with-keras/) that is trained for [classifying the MNIST image dataset](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). Before wrapping up, we'll also show how the model performs - and compare it to the results of _standard_ ReLU activation as well as _Leaky_ ReLU. + +After reading this tutorial, you will... + +- **Understand what the LiSHT activation function is, and how it can be useful.** +- **Know how you can define your own LiSHT function with TensorFlow 2 / Keras.** +- **See how to use this activation functions in a real TensorFlow 2 / Keras model.** + +Let's go! 😄 + +**Update 17/Mar/2021:** updated the article to ensure that it is up-to-date in 2021. Checked the article for renewal and updated the code so that it can be used with TensorFlow 2. + +* * * + +\[toc\] + +* * * + +## Full code example: LiSHT with TensorFlow and Keras + +It can be the case that you want a quick and full example where the **LiSHT activation function is applied**. Below, you can see a fully working example for TensorFlow 2 based Keras. If you want to understand LiSHT in mored detail, or want to find out how all the code works, then make sure to read the rest of this tutorial as well! 🚀 + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import numpy as np +import matplotlib.pyplot as plt + +# LiSHT +def LiSHT(x): + return x * tensorflow.math.tanh(x) + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation=LiSHT, input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation=LiSHT)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation=LiSHT)) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Plot history: Crossentropy loss +plt.plot(history.history['loss'], label='Crossentropy loss (training data)') +plt.plot(history.history['val_loss'], label='Crossentropy loss (validation data)') +plt.title('Crossentropy loss') +plt.ylabel('Loss value') +plt.xlabel('Epochs') +plt.legend(loc="upper left") +plt.show() + +# Plot history: Accuracies +plt.plot(history.history['accuracy'], label='Accuracy (training data)') +plt.plot(history.history['val_accuracy'], label='Accuracy (validation data)') +plt.title('Accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epochs') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Recap: what is LiSHT? + +LiSHT is a relatively new activation function, proposed by Roy et al. in their early 2019 [paper on ArXiv](https://arxiv.org/abs/1901.05894). It stands for a **Linearly Scaled Hyperbolic Tangent** and is non-parametric in the sense that `tanh(x)` is scaled linearly with `x` without the need for manual configuration by means of some parameter. + +Its formula - \[latex\]LiSHT(x) = x \\times tanh(x)\[/latex\] leads to the following visualization, where LiSHT is visualized in green: + +[![](images/lisht_visualized-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_visualized.png) + +In terms of the derivative, this has the effect that the _range_ of the derivative function - and hence the computed gradients - is expanded. This is expected to reduce the impact of the vanishing gradients problem. I'd recommend to read [MachineCurve's other blog post](https://www.machinecurve.com/index.php/2019/11/17/beyond-swish-the-lisht-activation-function/) for more information about the theoretical aspects of the LiSHT activation function. + +[![](images/lisht_derivs-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_derivs.png) + +* * * + +## Creating your own activation function with Keras + +In this blog post, we'll focus on how to implement LiSHT with Keras instead. Keras, the deep learning framework for Python that I prefer due to its flexibility and ease of use, supports the creation of custom activation functions. You can do so by creating a regular Python definition and subsequently assigning this def as your activation function. + +For example, for LiSHT: + +``` +# LiSHT +def LiSHT(x): + return x * K.tanh(x) +``` + +Where `K` is the Keras backend, imported as `from keras import backend as K` . + +Note that using Numpy directly does not work when creating a custom function with Keras - you'll run into the following error: + +``` +NotImplementedError: Cannot convert a symbolic Tensor (2nd_target:0) to a numpy array. +``` + +The fix is simple - replace your Numpy based tanh (i.e. `np.tanh(x)`) with the Keras based one - `K.tanh(x)`. Contrary to Numpy, the `K` backend performs the tanh operation at the tensor level. + +Subsequently, you can use the created def in arbitrary Keras layers - e.g. with the Sequential API: + +``` +model.add(Dense(256, activation=LiSHT)) +``` + +* * * + +## Creating your LiSHT model with Keras + +Now that we've found how to create custom activation functions with Keras, we can start working on our LiSHT CNN. First, we'll take a look at the dataset we're going to use today, and the model architecture that we're going to create. Subsequently, we create the model, start training and discuss model performance. + +### Today's dataset + +A supervised machine learning model requires a dataset that can be used for training. For the sake of simplicity, we use a relatively simple one today: the MNIST dataset. + +This dataset contains thousands of handwritten digits - i.e., numbers between 0 and 9 - that are all 28 by 28 pixels. It is one of the standard datasets used in computer vision education for its simplicity and extensiveness, and hence is a good candidate for explaining how to create the model. + +[![](images/mnist-visualize.png)](https://www.machinecurve.com/wp-content/uploads/2019/06/mnist-visualize.png) + +What's even better: the Keras API contains a pointer to the MNIST dataset already. That is, we can import the data and assign it to Python variables quite easily - by calling `load_data` with some Keras function. This is also why it's good to use MNIST in an educational setting. + +All right, let's now find out what you need in order to run the model. + +### What you'll need to run the model + +Put simply, these are the software requirements for running this Keras model: + +- **Keras itself** - which is obvious. +- By consequence, you'll also need to install **Python**, preferably version 3.6+. +- You also need one of the backends: **Tensorflow, Theano or CNTK**. We prefer Tensorflow, since it has been integrated deeply in today's Keras versions (strictly speaking, it's the other way around, but OK). +- Finally, you'll also need **Numpy** and **Matplotlib** for data processing and visualization purposes. + +### Let's go: stating our imports + +Now that we know what we need, we can actually create our model. Open up your file explorer, navigate to a directory of your choice and create a Python file, such as `model_lisht.py`. Next, open this file in your code editor of choice (which preferably supports Python syntax highlighting). We can now start coding! 😄 + +We first define our imports: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import numpy as np +import matplotlib.pyplot as plt +``` + +As you can see, they relate strongly to the imports specified in the previous section. You'll import TensorFlow, the MNIST dataset, the Sequential API, and a variety of Keras layers. + +Finally, you import Numpy and Matplotlib - as said, for data processing and visualization purposes. + +### Defining LiSHT and model configuration + +The next thing we do is defining LiSHT in terms of a Python definition: + +``` +# LiSHT +def LiSHT(x): + return x * tensorflow.math.tanh(x) +``` + +Quite simple, actually - we transform some input \[latex\]x\[/latex\] into an output that follows the LiSHT equation of \[latex\]x \\times tanh(x)\[/latex\]. + +For doing so, we use `tensorflow.math` based `tanh` because it can run with Tensors adequately. + +Subsequently, we add variables for model configuration: + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +As MNIST images are 28 pixels wide and 28 pixels high, we specify `img_height` and `img_width` to be 28. We also use a batch size of 250, which means that - even though we don't truly use the gradient descent optimizer - we're taking a [minibatch](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) approach. + +Twenty-five epochs are used for training. This is just a fixed number and is based on my estimate that with a relatively simple dataset quite accurate performance must be achievable without extensive training. In your own projects, you must obviously configure the number of epochs to an educated estimate of your own, or use smart techniques like [EarlyStopping](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) instead. + +We use 20% of our training data for validation purposes and set model verbosity to `True` (by means of '1'), essentially outputting everything on screen. This is useful for educational settings, but slightly slows down the training process. Choose wisely in your own project :) + +### Importing and preparing data + +We next import our data. As said before, this is essentially a one-line statement due to the way the MNIST dataset is integrated in TensorFlow / the Keras library: + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +``` + +When running, it downloads the dataset automatically, and if you downloaded it before, it will use your cache to speed up the training process. + +Next, we add code for data processing and preparation: + +``` +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +This code essentially: + +- Reshapes data based using the channels-last strategy by default required in TensorFlow 2. +- Parses numbers as floats, which is estimated to speed up the training process (Quora, n.d.). +- Normalizes the data. +- Converts target vectors into categorical format, allowing us to use [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) for evaluating training and validation performance. + +### Creating our model architecture + +Next, we specify the architecture of our model. We use two convolutional blocks with max pooling and dropout, as well as two densely-connected layers. Please [refer to this post](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) if you wish to understand these blocks in more detail. Here's the code: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation=LiSHT, input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation=LiSHT)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation=LiSHT)) +model.add(Dense(no_classes, activation='softmax')) +``` + +See how we're using LiSHT here? + +We simply add the `LiSHT` Python definition to the layers by specifying it as the `activation` attribute. Note that we omit the quotes (`'`) i.e. we don't supply the definition as Strings, but as the definition. This allows Keras to directly use our custom LiSHT activation function. + +### Compiling model and starting training + +Next, we compile our model with the hyperparameters set in the _model configuration_ section and start our training process. + +We store the training history in the `history` object, for [visualizing model performance](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/) over time. + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Evaluation metrics and visualizations + +We next add code for testing the _generalization power_ of our model and for visualizing the model history: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Plot history: Crossentropy loss +plt.plot(history.history['loss'], label='Crossentropy loss (training data)') +plt.plot(history.history['val_loss'], label='Crossentropy loss (validation data)') +plt.title('Crossentropy loss') +plt.ylabel('Loss value') +plt.xlabel('Epochs') +plt.legend(loc="upper left") +plt.show() + +# Plot history: Accuracies +plt.plot(history.history['accuracy'], label='Accuracy (training data)') +plt.plot(history.history['val_accuracy'], label='Accuracy (validation data)') +plt.title('Accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epochs') +plt.legend(loc="upper left") +plt.show() +``` + +### Full model code + +...and finally arrive at the full model as specified as follows: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import numpy as np +import matplotlib.pyplot as plt + +# LiSHT +def LiSHT(x): + return x * tensorflow.math.tanh(x) + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation=LiSHT, input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation=LiSHT)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation=LiSHT)) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Plot history: Crossentropy loss +plt.plot(history.history['loss'], label='Crossentropy loss (training data)') +plt.plot(history.history['val_loss'], label='Crossentropy loss (validation data)') +plt.title('Crossentropy loss') +plt.ylabel('Loss value') +plt.xlabel('Epochs') +plt.legend(loc="upper left") +plt.show() + +# Plot history: Accuracies +plt.plot(history.history['accuracy'], label='Accuracy (training data)') +plt.plot(history.history['val_accuracy'], label='Accuracy (validation data)') +plt.title('Accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epochs') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## How well does LiSHT perform? + +Here, we'll take a look how well LiSHT performs. Consider these checks to be relatively quick in nature - using the MNIST (and hence a simple) dataset, using 25 epochs only, without any statistical tests whatsoever. + +They thus do not say _everything_ about how well LiSHT performs, but give you an idea. First, I compare LiSHT with traditional ReLU, by retraining the TensorFlow 2 / Keras based CNN we created before and comparing histories. Subsequently, I compare LiSHT to Leaky ReLU, also by retraining the particular CNN. Let's find out how well it performs! + +### LiSHT performance in general + +This is how well LiSHT performs as a baseline: + +[![](images/lisht_ce_loss.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_ce_loss.png) + +[![](images/lisht_accuracy.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_accuracy.png) + +These graphs seem to be quite normal: fast-increasing/fast-decreasing accuracy and loss values at first, slowing down when the number of epochs increase. LiSHT also generalizes well with the MNIST dataset, achieving test accuracy of 99.2%. + +``` +Test loss: 0.02728017502297298 / Test accuracy: 0.9922000169754028 +``` + +### Comparing LiSHT to ReLU + +I compared LiSHT with ReLU by training the same ConvNet with both LiSHT and ReLU, with exactly the same settings. These are the results: + +- [![](images/lisht_relu_acc.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_relu_acc.png) + +- [![](images/lisht_relu_ce.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_relu_ce.png) + + +Results are close, but in terms of _validation_ loss and accuracy, ReLU seems to win from LiSHT. + +But if you use a model in practice, it's likely that you're more interested in generalization power - and we also have these metrics. + +``` +LiSHT - Test loss: 0.023941167211306312 / Test accuracy: 0.9930999875068665 +ReLU - Test loss: 0.025442042718966378 / Test accuracy: 0.9922999739646912 +``` + +Even though the difference is small (only ~0.08%), LiSHT performs better than ReLU in this case. What's more, it seems to be that it's not only more confident about its predictions, but actually produces better results (as indicated by the lower test loss value). This is promising, but possibly not statistically significant. + +This means that LiSHT and ReLU do not really produce different results in normal training scenarios. Let's now take a look at Leaky ReLU performance vs LiSHT. + +### Comparing LiSHT to Leaky ReLU + +- [![](images/lisht_leaky_acc.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_leaky_acc.png) + +- [![](images/lisht_leaky_ce.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lisht_leaky_ce.png) + + +The differences are a bit larger when comparing LiSHT with Leaky ReLU. The new activation function performs better, as can be seen in the plots. This is also visible when testing the models with our test dataset: + +``` +LiSHT - Test loss: 0.031986971908376474 / Test accuracy: 0.9905999898910522 +Leaky ReLU - Test loss: 0.04289412204660111 / Test accuracy: 0.9879000186920166 +``` + +* * * + +## Summary + +In this blog post, we implemented the LiSHT activation function with TensorFlow 2, using the Keras deep learning library. Empirically, with a simple test, we showed that it performs well compared to ReLU, and even better compared to Leaky ReLU. Note that this may mean that it does not matter much whether you use ReLU or LiSHT when you don't face the dying ReLU problem. If you do, however, it might be the case that you actually gain better results given their derivatives, while not being too sensitive to the vanishing gradients problem. However, that's for another time 😊 + +Thanks for reading MachineCurve today, and I hope you've learnt something! If you did, please feel free to leave a comment below 👇 I'll happily answer your comments and answer any questions you may have. + +Thanks again - and happy engineering! 😎 + +* * * + +## References + +Roy, S. K., Manna, S., Dubey, S. R., & Chaudhuri, B. B. (2019). LiSHT: Non-Parametric Linearly Scaled Hyperbolic Tangent Activation Function for Neural Networks. [_arXiv preprint arXiv:1901.05894_.](https://arxiv.org/abs/1901.05894) + +Quora. (n.d.). When should I use tf.float32 vs tf.float64 in TensorFlow? Retrieved from [https://www.quora.com/When-should-I-use-tf-float32-vs-tf-float64-in-TensorFlow](https://www.quora.com/When-should-I-use-tf-float32-vs-tf-float64-in-TensorFlow) diff --git a/how-to-use-logcosh-with-keras.md b/how-to-use-logcosh-with-keras.md new file mode 100644 index 0000000..69f7fe2 --- /dev/null +++ b/how-to-use-logcosh-with-keras.md @@ -0,0 +1,340 @@ +--- +title: "How to use Logcosh with TensorFlow 2 and Keras?" +date: "2019-10-23" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "logcosh" + - "loss-function" + - "machine-learning" + - "regression" +--- + +There are two main branches in the domain of supervised machine learning problems: _classification_ and _regression_. While you assign a sample to a fixed set of groups with classification, you're doing something very different when regressing. In fact, your regression model estimates a numeric value for the sample, such as the predicted oxygen levels given certain input values (such as the number of people currently in the room). + +In order to [train your machine learning model](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), you need to optimize it. That is, the model will adapt itself iteratively, based on the inputs on the left (which you feed through the model) and [a loss function on the right](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss-functions), which computes how much off the model performs to the actual targets. + +For regression problems, there is a wide array of very known loss functions that can be used. MAE, MSE, RMSE, MAPE - they're all usable in such problems, but all have their drawbacks. MAE, for example, is too soft when the average error is small; MSE, on the other hand, lets the computed error explode when you have outliers in your dataset, substantially distorting the computed error. + +Another loss function which attempts to combine best of both worlds is the **Logcosh loss function**. It works like the MSE, but is smoothed towards large errors (presumably caused by outliers) so that the final error score isn't impacted thoroughly. + +In this blog post, we will first introduce the Logcosh loss intuitively. We do so by providing the maths, the function plot, and an intuitive explanation as to what happens under the hood. + +Subsequently, we provide an implementation of a regression model with Keras that makes use of Logcosh loss. Beyond creating the model, we will also run it, discuss model performance, and summarize our observations so that you can make a proper choice about the loss function to use. + +_Note that the full code for the models we create in this blog post is also available through my [Keras Loss Functions repository](https://github.com/christianversloot/keras-loss-functions) on GitHub._ + +**After reading this article, you will understand...** + +- How Logcosh loss works. +- Why Logcosh loss can work better than MSE. +- How to implement Logcosh loss with TensorFlow 2 + +**If you wish to understand loss functions in more detail...** + +- [All our blogs about loss functions, some with Keras implementations.](https://www.machinecurve.com/index.php/tag/loss-function/) + +**Updates:** + +- 01/Mar/2021: updated code examples to reflect TensorFlow 2, ensuring that the code can be used with the recent major TensorFlow version. Also made some textual and structural improvements to the article. + +* * * + +\[toc\] + +* * * + +## Code example: Logcosh with TensorFlow 2 based Keras + +Logcosh loss can be configured in the model compilation step, i.e. in `model.compile`. In this code example, you can easily find how Logcosh loss is used within TensorFlow. Make sure to read the rest of the article to understand the loss function and its use in more detail. + +``` +# Configure the model and start training +model.compile(loss='logcosh', optimizer='adam', metrics=['mean_absolute_error']) +history = model.fit(x_train, y_train, epochs=250, batch_size=1, verbose=1, validation_split=0.2) +``` + +* * * + +## Intro: Logcosh loss + +Let's first cover Logcosh loss intuitively. + +“Log-cosh is the logarithm of the hyperbolic cosine of the prediction error.” (Grover, 2019). Oops, that's not intuitive but nevertheless quite important - this is **the maths behind Logcosh loss:** + +![](images/image-3.png) + +Don't be scared by the maths though, because we'll discuss Logcosh by means of its visualization - which is this one: + +[![](images/logcosh-1024x433.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/10/logcosh.jpeg) + +As you can see, Logcosh loss for some target value (in this case, target = 0), is zero at the target value, and increases when the predicted value is further away from the target value. + +The [TensorFlow docs](https://www.tensorflow.org/api_docs/python/tf/keras/losses/logcosh) write this about Logcosh loss: + +> `log(cosh(x))` is approximately equal to `(x ** 2) / 2` for small `x` and to `abs(x) - log(2)` for large `x`. This means that 'logcosh' works mostly like the mean squared error, but will not be so strongly affected by the occasional wildly incorrect prediction. +> +> Source: [TensorFlow docs](https://www.tensorflow.org/api_docs/python/tf/keras/losses/logcosh), taken from [About loss and loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#logcosh) + +It is therefore something like the MSE when you're training a regression model, but then with a degree of built-in protection against "wildly incorrect predictions" that are likely caused by outlier samples. + +This interesting property will be tested today, since we'll now start implementing our model with Keras :-) + +* * * + +## Today's dataset + +But first, the dataset. + +We will be using the **[Boston Housing Prices Regression](https://keras.io/datasets/#boston-housing-price-regression-dataset)** [dataset](https://keras.io/datasets/#boston-housing-price-regression-dataset), which is one of the datasets that is available in the Keras API by default. It allows us to focus on the Logcosh loss aspects of the implementation rather than importing and cleaning the data, and hence ease of use. + +> The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. +> +> [StatLib Datasets Archive](http://lib.stat.cmu.edu/datasets/) + +What does it look like? + +Let's find out! + +It contains these variables, according to the [StatLib website](http://lib.stat.cmu.edu/datasets/boston): + +- **CRIM** per capita crime rate by town +- **ZN** proportion of residential land zoned for lots over 25,000 sq.ft. +- **INDUS** proportion of non-retail business acres per town +- **CHAS** Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) +- **NOX** nitric oxides concentration (parts per 10 million) +- **RM** average number of rooms per dwelling +- **AGE** proportion of owner-occupied units built prior to 1940 +- **DIS** weighted distances to five Boston employment centres +- **RAD** index of accessibility to radial highways +- **TAX** full-value property-tax rate per $10,000 +- **PTRATIO** pupil-teacher ratio by town +- **B** 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town +- **LSTAT** % lower status of the population +- **MEDV** Median value of owner-occupied homes in $1000's + +As you can see, the target value for the preditions is the **median home value in $1000s.** The dataset contains quite some other _feature vectors_, which tell us something about that particular house - such as the crime rate, industry (not retail!) around the house, air pollution, and so on. Together, it is assumed, these variables will be able to tell us something about the median value of the home. + +When applying the same dataset with [Huber loss](https://www.machinecurve.com/index.php/2019/10/12/using-huber-loss-in-keras/), we found a mean absolute error of approximately $3.639. Not too bad, but not spot on either. + +Let's now find out whether we can improve this score when applying Logcosh loss instead. + +* * * + +## Building the Keras model + +For building the model, let's open up your file explorer, navigate to some folder (or perhaps, create one), and create a file called `logcosh-loss.py`. + +We can now start coding. We do so by first inspecting what we need in order to successfully run this model, i.e., our software dependencies. Subsequently, we will construct the model and discuss each step in detail. + +### What you'll need to run the model + +You will need to install these software dependencies if you wish to run the model on your machine: + +- **Python**, for actually running the code. Ensure that you have Python 3.8+ installed. +- **TensorFlow 2** (any of the 2.x versions), which includes `tensorflow.keras` out of the box. +- **Numpy**, for number processing. +- **Matplotlib**, for visualization. + +Preferably, you have these installed in an Anaconda environment, but this is not strictly necessary. + +### Model imports + +We start our code by means of writing down the software imports we need to actually run and support the model: + +``` +''' + Keras model demonstrating Logcosh loss +''' +import tensorflow +from tensorflow.keras.datasets import boston_housing +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np +import matplotlib.pyplot as plt +``` + +Obviously, we'll need the `boston_housing` dataset which we can import from `tensorflow.keras.datasets`. When it's not already installed on your system, Keras will automatically download the dataset (from some S3 storage location) and put it in place so that your model can be trained. If it was installed before, it will be loaded from cache. + +Additionally, we need the `Sequential` model, as we will use the Sequential API, using which we will stack multiple densely-connected or `Dense` layers. + +Numpy is used for numbers processing and Matplotlib is used for visualization purposes (i.e., for [visualizing model performance across epochs](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/)). + +### Loading and preparing the dataset + +Next, we load and prepare the dataset, which is as easy as writing this: + +``` +# Load data +(x_train, y_train), (x_test, y_test) = boston_housing.load_data() + +# Set the input shape +shape_dimension = len(x_train[0]) +input_shape = (shape_dimension,) +print(f'Feature shape: {input_shape}') +``` + +Under 'load data', you effectively load the training and testing data from Keras, specifically `load_data()` on `boston_housing`, which you imported before. Simple as that. + +Subsequently, you'll set the `input_shape` which describes the structure of one sample in your training and testing sets. In this case, a sample is one-dimensional, containing `len(x_train[0])` values, or the number of values in the array of your first feature vector - the 13 variables defined above. + +### Creating the model architecture + +Next, we specify the architecture of our model: + +``` +# Create the model +model = Sequential() +model.add(Dense(16, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='linear')) +``` + +It's really simple - we're using the Sequential API, which allows us to stack the subsequent layers on top of each other. These layers are all densely-connected, or `Dense`, and have 16, 8 and 1 neuron(s), respectively. The hidden layers (the first two we're adding) use ReLU activation and He uniform init, [which is wise](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). The first hidden layer specifies the input shape and hence the number of neurons in your initial layer. The final layer has one neuron (since only one value - the one being predicted with the regressed function - is output) and activates linearly, so that the predicted value is output. + +### Model configuration & training + +We next specify code for _hyperparameter tuning_ (or model configuration) and starting the actual training process (or, in Keras terms, _fitting the data to your model architecture_): + +``` +# Configure the model and start training +model.compile(loss='logcosh', optimizer='adam', metrics=['mean_absolute_error']) +history = model.fit(x_train, y_train, epochs=250, batch_size=1, verbose=1, validation_split=0.2) +``` + +Here, we specify various configuration options such as the loss value (Logcosh), the optimizer, additional metrics (we also use MAE so that we can compare with the [Huber loss](https://www.machinecurve.com/index.php/2019/10/12/using-huber-loss-in-keras/) variant), and so on. Fitting the data also requires us to specify certain options, such as the number of epochs, the batch size, and the validation split. We store the results of the training process in a `history` object so that we can visualize the model's performance. + +### Model testing & performance visualization code + +Next, we add code which evaluates the model against the _testing set_ to test whether it generalizes properly to data it has never seen before: + +``` +# Test the model after training +test_results = model.evaluate(x_test, y_test, verbose=1) +print(f'Test results - Loss: {test_results[0]} - MAE: {test_results[1]}') +``` + +Next, we visualize training history, for both the Logcosh loss value as the additional MAE metric: + +``` +# Plot history: Logcosh loss and MAE +plt.plot(history.history['loss'], label='Logcosh loss (training data)') +plt.plot(history.history['val_loss'], label='Logcosh loss (validation data)') +plt.title('Boston Housing Price Dataset regression model - Logcosh loss') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() + +plt.title('Boston Housing Price Dataset regression model - MAE') +plt.plot(history.history['mean_absolute_error'], label='MAE (training data)') +plt.plot(history.history['val_mean_absolute_error'], label='MAE (validation data)') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Whole model code + +This concludes our implementation! + +If you're interested in the model as a whole - here you go: + +``` +''' + Keras model demonstrating Logcosh loss +''' +import tensorflow +from tensorflow.keras.datasets import boston_housing +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np +import matplotlib.pyplot as plt + +# Load data +(x_train, y_train), (x_test, y_test) = boston_housing.load_data() + +# Set the input shape +shape_dimension = len(x_train[0]) +input_shape = (shape_dimension,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(16, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='linear')) + +# Configure the model and start training +model.compile(loss='logcosh', optimizer='adam', metrics=['mean_absolute_error']) +history = model.fit(x_train, y_train, epochs=250, batch_size=1, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(x_test, y_test, verbose=1) +print(f'Test results - Loss: {test_results[0]} - MAE: {test_results[1]}') + +# Plot history: Logcosh loss and MAE +plt.plot(history.history['loss'], label='Logcosh loss (training data)') +plt.plot(history.history['val_loss'], label='Logcosh loss (validation data)') +plt.title('Boston Housing Price Dataset regression model - Logcosh loss') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() + +plt.title('Boston Housing Price Dataset regression model - MAE') +plt.plot(history.history['mean_absolute_error'], label='MAE (training data)') +plt.plot(history.history['val_mean_absolute_error'], label='MAE (validation data)') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Model performance + +Now, we can proceed with the actual training process. Open up a terminal, such as the Anaconda prompt or a regular `cmd`, `cd` to the folder where your Python file is located, and execute `python logcosh-loss.py`. The training process should then start and eventually stop after 250 epochs (or iterations), with a test MAE of approximately $3.5k. + +``` +Epoch 250/250 +323/323 [==============================] - 1s 4ms/step - loss: 2.5377 - mean_absolute_error: 3.1209 - val_loss: 3.4016 - val_mean_absolute_error: 4.0356 +102/102 [==============================] - 0s 127us/step +Test results - Loss: 2.8869358511532055 - MAE: 3.4803783893585205 +``` + +This means that the model performed better than when using Huber loss! + +We cannot say this with statistical significance, obviously, since for the two groups n = 1 and no actual test is used, but the result is somewhat promising. We cannot say this for sure either, but perhaps the slight improvement is caused by the fact that Logcosh loss is less sensitive to outliers -- but this can be tested only by running the training process time after time, applying statistical tests and more detailed scrutiny of the predictions themselves. + +Two nice plots are generated as well which see how the model improves - at large at first, but more slowly towards the end; just as it should be. + +[![](images/logcosh_loss.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/logcosh_loss.png) + +[![](images/logcosh_mae.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/logcosh_mae.png) + +* * * + +## Summary + +In this blog post, we've seen how to implement Logcosh loss with Keras, the deep learning framework for Python. We provided an actual implementation that we discussed in detail, and an intuitive explanation of Logcosh loss and its benefits compared to e.g. MSE. I hope you've learnt something new today 😊 and would appreciate your comment below 👇 Thanks and happy engineering! 👋 + +_Note that the full code for the models we created in this blog post is also available through my [Keras Loss Functions repository](https://github.com/christianversloot/keras-loss-functions) on GitHub._ + +* * * + +## References + +About loss and loss functions – MachineCurve. (2019, October 15). Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) + +Grover, P. (2019, September 25). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from [https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0) + +Carnegie Mellon University StatLib. (n.d.). Boston house-price data. Retrieved from [http://lib.stat.cmu.edu/datasets/boston](http://lib.stat.cmu.edu/datasets/boston) diff --git a/how-to-use-padding-with-keras.md b/how-to-use-padding-with-keras.md new file mode 100644 index 0000000..c9fd30a --- /dev/null +++ b/how-to-use-padding-with-keras.md @@ -0,0 +1,256 @@ +--- +title: "How to use padding with Keras?" +date: "2020-02-08" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "machine-learning" + - "neural-network" + - "neural-networks" + - "padding" +--- + +Sometimes, you don't want the shape of your convolutional outputs to reduce in size. Other times, you wish to append zeroes to the inputs of your Conv1D layers. Padding - same/zero padding and causal padding - can help here. This blog post illustrates how, by providing example code for the Keras framework. + +However, before we do so, we firstly recap on the concept of padding. What is it, again? And why do we need it? This is followed by the Python examples. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Recap: what is padding? + +Convolutional layers induce spatial hierarchy. That is, generally speaking, they reduce the size of your input data for every layer the data passes through - allowing neural networks to learn both very _specific_ and very _abstract_ aspects of your input data. + +However, sometimes you don't want this to happen: you want the size of your input data to stay the same. In that case, padding can help by adding [zeros, constants or different numbers around the reduced input](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/). This way, the size of your input data remains the same. + +[![](images/reflection_pad.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/reflection_pad.jpg) + +[Reflection padding](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/#reflection-padding) can be used for this purpose. + +In a different scenario, you have one dimensional data representing a time series. Two values in your _feature data_ causally determine a _target_, i.e., together they produce the outcome. However, if you train a `Conv1D` model with both the _inputs_ and the _targets_, effectively, the target will "predate" the input data. As this is weird, [causal padding](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/#causal-padding) can be applied in order to add zeroes to your input data, as you can see here: + +![](images/Causalpad-4-1024x262.jpg) + +* * * + +## Types of padding supported by Keras + +Make sure to take a look at our blog post ["What is padding in a neural network?"](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/) in order to understand padding and the different types in more detail. In this blog post, we'll take a look at _implementations_ - using the Keras framework, to be precise. This framework, which today works with TensorFlow 2.0, allows you to apply padding to your convolutional neural network. + +However, not all types of padding from the blog post linked above are supported. Keras supports these types of padding: + +- Valid padding, a.k.a. no padding; +- Same padding, a.k.a. zero padding; +- Causal padding. + +In this blog post, we'll look at each of them from a Keras point of view. That is, we don't explain them thoroughly (this is the purpose of the blog post linked above), but rather provide actual code! 👩‍💻 This way, you should be able to build ConvNets with these types of padding yourself. + +Now, let's open your code editor and go! 😎 + +* * * + +## How to use Valid Padding with Keras? + +Building a model with Keras often consists of three steps: + +- Instantiating the model, e.g. with the Sequential API; +- Stacking the layers on top of each other; +- Compiling the model. + +Once these have been completed, data can be fit to the model, after which the training process starts :) + +As you likely understand by now, **applying valid padding** happens during the model building phase. More precisely, it happens during the _stacking_ phase, where you add the individual layers to the model that has been constructed so far. + +So, for example, a simple model with three convolutional layers using the Keras Sequential API always starts with the `Sequential` instantiation: + +``` +# Create the model +model = Sequential() +``` + +### Adding the Conv layers + +Subsequently, the three `Conv` layers can be added. In our case, they are two-dimensional ones, as our ConvNet was used for image classification. Do note that at two layers `padding='valid'` is specified, whereas it is omitted in the second layer. This is for a reason - as you'll see towards the end of this section! + +The value for `input_shape = (28, 28, 1)`. + +``` +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, padding='valid')) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', padding='valid')) +``` + +Next, we can add a `Flatten` layer - which flattens the multidimensional outputs of the last `Conv2D` layer into one-dimensional format - and two `Dense` layers, [which generate a multiclass probability distribution using Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/). This is perfect for classification 😎 + +``` +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +### Full model code + +The full stack of layers: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, padding='valid')) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', padding='valid')) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +### Model summary and the effects of Valid Padding + +Now, it's time to add `model.summary()` and run the Python code in your terminal. You should see a summary appear: + +``` +Model: "sequential_1" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d_1 (Conv2D) (None, 26, 26, 32) 320 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 24, 24, 64) 18496 +_________________________________________________________________ +conv2d_3 (Conv2D) (None, 22, 22, 128) 73856 +_________________________________________________________________ +flatten_1 (Flatten) (None, 61952) 0 +_________________________________________________________________ +dense_1 (Dense) (None, 256) 15859968 +_________________________________________________________________ +dense_2 (Dense) (None, 10) 2570 +================================================================= +Total params: 15,955,210 +Trainable params: 15,955,210 +Non-trainable params: 0 +_________________________________________________________________ +``` + +Analyzing the summary, the effect of valid padding is clear - and it is also clear why it equals "no padding". That is, unsurprisingly, **no padding is applied when using valid padding**. + +For each layer, clearly, the feature map dimensions (i.e. width and height) are reduced, from 28x28 to 22x22 pixels directly before the `Flatten` layer. Just as convolutional layers should work when no padding is applied! :) + +For this reason, we told you that we omitted the value for `padding` on the second layer on purpose. It was to show you that it doesn't matter whether you apply it or not - in both cases, feature map dimensions get reduced. + +Let's now take a look at "same" or "zero" padding - which _doesn't reduce the feature maps in size._ + +* * * + +## How to use Same / Zero Padding with Keras? + +Models that have same or zero padding are not too different from the ones using valid padding. Equally to those, such models - when using the `Sequential` API - are initialized first: + +``` +# Create the model +model = Sequential() +``` + +After which the `Conv` layers are added. In our case, they are `Conv2D` again, with 'same' as the value for `padding` for all three layers: + +``` +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, padding='same')) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', padding='same')) +``` + +Then, like the "valid" padding scenario, we add a Flatten layer and two Dense ones, ending with a Softmax activated output: + +``` +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +Now, we've got our model :) + +### Full model code + +Here too, you can obtain the full model code at once if you wish: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape, padding='same')) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', padding='same')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu', padding='same')) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +### Model summary and the effects of Valid Padding + +What if we run `model.summary()` here? + +This would be the output: + +``` +Model: "sequential_1" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d_1 (Conv2D) (None, 28, 28, 32) 320 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 28, 28, 64) 18496 +_________________________________________________________________ +conv2d_3 (Conv2D) (None, 28, 28, 128) 73856 +_________________________________________________________________ +flatten_1 (Flatten) (None, 100352) 0 +_________________________________________________________________ +dense_1 (Dense) (None, 256) 25690368 +_________________________________________________________________ +dense_2 (Dense) (None, 10) 2570 +================================================================= +Total params: 25,785,610 +Trainable params: 25,785,610 +Non-trainable params: 0 +_________________________________________________________________ +``` + +Indeed, as we can now observe, the _shape_ of our feature maps has stayed the same - 28x28 pixels! :) + +* * * + +## How to use Causal Padding with Keras? + +Applying causal padding is simple: just apply `causal` to your `Conv1D` model to pad zeroes to the front of your inputs. + +``` +model.add(Conv1D(32, kernel_size=4, activation='relu', input_shape=input_shape, padding='causal')) +``` + +* * * + +## Summary + +In this blog post, we looked at how to implement padding with Keras. Firstly, we looked at what padding is at a high level - followed by example code for valid, same and causal padding. + +I hope you've learnt something today! If you did, please leave a comment in the comments section below 😊 Please do the same if you have remarks or questions. I'll happily answer and improve my blog post where possible! + +Thanks for reading MachineCurve today and happy engineering 😎 + +\[kerasbox\] + +* * * + +## References + +Keras. (n.d.). Convolutional Layers. Retrieved from [https://keras.io/layers/convolutional/](https://keras.io/layers/convolutional/) + +TensorFlow. (n.d.). tf.keras.layers.ZeroPadding1D. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/ZeroPadding1D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/ZeroPadding1D) + +TensorFlow. (n.d.). tf.keras.layers.ZeroPadding2D. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/ZeroPadding2D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/ZeroPadding2D) + +TensorFlow. (n.d.). tf.keras.layers.ZeroPadding3D. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/ZeroPadding3D](https://www.tensorflow.org/api_docs/python/tf/keras/layers/ZeroPadding3D) diff --git a/how-to-use-prelu-with-keras.md b/how-to-use-prelu-with-keras.md new file mode 100644 index 0000000..b9ef021 --- /dev/null +++ b/how-to-use-prelu-with-keras.md @@ -0,0 +1,380 @@ +--- +title: "How to use PReLU with Keras?" +date: "2019-12-05" +categories: + - "deep-learning" + - "frameworks" +tags: + - "activation-function" + - "activation-functions" + - "convolutional-neural-networks" + - "deep-learning" + - "keras" + - "machine-learning" + - "mnist" + - "neural-network" + - "neural-networks" + - "parametric-relu" + - "prelu" + - "relu" +--- + +Rectified Linear Unit, or [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/), is considered to be the standard activation function of choice for today's neural networks. Even though time has passed since its introduction and many new activation functions have been introduced, ReLU is still recommended everywhere. + +The reason for this is twofold: first, it is a very simple activation function. As such, it is computationally inexpensive than others (such as [Sigmoid and Tanh](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/)), which means that fewer computational resources are required for training your model. Second, it is highly generalizable. That is, it's difficult to use activation functions in practice if they work well on _some data_, while poorly on other data. If their performance is relatively independent of the dataset, they're good. + +And ReLU has both. + +However, it also has problems - or, rather, challenges. While it does not suffer from the [vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/), it does suffer from [dying neurons](https://www.machinecurve.com/index.php/2019/10/15/leaky-relu-improving-traditional-relu/) instead. For this reason, various new activation functions have been proposed in the past. **Parametric Rectified Linear Unit**, or PReLU, is one of them, and we will cover it in this blog (He et al., 2015). + +First, we'll provide a recap on ReLU: why is it so useful for many neural networks? What are the challenges we just introduced, and why is PReLU different? Subsequently, we'll give an example implementation for PReLU for your Keras based neural network. This includes a comparison with standard ReLU, in our case with the MNIST dataset. This way, you can both understand _why_ PReLU may be useful, _and_ immediately use it in practice. + +All right, let's go! 😎 + +\[toc\] + +## Recap on ReLU: no vanishing gradients, but dying neurons instead + +ReLU - such a nice activation function: it is highly usable, as it generalizes acceptably well to pretty much any machine learning problem. It's also really simple, as it's just a two-path decision given some input: + +\\begin{equation} f(x) = \\begin{cases} 0, & \\text{if}\\ x < 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +The output equals the input for all positive inputs, and zero for all others. + +This can be visualized as follows: + +[![](images/relu-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/relu.png) + +### No vanishing gradients + +During [optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), the neural network optimizer uses the gradient (computed with backpropagation) in order to find its path towards better weights. + +With traditional activation functions, such as the Sigmoid function, this gradient - which can be computed by letting the input pass through the first-order derivative of the original function - gets a lot smaller: + +[![](images/sigmoid_deriv-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sigmoid_deriv.png) + +In fact, the maximum output for any input is \[latex\]\\approx 0.25\[/latex\], while in most cases it is even smaller. + +A gradient is always generated with respect to the error generated based on the predictions at the model's tail. For upstream layers, this means that _all the gradients of the layers in between layer X and the error must be included_. In mathematical terms, this means that they are chained - multiplied - by the backpropagation algorithm, for finding the gradient at some layer. This is exactly the problem: when the gradients are 0.25 at max, chaining four gradients results in a 0.25^4 = 0.00390625 gradient at max. + +You've just found out about the **[vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/)**. For upstream layers, gradients vanish when the wrong activation function is used. With relatively shallow networks, this was not much of a problem, but learning gets severely impaired when your neural network gets deep - and they do these days. Upstream layers simply learn very slowly - or might no longer converge at all! + +Fortunately, ReLU does not suffer from this problem, as can be seen when its gradients are visualized: + +[![](images/derivatives-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/derivatives.png) + +The gradient is either _zero_ or _one_. No more vanishing gradients! 😁 + +### The 'dying ReLU' problem + +...but I'm sorry to spoil the fun here 😐 + +Can you imagine what happens when the input at some layer is \[latex\]\\leq 0\[/latex\]? + +Indeed, the gradient for that layer is zero - **and so are all the gradients for the upstream layers, as the zero is included in the chain from error to layer gradient**. + +Welcome to the _dying ReLU problem_. + +When your data is not normalized properly or when you use a wrong [weight initialization](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) strategy, it may be that large parts of your neural network will find gradients of **zero** during optimization. This simply means that they can no longer participate in learning. + +The effect: your model does learn very slowly or, once again, does not converge at all. Indeed, that's an entire waste of resources. + +Consequently, ReLU is not problem-free either. + +## What is PReLU? + +Kaiming He and others, in one of the landmark papers of machine learning research, recognized this problem and set out to find a solution (He et al., 2015). They started with a couple of facts: + +- Tremendous improvements in recognition performance have been reported for computer vision models in the years prior to the paper. +- One of the primary drivers of these improvements is the ReLU activation function, which is a significant improvement over traditional Sigmoid and Tanh. +- Nevertheless, ReLU is not problem-free. +- One attempt to fix this problem - with [Leaky ReLU](https://www.machinecurve.com/index.php/2019/11/12/using-leaky-relu-with-keras/) - is not sufficient: while it indeed resolves the _dying ReLU problem_ by setting the inputs \[latex\]<= 0\[/latex\] to very small but nonzero values, empirical testing hasn't resulted in significant performance improvements. + +The authors argue that this might occur because the \[latex\]\\alpha\[/latex\] parameter, which configures the steepness of the nonzero negative outputs, must be set by the user before starting the training process. + +Why, they argued, can't this parameter be learnt during training? And, even better, why - for every neuron - can't we learn \[latex\]\\alpha\[/latex\] _per input element_, instead of a global alpha for all the input dimensions? + +This is what **Parametric Rectified Linear Unit** or **PReLU** is all about. Being a generalization of Leaky ReLU, the alpha value need no longer to be configured by the user, but is learnt during training instead. It is therefore entirely dependent on the data, and not on the engineer's guess, and hence it is estimated that this variant of ReLU both avoids the dying ReLU problem and shows performance improvements. + +### How PReLU works + +PReLU is actually not so different from [traditional ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/): + +\\begin{equation} f(x\_i) = \\begin{cases} x\_i, & \\text{if}\\ x\_i > 0 \\\\ \\alpha\_ix\_i, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Note that \[latex\]x\_i\[/latex\] is just one feature from the feature vector \[latex\]\\textbf{x}\[/latex\]. As the authors argue: "\[t\]he subscript _i_ in \[latex\]\\alpha\_i\[/latex\] indicates that we allow the nonlinear activation to vary on different channels" (He et al., 2015). This is called **channel-wise** PReLU and is the default setting. If you choose to learn the same \[latex\]\\alpha\[/latex\] for all features, you use what is known as **channel-shared** PReLU (He et al., 2015). + +This brings us to the following insights: + +- When all \[latex\]a\_i\[/latex\] values (or the global alpha, if channel-shared PReLU is used) are learnt to **zeros**, PReLU effectively behaves like traditional ReLU. +- When they are learnt to _small values_, you effectively have Leaky ReLU (see the image below for an example). +- In any other case, you benefit from the generalization: you don't have traditional ReLU nor Leaky ReLU, but have a variant that is better suited to your input data. + +[![](images/leaky_relu.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/leaky_relu.png) + +Learning the values for \[latex\]\\alpha\[/latex\] takes place by adding a few extra parameters to the network. In computational terms, the effects on resource requirements are negligible, and especially so in the channel-shared variant (meaning that only one parameter needs to be added). Traditional backpropagation is used for computing the alpha gradients, and optimization is performed with [momentum gradient descent](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#momentum) (He et al., 2015). In Keras, that would be the optimizer of your choice instead, I'd guess. + +As with any set of weights, the \[latex\]\\alpha\[/latex\] values must be initialized when the training process commences. The weight initialization strategy used by the authors is to initialize all \[latex\]\\alpha\[/latex\]s to 0.25. Backprop and the optimizer will then take over. + +### Empirical performance tests by the authors + +The authors performed various experiments with a deep model, using various datasets (He et al., 2015), but primarily this one: + +- Baseline: using traditional ReLU, trained on ImageNet 2012. +- Comparison: using PReLU, trained on ImageNet 2012. + +**Result:** 1.2% gain in error rate, which is a significant reduction. The channel-wise strategy performs better than the channel-shared strategy, and is therefore preferred. + +Many other tests were performed, where PReLU continously outperformed ReLU. + +I've become curious now - since these tests were performed with deep models. For those, due to the nature of the dying ReLU problem, and the vanishing gradients problem in the case of Leaky ReLU, the sensitivity to such problems is quite large. But for more shallow models, like very simple neural nets, I consistenly see that the differences between traditional ReLU and these variants of ReLU are low. Will we see any differences when we train a simple Keras model with PReLU and ReLU, using the same dataset - the MNIST dataset - as well as the architecture? + +Let's find out. 😁 + +## Implementing PReLU in your Keras models + +### What you'll need to run the model + +As with many of the tutorials at MachineCurve, you'll need to install a set of dependencies if you wish to run the model we're creating today: + +- As usual, you'll need **Python**, and preferably version 3.6+. +- You'll also need to install **Keras**, the deep learning framework we are using today. +- Similarly, you need to install one of the backends Keras runs on - and preferably **Tensorflow**, as it has deeply integrated with it recently. +- Finally, you'll need **Matplotlib** for generating some visualizations. This is not mandatory, but it's likely that you already have Matplotlib installed on your system, so it shouldn't be too much of an issue. + +### Inspecting the Keras API for PReLU + +When taking a look at the Keras API for PReLU, we see this: + +``` +keras.layers.PReLU(alpha_initializer='zeros', alpha_regularizer=None, alpha_constraint=None, shared_axes=None) +``` + +As you can see, the PReLU layer comes with an initializer, regularizer and constraint possibility, as well as something called `shared_axes`: + +- With the **initializer**, or `alpha_initializer`, you define how the \[latex\]\\alpha\[/latex\] weights are initialized. You can use any of the [Keras initializers](https://keras.io/initializers/) for this purpose - and even define your own one. He et al. (2015) use \[latex\]\\alpha = 0.25\[/latex\]. +- With the **regularizer**, or `alpha_regularizer`, it's possible to _regulate_ weight swings by applying penalties to outliers. You can use any of the [Keras regularizers](https://keras.io/regularizers/) for this purpose. He et al. (2015) do not use a regularizer, as they argue that some of them - especially L2 regularization - "tends to push \[alphas\] to zero, and thus biases PReLU towards ReLU". Combined L1 and L2 and/or L1, L2 regularization may also bias the activation function towards [Leaky ReLU](https://www.machinecurve.com/index.php/2019/11/12/using-leaky-relu-with-keras/) (Uthmān, 2017). +- With the **constraint**, or `alpha_constraint`, you can set fixed limits to the network parameters during training. You can use any of the [Keras constraints](https://keras.io/constraints/) for this purpose. He et al. (2015) do not use constraints to allow the activation function to be non-monotonic. +- With **shared\_axes**, you can share axes over space. This is useful when you wish to share axes over e.g. the filters present in ConvNets. + +By default, only the `alpha_initializer` value is set, to zero initialization: it is set to `zeros`. On Keras' GitHub page, Arseny Kravchenko points out that this might be wrong, as He et al. (2015) initialize alpha to 0.25 (Kravchenko, n.d.). In fact, he argues, zero initialization may lead to worse performance. We'll take this into account when inspecting model performance on the MNIST dataset and run it twice: once with zeros init, once with 0.25 init. + +### Today's model and dataset + +Today, we'll be creating a simple [ConvNet](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). We use two convolutional blocks with MaxPooling and Dropout, followed by two densely-connected layers. The architecture is visualized next: + +[![](images/model-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/model-1.png) + +We'll be training our ConvNet so that it will become able to classify digits from the MNIST dataset. This dataset, which contains 28x28 pixels of handwritten digits, is quite extensive, yet simple enough to use in demonstrations & tutorials. It looks as follows: + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +### Model code + +It's time to write some code! + +Open up your Explorer, navigate to some folder, and create a file, e.g. `model_prelu.py`. Now open up a code editor, open that file, and start coding 😀 + +#### Imports + +First, we'll have to import the dependencies that we listed before: + +``` +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras.initializers import Constant +from keras import backend as K +from keras.layers import PReLU +import matplotlib.pyplot as plt +``` + +#### Model configuration + +Secondly, we'll add configuration parameters for the ConvNet and the training process. We use the ones we defined and elaborated on in our [tutorial on Keras CNNs](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), so if you wish to understand the _whys_, I'd recommend you also read that blog post :) + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +leaky_relu_alpha = 0.1 +``` + +#### Data import & preparation + +Subsequently, we import the data and prepare it for training: + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height) + input_shape = (1, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) +``` + +- We simply load the MNIST data with `load_data`. +- Subsequently, we reshape the data based on whether the image strategy is channels first or channels last (this depends on the backend you are using). +- Subsequently, we parse the input data as floats, which reportedly speeds up the training process. +- Subsequently, we normalize the data into the \[0, 1\] range. +- Finally, we convert the integer target vectors into categorical format (i.e. true/false vector format) so that we can use [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). + +#### Defining model architecture + +Subsequently, we specify the architecture in line with the image showed above: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape)) +model.add(PReLU(alpha_initializer=Constant(value=0.25))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3))) +model.add(PReLU(alpha_initializer=Constant(value=0.25))) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256)) +model.add(PReLU(alpha_initializer=Constant(value=0.25))) +model.add(Dense(no_classes, activation='softmax')) +``` + +Note that here, we perform alpha initialization by setting `alpha_initializer` to `Constant(value=0.25)`, i.e., in line with the strategy proposed by He et al. (2015). However, you may also wish to initialize all-zeroes. You can then either replace `value=0.25` with `value=0` or replace `Constant(...)` with `'zeroes'`. Of course, you can also use any of the other [Keras initializers](https://keras.io/initializers/). + +#### Model configuration & training + +Next, we configure the model - in Keras terms: 'compile the model with our hyperparameters as its configuration'. Subsequently, we fit the data to our compiled model, which starts the training process: + +``` +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +#### Test and visualization metrics + +Finally, we provide metrics for testing the model - based on the test set, to see how well it generalizes to new data - and for [visualizing the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss for Keras PReLU CNN: {score[0]} / Test accuracy: {score[1]}') + +# Visualize model history +plt.plot(history.history['accuracy'], label='Training accuracy') +plt.plot(history.history['val_accuracy'], label='Validation accuracy') +plt.title('PReLU training / validation accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() + +plt.plot(history.history['loss'], label='Training loss') +plt.plot(history.history['val_loss'], label='Validation loss') +plt.title('PReLU training / validation loss values') +plt.ylabel('Loss value') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() +``` + +## Results + +Now open up a terminal, `cd` to the folder where your Python script is located, and run e.g. `python model_prelu.py`. You You should see the training process starting, and the model should train nicely with PReLU. + +My initial observation about PReLU is that it's slower than traditional ReLU. This might make sense, given the fact that more computations need to be made (rather than a simple `max` operation) at more dimensions (given the channel-wise strategy we're using). However, if it would result in better performance, it could be a perfectly acceptable trade-off. + +What we did beyond the script above is to compare PReLU loss with ReLU loss as well. We used the same dataset, the same hyperparameters, and the same model architecture - we only changed the activation functions. We did so for both alpha-zero initialization and alpha-0.25 initialization. Let's see whether (1) there is a difference between the two, and (2) whether PReLU works better on MNIST or whether it doesn't, answering my question about the necessity for altered ReLUs on small datasets. + +### Alpha init = 'zeroes' + +The test loss for both CNNs when alphas were initialized as zeroes are as follows: + +``` +Test loss for Keras ReLU CNN: 0.02572663464686841 / Test accuracy: 0.992900013923645 +Test loss for Keras PReLU CNN: 0.030429376099322735 / Test accuracy: 0.9926999807357788 +``` + +...loss for traditional ReLU seems to be lower! This also becomes clear from the visualizations: + +- [![](images/acc.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/acc.png) + +- [![](images/loss.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/loss.png) + +- [![](images/comparison.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/comparison.png) + + +Hence, for alpha-zero initialization, we can say that PReLU does not necessarily perform better than traditional ReLU. Additionally, it results in more loss oscillations, although we cannot say for certain whether this isn't just bad luck. However, for small datasets, PReLU seems to be less important than for larger ones (given the nature of the dying ReLU and vanishing gradient problems). + +Let's now find out what happens when we use our alpha-0.25 strategy. + +### Alpha init = 0.25 + +PReLU still performs worse: + +``` +Test loss for Keras ReLU CNN: 0.02390692584049343 / Test accuracy: 0.9926999807357788 +Test loss for Keras PReLU CNN: 0.030004095759327037 / Test accuracy: 0.9929999709129333 +``` + +- [![](images/acc-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/acc-1.png) + +- [![](images/loss-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/loss-1.png) + +- [![](images/comp.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/comp.png) + + +Nevertheless, the loss function seems to oscillate less significantly than with our alpha-zeroes strategy. Final test loss is not significantly better in the 0.25 case, however. Additionally, we can still conclude that for smaller networks, PReLU does not significantly improve model performance. + +## Summary + +In this blog post, we've discussed possible problems with the ReLU activation function and introduced a possible solution - the Parametric ReLU, or PReLU. We discussed the findings by the authors (He et al., 2015) and showed that, in their tests with large networks, PReLU yields significant performance improvements over traditional ReLU. + +We also provided an example implementation of a Keras based CNN using PReLU, with both zeroes initialization and alpha-0.25 initialization, the latter of which is recommended by the authors. Our empirical tests with a _smaller network_ show that PReLU does not yield better-performing models compared with ReLU, when trained on the MNIST dataset. PReLU, probably given the fact that it's more computationally complex than ReLU, did yield slower training times, though. Hence, when choosing activations for practical applications, I'd say you best inform yourself of the _depth_ of your network first before choosing between traditional or extended ReLUs. + +Thanks for reading MachineCurve today and happy engineering! 😎 + +## References + +He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. _2015 IEEE International Conference on Computer Vision (ICCV)_. [doi:10.1109/iccv.2015.123](https://arxiv.org/abs/1502.01852) + +Keras. (n.d.). Advanced Activations Layers \_ PReLU. Retrieved from [https://keras.io/layers/advanced-activations/#prelu](https://keras.io/layers/advanced-activations/#prelu) + +Arseny Kravchenko. (n.d.). Inaccurate initialization for PReLU layer · Issue #9810 · keras-team/keras. Retrieved from [https://github.com/keras-team/keras/issues/9810](https://github.com/keras-team/keras/issues/9810) + +Uthmān. (2017, July 16). Don't Regularize Me, Bro. Retrieved from [https://discuss.pytorch.org/t/dont-regularize-me-bro/4946](https://discuss.pytorch.org/t/dont-regularize-me-bro/4946) diff --git a/how-to-use-pytorch-loss-functions.md b/how-to-use-pytorch-loss-functions.md new file mode 100644 index 0000000..e15b2c7 --- /dev/null +++ b/how-to-use-pytorch-loss-functions.md @@ -0,0 +1,1748 @@ +--- +title: "How to use PyTorch loss functions" +date: "2021-07-19" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "bceloss" + - "bcewithlogitsloss" + - "binary-crossentropy" + - "categorical-crossentropy" + - "crossentropy" + - "deep-learning" + - "hinge" + - "hinge-loss" + - "huber-loss" + - "l1-loss" + - "loss-function" + - "mae-loss" + - "margin-loss" + - "mse-loss" + - "neural-networks" + - "nllloss" + - "pytorch" + - "smooth-l1-loss" + - "softmarginloss" +--- + +Loss functions are an important component of a neural network. Interfacing between the forward and backward pass within a Deep Learning model, they effectively compute how _poor_ a model performs (how big its _loss_) is. In this article, we're going to cover how to use a variety of PyTorch loss functions for classification and regression. + +After reading this article, you will... + +- **Understand what the role of a loss function in a neural network is.** +- **Be familiar with a variety of PyTorch based loss functions for classification and regression.** +- **Be able to use these loss functions in your Deep Learning models.** + +Are you ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What is a loss function? + +Training a Deep Learning model involves what I call a _high-level training process_. This process is visualized below. It all starts with a training dataset, which - in the case of classification and regression - contains a set of descriptive variables (features) that jointly are capable of predicting some target variable. + +Training the Deep Learning model, which often is a neural network, involves sequentially performing a forward pass and a backward pass, followed by optimization. In the forward pass, the dataset is fed to the network (in a batched fashion). This leads to predictions for the targets, which can then be compared with the _true_ labels. No prediction is perfect, and hence there will be an error value. Using this error value, the error can be computed backwards into the neural network using _backpropagation_. Subsequently, with an optimizer, the model can be changed slightly in the hope that it performs better next time. By repeating this process over and over again, the model can improve and _learn_ to generate accurate predictions. + +Let's get back to this error value. As the name suggests, it is used to illustrate how _poorly_ the model performs. In Deep Learning jargon, this value is also called a _loss value_. It is computed by means of a _loss function_. There are many functions that can be used for this purpose. Choosing one depends on the problem you are solving (i.e. classification or regression), the characteristics of your dataset, and quite frequently on trial and error. In the rest of this article, we're going to walk through a lot of loss functions available in PyTorch. Let's take a look! + +![](images/High-level-training-process-1024x973.jpg) + +* * * + +## PyTorch Classification loss function examples + +The first category of loss functions that we will take a look at is the one of **classification models**. + +### Binary Cross-entropy loss, on Sigmoid (`nn.BCELoss`) example + +**Binary cross-entropy** **loss** or **BCE Loss** compares a target \[latex\]t\[/latex\] with a prediction \[latex\]p\[/latex\] in a logarithmic and hence exponential fashion. In neural network implementations, the value for \[latex\]t\[/latex\] is either 0 or 1, while \[latex\]p\[/latex\] can take any value between 0 and 1. This is the formula for binary cross-entropy loss: + +![](images/image-5-1024x122.png) + +When visualizing BCE loss for a target value of 1, you can see that loss increases exponentially when the prediction approaches the opposite - 0, in our case. + +This suggests that small deviations are punished albeit lightly, whereas big prediction errors are punished significantly. + +![](images/bce-1-1024x421.png) + +This makes binary cross-entropy loss a good candidate for **binary classification** problems, where a classifier has two classes. + +Implementing binary cross-entropy loss with PyTorch is easy. It involves the following steps: + +1. Ensuring that the output of your neural network is a value between 0 and 1. Recall that the Sigmoid activation function can be used for this purpose. This is why we apply `nn.Sigmoid()` in our neural network below. +2. Ensuring that you use `nn.BCELoss()` as your loss function of choice during the training loop. + +A full example of using binary cross-entropy loss is given next, using the `torchvision.datasets.FakeData` dataset: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import FakeData +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1), + nn.Sigmoid() + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare FakeData dataset + dataset = FakeData(size=15000, image_size=(3, 28, 28), num_classes=2, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True, num_workers = 4, pin_memory = True) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.BCELoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Prepare targets + targets = targets \ + .type(torch.FloatTensor) \ + .reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Binary Cross-entropy loss, on logits (`nn.BCEWithLogitsLoss`) + +Simple binary cross-entropy loss (represented by `nn.BCELoss` in PyTorch) computes BCE loss on the predictions \[latex\]p\[/latex\] generated in the range `[0, 1]`. + +However, it is possible to generate more numerically stable variant of binary cross-entropy loss by _combining_ the Sigmoid and the BCE Loss into one loss function: + +> This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability. +> +> PyTorch (n.d.) + +This trick is summarized [here](https://en.wikipedia.org/wiki/LogSumExp#log-sum-exp_trick_for_log-domain_calculations). + +In PyTorch, this is combined into the `**nn.BCEWithLogitsLoss**` function. The difference between `nn.BCEWithLogitsLoss` and `nn.BCELoss` is that BCE with Logits loss _adds_ the Sigmoid function into the loss function. With simple BCE Loss, you will have to add Sigmoid to the neural network, whereas with BCE With Logits Loss you will not. + +Here is an example demonstrating `nn.BCEWithLogitsLoss` using the `torchvision.datasets.FakeData` dataset: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import FakeData +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare FakeData dataset + dataset = FakeData(size=15000, image_size=(3, 28, 28), num_classes=2, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True, num_workers = 4, pin_memory = True) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.BCEWithLogitsLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Prepare targets + targets = targets \ + .type(torch.FloatTensor) \ + .reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Negative log likelihood loss (`nn.NLLLoss`) + +The previous two loss functions involved binary classification. In other words, they can be used for a classifier that works with two possible targets only - a class 0 and a class 1. + +However, many classification problems involve more than two classes. The MNIST dataset (`torchvision.datasets.MNIST`) is a good example of such a classification problem: in MNIST, there is one class per digit, and hence there are 10 classes. + +**Negative log likelihood loss** (represented in PyTorch as `nn.NLLLoss`) can be used for this purpose. Sometimes also called _categorical cross-entropy_, it computes the _negative log likelihood_ of each prediction, and multiplies each log prediction with the true target value. + +For example, if we have a three-class classification problem with a sample where the third target class is the true target class, our target vector is as follows: `[0 0 1]`. + +Say that our model predicts a 60% likelihood that it's the second class: `[0.3 0.6 0.1]`. In that case, our negative log likelihood loss is as follows: + +`-(0 * log(0.3) + 0 * log(0.6) + 1 * log(0.1)) = 1` + +Now suppose that after a few more epochs, it successfully starts predicting the third: `[0.3 0.2 0.5]`. Ensure that loss now becomes `0.30`. When it is even more confident (say `[0.05 0.05 0.90]`), loss is `0.045`. In other words, using this loss, you can create a multiclass classifier + +![](images/image-6.png) + +Note as well that by consequence, you can also model a binary classification problem this way: it is then a multiclass classification problem with two classes. + +As you can see above, the _prediction_ of our classifier should be a _pseudo probability distribution_ over all the target classes. The softmax activation function serves this purpose. Using `nn.NLLLoss` therefore requires that we use a Softmax activated output in our neural network. `nn.LogSoftmax` is faster than pure `nn.Softmax`, however; that's why we are using `nn.LogSoftmax` in the `nn.NLLLoss` example for PyTorch below. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10), + nn.LogSoftmax(dim = 1) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.NLLLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Cross-entropy loss (`nn.CrossEntropyLoss`) + +Recall that `nn.NLLLoss` requires the application of a `Softmax` (or `LogSoftmax`) layer. As with the difference between `BCELoss` and `BCEWithLogitsLoss`, combining the Softmax and the `NLLLoss` into one likely allows you to benefit from computational benefits (PyTorch, n.d.). That's why you can also choose to use `nn.CrossEntropyLoss` instead. + +This is an example of `nn.CrossEntropyLoss` with a PyTorch neural network. Note that the final layer does not use any `Softmax` related loss; this is already built into the loss function! + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Poisson Negative log likelihood loss (`nn.PoissonNLLLoss`) + +Suppose that your multiclass classification targets are drawn from a Poisson distribution (PyTorch, n.d.); that you already know this fact from your exploratory data analysis. You can then use the **Poisson Negative log likelihood loss** instead of regular `nn.NLLLoss`. In PyTorch (n.d.), this loss is described as follows: + +![](images/image-1.png) + +An example using Poisson Negative log likelihood loss with PyTorch is as follows: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10), + nn.LogSoftmax(dim = 1) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.PoissonNLLLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Gaussian Negative log likelihood loss (`nn.GaussianNLLLoss`) + +Suppose that your multiclass classification targets are drawn from a Gaussian distribution (PyTorch, n.d.). Loss can then be computed differently - by using the **Gaussian Negative log likelihood loss**. This loss function is represented within PyTorch (n.d.) as `nn.GaussianNLLLoss`. + +![](images/image.png) + +This is an example of using Gaussian Negative log likelihood loss with PyTorch. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10), + nn.LogSoftmax(dim = 1) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.GaussianNLLLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Hinge embedding loss (`nn.HingeEmbeddingLoss`) + +In PyTorch, the **Hinge Embedding Loss** is defined as follows: + +![](images/image-2.png) + +It can be used to measure whether two inputs (`x` and `y`) are similar, and works only if `y`s are either 1 or -1. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import FakeData +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1), + nn.Tanh() + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare FakeData dataset + dataset = FakeData(size=15000, image_size=(3, 28, 28), num_classes=2, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True, num_workers = 4, pin_memory = True) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.HingeEmbeddingLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # For this example, change zero targets into -1 + targets[targets == 0] = -1 + + # Prepare targets + targets = targets \ + .type(torch.FloatTensor) \ + .reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Two-class soft margin loss (`nn.SoftMarginLoss`) + +The two-class soft margin loss optimizes the following formula (PyTorch, n.d.): + +![](images/image-3.png) + +It can be used in binary classification problems as follows: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import FakeData +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1), + nn.Tanh() + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare FakeData dataset + dataset = FakeData(size=15000, image_size=(3, 28, 28), num_classes=2, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True, num_workers = 4, pin_memory = True) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.SoftMarginLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # For this example, change zero targets into -1 + targets[targets == 0] = -1 + + # Prepare targets + targets = targets \ + .type(torch.FloatTensor) \ + .reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Multi-class margin loss (`nn.MultiMarginLoss`) + +For multiclass classification problems, a **multi-class hinge loss** can be used represented by `nn.MultiMarginLoss` (PyTorch, n.d.): + +![](images/image-4.png) + +Here is an examplke using `nn.MultiMarginLoss` with PyTorch for **multi-class single-label classification problems:** + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10), + nn.LogSoftmax(dim = 1) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.MultiMarginLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Multilabel soft margin loss (`nn.MultiLabelSoftMarginLoss`) + +In **multilabel classification problems**, the neural network learns to predict multiple labels for an input sample. It can also be viewed as solving a _tagging problem_, as you are essentially assigning multiple tags (instead of just one class) to _one_ input sample. + +Multilabel soft margin loss (implemented in PyTorch as `nn.MultiLabelSoftMarginLoss`) can be used for this purpose. Here is an example with PyTorch. If you look closely, you will see that: + +- We use the MNIST dataset for this purpose. By replacing the targets with one of three multilabel Tensors, we are simulating a multilabel classification problem. Note that there is no resemblence whatsoever between targets and inputs, as this is simply an example. +- The final Linear layer outputs a 10-dimensional Tensor, which makes sense since we need 10 logits per sample. + +``` +import os +import numpy as np +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + ''' + Forward pass + Note that + ''' + fp = self.layers(x) + return fp + + +def draw_label(label): + if label < 5: + return [0, 0, 0, 1, 0, 0, 0, 1, 0, 0] + if label < 8: + return [1, 0, 0, 1, 0, 0, 0, 1, 0, 0] + else: + return [0, 0, 0, 0, 1, 0, 0, 1, 1, 0] + +def replace_labels(labels): + ''' Randomly replace labels ''' + new_labels = [] + for label in labels: + new_labels.append(draw_label(label)) + return torch.from_numpy(np.array(new_labels)) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.MultiLabelSoftMarginLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + targets = replace_labels(targets) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Kullback-Leibler Divergence (KL Divergence) loss (`nn.KLDivLoss`) + +KL Divergence can be used for [Variational Autoencoders, multiclass classification and replacing Least Squares regression](https://www.machinecurve.com/index.php/2019/12/21/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras/). Here is an example that uses KL Divergence with PyTorch: + +``` +import os +import numpy as np +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1), + nn.Sigmoid() + ) + + + def forward(self, x): + ''' + Forward pass + Note that + ''' + fp = self.layers(x) + return fp + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.KLDivLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + targets = targets.float() + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## PyTorch Regression loss function examples + +Let's now take a look at PyTorch loss functions for regression models. + +### Mean Absolute Error (MAE) / L1 Loss (`nn.L1Loss`) + +**Mean Absolute Error** (MAE) is one of the loss functions for regression. This is what it looks like: + +![](images/image-16-1024x185.png) + +As you can see, it simply computes the difference between the input `x` and the expected value for `y`, then computing the absolute value (so that the outcome is always positive). It then averages this error. + +Below, you will see an example of MAE loss (also called L1 Loss) within PyTorch, using `nn.L1Loss` and the Boston Housing dataset: + +``` +import os +import numpy as np +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +from sklearn.datasets import load_boston +from sklearn.preprocessing import StandardScaler + +class BostonDataset(torch.utils.data.Dataset): + ''' + Prepare the Boston dataset for regression + ''' + + def __init__(self, X, y, scale_data=True): + if not torch.is_tensor(X) and not torch.is_tensor(y): + # Apply scaling if necessary + if scale_data: + X = StandardScaler().fit_transform(X) + self.X = torch.from_numpy(X) + self.y = torch.from_numpy(y) + + def __len__(self): + return len(self.X) + + def __getitem__(self, i): + return self.X[i], self.y[i] + + +class MLP(nn.Module): + ''' + Multilayer Perceptron for regression. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(13, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1) + ) + + + def forward(self, x): + ''' + Forward pass + Note that + ''' + fp = self.layers(x) + return fp + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Load Boston dataset + X, y = load_boston(return_X_y=True) + + # Prepare Boston dataset + dataset = BostonDataset(X, y) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.L1Loss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get and prepare inputs + inputs, targets = data + inputs, targets = inputs.float(), targets.float() + targets = targets.reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Mean Squared Error (MSE) loss (`nn.MSELoss`) + +The **Mean Squared Error** loss (or `nn.MSELoss`) essentially performs the same, but then doesn't compute the _absolute value_ but rather the _square_ of the difference. This also leads to the fact that all negatives are gone (squaring a negative value yields a positive one), but is better when the difference between errors is relatively small. Note that this comes at the cost of being sensitive to outliers. + +![](images/image-14-1024x296.png) + +This is an example of using MSE Loss with PyTorch, which is provided as `nn.MSELoss`: + +``` +import os +import numpy as np +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +from sklearn.datasets import load_boston +from sklearn.preprocessing import StandardScaler + +class BostonDataset(torch.utils.data.Dataset): + ''' + Prepare the Boston dataset for regression + ''' + + def __init__(self, X, y, scale_data=True): + if not torch.is_tensor(X) and not torch.is_tensor(y): + # Apply scaling if necessary + if scale_data: + X = StandardScaler().fit_transform(X) + self.X = torch.from_numpy(X) + self.y = torch.from_numpy(y) + + def __len__(self): + return len(self.X) + + def __getitem__(self, i): + return self.X[i], self.y[i] + + +class MLP(nn.Module): + ''' + Multilayer Perceptron for regression. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(13, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1) + ) + + + def forward(self, x): + ''' + Forward pass + Note that + ''' + fp = self.layers(x) + return fp + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Load Boston dataset + X, y = load_boston(return_X_y=True) + + # Prepare Boston dataset + dataset = BostonDataset(X, y) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.MSELoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get and prepare inputs + inputs, targets = data + inputs, targets = inputs.float(), targets.float() + targets = targets.reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Smooth MAE / L1 Loss (`nn.SmoothL1Loss`) + +Recall from above that in comparison, MAE Loss (L1 Loss) works better when there are many outliers, while MSE Loss works better when there are few outliers and relatively small differences between errors. However, sometimes you want to use a loss function that is precisely in between these two. **Smooth MAE Loss** can then be used. Being provided as `nn.SmoothL1Loss`, the error is computed in a squared fashion if the error is smaller than a value for beta (i.e. benefiting from the MSE part). In all other cases, a value similar to the MAE is computed. + +The `beta` parameter is configurable in the `nn.SmoothL1Loss(...)` initialization. + +![](images/image-5.png) + +``` +import os +import numpy as np +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +from sklearn.datasets import load_boston +from sklearn.preprocessing import StandardScaler + +class BostonDataset(torch.utils.data.Dataset): + ''' + Prepare the Boston dataset for regression + ''' + + def __init__(self, X, y, scale_data=True): + if not torch.is_tensor(X) and not torch.is_tensor(y): + # Apply scaling if necessary + if scale_data: + X = StandardScaler().fit_transform(X) + self.X = torch.from_numpy(X) + self.y = torch.from_numpy(y) + + def __len__(self): + return len(self.X) + + def __getitem__(self, i): + return self.X[i], self.y[i] + + +class MLP(nn.Module): + ''' + Multilayer Perceptron for regression. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(13, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1) + ) + + + def forward(self, x): + ''' + Forward pass + Note that + ''' + fp = self.layers(x) + return fp + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Load Boston dataset + X, y = load_boston(return_X_y=True) + + # Prepare Boston dataset + dataset = BostonDataset(X, y) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.SmoothL1Loss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get and prepare inputs + inputs, targets = data + inputs, targets = inputs.float(), targets.float() + targets = targets.reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Huber loss (`nn.HuberLoss`) + +**Huber loss** is another loss function that can be used for regression. Depending on a value for `delta`, it is computed in a different way - put briefly, when errors are small, the error itself is part of the square, whereas it's the delta in the case of large errors: + +![](images/image-4-1024x284.png) + +Visually, Huber loss looks as follows given different deltas: + +![](images/huberloss.jpeg) + +In other words, by tweaking the value for `delta`, we can adapt the loss function's sensitivity to outliers. It is therefore also a value that lies somewhere between MSE and MAE loss. + +Being available as `nn.HuberLoss` (with a configurable `delta` parameter), it can be used in the following way: + +``` +import os +import numpy as np +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +from sklearn.datasets import load_boston +from sklearn.preprocessing import StandardScaler + +class BostonDataset(torch.utils.data.Dataset): + ''' + Prepare the Boston dataset for regression + ''' + + def __init__(self, X, y, scale_data=True): + if not torch.is_tensor(X) and not torch.is_tensor(y): + # Apply scaling if necessary + if scale_data: + X = StandardScaler().fit_transform(X) + self.X = torch.from_numpy(X) + self.y = torch.from_numpy(y) + + def __len__(self): + return len(self.X) + + def __getitem__(self, i): + return self.X[i], self.y[i] + + +class MLP(nn.Module): + ''' + Multilayer Perceptron for regression. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(13, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 1) + ) + + + def forward(self, x): + ''' + Forward pass + Note that + ''' + fp = self.layers(x) + return fp + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Load Boston dataset + X, y = load_boston(return_X_y=True) + + # Prepare Boston dataset + dataset = BostonDataset(X, y) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.HuberLoss(delta=1.0) + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get and prepare inputs + inputs, targets = data + inputs, targets = inputs.float(), targets.float() + targets = targets.reshape((targets.shape[0], 1)) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 10 == 0: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## Summary + +In this article, you have... + +- **Learned what the role of a loss function in a neural network is.** +- **Been familiarized with a variety of PyTorch based loss functions for classification and regression.** +- **Been able to use these loss functions in your Deep Learning models.** + +I hope that this article was useful to you! If it was, please feel free to drop a comment in the comments section 💬 Feel free to do the same if you have questions or other remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## Sources + +PyTorch. (n.d.). _BCELoss — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) + +PyTorch. (n.d.). _BCEWithLogitsLoss — PyTorch 1.8.1 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html) + +PyTorch. (2019, March 7). _Difference between cross-entropy loss or log likelihood loss?_ PyTorch Forums. [https://discuss.pytorch.org/t/difference-between-cross-entropy-loss-or-log-likelihood-loss/38816/2](https://discuss.pytorch.org/t/difference-between-cross-entropy-loss-or-log-likelihood-loss/38816/2) + +PyTorch. (n.d.). _NLLLoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) + +PyTorch. (n.d.). _CrossEntropyLoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) + +PyTorch. (n.d.). _SoftMarginLoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.SoftMarginLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.SoftMarginLoss.html) + +PyTorch. (n.d.). _HingeEmbeddingLoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.HingeEmbeddingLoss.html#torch.nn.HingeEmbeddingLoss](https://pytorch.org/docs/stable/generated/torch.nn.HingeEmbeddingLoss.html#torch.nn.HingeEmbeddingLoss) + +PyTorch. (n.d.). _MultiMarginLoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.MultiMarginLoss.html](https://pytorch.org/docs/stable/generated/torch.nn.MultiMarginLoss.html) + +PyTorch. (n.d.). _MultiLabelSoftMarginLoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.MultiLabelSoftMarginLoss.html#torch.nn.MultiLabelSoftMarginLoss](https://pytorch.org/docs/stable/generated/torch.nn.MultiLabelSoftMarginLoss.html#torch.nn.MultiLabelSoftMarginLoss) + +PyTorch. (n.d.). _MultiLabelMarginLoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.MultiLabelMarginLoss.html#torch.nn.MultiLabelMarginLoss](https://pytorch.org/docs/stable/generated/torch.nn.MultiLabelMarginLoss.html#torch.nn.MultiLabelMarginLoss) + +MachineCurve. (2019, December 22). _How to use kullback-leibler divergence (KL divergence) with Keras?_ [https://www.machinecurve.com/index.php/2019/12/21/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras/](https://www.machinecurve.com/index.php/2019/12/21/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras/) + +PyTorch. (n.d.). _HuberLoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.HuberLoss.html#torch.nn.HuberLoss](https://pytorch.org/docs/stable/generated/torch.nn.HuberLoss.html#torch.nn.HuberLoss) + +PyTorch. (n.d.). _L1Loss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html#torch.nn.L1Loss](https://pytorch.org/docs/stable/generated/torch.nn.L1Loss.html#torch.nn.L1Loss) + +PyTorch. (n.d.). _MSELoss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss) + +PyTorch. (n.d.). _SmoothL1Loss — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html#torch.nn.SmoothL1Loss](https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html#torch.nn.SmoothL1Loss) diff --git a/how-to-use-sparse-categorical-crossentropy-in-keras.md b/how-to-use-sparse-categorical-crossentropy-in-keras.md new file mode 100644 index 0000000..2a01f42 --- /dev/null +++ b/how-to-use-sparse-categorical-crossentropy-in-keras.md @@ -0,0 +1,421 @@ +--- +title: "How to use sparse categorical crossentropy with TensorFlow 2 and Keras?" +date: "2019-10-06" +categories: + - "buffer" + - "frameworks" +tags: + - "categorical-crossentropy" + - "loss-function" + - "sparse-categorical-crossentropy" +--- + +For multiclass classification problems, many online tutorials - and even François Chollet's book _Deep Learning with Python_, which I think is one of the most intuitive books on deep learning with Keras - use **[categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/)** for computing the [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) of your neural network. + +However, traditional categorical crossentropy requires that your data is one-hot encoded and hence converted into categorical format. Often, this is not what your dataset looks like when you'll start creating your models. Rather, you likely have feature vectors with integer targets - such as 0 to 9 for the numbers 0 to 9. + +This means that you'll have to convert these targets first. In Keras, this can be done with `to_categorical`, which essentially applies one-hot encoding to your training set's targets. When applied, you can start using categorical crossentropy. + +But did you know that there exists another type of loss - **sparse categorical crossentropy** - with which you can leave the integers as they are, yet benefit from crossentropy loss? I didn't when I just started with Keras, simply because pretty much every article I read performs one-hot encoding before applying regular categorical crossentropy loss. + +In this blog, we'll figure out how to _build a convolutional neural network with sparse categorical crossentropy loss_. + +We'll create an actual CNN with Keras. It'll be a simple one - an extension of a [CNN that we created before](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), with the MNIST dataset. However, doing that allows us to compare the model in terms of its performance - to actually see whether sparse categorical crossentropy does as good a job as the regular one. + +**After reading this tutorial, you will...** + +- Understand what `to_categorical` does when creating your TensorFlow/Keras models. +- Why it's not necessary if you have integer labels/targets, but why you will have to change your loss function. +- How `sparse_categorical_crossentropy` loss can be useful in that case. + +Let's go! 😎 + +_Note that model code is also available [on GitHub](https://github.com/christianversloot/keras-cnn/blob/master/model_sparse.py)._ + +* * * + +**Update 28/Jan/2021:** Added summary and code example to get started straight away. Performed textual improvements, changed header information and slight addition to title of the tutorial. + +**Update 17/Nov/2020:** Made the code examples compatible with TensorFlow 2 + +**Update 01/Feb/2020**: Fixed an error in full model code. + +* * * + +\[toc\] + +* * * + +## Summary and code example: tf.keras.losses.sparse\_categorical\_crossentropy + +Training a neural network involves passing data forward, through the model, and comparing predictions with ground truth labels. This comparison is done by a loss function. In multiclass classification problems, **categorical crossentropy loss** is the loss function of choice. However, it requires that your labels are one-hot encoded, which is not always the case. + +In that case, **sparse categorical crossentropy loss** can be a good choice. This loss function performs the same type of loss - categorical crossentropy loss - but works on integer targets instead of one-hot encoded ones. Saves you that `to_categorical` step which is common with TensorFlow/Keras models! + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +* * * + +## Sparse categorical crossentropy vs normal categorical crossentropy + +Have you also seen lines of code like these in your Keras projects? + +``` +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +Most likely, you have - because many blogs explaining how to create multiclass classifiers with Keras apply **categorical crossentropy**, which requires you to one-hot encode your target vectors. + +Now you may wonder: what is one-hot encoding? + +### One-hot encoding + +Suppose that you have a classification problem where you have four target classes: { 0, 1, 2, 3 }. + +Your dataset likely comes in this flavor: `{ feature vector } -> target`, where your target is an integer value from { 0, 1, 2, 3 }. + +However, as we saw [in another blog on categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#categorical-crossentropy), its mathematical structure doesn't allow us to feed it integers directly. + +We'll have to convert it into categorical format first - with one-hot encoding, or `to_categorical` in Keras. + +You'll effectively transform your targets into this: + +- For class 0: \[latex\]\[1, 0, 0, 0\]\[/latex\]; +- For class 1: \[latex\]\[0, 1, 0, 0\]\[/latex\]; +- For class 2: \[latex\]\[0, 0, 1, 0\]\[/latex\]; +- For class 3: \[latex\]\[0, 0, 0, 1\]\[/latex\]. + +Note that when you have more classes, the trick goes on and on - you simply create **n**\-dimensional vectors, where **n** equals the unique number of classes in your dataset. + +### Categorical crossentropy + +When converted into categorical data, you can apply **[categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/)**: + +![](images/image-6.png) + +Don't worry - it's a human pitfall to always think defensively when we see maths. + +It's not so difficult at all, to be frank, so make sure to read on! + +What you see is obviously the categorical crossentropy formula. What it does is actually really simple: it iterates over all the possible classes `C` predicted by the ML during the forward pass of your machine learning training process. + +For each class, it takes a look at the target observation of the class - i.e., whether the actual class matching the prediction in your training set is 0 or one. Additionally, it computes the (natural) logarithm of the prediction of the observation (the odds that it belongs to that class). From this, it follows that only one such value is relevant - the _actual_ target. For this, it simply computes the natural log value which increases significantly when it is further away from 1: + +[![](images/bce-1024x469.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/bce.png) + +### Sparse categorical crossentropy + +Now, it could be the case that your dataset is not categorical at first ... and possibly, that it is too large in order to use `to_categorical`. In that case, it would be rather difficult to use categorical crossentropy, since it is dependent on categorical data. + +However, when you have integer targets instead of categorical vectors as targets, you can use **sparse categorical crossentropy**. It's an integer-based version of the categorical crossentropy loss function, which means that we don't have to convert the targets into categorical format anymore. + +* * * + +## Creating a CNN with TensorFlow 2 and Keras + +Let's now create a CNN with Keras that uses sparse categorical crossentropy. In some folder, create a file called `model.py` and open it in some code editor. + +### Today's dataset: MNIST + +As usual, like in our previous blog on [creating a (regular) CNN with Keras](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), we use the MNIST dataset. This dataset, which contains thousands of 28x28 pixel handwritten digits (individual numbers from 0-9), is one of the standard datasets in machine learning training programs because it's a very easy and normalized one. The images are also relatively small and high in quantity, which benefits the predictive and generalization power of your model when trained properly. This way, one can really focus on the machine learning aspects of an exercise, rather than the data related issues. + +Let's go! + +### Software dependencies + +If we wish to run the sparse categorical crossentropy Keras CNN, it's necessary to install a few software tools: + +- Obviously, you need **TensorFlow**, version 2.x (i.e. some version of 2), which comes with Keras installed as `tensorflow.keras`. +- By consequence, you'll need to install peer dependencies such as **NumPy**. You'll also need them for processing the data. +- In order to run any of those, you need to have a working **Python** installation; preferably, your Python version is 3.6+. + +Preferably, you run your model in an **Anaconda** environment. This way, you will be able to install your packages in a unique environment with which other packages do not interfere. Mingling Python packages is often a tedious job, which often leads to trouble. Anaconda resolves this by allowing you to use _environments_ or isolated sandboxes in which your code can run. Really recommended! + +### Our model + +This will be our model for today: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Let's break creating the model apart. + +#### Adding imports + +First, we add our imports - packages and functions that we'll need for our model to work as intended. + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +``` + +More specifically, we... + +- Import the **MNIST** dataset. It comes with Keras by default because it's a perfect dataset for educational purposes. When you use a model with this dataset for the first time, Keras will download the dataset automatically, after which it is stored locally - and you don't ever have to worry about downloading the dataset again. Very user friendly. +- Import the **Sequential API** \- which is one of the two APIs with which engineers can create Keras based models, the other being the Functional API. As Sequential is relatively easier than Functional, we'll use it for this tutorial. +- Import the **Dense** layer, the **Dropout** function and the **Flatten** layer. Dense layers are used for the classification part of the CNN; Dropout adds random noise which reduces the odds of overfitting, and Flatten converts the multidimensional output of the convolutional layers (which interpret your images) into a onedimensional vector to be used by the Dense layers (for classifying the interpretations into the correct classes). +- Additionally, we import **Conv2D** and **MaxPooling2D**, which are used for image interpretation and downscaling - i.e., the first part of your CNN. + +#### Model configuration + +Next up, model configuration: + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +We specify image width and image height, which are 28 for both given the images in the MNIST dataset. We specify a batch size of 250, which means that during training 250 images at once will be processed. When all images are processed, one completes an **epoch**, of which we will have 25 in total during the training of our model. Additionally, we specify the number of classes in advance - 10, the numbers 0 to 9. 20% of our training set will be set apart for validating the model after every batch, and for educational purposes we set model verbosity to True (1) - which means that all possible output is actually displayed on screen. + +#### Preparing MNIST data + +Next, we load and prepare the MNIST data: + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) +``` + +What we do is simple - we use `mnist.load_data()` to load the MNIST data into four Python variables, representing inputs and targets for both the training and testing datasets. + +Additionally, we reshape the data so that TensorFlow will accept it. + +#### Additional preparations + +Additionally, we perform some other preparations which concern the _data_ instead of how it is handled by your system: + +``` +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +We first parse the numbers as floats. This benefits the optimization step of the training process. + +Additionally, we normalize the data, which benefits the training process as well. + +#### Model architecture + +We then create the architecture of our model: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +To be frank: the architecture of our model doesn't really matter for showing that sparse categorical crossentropy really works. In fact, you can use the architecture you think is best for your machine learning problem. However, we put up the architecture above because it is very generic and hence works well in many simple classification scenarios: + +- We use two convolutional blocks which comprise a 2-dimensional convolutional layer, max pooling and Dropout. The convolutional layer interprets the features into feature maps, which are subsequently downsampled (made smaller / less granular) by the max pooling operation. Subsequently, random Dropout noise is introduced to reduce the odds of overfitting, which means that your model is tailored too specifically to your training data, and might not work anymore with data it has never seen. +- We then flatten the multidimensional input into a 1-dimensional vector that can be handled by the densely-connected layers. We specify the number of output neurons to `no_classes` which in the case of the MNIST dataset is 10: each neuron generates the probability (summated to one considering all neurons together) that the input belongs to one of the 10 classes in the MNIST scenario. +- We use two Dense layers which essentially give the CNN its classification power. Note that ReLU is used as an [activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) throughout all layers given its simplicity and relative power in today's deep learning problems. However, the last layer uses a Softmax activation, which essentially generates a multiclass probability distribution over all the classes that are available in your targets. + +#### Model compilation: hyperparameter tuning + +We next compile the model, which involves configuring it by means of hyperparameter tuning: + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +We specify the loss function used - **sparse categorical crossentropy!** We use it together with the Adam optimizer, which is one of the standard ones used today in very generic scenarios, and use accuracy as an additional metric, since it is more intuitive to humans. + +#### Training and evaluation + +Next, we fit the data following the specification created in the model configuration step and specify evaluation metrics that test the trained model with the testing data: + +``` +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Now, we can start the training process. Open a command prompt, possible the Anaconda one navigating to your environment by means of `conda activate `, and navigate to the folder storing `model.py` by means of the `cd` function. + +Next, start the training process with Python: `python model.py`. + +* * * + +## Model performance + +You should then see something like this: + +``` +48000/48000 [==============================] - 21s 431us/step - loss: 0.3725 - acc: 0.8881 - val_loss: 0.0941 - val_acc: 0.9732 +Epoch 2/25 +48000/48000 [==============================] - 6s 124us/step - loss: 0.0974 - acc: 0.9698 - val_loss: 0.0609 - val_acc: 0.9821 +Epoch 3/25 +48000/48000 [==============================] - 6s 122us/step - loss: 0.0702 - acc: 0.9779 - val_loss: 0.0569 - val_acc: 0.9832 +Epoch 4/25 +48000/48000 [==============================] - 6s 124us/step - loss: 0.0548 - acc: 0.9832 - val_loss: 0.0405 - val_acc: 0.9877 +Epoch 5/25 +48000/48000 [==============================] - 6s 122us/step - loss: 0.0450 - acc: 0.9861 - val_loss: 0.0384 - val_acc: 0.9873 +Epoch 6/25 +48000/48000 [==============================] - 6s 122us/step - loss: 0.0384 - acc: 0.9877 - val_loss: 0.0366 - val_acc: 0.9886 +Epoch 7/25 +48000/48000 [==============================] - 5s 100us/step - loss: 0.0342 - acc: 0.9892 - val_loss: 0.0321 - val_acc: 0.9907 +Epoch 8/25 +48000/48000 [==============================] - 5s 94us/step - loss: 0.0301 - acc: 0.9899 - val_loss: 0.0323 - val_acc: 0.9898 +Epoch 9/25 +48000/48000 [==============================] - 4s 76us/step - loss: 0.0257 - acc: 0.9916 - val_loss: 0.0317 - val_acc: 0.9907 +Epoch 10/25 +48000/48000 [==============================] - 4s 76us/step - loss: 0.0238 - acc: 0.9922 - val_loss: 0.0318 - val_acc: 0.9910 +Epoch 11/25 +48000/48000 [==============================] - 4s 82us/step - loss: 0.0214 - acc: 0.9928 - val_loss: 0.0324 - val_acc: 0.9905 +Epoch 12/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0201 - acc: 0.9934 - val_loss: 0.0296 - val_acc: 0.9907 +Epoch 13/25 +48000/48000 [==============================] - 4s 88us/step - loss: 0.0173 - acc: 0.9940 - val_loss: 0.0302 - val_acc: 0.9914 +Epoch 14/25 +48000/48000 [==============================] - 4s 79us/step - loss: 0.0157 - acc: 0.9948 - val_loss: 0.0306 - val_acc: 0.9912 +Epoch 15/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0154 - acc: 0.9949 - val_loss: 0.0308 - val_acc: 0.9910 +Epoch 16/25 +48000/48000 [==============================] - 4s 84us/step - loss: 0.0146 - acc: 0.9950 - val_loss: 0.0278 - val_acc: 0.9918 +Epoch 17/25 +48000/48000 [==============================] - 4s 84us/step - loss: 0.0134 - acc: 0.9954 - val_loss: 0.0302 - val_acc: 0.9911 +Epoch 18/25 +48000/48000 [==============================] - 4s 79us/step - loss: 0.0129 - acc: 0.9956 - val_loss: 0.0280 - val_acc: 0.9922 +Epoch 19/25 +48000/48000 [==============================] - 4s 80us/step - loss: 0.0096 - acc: 0.9968 - val_loss: 0.0358 - val_acc: 0.9908 +Epoch 20/25 +48000/48000 [==============================] - 4s 79us/step - loss: 0.0114 - acc: 0.9960 - val_loss: 0.0310 - val_acc: 0.9899 +Epoch 21/25 +48000/48000 [==============================] - 4s 86us/step - loss: 0.0086 - acc: 0.9970 - val_loss: 0.0300 - val_acc: 0.9922 +Epoch 22/25 +48000/48000 [==============================] - 4s 88us/step - loss: 0.0088 - acc: 0.9970 - val_loss: 0.0320 - val_acc: 0.9915 +Epoch 23/25 +48000/48000 [==============================] - 4s 87us/step - loss: 0.0080 - acc: 0.9971 - val_loss: 0.0320 - val_acc: 0.9919 +Epoch 24/25 +48000/48000 [==============================] - 4s 87us/step - loss: 0.0083 - acc: 0.9969 - val_loss: 0.0416 - val_acc: 0.9887 +Epoch 25/25 +48000/48000 [==============================] - 4s 86us/step - loss: 0.0083 - acc: 0.9969 - val_loss: 0.0334 - val_acc: 0.9917 +Test loss: 0.02523074444185986 / Test accuracy: 0.9932 +``` + +25 epochs as configured, with impressive scores in both the validation and testing phases. It pretty much works as well as the classifier created with categorical crossentropy - and I actually think the difference can be attributed to the relative randomness of the model optimization process: + +``` +Epoch 25/25 +48000/48000 [==============================] - 4s 85us/step - loss: 0.0072 - acc: 0.9975 - val_loss: 0.0319 - val_acc: 0.9925 + +Test loss: 0.02579820747410522 / Test accuracy: 0.9926 +``` + +* * * + +## Recap + +Well, today, we've seen how to create a Convolutional Neural Network (and by consequence, any model) with **sparse categorical crossentropy** in Keras. If you have integer targets in your dataset, which happens in many cases, you usually perform `to_categorical` in order to use multiclass crossentropy loss. With sparse categorical crossentropy, this is no longer necessary. This blog demonstrated this by means of an example Keras implementation of a CNN that classifies the MNIST dataset. + +Model code is also available [on GitHub](https://github.com/christianversloot/keras-cnn/blob/master/model_sparse.py), if it benefits you. + +I hope this blog helped you - if it did, or if you have any questions, let me know in the comments section! 👇 I'm happy to answer any questions you may have 😊 Thanks and enjoy coding! + +* * * + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Keras. (n.d.). Losses. Retrieved from [https://keras.io/losses/](https://keras.io/losses/) + +How to create a CNN classifier with Keras? – MachineCurve. (2019, September 24). Retrieved from [https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras) + +About loss and loss functions – MachineCurve. (2019, October 4). Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) diff --git a/how-to-use-tensorboard-with-keras.md b/how-to-use-tensorboard-with-keras.md new file mode 100644 index 0000000..713c1b1 --- /dev/null +++ b/how-to-use-tensorboard-with-keras.md @@ -0,0 +1,726 @@ +--- +title: "How to use TensorBoard with TensorFlow 2 and Keras?" +date: "2019-11-13" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "machine-learning" + - "neural-network" + - "tensorboard" + - "tensorflow" + - "visualization" +--- + +If you want to visualize how your Keras model performs, it's possible to use MachineCurve's tutorial for [visualizing the training process](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). Additionally, if you wish to visualize the model yourself, you can [use another tutorial](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/). + +But are they they only options you've got? + +No - not at all! + +You may also wish to use TensorBoard, for example. + +In this blog post, we'll discover what TensorBoard is, what you can use it for, and how it works with Keras. We specifically take a look at how TensorBoard is integrated into the Keras API by means of [callbacks](https://www.machinecurve.com/index.php/2020/11/10/an-introduction-to-tensorflow-keras-callbacks/), and we take a look at the specific Keras callback that can be used to control TensorBoard. + +This is followed by an example implementation of TensorBoard into your Keras model - by means of [our Keras CNN](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) and the [CIFAR10 dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). This way, you'll understand _what it is and how it works_, allowing you to easily implement TensorBoard in your own deep learning model. + +Let's go! 😎 + +In this tutorial, you will learn... + +- **What TensorBoard is and why it can be useful.** +- **How TensorBoard is implemented in the TensorFlow/Keras library.** +- **How you can add TensorBoard to your Keras model, and the configuration options you can set.** +- **What TensorBoard looks like after you added it to your model.** + +**Update 01/Mar/2021:** made some style improvements and changed title to reflect that this code works with any TF 2 based version, not just TF 2.0. + +**Update 13/Jan/2021:** added summary of what you will learn above. Also added internal links about TensorBoard to pages on this website. + +**Update 11/Jan/2021:** updated this article to 2021. Ensured compatibility with TensorFlow 2.x, removed references to Theano and CNTK, and added information about PyTorch usage. Fixed spelling mistakes. Added links to other articles on this website. Updated references and article metadata. Added code example for quick usage. + +* * * + +\[toc\] + +* * * + +## Code example: using TensorBoard with TensorFlow and Keras + +You can use the code example below to get started immediately. If you want to understand TensorBoard and how it can be used in more detail, make sure to continue and read the rest of the article below. + +``` +# The 'model' variable is a compiled Keras model. + +# Import TensorBoard +from tensorflow.keras.callbacks import TensorBoard + +# Define Tensorboard as a Keras callback +tensorboard = TensorBoard( + log_dir='.\logs', + histogram_freq=1, + write_images=True +) +keras_callbacks = [ + tensorboard +] + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split, + callbacks=keras_callbacks) +``` + +* * * + +## What is TensorBoard? + +Let's turn to the TensorFlow docs for a more elaborate description, as they can describe it best: + +> In machine learning, to improve something you often need to be able to measure it. TensorBoard is a tool for providing the measurements and visualizations needed during the machine learning workflow. It enables tracking experiment metrics like loss and accuracy, visualizing the model graph, projecting embeddings to a lower dimensional space, and much more. +> +> [TensorBoard - Get Started](https://www.tensorflow.org/tensorboard/get_started) + +In short, TensorBoard helps you better understand your machine learning model that you generated with TensorFlow. It allows you to measure various aspects - such as the [weights](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/), biases, gradients of your model - as well as how they progressed during training (i.e., across epochs). Additionally, you can [visualize model performance](https://www.machinecurve.com/index.php/2019/12/03/visualize-keras-models-overview-of-visualization-methods-tools/) over time, visualize classes in a multidimensional space, and so on. + +Quite exciting! + +Fortunately, TensorBoard integrates natively with Keras. Let's find out how it does next 😊 + +* * * + +## TensorBoard and the Keras API + +Keras provides TensorBoard in the form of a _[callback](https://www.machinecurve.com/index.php/2020/11/10/an-introduction-to-tensorflow-keras-callbacks/)_, which is "a set of functions to be applied at given stages of the training procedure" (Keras, n.d.). According to the Keras website, they can be used to take a look at the model's internals and statistics during training, but also afterwards. + +An exemplary combination of Keras callbacks is [EarlyStopping and ModelCheckpoint](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), which you can use to (1) identify whether your model's performance is still increasing, and if not, stop it, while (2) always [saving](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/) the best model to disk. + +In January 2021, Keras defined the **TensorBoard** callback as follows (Keras, n.d.): + +``` +tf.keras.callbacks.tensorboard_v1.TensorBoard(log_dir='./logs', histogram_freq=0, batch_size=32, write_graph=True, write_grads=False, write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None, update_freq='epoch') +``` + +Let's break the arguments for the `TensorBoard` callback apart and describe what they do and how they work. This is mainly based on the description provided in the Keras API docs for the TensorBoard callback (TensorFlow, n.d.): + +- With `log_dir` you specify the path to the directory where Keras saves the log files that you can later read when starting the actual TensorBoard. +- The histogram frequency, or `histogram_freq`, determines the frequency (in number of epochs) for compute weight histograms for all layers of the model (Sunside, n.d.). If `histogram_freq = 0`, no histograms will be computed, and computing them requires validation data to be present. [Click here to understand them more deeply.](#about-histogram_freq-what-are-weight-histograms) +- If you choose to compute these histograms, you can also specify the batch size with the `batch_size` attribute. It defaults to 32, but is only relevant if you compute histograms. The larger the batch size, the higher the memory requirements for your system. +- If you don't wish to visualize the graph of your model, as you can also do [with Keras-native functions](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/), you can choose to set `write_graph` to `False`. By default, it's set to `True` and your network graph will be visualized. +- While `histogram_freq` can be used to visualize the histograms of your layer _weights_ across epochs, you may also wish to visualize the distribution of _gradients_ for every layer. If `histogram_freq = True`, you can also set `write_grads` to True, which generates gradient histograms as well. +- Some people are more into numbers while others are more into visuals. For the latter, setting `write_images` to `True` results in the fact that Keras generates images for the weights based on the numbers present. This may allow you to spot patterns in weight changes, even if you're not too much into numbers! Very useful. +- The Keras TensorBoard callback also provides quite some functions related to _embeddings_. We'll cover these next. However, like me, it may be that your knowledge about embeddings isn't really... up to date, to say it nicely. In that case, I've attempted to explain the concept of embeddings briefly as part of this blog post - [and you can find it here](#about-embeddings-in-tensorboard-what-are-they). I hope it helps you. +- With `embeddings_freq`, like the `histogram_freq`, you can specify how often embeddings should be saved in the logs. The number you specify is the number of epochs. If set to zero, no embeddings will be saved. +- The attribute `embeddings_layer_names` can be used to specify the layers at which embeddings should be learnt. They do not necessarily have to be learnt at the most downstream layer of your neural network, but can be learnt e.g. in the middle, allowing you to find out whether e.g. a model has too many layers. With this attribute, you can specify the layers at which the embeddings should be learnt and visualized in TensorBoard. +- In `embeddings_metadata`, you provide a dictionary of file names per layer in `embeddings_layer_names`. Each file essentially contains the _targets_ of the samples you'll use next, in corresponding order. +- In `embeddings_data`, you specify the data which should be used to generate the embeddings learnt. That is, you generate the visualization based on `embeddings_data` and `embeddings_metadata`. While often, test data is suitable, you can also use data that does not belong to training and testing data. This is up to you. +- Finally, with `update_freq`, you can specify how often data should be written to `log_dir` logs. You can either configure `batch`, `epoch` or an integer number. If you set `update_freq` to `batch`, logs will be written after each batch (which, in Keras, you set as `batch_size` when calling `model.fit`). This may especially be useful when you use a minibatch [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) (or [gradient descent-like](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/)) optimizer. However, when you set the update frequency to `epoch`, it writes logs after each epoch, as if it were a batch SGD approach. You can also configure the update frequency to be an integer value `n`, which means that logs are written every `n` samples. The Keras docs warn you here: "writing too frequently (...) can slow down your training" (Keras, n.d.). + +Next, we'll dive more deeply into questions arising from the Keras API, specifically these ones: + +- What are weight histograms? +- What do weight images look like? +- What are TensorBoard embeddings? +- Does TensorBoard also work with PyTorch? + +If you're interested in any or multiple of those, I'd say: read on! 😎 + +If not, that's also fine 😊 - but in that case, it's best to [click here and move to the section about implementing the Keras model with TensorBoard](#implementing-tensorboard-into-your-keras-model). + +### About `histogram_freq`: what are weight histograms? + +Yep, those fancy weight histograms you can specify with `histogram_freq` when configuring TensorBoard with Keras 😎 + +Well, what are they? + +Let's take a look at some examples: + +![](images/weight_histogram_1.jpg) + +![](images/weight_histogram_2.jpg) + +Here, we see the _bias_ values and the true _weight_ values for the second and first [convolutional layer](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) in our model (which we specify later 😉), but then when we configured it to train for only 5 epochs. + +A weight histogram essentially tells you something about the distribution of your weights: visualize a histogram as many buckets where you can drop balls into. When a weight equals the (very small) range for one bucket, you drop the weight (represented by the ball) into that particular bucket. At the end, you can take a global look and see how many weights have landed in every bucket. This is your histogram. + +By inspecting weight histograms over time (i.e., by epoch), you can see how the distribution of weights or biases has changed over time. This leads to three important observations: + +- If the weight histograms for some layers didn't change at all in terms of their shape (e.g., the second image above), you can easily say that this layer didn't participate in learning. Question yourself if this layer is actually necessary, and if your architecture requires a change. +- If the weight histograms changed much, your layer contributed significantly to learning. Good! Also question (and perhaps test empirically) whether adding additional layers might capture the dataset even better. +- If the weight histograms changed significantly, but not extremely (e.g. in the first histogram, where the peak of the histogram essentially moved to the left over time), it's likely that you don't really need to change your architecture. + +Yep - what seemed difficult at first is actually really simple 😊 + +### About `write_images`: what do weight images look like? + +It may be that you wish to compare weights and how they changed over time. You may do so numerically, by inspecting the _numbers_ - but this doesn't work for everyone. + +Especially when you're very visually oriented, you may wish to compare visuals rather than numbers, and yay - TensorBoard with Keras supports this. By configuring `write_images`, you can actually visualize the weights as an image: + +![](images/weight_images.jpg) + +That's great! + +### About embeddings in TensorBoard: what are they? + +When I came across the embeddings part of the TensorBoard callback, I was quite confused. + +What the \*bleep\* are they? + +It cost me quite some time to understand them, and I bet you don't fully understand them (yet!) either, since you're reading this text. + +I'll try to explain them as intuitively as possible. We'll do so by means of the TensorFlow docs, which you can find in the references as TensorFlow (n.d.). + +If you wish to represent words, or objects - since we can also generate embeddings of images, such as the ones present in the MNIST or CIFAR10 datasets - you'll have to convert them into _categorical vectors_ or _integers_. These approaches are relatively inefficient and arbitrary, which either results in longer training times or missing relationships between objects. + +[Embeddings](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) enter the picture here. They allow you to map an object (say, a word, or an image) into a high-dimensional space by specifying coordinates for each dimension. For a two-dimensional space, that would be (x, y) coordinates. Often, you would want to use higher-dimensional spaces, because this makes your representation more accurate. + +Let's make it more intuitive. Take a look at this point cloud: + +https://www.youtube.com/watch?v=MPHPJ5mBTA8 + +...which allows you to map points into a three-dimensional space: they are represented by (x, y, z) coordinates. _What_ you plot is the actual color and other details of the point as measured by e.g. a LiDAR device. + +Now suppose that instead of LiDAR measurement, you map samples from the MNIST dataset to (x, y, z) coordinates. That is, you map these images to the point cloud, generating a three-dimensional space... _an embedding_. + +The question you now may have is as follows: **how are [embeddings](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) generated?** + +Well, they are learnt - at least in TensorFlow (TensorFlow, n.d.). Embeddings are initialized randomly, and the table representing classes vertically and the many dimensions horizontally is learnt during the training process. This way, you can generate really cool embeddings, such as this one for the MNIST dataset: + +https://www.youtube.com/watch?v=0W6o4chVxfk + +### Can TensorBoard be used when using PyTorch? + +Yes, TensorBoard [can be used with PyTorch](https://pytorch.org/docs/stable/tensorboard.html). + +* * * + +_You may also be interested in:_ + +- [How to visualize your Keras model without TensorBoard?](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/) + +* * * + +## Implementing TensorBoard into your Keras model + +### What model will we create today? + +Simple. We're going to use the [Keras CNN](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) that we already created before, as it is relatively simple and achieves adequate performance across a wide range of machine learning tasks. + +If you wish to understand how to create a convolutional neural network with Keras, specifically using Conv2D layers, click the link above 😄 + +### What dataset will we use? + +We're going to use the [CIFAR10 dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). This one, like the MNIST one, also comes with Keras by default (Keras, n.d.). It's a 32x32 pixel dataset representing common objects across ten classes, and when visualized looks as follows: + +![](images/cifar10_images.png) + +Ten randomly generated samples from the CIFAR10 dataset. + +If you wish to read more about the dataset, please feel free to look around at the [University of Toronto website](https://www.cs.toronto.edu/~kriz/cifar.html), which is home of the dataset 😊 + +### What you'll need to run this model + +As with any software scenario, you'll need a fair share of dependencies if you wish to run the TensorBoard based Keras CNN successfully: + +- Obviously, you'll need **TensorFlow** version 2.x, which includes Keras by default. +- For both, you'll need a recent version of **Python**. +- Additionally, but only if you wish to visualize the dataset, you'll need **Matplotlib**. + +### Specifying the imports + +Now that we know _what_ you need, we can actually create the model. + +Open your Explorer and create a file in a directory of your choice with a name of your choice, e.g. `model_cifar10_tensorboard.py`, as I called it. + +Let's next specify the imports: + +``` +import tensorflow.keras +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.callbacks import TensorBoard +from time import time +``` + +They are really simple: + +- We import Keras itself; +- We import the CIFAR10 dataset; +- We import the Sequential API for stacking our layers on top of each other; +- We import the Dense, Dropout, Flatten, [Conv2D](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) and [MaxPooling2D](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) layers - refer to [this post](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) if you wish to understand them in more detail. +- We import `TensorBoard` from the Keras callbacks. + +### Model configuration & loading CIFAR10 data + +Next, we configure the model, which is essentially just a configuration of image width and height, batch size, epochs, classes, [validation split](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/) and verbosity - just the regular stuff: + +``` +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +And (which is why I love Keras) we next [import the data](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) with one line of code (or two, if we include the comment): + +``` +# Load CIFAR10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() +``` + +### Optional: data visualization + +Now, if you wish to visualize the data, a.k.a. create this plot... + +![](images/cifar10_images.png) + +...this is what you'll have to add: + +``` +# Visualize CIFAR10 dataset +import matplotlib.pyplot as plt +classes = { + 0: 'airplane', + 1: 'automobile', + 2: 'bird', + 3: 'cat', + 4: 'deer', + 5: 'dog', + 6: 'frog', + 7: 'horse', + 8: 'ship', + 9: 'truck' +} +fig, axes = plt.subplots(2,5, sharex=True) +axes[0,0].imshow(input_train[0]) +axes[0,1].imshow(input_train[1]) +axes[0,2].imshow(input_train[2]) +axes[0,3].imshow(input_train[3]) +axes[0,4].imshow(input_train[4]) +axes[1,0].imshow(input_train[5]) +axes[1,1].imshow(input_train[6]) +axes[1,2].imshow(input_train[7]) +axes[1,3].imshow(input_train[8]) +axes[1,4].imshow(input_train[9]) +axes[0,0].set_title(classes[target_train[0][0]]) +axes[0,1].set_title(classes[target_train[1][0]]) +axes[0,2].set_title(classes[target_train[2][0]]) +axes[0,3].set_title(classes[target_train[3][0]]) +axes[0,4].set_title(classes[target_train[4][0]]) +axes[1,0].set_title(classes[target_train[5][0]]) +axes[1,1].set_title(classes[target_train[6][0]]) +axes[1,2].set_title(classes[target_train[7][0]]) +axes[1,3].set_title(classes[target_train[8][0]]) +axes[1,4].set_title(classes[target_train[9][0]]) +axes[0,0].set_axis_off() +axes[0,1].set_axis_off() +axes[0,2].set_axis_off() +axes[0,3].set_axis_off() +axes[0,4].set_axis_off() +axes[1,0].set_axis_off() +axes[1,1].set_axis_off() +axes[1,2].set_axis_off() +axes[1,3].set_axis_off() +axes[1,4].set_axis_off() +plt.show() +``` + +Which is a bunch of code that just tells Matplotlib to visualize the first ten samples of the CIFAR10 dataset you just imported as one plot with ten sub plots. + +### Data preparation + +Now, we can prepare our data, which comprises these steps: + +- Setting the [shape of our input data](https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim/), which is of shape `(img_width, img_height, 3)` because we're working with RGB and hence three-channel images. +- Converting the input data into `float32` format, which apparently [speeds up the training process](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) (Quora, n.d.). +- Input data [normalization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/). +- Generating [categorical data from the integer targets](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras/), allowing us to use [categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) for computing loss. + +``` +# Set input shape +input_shape = (img_width, img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +### Model architecture + +Next, we can specify the model architecture: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +It's a relatively simple [convolutional](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) architecture, with two convolutional blocks comprising [conv layers](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/), [max pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) and [dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/), followed by two Dense layers, with flattening in between. + +Refer to [this post](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) if you wish to understand this architecture in more detail. + +### Model compilation & fitting data + +Next, we can compile our model - i.e., add our configuration or the model's hyperparameters - and fit the data: + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Define Tensorboard as a Keras callback +tensorboard = TensorBoard( + log_dir='.\logs', + histogram_freq=1, + write_images=True +) +keras_callbacks = [ + tensorboard +] + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split, + callbacks=keras_callbacks) +``` + +Compiling the model involves specifying a loss function ([categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/)), an [optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) and an additional metric - which is not too exciting. + +Fitting the data to the compiled model is neither (you just specify the data, the number of epochs, batch size, and so on) - except for one thing: the additional [callbacks](https://www.machinecurve.com/index.php/2020/11/10/an-introduction-to-tensorflow-keras-callbacks/) variable that we added. + +And this `callbacks` variable refers to `keras_callbacks`, which is an array of Keras callbacks that we apply to this model - in this case, `tensorboard`! + +Tensorboard, or `tensorboard`, in its own is the implementation as defined by the Keras API. In our case, we save logs at `.\logs`, generate weight histograms after each epochs, and do write weight images to our logs. Take a look at the API spec above if you wish to understand the choices you can make. + +### Model evaluation + +Finally, we add this evaluation code which tells you how well the trained model performs based on the testing data - i.e., how well it [generalizes](https://www.machinecurve.com/index.php/2020/12/01/how-to-check-if-your-deep-learning-model-is-underfitting-or-overfitting/) to data it has never seen before: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Full model code + +If you're interested in the full model code - here you go: + +``` +import tensorflow.keras +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.callbacks import TensorBoard +from time import time + +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Visualize CIFAR10 dataset +import matplotlib.pyplot as plt +classes = { + 0: 'airplane', + 1: 'automobile', + 2: 'bird', + 3: 'cat', + 4: 'deer', + 5: 'dog', + 6: 'frog', + 7: 'horse', + 8: 'ship', + 9: 'truck' +} +fig, axes = plt.subplots(2,5, sharex=True) +axes[0,0].imshow(input_train[0]) +axes[0,1].imshow(input_train[1]) +axes[0,2].imshow(input_train[2]) +axes[0,3].imshow(input_train[3]) +axes[0,4].imshow(input_train[4]) +axes[1,0].imshow(input_train[5]) +axes[1,1].imshow(input_train[6]) +axes[1,2].imshow(input_train[7]) +axes[1,3].imshow(input_train[8]) +axes[1,4].imshow(input_train[9]) +axes[0,0].set_title(classes[target_train[0][0]]) +axes[0,1].set_title(classes[target_train[1][0]]) +axes[0,2].set_title(classes[target_train[2][0]]) +axes[0,3].set_title(classes[target_train[3][0]]) +axes[0,4].set_title(classes[target_train[4][0]]) +axes[1,0].set_title(classes[target_train[5][0]]) +axes[1,1].set_title(classes[target_train[6][0]]) +axes[1,2].set_title(classes[target_train[7][0]]) +axes[1,3].set_title(classes[target_train[8][0]]) +axes[1,4].set_title(classes[target_train[9][0]]) +axes[0,0].set_axis_off() +axes[0,1].set_axis_off() +axes[0,2].set_axis_off() +axes[0,3].set_axis_off() +axes[0,4].set_axis_off() +axes[1,0].set_axis_off() +axes[1,1].set_axis_off() +axes[1,2].set_axis_off() +axes[1,3].set_axis_off() +axes[1,4].set_axis_off() +plt.show() + +# Set input shape +input_shape = (img_width, img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert them into black or white: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Define Tensorboard as a Keras callback +tensorboard = TensorBoard( + log_dir='.\logs', + histogram_freq=1, + write_images=True +) +keras_callbacks = [ + tensorboard +] + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split, + callbacks=keras_callbacks) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Starting the training process + +Now, you can start the training process by simply opening a terminal that covers the dependencies that we listed before. `cd` to the directory you saved your model to, and start training with Python, with e.g. `python model_cifar10_tensorboard.py`. + +Normally, your model would start the training process, but I ran into this error at first: + +``` +tensorflow.python.framework.errors_impl.NotFoundError: Failed to create a directory: ./logs/1573629879\train; No such file or directory [Op:CreateSummaryFileWriter] +``` + +...if you're facing this, take a close look at how you specify the `log_dir` in your `TensorBoard` callback. If you specify your directory as e.g. `/this/is/a/dir`, it won't work. Instead, when you specify your logs directory as `\this\is\a\dir`, it will 😊 + +Now, the training process should commence, and you'll have to wait a bit to see results flowing in 😄 + +In my case, for the training process and the evaluation step, the results were as follows: + +``` +40000/40000 [==============================] - 6s 151us/step - loss: 1.7320 - accuracy: 0.3748 - val_loss: 1.4869 - val_accuracy: 0.4693 +Epoch 2/25 +40000/40000 [==============================] - 4s 103us/step - loss: 1.3774 - accuracy: 0.5084 - val_loss: 1.2976 - val_accuracy: 0.5448 +Epoch 3/25 +40000/40000 [==============================] - 4s 103us/step - loss: 1.2614 - accuracy: 0.5557 - val_loss: 1.1788 - val_accuracy: 0.5959 +Epoch 4/25 +40000/40000 [==============================] - 4s 105us/step - loss: 1.1687 - accuracy: 0.5863 - val_loss: 1.1033 - val_accuracy: 0.6199 +Epoch 5/25 +40000/40000 [==============================] - 4s 103us/step - loss: 1.1042 - accuracy: 0.6119 - val_loss: 1.0838 - val_accuracy: 0.6263 +Epoch 6/25 +40000/40000 [==============================] - 4s 101us/step - loss: 1.0471 - accuracy: 0.6307 - val_loss: 1.0273 - val_accuracy: 0.6428 +Epoch 7/25 +40000/40000 [==============================] - 4s 102us/step - loss: 0.9940 - accuracy: 0.6548 - val_loss: 0.9785 - val_accuracy: 0.6638 +Epoch 8/25 +40000/40000 [==============================] - 4s 102us/step - loss: 0.9554 - accuracy: 0.6669 - val_loss: 0.9411 - val_accuracy: 0.6739 +Epoch 9/25 +40000/40000 [==============================] - 4s 104us/step - loss: 0.9162 - accuracy: 0.6781 - val_loss: 0.9323 - val_accuracy: 0.6745 +Epoch 10/25 +40000/40000 [==============================] - 4s 106us/step - loss: 0.8866 - accuracy: 0.6895 - val_loss: 0.8977 - val_accuracy: 0.6880 +Epoch 11/25 +40000/40000 [==============================] - 4s 104us/step - loss: 0.8484 - accuracy: 0.7008 - val_loss: 0.8895 - val_accuracy: 0.6926 +Epoch 12/25 +40000/40000 [==============================] - 4s 108us/step - loss: 0.8119 - accuracy: 0.7171 - val_loss: 0.8796 - val_accuracy: 0.6969 +Epoch 13/25 +40000/40000 [==============================] - 4s 110us/step - loss: 0.7792 - accuracy: 0.7264 - val_loss: 0.8721 - val_accuracy: 0.6986 +Epoch 14/25 +40000/40000 [==============================] - 4s 108us/step - loss: 0.7480 - accuracy: 0.7369 - val_loss: 0.8384 - val_accuracy: 0.7127 +Epoch 15/25 +40000/40000 [==============================] - 4s 111us/step - loss: 0.7214 - accuracy: 0.7481 - val_loss: 0.8160 - val_accuracy: 0.7191 +Epoch 16/25 +40000/40000 [==============================] - 4s 112us/step - loss: 0.6927 - accuracy: 0.7575 - val_loss: 0.8109 - val_accuracy: 0.7215 +Epoch 17/25 +40000/40000 [==============================] - 5s 113us/step - loss: 0.6657 - accuracy: 0.7660 - val_loss: 0.8163 - val_accuracy: 0.7203 +Epoch 18/25 +40000/40000 [==============================] - 5s 113us/step - loss: 0.6400 - accuracy: 0.7747 - val_loss: 0.7908 - val_accuracy: 0.7282 +Epoch 19/25 +40000/40000 [==============================] - 5s 115us/step - loss: 0.6099 - accuracy: 0.7869 - val_loss: 0.8109 - val_accuracy: 0.7236 +Epoch 20/25 +40000/40000 [==============================] - 5s 115us/step - loss: 0.5808 - accuracy: 0.7967 - val_loss: 0.7823 - val_accuracy: 0.7364 +Epoch 21/25 +40000/40000 [==============================] - 5s 120us/step - loss: 0.5572 - accuracy: 0.8049 - val_loss: 0.7999 - val_accuracy: 0.7231 +Epoch 22/25 +40000/40000 [==============================] - 5s 114us/step - loss: 0.5226 - accuracy: 0.8170 - val_loss: 0.8005 - val_accuracy: 0.7310 +Epoch 23/25 +40000/40000 [==============================] - 5s 114us/step - loss: 0.5092 - accuracy: 0.8203 - val_loss: 0.7951 - val_accuracy: 0.7341 +Epoch 24/25 +40000/40000 [==============================] - 5s 124us/step - loss: 0.4890 - accuracy: 0.8291 - val_loss: 0.8062 - val_accuracy: 0.7257 +Epoch 25/25 +40000/40000 [==============================] - 5s 124us/step - loss: 0.4627 - accuracy: 0.8370 - val_loss: 0.8146 - val_accuracy: 0.7325 +Test loss: 0.823152067565918 / Test accuracy: 0.7271000146865845 +``` + +* * * + +## Viewing model performance in TensorBoard + +We can now inspect model performance in TensorBoard! + +Once the [training process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) is finished, execute this command in the same terminal: + +``` +tensorboard --logdir=./logs +``` + +(note that you may wish to change the `--logdir` flag if you used a different `log_dir` in your Keras callback) + +This is what you likely see next: + +``` +Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all +TensorBoard 2.0.0 at http://localhost:6006/ (Press CTRL+C to quit) +``` + +In that case, you can open a browser, navigate to http://localhost:6006, and see TensorBoard! 😍 + +### TensorBoard Scalars tab + +...which opens with the Scalars tab: + +[![](images/image-1-1024x505.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/image-1.png) + +This tab essentially shows you how the training process happened over time by showing you the [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) (for [training and validation data](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/)) as well as other metrics. On the left, you can configure the charts. + +### TensorBoard Images tab + +Next, you can move to the Images tab: + +[![](images/image-2-1024x505.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/image-2.png) + +This tab shows you the weight images that you could configure with `write_images`. + +### TensorBoard Graphs tab + +The Graphs tab shows you the network graph created by Keras when training your model: + +[![](images/image-6-1024x480.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/image-6.png) + +### TensorBoard Distributions tab + +The Distributions tab shows you how the weights and biases are distributed per iteration: + +[![](images/image-4-1024x505.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/image-4.png) + +### TensorBoard Histograms tab + +Finally, the Histograms tab shows the weights histograms which help you determine how the model learnt: + +[![](images/image-5-1024x505.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/image-5.png) + +In the case of our model that learns to classify based on the [CIFAR10 dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), it becomes clear that it has primarily learnt by _slightly_ adapting the [weights](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) of some models, and by _steering_ the biases quite substantially. This eventually reflects a 72.7% test accuracy. + +* * * + +## Summary + +In this blog post, we've seen quite a few things with respect to TensorBoard and Keras: + +- What it is; +- What it can be used for; +- How to use TensorBoard with Keras; +- And how its results can subsequently be viewed in TensorBoard. + +I hope you've learnt something from today's blog post! If you did, feel free to leave comment below 👇, especially if you have questions or think I made mistakes and/or can improve my post in any way. + +Thank you for visiting MachineCurve today and happy engineering 😊 + +* * * + +## References + +Keras. (n.d.). Callbacks. Retrieved from [https://keras.io/callbacks/#callback](https://keras.io/callbacks/#callback) + +TensorFlow. (n.d.). _Tf.keras.callbacks.TensorBoard_. [https://www.tensorflow.org/api\_docs/python/tf/keras/callbacks/TensorBoard](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/TensorBoard) + +Sunside. (n.d.). Understanding TensorBoard (weight) histograms. Retrieved from [https://stackoverflow.com/questions/42315202/understanding-tensorboard-weight-histograms](https://stackoverflow.com/questions/42315202/understanding-tensorboard-weight-histograms) + +TensorFlow. (n.d.). Word embeddings. Retrieved from [https://www.tensorflow.org/tutorials/text/word\_embeddings#word\_embeddings\_2](https://www.tensorflow.org/tutorials/text/word_embeddings#word_embeddings_2) + +Quora. (n.d.). When should I use tf.float32 vs tf.float64 in TensorFlow? Retrieved from [https://www.quora.com/When-should-I-use-tf-float32-vs-tf-float64-in-TensorFlow](https://www.quora.com/When-should-I-use-tf-float32-vs-tf-float64-in-TensorFlow) + +TensorFlow. (n.d.). Get started with TensorBoard. Retrieved from [https://www.tensorflow.org/tensorboard/get\_started](https://www.tensorflow.org/tensorboard/get_started) + +Keras. (n.d.). Callbacks – TensorBoard. Retrieved from [https://keras.io/callbacks/#tensorboard](https://keras.io/callbacks/#tensorboard) + +PyTorch. (n.d.). _Torch.utils.tensorboard — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/tensorboard.html](https://pytorch.org/docs/stable/tensorboard.html) diff --git a/how-to-use-tensorboard-with-pytorch.md b/how-to-use-tensorboard-with-pytorch.md new file mode 100644 index 0000000..4f64577 --- /dev/null +++ b/how-to-use-tensorboard-with-pytorch.md @@ -0,0 +1,579 @@ +--- +title: "How to use TensorBoard with PyTorch" +date: "2021-11-10" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "machine-learning" + - "model-visualization" + - "neural-networks" + - "pytorch" + - "scalars" + - "tensorboard" + - "visualization" + - "weight-histograms" +--- + +When your Deep Learning model is training, you are interested in how well it performs. Often, it's possible to output a variety of metrics on the fly. But did you know that there are tools for visualizing how performance evolves over time - and even allowing you to see performance over time _after_ training was finished? + +TensorBoard is one such tool. Originally intended for the TensorFlow library (including Keras models), it's a web application which reads log files from a directory and displays a variety of charts that can be very useful. Fun fact: it's also available for PyTorch! And precisely that is what we're going to build in today's article. + +First of all, we're going to start with taking a look at TensorBoard. What is it capable of doing? How is TensorBoard available in PyTorch (hint: through the `SummaryWriter`)? This includes working on a real example that adds TensorBoard to your PyTorch model. + +Are you ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What is TensorBoard? + +People who create stuff can usually know best how to describe what they created - and the same is true for the creators of TensorBoard: + +> _In machine learning, to improve something you often need to be able to measure it. TensorBoard is a tool for providing the measurements and visualizations needed during the machine learning workflow. It enables tracking experiment metrics like_ [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) _and accuracy, visualizing the model graph, projecting embeddings to a lower dimensional space, and much more._ +> +> [TensorBoard – Get Started](https://www.tensorflow.org/tensorboard/get_started) + +In other words, it's a tool for visualizing the machine learning experiments you performed in a variety of ways. + +Indeed, it's possible to generate a large amount of plots when using TensorBoard. For example, using **weight histograms**, it's possible to see how the distribution of your layer weights evolved over time - in this case, over five epochs: + +![](images/weight_histogram_1.jpg) + +In another screen, you can see how loss has evolved over the epochs: + +![](images/image-1-1024x505.png) + +[And so on. And so on.](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/#viewing-model-performance-in-tensorboard) + +Installing TensorBoard must be done separately to your PyTorch install. Doing so is not difficult, fortunately, and can be done by simply executing `pip` via `pip install tensorboard`. + +* * * + +## TensorBoard in PyTorch using the `SummaryWriter` + +TensorBoard was originally developed for TensorFlow. As you saw above, it is also available for PyTorch! But how? Through the `SummaryWriter`: + +> The SummaryWriter class provides a high-level API to create an event file in a given directory and add summaries and events to it. The class updates the file contents asynchronously. This allows a training program to call methods to add data to the file directly from the training loop, without slowing down training. +> +> PyTorch (n.d.) + +Great! + +This means that we can create a `SummaryWriter` (or, fully: `torch.utils.tensorboard.writer.SummaryWriter`) and use it to write away the data that we want. + +Recall from the article [linked above](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/#viewing-model-performance-in-tensorboard) that TensorBoard provides a variety of tabs: + +- The **scalar tab** for showing how the training process happened over time by means of displaying scalars (e.g., in a line plot). +- The **images tab** for showing images written away during the training process. +- The **graphs tab** showing the network graph created by (in the original case) TensorFlow during training. +- The **distributions tab** showing the distributions of the weights and biases of your network for every iteration. +- The **histograms tab** showing the weight and bias histograms helping you determine how the model learned what it learned. +- The **embeddings tab** visualizes learned embeddings. + +It's possible to write from PyTorch to each of these tabs: + +- Using `add_scalar` (or `add_scalars`), you can write scalar data to the scalar tab. +- By means of `add_image` (or `add_images`), images can be written to the images tab. Besides, it is also possible to write Matplotlib figures using `add_figure`. And if you have videos (for example by having an array with multiple images), `add_video` can be used. +- Through `add_graph`, graph data can be written to TensorBoard's graphs tab. +- With `add_histogram`, you can write histogram data to the histogram tab. +- Via `add_embedding`, embedding data can be written to the embeddings tab. +- You can also use `add_audio` for audio data and `add_text` for text data. `add_mesh` can be used for 3D point cloud data. +- And there is a lot more! + +In other words, it's possible to fully recreate the TensorBoard experience you're used to when coming from TensorFlow... but then using PyTorch! + +* * * + +## Adding TensorBoard to your PyTorch model + +Let's now take a look at _how_ we can use TensorBoard with PyTorch by means of an example. + +Please ensure that you have installed both PyTorch (and its related packages such as `torchvision`) as well as TensorBoard (if not: `pip install tensorboard`) before continuing. + +Adding TensorBoard to your PyTorch model will take a few simple steps: + +1. Starting with a simple Convolutional Neural Network. +2. Initializing the `SummaryWriter` which allows us to write to TensorBoard. +3. Writing away some scalar values, both individually and in groups. +4. Writing away images, graphs and histograms. + +This will give you a rough idea how TensorBoard can be used, leaving sufficient [room for experimentation](https://pytorch.org/docs/stable/generated/torch.nn.functional.pad.html#torch.nn.functional.pad) with all the other TensorBoard functionality available in PyTorch. + +### A simple ConvNet to start with + +In a different article, [we created a simple Convolutional Neural Network](https://www.machinecurve.com/index.php/2021/07/08/convolutional-neural-networks-with-pytorch/) for classifying MNIST digits. Let's use that code here and expand it with the `SummaryWriter` for + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class ConvNet(nn.Module): + ''' + Simple Convolutional Neural Network + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Conv2d(1, 10, kernel_size=3), + nn.ReLU(), + nn.Conv2d(10, 5, kernel_size=3), + nn.ReLU(), + nn.Flatten(), + nn.Linear(24 * 24 * 5, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the ConvNet + convnet = ConvNet() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(convnet.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = convnet(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Initializing the SummaryWriter + +Add the `SummaryWriter` to your imports first: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torch.utils.tensorboard import SummaryWriter +from torchvision import transforms +``` + +Then, directly after the `__name__` check, initialize it: + +``` +if __name__ == '__main__': + + # Initialize the SummaryWriter for TensorBoard + # Its output will be written to ./runs/ + writer = SummaryWriter() +``` + +We can now use TensorBoard within PyTorch :) + +### Writing scalar values and groups to TensorBoard from PyTorch + +If we inspect the code above, a prime candidate for writing to TensorBoard is the **loss value**. It is a simple value and hence can be represented as a _scalar_, and thus be written using `add_scalar`. + +First, we add a new counter just after we start the training loop: + +``` + # Run the training loop + loss_idx_value = 0 +``` + +Then, we add the `add_scalar` call to our code - we write away the `current_loss` variable for the current index value, which we then increase with one. + +``` + # Print statistics + current_loss += loss.item() + writer.add_scalar("Loss", current_loss, loss_idx_value) + loss_idx_value += 1 + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 +``` + +Because `current_loss` is reset after every 500th minibatch, we're likely going to see a _wavy_ pattern. + +Let's now run the Python script - and when training finishes, you can start TensorBoard as follows _from the directory where your script is located_: + +``` +tensorboard --logdir=runs +``` + +You should then see the following: + +``` +(pytorch) C:\Users\Chris\Test>tensorboard --logdir=runs +Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all +TensorBoard 2.6.0 at http://localhost:6006/ (Press CTRL+C to quit) +``` + +Let's go visit [localhost](http://localhost:6006/)! + +Indeed, a wavy pattern. This makes sense, because loss is reset continuously. Fortunately, we also see a lower loss range - indicating that our maximum loss per epoch is decreasing, suggesting that the model gets better. + +![](images/image-9.png) + +If we want, we can also group multiple graphs in a **scalar group**, this way: + +``` + # Run the training loop + loss_idx_value = 0 + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = convnet(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + writer.add_scalar("Loss/Minibatches", current_loss, loss_idx_value) + loss_idx_value += 1 + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Write loss for epoch + writer.add_scalar("Loss/Epochs", current_loss, epoch) +``` + +We can now see loss at _minibatch level_ (including the resets) and _epoch level_ (which is more smooth) - all under the umbrella term of "Loss". This way, you can construct multiple groups which contain many visualizations. + +![](images/image-2.png) + +### Writing network graphs, images and weight histograms to TensorBoard using PyTorch + +#### Graphs + +Using `add_graph`, you can write the network graph to TensorBoard so that it can be visualized. This is how to do it: + +``` + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Write the network graph at epoch 0, batch 0 + if epoch == 0 and i == 0: + writer.add_graph(convnet, input_to_model=data[0], verbose=False) +``` + +Et voila: + +![](images/image-8.png) + +#### Images + +Suppose that you have _image data_ available during the training process - for example, random selections from your batches, weight visualizations or e.g. Activation Maximization values - then you can use `add_image` in the following way: + +``` + # Write an image at every batch 0 + if i == 0: + writer.add_image("Example input", inputs[0], global_step=epoch) +``` + +At every 0th minibatch (i.e. at the start of every epoch), this writes the first input image to TensorBoard. + +![](images/image-7.png) + +#### Weight histograms + +Visualizing weight histograms in TensorBoard takes a bit more time, because PyTorch doesn't natively support making weight histograms. Since its APIs are really accessible, it is not very difficult to replicate this behavior either. Let's take a look at what must be done for passing weight histograms to TensorBoard: + +_Please do note that the functionality below only works with `Conv2d` and `Linear` layers. All others will be skipped. If you need other layers, please feel free to let me know through the comments, and I will try to add them!_ + +``` +def weight_histograms_conv2d(writer, step, weights, layer_number): + weights_shape = weights.shape + num_kernels = weights_shape[0] + for k in range(num_kernels): + flattened_weights = weights[k].flatten() + tag = f"layer_{layer_number}/kernel_{k}" + writer.add_histogram(tag, flattened_weights, global_step=step, bins='tensorflow') + + +def weight_histograms_linear(writer, step, weights, layer_number): + flattened_weights = weights.flatten() + tag = f"layer_{layer_number}" + writer.add_histogram(tag, flattened_weights, global_step=step, bins='tensorflow') + + +def weight_histograms(writer, step, model): + print("Visualizing model weights...") + # Iterate over all model layers + for layer_number in range(len(model.layers)): + # Get layer + layer = model.layers[layer_number] + # Compute weight histograms for appropriate layer + if isinstance(layer, nn.Conv2d): + weights = layer.weight + weight_histograms_conv2d(writer, step, weights, layer_number) + elif isinstance(layer, nn.Linear): + weights = layer.weight + weight_histograms_linear(writer, step, weights, layer_number) +``` + +Add these Python `def`s above the `__main__` check. Here's what they do, from the bottom to the top one: + +- With `weight_histograms`, you iterate over all layers in the model, check whether the layer is a `nn.Conv2d` or `nn.Linear` type of layer, and then proceed with the relevant definition. +- In `weight_histograms_linear`, we take the layer weights, flatten everything, and pass them to `writer` for making the histogram. +- In `weight_histograms_conv2d`, we do almost the same, except for doing it at _kernel_ level. This is how it's done in TensorFlow too. + +You can then add the call to your code: + +``` + # Run the training loop + loss_idx_value = 0 + for epoch in range(0, 5): # 5 epochs at maximum + + # Visualize weight histograms + weight_histograms(writer, epoch, convnet) +``` + +It's then possible to see weight histograms in your TensorBoard page (_note that the network below was trained for 30 epochs_): + +![](images/image-6-1024x566.png) + +* * * + +## Full model code + +Should you wish to use everything at once, here you go: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torch.utils.tensorboard import SummaryWriter +from torchvision import transforms +import numpy as np + +class ConvNet(nn.Module): + ''' + Simple Convolutional Neural Network + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Conv2d(1, 10, kernel_size=3), + nn.ReLU(), + nn.Conv2d(10, 5, kernel_size=3), + nn.ReLU(), + nn.Flatten(), + nn.Linear(24 * 24 * 5, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +def weight_histograms_conv2d(writer, step, weights, layer_number): + weights_shape = weights.shape + num_kernels = weights_shape[0] + for k in range(num_kernels): + flattened_weights = weights[k].flatten() + tag = f"layer_{layer_number}/kernel_{k}" + writer.add_histogram(tag, flattened_weights, global_step=step, bins='tensorflow') + + +def weight_histograms_linear(writer, step, weights, layer_number): + flattened_weights = weights.flatten() + tag = f"layer_{layer_number}" + writer.add_histogram(tag, flattened_weights, global_step=step, bins='tensorflow') + + +def weight_histograms(writer, step, model): + print("Visualizing model weights...") + # Iterate over all model layers + for layer_number in range(len(model.layers)): + # Get layer + layer = model.layers[layer_number] + # Compute weight histograms for appropriate layer + if isinstance(layer, nn.Conv2d): + weights = layer.weight + weight_histograms_conv2d(writer, step, weights, layer_number) + elif isinstance(layer, nn.Linear): + weights = layer.weight + weight_histograms_linear(writer, step, weights, layer_number) + + +if __name__ == '__main__': + + # Initialize the SummaryWriter for TensorBoard + # Its output will be written to ./runs/ + writer = SummaryWriter() + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the ConvNet + convnet = ConvNet() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(convnet.parameters(), lr=1e-4) + + # Run the training loop + loss_idx_value = 0 + for epoch in range(0, 30): # 5 epochs at maximum + + + # Visualize weight histograms + weight_histograms(writer, epoch, convnet) + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + if i > 1000: + break + + # Get inputs + inputs, targets = data + + # Write the network graph at epoch 0, batch 0 + if epoch == 0 and i == 0: + writer.add_graph(convnet, input_to_model=data[0], verbose=True) + + # Write an image at every batch 0 + if i == 0: + writer.add_image("Example input", inputs[0], global_step=epoch) + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = convnet(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + writer.add_scalar("Loss/Minibatches", current_loss, loss_idx_value) + loss_idx_value += 1 + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Write loss for epoch + writer.add_scalar("Loss/Epochs", current_loss, epoch) + + # Process is complete. + print('Training process has finished.') +``` + +That's it! You have successfully integrated your PyTorch model with TensorBoard. + +* * * + +## References + +PyTorch. (n.d.). _Torch.utils.tensorboard — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/tensorboard.html](https://pytorch.org/docs/stable/tensorboard.html) + +TensorFlow. (n.d.). _Get started with TensorBoard_. [https://www.tensorflow.org/tensorboard/get\_started](https://www.tensorflow.org/tensorboard/get_started) diff --git a/how-to-use-upsample-for-upsampling-with-pytorch.md b/how-to-use-upsample-for-upsampling-with-pytorch.md new file mode 100644 index 0000000..099639f --- /dev/null +++ b/how-to-use-upsample-for-upsampling-with-pytorch.md @@ -0,0 +1,180 @@ +--- +title: "How to use Upsample for upsampling with PyTorch" +date: "2021-12-28" +categories: + - "deep-learning" + - "frameworks" + - "geen-categorie" +tags: + - "computer-vision" + - "convolutional-neural-networks" + - "deep-learning" + - "machine-learning" + - "pytorch" + - "upsample" + - "upsampling" +--- + +Within computer vision, upsampling is a relatively common practice these days. Whereas Convolutional layers and Pooling layers make inputs _smaller_, or _downsample the inputs_, we sometimes want to perform the opposite as well. + +This is called Upsampling, and in today's tutorial you're going to learn how you can perform upsampling with the PyTorch deep learning library. + +Upsampling is commonly used within encoder-decoder architectures and within Generative Adversarial Networks, such as [StyleGAN](https://www.machinecurve.com/index.php/2021/12/27/stylegan-a-step-by-step-introduction/). + +In today's tutorial, we will take a look at three different things: + +1. **What upsampling involves**. Conceptually, and very briefly, we're taking a look at what happens when an image is upsampled. +2. **The PyTorch Upsample layer**. We take a look at how upsampling is implemented within PyTorch, one of today's leading deep learning libraries. +3. **A PyTorch based Upsample example**. You will also move from theory into practice, by learning how to perform upsampling within PyTorch by means of an example. + +Are you ready? Let's take a look 😎 + +* * * + +\[toc\] + +* * * + +## What is upsampling? + +Here's the Wikipedia explanation of upsampling: + +> When upsampling is performed on a sequence of samples of a _signal_ or other continuous function, it produces an approximation of the sequence that would have been obtained by sampling the signal at a higher rate (or [density](https://en.wikipedia.org/wiki/Dots_per_inch), as in the case of a photograph). +> +> Wikipedia (2004) + +In other words: you have an input, in computer vision frequently an image, that has a specific size. For example, you have an MNIST image that is 28 x 28 pixels and has one color channel. That is, a grayscale image. + +![](images/mnist-visualize.png) + +Instead of 28x28 pixels, you want the image to be 56x56 pixels. This is when, in the words of the Wikipedia page, you will need to _produce an approximation as if you'd sampled at a higher rate or density_. In other words, if you imagine one MNIST sample to be a photograph, when upsampling you'd _approximate_ as if _you would have made a larger-pixel image with better equipment_. + +If you cannot distinguish between the approximation and the true image, upsampling has succeeded. As you will see next, there are multiple interpolation algorithms for upsampling - but let's take a look at a usecase for upsampling first. + +* * * + +## Upsampling use: encoder-decoder architectures + +Below, you can see the architecture of the [StyleGAN](https://www.machinecurve.com/index.php/2021/12/27/stylegan-a-step-by-step-introduction/) generative adversarial network. The left side produces a so-called _latent vector_ which is used subsequently in the _synthesis network_ that produces an output picture: + +![](images/sampling_normalization.png) + +The synthesis network consists of a number of blocks that produce an image of a specific resolution, which is then used to increase image size even further. For example, in the picture above you see a 4 x 4 resolution for the first block, followed by an 8 x 8 pixel resolution, all the way to a 1024 x 1024 pixels image size. + +Between each block, _upsampling_ takes place. After the last adaptive instance normalization element in each block, an upsample step is performed to increase the _current output_ to something larger than the image output of the next block. Using a Convolutional layer, important input features from the previous block are learned by the next block, to which noise and styles are then added for control and randomness in the image synthesis process. + +Read the [StyleGAN article](https://www.machinecurve.com/index.php/2021/12/27/stylegan-a-step-by-step-introduction/) for a deep dive into that specific GAN, but hopefully this makes it clear how upsampling can be used within your neural network! :) + +* * * + +## PyTorch Upsample layer + +In PyTorch, upsampling is built into the `torch.nn.Upsample` class representing a layer called `Upsample` that can be added to your neural network: + +> Upsamples a given multi-channel 1D (temporal), 2D (spatial) or 3D (volumetric) data. +> +> PyTorch (n.d.) + +In other words, it works with both 1D, 2D and 3D data: + +- 1D data is a one-dimensional array and is associated with timeseries (with one list element representing one time step). This is why 1D data is called _temporal_. +- 2D data is a two-dimensional array and associated with images, _spatial_. +- 3D data is a three-dimensional array and often associated with real-world data (pointcloud scans, and so forth) or videos. + +The `Upsample` layer is made available in the following way: + +`torch.nn.Upsample(_size=None_, _scale_factor=None_, _mode='nearest'_, _align_corners=None_)` + +### Configurable attributes + +These attributes can be configured: + +- With `size`, the target output size can be represented. For example, if you have a 28 x 28 pixel image you wish to upsample to 56 x 56 pixels, you specify `size=(56, 56)`. +- If you don't use size, you can also specify a `scale_factor`, which scales the inputs. +- Through `mode`, it is possible to configure the interpolation algorithm used for filling in the empty pixels created after image shape was increased. It's possible to pick one of `'nearest'`, `'linear'`, `'bilinear'`, `'bicubic'` and `'trilinear'`. By default, it's `nearest`. +- Finally, to handle certain upsampling algorithms (linear, bilinear, trilinear), it's possible to set `align_corners` to `True`. This way, the corner points keep the same value whatever the interpolation output. + +* * * + +## PyTorch Upsample example + +The example below shows you how you can use upsampling in a 2D setting, with images from the MNIST dataset. + +It contains multiple parts: + +- **The imports**. We're going to depend on certain Python features and external libraries, such as `torch`, `torchvision` and `matplotlib`. As we're working with the MNIST dataset, we need to import it as well. These imports are fairly standard when creating a machine learning model with PyTorch. +- **The nn.Module** **\- a.k.a. the neural network**. As this tutorial involves using the Upsampling functionality within PyTorch, today's neural network is called `UpsampleExample`. It does only one thing: stack one `Upsample` layer in a `Sequential` block, which resizes inputs to `(56, 56)` shape and uses nearest neighbor interpolation for filling up the 'empty' pixels. The `forward` definition simply feeds the inputs to the layers and returns the result. +- **The main segment**. Firstly, we prepare the MNIST dataset by creating an instance of the `MNIST` class, which downloads the data if necessary. In addition, a Tensorfication of the input data is performed before any data will be passed to the neural network. Secondly, a `DataLoader` is initialized on top of the `dataset`, which shuffles and selects data using a preconfigured batch size (of 10 images). Thirdly, the upsample example is initialized, and we perform an iteration over the (first) batch. For each batch, we feed the data through the neural network, and pick the first example for visualization with Matplotlib. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import matplotlib.pyplot as plt + +class UpsampleExample(nn.Module): + ''' + Simple example for upsampling + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Upsample(size=(56,56), mode = 'nearest') + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Prepare MNIST + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the upsample_example + upsample_example = UpsampleExample() + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Take just one input + before_upsampling = inputs + + # Perform forward pass + after_upsampling = upsample_example(before_upsampling)[0] + + # Visualize subplots + fig, (ax1, ax2) = plt.subplots(1, 2) + ax1.imshow(before_upsampling[0].reshape((28, 28))) + ax2.imshow(after_upsampling.reshape(56, 56)) + plt.show() +``` + +After upsampling, this is what the inputs look like: + +[![](images/afterupsampling-1024x535.png)](https://www.machinecurve.com/wp-content/uploads/2021/12/afterupsampling.png) + +On the left, you can see the image before upsampling. On the right, the image after upsampling. + +You can see that the image pretty much stayed the same - but from the axes, you can see that it became bigger. + +From 28x28 pixels (the default sample shape of an MNIST sample), the image is now 56 x 56 pixels. + +Successfully upsampled with PyTorch! :D + +* * * + +## References + +PyTorch. (n.d.). _Upsample — PyTorch 1.10.1 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.Upsample.html](https://pytorch.org/docs/stable/generated/torch.nn.Upsample.html) + +Wikipedia. (2004, December 23). _Upsampling_. Wikipedia, the free encyclopedia. Retrieved December 28, 2021, from [https://en.wikipedia.org/wiki/Upsampling](https://en.wikipedia.org/wiki/Upsampling) diff --git a/how-to-visualize-a-model-with-keras.md b/how-to-visualize-a-model-with-keras.md new file mode 100644 index 0000000..bc8c41b --- /dev/null +++ b/how-to-visualize-a-model-with-keras.md @@ -0,0 +1,294 @@ +--- +title: "How to visualize a model with TensorFlow 2 and Keras?" +date: "2019-10-07" +categories: + - "buffer" + - "frameworks" +tags: + - "architecture" + - "deep-learning" + - "keras" + - "neural-network" + - "visualization" +--- + +Every now and then, you might need to demonstrate your Keras model structure. There's one or two things that you may do when this need arises. First, you may send the person who needs this overview your code, requiring them to derive the model architecture themselves. If you're nicer, you send them a model of your architecture. + +...but creating such models is often a hassle when you have to do it manually. Solutions like www.draw.io are used quite often in those cases, because they are (relatively) quick and dirty, allowing you to create models fast. + +However, there's a better solution: the built-in `plot_model` facility within Keras. It allows you to create a visualization of your model architecture. In this blog, I'll show you how to create such a visualization. Specifically, I focus on the model itself, discussing its architecture so that you fully understand what happens. Subsquently, I'll list some software dependencies that you'll need - including a highlight about a bug in Keras that results in a weird error related to `pydot` and GraphViz, which are used for visualization. Finally, I present you the code used for visualization and the end result. + +**After reading this tutorial, you will...** + +- Understand what the `plot_model()` util in TensorFlow 2.0/Keras does. +- Know what types of plots it generates. +- Have created a neural network that visualizes its structure. + +_Note that model code is also available [on GitHub](https://github.com/christianversloot/keras-visualizations)._ + +* * * + +**Update 22/Jan/2021:** ensured that the tutorial is up-to-date and reflects code for TensorFlow 2.0. It can now be used with recent versions of the library. Also performed some header changes and textual improvements based on the switch from Keras 1.0 to TensorFlow 2.0. Also added an exampl of horizontal plotting. + +* * * + +\[toc\] + +* * * + +## Code example: using plot\_model for visualizing the model + +If you want to get started straight away, here is the code that you can use for visualizing your TensorFlow 2.0/Keras model with `plot_model`: + +``` +from tensorflow.keras.utils import plot_model +plot_model(model, to_file='model.png') +``` + +Make sure to read the rest of this tutorial if you want to understand everything in more detail! + +* * * + +## Today's to-be-visualized model + +To show you how to visualize a Keras model, I think it's best if we discussed one first. + +Today, we will visualize the [Convolutional Neural Network](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) that we created earlier to demonstrate the benefits of using CNNs over densely-connected ones. + +This is the code of that model: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Set input shape +sample_shape = input_train[0].shape +img_width, img_height = sample_shape[0], sample_shape[1] +input_shape = (img_width, img_height, 1) + +# Reshape data +input_train = input_train.reshape(len(input_train), input_shape[0], input_shape[1], input_shape[2]) +input_test = input_test.reshape(len(input_test), input_shape[0], input_shape[1], input_shape[2]) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert them into black or white: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +What does it do? + +I'd suggest that you [read the post](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) if you wish to understand it very deeply, but I'll briefly cover it here. + +It simply classifies the MNIST dataset. This dataset contains 28 x 28 pixel images of digits, or numbers between 0 and 9, and our CNN classifies them with a staggering 99% accuracy. It does so by combining two convolutional blocks (which consist of a two-dimensional convolutional layer, two-dimensional max pooling and dropout) with densely-conneted layers. It's the best of both worlds in terms of interpreting the image _and_ generating final predictions. + +But how to visualize this model's architecture? Let's find out. + +* * * + +## Built-in `plot_model` util + +Utilities. I love them, because they make my life easier. They're often relatively simple functions that can be called upon to perform some relatively simple actions. Don't be fooled, however, because these actions often benefit one's efficiently greatly - in this case, not having to visualize a model architecture yourself in tools like draw.io + +I'm talking about the `plot_model` util, which comes [delivered with Keras](https://keras.io/visualization/#model-visualization). + +It allows you to create a visualization of your Keras neural network. + +More specifically, the Keras docs define it as follows: + +``` +from tensorflow.keras.utils import plot_model +plot_model(model, to_file='model.png') +``` + +From the **Keras utilities**, one needs to import the function, after which it can be used with very minimal parameters: + +- The **model instance**, or the model that you created - whether you created it now or preloaded it instead from a model [saved to disk](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). +- And the `to_file` parameter, which essentially specifies a location on disk where the model visualization is stored. + +If you wish, you can supply some additional parameters as well: + +- The **show\_shapes** argument (which is `False` by default) which controls whether the _shape_ of the layer outputs are shown in the graph. This would be beneficial if besides the architecture you also need to understand _how it transforms data_. +- With **show\_dtypes** (`False` by default) you can indicate whether to show layer data types on the plot. +- The **show\_layer\_names** argument (`True` by default) which determines whether the names of the layers are displayed. +- The **rankdir** (`TB` by default) can be used to indicate whether you want a vertical or horizontal plot. `TB` is vertical, `LR` is horizontal. +- The **expand\_nested** (`False` by default) controls how nested models are displayed. +- **Dpi** controls the dpi value of the image. + +However, likely, for a simple visualization, you don't need them. Let's now take a look what we would need if we were to create such a visualization. + +* * * + +## Software dependencies + +If you wish to run the code presented in this blog successfully, you need to install certain software dependencies. You'll need those to run it: + +- **TensorFlow 2.0**, or any subsequent version, which makes sense given the fact that we're using a Keras util for model visualization; +- **Python**, preferably 3.8+, which is required if you wish to run Keras. +- **Graphviz**, which is a graph visualization library for Python. Keras uses it to generate the visualization of your neural network. [You can install Graphviz from their website.](https://graphviz.gitlab.io/download/) + +Preferably, you'll run this from an **Anaconda** environment, which allows you to run these packages in an isolated fashion. Note that many people report that a `pip` based installation of Graphviz doesn't work; rather, you'll have to install it separately into your host OS from their website. Bummer! + +### Keras bug: `pydot` failed to call GraphViz + +When trying to visualize my Keras neural network with `plot_model`, I ran into this error: + +``` +'`pydot` failed to call GraphViz.' +OSError: `pydot` failed to call GraphViz.Please install GraphViz (https://www.graphviz.org/) and ensure that its executables are in the $PATH. +``` + +...which essentially made sense at first, because I didn't have Graphviz installed. + +...but which didn't after I installed it, because the error kept reappearing, even after restarting the Anaconda terminal. + +Sometimes, it helps to install `pydotplus` as well with `pip install pydotplus`. Another solution, although not preferred, is to downgrade your `pydot` version. + +* * * + +## Visualization code + +When adapting the code from my original CNN, scrapping away the elements I don't need for visualizing the model architecture, I end up with this: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.utils import plot_model + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Set input shape +sample_shape = input_train[0].shape +img_width, img_height = sample_shape[0], sample_shape[1] +input_shape = (img_width, img_height, 1) + +# Number of classes +no_classes = 10 + +# Reshape data +input_train = input_train.reshape(len(input_train), input_shape[0], input_shape[1], input_shape[2]) +input_test = input_test.reshape(len(input_test), input_shape[0], input_shape[1], input_shape[2]) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +plot_model(model, to_file='model.png') +``` + +You'll first perform the **imports** that you still need in order to successfully run the Python code. Specifically, you'll import the Keras library, the Sequential API and certain layers - this is obviously dependent on what _you_ want. Do you want to use the Functional API? That's perfectly fine. Other layers? Fine too. I just used them since the CNN is exemplary. + +Note that I also imported `plot_model` with `from tensorflow.keras.utils import plot_model` and reshaped the data to accomodate for the Conv2D layer. + +Speaking about architecture: that's what I kept in. Based on the Keras Sequential API, I apply the two convolutional blocks as discussed previously, before flattening their output and feeding it to the densely-connected layers generating the final prediction. And, of course, we need `no_classes = 10` to ensure that our final `Dense` layer works as well. + +However, in this case, no such prediction is generated. Rather, the `model` instance is used by `plot_model` to generate a model visualization stored at disk as `model.png`. Likely, you'll add hyperparameter tuning and data fitting later on - but hey, that's not the purpose of this blog. + +* * * + +## End result + +And your final end result looks like this: + +![](images/model.png) + +* * * + +## Making a horizontal TF/Keras model plot + +Indeed, above we saw that we can use the `rankdir` attribute (which is set to `TB` i.e. vertical by default) to generate a horizontal plot! This is new, and highly preferred, as we sometimes don't want these massive vertical plots. + +Making a horizontal plot of your TensorFlow/Keras model simply involves adding the `rankdir='LR'` a.k.a. _horizontal_ attribute: + +``` +plot_model(model, to_file='model.png', rankdir='LR') +``` + +Which gets you this: + +[![](images/model.png)](https://www.machinecurve.com/wp-content/uploads/2021/01/model.png) + +Awesome! + +* * * + +## Summary + +In this blog, you've seen how to create a Keras model visualization based on the `plot_model` util provided by the library. I hope you found it useful - let me know in the comments section, I'd appreciate it! 😎 If not, let me know as well, so I can improve. For now: happy engineering! 👩‍💻 + +_Note that model code is also available [on GitHub](https://github.com/christianversloot/keras-visualizations)._ + +* * * + +## References + +How to create a CNN classifier with Keras? – MachineCurve. (2019, September 24). Retrieved from [https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) + +Keras. (n.d.). Visualization. Retrieved from [https://keras.io/visualization/](https://keras.io/visualization/) + +Avoid wasting resources with EarlyStopping and ModelCheckpoint in Keras – MachineCurve. (2019, June 3). Retrieved from [https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) + +pydot issue · Issue #7 · XifengGuo/CapsNet-Keras. (n.d.). Retrieved from [https://github.com/XifengGuo/CapsNet-Keras/issues/7#issuecomment-536100376](https://github.com/XifengGuo/CapsNet-Keras/issues/7#issuecomment-536100376) + +TensorFlow. (2021). _Tf.keras.utils.plot\_model_. [https://www.tensorflow.org/api\_docs/python/tf/keras/utils/plot\_model](https://www.tensorflow.org/api_docs/python/tf/keras/utils/plot_model) diff --git a/how-to-visualize-support-vectors-of-your-svm-classifier.md b/how-to-visualize-support-vectors-of-your-svm-classifier.md new file mode 100644 index 0000000..0f42440 --- /dev/null +++ b/how-to-visualize-support-vectors-of-your-svm-classifier.md @@ -0,0 +1,172 @@ +--- +title: "How to visualize support vectors of your SVM classifier?" +date: "2020-05-05" +categories: + - "frameworks" + - "svms" +tags: + - "machine-learning" + - "scikit-learn" + - "support-vector-machine" + - "support-vectors" + - "visualization" +--- + +In today's world filled with buzz about deep neural networks, Support Vector Machines remain a widely used class of machine learning algorithms. + +The machines, which construct a hyperplane that aims to separate between classes in your dataset by maximizing the margin using _support vectors_, are still pretty useful when your number of samples is relatively low - given that your number of features does not exceed the number of samples. + +In the case of using SVMs for classification - they can also be used for regression - it could be valuable to visualize the support vectors of your SVM classifier. Doing so could help you determine how separable your dataset is, to give just one example: if many support vectors are necessary, the model had more difficulty generating the boundary than when say, one vector on each side was used to find the boundary. What's more, it helps you find out where precisely the decision boundary is located in your dataset. + +That's why in today's blog post, we will be looking at visualizing the support vectors that are used when constructing the decision boundary of your SVM classifier. Firstly, we will take a look at Support Vector Machines for classification and support vectors. What are they? How are they chosen? What does maximum-margin mean? Those questions will be answered. + +Subsequently, we'll move on to a practical example using Python and Scikit-learn. For an example dataset, which we will generate in this post as well, we will show you how a [simple SVM can be trained](https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/) and how you can subsequently visualize the support vectors. We will do this step-by-step, so that you understand everything that happens. + +All right, let's go! :) + +* * * + +\[toc\] + +* * * + +## Support Vector Machines and support vectors + +Support Vector Machines (SVMs) are a well-known and widely-used class of machine learning models traditionally used in classification. They can be used to generate a decision boundary between classes for both linearly separable and nonlinearly separable data. + +Formally, SVMs construct a hyperplane in feature space. Here, a hyperplane is a subspace of dimensionality N-1, where N is the number of dimensions of the feature space itself. For example, in the two-dimensional feature space of the example below (representing a plane), the hyperplane (a line) illustrated in red separates the ‘black’ class from the ‘white’ class. Model optimization is performed by finding a maximum-margin decision boundary for the hyperplane, by using so called _support vectors_ (hence the name of the model class). Support vectors lie at the ‘front line’ between the two classes and are of importance for separating the data. By maximizing the margin between the hyperplane and those support vectors, the confidence about separability between the samples is maximized, and so is model performance. + +SVMs can be used efficiently with linearly separable data. For nonlinearly separable data, such as the features in the example below, they need to apply what is known as the _kernel trick_ first. This trick, which is an efficient mathematical mapping of the original samples onto a higher-dimensional mathematical space by means of a kernel function, can make linear separability between the original samples possible. This allows SVMs to work with nonlinearly separable data too, although determining kernel suitability as well as kernel application and testing is a human task and can be exhausting. + +![](images/Kernel_Machine-1.png) + +Source: [Alisneaky on Wikipedia](https://commons.wikimedia.org/wiki/File:Kernel_Machine.png) / CC0 license + +* * * + +## Constructing an SVM with Python and Scikit-learn + +[![](images/separable_data.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/separable_data.png) + +_Today's dataset the SVM is trained on: clearly, two blobs of separable data are visible._ + +Constructing and training a Support Vector Machine is not difficult, as we could see [in a different blog post](https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/). In fact, with Scikit-learn and Python, it can be as easy as 3 lines of code. + +That's why today, given the focus of this post, we don't focus on creating the SVM itself step-by-step. Instead, I'd like to point you to the link referenced above if you wish to understand SVM creation in more detail. Here, we'll focus on visualizing the SVM's support vectors. + +Here's the full code for generating a dataset, performing a 66/33 train/test split and training the linear SVM: + +``` +# Imports +from sklearn.datasets import make_blobs + +from sklearn.model_selection import train_test_split +import numpy as np +import matplotlib.pyplot as plt +from sklearn import svm +from sklearn.metrics import plot_confusion_matrix + +# Configuration options +blobs_random_seed = 42 +centers = [(0,0), (5,5)] +cluster_std = 1.5 +frac_test_split = 0.33 +num_features_for_samples = 2 +num_samples_total = 1000 + +# Generate data +inputs, targets = make_blobs(n_samples = num_samples_total, centers = centers, n_features = num_features_for_samples, cluster_std = cluster_std) +X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=frac_test_split, random_state=blobs_random_seed) + +# Save and load temporarily +np.save('./datasv.npy', (X_train, X_test, y_train, y_test)) +X_train, X_test, y_train, y_test = np.load('./datasv.npy', allow_pickle=True) + +# Generate scatter plot for training data +plt.scatter(X_train[:,0], X_train[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Initialize SVM classifier +clf = svm.SVC(kernel='linear') + +# Fit data +clf = clf.fit(X_train, y_train) +``` + +* * * + +## Visualizing your SVM's support vectors + +According to Scikit-learn's website, there are three variables attached to the trained `clf` (= classifier) object that are of interest when you want to do something with the support vectors of your model: + +- The **support\_** variable, which holds the index numbers of the samples from your training set that were found to be the support vectors. +- The **n\_support\_** variable, which produces the number of support vectors for every class. +- The **support\_vectors\_** variable, which produces the support vectors themselves - so that you don't need to perform an array search after using **support\_**. + +Let's now take a look at each one in more detail. + +If you wanted to retrieve the index numbers of the support vectors for your SVM, you would need to add this code: + +``` +# Get support vector indices +support_vector_indices = clf.support_ +print(support_vector_indices) +``` + +Which, in our case, produces: + +``` +[ 66 108 138 267 319 367 427 536 548 562 606 650 4 9 99 126] +``` + +I count 16 vectors. Indeed, if you look at the numbers of support vectors per class, you see 2x8 = 16 vectors - so they are evenly spread across the classes: + +``` +# Get number of support vectors per class +support_vectors_per_class = clf.n_support_ +print(support_vectors_per_class) +``` + +Then, finally, by simply using Matplotlib to visualize the training set and stacking the support vectors on top, we can visualize the support vectors and the training set: + +``` +# Get support vectors themselves +support_vectors = clf.support_vectors_ + +# Visualize support vectors +plt.scatter(X_train[:,0], X_train[:,1]) +plt.scatter(support_vectors[:,0], support_vectors[:,1], color='red') +plt.title('Linearly separable data with support vectors') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +Of course, by leaving out the first `plt.scatter`, you can visualize the support vectors only, if that's what you're interested in :) + +Et voila - we have a nice plot of our support vectors: + +[![](images/support_vectors.png)](https://www.machinecurve.com/wp-content/uploads/2020/05/support_vectors.png) + +\[affiliatebox\] + +## Summary + +In this blog post, we looked at Support Vector Machines and, more precisely, support vectors. What are those vectors? How do they play a role in deciding about the decision boundary when an SVM is trained? In the first part of this blog, we looked at those questions from a theoretical point of view. + +Building further on top of an existing MachineCurve blog article, which constructs and trains a simple binary SVM classifier, we then looked at how support vectors for an SVM can be visualized. By using Python and Scikit-learn, we provided a step-by-step example of how to do this. The end result: a nice Matplotlib-based plot with visualized support vectors. Obviously, this will work with multiclass SVMs as well. + +I hope you've learned something from today's blog post. If you did, please feel free to leave a comment below 💬 Please do the same if you have any questions or other remarks, or when you spot mistakes in my post. I'll happily answer your questions and repair my blog, when necessary. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[scikitbox\] + +* * * + +## References + +Scikit-learn. (n.d.). _Sklearn.svm.SVC — scikit-learn 0.22.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved May 3, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) diff --git a/how-to-visualize-the-decision-boundary-for-your-keras-model.md b/how-to-visualize-the-decision-boundary-for-your-keras-model.md new file mode 100644 index 0000000..f24d15a --- /dev/null +++ b/how-to-visualize-the-decision-boundary-for-your-keras-model.md @@ -0,0 +1,527 @@ +--- +title: "Visualizing the decision boundary of your TensorFlow 2 / Keras model" +date: "2019-10-11" +categories: + - "buffer" + - "deep-learning" +tags: + - "classifier" + - "decision-boundary" + - "keras" + - "mlxtend" + - "visualization" +--- + +When you build a classifier, you're effectively learning a mathematical model to draw a _decision boundary_ that can separate between the classes present in your data set's targets. + +Many algorithms have many different approaches to generating such decision boundaries. Neural networks learn them differently, dependent on the optimizer, activation function(s) and loss function used in your training setting. They support multiclass classification quite natively in many cases. + +Support Vector machines learn them by finding a maximum-margin boundary between the two (!) classes in your ML problem. Indeed, SVMs do not work for more than two classes, and many SVMs have to be trained and merged to support multiclass classification. + +Linear classifiers generate a linear decision boundary, which can happen in a multitude of ways - whether with SVMs, neural networks or more traditional techniques such as just fitting a line. + +And so on. + +But how do we visualize such a decision boundary? Especially: **how do I visualize the decision boundary for my Keras classifier?** That's what we'll answer in this blog post today. By means of the library Mlxtend created by Raschka (2018), we show you by means of example code how to visualize the decision boundaries of classifiers for both linearly separable and nonlinear data. + +After reading this tutorial, you will... + +- Understand how to visualize the decision boundary of your TensorFlow 2/Keras classifier with Mlxtend. +- See how this works with linear and nonlinear data. +- Have walked through a full example demonstrating how to visualize the decision boundary. + +Are you ready? + +Let's go! 😎 + +_Note that code is also available on [GitHub](https://github.com/christianversloot/keras-visualizations), in my Keras Visualizations repository._ + +* * * + +**Update 25/Jan/2021:** updated code examples to TensorFlow 2. Added code example near the top of the tutorial so that people can get started immediately. Also updated header information and title to reflect availability of TensorFlow 2 code. + +* * * + +\[toc\] + +* * * + +## Code example: visualizing the decision boundary of your model + +This code example provides a **full example showing how to visualize the decision boundary of your TensorFlow / Keras model**. If you want to understand it in more detail, in particular the usage of Mlxtend's `plot_decision_regions`, make sure to read the rest of this tutorial as well! + +``` +# Imports +import tensorflow.keras +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 1000 +training_split = 250 + +# Generate data +X, targets = make_moons(n_samples = num_samples_total) +targets[np.where(targets == 0)] = -1 +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Set the input shape +feature_vector_shape = len(X_training[0]) +input_shape = (feature_vector_shape,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(50, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='tanh')) + +# Configure the model and start training +model.compile(loss='squared_hinge', optimizer='adam', metrics=['accuracy']) +model.fit(X_training, Targets_training, epochs=50, batch_size=25, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') + +# Plot decision boundary +plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2) +plt.show() +``` + +* * * + +## An example with linearly separable data + +Now that we know what a decision boundary is, we can try to visualize some of them for our Keras models. Here, we'll provide an example for visualizing the decision boundary with linearly separable data. + +Thus, data which can be separated by drawing a line in between the clusters. Typically, this is seen with classifiers and particularly [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) (which maximize the margin between the line and the two clusters), but also with neural networks. + +Let's start. Perhaps, create a file in some folder called `decision_boundary_linear_data.py` in which you'll add the following code. + +### Importing dependencies + +We first import the required dependencies: + +``` +# Imports +import tensorflow.keras +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from mlxtend.plotting import plot_decision_regions +``` + +We use **TensorFlow 2.0** for training our machine learning model, which includes a tightly coupled version of Keras through `tensorflow.keras`. Additionally, we'll import **Matplotlib**, which we need to visualize our dataset. **Numpy** is imported for preprocessing the data, **Scikit-learn**'s function `make_blobs` is imported for generating the linearly separable clusters of data and **Mlxtend** is used for visualizing the decision boundary. + +### Configuration options + +Next, we set some configuration options: + +``` +# Configuration options +num_samples_total = 1000 +training_split = 250 +``` + +The number of samples used in our visualization experiment is 1000 - to keep the training process fast, while still being able to show the predictive power of our model. + +We use 250 samples of them as _testing_ data by splitting them off the total dataset. + +Let's now generate data for the experiment. + +### Generating data + +With the help of the Scikit-learn library we generate data using the `make_blobs` function. It generates `n_samples` data points at the centers (0, 0) and (15, 15). The `n_features` is two: our samples have an (x, y) value on a 2D-space. The standard deviation of our cluster is set at 2.5. This allows us to add some spread without losing linear separability. + +``` +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = [(0,0), (15,15)], n_features = 2, center_box=(0, 1), cluster_std = 2.5) +targets[np.where(targets == 0)] = -1 +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] +``` + +Scikit-learn's `make_blobs` generates numbers as targets, starting at 0. However, we will use [Hinge loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#hinge) in an attempt to maximize the decision boundary between our clusters. This should be possible given its separability. Hinge loss does not understand a target value of 0; rather, targets must be -1 or +1. Hence, we next convert all zero targets into minus one. + +We finally split between training and testing data given the number of splitoff values that we configured earlier. + +### Visualizing our dataset + +We next visualize our data: + +``` +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +Put simply, we generate a scatter plot with Matplotlib, which clearly shows linear separability for our dataset: + +[![](images/final_lin_sep.png)](blob:https://www.machinecurve.com/2f4d2ae3-1171-4ade-b5a8-3f164f3d5717) + +### Model configuration, training & testing + +We next add the (relatively basic) Keras model used today: + +``` +# Set the input shape +feature_vector_shape = len(X_training[0]) +input_shape = (feature_vector_shape,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(50, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='tanh')) + +# Configure the model and start training +model.compile(loss='squared_hinge', optimizer='adam', metrics=['accuracy']) +model.fit(X_training, Targets_training, epochs=50, batch_size=25, verbose=1, validation_split=0.2) +``` + +We configure the input shape and next define the model architecture - we use Keras's Sequential API and let the data pass through two densely-connected layers. Two such layers should be sufficient for generating a successful decision boundary since our data is relatively simple - and in fact, linearly separable. + +Do note that since we use the ReLU [activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/), we cannot use Glorot uniform initialization - the default choice in Keras. Rather, we must [use He initialization](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/), and choose to do so with a uniform distribution. + +Next, we compile the model, using [squared hinge](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#squared-hinge) as our loss function, Adam as our optimizer (it's the de facto standard one used today) and accuracy as an additional metric - pretty much the choices I always make when creating models with Keras. + +Next, we fit the training data to the model, perform 50 iterations (or epochs) with batch sizes of 25, and use 20% of our 750 training samples for validating the outcomes of the training process after every epoch. Verbosity is set to 1 to show what happens during training. + +Subsequently, we add another default metric, which tests the final model once it stops training against the test set - to also show its power to _generalize_ to data the model has not seen before. + +``` +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') +``` + +### Plotting decision boundaries with Mlxtend + +Finally, we add code for visualizing the model's decision boundary. We use [Mlxtend](http://rasbt.github.io/mlxtend/) for this purpose, which is "a Python library of useful tools for the day-to-day data science tasks". Great! + +What's even better is that we can visualize the decision boundary of our Keras model with only two lines of code: + +``` +# Plot decision boundary +plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2) +plt.show() +``` + +Note that we use our testing data for this rather than our training data, that we input the instance of our Keras model and that we display a legend. + +Altogether, this is the code for the entire experiment: + +``` +# Imports +import tensorflow.keras +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 1000 +training_split = 250 + +# Generate data +X, targets = make_blobs(n_samples = num_samples_total, centers = [(0,0), (15,15)], n_features = 2, center_box=(0, 1), cluster_std = 2.5) +targets[np.where(targets == 0)] = -1 +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Linearly separable data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Set the input shape +feature_vector_shape = len(X_training[0]) +input_shape = (feature_vector_shape,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(50, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='tanh')) + +# Configure the model and start training +model.compile(loss='squared_hinge', optimizer='adam', metrics=['accuracy']) +model.fit(X_training, Targets_training, epochs=50, batch_size=25, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') + +# Plot decision boundary +plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2) +plt.show() +``` + +### Running the model + +Running the code requires that you installed all dependencies mentioned earlier; preferably in an Anaconda environment to keep them isolated. Next, you can open up a terminal, navigate to the folder your file is located in and hit e.g. `python decision_boundary_linear_data.py`. What you will see is that Keras starts training the model, but that also the visualization above and the decision boundary visualization is generated for you. + +``` +Epoch 1/50 +600/600 [==============================] - 5s 8ms/step - loss: 1.4986 - acc: 0.4917 - val_loss: 1.0199 - val_acc: 0.6333 +Epoch 2/50 +600/600 [==============================] - 0s 107us/step - loss: 0.7973 - acc: 0.6933 - val_loss: 0.6743 - val_acc: 0.7400 +Epoch 3/50 +600/600 [==============================] - 0s 102us/step - loss: 0.6273 - acc: 0.7467 - val_loss: 0.6020 - val_acc: 0.7800 +Epoch 4/50 +600/600 [==============================] - 0s 102us/step - loss: 0.5472 - acc: 0.7750 - val_loss: 0.5241 - val_acc: 0.8200 +Epoch 5/50 +600/600 [==============================] - 0s 93us/step - loss: 0.4313 - acc: 0.8000 - val_loss: 0.4170 - val_acc: 0.8467 +Epoch 6/50 +600/600 [==============================] - 0s 97us/step - loss: 0.2492 - acc: 0.8283 - val_loss: 0.1900 - val_acc: 0.8800 +Epoch 7/50 +600/600 [==============================] - 0s 107us/step - loss: 0.1199 - acc: 0.8850 - val_loss: 0.1109 - val_acc: 0.9133 +Epoch 8/50 +600/600 [==============================] - 0s 98us/step - loss: 0.0917 - acc: 0.9000 - val_loss: 0.0797 - val_acc: 0.9200 +Epoch 9/50 +600/600 [==============================] - 0s 96us/step - loss: 0.0738 - acc: 0.9183 - val_loss: 0.0603 - val_acc: 0.9200 +Epoch 10/50 +600/600 [==============================] - 0s 98us/step - loss: 0.0686 - acc: 0.9200 - val_loss: 0.0610 - val_acc: 0.9200 +Epoch 11/50 +600/600 [==============================] - 0s 101us/step - loss: 0.0629 - acc: 0.9367 - val_loss: 0.0486 - val_acc: 0.9333 +Epoch 12/50 +600/600 [==============================] - 0s 108us/step - loss: 0.0574 - acc: 0.9367 - val_loss: 0.0487 - val_acc: 0.9267 +Epoch 13/50 +600/600 [==============================] - 0s 102us/step - loss: 0.0508 - acc: 0.9400 - val_loss: 0.0382 - val_acc: 0.9467 +Epoch 14/50 +600/600 [==============================] - 0s 109us/step - loss: 0.0467 - acc: 0.9483 - val_loss: 0.0348 - val_acc: 0.9533 +Epoch 15/50 +600/600 [==============================] - 0s 108us/step - loss: 0.0446 - acc: 0.9467 - val_loss: 0.0348 - val_acc: 0.9467 +Epoch 16/50 +600/600 [==============================] - 0s 109us/step - loss: 0.0385 - acc: 0.9583 - val_loss: 0.0280 - val_acc: 0.9533 +Epoch 17/50 +600/600 [==============================] - 0s 100us/step - loss: 0.0366 - acc: 0.9583 - val_loss: 0.0288 - val_acc: 0.9467 +Epoch 18/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0320 - acc: 0.9633 - val_loss: 0.0227 - val_acc: 0.9733 +Epoch 19/50 +600/600 [==============================] - 0s 100us/step - loss: 0.0289 - acc: 0.9633 - val_loss: 0.0224 - val_acc: 0.9733 +Epoch 20/50 +600/600 [==============================] - 0s 107us/step - loss: 0.0264 - acc: 0.9683 - val_loss: 0.0202 - val_acc: 0.9733 +Epoch 21/50 +600/600 [==============================] - 0s 99us/step - loss: 0.0251 - acc: 0.9767 - val_loss: 0.0227 - val_acc: 0.9667 +Epoch 22/50 +600/600 [==============================] - 0s 95us/step - loss: 0.0247 - acc: 0.9750 - val_loss: 0.0170 - val_acc: 0.9800 +Epoch 23/50 +600/600 [==============================] - 0s 101us/step - loss: 0.0210 - acc: 0.9833 - val_loss: 0.0170 - val_acc: 0.9800 +Epoch 24/50 +600/600 [==============================] - 0s 104us/step - loss: 0.0192 - acc: 0.9833 - val_loss: 0.0148 - val_acc: 0.9933 +Epoch 25/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0191 - acc: 0.9833 - val_loss: 0.0138 - val_acc: 0.9867 +Epoch 26/50 +600/600 [==============================] - 0s 103us/step - loss: 0.0169 - acc: 0.9867 - val_loss: 0.0128 - val_acc: 0.9933 +Epoch 27/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0157 - acc: 0.9867 - val_loss: 0.0121 - val_acc: 1.0000 +Epoch 28/50 +600/600 [==============================] - 0s 103us/step - loss: 0.0150 - acc: 0.9883 - val_loss: 0.0118 - val_acc: 0.9933 +Epoch 29/50 +600/600 [==============================] - 0s 106us/step - loss: 0.0140 - acc: 0.9883 - val_loss: 0.0112 - val_acc: 1.0000 +Epoch 30/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0131 - acc: 0.9917 - val_loss: 0.0101 - val_acc: 1.0000 +Epoch 31/50 +600/600 [==============================] - 0s 110us/step - loss: 0.0123 - acc: 0.9917 - val_loss: 0.0099 - val_acc: 1.0000 +Epoch 32/50 +600/600 [==============================] - 0s 111us/step - loss: 0.0119 - acc: 0.9917 - val_loss: 0.0102 - val_acc: 0.9933 +Epoch 33/50 +600/600 [==============================] - 0s 116us/step - loss: 0.0116 - acc: 0.9933 - val_loss: 0.0093 - val_acc: 1.0000 +Epoch 34/50 +600/600 [==============================] - 0s 108us/step - loss: 0.0107 - acc: 0.9933 - val_loss: 0.0085 - val_acc: 1.0000 +Epoch 35/50 +600/600 [==============================] - 0s 102us/step - loss: 0.0100 - acc: 0.9933 - val_loss: 0.0081 - val_acc: 1.0000 +Epoch 36/50 +600/600 [==============================] - 0s 103us/step - loss: 0.0095 - acc: 0.9917 - val_loss: 0.0078 - val_acc: 1.0000 +Epoch 37/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0093 - acc: 0.9967 - val_loss: 0.0079 - val_acc: 1.0000 +Epoch 38/50 +600/600 [==============================] - 0s 104us/step - loss: 0.0088 - acc: 0.9950 - val_loss: 0.0072 - val_acc: 1.0000 +Epoch 39/50 +600/600 [==============================] - 0s 98us/step - loss: 0.0085 - acc: 0.9967 - val_loss: 0.0069 - val_acc: 1.0000 +Epoch 40/50 +600/600 [==============================] - 0s 103us/step - loss: 0.0079 - acc: 0.9983 - val_loss: 0.0066 - val_acc: 1.0000 +Epoch 41/50 +600/600 [==============================] - 0s 103us/step - loss: 0.0075 - acc: 0.9967 - val_loss: 0.0065 - val_acc: 1.0000 +Epoch 42/50 +600/600 [==============================] - 0s 101us/step - loss: 0.0074 - acc: 0.9950 - val_loss: 0.0060 - val_acc: 1.0000 +Epoch 43/50 +600/600 [==============================] - 0s 101us/step - loss: 0.0072 - acc: 0.9967 - val_loss: 0.0057 - val_acc: 1.0000 +Epoch 44/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0071 - acc: 0.9950 - val_loss: 0.0056 - val_acc: 1.0000 +Epoch 45/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0065 - acc: 0.9983 - val_loss: 0.0054 - val_acc: 1.0000 +Epoch 46/50 +600/600 [==============================] - 0s 110us/step - loss: 0.0062 - acc: 0.9983 - val_loss: 0.0056 - val_acc: 1.0000 +Epoch 47/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0059 - acc: 0.9983 - val_loss: 0.0051 - val_acc: 1.0000 +Epoch 48/50 +600/600 [==============================] - 0s 103us/step - loss: 0.0057 - acc: 0.9983 - val_loss: 0.0049 - val_acc: 1.0000 +Epoch 49/50 +600/600 [==============================] - 0s 101us/step - loss: 0.0056 - acc: 0.9983 - val_loss: 0.0047 - val_acc: 1.0000 +Epoch 50/50 +600/600 [==============================] - 0s 105us/step - loss: 0.0054 - acc: 0.9983 - val_loss: 0.0050 - val_acc: 1.0000 +250/250 [==============================] - 0s 28us/step +Test results - Loss: 0.007074932985007763 - Accuracy: 99.2% +``` + +As you can see, during training validation accuracy goes to 1 or 100%. Testing the model with the testing dataset yields an accuracy of 99.2%. That's quite good news! + +And the visualized decision boundary? + +![](images/final_lin_sep_db.png) + +Let's now take a look at an example with nonlinear data. + +* * * + +## An example with nonlinear data + +Now what if we have nonlinear data? We can do the same! + +We'll have to change a few lines in our code, though. Let's first replace the `make_blobs` import by `make_moons`: + +``` +from sklearn.datasets import make_moons +``` + +Next, also replace the call to this function under _Generate data_ to this: + +``` +X, targets = make_moons(n_samples = num_samples_total) +``` + +What happens? Well, unlike the linearly separable data, two shapes resembling half moons are generated; they cannot be linearly separated, at least in regular feature space: + +[![](images/moons.png)](blob:https://www.machinecurve.com/4683f482-c683-4b5d-9959-7d3573547adb) + +Running the code with these adaptations (the full code can be retrieved next) shows that the Keras model is actually able to perform hinge-loss based nonlinear separation pretty successfully: + +``` +Epoch 50/50 +600/600 [==============================] - 0s 107us/step - loss: 0.0748 - acc: 0.9233 - val_loss: 0.0714 - val_acc: 0.9400 +250/250 [==============================] - 0s 26us/step +Test results - Loss: 0.07214225435256957 - Accuracy: 91.59999976158142% +``` + +And it looks as follows: + +![](images/moons_decision.png) + +### Full model code + +``` +# Imports +import tensorflow.keras +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.utils import to_categorical +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 1000 +training_split = 250 + +# Generate data +X, targets = make_moons(n_samples = num_samples_total) +targets[np.where(targets == 0)] = -1 +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Set the input shape +feature_vector_shape = len(X_training[0]) +input_shape = (feature_vector_shape,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(50, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='tanh')) + +# Configure the model and start training +model.compile(loss='squared_hinge', optimizer='adam', metrics=['accuracy']) +model.fit(X_training, Targets_training, epochs=50, batch_size=25, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') + +# Plot decision boundary +plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2) +plt.show() +``` + +* * * + +## Summary + +In this blog, we've seen how to visualize the decision boundary of your Keras model by means of Mlxtend, a Python library that extends the toolkit of today's data scientists. We saw that we only need **two lines of code** to provide for a basic visualization which clearly demonstrates the presence of the decision boundary. + +I hope you've learnt something from this blog! Please let me know in a comment 👇 if you have any questions, any remarks or when you have comments. I'll happily improve my blog if I made mistakes or forgot crucial information - and I'm also very eager to hear what you've done with the information! Thanks and happy engineering! 😊💬 + +_Note that code is also available on [GitHub](https://github.com/christianversloot/keras-visualizations), in my Keras Visualizations repository._ + +* * * + +## References + +Raschka, S. (n.d.). Home - mlxtend. Retrieved from [http://rasbt.github.io/mlxtend/](http://rasbt.github.io/mlxtend/) + +Raschka, S. (2018). MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. _Journal of Open Source Software_, _3_(24), 638. [doi:10.21105/joss.00638](https://joss.theoj.org/papers/10.21105/joss.00638) + +Intuitively understanding SVM and SVR – MachineCurve. (2019, September 20). Retrieved from [https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) + +About loss and loss functions: Hinge loss – MachineCurve. (2019, October 4). Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#hinge](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#hinge) + +About loss and loss functions: Squared hinge loss – MachineCurve. (2019, October 4). Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#squared-hinge](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#squared-hinge) + +ReLU, Sigmoid and Tanh: today's most used activation functions – MachineCurve. (2019, September 4). Retrieved from [https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) + +He/Xavier initialization & activation functions: choose wisely – MachineCurve. (2019, September 18). Retrieved from [https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) diff --git a/how-to-visualize-the-encoded-state-of-an-autoencoder-with-keras.md b/how-to-visualize-the-encoded-state-of-an-autoencoder-with-keras.md new file mode 100644 index 0000000..3d4ec26 --- /dev/null +++ b/how-to-visualize-the-encoded-state-of-an-autoencoder-with-keras.md @@ -0,0 +1,681 @@ +--- +title: "How to visualize the encoded state of an autoencoder with Keras?" +date: "2019-12-26" +categories: + - "deep-learning" + - "frameworks" +tags: + - "autoencoder" + - "deep-learning" + - "encoded-state" + - "keract" + - "keras" + - "machine-learning" + - "visualization" +--- + +Autoencoders are special types of neural networks which learn to convert inputs into lower-dimensional form, after which they convert it back into the original or some related output. A variety of interesting applications has emerged for them: denoising, dimensionality reduction, input reconstruction, and - with a particular type of autoencoder called Variational Autoencoder - even [generative actions](https://www.machinecurve.com/index.php/2019/12/24/what-is-a-variational-autoencoder-vae/). + +This lower-dimensional form is also called the encoded state. But what does it look like? Can we visualize the encoded state of an autoencoder? And how do we do that for a Keras model? Those are questions that I'll answer in this blog post. Firstly, we'll dive into the concept of an autoencoder, to recap - or to inject - your high-level understanding of them. Next, we'll take a look at what we're going to do today, i.e., generate visualizations of the encoded state. Subsequently, we provide an example for encoded state visualization with both the Keras Functional API and the Keras Sequential API. + +Are you ready? Let's go! 😎 + +\[toc\] + +* * * + +## Recap: what are autoencoders and what are they used for? + +We talk about _encoded states_ and _autoencoders_ - but what are they? + +Likely, you already know what is meant with these concepts. In that case, I'd suggest that you skip this section and start at the next one. + +However, it's also possible that you _don't know yet_ what they are, or that your familiarity with these concepts is growing - but that you can't dream them yet 😴 + +In that case, read on :-) + +At a high level, this is an autoencoder: + +![](images/Autoencoder.png) + +It's a combination of building blocks which allows you to feed it some input, which is encoded, decoded and then returned to the user. Depending on how you configure it (in terms of input and desired outputs), they can be used for: + +- [Input reconstruction](https://www.machinecurve.com/index.php/2019/12/19/creating-a-signal-noise-removal-autoencoder-with-keras/); +- [Noise reduction](https://www.machinecurve.com/index.php/2019/12/20/building-an-image-denoiser-with-a-keras-autoencoder-neural-network/); +- Dimensionality reduction. + +Besides the input image and the 'reconstructed image' (or denoised image), there are more building blocks: an encoder, a decoder and an encoded state. + +What are they? + +Let's take at the encoder first. When you feed an autoencoder input, the input has multiple dimensions. For example, a 28 x 28 pixels input image has 28 x 28 = 784 dimensions, which all take a real value (i.e., a number with some decimals that can be positive and negative). The _encoder_ reduces the dimensionality of your input to the extent that it can be overseen by compressing information and discarding useless information (e.g. sampling noise), to e.g. 25 dimensions. This is called the _encoded state_, which you can feed to the _decoder_. The decoder, in turn, attempts to build up a new image, such as a reconstruction of the input. How good the autoencoder works is determined by the loss function with which it is trained, and is determined by how similar the output is to the input. + +An important word in the previous paragraph is _trained_. Indeed, the encoder and decoder segments of autoencoders must be trained. Usually, neural networks are employed for this purpose, as they are [universal function approximators](https://www.machinecurve.com/index.php/2019/07/18/can-neural-networks-approximate-mathematical-functions/) and can by consequence _learn_ the mapping from input to encoded state, and from encoded state to reconstruction. This is really important to remember when you're talking about autoencoders. While they are usually associated with neural networks, they're not the same. Rather, they are _implemented_ with neural nets. + +* * * + +## Visualizing the encoded state: what we want to achieve + +You may now wonder: _what's the goal of this blog post?_ + +It's to visualize the **encoded state**, when a sample is fed to the autoencoder. + +This can be useful in situations when you use autoencoders for dimensionality reduction, and you consider the encoded states to be features for e.g. [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/). In those cases, you may wish to look at your encoded states in order to find whether dimensions must be added or dropped, and so on. In other words, you wish to generate images like this one: + +[![](images/encoded-state-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/encoded-state.png) + +Fortunately, this is possible! 🎉 + +**Keract** ([link to their GitHub](https://github.com/philipperemy/keract)) is a nice toolkit with which you can "get the activations (outputs) and gradients for each layer of your Keras model" (Rémy, 2019). We already covered Keract before, in a blog post illustrating how to use it for [visualizing the hidden layers in your neural net](https://www.machinecurve.com/index.php/2019/12/02/visualize-layer-outputs-of-your-keras-classifier-with-keract/), but we're going to use it again today. + +This time, we'll be using it to visualize the encoded state - which, in terms of the neural network implementation of your autoecoder, is nothing else than a visualization of the output of the _encoder segment_, i.e. the final layer of the neural network segment that represents your autoencoder's encoder. + +Let's find out how we can do that, with both the Keras Functional API and the Keras Sequential API! 😀 + +### What you'll need to run the model + +Today, we create two variants of one autoencoder: one with the Keras Functional API, and one with the Keras Sequential API. This allows you to choose what fits best for you, and use code for both scenarios in your machine learning projects. + +Regardless of the API that we will use, you need to install a set of software dependencies if you wish to run the autoencoders successfully in your local system: + +- **Keras**, which is the deep learning framework we will use today. +- **Python**, which is the language in which Keras models are created. Preferably, use Python 3.6 or newer. +- One of the backends supported by Keras, and since the deep integration with TensorFlow (since v2.0) I'd say that **Tensorflow** is the best choice. +- **Keract**, for visualizing the encoded state: `pip install keract`. +- **Matplotlib**, for plotting the visualizations on screen. + +Are you ready? Lets go! 😎 + +* * * + +## Visualizing encoded state with a Keras Functional API autoencoder + +We'll first take a look at how encoded states can be visualized when building autoencoders with the Keras Functional API. While it's a bit harder to structure your models, it allows easier access to intermediate layers compared to the Keras Sequential API. + +And access to these intermediate layers is what we need. Remember that autoencoders contain an _encoder_ segment, as well as a _decoder_ segment, which are trained together but have separate tasks. Additionally, the _autoencoder_ must be considered as a whole. Separating between layers and/or segments is thus necessary when creating autoencoders. While this is certainly possible with the Sequential API (as we will show later in this blog post), you'll make your life easier when you use the Functional API. + +In order to start, open up your Explorer/Finder, navigate to some folder, and create a Python file, e.g. `autoencoder_encodedstate_functional.py`. Next, open this file in your code editor. Now, we can start! + +### Model imports + +The first thing we must do is import the dependencies: + +```python +''' + Visualizing the encoded state of a simple autoencoder created with the Keras Functional API + with Keract. +''' +import keras +from keras.layers import Input, Dense +from keras.datasets import mnist +from keras.models import Model +from keract import get_activations, display_activations +import matplotlib.pyplot as plt +``` + +We'll need the `Input` and `Dense` layers today: Input for serving as the input layer of the neural network implementing our autoencoder, Dense as the hidden layer that generates the encoding. With the `mnist` dataset, we'll train our autoencoder, and use the `Model` instance to instantiate our models later on. Additionally, we'll need some functions from `keract`, and need to import the Matplotlib PyPlot API as `plt`. + +### Model configuration + +Next up: model configuration. As usual, we'll define the width and height of our images, and multiply them in order to define our initial dimension (which is 28 x 28 = 784 pixels). Batch size is set to 128, which is default, and the number of epochs is kept low deliberately (as the model seems to converge quite quickly). 20% of the training data is used for validation purposes, verbosity mode is set to True (i.e., all output is shown on screen), and our encoder reduces the dimensionality to 50 (from 784). + +```python +# Model configuration +img_width, img_height = 28, 28 +initial_dimension = img_width * img_height +batch_size = 128 +no_epochs = 10 +validation_split = 0.2 +verbosity = 1 +encoded_dim = 50 +``` + +### Data loading and preprocessing + +Now, it's time to load data and preprocess it. + +First, we use the built-in Keras functionality for loading the MNIST dataset: `mnist.load_data()`. This automatically downloads the data from some S3 endpoint and puts it into local cache, which allows you to use it without any difficulty. + +Secondly, we'll reshape the data into our `initial_dimension` (i.e., from (28, 28, 1) format - 28 width, 28 height and 1 channel - into 784 (everything merged together). We finally represent our new shape in `input_shape`. + +Subsequently, we parse the numbers in our data as floats - specifically, `float32` - which presumably speeds up the training process. + +Finally, we normalize the data, so that it's in the \[-1, +1\] range. This is appreciated by the neural network during training. + +```python +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], initial_dimension) +input_test = input_test.reshape(input_test.shape[0], initial_dimension) +input_shape = (initial_dimension, ) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +### Defining the layers of the autoencoder + +Next, we can define the layers that we will use in the autoencoder. + +The `inputs` layer does what you think it does: it serves to "take in" the input following the `input_shape` determined before. + +Recognizable by the `(inputs)` code in the next layer, we can tell that it's fed to the `encoding_layer`, which is a densely-connected layer with `encoded_dim` (= 50, in our case) neurons, [ReLU activation](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/), [and by consequence He init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). + +The decoding layer, which takes the input from the encoding layer, is once again densely-connected. Its output equals the `initial_dimension`, which results in the same shape as we fed it in the first place. It activates with a Sigmoi activation function, so that we can use [binary crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) for computing loss (as we will see next). + +```python +# Define the layers +inputs = Input(shape=input_shape) +encoding_layer = Dense(encoded_dim, activation='relu', kernel_initializer='he_normal')(inputs) +decoding_layer = Dense(initial_dimension, activation='sigmoid')(encoding_layer) +``` + +### Intermezzo: definitions versus instantiations + +Above, we defined the layers of our autoencoder: + +- The input layer; +- The encoding layer; +- The decoding layer. + +We also told Keras in our code how information should flow with these layers: from the input layer, through the encoding layer, through the decoding layer, to become output again. + +However, we don't have a model yet. It's important to understand that even though we defined the _architecture_ of the model, we haven't actually _got a model instance_ yet. We'll create one now :) + +### Instantiating the full autoencoder and encoder/decoder segments + +Now that we have defined out layers, we can instantiate them, i.e. convert them to true Model instances. + +#### Full autoencoder + +First, for the full autoencoder, we essentially tell the `Model` that we'll wish to create one with `inputs` (the inputs layer) as the starting layer, and `decoding_layer` (the output of the decoder) as the full layer. + +```python +# Instantiate the autoencoder +autoencoder = Model(inputs, decoding_layer, name='full_autoencoder') +``` + +#### Encoder segment + +For the encoder, we only want the flow to go from `inputs` to `encoding_layer`: + +```python +# Instantiate the encoder +encoder = Model(inputs, encoding_layer, name='encoder') +``` + +#### Decoder segment + +For the decoder, well have to take a slightly different approach. Contrary to the decoder, which is not dependent on anything, the decoder is dependent on the _learnt_ encoder. Hence, we'll need to include it in our code: simply adding the `decoding_layer` here would suggest that we split encoder and decoder. + +Hence, we'll first define a new pseudo-Input layer, which takes inputs with shape `(encoded_dim, )`. Next, we retrieve the final (i.e., the decoding) layer from our instantiated `autoencoder`. This layer is used in the `decoder` model, which has the `encoded_input` as its input (which makes sense) and `final_ae_layer(encoded_input)` as output. Although being slightly more difficult, this makes sense as well: it's just the _encoded input_ fed to the _decoder layer_ we already _trained before_. + +```python +# Instantiate the decoder +encoded_input = Input(shape=(encoded_dim, )) +final_ae_layer = autoencoder.layers[-1] +decoder = Model(encoded_input, final_ae_layer(encoded_input), name='decoder') +``` + +### Compile autoencoder & encoder segment + +Next, we can compile both the autoencoder and the encoder segment. We'll need to compile them both, as we'll use them later to generate an image of input and reconstructed output (hence we need `autoencoder`) and visualize the encoded state (hence we need `encoder`). If you only need one of them, it's of course fine to drop any of them. + +We'll use the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) and [binary crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/), pretty much the standard settings for today's networks with Sigmoid activation functions. + +```python +# Compile the autoencoder +encoder.compile(optimizer='adam', loss='binary_crossentropy') +autoencoder.compile(optimizer='adam', loss='binary_crossentropy') +``` + +It's now time to get some insights about the structure of our models as well. + +To achieve this, we generate summaries: + +```python +# Give us some insights +autoencoder.summary() +encoder.summary() +decoder.summary() +``` + +### Fitting data + +Finally, we'll fit the data - our `input_train` serves both as features and as targets, specify the number of epochs and batch size as configured before, and do the same for the validation split. + +```python +# Fit data +autoencoder.fit(input_train, input_train, epochs=no_epochs, batch_size=batch_size, validation_split=validation_split) +``` + +Almost ready! + +### Visualizing a sample and reconstruction + +The only thing left is generating some visualizations, which is why you're here in the first place. + +The first thing to add is taking a sample from the _testing_ dataset, which you'll feed to the `autoencoder` (i.e. the encoder _and_ decoder segments) in order to retrieve a reconstruction. + +```python +# ============================================= +# Take a sample for visualization purposes +# ============================================= +input_sample = input_test[:1] +reconstruction = autoencoder.predict([input_sample]) +``` + +The next thing we do is plot this reconstruction together with the `input_sample`. This allows us to see how well the autoencoder performs: + +```python +# ============================================= +# Visualize input-->reconstruction +# ============================================= +fig, axes = plt.subplots(1, 2) +fig.set_size_inches(6, 3.5) +input_sample_reshaped = input_sample.reshape((img_width, img_height)) +reconsstruction_reshaped = reconstruction.reshape((img_width, img_height)) +axes[0].imshow(input_sample_reshaped) +axes[0].set_title('Original image') +axes[1].imshow(reconsstruction_reshaped) +axes[1].set_title('Reconstruction') +plt.show() +``` + +### Visualizing the encoded state + +Next, we can visualize the encoded state. As said, we'll use Keract for this, which allows you to visualize activations in neural network layers made with Keras. Since it abstracts much of the coding away, it's actually really easy to generate these visualizations: + +```python +# ============================================= +# Visualize encoded state with Keract +# ============================================= +activations = get_activations(encoder, input_sample) +display_activations(activations, cmap="gray", save=False) +``` + +The steps are simple: + +- Get the activations for the `encoder` model (_we use the encoder because we want to visualize the encoded state!)_ from the `input_sample` we took before. +- Next, display these activations with `display_activations`, using the gray colormap and without saving them. + +That's it! + +### Full model code + +Should you wish to copy the full model code at once, here you go 😀 + +```python +''' + Visualizing the encoded state of a simple autoencoder created with the Keras Functional API + with Keract. +''' +import keras +from keras.layers import Input, Dense +from keras.datasets import mnist +from keras.models import Model +from keract import get_activations, display_activations +import matplotlib.pyplot as plt + +# Model configuration +img_width, img_height = 28, 28 +initial_dimension = img_width * img_height +batch_size = 128 +no_epochs = 10 +validation_split = 0.2 +verbosity = 1 +encoded_dim = 50 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], initial_dimension) +input_test = input_test.reshape(input_test.shape[0], initial_dimension) +input_shape = (initial_dimension, ) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Define the layers +inputs = Input(shape=input_shape) +encoding_layer = Dense(encoded_dim, activation='relu', kernel_initializer='he_normal')(inputs) +decoding_layer = Dense(initial_dimension, activation='sigmoid')(encoding_layer) + +# Instantiate the autoencoder +autoencoder = Model(inputs, decoding_layer, name='full_autoencoder') + +# Instantiate the encoder +encoder = Model(inputs, encoding_layer, name='encoder') + +# Instantiate the decoder +encoded_input = Input(shape=(encoded_dim, )) +final_ae_layer = autoencoder.layers[-1] +decoder = Model(encoded_input, final_ae_layer(encoded_input), name='decoder') + +# Compile the autoencoder +encoder.compile(optimizer='adam', loss='binary_crossentropy') +autoencoder.compile(optimizer='adam', loss='binary_crossentropy') + +# Give us some insights +autoencoder.summary() +encoder.summary() +decoder.summary() + +# Fit data +autoencoder.fit(input_train, input_train, epochs=no_epochs, batch_size=batch_size, validation_split=validation_split) + +# ============================================= +# Take a sample for visualization purposes +# ============================================= +input_sample = input_test[:1] +reconstruction = autoencoder.predict([input_sample]) + +# ============================================= +# Visualize input-->reconstruction +# ============================================= +fig, axes = plt.subplots(1, 2) +fig.set_size_inches(6, 3.5) +input_sample_reshaped = input_sample.reshape((img_width, img_height)) +reconsstruction_reshaped = reconstruction.reshape((img_width, img_height)) +axes[0].imshow(input_sample_reshaped) +axes[0].set_title('Original image') +axes[1].imshow(reconsstruction_reshaped) +axes[1].set_title('Reconstruction') +plt.show() + +# ============================================= +# Visualize encoded state with Keract +# ============================================= +activations = get_activations(encoder, input_sample) +display_activations(activations, cmap="gray", save=False) +``` + +### Results + +Now, it's time to run the model. + +Open up a terminal, `cd` to the folder where your `.py` file is located, and run it: e.g. `python autoencoder_encodedstate_functional.py`. You should see the training process begin, and once it finishes, some visualizations should start popping up. + +The results are pretty awesome. + +This is the first visualization we'll get: + +[![](images/reconstruction.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/reconstruction.png) + +As you can see, the MNIST dataset is so discriminative that even with a simple structure (Dense layers only!), the autoencoder is capable of generating accurate reconstructions 😊 + +Next, the visualization of the encoded state - which is, once again, why you're here. Here you go: + +[![](images/encoded-state-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/encoded-state.png) + +Looks good to me! 😎 + +* * * + +## Visualizing encoded state with a Keras Sequential API autoencoder + +Visualizing the encoded state of an autoencoder created with the Keras Sequential API is a bit harder, because you don't have as much control over the individual layers as you'd like to have. + +However, it's possible nevertheless :) + +Open up your Explorer/Finder again, navigate to some folder, and create another Python file: `autoencoder_encodedstate_sequential.py`. Now open this file in your code editor. We can start coding again :) + +### Model imports + +First, we'll add the imports: + +```python +''' + Visualizing the encoded state of a simple autoencoder created with the Keras Sequential API + with Keract. +''' +import keras +from keras.layers import Dense +from keras.datasets import mnist +from keras.models import Sequential +from keract import get_activations, display_activations +import matplotlib.pyplot as plt +from keras import backend as K +``` + +We use Keras for creating the deep learning model. From Keras, we'll import the densely-connected layer, the MNIST dataset - just as we did before. However, this time, instead of the `Model` container, we import the `Sequential` API. This allows us to stack the layers easily. + +Next, we import some calls from Keract which we can use to generate the visualizations. This is followed by the Matplotlib PyPlot API, which we'll use to plot the visualizations generated with Keract. Finally, we import the Keras backend - which we'll use to get the intermediate layer (from the encoder) later on. + +### Model configuration + +Next, we specify the configuration of our model: + +```python +# Model configuration +img_width, img_height = 28, 28 +initial_dimension = img_width * img_height +batch_size = 128 +no_epochs = 10 +validation_split = 0.2 +verbosity = 1 +encoded_dim = 50 +``` + +It's equal to the model configuration for the model created with the Functional API. + +### Import & process dataset + +The same is valid for importing and preprocessing the data: it's equal to using the Functional API. First, we import the data, reshape it into the correct format, parse the data as floats (to speed up the training process) and normalize it to the \[-1, +1\] range. + +```python +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], initial_dimension) +input_test = input_test.reshape(input_test.shape[0], initial_dimension) +input_shape = (initial_dimension, ) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +### Defining the full autoencoder + +We next define the autoencoder in full. We do so by assigning a new instance of the `Sequential` model type to the `autoencoder` variable. Subsequently, we'll add a densely-connected layer that has `encoded_dim` outputs i.e. neurons - and thus learns the encoded state. + +It makes use of [ReLU activation](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) and by consequence [He initialization](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). The `input_shape` is defined as well as with the Sequential API, the input layer is defined under the hood. + +Next, we add another densely-connected layer, which converts the representation of `encoded_dim` dimensionality back into the `initial_dimension`, and thus serves as the decoder segment. It makes use of Sigmoid activation in order to allow us to use [binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). + +```python +# Define the 'autoencoder' full model +autoencoder = Sequential() +autoencoder.add(Dense(encoded_dim, activation='relu', kernel_initializer='he_normal', input_shape=input_shape)) +autoencoder.add(Dense(initial_dimension, activation='sigmoid')) +``` + +### Compilation & fitting data + +Next up: compiling the model (using the default Adam optimizer and binary crossentropy loss), outputting a summary, and fitting the data. Note that, as usual with an autoencoder, we feed `input_train` both as features and as targets. We train the model as specified in the configuration in terms of number of epochs, batch size and validation split. + +```python +# Compile the autoencoder +autoencoder.compile(optimizer='adam', loss='binary_crossentropy') + +# Give us some insights +autoencoder.summary() + +# Fit data +autoencoder.fit(input_train, input_train, epochs=no_epochs, batch_size=batch_size, validation_split=validation_split) +``` + +### Visualizing a sample + +We next take a sample, generate a reconstruction using the trained model and visualize it using Matplotlib. This code is no different compared to the one used with the Functional API. + +```python +# ============================================= +# Take a sample for visualization purposes +# ============================================= +input_sample = input_test[:1] +reconstruction = autoencoder.predict([input_sample]) + +# ============================================= +# Visualize input-->reconstruction +# ============================================= +fig, axes = plt.subplots(1, 2) +fig.set_size_inches(6, 3.5) +input_sample_reshaped = input_sample.reshape((img_width, img_height)) +reconsstruction_reshaped = reconstruction.reshape((img_width, img_height)) +axes[0].imshow(input_sample_reshaped) +axes[0].set_title('Original image') +axes[1].imshow(reconsstruction_reshaped) +axes[1].set_title('Reconstruction') +plt.show() +``` + +### Visualizing the encoded state + +Next, we visualize the encoded state ... _and here is one difference compared to the Functional API_. + +Since Keras does not allow us to consider layers and their outputs easily (and to make it compatible with add-ons like Keract), we'll have to feed Keract the entire `autoencoder` instance. This means that you will get visualizations for both the _encoder_ and the _decoder_ segment (the latter of which is simply the output that is visualized later, but has to be reshaped yet into 28 x 28 pixels format). It's unfortunate, but it's how we'll have to do it. The silver lining: there's not much to visualize here, so it won't take you a lot of extra time :) + +```python +# ============================================= +# Visualize encoded state with Keract +# ============================================= +activations = get_activations(autoencoder, input_sample) +display_activations(activations, cmap="gray", save=False) +``` + +Now, we're ready! Time to start up your terminal again, `cd` into the folder where your Sequential API model is stored, and run `python autoencoder_encodedstate_sequential.py`. Training should begin and visualizations should pop up once it finishes. Let's go! 😎 + +### Full model code + +Should you wish to obtain the full code for the Sequential version at once, here you go: + +```python +''' + Visualizing the encoded state of a simple autoencoder created with the Keras Sequential API + with Keract. +''' +import keras +from keras.layers import Dense +from keras.datasets import mnist +from keras.models import Sequential +from keract import get_activations, display_activations +import matplotlib.pyplot as plt +from keras import backend as K + +# Model configuration +img_width, img_height = 28, 28 +initial_dimension = img_width * img_height +batch_size = 128 +no_epochs = 10 +validation_split = 0.2 +verbosity = 1 +encoded_dim = 50 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], initial_dimension) +input_test = input_test.reshape(input_test.shape[0], initial_dimension) +input_shape = (initial_dimension, ) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Define the 'autoencoder' full model +autoencoder = Sequential() +autoencoder.add(Dense(encoded_dim, activation='relu', kernel_initializer='he_normal', input_shape=input_shape)) +autoencoder.add(Dense(initial_dimension, activation='sigmoid')) + +# Compile the autoencoder +autoencoder.compile(optimizer='adam', loss='binary_crossentropy') + +# Give us some insights +autoencoder.summary() + +# Fit data +autoencoder.fit(input_train, input_train, epochs=no_epochs, batch_size=batch_size, validation_split=validation_split) + +# ============================================= +# Take a sample for visualization purposes +# ============================================= +input_sample = input_test[:1] +reconstruction = autoencoder.predict([input_sample]) + +# ============================================= +# Visualize input-->reconstruction +# ============================================= +fig, axes = plt.subplots(1, 2) +fig.set_size_inches(6, 3.5) +input_sample_reshaped = input_sample.reshape((img_width, img_height)) +reconsstruction_reshaped = reconstruction.reshape((img_width, img_height)) +axes[0].imshow(input_sample_reshaped) +axes[0].set_title('Original image') +axes[1].imshow(reconsstruction_reshaped) +axes[1].set_title('Reconstruction') +plt.show() + +# ============================================= +# Visualize encoded state with Keract +# ============================================= +activations = get_activations(autoencoder, input_sample) +display_activations(activations, cmap="gray", save=False) +``` + +### Results + +Time for some results :-) + +As with the Functional API version, the Sequential API based autoencoder learns to reconstruct the inputs pretty accurately. Additionally, you'll also get a visualization of the encoded state: + +- [![](images/sequential_rec.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/sequential_rec.png) + +- [![](images/sequential_encodedstate-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/sequential_encodedstate.png) + + +This time, as indicated before, you'll also get an extra visualization - being what is output by the decoder before it's reshaped into 28 x 28 pixels format. It's simply the way Keract works, and given the relative inflexibility of the Sequential API there's not much we can do about it. + +[![](images/sequential_output-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/sequential_output.png) + +Mission achieved! 🎉 + +* * * + +## Summary + +In this blog post, we've seen how the encoded state of an autoencoder created with Keras can be visualized. We first looked at the concepts behind autoencoders, and how they can be implemented by using neural networks. This included an analysis of the concept 'encoded state' and how autoencoders learn it. + +In order to demonstrate how it works, we created an example with Keras which contains two densely-connected (Dense) autoencoders being trained on the MNIST dataset. The two, one of which is created with the Functional API and the other with the Sequential API, learnt to reconstruct the MNIST digits pretty accurately. With Keract, we finally visualized the encoded state with only a few lines of code. + +I hope that you've learnt something new today 😀 Please let me know what you think by dropping a comment in the comments section below 👍 Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Rémy, P. (2019, November 28). Keract. Retrieved from [https://github.com/philipperemy/keract](https://github.com/philipperemy/keract) diff --git a/how-to-visualize-the-training-process-in-keras.md b/how-to-visualize-the-training-process-in-keras.md new file mode 100644 index 0000000..0eb87f0 --- /dev/null +++ b/how-to-visualize-the-training-process-in-keras.md @@ -0,0 +1,262 @@ +--- +title: "Visualizing training performance with TensorFlow 2 and Keras" +date: "2019-10-08" +categories: + - "buffer" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "neural-network" + - "visualization" +--- + +Sometimes, you don't want to visualize the [architecture](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/) of your Keras model, but rather you wish to show the training process. + +One way of achieving that is by exporting all the loss values and accuracies manually, adding them to an Excel sheet - before generating a chart. + +Like I did a while ago 🙈 + +[![](images/image-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/image-2.png) + +It goes without saying that there are smarter ways for doing that. In today's blog, we'll cover how to visualize the training process in Keras - just like above, but then with a little piece of extra code. This blog covers precisely what you need in order to generate such plots, it discusses the Keras `History` object which contains the data you'll need and presents the visualization code. + +In this tutorial, you will learn... + +- That a History object is attached to `model.fit` in TensorFlow/Keras and that it provides useful information. +- What the structure of this History object is. +- How to visualize the contents of the History object to see model performance across epochs. + +Let's go! + +_Note that model code is also available [on GitHub](https://github.com/christianversloot/keras-visualizations)._ + +* * * + +**Update 26/Jan/2021:** updated the article. It now uses TensorFlow 2 meaning that it works with recent versions of the library. Additionally, the headers were changed, and a few textual corrections were made. + +* * * + +\[toc\] + +* * * + +## Code example: visualizing the History object of your TensorFlow model + +Here is a **simple but complete example** that can be used for visualizing the performance of your TensorFlow model during training. It utilizes the `history` object, which is returned by calling `model.fit()` on your Keras model. This example visualizes the _training loss_ and _validation loss_, which can e.g. be MAE. + +If you want to understand everything in more detail - such as how this History object works - then make sure to read the rest of this tutorial as well! :) + +``` +from tensorflow.keras.models import Sequential +import matplotlib.pyplot as plt + +# Some TensorFlow/Keras model +model = Sequential() +model.compile() +history = model.fit() + +# Plot history: MAE +plt.plot(history.history['loss'], label='MAE (training data)') +plt.plot(history.history['val_loss'], label='MAE (validation data)') +plt.title('MAE for Chennai Reservoir Levels') +plt.ylabel('MAE value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## What you'll need + +Since we're creating some actual code, you'll likely wish to run it on your machine. For this to work, you need to install certain software dependencies. Specifically: + +- You'll need **Python** to run Keras, preferably 3.8+ +- A recent version of **TensorFlow**, 2.1.0+ for example. +- What's also necessary is **Matplotlib** and, by consequence, **SciPy**. + +Preferably, you run these in an Anaconda environment that isolates these packages from your other development environments. It saves you a lot of struggle as packages could otherwise interfere with each other. + +* * * + +## The model we'll work with today + +In this blog we want to visualize the training process of a Keras model. This requires that we'll work with an actual model. We use this simple one today: + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np + +# Load data +dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4)) + +# Shuffle dataset +np.random.shuffle(dataset) + +# Separate features and targets +X = dataset[:, 0:3] +Y = dataset[:, 3] + +# Set the input shape +input_shape = (3,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(16, input_shape=input_shape, activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(1, activation='linear')) + +# Configure the model and start training +model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_squared_error']) +model.fit(X, Y, epochs=25, batch_size=1, verbose=1, validation_split=0.2) +``` + +Why such a simple one? Well - it's not about the model today, so we should keep most complexity out of here. The regular reader recognizes that this is the regression MLP that we created [earlier](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/). It loads Chennai, India based water reservoir water levels and attempts to predict the levels at one given the levels in the other 3 reservoirs. It does so by means of the Keras Sequential API and densely-conencted layers and MAE as a [regression loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss-functions-for-regression), with MSE as an additional one. It performs training in 25 epochs. + +Let's create a file called `history_visualization.py` and paste the above code into it. + +* * * + +## The `History` object + +When running this model, Keras maintains a so-called `History` object in the background. This object keeps all loss values and other metric values in memory so that they can be used in e.g. [TensorBoard](https://www.tensorflow.org/tensorboard/r1/summaries), in Excel reports or indeed for our own custom visualizations. + +The history object is the output of the `fit` operation. Hence, it can be accessed in your Python script by slightly adapting that row in the above code to: + +`history = model.fit(X, Y, epochs=250, batch_size=1, verbose=1, validation_split=0.2)` + +In the Keras docs, we find: + +> The `History.history` attribute is a dictionary recording training loss values and metrics values at successive epochs, as well as validation loss values and validation metrics values (if applicable). +> +> [Keras docs on model visualization](https://keras.io/visualization/#model-visualization) + +Also add `print(history)` so that we can inspect the history before we visualize it, to get a feel for its structure. + +It indeed outputs the model history (note that for simplicity we trained with only 5 epochs): + +``` +{'val_loss': [281.05517045470464, 281.0461930366744, 282.3450624835175, 283.21272195725317, 278.22250578392925], 'val_mean_squared_error': [131946.00690089026, 131610.73269158995, 132186.26299269326, 133621.92045977595, 131213.40662287443], 'loss': [319.1303724563634, 279.54961594772305, 277.2224043372698, 276.19018290098035, 276.37119589065435], 'mean_squared_error': [210561.46019607811, 132310.933269216, 131070.35584168187, 131204.38709398077, 131249.8484192732]} +``` + +Or, when nicely formatted: + +``` +{ + "val_loss":[ + 281.05517045470464, + 281.0461930366744, + 282.3450624835175, + 283.21272195725317, + 278.22250578392925 + ], + "val_mean_squared_error":[ + 131946.00690089026, + 131610.73269158995, + 132186.26299269326, + 133621.92045977595, + 131213.40662287443 + ], + "loss":[ + 319.1303724563634, + 279.54961594772305, + 277.2224043372698, + 276.19018290098035, + 276.37119589065435 + ], + "mean_squared_error":[ + 210561.46019607811, + 132310.933269216, + 131070.35584168187, + 131204.38709398077, + 131249.8484192732 + ] +} +``` + +It nicely displays all the metrics that we defined: **MAE** ("loss" and "val\_loss" i.e. for both testing and validation data) and **MSE** as an additional metric. + +Since this is a simple Python dictionary structure, we can easily use it for visualization purposes. + +* * * + +## Visualizing the model history + +Let's now add an extra import - for Matplotlib, our visualization library: + +`import matplotlib.pyplot as plt` + +Next, ensure that the number of epochs is at 25 again. + +### Visualizing the MAE + +Let's now add a piece of code that visualizes the MAE: + +``` +# Plot history: MAE +plt.plot(history.history['loss'], label='MAE (training data)') +plt.plot(history.history['val_loss'], label='MAE (validation data)') +plt.title('MAE for Chennai Reservoir Levels') +plt.ylabel('MAE value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +Note that since you defined MAE to be the official loss value (`loss='mean_absolute_error'`), you'll have to use `loss` and `val_loss` in the History object. Above, we additionally add labels, a title and a legend which eventually arrives at this: + +[![](images/mae-1024x565.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/mae.png) + +### Visualizing the MSE + +Similarly, we can add a visualization of our MSE value - but here, we'll have to use `mean_squared_error` and `val_mean_squared_error` instead, because they are an additional metric (`metrics=['mean_squared_error']`). + +``` +# Plot history: MSE +plt.plot(history.history['mean_squared_error'], label='MSE (training data)') +plt.plot(history.history['val_mean_squared_error'], label='MSE (validation data)') +plt.title('MSE for Chennai Reservoir Levels') +plt.ylabel('MSE value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +This is the output for our training process: + +[![](images/mse-1024x563.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/mse.png) + +### Interpreting the training process + +What can we observe from the training process? + +- Both the _validation_ MAE and MSE are very sensitive to weight swings over the epochs, but the general trend goes downward. This is good. +- Especially testing loss decreases very rapidly in the beginning, to decrease only lightly when the number of epochs increases. This is normal and is a good sign. +- The _testing_ MAE and MSE are less sensitive to weight swings. This might be the case because the model has never seen the data before. This is also good. +- What can be improved is that apparently the model can improve even further: _testing loss_ is still decreasing at the 25th epoch. This means that the model is not yet overfitting to the data and that its predictive power can be increased. The solution: more epochs. + +* * * + +## Summary + +As you can see, visualizing the training process of your Keras model can help you understand how the model performs. While you can do this manually with e.g. Excel, we've seen in this blog that you can also use built-in Keras utils (namely, the `History`object) to generate an overview of your training process. With Matplotlib, this history can subsequently be visualized. + +I hope you've learnt something today - if so, please let me know in the comments; I'd appreciate your remarks! 😊 Feel free to leave a comment as well if you have any questions or when you think this blog can be improved. I'll happily edit the text. Happy engineering! + +_Note that model code is also available [on GitHub](https://github.com/christianversloot/keras-visualizations)._ + +* * * + +## Resources + +Keras. (n.d.). Visualization. Retrieved from [https://keras.io/visualization/#model-visualization](https://keras.io/visualization/#model-visualization) + +Creating an MLP for regression with Keras – MachineCurve. (2019, July 30). Retrieved from [https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/) + +How to visualize a model with Keras? – MachineCurve. (2019, October 7). Retrieved from [https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/) + +TensorBoard: Visualizing Learning. (n.d.). Retrieved from [https://www.tensorflow.org/tensorboard/r1/summaries](https://www.tensorflow.org/tensorboard/r1/summaries) diff --git a/images/0-2.png b/images/0-2.png new file mode 100644 index 0000000..352613c Binary files /dev/null and b/images/0-2.png differ diff --git a/images/0.png b/images/0.png new file mode 100644 index 0000000..dd4bc4a Binary files /dev/null and b/images/0.png differ diff --git a/images/0_conv2d_1-1024x511.png b/images/0_conv2d_1-1024x511.png new file mode 100644 index 0000000..9b9ef0f Binary files /dev/null and b/images/0_conv2d_1-1024x511.png differ diff --git a/images/0cf.png b/images/0cf.png new file mode 100644 index 0000000..4ba0fcc Binary files /dev/null and b/images/0cf.png differ diff --git a/images/0db.png b/images/0db.png new file mode 100644 index 0000000..eea7727 Binary files /dev/null and b/images/0db.png differ diff --git a/images/1-1.png b/images/1-1.png new file mode 100644 index 0000000..7043936 Binary files /dev/null and b/images/1-1.png differ diff --git a/images/1-1024x1024.jpg b/images/1-1024x1024.jpg new file mode 100644 index 0000000..08282d4 Binary files /dev/null and b/images/1-1024x1024.jpg differ diff --git a/images/1-1024x384.png b/images/1-1024x384.png new file mode 100644 index 0000000..4886477 Binary files /dev/null and b/images/1-1024x384.png differ diff --git a/images/1-2-1.png b/images/1-2-1.png new file mode 100644 index 0000000..576bed4 Binary files /dev/null and b/images/1-2-1.png differ diff --git a/images/1-2-1024x537.png b/images/1-2-1024x537.png new file mode 100644 index 0000000..afd78bf Binary files /dev/null and b/images/1-2-1024x537.png differ diff --git a/images/1-2.png b/images/1-2.png new file mode 100644 index 0000000..6bad870 Binary files /dev/null and b/images/1-2.png differ diff --git a/images/1-3.png b/images/1-3.png new file mode 100644 index 0000000..d15edf1 Binary files /dev/null and b/images/1-3.png differ diff --git a/images/1-5.png b/images/1-5.png new file mode 100644 index 0000000..12b157e Binary files /dev/null and b/images/1-5.png differ diff --git a/images/1-6.png b/images/1-6.png new file mode 100644 index 0000000..e76a277 Binary files /dev/null and b/images/1-6.png differ diff --git a/images/1.png b/images/1.png new file mode 100644 index 0000000..81cb2fb Binary files /dev/null and b/images/1.png differ diff --git a/images/100_acc.png b/images/100_acc.png new file mode 100644 index 0000000..274bd3e Binary files /dev/null and b/images/100_acc.png differ diff --git a/images/100_loss.png b/images/100_loss.png new file mode 100644 index 0000000..3d81e10 Binary files /dev/null and b/images/100_loss.png differ diff --git a/images/1024px-Right-continuous.svg_-1024x853.png b/images/1024px-Right-continuous.svg_-1024x853.png new file mode 100644 index 0000000..6179bd2 Binary files /dev/null and b/images/1024px-Right-continuous.svg_-1024x853.png differ diff --git a/images/10305.jpg b/images/10305.jpg new file mode 100644 index 0000000..1d84fc6 Binary files /dev/null and b/images/10305.jpg differ diff --git a/images/10425.jpg b/images/10425.jpg new file mode 100644 index 0000000..fab5f1a Binary files /dev/null and b/images/10425.jpg differ diff --git a/images/10539.jpg b/images/10539.jpg new file mode 100644 index 0000000..17da69e Binary files /dev/null and b/images/10539.jpg differ diff --git a/images/10885.jpg b/images/10885.jpg new file mode 100644 index 0000000..c899c3b Binary files /dev/null and b/images/10885.jpg differ diff --git a/images/10_100-1.jpg b/images/10_100-1.jpg new file mode 100644 index 0000000..f015192 Binary files /dev/null and b/images/10_100-1.jpg differ diff --git a/images/10_100.jpg b/images/10_100.jpg new file mode 100644 index 0000000..c684e54 Binary files /dev/null and b/images/10_100.jpg differ diff --git a/images/11312.jpg b/images/11312.jpg new file mode 100644 index 0000000..1891f59 Binary files /dev/null and b/images/11312.jpg differ diff --git a/images/11515.jpg b/images/11515.jpg new file mode 100644 index 0000000..50951cd Binary files /dev/null and b/images/11515.jpg differ diff --git a/images/119_100.jpg b/images/119_100.jpg new file mode 100644 index 0000000..fd04b08 Binary files /dev/null and b/images/119_100.jpg differ diff --git a/images/12365.jpg b/images/12365.jpg new file mode 100644 index 0000000..ff7475d Binary files /dev/null and b/images/12365.jpg differ diff --git a/images/12403.jpg b/images/12403.jpg new file mode 100644 index 0000000..6188012 Binary files /dev/null and b/images/12403.jpg differ diff --git a/images/12481.jpg b/images/12481.jpg new file mode 100644 index 0000000..28d7076 Binary files /dev/null and b/images/12481.jpg differ diff --git a/images/1280px-Manhattan_distance.svg_-1024x1024.png b/images/1280px-Manhattan_distance.svg_-1024x1024.png new file mode 100644 index 0000000..1c25e26 Binary files /dev/null and b/images/1280px-Manhattan_distance.svg_-1024x1024.png differ diff --git a/images/13749.jpg b/images/13749.jpg new file mode 100644 index 0000000..981dcf9 Binary files /dev/null and b/images/13749.jpg differ diff --git a/images/1403.jpg b/images/1403.jpg new file mode 100644 index 0000000..e552b12 Binary files /dev/null and b/images/1403.jpg differ diff --git a/images/14650.jpg b/images/14650.jpg new file mode 100644 index 0000000..09482ff Binary files /dev/null and b/images/14650.jpg differ diff --git a/images/1523.jpg b/images/1523.jpg new file mode 100644 index 0000000..5ad422c Binary files /dev/null and b/images/1523.jpg differ diff --git a/images/15294.jpg b/images/15294.jpg new file mode 100644 index 0000000..c4e12b4 Binary files /dev/null and b/images/15294.jpg differ diff --git a/images/15307.jpg b/images/15307.jpg new file mode 100644 index 0000000..b43eb40 Binary files /dev/null and b/images/15307.jpg differ diff --git a/images/15330.jpg b/images/15330.jpg new file mode 100644 index 0000000..0b91dff Binary files /dev/null and b/images/15330.jpg differ diff --git a/images/15743.jpg b/images/15743.jpg new file mode 100644 index 0000000..6944fcb Binary files /dev/null and b/images/15743.jpg differ diff --git a/images/1676.jpg b/images/1676.jpg new file mode 100644 index 0000000..a353cf0 Binary files /dev/null and b/images/1676.jpg differ diff --git a/images/17749.jpg b/images/17749.jpg new file mode 100644 index 0000000..ab46404 Binary files /dev/null and b/images/17749.jpg differ diff --git a/images/18017.jpg b/images/18017.jpg new file mode 100644 index 0000000..2fdc743 Binary files /dev/null and b/images/18017.jpg differ diff --git a/images/1813.jpg b/images/1813.jpg new file mode 100644 index 0000000..4511bda Binary files /dev/null and b/images/1813.jpg differ diff --git a/images/18167.jpg b/images/18167.jpg new file mode 100644 index 0000000..262ec8f Binary files /dev/null and b/images/18167.jpg differ diff --git a/images/18_100.jpg b/images/18_100.jpg new file mode 100644 index 0000000..3a72ba1 Binary files /dev/null and b/images/18_100.jpg differ diff --git a/images/1920px-Drawing_of_a_neuron.svg_-1024x397.png b/images/1920px-Drawing_of_a_neuron.svg_-1024x397.png new file mode 100644 index 0000000..0fa24af Binary files /dev/null and b/images/1920px-Drawing_of_a_neuron.svg_-1024x397.png differ diff --git a/images/1920px-LSTM_cell.svg_.png b/images/1920px-LSTM_cell.svg_.png new file mode 100644 index 0000000..26fe4cd Binary files /dev/null and b/images/1920px-LSTM_cell.svg_.png differ diff --git a/images/1920px-Linear_regression.svg_-1024x677.png b/images/1920px-Linear_regression.svg_-1024x677.png new file mode 100644 index 0000000..7b99c95 Binary files /dev/null and b/images/1920px-Linear_regression.svg_-1024x677.png differ diff --git a/images/1920px-Normal_Distribution_PDF.svg_-1024x654.png b/images/1920px-Normal_Distribution_PDF.svg_-1024x654.png new file mode 100644 index 0000000..110287f Binary files /dev/null and b/images/1920px-Normal_Distribution_PDF.svg_-1024x654.png differ diff --git a/images/19_100.jpg b/images/19_100.jpg new file mode 100644 index 0000000..9a72a98 Binary files /dev/null and b/images/19_100.jpg differ diff --git a/images/1_BHzGVskWGS_3jEcYYi6miQ-842x1024.png b/images/1_BHzGVskWGS_3jEcYYi6miQ-842x1024.png new file mode 100644 index 0000000..7c4cfc0 Binary files /dev/null and b/images/1_BHzGVskWGS_3jEcYYi6miQ-842x1024.png differ diff --git a/images/1_maxpooling2d_1-1024x511.png b/images/1_maxpooling2d_1-1024x511.png new file mode 100644 index 0000000..157f0d6 Binary files /dev/null and b/images/1_maxpooling2d_1-1024x511.png differ diff --git a/images/1dconv.png b/images/1dconv.png new file mode 100644 index 0000000..8b03550 Binary files /dev/null and b/images/1dconv.png differ diff --git a/images/1dmap-1024x511.png b/images/1dmap-1024x511.png new file mode 100644 index 0000000..cd31a82 Binary files /dev/null and b/images/1dmap-1024x511.png differ diff --git a/images/1o-1024x684.jpg b/images/1o-1024x684.jpg new file mode 100644 index 0000000..a7a9a63 Binary files /dev/null and b/images/1o-1024x684.jpg differ diff --git a/images/1x_approximated-1024x537.jpeg b/images/1x_approximated-1024x537.jpeg new file mode 100644 index 0000000..8f483a7 Binary files /dev/null and b/images/1x_approximated-1024x537.jpeg differ diff --git a/images/2-1.png b/images/2-1.png new file mode 100644 index 0000000..bad9e6b Binary files /dev/null and b/images/2-1.png differ diff --git a/images/2-1024x1024.jpg b/images/2-1024x1024.jpg new file mode 100644 index 0000000..691a8f1 Binary files /dev/null and b/images/2-1024x1024.jpg differ diff --git a/images/2-1024x384.png b/images/2-1024x384.png new file mode 100644 index 0000000..f05815d Binary files /dev/null and b/images/2-1024x384.png differ diff --git a/images/2-2-1024x537.png b/images/2-2-1024x537.png new file mode 100644 index 0000000..2a04005 Binary files /dev/null and b/images/2-2-1024x537.png differ diff --git a/images/2-2.png b/images/2-2.png new file mode 100644 index 0000000..135bd70 Binary files /dev/null and b/images/2-2.png differ diff --git a/images/2-3.png b/images/2-3.png new file mode 100644 index 0000000..380226d Binary files /dev/null and b/images/2-3.png differ diff --git a/images/2-300x225.png b/images/2-300x225.png new file mode 100644 index 0000000..f651dec Binary files /dev/null and b/images/2-300x225.png differ diff --git a/images/2-4.png b/images/2-4.png new file mode 100644 index 0000000..cd394a6 Binary files /dev/null and b/images/2-4.png differ diff --git a/images/2.png b/images/2.png new file mode 100644 index 0000000..b2f19f8 Binary files /dev/null and b/images/2.png differ diff --git a/images/20619.jpg b/images/20619.jpg new file mode 100644 index 0000000..e6b757c Binary files /dev/null and b/images/20619.jpg differ diff --git a/images/21402.jpg b/images/21402.jpg new file mode 100644 index 0000000..d3f25f6 Binary files /dev/null and b/images/21402.jpg differ diff --git a/images/21752.jpg b/images/21752.jpg new file mode 100644 index 0000000..4ce5baf Binary files /dev/null and b/images/21752.jpg differ diff --git a/images/22_bce_circles.png b/images/22_bce_circles.png new file mode 100644 index 0000000..1739f41 Binary files /dev/null and b/images/22_bce_circles.png differ diff --git a/images/22_bce_db.png b/images/22_bce_db.png new file mode 100644 index 0000000..0680f53 Binary files /dev/null and b/images/22_bce_db.png differ diff --git a/images/22_bce_history.png b/images/22_bce_history.png new file mode 100644 index 0000000..955301c Binary files /dev/null and b/images/22_bce_history.png differ diff --git a/images/22_cce_clusters.png b/images/22_cce_clusters.png new file mode 100644 index 0000000..02cd038 Binary files /dev/null and b/images/22_cce_clusters.png differ diff --git a/images/22_cce_db.png b/images/22_cce_db.png new file mode 100644 index 0000000..2f9c2b7 Binary files /dev/null and b/images/22_cce_db.png differ diff --git a/images/22_cce_history.png b/images/22_cce_history.png new file mode 100644 index 0000000..1d6be79 Binary files /dev/null and b/images/22_cce_history.png differ diff --git a/images/23_pad_1d-1024x147.png b/images/23_pad_1d-1024x147.png new file mode 100644 index 0000000..11e90e7 Binary files /dev/null and b/images/23_pad_1d-1024x147.png differ diff --git a/images/24085.jpg b/images/24085.jpg new file mode 100644 index 0000000..a28d95c Binary files /dev/null and b/images/24085.jpg differ diff --git a/images/24100.jpg b/images/24100.jpg new file mode 100644 index 0000000..1909d66 Binary files /dev/null and b/images/24100.jpg differ diff --git a/images/24854.jpg b/images/24854.jpg new file mode 100644 index 0000000..83525f4 Binary files /dev/null and b/images/24854.jpg differ diff --git a/images/2558.jpg b/images/2558.jpg new file mode 100644 index 0000000..5a282fa Binary files /dev/null and b/images/2558.jpg differ diff --git a/images/2560px-Gated_Recurrent_Unit_base_type.svg_.png b/images/2560px-Gated_Recurrent_Unit_base_type.svg_.png new file mode 100644 index 0000000..e780ec1 Binary files /dev/null and b/images/2560px-Gated_Recurrent_Unit_base_type.svg_.png differ diff --git a/images/2560px-Recurrent_neural_network_unfold.svg_.png b/images/2560px-Recurrent_neural_network_unfold.svg_.png new file mode 100644 index 0000000..c7044ee Binary files /dev/null and b/images/2560px-Recurrent_neural_network_unfold.svg_.png differ diff --git a/images/26247.jpg b/images/26247.jpg new file mode 100644 index 0000000..161221a Binary files /dev/null and b/images/26247.jpg differ diff --git a/images/26544.jpg b/images/26544.jpg new file mode 100644 index 0000000..fd590b8 Binary files /dev/null and b/images/26544.jpg differ diff --git a/images/26_100.jpg b/images/26_100.jpg new file mode 100644 index 0000000..6ae67da Binary files /dev/null and b/images/26_100.jpg differ diff --git a/images/27260.jpg b/images/27260.jpg new file mode 100644 index 0000000..6e091d5 Binary files /dev/null and b/images/27260.jpg differ diff --git a/images/27447.jpg b/images/27447.jpg new file mode 100644 index 0000000..352df6b Binary files /dev/null and b/images/27447.jpg differ diff --git a/images/27569.jpg b/images/27569.jpg new file mode 100644 index 0000000..953caa6 Binary files /dev/null and b/images/27569.jpg differ diff --git a/images/27757.jpg b/images/27757.jpg new file mode 100644 index 0000000..932ceea Binary files /dev/null and b/images/27757.jpg differ diff --git a/images/27809.jpg b/images/27809.jpg new file mode 100644 index 0000000..9c3b782 Binary files /dev/null and b/images/27809.jpg differ diff --git a/images/27872.jpg b/images/27872.jpg new file mode 100644 index 0000000..a466ab7 Binary files /dev/null and b/images/27872.jpg differ diff --git a/images/28222.jpg b/images/28222.jpg new file mode 100644 index 0000000..7a576f8 Binary files /dev/null and b/images/28222.jpg differ diff --git a/images/28291.jpg b/images/28291.jpg new file mode 100644 index 0000000..b36bb7b Binary files /dev/null and b/images/28291.jpg differ diff --git a/images/29119.jpg b/images/29119.jpg new file mode 100644 index 0000000..f808916 Binary files /dev/null and b/images/29119.jpg differ diff --git a/images/296px-Colored_neural_network.svg_.png b/images/296px-Colored_neural_network.svg_.png new file mode 100644 index 0000000..678930e Binary files /dev/null and b/images/296px-Colored_neural_network.svg_.png differ diff --git a/images/29735.jpg b/images/29735.jpg new file mode 100644 index 0000000..ce3afa5 Binary files /dev/null and b/images/29735.jpg differ diff --git a/images/2_conv2d_2-1024x511.png b/images/2_conv2d_2-1024x511.png new file mode 100644 index 0000000..7c34eb6 Binary files /dev/null and b/images/2_conv2d_2-1024x511.png differ diff --git a/images/2cf.png b/images/2cf.png new file mode 100644 index 0000000..ae8497c Binary files /dev/null and b/images/2cf.png differ diff --git a/images/2db.png b/images/2db.png new file mode 100644 index 0000000..2723a45 Binary files /dev/null and b/images/2db.png differ diff --git a/images/3-1.png b/images/3-1.png new file mode 100644 index 0000000..60c2dd2 Binary files /dev/null and b/images/3-1.png differ diff --git a/images/3-1024x1024.jpg b/images/3-1024x1024.jpg new file mode 100644 index 0000000..596fe15 Binary files /dev/null and b/images/3-1024x1024.jpg differ diff --git a/images/3-1024x384.png b/images/3-1024x384.png new file mode 100644 index 0000000..ae28dbb Binary files /dev/null and b/images/3-1024x384.png differ diff --git a/images/3-2-1024x537.png b/images/3-2-1024x537.png new file mode 100644 index 0000000..fe0b616 Binary files /dev/null and b/images/3-2-1024x537.png differ diff --git a/images/3-2.png b/images/3-2.png new file mode 100644 index 0000000..d4f5d06 Binary files /dev/null and b/images/3-2.png differ diff --git a/images/3-3.png b/images/3-3.png new file mode 100644 index 0000000..a7cce15 Binary files /dev/null and b/images/3-3.png differ diff --git a/images/3.png b/images/3.png new file mode 100644 index 0000000..285ca2e Binary files /dev/null and b/images/3.png differ diff --git a/images/30218.jpg b/images/30218.jpg new file mode 100644 index 0000000..bef8274 Binary files /dev/null and b/images/30218.jpg differ diff --git a/images/30good.png b/images/30good.png new file mode 100644 index 0000000..edade8e Binary files /dev/null and b/images/30good.png differ diff --git a/images/30over.png b/images/30over.png new file mode 100644 index 0000000..405a7ee Binary files /dev/null and b/images/30over.png differ diff --git a/images/30samples.png b/images/30samples.png new file mode 100644 index 0000000..4829749 Binary files /dev/null and b/images/30samples.png differ diff --git a/images/30under.png b/images/30under.png new file mode 100644 index 0000000..abf5bf8 Binary files /dev/null and b/images/30under.png differ diff --git a/images/31184.jpg b/images/31184.jpg new file mode 100644 index 0000000..5a70f60 Binary files /dev/null and b/images/31184.jpg differ diff --git a/images/32057.jpg b/images/32057.jpg new file mode 100644 index 0000000..c340672 Binary files /dev/null and b/images/32057.jpg differ diff --git a/images/32091.jpg b/images/32091.jpg new file mode 100644 index 0000000..70f97f6 Binary files /dev/null and b/images/32091.jpg differ diff --git a/images/33582.jpg b/images/33582.jpg new file mode 100644 index 0000000..aba99d0 Binary files /dev/null and b/images/33582.jpg differ diff --git a/images/34242.jpg b/images/34242.jpg new file mode 100644 index 0000000..4160f7a Binary files /dev/null and b/images/34242.jpg differ diff --git a/images/34889.jpg b/images/34889.jpg new file mode 100644 index 0000000..f384781 Binary files /dev/null and b/images/34889.jpg differ diff --git a/images/35045.jpg b/images/35045.jpg new file mode 100644 index 0000000..40f5fbd Binary files /dev/null and b/images/35045.jpg differ diff --git a/images/3513.jpg b/images/3513.jpg new file mode 100644 index 0000000..7e635bf Binary files /dev/null and b/images/3513.jpg differ diff --git a/images/35707.jpg b/images/35707.jpg new file mode 100644 index 0000000..15b3e6a Binary files /dev/null and b/images/35707.jpg differ diff --git a/images/3576.jpg b/images/3576.jpg new file mode 100644 index 0000000..1246d23 Binary files /dev/null and b/images/3576.jpg differ diff --git a/images/35793.jpg b/images/35793.jpg new file mode 100644 index 0000000..1807739 Binary files /dev/null and b/images/35793.jpg differ diff --git a/images/36144.jpg b/images/36144.jpg new file mode 100644 index 0000000..355e775 Binary files /dev/null and b/images/36144.jpg differ diff --git a/images/36450.jpg b/images/36450.jpg new file mode 100644 index 0000000..8476757 Binary files /dev/null and b/images/36450.jpg differ diff --git a/images/37591.jpg b/images/37591.jpg new file mode 100644 index 0000000..9fe7b06 Binary files /dev/null and b/images/37591.jpg differ diff --git a/images/37932.jpg b/images/37932.jpg new file mode 100644 index 0000000..6ac1704 Binary files /dev/null and b/images/37932.jpg differ diff --git a/images/38151.jpg b/images/38151.jpg new file mode 100644 index 0000000..ed17ea8 Binary files /dev/null and b/images/38151.jpg differ diff --git a/images/38333.jpg b/images/38333.jpg new file mode 100644 index 0000000..14313ae Binary files /dev/null and b/images/38333.jpg differ diff --git a/images/38811.jpg b/images/38811.jpg new file mode 100644 index 0000000..81dea7b Binary files /dev/null and b/images/38811.jpg differ diff --git a/images/38876.jpg b/images/38876.jpg new file mode 100644 index 0000000..dc75001 Binary files /dev/null and b/images/38876.jpg differ diff --git a/images/39358.jpg b/images/39358.jpg new file mode 100644 index 0000000..1b68cec Binary files /dev/null and b/images/39358.jpg differ diff --git a/images/3_flatten-1024x511.png b/images/3_flatten-1024x511.png new file mode 100644 index 0000000..198d76e Binary files /dev/null and b/images/3_flatten-1024x511.png differ diff --git a/images/3cm.png b/images/3cm.png new file mode 100644 index 0000000..5f2a1b8 Binary files /dev/null and b/images/3cm.png differ diff --git a/images/3d_mnist_perf-1024x581.png b/images/3d_mnist_perf-1024x581.png new file mode 100644 index 0000000..59d901a Binary files /dev/null and b/images/3d_mnist_perf-1024x581.png differ diff --git a/images/3db.png b/images/3db.png new file mode 100644 index 0000000..782e9e6 Binary files /dev/null and b/images/3db.png differ diff --git a/images/3o-1024x557.jpg b/images/3o-1024x557.jpg new file mode 100644 index 0000000..6557060 Binary files /dev/null and b/images/3o-1024x557.jpg differ diff --git a/images/4-1.png b/images/4-1.png new file mode 100644 index 0000000..0ec94b4 Binary files /dev/null and b/images/4-1.png differ diff --git a/images/4-1024x1024.jpg b/images/4-1024x1024.jpg new file mode 100644 index 0000000..40b7a61 Binary files /dev/null and b/images/4-1024x1024.jpg differ diff --git a/images/4-1024x384.png b/images/4-1024x384.png new file mode 100644 index 0000000..5036db0 Binary files /dev/null and b/images/4-1024x384.png differ diff --git a/images/4-2-1024x537.png b/images/4-2-1024x537.png new file mode 100644 index 0000000..abc1db3 Binary files /dev/null and b/images/4-2-1024x537.png differ diff --git a/images/4-2.png b/images/4-2.png new file mode 100644 index 0000000..3225051 Binary files /dev/null and b/images/4-2.png differ diff --git a/images/4-3.png b/images/4-3.png new file mode 100644 index 0000000..db3a320 Binary files /dev/null and b/images/4-3.png differ diff --git a/images/4.png b/images/4.png new file mode 100644 index 0000000..4f10120 Binary files /dev/null and b/images/4.png differ diff --git a/images/40360.jpg b/images/40360.jpg new file mode 100644 index 0000000..f379f45 Binary files /dev/null and b/images/40360.jpg differ diff --git a/images/40969.jpg b/images/40969.jpg new file mode 100644 index 0000000..de4cee3 Binary files /dev/null and b/images/40969.jpg differ diff --git a/images/41192.jpg b/images/41192.jpg new file mode 100644 index 0000000..440ec7b Binary files /dev/null and b/images/41192.jpg differ diff --git a/images/41909.jpg b/images/41909.jpg new file mode 100644 index 0000000..ad1f910 Binary files /dev/null and b/images/41909.jpg differ diff --git a/images/42180.jpg b/images/42180.jpg new file mode 100644 index 0000000..fc9204e Binary files /dev/null and b/images/42180.jpg differ diff --git a/images/42681.jpg b/images/42681.jpg new file mode 100644 index 0000000..9b55f6b Binary files /dev/null and b/images/42681.jpg differ diff --git a/images/43819.jpg b/images/43819.jpg new file mode 100644 index 0000000..f32d56c Binary files /dev/null and b/images/43819.jpg differ diff --git a/images/43871.jpg b/images/43871.jpg new file mode 100644 index 0000000..84cd6f9 Binary files /dev/null and b/images/43871.jpg differ diff --git a/images/45028.jpg b/images/45028.jpg new file mode 100644 index 0000000..206b389 Binary files /dev/null and b/images/45028.jpg differ diff --git a/images/46818.jpg b/images/46818.jpg new file mode 100644 index 0000000..cf8b832 Binary files /dev/null and b/images/46818.jpg differ diff --git a/images/47056.jpg b/images/47056.jpg new file mode 100644 index 0000000..801b409 Binary files /dev/null and b/images/47056.jpg differ diff --git a/images/47308.jpg b/images/47308.jpg new file mode 100644 index 0000000..6b058cf Binary files /dev/null and b/images/47308.jpg differ diff --git a/images/4798.jpg b/images/4798.jpg new file mode 100644 index 0000000..1b41d13 Binary files /dev/null and b/images/4798.jpg differ diff --git a/images/48003.jpg b/images/48003.jpg new file mode 100644 index 0000000..919d0c8 Binary files /dev/null and b/images/48003.jpg differ diff --git a/images/48674135-6fe49400-eb8c-11e8-963c-c343867b7565.gif b/images/48674135-6fe49400-eb8c-11e8-963c-c343867b7565.gif new file mode 100644 index 0000000..1e928ba Binary files /dev/null and b/images/48674135-6fe49400-eb8c-11e8-963c-c343867b7565.gif differ diff --git a/images/48674136-71ae5780-eb8c-11e8-8d8f-8cb511009491.gif b/images/48674136-71ae5780-eb8c-11e8-8d8f-8cb511009491.gif new file mode 100644 index 0000000..80c5436 Binary files /dev/null and b/images/48674136-71ae5780-eb8c-11e8-8d8f-8cb511009491.gif differ diff --git a/images/48715.jpg b/images/48715.jpg new file mode 100644 index 0000000..50d270b Binary files /dev/null and b/images/48715.jpg differ diff --git a/images/48975.jpg b/images/48975.jpg new file mode 100644 index 0000000..5f47e7f Binary files /dev/null and b/images/48975.jpg differ diff --git a/images/49002.jpg b/images/49002.jpg new file mode 100644 index 0000000..782eacf Binary files /dev/null and b/images/49002.jpg differ diff --git a/images/49065.jpg b/images/49065.jpg new file mode 100644 index 0000000..19f3c91 Binary files /dev/null and b/images/49065.jpg differ diff --git a/images/49406.jpg b/images/49406.jpg new file mode 100644 index 0000000..9affc9b Binary files /dev/null and b/images/49406.jpg differ diff --git a/images/49534865371_7219ecfbcd_k-1024x800.jpg b/images/49534865371_7219ecfbcd_k-1024x800.jpg new file mode 100644 index 0000000..4618ef0 Binary files /dev/null and b/images/49534865371_7219ecfbcd_k-1024x800.jpg differ diff --git a/images/49626.jpg b/images/49626.jpg new file mode 100644 index 0000000..4cc2af4 Binary files /dev/null and b/images/49626.jpg differ diff --git a/images/4_dense-1024x511.png b/images/4_dense-1024x511.png new file mode 100644 index 0000000..7efbcff Binary files /dev/null and b/images/4_dense-1024x511.png differ diff --git a/images/4cm.png b/images/4cm.png new file mode 100644 index 0000000..81a5f28 Binary files /dev/null and b/images/4cm.png differ diff --git a/images/4db.png b/images/4db.png new file mode 100644 index 0000000..90ca8f5 Binary files /dev/null and b/images/4db.png differ diff --git a/images/4o-1024x558.jpg b/images/4o-1024x558.jpg new file mode 100644 index 0000000..5400a7c Binary files /dev/null and b/images/4o-1024x558.jpg differ diff --git a/images/5-1.png b/images/5-1.png new file mode 100644 index 0000000..78be200 Binary files /dev/null and b/images/5-1.png differ diff --git a/images/5-1024x1024.jpg b/images/5-1024x1024.jpg new file mode 100644 index 0000000..c4c5627 Binary files /dev/null and b/images/5-1024x1024.jpg differ diff --git a/images/5-1024x384.png b/images/5-1024x384.png new file mode 100644 index 0000000..09d23a7 Binary files /dev/null and b/images/5-1024x384.png differ diff --git a/images/5-2.png b/images/5-2.png new file mode 100644 index 0000000..1df09d9 Binary files /dev/null and b/images/5-2.png differ diff --git a/images/5.png b/images/5.png new file mode 100644 index 0000000..cc909b1 Binary files /dev/null and b/images/5.png differ diff --git a/images/5023.jpg b/images/5023.jpg new file mode 100644 index 0000000..38d599f Binary files /dev/null and b/images/5023.jpg differ diff --git a/images/50830.jpg b/images/50830.jpg new file mode 100644 index 0000000..0a6b26c Binary files /dev/null and b/images/50830.jpg differ diff --git a/images/51660.jpg b/images/51660.jpg new file mode 100644 index 0000000..6b27124 Binary files /dev/null and b/images/51660.jpg differ diff --git a/images/54277.jpg b/images/54277.jpg new file mode 100644 index 0000000..fa9e11c Binary files /dev/null and b/images/54277.jpg differ diff --git a/images/5436.jpg b/images/5436.jpg new file mode 100644 index 0000000..14f349c Binary files /dev/null and b/images/5436.jpg differ diff --git a/images/54650.jpg b/images/54650.jpg new file mode 100644 index 0000000..8ae9c89 Binary files /dev/null and b/images/54650.jpg differ diff --git a/images/56116.jpg b/images/56116.jpg new file mode 100644 index 0000000..2307ea1 Binary files /dev/null and b/images/56116.jpg differ diff --git a/images/56334.jpg b/images/56334.jpg new file mode 100644 index 0000000..d074d22 Binary files /dev/null and b/images/56334.jpg differ diff --git a/images/5726.jpg b/images/5726.jpg new file mode 100644 index 0000000..5a19d1a Binary files /dev/null and b/images/5726.jpg differ diff --git a/images/58299.jpg b/images/58299.jpg new file mode 100644 index 0000000..51a60d6 Binary files /dev/null and b/images/58299.jpg differ diff --git a/images/5_dense-1024x511.png b/images/5_dense-1024x511.png new file mode 100644 index 0000000..ac67ec2 Binary files /dev/null and b/images/5_dense-1024x511.png differ diff --git a/images/6-1.png b/images/6-1.png new file mode 100644 index 0000000..5c4c029 Binary files /dev/null and b/images/6-1.png differ diff --git a/images/6-1024x1024.jpg b/images/6-1024x1024.jpg new file mode 100644 index 0000000..06962bf Binary files /dev/null and b/images/6-1024x1024.jpg differ diff --git a/images/6-1024x384.png b/images/6-1024x384.png new file mode 100644 index 0000000..e76e46b Binary files /dev/null and b/images/6-1024x384.png differ diff --git a/images/6-2.png b/images/6-2.png new file mode 100644 index 0000000..abb4311 Binary files /dev/null and b/images/6-2.png differ diff --git a/images/6.png b/images/6.png new file mode 100644 index 0000000..9b1cc1d Binary files /dev/null and b/images/6.png differ diff --git a/images/6418.jpg b/images/6418.jpg new file mode 100644 index 0000000..869a4ca Binary files /dev/null and b/images/6418.jpg differ diff --git a/images/68747470733a2f2f61722e706567672e696f2f696d672f6563636f2d6c6f676f2d772d3830302e706e67.png b/images/68747470733a2f2f61722e706567672e696f2f696d672f6563636f2d6c6f676f2d772d3830302e706e67.png new file mode 100644 index 0000000..c5b4bf7 Binary files /dev/null and b/images/68747470733a2f2f61722e706567672e696f2f696d672f6563636f2d6c6f676f2d772d3830302e706e67.png differ diff --git a/images/7-1.png b/images/7-1.png new file mode 100644 index 0000000..038f7f4 Binary files /dev/null and b/images/7-1.png differ diff --git a/images/7-2-1.png b/images/7-2-1.png new file mode 100644 index 0000000..68f2377 Binary files /dev/null and b/images/7-2-1.png differ diff --git a/images/7-2.png b/images/7-2.png new file mode 100644 index 0000000..76f211d Binary files /dev/null and b/images/7-2.png differ diff --git a/images/7-3.png b/images/7-3.png new file mode 100644 index 0000000..f154798 Binary files /dev/null and b/images/7-3.png differ diff --git a/images/7.png b/images/7.png new file mode 100644 index 0000000..b3e3908 Binary files /dev/null and b/images/7.png differ diff --git a/images/7333.jpg b/images/7333.jpg new file mode 100644 index 0000000..9f47621 Binary files /dev/null and b/images/7333.jpg differ diff --git a/images/8-1.png b/images/8-1.png new file mode 100644 index 0000000..32e94df Binary files /dev/null and b/images/8-1.png differ diff --git a/images/8-2.png b/images/8-2.png new file mode 100644 index 0000000..53db335 Binary files /dev/null and b/images/8-2.png differ diff --git a/images/8.png b/images/8.png new file mode 100644 index 0000000..267613d Binary files /dev/null and b/images/8.png differ diff --git a/images/800px-Neural_network_example.svg_-768x1024.png b/images/800px-Neural_network_example.svg_-768x1024.png new file mode 100644 index 0000000..1fa6b0b Binary files /dev/null and b/images/800px-Neural_network_example.svg_-768x1024.png differ diff --git a/images/834.jpg b/images/834.jpg new file mode 100644 index 0000000..c428119 Binary files /dev/null and b/images/834.jpg differ diff --git a/images/9-2.png b/images/9-2.png new file mode 100644 index 0000000..a6d7baf Binary files /dev/null and b/images/9-2.png differ diff --git a/images/9-3.png b/images/9-3.png new file mode 100644 index 0000000..4874fe7 Binary files /dev/null and b/images/9-3.png differ diff --git a/images/9.png b/images/9.png new file mode 100644 index 0000000..72f4818 Binary files /dev/null and b/images/9.png differ diff --git a/images/Ads.drawio-1024x146.png b/images/Ads.drawio-1024x146.png new file mode 100644 index 0000000..961abb4 Binary files /dev/null and b/images/Ads.drawio-1024x146.png differ diff --git a/images/Apache_Spark_logo.svg_.png b/images/Apache_Spark_logo.svg_.png new file mode 100644 index 0000000..fd0b463 Binary files /dev/null and b/images/Apache_Spark_logo.svg_.png differ diff --git a/images/Autoencoder.png b/images/Autoencoder.png new file mode 100644 index 0000000..8109549 Binary files /dev/null and b/images/Autoencoder.png differ diff --git a/images/Average-Pooling-1.png b/images/Average-Pooling-1.png new file mode 100644 index 0000000..642b5ee Binary files /dev/null and b/images/Average-Pooling-1.png differ diff --git a/images/Average-Pooling.png b/images/Average-Pooling.png new file mode 100644 index 0000000..ce118ac Binary files /dev/null and b/images/Average-Pooling.png differ diff --git a/images/B3gizm9.png b/images/B3gizm9.png new file mode 100644 index 0000000..30d4bc5 Binary files /dev/null and b/images/B3gizm9.png differ diff --git a/images/Basic-neural-network.jpg b/images/Basic-neural-network.jpg new file mode 100644 index 0000000..5acdacf Binary files /dev/null and b/images/Basic-neural-network.jpg differ diff --git a/images/CNN-1.jpg b/images/CNN-1.jpg new file mode 100644 index 0000000..70caabd Binary files /dev/null and b/images/CNN-1.jpg differ diff --git a/images/CNN-1.png b/images/CNN-1.png new file mode 100644 index 0000000..f276ba1 Binary files /dev/null and b/images/CNN-1.png differ diff --git a/images/CNN-onechannel.png b/images/CNN-onechannel.png new file mode 100644 index 0000000..6af27b7 Binary files /dev/null and b/images/CNN-onechannel.png differ diff --git a/images/CNN-opposite.png b/images/CNN-opposite.png new file mode 100644 index 0000000..589e466 Binary files /dev/null and b/images/CNN-opposite.png differ diff --git a/images/CNN.jpg b/images/CNN.jpg new file mode 100644 index 0000000..d2be32e Binary files /dev/null and b/images/CNN.jpg differ diff --git a/images/CNN.png b/images/CNN.png new file mode 100644 index 0000000..f276ba1 Binary files /dev/null and b/images/CNN.png differ diff --git a/images/CNNaltogether.png b/images/CNNaltogether.png new file mode 100644 index 0000000..175c99e Binary files /dev/null and b/images/CNNaltogether.png differ diff --git a/images/CNNdepthwise-1.png b/images/CNNdepthwise-1.png new file mode 100644 index 0000000..c48f1be Binary files /dev/null and b/images/CNNdepthwise-1.png differ diff --git a/images/CNNpointwise.png b/images/CNNpointwise.png new file mode 100644 index 0000000..25e1b4f Binary files /dev/null and b/images/CNNpointwise.png differ diff --git a/images/Causalpad-2.jpg b/images/Causalpad-2.jpg new file mode 100644 index 0000000..92bbf55 Binary files /dev/null and b/images/Causalpad-2.jpg differ diff --git a/images/Causalpad-3-1024x429.jpg b/images/Causalpad-3-1024x429.jpg new file mode 100644 index 0000000..2b7ce87 Binary files /dev/null and b/images/Causalpad-3-1024x429.jpg differ diff --git a/images/Causalpad-4-1024x262.jpg b/images/Causalpad-4-1024x262.jpg new file mode 100644 index 0000000..6f4086a Binary files /dev/null and b/images/Causalpad-4-1024x262.jpg differ diff --git a/images/Cnn_layer-1.jpg b/images/Cnn_layer-1.jpg new file mode 100644 index 0000000..dcc3b25 Binary files /dev/null and b/images/Cnn_layer-1.jpg differ diff --git a/images/ComplexNeuralNetwork.png b/images/ComplexNeuralNetwork.png new file mode 100644 index 0000000..9d695f7 Binary files /dev/null and b/images/ComplexNeuralNetwork.png differ diff --git a/images/Diagram-1-1024x590.png b/images/Diagram-1-1024x590.png new file mode 100644 index 0000000..0ab6a99 Binary files /dev/null and b/images/Diagram-1-1024x590.png differ diff --git a/images/Diagram-10-1.png b/images/Diagram-10-1.png new file mode 100644 index 0000000..da58233 Binary files /dev/null and b/images/Diagram-10-1.png differ diff --git a/images/Diagram-11.png b/images/Diagram-11.png new file mode 100644 index 0000000..7e7c7f5 Binary files /dev/null and b/images/Diagram-11.png differ diff --git a/images/Diagram-12.png b/images/Diagram-12.png new file mode 100644 index 0000000..afd0d5a Binary files /dev/null and b/images/Diagram-12.png differ diff --git a/images/Diagram-13-771x1024.png b/images/Diagram-13-771x1024.png new file mode 100644 index 0000000..d2b5f5e Binary files /dev/null and b/images/Diagram-13-771x1024.png differ diff --git a/images/Diagram-14-1.png b/images/Diagram-14-1.png new file mode 100644 index 0000000..ed6177d Binary files /dev/null and b/images/Diagram-14-1.png differ diff --git a/images/Diagram-15.png b/images/Diagram-15.png new file mode 100644 index 0000000..d4a09e2 Binary files /dev/null and b/images/Diagram-15.png differ diff --git a/images/Diagram-17-627x1024.png b/images/Diagram-17-627x1024.png new file mode 100644 index 0000000..1cf39d7 Binary files /dev/null and b/images/Diagram-17-627x1024.png differ diff --git a/images/Diagram-18.png b/images/Diagram-18.png new file mode 100644 index 0000000..b43cc63 Binary files /dev/null and b/images/Diagram-18.png differ diff --git a/images/Diagram-19.png b/images/Diagram-19.png new file mode 100644 index 0000000..35977c0 Binary files /dev/null and b/images/Diagram-19.png differ diff --git a/images/Diagram-20-1024x282.png b/images/Diagram-20-1024x282.png new file mode 100644 index 0000000..eea88ca Binary files /dev/null and b/images/Diagram-20-1024x282.png differ diff --git a/images/Diagram-21.png b/images/Diagram-21.png new file mode 100644 index 0000000..aa20caa Binary files /dev/null and b/images/Diagram-21.png differ diff --git a/images/Diagram-22.png b/images/Diagram-22.png new file mode 100644 index 0000000..22daf6e Binary files /dev/null and b/images/Diagram-22.png differ diff --git a/images/Diagram-23.png b/images/Diagram-23.png new file mode 100644 index 0000000..213f82e Binary files /dev/null and b/images/Diagram-23.png differ diff --git a/images/Diagram-3.png b/images/Diagram-3.png new file mode 100644 index 0000000..b855477 Binary files /dev/null and b/images/Diagram-3.png differ diff --git a/images/Diagram-32-1-1024x991.png b/images/Diagram-32-1-1024x991.png new file mode 100644 index 0000000..dcdc85d Binary files /dev/null and b/images/Diagram-32-1-1024x991.png differ diff --git a/images/Diagram-32-1.png b/images/Diagram-32-1.png new file mode 100644 index 0000000..bb773e9 Binary files /dev/null and b/images/Diagram-32-1.png differ diff --git a/images/Diagram-33-1024x352.png b/images/Diagram-33-1024x352.png new file mode 100644 index 0000000..84f15de Binary files /dev/null and b/images/Diagram-33-1024x352.png differ diff --git a/images/Diagram-34-1024x353.png b/images/Diagram-34-1024x353.png new file mode 100644 index 0000000..fb14d44 Binary files /dev/null and b/images/Diagram-34-1024x353.png differ diff --git a/images/Diagram-36-1024x353.png b/images/Diagram-36-1024x353.png new file mode 100644 index 0000000..bec817b Binary files /dev/null and b/images/Diagram-36-1024x353.png differ diff --git a/images/Diagram-37.png b/images/Diagram-37.png new file mode 100644 index 0000000..026c0e3 Binary files /dev/null and b/images/Diagram-37.png differ diff --git a/images/Diagram-38-1024x505.png b/images/Diagram-38-1024x505.png new file mode 100644 index 0000000..899051c Binary files /dev/null and b/images/Diagram-38-1024x505.png differ diff --git a/images/Diagram-39-1024x436.png b/images/Diagram-39-1024x436.png new file mode 100644 index 0000000..f20ada9 Binary files /dev/null and b/images/Diagram-39-1024x436.png differ diff --git a/images/Diagram-4-1.png b/images/Diagram-4-1.png new file mode 100644 index 0000000..c7f9c92 Binary files /dev/null and b/images/Diagram-4-1.png differ diff --git a/images/Diagram-40-1024x385.png b/images/Diagram-40-1024x385.png new file mode 100644 index 0000000..080f561 Binary files /dev/null and b/images/Diagram-40-1024x385.png differ diff --git a/images/Diagram-41-1024x334.png b/images/Diagram-41-1024x334.png new file mode 100644 index 0000000..97b2ce0 Binary files /dev/null and b/images/Diagram-41-1024x334.png differ diff --git a/images/Diagram-42-1024x319.png b/images/Diagram-42-1024x319.png new file mode 100644 index 0000000..f2136f1 Binary files /dev/null and b/images/Diagram-42-1024x319.png differ diff --git a/images/Diagram-43-1024x342.png b/images/Diagram-43-1024x342.png new file mode 100644 index 0000000..c570426 Binary files /dev/null and b/images/Diagram-43-1024x342.png differ diff --git a/images/Diagram-44-1024x625.png b/images/Diagram-44-1024x625.png new file mode 100644 index 0000000..1159d3a Binary files /dev/null and b/images/Diagram-44-1024x625.png differ diff --git a/images/Diagram-5.png b/images/Diagram-5.png new file mode 100644 index 0000000..45f29d3 Binary files /dev/null and b/images/Diagram-5.png differ diff --git a/images/Diagram-6.png b/images/Diagram-6.png new file mode 100644 index 0000000..8a2f0db Binary files /dev/null and b/images/Diagram-6.png differ diff --git a/images/Diagram-7.png b/images/Diagram-7.png new file mode 100644 index 0000000..bbb8870 Binary files /dev/null and b/images/Diagram-7.png differ diff --git a/images/Diagram-8-1.png b/images/Diagram-8-1.png new file mode 100644 index 0000000..bd2ea21 Binary files /dev/null and b/images/Diagram-8-1.png differ diff --git a/images/Diagram-9.png b/images/Diagram-9.png new file mode 100644 index 0000000..1d512f1 Binary files /dev/null and b/images/Diagram-9.png differ diff --git a/images/Dropout-neuron.png b/images/Dropout-neuron.png new file mode 100644 index 0000000..8ad5568 Binary files /dev/null and b/images/Dropout-neuron.png differ diff --git a/images/EvaluationScenario-1024x366.png b/images/EvaluationScenario-1024x366.png new file mode 100644 index 0000000..a1e4b42 Binary files /dev/null and b/images/EvaluationScenario-1024x366.png differ diff --git a/images/FeatureVectorFeatureSpace.png b/images/FeatureVectorFeatureSpace.png new file mode 100644 index 0000000..6663ba0 Binary files /dev/null and b/images/FeatureVectorFeatureSpace.png differ diff --git a/images/GAN-1024x431.jpg b/images/GAN-1024x431.jpg new file mode 100644 index 0000000..c646aae Binary files /dev/null and b/images/GAN-1024x431.jpg differ diff --git a/images/GAN.jpg b/images/GAN.jpg new file mode 100644 index 0000000..103beee Binary files /dev/null and b/images/GAN.jpg differ diff --git a/images/GLS.png b/images/GLS.png new file mode 100644 index 0000000..c08241e Binary files /dev/null and b/images/GLS.png differ diff --git a/images/Global-Average-Pooling-2.png b/images/Global-Average-Pooling-2.png new file mode 100644 index 0000000..462b222 Binary files /dev/null and b/images/Global-Average-Pooling-2.png differ diff --git a/images/Global-Average-Pooling-3.png b/images/Global-Average-Pooling-3.png new file mode 100644 index 0000000..2c5dc72 Binary files /dev/null and b/images/Global-Average-Pooling-3.png differ diff --git a/images/Global-Max-Pooling-1.png b/images/Global-Max-Pooling-1.png new file mode 100644 index 0000000..872422a Binary files /dev/null and b/images/Global-Max-Pooling-1.png differ diff --git a/images/Global-Max-Pooling-3.png b/images/Global-Max-Pooling-3.png new file mode 100644 index 0000000..db3bdc0 Binary files /dev/null and b/images/Global-Max-Pooling-3.png differ diff --git a/images/High-level-training-process-1024x973.jpg b/images/High-level-training-process-1024x973.jpg new file mode 100644 index 0000000..75d8634 Binary files /dev/null and b/images/High-level-training-process-1024x973.jpg differ diff --git a/images/KTraintest.png b/images/KTraintest.png new file mode 100644 index 0000000..9c4d872 Binary files /dev/null and b/images/KTraintest.png differ diff --git a/images/Kernel_Machine-1.png b/images/Kernel_Machine-1.png new file mode 100644 index 0000000..f87b947 Binary files /dev/null and b/images/Kernel_Machine-1.png differ diff --git a/images/Kernel_Machine.png b/images/Kernel_Machine.png new file mode 100644 index 0000000..f87b947 Binary files /dev/null and b/images/Kernel_Machine.png differ diff --git a/images/LSTM-1-1024x657.png b/images/LSTM-1-1024x657.png new file mode 100644 index 0000000..b440628 Binary files /dev/null and b/images/LSTM-1-1024x657.png differ diff --git a/images/LSTM-1024x657.png b/images/LSTM-1024x657.png new file mode 100644 index 0000000..e71c9f8 Binary files /dev/null and b/images/LSTM-1024x657.png differ diff --git a/images/LSTM-2-1024x657.png b/images/LSTM-2-1024x657.png new file mode 100644 index 0000000..0ab7b7b Binary files /dev/null and b/images/LSTM-2-1024x657.png differ diff --git a/images/LSTM-3-1024x657.png b/images/LSTM-3-1024x657.png new file mode 100644 index 0000000..9cc36ef Binary files /dev/null and b/images/LSTM-3-1024x657.png differ diff --git a/images/LSTM-4-1024x657.png b/images/LSTM-4-1024x657.png new file mode 100644 index 0000000..cccc0fa Binary files /dev/null and b/images/LSTM-4-1024x657.png differ diff --git a/images/LSTM-5.png b/images/LSTM-5.png new file mode 100644 index 0000000..24b2357 Binary files /dev/null and b/images/LSTM-5.png differ diff --git a/images/ML-supervised.png b/images/ML-supervised.png new file mode 100644 index 0000000..301ec75 Binary files /dev/null and b/images/ML-supervised.png differ diff --git a/images/ML-unsupervised.png b/images/ML-unsupervised.png new file mode 100644 index 0000000..322937a Binary files /dev/null and b/images/ML-unsupervised.png differ diff --git a/images/Max-Pooling-1.png b/images/Max-Pooling-1.png new file mode 100644 index 0000000..a15c66d Binary files /dev/null and b/images/Max-Pooling-1.png differ diff --git a/images/Max-Pooling-2.png b/images/Max-Pooling-2.png new file mode 100644 index 0000000..4cc75b1 Binary files /dev/null and b/images/Max-Pooling-2.png differ diff --git a/images/Max-Pooling.png b/images/Max-Pooling.png new file mode 100644 index 0000000..2a84025 Binary files /dev/null and b/images/Max-Pooling.png differ diff --git a/images/MaximumCounterexample.png b/images/MaximumCounterexample.png new file mode 100644 index 0000000..c9d21a1 Binary files /dev/null and b/images/MaximumCounterexample.png differ diff --git a/images/MultivariateNormal.png b/images/MultivariateNormal.png new file mode 100644 index 0000000..e57714a Binary files /dev/null and b/images/MultivariateNormal.png differ diff --git a/images/Normal-neuron.png b/images/Normal-neuron.png new file mode 100644 index 0000000..7163223 Binary files /dev/null and b/images/Normal-neuron.png differ diff --git a/images/OPTICS.svg_-1024x700.png b/images/OPTICS.svg_-1024x700.png new file mode 100644 index 0000000..6ed623e Binary files /dev/null and b/images/OPTICS.svg_-1024x700.png differ diff --git a/images/Perceptron-1024x794.png b/images/Perceptron-1024x794.png new file mode 100644 index 0000000..12973ea Binary files /dev/null and b/images/Perceptron-1024x794.png differ diff --git a/images/Perceptron_with_bias-1024x907.png b/images/Perceptron_with_bias-1024x907.png new file mode 100644 index 0000000..4938c34 Binary files /dev/null and b/images/Perceptron_with_bias-1024x907.png differ diff --git a/images/Saddle_Point_between_maxima.png b/images/Saddle_Point_between_maxima.png new file mode 100644 index 0000000..29fcfd8 Binary files /dev/null and b/images/Saddle_Point_between_maxima.png differ diff --git a/images/Saddle_point.png b/images/Saddle_point.png new file mode 100644 index 0000000..00930ae Binary files /dev/null and b/images/Saddle_point.png differ diff --git a/images/Slaiency-1.png b/images/Slaiency-1.png new file mode 100644 index 0000000..23e391e Binary files /dev/null and b/images/Slaiency-1.png differ diff --git a/images/Slaiency.png b/images/Slaiency.png new file mode 100644 index 0000000..0b25c9b Binary files /dev/null and b/images/Slaiency.png differ diff --git a/images/StyleGAN.drawio-1.png b/images/StyleGAN.drawio-1.png new file mode 100644 index 0000000..3366702 Binary files /dev/null and b/images/StyleGAN.drawio-1.png differ diff --git a/images/StyleGAN.drawio-925x1024.png b/images/StyleGAN.drawio-925x1024.png new file mode 100644 index 0000000..ec727b0 Binary files /dev/null and b/images/StyleGAN.drawio-925x1024.png differ diff --git a/images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png b/images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png new file mode 100644 index 0000000..d787720 Binary files /dev/null and b/images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png differ diff --git a/images/T-SNE_visualisation_of_word_embeddings_generated_using_19th_century_literature-1024x695.png b/images/T-SNE_visualisation_of_word_embeddings_generated_using_19th_century_literature-1024x695.png new file mode 100644 index 0000000..6d3ee9b Binary files /dev/null and b/images/T-SNE_visualisation_of_word_embeddings_generated_using_19th_century_literature-1024x695.png differ diff --git a/images/Traintest.png b/images/Traintest.png new file mode 100644 index 0000000..9e9e58e Binary files /dev/null and b/images/Traintest.png differ diff --git a/images/UnderOver.png b/images/UnderOver.png new file mode 100644 index 0000000..c6eae93 Binary files /dev/null and b/images/UnderOver.png differ diff --git a/images/Unfold_through_time.png b/images/Unfold_through_time.png new file mode 100644 index 0000000..2bd5975 Binary files /dev/null and b/images/Unfold_through_time.png differ diff --git a/images/Uniform_Distribution_PDF_SVG.svg_-1024x732.png b/images/Uniform_Distribution_PDF_SVG.svg_-1024x732.png new file mode 100644 index 0000000..6c976bd Binary files /dev/null and b/images/Uniform_Distribution_PDF_SVG.svg_-1024x732.png differ diff --git a/images/Which-regularizer-do-I-need-2-794x1024.png b/images/Which-regularizer-do-I-need-2-794x1024.png new file mode 100644 index 0000000..f590658 Binary files /dev/null and b/images/Which-regularizer-do-I-need-2-794x1024.png differ diff --git a/images/Zero.drawio.png b/images/Zero.drawio.png new file mode 100644 index 0000000..4daba83 Binary files /dev/null and b/images/Zero.drawio.png differ diff --git a/images/acc-1.png b/images/acc-1.png new file mode 100644 index 0000000..5a0c940 Binary files /dev/null and b/images/acc-1.png differ diff --git a/images/acc-2-1024x528.png b/images/acc-2-1024x528.png new file mode 100644 index 0000000..c659382 Binary files /dev/null and b/images/acc-2-1024x528.png differ diff --git a/images/acc-2.png b/images/acc-2.png new file mode 100644 index 0000000..c5d5332 Binary files /dev/null and b/images/acc-2.png differ diff --git a/images/acc-3-1024x537.png b/images/acc-3-1024x537.png new file mode 100644 index 0000000..9f99170 Binary files /dev/null and b/images/acc-3-1024x537.png differ diff --git a/images/acc-4-1024x537.png b/images/acc-4-1024x537.png new file mode 100644 index 0000000..8611213 Binary files /dev/null and b/images/acc-4-1024x537.png differ diff --git a/images/acc.png b/images/acc.png new file mode 100644 index 0000000..49a8c41 Binary files /dev/null and b/images/acc.png differ diff --git a/images/accuracy.png b/images/accuracy.png new file mode 100644 index 0000000..a371676 Binary files /dev/null and b/images/accuracy.png differ diff --git a/images/act-1.png b/images/act-1.png new file mode 100644 index 0000000..44c12c8 Binary files /dev/null and b/images/act-1.png differ diff --git a/images/act.png b/images/act.png new file mode 100644 index 0000000..b48d7e6 Binary files /dev/null and b/images/act.png differ diff --git a/images/action-artificial-intelligence-device-595804-1024x683.jpg b/images/action-artificial-intelligence-device-595804-1024x683.jpg new file mode 100644 index 0000000..f80a352 Binary files /dev/null and b/images/action-artificial-intelligence-device-595804-1024x683.jpg differ diff --git a/images/ad-1024x521.png b/images/ad-1024x521.png new file mode 100644 index 0000000..5436b6e Binary files /dev/null and b/images/ad-1024x521.png differ diff --git a/images/adult-adventure-backpack-287240-1024x767.jpg b/images/adult-adventure-backpack-287240-1024x767.jpg new file mode 100644 index 0000000..2450ede Binary files /dev/null and b/images/adult-adventure-backpack-287240-1024x767.jpg differ diff --git a/images/af_ups.png b/images/af_ups.png new file mode 100644 index 0000000..d54d269 Binary files /dev/null and b/images/af_ups.png differ diff --git a/images/afbeelding-3.png b/images/afbeelding-3.png new file mode 100644 index 0000000..06693ce Binary files /dev/null and b/images/afbeelding-3.png differ diff --git a/images/afbeelding-4.png b/images/afbeelding-4.png new file mode 100644 index 0000000..c87b429 Binary files /dev/null and b/images/afbeelding-4.png differ diff --git a/images/afbeelding-5.png b/images/afbeelding-5.png new file mode 100644 index 0000000..0a2bfab Binary files /dev/null and b/images/afbeelding-5.png differ diff --git a/images/afbeelding-6.png b/images/afbeelding-6.png new file mode 100644 index 0000000..88e3908 Binary files /dev/null and b/images/afbeelding-6.png differ diff --git a/images/afbeelding.png b/images/afbeelding.png new file mode 100644 index 0000000..96286d2 Binary files /dev/null and b/images/afbeelding.png differ diff --git a/images/afp_cluster.png b/images/afp_cluster.png new file mode 100644 index 0000000..7fa8c9b Binary files /dev/null and b/images/afp_cluster.png differ diff --git a/images/afp_clustered.png b/images/afp_clustered.png new file mode 100644 index 0000000..a1f34fc Binary files /dev/null and b/images/afp_clustered.png differ diff --git a/images/afterupsampling-1024x535.png b/images/afterupsampling-1024x535.png new file mode 100644 index 0000000..adab250 Binary files /dev/null and b/images/afterupsampling-1024x535.png differ diff --git a/images/airplane-1.png b/images/airplane-1.png new file mode 100644 index 0000000..df89efc Binary files /dev/null and b/images/airplane-1.png differ diff --git a/images/airplane-2.png b/images/airplane-2.png new file mode 100644 index 0000000..d4f135a Binary files /dev/null and b/images/airplane-2.png differ diff --git a/images/airplane.png b/images/airplane.png new file mode 100644 index 0000000..f08ae48 Binary files /dev/null and b/images/airplane.png differ diff --git a/images/art-black-and-white-blur-724994-1024x736.jpg b/images/art-black-and-white-blur-724994-1024x736.jpg new file mode 100644 index 0000000..499b7b2 Binary files /dev/null and b/images/art-black-and-white-blur-724994-1024x736.jpg differ diff --git a/images/art-blue-skies-clouds-335907-1024x686.jpg b/images/art-blue-skies-clouds-335907-1024x686.jpg new file mode 100644 index 0000000..2694fc0 Binary files /dev/null and b/images/art-blue-skies-clouds-335907-1024x686.jpg differ diff --git a/images/assorted-clothes-996329-1-1024x683.jpg b/images/assorted-clothes-996329-1-1024x683.jpg new file mode 100644 index 0000000..3d7106c Binary files /dev/null and b/images/assorted-clothes-996329-1-1024x683.jpg differ diff --git a/images/automobile-1.png b/images/automobile-1.png new file mode 100644 index 0000000..b702793 Binary files /dev/null and b/images/automobile-1.png differ diff --git a/images/automobile.png b/images/automobile.png new file mode 100644 index 0000000..e781be5 Binary files /dev/null and b/images/automobile.png differ diff --git a/images/bart-1024x449.jpg b/images/bart-1024x449.jpg new file mode 100644 index 0000000..8e80ea0 Binary files /dev/null and b/images/bart-1024x449.jpg differ diff --git a/images/bce-1-1024x421.png b/images/bce-1-1024x421.png new file mode 100644 index 0000000..1e8aaaa Binary files /dev/null and b/images/bce-1-1024x421.png differ diff --git a/images/bce-1-300x123.png b/images/bce-1-300x123.png new file mode 100644 index 0000000..5e48862 Binary files /dev/null and b/images/bce-1-300x123.png differ diff --git a/images/bce-1024x469.png b/images/bce-1024x469.png new file mode 100644 index 0000000..fc69928 Binary files /dev/null and b/images/bce-1024x469.png differ diff --git a/images/bce_t0-1024x459.png b/images/bce_t0-1024x459.png new file mode 100644 index 0000000..266ae1b Binary files /dev/null and b/images/bce_t0-1024x459.png differ diff --git a/images/beautiful-facial-expression-female-834949-1024x683.jpg b/images/beautiful-facial-expression-female-834949-1024x683.jpg new file mode 100644 index 0000000..3997170 Binary files /dev/null and b/images/beautiful-facial-expression-female-834949-1024x683.jpg differ diff --git a/images/bidirectional-1024x414.png b/images/bidirectional-1024x414.png new file mode 100644 index 0000000..be5f931 Binary files /dev/null and b/images/bidirectional-1024x414.png differ diff --git a/images/bigdl-logo-bw.jpg b/images/bigdl-logo-bw.jpg new file mode 100644 index 0000000..5054512 Binary files /dev/null and b/images/bigdl-logo-bw.jpg differ diff --git a/images/bin.png b/images/bin.png new file mode 100644 index 0000000..3f30bef Binary files /dev/null and b/images/bin.png differ diff --git a/images/bird-1.png b/images/bird-1.png new file mode 100644 index 0000000..c0f730d Binary files /dev/null and b/images/bird-1.png differ diff --git a/images/bird.png b/images/bird.png new file mode 100644 index 0000000..e93e30d Binary files /dev/null and b/images/bird.png differ diff --git a/images/black-background-brain-close-up-818563-1024x576.jpg b/images/black-background-brain-close-up-818563-1024x576.jpg new file mode 100644 index 0000000..7c4a6d3 Binary files /dev/null and b/images/black-background-brain-close-up-818563-1024x576.jpg differ diff --git a/images/block1_conv1_11.jpg b/images/block1_conv1_11.jpg new file mode 100644 index 0000000..779eb8b Binary files /dev/null and b/images/block1_conv1_11.jpg differ diff --git a/images/block1_conv1_12.jpg b/images/block1_conv1_12.jpg new file mode 100644 index 0000000..07500f4 Binary files /dev/null and b/images/block1_conv1_12.jpg differ diff --git a/images/block1_conv1_15.jpg b/images/block1_conv1_15.jpg new file mode 100644 index 0000000..120c476 Binary files /dev/null and b/images/block1_conv1_15.jpg differ diff --git a/images/block1_conv1_2.jpg b/images/block1_conv1_2.jpg new file mode 100644 index 0000000..dc60380 Binary files /dev/null and b/images/block1_conv1_2.jpg differ diff --git a/images/block1_conv1_25.jpg b/images/block1_conv1_25.jpg new file mode 100644 index 0000000..43eed7a Binary files /dev/null and b/images/block1_conv1_25.jpg differ diff --git a/images/block1_conv1_5.jpg b/images/block1_conv1_5.jpg new file mode 100644 index 0000000..b0b7cf8 Binary files /dev/null and b/images/block1_conv1_5.jpg differ diff --git a/images/block2_conv1_100.jpg b/images/block2_conv1_100.jpg new file mode 100644 index 0000000..e74c197 Binary files /dev/null and b/images/block2_conv1_100.jpg differ diff --git a/images/block2_conv1_26.jpg b/images/block2_conv1_26.jpg new file mode 100644 index 0000000..0f26ed0 Binary files /dev/null and b/images/block2_conv1_26.jpg differ diff --git a/images/block2_conv1_33.jpg b/images/block2_conv1_33.jpg new file mode 100644 index 0000000..b386d7b Binary files /dev/null and b/images/block2_conv1_33.jpg differ diff --git a/images/block2_conv1_39.jpg b/images/block2_conv1_39.jpg new file mode 100644 index 0000000..76fe8db Binary files /dev/null and b/images/block2_conv1_39.jpg differ diff --git a/images/block2_conv1_84.jpg b/images/block2_conv1_84.jpg new file mode 100644 index 0000000..3fc6789 Binary files /dev/null and b/images/block2_conv1_84.jpg differ diff --git a/images/block2_conv1_97.jpg b/images/block2_conv1_97.jpg new file mode 100644 index 0000000..26c6240 Binary files /dev/null and b/images/block2_conv1_97.jpg differ diff --git a/images/block3_conv2_123.jpg b/images/block3_conv2_123.jpg new file mode 100644 index 0000000..c0a2b18 Binary files /dev/null and b/images/block3_conv2_123.jpg differ diff --git a/images/block3_conv2_162.jpg b/images/block3_conv2_162.jpg new file mode 100644 index 0000000..e6cbe59 Binary files /dev/null and b/images/block3_conv2_162.jpg differ diff --git a/images/block3_conv2_17.jpg b/images/block3_conv2_17.jpg new file mode 100644 index 0000000..ee6d692 Binary files /dev/null and b/images/block3_conv2_17.jpg differ diff --git a/images/block3_conv2_185.jpg b/images/block3_conv2_185.jpg new file mode 100644 index 0000000..f989695 Binary files /dev/null and b/images/block3_conv2_185.jpg differ diff --git a/images/block3_conv2_21.jpg b/images/block3_conv2_21.jpg new file mode 100644 index 0000000..4ca380e Binary files /dev/null and b/images/block3_conv2_21.jpg differ diff --git a/images/block3_conv2_3.jpg b/images/block3_conv2_3.jpg new file mode 100644 index 0000000..a124415 Binary files /dev/null and b/images/block3_conv2_3.jpg differ diff --git a/images/block4_conv1_100.jpg b/images/block4_conv1_100.jpg new file mode 100644 index 0000000..2fb9f5a Binary files /dev/null and b/images/block4_conv1_100.jpg differ diff --git a/images/block4_conv1_294.jpg b/images/block4_conv1_294.jpg new file mode 100644 index 0000000..98f30e6 Binary files /dev/null and b/images/block4_conv1_294.jpg differ diff --git a/images/block4_conv1_461.jpg b/images/block4_conv1_461.jpg new file mode 100644 index 0000000..845fc3a Binary files /dev/null and b/images/block4_conv1_461.jpg differ diff --git a/images/block4_conv1_69.jpg b/images/block4_conv1_69.jpg new file mode 100644 index 0000000..eb66ce6 Binary files /dev/null and b/images/block4_conv1_69.jpg differ diff --git a/images/block4_conv1_78.jpg b/images/block4_conv1_78.jpg new file mode 100644 index 0000000..adfbadf Binary files /dev/null and b/images/block4_conv1_78.jpg differ diff --git a/images/block4_conv1_97.jpg b/images/block4_conv1_97.jpg new file mode 100644 index 0000000..9cba5ec Binary files /dev/null and b/images/block4_conv1_97.jpg differ diff --git a/images/block5_conv2_136.jpg b/images/block5_conv2_136.jpg new file mode 100644 index 0000000..9bc06c4 Binary files /dev/null and b/images/block5_conv2_136.jpg differ diff --git a/images/block5_conv2_222.jpg b/images/block5_conv2_222.jpg new file mode 100644 index 0000000..85eeee6 Binary files /dev/null and b/images/block5_conv2_222.jpg differ diff --git a/images/block5_conv2_247.jpg b/images/block5_conv2_247.jpg new file mode 100644 index 0000000..65addf7 Binary files /dev/null and b/images/block5_conv2_247.jpg differ diff --git a/images/block5_conv2_479.jpg b/images/block5_conv2_479.jpg new file mode 100644 index 0000000..0d0bab3 Binary files /dev/null and b/images/block5_conv2_479.jpg differ diff --git a/images/block5_conv2_480.jpg b/images/block5_conv2_480.jpg new file mode 100644 index 0000000..861e9d2 Binary files /dev/null and b/images/block5_conv2_480.jpg differ diff --git a/images/block5_conv2_53.jpg b/images/block5_conv2_53.jpg new file mode 100644 index 0000000..6886312 Binary files /dev/null and b/images/block5_conv2_53.jpg differ diff --git a/images/bookshelves-chair-desk-1546912.jpg b/images/bookshelves-chair-desk-1546912.jpg new file mode 100644 index 0000000..3cd710d Binary files /dev/null and b/images/bookshelves-chair-desk-1546912.jpg differ diff --git a/images/boston_boxplot.png b/images/boston_boxplot.png new file mode 100644 index 0000000..df01138 Binary files /dev/null and b/images/boston_boxplot.png differ diff --git a/images/boston_boxplot_test-1.png b/images/boston_boxplot_test-1.png new file mode 100644 index 0000000..973bb2f Binary files /dev/null and b/images/boston_boxplot_test-1.png differ diff --git a/images/boston_boxplot_train-2.png b/images/boston_boxplot_train-2.png new file mode 100644 index 0000000..7d4cba8 Binary files /dev/null and b/images/boston_boxplot_train-2.png differ diff --git a/images/boundary.png b/images/boundary.png new file mode 100644 index 0000000..46908a2 Binary files /dev/null and b/images/boundary.png differ diff --git a/images/boxplot.jpg b/images/boxplot.jpg new file mode 100644 index 0000000..67e7fcf Binary files /dev/null and b/images/boxplot.jpg differ diff --git a/images/cabinet-data-data-center-325229-1024x358.jpg b/images/cabinet-data-data-center-325229-1024x358.jpg new file mode 100644 index 0000000..58ad4a3 Binary files /dev/null and b/images/cabinet-data-data-center-325229-1024x358.jpg differ diff --git a/images/cardinality-2.png b/images/cardinality-2.png new file mode 100644 index 0000000..db15625 Binary files /dev/null and b/images/cardinality-2.png differ diff --git a/images/cat-1.png b/images/cat-1.png new file mode 100644 index 0000000..6aa607a Binary files /dev/null and b/images/cat-1.png differ diff --git a/images/cat-2.png b/images/cat-2.png new file mode 100644 index 0000000..187626b Binary files /dev/null and b/images/cat-2.png differ diff --git a/images/cat.png b/images/cat.png new file mode 100644 index 0000000..88b2908 Binary files /dev/null and b/images/cat.png differ diff --git a/images/catmask.png b/images/catmask.png new file mode 100644 index 0000000..34554e4 Binary files /dev/null and b/images/catmask.png differ diff --git a/images/causal-1024x445.png b/images/causal-1024x445.png new file mode 100644 index 0000000..ddacf5d Binary files /dev/null and b/images/causal-1024x445.png differ diff --git a/images/cf_matrix.png b/images/cf_matrix.png new file mode 100644 index 0000000..0901907 Binary files /dev/null and b/images/cf_matrix.png differ diff --git a/images/chennai_oli_2018151.jpg b/images/chennai_oli_2018151.jpg new file mode 100644 index 0000000..d1f222d Binary files /dev/null and b/images/chennai_oli_2018151.jpg differ diff --git a/images/chennai_oli_2019170.jpg b/images/chennai_oli_2019170.jpg new file mode 100644 index 0000000..2a77129 Binary files /dev/null and b/images/chennai_oli_2019170.jpg differ diff --git a/images/cifar10_images.png b/images/cifar10_images.png new file mode 100644 index 0000000..9fd673e Binary files /dev/null and b/images/cifar10_images.png differ diff --git a/images/cifar10_visualized.png b/images/cifar10_visualized.png new file mode 100644 index 0000000..0cd7e77 Binary files /dev/null and b/images/cifar10_visualized.png differ diff --git a/images/classes-1.png b/images/classes-1.png new file mode 100644 index 0000000..2776c26 Binary files /dev/null and b/images/classes-1.png differ diff --git a/images/classes.png b/images/classes.png new file mode 100644 index 0000000..2776c26 Binary files /dev/null and b/images/classes.png differ diff --git a/images/classic-autoencoder.png b/images/classic-autoencoder.png new file mode 100644 index 0000000..313d234 Binary files /dev/null and b/images/classic-autoencoder.png differ diff --git a/images/classic_autoencoder-1024x853.png b/images/classic_autoencoder-1024x853.png new file mode 100644 index 0000000..987f187 Binary files /dev/null and b/images/classic_autoencoder-1024x853.png differ diff --git a/images/classic_cropped_accuracy.png b/images/classic_cropped_accuracy.png new file mode 100644 index 0000000..7c4e14f Binary files /dev/null and b/images/classic_cropped_accuracy.png differ diff --git a/images/classic_cropped_loss.png b/images/classic_cropped_loss.png new file mode 100644 index 0000000..3b86d45 Binary files /dev/null and b/images/classic_cropped_loss.png differ diff --git a/images/classic_drawing-1024x853.png b/images/classic_drawing-1024x853.png new file mode 100644 index 0000000..6d9bed8 Binary files /dev/null and b/images/classic_drawing-1024x853.png differ diff --git a/images/clr.png b/images/clr.png new file mode 100644 index 0000000..5af0ca2 Binary files /dev/null and b/images/clr.png differ diff --git a/images/clr_decay.png b/images/clr_decay.png new file mode 100644 index 0000000..cb1e073 Binary files /dev/null and b/images/clr_decay.png differ diff --git a/images/clustered.png b/images/clustered.png new file mode 100644 index 0000000..44fd0ec Binary files /dev/null and b/images/clustered.png differ diff --git a/images/clusters.png b/images/clusters.png new file mode 100644 index 0000000..5b4e979 Binary files /dev/null and b/images/clusters.png differ diff --git a/images/clusters_2-1.png b/images/clusters_2-1.png new file mode 100644 index 0000000..fbc7489 Binary files /dev/null and b/images/clusters_2-1.png differ diff --git a/images/clusters_mean.png b/images/clusters_mean.png new file mode 100644 index 0000000..0bde54c Binary files /dev/null and b/images/clusters_mean.png differ diff --git a/images/cnn_layer.jpg b/images/cnn_layer.jpg new file mode 100644 index 0000000..190f3d5 Binary files /dev/null and b/images/cnn_layer.jpg differ diff --git a/images/combined-1.png b/images/combined-1.png new file mode 100644 index 0000000..15ca7bf Binary files /dev/null and b/images/combined-1.png differ diff --git a/images/combined.png b/images/combined.png new file mode 100644 index 0000000..72a2dbd Binary files /dev/null and b/images/combined.png differ diff --git a/images/comp.png b/images/comp.png new file mode 100644 index 0000000..9223396 Binary files /dev/null and b/images/comp.png differ diff --git a/images/comparison.png b/images/comparison.png new file mode 100644 index 0000000..8bf6911 Binary files /dev/null and b/images/comparison.png differ diff --git a/images/computerized_filter-2.jpg b/images/computerized_filter-2.jpg new file mode 100644 index 0000000..cf6aba5 Binary files /dev/null and b/images/computerized_filter-2.jpg differ diff --git a/images/computerized_image_2.jpg b/images/computerized_image_2.jpg new file mode 100644 index 0000000..3bef000 Binary files /dev/null and b/images/computerized_image_2.jpg differ diff --git a/images/conf_matrix.png b/images/conf_matrix.png new file mode 100644 index 0000000..c27a660 Binary files /dev/null and b/images/conf_matrix.png differ diff --git a/images/confused-digital-nomad-electronics-874242-1024x682.jpg b/images/confused-digital-nomad-electronics-874242-1024x682.jpg new file mode 100644 index 0000000..eb30ce7 Binary files /dev/null and b/images/confused-digital-nomad-electronics-874242-1024x682.jpg differ diff --git a/images/connection-data-desk-1181675-1024x683.jpg b/images/connection-data-desk-1181675-1024x683.jpg new file mode 100644 index 0000000..490e4b0 Binary files /dev/null and b/images/connection-data-desk-1181675-1024x683.jpg differ diff --git a/images/constant23.png b/images/constant23.png new file mode 100644 index 0000000..0d8c61f Binary files /dev/null and b/images/constant23.png differ diff --git a/images/constantpad.jpg b/images/constantpad.jpg new file mode 100644 index 0000000..791df0c Binary files /dev/null and b/images/constantpad.jpg differ diff --git a/images/contracting-222x300.png b/images/contracting-222x300.png new file mode 100644 index 0000000..18d78d0 Binary files /dev/null and b/images/contracting-222x300.png differ diff --git a/images/conv-new.png b/images/conv-new.png new file mode 100644 index 0000000..c5d5b8c Binary files /dev/null and b/images/conv-new.png differ diff --git a/images/conv-new2.png b/images/conv-new2.png new file mode 100644 index 0000000..cf17a10 Binary files /dev/null and b/images/conv-new2.png differ diff --git a/images/conv.png b/images/conv.png new file mode 100644 index 0000000..e934e07 Binary files /dev/null and b/images/conv.png differ diff --git a/images/conv2d_1-1024x577.png b/images/conv2d_1-1024x577.png new file mode 100644 index 0000000..de3f921 Binary files /dev/null and b/images/conv2d_1-1024x577.png differ diff --git a/images/conv2d_2-1024x577.png b/images/conv2d_2-1024x577.png new file mode 100644 index 0000000..3919d1a Binary files /dev/null and b/images/conv2d_2-1024x577.png differ diff --git a/images/conv_matrix-1.jpg b/images/conv_matrix-1.jpg new file mode 100644 index 0000000..74af79b Binary files /dev/null and b/images/conv_matrix-1.jpg differ diff --git a/images/convnet_fig.png b/images/convnet_fig.png new file mode 100644 index 0000000..498587a Binary files /dev/null and b/images/convnet_fig.png differ diff --git a/images/corepoints-1.png b/images/corepoints-1.png new file mode 100644 index 0000000..5010f80 Binary files /dev/null and b/images/corepoints-1.png differ diff --git a/images/corepoints-3-1024x504.png b/images/corepoints-3-1024x504.png new file mode 100644 index 0000000..17202d7 Binary files /dev/null and b/images/corepoints-3-1024x504.png differ diff --git a/images/corepoints-4.png b/images/corepoints-4.png new file mode 100644 index 0000000..8abf7c4 Binary files /dev/null and b/images/corepoints-4.png differ diff --git a/images/corepoints-5.png b/images/corepoints-5.png new file mode 100644 index 0000000..5f253ea Binary files /dev/null and b/images/corepoints-5.png differ diff --git a/images/corepoints.png b/images/corepoints.png new file mode 100644 index 0000000..a5db7e1 Binary files /dev/null and b/images/corepoints.png differ diff --git a/images/corereach-1.png b/images/corereach-1.png new file mode 100644 index 0000000..7aa51df Binary files /dev/null and b/images/corereach-1.png differ diff --git a/images/corereach-2.png b/images/corereach-2.png new file mode 100644 index 0000000..1009c70 Binary files /dev/null and b/images/corereach-2.png differ diff --git a/images/covids.png b/images/covids.png new file mode 100644 index 0000000..23f3cf3 Binary files /dev/null and b/images/covids.png differ diff --git a/images/crop_1.png b/images/crop_1.png new file mode 100644 index 0000000..3c6255c Binary files /dev/null and b/images/crop_1.png differ diff --git a/images/crop_2.png b/images/crop_2.png new file mode 100644 index 0000000..2c27880 Binary files /dev/null and b/images/crop_2.png differ diff --git a/images/crop_3.png b/images/crop_3.png new file mode 100644 index 0000000..1c252a5 Binary files /dev/null and b/images/crop_3.png differ diff --git a/images/crop_4.png b/images/crop_4.png new file mode 100644 index 0000000..177d2cc Binary files /dev/null and b/images/crop_4.png differ diff --git a/images/darts-1.png b/images/darts-1.png new file mode 100644 index 0000000..43e60e2 Binary files /dev/null and b/images/darts-1.png differ diff --git a/images/darts-1024x768.jpg b/images/darts-1024x768.jpg new file mode 100644 index 0000000..d87e66f Binary files /dev/null and b/images/darts-1024x768.jpg differ diff --git a/images/darts-1024x925.png b/images/darts-1024x925.png new file mode 100644 index 0000000..03c8e58 Binary files /dev/null and b/images/darts-1024x925.png differ diff --git a/images/dataset.png b/images/dataset.png new file mode 100644 index 0000000..2a92421 Binary files /dev/null and b/images/dataset.png differ diff --git a/images/deer-1.png b/images/deer-1.png new file mode 100644 index 0000000..1e88a87 Binary files /dev/null and b/images/deer-1.png differ diff --git a/images/deer.png b/images/deer.png new file mode 100644 index 0000000..71d5b0a Binary files /dev/null and b/images/deer.png differ diff --git a/images/derivative_linear-1024x537.png b/images/derivative_linear-1024x537.png new file mode 100644 index 0000000..56b6ba1 Binary files /dev/null and b/images/derivative_linear-1024x537.png differ diff --git a/images/derivatives-1024x511.png b/images/derivatives-1024x511.png new file mode 100644 index 0000000..5ef275e Binary files /dev/null and b/images/derivatives-1024x511.png differ diff --git a/images/diary.png b/images/diary.png new file mode 100644 index 0000000..d30a8b9 Binary files /dev/null and b/images/diary.png differ diff --git a/images/differences-1.jpg b/images/differences-1.jpg new file mode 100644 index 0000000..296ce41 Binary files /dev/null and b/images/differences-1.jpg differ diff --git a/images/dig_1.png b/images/dig_1.png new file mode 100644 index 0000000..edcf9ba Binary files /dev/null and b/images/dig_1.png differ diff --git a/images/dig_2.png b/images/dig_2.png new file mode 100644 index 0000000..5272f89 Binary files /dev/null and b/images/dig_2.png differ diff --git a/images/dig_3.png b/images/dig_3.png new file mode 100644 index 0000000..046a356 Binary files /dev/null and b/images/dig_3.png differ diff --git a/images/dig_4-300x225.png b/images/dig_4-300x225.png new file mode 100644 index 0000000..f423fbc Binary files /dev/null and b/images/dig_4-300x225.png differ diff --git a/images/dig_4.png b/images/dig_4.png new file mode 100644 index 0000000..34defd1 Binary files /dev/null and b/images/dig_4.png differ diff --git a/images/dog-1.png b/images/dog-1.png new file mode 100644 index 0000000..e74db9d Binary files /dev/null and b/images/dog-1.png differ diff --git a/images/dog-2.png b/images/dog-2.png new file mode 100644 index 0000000..17ea95a Binary files /dev/null and b/images/dog-2.png differ diff --git a/images/dog.png b/images/dog.png new file mode 100644 index 0000000..9680963 Binary files /dev/null and b/images/dog.png differ diff --git a/images/dog2.png b/images/dog2.png new file mode 100644 index 0000000..b5026de Binary files /dev/null and b/images/dog2.png differ diff --git a/images/dropout.png b/images/dropout.png new file mode 100644 index 0000000..8d03e2b Binary files /dev/null and b/images/dropout.png differ diff --git a/images/ecoc_boundary.png b/images/ecoc_boundary.png new file mode 100644 index 0000000..4b48ba2 Binary files /dev/null and b/images/ecoc_boundary.png differ diff --git a/images/ecoc_conf.png b/images/ecoc_conf.png new file mode 100644 index 0000000..c7ed958 Binary files /dev/null and b/images/ecoc_conf.png differ diff --git a/images/eig.png b/images/eig.png new file mode 100644 index 0000000..7a4d902 Binary files /dev/null and b/images/eig.png differ diff --git a/images/elephas-logo.png b/images/elephas-logo.png new file mode 100644 index 0000000..e88169b Binary files /dev/null and b/images/elephas-logo.png differ diff --git a/images/elephas.gif b/images/elephas.gif new file mode 100644 index 0000000..5de799c Binary files /dev/null and b/images/elephas.gif differ diff --git a/images/elu_acc.png b/images/elu_acc.png new file mode 100644 index 0000000..597499f Binary files /dev/null and b/images/elu_acc.png differ diff --git a/images/elu_avf.png b/images/elu_avf.png new file mode 100644 index 0000000..d39378c Binary files /dev/null and b/images/elu_avf.png differ diff --git a/images/elu_deriv.png b/images/elu_deriv.png new file mode 100644 index 0000000..d6d7da5 Binary files /dev/null and b/images/elu_deriv.png differ diff --git a/images/elu_he_acc.png b/images/elu_he_acc.png new file mode 100644 index 0000000..1801589 Binary files /dev/null and b/images/elu_he_acc.png differ diff --git a/images/elu_he_loss.png b/images/elu_he_loss.png new file mode 100644 index 0000000..467490f Binary files /dev/null and b/images/elu_he_loss.png differ diff --git a/images/elu_he_relu.png b/images/elu_he_relu.png new file mode 100644 index 0000000..b4f07f6 Binary files /dev/null and b/images/elu_he_relu.png differ diff --git a/images/elu_loss.png b/images/elu_loss.png new file mode 100644 index 0000000..c958231 Binary files /dev/null and b/images/elu_loss.png differ diff --git a/images/elu_relu.png b/images/elu_relu.png new file mode 100644 index 0000000..f6e012b Binary files /dev/null and b/images/elu_relu.png differ diff --git a/images/emb_acc.png b/images/emb_acc.png new file mode 100644 index 0000000..61f704e Binary files /dev/null and b/images/emb_acc.png differ diff --git a/images/emb_loss.png b/images/emb_loss.png new file mode 100644 index 0000000..709de23 Binary files /dev/null and b/images/emb_loss.png differ diff --git a/images/emnist-balanced.png b/images/emnist-balanced.png new file mode 100644 index 0000000..8ec684a Binary files /dev/null and b/images/emnist-balanced.png differ diff --git a/images/emnist-byclass.png b/images/emnist-byclass.png new file mode 100644 index 0000000..cfcbfa1 Binary files /dev/null and b/images/emnist-byclass.png differ diff --git a/images/emnist-bymerge.png b/images/emnist-bymerge.png new file mode 100644 index 0000000..835d2e0 Binary files /dev/null and b/images/emnist-bymerge.png differ diff --git a/images/emnist-digits.png b/images/emnist-digits.png new file mode 100644 index 0000000..386db75 Binary files /dev/null and b/images/emnist-digits.png differ diff --git a/images/emnist-letters.png b/images/emnist-letters.png new file mode 100644 index 0000000..cd82405 Binary files /dev/null and b/images/emnist-letters.png differ diff --git a/images/emnist-mnist.png b/images/emnist-mnist.png new file mode 100644 index 0000000..d42c19e Binary files /dev/null and b/images/emnist-mnist.png differ diff --git a/images/empty_vector.png b/images/empty_vector.png new file mode 100644 index 0000000..c53d15e Binary files /dev/null and b/images/empty_vector.png differ diff --git a/images/encoded-state-1024x511.png b/images/encoded-state-1024x511.png new file mode 100644 index 0000000..bbd886f Binary files /dev/null and b/images/encoded-state-1024x511.png differ diff --git a/images/envs-1.jpg b/images/envs-1.jpg new file mode 100644 index 0000000..5311b3e Binary files /dev/null and b/images/envs-1.jpg differ diff --git a/images/epoch0_batch0-1.jpg b/images/epoch0_batch0-1.jpg new file mode 100644 index 0000000..bd30502 Binary files /dev/null and b/images/epoch0_batch0-1.jpg differ diff --git a/images/epoch0_batch0.jpg b/images/epoch0_batch0.jpg new file mode 100644 index 0000000..08cffbb Binary files /dev/null and b/images/epoch0_batch0.jpg differ diff --git a/images/epoch0_batch100.jpg b/images/epoch0_batch100.jpg new file mode 100644 index 0000000..1897d58 Binary files /dev/null and b/images/epoch0_batch100.jpg differ diff --git a/images/epoch0_batch200-1.jpg b/images/epoch0_batch200-1.jpg new file mode 100644 index 0000000..6341547 Binary files /dev/null and b/images/epoch0_batch200-1.jpg differ diff --git a/images/epoch0_batch200.jpg b/images/epoch0_batch200.jpg new file mode 100644 index 0000000..923c146 Binary files /dev/null and b/images/epoch0_batch200.jpg differ diff --git a/images/epoch0_batch300.jpg b/images/epoch0_batch300.jpg new file mode 100644 index 0000000..83e0267 Binary files /dev/null and b/images/epoch0_batch300.jpg differ diff --git a/images/epoch0_batch350.jpg b/images/epoch0_batch350.jpg new file mode 100644 index 0000000..192c2d2 Binary files /dev/null and b/images/epoch0_batch350.jpg differ diff --git a/images/epoch0_batch400.jpg b/images/epoch0_batch400.jpg new file mode 100644 index 0000000..de01cd9 Binary files /dev/null and b/images/epoch0_batch400.jpg differ diff --git a/images/epoch0_batch50-1.jpg b/images/epoch0_batch50-1.jpg new file mode 100644 index 0000000..b987d40 Binary files /dev/null and b/images/epoch0_batch50-1.jpg differ diff --git a/images/epoch0_batch50.jpg b/images/epoch0_batch50.jpg new file mode 100644 index 0000000..77ba5c6 Binary files /dev/null and b/images/epoch0_batch50.jpg differ diff --git a/images/epoch0_batch500.jpg b/images/epoch0_batch500.jpg new file mode 100644 index 0000000..21659c7 Binary files /dev/null and b/images/epoch0_batch500.jpg differ diff --git a/images/epoch0_batch650.jpg b/images/epoch0_batch650.jpg new file mode 100644 index 0000000..aba4a20 Binary files /dev/null and b/images/epoch0_batch650.jpg differ diff --git a/images/epoch18_batch0.jpg b/images/epoch18_batch0.jpg new file mode 100644 index 0000000..a5ad9ea Binary files /dev/null and b/images/epoch18_batch0.jpg differ diff --git a/images/epoch18_batch50.jpg b/images/epoch18_batch50.jpg new file mode 100644 index 0000000..09d668e Binary files /dev/null and b/images/epoch18_batch50.jpg differ diff --git a/images/epoch1_batch0-1.jpg b/images/epoch1_batch0-1.jpg new file mode 100644 index 0000000..57a3a99 Binary files /dev/null and b/images/epoch1_batch0-1.jpg differ diff --git a/images/epoch1_batch0.jpg b/images/epoch1_batch0.jpg new file mode 100644 index 0000000..2e30ea3 Binary files /dev/null and b/images/epoch1_batch0.jpg differ diff --git a/images/epoch1_batch100.jpg b/images/epoch1_batch100.jpg new file mode 100644 index 0000000..3b5f5e2 Binary files /dev/null and b/images/epoch1_batch100.jpg differ diff --git a/images/epoch1_batch200.jpg b/images/epoch1_batch200.jpg new file mode 100644 index 0000000..30e5955 Binary files /dev/null and b/images/epoch1_batch200.jpg differ diff --git a/images/epoch1_batch300.jpg b/images/epoch1_batch300.jpg new file mode 100644 index 0000000..89fb12c Binary files /dev/null and b/images/epoch1_batch300.jpg differ diff --git a/images/epoch1_batch50.jpg b/images/epoch1_batch50.jpg new file mode 100644 index 0000000..ae4f250 Binary files /dev/null and b/images/epoch1_batch50.jpg differ diff --git a/images/epoch22_batch100.jpg b/images/epoch22_batch100.jpg new file mode 100644 index 0000000..96b23d0 Binary files /dev/null and b/images/epoch22_batch100.jpg differ diff --git a/images/epoch22_batch150.jpg b/images/epoch22_batch150.jpg new file mode 100644 index 0000000..ea498e4 Binary files /dev/null and b/images/epoch22_batch150.jpg differ diff --git a/images/epoch22_batch200.jpg b/images/epoch22_batch200.jpg new file mode 100644 index 0000000..59ebcd6 Binary files /dev/null and b/images/epoch22_batch200.jpg differ diff --git a/images/epoch22_batch250.jpg b/images/epoch22_batch250.jpg new file mode 100644 index 0000000..d355775 Binary files /dev/null and b/images/epoch22_batch250.jpg differ diff --git a/images/epoch25_batch0.jpg b/images/epoch25_batch0.jpg new file mode 100644 index 0000000..9d28308 Binary files /dev/null and b/images/epoch25_batch0.jpg differ diff --git a/images/epoch25_batch50.jpg b/images/epoch25_batch50.jpg new file mode 100644 index 0000000..1f0d940 Binary files /dev/null and b/images/epoch25_batch50.jpg differ diff --git a/images/epoch2_batch0.jpg b/images/epoch2_batch0.jpg new file mode 100644 index 0000000..fec6491 Binary files /dev/null and b/images/epoch2_batch0.jpg differ diff --git a/images/epoch2_batch50.jpg b/images/epoch2_batch50.jpg new file mode 100644 index 0000000..925715d Binary files /dev/null and b/images/epoch2_batch50.jpg differ diff --git a/images/epoch30_batch0.jpg b/images/epoch30_batch0.jpg new file mode 100644 index 0000000..d65c74b Binary files /dev/null and b/images/epoch30_batch0.jpg differ diff --git a/images/epoch30_batch50.jpg b/images/epoch30_batch50.jpg new file mode 100644 index 0000000..8a3b011 Binary files /dev/null and b/images/epoch30_batch50.jpg differ diff --git a/images/epoch36_batch0.jpg b/images/epoch36_batch0.jpg new file mode 100644 index 0000000..d73839d Binary files /dev/null and b/images/epoch36_batch0.jpg differ diff --git a/images/epoch36_batch50.jpg b/images/epoch36_batch50.jpg new file mode 100644 index 0000000..555ad9c Binary files /dev/null and b/images/epoch36_batch50.jpg differ diff --git a/images/epoch3_batch0.jpg b/images/epoch3_batch0.jpg new file mode 100644 index 0000000..a5fca33 Binary files /dev/null and b/images/epoch3_batch0.jpg differ diff --git a/images/epoch3_batch50.jpg b/images/epoch3_batch50.jpg new file mode 100644 index 0000000..17a9277 Binary files /dev/null and b/images/epoch3_batch50.jpg differ diff --git a/images/epoch40_batch1550.jpg b/images/epoch40_batch1550.jpg new file mode 100644 index 0000000..8d1d0a2 Binary files /dev/null and b/images/epoch40_batch1550.jpg differ diff --git a/images/epoch40_batch1650.jpg b/images/epoch40_batch1650.jpg new file mode 100644 index 0000000..c666e0c Binary files /dev/null and b/images/epoch40_batch1650.jpg differ diff --git a/images/epoch40_batch1750.jpg b/images/epoch40_batch1750.jpg new file mode 100644 index 0000000..a7c8237 Binary files /dev/null and b/images/epoch40_batch1750.jpg differ diff --git a/images/epoch40_batch1850.jpg b/images/epoch40_batch1850.jpg new file mode 100644 index 0000000..0d3ab1d Binary files /dev/null and b/images/epoch40_batch1850.jpg differ diff --git a/images/epoch_accuracy-1-1024x190.png b/images/epoch_accuracy-1-1024x190.png new file mode 100644 index 0000000..402409a Binary files /dev/null and b/images/epoch_accuracy-1-1024x190.png differ diff --git a/images/epoch_accuracy-1024x190.png b/images/epoch_accuracy-1024x190.png new file mode 100644 index 0000000..ef68cfb Binary files /dev/null and b/images/epoch_accuracy-1024x190.png differ diff --git a/images/epoch_learning_rate-1-1024x192.png b/images/epoch_learning_rate-1-1024x192.png new file mode 100644 index 0000000..2b4ab0f Binary files /dev/null and b/images/epoch_learning_rate-1-1024x192.png differ diff --git a/images/epoch_learning_rate-1024x192.png b/images/epoch_learning_rate-1024x192.png new file mode 100644 index 0000000..7e383cd Binary files /dev/null and b/images/epoch_learning_rate-1024x192.png differ diff --git a/images/erd-1-1.png b/images/erd-1-1.png new file mode 100644 index 0000000..5cb4488 Binary files /dev/null and b/images/erd-1-1.png differ diff --git a/images/example.jpg b/images/example.jpg new file mode 100644 index 0000000..16a3aac Binary files /dev/null and b/images/example.jpg differ diff --git a/images/example_nonlinear.png b/images/example_nonlinear.png new file mode 100644 index 0000000..7e6773b Binary files /dev/null and b/images/example_nonlinear.png differ diff --git a/images/expansive-228x300.png b/images/expansive-228x300.png new file mode 100644 index 0000000..c453f76 Binary files /dev/null and b/images/expansive-228x300.png differ diff --git a/images/explained-1-300x165.jpg b/images/explained-1-300x165.jpg new file mode 100644 index 0000000..a734c9f Binary files /dev/null and b/images/explained-1-300x165.jpg differ diff --git a/images/explained-1.jpg b/images/explained-1.jpg new file mode 100644 index 0000000..b9e1259 Binary files /dev/null and b/images/explained-1.jpg differ diff --git a/images/exploits_of_a_mom.png b/images/exploits_of_a_mom.png new file mode 100644 index 0000000..530ddc0 Binary files /dev/null and b/images/exploits_of_a_mom.png differ diff --git a/images/exploring_keras_datasets.jpg b/images/exploring_keras_datasets.jpg new file mode 100644 index 0000000..b5d7d09 Binary files /dev/null and b/images/exploring_keras_datasets.jpg differ diff --git a/images/exponential_decay.png b/images/exponential_decay.png new file mode 100644 index 0000000..fcf0512 Binary files /dev/null and b/images/exponential_decay.png differ diff --git a/images/extended_a.png b/images/extended_a.png new file mode 100644 index 0000000..56b4436 Binary files /dev/null and b/images/extended_a.png differ diff --git a/images/extended_l.png b/images/extended_l.png new file mode 100644 index 0000000..5f874ca Binary files /dev/null and b/images/extended_l.png differ diff --git a/images/extra_k_logo_neg-300x173.png b/images/extra_k_logo_neg-300x173.png new file mode 100644 index 0000000..2b90107 Binary files /dev/null and b/images/extra_k_logo_neg-300x173.png differ diff --git a/images/ezgif-3-15f84305f6f1.gif b/images/ezgif-3-15f84305f6f1.gif new file mode 100644 index 0000000..f051f9e Binary files /dev/null and b/images/ezgif-3-15f84305f6f1.gif differ diff --git a/images/ezgif-3-55b348749f13.gif b/images/ezgif-3-55b348749f13.gif new file mode 100644 index 0000000..764946f Binary files /dev/null and b/images/ezgif-3-55b348749f13.gif differ diff --git a/images/ezgif-3-7996ebfb2d38.gif b/images/ezgif-3-7996ebfb2d38.gif new file mode 100644 index 0000000..e7f7958 Binary files /dev/null and b/images/ezgif-3-7996ebfb2d38.gif differ diff --git a/images/ezgif.com-gif-maker-4.gif b/images/ezgif.com-gif-maker-4.gif new file mode 100644 index 0000000..89f2f45 Binary files /dev/null and b/images/ezgif.com-gif-maker-4.gif differ diff --git a/images/f_acc.png b/images/f_acc.png new file mode 100644 index 0000000..125f8a3 Binary files /dev/null and b/images/f_acc.png differ diff --git a/images/f_loss.png b/images/f_loss.png new file mode 100644 index 0000000..5de634c Binary files /dev/null and b/images/f_loss.png differ diff --git a/images/featured_image-1024x505.png b/images/featured_image-1024x505.png new file mode 100644 index 0000000..19a05bd Binary files /dev/null and b/images/featured_image-1024x505.png differ diff --git a/images/feed-1024x404.jpg b/images/feed-1024x404.jpg new file mode 100644 index 0000000..66770a6 Binary files /dev/null and b/images/feed-1024x404.jpg differ diff --git a/images/feed-2.jpg b/images/feed-2.jpg new file mode 100644 index 0000000..247eef8 Binary files /dev/null and b/images/feed-2.jpg differ diff --git a/images/feed-3.jpg b/images/feed-3.jpg new file mode 100644 index 0000000..005daaa Binary files /dev/null and b/images/feed-3.jpg differ diff --git a/images/filter.jpg b/images/filter.jpg new file mode 100644 index 0000000..2a31e6a Binary files /dev/null and b/images/filter.jpg differ diff --git a/images/filter_move.jpg b/images/filter_move.jpg new file mode 100644 index 0000000..29d845b Binary files /dev/null and b/images/filter_move.jpg differ diff --git a/images/final_lin_sep.png b/images/final_lin_sep.png new file mode 100644 index 0000000..8964f94 Binary files /dev/null and b/images/final_lin_sep.png differ diff --git a/images/final_lin_sep_db.png b/images/final_lin_sep_db.png new file mode 100644 index 0000000..8c8a97d Binary files /dev/null and b/images/final_lin_sep_db.png differ diff --git a/images/fit-1.png b/images/fit-1.png new file mode 100644 index 0000000..d3bc876 Binary files /dev/null and b/images/fit-1.png differ diff --git a/images/fit.png b/images/fit.png new file mode 100644 index 0000000..b3b061f Binary files /dev/null and b/images/fit.png differ diff --git a/images/fixed_lr_baseline.png b/images/fixed_lr_baseline.png new file mode 100644 index 0000000..daed263 Binary files /dev/null and b/images/fixed_lr_baseline.png differ diff --git a/images/fixed_lr_large.png b/images/fixed_lr_large.png new file mode 100644 index 0000000..e6751df Binary files /dev/null and b/images/fixed_lr_large.png differ diff --git a/images/fixed_lr_really_large.png b/images/fixed_lr_really_large.png new file mode 100644 index 0000000..e4c94bc Binary files /dev/null and b/images/fixed_lr_really_large.png differ diff --git a/images/fixed_lr_really_small.png b/images/fixed_lr_really_small.png new file mode 100644 index 0000000..fb8590e Binary files /dev/null and b/images/fixed_lr_really_small.png differ diff --git a/images/fixed_lr_small.png b/images/fixed_lr_small.png new file mode 100644 index 0000000..fd39b50 Binary files /dev/null and b/images/fixed_lr_small.png differ diff --git a/images/fmnist_50_latsp.png b/images/fmnist_50_latsp.png new file mode 100644 index 0000000..3278c09 Binary files /dev/null and b/images/fmnist_50_latsp.png differ diff --git a/images/fmnist_50_plot.png b/images/fmnist_50_plot.png new file mode 100644 index 0000000..20ff48c Binary files /dev/null and b/images/fmnist_50_plot.png differ diff --git a/images/fmnist_dmax_plot.png b/images/fmnist_dmax_plot.png new file mode 100644 index 0000000..0491d07 Binary files /dev/null and b/images/fmnist_dmax_plot.png differ diff --git a/images/fmnist_dmax_space.png b/images/fmnist_dmax_space.png new file mode 100644 index 0000000..e69e603 Binary files /dev/null and b/images/fmnist_dmax_space.png differ diff --git a/images/formula.jpg b/images/formula.jpg new file mode 100644 index 0000000..bb24c38 Binary files /dev/null and b/images/formula.jpg differ diff --git a/images/frog-1.png b/images/frog-1.png new file mode 100644 index 0000000..dc9aed9 Binary files /dev/null and b/images/frog-1.png differ diff --git a/images/frog-2.png b/images/frog-2.png new file mode 100644 index 0000000..839f3d7 Binary files /dev/null and b/images/frog-2.png differ diff --git a/images/frog.png b/images/frog.png new file mode 100644 index 0000000..0ca1273 Binary files /dev/null and b/images/frog.png differ diff --git a/images/fruits.png b/images/fruits.png new file mode 100644 index 0000000..1cd5d56 Binary files /dev/null and b/images/fruits.png differ diff --git a/images/ftswish-1.png b/images/ftswish-1.png new file mode 100644 index 0000000..6d566d3 Binary files /dev/null and b/images/ftswish-1.png differ diff --git a/images/ftswish_deriv.png b/images/ftswish_deriv.png new file mode 100644 index 0000000..141598a Binary files /dev/null and b/images/ftswish_deriv.png differ diff --git a/images/g1.png b/images/g1.png new file mode 100644 index 0000000..2e74177 Binary files /dev/null and b/images/g1.png differ diff --git a/images/gap_acc.png b/images/gap_acc.png new file mode 100644 index 0000000..3f3837c Binary files /dev/null and b/images/gap_acc.png differ diff --git a/images/gap_loss.png b/images/gap_loss.png new file mode 100644 index 0000000..383c754 Binary files /dev/null and b/images/gap_loss.png differ diff --git a/images/gauss0.png b/images/gauss0.png new file mode 100644 index 0000000..bed5587 Binary files /dev/null and b/images/gauss0.png differ diff --git a/images/gauss1.png b/images/gauss1.png new file mode 100644 index 0000000..2055172 Binary files /dev/null and b/images/gauss1.png differ diff --git a/images/grap1h.png b/images/grap1h.png new file mode 100644 index 0000000..d7ad34d Binary files /dev/null and b/images/grap1h.png differ diff --git a/images/graph-1-1-1024x173.png b/images/graph-1-1-1024x173.png new file mode 100644 index 0000000..662ba3c Binary files /dev/null and b/images/graph-1-1-1024x173.png differ diff --git a/images/graph-1.png b/images/graph-1.png new file mode 100644 index 0000000..eac6619 Binary files /dev/null and b/images/graph-1.png differ diff --git a/images/graph-4.png b/images/graph-4.png new file mode 100644 index 0000000..4cb363d Binary files /dev/null and b/images/graph-4.png differ diff --git a/images/graph.png b/images/graph.png new file mode 100644 index 0000000..0d240f5 Binary files /dev/null and b/images/graph.png differ diff --git a/images/greedy.drawio-1024x336.png b/images/greedy.drawio-1024x336.png new file mode 100644 index 0000000..b7e2034 Binary files /dev/null and b/images/greedy.drawio-1024x336.png differ diff --git a/images/grouped.png b/images/grouped.png new file mode 100644 index 0000000..0610de0 Binary files /dev/null and b/images/grouped.png differ diff --git a/images/hierarchies.png b/images/hierarchies.png new file mode 100644 index 0000000..873ef39 Binary files /dev/null and b/images/hierarchies.png differ diff --git a/images/hinge_binary-1024x524.png b/images/hinge_binary-1024x524.png new file mode 100644 index 0000000..648a980 Binary files /dev/null and b/images/hinge_binary-1024x524.png differ diff --git a/images/hinge_db.png b/images/hinge_db.png new file mode 100644 index 0000000..01ce560 Binary files /dev/null and b/images/hinge_db.png differ diff --git a/images/hinge_loss-1024x507.jpeg b/images/hinge_loss-1024x507.jpeg new file mode 100644 index 0000000..abdf063 Binary files /dev/null and b/images/hinge_loss-1024x507.jpeg differ diff --git a/images/hinge_nonlienar.png b/images/hinge_nonlienar.png new file mode 100644 index 0000000..6eaa88c Binary files /dev/null and b/images/hinge_nonlienar.png differ diff --git a/images/hinge_squared-1024x511.png b/images/hinge_squared-1024x511.png new file mode 100644 index 0000000..69ea0ae Binary files /dev/null and b/images/hinge_squared-1024x511.png differ diff --git a/images/horse-1.png b/images/horse-1.png new file mode 100644 index 0000000..bf9fc78 Binary files /dev/null and b/images/horse-1.png differ diff --git a/images/horse-2.png b/images/horse-2.png new file mode 100644 index 0000000..deda1cb Binary files /dev/null and b/images/horse-2.png differ diff --git a/images/horse.png b/images/horse.png new file mode 100644 index 0000000..6c9685e Binary files /dev/null and b/images/horse.png differ diff --git a/images/house_moved_filter.jpg b/images/house_moved_filter.jpg new file mode 100644 index 0000000..85bd7f7 Binary files /dev/null and b/images/house_moved_filter.jpg differ diff --git a/images/house_with_filter.jpg b/images/house_with_filter.jpg new file mode 100644 index 0000000..0ac01ca Binary files /dev/null and b/images/house_with_filter.jpg differ diff --git a/images/huber_loss_d1.5-1024x511.png b/images/huber_loss_d1.5-1024x511.png new file mode 100644 index 0000000..63c7e3f Binary files /dev/null and b/images/huber_loss_d1.5-1024x511.png differ diff --git a/images/huber_loss_mae1.5-1024x511.png b/images/huber_loss_mae1.5-1024x511.png new file mode 100644 index 0000000..2b8a625 Binary files /dev/null and b/images/huber_loss_mae1.5-1024x511.png differ diff --git a/images/huberloss-1024x580.jpeg b/images/huberloss-1024x580.jpeg new file mode 100644 index 0000000..a6c2415 Binary files /dev/null and b/images/huberloss-1024x580.jpeg differ diff --git a/images/huberloss.jpeg b/images/huberloss.jpeg new file mode 100644 index 0000000..e991f2c Binary files /dev/null and b/images/huberloss.jpeg differ diff --git a/images/hz.png b/images/hz.png new file mode 100644 index 0000000..f4566a7 Binary files /dev/null and b/images/hz.png differ diff --git a/images/image-1-1024x401.png b/images/image-1-1024x401.png new file mode 100644 index 0000000..3e2f5be Binary files /dev/null and b/images/image-1-1024x401.png differ diff --git a/images/image-1-1024x505.png b/images/image-1-1024x505.png new file mode 100644 index 0000000..21f805f Binary files /dev/null and b/images/image-1-1024x505.png differ diff --git a/images/image-1-1024x542.png b/images/image-1-1024x542.png new file mode 100644 index 0000000..59544fb Binary files /dev/null and b/images/image-1-1024x542.png differ diff --git a/images/image-1-1024x576.png b/images/image-1-1024x576.png new file mode 100644 index 0000000..dc857a8 Binary files /dev/null and b/images/image-1-1024x576.png differ diff --git a/images/image-1.png b/images/image-1.png new file mode 100644 index 0000000..55c9c3e Binary files /dev/null and b/images/image-1.png differ diff --git a/images/image-10-1024x191.png b/images/image-10-1024x191.png new file mode 100644 index 0000000..66d0075 Binary files /dev/null and b/images/image-10-1024x191.png differ diff --git a/images/image-10.png b/images/image-10.png new file mode 100644 index 0000000..7ade2b1 Binary files /dev/null and b/images/image-10.png differ diff --git a/images/image-1024x223.png b/images/image-1024x223.png new file mode 100644 index 0000000..59d95b4 Binary files /dev/null and b/images/image-1024x223.png differ diff --git a/images/image-1024x372.png b/images/image-1024x372.png new file mode 100644 index 0000000..4bc10b7 Binary files /dev/null and b/images/image-1024x372.png differ diff --git a/images/image-1024x562.png b/images/image-1024x562.png new file mode 100644 index 0000000..e803b99 Binary files /dev/null and b/images/image-1024x562.png differ diff --git a/images/image-1024x584.png b/images/image-1024x584.png new file mode 100644 index 0000000..8f1a1dc Binary files /dev/null and b/images/image-1024x584.png differ diff --git a/images/image-11-1024x204.png b/images/image-11-1024x204.png new file mode 100644 index 0000000..2569d86 Binary files /dev/null and b/images/image-11-1024x204.png differ diff --git a/images/image-11.png b/images/image-11.png new file mode 100644 index 0000000..5a5df98 Binary files /dev/null and b/images/image-11.png differ diff --git a/images/image-12-1024x801.png b/images/image-12-1024x801.png new file mode 100644 index 0000000..47e9261 Binary files /dev/null and b/images/image-12-1024x801.png differ diff --git a/images/image-12.png b/images/image-12.png new file mode 100644 index 0000000..8e686f3 Binary files /dev/null and b/images/image-12.png differ diff --git a/images/image-13-1024x375.png b/images/image-13-1024x375.png new file mode 100644 index 0000000..0d2bdef Binary files /dev/null and b/images/image-13-1024x375.png differ diff --git a/images/image-13-1024x624.png b/images/image-13-1024x624.png new file mode 100644 index 0000000..30b63a3 Binary files /dev/null and b/images/image-13-1024x624.png differ diff --git a/images/image-13.png b/images/image-13.png new file mode 100644 index 0000000..752cd5d Binary files /dev/null and b/images/image-13.png differ diff --git a/images/image-14-1024x174.png b/images/image-14-1024x174.png new file mode 100644 index 0000000..8647c8d Binary files /dev/null and b/images/image-14-1024x174.png differ diff --git a/images/image-14-1024x296.png b/images/image-14-1024x296.png new file mode 100644 index 0000000..492c010 Binary files /dev/null and b/images/image-14-1024x296.png differ diff --git a/images/image-14-1024x566.png b/images/image-14-1024x566.png new file mode 100644 index 0000000..81f4c42 Binary files /dev/null and b/images/image-14-1024x566.png differ diff --git a/images/image-14-1024x638.png b/images/image-14-1024x638.png new file mode 100644 index 0000000..daeb4de Binary files /dev/null and b/images/image-14-1024x638.png differ diff --git a/images/image-15-1024x290.png b/images/image-15-1024x290.png new file mode 100644 index 0000000..60f912b Binary files /dev/null and b/images/image-15-1024x290.png differ diff --git a/images/image-15.png b/images/image-15.png new file mode 100644 index 0000000..5fe7cc9 Binary files /dev/null and b/images/image-15.png differ diff --git a/images/image-16-1024x185.png b/images/image-16-1024x185.png new file mode 100644 index 0000000..3b33cd6 Binary files /dev/null and b/images/image-16-1024x185.png differ diff --git a/images/image-16-1024x636.png b/images/image-16-1024x636.png new file mode 100644 index 0000000..dc8fd24 Binary files /dev/null and b/images/image-16-1024x636.png differ diff --git a/images/image-16.png b/images/image-16.png new file mode 100644 index 0000000..54544db Binary files /dev/null and b/images/image-16.png differ diff --git a/images/image-17-1024x428.png b/images/image-17-1024x428.png new file mode 100644 index 0000000..6efc369 Binary files /dev/null and b/images/image-17-1024x428.png differ diff --git a/images/image-17.png b/images/image-17.png new file mode 100644 index 0000000..1979170 Binary files /dev/null and b/images/image-17.png differ diff --git a/images/image-18-1024x269.png b/images/image-18-1024x269.png new file mode 100644 index 0000000..0c28710 Binary files /dev/null and b/images/image-18-1024x269.png differ diff --git a/images/image-18-1024x470.png b/images/image-18-1024x470.png new file mode 100644 index 0000000..89e87eb Binary files /dev/null and b/images/image-18-1024x470.png differ diff --git a/images/image-18-1024x562.png b/images/image-18-1024x562.png new file mode 100644 index 0000000..cc44d86 Binary files /dev/null and b/images/image-18-1024x562.png differ diff --git a/images/image-19-1024x239.png b/images/image-19-1024x239.png new file mode 100644 index 0000000..6f0c4d3 Binary files /dev/null and b/images/image-19-1024x239.png differ diff --git a/images/image-19-1024x704.png b/images/image-19-1024x704.png new file mode 100644 index 0000000..8990046 Binary files /dev/null and b/images/image-19-1024x704.png differ diff --git a/images/image-2-1024x170.png b/images/image-2-1024x170.png new file mode 100644 index 0000000..8ef10ed Binary files /dev/null and b/images/image-2-1024x170.png differ diff --git a/images/image-2-1024x248.png b/images/image-2-1024x248.png new file mode 100644 index 0000000..b82de91 Binary files /dev/null and b/images/image-2-1024x248.png differ diff --git a/images/image-2-1024x454.png b/images/image-2-1024x454.png new file mode 100644 index 0000000..c71e402 Binary files /dev/null and b/images/image-2-1024x454.png differ diff --git a/images/image-2-1024x476.png b/images/image-2-1024x476.png new file mode 100644 index 0000000..6f36ba0 Binary files /dev/null and b/images/image-2-1024x476.png differ diff --git a/images/image-2-1024x480.png b/images/image-2-1024x480.png new file mode 100644 index 0000000..1735e0b Binary files /dev/null and b/images/image-2-1024x480.png differ diff --git a/images/image-2-1024x505.png b/images/image-2-1024x505.png new file mode 100644 index 0000000..3773990 Binary files /dev/null and b/images/image-2-1024x505.png differ diff --git a/images/image-2.png b/images/image-2.png new file mode 100644 index 0000000..8cdf0a6 Binary files /dev/null and b/images/image-2.png differ diff --git a/images/image-20.png b/images/image-20.png new file mode 100644 index 0000000..fa641d9 Binary files /dev/null and b/images/image-20.png differ diff --git a/images/image-21-1024x538.png b/images/image-21-1024x538.png new file mode 100644 index 0000000..8095411 Binary files /dev/null and b/images/image-21-1024x538.png differ diff --git a/images/image-22-1024x540.png b/images/image-22-1024x540.png new file mode 100644 index 0000000..89b85a7 Binary files /dev/null and b/images/image-22-1024x540.png differ diff --git a/images/image-24-1024x432.png b/images/image-24-1024x432.png new file mode 100644 index 0000000..bcf4e45 Binary files /dev/null and b/images/image-24-1024x432.png differ diff --git a/images/image-25-1024x501.png b/images/image-25-1024x501.png new file mode 100644 index 0000000..7e58533 Binary files /dev/null and b/images/image-25-1024x501.png differ diff --git a/images/image-3-1024x350.png b/images/image-3-1024x350.png new file mode 100644 index 0000000..1dc8801 Binary files /dev/null and b/images/image-3-1024x350.png differ diff --git a/images/image-3-1024x356.png b/images/image-3-1024x356.png new file mode 100644 index 0000000..0b38d2a Binary files /dev/null and b/images/image-3-1024x356.png differ diff --git a/images/image-3-1024x414.png b/images/image-3-1024x414.png new file mode 100644 index 0000000..01c3812 Binary files /dev/null and b/images/image-3-1024x414.png differ diff --git a/images/image-3-1024x534.png b/images/image-3-1024x534.png new file mode 100644 index 0000000..1a1fa7c Binary files /dev/null and b/images/image-3-1024x534.png differ diff --git a/images/image-3-1024x790.png b/images/image-3-1024x790.png new file mode 100644 index 0000000..404a5db Binary files /dev/null and b/images/image-3-1024x790.png differ diff --git a/images/image-3-150x150.png b/images/image-3-150x150.png new file mode 100644 index 0000000..9e3ca97 Binary files /dev/null and b/images/image-3-150x150.png differ diff --git a/images/image-3-300x290.png b/images/image-3-300x290.png new file mode 100644 index 0000000..2518ce7 Binary files /dev/null and b/images/image-3-300x290.png differ diff --git a/images/image-3.png b/images/image-3.png new file mode 100644 index 0000000..b6ce03e Binary files /dev/null and b/images/image-3.png differ diff --git a/images/image-300x203.png b/images/image-300x203.png new file mode 100644 index 0000000..d81e9cd Binary files /dev/null and b/images/image-300x203.png differ diff --git a/images/image-35.png b/images/image-35.png new file mode 100644 index 0000000..3f739bd Binary files /dev/null and b/images/image-35.png differ diff --git a/images/image-36-1024x468.png b/images/image-36-1024x468.png new file mode 100644 index 0000000..7479c47 Binary files /dev/null and b/images/image-36-1024x468.png differ diff --git a/images/image-37-1024x441.png b/images/image-37-1024x441.png new file mode 100644 index 0000000..23fb617 Binary files /dev/null and b/images/image-37-1024x441.png differ diff --git a/images/image-38.png b/images/image-38.png new file mode 100644 index 0000000..a8ace5f Binary files /dev/null and b/images/image-38.png differ diff --git a/images/image-39.png b/images/image-39.png new file mode 100644 index 0000000..f1afdbb Binary files /dev/null and b/images/image-39.png differ diff --git a/images/image-4-1024x207.png b/images/image-4-1024x207.png new file mode 100644 index 0000000..d5959f1 Binary files /dev/null and b/images/image-4-1024x207.png differ diff --git a/images/image-4-1024x284.png b/images/image-4-1024x284.png new file mode 100644 index 0000000..9f5dcca Binary files /dev/null and b/images/image-4-1024x284.png differ diff --git a/images/image-4-1024x407.png b/images/image-4-1024x407.png new file mode 100644 index 0000000..0e9b961 Binary files /dev/null and b/images/image-4-1024x407.png differ diff --git a/images/image-4-1024x415.png b/images/image-4-1024x415.png new file mode 100644 index 0000000..6e619e1 Binary files /dev/null and b/images/image-4-1024x415.png differ diff --git a/images/image-4-1024x432.png b/images/image-4-1024x432.png new file mode 100644 index 0000000..d2fd745 Binary files /dev/null and b/images/image-4-1024x432.png differ diff --git a/images/image-4-1024x455.png b/images/image-4-1024x455.png new file mode 100644 index 0000000..948548c Binary files /dev/null and b/images/image-4-1024x455.png differ diff --git a/images/image-4-1024x505.png b/images/image-4-1024x505.png new file mode 100644 index 0000000..92f38f8 Binary files /dev/null and b/images/image-4-1024x505.png differ diff --git a/images/image-4-1024x568.png b/images/image-4-1024x568.png new file mode 100644 index 0000000..c160467 Binary files /dev/null and b/images/image-4-1024x568.png differ diff --git a/images/image-4-897x1024.png b/images/image-4-897x1024.png new file mode 100644 index 0000000..ad04f98 Binary files /dev/null and b/images/image-4-897x1024.png differ diff --git a/images/image-4.png b/images/image-4.png new file mode 100644 index 0000000..b368473 Binary files /dev/null and b/images/image-4.png differ diff --git a/images/image-40.png b/images/image-40.png new file mode 100644 index 0000000..98d7e62 Binary files /dev/null and b/images/image-40.png differ diff --git a/images/image-41-1024x468.png b/images/image-41-1024x468.png new file mode 100644 index 0000000..c726dbc Binary files /dev/null and b/images/image-41-1024x468.png differ diff --git a/images/image-42-1024x461.png b/images/image-42-1024x461.png new file mode 100644 index 0000000..a2b9cfe Binary files /dev/null and b/images/image-42-1024x461.png differ diff --git a/images/image-43.png b/images/image-43.png new file mode 100644 index 0000000..bea1115 Binary files /dev/null and b/images/image-43.png differ diff --git a/images/image-45.png b/images/image-45.png new file mode 100644 index 0000000..361e569 Binary files /dev/null and b/images/image-45.png differ diff --git a/images/image-5-1024x122.png b/images/image-5-1024x122.png new file mode 100644 index 0000000..f126ed6 Binary files /dev/null and b/images/image-5-1024x122.png differ diff --git a/images/image-5-1024x268.png b/images/image-5-1024x268.png new file mode 100644 index 0000000..1de29a1 Binary files /dev/null and b/images/image-5-1024x268.png differ diff --git a/images/image-5-1024x415.png b/images/image-5-1024x415.png new file mode 100644 index 0000000..d1a25ba Binary files /dev/null and b/images/image-5-1024x415.png differ diff --git a/images/image-5-1024x487.png b/images/image-5-1024x487.png new file mode 100644 index 0000000..d790a7d Binary files /dev/null and b/images/image-5-1024x487.png differ diff --git a/images/image-5-1024x505.png b/images/image-5-1024x505.png new file mode 100644 index 0000000..1fdb979 Binary files /dev/null and b/images/image-5-1024x505.png differ diff --git a/images/image-5-1024x648.png b/images/image-5-1024x648.png new file mode 100644 index 0000000..93b2e17 Binary files /dev/null and b/images/image-5-1024x648.png differ diff --git a/images/image-5-1024x875.png b/images/image-5-1024x875.png new file mode 100644 index 0000000..e767d1c Binary files /dev/null and b/images/image-5-1024x875.png differ diff --git a/images/image-5-1024x886.png b/images/image-5-1024x886.png new file mode 100644 index 0000000..f7c1ce4 Binary files /dev/null and b/images/image-5-1024x886.png differ diff --git a/images/image-5.png b/images/image-5.png new file mode 100644 index 0000000..a6673db Binary files /dev/null and b/images/image-5.png differ diff --git a/images/image-6-1024x218.png b/images/image-6-1024x218.png new file mode 100644 index 0000000..4ce4554 Binary files /dev/null and b/images/image-6-1024x218.png differ diff --git a/images/image-6-1024x380.png b/images/image-6-1024x380.png new file mode 100644 index 0000000..7121a00 Binary files /dev/null and b/images/image-6-1024x380.png differ diff --git a/images/image-6-1024x480.png b/images/image-6-1024x480.png new file mode 100644 index 0000000..c1ec4b3 Binary files /dev/null and b/images/image-6-1024x480.png differ diff --git a/images/image-6-1024x566.png b/images/image-6-1024x566.png new file mode 100644 index 0000000..4c2b986 Binary files /dev/null and b/images/image-6-1024x566.png differ diff --git a/images/image-6-1024x779.png b/images/image-6-1024x779.png new file mode 100644 index 0000000..ca3c551 Binary files /dev/null and b/images/image-6-1024x779.png differ diff --git a/images/image-6.png b/images/image-6.png new file mode 100644 index 0000000..066fd0a Binary files /dev/null and b/images/image-6.png differ diff --git a/images/image-7-1024x261.png b/images/image-7-1024x261.png new file mode 100644 index 0000000..befa191 Binary files /dev/null and b/images/image-7-1024x261.png differ diff --git a/images/image-7-1024x393.png b/images/image-7-1024x393.png new file mode 100644 index 0000000..c5f204a Binary files /dev/null and b/images/image-7-1024x393.png differ diff --git a/images/image-7-1024x415.png b/images/image-7-1024x415.png new file mode 100644 index 0000000..aa672ea Binary files /dev/null and b/images/image-7-1024x415.png differ diff --git a/images/image-7-1024x782.png b/images/image-7-1024x782.png new file mode 100644 index 0000000..98a0fa1 Binary files /dev/null and b/images/image-7-1024x782.png differ diff --git a/images/image-7-737x1024.png b/images/image-7-737x1024.png new file mode 100644 index 0000000..5c12a14 Binary files /dev/null and b/images/image-7-737x1024.png differ diff --git a/images/image-7.png b/images/image-7.png new file mode 100644 index 0000000..3ca51d6 Binary files /dev/null and b/images/image-7.png differ diff --git a/images/image-8-1024x415.png b/images/image-8-1024x415.png new file mode 100644 index 0000000..c7506d9 Binary files /dev/null and b/images/image-8-1024x415.png differ diff --git a/images/image-8-1024x616.png b/images/image-8-1024x616.png new file mode 100644 index 0000000..d0459a0 Binary files /dev/null and b/images/image-8-1024x616.png differ diff --git a/images/image-8-135x300.png b/images/image-8-135x300.png new file mode 100644 index 0000000..e1aaeb4 Binary files /dev/null and b/images/image-8-135x300.png differ diff --git a/images/image-8.png b/images/image-8.png new file mode 100644 index 0000000..4590c78 Binary files /dev/null and b/images/image-8.png differ diff --git a/images/image-9-1024x415.png b/images/image-9-1024x415.png new file mode 100644 index 0000000..88c196f Binary files /dev/null and b/images/image-9-1024x415.png differ diff --git a/images/image-9-1024x481.png b/images/image-9-1024x481.png new file mode 100644 index 0000000..1caf06f Binary files /dev/null and b/images/image-9-1024x481.png differ diff --git a/images/image-9.png b/images/image-9.png new file mode 100644 index 0000000..8fc9855 Binary files /dev/null and b/images/image-9.png differ diff --git a/images/image-922x1024.png b/images/image-922x1024.png new file mode 100644 index 0000000..215bf65 Binary files /dev/null and b/images/image-922x1024.png differ diff --git a/images/image.jpg b/images/image.jpg new file mode 100644 index 0000000..9fcfee2 Binary files /dev/null and b/images/image.jpg differ diff --git a/images/image.png b/images/image.png new file mode 100644 index 0000000..85c08ec Binary files /dev/null and b/images/image.png differ diff --git a/images/incontinuous-1024x853.png b/images/incontinuous-1024x853.png new file mode 100644 index 0000000..8917e44 Binary files /dev/null and b/images/incontinuous-1024x853.png differ diff --git a/images/insurance.png b/images/insurance.png new file mode 100644 index 0000000..04362ee Binary files /dev/null and b/images/insurance.png differ diff --git a/images/insurancedensity.png b/images/insurancedensity.png new file mode 100644 index 0000000..718ee96 Binary files /dev/null and b/images/insurancedensity.png differ diff --git a/images/iris-mix.png b/images/iris-mix.png new file mode 100644 index 0000000..889539e Binary files /dev/null and b/images/iris-mix.png differ diff --git a/images/iris-performance-1024x537.png b/images/iris-performance-1024x537.png new file mode 100644 index 0000000..de967a7 Binary files /dev/null and b/images/iris-performance-1024x537.png differ diff --git a/images/iris-petal.png b/images/iris-petal.png new file mode 100644 index 0000000..eac9e22 Binary files /dev/null and b/images/iris-petal.png differ diff --git a/images/iris-plot-1024x537.png b/images/iris-plot-1024x537.png new file mode 100644 index 0000000..cd37708 Binary files /dev/null and b/images/iris-plot-1024x537.png differ diff --git a/images/iris-sepal.png b/images/iris-sepal.png new file mode 100644 index 0000000..024a650 Binary files /dev/null and b/images/iris-sepal.png differ diff --git a/images/j3XuSZT.png b/images/j3XuSZT.png new file mode 100644 index 0000000..c938715 Binary files /dev/null and b/images/j3XuSZT.png differ diff --git a/images/kernelized.png b/images/kernelized.png new file mode 100644 index 0000000..6335fd0 Binary files /dev/null and b/images/kernelized.png differ diff --git a/images/kernelized1.png b/images/kernelized1.png new file mode 100644 index 0000000..d960cdd Binary files /dev/null and b/images/kernelized1.png differ diff --git a/images/kld.png b/images/kld.png new file mode 100644 index 0000000..b9d43d6 Binary files /dev/null and b/images/kld.png differ diff --git a/images/kld1.png b/images/kld1.png new file mode 100644 index 0000000..afd7623 Binary files /dev/null and b/images/kld1.png differ diff --git a/images/kld2.png b/images/kld2.png new file mode 100644 index 0000000..dca8eee Binary files /dev/null and b/images/kld2.png differ diff --git a/images/kld3.png b/images/kld3.png new file mode 100644 index 0000000..865e00d Binary files /dev/null and b/images/kld3.png differ diff --git a/images/kld4.png b/images/kld4.png new file mode 100644 index 0000000..85718a6 Binary files /dev/null and b/images/kld4.png differ diff --git a/images/kmnist-k49.png b/images/kmnist-k49.png new file mode 100644 index 0000000..30bfd30 Binary files /dev/null and b/images/kmnist-k49.png differ diff --git a/images/kmnist-kmnist.png b/images/kmnist-kmnist.png new file mode 100644 index 0000000..354cf36 Binary files /dev/null and b/images/kmnist-kmnist.png differ diff --git a/images/l1_a.png b/images/l1_a.png new file mode 100644 index 0000000..80a3e58 Binary files /dev/null and b/images/l1_a.png differ diff --git a/images/l1_a_a.png b/images/l1_a_a.png new file mode 100644 index 0000000..530577f Binary files /dev/null and b/images/l1_a_a.png differ diff --git a/images/l1_component.png b/images/l1_component.png new file mode 100644 index 0000000..071b1ad Binary files /dev/null and b/images/l1_component.png differ diff --git a/images/l1_deriv.png b/images/l1_deriv.png new file mode 100644 index 0000000..325716d Binary files /dev/null and b/images/l1_deriv.png differ diff --git a/images/l1_kb.png b/images/l1_kb.png new file mode 100644 index 0000000..506ddf0 Binary files /dev/null and b/images/l1_kb.png differ diff --git a/images/l1_kb_a.png b/images/l1_kb_a.png new file mode 100644 index 0000000..be8c37f Binary files /dev/null and b/images/l1_kb_a.png differ diff --git a/images/l1_l2_a.png b/images/l1_l2_a.png new file mode 100644 index 0000000..76087ec Binary files /dev/null and b/images/l1_l2_a.png differ diff --git a/images/l1_l2_a_a.png b/images/l1_l2_a_a.png new file mode 100644 index 0000000..cbab603 Binary files /dev/null and b/images/l1_l2_a_a.png differ diff --git a/images/l1l2_a.png b/images/l1l2_a.png new file mode 100644 index 0000000..0f5f095 Binary files /dev/null and b/images/l1l2_a.png differ diff --git a/images/l1l2_l.png b/images/l1l2_l.png new file mode 100644 index 0000000..5be46f1 Binary files /dev/null and b/images/l1l2_l.png differ diff --git a/images/l2_a.png b/images/l2_a.png new file mode 100644 index 0000000..d2cfb97 Binary files /dev/null and b/images/l2_a.png differ diff --git a/images/l2_a_a.png b/images/l2_a_a.png new file mode 100644 index 0000000..c471cb9 Binary files /dev/null and b/images/l2_a_a.png differ diff --git a/images/l2_a_l.png b/images/l2_a_l.png new file mode 100644 index 0000000..44d8048 Binary files /dev/null and b/images/l2_a_l.png differ diff --git a/images/l2_comp.png b/images/l2_comp.png new file mode 100644 index 0000000..7fe49b6 Binary files /dev/null and b/images/l2_comp.png differ diff --git a/images/l2_deriv.png b/images/l2_deriv.png new file mode 100644 index 0000000..e5f76e7 Binary files /dev/null and b/images/l2_deriv.png differ diff --git a/images/l2_kb.png b/images/l2_kb.png new file mode 100644 index 0000000..0a5910d Binary files /dev/null and b/images/l2_kb.png differ diff --git a/images/latent-space-visualized.png b/images/latent-space-visualized.png new file mode 100644 index 0000000..2a0d471 Binary files /dev/null and b/images/latent-space-visualized.png differ diff --git a/images/latent-space-with-outliers.png b/images/latent-space-with-outliers.png new file mode 100644 index 0000000..70ab0ae Binary files /dev/null and b/images/latent-space-with-outliers.png differ diff --git a/images/latent-space-without-outliers.png b/images/latent-space-without-outliers.png new file mode 100644 index 0000000..08c4111 Binary files /dev/null and b/images/latent-space-without-outliers.png differ diff --git a/images/layer-act-1024x227.png b/images/layer-act-1024x227.png new file mode 100644 index 0000000..a3ff8fc Binary files /dev/null and b/images/layer-act-1024x227.png differ diff --git a/images/layer-linear.png b/images/layer-linear.png new file mode 100644 index 0000000..8974a1a Binary files /dev/null and b/images/layer-linear.png differ diff --git a/images/leaky_relu.png b/images/leaky_relu.png new file mode 100644 index 0000000..2fd2765 Binary files /dev/null and b/images/leaky_relu.png differ diff --git a/images/legend-1-1024x108.png b/images/legend-1-1024x108.png new file mode 100644 index 0000000..b432f83 Binary files /dev/null and b/images/legend-1-1024x108.png differ diff --git a/images/legend-2.png b/images/legend-2.png new file mode 100644 index 0000000..e277460 Binary files /dev/null and b/images/legend-2.png differ diff --git a/images/likekthis3.png b/images/likekthis3.png new file mode 100644 index 0000000..947668f Binary files /dev/null and b/images/likekthis3.png differ diff --git a/images/likethis.png b/images/likethis.png new file mode 100644 index 0000000..9429ed0 Binary files /dev/null and b/images/likethis.png differ diff --git a/images/likethis2.png b/images/likethis2.png new file mode 100644 index 0000000..8eedfff Binary files /dev/null and b/images/likethis2.png differ diff --git a/images/linear-1024x514.png b/images/linear-1024x514.png new file mode 100644 index 0000000..127be2f Binary files /dev/null and b/images/linear-1024x514.png differ diff --git a/images/linear_classifier.jpg b/images/linear_classifier.jpg new file mode 100644 index 0000000..1880a21 Binary files /dev/null and b/images/linear_classifier.jpg differ diff --git a/images/linear_decay.png b/images/linear_decay.png new file mode 100644 index 0000000..c921c15 Binary files /dev/null and b/images/linear_decay.png differ diff --git a/images/linearly.png b/images/linearly.png new file mode 100644 index 0000000..ee5eeac Binary files /dev/null and b/images/linearly.png differ diff --git a/images/linearly_separable_dataset-1.png b/images/linearly_separable_dataset-1.png new file mode 100644 index 0000000..704dce7 Binary files /dev/null and b/images/linearly_separable_dataset-1.png differ diff --git a/images/lisht_accuracy.png b/images/lisht_accuracy.png new file mode 100644 index 0000000..5ba69a1 Binary files /dev/null and b/images/lisht_accuracy.png differ diff --git a/images/lisht_ce_loss.png b/images/lisht_ce_loss.png new file mode 100644 index 0000000..e54c6b8 Binary files /dev/null and b/images/lisht_ce_loss.png differ diff --git a/images/lisht_derivs-1024x511.png b/images/lisht_derivs-1024x511.png new file mode 100644 index 0000000..87a0e57 Binary files /dev/null and b/images/lisht_derivs-1024x511.png differ diff --git a/images/lisht_leaky_acc.png b/images/lisht_leaky_acc.png new file mode 100644 index 0000000..306c062 Binary files /dev/null and b/images/lisht_leaky_acc.png differ diff --git a/images/lisht_leaky_ce.png b/images/lisht_leaky_ce.png new file mode 100644 index 0000000..a9c14d7 Binary files /dev/null and b/images/lisht_leaky_ce.png differ diff --git a/images/lisht_relu_acc.png b/images/lisht_relu_acc.png new file mode 100644 index 0000000..eba28df Binary files /dev/null and b/images/lisht_relu_acc.png differ diff --git a/images/lisht_relu_ce.png b/images/lisht_relu_ce.png new file mode 100644 index 0000000..32a0995 Binary files /dev/null and b/images/lisht_relu_ce.png differ diff --git a/images/lisht_visualized-1024x511.png b/images/lisht_visualized-1024x511.png new file mode 100644 index 0000000..85a2359 Binary files /dev/null and b/images/lisht_visualized-1024x511.png differ diff --git a/images/logarithmic_performance-1024x537.png b/images/logarithmic_performance-1024x537.png new file mode 100644 index 0000000..be93740 Binary files /dev/null and b/images/logarithmic_performance-1024x537.png differ diff --git a/images/logcosh-1024x433.jpeg b/images/logcosh-1024x433.jpeg new file mode 100644 index 0000000..5431c2b Binary files /dev/null and b/images/logcosh-1024x433.jpeg differ diff --git a/images/logcosh_loss.png b/images/logcosh_loss.png new file mode 100644 index 0000000..0df83bf Binary files /dev/null and b/images/logcosh_loss.png differ diff --git a/images/logcosh_mae.png b/images/logcosh_mae.png new file mode 100644 index 0000000..e2137a0 Binary files /dev/null and b/images/logcosh_mae.png differ diff --git a/images/logits.png b/images/logits.png new file mode 100644 index 0000000..61cd3f9 Binary files /dev/null and b/images/logits.png differ diff --git a/images/logits_with_outputs.png b/images/logits_with_outputs.png new file mode 100644 index 0000000..e6e172f Binary files /dev/null and b/images/logits_with_outputs.png differ diff --git a/images/long_elu_acc-1024x294.png b/images/long_elu_acc-1024x294.png new file mode 100644 index 0000000..5e31bbc Binary files /dev/null and b/images/long_elu_acc-1024x294.png differ diff --git a/images/long_elu_loss-1024x294.png b/images/long_elu_loss-1024x294.png new file mode 100644 index 0000000..765db49 Binary files /dev/null and b/images/long_elu_loss-1024x294.png differ diff --git a/images/long_elu_relu-1024x294.png b/images/long_elu_relu-1024x294.png new file mode 100644 index 0000000..83038f5 Binary files /dev/null and b/images/long_elu_relu-1024x294.png differ diff --git a/images/loss-1.png b/images/loss-1.png new file mode 100644 index 0000000..0c2598e Binary files /dev/null and b/images/loss-1.png differ diff --git a/images/loss-2-1024x528.png b/images/loss-2-1024x528.png new file mode 100644 index 0000000..f9ebaff Binary files /dev/null and b/images/loss-2-1024x528.png differ diff --git a/images/loss-2.png b/images/loss-2.png new file mode 100644 index 0000000..c8076b8 Binary files /dev/null and b/images/loss-2.png differ diff --git a/images/loss-3-1024x537.png b/images/loss-3-1024x537.png new file mode 100644 index 0000000..1176bc4 Binary files /dev/null and b/images/loss-3-1024x537.png differ diff --git a/images/loss-3.png b/images/loss-3.png new file mode 100644 index 0000000..9c9fac0 Binary files /dev/null and b/images/loss-3.png differ diff --git a/images/loss-4-1024x537.png b/images/loss-4-1024x537.png new file mode 100644 index 0000000..5dcaa40 Binary files /dev/null and b/images/loss-4-1024x537.png differ diff --git a/images/loss.png b/images/loss.png new file mode 100644 index 0000000..b057072 Binary files /dev/null and b/images/loss.png differ diff --git a/images/lrf_mnist-1024x512.png b/images/lrf_mnist-1024x512.png new file mode 100644 index 0000000..3908e27 Binary files /dev/null and b/images/lrf_mnist-1024x512.png differ diff --git a/images/lrr_lr_accuracies.png b/images/lrr_lr_accuracies.png new file mode 100644 index 0000000..ad924f1 Binary files /dev/null and b/images/lrr_lr_accuracies.png differ diff --git a/images/lrr_lr_losses.png b/images/lrr_lr_losses.png new file mode 100644 index 0000000..27ef6b6 Binary files /dev/null and b/images/lrr_lr_losses.png differ diff --git a/images/lrr_lrr_acc.png b/images/lrr_lrr_acc.png new file mode 100644 index 0000000..e72ca4d Binary files /dev/null and b/images/lrr_lrr_acc.png differ diff --git a/images/lrr_lrr_loss.png b/images/lrr_lrr_loss.png new file mode 100644 index 0000000..6ef5d80 Binary files /dev/null and b/images/lrr_lrr_loss.png differ diff --git a/images/lrt_loss_deltas-1024x537.png b/images/lrt_loss_deltas-1024x537.png new file mode 100644 index 0000000..f398efb Binary files /dev/null and b/images/lrt_loss_deltas-1024x537.png differ diff --git a/images/lrt_losses-1024x537.png b/images/lrt_losses-1024x537.png new file mode 100644 index 0000000..d413abc Binary files /dev/null and b/images/lrt_losses-1024x537.png differ diff --git a/images/mae-1024x565.png b/images/mae-1024x565.png new file mode 100644 index 0000000..4b33c1a Binary files /dev/null and b/images/mae-1024x565.png differ diff --git a/images/main-qimg-3a8a3a78734fed3301ed3546634b871a-c.jpg b/images/main-qimg-3a8a3a78734fed3301ed3546634b871a-c.jpg new file mode 100644 index 0000000..220bbf5 Binary files /dev/null and b/images/main-qimg-3a8a3a78734fed3301ed3546634b871a-c.jpg differ diff --git a/images/mapping_network.png b/images/mapping_network.png new file mode 100644 index 0000000..2c7092a Binary files /dev/null and b/images/mapping_network.png differ diff --git a/images/matrix_zeroes.jpg b/images/matrix_zeroes.jpg new file mode 100644 index 0000000..68cb985 Binary files /dev/null and b/images/matrix_zeroes.jpg differ diff --git a/images/maxpooling-1024x577.png b/images/maxpooling-1024x577.png new file mode 100644 index 0000000..6d825ee Binary files /dev/null and b/images/maxpooling-1024x577.png differ diff --git a/images/mh_3.png b/images/mh_3.png new file mode 100644 index 0000000..fc1cd36 Binary files /dev/null and b/images/mh_3.png differ diff --git a/images/mh_boundary-1024x587.png b/images/mh_boundary-1024x587.png new file mode 100644 index 0000000..5f7fb6e Binary files /dev/null and b/images/mh_boundary-1024x587.png differ diff --git a/images/mh_loss-1024x564.png b/images/mh_loss-1024x564.png new file mode 100644 index 0000000..6b5564b Binary files /dev/null and b/images/mh_loss-1024x564.png differ diff --git a/images/mlabel_1.png b/images/mlabel_1.png new file mode 100644 index 0000000..ba6aef0 Binary files /dev/null and b/images/mlabel_1.png differ diff --git a/images/mlabel_2.png b/images/mlabel_2.png new file mode 100644 index 0000000..8a60e02 Binary files /dev/null and b/images/mlabel_2.png differ diff --git a/images/mlm-1024x447.png b/images/mlm-1024x447.png new file mode 100644 index 0000000..189120f Binary files /dev/null and b/images/mlm-1024x447.png differ diff --git a/images/mnist-1.jpg b/images/mnist-1.jpg new file mode 100644 index 0000000..f389a78 Binary files /dev/null and b/images/mnist-1.jpg differ diff --git a/images/mnist-300x133.png b/images/mnist-300x133.png new file mode 100644 index 0000000..2f402b7 Binary files /dev/null and b/images/mnist-300x133.png differ diff --git a/images/mnist-acc-poor-1024x537.png b/images/mnist-acc-poor-1024x537.png new file mode 100644 index 0000000..a8d7ddb Binary files /dev/null and b/images/mnist-acc-poor-1024x537.png differ diff --git a/images/mnist-visualize.png b/images/mnist-visualize.png new file mode 100644 index 0000000..7d1d635 Binary files /dev/null and b/images/mnist-visualize.png differ diff --git a/images/mnist.png b/images/mnist.png new file mode 100644 index 0000000..59d2cea Binary files /dev/null and b/images/mnist.png differ diff --git a/images/mnist3d.jpg b/images/mnist3d.jpg new file mode 100644 index 0000000..f800a91 Binary files /dev/null and b/images/mnist3d.jpg differ diff --git a/images/mnist_100_latentspace.png b/images/mnist_100_latentspace.png new file mode 100644 index 0000000..d3734cb Binary files /dev/null and b/images/mnist_100_latentspace.png differ diff --git a/images/mnist_accs-1024x511.png b/images/mnist_accs-1024x511.png new file mode 100644 index 0000000..4788711 Binary files /dev/null and b/images/mnist_accs-1024x511.png differ diff --git a/images/mnist_digits.png b/images/mnist_digits.png new file mode 100644 index 0000000..89a1dee Binary files /dev/null and b/images/mnist_digits.png differ diff --git a/images/mnist_visualized.jpeg b/images/mnist_visualized.jpeg new file mode 100644 index 0000000..019ca0e Binary files /dev/null and b/images/mnist_visualized.jpeg differ diff --git a/images/mnist_with_dct-1024x537.jpeg b/images/mnist_with_dct-1024x537.jpeg new file mode 100644 index 0000000..6ee9b9c Binary files /dev/null and b/images/mnist_with_dct-1024x537.jpeg differ diff --git a/images/model-1.png b/images/model-1.png new file mode 100644 index 0000000..117f765 Binary files /dev/null and b/images/model-1.png differ diff --git a/images/model-137x300.png b/images/model-137x300.png new file mode 100644 index 0000000..0b18ba7 Binary files /dev/null and b/images/model-137x300.png differ diff --git a/images/model-2.png b/images/model-2.png new file mode 100644 index 0000000..3e1055e Binary files /dev/null and b/images/model-2.png differ diff --git a/images/model-219x1024.png b/images/model-219x1024.png new file mode 100644 index 0000000..52dffa7 Binary files /dev/null and b/images/model-219x1024.png differ diff --git a/images/model-3-241x1024.png b/images/model-3-241x1024.png new file mode 100644 index 0000000..32128d2 Binary files /dev/null and b/images/model-3-241x1024.png differ diff --git a/images/model-4.png b/images/model-4.png new file mode 100644 index 0000000..117f765 Binary files /dev/null and b/images/model-4.png differ diff --git a/images/model-5-187x300.png b/images/model-5-187x300.png new file mode 100644 index 0000000..be53799 Binary files /dev/null and b/images/model-5-187x300.png differ diff --git a/images/model-5.png b/images/model-5.png new file mode 100644 index 0000000..9cb935a Binary files /dev/null and b/images/model-5.png differ diff --git a/images/model-6-187x300.png b/images/model-6-187x300.png new file mode 100644 index 0000000..be53799 Binary files /dev/null and b/images/model-6-187x300.png differ diff --git a/images/model-6.png b/images/model-6.png new file mode 100644 index 0000000..9cb935a Binary files /dev/null and b/images/model-6.png differ diff --git a/images/model.png b/images/model.png new file mode 100644 index 0000000..3572ffc Binary files /dev/null and b/images/model.png differ diff --git a/images/model_cropping2d-84x300.png b/images/model_cropping2d-84x300.png new file mode 100644 index 0000000..5786733 Binary files /dev/null and b/images/model_cropping2d-84x300.png differ diff --git a/images/moons.png b/images/moons.png new file mode 100644 index 0000000..7628c48 Binary files /dev/null and b/images/moons.png differ diff --git a/images/moons3d.png b/images/moons3d.png new file mode 100644 index 0000000..ca1c597 Binary files /dev/null and b/images/moons3d.png differ diff --git a/images/moons3d1.png b/images/moons3d1.png new file mode 100644 index 0000000..ae76f30 Binary files /dev/null and b/images/moons3d1.png differ diff --git a/images/moons_decision.png b/images/moons_decision.png new file mode 100644 index 0000000..846be90 Binary files /dev/null and b/images/moons_decision.png differ diff --git a/images/mor-1024x516.jpg b/images/mor-1024x516.jpg new file mode 100644 index 0000000..a6822c0 Binary files /dev/null and b/images/mor-1024x516.jpg differ diff --git a/images/mse-1024x563.png b/images/mse-1024x563.png new file mode 100644 index 0000000..1528abe Binary files /dev/null and b/images/mse-1024x563.png differ diff --git a/images/multiplication.jpg b/images/multiplication.jpg new file mode 100644 index 0000000..0db2d6e Binary files /dev/null and b/images/multiplication.jpg differ diff --git a/images/neg_vec.png b/images/neg_vec.png new file mode 100644 index 0000000..ad1f0b0 Binary files /dev/null and b/images/neg_vec.png differ diff --git a/images/netron_model_mobilenetv2.png b/images/netron_model_mobilenetv2.png new file mode 100644 index 0000000..82ee0a6 Binary files /dev/null and b/images/netron_model_mobilenetv2.png differ diff --git a/images/no_a.png b/images/no_a.png new file mode 100644 index 0000000..99d092b Binary files /dev/null and b/images/no_a.png differ diff --git a/images/no_l.png b/images/no_l.png new file mode 100644 index 0000000..e108f57 Binary files /dev/null and b/images/no_l.png differ diff --git a/images/noiseremoved.png b/images/noiseremoved.png new file mode 100644 index 0000000..0adb3ec Binary files /dev/null and b/images/noiseremoved.png differ diff --git a/images/noisy.png b/images/noisy.png new file mode 100644 index 0000000..fe01b79 Binary files /dev/null and b/images/noisy.png differ diff --git a/images/nonlinear-1-1024x514.png b/images/nonlinear-1-1024x514.png new file mode 100644 index 0000000..0cb324f Binary files /dev/null and b/images/nonlinear-1-1024x514.png differ diff --git a/images/nonlinear.png b/images/nonlinear.png new file mode 100644 index 0000000..6a97136 Binary files /dev/null and b/images/nonlinear.png differ diff --git a/images/normal_conv.jpg b/images/normal_conv.jpg new file mode 100644 index 0000000..095bf01 Binary files /dev/null and b/images/normal_conv.jpg differ diff --git a/images/notepad-1024x349.jpg b/images/notepad-1024x349.jpg new file mode 100644 index 0000000..b22c22f Binary files /dev/null and b/images/notepad-1024x349.jpg differ diff --git a/images/notresized.png b/images/notresized.png new file mode 100644 index 0000000..3370445 Binary files /dev/null and b/images/notresized.png differ diff --git a/images/optics.png b/images/optics.png new file mode 100644 index 0000000..d4db141 Binary files /dev/null and b/images/optics.png differ diff --git a/images/outliers-1.png b/images/outliers-1.png new file mode 100644 index 0000000..ecee7d4 Binary files /dev/null and b/images/outliers-1.png differ diff --git a/images/outliers2-1.png b/images/outliers2-1.png new file mode 100644 index 0000000..706314e Binary files /dev/null and b/images/outliers2-1.png differ diff --git a/images/output_p-fs_H1lkLfV_0.png b/images/output_p-fs_H1lkLfV_0.png new file mode 100644 index 0000000..b6b9a53 Binary files /dev/null and b/images/output_p-fs_H1lkLfV_0.png differ diff --git a/images/ovo_boundary.png b/images/ovo_boundary.png new file mode 100644 index 0000000..e6b53e0 Binary files /dev/null and b/images/ovo_boundary.png differ diff --git a/images/ovo_conf.png b/images/ovo_conf.png new file mode 100644 index 0000000..d722c0a Binary files /dev/null and b/images/ovo_conf.png differ diff --git a/images/ovr_boundary.png b/images/ovr_boundary.png new file mode 100644 index 0000000..9b2a4d0 Binary files /dev/null and b/images/ovr_boundary.png differ diff --git a/images/ovr_conf.png b/images/ovr_conf.png new file mode 100644 index 0000000..9fda3d4 Binary files /dev/null and b/images/ovr_conf.png differ diff --git a/images/pad-nopad-conv-1-300x300.jpg b/images/pad-nopad-conv-1-300x300.jpg new file mode 100644 index 0000000..b045675 Binary files /dev/null and b/images/pad-nopad-conv-1-300x300.jpg differ diff --git a/images/pad-nopad-conv.jpg b/images/pad-nopad-conv.jpg new file mode 100644 index 0000000..3e5d452 Binary files /dev/null and b/images/pad-nopad-conv.jpg differ diff --git a/images/pad-nopad.jpg b/images/pad-nopad.jpg new file mode 100644 index 0000000..0e1b5c6 Binary files /dev/null and b/images/pad-nopad.jpg differ diff --git a/images/parabolic.png b/images/parabolic.png new file mode 100644 index 0000000..763e05a Binary files /dev/null and b/images/parabolic.png differ diff --git a/images/parkout.jpg b/images/parkout.jpg new file mode 100644 index 0000000..18e59c9 Binary files /dev/null and b/images/parkout.jpg differ diff --git a/images/pca_1.png b/images/pca_1.png new file mode 100644 index 0000000..e1853c6 Binary files /dev/null and b/images/pca_1.png differ diff --git a/images/pca_2.png b/images/pca_2.png new file mode 100644 index 0000000..860cac1 Binary files /dev/null and b/images/pca_2.png differ diff --git a/images/pca_3.png b/images/pca_3.png new file mode 100644 index 0000000..8070a8a Binary files /dev/null and b/images/pca_3.png differ diff --git a/images/pcasvd.png b/images/pcasvd.png new file mode 100644 index 0000000..4694c33 Binary files /dev/null and b/images/pcasvd.png differ diff --git a/images/pears-1024x464.png b/images/pears-1024x464.png new file mode 100644 index 0000000..13573a7 Binary files /dev/null and b/images/pears-1024x464.png differ diff --git a/images/penalty-values.png b/images/penalty-values.png new file mode 100644 index 0000000..49b9a1e Binary files /dev/null and b/images/penalty-values.png differ diff --git a/images/perceptron_with_boundary.png b/images/perceptron_with_boundary.png new file mode 100644 index 0000000..2ed30b8 Binary files /dev/null and b/images/perceptron_with_boundary.png differ diff --git a/images/pexels-ashley-williams-685382-1024x604.jpg b/images/pexels-ashley-williams-685382-1024x604.jpg new file mode 100644 index 0000000..0bd61a5 Binary files /dev/null and b/images/pexels-ashley-williams-685382-1024x604.jpg differ diff --git a/images/pexels-daria-shevtsova-880474-819x1024.jpg b/images/pexels-daria-shevtsova-880474-819x1024.jpg new file mode 100644 index 0000000..9618857 Binary files /dev/null and b/images/pexels-daria-shevtsova-880474-819x1024.jpg differ diff --git a/images/pexels-fernando-arcos-211122-1024x681.jpg b/images/pexels-fernando-arcos-211122-1024x681.jpg new file mode 100644 index 0000000..439540b Binary files /dev/null and b/images/pexels-fernando-arcos-211122-1024x681.jpg differ diff --git a/images/pexels-helena-lopes-1015568-1024x683.jpg b/images/pexels-helena-lopes-1015568-1024x683.jpg new file mode 100644 index 0000000..559fd21 Binary files /dev/null and b/images/pexels-helena-lopes-1015568-1024x683.jpg differ diff --git a/images/pexels-manuel-geissinger-325229-1024x358.jpg b/images/pexels-manuel-geissinger-325229-1024x358.jpg new file mode 100644 index 0000000..d563db2 Binary files /dev/null and b/images/pexels-manuel-geissinger-325229-1024x358.jpg differ diff --git a/images/pexels-photo-1114690-1024x684.jpeg b/images/pexels-photo-1114690-1024x684.jpeg new file mode 100644 index 0000000..52d5ce3 Binary files /dev/null and b/images/pexels-photo-1114690-1024x684.jpeg differ diff --git a/images/pexels-photo-209679-1024x663.jpeg b/images/pexels-photo-209679-1024x663.jpeg new file mode 100644 index 0000000..7b18ed7 Binary files /dev/null and b/images/pexels-photo-209679-1024x663.jpeg differ diff --git a/images/pexels-pixabay-268460.jpg b/images/pexels-pixabay-268460.jpg new file mode 100644 index 0000000..6aa68fd Binary files /dev/null and b/images/pexels-pixabay-268460.jpg differ diff --git a/images/photo-of-head-bust-print-artwork-724994-1024x736.jpg b/images/photo-of-head-bust-print-artwork-724994-1024x736.jpg new file mode 100644 index 0000000..499b7b2 Binary files /dev/null and b/images/photo-of-head-bust-print-artwork-724994-1024x736.jpg differ diff --git a/images/pima-performance-1024x537.png b/images/pima-performance-1024x537.png new file mode 100644 index 0000000..bc8d6cb Binary files /dev/null and b/images/pima-performance-1024x537.png differ diff --git a/images/pima-performance-2-1024x537.png b/images/pima-performance-2-1024x537.png new file mode 100644 index 0000000..6a0c6d7 Binary files /dev/null and b/images/pima-performance-2-1024x537.png differ diff --git a/images/plot_for_cifar10-1024x511.png b/images/plot_for_cifar10-1024x511.png new file mode 100644 index 0000000..6834a9d Binary files /dev/null and b/images/plot_for_cifar10-1024x511.png differ diff --git a/images/plot_for_mnist-1024x537.png b/images/plot_for_mnist-1024x537.png new file mode 100644 index 0000000..8a3440c Binary files /dev/null and b/images/plot_for_mnist-1024x537.png differ diff --git a/images/point.png b/images/point.png new file mode 100644 index 0000000..055b672 Binary files /dev/null and b/images/point.png differ diff --git a/images/points.png b/images/points.png new file mode 100644 index 0000000..cb0a67e Binary files /dev/null and b/images/points.png differ diff --git a/images/poly_both.png b/images/poly_both.png new file mode 100644 index 0000000..3f53c42 Binary files /dev/null and b/images/poly_both.png differ diff --git a/images/poly_large.png b/images/poly_large.png new file mode 100644 index 0000000..eab9e60 Binary files /dev/null and b/images/poly_large.png differ diff --git a/images/poly_small.png b/images/poly_small.png new file mode 100644 index 0000000..9cefff4 Binary files /dev/null and b/images/poly_small.png differ diff --git a/images/pooling.jpg b/images/pooling.jpg new file mode 100644 index 0000000..dee8e24 Binary files /dev/null and b/images/pooling.jpg differ diff --git a/images/possibly_separable.png b/images/possibly_separable.png new file mode 100644 index 0000000..62727dc Binary files /dev/null and b/images/possibly_separable.png differ diff --git a/images/pre_up_plot.png b/images/pre_up_plot.png new file mode 100644 index 0000000..96da3bc Binary files /dev/null and b/images/pre_up_plot.png differ diff --git a/images/pred_1.png b/images/pred_1.png new file mode 100644 index 0000000..b8c36d9 Binary files /dev/null and b/images/pred_1.png differ diff --git a/images/pred_2.png b/images/pred_2.png new file mode 100644 index 0000000..45c3e84 Binary files /dev/null and b/images/pred_2.png differ diff --git a/images/pred_3.png b/images/pred_3.png new file mode 100644 index 0000000..f5d2f48 Binary files /dev/null and b/images/pred_3.png differ diff --git a/images/pred_4.png b/images/pred_4.png new file mode 100644 index 0000000..7d0f3ac Binary files /dev/null and b/images/pred_4.png differ diff --git a/images/qSEY9xn.png b/images/qSEY9xn.png new file mode 100644 index 0000000..4bead7f Binary files /dev/null and b/images/qSEY9xn.png differ diff --git a/images/rankshape.png b/images/rankshape.png new file mode 100644 index 0000000..d9e8460 Binary files /dev/null and b/images/rankshape.png differ diff --git a/images/rbf1.png b/images/rbf1.png new file mode 100644 index 0000000..2b49a24 Binary files /dev/null and b/images/rbf1.png differ diff --git a/images/rbf2.png b/images/rbf2.png new file mode 100644 index 0000000..4b7e8a1 Binary files /dev/null and b/images/rbf2.png differ diff --git a/images/rbf3.png b/images/rbf3.png new file mode 100644 index 0000000..d1208a0 Binary files /dev/null and b/images/rbf3.png differ diff --git a/images/rcf_boundary.png b/images/rcf_boundary.png new file mode 100644 index 0000000..565a69e Binary files /dev/null and b/images/rcf_boundary.png differ diff --git a/images/rcf_data.png b/images/rcf_data.png new file mode 100644 index 0000000..682bea6 Binary files /dev/null and b/images/rcf_data.png differ diff --git a/images/rcf_matrix.png b/images/rcf_matrix.png new file mode 100644 index 0000000..42a8006 Binary files /dev/null and b/images/rcf_matrix.png differ diff --git a/images/rcf_sup.png b/images/rcf_sup.png new file mode 100644 index 0000000..a10a53b Binary files /dev/null and b/images/rcf_sup.png differ diff --git a/images/reconstruction.png b/images/reconstruction.png new file mode 100644 index 0000000..846708d Binary files /dev/null and b/images/reconstruction.png differ diff --git a/images/reflection.png b/images/reflection.png new file mode 100644 index 0000000..d6e6567 Binary files /dev/null and b/images/reflection.png differ diff --git a/images/reflection_1d-1024x147.png b/images/reflection_1d-1024x147.png new file mode 100644 index 0000000..e310b7f Binary files /dev/null and b/images/reflection_1d-1024x147.png differ diff --git a/images/reflection_pad.jpg b/images/reflection_pad.jpg new file mode 100644 index 0000000..50021b7 Binary files /dev/null and b/images/reflection_pad.jpg differ diff --git a/images/relu-1024x511.png b/images/relu-1024x511.png new file mode 100644 index 0000000..93672e4 Binary files /dev/null and b/images/relu-1024x511.png differ diff --git a/images/relu_and_deriv-1024x511.jpeg b/images/relu_and_deriv-1024x511.jpeg new file mode 100644 index 0000000..72790c7 Binary files /dev/null and b/images/relu_and_deriv-1024x511.jpeg differ diff --git a/images/relu_swish-1024x511.png b/images/relu_swish-1024x511.png new file mode 100644 index 0000000..32863b8 Binary files /dev/null and b/images/relu_swish-1024x511.png differ diff --git a/images/replication.png b/images/replication.png new file mode 100644 index 0000000..801606e Binary files /dev/null and b/images/replication.png differ diff --git a/images/replication_1d-1024x147.png b/images/replication_1d-1024x147.png new file mode 100644 index 0000000..4b5b4ad Binary files /dev/null and b/images/replication_1d-1024x147.png differ diff --git a/images/replication_pad.png b/images/replication_pad.png new file mode 100644 index 0000000..d7bfa6e Binary files /dev/null and b/images/replication_pad.png differ diff --git a/images/residual.png b/images/residual.png new file mode 100644 index 0000000..afd99da Binary files /dev/null and b/images/residual.png differ diff --git a/images/resized.png b/images/resized.png new file mode 100644 index 0000000..9881100 Binary files /dev/null and b/images/resized.png differ diff --git a/images/resnet56_noshort_small.jpg b/images/resnet56_noshort_small.jpg new file mode 100644 index 0000000..65d465a Binary files /dev/null and b/images/resnet56_noshort_small.jpg differ diff --git a/images/rl_2d-1024x853.png b/images/rl_2d-1024x853.png new file mode 100644 index 0000000..9c666fe Binary files /dev/null and b/images/rl_2d-1024x853.png differ diff --git a/images/rlkl_2d-1024x853.png b/images/rlkl_2d-1024x853.png new file mode 100644 index 0000000..cfc90ce Binary files /dev/null and b/images/rlkl_2d-1024x853.png differ diff --git a/images/robust.png b/images/robust.png new file mode 100644 index 0000000..5541918 Binary files /dev/null and b/images/robust.png differ diff --git a/images/robust2.png b/images/robust2.png new file mode 100644 index 0000000..1df9a0e Binary files /dev/null and b/images/robust2.png differ diff --git a/images/rplot.png b/images/rplot.png new file mode 100644 index 0000000..1914006 Binary files /dev/null and b/images/rplot.png differ diff --git a/images/sal1-2.png b/images/sal1-2.png new file mode 100644 index 0000000..c674f1b Binary files /dev/null and b/images/sal1-2.png differ diff --git a/images/sal1.png b/images/sal1.png new file mode 100644 index 0000000..55456c4 Binary files /dev/null and b/images/sal1.png differ diff --git a/images/sal2.png b/images/sal2.png new file mode 100644 index 0000000..35126f4 Binary files /dev/null and b/images/sal2.png differ diff --git a/images/sal3.png b/images/sal3.png new file mode 100644 index 0000000..52d4bd0 Binary files /dev/null and b/images/sal3.png differ diff --git a/images/sal7-2.png b/images/sal7-2.png new file mode 100644 index 0000000..d48c14a Binary files /dev/null and b/images/sal7-2.png differ diff --git a/images/sal7.png b/images/sal7.png new file mode 100644 index 0000000..f4b1b70 Binary files /dev/null and b/images/sal7.png differ diff --git a/images/sal9.png b/images/sal9.png new file mode 100644 index 0000000..39a436d Binary files /dev/null and b/images/sal9.png differ diff --git a/images/same-pad.jpg b/images/same-pad.jpg new file mode 100644 index 0000000..3a0fffa Binary files /dev/null and b/images/same-pad.jpg differ diff --git a/images/sample_normalization.png b/images/sample_normalization.png new file mode 100644 index 0000000..e164f4c Binary files /dev/null and b/images/sample_normalization.png differ diff --git a/images/samples-1.png b/images/samples-1.png new file mode 100644 index 0000000..6c7fd9b Binary files /dev/null and b/images/samples-1.png differ diff --git a/images/samples-2-1.png b/images/samples-2-1.png new file mode 100644 index 0000000..13f1810 Binary files /dev/null and b/images/samples-2-1.png differ diff --git a/images/samples-3.png b/images/samples-3.png new file mode 100644 index 0000000..96fe046 Binary files /dev/null and b/images/samples-3.png differ diff --git a/images/samples.png b/images/samples.png new file mode 100644 index 0000000..3795d3f Binary files /dev/null and b/images/samples.png differ diff --git a/images/sampling-1.png b/images/sampling-1.png new file mode 100644 index 0000000..3adaf25 Binary files /dev/null and b/images/sampling-1.png differ diff --git a/images/sampling.png b/images/sampling.png new file mode 100644 index 0000000..5516ffe Binary files /dev/null and b/images/sampling.png differ diff --git a/images/sampling_normalization.png b/images/sampling_normalization.png new file mode 100644 index 0000000..87096f2 Binary files /dev/null and b/images/sampling_normalization.png differ diff --git a/images/searchspace.png b/images/searchspace.png new file mode 100644 index 0000000..c9cfb61 Binary files /dev/null and b/images/searchspace.png differ diff --git a/images/selu.png b/images/selu.png new file mode 100644 index 0000000..8f70c51 Binary files /dev/null and b/images/selu.png differ diff --git a/images/separable_data.png b/images/separable_data.png new file mode 100644 index 0000000..9f81dcc Binary files /dev/null and b/images/separable_data.png differ diff --git a/images/sequential_encodedstate-1024x511.png b/images/sequential_encodedstate-1024x511.png new file mode 100644 index 0000000..9bb55f0 Binary files /dev/null and b/images/sequential_encodedstate-1024x511.png differ diff --git a/images/sequential_output-1024x511.png b/images/sequential_output-1024x511.png new file mode 100644 index 0000000..fe062db Binary files /dev/null and b/images/sequential_output-1024x511.png differ diff --git a/images/sequential_rec.png b/images/sequential_rec.png new file mode 100644 index 0000000..b38e2e9 Binary files /dev/null and b/images/sequential_rec.png differ diff --git a/images/seven.png b/images/seven.png new file mode 100644 index 0000000..e948e4c Binary files /dev/null and b/images/seven.png differ diff --git a/images/sgd_only-1024x537.png b/images/sgd_only-1024x537.png new file mode 100644 index 0000000..8c8d055 Binary files /dev/null and b/images/sgd_only-1024x537.png differ diff --git a/images/sgd_only_v-1024x537.png b/images/sgd_only_v-1024x537.png new file mode 100644 index 0000000..9cda338 Binary files /dev/null and b/images/sgd_only_v-1024x537.png differ diff --git a/images/ship-1.png b/images/ship-1.png new file mode 100644 index 0000000..3ac0df8 Binary files /dev/null and b/images/ship-1.png differ diff --git a/images/ship.png b/images/ship.png new file mode 100644 index 0000000..868be8e Binary files /dev/null and b/images/ship.png differ diff --git a/images/shopout.jpg b/images/shopout.jpg new file mode 100644 index 0000000..ed78833 Binary files /dev/null and b/images/shopout.jpg differ diff --git a/images/sigmoid-1024x511.png b/images/sigmoid-1024x511.png new file mode 100644 index 0000000..8a40f94 Binary files /dev/null and b/images/sigmoid-1024x511.png differ diff --git a/images/sigmoid_and_deriv-1024x511.jpeg b/images/sigmoid_and_deriv-1024x511.jpeg new file mode 100644 index 0000000..3537dbd Binary files /dev/null and b/images/sigmoid_and_deriv-1024x511.jpeg differ diff --git a/images/sigmoid_deriv-1024x511.png b/images/sigmoid_deriv-1024x511.png new file mode 100644 index 0000000..8862535 Binary files /dev/null and b/images/sigmoid_deriv-1024x511.png differ diff --git a/images/sigmoid_deriv.png b/images/sigmoid_deriv.png new file mode 100644 index 0000000..ee66bba Binary files /dev/null and b/images/sigmoid_deriv.png differ diff --git a/images/signal_compaction-1.png b/images/signal_compaction-1.png new file mode 100644 index 0000000..c7ca51d Binary files /dev/null and b/images/signal_compaction-1.png differ diff --git a/images/simple-resnet-block.png b/images/simple-resnet-block.png new file mode 100644 index 0000000..93c1d1f Binary files /dev/null and b/images/simple-resnet-block.png differ diff --git a/images/simple_upsampling.png b/images/simple_upsampling.png new file mode 100644 index 0000000..1cafebb Binary files /dev/null and b/images/simple_upsampling.png differ diff --git a/images/sinusoidal.png b/images/sinusoidal.png new file mode 100644 index 0000000..75dd604 Binary files /dev/null and b/images/sinusoidal.png differ diff --git a/images/sinx_approximated-1024x537.jpeg b/images/sinx_approximated-1024x537.jpeg new file mode 100644 index 0000000..04d01e5 Binary files /dev/null and b/images/sinx_approximated-1024x537.jpeg differ diff --git a/images/sinx_more_data-1024x537.jpeg b/images/sinx_more_data-1024x537.jpeg new file mode 100644 index 0000000..3ef1b76 Binary files /dev/null and b/images/sinx_more_data-1024x537.jpeg differ diff --git a/images/skips_example.png b/images/skips_example.png new file mode 100644 index 0000000..c679363 Binary files /dev/null and b/images/skips_example.png differ diff --git a/images/sklrn.png b/images/sklrn.png new file mode 100644 index 0000000..a6425f2 Binary files /dev/null and b/images/sklrn.png differ diff --git a/images/small_house.jpg b/images/small_house.jpg new file mode 100644 index 0000000..b6796f6 Binary files /dev/null and b/images/small_house.jpg differ diff --git a/images/softmax_logits.png b/images/softmax_logits.png new file mode 100644 index 0000000..3de6f22 Binary files /dev/null and b/images/softmax_logits.png differ diff --git a/images/sparse.png b/images/sparse.png new file mode 100644 index 0000000..2d78a6f Binary files /dev/null and b/images/sparse.png differ diff --git a/images/sqh-db.png b/images/sqh-db.png new file mode 100644 index 0000000..e4ac503 Binary files /dev/null and b/images/sqh-db.png differ diff --git a/images/sqh-generated.png b/images/sqh-generated.png new file mode 100644 index 0000000..7da679c Binary files /dev/null and b/images/sqh-generated.png differ diff --git a/images/sqh-history-1024x537.png b/images/sqh-history-1024x537.png new file mode 100644 index 0000000..00dded8 Binary files /dev/null and b/images/sqh-history-1024x537.png differ diff --git a/images/standards.png b/images/standards.png new file mode 100644 index 0000000..5d38303 Binary files /dev/null and b/images/standards.png differ diff --git a/images/step_1.png b/images/step_1.png new file mode 100644 index 0000000..84448ee Binary files /dev/null and b/images/step_1.png differ diff --git a/images/step_2.png b/images/step_2.png new file mode 100644 index 0000000..66fdb6e Binary files /dev/null and b/images/step_2.png differ diff --git a/images/step_decay.png b/images/step_decay.png new file mode 100644 index 0000000..e248ed9 Binary files /dev/null and b/images/step_decay.png differ diff --git a/images/stl10-1.png b/images/stl10-1.png new file mode 100644 index 0000000..c70be29 Binary files /dev/null and b/images/stl10-1.png differ diff --git a/images/street_bboxes_mc-1024x684.jpg b/images/street_bboxes_mc-1024x684.jpg new file mode 100644 index 0000000..99f8be0 Binary files /dev/null and b/images/street_bboxes_mc-1024x684.jpg differ diff --git a/images/street_mc-1024x684.jpg b/images/street_mc-1024x684.jpg new file mode 100644 index 0000000..7a2689c Binary files /dev/null and b/images/street_mc-1024x684.jpg differ diff --git a/images/stylegan-teaser-1024x614.png b/images/stylegan-teaser-1024x614.png new file mode 100644 index 0000000..c64ced8 Binary files /dev/null and b/images/stylegan-teaser-1024x614.png differ diff --git a/images/support_vectors.png b/images/support_vectors.png new file mode 100644 index 0000000..81147b4 Binary files /dev/null and b/images/support_vectors.png differ diff --git a/images/supportvectors.png b/images/supportvectors.png new file mode 100644 index 0000000..6f1bd69 Binary files /dev/null and b/images/supportvectors.png differ diff --git a/images/svhn-extra.png b/images/svhn-extra.png new file mode 100644 index 0000000..cc6b930 Binary files /dev/null and b/images/svhn-extra.png differ diff --git a/images/svhn-normal.png b/images/svhn-normal.png new file mode 100644 index 0000000..fdd0091 Binary files /dev/null and b/images/svhn-normal.png differ diff --git a/images/swish-1024x511.png b/images/swish-1024x511.png new file mode 100644 index 0000000..abdf3d9 Binary files /dev/null and b/images/swish-1024x511.png differ diff --git a/images/swish_deriv-1024x511.png b/images/swish_deriv-1024x511.png new file mode 100644 index 0000000..95c838a Binary files /dev/null and b/images/swish_deriv-1024x511.png differ diff --git a/images/swish_formula.png b/images/swish_formula.png new file mode 100644 index 0000000..6e3e3dd Binary files /dev/null and b/images/swish_formula.png differ diff --git a/images/tanh-1024x511.png b/images/tanh-1024x511.png new file mode 100644 index 0000000..5a92f2b Binary files /dev/null and b/images/tanh-1024x511.png differ diff --git a/images/tanh_and_deriv-1024x511.jpeg b/images/tanh_and_deriv-1024x511.jpeg new file mode 100644 index 0000000..f08a1b7 Binary files /dev/null and b/images/tanh_and_deriv-1024x511.jpeg differ diff --git a/images/taxicab1.png b/images/taxicab1.png new file mode 100644 index 0000000..bd63b85 Binary files /dev/null and b/images/taxicab1.png differ diff --git a/images/taxicab2.png b/images/taxicab2.png new file mode 100644 index 0000000..9a70c8e Binary files /dev/null and b/images/taxicab2.png differ diff --git a/images/tf-2.jpg b/images/tf-2.jpg new file mode 100644 index 0000000..63b5016 Binary files /dev/null and b/images/tf-2.jpg differ diff --git a/images/thispersondoesnotexist-1-1022x1024.jpg b/images/thispersondoesnotexist-1-1022x1024.jpg new file mode 100644 index 0000000..48fe04d Binary files /dev/null and b/images/thispersondoesnotexist-1-1022x1024.jpg differ diff --git a/images/time_decay.png b/images/time_decay.png new file mode 100644 index 0000000..0af94f3 Binary files /dev/null and b/images/time_decay.png differ diff --git a/images/tree-1024x535.png b/images/tree-1024x535.png new file mode 100644 index 0000000..c2c3f14 Binary files /dev/null and b/images/tree-1024x535.png differ diff --git a/images/triangular-300x140.png b/images/triangular-300x140.png new file mode 100644 index 0000000..bd069dc Binary files /dev/null and b/images/triangular-300x140.png differ diff --git a/images/triangular.png b/images/triangular.png new file mode 100644 index 0000000..d8ccef8 Binary files /dev/null and b/images/triangular.png differ diff --git a/images/truck-1.png b/images/truck-1.png new file mode 100644 index 0000000..806d09b Binary files /dev/null and b/images/truck-1.png differ diff --git a/images/truck-2.png b/images/truck-2.png new file mode 100644 index 0000000..017d730 Binary files /dev/null and b/images/truck-2.png differ diff --git a/images/truck.png b/images/truck.png new file mode 100644 index 0000000..4d820b4 Binary files /dev/null and b/images/truck.png differ diff --git a/images/twoclusters.png b/images/twoclusters.png new file mode 100644 index 0000000..b80c86f Binary files /dev/null and b/images/twoclusters.png differ diff --git a/images/twoclustersclustered.png b/images/twoclustersclustered.png new file mode 100644 index 0000000..bf9cc83 Binary files /dev/null and b/images/twoclustersclustered.png differ diff --git a/images/undercomplete.png b/images/undercomplete.png new file mode 100644 index 0000000..9767709 Binary files /dev/null and b/images/undercomplete.png differ diff --git a/images/unet-1-1024x868.png b/images/unet-1-1024x868.png new file mode 100644 index 0000000..8edbbfb Binary files /dev/null and b/images/unet-1-1024x868.png differ diff --git a/images/unidirectional-1024x414.png b/images/unidirectional-1024x414.png new file mode 100644 index 0000000..781f6f5 Binary files /dev/null and b/images/unidirectional-1024x414.png differ diff --git a/images/usps.png b/images/usps.png new file mode 100644 index 0000000..16a2ebe Binary files /dev/null and b/images/usps.png differ diff --git a/images/vae-encoder-decoder-1024x229.png b/images/vae-encoder-decoder-1024x229.png new file mode 100644 index 0000000..385dcb8 Binary files /dev/null and b/images/vae-encoder-decoder-1024x229.png differ diff --git a/images/vae-encoder-x.png b/images/vae-encoder-x.png new file mode 100644 index 0000000..f282bd8 Binary files /dev/null and b/images/vae-encoder-x.png differ diff --git a/images/vae-encoder.png b/images/vae-encoder.png new file mode 100644 index 0000000..7e5215a Binary files /dev/null and b/images/vae-encoder.png differ diff --git a/images/vae_mlp-300x180.png b/images/vae_mlp-300x180.png new file mode 100644 index 0000000..259767a Binary files /dev/null and b/images/vae_mlp-300x180.png differ diff --git a/images/vae_mnist.png b/images/vae_mnist.png new file mode 100644 index 0000000..5ebeac8 Binary files /dev/null and b/images/vae_mnist.png differ diff --git a/images/vae_space.png b/images/vae_space.png new file mode 100644 index 0000000..38a7838 Binary files /dev/null and b/images/vae_space.png differ diff --git a/images/val_acc.png b/images/val_acc.png new file mode 100644 index 0000000..6cce3f8 Binary files /dev/null and b/images/val_acc.png differ diff --git a/images/val_loss.png b/images/val_loss.png new file mode 100644 index 0000000..cb1bc19 Binary files /dev/null and b/images/val_loss.png differ diff --git a/images/validpad-300x300.jpg b/images/validpad-300x300.jpg new file mode 100644 index 0000000..a2526f9 Binary files /dev/null and b/images/validpad-300x300.jpg differ diff --git a/images/vg_0.png b/images/vg_0.png new file mode 100644 index 0000000..24fabfb Binary files /dev/null and b/images/vg_0.png differ diff --git a/images/vizprinc.png b/images/vizprinc.png new file mode 100644 index 0000000..ce9dd43 Binary files /dev/null and b/images/vizprinc.png differ diff --git a/images/weight_histogram_1.jpg b/images/weight_histogram_1.jpg new file mode 100644 index 0000000..7175a3f Binary files /dev/null and b/images/weight_histogram_1.jpg differ diff --git a/images/weight_histogram_2.jpg b/images/weight_histogram_2.jpg new file mode 100644 index 0000000..5e8a7fd Binary files /dev/null and b/images/weight_histogram_2.jpg differ diff --git a/images/weight_images.jpg b/images/weight_images.jpg new file mode 100644 index 0000000..cfca627 Binary files /dev/null and b/images/weight_images.jpg differ diff --git a/images/whatisclassification.png b/images/whatisclassification.png new file mode 100644 index 0000000..52fe644 Binary files /dev/null and b/images/whatisclassification.png differ diff --git a/images/whatisclassification2.png b/images/whatisclassification2.png new file mode 100644 index 0000000..b04d338 Binary files /dev/null and b/images/whatisclassification2.png differ diff --git a/images/whatisclassification5.png b/images/whatisclassification5.png new file mode 100644 index 0000000..fef40ce Binary files /dev/null and b/images/whatisclassification5.png differ diff --git a/images/whatisclassification6.png b/images/whatisclassification6.png new file mode 100644 index 0000000..c9e14db Binary files /dev/null and b/images/whatisclassification6.png differ diff --git a/images/with_dropout-1024x497.png b/images/with_dropout-1024x497.png new file mode 100644 index 0000000..f122108 Binary files /dev/null and b/images/with_dropout-1024x497.png differ diff --git a/images/x2_1000-1024x537.jpeg b/images/x2_1000-1024x537.jpeg new file mode 100644 index 0000000..3f2671d Binary files /dev/null and b/images/x2_1000-1024x537.jpeg differ diff --git a/images/x2_approximated-1024x537.jpeg b/images/x2_approximated-1024x537.jpeg new file mode 100644 index 0000000..b9aaff9 Binary files /dev/null and b/images/x2_approximated-1024x537.jpeg differ diff --git a/images/x2noise-300x225.png b/images/x2noise-300x225.png new file mode 100644 index 0000000..6337029 Binary files /dev/null and b/images/x2noise-300x225.png differ diff --git a/images/x2noise.png b/images/x2noise.png new file mode 100644 index 0000000..6932693 Binary files /dev/null and b/images/x2noise.png differ diff --git a/images/x2sample-300x225.png b/images/x2sample-300x225.png new file mode 100644 index 0000000..0b0208f Binary files /dev/null and b/images/x2sample-300x225.png differ diff --git a/images/x2sample.png b/images/x2sample.png new file mode 100644 index 0000000..f280515 Binary files /dev/null and b/images/x2sample.png differ diff --git a/images/z0.png b/images/z0.png new file mode 100644 index 0000000..8a01bae Binary files /dev/null and b/images/z0.png differ diff --git a/images/z9.png b/images/z9.png new file mode 100644 index 0000000..f0565ef Binary files /dev/null and b/images/z9.png differ diff --git a/images/z9_o.png b/images/z9_o.png new file mode 100644 index 0000000..5053284 Binary files /dev/null and b/images/z9_o.png differ diff --git a/images/zero_padding.png b/images/zero_padding.png new file mode 100644 index 0000000..4992698 Binary files /dev/null and b/images/zero_padding.png differ diff --git a/images/zero_padding_1d-1-1024x147.png b/images/zero_padding_1d-1-1024x147.png new file mode 100644 index 0000000..3cd6591 Binary files /dev/null and b/images/zero_padding_1d-1-1024x147.png differ diff --git a/implementing-relu-sigmoid-and-tanh-in-keras.md b/implementing-relu-sigmoid-and-tanh-in-keras.md new file mode 100644 index 0000000..270dcd4 --- /dev/null +++ b/implementing-relu-sigmoid-and-tanh-in-keras.md @@ -0,0 +1,324 @@ +--- +title: "ReLU, Sigmoid and Tanh with TensorFlow 2 and Keras" +date: "2019-09-09" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "activation-functions" + - "deep-learning" + - "keras" + - "relu" + - "sigmoid" + - "tanh" +--- + +In a recent tutorial, we looked at [widely used activation functions](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) in today's neural networks. More specifically, we checked out Rectified Linear Unit (ReLU), Sigmoid and Tanh (or hyperbolic tangent), together with their benefits and drawbacks. + +However, it all remained theory. + +In this blog post, we'll move towards implementation. Because how to build up neural networks with ReLU, Sigmoid and tanh in Keras, one of today's popular deep learning frameworks? + +If you're interested in the inner workings of the activation functions, check out the link above. + +If you wish to implement them, make sure to read on! 😎 + +In this tutorial, you will... + +- Understand what the ReLU, Tanh and Sigmoid activations are. +- See where to apply these activation functions in your TensorFlow 2.0 and Keras model. +- Walk through an end-to-end example of implementing ReLU, Tanh or Sigmoid in your Keras model. + +Note that the results are [also available on GitHub](https://github.com/christianversloot/relu-tanh-sigmoid-keras). + +* * * + +**Update 18/Jan/2021:** ensure that the tutorial is up to date for 2021. Also revisited header information. + +**Update 03/Nov/2020:** made code compatible with TensorFlow 2.x. + +* * * + +\[toc\] + +* * * + +## Code examples: using ReLU, Tanh and Sigmoid with TF 2.0 and Keras + +These code examples show how you can add ReLU, Sigmoid and Tanh to your TensorFlow 2.0/Keras model. If you want to understand the activation functions in more detail, or see how they fit in a Keras model as a whole, make sure to continue reading! + +### Rectified Linear Unit (ReLU) + +``` +model.add(Dense(12, input_shape=(8,), activation='relu')) +model.add(Dense(8, activation='relu')) +``` + +### Sigmoid + +``` +model.add(Dense(12, input_shape=(8,), activation='sigmoid')) +model.add(Dense(8, activation='sigmoid')) +``` + +### Tanh + +``` +model.add(Dense(12, input_shape=(8,), activation='tanh')) +model.add(Dense(8, activation='tanh')) +``` + +* * * + +## Recap: ReLU, Tanh and Sigmoid + +Before we begin, a small recap on the concept of an activation function and the [three widely ones used today](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/). + +Neural networks are composed of layers of individual neurons which can take vector data as input and subsequently either fire to some extent or remain silent. + +Each individual neuron multiplies an input vector with its weights vector to compute the so-called dot product, subsequently adding a bias value, before emitting the output. + +However, the multiplication and addition operations are linear and by consequence when applied neural networks can only handle linear data well. + +This is not desirable because most real-world data is nonlinear in nature. For example, it's really hard to draw a line through an image to separate an object from its surroundings. + +Hence, activation functions are applied to neural networks: the linear output is first input into such a function before being emitted to the next layer. Since activation functions are nonlinear, the linear input will be transformed into nonlinear output. When applied to all neurons, the system as a whole becomes nonlinear, capable of learning from highly complex, nonlinear data. + +ReLU, Sigmoid and Tanh are today's most widely used activation functions. From these, ReLU is the most prominent one and the de facto standard one during deep learning projects because it is resistent against the [vanishing and exploding gradients](https://machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) problems, whereas Sigmoid and Tanh are not. Hence, it's good practice to start with ReLU and expand from there. However, this must always be done with its challenges in mind: ReLU is not perfect and is [continuously improved](https://machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/). + +Now that we have a little background on these activation functions, we can introduce the dataset we're going to use to implement neural networks with ReLU, Sigmoid and Tanh in Keras. + +* * * + +## Today's dataset + +Today, we're going to use a dataset that we used before when discussing [Rosenblatt Perceptrons and Keras](https://machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/): the **Pima Indians Diabetes Database**. + +This is what it does: + +> This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. +> +> Source: [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database) + +The nice thing about this dataset is that it is relatively simple. Hence, we can fully focus on the implementation rather than having to be concerned about data related issues. Additionally, it is freely available at [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database), under a CC0 license. This makes it the perfect choice for a blog like this. + +The dataset very simply tries to predict the following: + +- **Outcome:** Whether a person has diabetes (1) or not (0), the 0 and 1 being the target values. + +For machine learning projects, it allows you to find correlations between (combinations of) those input values and the target values: + +- **Pregnancies:** the number of times one has been pregnant; +- **Glucose:** one's plasma glucose concentration; +- **BloodPressure:** one's diastolic (lower) blood pressure value in mmHg. +- **SkinThickness:** the thickness of one's skin fold at the triceps, in mm. +- **Insulin:** one's 2-hour serum insulin level; +- **BMI:** one's Body Mass Index; +- **Diabetes pedigree function:** one's sensitivity to diabetes e.g. based on genetics; +- **Age:** one's age in years. + +* * * + +## General model parts + +Today, we'll build a very simple model to illustrate our point. More specifically, we will create a [multilayer perceptron](https://machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/) with Keras - but then three times, each time with a different activation function. + +To do this, we'll start by creating three files - one per activation function: `relu.py`, `sigmoid.py` and `tanh.py`. In each, we'll add general parts that are shared across the model instances. + +Note that you'll need the dataset as well. You could either download it from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database) or take a look at GitHub, where it is present as well. Save the `pima_dataset.csv` file in the same folder as your `*.py` files. + +We begin with the dependencies: + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np +``` + +They are really simple today. We use the Keras Sequential API, which is the simplest of two and allows us to add layers sequentially, or in a line. We also import the `Dense` layer, which is short for densely-connected, or the layer types that are traditionally present in a multilayer perceptron. + +Additionally, we import `numpy` for reading the file and preparing the dataset. + +Second, we load the data: + +``` +# Load data +dataset = np.loadtxt('./pima_dataset.csv', delimiter=',') +``` + +Since the data is comma-separated, we set the `delimiter` to a comma. + +We then separate the input data and the target data: + +``` +# Separate input data and target data +X = dataset[:, 0:8] +Y = dataset[:, 8] +``` + +In the CSV file, the data is appended together. That is, each row contains both the input data (the data used for training) and the outcomes (0/1) that are related to the input data. They need to be split if we want to train the model. We do so with the code above. It essentially takes 8 columns and makes it input data (columns 0-7), and one (the 8th) as target data. + +We then start off with the model itself and instantiate the Sequential API: + +``` +# Create the Perceptron +model = Sequential() +``` + +We're then ready to add some activation function-specific code. We'll temporarily indicate its position with a comment: + +``` +# ActivationFunction-specific code here +``` + +...and continue with our final general steps: + +``` +model.add(Dense(1, activation='sigmoid')) +model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) + +# Train the model +model.fit(X, Y, epochs=225, batch_size=25, verbose=1, validation_split=0.2) +``` + +What we do first is adding the _final_ layer in the model: a Dense layer with one neuron and a Sigmoid activation function. This is what we need: since our classification problem is binary, we need one output neuron (that outputs a value between class 0 and class 1). The [Sigmoid](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) activation function allows us to do exactly that. Hence, we use it in our final layer too. + +Compiling the model with binary crossentropy (we have a binary classification problem), the Adam optimizer (an extension of stochastic gradient descent that allows local parameter optimization and adds momentum) and accuracy is what we do second. + +We finally fit the data (variables `X` and `Y` to the model), using 225 epochs with a batch size of 25. We set verbosity mode to 1 to see what happens and allow for a validation split of `0.2`: 20% of the data will be used for validating the training process after each epoch. + +* * * + +## Activation function-specific implementations + +Now, it's time to add activation function-specific code. In all of the below cases, this is the part that you'll need to replace: + +``` +# ActivationFunction-specific code here +``` + +* * * + +## TensorFlow 2.0 and Keras = Neural networks, made easy + +As we may recall from the introduction of this blog _or_ the Keras website, this is the framework's goal: + +> It was developed with a focus on enabling fast experimentation. _Being able to go from idea to result with the least possible delay is key to doing good research._ + +It is therefore no surprise that changing the activation function is very easy if you're using the standard ones. Essentially, Keras allows you to specify an activation function per layer by means of the `activation` parameter. As you can see above, we used this parameter to specify the Sigmoid activation in our final layer. The standard ones are available. + +Today, Keras is tightly coupled to TensorFlow 2.0, and is still one of the key libraries for creating your neural networks. This article was adapted to reflect the latest changes in TensorFlow and works with any TensorFlow 2 version. + +What's best, if the activation function of your choice - for example [Swish](https://machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/) - is not available, you can create it yourself and add it as a function. Take a look at the Swish post to find an example. + +### Adding ReLU to your model + +By consequence, if we wish to implement a neural net work with ReLU, we do this: + +``` +model.add(Dense(12, input_shape=(8,), activation='relu')) +model.add(Dense(8, activation='relu')) +``` + +### Adding Sigmoid to your model + +...and with Sigmoid: + +``` +model.add(Dense(12, input_shape=(8,), activation='sigmoid')) +model.add(Dense(8, activation='sigmoid')) +``` + +### Adding Tanh to your model + +...or Tanh: + +``` +model.add(Dense(12, input_shape=(8,), activation='tanh')) +model.add(Dense(8, activation='tanh')) +``` + +Eventually, your code will look like this: + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np + +# Load data +dataset = np.loadtxt('./pima_dataset.csv', delimiter=',') + +# Separate input data and target data +X = dataset[:, 0:8] +Y = dataset[:, 8] + +# Create the Perceptron +model = Sequential() + +# Model layers +model.add(Dense(12, input_shape=(8,), activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(1, activation='sigmoid')) + +# Model compilation +model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) + +# Model training +model.fit(X, Y, epochs=225, batch_size=25, verbose=1, validation_split=0.2) +``` + +The models are also available on [GitHub](https://github.com/christianversloot/relu-tanh-sigmoid-keras). + +* * * + +## Model performance: some observations + +Let's now see if we can train the model. We can simply call the model by e.g. writing `python relu.py`, `python sigmoid.py` or `python tanh.py` depending on the model you wish to train. + +Note that you need a fully operational deep learning environment to make it work. This means that you'll need Python (preferably 3.8+), that you'll need TensorFlow and Keras as well as Numpy. Preferably, you'll have this installed in an Anaconda container so that you have a pure deep learning environment for each time you're training. + +When you have one, the training process starts upon execution of the command, and eventually this will be your output. + +For ReLU: + +``` +Epoch 225/225 +614/614 [==============================] - 0s 129us/step - loss: 0.4632 - acc: 0.7785 - val_loss: 0.5892 - val_acc: 0.7143 +``` + +For Tanh: + +``` +Epoch 225/225 +614/614 [==============================] - 0s 138us/step - loss: 0.5466 - acc: 0.7003 - val_loss: 0.6839 - val_acc: 0.6169 +``` + +For Sigmoid: + +``` +Epoch 225/225 +614/614 [==============================] - 0s 151us/step - loss: 0.5574 - acc: 0.7280 - val_loss: 0.6187 - val_acc: 0.7013 +``` + +The results suggest that Tanh performs worse than ReLU and Sigmoid. This is explainable through the lens of its range: since we're having a binary classification problem, both Sigmoid and ReLU are naturally better suited for this task, particularly the Sigmoid function. Specifically, its binary crossentropy loss value is much higher than e.g. ReLU, although this one can also be improved much further - but that's not the point of this blog. + +As you can see, it's always wise to consider multiple activation functions. In my master's thesis, I found that in some cases Tanh works better than ReLU. Since the practice of deep learning is often more art than science, it's always worth a try. + +* * * + +## Summary + +In this blog, we've been introduced to activation functions and the most widely ones used today at a high level. Additionally, we checked the Pima Indians Diabetes Dataset and its contents and applied it with Keras to demonstrate how to create neural networks with the ReLU, Tanh and Sigmoid activation functions - [see GitHub](https://github.com/christianversloot/relu-tanh-sigmoid-keras). I hope you've found the answers to your challenges and hope you'll keep engineering! 😎 + +* * * + +## References + +TensorFlow. (2021). _Module: Tf.keras.activations_. [https://www.tensorflow.org/api\_docs/python/tf/keras/activations](https://www.tensorflow.org/api_docs/python/tf/keras/activations) + +Keras. (n.d.). Activations. Retrieved from [https://keras.io/activations/](https://keras.io/activations/) + +Kaggle. (n.d.). Pima Indians Diabetes Database. Retrieved from [https://www.kaggle.com/uciml/pima-indians-diabetes-database](https://www.kaggle.com/uciml/pima-indians-diabetes-database) diff --git a/introducing-pca-with-python-and-scikit-learn-for-machine-learning.md b/introducing-pca-with-python-and-scikit-learn-for-machine-learning.md new file mode 100644 index 0000000..73d91b7 --- /dev/null +++ b/introducing-pca-with-python-and-scikit-learn-for-machine-learning.md @@ -0,0 +1,888 @@ +--- +title: "Introducing PCA with Python and Scikit-learn for Machine Learning" +date: "2020-12-07" +categories: + - "frameworks" + - "svms" +tags: + - "data-preprocessing" + - "deep-learning" + - "feature-extraction" + - "feature-scaling" + - "machine-learning" + - "neural-networks" + - "pca" + - "principal-component-analysis" +--- + +Training a Supervised Machine Learning model - whether that is a traditional one or a Deep Learning model - involves a few steps. The first is feeding forward the data through the model, generating predictions. The second is comparing those predictions with the actual values, which are also called ground truth. The third, then, is to optimize the model based on the minimization of some objective function. + +In this iterative process, the model gets better and better, and sometimes it even gets _really_ good. + +But what data will you feed forward? + +Sometimes, your input sample will contain many _columns_, also known as features. It is common knowledge (especially with traditional models) that using every column in your Machine Learning model will mean trouble, the so-called curse of dimensionality. In this case, you'll have to selectively handle the features you are working with. In this article, we'll cover **Principal Component Analysis** (PCA), which is one such way. It provides **a gentle but extensive introduction to feature extraction for your Machine Learning model with PCA.** + +It is structured as follows. First of all, we'll take a look at what PCA is. We do this through the lens of the Curse of Dimensionality, which explains why we need to reduce dimensionality especially with traditional Machine Learning algorithms. This also involves the explanation of the differences between Feature Selection and Feature Extraction technique, which have a different goal. PCA, which is part of the Feature Extraction branch of techniques, is then introduced. + +When we know sufficiently about PCA conceptually, we'll take a look at it from a Python point of view. For a sample dataset, we're going to perform PCA in a step-by-step fashion. We'll take a look at all the individual components. Firstly, we'll compute the covariance matrix for the variables. Then, we compute the eigenvectors and eigenvalues, and select which ones are best. Subsequently, we compose the PCA projection matrix for mapping the data onto the axes of the principal components. This allows us to create entirely new dimensions which capture most of the variance from the original dataset, at a fraction of the dimensions. Note that SVD can also be used instead of eigenvector decomposition; we'll also take a look at that. + +Once we clearly understand how PCA happens by means of the Python example, we'll show you how you don't have to reinvent the wheel if you're using PCA. If you understand what's going on, it's often better to use a well-established library for computing the PCA. Using Scikit-learn's `sklearn.decomposition.PCA` API, we will finally show you how to compute principal components and apply them to perform dimensionality reduction for your dataset. + +All right. Enough introduction for now. + +Let's get to work! 😎 + +**Update 11/Jan/2021:** added quick code example to start using PCA straight away. Also corrected a few spelling issues. + +* * * + +\[toc\] + +* * * + +## Code example: using PCA with Python + +This quick code example allows you to start using Principal Component Analysis with Python immediately. If you want to understand the concepts and code in more detail, make sure to read the rest of this article :) + +``` +from sklearn import datasets +from sklearn.decomposition import PCA +from sklearn.preprocessing import StandardScaler + +# Load Iris dataset +iris = datasets.load_iris() +X = iris.data +y = iris.target + +# Standardize +scaler = StandardScaler() +scaler.fit(X) +X = scaler.transform(X) + +# PCA +pca = PCA(n_components=2) +pca.fit(X) +print(pca.explained_variance_ratio_) +print(pca.components_) +X = pca.transform(X) +``` + +* * * + +## What is Principal Component Analysis? + +Before we dive in to the specifics of PCA, I think we should first take a look at why it can be really useful for Machine Learning projects. For this reason, we will first take a look at Machine Learning projects and the Curse of Dimensionality, which is especially present when using older Machine Learning algorithms (Support Vector Machines, Logistic Regression, ...). + +Then, we'll discuss what can be done against it - _dimensionality reduction_ - and explain the difference between Feature Selection and Feature Extraction. Finally, we'll get to PCA - and provide a high-level introduction. + +### Machine Learning and the Curse of Dimensionality + +If you are training a Supervised Machine Learning model, at a high level, you are following a three-step, iterative process: + +![](images/feed-1024x404.jpg) + +Since Supervised Learning means that you have a dataset at your disposal, the first step in training a model is **feeding the samples to the model**. For every sample, a prediction is generated. Note that at the first iteration, the model has just been initialized. The predictions therefore likely make no sense at all. + +This becomes especially evident from what happens in the second step, **where predictions and ground truth (= actual targets) are compared**. This comparison produces an [error or loss value](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/) which illustrates how bad the model performs. + +The third step is then really simple: you **improve the model**. Depending on the Machine Learning algorithm, optimization happens in different ways. In the case of Neural networks, gradients are computed with backpropagation, and subsequently [optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) are used for changing the model internals. Weights can also be changed by minimizing one function only; it just depends on the algorithm. + +You then start again. Likely, because you have optimized the model, the predictions are a little bit better now. You simply keep iterating until you are satisfied with the results, and then you stop the training process. + +#### Underfitting and overfitting a model + +When you are performing this iterative process, you are effectively moving from a model that is _underfit_ to a model that demonstrates a _good fit_. If you want to understand these concepts in more detail, [this article can help](https://www.machinecurve.com/index.php/2020/12/01/how-to-check-if-your-deep-learning-model-is-underfitting-or-overfitting/), but let's briefly take a look at them here as well. + +In the first stages of the training process, your model is likely not able to capture the patterns in your dataset. This is visible in the left part of the figure below. The solution is simple: just keep training until you achieve the right fit for the dataset (that's the right part). Now, you _can't keep training forever_. If you do, the model will learn to focus too much on patterns hidden within your training dataset - patterns that may not be present in other real-world data at all; patterns truly specific to the sample with which you are training. + +The result: a model tailored to your specific dataset, visible in the middle part of the figure. + +In other words, training a Machine Learning model involves finding a good balance between a model that is underfit and a model that is overfit. Fortunately, many techniques are available [to help you with this](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), but it's one of the most common problems in Supervised ML today. + +- [![](images/30under.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/30under.png) + +- [![](images/30over.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/30over.png) + +- [![](images/30good.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/30good.png) + + +On the left: a model that is underfit with respect to the data. In the middle: a model that is overfit with respect to the data. On the right: the fit that we were looking for. + +#### Having a high-dimensional feature vector + +I think the odds are that I can read your mind at this point. + +Overfitting, underfitting, and training a Machine Learning model - how are they related to Principal Component Analysis? + +That's a fair question. What I want to do is to illustrate why a large dataset - in terms of the number of columns - can significantly increase the odds that your model will overfit. + +Suppose that you have the following feature vector: + +\[latex\]\\textbf{x} = \[1.23, -3.00, 45.2, 9.3, 0.1, 12.3, 8.999, 1.02, -2.45, -0.26, 1.24\]\[/latex\] + +This feature vector is 11-dimensional. + +Now suppose that you have 200 samples. + +Will a Machine Learning model be able to _generalize_ across all eleven dimensions? In other words, do we have sufficient samples to cover large parts of the domains for all features in the vector (i.e., all the axes in the 11-dimensional space)? Or does it look like a cheese with (massive) holes? + +I think it's the latter. Welcome to the Curse of Dimensionality. + +#### The Curse of Dimensionality + +Quoted from Wikipedia: + +> In machine learning problems that involve learning a "state-of-nature" from a finite number of data samples in a high-dimensional feature space with each feature having a range of possible values, typically an enormous amount of training data is required to ensure that there are several samples with each combination of values. +> +> Wikipedia (n.d.) + +In other words, that's what we just described. + +The point with "\[ensuring that there are several samples with each combination of values" is that when this is performed well, you will likely be able to train a model that (1) performs well and (2) generalizes well across many settings. With 200 samples, however, it's _100% certain_ that you don't meet this requirement. The effect is simple: your model will overfit to the data at hand, and it will become worthless if it is used with data from the real world. + +Since increasing dimensionality equals an increasingly growing need for more data, the only way out of this curse is to reduce the number of dimensions in our dataset. This is called Dimensionality Reduction, and we'll now take a look at two approaches - Feature Selection and Feature Extraction. + +### Dimensionality Reduction: Feature Selection vs Feature Extraction + +We saw that if we want to decrease the odds of overfitting, we must reduce the dimensionality of our data. While this can easily be done in theory (we can simply cut off a few dimensions, who cares?), this gets slightly difficult in practice (which dimension to cut... because, how do I know which one contributes most?). + +And what if _each dimension contributes an equal emount to the predictive power of the model?_ What then? + +In the field of Dimensionality Reduction, there are two main approaches that you can use: Feature Selection and Feature Extraction. + +- **Feature Selection** involves "the process of selecting a subset of relevant features (variables, predictors) for use in model construction" (Wikipedia, 2004). In other words, Feature Selection approaches attempt to measure the contribution of each feature, so that you can keep the ones that contribute most. What must be clear is that your model will be trained with the _original_ variables; however, with only a few of them. + - Feature Selection can be a good idea if you already think that most variance within your dataset can be explained by a few variables. If the others are truly non-important, then you can easily discard them without losing too much information. +- **Feature Extraction**, on the other hand, "starts from an initial set of measured data and builds derived values (features) intended to be informative and non-redundant" (Wikipedia, 2003). In other words, _a derived dataset will be built_ that can be used for training your Machine Learning model. It is lower-dimensional compared to the original dataset. It will be as informative as possible (i.e., as much information from the original dataset is pushed into the new variables) while non-redundant (i.e., we want to avoid that information present in one variable in the new dataset is also present in another variable in the new dataset). In other words, we get a lower-dimensional dataset that explains most variance in the dataset, while keeping things relatively simple. + - Especially in the case where each dimension contributes an equal emount, Feature Extraction can be preferred over Feature Selection. The same is true if you have no clue about the contribution of each variable to the model's predictive power. + +### Introducing Principal Component Analysis (PCA) + +Now that we are aware of the two approaches, it's time to get to the point. We'll now introduce PCA, a Feature Extraction technique, for dimensionality reduction. + +**Principal Component Analysis** is defined as follows: + +> Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest. +> +> Wikipedia (2002) + +Well, that's quite a technical description, isn't it. And what are "principal components"? + +> The principal components of a collection of points in a real p-space are a sequence of \[latex\]p\[/latex\] direction vectors, where the \[latex\]i^{th}\[/latex\] vector is the direction of a line that best fits the data while being orthogonal to the first \[latex\]i - 1\[/latex\] vectors. +> +> Wikipedia (2002) + +I can perfectly get it when you still have no idea what PCA is after reading those two quotes. I had the same. For this reason, let's break down stuff step-by-step. + +**The goal of PCA:** finding a set of vectors (principal components) that best describe the spread and direction of your data across its many dimensions, allowing you to subsequently pick the top-\[latex\]n\[/latex\] best-describing ones for reducing the dimensionality of your feature space. + +**The steps of PCA:** + +1. If you have a dataset, its spread can be expressed in orthonormal vectors - the principal directions of the dataset. Orthonormal, here, means that the vectors are orthogonal to each other (i.e. they have an angle of 90°) and are of size 1. +2. By sorting these vectors in order of importance (by looking at their relative contribution to the spread of the data as a whole), we can find the dimensions of the data which explain most variance. +3. We can then reduce the number of dimensions to the most important ones only. +4. And finally, we can project our dataset onto these new dimensions, called the principal components, performing dimensionality reduction without losing much of the information present in the dataset. + +**The how:** + +Although we will explain _how_ later in this article, we'll now visually walk through performing PCA at a high level. This allows you to understand _what happens_ first, before we dive into _how it happens._ + +Another important note is that for step (1), decomposing your dataset into vectors can be done in two different ways - by means of (a) eigenvector decomposition of the covariance matrix, or (b) Singular Value Decomposition. Later in this article, we'll walk through both approaches step-by-step. + +#### Expressing the spread of your dataset in vectors + +Suppose that we generate a dataset based on two overlapping blobs which we consider to be part of just one dataset: + +``` +from sklearn.datasets import make_blobs +import matplotlib.pyplot as plt +import numpy as np + +# Configuration options +num_samples_total = 1000 +cluster_centers = [(1,1), (1.25,1.5)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.15) +e) + +# Make plot +plt.scatter(X[:, 0], X[:, 1]) +axes = plt.gca() +axes.set_xlim([0, 2]) +axes.set_ylim([0, 2]) +plt.show() +``` + +...which looks as follows: + +![](images/pca_1.png) + + +If you look closely at the dataset, you can see that it primarily spreads into two directions. These directions are from the upper right corner to the lower left corner and from the lower right middle to the upper left middle. Those directions are different from the **axis directions**, which are orthogonal to each other: the x and y axes have an angle of 90 degrees. + +No other set of directions will explain as much as the variance than the one we mentioned above. + +After [standardization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/), we can visualize the directions as a pair of two vectors. These vectors are called the **principal directions** of the data (StackExchange, n.d.). There are as many principal directions as the number of dimensions; in our case, there are two. + +![](images/pca_2.png) + +We call these vectors **eigenvectors**. Their length is represented by what is known as an **eigenvalue**. They play a big role in PCA because of the following reason: + +> \[The eigenvectors and related\] eigenvalues explain the variance of the data along the new feature axes. +> +> Raschka (2015) + +In other words, they allow us to capture both the (1) direction and (2) magnitude of the spread in your dataset. + +Notice that the vectors are orthogonal to each other. Also recall that our axes are orthogonal to each other. You can perhaps now imagine that it becomes possible to perform a transformation to your dataset, so that the directions of the axes are equal to the directions of the eigenvectors. In other words, we change the "viewpoint" of our data, so that the axes and vectors have equal directions. + +This is the core of PCA: projecting the data to our principal directions, which are then called **principal components**. + +The benefit here is that while the _eigenvectors_ tell us something about the directions of our projection, the corresponding _eigenvalues_ tell us something about the **importance** of that particular principal direction in explaining the variance of the dataset. It allows us to easily discard the directions that don't contribute sufficiently enough. That's why before projecting the dataset onto the principal components, we must first sort the vectors and reduce the number of dimensions. + +#### Sorting the vectors in order of importance + +Once we know the eigenvectors and eigenvalues that explain the spread of our dataset, we must sort them in order of descending importance. This allows us to perform dimensionality reduction, as we can keep the principal directions which contribute most significantly to the spread in our dataset. + +Sorting is simple: we sort the list with eigenvalues in descending order and ensure that our list with eigenvectors is sorted in the same way. In other words, the pairs of eigenvectors and eigenvalues are jointly sorted in a descending order based on the eigenvalue. As the largest eigenvalues indicate the biggest explanation for spread in your dataset, they must be on top of the list. + +For the example above, we can see that the eigenvalue for the downward-oriented eigenvector exceeds the one for the upward-oriented vector. If we draw a line through the dataset that overlaps with the vector, we can also see that variance for that line as a whole (where variance is defined as the squared distance of each point to the mean value for the line) is biggest. We can simply draw no line where variance is larger. + +In fact, the total (relative) contribution of the eigenvectors to the spread for our example is as follows: + +``` +[0.76318124 0.23681876] +``` + +(We'll look at how we can determine this later.) + +So, for our example above, we now have a sorted list with eigenpairs. + +#### Reducing the number of dimensions + +As we saw above, the first eigenpair explains 76.3% of the spread in our dataset, whereas the second one explains only 23.7%. Jointly, they explain 100% of the spread, which makes sense. + +Using PCA for dimensionality reduction now allows us to take the biggest-contributing vectors (if your original feature space was say 10-dimensional, it is likely that you can find a smaller set of vectors which explains most of the variance) and only move forward with them. + +If our goal was to reduce dimensionality to one, we would now move forward and take the 0.763 contributing eigenvector for data projection. Note that this implies that we will lose 0.237 worth of information about our spread, but in return get a lower number of dimensions. + +Clearly, this example with only two dimensions makes no rational sense as two dimensions can easily be handled by Machine Learning algorithms, but this is incredibly useful if you have many dimensions to work with. + +#### Projecting the dataset + +Once we have chosen the number of eigenvectors that we will use for dimensionality reduction (i.e. our target number of dimensions), we can project the data onto the principal components - or component, in our case. + +This means that we will be changing the axes so that they are now equal to the eigenvectors. + +In the example below, we can project our data to one eigenvector. We can see that only the \[latex\]x\[/latex\] axis has values after projecting, and that hence our feature space has been reduced to one dimension. + +We have thus used PCA for dimensionality reduction. + +![](images/pca_3.png) + +#### The how of generating eigenpairs: Eigenvector Decomposition or Singular Value Decomposition + +Above, we covered the general steps of performing Principal Component Analysis. Recall that they are as follows: + +1. Decomposing the dataset into a set of eigenpairs. +2. Sorting the eigenpairs in descending order of importance. +3. Selecting \[latex\]n\[/latex\] most important eigenpairs, where \[latex\]n\[/latex\] is the desired number of dimensions. +4. Projecting the data to the \[latex\]n\[/latex\] eigenpairs so that their directions equal the ones of our axes. + +In step (1), we simply mentioned that we can express the spread of our data by means of eigenpairs. On purpose, we didn't explain _how_ this can be done, for the sake of simplicity. + +In fact, there are two methods that are being used for this purpose today: **Eigenvector Decomposition** (often called "EIG") and **Singular Value Decomposition** ("SVD"). Using different approaches, they can be used to obtain the same end result: expressing the spread of your dataset in eigenpairs, the principal directions of your data, which can subsequently be used to reduce the number of dimensions by projecting your dataset to the most important ones, the principal components. + +While mathematically and hence formally you can obtain the same result with both, in practice PCA-SVD is numerically more stable (StackExchange, n.d.). For this reason, you will find that most libraries and frameworks favor a PCA-SVD implementation over a PCA-EIG one. Nevertheless, you can still achieve the same result with both approaches! + +In the next sections, we will take a look at clear and step-by-step examples of PCA with EIG and PCA with SVD, allowing you to understand the differences intuitively. We will then look at `sklearn.decomposition.PCA`, Scikit-learn's implementation of Principal Component Analysis based on PCA-SVD. There is no need to perform PCA manually if there are great tools out there, after all! ;-) + +* * * + +## PCA-EIG: Eigenvector Decomposition with Python Step-by-Step + +One of the ways in which PCA can be performed is by means of **Eigenvector Decomposition (EIG)**. More specifically, we can use the covariance matrix of our \[latex\]N\[/latex\]-dimensional dataset and decompose it into \[latex\]N\[/latex\] eigenpairs. We can do this as follows: + +1. **Standardizing the dataset:** EIG based PCA only works well if the dataset is centered and has a mean of zero (i.e. \[latex\]\\mu = 0.0\[/latex\]). We will use [standardization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) for this purpose, which also scales the data to a standard deviation of one (\[latex\]\\sigma = 1.0\[/latex\]). +2. **Computing the covariance matrix of the variables:** a covariance matrix indicates how much variance each individual variable has, and how much they 'covary' - in other words, how much certain variables move together. +3. **Decomposing the covariance matrix into eigenpairs:** mathematically, we can rewrite the covariance matrix so that we can get a set of eigenvectors and eigenvalues, or eigenpairs. +4. **Sorting the eigenpairs in decreasing order of importance**, to find the principal directions in your dataset which contribute to the spread most significantly. +5. **Selecting the variance contribution of your principal directions and selecting \[latex\]n\[/latex\] principal components:** if we know the relative contributions to the spread for each principal direction, we can perform dimensionality reduction by selecting only the \[latex\]n\[/latex\] most contributing principal components. +6. **Building the projection matrix** for projecting our original dataset onto the principal components. + +We can see that steps (1), (4), (5) and (6) are general - we also saw them above. Steps (2) and (3) are specific to PCA-EIG and represent the core of what makes eigenvector decomposition based PCA unique. We will now cover each step in more detail, including step-by-step examples with Python. Note that the example in this section makes use of native / vanilla Python deliberately, and that Scikit-learn based implementations of e.g. [standardization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) and PCA will be used in another section. + +### Using the multidimensional Iris dataset + +If we want to show how PCA works, we must use a dataset where the number of dimensions \[latex\]> 2\[/latex\]. Fortunately, Scikit-learn provides the Iris dataset, which can be used to classify three groups of Iris flowers based on four characteristics (and hence features or dimensions): petal length, petal width, sepal length and sepal width. + +This code can be used for visualizing two dimensions every time: + +``` +from sklearn import datasets +import matplotlib.pyplot as plt +import numpy as np + +# Configuration options +dimension_one = 1 +dimension_two = 3 + +# Load Iris dataset +iris = datasets.load_iris() +X = iris.data +y = iris.target + +# Shape +print(X.shape) +print(y.shape) + +# Dimension definitions +dimensions = { + 0: 'Sepal Length', + 1: 'Sepal Width', + 2: 'Petal Length', + 3: 'Petal Width' +} + +# Color definitions +colors = { + 0: '#b40426', + 1: '#3b4cc0', + 2: '#f2da0a', +} + +# Legend definition +legend = ['Iris Setosa', 'Iris Versicolour', 'Iris Virginica'] + +# Make plot +colors = list(map(lambda x: colors[x], y)) +plt.scatter(X[:, dimension_one], X[:, dimension_two], c=colors) +plt.title(f'Visualizing dimensions {dimension_one} and {dimension_two}') +plt.xlabel(dimensions[dimension_one]) +plt.ylabel(dimensions[dimension_two]) +plt.show() +``` + +This yields the following plots, if we play with the dimensions: + +- [![](images/iris-mix.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/iris-mix.png) + +- [![](images/iris-petal.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/iris-petal.png) + +- [![](images/iris-sepal.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/iris-sepal.png) + + +The images illustrate that two of the Iris flowers cannot be linearly separated, but that this group _can_ be separated from the other Iris flower. Printing the shape yields the following: + +``` +(150, 4) +(150,) +``` + +...indicating that we have only 150 samples, but that our feature space is four-dimensional. Clearly a case where feature extraction _could_ be beneficial for training our Machine Learning model. + +### Performing standardization + +We first add Python code for [standardization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/), which brings our data to \[latex\]\\mu = 0.0, \\sigma = 1.0\[/latex\] by performing \[latex\]x = \\frac{x - \\mu}{\\sigma}\[/latex\] for each dimension (MachineCurve, 2020). + +``` +# Perform standardization +for dim in range(0, X.shape[1]): + print(f'Old mean/std for dim={dim}: {np.average(X[:, dim])}/{np.std(X[:, dim])}') + X[:, dim] = (X[:, dim] - np.average(X[:, dim])) / np.std(X[:, dim]) + print(f'New mean/std for dim={dim}: {np.abs(np.round(np.average(X[:, dim])))}/{np.std(X[:, dim])}') + +# Make plot +colors = list(map(lambda x: colors[x], y)) +plt.scatter(X[:, dimension_one], X[:, dimension_two], c=colors) +plt.title(f'Visualizing dimensions {dimension_one} and {dimension_two}') +plt.xlabel(dimensions[dimension_one]) +plt.ylabel(dimensions[dimension_two]) +plt.show() +``` + +And indeed: + +``` +Old mean/std for dim=0: 5.843333333333334/0.8253012917851409 +New mean/std for dim=0: 0.0/1.0 +Old mean/std for dim=1: 3.0573333333333337/0.4344109677354946 +New mean/std for dim=1: 0.0/0.9999999999999999 +Old mean/std for dim=2: 3.7580000000000005/1.759404065775303 +New mean/std for dim=2: 0.0/1.0 +Old mean/std for dim=3: 1.1993333333333336/0.7596926279021594 +New mean/std for dim=3: 0.0/1.0 +``` + +### Computing the covariance matrix of your variables + +The next step is computing the covariance matrix for our dataset. + +> In probability theory and statistics, a covariance matrix (…) is a square matrix giving the covariance between each pair of elements of a given random vector. +> +> Wikipedia (2003) + +If you're not into mathematics, I can understand that you don't know what this is yet. Let's therefore briefly take a look at a few aspects related to a covariance matrix before we move on, based on Lambers (n.d.). + +**A variable:** such as \[latex\]X\[/latex\]. A mathematical representation of one dimension of the data set. For example, if \[latex\]X\[/latex\] represents \[latex\]\\text{petal width}\[/latex\], numbers such as \[latex\]1.19, 1.20, 1.21, 1.18, 1.16, ...\[/latex\] which represent the petal width for one flower can all be described by variable \[latex\]X\[/latex\]. + +**Variable mean:** the average value for the variable. Computed as the sum of all available values divided by the number of values summed together. As petal width represents `dim=3` in the visualization above, with a mean of \[latex\]\\approx 1.1993\[/latex\], we can see how the numbers above fit. + +**Variance:** describing the "spread" of data around the variable. Computed as the sum of squared differences between each number and the mean, i.e. the sum of \[latex\](x - \\mu)^2\[/latex\] for each number. + +**Covariance:** describing the _joint variability_ (or joint spread) of two variables. For each pair of numbers from both variables, covariance is computed as \[latex\]Cov(x, y) = (x - \\mu\_x)(y - \\mu\_y)\[/latex\]. + +**Covariance matrix for \[latex\]n\[/latex\] variables:** a matrix representing covariances for each pair of variables from some set of variables (dimensions) \[latex\]V = \[X, Y, Z, ....\]\[/latex\]. + +A covariance matrix for two dimensions \[latex\]X\[/latex\] and \[latex\]Y\[/latex\] looks as follows: + +\[latex\]\\begin{pmatrix}Cov(X, X) & Cov(X, Y)\\\\ Cov(Y, X) & Cov(Y, Y)\\end{pmatrix}\[/latex\] + +Fortunately, there are some properties which make covariance matrices interesting for PCA (Lambers, n.d.): + +- \[latex\]Cov(X, X) = Var(X)\[/latex\] +- \[latex\]Cov(X, Y) = Cov(Y, X)\[/latex\]. + +By consequence, our covariance matrix is a symmetrical and square, \[latex\]n \\times n\[/latex\] matrix and can hence also be written as follows: + +\[latex\]\\begin{pmatrix}Var(X) & Cov(X, Y)\\\\ Cov(Y, X) & Var(Y)\\end{pmatrix}\[/latex\] + +We can compute the covariance matrix by generating a \[latex\]n \\times n\[/latex\] matrix and then filling it by iterating over its rows and columns, setting the value to the average covariance for each respective number from both variables: + +``` +# Compute covariance matrix +cov_matrix = np.empty((X.shape[1], X.shape[1])) # 4 x 4 matrix +for row in range(0, X.shape[1]): + for col in range(0, X.shape[1]): + cov_matrix[row][col] = np.round(np.average([(X[i, row] - np.average(X[:, row]))*(X[i, col]\ + - np.average(X[:, col])) for i in range(0, X.shape[0])]), 2) +``` + +If we compare our self-computed covariance matrix with one generated with NumPy's `np.cov`, we can see the similarities: + +``` +# Compare the matrices +print('Self-computed:') +print(cov_matrix) +print('NumPy-computed:') +print(np.round(np.cov(X.T), 2)) + +> Self-computed: +> [[ 1. -0.12 0.87 0.82] +> [-0.12 1. -0.43 -0.37] +> [ 0.87 -0.43 1. 0.96] +> [ 0.82 -0.37 0.96 1. ]] +> NumPy-computed: +> [[ 1.01 -0.12 0.88 0.82] +> [-0.12 1.01 -0.43 -0.37] +> [ 0.88 -0.43 1.01 0.97] +> [ 0.82 -0.37 0.97 1.01]] +``` + +### Decomposing the covariance matrix into eigenvectors and eigenvalues + +Above, we have expressed the spread of our dataset across the dimensions in our covariance matrix. Recall that PCA works by expressing this spread in terms of _vectors_, called eigenvectors, which together with their corresponding eigenvalues tell us something about the direction and magnitude of the spread. + +The great thing of EIG-PCA is that we can **decompose the covariance matrix into eigenvectors and eigenvalues.** + +We can do this as follows: + +\[latex\]\\mathbf C = \\mathbf V \\mathbf L \\mathbf V^\\top\[/latex\] + +Here, \[latex\]\\mathbf V\[/latex\] is a matrix of _eigenvectors_ where each column is an eigenvector, \[latex\]\\mathbf L\[/latex\] is a diagonal matrix with eigenvalues and \[latex\]\\mathbf V^\\top\[/latex\] is the transpose of \[latex\]\\mathbf V\[/latex\]. + +We can use NumPy's `numpy.linalg.eig` to compute the eigenvectors for this square array: + +``` +# Compute the eigenpairs +eig_vals, eig_vect = np.linalg.eig(cov_matrix) +print(eig_vect) +print(eig_vals) +``` + +This yields the following: + +``` +[[ 0.52103086 -0.37921152 -0.71988993 0.25784482] + [-0.27132907 -0.92251432 0.24581197 -0.12216523] + [ 0.57953987 -0.02547068 0.14583347 -0.80138466] + [ 0.56483707 -0.06721014 0.63250894 0.52571316]] +[2.91912926 0.91184362 0.144265 0.02476212] +``` + +![](images/eig.png) + +If we compute how much each principal dimension contributes to variance explanation, we get the following: + +``` +# Compute variance contribution of each vector +contrib_func = np.vectorize(lambda x: x / np.sum(eig_vals)) +var_contrib = contrib_func(eig_vals) +print(var_contrib) +print(np.sum(var_contrib)) +> [0.72978232 0.2279609 0.03606625 0.00619053] +> 1.0 +``` + +In other words, the first principal dimension contributes for 73%; the second one for 23%. If we therefore reduce the dimensionality to two, we get to keep approximately \[latex\]73 + 23 = 96%\[/latex\] of the variance explanation. + +### Sorting the eigenpairs in decreasing order of importance + +Even though the eigenpairs above have already been sorted, it's a thing we must definitely do - especially when you perform the decomposition in eigenpairs in a different way. + +Sorting the eigenpairs happens by eigenvalue: the eigenvalues must be sorted in a descending way; the corresponding eigenvectors must therefore also be sorted equally. + +``` +# Sort eigenpairs +eigenpairs = [(np.abs(eig_vals[x]), eig_vect[:,x]) for x in range(0, len(eig_vals))] +eig_vals = [eigenpairs[x][0] for x in range(0, len(eigenpairs))] +eig_vect = [eigenpairs[x][1] for x in range(0, len(eigenpairs))] +print(eig_vals) +``` + +This yields sorted eigenpairs, as we can see from the eigenvalues: + +``` +[2.919129264835876, 0.9118436180017795, 0.14426499504958146, 0.024762122112763244] +``` + +### Selecting n principal components + +Above, we saw that 96% of the variance can be explained by only two of the dimensions. We can therefore reduce the dimensionality of our feature space from \[latex\]n = 4\[/latex\] to \[latex\]n = 2\[/latex\] without losing much of the information. + +### Building the projection matrix + +The final thing we must do is generate the **projection matrix** and **project our original data onto the (two) principal components** (Raschka, 2015): + +``` +# Build the projection matrix +proj_matrix = np.hstack((eig_vect[0].reshape(4,1), eig_vect[1].reshape(4,1))) +print(proj_matrix) + +# Project onto the principal components +X_proj = X.dot(proj_matrix) +``` + +### Voilà, you've performed PCA + +If we now plot the projected data, we get the following plot: + +``` +# Make plot of projection +plt.scatter(X_proj[:, 0], X_proj[:, 1], c=colors) +plt.title(f'Visualizing the principal components') +plt.xlabel('Principal component 1') +plt.ylabel('Principal component 2') +plt.show() +``` + +![](images/vizprinc.png) + +That's it! You just performed Principal Component Analysis using Eigenvector Decomposition and have reduced dimensionality to two without losing much of the information in the dataset. + +* * * + +## PCA-SVD: Singular Value Decomposition with Python Step-by-Step + +Above, we covered performing Principal Component Analysis with Eigenvector Decomposition of the dataset's covariance matrix. A more numerically stable method is using **Singular Value Decomposition** on the data matrix itself instead of Eigenvector Decomposition on its covariance matrix. In this section, we'll cover the SVD approach in a step-by-step fashion, using Python. + +Note that here as well, we'll use a vanilla / native Python approach to performing PCA, since it brings more clarity. In the next section, we'll use the framework provided tools (i.e. `sklearn.decomposition.PCA`) instead of the native ones. + +### Starting with the standardized Iris dataset + +In the PCA-SVD approach, we also use the Iris dataset as an example. Using the code below, we'll load the Iris data and perform [standardization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/), which means that your mean will become \[latex\]\\mu = 0.0\[/latex\] and your standard deviation will become \[latex\]\\sigma = 1.0\[/latex\]. + +``` +from sklearn import datasets +import matplotlib.pyplot as plt +import numpy as np + +# Load Iris dataset +iris = datasets.load_iris() +X = iris.data +y = iris.target + +# Shape +print(X.shape) +print(y.shape) + +# Color definitions +colors = { + 0: '#b40426', + 1: '#3b4cc0', + 2: '#f2da0a', +} + +# Legend definition +legend = ['Iris Setosa', 'Iris Versicolour', 'Iris Virginica'] + +# Perform standardization +for dim in range(0, X.shape[1]): + print(f'Old mean/std for dim={dim}: {np.average(X[:, dim])}/{np.std(X[:, dim])}') + X[:, dim] = (X[:, dim] - np.average(X[:, dim])) / np.std(X[:, dim]) + print(f'New mean/std for dim={dim}: {np.abs(np.round(np.average(X[:, dim])))}/{np.std(X[:, dim])}') +``` + +### Performing SVD on the data matrix + +In the EIG variant of PCA, we computed the covariance matrix of our dataset, and then performed Eigenvector Decomposition on this matrix to find the eigenvectors and eigenvalues. We could then use these to sort the most important ones and project our dataset onto the most important ones, i.e. the principal components. + +In the SVD variant, we compute the **singular values** of the **data matrix** instead. It is a generalization of the Eigenvector Decomposition, meaning that it can also be used on non-square and non-symmetric matrices (which in the EIG case required us to use the covariance matrix, which satisfies both criteria). + +``` +# Compute SVD +u, s, vh = np.linalg.svd(X.T, full_matrices=True) +``` + +In SVD, we decompose a matrix into three components: + +- **Unitary arrays** \[latex\]U\[/latex\] +- **Vectors with the singular values** \[latex\]s\[/latex\] +- **Unitary arrays** \[latex\]vh\[/latex\] + +Here, the columns of the unitary arrays give results equal to the eigenvectors of the covariance matrix in the PCA-EIG approach, and the singular value vectors are equal to the square roots of the eigenvalues of the covariance matrix (StackExchange, n.d.). + +### Translating SVD outputs to usable vectors and values + +In other words, by performing SVD on the data matrix, we can create the same results as with the PCA-EIG approach. With that approach, the eigenvectors of the covariance matrix were as follows: + +``` +[[ 0.52103086 -0.37921152 -0.71988993 0.25784482] + [-0.27132907 -0.92251432 0.24581197 -0.12216523] + [ 0.57953987 -0.02547068 0.14583347 -0.80138466] + [ 0.56483707 -0.06721014 0.63250894 0.52571316]] +``` + +Now compare them to the output of `vh`: + +``` +print(vh) + +> [[ 0.52106591 -0.26934744 0.5804131 0.56485654] +> [-0.37741762 -0.92329566 -0.02449161 -0.06694199] +> [ 0.71956635 -0.24438178 -0.14212637 -0.63427274] +> [ 0.26128628 -0.12350962 -0.80144925 0.52359713]] +``` + +Except for the sign, the _columns_ of `vh` equal the _rows_ of the EIG-based eigenvectors. + +### Sorting eigenvalues and eigenvectors + +In the PCA-EIG scenario, you had to sort eigenpairs in descending order of the eigenvalues. `np.linalg.svd` already sorts in descending order, so this is no longer necessary. + +### Selecting n components + +Here, too, we can simply select \[latex\]n\[/latex\] components. As with the PCA-EIG scenario, here we also take \[latex\]n = 2\[/latex\] and hence reduce our dimensionality from 4 to 2. + +### Building the projection matrix + +We can now easily build the projection matrix as we did in the PCA-EIG case, project our data onto the principal components, and make a plot of the projection. + +``` +# Build the projection matrix +proj_matrix = np.hstack((vh[0].reshape(4,1), vh[1].reshape(4,1))) +print(proj_matrix) + +# Project onto the principal components +X_proj = X.dot(proj_matrix) + +# Make plot of projection +colors = list(map(lambda x: colors[x], y)) +plt.scatter(X_proj[:, 0], X_proj[:, 1], c=colors) +plt.title(f'Visualizing the principal components') +plt.xlabel('Principal component 1') +plt.ylabel('Principal component 2') +plt.show() +``` + +The end result: + +![](images/pcasvd.png) + +It's the same! + +* * * + +## Easy PCA with Scikit-learn for real datasets + +In the previous two sections, we manually computed the principal components and manually projected our dataset onto these components - for the sake of showing you how stuff works. + +Fortunately, this task is not necessary when using modern Machine Learning libraries such as Scikit-learn. Instead, it provides the functionality for PCA out of the box, through `sklearn.decomposition.PCA`. Really easy! + +To be more precise, Scikit-learn utilizes PCA-SVD for computing the Principal Components of your dataset. Let's now take a look at how Scikit's approach works, so that you can finish this article both knowing how (1) PCA-EIG and PCA-SVD work (previous sections) and (2) how you can implement PCA pragmatically (this section). + +### Restarting with the Iris dataset + +Here, too, we start with the Iris dataset: + +``` +from sklearn import datasets +import matplotlib.pyplot as plt +import numpy as np +from sklearn.decomposition import PCA +from sklearn.preprocessing import StandardScaler + +# Load Iris dataset +iris = datasets.load_iris() +X = iris.data +y = iris.target +``` + +### Performing Scikit-learn based standardization + +As we could read in [another article](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/), Scikit-learn provides standardization out of the box through the `StandardScaler`, so we also implement it here: + +``` +# Standardize +scaler = StandardScaler() +scaler.fit(X) +X = scaler.transform(X) +``` + +### Performing sklearn.decomposition.PCA + +We can then easily implement PCA as follows. First, we initialize `sklearn.decomposition.PCA` and instruct it to extract two principal components (just like we did before) based on the Iris dataset (recall that `X = iris.data`): + +``` +# PCA +pca = PCA(n_components=2) +pca.fit(X) +``` + +We can then already print information about the analysis: + +``` +print(pca.explained_variance_ratio_) +print(pca.components_) +``` + +``` +> [0.72962445 0.22850762] +> [[ 0.52106591 -0.26934744 0.5804131 0.56485654] +> [ 0.37741762 0.92329566 0.02449161 0.06694199]] +``` + +We can see that our explained variance ratio is equal to the ones we found manually; that the same is true for the PCA components. + +We can now easily project the data onto the principal components with `.transform(X)`: + +``` +X = pca.transform(X) +``` + +Visualizing the data... + +``` +# Color definitions +colors = { + 0: '#b40426', + 1: '#3b4cc0', + 2: '#f2da0a', +} + +# Make plot of projection +colors = list(map(lambda x: colors[x], y)) +plt.scatter(X[:, 0], X[:, 1], c=colors) +plt.title(f'Visualizing the principal components with Scikit-learn based PCA') +plt.xlabel('Principal component 1') +plt.ylabel('Principal component 2') +plt.show() +``` + +...gives the following result: + +![](images/sklrn.png) + +Voila, precisely as we have seen before! + +### Full PCA code + +The full code for performing the PCA with Scikit-learn on the Iris dataset is as follows: + +``` +from sklearn import datasets +import matplotlib.pyplot as plt +import numpy as np +from sklearn.decomposition import PCA +from sklearn.preprocessing import StandardScaler + +# Load Iris dataset +iris = datasets.load_iris() +X = iris.data +y = iris.target + +# Standardize +scaler = StandardScaler() +scaler.fit(X) +X = scaler.transform(X) + +# PCA +pca = PCA(n_components=2) +pca.fit(X) +print(pca.explained_variance_ratio_) +print(pca.components_) +X = pca.transform(X) + +# Color definitions +colors = { + 0: '#b40426', + 1: '#3b4cc0', + 2: '#f2da0a', +} + +# Make plot of projection +colors = list(map(lambda x: colors[x], y)) +plt.scatter(X[:, 0], X[:, 1], c=colors) +plt.title(f'Visualizing the principal components with Scikit-learn based PCA') +plt.xlabel('Principal component 1') +plt.ylabel('Principal component 2') +plt.show() +``` + +* * * + +## Summary + +In this article, we read about performing Principal Component Analysis on the dimensions of your dataset for the purpose of dimensionality reduction. Some datasets have many features and few samples, meaning that many Machine Learning algorithms will be struck by the curse of dimensionality. Feature extraction approaches like PCA, which attempt to construct a lower-dimensional feature space based on the original dataset, can help reduce this curse. Using PCA, we can attempt to recreate our feature space with fewer dimensions _and_ with minimum information loss. + +After defining the context for applying PCA, we looked at it from a high-level perspective. We saw that we can compute eigenvectors and eigenvalues and sort those to find the principal directions in your dataset. After generating a projection matrix for these directions, we can map our dataset onto these directions, which are then called the principal components. But _how_ these eigenvectors can be derived was explained later, because there are two methods for doing so: using Eigenvector Decomposition (EIG) and the more generalized Singular Value Decomposition (SVD). + +In two step-by-step examples, we saw how we can apply both PCA-EIG and PCA-SVD for performing a Principal Component Analysis. In the first case, we saw that we can compute a covariance matrix for the standardized dataset which illustrates the variances and covariances of its variables. This matrix can then be decomposed into eigenvectors and eigenvalues, which illustrate the direction and magnitude of the spread expressed by the covariance matrix. Sorting the eigenpairs, we can select the principal directions that contribute most to variance, generate the projection matrix and project our data. + +While PCA-EIG works well with symmetric and square matrices (and hence with our covariance matrix), it can be numerically unstable. That's why PCA-SVD is very common in today's Machine Learning libraries. In another step-by-step example, we looked at how the SVD can be used directly on the standardized data matrix for deriving the eigenvectors we also found with PCA-EIG. They can be used for generating a projection matrix which allowed us to arrive at the same end result as when performing PCA-EIG. + +Finally, knowing how PCA-EIG and PCA-SVD work, we moved to a Scikit-learn based implementation of Principal Component Analysis. Because why reinvent the wheel if good implementations are already available? Using the Scikit `StandardScaler` and `PCA` implementations, we performed the standardization and (SVD-based) PCA that we also performed manually, once again finding the same results. + +It's been a thorough read, that's for sure. Still, I hope that you have learned something. Please share this article or drop a comment in the comments section below if you find it useful 💬 Please do the same when you have additional questions, remarks or suggestions for improvement. Where possible, I'll respond as quickly as I can. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (n.d.). _Curse of dimensionality_. Wikipedia, the free encyclopedia. Retrieved December 3, 2020, from [https://en.wikipedia.org/wiki/Curse\_of\_dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) + +Wikipedia. (2004, November 17). _Feature selection_. Wikipedia, the free encyclopedia. Retrieved December 3, 2020, from [https://en.wikipedia.org/wiki/Feature\_selection](https://en.wikipedia.org/wiki/Feature_selection) + +Wikipedia. (2003, June 8). _Feature extraction_. Wikipedia, the free encyclopedia. Retrieved December 3, 2020, from [https://en.wikipedia.org/wiki/Feature\_extraction](https://en.wikipedia.org/wiki/Feature_extraction) + +Wikipedia. (2002, August 26). _Principal component analysis_. Wikipedia, the free encyclopedia. Retrieved December 7, 2020, from [https://en.wikipedia.org/wiki/Principal\_component\_analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) + +StackExchange. (n.d.). _Relationship between SVD and PCA. How to use SVD to perform PCA?_ Cross Validated. [https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca](https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca) + +Wikipedia. (n.d.). _Eigenvalues and eigenvectors_. Wikipedia, the free encyclopedia. Retrieved December 7, 2020, from [https://en.wikipedia.org/wiki/Eigenvalues\_and\_eigenvectors](https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors) + +Lambers, J. V. (n.d.). _PCA - Mathematical Background_. [https://www.math.usm.edu/lambers/cos702/cos702\_files/docs/PCA.pdf](https://www.math.usm.edu/lambers/cos702/cos702_files/docs/PCA.pdf) + +Raschka, S. (2015, January 27). _Principal component analysis_. Dr. Sebastian Raschka. [https://sebastianraschka.com/Articles/2015\_pca\_in\_3\_steps.html](https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html) + +StackExchange. (n.d.). _Why does Andrew Ng prefer to use SVD and not EIG of covariance matrix to do PCA?_ Cross Validated. [https://stats.stackexchange.com/questions/314046/why-does-andrew-ng-prefer-to-use-svd-and-not-eig-of-covariance-matrix-to-do-pca](https://stats.stackexchange.com/questions/314046/why-does-andrew-ng-prefer-to-use-svd-and-not-eig-of-covariance-matrix-to-do-pca) + +MachineCurve. (2020, November 19). _How to normalize or standardize a dataset in Python? – MachineCurve_. [https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) + +Wikipedia. (2003, March 4). _Covariance matrix_. Wikipedia, the free encyclopedia. Retrieved December 7, 2020, from [https://en.wikipedia.org/wiki/Covariance\_matrix](https://en.wikipedia.org/wiki/Covariance_matrix) + +NumPy. (n.d.). _Numpy.linalg.svd — NumPy v1.19 manual_. [https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html](https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html) + +StackExchange. (n.d.). _Understanding the output of SVD when used for PCA_. Cross Validated. [https://stats.stackexchange.com/questions/96482/understanding-the-output-of-svd-when-used-for-pca](https://stats.stackexchange.com/questions/96482/understanding-the-output-of-svd-when-used-for-pca) diff --git a/introduction-to-isotropic-architectures-in-computer-vision.md b/introduction-to-isotropic-architectures-in-computer-vision.md new file mode 100644 index 0000000..80474b3 --- /dev/null +++ b/introduction-to-isotropic-architectures-in-computer-vision.md @@ -0,0 +1,77 @@ +--- +title: "Introduction to Isotropic architectures in Computer Vision" +date: "2021-11-07" +categories: + - "deep-learning" +tags: + - "computer-vision" + - "isotropic-architectures" + - "transformer" +--- + +If you are new to Deep Learning, or have worked with neural networks for some time, it's likely that you're familiar with Convolutional layers. These have been the standard building blocks for computer vision models since 2012, when AlexNet had a breakthrough and boosted the era of Deep Learning that we are still building on top of these days. + +Maybe, you're also familiar to recent developments in Natural Language Processing - with Transformer based architectures in particular. Using a concept called self-attention, models can be taught to connect related lingual concepts and hence better understand language. Generative architectures like the GPT series and understanding architectures like BERT have been notable developments in this space. + +Recently, however, Transformers have been altered to the Vision domain. The concept of an **Isotropic architecture** has emerged from these developments. Isotropic architectures have equal size and shape for all elements troughout the network. Contrary to more pyramid-shaped architectures, recent research discovers that isotropic architectures may improve model performance or even meet state-of-the-art performance with a lot lighter components. + +In this article, we'll dive a bit deeper into isotropic architectures. What are they? How do they compare to classic pyramid shaped Convolutional architectures? Let's take a look. + +* * * + +\[toc\] + +* * * + +## Standard building blocks for Computer Vision: Convolutional layers + +This is what happens within a standard 2D convolutional layer: + +![](images/CNN.jpg) + +A kernel with some `(width, height)` and `C` channels is slided from left to right and while doing so from top to bottom (i.e., convolved) over an input image with some width and height and `C` channels (recall that RGB images have `C=3`). In fact, multiple such kernels are convolved during one forward pass. For each slide, a scalar output value is produced using element-wise multiplications. + +The output of a convolutional layer is a feature map: a 3D block with some height and width and `N` 'channels', which represent the result for each of the `N` maps. + +By effectively summarizing regions of the image in a feature map, and then summarizing these in another layer, and another, and another, it's possible to learn connections _within_ regions and _between_ regions in an image. In other words: it then no longer matters whether the object is in a certain region... as the object gets detected anyway. Object invariance is one of the strongest virtues of a Convolutional Neural Network. + +### Pyramid structure + +Given what was mentioned above for every kernel - that kernels slide over the image, effectively outputting a scalar value and hence a summary of a region - it's easy to see that feature maps get smaller for every layer if you stack multiple layers together. + +This pyramid structure is also very common for Convolutional Neural Networks: the shape of the data changes downstream, as well as the size of the network. Previously, it was thought that such pyramid structures introduce an information bottleneck to your neural networks. But does it actually improve performance? Isotropic architectures change this way of thinking about neural networks. + +* * * + +## Transformers and Mixers in Computer Vision: Isotropic architectures + +If you have followed the field of Natural Language Processing in recent years, you must know about the fact that [Transformer architectures](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) have really boosted progress over there. These architectures, which combine an **encoding segment** with a **decoding segment** (or have one of them) with the concept of **self-attention** really changed the way language models have been created. + +I would suggest to click the link above to read more about what Transformers do. For now, it's enough to know that they have recently been applied in Computer Vision problems as well. For example, the Vision Transformer (Dosovitskiy et al., 2020) has reached state-of-the-art performance when pretrained and then finetuned using massive image datasets. In doing so, images have been divided into patches, and these patches (turned into an embedding) have been used as input for the Transformer architecture. + +The same is true for Mixer architectures like MLP-Mixer (Tolstikhin et al., 2021) and ConvMixer (of which the authors are yet unknown; see the reference at the bottom of this page), which _strip_ the Transformer architecture but rather keep the patches, to point at the question whether it's actually the image patches that ensure that SOTA performance can be reached. + +![](images/Diagram-32-1.png) + +Both Transformer architectures and Mixer architectures are part of the class of **isotropic architectures**. To understand what they are, let's take a look at what the word _isotropic_ means first: + +> (of an object or substance) having a physical property which has the same value when measured in different directions. +> (of a property or phenomenon) not varying in magnitude according to the direction of measurement. + +In other (simpler) words, when you take a look at the value going through an _isotropic_ network, it doesn't change in size. + +And precisely that is what an isotropic architecture is. Isotropic architectures do not produce pyramid shaped data transformations, but rather _fixed_ ones where data does not change in shape and size, like in the image below. + +![](images/image.png) + +The structure of a Mixer Layer from MLP-Mixer (Tolstikhin et al., 2021). As you can see, the input data (the various patches) are used in a variety of ways (primarily by Transposing them) but are not _changed_. In other words, the data size and shape is kept intact, and hence the architecture is isotropic rather than pyramidal. + +* * * + +## References + +Tolstikhin, I., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., ... & Dosovitskiy, A. (2021). [Mlp-mixer: An all-mlp architecture for vision.](https://arxiv.org/abs/2105.01601) _arXiv preprint arXiv:2105.01601_. + +_Patches are all you need?_ (n.d.). OpenReview. [https://openreview.net/forum?id=TVHS5Y4dNvM](https://openreview.net/forum?id=TVHS5Y4dNvM) + +Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). [An image is worth 16x16 words: Transformers for image recognition at scale](https://arxiv.org/abs/2010.11929). _arXiv preprint arXiv:2010.11929_. diff --git a/introduction-to-transformers-in-machine-learning.md b/introduction-to-transformers-in-machine-learning.md new file mode 100644 index 0000000..ff95198 --- /dev/null +++ b/introduction-to-transformers-in-machine-learning.md @@ -0,0 +1,460 @@ +--- +title: "Introduction to Transformers in Machine Learning" +date: "2020-12-28" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "natural-language-processing" + - "text-analysis" + - "transformer" + - "transformers" +--- + +When you talk about Machine Learning in Natural Language Processing these days, all you hear is one thing - Transformers. Models based on this Deep Learning architecture have taken the NLP world by storm since 2017. In fact, they are the go-to approach today, and many of the approaches build on top of the original Transformer, one way or another. + +Transformers are however not simple. The original Transformer architecture is quite complex and the same is true for many of the spin-off architectures. For this reason, we will take a look at the vanilla Transformer architecture proposed by Vaswani et al. back in 2017. It lies at the basis of exploring many other Transformer architectures on [this page](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/). It won't be maths-heavy, but rather intuitive, so that many people can understand what is going on under the hood of a vanilla Transformer. + +The article is structured as follows. First, we'll take a look at why Transformers have emerged in the first place - by taking a look at the problems of their predecessors, primarily [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) and GRUs. Then, we're going to take a look at the Transformer architecture holistically, i.e. from a high level. This is followed by a more granular analysis of the architecture, as we will first take a look at the encoder segment and then at the decoder segment. Finally, we're going to cover how a Transformer can be trained. + +Ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Why Transformers? + +Machine Learning in Natural Language Processing has traditionally been performed with recurrent neural networks. Recurrent, here, means that when a sequence is processed, the hidden state (or 'memory') that is used for generating a prediction for a token is also passed on, so that it can be used when generating the subsequent prediction. + +> A **recurrent neural network** (**RNN**) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. +> +> Wikipedia (2005) + +Recurrent networks have been around for some time. One of the first ones was a simple or _vanilla_ recurrent network, or vanilla RNN. It is the top left image in the gallery below. As you can see, upon generating a prediction, the updated hidden state is passed to itself, so that it can be used in any subsequent prediction. When unfolded, we can clearly see how this works with a variety of input tokens and output predictions. + +While recurrent networks were able to boost the state-of-the-art in Natural Language Processing at the time, they also experienced a series of drawbacks / bottlenecks: + +1. Because of the way in which hidden states were passed, RNNs were highly sensitive to the [vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). Especially with longer sequences, the chain of gradients used for optimization can be so long that actual gradients in the first layers are really small. In other words, as with any network struck by vanishing gradients, the most upstream layers learn almost nothing. +2. The same is true for memory: the hidden state is passed to the next prediction step, meaning that most of the contextual information available is related to what the model has seen in the short term. With classic RNNs, models therefore face a long-term memory issue, in that they are good at short-term memory but very bad at longer-term memory. +3. Processing happens sequentially. That is, each word in a phrase has to be passed through the recurrent network, after which a prediction is returned. As recurrent networks _can_ be intensive in terms of the computational requirements, it can take a while before an output prediction is generated. This is an inherent problem with recurrent networks. + +Fortunately, in the 2010s, **[Long Short-Term Memory](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/)** networks (LSTMs, top right) and **Gated Recurrent Units** (GRUs, bottom) were researched and applied to resolve many of the three issues above. LSTMs in particular, through the cell like structure where memory is retained, are robust to the vanishing gradients problem. What's more, because memory is now maintained separately from the previous cell output (the \[latex\]c\_{t}\[/latex\] flow in the LSTM image below, for example), both are capable of storing longer-term memory. + +Especially when the **attention mechanism** was invented on top of it, where instead of the hidden state a weighted context vector is provided that weighs the outputs of all previous prediction steps, long-term memory issues were diminishing rapidly. The only standing problem remained that processing had to be performed sequentially, imposing a significant resource bottleneck on training a model for Natural Language Processing. + +- [![](images/2560px-Recurrent_neural_network_unfold.svg_.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/2560px-Recurrent_neural_network_unfold.svg_.png) + +- [![](images/1920px-LSTM_cell.svg_.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/1920px-LSTM_cell.svg_.png) + +- [![](images/2560px-Gated_Recurrent_Unit_base_type.svg_.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/2560px-Gated_Recurrent_Unit_base_type.svg_.png) + + +(Left top) A fully recurrent network. Created by [fdeloche](https://commons.wikimedia.org/wiki/User:Ixnay) at [Wikipedia](https://en.wikipedia.org/wiki/Recurrent_neural_network#/media/File:Recurrent_neural_network_unfold.svg), licensed as [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0). No changes were made. +(Right top) An LSTM cell. Created by [Guillaume Chevalier](https://commons.wikimedia.org/w/index.php?title=User:GChe&action=edit&redlink=1) (svg by Ketograff) at [Wikipedia](https://en.wikipedia.org/wiki/Long_short-term_memory#/media/File:LSTM_cell.svg), licensed as [CC BY 4.0](https://creativecommons.org/licenses/by/4.0). +(Bottom) A GRU cell. Created by [Jeblad](https://commons.wikimedia.org/wiki/User:Jeblad) at [Wikipedia](https://en.wikipedia.org/wiki/Gated_recurrent_unit#/media/File:Gated_Recurrent_Unit,_base_type.svg), licensed as [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0) (no changes made). + +* * * + +## What are Transformers? + +In a landmark work from 2017, Vaswani et al. claimed that [Attention is all you need](https://arxiv.org/abs/1706.03762) - in other words, that recurrent building blocks are not necessary in a Deep Learning model for it to perform really well on NLP tasks. They proposed a new architecture, the **Transformer**, which is capable of maintaining the attention mechanism while processing sequences in parallel: all words together rather than on a word-by-word basis. + +This architecture has obliterated the final issue from the three mentioned above, namely that sequences have to be processed sequentially, incurring a lot of computational cost. With Transformers, parallelism has become real. + +As we shall see in different articles, Transformer based architectures come in different flavors. Based off the traditional Transformer architecture, researchers and engineers have experimented significantly and brought about changes. However, the original Transformer architecture looks as follows: + +![](images/1_BHzGVskWGS_3jEcYYi6miQ-842x1024.png) + +Source: Vaswani et al. (2017) + +As we can see, it has two intertwined segments: + +- An **encoder segment**, which takes inputs from the source language, generates an embedding for them, encodes positions, computes where each word has to attend to in a multi-context setting, and subsequently outputs some intermediary representation. +- A **decoder segment**, which takes inputs from the target language, generates an embedding for them with encoded positions, computes where each word has to attend to, and subsequently combines encoder output with what it has produced so far. The outcome is a prediction for the next token, by means of a [Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) and hence argmax class prediction (where each token, or word, is a class). + +The original Transformer is therefore a classic sequence-to-sequence model. + +Do note that Transformers can be used for a variety of language tasks, ranging from natural language understanding (NLU) to natural language generation (NLG). For this reason, it can be the case that source and target languages are identical, but this is not _necessarily_ the case. + +If you're saying that you are a bit overwhelmed right now, I can understand. I had the same when I first read about Transformers. That's why we will now take a look at both the encoder and decoder segments individually, taking a close look at each and individual step. We're going to cover them as intuitively as we can, using the translators analogy. + +### The translators analogy + +Suppose that our goal is to build a language model capable of translating German text into English. In the classic scenario, with more classic approaches, we would learn a model which is capable of making the translation directly. In other words, we are teaching _one_ translator to translate German into English. In other words, the translator needs to be able to speak both languages fluently, understand the relationships between words in the two languages, and so on. While this will work, it's not scalable. + +Transformers work differently because they use an encoder-decoder architecture. Think about it as if you're working with two translator. The first translator is capable of translating German into some intermediary, universal language. Another translator is capable of translating that language into English. However, at every translation task, you'll let translations pass through the intermediary language first. This will work as well as the classic approaches (in terms of whether the model yields any usable result). However, it is also scalable: we can use the intermediary language to train a model for summarizing text, for example. We don't need to train for the first translation task anymore. + +In different articles, we shall see that this pretraining and fine-tuning dogma is very prevalent today, especially with the BERT like architectures, which take the encoder segment from the original Transformer, pretrain it on a massive dataset and allow people to perform fine-tuning to various tasks themselves. However, for now, we'll stick to the original Transformer. In it, the \[latex\]\\text{German} \\rightarrow \\text{Intermediary language}\[/latex\] translation task would be performed by the encoder segment, in this analogy yielding the intermediary state as the _intermediary language_. The \[latex\]\\text{Intermediary language} \\rightarrow \\text{English}\[/latex\] translation task is then performed by the decoder segment. + +Let's now take a look at both segments in more detail. + +![](images/Diagram-1-1024x590.png) + +* * * + +## The encoder segment + +The encoder segment of a Transformer is responsible for converting inputs into some intermediary, high-dimensional representation. Visually, it looks as follows. The encoder segment is composed of a couple of individual components: + +- **Input Embeddings**, which convert tokenized inputs into vector format so that they can be used. The original work by Vaswani et al. (2017) utilizes [learned embeddings](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/), meaning that the token-to-vector conversion process is learned along with the main Machine Learning task (i.e. learning the sequence-to-sequence model). +- **Positional Encodings**, which slightly change the vector outputs of the embedding layer, adding positional information to these vectors. +- **The actual encoder segment**, which learns to output an attended representation of the input vectors, and is composed of the following sub segments: + - The **multi-head attention segment**, which performs multi-head self-attention, adds the residual connection and then performs layer normalization. + - The **feed forward segment**, which generates the encoder output for each token. + - The encoder segment can be repeated \[latex\]N\[/latex\] times; Vaswani et al. (2017) chose \[latex\]N = 6\[/latex\]. + +Let's now take a look at each of the encoder's individual components in more detail. + +![](images/Diagram-3.png) + +### From textual inputs to Input Embedding + +You'll train a Transformer with a textual dataset. As you would imagine, such a dataset consists of phrases (and often of pairs of phrases that correspond to each other). For example, for English, the phrase `I go to the store` would equal `Je vais au magasin` in French. + +#### Tokenization + +However, we cannot feed text to Machine Learning models - TensorFlow, for example, is a _numbers processing_ library, and optimization techniques also work with numbers. + +We hence have to find a way to express text in numbers. We can do this by means of **tokenization**, which allows us to express text as a list of integers. The `tf.keras` [Tokenizer](https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/text.py), for example, allows us to perform two things (Nuric, 2018): + +- _Generating a vocabulary based on text._ We start with an empty Python dictionary, `{}`, and slowly but surely fill it with each distinct word, so that e.g. `dictionary["I"] = 1`, `dictionary["go"] = 2`, and so on. +- _Converting words into integers using the vocabulary._ Based on the vocabulary, which is obviously filled with a whole lot of words, we can convert phrases into integer based sequences. For example, `I go to the store`, through `["I", "go", "to", "the", "store"]`, may become `[1, 2, 39, 49, 128]`. Obviously, the integers here are decided by how the vocabulary is generated. + +#### One-hot encoding words is not practical + +Suppose that we have generated a word index with a Tokenizer on 45.000 distinct words. We then have a Python dictionary with 45.000 keys, so `len(keys) = 45000`. The next step would be to tokenize each phrase from the dataset, so that for example `["I", "go", "to", "the", "store"]` becomes `[1, 2, 39, 49, 128]`, and `["I", "will", "go", "now"]` becomes `[1, 589, 2, 37588]`. The numbers here are arbitrary of course and determined by the Tokenizer. + +Because these variables are categorical, we must express them in a different way - e.g. by means of [one-hot encoding](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras/) (KDNuggets, n.d). However, with very large word vocabularies, this is highly inefficient. For example, in our dictionary above, each token would be a 45.000-dimensional vector! Hence, for small vocabularies, one-hot encoding can be a good way for expressing text. For larger vocabularies, we need a different approach. + +#### Word embeddings + +However, we have a solution. + +We can use **word embeddings** in that case: + +> Word embedding is any of a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension. +> +> Wikipedia (2014) + +In other words, if we can learn to map our tokens to vectors, we can possibly find a unique vector for each word in a much-lower dimensional space. We can see this in the visualization above. For 10.000 words, it becomes possible to visualize them in a three-dimensional space (with only small information loss by virtue of the application of [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/)), whereas we would have used 10.000-dimensional vectors if we applied one-hot encoding. + +[![](images/image-5-1024x648.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/image-5.png) + +A plot from the Word2Vec 10K dataset, with three [principal components](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/) plotted in a three-dimensional space, using the [Embedding projector](http://projector.tensorflow.org/). The word 'routine' is highlighted. + +#### Vanilla Transformers use learned input embeddings + +Vanilla Transformers use a [learned input embedding layer](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) (Vaswani et al., 2017). This means that the embedding is learned on the fly [rather than using a pretrained embedding](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/), such as a pretrained Word2Vec embedding, which can also be an option. Learning the embedding on the fly ensures that each word can be mapped to a vector properly, improving effectiveness (not missing out any word). + +The learned embedding produces vectors of dimension \[latex\]d\_{\\text{model}}\[/latex\], where Vaswani et al. (2017) set \[latex\]d\_{\\text{model}} = 512\[/latex\]. \[latex\]d\_{\\text{model}}\[/latex\] is also the output of all the sub layers in the model. + +![](images/Diagram-4-1.png) + +According to Vaswani et al. (2017), the weight matrix between the input embedding and output embedding layers is shared, as well as the pre-softmax linear transformation. Weights are also multiplied by \[latex\]\\sqrt{d\_{\\text{model}}}\[/latex\] for stability. Sharing weights is a design decision, which is not strictly necessary and _can_ be counter to performance, as illustrated by this answer: + +> **The source and target embeddings can be shared or not**. This is a design decision. They are normally shared if the token vocabulary is shared, and this normally happens when you have languages with the same script (i.e. the Latin alphabet). If your source and target languages are e.g. English and Chinese, which have different writing systems, your token vocabularies would probably not be shared, and then the embeddings wouldn't be shared either. +> +> Ncasas (2020) + +### Positional Encoding + +Classic approaches to building models for language understanding or generation have benefited from the sequential order of processing. Because words had to be processed sequentially, models became aware of common orderings (such as `I am`) because the hidden state including `I` was always passed prior to processing `am`. + +With Transformers, this is no longer the case, as we know that such models have no recurrent aspects, but use attention only. When an entire phrase is fed to a Transformer model, it is not necessarily processed in order, and hence the model is not aware of any positional order within a phrase within the sequence. + +![](images/Diagram-5.png) + +Using **positional encodings**, we add a vector indicating the relative position of a word to the word vectors generated by the embedding layer. This is a simple vector multiplication: \[latex\]\\textbf{v}\_{encoded} = \\textbf{v}\_{embedding} + \\textbf{v}\_{encoding}\[/latex\]. You can imagine this as a restructuring operation where common vectors are positioned more closely together. + +Vaswani et al. (2017) use a maths based (more specifically a sine and cosine based) approach to positional encoding. By letting the position and dimension flow through a sine or cosine function depending on its oddity or evenness, we can generate positional encodings that we can use to position-encode the embeddings output. The outcome of this step is a vector which has much of the information of the embedding retained, but then with some information about relative positions (i.e. how words are related) added. + +### N times the encoder segment + +Generating the input embedding and applying positional encoding were the preparatory steps, allowing us to use textual data in our Transformer model. It's now time to look at the _actual_ encoder segment. + +We must note first that whatever we'll discuss here can be repeated \[latex\]N\[/latex\] times; stacked, if you will. When stacking encoders, the output of each encoder is used as input for the next encoder, generating an ever-more abstract encoding. While stacking encoders can definitely improve model performance through generalization, it is also computationally intensive. Vaswani et al. (2017) chose to set \[latex\]N = 6\[/latex\] and hence use 6 encoders stacked on top of each other. + +Each encoder segment is built from the following components: + +- A **multi-head attention block**. This block allows us to perform self-attention over each sequence (i.e., for each phrase that we feed the model, determine on a per-token (per-word) basis which other tokens (words) from the phrase are relevant to that token; thus where to attend to when reading that token/word). +- A **feed-forward block**. After generating attention for each token (word), we must generate a \[latex\]d\_{\\text{model}} \\text{-dimensional}\[/latex\] and thus 512-dimensional vector that encodes the token. The feed forward block is responsible for performing this. +- **Residual** **connections**. A residual connection is a connection that does not flow through a complex block. We can see two residual connections here: one flowing from the input to the first Add & Norm block; another one from there to the second block. Residual connections allow the models to optimize more efficiently, because technically speaking gradients can flow freely from the end of the model to the start. +- **Add & Norm blocks**. In these blocks, the output from either the Multi-head attention block or the Feed-forward block is merged with the residual (by means of addition), the result of which is subsequently layer normalized. + +While the inputs to an encoder segment are therefore either embedded and position-normalized tokens or the output from a previous encoder segment, an encoder therefore learns to generate a context-aware intermediate representation for the input tokens (the encoding). Through this context-awareness, achieved by the self-attention performed, Transformers can do their trick. + +![](images/Diagram-6.png) + +#### Multi-Head attention + +Let's now zoom into the individual components of the encoder segment in more detail. The first block that the input will flow through is the **multi-head attention block**. It is composed of multiple so-called **scaled dot-product attention** **blocks**, which we'll now take a closer look at. + +Visually, such a scaled block looks as follows (Vaswani et al., 2017): + +![](images/Diagram-7.png) + +Scaled Dot-Product Attention + +You can see that it has _three inputs_ - **queries** (Q), **keys** (K) and **values** (V). This means that the position-encoded input vectors are first split into three separate streams and hence matrices; we shall see that this happens by means of 3 different Linear layers. + +In Vaswani et al. (2017), these Q, K and V values are described as follows: + +> An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. +> +> Vaswani et al. (2017) + +However, a more intuitive description is provided by Dontloo (2019): + +> The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your **query** against a set of **keys** (video title, description etc.) associated with candidate videos in the database, then present you the best matched videos (**values**). +> +> Dontloo (2019) + +I hope this allows you to understand better what the role of queries, keys and values in Transformers is. + +Important note: above, I write _vector_s and _matrices_, because all tokens are processed in parallel! This means that all the position-encoded input vectors are passed through the 3 Linear layers and hence form a matrix. It is important to understand here that they are processed jointly, to understand how self-attention through the score matrix works next. + +However, if we actually want to present the best matched videos, we need to identify the attention points - which videos are most relevant given some inputs? That's why in the image above, you see a `MatMul` operation between the queries and keys. It is a matrix multiplication where the query output is multiplied by the keys matrix to generate a **scores matrix.** + +![](images/Diagram-9.png) + +A score matrix can look as follows: + +![](images/Diagram-10-1.png) + +It illustrates the importance of certain words in a phrase given one word in a phrase in an absolute sense. However, they are not yet comparable. Traditionally, a [Softmax function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) can be used to generate (pseudo-)probabilities and hence make the values comparable. + +However, if you take a look at the flow image above, you can see that prior to applying Softmax we first apply a scaling function. We apply this scaling because of the possible sensitivity of Softmax to vanishing gradients, which is what we don't want. We scale by dividing all values by \[latex\]\\sqrt{d\_k}\[/latex\], where \[latex\]d\_k\[/latex\] is the dimensionality of the queries and keys. + +We then compute the Softmax outputs, which immediately shows for a word which other words from the phrase are important in the context of that word. + +- [![](images/Diagram-11.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-11.png) + +- [![](images/Diagram-12.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-12.png) + + +The remaining step is matrix multiplying the scores matrix containing the attention weights with the _values_, effectively keeping the values for which the model has learned that they are most important. + +And this is how **self-attention** works, but then scaled - which is why Vaswani et al. (2017) call it **scaled dot-product self-attention**. + +##### Multiple heads + +However, the encoder block is called multi-head attention. What is this thing called multiple heads? Here you go, visually: + +![](images/Diagram-8-1.png) + +Multi-Head Attention + +By **copying the linear layers** that generate the queries, keys and values matrices, letting them have different weights, we can learn multiple representations of these queries, keys and values. + +In human language, you can visualize this as if you are looking at the same problem from different angles, rather than just one angle (i.e. the self-attention we just covered). By learning multiple representations, the Transformer becomes more and more context-aware. As you can see, the outputs of the linear layers are sent to separate scaled dot-product attention blocks, which output the importance of the values; these are concatenated and passed through a Linear layer again. + +Each individual combination of building blocks is called an **attention head**. Since multiple attention heads are present in one encoder segment, this block is called a **multi-head attention block**. It performs scaled dot-product attention for every block, then concatenates all the outputs and lets it flow through a Linear layer, which once again produces a 512-dimensional output value. + +Note that the dimensionality of every attention head is \[latex\]d\_\\text{model}/h\[/latex\] where \[latex\]h\[/latex\] is the number of attention heads. Vaswani et al. (2017) used a model dimensionality of 512, and used 8 parallel heads, so head dimensionality in their case was \[latex\]512/8 = 64\[/latex\]. + +#### Adding residual and Layer Normalization + +The output of the multi-head attention block is first added with the residual connection, which recall is the position-encoded input embedding for all the vectors. This is a simple addition operation. After adding, a [layer normalization](https://arxiv.org/abs/1607.06450) operation is performed on the outcome, before it is passed to the Feed-forward segment. Applying layer normalization stabilizes the training process, and adding the residual connection does the same. + +![](images/Diagram-13-771x1024.png) + +#### Feed Forward layer + +After the layer normalization has been completed, the inputs are passed to a set of Feed Forward layers. Each position (i.e. token) is passed through this network individually, according to Vaswani et al. (2017): it "is applied to each position separately and identically". Each Feed Forward network contains two Linear layers with one [ReLU activation function](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) in between. + +![](images/Diagram-14-1.png) + +#### Adding residual and Layer Normalization + +In the case of the Feed Forward network, too, a residual is first branched off the input, for the sake of flowing gradients. It is added to the outputs of the Feed Forward network which are subsequently Layer Normalized. + +It is the final operation before the _encoded input_ leaves the encoder segment. It can now be used further (like in BERT, which we will cover in another article) or serve as the (partial) input for the decoder segment of the original Transformer, which we will cover now. + +* * * + +## The decoder segment + +Okay, so far we understand how the encoder segment works - i.e. how inputs are converted into an intermediate representation. Let's now take a look at the **decoder segment**. This segment of a Transformer is responsible for converting the intermediary, high-dimensional representation into predictions for output tokens. Visually, it looks as follows. The decoder segment is composed of a couple of individual components: + +- **Output Embeddings**, which convert tokenized outputs into vector format - just like the embeddings used for the inputs. The only difference here is that outputs are shifted right by one position. This, together with the masked multi-head attention segment, ensures that predictions for any position can only depend on the known outputs at positions less than that input (Vaswasni et al., 2017). In other words, it is ensured that predictions depend on the past only, not on the future. +- **Positional Encodings**, which like the input positional encodings slightly change the vector outputs of the embedding layer, adding positional information to these vectors. +- The **actual decoder segment**, which is composed of the following sub segments: + - The **masked multi-head attention segment**, which performs multi-head self-attention on the outputs, but does so in a masked way, so that positions depend on the past only. + - The **multi-head attention segment**, which performs multi-head self-attention on a combination of the (_encoded_) inputs and the outputs, so that the model learns to correlate encoded inputs with desired outputs. + - The **feed forward segment**, which processes each token individually. +- Finally, there is a **linear** layer which generates [logits](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/#logits-layer-and-logits) and a **Softmax** layer which generates [pseudoprobabilities](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/#logits-layer-and-logits). By taking the argmax value of this prediction, we know which token should be taken and added to the tokens already predicted. + +Let's now take a look at each of the decoder's individual components in more detail. + +![](images/Diagram-17-627x1024.png) + +### Output Embedding + +Like the _encoder_, the inputs to the decoder segment are also **embedded** first. Of course, this happens with the _outputs_, which are the target phrases from the sentence pairs with which vanilla Transformers are trained. Here, too, learned embeddings are used, and Vaswani et al. (2017) share the weight matrix of both embedding layers, and the pre-Softmax linear layer visualized above. + +![](images/Diagram-15.png) + +### Positional Encoding + +Exactly the same sine- and cosine-based [positional encoding](#positional-encoding) is performed in the decoder segment like the encoder segment. + +### N times the decoder segment + +The first two elements of the decoder segment were equal in functionality to the first two elements of the encoder segment. Now is where we'll take a look at (a few) differences, because we're going to look at the **decoder segment** - which is also replicated \[latex\]N\[/latex\] times (with \[latex\]N = 6\[/latex\] in Vaswani et al.'s work). + +The decoder segment is composed of three sub segments: + +- A **masked multi-head attention** **segment**, where self-attention is applied to (masked) outputs, so that the model learns to which _previous_ tokens it must attend given some token. +- A **multi-head attention segment**, where self-attention is applied to the encoded inputs (serving as queries and keys) and the combination of masked multi-head attention outputs / input residual, being the gateway where encoded inputs and target outputs are merged. +- A **feedforward segment**, which is applied position-wise to each token passed along. + +Finally, there is a small additional appendix - a **linear layer** and a **Softmax activation function**. These will take the output of the decoder segment and transform it into a logits output (i.e. a value based output for each of the tokens in the vocabulary) and a [pseudoprobability output](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) which assigns probabilities to each of the possible token outputs given the logit values. By simply taking the \[latex\]\\text{argmax}\[/latex\] value from these outputs, we can identify the word that is the most likely prediction here. + +We'll take a look at all these aspects in more detail now. + +#### Masked Multi-Head Attention + +The first sub segment to which the position-encoded embedded input is fed is called the **masked multi-head attention segment**. It is quite an irregular but regular attention segment: + +![](images/Diagram-18.png) + +It is regular in the sense that here too, we have queries, keys and values. The queries and keys are matrix multiplied yielding a score matrix, which is then combined with the values matrix in order to apply self-attention to the target values i.e. determine which of the output values are most important. + +In other words, the flow is really similar to the flow of the multi-head attention segment in the encoder: + +![](images/Diagram-7.png) + +Except for one key difference, which is that this segment is part of the _decoder_, which is responsible for predicting which target must be output next. + +And if I'm constructing a phrase, as a human being, for producing the next word I cannot rely on all future words for doing so. Rather, I can only rely on the words that I have produced before. This is why the _classic_ multi-head attention block does not work in the decoder segment, because the same thing applies here as well: the when predicting a token, the decoder should not be able to be aware of future outputs (and especially their attention values) for the simple reason that it would otherwise be able to glimpse into the future when predicting for the present. + +The flow above will therefore not work and must be adapted. Vaswani et al. (2017) do so by adding a mask into the flow of the multi-head attention layer, making it a **masked** multi-head attention layer. + +But what is this _mask_ about? + +![](images/Diagram-19.png) + +Recall that the matrix multiplication (_MatMul_) between queries and keys yields a score matrix, which is scaled and then put through a SoftMax layer. When this happens, we get (conditional) pseudoprobabilities for each token/word that tell us something about the word importance _given another word_ (or token). But as you can see, this is problematic if we don't want to look into the future: if we are predicting the next token after ``, which should be ``, we don't want to know that `` comes after it; humans simply don't know this when they are producing words on the fly. + +- ![](images/Diagram-9.png) + +- ![](images/Diagram-12.png) + + +That's why a **mask** is applied to the scaled score matrix prior to generating pseudoprobabilities with Softmax. That is, if this is our score matrix... + +![](images/Diagram-10-1.png) + +...we apply what is known as a **look-ahead mask**. It is a simple matrix addition: we add another matrix to the scores matrix, where values are either zero or minus infinity. As you can see, all values that can be visible for a token (i.e. all previous values) are set to zero, so they remain the same. The others (Vaswani et al. (2017) call them _illegal connections_) are combined with minus infinity and hence yield minus infinity as the value. + +![](images/Diagram-20-1024x282.png) + +If we then apply Softmax, we can see that the importance for all values that _lie in the future_ is set to zero. They're no longer important. When masked, the model learns to attend to values from the past only when predicting for the present. This is a very important characteristic that allows Transformers to generalize to unseen data better. + +![](images/Diagram-21.png) + +#### Adding residual and Layer Normalization + +As is common in the Transformer architecture, the masked multi-head attention segment also makes use of **residuals** and **layer normalization**. In other words, a residual connecting the input embedding to the addition layer is added, combining the output of the masked multi-head attention segment with the original position-encoded output embedding. This allows gradients to flow more freely, benefiting the training process. Layer normalization stabilizes the training process further, yielding better results. + +#### Regular Multi-Head Attention with Encoder Output + +The second sub-segment in the decoder segment is the **multi-head attention segment**. This is a regular multi-head attention segment which computes a non-masked score matrix between queries and keys and then applies it to the values, yielding an attention-based outcome. + +Contrary to the encoder segment, which computes self-attention over the inputs, this segment performs it slightly differently. The queries and keys and hence the score matrix is based on the output of the _encoder segment_. In other words, the scores for putting attention to certain words in a phrase are determined by the _inputs_ that have been encoded before. + +And this makes a lot of sense, because as we shall see vanilla Transformers are trained on datasets with pairs in different languages (Vaswani et al, 2017). For example, if the goal is to translate `I am doing okay` into German, attention between languages is somewhat similar, and hence attention generated from the encoded input can be used for generating a decoder prediction, actually spawning sequence-to-sequence abilities for a Transformer model. + +That this actually happens can also be seen in the figure below, because the queries and keys that together form the scorse matrix, are matrix multiplied with the values matrix, which are generated by the masked multi-attention segment and the residual combined previously. In other words, this segment combines encoder output with target output, and hence generates the ability to make the 'spillover' from source language into target language (or more general, source text into target text). + +- [![](images/Diagram-22.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-22.png) + +- [![](images/Diagram-7.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-7.png) + + +#### Adding residual and Layer Normalization + +Here, too, we add the residual and perform Layer Normalization before we move forward. + +#### Feed Forward layer + +Like the encoder, a Feed Forward network composed of two linear layers and a [ReLu activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) is applied position-wise. + +![](images/Diagram-14-1.png) + +#### Adding residual and Layer Normalization + +The results of this network are added with another residual and subsequently a final Layer Normalization operation is performed. + +#### Generating a token prediction + +After the residual was added and the layer was normalized (visible in the figure as **Add & Norm**), we can start working towards the actual prediction of a token (i.e., a word). This is achieved by means of a linear layer and a Softmax activation function. In this linaer layer, which shares the weight matrix with the embedding layers, logits are generated - i.e. the importance of each token given the encoded inputs and the decoded outputs. With a [Softmax function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/), we can generate output (pseudo)probabilities for all the tokens in our vocabulary. + +Selecting the token prediction is then really simple. By taking the maximum argument (\[latex\]\\text{argmax}\[/latex\]) value, we can select the token that should be predicted next given the inputs and outputs sent into the model. + +![](images/Diagram-23.png) + +Et voila, that's the architecture of a vanilla Transformer! + +* * * + +## Training a Transformer + +Vanilla Transformers are so-called **sequence-to-sequence** **models**, [converting input sequences into target sequences](https://www.machinecurve.com/index.php/2020/12/21/from-vanilla-rnns-to-transformers-a-history-of-seq2seq-learning/). This means that they should be trained on bilingual datasets if the task is machine translation. + +For example, Vaswani et al. (2017) have trained the vanilla Transformer on the WMT 2014 English-to-German translation dataset, i.e. training for a translation task. + +The training set of this dataset has 4.5 million pairs of phrases (Stanford, n.d.): + +![](images/image-6-1024x380.png) + +All phrases have corresponding ones in German or at least German-like text: + +![](images/image-7-1024x393.png) + +* * * + +## Summary + +Transformers are taking the world of Natural Language Processing by storm. But their architectures are relatively complex and it takes quite some time to understand them sufficiently. That's why in this article we have looked at the architecture of vanilla Transformers, as proposed by Vaswani et al. in a 2017 paper. + +This architecture, which lies at the basis of all Transformer related activities today, has solved one of the final problems in sequence-to-sequence models: that of sequential processing. No recurrent segments are necessary anymore, meaning that networks can benefit from parallelism, significantly boosting the training process. In fact, today's Transformers are trained with millions of sequences, if not more. + +To provide the necessary context, we first looked at what Transformers are and why they are necessary. We then moved forward looking at the encoder and decoder segments. + +We saw that in the encoder segment, inputs are first passed through a (learned) input embedding, which converts integer based tokens into vectors having lower dimensionality. These are then position encoded by means of sine and cosine functions, to add information about the relative position of tokens into the embedding - information naturally available in traditional models due to the sequential nature of processing, but now lost given the parallelism. After these preparation steps, the inputs are fed to the encoder segment, which learns to apply self attention. In other words, the model learns itself what parts of a phrase are important when a particular word is looked at. This is achieved by multi-head attention and a feedforward network. + +The decoder segment works in a similar way, albeit a bit differently. First of all, the outputs are embedded and position encoded, after which they are also passed through a multi-head attention block. This block however applies a look-ahead mask when generating the scores matrix, to ensure that the model cannot look at words down the line when predicting a word in the present. In other words, it can only use past words in doing so. Subsequently, another multi-head attention block is added, combining the encoded inputs as queries and keys with the attended output values as values. This combination is passed to a feedforward segment, which finally allows us to generate a token prediction by means of an additional Linear layer and a Softmax activation function. + +Vanilla Transformers are trained on bilingual datasets if they are used for translation tasks. An example of such datasets is the WMT 2014 English-to-German dataset, which contains English and German phrases; it was used by Vaswani et al. (2014) for training their Transformer. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +Transformers have become prominent architectures since 2017 and are continuously being researched today. I hope that this article has helped you gain a better understanding of why they improve traditional approaches and, more importantly, how they work. If you have any questions, please feel free to ask them in the comments section below 💬 You can also click the **Ask Questions** button on the right. Please feel free to drop a message as well if you have comments or wish to put forward suggestions for improvement. I'm looking forward to hearing from you! 😎 + +Thank you for reading MachineCurve today and happy engineering! + +* * * + +## References + +Wikipedia. (2005, April 7). _Recurrent neural network_. Wikipedia, the free encyclopedia. Retrieved December 23, 2020, from [https://en.wikipedia.org/wiki/Recurrent\_neural\_network](https://en.wikipedia.org/wiki/Recurrent_neural_network) + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Nuric. (2018). _What does Keras Tokenizer method exactly do?_ Stack Overflow. [https://stackoverflow.com/a/51956230](https://stackoverflow.com/a/51956230) + +KDNuggets. (n.d.). _Data representation for natural language processing tasks_. KDnuggets. [https://www.kdnuggets.com/2018/11/data-representation-natural-language-processing.html](https://www.kdnuggets.com/2018/11/data-representation-natural-language-processing.html) + +Wikipedia. (2014, August 14). _Word embedding_. Wikipedia, the free encyclopedia. Retrieved December 24, 2020, from [https://en.wikipedia.org/wiki/Word\_embedding](https://en.wikipedia.org/wiki/Word_embedding) + +Ncasas. (2020). _Weights shared by different parts of a transformer model_. Data Science Stack Exchange. [https://datascience.stackexchange.com/a/86363](https://datascience.stackexchange.com/a/86363) + +Dontloo. (2019). _What exactly are keys, queries, and values in attention mechanisms?_ Cross Validated. [https://stats.stackexchange.com/a/424127](https://stats.stackexchange.com/a/424127) + +Wikipedia. (2002, October 22). _Matrix multiplication_. Wikipedia, the free encyclopedia. Retrieved December 24, 2020, from [https://en.wikipedia.org/wiki/Matrix\_multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) + +Stanford. (n.d.). _The Stanford natural language processing group_. The Stanford Natural Language Processing Group. [https://nlp.stanford.edu/projects/nmt/](https://nlp.stanford.edu/projects/nmt/) diff --git a/intuitive-introduction-to-bert.md b/intuitive-introduction-to-bert.md new file mode 100644 index 0000000..16b2448 --- /dev/null +++ b/intuitive-introduction-to-bert.md @@ -0,0 +1,229 @@ +--- +title: "Intuitive Introduction to BERT" +date: "2021-01-04" +categories: + - "deep-learning" +tags: + - "bert" + - "language-model" + - "natural-language-processing" + - "transformer" + - "transformers" +--- + +Transformers are taking the world of NLP by storm. After being introduced in Vaswani et al.'s _Attention is all you need_ work back in 2017, they - and particularly their self-attention mechanism requiring no recurrent elements to be used anymore - have proven to show state-of-the-art performance on a wide variety of language tasks. + +Nevertheless, what's good can still be improved, and this process has been applied to Transformers as well. After the introduction of the 'vanilla' Transformer by Vaswani and colleagues, a group of people at OpenAI have [used just the decoder segment](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) and built a model that works great. However, according to Devlin et al., the authors of a 2018 paper about pretrained Transformers in NLP, they do one thing wrong: the attention that they apply is [unidirectional](https://www.machinecurve.com/index.php/question/what-are-unidirectional-language-models/). + +This hampers learning unnecessarily, they argue, and they proposed a bidirectional variant instead: BERT, or **Bidirectional Encoder Representations from Transformers**. It is covered in this article. Firstly, we'll briefly take a look at _finetuning-based approaches in NLP_, which is followed by BERT as well. It is necessary to get sufficient context for reading about how BERT works: we'll cover both the architecture i.e. the _what_ and how BERT is trained i.e. the _why_. This includes a detailed look at how the inputs to a BERT model must be constructed. + +You'll take away from this article: + +- Understanding how fine-tuning approaches are different from feature-based approaches. +- How inputs to BERT are structured. +- How BERT works. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Finetuning-based approaches in NLP + +A BERT Transformer follows the so-called **finetuning-based approach** in Natural Language Processing. It is different than the **feature-based approach**, which is also used commonly, and more thoroughly in older language models or models that haven't been pretrained. + +Because pretraining is tightly coupled to finetuning, in the sense that they are very much related. Let's take a look at this approach in more detail and then compare it with the feature-based approach also mentioned above. + +If you take a look at the image below, you'll see a schematic representation of a finetuning-based approach in NLP. Note that we will be using the same model architecture and often the same model for all the tasks, visualized in green. The yellow blocks represent model states, and specifically the state of weights when we're talking about neural networks. + +Of course, we start with a neural network thas has been initialized pseudorandomly. We'll then train it using an unlabeled corpus, which is often big. The task performed is often language modeling: is the predicted next token actually the next token? It allows us to use large, unlabeled datasets to train a model that can detect very generic linguistic patterns in text: the _pretrained model_. + +We do however often want to create a machine learning model that can perform one task really well. This is where _finetuning_ comes in: using a labeled corpus, which is often smaller, we can then train the pretrained model further, with an additional or replacing NLP task. The end result is a model that has been pretrained on the large unlabeled corpus and which is finetuned to a specific language task, such as summarization, text generation in a particular domain, or translation. + +[![](images/Diagram-39-1024x436.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-39.png) + +Finetuning-based approaches are different to feature-based approaches, which use pretrained models to generate features that are then used as features in a separate model. In other words, with finetuning, we train using the same model all the time, whereas in a feature-based approach we chain two models together in a pipeline, allowing joint training to occur. + +Performing pretraining allows us to use unlabeled datasets. This is good news, because labeling data is expensive, and by consequence most datasets that are labeled are small. Training a machine learning model however requires large datasets for sufficient generalization. Combining unlabeled and labeled data into a **semi-supervised approach**, with pretraining and finetuning, allows us to benefit from the best of both worlds. + +Let's now take a look at how BERT utilizes finetuning for achieving significant capabilities on language tasks. + +* * * + +## How BERT works: an introduction + +BERT was introduced in a 2018 paper by Devlin et al. called _Bert: Pre-training of deep bidirectional transformers for language understanding._ BERT, which stands for **B**idirectional **E**ncoder **R**epresentations from **T**ransformers, is today [widely used within Google Search](https://blog.google/products/search/search-language-understanding-bert/), to give just one example. + +Let's take a look at how it works. Firstly, we'll cover the _why_: we will see that BERT was proposed to overcome the issue of unidirectionality in previous Transformer approaches, such as GPT. Then, we'll take a look at the Transformer encoder segment, which BERT borrows from the original Transformer proposed by Vaswani et al. (2017). Then, we'll take a look at how BERT is pretrained, as well as how it can be finetuned across many language understanding tasks. + +### Why BERT? + +One of the first questions that I had when reading the BERT paper was "why"? Why BERT? What makes it better than other approaches, such as the vanilla Transformer proposed by Vaswani et al. (2017) or the [GPT model](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) which utilizes the decoder segment of the original Transformer together with pretraining? + +The authors argue as follows: + +> We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. +> +> Devlin et al. (2018) + +What does this mean? We'll have to briefly take a look at e.g. the GPT model to know for sure - why **unidirectional models** can underperform. + +From our [article about GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/): "The input is then served to a masked multi-head attention segment, which computes self-attention in a unidirectional way. Here, the residual is added and the result is layer normalized." + +Indeed, GPT (which uses the Transformer decoder segment [autoregressively during pretraining](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/)) and the original Transformer (which performs [Seq2Seq](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/)), apply a mask in one of the attention modules - the _masked multi-head self-attention subsegment_ in the decoder segment. + +For any token, this mask sets the values for any future tokens to infinite, as can be seen in the example below. For example, for the doken "am", "doing ok" is set to minus infinite, so that after applying [Softmax activation](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) the attention to future tokens is zero. This ensures + +![](images/Diagram-20-1024x282.png) + +![](images/Diagram-21.png) + +We call this a **left-to-right model** because attention is applied in a left-to-right fashion: only words to the left of a token are attended to, whereas tokens to the right are ignored. As this is one direction, we call such models _unidirectional_. Devlin et al. (2018) argue that this is suboptimal or even harmful during finetuning: + +> For example, in OpenAI GPT, the authors use a left-toright architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer (…) Such restrictions are sub-optimal for sentence-level tasks, and could be very harmful when applying finetuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions. +> +> Devlin et al. (2018) + +The _why_ is related to the context that is provided to a token during processing. During pretraining, unidirectionality in language models is not of much concern, given the training task performed by GPT during pretraining ("given all previous tokens, predict the next one" - a strict left-to-right or right-to-left task, depending on the language). + +During finetuning, the problem becomes more clear. If we want to finetune to a specific task, not only previous tokens become important, but also _future_ tokens with respect to some token will be. Let's draw a human analogy. If our task is to summarize, we'll first read the text once, which can be compared to the "pretraining step" - because your brain effectively guesses which token (i.e. word) comes next based on what you've read so far. + +However, if your finetuning task is then to learn to generate a summary for the particular text, you won't read the text in a left-to-right fashion and then write the summary. Rather, you'll read back and forth, compare context from the past with context from 'the future', i.e. the left and the right, and then write your summary. + +_That_ is why Devlin et al. (2018) argue why previous Transformers underperform given what they should be capable of: the masked self-attention layer is not suitable for many finetuning tasks, at least intuitively. And they set out to prove their idea: the creation of a **bidirectional** language model, where token attention is generated in a bidirectional way. + +Say hello to BERT :) + +### Transformer encoder segment + +First: the architecture. Understanding BERT requires you to [understand the Vanilla Transformer first](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), because BERT utilizes the encoder segment of the original Transformer as the architecture. + +This segment looks as follows: + +![](images/Diagram-6.png) + +It has two subsegments: + +- The **multi-head attention segment**, which computes self-attention over the inputs, then adds back the residual and layer normalizes everything. The attention head can be split into multiple segments, hence the name _multi-head_. + - The multi-head attention segment differentiates itself from the _masked_ multi-head attention segment [used by the GPT model](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/) and is why Devlin et al. (2018) propose BERT. It's exactly the same, except for the mask. In other words, this is how bidirectionality is added to the self-attention mechanism. +- The **feedforward segment**, which is applied to each individual input, after which the residual and layer normalization is performed once again. + +BERT specifically comes in two flavors: + +- **BERT base** (\[latex\]\\text{BERT}\_\\text{BASE}\[/latex\]), which has 12 Encoder Segments stacked on top of each other, has 768-dimensional intermediate state, and utilizes 12 attention heads (with hence 768/12 = 64-dimensional attention heads). +- **BERT large** (\[latex\]\\text{BERT}\_\\text{LARGE}\[/latex\]), which has 24 Encoder Segments, 1024-dimensional intermediate state, and 16 attention heads (64-dimensional attention heads again). + +The models are huge: the BERT base model has 110 million trainable parameters; the BERT large model has 340 million (Devlin et al., 2018). In comparison, classic vanilla ConvNets have hundreds of thousands [to a few million](https://www.machinecurve.com/index.php/2020/01/31/reducing-trainable-parameters-with-a-dense-free-convnet-classifier/). Training them hence requires a lot of resources: in the case of [GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/), it's not strange to find that they have to be pretrained for a month using massive machinery, whereas fine-tuning is more cost efficient. + +Speaking about training, let's now take a look at how BERT is actually trained. + +### Data flow through a BERT model: from inputs to outputs + +Pretraining a language model like BERT allows us to take a look at two things: the _task_ that is used during pretraining, as well as the _datasets_ used for pretraining. + +With respect to the task, we must actually say _tasks_: during pretraining of BERT, the model is trained for a combination of two tasks. The first task is a **Masked Language Model** (MLM) task and the second one is a **Next Sentence Prediction** (NSP) task. + +We'll cover the tasks themselves later. If we want to understand them, we must first take a look at what the input to BERT looks like. Put simply: each input to a BERT mode is either a whole sentence or two sentences packed together. All the words are tokenized and (through BERT) converted into a [word embedding](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/). + +Visually, this input looks as follows. Below the image, we'll cover each component in plain English. + +![](images/Diagram-44-1024x625.png) + +The bottom row represents an array with a variety of tokens. These tokens are separated into four elements: + +- A **\[CLS\]** token, which is the "classification token". It signals that a new combination is input _and_ its output value can later be used for sentence-level predictions during fine-tuning, as it will learn to contain sentence-level information. +- A **Tok 1 ... Tok N** ordered list of tokens, containing the tokens from the first sentence. +- A **\[SEP\]** token which separates two sentences, if necessary. +- A **Tok 1 ... Tok M** ordered list of tokens, containing the tokens from the first sentence. + +#### From tokens to embeddings + +The first component of the decoder segment (not visualized in the image above) is a word embedding. Word embeddings allow us to tokenized textual inputs (i.e., integers representing tokens) into vector-based format, which decreases dimensionality and hence improves representation. What's more, we can also embed similar words together, which is not possible with other approaches such as one-hot encoding. + +BERT utilizes WordPiece embeddings for this purpose, having a 30.000 size word vocabulary (Devlin et al., 2018). The whole sequence from **\[CLS\]** to the final **Tok M** is first embedded within the Transformer, all at once. + +#### BERT outputs + +BERT utilizes the encoder segment, meaning that it outputs some vectors \[latex\]T\_i\[/latex\] for every token. The first vector, \[latex\]T\_0\[/latex\], is also called \[latex\]C\[/latex\] in the BERT paper: it is the "class vector" that contains sentence-level information (or in the case of multiple sentences, information about the sentence pair). All other vectors are vectors representing information about the specific token. + +#### Using outputs in language tasks + +In other words, structuring BERT this way allows us to perform **sentence-level tasks** and **token-level tasks**. If we use BERT and want to work with sentence-level information, we build on top of the \[latex\]C\[/latex\] token. If we want to perform tasks related to tokens only, we can use the individual tokens. It's a really awesome way to add versatility to a machine learning model. + +### Pre-training step + +Now that we understand how inputs are represented in BERT, we can revisit the tasks for pretraining the model. The first, task 1, is a Masked Language Modeling task, or MLM. The second is a Next Sentence Prediction task, or NSP. + +Both are necessary for the model to generalize well across a wide set of tasks during fine-tuning: the Masked Language Modeling task, as we shall see, will allow us to learn token-level information and hence information specific to arbitrary words. The Next Sentence Prediction task however will allow the model to learn sentence-level information through the \[latex\]C\[/latex\] token. + +Let's now take a look at each of the tasks in more detail. + +#### Task 1: Masked Language Modeling (MLM) + +The first task performed during pretraining is a **Masked Language Modeling** (MLM) task. It looks like the autoregressive Language Modeling task performed by the [GPT model](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/), which involves predicting the next token given the previous tokens, but it is in fact slightly different. + +In Masked Language Modeling, an input sequence of tokens is provided, but with some of these tokens masked. The goal of the model is then to learn to predict the correct tokens that are hidden by the mask. If it can do so, it can learn token-level information given the context of the token. + +In BERT, this is done as follows. 15% of all word embedded tokens is masked at random. From this 15%, 80% of the tokens is represented with a token called ****, 10% is replaced with a random token and 10% is left alone. This ensures that masking is both relatively random and that the model does not zoom in to the token, which is available during pretraining but not during fine-tuning. + +#### Task 2: Next Sentence Prediction (NSP) + +The other task is **Next Sentence Prediction** (NSP). This task ensures that the model learns sentence-level information. It is also really simple, and is the reason why the BERT inputs can sometimes be a pair of sentences. NSP involves textual entailment, or understanding the relationship between two sentences. + +Next Sentence Prediction, given two sentences A and B, essentially involves predicting whether B is the next sentence given A or whether it is not. + +Constructing a training dataset for this task is simple: given an unlabeled corpus, we take a phrase, and take the next one for the 50% of cases where BERT has a next sentence. We take another phase at random given A for the 50% where this is not the case (Devlin et al., 2018). This way, we can construct a dataset where there is a 50/50 split between 'is next' and 'is not next' sentences. + +#### Pre-training data + +As we can see from the BERT input structure above, during pretraining, BERT is trained jointly on both the MLM and NSP tasks. We can also see that the input structure supports this through the specific way of inputting data by means of , , and . + +BERT is pretrained on two datasets. The first dataset that is being used is the BooksCorpus dataset, which contains 800 million words from "more than 7.000 unpublished books. It includes many genres and hence texts from many domains, such as adventure, fantasy and romance", we wrote in our article about GPT, which is pretrained on [the same dataset](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/). + +BooksCorpus is however not the only dataset that is used for pretraining. The English Wikipedia dataset, with 2500 million words, is used as well. First, all lists, tables, headers and images are removed from the texts, because they have no linguistic representation whatsoever and are specific to Wikipedia (Devlin et al., 2018). Then, the texts are used. + +The result is a model that is pretrained and which can be used for fine-tuning tasks. + +### Fine-tuning step + +According to Devlin et al., fine-tuning the BERT architecture is straight-forward. And it actually is, because the way BERT works (i.e. self-attention over all the inputs by nature of the Transformer architecture) and how inputs are structured (i.e. the joint availability of sentence level information and token level information) allows for a wide variety of language tasks to which the model can be fine-tuned. + +It thus does not matter whether your downstream task involves single text or text pairs: BERT can handle it. Structuring the text itself is dependent on the task to be performed, and for sentences A and B are similar to (Devlin et al., 2018): + +- **Sentence pairs** in paraphrasing tasks. +- **Hypothesis-premise** pairs in textual entailment tasks. +- **Question-answer** pairs in question answering. +- **Text-empty** pair in text classification. + +Yes, you read it right: sentence B is empty if your goal is to fine-tune for text classification. There simply is no sentence after the token. + +Fine-tuning is also really inexpensive (Devlin et al., 2018). Using a Cloud TPU and a standard dataset, fine-tuning can be completed with more than adequate results within an hour. If you're using a GPU, it'll take only a few hours. And fortunately, there are many datasets available to which BERT can be finetuned. In fact, in their work, the authors have achieved state-of-the-art results (at the time) with their model architecture. That's really nice! + +* * * + +## Summary + +In this article, we provided an intuitive introduction to the BERT model. BERT, which is one of the relatively state-of-the-art approaches in Natural Language Processing these days (in fact, many models have sprung off the original BERT model), works by using the encoder segment from the original Transformer due to its bidirectionality benefits. By performing a joint sentence-level and token-level pretraining task by means of Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), BERT can both be used for downstream task that require sentence-level and/or token-level information. + +In other words, the pretrained version of BERT is a really good starting point for your own language related models. + +The benefit of BERT compared to previous approaches such as GPT is bidirectionality. In the attention segment, both during pretraining and fine-tuning, the self-attention subsegment does not use a mask for hiding "inputs from the future". The effect is that inputs can both learn from the past and from the future, which can be a necessary thing for many downstream tasks. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article! If you did, please feel free to leave a message in the comments section. I'd love to hear from you 💬 If you have questions, please feel free to leave a question through the Ask Questions button above. Where possible, I'll try to answer as soon as I can. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [Bert: Pre-training of deep bidirectional transformers for language understanding](https://arxiv.org/abs/1810.04805). _arXiv preprint arXiv:1810.04805_. + +Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). [Improving language understanding by generative pre-training](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf). diff --git a/intuitive-introduction-to-openai-gpt.md b/intuitive-introduction-to-openai-gpt.md new file mode 100644 index 0000000..2ed5e61 --- /dev/null +++ b/intuitive-introduction-to-openai-gpt.md @@ -0,0 +1,203 @@ +--- +title: "Intuitive Introduction to OpenAI GPT" +date: "2021-01-02" +categories: + - "deep-learning" +tags: + - "gpt" + - "huggingface" + - "language-model" + - "natural-language-processing" + - "nlp" + - "openai" + - "transformer" + - "transformers" +--- + +Natural Language Processing is one of the fields where Machine Learning has really boosted progress in the past few years. One of the reasons why there was such progress is of course the [Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) introduced in 2017. However, in addition to that, it's unlikely that you haven't heard about the **GPT** **class** of language models. This class, which includes the GPT-2 and GPT-3 architectures, has been attracting global attention since they can produce text which resembles text written by humans. + +In fact, Microsoft has acquired an [exclusive license to the GPT-3 language model](https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-openai-to-exclusively-license-gpt-3-language-model/), which will likely give it a prominent role in its cloud environment. In addition to that, many other cloud services using GPT-like models are seeing the light of day. Language models like these can possibly change the world of text in unprecedented ways. + +But how does the GPT class of models work? In this article, we'll cover the first model from that range: the **OpenAI GPT** (i.e. GPT-1) model. It was proposed in a 2018 paper by Radford et al. and produced state-of-the-art at the time. This article will explain the GPT model as intuitively as possible. + +It is structured as follows. Firstly, we'll take a look at performing semi-supervised learning in NLP models - i.e., pretraining on large unlabeled corpora (the unsupervised part) and subsequent fine-tuning on relatively small, labeled corporate (the supervised part). Using this approach, it becomes possible to use the large, pretrained model for building a very task-specific model. + +Following this is the actual introduction as GPT. We're going to find out how it utilizes the decoder segment [of the original Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) as its base architecture. We will also cover the hyperparameters used for training the decoder segment in pre-training and in fine-tuning. This way, you'll understand how GPT works in detail - without a lot of heavy maths. Looking at fine-tuning, we will also cover the variety of tasks that the GPT model was fine-tuned on, and see how it performs. + +Finally, we are going to look at a few extra takeaways of the GPT paper. We'll find out what the effect is of 'locking' certain layers of the pretrained model in terms of performance deterioriation. We'll also see that the pretrained model shows zero-shot behavior, meaning that _some_ performance is achieved when it has not had _any_ fine-tuning. This suggests that the unsupervised language model also learns to recognize linguistic patterns within the text. Finally, we'll compare the performance of Transformer based architectures for semi-supervised learning to that of [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/). + +* * * + +\[toc\] + +* * * + +## How GPT is trained: Semi-supervised learning for NLP + +Before we can take a look at how GPT works (and how it is trained precisely), we must take a look at the general approach that it utilizes. According to Radford et al. (2018), GPTs fall under the category of semi-supervised learning. + +> Our work broadly falls under the category of semi-supervised learning for natural language. This paradigm has attracted significant interest, with applications to tasks like sequence labeling or text classification. +> +> Radford et al. (2018) + +Semi-supervised learning is composed of an _unsupervised_ component and a _supervised_ component (hence the name _semi-_supervised). They are the following: + +1. **Pretraining**, which is _unsupervised_, utilizes an unlabeled corpus of (tokenized) text. Here, the goal is not to find a model that works well for a specific task, but rather to find a good _initialization point_ from which to start when learning for a specific task (Radford et al., 2018). +2. **Fine-tuning**, which is _supervised_, utilizes a labeled corpus of (tokenized) text specifically tailored to a specific language task, such as summarization, text classification or sentiment analysis. + +The approach has attracted significant interest because it demonstrates to improve the performance of language models significantly (Radford et al., 2018). One of the key reasons for this observation is that there is a scarcity of labeled datasets; they are often also labeled for one particular domain. Unlabeled text, however, _does_ contain all the patterns, but has no labels. It is also much more abundant compared to labeled text. If we can extract certain linguistic patterns from the unlabeled text, we might find a better starting point from which to specialize further. For this latter job, we can use the labeled but often much smaller dataset. + +Semi-supervised learning for natural language has been visualized in the figure below. In green, we can see three tasks: a pretraining task and two finetuning tasks. The pretraining task utilizes a large corpus of unlabeled text to pretrain the model. Using the pretrained model, we can then use different corpora that are task-oriented for finetuning. The outcome is a model that is finetuned to a specific task, but which benefits from pretraining significantly (Radford et al., 2018). + +[![](images/Diagram-39-1024x436.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-39.png) + +* * * + +## How GPT works: an introduction + +Now that we know what semi-supervised learning for natural language involves, we can actually take a look at GPT and how it works. We'll do this in three parts. Firstly, we're going to take a look at the _architecture_ - because we'll need to understand the model that is trained first. The next thing we'll cover is the _pre-training task_, which is formulated as a language modeling task. Finally, we're going to cover _fine-tuning_ and give you a wide range of example tasks that the pre-trained GPT model can specialize to, as well as the corresponding datasets (Radford et al., 2018). + +### Using the Transformer decoder segment + +From the original article about the [Transformer architecture](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/), we know that the version proposed by Vaswani et al. (2017) is composed of an **encoder segment** and a **decoder segment**. + +The encoder segment converts the original sequence into a hidden and intermediary representation, whereas the decoder segment converts this back into a target sequence. Being a classic [Seq2Seq model](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/), the classic Transformer allows us to perform e.g. translation using neural networks. + +The GPT based Transformer extends this work by simply taking the decoder segment and stacking it 12 times, like visualized here: + +[![](images/Diagram-37.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/Diagram-37.png) + +As you can see, it has both the masked multi-head attention segment, the feed forward segment, the residuals and their corresponding addition & layer normalization steps. + +This, in other words, means that: + +1. First, the (learned) [embedding](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#output-embedding) is [position embedded](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#positional-encoding_1) (which contrary to the classic Transformer is also performed using a learned embedding). +2. The input is then served to a [masked multi-head attention segment](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#masked-multi-head-attention), which computes self-attention in a [unidirectional way](https://www.machinecurve.com/index.php/question/what-are-unidirectional-language-models/). Here, the residual is added and the result is layer normalized. +3. The result is then passed through a position-wise feedforward network, meaning that every token is passed individually and that the result is merged back together. Once again, the residual is added and the result is layer normalized. +4. The outcome either passes to the next decoder segment or is the output of the model as a whole. + +### Pre-training task + +Pretraining of the GPT Transformer is performed with the [BooksCorpus dataset](https://www.machinecurve.com/index.php/question/what-does-the-bookscorpus-dataset-look-like/). This dataset, which is unfortunately not wholly distributed anymore but can be reconstructed (see the link for more information), contains more than 7.000 unpublished books (Radford et al., 2018). It includes many genres and hence texts from many domains, such as adventure, fantasy and romance. + +An excerpt from the corpus, [found here](https://twitter.com/theshawwn/status/1301852133319294976), is as follows: + +> _April Johnson had been crammed inside an apartment in San Francisco for two years, as the owners of the building refurbished it, where they took a large three story prewar home and turned it into units small enough where she felt a dog’s kennel felt larger than where she was living and it would be a step up. And with the walls so thin, all she could do was listen to the latest developments of her new neighbors. Their latest and only developments were the sex they appeared to be having late at night on the sofa, on the kitchen table, on the floor, and in the shower. But tonight the recent development occurred in the bed. If she had her way she would have preferred that they didn’t use the bed for sex because for some reason it was next to the paper thin wall which separated her apartment from theirs._ + +Once more: pretraining happens in an unsupervised way, meaning that there are no labels whatsoever in order to help us steer the training process into the right direction. What we can do with our large corpus of tokens \[latex\]\\{T\_1, ..., T\_n\\}\[/latex\] however is applying a (sliding) **context window** of length \[latex\]k\[/latex\]. In other words, we can structure our text into the following windows: \[latex\]\\{T\_1, T\_2, T\_3\\}\[/latex\], \[latex\]\\{T\_2, T\_3, T\_4\\}\[/latex\], and so on, here with \[latex\]k = 3\[/latex\]. + +If we then feed a context window to the GPT model, we can predict the next token - e.g. \[latex\]T\_4\[/latex\] in the case of the \[latex\]\\{T\_1, T\_2, T\_3\\}\[/latex\] window: + +![](images/Diagram-38-1024x505.png) + +The goal is then to maximize the following loss function. Here is what optimization of GPT looks like: + +![](images/image.png) + +Source: Radford et al. (2018) + +This function is a really complex way of writing down the following: + +[![](images/bce-1-1024x421.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/bce-1-1024x421.png) + +- For each token \[latex\]T\_i\[/latex\] (in the formula also called \[latex\]u\_i\[/latex\]) in the corpus \[latex\]U\[/latex\], [we compute log loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) of the probability that it occurs given the context window \[latex\]u\_{i-k} \\rightarrow u\_{1-1}\[/latex\], i.e. the \[latex\]k\[/latex\] tokens prior to token \[latex\]i\[/latex\]. +- In plain English, this means: we let the model output the probability that token \[latex\]u\_i\[/latex\] is the next token given the context window of length \[latex\]k\[/latex\], and compute log loss for this probability, indicating how off the prediction is. + - In the image on the right, you can see that when the prediction is 100% correct, loss is 0; when it gets worse, loss increases exponentially. +- If we sum this together for all tokens \[latex\]i \\in U\[/latex\], we get the loss as a whole and we can perform backpropagation based error computation and subsequent optimization. In fact, GPT is optimized with [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) with a learning rate schedule with a maximum rate of 2.5e-4. + +Radford et al. (2018) ran the training process for 100 epochs with a [minibatch](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) approach using 64 randomly sampled batches of 512 tokens per batch. + +> Our approach requires an expensive pre-training step - 1 month on 8 GPUs. (...) The model does fine-tune to new tasks very quickly which helps mitigate the additional resource requirements. +> +> OpenAI (2020) + +Environmentally, pretraining the GPT model is not efficient. As you can see above, the whole pretraining operation - the full 100 epochs - cost 1 month and required the full utilization of 8 GPUs. Fortunately, OpenAI released the model weights for the pretrained model jointly with their paper. This means that we can use the pretrained GPT model for fine-tuning to more specific tasks. This, according to OpenAI, can be performed really quickly. That's some better news! + +Let's now take a look at how we can use the pretrained model for fine-tuning. + +### Fine-tuning task + +Once the GPT model has been pretrained, it can be finetuned. This involves a labeled dataset, which Radford et al. (2018) call \[latex\]C\[/latex\]. Each instance contains a sequence of tokens \[latex\]\\{x^2, x^2, ..., x^m\\}\[/latex\], as well as a label \[latex\]y\[/latex\]. The sequence is passed through the pretrained Transformer architecture, which then is passed through a linear layer with weights \[latex\]W\_y\[/latex\] and [Softmax activation](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) for [multiclass prediction](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). + +![](images/image-1.png) + +In other words, we predict pseudoprobabilities over all the classes (which are task-specific). For example, these are possible tasks and corresponding classes: + +- **Classification problem (e.g. sentiment analysis):** two or more classes (e.g. positive, neutral and negative in the case of sentiment analysis). +- **Textual entailment:** two classes (is next, is not next). +- **Similarity:** one class outcome from two options. +- **Multiple choice:** one class outcome from multiple options. + +By taking the `argmax` of the outcome, we cna find the class that is most likely. + +#### Textual representation during finetuning + +We saw above that during fine-tuning text is fed to the Transformer as a sequence of tokens. Obviously, there are many finetuning tasks, four of which have been defined above; they were retrieved from Radford et al.'s work. + +As these tasks are all different, we must also represent texts differently when inputting them to the model. + +Take **classification**, with sentiment analysis as an example. When we perform classification, we simply tokenize the text, add a and token, and feed it to the Transformer. By sticking a Linear layer on top of it, e.g. with a [binary or multiclass loss function](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) with Softmax for activation, we can create a model that classifies text. + +![](images/Diagram-40-1024x385.png) + +If the task is related to **textual entailment**, i.e. showing directionality in text, we structure text slightly differently. First of all, the and tokens are present again, but we now also have got a token. This token separates a _premise_ sequence from a _hypothesis_ sequence. Here, the premise can be "If you're sick, you cannot go to school." whereas the hypothesis can be "It is legal to not go to school if you're not feeling well". + +![](images/Diagram-41-1024x334.png) + +Another use case for GPT is **similarity detection**. Here, we also see the , and tokens again, but the application as a whole is structured a bit differently. As you can see, we cannot assume that text 1 is always preceding text i.e. assume any directionality between texts. For this reason, Radford et al. (2018) fine-tune GPT for simliarity detection by feeding the sequences in opposite order through the Transformer, simply adding together the Transformer outcomes, before feeding them to a linear classifier for similarity detection (i.e. similar / not similar and possibly with what score). + +![](images/Diagram-42-1024x319.png) + +Finally, there's **question answering** and **common sense reasoning**. In the case of question answering, a context and answer with corresponding , and tokens is passed through the GPT Transformer and fed to a Linear model. In the case of **common sence reasoning** (i.e. multiple choice based reasoning about common sense), we do this for each answer, then perform a Softmax based operation on the outcomes of all the Linear layers. Jointly trained, the model then learns to perform common sense reasoning. + +![](images/Diagram-43-1024x342.png) + +As you can see, the GPT Transformer can be fine-tuned on a wide variety of tasks given the pretrained model. This requires structuring the text a bit differently given the use case and possibly the dataset, but the same architecture and pretrained model can be used over and over again. This is one of the reasons why GPT is used quite widely these days and why it is present in e.g. the [HuggingFace Transformers library](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/). + +* * * + +## Extra takeaways + +Let's now take a look at three extra takeaways from the Radford et al. (2018) papers, which they achieved through ablation studies: + +1. **Whether fine-tuning more layers of the model yields better performance**. +2. **Whether zero-shot learning provides some accuracy.** In other words, whether _not_ performing any epochs and measuring the performance of the fine-tuning task yields some performance to begin with. If so, this suggests that the pretrained model itself is capable of understanding some language. +3. **What the performance differences are between [LSTM networks](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/) and (GPT based) Transformer ones**. + +### More layers used in finetuning means better performance + +First of all, the number of Transformer layers that was fine-tuned. Recall that in any form of transfer learning, it is not necessary that the _whole_ model is transferred. In fact, we can 'lock' certain layers to keep them untouched during fine-tuning. Radford et al. (2018) find that the more layers remain unlocked, the better the fine-tuned model performs. This was not entirely unexpected. + +### Zero-shot learning provides (some) accuracy + +What's more and perhaps more surprising is that zero-shot learning provides some accuracy on a variety of language tasks. Zero-shot learning here means that the model is used for performing the downstream tasks _without_ being finetuned first, i.e. by using the pretrained model. + +Surprisingly, this zero-shot approach indicates that the pretrained model performs relatively poorly on the downstream tasks, but does show _some_ performance. This suggests that pretraining supports the learning of a wide variety of task relevant functionality (Radford et al., 2018). In other words, it explains why pretraining does significantly improve language models. + +### Transformers vs LSTMs + +In finding the effectiveness of the GPT Transformer based model, Radford et al. (2018) have also trained a 2048 unit single layer [LSTM network](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/). On average, across many of the tasks, the performance of the network dropped significantly when doing so. This clearly demonstrates that Transformer based models in general and GPT in particular _does_ improve performance compared to previous approaches. + +* * * + +## Summary + +In this article, we have introduced the OpenAI GPT model architecture used for language modeling. It is a Transformer-based approach and one of the many articles that will follow about Transformers and the specific architectures. In doing so, we first saw that GPT based models are trained in a semi-supervised approach, with a general pretraining step followed by task-specific fine-tuning. + +We then proceeded by looking at how GPT works: we saw that it uses the decoder segment from the original Transformer, which is pretrained on the BooksCorpus dataset in an autoregressive way. Once pretrained, which takes a significant amount of time (one month!), we can use it to perform specific fine-tuning. There is a wide variety of datasets (including your own) that can be used for this purpose. We did see however that texts must be structured in a particular way when fine-tuning is to be performed. More specifically, we also looked at represent text in the case of classification tasks, textual entailment tasks, question answering tasks and similarity detection tasks. + +To finalize, we also appreciated three extra takeaways from the Radford et al. (2018) paper that may be present across many Transformed based approaches. Firstly, we saw that fine-tuning more layers yields better performing models compared to when only one or a few layers (i.e. Transformer segments) are fine-tuned. Secondly, we saw that zero-shot learning (i.e. performing the fine-tuned task with the pretrained model, so without extra finetuning epochs) already provides some performance. This suggests that pretraining _really_ provides the performance boost that we suspected it to provide. Thirdly, and finally, the GPT architecture also demonstrates that the Transformer based architecture performs much better than previous LSTM-based approaches, as was experimentally identified during an ablation study. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article. If you did, please feel free to leave a message in the comments section 💬 Please do the same if you have any questions, or use the **Ask Questions** button on the right. I'd love to hear from you :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). [Improving language understanding by generative pre-training](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf). + +OpenAI. (2020, March 2). _Improving language understanding with unsupervised learning_. [https://openai.com/blog/language-unsupervised/](https://openai.com/blog/language-unsupervised/) diff --git a/intuitively-understanding-svm-and-svr.md b/intuitively-understanding-svm-and-svr.md new file mode 100644 index 0000000..961842e --- /dev/null +++ b/intuitively-understanding-svm-and-svr.md @@ -0,0 +1,261 @@ +--- +title: "Understanding SVM and SVR for Classification and Regression" +date: "2019-09-20" +categories: + - "buffer" + - "svms" +tags: + - "classifier" + - "kernel" + - "machine-learning" + - "regression" + - "support-vector-machine" + - "support-vector-regression" +--- + +There is a lot of interest in deep learning models today: deep neural networks show beyond-average performance on many tasks, having spawned a new AI hype as well as many interesting and truly valuable AI based applications. + +Does that mean, however, that we should forget about the more traditional approaches to machine learning? + +No, we don't. The reason why is simple - they see things that deep learning models don't see. Given their different mathematical structure, the errors produced by those techniques are often _different ones_ than the DL models. + +This sounds bad, but the exact opposite is true - because the models can be combined. When doing that, you might actually find the _ensemble to perform better_. This is the result of all the different errors cancelling each other out (Chollet, 2017). + +Before neural networks, Support Vector Machines (SVMs) were very popular for generating classifiers. Support Vector Regression (SVR) is its regression equivalent. In this blog, we'll cover SVMs and SVRs. After reading it, you will understand... + +- Where SVMs and SVR are located on the spectrum between supervised vs unsupervised vs reinforcement learning. +- How Support Vector Machines work as maximum-margin classifiers. +- How SVM can be extended to Regression and what this takes. + +* * * + +**Update 05/Feb/2021:** ensured that article is up to date. + +* * * + +\[toc\] + +* * * + +## Summary: Support Vector Machines and Support Vector Regression + +![](images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png) + +Hyperplanes and data points. The [image](https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)is not edited. Author: [Zack Weinberg](https://commons.wikimedia.org/w/index.php?title=User:ZackWeinberg&action=edit&redlink=1), derived from [Cyc's](https://commons.wikimedia.org/w/index.php?title=User:Cyc&action=edit&redlink=1) work. License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode) + +When you are training a Machine Learning model, there is a wide variety of algorithms to choose from. Today, neural networks are very popular methods for training a classification or regression model, but there are additional ones. Take Support Vector Machines, or their regression equivalent, Support Vector Regression. While these models have more [bias](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/) by design compared to neural networks, they might work better in cases where data is scarce. + +This article discusses in detail but intuitively how **Support Vector Machines** and **Support Vector Regression works**. Here, we'll already cover things briefly. If you look at the image on the right, you see a binary classification problem. In other words, a supervised learning problem. You see black and white circles. The goal of any machine learning model used for classification is to find a decision boundary, i.e. a line (or, more strictly, a N-1 dimensional object called hyperplane for the N-dimensional feature space; here N=2, so the hyperplane is a line) that can distinguish between the classes. + +Support Vector Machines (SVMs) here are so-called maximum-margin classifiers. This means that they will attempt to maximize the distance between the closest vectors of each class and the line. These closest vectors are called **support vectors**, and hence the name _Support Vector_ Machine. Hyperplane `H3` is best, as you can see it maximizes the equal distance between the two classes. It's better than `H2`, which is also capable of performing a classification but is not as good as `H3`, as well as better than `H1`, which is not capable of classifying at all. + +The problem with SVMs however is that they can **(1) only be used for binary classification and (2) require a kernel function provided by humans to learn data**. In other words, you'll have to provide some estimate about the structure of your data, and it will then only work for binary classification problems. At least, out of the box. By [combining various binary SVMs together](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/), you can still use it in a **multiclass way**. + +SVMs can also be used for regression; then, the goal is to estimate a hyperplane that can be used for regression. It works in a similar way, although in regression a so-called 'error tube' is added where errors are not penalized, reflecting the increased complexity of the problem (i.e. from a discrete problem with a few classes regression problems are continuous problems with infinite possible outcomes). + +### Additional reading + +Also make sure to read the following articles if you are interested in SVMs and SVR: + +- [Creating a simple binary SVM classifier with Python and Scikit-learn](https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/) +- [How to visualize support vectors of your SVM classifier?](https://www.machinecurve.com/index.php/2020/05/05/how-to-visualize-support-vectors-of-your-svm-classifier/) +- [Creating One-vs-Rest and One-vs-One SVM Classifiers with Scikit-learn](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/) +- [Using Error-Correcting Output Codes with Scikit-learn for multiclass SVM classification](https://www.machinecurve.com/index.php/2020/11/12/using-error-correcting-output-codes-for-multiclass-svm-classification/) +- [How to create a Multilabel SVM classifier with Scikit-learn](https://www.machinecurve.com/index.php/2020/11/12/how-to-create-a-multilabel-svm-classifier-with-scikit-learn/) +- [How to perform Multioutput Regression with SVMs in Python](https://www.machinecurve.com/index.php/2020/11/17/how-to-perform-multioutput-regression-with-svms-in-python/) +- [Using Radial Basis Functions for SVMs with Python and Scikit-learn](https://www.machinecurve.com/index.php/2020/11/25/using-radial-basis-functions-for-svms-with-python-and-scikit-learn/) + +Let's now dive into SVMs and SVR more deeply! 🚀 + +* * * + +## Before getting started: some basic ML ingredients + +Before we can do so, we must first take a look at some basic ingredients of machine learning, before we can continue with SVMs and SVR. If you're already very familiar with these concepts, feel free to skip to the next section. If not, let's go! + +### Supervised vs unsupervised vs reinforcement + +In machine learning, you'll work goal-oriented: there is a problem to be solved and a machine learning model may be the solution to that problem. A problem may spawn a wide variety of ML scenarios, which can broadly be categorized into _supervised learning,_ _unsupervised learning_ and _reinforcement learning_. + +In a supervised learning scenario, we have a so-called training set. This training set consists of many samples, usually in the form of vectors. These vectors, that are also called feature vectors, contain individual features, or values characterizing some domain. For example, the features _height_, _weight_, _BMI_ and _percentage of fat_ may characterize one's _body posture_ (the name of the feature vector, in this case). + +In supervised learning, what you'll also find is that for each feature vector, there exists a so-called _target variable_. This target variable essentially correlates the feature vector with some outcome. Usually, the target variable is highly related to the problem you're trying to solve with machine learning. For example, in the situation above you might be interested in training a machine learning model that predicts the likelihood that one has diabetes (outcome 'yes', or 1) or no diabetes (outcome 'no', or 0) in five years from now. + +Unsupervised learning scenarios exist as well. In those, you don't have target values, but merely a dataset in which your goal is to detect certain patterns. For example, you may wish to find certain groups in your data set - this is a typical scenario when you wish to group buyers based on their purchase records. + +In the last category, reinforcement learning, you don't really have a dataset with which you either train a model or find patterns. Rather, you start with a dumb agent, which displays certain behavior. After each display of behavior, you'll tell the agent whether their action is _right_ or _wrong_. As a result, over time, it will perform the action described implicitly by the _goal_ present in your judgment. It will display _goal-oriented behavior_. A perfect example is displayed in this video, where agents learn to play hide and seek: + +https://www.youtube.com/watch?v=kopoLzvh5jY + +### Supervised ML: classification vs regression + +SVMs and SVR are classic examples of supervised machine learning techniques. We'll therefore narrow down on supervised ML. We must next differentiate between classification and regression. + +In a different blog, I already explained what classification is: + +> Suppose that you work in the field of separating non-ripe tomatoes from the ripe ones. It's an important job, one can argue, because we don't want to sell customers tomatoes they can't process into dinner. It's the perfect job to illustrate what a human classifier would do. +> +> Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape: +> +> If it's green, it's likely to be unripe (or: not sellable); +> If it smells, it is likely to be unsellable; +> The same goes for when it's white or when fungus is visible on top of it. +> +> If none of those occur, it's likely that the tomato can be sold. +> +> We now have _two classes_: sellable tomatoes and non-sellable tomatoes. +> +> Human classifiers _decide about which class an object (a tomato) belongs to._ +> +> [How to create a CNN classifier with Keras?](https://machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) + +This can also be done by a machine learning model: the numbers behind the tomato images as features in a feature vector and the outcome (sellable or non-sellable) as targets. + +And Support Vector Machines (SVM) are methods to generate such classifiers. We'll cover their inner workings next. + +...because _regression_ is left. In this supervised scenario, you don't pick a class for a feature vector, but rather, you'll estimate a _real number_ (an either positive or negative number that can have an infinite amount of decimals) for the input. For the diabetes case above, rather than estimating _yes or no_, you might wish to estimate the _probability_ of getting diabetes. + +In the case of support vectors, Support Vector Regression is your way to go for supervised regression scenarios. Hence, let's cover their internals. + +* * * + +## Support Vector Machines + +How do SVMs work? We'll cover the inner workings of Support Vector Machines first. They are used for classification problems, or assigning classes to certain inputs based on what was learnt previously. + +Suppose that we have a dataset that is linearly separable: + +![](images/linearly_separable_dataset-1.png) + +We can simply draw a line in between the two groups and separate the data. As we've seen for e.g. the [Rosenblatt Perceptron](https://machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/), it's then possible to classify new data points into the correct group, or class. + +However, with much data, a linear classifier might not be such a good idea: every sample is taken into account for generating the decision boundary. What's more, linear classifiers do not find the _optimum decision boundary._ When data is linearly separable, _at least one_, but often many decision boundaries exist. Yet, which one is optimal? + +Support Vector Machines can very well handle these situations because they do two things: they _maximize the margin_ and they do so by means of _support vectors_. + +### Maximum-margin classifier + +In SVM scenario, a decision boundary is also called a hyperplane which, given that you have N dimensions, is N-1-dimensional. Hence, in the two-dimensional setting below, the hyperplane is one-dimensional - thus, a line. + +We do however see three hyperplanes: H1, H2 and H3: + +![](images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png) + +Hyperplanes and data points. The [image](https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)is not edited. Author: [Zack Weinberg](https://commons.wikimedia.org/w/index.php?title=User:ZackWeinberg&action=edit&redlink=1), derived from [Cyc's](https://commons.wikimedia.org/w/index.php?title=User:Cyc&action=edit&redlink=1) work. License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode) + +The first hyperplane, H1, does not separate the classes and is therefore not a decision boundary at all. + +The second hyperplane, H2, does separate the classes, but it is easy to understand that it does only so with a _small margin_: the distance from the closest _black_ data point and the closest _white_ data point to the hyperplane is relatively small.. A more optimal hyperplane can be found. + +And that's H3. As you can see, the distances between the closests vectors to the line is much larger now, and, in fact, the largest _margin_ that can be found. We call this the maximum margin, and hence, the SVM is a _maximum-margin classifier_. + +As we've seen, those vectors - or data points - play an important role in finding the optimum decision boundary. We hence call them _support vectors_ and they allow SVMs to have some very interesting properties compared to linear classifiers. + +### Support Vectors + +Let's closely look at the SVM plot again, and especially the separating decision boundaries H2 and H3. We have some very interesting observations: + +- The decision boundaries seem to be determined only by the vectors that are closest to those boundaries, the support vectors. H2 and H3 have different support vectors; +- The best decision boundary is the one where the distance between the support vectors and the decision boundary is the largest. + +![](images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png) + +Hyperplanes and data points. The [image](https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)is not edited. Author: [Zack Weinberg](https://commons.wikimedia.org/w/index.php?title=User:ZackWeinberg&action=edit&redlink=1), derived from [Cyc's](https://commons.wikimedia.org/w/index.php?title=User:Cyc&action=edit&redlink=1) work. License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode) + +This essentially means that you'll only need to consider the closest vectors when computing the outcome for a new sample. + +And hence, you can ignore those. + +### Optimizing a SVM: cost function + +When you optimize a SVM, you minimize a cost function. In the case of SVM, the cost function has _one_ optimum, which is the decision boundary that produces the largest margin. Given how you initialize the weights of your decision boundary before the training process starts (this may be done randomly), it may take some time to find this optimum. Many frameworks therefore allow you to specify a breakoff point, where the decision boundary might not be best, but good enough for your model to be usable. + +### SVM variations + +There are two flavors of SVM: **C-SVM** based classification and **nu-SVM** classification. While essentially they try to do the same ding - finding the optimum decision boundary by minimizing a cost function - the actual _cost function_ differs. The difference relates to _how errors are penalized_ during training. [Click here if you wish to look at the formulas](http://www.statsoft.com/textbook/support-vector-machines). It does however not seem to matter much which variation is used (Ferdi, n.d.). Rather, one's choice seems to be based on how intuitive one thinks it is for computing loss. + +### What if data is not linearly separable? Kernels + +It may of course also be the case that your data is not linearly separable. What to do? + +![](images/Kernel_Machine.png) + +Author: [Alisneaky](https://commons.wikimedia.org/w/index.php?title=User:Alisneaky&action=edit&redlink=1), [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/legalcode) + +We benefit from the fact that Support Vector Machines are so called _kernel machines_. + +Kernels functions take a data point in some mathematical space as an input and can map it to another mathematical space. + +Although very abstract, SVMs are thus capable of changing the shape of the dataset without actually changing the interrelationships between the samples, for the reason that they are all mapped in the same way. + +This is great news, because we might now be able to find a function that maps our non-linearly separable dataset into one which _does have a linear separation between the two classes_. This is visually represented in the image above. + +There are many kernels in use today. The Gaussian kernel is pretty much the standard one. From there, one can experiment further to see whether data can become linearly separable. If your data is not linearly separable at first, classification by means of a linear SVM is a bad idea, and kernels must be used. + +* * * + +## Support Vector Regression + +Above, we looked at applying support vectors for classification, i.e., SVMs. However, did you know that support vectors can also be applied to regression scenarios - where you estimate a real number instead of assigning classes to inputs? + +### How does Support Vector Regression work? + +Support Vector Regression maintains all the interesting properties from Support Vector Machines (Saed Sayad, n.d.). Given data points, it attempts to find a curve. However, rather than having the curve act as a decision boundary in a classification problem, in SVR, a match is found between some vector and the _position_ on the curve. + +_It's a regression scenario after all_. + +And support vectors participate in finding the closest match between the data points and the actual function that is represented by them. Intuitively, when we maximize the distance between the _support vectors_ to the regressed curve, we get closest to the actual curve (because there is always some noise present in the statistical samples). It also follows that we can discard all the vectors that are no support vectors, for the simple reason that they are likely statistical outliers. + +The result is a regressed function! + +### Kernels and SVR + +Kernels can also be applied in SVR. Hence, it is possible to regressing a _non-linear function_, or a curve, using SVR. Similarly, the non-linear data is mapped onto a space that makes the data _linear_. In the case of SVR, however, it does not need to be linear in order to separate two groups, but rather, to represent a straight line and hence compute the contribution of support vectors to the regression problem. + +### SVR variations + +With SVM, we saw that there are two variations: C-SVM and nu-SVM. In that case, the difference lies in the cost function that is to be optimized, especially in the hyperparameter that configures the loss to be computed. + +The same happens in SVR: it comes with epsilon-SVM and nu-SVM regression, or epsilon-SVR and nu-SVR. + +The difference lies in _what you wish to control during the training process_ (Pablo Rivas, n.d.). + +In the case of epsilon-SVR, or ep-SVR, you wish to control the _maximum allowable error_ for your regression setting. That's great if you wish to find the best possible model without caring much about computational resources. As a result, you don't control the number of support vectors that is used during optimization: it could be few, it could be many. And with many, it could be that you need many resources. + +On the other hand, with nu-SVR, you'll control the _number of support vectors_ instead. As a result, you don't control the maximum amount of error allowed in the model - it is estimated for you. By consequence, you likely need fewer or simpler resources than with ep-SVR, but you will likely find a slightly larger error. + +Depending on your needs, you should choose the type of SVR that fits your machine learning problem. + +* * * + +## Recap + +In this blog, we attempted to arrive at an intuitive understanding of generic machine learning concepts and eventually Support Vector Machines and Support Vector Regression. We identified the need for kernels, kernel functions and how cost functions are optimized with both SVM and SVR - and checked out C-SVM, nu-SVM, ep-SVR and nu-SVR. + +I hope you've learnt something from this blog, or that it helped you understand those concepts rather intuitively! 😄 If you have any questions, remarks, or other comments, please feel free to leave a comment below 👇 I'll happily answer and where I see fit improve my post. Thanks in advance and happy engineering! 😎 + +### Additional reading + +Also make sure to read the following articles if you are interested in SVMs and SVR: + +- [Creating a simple binary SVM classifier with Python and Scikit-learn](https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/) +- [How to visualize support vectors of your SVM classifier?](https://www.machinecurve.com/index.php/2020/05/05/how-to-visualize-support-vectors-of-your-svm-classifier/) +- [Creating One-vs-Rest and One-vs-One SVM Classifiers with Scikit-learn](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/) +- [Using Error-Correcting Output Codes with Scikit-learn for multiclass SVM classification](https://www.machinecurve.com/index.php/2020/11/12/using-error-correcting-output-codes-for-multiclass-svm-classification/) +- [How to create a Multilabel SVM classifier with Scikit-learn](https://www.machinecurve.com/index.php/2020/11/12/how-to-create-a-multilabel-svm-classifier-with-scikit-learn/) +- [How to perform Multioutput Regression with SVMs in Python](https://www.machinecurve.com/index.php/2020/11/17/how-to-perform-multioutput-regression-with-svms-in-python/) +- [Using Radial Basis Functions for SVMs with Python and Scikit-learn](https://www.machinecurve.com/index.php/2020/11/25/using-radial-basis-functions-for-svms-with-python-and-scikit-learn/) + +* * * + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Ferdi. (n.d.). c-classification SVM vs nu-classification SVM in e1071 R. Retrieved from [https://stats.stackexchange.com/a/312904](https://stats.stackexchange.com/a/312904) + +Statsoft. (n.d.). Support Vector Machines (SVM). Retrieved from [http://www.statsoft.com/textbook/support-vector-machines](http://www.statsoft.com/textbook/support-vector-machines) + +Saed Sayad. (n.d.). Support Vector Regression. Retrieved from [https://www.saedsayad.com/support\_vector\_machine\_reg.htm](https://www.saedsayad.com/support_vector_machine_reg.htm) + +Pablo Rivas. (n.d.). Difference between ep-SVR and nu-SVR (and least squares SVR). Retrieved from [https://stats.stackexchange.com/a/167545](https://stats.stackexchange.com/a/167545) diff --git a/leaky-relu-improving-traditional-relu.md b/leaky-relu-improving-traditional-relu.md new file mode 100644 index 0000000..fe4114b --- /dev/null +++ b/leaky-relu-improving-traditional-relu.md @@ -0,0 +1,122 @@ +--- +title: "Leaky ReLU: improving traditional ReLU" +date: "2019-10-15" +categories: + - "buffer" + - "deep-learning" +tags: + - "activation-functions" + - "deep-learning" + - "neural-networks" + - "relu" +--- + +The **Leaky ReLU** is a type of [activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) which comes across many machine learning blogs every now and then. It is suggested that it is an improvement of traditional ReLU and that it should be used more often. + +But how is it an improvement? How does Leaky ReLU work? In this blog, we'll take a look. We identify what ReLU does and why this may be problematic in some cases. We then introduce Leaky ReLU and argue why its design can help reduce the impact of the problems of traditional ReLU. Subsequently, we briefly look into whether it is actually better and why traditional ReLU is still in favor today. + +After reading this tutorial, you will... + +- Understand how ReLU works. +- See why using ReLU can be problematic at times. +- How Leaky ReLU helps resolve these problems. + +Let's take a look 🚀 + +* * * + +**Update 08/Feb/2021:** ensure that article is up-to-date. + +* * * + +\[toc\] + +* * * + +## Brief recap: what is ReLU and how does it work? + +Rectified Linear Unit, or ReLU, is one of the most common [activation functions](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) used in neural networks today. It is added to layers in neural networks to add _nonlinearity_, which is required to handle today's ever more complex and nonlinear datasets. + +Each neuron computes a [dot product and adds a bias value](https://www.machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/) before the value is output to the neurons in the subsequent layer. These mathematical operations are linear in nature. This is not bad if we were training the model against a dataset that is linearly separable (in the case of classification) or where a line needs to be estimated (when regressing). + +However, if data is nonlinear, we face problems. Linear neuron outputs ensure that the system as a whole, thus the entire neural network, behaves linearly. By consequence, it cannot handle such data, which is very common today: the MNIST dataset, which we used for showing how to build [classifiers in Keras](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), is nonlinear - and it is one of the simpler ones! + +Activation functions come to the rescue by adding nonlinearity. They're placed directly after the neural outputs and do nothing else but converting some input to some output. Because the mathematical functions used are nonlinear, the output is nonlinear - which is exactly what we want, since now the system behaves nonlinearly and nonlinear data is supported! + +Note that although activation functions are pretty much nonlinear all the time, it's of course also possible to use the identity function \[latex\]f(x) = x\[/latex\] as an activation function. It would be pointless, but it can be done. + +Now ReLU. It can be expressed as follows: + +\\begin{equation} f(x) = \\begin{cases} 0, & \\text{if}\\ x < 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +And visualized in this way: + +[![](images/relu-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/relu.png) + +For all values \[latex\]\\geq 0\[/latex\], it behaves linearly, but essentially behaves nonlinearly by outputting zeroes for all negative inputs. + +Hence, it can be used as a nonlinear activation function. + +It's grown very popular and may be the most popular activation used today - it is more popular than the older [Sigmoid and Tanh](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) activation functions - for the reason that it can be computed relatively inexpensively. Computing ReLU is equal to computing \[latex\]ReLU(x) = max(0, x)\[/latex\], which is much less expensive than the exponents or trigonometric operations necessary otherwise. + +* * * + +## Problems with ReLU + +However, it's not the silver bullet and every time you'll run into trouble when using ReLU. It doesn't happen often - which makes it highly generalizable across machine learning domains and machine learning problems - but you may run into some issues. + +Firstly, ReLU is not continuously differentiable. At \[latex\]x = 0\[/latex\], the breaking point between \[latex\]x\[/latex\] and 0, the gradient cannot be computed. This is not too problematic, but can very lightly impact training performance. + +Secondly, and more gravely, ReLU sets all values < 0 to zero. This is beneficial in terms of sparsity, as the network will adapt to ensure that the most important neurons have values of > 0. However, this is a problem as well, since the gradient of 0 is 0 and hence neurons arriving at large negative values cannot recover from being stuck at 0. The neuron effectively dies and hence the problem is known as the _dying ReLU problem_. You're especially vulnerable to it when your neurons are not initialized properly or when your data is not normalized very well, causing significant weight swings during the first phases of optimizing your model. The impact of this problem may be that your network essentially stops learning and underperforms. + +* * * + +## Introducing Leaky ReLU + +What if you _caused a slight but significant information leak_ in the left part of ReLU, i.e. the part where the output is always 0? + +This is the premise behind **Leaky ReLU**, one of the possible newer activation functions that attempts to minimize one's sensitivity to the _dying ReLU problem_. + +Mathematically, it is defined as follows (Maas et al., 2013): + +\\begin{equation} f(x) = \\begin{cases} 0.01x, & \\text{if}\\ x < 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Leaky ReLU can be visualized as follows: + +[![](images/leaky_relu.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/leaky_relu.png) + +If you compare this with the image for traditional ReLU above, you'll see that for all \[latex\]inputs < 0\[/latex\], the outputs are slightly descending. The thesis is that these small numbers reduce the death of ReLU activated neurons. This way, you'll have to worry less about the initialization of your neural network and the normalization of your data. Although these topics remain important, they are slightly less critical. + +* * * + +## Does Leaky ReLU really work? + +Next, the question: **does Leaky ReLU really work?** That is, does it really reduce the likelihood that your ReLU activating network dies off? + +Let's try and find out. + +Nouroz Rahman isn't convinced: + +> However, I personally don’t think _Leaky ReLU_ provides any advantage over _ReLU_, holistically, considering both training and accuracy although some papers claimed to achieve that. That’s why _Leaky ReLU_ is trivial in deep learning and honestly speaking, I have never used it or thought of the necessity of using it. +> +> [Nouroz Rahman](https://www.quora.com/What-are-the-advantages-of-using-Leaky-Rectified-Linear-Units-Leaky-ReLU-over-normal-ReLU-in-deep-learning/answer/Nouroz-Rahman) + +In a 2018 study, Pedamonti argues that Leaky ReLU and ReLU performance on the MNIST dataset is similar. Even though the problem of dying neural networks may now be solved theoretically, it can be the case that it simply doesn't happen very often - and that in those cases, normal ReLU works as well. "It's simple, it's fast, it's standard" - someone argued. And I tend to agree. + +* * * + +## Summary + +In this blog post, we've seen what challenges ReLU-activated neural networks. We also introduced the Leaky ReLU which attempts to resolve issues with traditional ReLU that are related to dying neural networks. We can conclude that in many cases, it seems to be the case that traditional / normal ReLU is relevant, and that Leaky ReLU benefits in those cases where you suspect your neurons are dying. I'd say: use ReLU if you can, and other linear rectifiers if you need to. + +Happy engineering! 😊 + +* * * + +## References + +Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. Retrieved from [https://www.semanticscholar.org/paper/Rectifier-Nonlinearities-Improve-Neural-Network-Maas/367f2c63a6f6a10b3b64b8729d601e69337ee3cc](https://www.semanticscholar.org/paper/Rectifier-Nonlinearities-Improve-Neural-Network-Maas/367f2c63a6f6a10b3b64b8729d601e69337ee3cc) + +What are the advantages of using Leaky Rectified Linear Units (Leaky ReLU) over normal ReLU in deep learning? (n.d.). Retrieved from [https://www.quora.com/What-are-the-advantages-of-using-Leaky-Rectified-Linear-Units-Leaky-ReLU-over-normal-ReLU-in-deep-learning](https://www.quora.com/What-are-the-advantages-of-using-Leaky-Rectified-Linear-Units-Leaky-ReLU-over-normal-ReLU-in-deep-learning) + +Pedamonti, D. (2018). Comparison of non-linear activation functions for deep neural networks on MNIST classification task. _arXiv preprint [arXiv:1804.02763](https://arxiv.org/pdf/1804.02763.pdf)_. diff --git a/life-3-0-review-being-human-in-the-age-of-ai.md b/life-3-0-review-being-human-in-the-age-of-ai.md new file mode 100644 index 0000000..04a3c70 --- /dev/null +++ b/life-3-0-review-being-human-in-the-age-of-ai.md @@ -0,0 +1,128 @@ +--- +title: "Life 3.0 Review: Being Human in the Age of AI" +date: "2019-11-04" +categories: + - "books-about-ai" +tags: + - "agi" + - "artificial-intelligence" + - "life-3-0" + - "narrow-ai" + - "physics" + - "superintelligence" +--- + +According to Max Tegmark, the author of the **Life 3.0** book, any form of life can be classified into three distinct groups: + + + +- Life 1.0, where life can't adapt hardware (their bodies) and software (their thinking and actions) itself - instead, that was the role of evolution across generations. +- Life 2.0, where life can't adapt hardware but can adapt software itself. +- Life 3.0, where life can adapt both. + +_(the Amazon advertisement contains an affiliate link for MachineCurve)._ + +...the book is about the third, and especially the wide array of possible paths towards there - plus their consequences. + +Why a review here? Simple - artificial intelligence, and its quest for artificial general intelligence, may spawn something that is more intelligent than human beings: a superintelligence. + +And Life 3.0 explores the consequences of such superintelligence. Let's find out how it does. + +\[toc\] + +## Welcome to the most important conversation of our time + +The book starts with the Omega team, which is an organization that has secretly created the world's first superintelligence, called Prometheus. At first, its makers ensure that they earn a lot of money by deploying its intelligence - but they do not stop there. + +No, on the contrary - since Prometheus is smarter than humans and is capable of improving itself, the world experiences a mind-blowing explosion of technical innovation, financial change and weakening of political systems. At last, they achieved what has never been achieved before - that the world is controlled by _one_ system. + +Even though this scenario seems to be very unlikely, it's not. Today, some organizations are already pursuing [artificial general intelligence](https://en.wikipedia.org/wiki/Artificial_general_intelligence#Artificial_general_intelligence_research) through technological research. Even though we can't tell for sure whether we'll eventually achieve AGI and/or superintelligence, we can't tell we don't either. + +By consequence, Max Tegmark welcomes us to the most important conversation of our time. + +![](images/action-artificial-intelligence-device-595804-1024x683.jpg) + +We don't need to be scared of robots. True superintelligent systems, Tegmark argues, will live inside of servers and other computing mediums. + +## Setting the stage with today's narrow AI + +In chapter 1, Tegmark sets the stage. He explores the concept of life - 1.0, 2.0 and 3.0, as outlined above - and introduces the thought that artificial intelligence research may allow us to reach Life 3.0 before the end of this century. + +Such research does not happen in a moral vacuum. Instead, a fascinating and intense discussion has emerged about the kind of future we can likely expect when Life 3.0 enters our daily lives. He introduces three important streams of thought: _technoskeptics_, who believe that AGI is so difficult that it will still take centuries before we'll realize it; the _utopian/singularity stream of thought_ which welcomes AGI because, they argue, it will improve the world beyond expectations, and finally the _beneficial AI movement_. This latter argues that AI systems can bring both good and bad, and that we should undertake much exploration into finding out what it should look like. + +After invalidating many false assumptions about artificial general intelligence and superintelligence, Tegmark explores the concepts of intelligence, memory, computation and learning in chapter 2. He links them to the fact that they are _independent of a medium_ - that is, where humans are limited by their brains, e.g. memory isn't limited by its hardware. Instead, we just invent new hardware to accomodate additional growth, as we have seen in the past. This ensures that the laws of physics are the only ones that we must respect, an important observation that Tegmark explores further in later chapters. + +Of course - artificial intelligence has been around for a while, and it already impacts today's world. Chapter 3 explores the impact of contemporary AI on society in its many forms. For example, AI can already benefit our financial markets, self-driving cars, and healthcare. But this spawns a new question: how can we ensure that AI is reliable, and that it does what we want? Life 3.0 further explores this question in terms of autonomous weapons, autonomous legal systems, as well as AI and our jobs. The interesting thing here is that these changes are already happening all around us. We're already having [medical robots](https://emerj.com/ai-sector-overviews/artificial-intelligence-medical-robotics/), [AI can already diagnose eye disease](https://ai.googleblog.com/2016/11/deep-learning-for-detection-of-diabetic.html) and many jobs [can disappear as a result of AI](https://www.drdouggreen.com/2018/16-jobs-that-will-disappear-in-the-next-20-years-due-to-artificial-intelligence-from-alux-com/). Even when you cannot agree with the rest of the book, which is (as we shall see) much more abstract, this chapter is _very real_. + +[![](images/cabinet-data-data-center-325229-1024x358.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/11/cabinet-data-data-center-325229.jpg) + +## Superintelligence: from intelligence explosion to cosmic exploration + +It may (or may not) happen that we achieve artificial general intelligence one day - i.e., AI that is as intelligent and as versatile as human beings. This is entirely different than today's AI systems, which are very _narrow_ and hence good at one task while poor at all the others. + +AGI, on the other hand, will be able to sense its shortcomings and eliminate them through learning. + +This may result in what is called an _intelligence explosion_, where AGI systems improve themselves to levels beyond human comprehension - while doing so faster and faster. + +### The consequences of exploding intelligence + +Chapter 4 explores the consequences of such an intelligence explosion. What does it mean for the global systems that are in place today? Can they co-exist with superintelligent AI or will they be replaced? + +Can we control an intelligence explosion? If not, what will happen? And how does the speed with which superintelligent systems improve themselves influence its impact on the world? + +Those, and other questions, are introduced in this chapter, which bridges between AI as we know it and AI we couldn't even imagine - for better or for worse. + +### The next 10k years after superintelligence emerges + +We have no idea about what the world will look like after we create superintelligence. In fact, Tegmark argues in chapter 5, there are many streams of thought on this matter! + +For example, will we create peaceful superintelligent systems that are either friendly by design or because humans control them to be that way? + +Or, on the other side of the spectrum, will superintelligent systems replace human beings altogether, because they see fit? + +We just don't know - and consensus on this matter seems to be far away. Every scenario that is drawn (Tegmark draws more scenarios than illustrated above and elaborates substantially on almost each of them) has benefits _and_ drawbacks. The only thing we all agree on is that we have to find out the best approach _before superintelligence exists_. + +So far, Tegmark's arguments could be digested easily. We're now approaching the final stages of the book, and here Tegmark adds something of his background into the narrative: chapter 6 discusses the legacy of superintelligence from the point of view of a physicist. + +We saw earlier that superintelligence is limited only by the laws of physics. These laws are however more complex than we think, and they can be bended in favor of the superintelligence. For example, superintelligent systems are expected to have a significant energy consumption. And superintelligence that decides to colonize our cosmos has to deal with the speed of light when it communicates its thoughts internally. Chapter 6 explores these laws of physics and shortcuts which superintelligence may take. + +Even though it's incredibly interesting matter, even for people without a background in physics (like me), it was tougher to understand these parts of the book than the previous ones. And that is also true about the final two chapters of Life 3.0: the ones about _goal-driven behavior_ and _consciousness_. + +![](images/art-black-and-white-blur-724994-1024x736.jpg) + +## Goal-driven behavior & consciousness + +AIs are often thought of as _goal-driven_, that is, they have some goal that they will attempt to achieve no matter what. This latter is likely especially true for superintelligent systems, which may attempt to achieve their goals by eliminating everything that stands between them and their goals. + +But what is a goal? And where does goal-driven behavior originate from? Tegmark explores these questions in chapter 8. He does so by going back to his roots - physics - once again, explaining that goal-oriented behavior is rooted in the laws governing all matter. + +He next links this thought on goal-driven behavior to human beings and how we attempt to achieve our goals, before moving on to intelligent machines and how they may be equipped with goal-driven behavior. This obviously requires taking a look at ethical aspects and the question of _how to equip such systems with goals_, without eliminating ourselves. Once again, Tegmark challenges us to help think about how we should shape our future with superintelligence. + +The final chapter covers one of the most challenging topics surrounding superintelligent systems - the one of _consciousness_. What is consciousness? (We humans don't know for sure.) Do AI systems have consciousness - and by consequence, can they suffer, do they have rights, and does turning AI systems off equal murder? Those are questions that are intrinsically linked to this topic. + +Tegmark uses a multi-stage approach to covering his views on consciousness. First, he explores _which parts of the brain are responsible for consciousness_. This is a scientific topic and hence falsifiable, and can hence be explored relatively easily. However, the other two - how physics relates to consciousness and why such thing exists - are much harder to answer. He therefore doesn't even try - which does make total sense. + +## Life 3.0 is as objective as it can be + +A book about superintelligence equals speculation - period. We don't know whether we will eventually achieve it, nor when that will be the case. Life 3.0 is by consequence a very speculative book. + +_But it speculates as objectively as it can do_. + +Tegmark, being a scientist, takes a very broad, objective and chronological stance to the discussion on superintelligence. For example, instead of the authors of other books, such as James Barrat in _Our Final Invention_ (who takes a very defeatist view to superintelligent AIs), I feel that Tegmark illuminates equally all streams of thought on superintelligence. He also covers a broad range of outcomes of superintelligence, all their benefits and drawbacks, without preferring one explicitly (except perhaps one time: he clearly argues to be part of the _beneficial AI movement_ himself. That is, he's somewhere in the middle between skepticisim and utopianism, which perhaps allows his book to be _sans_ any speculative dogmas). + +The only thing you'll need to be able to handle is the fact that Tegmark is a physicist. Even though he writes in popular-scientific language, allowing you to digest the content with relative ease, the last three chapters are conceptually difficult. Since in the chapters on AI's cosmic colonization, goal-driven behavior and consciousness Tegmark links the physics and concepts to superintelligence relatively late, it requires some persistence to continue reading. But when you look back once you've finished them, you feel that you've just finished a story that _is indeed one of the more important ones of our time._ + +Life 3.0 is a very interesting book for the reader who aims to deepen his or her understanding of how today's narrow AI may grow into superintelligence - and the giant leap we still have to make to let this happen safely. I'd recommend it for sure. + +Will you read the book? Or did you read it already? Let me know what you think about it in the comments section below! 👇 I'm very curious to find out! + +## Check prices + +[![](//ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&MarketPlace=US&ASIN=1101970316&ServiceVersion=20070822&ID=AsinImage&WS=1&Format=_SL160_&tag=webn3rd-20)](https://www.amazon.com/gp/product/1101970316/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1101970316&linkCode=as2&tag=webn3rd-20&linkId=f72f2bd211e0dc8dfa8e6a2649be65d4)![](//ir-na.amazon-adsystem.com/e/ir?t=webn3rd-20&l=am2&o=1&a=1101970316) + +**Life 3.0: Being Human in the Age of Artificial Intelligence** +Max Tegmark, 2018 +ISBN 978-1-101-94659-6 +Vintage Books + +[Check prices at Amazon (affiliate link).](https://amzn.to/34mGKlY) diff --git a/linking-maths-and-intuition-rosenblatts-perceptron-in-python.md b/linking-maths-and-intuition-rosenblatts-perceptron-in-python.md new file mode 100644 index 0000000..2fdac47 --- /dev/null +++ b/linking-maths-and-intuition-rosenblatts-perceptron-in-python.md @@ -0,0 +1,407 @@ +--- +title: "Linking maths and intuition: Rosenblatt's Perceptron in Python" +date: "2019-07-23" +categories: + - "svms" +tags: + - "machine-learning" + - "neural-networks" + - "rosenblatt-perceptron" +--- + +According to Wikipedia, Frank Rosenblatt is an "American psychologist notable in the field of artificial intelligence". + +And notable, he is. + +Rosenblatt is the inventor of the so-called **Rosenblatt Perceptron**, which is one of the first algorithms for supervised learning, invented in 1958 at the Cornell Aeronautical Laboratory. + +The blogs I write on MachineCurve.com are educational in two ways. First, I use them to structure my thoughts on certain ML related topics. Second, if they help me, they could help others too. This blog is one of the best examples: it emerged from my struggle to identify **why it is difficult** to implement Rosenblatt's Perceptron with modern machine learning frameworks. + +Turns out that has to do with the means of optimizing one's model - a.k.a. the Perceptron Learning Rule vs [Stochastic Gradient Descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). I'm planning to dive into this question in detail in another blog. This article describes the work I preformed _before_ being able to answer it - or, programming a Perceptron myself, understanding how it attempts to find the best [decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/). It **provides a tutorial** for **implementing the Rosenblatt Perceptron yourself**. + +I will first introduce the Perceptron in detail by discussing some of its history as well as its mathematical foundations. Subsequently, I will move on to the Perceptron Learning Rule, demonstrating how it improves over time. This is followed by a Python based Perceptron implementation that is finally demonstrated with a real dataset. + +Of course, if you want to start working with the Perceptron right away, you can find example code for the Rosenblatt Perceptron in the first section. + +If you run into questions during the read, or if you have any comments, please feel free to write a comment in the comment box near the bottom 👇 I'm happy to provide my thoughts and improve this post whenever I'm wrong. I hope to hear from you! + +**Update 13/Jan/2021:** Made article up-to-date. Added quick example to answer question _how to implement Rosenblatt Perceptron with Python?_ Performed changes to article structure. Added links to other articles. It's now ready for 2021! + +* * * + +\[toc\] + +* * * + +## Answer: implementing Rosenblatt Perceptron with Python + +Some people just want to start with code before they read further. That's why in this section, you'll find a **fully functional example** of the Rosenblatt Perceptron, created with Python. It shows a class that is initialized, that has a training loop (`train` definition) and which can generate predictions once trained (through `predict`). If you want to understand the Perceptron in more detail, make sure to read the rest of this tutorial too! + +``` +import numpy as np + +# Basic Rosenblatt Perceptron implementation +class RBPerceptron: + + # Constructor + def __init__(self, number_of_epochs = 100, learning_rate = 0.1): + self.number_of_epochs = number_of_epochs + self.learning_rate = learning_rate + + # Train perceptron + def train(self, X, D): + # Initialize weights vector with zeroes + num_features = X.shape[1] + self.w = np.zeros(num_features + 1) + # Perform the epochs + for i in range(self.number_of_epochs): + # For every combination of (X_i, D_i) + for sample, desired_outcome in zip(X, D): + # Generate prediction and compare with desired outcome + prediction = self.predict(sample) + difference = (desired_outcome - prediction) + # Compute weight update via Perceptron Learning Rule + weight_update = self.learning_rate * difference + self.w[1:] += weight_update * sample + self.w[0] += weight_update + return self + + # Generate prediction + def predict(self, sample): + outcome = np.dot(sample, self.w[1:]) + self.w[0] + return np.where(outcome > 0, 1, 0) +``` + +* * * + +## A small introduction - what is a Perceptron? + +A Perceptron is a [binary classifier](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/#variant-1-binary-classification) that was invented by [Frank Rosenblatt](https://en.wikipedia.org/wiki/Frank_Rosenblatt) in 1958, working on a research project for Cornell Aeronautical Laboratory that was US government funded. It was based on the recent advances with respect to mimicing the human brain, in particular the MCP architecture that was recently invented by McCulloch and Pitts. + +This architecture attempted to mimic the way neurons operate in the brain: given certain inputs, they fire, and their firing behavior can change over time. By [allowing the same to happen in an artificial neuron](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/#individual-neurons), researchers at the time argued, machines could become capable of approximating human intelligence. + +...well, that was a slight overestimation, I'd say 😄 Nevertheless, the Perceptron lies at the basis of where we've come today. It's therefore a very interesting topic to study deeper. Next, I will therefore scrutinize its mathematical building blocks, before moving on to implementing one in Python. + +\[ad\] + +### Mathematical building blocks + +When you [train a supervised machine learning model](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), it must somehow capture the information that you're giving it. The Perceptron does this by means of a _[weights vector](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/)_, or `**w**` that determines the exact position of the [decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) and is learnt from the data. + +If you input new data, say in an _input vector_ `**x**`, you'll simply have to pinpoint this vector with respect to the learnt weights, to decide on the class. + +Mathematically, this is represented as follows: + +\[mathjax\] + +\\begin{equation} f(x) = \\begin{cases} 1, & \\text{if}\\ \\textbf{w}\\cdot\\textbf{x}+b > 0 \\\\ 0, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Here, you can see why it is a binary classifier: it simply determines the data to be part of class '0' or class '1'. This is done based on the output of the multiplication of the weights and input vectors, with a bias value added. + +When you multiply two vectors, you're computing what is called a dot product. A dot product is the sum of the multiplications of the individual scalars in the vectors, pair-wise. This means that e.g. \[latex\]w\_1x\_1\[/latex\] is computed and summated together with \[latex\]w\_2x\_2\[/latex\], \[latex\]w\_3x\_3\[/latex\] and so on ... until \[latex\]w\_nx\_n\[/latex\]. Mathematically: + +\\begin{equation} \\begin{split} &z=\\sum\_{i=1}^{n} w\_nx\_n + b \\\\ &= w\_1x\_1 + ... + w\_nx\_n + b \\\\ \\end{split} \\end{equation} + +When this output value is larger than 0, it's class 1, otherwise it's class 0. In other words: [binary classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/#variant-1-binary-classification). + +### The Perceptron, visually + +Visually, this looks as follows: + + +![](images/Perceptron-1024x794.png) + + +All right - we now have a mathematical structure for automatically deciding about the class. Weights vector `**w**` and bias value `_b_` are used for setting the decision boundary. We did however not yet cover how the Perceptron is updated. Let's find out now! + +* * * + +## Before optimizing: moving the bias into the weights vector + +Rosenblatt did not only provide the model of the perceptron, but also the method for optimizing it. + +This however requires that we first move the bias value into the weights vector. + +This sounds strange, but it is actually a very elegant way of making the equation simpler. + +\[ad\] + +As you recall, this is how the Perceptron can be defined mathematically: + +\\begin{equation} f(x) = \\begin{cases} 1, & \\text{if}\\ \\textbf{w}\\cdot\\textbf{x}+b > 0 \\\\ 0, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Of which \[latex\]\\textbf{w}\\cdot\\textbf{x}+b\[/latex\] could be written as: + +\\begin{equation} \\begin{split} &z=\\sum\_{i=1}^{n} w\_nx\_n + b \\\\ &= w\_1x\_1 + ... + w\_nx\_n + b \\\\ \\end{split} \\end{equation} + +We now add the bias to the weights vector as \[latex\]w\_0\[/latex\] and choose \[latex\]x\_0 = 1\[/latex\]. This looks as follows: + +![](images/Perceptron_with_bias-1024x907.png) + +This allows us to rewrite \[latex\]z\[/latex\] as follows - especially recall that \[latex\]w\_0 = b\[/latex\] and \[latex\]x\_0 = 1\[/latex\]: + +\\begin{equation} \\begin{split} & z = \\sum\_{i=0}^{n} w\_nx\_n \\\\ & = w\_0x\_0 + w\_1x\_1 + ... + w\_nx\_n \\\\ & = w\_0x\_0 + w\_1x\_1 + ... + w\_nx\_n \\\\ & = 1b + w\_1x\_1 + ... + w\_nx\_n \\\\ & = w\_1x\_1 + ... + w\_nx\_n + b \\end{split} \\end{equation} + +As you can see, it is still equal to the original way of writing it: + +\\begin{equation} \\begin{split} &z=\\sum\_{i=1}^{n} w\_nx\_n + b \\\\ &= w\_1x\_1 + ... + w\_nx\_n + b \\\\ \\end{split} \\end{equation} + +This way, we got rid of the bias \[latex\]b\[/latex\] in our main equation, which will greatly help us with what we'll do now: update the weights in order to optimize the model. + +* * * + +## Training the model + +We'll use what is called the _Perceptron Learning Rule_ for that purpose. But first, we need to show you how the model is actually trained - by showing the pseudocode for the entire training process. + +We'll have to make a couple assumptions at first: + +1. There is the weights vector `w` which, at the beginning, is [uninitialized](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/). +2. You have a set of training values, such as \[latex\]T = \\{ (x\_1, d\_1), (x\_2, d\_2), ..., (x\_n, d\_n) \\}\[/latex\]. Here, \[latex\]x\_n\[/latex\] is a specific feature vector, while \[latex\]d\_n\[/latex\] is the corresponding target value. +3. We ensure that \[latex\]w\_0 = b\[/latex\] and \[latex\]x\_0 = 1\[/latex\]. +4. We will have to configure a _[learning rate](https://www.machinecurve.com/index.php/2019/11/06/what-is-a-learning-rate-in-a-neural-network/)_ or \[latex\]r\[/latex\], or by how much the model weights improve. This is a number between 0 and 1. We use \[latex\]r = 0.1\[/latex\] in the Python code that follows next. + +This is the pseudocode: + +1. Initialize the weights vector `**w**` to zeroes or random numbers. +2. For every \[latex\](x\_n, d\_n)\[/latex\] in \[latex\]D\[/latex\]: + 1. Compute the output value for the input vector \[latex\]x\_n\[/latex\]. Mathematically, that's \[latex\]d'\_n: f(x\_n) = w\_nx\_n\[/latex\]. + 2. Compare the output value \[latex\]d'\_n\[/latex\] with target value \[latex\]d\_n\[/latex\]. + 3. Update the weights according to the Perceptron Learning Rule: \[latex\]w\_\\text{n,i}(t+1) = w\_\\text{n,i}(t) + r \\cdot (d\_n - d'\_n) \\cdot x\_\\text{n,i}\[/latex\] for all features (scalars) \[latex\]0 \\leq i \\leq|w\_n|\[/latex\]. + +Or, in plain English: + +- First [initialize the weights](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) randomly or to zeroes. +- Iterate over every feature in the data set. +- Compute the output value. +- Compare if it matches, and 'push' the weights into the right direction (i.e. the \[latex\]d\_n - d'\_n\[/latex\] part) slightly with respect to \[latex\]x\_\\text{n,i}\[/latex\], as much as the learning rate \[latex\]r\[/latex\] allows. + +This means that the weights are updated for every sample from the dataset. + +This process may be repeated until some criterion is reached, such as a specific number of errors, or - if you are adventurous - full convergence (i.e., the number of errors is 0). + +* * * + +## Perceptron in Python + +Now let's see if we can code a Perceptron in Python. Create a new folder and add a file named `p.py`. In it, let's first import numpy, which we'll need for some number crunching: + +``` +import numpy as np +``` + +We'll create a class that is named `RBPerceptron`, or Rosenblatt's Perceptron. Classes in Python have a specific structure: they must be defined as such (by using `class`) and can contain Python definitions which must be coupled to the class through `self`. Additionally, it may have a constructor definition, which in Python is called `__init__`. + +### Class definition and constructor + +So let's code the class: + +``` +# Basic Rosenblatt Perceptron implementation +class RBPerceptron: +``` + +Next, we want to allow the engineer using our Perceptron to configure it before he or she starts the training process. We would like them to be able to configure two variables: + +- The number of epochs, or rounds, before the model stops the training process. +- The learning rate \[latex\]r\[/latex\], i.e. the determinant for the size of the weight updates. + +We'll do that as follows: + +``` + # Constructor + def __init__(self, number_of_epochs = 100, learning_rate = 0.1): + self.number_of_epochs = number_of_epochs + self.learning_rate = learning_rate +``` + +The `__init__` definition nicely has a self reference, but also two attributes: `number_of_epochs` and `learning_rate`. These are preconfigured, which means that if those values are not supplied, those values serve as default ones. By default, the model therefore trains for 100 epochs and has a default learning rate of 0.1 + +However, since the user can manually provide those, they must also be set. We need to use them globally: the number of epochs and the learning rate are important for the training process. By consequence, we cannot simply keep them in the context of our Python definition. Rather, we must add them to the instance variables of the class. This can be done by assigning them to the class through `self`. + +\[ad\] + +### The training definition + +All right, the next part - the training definition: + +``` + # Train perceptron + def train(self, X, D): + # Initialize weights vector with zeroes + num_features = X.shape[1] + self.w = np.zeros(num_features + 1) + # Perform the epochs + for i in range(self.number_of_epochs): + # For every combination of (X_i, D_i) + for sample, desired_outcome in zip(X, D): + # Generate prediction and compare with desired outcome + prediction = self.predict(sample) + difference = (desired_outcome - prediction) + # Compute weight update via Perceptron Learning Rule + weight_update = self.learning_rate * difference + self.w[1:] += weight_update * sample + self.w[0] += weight_update + return self +``` + +The definition it self must once again have a `self` reference, which is provided. However, it also requires the engineer to pass two attributes: `X`, or the set of input samples \[latex\]x\_1 ... x\_n\[/latex\], as well as `D`, which are their corresponding targets. + +Within the definition, we first initialize the weights vector as discussed above. That is, we assign it with zeroes, and it is `num_features + 1` long. This way, it can both capture the features \[latex\]x\_1 ... x\_n\[/latex\] as well as the bias \[latex\]b\[/latex\] which was assigned to \[latex\]x\_0\[/latex\]. + +Next, the training process. This starts by creating a `for` statement that simply ensures that the program iterates over the `number_of_epochs` that were configured by the user. + +During one iteration, or epoch, every combination of \[latex\](x\_i, d\_i)\[/latex\] is iterated over. In line with the pseudocode algorithm, a prediction is generated, the difference is computed, and the weights are updated accordingly. + +After the training process has finished, the model itself is returned. This is not necessary, but is relatively convenient for later use by the ML engineer. + +### Generating predictions + +Finally, the model must also be capable of generating predictions, i.e. computing the dot product \[latex\]\\textbf{w}\\cdot\\textbf{x}\[/latex\] (where \[latex\]b\[/latex\] is included as \[latex\]w\_0\[/latex\]). + +We do this relatively elegantly, thanks to another example of the Perceptron algorithm provided by [Sebastian Raschka](https://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#artificial-neurons-and-the-mcculloch-pitts-model): we first compute the dot product for all weights except \[latex\]w\_0\[/latex\] and subsequently add this one as the bias weight. Most elegantly, however, is how the prediction is generated: with `np.where`. This allows an engineer to generate predictions for a batch of samples \[latex\]x\_i\[/latex\] at once. It looks as follows: + +``` + # Generate prediction + def predict(self, sample): + outcome = np.dot(sample, self.w[1:]) + self.w[0] + return np.where(outcome > 0, 1, 0) +``` + +### Final code + +All right - when integrated, this is our final code. + +You can also check it out on [GitHub](https://github.com/christianversloot/rosenblatts-perceptron). + +``` +import numpy as np + +# Basic Rosenblatt Perceptron implementation +class RBPerceptron: + + # Constructor + def __init__(self, number_of_epochs = 100, learning_rate = 0.1): + self.number_of_epochs = number_of_epochs + self.learning_rate = learning_rate + + # Train perceptron + def train(self, X, D): + # Initialize weights vector with zeroes + num_features = X.shape[1] + self.w = np.zeros(num_features + 1) + # Perform the epochs + for i in range(self.number_of_epochs): + # For every combination of (X_i, D_i) + for sample, desired_outcome in zip(X, D): + # Generate prediction and compare with desired outcome + prediction = self.predict(sample) + difference = (desired_outcome - prediction) + # Compute weight update via Perceptron Learning Rule + weight_update = self.learning_rate * difference + self.w[1:] += weight_update * sample + self.w[0] += weight_update + return self + + # Generate prediction + def predict(self, sample): + outcome = np.dot(sample, self.w[1:]) + self.w[0] + return np.where(outcome > 0, 1, 0) +``` + +* * * + +## Testing with a dataset + +All right, let's now test our implementation of the Perceptron. For that, we'll need a dataset first. Let's generate one with Python. Go to the same folder as `p.py` and create a new one, e.g. `dataset.py`. Use this file for the next steps. + +### Generating the dataset + +We'll first import `numpy` and generate 50 zeros and 50 ones. + +We then combine them into the `targets` list, which is now 100 long. + +We'll then use the normal distribution to generate two samples that do not overlap of both 50 samples. + +Finally, we concatenate the samples into the list of input vectors `X` and set the desired targets `D` to the targets generated before. + +``` +# Import libraries +import numpy as np + +# Generate target classes {0, 1} +zeros = np.zeros(50) +ones = zeros + 1 +targets = np.concatenate((zeros, ones)) + +# Generate data +small = np.random.normal(5, 0.25, (50,2)) +large = np.random.normal(6.5, 0.25, (50,2)) + +# Prepare input data +X = np.concatenate((small,large)) +D = targets +``` + +### Visualizing the dataset + +It's always nice to get a feeling for the data you're working with, so let's first visualize the dataset: + +``` +import matplotlib.pyplot as plt +plt.scatter(small[:,0], small[:,1], color='blue') +plt.scatter(large[:,0], large[:,1], color='red') +plt.show() +``` + +It should look like this: + +![](images/linearly_separable_dataset-1.png) + +Let's next train our Perceptron with the entire [training set](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/) `X` and the corresponding desired targets `D`. + +\[ad\] + +We must first initialize our Perceptron for this purpose: + +``` +from p import RBPerceptron +rbp = RBPerceptron(600, 0.1) +``` + +Note that we use 600 epochs and set a learning rate of 0.1. Let's now train our model: + +``` +trained_model = rbp.train(X, D) +``` + +The training process should be completed relatively quickly. We can now visualize the Perceptron and its [decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) with a library called [mlxtend](http://rasbt.github.io/mlxtend/) - once again the credits for using this library go out to [Sebastian Raschka.](https://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#artificial-neurons-and-the-mcculloch-pitts-model) + +If you don't have it already, install it first by means of `pip install mlxtend`. + +Subsequently, add this code: + +``` +from mlxtend.plotting import plot_decision_regions +plot_decision_regions(X, D.astype(np.integer), clf=trained_model) +plt.title('Perceptron') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +You should now see the same data with the Perceptron decision boundary successfully separating the two classes: + +![](images/perceptron_with_boundary.png) + +There you go, Rosenblatt's Perceptron in Python! + +* * * + +## References + +Bernard (2018, December). Align equation left. Retrieved from [https://tex.stackexchange.com/questions/145657/align-equation-left](https://tex.stackexchange.com/questions/145657/align-equation-left) + +Raschka, S. (2015, March 24). Single-Layer Neural Networks and Gradient Descent. Retrieved from [https://sebastianraschka.com/Articles/2015\_singlelayer\_neurons.html#artificial-neurons-and-the-mcculloch-pitts-model](https://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#artificial-neurons-and-the-mcculloch-pitts-model) + +Perceptron. (2003, January 22). Retrieved from [https://en.wikipedia.org/wiki/Perceptron#Learning\_algorithm](https://en.wikipedia.org/wiki/Perceptron#Learning_algorithm) diff --git a/longformer-transformers-for-long-sequences.md b/longformer-transformers-for-long-sequences.md new file mode 100644 index 0000000..c28a475 --- /dev/null +++ b/longformer-transformers-for-long-sequences.md @@ -0,0 +1,16 @@ +--- +title: "Longformer: Transformers for Long Sequences" +date: "2021-03-11" +categories: + - "buffer" + - "deep-learning" +tags: + - "deep-learning" + - "language-model" + - "longformer" + - "machine-learning" + - "nlp" + - "transformer" +--- + +Transformers have really changed the NLP world, in part due to their self-attention component. But this component is problematic in the sense that it has quadratic computational and memory growth with sequence length, due to the QK^T diagonals (Questions, Keys diagonals) in the self-attention component. By consequence, Transformers cannot be trained on really long sequences because resource requirements are just too high. BERT, for example, sets a maximum sequence length of 512 characters. diff --git a/machine-learning-error-bias-variance-and-irreducible-error-with-python.md b/machine-learning-error-bias-variance-and-irreducible-error-with-python.md new file mode 100644 index 0000000..3a6ad24 --- /dev/null +++ b/machine-learning-error-bias-variance-and-irreducible-error-with-python.md @@ -0,0 +1,268 @@ +--- +title: "Machine Learning Error: Bias, Variance and Irreducible Error with Python" +date: "2020-11-02" +categories: + - "frameworks" + - "svms" +tags: + - "bias" + - "error" + - "loss-function" + - "loss-value" + - "machine-learning" + - "supervised-learning" + - "variance" +--- + +Supervised Machine Learning is one of the most prominent branches of Machine Learning these days. Using a labeled training set and an adequate model, it is possible to create a ML model that demonstrates very impressive results. As we will see, training such a model involves a cycle of _feeding forward_ data through the model, _observing how bad it performs_, and subsequently _optimizing it to make it better_. + +After some threshold is passed, the training process stops, and the model is trained. + +The "observing how bad it performs" part is the focus of today's article. Because the **error**, with which the former is expressed, is interesting - because it is composed of multiple error subtypes. This article will focus on these subtypes, which are the **bias error**, the **variance error** and the **irreducible error**. We will find out what error is in general, what those subtypes are, and how we can decompose a TensorFlow ML model into the error subtypes. + +Let's take a look! :) + +* * * + +\[toc\] + +* * * + +## Error in supervised machine learning: what is it? + +From the article about loss and loss functions, we know about the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process): + +1. Samples from a labeled dataset are inserted into the model - this is called "feeding the samples forward". +2. The machine learning model generates a prediction for each sample. +3. All predictions are compared to the labels, called the ground truth, and a _loss value_ is output. +4. Based on the loss value, the loss is computed backwards, to find the optimizations for the individual parts of the machine learning model. +5. By means of some optimization mechanism (e.g. [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [Adaptive optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/)), the model is optimized. + +Above, we talked about the "observing how bad it performs" part of training a supervised machine learning model. Note that "how bad" and "loss" have relatively similar meaning - and yes, they are connected. + +The "loss value" effectively shows you _how bad the model performs_ - in other words, how much off the model is compared to the ground truth, on average. + +Hence, this loss value is also called the **model error**. Your goal as a machine learning engineer is to create a dataset, find a suitable algorithm, and tweak it accordingly, to generate a model that performs _and_ generalizes well. In other words, it must be accurate in terms of the prediction, and work in a wide range of cases - even with data that the model has never seen before. + +And of course, this is quite a challenge. + +![](images/High-level-training-process-1024x973.jpg) + +* * * + +## Error subtypes + +Above, we saw that feeding forward the training samples results in a loss or _error_ value. It can be used subsequently for improving your Machine Learning model. Now, we'll dive into the error concept, and will see that it can be decomposed in a few distinct subtypes: **bias error**, **variance error** and **irreducible error**. + +Put simply, the subtypes together compose the notion of 'error' in the following manner: + +_Error = Bias error + Variance error + Irreducible error._ + +- **Bias error:** how strict the model generalizes to some designated set of functions. +- **Variance error:** how much the estimated function will change when the algorithm is trained with differing datasets. +- **Irreducible error:** error that is neither bias or variance error and is hence relatively random. + +### Bias error + +In Dietterich and Kong (1995), we find that Mitchell (1980) introduces the concept of **bias error** as follows: + +> "Any basis for choosing one generalization \[hypothesis\] over another, other than strict consistency with the observed training instances." + +While this sounds relatively vague - likely on purpose, for generalization purposes - we can relatively easily convert it into a definition that resonates well with ML researchers and engineers: + +Bias involves an _assumption of the Machine Learning model that the target function to learn is part of a set of target functions_. In other words, if a model can only learn - or fit - a few example functions, it is a **high-bias** model. If the model can learn many functions instead, it is a **low-bias** model. + +> Bias, in the context of Machine Learning, is a type of error that occurs due to erroneous assumptions in the learning algorithm. +> +> StackExchange (n.d.) + +High bias, by making a lot of assumptions about the target function, simplifies the model and makes the fit less computationally intensive. + +For example, linear regression is a high-bias model, as it attempts to learn fit data to a function of the form \[latex\]y = a \\times x + b\[/latex\], and nothing else: + +![](images/linear-1024x514.png) + +Bias error quantifies the amount of error that can be attributed to this assumption. In the plot above, we can see that due to the high-bias property of the linear learner, the bias error shall be quite high. + +- **Models with high bias:** linear regression, logistic regression, linear classification, linear neural networks, linear SVMs +- **Models with low bias:** [nonlinear neural networks](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/), [nonlinear Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/), decision trees. + +Your choice for a ML algorithm should never be entirely dependent on the bias assumption of the model. For example, if you have a linear dataset, there is no need to start with neural networks - instead, a linear classifier or linear regression model would likely be able to achieve similar performance at a fraction of the computational cost. Therefore, make sure to think about the characteristics of your dataset, the bias property, but also make sure to consider what we will study next: the variance error. + +### Variance error + +While the bias of a model tells us something about how rigid it is towards fitting a particular function, the **variance** of our model is related to our datasets: + +> Variance, in the context of Machine Learning, is a type of error that occurs due to a model's sensitivity to small fluctuations in the training set. +> +> StackExchange (n.d.) + +Say, for example that we are training the same machine learning model with two different datasets. Model-wise, everything is the same - the algorithm is the same, the hyperparameter configuration is the same, and so on. The only thing that differs is the dataset. + +Here, it must also be noted that we do not know whether the distributions of the datasets are _exactly_ the same - they could be, but do not necessarily have to be. However, they're close. + +If our model is a **high-variance** model, it is really sensitive to changes in the dataset, and hence could show highly different performance - even when the changes are small. If it's **low-variance**, it's not so sensitive. + +Especially when the model is [overfit](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/), the model generally has high variance - and visually, decision boundaries of such models look like this: + +- **Models with low variance:** linear regression, logistic regression, linear classification, linear neural networks, linear SVMs +- **Models with high variance:** [nonlinear neural networks](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/), [nonlinear Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/), decision trees. + +![](images/nonlinear-1-1024x514.png) + +### Irreducible error + +Some of the model error cannot be ascribed to bias or variance. This **irreducible error** can for example be random noise, which is always present in a randomly initialized machine learning model. + +If we want to reduce the impact of model bias, we can choose a machine learning algorithm that is relatively low-bias - that is, increase model complexity and sensitivity. If we want to reduce model sensitivity to changes in data, we can pick a machine learning algorithm that is more rigid. We cannot remove irreducible error from the machine learning model. + +It's simply something that we have to live with. + +### The Bias-Variance trade-off + +In writing the article, I have dropped some hints that bias and machine learning may be related to each other. + +- If you read the article with a critical mind, you perhaps noticed that the list of models with low/high variance is exactly the opposite in the case of bias. +- In the section about irreducible error, reducing the effect of one (say, bias) would be to move into the direction of the other (say, variance). + +And the opposite is also true. In fact, bias and variance are related. This is true for statistics and hence also for the field of machine learning. In fact, it is known as the **bias-variance trade-off**. + +> The bias–variance trade-off implies that a model should balance underfitting and overfitting: Rich enough to express underlying structure in data and simple enough to avoid fitting spurious patterns +> +> Belkin et al. (2019) + +If you compare generating a machine learning model with playing a game of throwing bull's eye, your optimal end result would be a darts board where all arrows are in the middle of the board: + +[![](images/darts-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/darts-1.png) + +In Machine Learning terms, this is a model with **low bias** and **low variance**. + +It is both effective / rich enough "to express structure" (i.e., all near the desired spot, being the center) and simple enough to "\[see\] spurious patterns" (i.e., darts arrows scattered around the board). In other words, it is a model of which its predictions are "spot on" and "not scattered". + +In fact, we can extend the darts board to all four cases between low/high bias and low/high variance. + +- If your **bias is low** and your **variance is high**, your darts arrows will be near the center but will show some scattering (ML: capable of fitting many patterns, but with some sensitivity to data changes). +- If your **bias is high** and your **variance is low**, the darts arrows will be near each other, but not near the center (ML: not so sensitive to data changes, but too biased, and hence predictions that are collectively off). +- If your **bias is high** and your **variance is high**, the darts arrows will both be scattered and away from the center (ML: too sensitive _and_ not capable of generating precise predictions). +- If your **bias is low** and your **variance is low**, your model is spot on without scattering. This is what you want. + +[![](images/darts-1024x925.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/darts.png) + +The **trade-off** in the bias-variance trade-off means that you have to choose between giving up bias and giving up variance in order to generate a model that really works. If you choose a machine learning algorithm with more bias, it will often reduce variance, making it less sensitive to data. This can be good, unless the bias means that the model becomes too rigid. The opposite is also true: if you give up rigidity only to find the model show too much sensitivity, you've crossed the balance between bias and variance in the wrong direction. + +Your end goal as a ML engineer is to find the sweet spot between bias and variance. + +This is no easy task, and it is dependent on your dataset, the computational resources at your disposal (high-bias models are often less resource-intensive compared to low-bias models; the opposite is true for variance), and your ML experience. + +Here, the only lesson is that practice makes perfect. + +* * * + +## Decomposing your ML error value into error subtypes + +If your train a machine learning model, through picking a [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), you'll be able to observe loss values throughout the training process and during the model evaluation step. + +This loss value can be decomposed into bias and variance error by means of Sebastian Raschka's [Mlxtend](http://rasbt.github.io/mlxtend/) Python library, with which one can also [plot the decision boundary of a classifier](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/). + +More specifically, this can be done by means of the `bias_variance_decomp` functionality available in the library. Let's see how it works with the TensorFlow model listed here. It is the [MLP classifier that we created](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/) with the Chennai Reservoir Level Dataset ([click here for the dataset](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/#getting-familiar-with-the-data-the-chennai-water-crisis)). + +- We import a variety of functionality: the bias-variance decomposition functionality from Mlxtend, some TensorFlow things, NumPy and generating a train/test split from Scikit-learn. +- We then load the data from a CSV file and shuffle the dataset. +- We separate features and targets and split the data into a 66/33 train/test split. +- We configure the input shape and subsequently create, configure and fit the model. +- We then evaluate the model and display the loss results. +- Finally, we use Mlxtend to decompose the loss into bias and variance loss: we set the loss function to Mean Squared Error, let it simulate bias and variance for 100 iterations, and initialize it with a random seed `46` (this can be any number). + +Running the model gives us the components of the **bias error** and the **variance error**. + +- Note that in November 2020, the Mlxtend package could not yet generate a decomposition for a Keras model. However, this was added quite recently, but was not available in the `pip` release yet. Therefore, make sure to install/upgrade Mlxtend from GitHub (GitHub, n.d.), by means of: + +``` +pip install git+git://github.com/rasbt/mlxtend.git +``` + +Here is the code. + +``` +# Imports +from mlxtend.evaluate import bias_variance_decomp +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np +from sklearn.model_selection import train_test_split + +# Load data +dataset = np.loadtxt('./chennai_reservoir_levels.csv', delimiter='|', skiprows=1, usecols=(1,2,3,4)) + +# Shuffle dataset +np.random.shuffle(dataset) + +# Separate features and targets +X = dataset[:, 0:3] +y = dataset[:, 3] + +# Split + into train/test sets +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) + +# Set the input shape +input_shape = (3,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(16, input_shape=input_shape, activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(1, activation='linear')) + +# Configure the model and start training +model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_squared_error']) +model.fit(X_train, y_train, epochs=25, batch_size=1, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_test, y_test, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') + +# Using Mlxtend to decompose loss +avg_expected_loss, avg_bias, avg_var = bias_variance_decomp( + model, X_train, y_train, X_test, y_test, + loss='mse', + num_rounds=100, + random_seed=46) + +print('Average expected loss: %.3f' % avg_expected_loss) +print('Average bias: %.3f' % avg_bias) +print('Average variance: %.3f' % avg_var) +``` + +* * * + +## Summary + +In this article, we looked at the trade-off between bias error and variance error. We saw that machine learning model error can be decomposed into subtypes, being the bias - or the rigidity of the model in terms of the functions that it can learn - and variance - being the sensitivity to fluctuations in the training data. + +Through an example of a darts board, we saw that we want to strike a fair balance between bias and variance - and that we would likely achieve a model with relatively low bias _and_ variance. + +Unfortunately, we also saw that this is not always possible. Still, here, it is also important to find a balance between the two. For this reason, we illustrated how the error from a TensorFlow model can be decomposed into bias and variance by means of Mlxtend. + +I hope that you have learnt something from today's article! If you did, please leave a comment in the comments section below - I would definitely appreciate it 😊 Please do the same if you have questions, remarks or other comments. Where possible, I will make sure to answer them! + +Thank you for reading MachineCurve today and happy engineering 😎 + +* * * + +## References + +Raschka, S. (n.d.). _Mlxtend.evaluate - mlxtend_. Site not found · GitHub Pages. [https://rasbt.github.io/mlxtend/api\_subpackages/mlxtend.evaluate/#bias\_variance\_decomp](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.evaluate/#bias_variance_decomp) + +Dietterich, T. G., & Kong, E. B. (1995). _[Machine learning bias, statistical bias, and statistical variance of decision tree algorithms](http://www.cems.uwe.ac.uk/~irjohnso/coursenotes/uqc832/tr-bias.pdf)_. Technical report, Department of Computer Science, Oregon State University. + +Mitchell, T. M. (1980). [The need for biases in learning generalizations.](http://www-cgi.cs.cmu.edu/~tom/pubs/NeedForBias_1980.pdf) Tech. rep. CBMTR-117, Rutgers University, New Brunswick, NJ. + +Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). [Reconciling modern machine-learning practice and the classical bias–variance trade-off](https://www.pnas.org/content/116/32/15849.short). _Proceedings of the National Academy of Sciences_, _116_(32), 15849-15854. + +StackExchange. (n.d.). _What is the meaning of term variance in machine learning model?_ Data Science Stack Exchange. [https://datascience.stackexchange.com/a/37350](https://datascience.stackexchange.com/a/37350) + +Dartboard icon made by [Freepik](https://www.flaticon.com/authors/freepik "Freepik") from [www.flaticon.com](https://www.flaticon.com/ "Flaticon") + +GitHub. (n.d.). _Bias\_variance\_decomp for keras sequential? · Issue #719 · rasbt/mlxtend_. [https://github.com/rasbt/mlxtend/issues/719](https://github.com/rasbt/mlxtend/issues/719) diff --git a/making-more-datasets-available-for-keras.md b/making-more-datasets-available-for-keras.md new file mode 100644 index 0000000..ddcdb8e --- /dev/null +++ b/making-more-datasets-available-for-keras.md @@ -0,0 +1,251 @@ +--- +title: "Making more datasets available for Keras" +date: "2020-01-10" +categories: + - "deep-learning" + - "frameworks" +tags: + - "dataset" + - "deep-learning" + - "extra-keras-datasets" + - "keras" + - "machine-learning" +--- + +Thanks to modern deep learning frameworks like Keras, it's very easy to use particular datasets - which are included in the framework by default. However, the amount of datasets available is often quite low, as the creators likely have more important things to do than integrate all public datasets that are available on the Internet. + +This blog post introduces the `extra-keras-datasets` module, which extends `tensorflow.keras.datasets` with additional ones. So far, we've included the EMNIST dataset, the KMNIST ones, as well as SVHN and STL-10, and we're adding more regularly. + +Let's explore these new Keras datasets! + +**Update 16/Nov/2020:** made the references to `keras.datasets` compatible with TensorFlow 2.x. + +* * * + +\[toc\] + +* * * + +## The Keras Datasets module + +In a different blog post, we [explored the Keras Datasets module](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). The module, which can be used in your Keras models by importing `tensorflow.keras.datasets`, allows you to load datasets very easily: often, it's simply enough to call `load_data()` and there you go. + +![](images/exploring_keras_datasets.jpg) + +The module contains various **image recognition datasets** - being MNIST, CIFAR-10, CIFAR-100, Fashion-MNIST - as well as **text classification datasets** - Reuters Newswires and IMDB sentiment - and a **regression dataset** (Boston House Prices). + +However, the number of datasets is relatively small, especially when you're experimenting a lot (such as for my blog posts on MachineCurve. I'll usually use MNIST or the CIFAR datasets, but I'm a bit fed up with them). However, importing various other datasets requires quite some extra code, which makes the explanations with regards to the Keras models less accessible for beginners. Unfortunately, no module is available to support additional datasets.... **until now!** + +* * * + +## Say hi to Extra Keras Datasets + +The **[Extra Keras Datasets module](https://github.com/christianversloot/extra_keras_datasets)** is a drop-in replacement for `tensorflow.keras.datasets`. Under the license provided by Keras, it makes use of its way of _downloading_ data, and offers the same `load_data()` definition to load particular datasets. + +[![](images/extra_k_logo_neg-300x173.png)](https://github.com/christianversloot/extra_keras_datasets) + +So far, we support a small range of additional datasets, and we're extending on a daily to weekly basis. These are the datasets supported so far: + +- EMNIST +- KMNIST +- SVHN +- STL-10 + +Before we continue with exploring the datasets themselves, let's take a look at the installation procedure first, so that you can start straight away :) + +### Installing the Extra Keras Datasets + +The installation process is fairly straight-forward: + +``` +pip install extra-keras-datasets +``` + +It should also check for and if necessary install the dependencies that are required to run it successfully. + +Let's now take a look at which datasets are available :) + +* * * + +### EMNIST + +The EMNIST dataset, which stands for Extended MNIST, is an extension of the MNIST dataset based on the original NIST dataset. It comes in multiple flavors: + +- Balanced, which contains a balanced number of letters and digits. +- ByClass, which is unbalanced. +- ByMerge, which is also unbalanced. +- Digits, which are the digits only. +- Letters, which are the letters only. +- Classic MNIST, which is the MNIST dataset as we know it. + +Let's now take a look at these datasets in a bit more detail. + +#### Balanced + +The `balanced` dataset contains digits as well as uppercase and lower handwritten letters. It contains 131.600 characters across 47 balanced classes. + +``` +from extra-keras-datasets import emnist +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='balanced') +``` + +[![](images/emnist-balanced.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/emnist-balanced.png) + +* * * + +#### ByClass + +The `byClass` dataset also contains digits as well as upper case and lowercase letters, but it's unbalanced. Hence, the dataset is substantially larger, with 814.255 characters across 62 unbalanced classes. The classes for this dataset are \[0-9\], \[a-z\] and \[A-Z\] (Cohen et al., 2017). + +``` +from extra-keras-datasets import emnist +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='byclass') +``` + +[![](images/emnist-byclass.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/emnist-byclass.png) + +* * * + +#### ByMerge + +The same is true for `byMerge`, but it's built up slightly differently. It also contains 814.255 characters, but has 47 unbalanced classes only. It merges classes where similarity between uppercase and lowercase letters is too large, possibly confusing your model. The merged classes are C, I, J, K, L, M, O, P, S, U, V, W, X, Y and Z, resulting in 47 insteaad of 62 classes (Cohen et al., 2017). + +``` +from extra-keras-datasets import emnist +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='bymerge') +``` + +[![](images/emnist-bymerge.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/emnist-bymerge.png) + +* * * + +#### Digits + +The `digits` dataset contains 280.000 characters across 10 balanced classes; these are the digits only. + +``` +from extra-keras-datasets import emnist +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='digits') +``` + +[![](images/emnist-digits.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/emnist-digits.png) + +* * * + +#### Letters + +The `letters` dataset contains 145.600 characters across 26 balanced classes; these are the handwritten letters only. + +``` +from extra-keras-datasets import emnist +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='letters') +``` + +[![](images/emnist-letters.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/emnist-letters.png) + +* * * + +#### Classic MNIST + +The `mnist` dataset is actually the same as traditional MNIST, with 70.000 characters across 10 balanced classes, equaling `tensorflow.keras.datasets.mnist`. + +``` +from extra-keras-datasets import emnist +(input_train, target_train), (input_test, target_test) = emnist.load_data(type='mnist') +``` + +[![](images/emnist-mnist.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/emnist-mnist.png) + +* * * + +### KMNIST + +#### Kuzushiji-MNIST + +This is a drop-in replacement for MNIST, but then with 70.000 28x28 images of Japanese Kuzushiji characters. These are considered to be slightly more difficult than the digits of the MNIST dataset. + +``` +from extra-keras-datasets import kmnist +(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='kmnist') +``` + +[![](images/kmnist-kmnist.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/kmnist-kmnist.png) + +* * * + +#### Kuzushiji-49 + +This is an extension of the Kuzishiji-MNIST dataset, offering 270.912 images across 49 classes. + +``` +from extra-keras-datasets import kmnist +(input_train, target_train), (input_test, target_test) = kmnist.load_data(type='k49') +``` + +[![](images/kmnist-k49.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/kmnist-k49.png) + +* * * + +### SVHN + +The Street View House Numbers dataset (SVHN) contains 32x32 cropped images of house numbers obtained from Google Street View. + +#### Normal + +The `normal` variant contains 73.257 digits for training and 26.032 for testing. + +``` +from extra-keras-datasets import svhn +(input_train, target_train), (input_test, target_test) = svhn.load_data(type='normal') +``` + +[![](images/svhn-normal.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/svhn-normal.png) + +* * * + +#### Extra + +The `extra` dataset extends the `normal` one with 531.131 extra samples, which are less difficult (Netzer et al., 2011). The dataset then totals 604.388 digits for training and 26.032 digits for testing. + +``` +from extra-keras-datasets import svhn +(input_train, target_train), (input_test, target_test) = svhn.load_data(type='extra') +``` + +[![](images/svhn-extra.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/svhn-extra.png) + +* * * + +### STL-10 + +The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It contains 5.000 training images and 8.000 testing images, and represents 10 classes in total (airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck). + +``` +from extra-keras-datasets import stl10 +(input_train, target_train), (input_test, target_test) = stl10.load_data() +``` + +[![](images/stl10-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/stl10-1.png) + +* * * + +## Summary + +In this blog post, we've introduced the `extra-keras-datasets` module. It extends the original `tensorflow.keras.datasets` module with additional datasets. So far, the EMNIST, KMNIST, SVHN and STL-10 datasets have been made available for easy use. We're extending this dataset on a weekly to monthly basis, so stay tuned! :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an extension of MNIST to handwritten letters. Retrieved from [http://arxiv.org/abs/1702.05373](http://arxiv.org/abs/1702.05373) + +Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., & Ha, D. (2018). Deep learning for classical Japanese literature. arXiv preprint arXiv:1812.01718. Retrieved from [https://arxiv.org/abs/1812.01718](https://arxiv.org/abs/1812.01718) + +Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. Retrieved from [http://ufldl.stanford.edu/housenumbers/nips2011\_housenumbers.pdf](http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf) + [http://ufldl.stanford.edu/housenumbers/](http://ufldl.stanford.edu/housenumbers/) + +Coates, A., Ng, A., & Lee, H. (2011, June). An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics (pp. 215-223). Retrieved from [http://cs.stanford.edu/~acoates/papers/coatesleeng\_aistats\_2011.pdf](http://cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf) diff --git a/measuring-sparsity-during-training-tensorflow-pruningsummaries.md b/measuring-sparsity-during-training-tensorflow-pruningsummaries.md new file mode 100644 index 0000000..edb2a4a --- /dev/null +++ b/measuring-sparsity-during-training-tensorflow-pruningsummaries.md @@ -0,0 +1,424 @@ +--- +title: "Measuring sparsity during training: TensorFlow PruningSummaries" +date: "2020-10-01" +categories: + - "frameworks" +tags: + - "edge-ai" + - "machine-learning" + - "pruning" + - "tensorboard" + - "tensorflow" + - "model-optimization" +--- + +The human brain is very efficient: it can handle extremely complex cognitive situations with relative ease and energy efficiency. It also has a very high computational power relative to its size. Modern neural networks, which are at the forefront of the deep learning field, don't have that power-size ratio. Instead, for good performance in narrow domains, deep learning models need to become really big. Think half a gigabyte big for some pretrained models that can be used for detecting objects and so on. + +Fortunately, there are model optimization techniques such as [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) and [pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/) that help with reducing the size of your model without losing a lot of its predictive power. This benefits speed and inference power, especially when running optimized models on edge devices - which is a trend I am convinced we will hear a lot more about in the years to come. Using [TensorFlow techniques](https://www.machinecurve.com/index.php/2020/09/29/tensorflow-pruning-schedules-constantsparsity-and-polynomialdecay/), we can reduce our models up to 10 times in size. + +Now, let's zoom in to pruning. With e.g. the `PolynomialDecay` pruning schedule in TensorFlow, it is possible to configure _when_ pruning should start during training, as well as _how much pruning should be performed at a particular time_. That is, it is possible to increase the degree of pruning slowly but surely over the training process. This way, model weights can become more robust against the loss of weights incurred by pruning. + +However, applying pruning can still be a black box - you know when during the training process pruning will approximately begin, and when it will end. You'll also know at what sparsity level pruning will start, and at what level it will end. However, if you want to **measure the degree of sparsity created by pruning during training**, you're blind. Say that you observe a model that no longer performs well after pruning. At what sparsity level does loss starts to decrease? And what sparsity level does the model finally end up with, anyway? Simply applying `PolynomialDecay` or `ConstantSparsity` based pruning will apply the pruning - but from a black box perspective. + +Fortunately, when creating your Keras model using TensorFlow, a callback is available called `PruningSummaries` that is execuded after each epoch during the training process. By using this callback, you can make available information about the pruning process to your [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) instance. And precisely that is what we will be looking at in today's blog article. **How can pruning summaries open the black box of pruning during the training process?** Firstly, we'll take a brief look at what pruning is to provide you with necessary context about the pruning process. Subsequently, we'll introduce `PruningSummaries` and what they do according to the TensorFlow documentation. Then, we'll add them to our Keras example created in our article about [pruning](https://www.machinecurve.com/index.php/2020/09/29/tensorflow-pruning-schedules-constantsparsity-and-polynomialdecay/), and show you how they work by means of another example. + +Let's take a look 😎 + +\[toc\] + +* * * + +## A brief recap: what is pruning, again? + +Before we'll take a look at pruning summaries, it would be worthwhile to look at what pruning is in the first place - because we must understand _why_ we apply the pruning summaries in the first place. We can't do that without understanding what pruning is, what it does, and how it helps you during model optimization. + +Obviously, if you already know about pruning, it's perfectly okay to skip this part :) + +### The need for pruning + +Now, suppose that you have an idea for a machine learning application of which the goal is large-scale image recognition. That is, you want to create a model that can distinguish between 1.000 classes - without making many mistakes. You also don't have a large (say, 1-2 million samples) dataset available for training. In those cases, it could be worthwhile to use the [VGG-16](https://keras.io/api/applications/vgg/#vgg16-function) architecture (Simonyan & Zisserman, 2014), which comes delivered with Keras with weights pretrained using the ImageNet dataset. + +This is a great thing, as it might ensure that you can move forward with your idea, but it comes at a cost: running the model will be quite expensive in terms of computational resources that are required. As we saw [here](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#the-need-for-model-optimization), weights for VGG-16 are approximately ~500MB big. This means that you'll have to load a model of half a gigabyte, and then use it to generate new predictions! Think about how many neurons are in the network, and how computationally expensive computing all the intermediate steps between the first layer and the outcome would be (note that there are 16 layers with many weights between input and output). + +### What happens when training a neural network + +From our article about the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), we know that training a supervised model involves moving features from a training set forward through layers in a neural network, subsequently computing the error (or loss value) and finally performing backwards error computation and [optimization](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) (MachineCurv,e 2019). + +Here, moving features forward means that per feature, we compute vector multiplications between so-called _feature vectors_ (representing a sample for which an outcome must be predicted) and _weight vectors_ (which cover part of the learning performed by the neural network; all weights together capture the entirety of patterns learnt by the network). + +The same happens when you let a new sample pass through the network in a forward fashion in order to generate a new sample. With very large networks with many weights, this explains why they can sometimes be _very slow_ when generating new predictions, a process called model inference. + +![](images/High-level-training-process-1024x973.jpg) + +### Adding pruning to the training process + +Pruning involves an answer to the following question: **can we drop the weights that don't contribute significantly to the predictive power of a machine learning model during training?** This means two things: + +1. We can try and drop all weights that don't contribute in order to make our model faster; +2. _Without_ losing predictive power, i.e., without making the model significantly worse in its ability to generate predictions. + +With [magnitude-based pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/), this is precisely what happens: weights that contribute insignificantly to the outcome model are dropped. They are not _truly_ removed from the model, because that would be impossible architecturally. However, they are removed in the sense that the **weights are set to zero**. This creates what is known as [sparsity](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#saving-storage-and-making-things-faster-with-magnitude-based-pruning), and this has many benefits: + +> Why does setting model weights to zero help optimize a model, and make it smaller? Gale et al. (2019) answer this question: “models can be stored and transmitted compactly using sparse matrix formats”. This benefits from the fact that “\[sparse\] data is by nature more easily [compressed](https://en.wikipedia.org/wiki/Data_compression) and thus requires significantly less [storage](https://en.wikipedia.org/wiki/Computer_data_storage).” (Wikipedia, 2003). In addition, beyond compression, computation-wise programming code (such as computing `x`+`y`) can be made faster (e.g., it can be omitted if `x` or `y` are sparse, or both – `x+0` = `x`, and so on), benefiting processing – _inference_, in our case. +> +> _TensorFlow model optimization: An introduction to pruning – MachineCurve_. (2020, September 23). MachineCurve. [https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#saving-storage-and-making-things-faster-with-magnitude-based-pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#saving-storage-and-making-things-faster-with-magnitude-based-pruning) + +So, inducing **sparsity** by means of pruning helps make them smaller, makes them faster to load, and makes inference faster, by benefiting from the properties of sparse data in terms of storage and programming of libraries such as TensorFlow. That sounds great, doesn't it? + +In TensorFlow, it is possible to apply pruning by means of the `ConstantSparsity` and `PolynomialDecay` [pruning schedules](https://www.machinecurve.com/index.php/2020/09/29/tensorflow-pruning-schedules-constantsparsity-and-polynomialdecay/). However, they provide what is a relative black box - you know that pruning will be applied, when it will start and when it will end, and how much sparsity will change over the course of your training process. However, you can't really look _inside_ and look at sparsity levels given a training step - you can only estimate them based on your configuration, but not measure them. + +That's why we'll now take a look at `PruningSummaries`, which is a technique to measure the degree of sparsity of your neural network when applying pruning. + +* * * + +## Measuring pruning during training: PruningSummaries + +If you've read my blog post about [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/), or have used it before, you know that it can be used to follow the progress of your machine learning model that's training in realtime. In the post, I'll cover TensorBoard in more detail. Briefly, however: + +- TensorBoard is a TensorFlow-delivered web application which allows you to track the inner workings of your ML model in multiple ways, including how training has progressed. As mentioned, this can be done in realtime. +- TensorBoard can be enabled by letting the training process write log data to a folder - you can do so by means of a Keras callback. +- A callback, in this sense, is a piece of code that is run every time something happens - for example, at the start of an epoch or at the start of a [batch](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/#minibatch-gradient-descent) being fed forward through your model. It can also run at the end, if that's how the callback was programmed. It can be run by means of the `TensorBoard` callback. + +The TensorFlow Model Optimization toolkit actually piggybacks on TensorBoard functionality by providing another callback called `PruningSummaries`. It runs after every epoch and logs information about the pruning process to TensorBoard. In particular, it logs the degree of sparsity of the model at a particular epoch. In short, it's "\[a\] Keras callback for adding pruning summaries to \[TensorBoard\]" (PruningSummaries, n.d.). + +### PruningSummaries in the TFMOT API + +PruningSummaries are available in the optimization toolkit's API, and the callback is hence documented. Here's what it looks like: + +``` +tfmot.sparsity.keras.PruningSummaries( + log_dir, update_freq='epoch', **kwargs +) +``` + +It's a really simple callback with literally two arguments - `log_dir`, i.e. where to write the logs, and `update_freq`, which means that it runs every epoch. + +(By checking the [source code](https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/core/sparsity/keras/pruning_callbacks.py#L92-L151), I haven't observed what other values for `update_freq` are possible, so I'm assuming that it only runs on a per epoch basis.) + +* * * + +## Adding pruning summaries to TensorBoard: a Keras example + +Well, that's all for theory. Let's now see how to add `PruningSummaries` to the Keras-based pruning example created in [my other article](https://www.machinecurve.com/index.php/2020/09/29/tensorflow-pruning-schedules-constantsparsity-and-polynomialdecay/). Do note that instead of the `ConstantSparsity` applied there, I will apply `PolynomialDecay` here, because we saw that it works a bit better. What's more, TensorBoard should now nicely show increasing sparsity levels when the numbers of epochs increase. + +### Model imports, CNN definition & model compilation + +First of all, we add model imporst, define our ConvNet and compile our model. As this isn't really different from creating the [ConvNet itself](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), I won't really explain how to do that here. If you wish to understand this in more detail, please click the link. + +Make sure to create a Python file (e.g. `pruningsummaries.py`) or open a Jupyter Notebook, and add the following code: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential, save_model +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import tempfile +import tensorflow_model_optimization as tfmot +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +pruning_epochs = 30 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +### Configuring pruning + +Now that we have a compiled model, we can convert it so that it is wrapped with pruning functionality - which allows pruning to happen: + +``` +# Load functionality for adding pruning wrappers +prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude + +# Compute the pruning end step +num_images = input_train.shape[0] * (1 - validation_split) +end_step = np.ceil(num_images / batch_size).astype(np.int32) * pruning_epochs + +# Define pruning configuration +pruning_params = { + 'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.00, + final_sparsity=0.875, + begin_step=0.2*end_step, + end_step=end_step) +} +model_for_pruning = prune_low_magnitude(model, **pruning_params) + +# Recompile the model +model_for_pruning.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +First, we define the function that will wrap our model with pruning functionality as `prune_low_magnitude`. We subsequently compute the `end_step`, which is effectively the number of batches per epoch times the number of epochs - i.e., the total number of steps taken to train the model. + +Then, we define our pruning parameters. More specifically, we define `PolynomialDecay` as our pruning schedule. We start with a sparsity of 0% and end with a sparsity of theoretically 87.5%. Note that we only start pruning at 20% of the training process, to give the model a little bit of time to learn weights without immediately facing loss incurred by pruned weights. Pruning ends at 100% of the training process. + +Finally, we actually _add_ the wrappers by calling `prune_low_magnitude` with our model and the pruning parameters that we defined above. Model recompilation is necessary for the pruning to work, and we do that last. + +### Defining callbacks + +Next, it's time to define the callbacks that Keras will use during training: + +``` +# Model callbacks +log_dir = '.\logs' +callbacks = [ + tfmot.sparsity.keras.UpdatePruningStep(), + tensorflow.keras.callbacks.TensorBoard(log_dir=log_dir, profile_batch = 100000000, histogram_freq=0, batch_size=32, write_graph=True, write_grads=False, write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None, update_freq='epoch'), + tfmot.sparsity.keras.PruningSummaries( + log_dir, update_freq='epoch' + ) +] +``` + +We use three of them: + +- First of all, we're using the `UpdatePruningStep`. It updates the pruning step (literally: step = step + 1) after each batch. Without it, pruning doesn't work, and TensorFlow will in fact throw an error. +- The `TensorBoard` callback, which we use to write _regular_ data to TensorBoard. Click [here](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/#tensorboard-and-the-keras-api) for an explanation of all the attributes passed along. + - **Important:** note that we, weirdly perhaps, added `profile_batch = 100000000` to our `TensorBoard` callback. This was necessary, because otherwise Keras would throw `ValueError: Must enable trace before export` when using `PruningSummaries`. +- The `PruningSummaries` callback which writes pruning summaries to TensorBoard as well. + +As is clear, we write all logs to our `log_dir`. Note that `.\logs` means that it will work on Windows. Perhaps, you will need to change it into `./logs` if you're running Linux or Mac OS. + +### Starting the training process with pruning + +Then, it's time to start the training process - including pruning: + +``` +# Fitting data +model_for_pruning.fit(input_train, target_train, + batch_size=batch_size, + epochs=pruning_epochs, + verbose=verbosity, + callbacks=callbacks, + validation_split=validation_split) +``` + +### Model evaluation + +Finally, we test our trained and pruned model with our test dataset (split off above, in the first part of the model code) and print the evaluation scores: + +``` +# Generate generalization metrics +score_pruned = model_for_pruning.evaluate(input_test, target_test, verbose=0) +print(f'Pruned CNN - Test loss: {score_pruned[0]} / Test accuracy: {score_pruned[1]}') +``` + +### Full model code + +If you wish to obtain the full model code at once - that's of course possible :) Here you go: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential, save_model +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import tempfile +import tensorflow_model_optimization as tfmot +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +pruning_epochs = 30 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Load functionality for adding pruning wrappers +prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude + +# Compute the pruning end step +num_images = input_train.shape[0] * (1 - validation_split) +end_step = np.ceil(num_images / batch_size).astype(np.int32) * pruning_epochs + +# Define pruning configuration +pruning_params = { + 'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.00, + final_sparsity=0.875, + begin_step=0.2*end_step, + end_step=end_step) +} +model_for_pruning = prune_low_magnitude(model, **pruning_params) + +# Recompile the model +model_for_pruning.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Model callbacks +log_dir = '.\logs' +callbacks = [ + tfmot.sparsity.keras.UpdatePruningStep(), + tensorflow.keras.callbacks.TensorBoard(log_dir=log_dir, profile_batch = 100000000, histogram_freq=0, batch_size=32, write_graph=True, write_grads=False, write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None, update_freq='epoch'), + tfmot.sparsity.keras.PruningSummaries( + log_dir, update_freq='epoch' + ) +] + +# Fitting data +model_for_pruning.fit(input_train, target_train, + batch_size=batch_size, + epochs=pruning_epochs, + verbose=verbosity, + callbacks=callbacks, + validation_split=validation_split) + +# Generate generalization metrics +score_pruned = model_for_pruning.evaluate(input_test, target_test, verbose=0) +print(f'Pruned CNN - Test loss: {score_pruned[0]} / Test accuracy: {score_pruned[1]}') +``` + +* * * + +## Results: evaluation & TensorBoard screenshots + +Let's now run the model. Open up your Python development environment with the necessary libraries installed (primarily, `tensorflow` and the `tfmot` toolkit), and run `python pruningsummaries.py` (or run it in your Notebook). + +If all goes well, your model should start training. + +### Model evaluation scores + +Once it ends, you'll see a model evaluation score - in our case, pruning hasn't impacted accuracy, which is great but expected: + +``` +Pruned CNN - Test loss: 0.024733462060560124 / Test accuracy: 0.9921000003814697 +``` + +### Starting TensorBoard + +More interestingly, however, is what information `PruningSummaries` provide in TensorBoard. Let's open the board by means of a terminal that works from the same directory as where your `*.py` file is located: + +``` +tensorboard --logdir=./logs +``` + +If all goes well, you should see the following message: + +``` +Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all +TensorBoard 2.0.2 at http://localhost:6006/ (Press CTRL+C to quit) +``` + +You can now navigate to [http://localhost:6006/](http://localhost:6006/) to see the results. + +### What PruningSummaries look like in TensorFlow + +When clicking one of the layers, you should see information about how the model became sparser during training: + +![](images/image-1.png) + +Indeed: the model started getting sparser at approximately 20% of the training process, and achieved between 85-90% sparsity. Precisely as expected! + +* * * + +## Summary + +Pruning can help your models get sparser, and hence faster to run and faster to load, especially on edge devices. This benefits model inference and the deployment of AI in the field. Today's machine learning libraries, such as TensorFlow, provide functionality for pruning through its Model Optimization Toolkit. While pruning often yields great benefits in terms of model size without losing model performance, it's still a bit of a black box. + +By means of `PruningSummaries`, however, it's possible to see how pruning has induced sparsity within your machine learning model during the training process. The callback, which logs information about model sparsity to TensorBoard, allows you to see precisely when your model started getting sparser and what sparsity was achieved when. This opens the black box we mentioned before. + +This article also provided an example with Keras. By means of a ConvNet based classifier, classifying the MNIST dataset into one out of ten classes, we saw that sparsity of 87.5% could be achieved without a bad-performing model. While this is a relatively artificial example, it shows how pruning can be really effective. + +If you have any questions, remarks or other comments, please feel free to leave a comment in the comments section below! I'd love to hear from you 💬 Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +_VGG16 - Convolutional network for classification and detection_. (2018, November 21). Neurohive - Neural Networks. [https://neurohive.io/en/popular-networks/vgg16/](https://neurohive.io/en/popular-networks/vgg16/) + +Simonyan, K., & Zisserman, A. (2014). [Very deep convolutional networks for large-scale image recognition](https://arxiv.org/abs/1409.1556). _arXiv preprint arXiv:1409.1556_. + +MachineCurve. (2019, December 21). _The high-level supervised learning process_. [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) + +Universität Tübingen. (n.d.). _Magnitude based pruning_. Kognitive Systeme | Universität Tübingen. [https://www.ra.cs.uni-tuebingen.de/SNNS/UserManual/node249.html](https://www.ra.cs.uni-tuebingen.de/SNNS/UserManual/node249.html) + +_TensorFlow model optimization: An introduction to pruning – MachineCurve_. (2020, September 23). MachineCurve. [https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#saving-storage-and-making-things-faster-with-magnitude-based-pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#saving-storage-and-making-things-faster-with-magnitude-based-pruning) + +_Tfmot.sparsity.keras.PruningSummaries_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/sparsity/keras/PruningSummaries](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/PruningSummaries) + +_TensorFlow/model-optimization_. (n.d.). GitHub. [https://github.com/tensorflow/model-optimization/blob/master/tensorflow\_model\_optimization/python/core/sparsity/keras/pruning\_callbacks.py#L92-L151](https://github.com/tensorflow/model-optimization/blob/master/tensorflow_model_optimization/python/core/sparsity/keras/pruning_callbacks.py#L92-L151) diff --git a/ml-against-covid-19-detecting-disease-with-tensorflow-keras-and-transfer-learning.md b/ml-against-covid-19-detecting-disease-with-tensorflow-keras-and-transfer-learning.md new file mode 100644 index 0000000..ff3b283 --- /dev/null +++ b/ml-against-covid-19-detecting-disease-with-tensorflow-keras-and-transfer-learning.md @@ -0,0 +1,821 @@ +--- +title: "ML against COVID-19: detecting disease with TensorFlow, Keras and transfer learning" +date: "2020-11-05" +categories: + - "geen-categorie" +--- + +Since March 2020, the world is in crisis: the SARS-CoV-2 coronavirus is sweeping across the world. Many countries are currently in lockdown or have imposed strict social distancing measures. Some even fear that the world as we know it will never return - in the sense that even after vaccins are allowed onto the market, people will continue to show different behavior. + +Now, the goal of this article is not to impose any additional fears onto my readers. Instead, I'm hoping to demonstrate that technology can play a very positive role in fighting the disease. We all know that in recent years, Machine Learning - and especially Deep Neural Networks - have made lots of progress. Especially in the area of Computer Vision, models can increasingly support and sometimes even replace humans in narrow domains. + +And that is why Machine Learning can also be used to help doctors diagnose COVID-19 disease. + +In this article, we'll create a **Convolutional Neural Network** that can help diagnose COVID-19 on **Radiography images**. It is structured as follows. Firstly, we'll take a look at COVID-19 in general (perhaps, this will be increasingly relevant for those who read this article years from now). Subsequently, we'll cover the detection of COVID-19 on Radiography images, and the dataset that we will be using today. + +Subsequently, we'll move on to the real work. We first demonstrate that a **vanilla ConvNet** can already be used to generate a classifier. However, as more state-of-the-art results can potentially be achieved with pretraining, we also demonstrate how a neural network can be trained **on top of a ConvNet trained on ImageNet**: that is, a **transfer learning** setting. Finally, we'll evaluate how well the models work. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## COVID-19: some context, and how ML can help + +[![](images/49534865371_7219ecfbcd_k-1024x800.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/11/49534865371_7219ecfbcd_k-scaled.jpg) + +SARS-CoV-2 pictured by a transmission electron microscope. Credit: [NIAID](https://www.flickr.com/photos/niaid/49534865371/), license: [CC BY 2.0](https://creativecommons.org/licenses/by/2.0), no changes made. + +Somewhere between October and December 2019, in China, a virus of type coronavirus likely spilled over from an animal onto humans. A Chinese doctor named Li Wenliang worked at the Wuhan Central Hospital in China and recognized a strange pattern in December - there had been some hospital admissions with symptoms that looked like SARS, the virus that led to an epidemic in 2003 (Hegerty, 2020). + +On 30 December, he warned others about the outbreak, and eventually even died because he got ill himself. And he was right: the virus indeed looked like SARS. In fact, it's called **SARS-CoV-2**, and it's very closely related to the original SARS virus. What he didn't know at the time, however, was what an impact the virus would make on the world: + +- The Wuhan area in China went into lockdown in the first months of 2020. +- Being insufficient, the virus spread to other parts of Asia in January and February. +- In early March, during Alpine season, the virus was present within Europe and spread extremely fast across European countries. +- Equally easily, the virus was imported into the United States and South America. + +By consequence, at the time of writing, we're in a semi-lockdown because of a surge in cases in what is called the _second wave_. + +Hospitals are flooding with patients - and we can only keep patient levels at relatively adequate levels by following strict social distancing measures. Doctors and nurses are exhausted, less-than-critical care is postponed, and the health system is cracking. In sum, this is what happened in March - and what can happen again: + +https://www.youtube.com/watch?v=Ee7FRSPo76M + +Fortunately, new technologies can always help in making life easier. This is especially true for **Artificial Intelligence** and its sub branch **Machine Learning.** In my master's research, for example, I studied how [Convolutional Neural Networks](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) can help geophysicists with analyzing Ground Penetrating Radar images. This geophysical technique is used to scan the underground, in order to register cables and pipelines and avoid excavation damages during construction activities. Applying neural networks, in my case for the recognition of material type of an underground object, can make the analyst's job more efficient - they can perform work faster and have a decision-support tool at their disposal. + +I started wondering - can't we do the same for COVID-19? Would there be a dataset available with which a Machine Learning model can be created that helps medical staff? Recognizing that medical decisions are _especially_ critical and that technologies should not be applied without care, I still wanted to experiment. Because _if_ Machine Learning can make the life of COVID-19 medical staff easier, the whole system benefits. Therefore, let's find out what's possible. + +* * * + +## Detecting COVID-19 on Radiography images + +On [Kaggle](https://www.kaggle.com/tawsifurrahman/covid19-radiography-database), which is an awesome community for data science minded people, I found the **COVID-19 Radiography Database**: + +> A team of researchers from Qatar University, Doha, Qatar and the University of Dhaka, Bangladesh along with their collaborators from Pakistan and Malaysia in collaboration with medical doctors have created a database of chest X-ray images for COVID-19 positive cases along with Normal and Viral Pneumonia images. In our current release, there are 219 COVID-19 positive images, 1341 normal images and 1345 viral pneumonia images. We will continue to update this database as soon as we have new x-ray images for COVID-19 pneumonia patients. +> +> COVID-19 Radiography Database, n.d. + +- The dataset contains images of **COVID-19 pneumonia,** **other viral pneumonia** and **normal chests.** +- For all three [classes](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/), the images were generated by means of radiography. +- The images are quite big - 1024 x 1024 pixels (and can be removed by us). +- There is a class imbalance between especially COVID-19 and the other classes. This is not surprising given the fact that COVID-19 is a relatively new disease, but this should be taken into account when creating the Machine Learning model. + +Samples look as follows: + +![](images/covids.png) + +* * * + +## Creating a Machine Learning model for detecting COVID-19 + +Now that we have obtained the necessary context and some insight in today's dataset, we're moving on to the practical part: creating a neural network for [classifying the images](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). + +For this task, we'll be using [TensorFlow](https://www.tensorflow.org/) and [Keras](http://keras.io). If you're no Machine Learning engineer - TensorFlow is one of the state-of-the-art machine learning libraries for training neural networks. Keras, on the other hand, is an abstraction layer on top of TensorFlow, created to make creating TF models easier. In fact, they are so tightly coupled that Keras has actually been embedded into TensorFlow these days. + +Creating the model involves a set of steps: + +1. **Specifying the model imports.** +2. **Defining configuration options for the model.** +3. **Creating ImageDataGenerators which can flow data from file.** +4. **Preparing the dataset.** +5. **Creating and compiling the [ConvNet](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/).** +6. **Fitting the data to the network.** +7. **Evaluating the network.** + +### Specifying model imports + +Let's open up a file or [Notebook](https://www.machinecurve.com/index.php/2020/10/07/easy-install-of-jupyter-notebook-with-tensorflow-and-docker/) where you can write some code. Make sure that you have Python and TensorFlow installed; preferably the newest versions of the language and library. + +We can then add our imports. + +- We will use `os` to perform file operations. +- We import `tensorflow` for obvious reasons. +- We will use the `EarlyStopping` and `ModelCheckpoint` callbacks to stop training when our [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) metrics [stop improving](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), so that we always have the best model at our disposal. +- We import the `Conv2D` and `MaxPooling` layers for [feature extraction](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) and [feature map reduction](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/), respectively. +- Subsequently, further layer imports are `Dense`, `Dropout` and `Flatten` - in order to generate predictions (Flatten/Dense) and avoid overfitting ([Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/)). +- As we create our model with the Keras Sequential API, we import `Sequential`. +- Finally, we import `ImageDataGenerator`, with which we'll be able to flow data from files instead of storing everything into memory. + +``` +import os +import tensorflow +from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.models import Sequential +from tensorflow.keras.preprocessing.image import ImageDataGenerator +``` + +### Defining configuration options + +Next, it's time to define some options for model configuration. + +- Our `target_size_scalar` variable will later be used to specify width and height of the images _after_ they will be resized from 1024 x 1024 pixels. +- Our [batch size](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/#minibatch-gradient-descent) is set to 250. +- The number of epochs (i.e. [iterations](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process)) is set to 1000, but the number of iterations will be lower because we use Early Stopping. +- The number of classes is 3. +- 20% of our data will be used for validation purposes. +- Verbosity, i.e. model output in your terminal, is set to 1, or `True`. +- The `path` and `path_test` determine the paths towards your training/validation and testing data, respectively. We'll get back to this. +- The `input_shape` is common for an image: `(w, h, d)` - with our target size scalar representing width and height. +- The labels speak for themselves. +- The `checkpoint_path` is the path towards the file where [ModelCheckpoint](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) will save our model. + +``` +# Model configuration +target_size_scalar = 50 # 50 x 50 pixels +batch_size = 250 +no_epochs = 1000 +no_classes = 3 +validation_split = 0.2 +verbosity = 1 +path = './covid/COVID-19 Radiography Database' +path_test = './covid/COVID-19 Radiography Database Test' +input_shape = (target_size_scalar, target_size_scalar, 3) +labels = ['COVID-19', 'NORMAL', 'Viral Pneumonia'] +checkpoint_path=f'{os.path.dirname(os.path.realpath(__file__))}/covid-convnet.h5' +``` + +### Creating ImageDataGenerators for data import + +Next, we create three `ImageDataGenerators` for our training, validation and testing data. + +Image data generators can be used for "\[generating\] batches of tensor image data with real-time data augmentation" (Keras Team n.d.). Generators loop over the data in batches and are hence a useful mechanism for feeding data to the training process that starts later. + +- We set `rescale` to `1./255`, meaning that each pixel is multiplied with `1/255`, to ensure that it's in the \[latex\]\[0, 1\]\[/latex\] range. This benefits training (omitting this can even make the model untrainable). +- For the training and validation `ImageDataGenerator`, we specify the 20% validation split. Next, we will see how this nicely leads to a training and validation dataset. + +``` +# Create ImageDataGenerators for training, validation and testing data +training_idg = ImageDataGenerator( + rescale = 1./255, + validation_split = validation_split +) + +validation_idg = ImageDataGenerator( + rescale = 1./255, + validation_split = validation_split +) + +testing_idg = ImageDataGenerator( + rescale = 1./255 +) +``` + +We often don't want to load all our data into memory just at once. Sometimes, this is even impossible - e.g. when you are training with really big datasets. + +In those cases, `flow_from_directory` can be a nice technique. It connects to your `ImageDataGenerator` and essentially flows the batches of images from a directory that is to be specified. Below, we flow data from directory for the training, validation and testing generators. + +- We set the `directory` to `path` for the first two generators, specifying the path to our training dataset. The test generator has `path_test` set as the path. +- Our classes had been specified by `labels`, so we configure this here. +- We apply a `seed` of 28 (this can be any integer close to and above 0), for `shuffling`, which is `True` in our case. In other words, we randomly shuffle the dataset when flowing from directory. However, if we do so with a `seed`, we know that our random initializer is generated in the same way. Doing this across the training and validation generator means that we'll have a nice validation dataset. We specify the differences between the two by means of `subset`. +- Our batch sizes and target sizes are also specified accordingly. + +``` +# Flow from directory for the IDGs +train_generator = training_idg.flow_from_directory( + directory = path, + classes = labels, + seed = 28, + batch_size = batch_size, + shuffle = True, + target_size=(target_size_scalar, target_size_scalar), + subset = 'training' +) + +val_generator = validation_idg.flow_from_directory( + directory = path, + classes = labels, + seed = 28, + batch_size = batch_size, + shuffle = True, + target_size=(target_size_scalar, target_size_scalar), + subset = 'validation' +) + +test_generator = testing_idg.flow_from_directory( + directory = path_test, + classes = labels, + batch_size = batch_size, + target_size=(target_size_scalar, target_size_scalar) +) +``` + +### Preparing the dataset + +Now that we have specified where the data loads from and how it loads, it's time to prepare the dataset. + +The first and most important thing you have to do is [downloading the data](https://www.kaggle.com/tawsifurrahman/covid19-radiography-database). Once ready, unpack the data, and move the `COVID-19 Radiography Database` into a folder called `covid`, which itself is located at the level where your model code is being written. + +Then create another folder in `./covid` called `COVID-19 Radiography Database Test`. Also create the sub folders, i.e. `COVID-19`, `NORMAL` and `VIRAL PNEUMONIA`. Now, go to the `COVID-19 Radiography Database` folder. From each sub folder (i.e. each class folder), cut 40 samples and move them to the respective folders in the `COVID-19 Radiography Database Test` folder. Preferably, do so randomly. This way, we're creating a test dataset while leaving training data behind. + +In total, your training dataset (including validation data) will therefore be of this quantity: + +- **COVID-19 images:** 179 +- **Normal images:** 1301 +- **Other viral pneumonia images:** 1305 + +Once again, we stress that this dataset is highly imbalanced between COVID-19 and the other classes. We'll take this into account later, when fitting data to the model. + +### Creating and compiling the ConvNet + +Our next step is creating the ConvNet. + +- We use the `Sequential` API, which allows us to stack layers on top of each other with `model.add`. +- We use one two-dimensional Conv layer that is [ReLU-activated](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/). +- This is followed by a [Max Pooling layer](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) to further downsample the feature maps. +- We also apply [Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/) to help avoid overfitting. +- Then, we flatten the multidimensional data into one-dimensional format with `Flatten`, and apply `Dense` layers to generate the predictions. +- The final layer is [Softmax-activated](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) in order to generate a multiclass probability distribution over the 3 classes (because `no_classes = 3`). + +``` +# Create the ConvNet +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.5)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +The layer stack above does not represent a full model yet - but rather, a model skeleton. We must compile the model in order to bring it to life. + +- Since we are dealing with categorical data, we're using [categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) loss. +- We optimize with the Adam optimizer, one of the [widely used optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) these days. +- For metrics, we specify a wide variety of them - accuracy, of course, but also the more specific ones such as precision, recall, and true/false positives/negatives. + +``` +# Compile the ConvNet +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=[ + 'accuracy', + tensorflow.keras.metrics.TruePositives(), + tensorflow.keras.metrics.FalsePositives(), + tensorflow.keras.metrics.TrueNegatives(), + tensorflow.keras.metrics.FalseNegatives(), + tensorflow.keras.metrics.Precision(), + tensorflow.keras.metrics.Recall() + ]) +``` + +### Fitting data to the network + +Now that our model has been created and compiled, we can fit data to the model - i.e., start the training process. + +#### Balancing the classes + +However, we stressed twice that our dataset is highly imbalanced. If we would simply fit the data to the model, this could be problematic, because the imbalance would perhaps favor the two other classes simply for being more present. + +That's why we must use **class weights** when fitting the data. With them, we can increase the importance of less-frequent classes, while decreasing the importance of more-frequent ones. + +But how to compute the weights? + +If you have [Scikit-learn](https://scikit-learn.org/stable/) installed onto your system, you can apply `sklearn.utils.class_weight.compute_class_weight` to compute the class weights relative to each other. I created a relatively naïve (i.e. non-generalizable) Python snippet that computes the weights for precisely the training class imbalance for our COVID-19 classifier. + +``` +import numpy as np +import sklearn + +# Classes +num_one = 179 +num_two = 1301 +num_three = 1305 + +# Numpy arrays +arr_one = np.full(num_one, 0) +arr_two = np.full(num_two, 1) +arr_three = np.full(num_three, 2) + +# Concat and unique +all_together = np.concatenate((arr_one, arr_two, arr_three)) +unique = np.unique(all_together) + +# Compute and print weights +weights = sklearn.utils.class_weight.compute_class_weight('balanced', unique, all_together) +print(weights) +``` + +Running it gives the following weights. They make sense: if we multiply 179 with \[latex\]\\approx 5.18\[/latex\] and then divide it by \[latex\]\\approx 0.71\[/latex\], we get \[latex\]\\approx 1301\[/latex\]. The same is true for the others. The weights ensure that the classes are balanced. + +``` +[5.18621974 0.71355368 0.71136654] +``` + +In our code, we therefore now add the weights as follows: + +``` + +# Compute weights +class_weights = {0: 5.18621974, 1: 0.71355368, 2: 0.71136654} +``` + +#### Defining callbacks + +Before, we noted that we use [EarlyStopping and ModelCheckpoint](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) for stopping the training process when the loss metrics do no longer improve. Make sure to add the following code to add the callbacks as well. + +- With `EarlyStopping`, the training process stops early - i.e. if some `monitor` no longer improves. In our case, that monitor is the validation loss. We want to minimize that loss, so our `mode` is set to `min`. We consider an epoch to be a non-improvement if the loss improvement is `< 0.01`. We are patient for five non-improving epochs/iterations, after which the training process stops. +- With `ModelCheckpoint`, we can save the model after every iteration. However, we ensure that the best is saved only, based on the same `monitor` and `mode`. + +``` +# Define callbacks +keras_callbacks = [ + EarlyStopping(monitor='val_loss', patience=5, mode='min', min_delta=0.01), + ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') +] +``` + +#### Starting the training process with model.fit + +Now, it's time to start the training process. + +- We specify the `train_generator` for our training data and the `val_generator` for our validation data. We also add the callbacks, class weights, define the number of epochs to 1.000 (but, remember, it'll stop early), and set verbosity (i.e. output) to `True`. + +``` +# Fit data to model +model.fit(train_generator, + epochs=no_epochs, + verbose=verbosity, + class_weight=class_weights, + callbacks=keras_callbacks, + validation_data=val_generator) +``` + +### Model evaluation + +Finally, we add some code that helps us [evaluate](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) the model after it was trained. We test the model with the `test_generator` to find out if it generalizes well. We do so by listing each individual metric that we defined before. + +``` +# Generalization key value pairs +kvp = { + 0: 'Categorical crossentropy loss', + 1: 'Accuracy', + 2: 'True positives', + 3: 'False positives', + 4: 'True negatives', + 5: 'False negatives', + 6: 'Precision', + 7: 'Recall' +} + +# Generate generalization metrics +scores = model.evaluate(test_generator, verbose=1) +print('Test results:') +for index, score in enumerate(scores): + print(f'-> {kvp[index]}: {score}') +``` + +### Full model code + +Should you wish to obtain the full model code - that is of course also possible :D Here you go. + +``` +import os +import tensorflow +from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.models import Sequential +from tensorflow.keras.preprocessing.image import ImageDataGenerator + +# Model configuration +target_size_scalar = 50 # 50 x 50 pixels +batch_size = 250 +no_epochs = 1000 +no_classes = 3 +validation_split = 0.2 +verbosity = 1 +path = './covid/COVID-19 Radiography Database' +path_test = './covid/COVID-19 Radiography Database Test' +input_shape = (target_size_scalar, target_size_scalar, 3) +labels = ['COVID-19', 'NORMAL', 'Viral Pneumonia'] +checkpoint_path=f'{os.path.dirname(os.path.realpath(__file__))}/covid-convnet.h5' + +# Create ImageDataGenerators for training, validation and testing data +training_idg = ImageDataGenerator( + rescale = 1./255, + validation_split = validation_split +) + +validation_idg = ImageDataGenerator( + rescale = 1./255, + validation_split = validation_split +) + +testing_idg = ImageDataGenerator( + rescale = 1./255 +) + +# Flow from directory for the IDGs +train_generator = training_idg.flow_from_directory( + directory = path, + classes = labels, + seed = 28, + batch_size = batch_size, + shuffle = True, + target_size=(target_size_scalar, target_size_scalar), + subset = 'training' +) + +val_generator = validation_idg.flow_from_directory( + directory = path, + classes = labels, + seed = 28, + batch_size = batch_size, + shuffle = True, + target_size=(target_size_scalar, target_size_scalar), + subset = 'validation' +) + +test_generator = testing_idg.flow_from_directory( + directory = path_test, + classes = labels, + batch_size = batch_size, + target_size=(target_size_scalar, target_size_scalar) +) + +# Create the ConvNet +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.5)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the ConvNet +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=[ + 'accuracy', + tensorflow.keras.metrics.TruePositives(), + tensorflow.keras.metrics.FalsePositives(), + tensorflow.keras.metrics.TrueNegatives(), + tensorflow.keras.metrics.FalseNegatives(), + tensorflow.keras.metrics.Precision(), + tensorflow.keras.metrics.Recall() + ]) + +# Compute weights +class_weights = {0: 5.18621974, 1: 0.71355368, 2: 0.71136654} + +# Define callbacks +keras_callbacks = [ + EarlyStopping(monitor='val_loss', patience=5, mode='min', min_delta=0.01), + ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') +] + +# Fit data to model +model.fit(train_generator, + epochs=no_epochs, + verbose=verbosity, + class_weight=class_weights, + callbacks=keras_callbacks, + validation_data=val_generator) + +# Generalization key value pairs +kvp = { + 0: 'Categorical crossentropy loss', + 1: 'Accuracy', + 2: 'True positives', + 3: 'False positives', + 4: 'True negatives', + 5: 'False negatives', + 6: 'Precision', + 7: 'Recall' +} + +# Generate generalization metrics +scores = model.evaluate(test_generator, verbose=1) +print('Test results:') +for index, score in enumerate(scores): + print(f'-> {kvp[index]}: {score}') +``` + +* * * + +## How well does it work? + +Time to train your model! Run your model with Python or run your Jupyter Notebook - and the training process will start. After a while, it'll end as well, because loss no longer improves. Those were the test results for my training process: + +``` +Test results: +-> Categorical crossentropy loss: 0.337981641292572 +-> Accuracy: 0.8500000238418579 +-> True positives: 101.0 +-> False positives: 17.0 +-> True negatives: 223.0 +-> False negatives: 19.0 +-> Precision: 0.8559321761131287 +-> Recall: 0.8416666388511658 +``` + +On the test dataset, a **85% accuracy** was achieved for the final `model`. This means that in 85% of cases, a predicted class represented the actual class. A precision and recall of approximately 84-86% also suggest good model performance. + +And this for a vanilla ConvNet! + +### Testing the saved ConvNet + +The test above was generated for the final `model`. However, as we built in a bit of `patience` in the `EarlyStopping` callback, the process did not stop when we achieved the exact minimum in terms of [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). In other words, if we could [load the saved model](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/) and call [evaluate](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) on it, we might find even better performance. + +I created the following snippet for testing a saved model. + +- We import the TensorFlow building blocks that we need. +- We set configuration options for model evaluation. +- We generate a testing `ImageDataGenerator` and configure it to flow data from directory - the Test set directory, to be precise. +- We then [load the model](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/) and [evaluate it](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) with the `ImageDataGenerator`. + +``` +import os +import tensorflow +from tensorflow.keras.preprocessing.image import ImageDataGenerator +from tensorflow.keras.models import load_model + +# Model configuration +target_size_scalar = 50 +batch_size = 15 +path_test = './covid/COVID-19 Radiography Database Test' +input_shape = (target_size_scalar, target_size_scalar, 3) +labels = ['COVID-19', 'NORMAL', 'Viral Pneumonia'] +checkpoint_path=f'{os.path.dirname(os.path.realpath(__file__))}/covid-convnet.h5' + + +# Generators +testing_idg = ImageDataGenerator( + rescale = 1./255 +) + +test_generator = testing_idg.flow_from_directory( + directory = path_test, + classes = labels, + batch_size = batch_size, + target_size=(target_size_scalar, target_size_scalar) +) + +# Load +model = load_model( + checkpoint_path, + custom_objects=None, + compile=True +) + +# Generalization key value pairs +kvp = { + 0: 'Categorical crossentropy loss', + 1: 'Accuracy', + 2: 'True positives', + 3: 'False positives', + 4: 'True negatives', + 5: 'False negatives', + 6: 'Precision', + 7: 'Recall' +} + +# Generate generalization metrics +scores = model.evaluate(test_generator, verbose=1) +print('Test results:') +for index, score in enumerate(scores): + print(f'-> {kvp[index]}: {score}') +``` + +The results: + +``` +Test results: +-> Categorical crossentropy loss: 0.28413824876770377 +-> Accuracy: 0.8833333253860474 +-> True positives: 106.0 +-> False positives: 13.0 +-> True negatives: 227.0 +-> False negatives: 14.0 +-> Precision: 0.8907563090324402 +-> Recall: 0.8833333253860474 +``` + +Even better! With a vanilla ConvNet, at the loss minimum found during the training process, we find an 88.3% accuracy and an 88-89% precision-recall. That's awesome! 😎 + +Now, while performance is already good, we can try and find whether we can boost performance even further, by applying a technique called _pretraining_ or _Transfer Learning_. + +* * * + +## Can performance be improved with Transfer Learning? + +Transfer Learning is an area of Machine Learning "that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem" (Wikipedia, 2006). + +> For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks. +> +> Wikipedia (2006) + +Transfer learning can be really useful if you're a bit stuck training a Machine Learning model for a particular problem, while you have a well-performing one for a closely related problem. + +In the case of Computer Vision, where ConvNets are used which learn [increasingly abstract feature representations](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), this effectively means that parts of another well-performing ConvNet - i.e. the parts that learn very generic features - can be used to boost the performance of another ConvNet. + +In our case, this means that if we can find a **pretrained** ConvNet - as it is called - and connect it to the ConvNet we created before, we might be able to boost performance even further. + +### Adding a pretrained InceptionV3 model to our code + +The snippet below shows how an `InceptionV3` architecture with weights trained on the `ImageNet` dataset can be used as a base for our COVID-19 classifier. + +- First of all, note that we had to use the [Keras Functional API](https://keras.io/guides/functional_api/) instead of the Sequential one - otherwise, we could not glue the models together. +- We load the `InceptionV3` architecture with `ImageNet` weights into `basemodel` - and greatly benefit from the fact that Keras already defines it as a `keras.application`. +- On top of the `.output` of the `basemodel`, we stack a `GlobalAveragePooling2D` layer as well as a `Flatten` layer. +- We then increase the number of `Dense` layers. If we didn't do that, the bottleneck created by the single `Dense` layer would be too extreme and the model wouldn't learn. +- For the rest, the model is pretty similar: we apply `Dropout` and the same set of metrics as before. The two things that have also changed are the `target_size_scalar` (I thought 256 x 256 pixel images would be better for a deep architecture, so that it can learn more features) and `batch_size` - reduced significantly for memory reasons. + +``` +import os +import tensorflow +from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint +from tensorflow.keras.layers import GlobalAveragePooling2D, Input +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.models import Model +from tensorflow.keras.preprocessing.image import ImageDataGenerator +from tensorflow.keras.applications import InceptionV3 + +# Model configuration +target_size_scalar = 256 # 256 x 256 pixels +batch_size = 15 +no_epochs = 1000 +no_classes = 3 +validation_split = 0.2 +verbosity = 1 +path = './covid/COVID-19 Radiography Database' +path_test = './covid/COVID-19 Radiography Database Test' +input_shape = (target_size_scalar, target_size_scalar, 3) +labels = ['COVID-19', 'NORMAL', 'Viral Pneumonia'] +checkpoint_path=f'{os.path.dirname(os.path.realpath(__file__))}/covid-convnet-pretrained.h5' + +# Create ImageDataGenerators for training, validation and testing data +training_idg = ImageDataGenerator( + rescale = 1./255, + validation_split = validation_split +) + +validation_idg = ImageDataGenerator( + rescale = 1./255, + validation_split = validation_split +) + +testing_idg = ImageDataGenerator( + rescale = 1./255 +) + +# Flow from directory for the IDGs +train_generator = training_idg.flow_from_directory( + directory = path, + classes = labels, + seed = 28, + batch_size = batch_size, + shuffle = True, + target_size=(target_size_scalar, target_size_scalar), + subset = 'training' +) + +val_generator = validation_idg.flow_from_directory( + directory = path, + classes = labels, + seed = 28, + batch_size = batch_size, + shuffle = True, + target_size=(target_size_scalar, target_size_scalar), + subset = 'validation' +) + +test_generator = testing_idg.flow_from_directory( + directory = path_test, + classes = labels, + batch_size = batch_size, + target_size=(target_size_scalar, target_size_scalar) +) + +# Load the InceptionV3 application +basemodel = InceptionV3( + include_top = False, + weights = 'imagenet', + input_tensor = Input(input_shape), +) +basemodel.trainable = True + +# Create the ConvNet +headmodel = basemodel.output +headmodel = GlobalAveragePooling2D()(headmodel) +headmodel = Flatten()(headmodel) +headmodel = Dense(200, activation='relu')(headmodel) +headmodel = Dropout(0.5)(headmodel) +headmodel = Dense(100, activation='relu')(headmodel) +headmodel = Dropout(0.5)(headmodel) +headmodel = Dense(50, activation='relu')(headmodel) +headmodel = Dropout(0.5)(headmodel) +headmodel = Dense(no_classes, activation='softmax')(headmodel) +model = Model(inputs = basemodel.input, outputs = headmodel) + +# Compile the ConvNet +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=[ + 'accuracy', + tensorflow.keras.metrics.TruePositives(), + tensorflow.keras.metrics.FalsePositives(), + tensorflow.keras.metrics.TrueNegatives(), + tensorflow.keras.metrics.FalseNegatives(), + tensorflow.keras.metrics.Precision(), + tensorflow.keras.metrics.Recall() + ]) + +# Compute weights +class_weights = {0: 5.18621974, 1: 0.71355368, 2: 0.71136654} + +# Define callbacks +keras_callbacks = [ + EarlyStopping(monitor='val_loss', patience=5, mode='min', min_delta=0.01), + ModelCheckpoint(checkpoint_path, monitor='val_loss', save_best_only=True, mode='min') +] + +# Fit data to model +model.fit(train_generator, + epochs=no_epochs, + verbose=verbosity, + class_weight=class_weights, + callbacks=keras_callbacks, + validation_data=val_generator) + +# Generalization key value pairs +kvp = { + 0: 'Categorical crossentropy loss', + 1: 'Accuracy', + 2: 'True positives', + 3: 'False positives', + 4: 'True negatives', + 5: 'False negatives', + 6: 'Precision', + 7: 'Recall' +} + +# Generate generalization metrics +scores = model.evaluate(test_generator, verbose=1) +print('Test results:') +for index, score in enumerate(scores): + print(f'-> {kvp[index]}: {score}') +``` + +A few changes to the testing snippet are necessary to run the test with the pretrained ConvNet: + +``` + +target_size_scalar = 256 +batch_size = 15 +... +checkpoint_path=f'{os.path.dirname(os.path.realpath(__file__))}/covid-convnet-pretrained.h5' +``` + +Results: + +``` +Test results: +-> Categorical crossentropy loss: 0.22990821907296777 +-> Accuracy: 0.925000011920929 +-> True positives: 110.0 +-> False positives: 8.0 +-> True negatives: 232.0 +-> False negatives: 10.0 +-> Precision: 0.9322034120559692 +-> Recall: 0.9166666865348816 +``` + +Our pretrained ConvNet boosts acuracy, precision and recall even further! 😎 + +* * * + +## Summary + +COVID-19 is currently sweeping our world. The pandemic, that spilled over from an animal to a human approximately one year ago, is currently having a significant impact on the world and global economies. Even worse, medical systems are cracking under the intense load of COVID patients currently in hospitals throughout the world. + +In this article, we looked at how Machine Learning can benefit medical professionals in the COVID-19 area. Recognizing that real medical usage would require a lot of additional testing and such, today, we created a Convolutional Neural Network that can classify COVID-19 on Radiography images, distinguishing the images from normal ones and images with other viral pneumonias. In fact, by applying pretraining, we could achieve accuracies of 92.5%. Really awesome. + +Improvements remain possible, of course. For example, medical science suggests that CT scans can be an even better data source for recognizing COVID. What's more, we could potentially _combine_ the two data sources in some kind of an ensemble model. And perhaps we can also add other medical data - blood data to give an example - with e.g. the levels of immunoglobin or typical COVID-related cytokines, to even tell what the disease timeline for an individual is. + +However, that would be too much for one article! :) + +Altogether, I hope that this article was useful for both inspiration and Machine Learning education - I did learn a lot from writing it. If you have any comments, questions or other remarks, please feel free to leave a comment in the comments section below 💬 Otherwise, thank you for reading MachineCurve today, stay healthy 😷 and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2020, February 5). _Coronavirus disease 2019_. Wikipedia, the free encyclopedia. Retrieved November 4, 2020, from [https://en.wikipedia.org/wiki/Coronavirus\_disease\_2019](https://en.wikipedia.org/wiki/Coronavirus_disease_2019) + +M.E.H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M.A. Kadir, Z.B. Mahbub, K.R. Islam, M.S. Khan, A. Iqbal, N. Al-Emadi, M.B.I. Reaz, M. T. Islam, [“Can AI help in screening Viral and COVID-19 pneumonia?”](https://arxiv.org/ftp/arxiv/papers/2003/2003.13145.pdf) IEEE Access, Vol. 8, 2020, pp. 132665 - 132676. + +Wikipedia. (2020, January 5). _COVID-19 pandemic_. Wikipedia, the free encyclopedia. Retrieved November 4, 2020, from [https://en.wikipedia.org/wiki/COVID-19\_pandemic#Background](https://en.wikipedia.org/wiki/COVID-19_pandemic#Background) + +Hegarty, S. (2020, February 6). _The Chinese doctor who tried to warn others about coronavirus_. BBC News. [https://www.bbc.com/news/world-asia-china-51364382](https://www.bbc.com/news/world-asia-china-51364382) + +_COVID-19 radiography database_. (n.d.). Kaggle: Your Machine Learning and Data Science Community. [https://www.kaggle.com/tawsifurrahman/covid19-radiography-database](https://www.kaggle.com/tawsifurrahman/covid19-radiography-database) + +Keras Team. (n.d.). _Keras documentation: Image data preprocessing_. Keras: the Python deep learning API. [https://keras.io/api/preprocessing/image/#imagedatagenerator-class](https://keras.io/api/preprocessing/image/#imagedatagenerator-class) + +Wikipedia. (2006, February 1). _Transfer learning_. Wikipedia, the free encyclopedia. Retrieved November 5, 2020, from [https://en.wikipedia.org/wiki/Transfer\_learning](https://en.wikipedia.org/wiki/Transfer_learning) diff --git a/neural-network-activation-visualization-with-tf-explain.md b/neural-network-activation-visualization-with-tf-explain.md new file mode 100644 index 0000000..82bc5c2 --- /dev/null +++ b/neural-network-activation-visualization-with-tf-explain.md @@ -0,0 +1,509 @@ +--- +title: "Neural network Activation Visualization with tf-explain" +date: "2020-04-27" +categories: + - "deep-learning" + - "frameworks" +tags: + - "activation" + - "activation-function" + - "cifar10" + - "conv2d" + - "convolutional-neural-networks" + - "mnist" + - "tf-explain" + - "visualization" +--- + +Deep learning models and especially neural networks have been used thoroughly over the past few years. There are many success cases in the press about the application of those models. One of the primary categories in which those are applied is the field of computer vision, mainly thanks to the 2012 revolution in [Convolutional Neural Networks](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). + +However, until recently, it was very difficult to understand how a neural network arrived at its outcome - i.e., its prediction. + +Society however doesn't work that way. In order to facilitate adoption of AI into business processes, results must be explainable. For example, if access to a building is denied based on a computer vision model, the law might require that this happens based on a valid reason. If there's no such reason, it would be illegal to keep that person out. Now, this is just an example, but it demonstrates the necessity for explaining AI models, and by consequence machine learning models. + +Tf-explain is a framework for enhancing interpretability and explainability of AI models created with TensorFlow 2.x based Keras. It offers a wide range of techniques for visualizing the outcomes and decision criteria of neural networks; then, primarily Convolutional Neural Networks. + +In this blog post, we'll look at one such technique: Activation Visualization. It does what the name suggests - for a layer in a ConvNet, it visualizes how an input is processed through that layer and what each subsequent feature map looks like. + +It is structured as follows. Firstly, we'll look at the conceptual nature of an activation - by taking a look at a layer of a ConvNet and how input is processed there, including why activation functions are necessary. Subsequently, we'll introduce `tf-explain` and provide a bit of background information. Following this, we'll implement a [Keras ConvNet](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) based on TensorFlow 2.x and perform Activation Visualization with `tf-explain` to show you how it works. + +All code for doing so is included with the relevant steps and is broken down into small parts so that it will be very understandable. I hope this post will be useful for those who wish to use AI explainability techniques with Keras themselves. Let's go! :) + +* * * + +\[toc\] + +\[affiliatebox\] + +* * * + +## What are activations? + +In this tutorial, we'll primarily focus on Convolutional Neural Networks, also called CNNs or ConvNets. Those networks are the primary drivers of the deep learning revolution in computer vision. + +Now, what is a ConvNet? + +While we already have a [very detailed post on the matter](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), let's repeat what is written there, here - but then briefly. + +First of all, a Convolutional Neural Network is no special type of neural network - as with any, it's a set of trainable parameters which is trained through the [supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) with its feedforward operation and subsequent gradient based optimization. + +Except for the fact that within Convolutional Neural Networks, parameters are structured a bit differently than in a regular fully connected network. + +### ConvNets at a high level + +Let's take a look at a ConvNet schematically: + +![](images/convnet_fig.png) + +Source: [gwding/draw\_convnet](https://github.com/gwding/draw_convnet) + +On the left, you see the `Inputs` layer, which accepts a 32 x 32 pixel image with 3 image channels - i.e. an RGB image. + +As you can see, in a loup-like structure, the inputs are reduced in size. This happens by means of the _convolution_ operation - which is a kernel of some by some pixels (5x5 in the `Inputs` layer) that slides over the entire input - horizontally and vertically. Doing so, it multiplies all the kernel images (the _learnt weights_) with the _input it covers_ in element-wise multiplications. The output represents the "loupe result" in the downstream layer. All outputs for an image combined is called a _feature map_, and often, as we can see here too, there are many kernels in a layer - resulting in many feature maps. + +Learning is ascribed to the kernels, which have trainable weights that can be adapted to respond to inputs. This happens in the [optimization part of the machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#backwards-pass). It closes the cycle between inputs, outputs and subsequent model improvement - it's really that simple conceptually ;-) + +As you can understand, the feature maps at the most downstream convolutional layer (possibly followed by [pooling layers](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/)) are a very abstract representation of the original input image. This is beneficial in two ways: + +1. Different images of the same object (such as two different cats) will quite resemble each other when they have been "louped" into very abstract format. +2. The abstract representations are common representations of a particular class (such as "cat") and are thus useful for actual classification. + +That's why we say that convolutional layers are _feature extractors_, whereas the actual classification happens in the subsequent fully-connected layers - as we are used to with any neural network. + +That's why ConvNets are slightly different, but conceptually similar to _traditional_ neural networks. + +### Linearity of element-wise multiplications and activation functions + +In a traditional neural network, the operation performed for some input vector \[latex\]\\textbf{x}\[/latex\] is \[latex\]output(\\textbf{x}) = \\textbf{w} \\times \\textbf{x}\[/latex\]. Here, vector \[latex\]\\textbf{x}\[/latex\] is the input vector (e.g. \[latex\]\[1.23, 3.77, -2.19\]\[/latex\] for a three-dimensional input vector) and \[latex\]\\textbf{w}\[/latex\] is the trainable weights vector of the same dimensionality. + +This multiplication is done on an element-wise basis, i.e. \[latex\]1.23 \\times \\mathbf{w\_1}\[/latex\], and so on. For ConvNets, things work a bit differently, but the point remains standing - **this operation is linear**. + +That's quite a problem, to say the least, for it doesn't matter how big your neural network is - if the chain of processing throughout all layers is linear, _[you can only handle linear data](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/)_. + +And pretty much all of today's data is nonlinear. + +The solution is simple and elegant: you can place an [activation function](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/) directly behind the output of the layer. This function, which is pretty much always nonlinear (such as \[latex\]sin(x)\[/latex\]), converts the linear output into something nonlinear and subsequently ensures that it is passed to the next layer for processing. This way, it is ensured that all processing is nonlinear - and suddenly it becomes possible to handle nonlinear datasets. The output of this function is what we will call the **activation**, hence the name activation function. Today, we'll be visualizing the activations of ConvNet layers. + +Today's most common activation function is the [Rectified Linear Unit (ReLU)](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/). Other ones used still are [Tanh and Sigmoid](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/), while there are also newer ones, such as [Leaky ReLU, PReLU, and Swish](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/). They all try to improve on top of each other. However, in most cases, ReLU suffices. + +- [![](images/swish-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/swish.png) + +- [![](images/sigmoid-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/sigmoid.png) + +- [![](images/tanh-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/tanh.png) + +- [![](images/relu-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/relu.png) + + +Common activation functions + +* * * + +## Introducing tf-explain + +Let's now move towards the core of this post: the `tf-explain` framework 😎 Created by [Sicara](https://www.sicara.com/), it is a [collection](https://github.com/sicara/tf-explain) of "Interpretability Methods for tf.keras models with Tensorflow 2.0" (Tf-explain, n.d.). + +Great! A collection of techniques that are usable with the modern implementation of Keras, which has migrated into the TensorFlow framework as of version 2.0. + +What's more, `tf-explain` has implemented a wide range of techniques that have been proposed by scientists in a range of academic papers (Tf-explain, n.d.). As of April 2020, these include, but may no longer be limited to: + +1. Activations Visualization +2. Vanilla Gradients +3. Gradients\*Inputs +4. Occlusion Sensitivity +5. Grad CAM (Class Activation Maps) +6. SmoothGrad +7. Integrated Gradients + +...and others are on their development roadmap: + +1. GradCAM++ +2. Guided SmoothGrad +3. LRP + +Sounds really cool - and installation is simple: `pip install tf-explain`. That's it - and it's usable for both the TensorFlow CPU and GPU based models :) + +\[affiliatebox\] + +* * * + +## Visualizing ConvNet activations with tf-explain and Keras + +Now that we understand what `tf-explain` is and what it does, we can actually do some work. Today, we will visualize the ConvNet activations with `tf-explain` for a simple ConvNet [created with Keras](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/). + +Recall: in a ConvNet, activations are the outputs of layers, and our technique will allow us to see the feature maps that are generated by a Keras model. + +### Today's model + +As the point of this blog post is to illustrate how `tf-explain` can be used for visualizing activations, I will not focus on creating the neural network itself. Instead, we have another blog post for that - being ["How to use Conv2D with Keras?"](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/). Click the link to find a detailed, step-by-step explanation about the model that we will use in this blog post. + +In short, our ConvNet will be able to classify the [CIFAR10 dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#cifar-10-small-image-classification): + +[![](images/cifar10_visualized.png)](https://www.machinecurve.com/wp-content/uploads/2019/06/cifar10_visualized.png) + +As you can see, it is an image clasification dataset with 32x32 pixel RGB images of everyday objects. The images are distributed across 10 classes. + +Here's the full model code from the ["How to use Conv2D with Keras?"](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) post: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 100 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Now, there are two paths forward with respect to generating the Activation Visualizations with `tf-explain`: + +1. **Visualizing the activations during the training process.** This allows you to determine how your model's trainable parameters evolve during training, and whether you might have an intuitively better intermediate result that you better use as your final model. What's even better is that you can use [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) to track your visualizations _during_ training. +2. **Visualizing the activations after training.** This helps you determine whether your final model works well. + +Of course, we'll cover both variants next. + +### Visualizing the activations during training + +The first thing to do when we want to visualize the activations during the training process is installing `tf-explain`, if you didn't already do so. It's really easy: `pip install tf-explain`. Make sure to do so in the environment where your `tensorflow` and other dependencies are also installed. + +#### Adding tf-explain to your imports + +The first thing we have to do is adding the `ActivationsVisualizationCallback` to the imports we already have: + +``` +from tf_explain.callbacks.activations_visualization import ActivationsVisualizationCallback +``` + +...so that they become: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tf_explain.callbacks.activations_visualization import ActivationsVisualizationCallback +``` + +#### Creating a Keras callback: the ActivationsVisualizationCallback + +As you could have guessed by now, using `tf-explain` during training works by means of _callbacks_. Those are pieces of functionality supported by Keras that run _while_ training, and can e.g. be used to [stop the training process and save data on the fly](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). + +So, in order to make this work, define a new callback just below the `model.compile` step: + +``` +# Define the Activation Visualization callback +output_dir = './visualizations' +callbacks = [ + ActivationsVisualizationCallback( + validation_data=(input_test, target_test), + layers_name=['visualization_layer'], + output_dir=output_dir, + ), +] +``` + +Firstly, we specify the `output_dir`, and set it to `./visualizations`. This means that a new folder called _visualizations_ will be created in the current folder, and the callback will dump the files for generating the visualization there. + +Then, the `callbacks` array. All Keras callbacks must be provided to the model jointly, that is, together. Hence, we usually put them in an array. Today, the only callback is the `ActivationsVisualizationCallback`, so it might be a bit redundant - but the array is still necessary. + +In the callback, we specify a few things: + +- **Validation data:** which, in our case, is our testing data. +- **Layers name:** the names of the layers that we want to visualize. +- The **output directory**. + +Now, wait a second! Layers name? What is this? + +Well, in Keras, every layer gets assigned a name. Take a look at the [model summaries we can generate](https://www.machinecurve.com/index.php/2020/04/01/how-to-generate-a-summary-of-your-keras-model/), to give just one example. You'll see names like `conv2d_1` - but you can also provide your own. Let's do this, and replace the second `model.add` with: + +``` +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', name='visualization_layer')) +``` + +Now, `tf-explain` will understand what layer must be visualized. + +#### Fit data to your model with the callback appended + +Now that we have prepared our callback, it's time to use it. This is really as simple as adding the callbacks to the training process, i.e., to `model.fit`: + +``` +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split, + callbacks=callbacks) +``` + +I.e.: `callbacks=callbacks`. + +**Pro tip:** for the time being, also make sure to set your `no_epochs` in the configuration to 2, especially if you just want to test. As we shall see, the visualization will become pretty big since we have to visualize 64 kernels across 2 epochs. + +Time to run the model! Save your code as `activation-visualization-training.py` (or some other Python file), open up a terminal / environment where the dependencies are installed (being Tensorflow 2.x and `tf-explain`), and run the model: + +``` +python activation-visualization-training.py +``` + +The training process will start: + +``` +Relying on driver to perform ptx compilation. This message will be only logged once. +40000/40000 [==============================] - 106s 3ms/sample - loss: 1.5383 - accuracy: 0.4464 - val_loss: 1.2711 - val_accuracy: 0.5510 +Epoch 2/2 +40000/40000 [======================> +``` + +Contrary to what you are used to, time between epochs will be a little bit longer - as the results will have to be written to disk. Don't let this discourage you, though :) + +#### Full model code + +Now, for those who wish to obtain the full model code at once - for example, to start playing with the code straight away - here you go :) + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tf_explain.callbacks.activations_visualization import ActivationsVisualizationCallback + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 32, 32, 3 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 2 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 data +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', name='visualization_layer')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Define the Activation Visualization callback +output_dir = './visualizations' +callbacks = [ + ActivationsVisualizationCallback( + validation_data=(input_test, target_test), + layers_name=['visualization_layer'], + output_dir=output_dir, + ), +] + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split, + callbacks=callbacks) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +#### Results + +When the training process has finished, you should see a file in your `./visualizations.py` folder that is named like `events.out.tfevents.1588008411.DESKTOP-PJKJ0UE.13452.235.v2`. + +It could be located in some folder with another timestamp based name. + +This could be odd, but it isn't. Recall from the introduction that _visualizations during training_ are generated in TFEvents format, and can be visualized using [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/). + +So let's do that. Open up your terminal again (possibly the same one as you trained your model in), `cd` to the folder where your `.py` file is located, and start TensorBoard: + +``` +tensorboard --logdir=./visualizations +``` + +By default, TensorBoard will load on `localhost` at port `6006`: + +``` +Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all +TensorBoard 2.1.0 at http://localhost:6006/ (Press CTRL+C to quit) +``` + +Opening up that web page in your browser will pretty much directly show you the Activation Visualizations generated during training. What's more, on the left, it's also possible to adapt brightness and contrast: + +![](images/image-9-1024x481.png) + +My visualizations are still pretty weird. That's likely due to the fact that I've quit the training process after the second epoch, at a time where the model is still not adequate enough (`validation_accuracy = 0.63...`). However, as the file got quite quickly (~ 126MB already) and the purpose of this number of epochs was to demonstrate that it works, I left it at it :) + +All right! Visualizing the activations during training: ✅. Let's proceed with visualizing activations after training 🚀 + +\[affiliatebox\] + +### Visualizing the activations after training + +In the case where you want to interpret how your model works after it has finished training, you might wish to use `tf-explain` and `ActivationsVisualization` _after_ the training process has finished. + +We'll cover this scenario next. + +#### Adding tf-explain to your imports + +Of course, the first thing that must be done is adding `tf-explain` to the imports. A bit counterintuitively, the Activation Visualizer is called differently here: `ExtractActivations`. What's more, and this makes sense, it's located in the `.core` sub area of the Python source code, and not in `.callbacks`. + +That's why this must be added to the imports: + +``` +from tf_explain.core.activations import ExtractActivations +``` + +So that they become: + +``` +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tf_explain.core.activations import ExtractActivations +``` + +#### Instantiate the ExtractActivations explainer + +Now that we have imported the `tf-explain` functionality that we need, we can instantiate the explainer directly below `model.fit`: + +``` +# Define the Activation Visualization explainer +index = 250 +image = input_test[index].reshape((1, 32, 32, 3)) +label = target_test[index] +data = ([image], [label]) +explainer = ExtractActivations() +grid = explainer.explain(data, model, layers_name='visualization_layer') +explainer.save(grid, '.', 'act.png') +``` + +Lets take a look at this code line by line: + +- At the first line, we set `index` to 250. That means, sample 250. It can be set to any number, as long as it's a valid index in the dataset you're using. +- At the second line, we define the `image` based on the `index`. We also have to reshape it from `(32, 32, 3) -> (1, 32, 32, 3)`, because `tf-explain` throws an error otherwise. +- At the third line we define the `label` based on the `index`. +- We merge them together into a `data` tuple of samples at the fourth line. Do note that you could add multiple samples here: for example, a `second_image` (resulting in `[image, second_image]`), and the same for the labels. The labels seem not to be required. +- Subsequently, we instantiate the `explainer` next. +- Then, we instruct it to `explain` the `data` based on the `model` we trained, for the `layers_name` we defined. +- Then, we `save` the end result into `act.png`. + +As with the _during training explanation_, we must specify the layer name here as well - so replace the second `model.add` in your code with: + +``` +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', name='visualization_layer')) +``` + +#### Results + +Time to run it! 😎 Open up your terminal, `cd` to the folder where your Python file is located, and run it - e.g. `python activation-visualization-trained.py`: + +``` +Relying on driver to perform ptx compilation. This message will be only logged once. +40000/40000 [==============================] - 24s 596us/sample - loss: 1.4864 - accuracy: 0.4670 - val_loss: 1.1722 - val_accuracy: 0.5926 +Epoch 2/10 +40000/40000 [=================> +``` + +Great, we have a running training process :) Once it finishes, your activation visualization should be visible in the `act.png` file. In my case, it's a bit black-ish. What does it look like with your dataset? I'd love to know! + +![](images/act.png) + +For the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#mnist-database-of-handwritten-digits), and a specific sample of number 4, it looks like this: + +![](images/act-1.png) + +That's more promising :) + +* * * + +\[affiliatebox\] + +## Summary + +In this blog post, we looked at Activation Visualization for neural network interpretability with `tf-explain` and Keras. Firstly, we looked at Convolutional Neural Networks and their activations in general. Subsequently, we introduced `tf-explain`. + +This was followed by a step-by-step explanation of the framework for visualizing data _during_ training with TensorBoard, and _after_ training with manual action. All code is included in the post. + +I hope you've learnt something today! If you have any questions, remarks or other comments, please feel free to leave a comment in the comments section below 💬 Thank you for reading MachineCurve today and happy engineering 😎 + +\[kerasbox\] + +* * * + +## References + +Tf-explain. (n.d.). _tf-explain documentation_. tf-explain — tf-explain documentation. [https://tf-explain.readthedocs.io/en/latest/](https://tf-explain.readthedocs.io/en/latest/) diff --git a/object-detection-for-images-and-videos-with-tensorflow-2-x.md b/object-detection-for-images-and-videos-with-tensorflow-2-x.md new file mode 100644 index 0000000..5c50051 --- /dev/null +++ b/object-detection-for-images-and-videos-with-tensorflow-2-x.md @@ -0,0 +1,678 @@ +--- +title: "Object Detection for Images and Videos with TensorFlow 2.0" +date: "2021-01-15" +categories: + - "frameworks" +tags: + - "computer-vision" + - "convolutional-neural-networks" + - "deep-learning" + - "machine-learning" + - "object-detection" + - "tensorflow" +--- + +Object detection is one of the areas in Deep Learning where much progress has been made. Using a variety of models, we can detect objects in photos and - by consequence - also in videos. Even real-time object detection using webcam images is a common thing these days! + +In this tutorial, we will build an **object detection system with TensorFlow**. Specifically, we will be using the TensorFlow Object Detection API. In a step-by-step fashion, you will install all the necessary dependencies, take a look at pretrained models in the TensorFlow Model Zoo, and build the object detector. + +In other words, after reading this tutorial, you will... + +- Understand what you need to install for building a TensorFlow based object detector. +- Know where to find pretrained models and download them to your system. +- Have built an actual object detector system that can be used with photos and videos. + +And because images always say more than 1.000 words, you will create a system that can do this: + +![](images/ezgif-3-15f84305f6f1.gif) + +Let's take a look! 🔥 + +* * * + +\[toc\] + +* * * + +## Code example: fully functional Object Detection with TensorFlow 2.x + +With this fully functional **example of object detection with TensorFlow**, you can get started quickly. If you want to understand everything in more detail, make sure to read the rest of this tutorial below. We're going to walk through every step, so that you'll understand exactly how to build such a system yourself. Good luck! + +**⚠ Pay attention to the following things when running this example straight away:** + +1. Make sure that you have installed TensorFlow, OpenCV and the TensorFlow Object Detection API. I built this with [TensorFlow 2.4.0](https://www.machinecurve.com/index.php/2020/11/05/saying-hello-to-tensorflow-2-4-0/). +2. Download the pretrained model that you want to use for object detection. +3. Ensure that you correctly configure the path to the Object Detection API, the model checkpoint and the labels. Also make sure to set the model name correctly. +4. Optionally, comment out the `os.environ(...)` call if you want to run the code on your GPU. Of course, this only works if your TensorFlow is GPU-enabled. + +``` +# Specify model imports +from object_detection.builders import model_builder +from object_detection.utils import config_util +from object_detection.utils import label_map_util +from object_detection.utils import visualization_utils as viz_utils +import cv2 +import numpy as np +import os +import tensorflow as tf + +# Disable GPU if necessary +os.environ['CUDA_VISIBLE_DEVICES'] = '-1' + +# Create object detector +class TFObjectDetector(): + + # Constructor + def __init__(self, path_to_object_detection = './models/research/object_detection/configs/tf2',\ + path_to_model_checkpoint = './checkpoint', path_to_labels = './labels.pbtxt',\ + model_name = 'ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8'): + self.model_name = model_name + self.pipeline_config_path = path_to_object_detection + self.pipeline_config = os.path.join(f'{self.pipeline_config_path}/{self.model_name}.config') + self.full_config = config_util.get_configs_from_pipeline_file(self.pipeline_config) + self.path_to_model_checkpoint = path_to_model_checkpoint + self.path_to_labels = path_to_labels + self.setup_model() + + + # Set up model for usage + def setup_model(self): + self.build_model() + self.restore_checkpoint() + self.detection_function = self.get_model_detection_function() + self.prepare_labels() + + + # Build detection model + def build_model(self): + model_config = self.full_config['model'] + assert model_config is not None + self.model = model_builder.build(model_config=model_config, is_training=False) + return self.model + + + # Restore checkpoint into model + def restore_checkpoint(self): + assert self.model is not None + self.checkpoint = tf.train.Checkpoint(model=self.model) + self.checkpoint.restore(os.path.join(self.path_to_model_checkpoint, 'ckpt-0')).expect_partial() + + + # Get a tf.function for detection + def get_model_detection_function(self): + assert self.model is not None + + @tf.function + def detection_function(image): + image, shapes = self.model.preprocess(image) + prediction_dict = self.model.predict(image, shapes) + detections = self.model.postprocess(prediction_dict, shapes) + return detections, prediction_dict, tf.reshape(shapes, [-1]) + + return detection_function + + + # Prepare labels + # Source: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/inference_tf2_colab.ipynb + def prepare_labels(self): + label_map = label_map_util.load_labelmap(self.path_to_labels) + categories = label_map_util.convert_label_map_to_categories( + label_map, + max_num_classes=label_map_util.get_max_label_map_index(label_map), + use_display_name=True) + self.category_index = label_map_util.create_category_index(categories) + self.label_map_dict = label_map_util.get_label_map_dict(label_map, use_display_name=True) + + # Get keypoint tuples + # Source: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/inference_tf2_colab.ipynb + def get_keypoint_tuples(self, eval_config): + tuple_list = [] + kp_list = eval_config.keypoint_edge + for edge in kp_list: + tuple_list.append((edge.start, edge.end)) + return tuple_list + + + # Prepare image + def prepare_image(self, image): + return tf.convert_to_tensor( + np.expand_dims(image, 0), dtype=tf.float32 + ) + + + # Perform detection + def detect(self, image, label_offset = 1): + # Ensure that we have a detection function + assert self.detection_function is not None + + # Prepare image and perform prediction + image = image.copy() + image_tensor = self.prepare_image(image) + detections, predictions_dict, shapes = self.detection_function(image_tensor) + + # Use keypoints if provided + keypoints, keypoint_scores = None, None + if 'detection_keypoints' in detections: + keypoints = detections['detection_keypoints'][0].numpy() + keypoint_scores = detections['detection_keypoint_scores'][0].numpy() + + # Perform visualization on output image/frame + viz_utils.visualize_boxes_and_labels_on_image_array( + image, + detections['detection_boxes'][0].numpy(), + (detections['detection_classes'][0].numpy() + label_offset).astype(int), + detections['detection_scores'][0].numpy(), + self.category_index, + use_normalized_coordinates=True, + max_boxes_to_draw=25, + min_score_thresh=.40, + agnostic_mode=False, + keypoints=keypoints, + keypoint_scores=keypoint_scores, + keypoint_edges=self.get_keypoint_tuples(self.full_config['eval_config'])) + + # Return the image + return image + + + # Predict image from folder + def detect_image(self, path, output_path): + + # Load image + image = cv2.imread(path) + + # Perform object detection and add to output file + output_file = self.detect(image) + + # Write output file to system + cv2.imwrite(output_path, output_file) + + + # Predict video from folder + def detect_video(self, path, output_path): + + # Set output video writer with codec + fourcc = cv2.VideoWriter_fourcc(*'mp4v') + out = cv2.VideoWriter(output_path, fourcc, 25.0, (1920, 1080)) + + # Read the video + vidcap = cv2.VideoCapture(path) + frame_read, image = vidcap.read() + count = 0 + + # Iterate over frames and pass each for prediction + while frame_read: + + # Perform object detection and add to output file + output_file = self.detect(image) + + # Write frame with predictions to video + out.write(output_file) + + # Read next frame + frame_read, image = vidcap.read() + count += 1 + + # Release video file when we're ready + out.release() + + +if __name__ == '__main__': + detector = TFObjectDetector('../../tf-models/research/object_detection/configs/tf2', './checkpoint', './labels.pbtxt', 'ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8') + detector.detect_image('./1.jpg', './1o.jpg') + detector.detect_video('./1v.mp4', './v1o.mp4') +``` + +* * * + +## Building an object detector: prerequisites + +In order to build an object detection system with the TensorFlow Object Detection API, you will need to complete the following three steps: + +1. **Install TensorFlow and OpenCV**. We need TensorFlow for, well, TF functionality, and OpenCV for Image I/O. Normally, these are already installed onto your system, but for the sake of completeness we include them here. +2. **Install the TensorFlow Object Detection API**. This extra set of functionalities must be installed separately. We will take a look at how we can do this. +3. **Find an appropriate pretrained model in the TensorFlow Model Zoo**. In this Zoo, the creators of TensorFlow have put a variety of pretrained models using different model architectures. We're going to take a brief look at them and make a choice for a model. + +![](images/tf-2.jpg) + +### Installing TensorFlow and OpenCV + +The first step to complete before we actually build the object detector is **installing TensorFlow and OpenCV**. + +Here, we assume that you have Python installed on your system already. If not, [make sure to install that as well](https://www.python.org/downloads/). + +[Installing TensorFlow](https://www.tensorflow.org/install) is really easy these days. Run the following two commands from within a terminal that has access to Python: + +``` +# Requires the latest pip +pip install --upgrade pip + +# Current stable release for CPU and GPU +pip install tensorflow +``` + +It first upgrades `pip` to the latest version and then installs TensorFlow. Whereas previously you had to specify manually whether you wanted the CPU or GPU version, this is no longer the case today. Simply install `tensorflow` and the GPU version will install itself if you have your GPU setup correctly. In fact, you'll be able to switch back and forth between GPU and CPU if you want, but we'll get back to that later. + +Installing OpenCV is neither difficult: `pip install opencv-python` should do the trick. + +Now that you have the base packages installed, we can take a look at the TensorFlow Object Detection API. + +![](images/parkout.jpg) + +A dog... oh wow 🐶😂 + +### Installing the TensorFlow Object Detection API + +On GitHub, specifically in `tensorflow/models`, you can find the [Object Detection API](https://github.com/tensorflow/models/tree/master/research/object_detection): + +> The TensorFlow Object Detection API is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. +> +> TensorFlow (n.d.) + +As the name suggests, it can be used for object detection purposes. In particular, it offers functionality to load pretrained models and to add bounding boxes to images and videos. This is great, as our object detection system can use these APIs meaning that we don't have to develop everything ourselves. + +We'll take a look at the pretrained models later. Let's install the Object Detection API first. This assumes that you [have Git installed onto your system](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git). Also ensure that you can run the `protoc` command: [find here how](https://stackoverflow.com/questions/47704968/protoc-command-not-found-linux). + +1. First, clone the whole `tensorflow/models` repository. Make sure to clone one level deep only. Execute this command for cloning the repository: `git clone --depth 1 https://github.com/tensorflow/models` +2. Now, navigate into the correct directory with `cd models/research/` and execute `protoc object_detection/protos/*.proto --python_out=.` +3. Then copy the setup file into the current directory using `cp object_detection/packages/tf2/setup.py .` +4. Finally, install the Object Detection API with `pip` via `python -m pip install .` + +### TensorFlow Model Zoo: pretrained models for Object Detection + +Our object detection system will be built on top of a TensorFlow model that is capable of detecting objects - so far no surprise. Training such a model involves the following steps: + +- Collecting a large amount of images with a variety of objects. +- Labeling all these images, ensuring that you [ensure class balance](https://www.machinecurve.com/index.php/2020/11/10/working-with-imbalanced-datasets-with-tensorflow-and-keras/) when doing so. +- Training a model. + +This takes quite a lot of work that you likely don't want to spend. Fortunately, the folks at TensorFlow have made available a variety of pretrained object detection models in the [TensorFlow Detection Model Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md). + +> We provide a collection of detection models pre-trained on the COCO 2017 dataset. These models can be useful for out-of-the-box inference if you are interested in categories already in those datasets. +> +> TensorFlow (n.d.) + +These object detectors have been pretrained and are available in the TensorFlow Object Detection API (with the underlying model architectures written in parentheses): + +- **CenterNet (HourGlass104, Resnet50 V1, Resnet101 V1, Resnet50 V2).** +- **EfficientDet (D0, D1, D2, D3, D4, D5, D6, D7).** +- **SSD (MobileNet V1 FPN, V2, V2 FPNLite; ResNet50 V1; Resnet101 V1).** +- **Faster R-CNN (ResNet50; ResNet101; ResNet152; Inception ResNet V2).** +- **Mask R-CNN (Inception ResNet V2).** +- **ExtremeNet**. + +Of course, you can also choose to build your own - but that's a more advanced use case not covered by this tutorial. + +Today, we're going to use the [SSD MobileNet V2 FPNLite 640x640](http://download.tensorflow.org/models/object_detection/tf2/20200711/ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8.tar.gz) model. You can literally [choose any model from the Zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md), but this pretrained model is only 20MB and can therefore be downloaded by many people with different internet speeds. + +Let's now build that detector of ours! + +![](images/shopout.jpg) + +* * * + +## Building the object detector + +Here, we're going to take a look at building the object detection system. Doing so can be split up in three separate but sequential parts: + +1. **Laying the foundation.** Here, we're going to specify the imports, define the class, write what happens on initialization, and write preparatory definitions. +2. **Writing the detection functions.** This is the core of the detector - it allows us to perform detections in general, and specifically generate predictions for images and videos. +3. **Creating the detection calls.** Finally, once our detector is ready, we add some extra code which ensures that we can actually use it. + +Make sure to open your code editor and create a Python file, e.g. `objectdetector.py`. Let's write some code! 😎 + +### Part 1: Laying the foundation + +Recall that the TensorFlow Object Detection API is a framework on top of TensorFlow for building object detectors. In other words, it's a layer on top of a well-known library for creating machine learning models. We're going to add another layer on top of this API, being an object detector layer that can use the Object Detection API. + +Creating the foundation of this `TFObjectDetector` involves adding the Python imports, disabling the GPU if necessary, creating the `TFObjectDetector` class and initializing it, writing the setup mechanism for the object detector, and finally creating some helper functions. + +#### Python imports + +The first code always involves Python imports, and today is not different: + +``` +# Specify model imports +from object_detection.builders import model_builder +from object_detection.utils import config_util +from object_detection.utils import label_map_util +from object_detection.utils import visualization_utils as viz_utils +import cv2 +import numpy as np +import os +import tensorflow as tf +``` + +As you can see, we import many functions from the `object_detection` package - which represents the TensorFlow Object Detection API. We'll use `model_builder` for building the detection model (i.e. the SSD MobileNet model). With `config_util`, we can load the configuration which tells TensorFlow to load the correct model. The labels representing the class names can be loaded with the `label_map_util`, and `viz_utils` will be useful for adding the bounding boxes to the image or the video. + +OpenCV (`cv2`) will be used for image input/output, NumPy (`np`) for numbers processing, `os` for operating system functions, and finally we import TensorFlow as well. + +#### Disable the GPU if necessary + +The second step is to disable the GPU, but this is **optional** - in other words, only if you want to. Especially when you have a GPU but when it is misconfigured, this can be useful. You then simply have to erase all CUDA visible devices from the visible environment. If you don't use the GPU version of TensorFlow, this code can be omitted. + +``` +# Disable GPU if necessary +os.environ['CUDA_VISIBLE_DEVICES'] = '-1' +``` + +#### Create the class and initializer + +Now, it's time for the real work. Let's create a class called `TFObjectDetector` which covers all the functionalities of our object detector. + +``` +# Create object detector +class TFObjectDetector(): +``` + +We immediately add the `__init__` definition which represents the constructor of the class. In other words, it is executed immediately upon object creation (in plainer English, loading the `TFObjectDetector`). Note that it takes the following inputs: + +- The **path to object detection**, which represents the path to the TensorFlow 2.x configuration files for the Object Detection API installed on your system. +- The **path to the model checkpoint** for the model that you are running (in our case, the SSD MobileNet model). +- The **path to the labels file** which will allow us to construct a dictionary mapping class ids to textual labels. +- The **model name**. + +⚠ We'll cover setting the inputs to values for your situation later, when we will actually use the detector. + +In the constructor, we do quite a few things. Firstly, we fill a lot of _instance variables_ so that our inputs can be reused throughout the detector. We also load the pipeline configuration that is available in the Object Detection API folder, specifically the one for our model. We also load the full configuration and finally call `self.setup_model()`. + +This starts the setup mechanism for our model, which we'll take a look at now. + +``` +# Create object detector +class TFObjectDetector(): + + # Constructor + def __init__(self, path_to_object_detection = './models/research/object_detection/configs/tf2',\ + path_to_model_checkpoint = './checkpoint', path_to_labels = './labels.pbtxt',\ + model_name = 'ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8'): + self.model_name = model_name + self.pipeline_config_path = path_to_object_detection + self.pipeline_config = os.path.join(f'{self.pipeline_config_path}/{self.model_name}.config') + self.full_config = config_util.get_configs_from_pipeline_file(self.pipeline_config) + self.path_to_model_checkpoint = path_to_model_checkpoint + self.path_to_labels = path_to_labels + self.setup_model() +``` + +#### Setup mechanism + +The setup mechanism is responsible for setting up the model in the background and making our object detector ready for usage. It involves the following steps: + +1. **Building the model with the model configuration loaded in the `__init__` function.** +2. **Restoring the model to a [checkpoint](https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint) i.e. to a particular state to which it was trained.** +3. **Retrieving the model detection function, which is a `tf.function` that can be used for generating predictions.** +4. **Preparing the labels, i.e. generating the mapping between class ids and textual labels.** + +Let's group the execution of these steps in the `setup_model()` definition. Recall that this definition is called in the `__init__` definition specified above and thus at the creation of our object detector. + +``` + # Set up model for usage + def setup_model(self): + self.build_model() + self.restore_checkpoint() + self.detection_function = self.get_model_detection_function() + self.prepare_labels() +``` + +We can next create `build_model()`: + +``` + # Build detection model + def build_model(self): + model_config = self.full_config['model'] + assert model_config is not None + self.model = model_builder.build(model_config=model_config, is_training=False) + return self.model +``` + +This definition retrieves the configuration, ensures that it exists and builds the model. It assigns the model to the instance variables so that it can be reused across our object detector. + +With `restore_checkpoint()`, we can set the model back to the checkpointed position / state provided by the TensorFlow Detection Model Zoo. + +``` + # Restore checkpoint into model + def restore_checkpoint(self): + assert self.model is not None + self.checkpoint = tf.train.Checkpoint(model=self.model) + self.checkpoint.restore(os.path.join(self.path_to_model_checkpoint, 'ckpt-0')).expect_partial() +``` + +We can then generate a `tf.function` for detection. This function utilizes our model, preprocesses the image, generates the prediction, postprocesses the detections and returns everything. + +``` + # Get a tf.function for detection + def get_model_detection_function(self): + assert self.model is not None + + @tf.function + def detection_function(image): + image, shapes = self.model.preprocess(image) + prediction_dict = self.model.predict(image, shapes) + detections = self.model.postprocess(prediction_dict, shapes) + return detections, prediction_dict, tf.reshape(shapes, [-1]) + + return detection_function +``` + +Finally, we generate a definition called `prepare_labels()`. Note that it was created by the people at TensorFlow and that it is responsible for mapping class identifiers to textual labels. It sets these to the instance variables. + +``` + # Prepare labels + # Source: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/inference_tf2_colab.ipynb + def prepare_labels(self): + label_map = label_map_util.load_labelmap(self.path_to_labels) + categories = label_map_util.convert_label_map_to_categories( + label_map, + max_num_classes=label_map_util.get_max_label_map_index(label_map), + use_display_name=True) + self.category_index = label_map_util.create_category_index(categories) + self.label_map_dict = label_map_util.get_label_map_dict(label_map, use_display_name=True) +``` + +#### Helper functions + +So far, we have created a foundation that is capable of preparing the object detector. We only need to create two more helper functions to finish this part. The first restructures keypoint tuples and the second one prepares the image, i.e. converting it into a Tensor. + +``` + # Get keypoint tuples + # Source: https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/inference_tf2_colab.ipynb + def get_keypoint_tuples(self, eval_config): + tuple_list = [] + kp_list = eval_config.keypoint_edge + for edge in kp_list: + tuple_list.append((edge.start, edge.end)) + return tuple_list + + + # Prepare image + def prepare_image(self, image): + return tf.convert_to_tensor( + np.expand_dims(image, 0), dtype=tf.float32 + ) +``` + +### Part 2: Writing the detection functions + +Wohoo, we're at part 2 already! In this part, we'll write the detection functions. More specifically, we create three definitions: + +1. A **general detection function**. This function contains all general detection code, which can be reused across detection for images and detection for videos. +2. **Detecting images**. This code will be used specifically for object detection in images. +3. **Detecting videos**. This code will be used for object detection in videos. + +#### General detection function + +The first definition is the general detection function. General here means that it contains the detection functionality shared across detecting on images and on videos. In other words, things that would be pointless to add twice! It contains the following segments: + +- First of all, we check whether the detection function (see above, in Part 1) `is not None`, meaning that it must be set or we can't perform detection. +- We then prepare the image by copying it and converting it into a Tensor. This is followed by generating the `detections`, a dictionary with the `predictions`, and an object containing `shapes` information. +- If keypoints are available, we use them. +- We then add the bounding boxes with our predictions to the image using the `viz_utils` APIs provided by the Object Detection API. +- Finally, we return the image with bounding boxes. + +``` + # Perform detection + def detect(self, image, label_offset = 1): + # Ensure that we have a detection function + assert self.detection_function is not None + + # Prepare image and perform prediction + image = image.copy() + image_tensor = self.prepare_image(image) + detections, predictions_dict, shapes = self.detection_function(image_tensor) + + # Use keypoints if provided + keypoints, keypoint_scores = None, None + if 'detection_keypoints' in detections: + keypoints = detections['detection_keypoints'][0].numpy() + keypoint_scores = detections['detection_keypoint_scores'][0].numpy() + + # Perform visualization on output image/frame + viz_utils.visualize_boxes_and_labels_on_image_array( + image, + detections['detection_boxes'][0].numpy(), + (detections['detection_classes'][0].numpy() + label_offset).astype(int), + detections['detection_scores'][0].numpy(), + self.category_index, + use_normalized_coordinates=True, + max_boxes_to_draw=25, + min_score_thresh=.40, + agnostic_mode=False, + keypoints=keypoints, + keypoint_scores=keypoint_scores, + keypoint_edges=self.get_keypoint_tuples(self.full_config['eval_config'])) + + # Return the image + return image +``` + +#### Detect function for images + +Detecting objects on any image is now easy. It simply involves reading the image from a `path` with OpenCV, calling the general detection definition, and writing the output to the `output_path`. + +``` + # Predict image from folder + def detect_image(self, path, output_path): + + # Load image + image = cv2.imread(path) + + # Perform object detection and add to output file + output_file = self.detect(image) + + # Write output file to system + cv2.imwrite(output_path, output_file) +``` + +#### Detect function for videos + +Detecting objects on a video is a bit more difficult, but also still pretty easy. Recall that a video is nothing more than a set of images, often with 25 frames - and thus images - per second of video. We will use that characteristic when performing object detection on videos! + +This segment is composed of the following steps: + +- We first set the output video writer and the codec. This allows us to write each frame with bounding boxes drawn on top of it to the output video. This essentially means reconstructing the video frame by frame, but then with bounding boxes. +- We then read the video from `path` using OpenCV's `VideoCapture` functionality. +- Using `vidcap.read()`, we read the first frame (`image`) and indicate whether we read it successfully. We also set the frame `count` to zero. +- Now, we loop over the frame, perform detection (see that this is nothing more than detection on images!), and write the frame to the output video. We then read then ext frame, and continue until no frames can be read anymore (i.e. until `frame_read != True`). +- Once we have processed every frame, we release the output video using `out.release()`. + +``` + # Predict video from folder + def detect_video(self, path, output_path): + + # Set output video writer with codec + fourcc = cv2.VideoWriter_fourcc(*'mp4v') + out = cv2.VideoWriter(output_path, fourcc, 25.0, (1920, 1080)) + + # Read the video + vidcap = cv2.VideoCapture(path) + frame_read, image = vidcap.read() + count = 0 + + # Iterate over frames and pass each for prediction + while frame_read: + + # Perform object detection and add to output file + output_file = self.detect(image) + + # Write frame with predictions to video + out.write(output_file) + + # Read next frame + frame_read, image = vidcap.read() + count += 1 + + # Release video file when we're ready + out.release() +``` + +### Part 3: Creating the detection calls + +Parts 1 and 2 conclude the creation of our `TFObjectDetector` class and hence our object detector. Now that we have finished it, it's time to call it. We can do so with the following code. + +``` +if __name__ == '__main__': + detector = TFObjectDetector('../../tf-models/research/object_detection/configs/tf2', './checkpoint', './labels.pbtxt', 'ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8') + detector.detect_image('./shop.jpg', './shopout.jpg') + detector.detect_video('./video.mp4', './videooutput.mp4') +``` + +This code does the following: + +- When it is run _directly_, i.e. not within the context of another class, it first creates a new instance of the `TFObjectDetector`. Here, we pass the following information: + - The _absolute_ or _relative_ path to the **`tf2` config folder** of your `tensorflow/models` cloned GitHub repository. + - The _absolute_ or _relative_ path to the **model checkpoint** folder of the model you downloaded. In the SSD MobileNet case that we use, untar the folder, open it, and you will see the `./checkpoint` folder. Refer there. + - The _absolute_ or _relative_ path to the **labels file** that is used for the mapping between class indices and label names. If you don't have it, [you can download it here](https://github.com/tensorflow/models/blob/master/research/object_detection/data/mscoco_label_map.pbtxt) for any of the TensorFlow Detection Model Zoo models. + - The **name of the model**. In our case, that's indeed the difficult name we specify. You can also use any of the other names from the Model Zoo, but then make sure to use the correct checkpoint as well. +- It performs image detection on some image called `./shop.jpg`, storing the output (i.e. the image with overlaying bounding boxes) at `./shopout.jpg`. +- It performs video detection on some video called `./video.mp4` with output at `./videooutput.mp4`. + +* * * + +## Running the object detector + +Let's now take a look at some results of running the object detector. + +_These photos and videos have been downloaded and used under a [Pexels License](https://www.pexels.com/license/)._ + +### On photos + +- [![](images/1o-1024x684.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/01/1o-scaled.jpg) + +- [![](images/3o-1024x557.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/01/3o-scaled.jpg) + +- [![](images/4o-1024x558.jpg)](https://www.machinecurve.com/wp-content/uploads/2021/01/4o-scaled.jpg) + + +### On videos + +- ![](images/ezgif-3-55b348749f13.gif) + +- ![](images/ezgif-3-7996ebfb2d38.gif) + + +* * * + +## Summary + +There are many use cases for object detection in Machine Learning. In this tutorial, you have learned how you can build an object detection system yourself. Using the TensorFlow Object Detection API and a pretrained model, you have been able to perform object detection on images and on videos. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this article 💬 If you did, please feel free to leave a message in the comments section below. Please do the same if you have any questions, or click the **Ask Questions** button on the right. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +_TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc._ + +TensorFlow. (2020, September 9). _TensorFlow/models_. GitHub. [https://github.com/tensorflow/models/blob/master/research/object\_detection/g3doc/tf2\_detection\_zoo.md](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md) + +TensorFlow. (2020, 11). _TensorFlow/models_. GitHub. [https://github.com/tensorflow/models/blob/master/research/object\_detection/colab\_tutorials/inference\_tf2\_colab.ipynb](https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/inference_tf2_colab.ipynb) + +TensorFlow. (n.d.). _TensorFlow/models_. GitHub. [https://github.com/tensorflow/models/tree/master/research/object\_detection](https://github.com/tensorflow/models/tree/master/research/object_detection) + +"[Speed/accuracy trade-offs for modern convolutional object detectors.](https://arxiv.org/abs/1611.10012)" Huang J, Rathod V, Sun C, Zhu M, Korattikara A, Fathi A, Fischer I, Wojna Z, Song Y, Guadarrama S, Murphy K, CVPR 2017 diff --git a/one-hot-encoding-for-machine-learning-with-python-and-scikit-learn.md b/one-hot-encoding-for-machine-learning-with-python-and-scikit-learn.md new file mode 100644 index 0000000..07a8724 --- /dev/null +++ b/one-hot-encoding-for-machine-learning-with-python-and-scikit-learn.md @@ -0,0 +1,226 @@ +--- +title: "One-Hot Encoding for Machine Learning with Python and Scikit-learn" +date: "2020-11-24" +categories: + - "frameworks" + - "svms" +tags: + - "categorical-crossentropy" + - "categorical-data" + - "classification" + - "data-preprocessing" + - "one-hot-encoding" + - "scikit-learn" + - "sparse-categorical-crossentropy" + - "tensorflow" +--- + +Machine Learning models work with numbers. That is, they are mathematical models which improve themselves by performing mathematical optimization. It possibly makes the hype a little bit less fascinating, but it's the truth. Now, when you look at this from a real-world point of view, you might get into a struggle soon when you look at datasets. Datasets are almost never numbers only. For example, if your dataset contains categories, you have no numbers in your dataset. Neither is the case when your dataset contains an ordered list of names, e.g. to illustrate the winners in some kind of competition. + +Machine Learning models don't support such data natively. + +Fortunately, with **one-hot encoding**, we can ensure that we can _still_ use these features - simply by converting them into numeric vector format in a smart way. This article illustrates how we can do that with Python and Scikit-learn. Firstly, however, we will look at one-hot encoding in more detail. What is it? Why apply it in the first place? Once we know the answers, we'll move on to the Python example. There, we explain step by step how to use the Scikit-learn `OneHotEncoder` feature. + +* * * + +\[toc\] + +* * * + +## What is One-Hot Encoding? + +The natural question that we might need to answer first before we move towards a practical implementation is the one related to the _what_. What is one-hot encoding? And how does it work? + +If we look at Wikipedia, we read the following: + +> In digital circuits and machine learning, a **one-hot** is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0). +> +> Wikipedia (2005) + +In other words, if we have a set of bits (recall that these can have 0/1 values only), a one-hot encoded combination means that _one_ of the set is 1 while the _others_ are zero. Hence 'one-hot' encoding: there is one that is 'hot', or activated, while the others are 'cold'. + +Let's take a look at an example. + +If we want to express the decimal numbers 0-3 into binary format, we see that they can be expressed as a set of two bits: the bits take all forms between 00 and 11 to express the decimal numbers. + +
DecimalBinaryOne-hot
0000001
1010010
2100100
3111000
+ +However, this expression does not align with the definition of one-hot encoding: _there is no single high_ in the latter case. If we added more bits, e.g. expressed 7 into binary format (111), we could clearly see that this is a recurring problem. + +On the right of the table, we also see the expression of the binary format into one-hot encoded format. Here, the expression ranges from 0001 to 1000, and there is only one _hot_ value per encoding. This illustrates the use of one-hot encoding in expressing values. + +### Why apply One-Hot Encoding? + +Machine Learning models work with numeric data only. That is, they cannot natively accept text data and learn from it. This occurs because of the method with which Machine Learning models are trained. If you are training one in a supervised way, you namely feed forward samples through the model, which generates predictions. You then compare the predictions and the corresponding labels (called _ground truth_) and compute [how bad the model performs](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/). Then, you improve the model, and you repeat the cycle. + +![](images/feed-1024x404.jpg) + +Of course, for the third step, there are many different approaches for improving a Machine Learning model. Many of them are dependent on the algorithm that you are using. In the case of Neural Networks, for example, the contribution of neurons to the loss function can be computed by a technique called backpropagation. If we know the contribution, we also know (by means of a concept called _gradients_, or the slope of loss change given some change in neuron parameters) into what direction we must change the weights if we want to improve the model. + +Then, using an [optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), we can actually change the weights. + +Such operations do however require that data is available in numeric format. The neuron weights are expressed as numbers. For example, this can be a weights vector: \[latex\]\[2.61, 3.92, -2.4, 0.11, 1.11\]\[/latex\]. This also means that in step (1), feeding forward samples to models, computations must be made with respect to these weight vectors, in order to learn patterns. In fact, this is the case. An input vector \[latex\]\\textbf{x}\[/latex\] to a neuron is multiplied with the weights vector \[latex\]\\textbf{b}\[/latex\], after which a bias value - \[latex\]b\[/latex\] - is added. This output is then fed through an [activation function](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/) and serves as one of the output values of the Neural layer. + +![](images/layer-act-1024x227.png) + +The point, here, is that in order to make the computation, the input / feature vector \[latex\]\\textbf{x}\[/latex\] must contain numbers. If it contains text, it will fail: there is no way in which we can multiply numbers (the weights vector) with text (the feature vector). + +The problem is that there are many cases where data comes in the form of text - take for example the case of **categorical data** (Wikipedia, 2012). When data is of this type, it assigns 'groups' to samples - e.g. in the case of a health check. The _group_ variable here is categorical with the possible values being _Healthy_ and _Unhealthy_. + +
AgeGroup
12Healthy
24Unhealthy
54Healthy
+ +### Why One-Hot Encoding helps in this case + +If you are somewhat creative, you can already start to see the relationships between the previous two sections. Here is the primary one: if you want to express categorical data into numeric format, you can use one-hot encoding for doing so. + +Let's take the _Group_ example from the previous section to illustrate how. The case is pretty simple, actually: we can represent the Group values as a set of two bits. For example, if the person is Unhealthy, the category can be expressed as \[latex\]\[0 \\ 1\]\[/latex\], while Healthy can be expressed as \[latex\]\[1 \\ 0\]\[/latex\]. Naturally, we see that we now have a numeric (vector based) representation of our categories, which we can use in our Machine Learning model. + +Long story short: one-hot encoding is of great help when solving [classification problems](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). + +### One-Hot Encoding and multidimensional settings + +However, there is a catch when it comes to one-hot encoding your data. Suppose that you have a textual dataset with phrases like this: + +- hi there +- i am chris + +Applying one-hot encoding to the text can be done as follows: + +\[latex\]\[1, 0, 0, 0, 0\] \\rightarrow \\text{hi}\[/latex\] + +\[latex\]\[0, 1, 0, 0, 0\] \\rightarrow \\text{there} \[/latex\] + +\[latex\]\[0, 0, 1, 0, 0\] \\rightarrow \\text{i} \[/latex\] + +\[latex\]\[0, 0, 0, 1, 0\] \\rightarrow \\text{am} \[/latex\] + +\[latex\]\[0, 0, 0, 0, 1\] \\rightarrow \\text{chris} \[/latex\] + +If your corpus is big, this will become problematic, because you get one-hot encoded vectors with _many_ dimensions (here, there are just five). Hence, one-hot encoding is as limited as it is promising: while it can help you fix the issue of textual data with a relatively lower-dimensional case, it is best not to use it when you have many categories or when you want to convert text into numbers. In those cases, learning an [Embedding](https://www.machinecurve.com/index.php/2020/03/03/classifying-imdb-sentiment-with-keras-and-embeddings-dropout-conv1d/) can be the way to go. + +* * * + +## A Python Example: One-Hot Encoding for Machine Learning + +Now that we know about one-hot encoding and how to apply it in theory, it's time to start using it in practice. Let's take a look at two settings and apply the `OneHotEncoder` from Scikit-learn. The first setting is a simple one: we simply one-hot encode an array with categorical values, representing the _Group_ feature from a few sections back. The second setting is a more real-world one, where we apply one-hot encoding to the TensorFlow/Keras based [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). + +Let's take a look. + +### One-Hot Encoding a NumPy Array + +Suppose that we express the Group feautre, with _healthy, unhealthy and healthy_ as a NumPy array. We can then use Scikit-learn for converting the values into a one-hot encoded array, because it offers the `sklearn.preprocessing.OneHotEncoder` module. + +- We first import the `numpy` module for converting a Python list into a NumPy array, and the `preprocessing` module from Scikit-learn. +- We then initialize the `OneHotEncoder` and define the data into the `health` variable. Note the reshaping operation, which is necessary for data that is unidimensional. +- We then fit the `health` variable to the `ohe` variable, which contains the `OneHotEncoder`. +- We then perform a `.transform(..)` operation on two elements from the array with features: first, on a Healthy; secondly, on an Unhealthy group member. We expect the outcome to be `[1, 0]` for the healthy group, and `[0, 1]` for the unhealthy group. +- After the transform, we convert the data into array format and print it to standard output. + +``` +import numpy as np +from sklearn import preprocessing + +ohe = preprocessing.OneHotEncoder() +health = np.array([['Healthy'], ['Unhealthy'], ['Healthy']]).reshape(-1, 1) +ohe.fit(health) +encoded_healthy = ohe.transform([health[0]]).toarray() +encoded_unhealthy = ohe.transform([health[1]]).toarray() + +print(f'Healthy one-hot encoded: {encoded_healthy}') +print(f'Unhealthy one-hot encoded: {encoded_unhealthy}') +``` + +And indeed: + +``` +Healthy one-hot encoded: [[1. 0.]] +Unhealthy one-hot encoded: [[0. 1.]] +``` + +### One-Hot Encoding Dataset Targets + +Let's now take a look at a real-world dataset. We can load the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#mnist-database-of-handwritten-digits), which is a dataset of handwritten numbers, as follows: + +``` +from tensorflow.keras.datasets import mnist +(x_train, y_train), (x_test, y_test) = mnist.load_data() +``` + +The data looks like this: + +![](images/mnist.png) + +Now let's print one of the `y` values on screen: + +``` +from tensorflow.keras.datasets import mnist +(x_train, y_train), (x_test, y_test) = mnist.load_data() +print(y_test[123]) +``` + +Outcome: `6`. Clearly, the input number belongs to class 6 (and hence represents the number 7, because the classes range from 0-9). However, this does not represent one-hot encoding! If we are to train our Neural network, we can use [sparse categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) for computing [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) in this case. However, if we _do_ want to use [categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/) instead (which makes no sense in this case, but we want to show one-hot encoding, so we go forward with it anyway), we must one-hot encode our feature vectors first. + +Let's see how we can do this with Scikit-learn. + +- In the imports, we specify NumPy, preprocessing and now also the `mnist` import from `tensorflow.keras.datasets` +- We then define and initialize the `OneHotEncoder`. +- We load the MNIST data and then reshape it - the reshape operation is required by Scikit-learn for performing one-hot encoding. +- We then perform fit and transform operations with the `OneHotEncoder` initialization for both the training and the testing segments of our dataset. +- We finally print the results for the `y` variable we checked earlier. + +``` +import numpy as np +from sklearn import preprocessing +from tensorflow.keras.datasets import mnist + +# Define the One-hot Encoder +ohe = preprocessing.OneHotEncoder() + +# Load MNIST data +(x_train, y_train), (x_test, y_test) = mnist.load_data() + +# Reshape data +y_train = y_train.reshape(-1, 1) +y_test = y_test.reshape(-1, 1) + +# Fit and transform training data +ohe.fit(y_train) +transformed_train = ohe.transform(y_train).toarray() + +# Fit and transform testing data +ohe.fit(y_test) +transformed_test = ohe.transform(y_test).toarray() + +# Print results +print(f'Value without encoding: {y_test[123]}') +print(f'Value with encoding: {transformed_test[123]}') +``` + +The result: + +``` +Value without encoding: [6] +Value with encoding: [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] +``` + +Since the range of `y` values ranges from 0-9, we would expect a one-hot encoded vector with ten items - and this is the case. What's more, our `y = 6` value is reflected by setting the 7th value (`i = 6`) in our one-hot encoded array to 1, while the rest remains at 0. Great! + +* * * + +## Summary + +Machine Learning models require numeric data for training. Sometimes, however, your dataset is not numeric - think about converting text into Machine Learning input, or handling categorical data. In those cases, one-hot encoding can be used for making your Machine Learning dataset usable for your ML projects. + +In this article, we looked at applying one-hot encoding. We saw that it involves creating a set of bits where for each unique combination one bit is set to 1 ('hot'), while the others are set to zero ('cold'). For this reason, the technique is called one-hot encoding. We saw that it naturally fits making categorical data usable in Machine Learning problems and can help us significantly when we are solving a classification problem. + +In the practical part of this article, we looked at how we can use Python and Scikit-learn to perform one-hot encoding. We applied Scikit's `OneHotEncoder` to a normal NumPy array, which reflected a simple one-hot encoding scenario with the _Healthy_ and _Unhealthy_ feature values we used in one of the earlier sections. In the second example, we loaded the MNIST data from TensorFlow, and applied one-hot encoding to make our targets compatible with categorical crossentropy loss. We saw that the data is nicely converted, which is nice! + +I hope that you have learned something from today's article. If you did, please feel free to leave a message in the comments section below 💬 Please feel free to do the same if you have questions or other remarks. I'd love to hear from you and will respond whenever possible. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2005, June 16). _One-hot_. Wikipedia, the free encyclopedia. Retrieved November 24, 2020, from [https://en.wikipedia.org/wiki/One-hot](https://en.wikipedia.org/wiki/One-hot) + +Wikipedia. (2012, March 30). _Statistical data type_. Wikipedia, the free encyclopedia. Retrieved November 24, 2020, from [https://en.wikipedia.org/wiki/Statistical\_data\_type](https://en.wikipedia.org/wiki/Statistical_data_type) diff --git a/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras.md b/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras.md new file mode 100644 index 0000000..a7196b3 --- /dev/null +++ b/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras.md @@ -0,0 +1,305 @@ +--- +title: "One-Hot Encoding for Machine Learning with TensorFlow 2.0 and Keras" +date: "2020-11-24" +categories: + - "frameworks" + - "svms" +tags: + - "categorical-crossentropy" + - "data-preprocessing" + - "keras" + - "neural-network" + - "neural-networks" + - "one-hot-encoding" + - "sparse-categorical-crossentropy" + - "tensorflow" +--- + +When you are training a Supervised Machine Learning model, you are effectively feeding forward data through the model, comparing the predictions, and improving the model internals - iteratively. These are mathematical operations and hence data must be numeric if we want to train a Neural network using TensorFlow and Keras. In many cases, this is the case. For example, images can be expressed as numbers; more specifically, the color values for the pixels of the image. + +However, some datasets cannot be expressed as a number natively. For example, when you have features that represent group membership - for example, a feature called _football club_ and where the contents can be _FC Barcelona, Manchester United_ or _AC Milan_ - the data is not numeric. Does this mean that we cannot use those for building a predictive model? No. On the contrary. We will show you how we can still use these features in TensorFlow and Keras models by using a technique called **one-hot encoding**. This article specifically focuses on that. + +It is structured as follows. Firstly, we will take a look at one-hot encoding in more detail. What is it? How does it relate to _categorical crossentropy loss_, a type of loss that is used for training multiclass Neural Networks? Those are the questions that will provide the necessary context for applying one-hot encoding to a dataset. The latter is what we will show then, by giving you an example of applying one-hot encoding to a [Keras dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), covering how to use `to_categorical` when training a Neural Network step by step. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## What is One-Hot Encoding? + +Before we dive into any practical part, I always tend to find it important that we know about what we are building. Hence, I think that it's important that we take a look at the concept of **one-hot encoding** in more detail first, and why it must be applied. + +If you have read some other articles on MachineCurve (if not: [click](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/)), you know that optimizing a Neural Network involves three main steps: + +1. Feeding samples to the model, generating predictions. We call this the _forward pass_. +2. Comparing the predictions with the corresponding labels for the samples, also known as the _ground truth_. This results in a score illustrating how bad the model performs, also called the _loss_. +3. Improving the model by computing the individual contribution of model parameters to the loss and applying an optimizer to actually change the weights of the neural network. + +![](images/feed-1024x404.jpg) + +We also know that step (1), feeding forward the samples through the model, involves a system of linear computations (\[latex\]\\textbf{w} \\times \\textbf{x} + b\[/latex\]) and mapping those to [nonlinear outputs](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/). Here, \[latex\]\\textbf{w}\[/latex\] represents the so-called _weights vector_, which captures (parts of) the patterns that have been learned by the Machine Learning model. \[latex\]\\textbf{x}\[/latex\] is also called the feature vector and represents a _row_ from the input dataset. Bias is expressed as \[latex\]b\[/latex\], and the activation function is often [Rectified Linear Unit](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) these days. + +![](images/layer-act-1024x227.png) + +Clearly, from this overview, we can see that the linear operation involves a multiplication of two vectors and the addition of a scalar value. This all suggests that both \[latex\]\\textbf{w}\[/latex\], \[latex\]\\textbf{x}\[/latex\] and \[latex\]b\[/latex\] must be numeric. And indeed: there is no such thing as a text-number vector multiplication that is used within Neural Networks, and hence _indeed_ all data must be numeric. + +There are many features that are numeric by nature: + +- Age +- Time offset +- Pixel value for the pixel of an image + +...and so on. + +### What to do when data isn't numeric + +But not all data is numeric. For example, if we have a feature called _Healthiness_, we can either express one as being 'Healthy' or as 'Unhealthy'. This is text based data and hence conversion must take place if we want to use it in our Machine Learning model. + +**One-hot encoding** is an approach that we can follow if we want to convert such non-numeric (but rather categorical) data into a usable format. + +> _In digital circuits and machine learning, a_ **one-hot** _is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)._ +> +> Wikipedia (2005) + +In other words, we can express the categories into 'sets of bits' (recall that they can only take values between 0 and 1) so that for each set of bits, only one bit is true all the time, while all the others are zero. For example, for our Healthiness case, we can express the categories with two bits: + +- \[latex\]\\text{Healthy} \\rightarrow \[0 \\ 1\]\[/latex\] +- \[latex\]\\text{Unhealthy} \\rightarrow \[1 \\ 0\]\[/latex\] + +Really simple! + +If we want to express more categories, we can simply add more bits. E.g. if we wanted to add the 'Unknown category', we would simply increase the number of bits that represent the one-hot encoding: + +- \[latex\]\\text{Healthy} \\rightarrow \[0 \\ 0 \\ 1\]\[/latex\] +- \[latex\]\\text{Unhealthy} \\rightarrow \[0 \\ 1 \\ 0\]\[/latex\] +- \[latex\]\\text{Unknown} \\rightarrow \[1 \\ 0 \\ 0\]\[/latex\] + +### Training Neural Networks with Categorical Crossentropy Loss + +When we are training a Neural Network with TensorFlow, we always use `categorical_crossentropy_loss` when we are working with categorical data (and often, are trying to solve a [multiclass classification problem](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/)). + +As we can read on the page about [loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#categorical-crossentropy), **categorical crossentropy loss** uses the prediction from our model for the true target to compute _how bad the model performs_. As we can read on that page as well, we see that this loss function requires data to be categorical - and hence, one-hot encoded. + +### How One-Hot Encoding fits CCE Loss + +For this reason, it is desirable to work with _categorical_ (and hence one-hot encoded) target data when we are using categorical crossentropy loss. This requires that we convert the targets into this format prior to training the Neural Network. + +If we don't have one-hot encoded targets in the dataset, but integers instead to give just one example, it could be a good idea to use a different loss function. For example, [sparse categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) works with categorical targets where the targets are expressed as integer values, to give just an example. If you have a binary classification problem, and hence work with a [Sigmoid activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) generating a prediction \[latex\] p \\in \[0, 1\]\[/latex\], you will want to use [binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) instead. + +One simple rule to remember: use categorical crossentropy loss when your Neural Network dataset has one-hot encoded target values! + +Let's now take a look at how this works with a real example. + +* * * + +## Using TensorFlow and Keras for One-Hot Encoding + +TensorFlow is a widely used Machine Learning library for creating Neural Networks. Having been around for a while, it is one of the primary elements of the toolkit of a Machine Learning engineer (besides libraries like [Scikit-learn](https://www.machinecurve.com/index.php/how-to-use-scikit-learn-for-machine-learning-with-python-mastering-scikit/) and PyTorch). I'm quite fond of the library and have been using it for some time now. One of the main benefits is that it makes the life of Machine Learning engineers much easier. + +> TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. +> +> TensorFlow (n.d.) + +The quote above states that "developers \[can\] easily build (...) ML powered applications". This is primarily due to the deep integration of the Keras library with TensorFlow, into `tensorflow.keras`. In the beginning of the TensorFlow era, TF provided its own APIs for constructing neural networks - and they are still available in `tensorflow.nn`. However, the learning curve for constructing them was steep and you had to have a lot of expertise when you wanted to create one. That's why Keras was born, an abstraction layer on top of TensorFlow (and originally also Theano and CNTK) with which people could easily build their Neural Networks. + +The goal: speeding up iteration, as engineers should not have to worry about code, but rather about the principles - and hence the model structure - behind the code. + +Today, TensorFlow and Keras are tightly coupled and deeply integrated, and the difference between the two is vastly disappearing. We will now use the Keras API within TensorFlow (i.e., `tensorflow.keras`) to construct a [Convolutional Neural Network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) that is capable of classifying digits from [the MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). Let's go! + +### Taking a look at the MNIST dataset + +The MNIST dataset? Although the odds are that you already know what this dataset is all about, there may be some readers who don't know about this dataset yet. As you can see below, it's a Computer Vision dataset - and it contains thousands of small grayscale images. More specifically, the images represent handwritten digits, and thus the numbers 0 to 9. + +It is one of the most widely used datasets in Machine Learning education because it is so easy to use (as [we shall see](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), it is embedded into Keras as `tensorflow.keras.datasets.mnist`) and because the classifiers that are trained on it perform really well. For this reason, we will be using MNIST as well today. + +![](images/mnist-visualize.png) + +Loading data from the MNIST dataset is really easy. Let's open up a code editor, create a Python file and specify some imports - as well as a call to `load_data()`, with which we can load the MNIST dataset: + +``` +from tensorflow.keras.datasets import mnist +(X_train, y_train), (X_test, y_test) = mnist.load_data() + +print(X_train.shape) +print(y_train.shape) +print(X_test.shape) +print(y_test.shape) +``` + +If we run it, we see this text appear on screen after a while: + +``` +(60000, 28, 28) +(60000,) +(10000, 28, 28) +(10000,) +``` + +In other words, we can see that our [training set](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/) contains 60000 28x28 samples (as the shape of one input value seems to be \[latex\](28, 28)\[/latex\], we also see that our images are grayscale - if they were RGB, shape would have been \[latex\](28, 28, 3)\[/latex\] per sample and hence \[latex\](60000, 28, 28, 3)\[/latex\] for the whole array). Our testing set contains 10000 samples of the same format. + +### Inspecting a sample in more detail + +Let's now inspect one sample in more detail. + +``` +index = 128 +print(y_train[index]) +print(y_train[index].shape) +``` + +The output is as follows: + +``` +1 +() +``` + +We can see that the _actual_ \[latex\]y\[/latex\] value for index 128 is 1 - meaning that it represents the number 1. The shape is \[latex\]()\[/latex\] and hence we are _really_ talking about a scalar value. + +If we would create a Neural Network, the best choice for this dataset would be to apply [sparse categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) - for the simple reason that we don't have to apply one-hot encoding if we use that loss function. Because we do want to show you how one-hot encoding works with TensorFlow and Keras, we do use [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/) instead, so we must apply one-hot encoding to the samples. + +### Applying One-Hot Encoding to the samples + +If we need to convert our dataset into categorical format (and hence one-hot encoded format), we can do so using Scikit-learn's `OneHotEncoder` [module](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-python-and-scikit-learn/). However, TensorFlow also offers its own implementation: `tensorflow.keras.utils.to_categorical`. It's a utility function which allows us to convert integer targets into categorical and hence one-hot encoded ones. + +And if the library that you are using for building your Neural Network offers a one-hot encoder out of the box, why use Scikit-learn's variant instead? There is nothing wrong with the latter, but there would be simply no point in doing so :) + +Now, let's add to the imports: + +``` +from tensorflow.keras.utils import to_categorical +``` + +And to the end of our code: + +``` +y_train = to_categorical(y_train) +y_test = to_categorical(y_test) + +print(y_train[index]) +print(y_train[index].shape) +``` + +The output for this part is now as follows: + +``` +[0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] +(10,) +``` + +We can clearly see that our target vector has ten values (by means of the \[latex\](10,)\[/latex\] shape), one for each individual digit. The first is one while the others are zero, indicating that we are talking about the number 1, but then in one-hot encoded format. Exactly the same as our original integer value! + +### Creating a ConvNet that classifies the MNIST digits + +Let's now clean up our code a bit. Make sure that it looks as follows: + +- Import some modules that you need for the code. +- Load the MNIST dataset. +- Convert targets into one-hot encoded format. + +``` +# Imports +from tensorflow.keras.datasets import mnist +from tensorflow.keras.utils import to_categorical + +# Load dataset +(X_train, y_train), (X_test, y_test) = mnist.load_data() + +# Convert targets into one-hot encoded format +y_train = to_categorical(y_train) +y_test = to_categorical(y_test) +``` + +We can now continue and add more code for constructing the actual ConvNet. Read [here](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) if you wish to receive more instructions about doing this; we'll simply show the code next. + +``` +# Imports +from tensorflow.keras.datasets import mnist +from tensorflow.keras.utils import to_categorical, normalize +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import categorical_crossentropy +from tensorflow.keras.optimizers import Adam +import numpy as np + +# Load dataset +(X_train, y_train), (X_test, y_test) = mnist.load_data() + +# Configuration options +no_classes = len(np.unique(y_train)) +img_width, img_height = 28, 28 +validation_split = 0.20 +no_epochs = 25 +verbosity = 1 +batch_size = 250 + +# Reshape data +X_train = X_train.reshape(X_train.shape[0], img_width, img_height, 1) +X_test = X_test.reshape(X_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Convert targets into one-hot encoded format +y_train = to_categorical(y_train) +y_test = to_categorical(y_test) + +# Normalize the data +X_train = normalize(X_train) +X_test = normalize(X_test) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=categorical_crossentropy, + optimizer=Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(X_train, y_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +When running the code, we can see that our model starts training successfully: + +``` +Epoch 1/25 +48000/48000 [==============================] - 12s 251us/sample - loss: 0.2089 - accuracy: 0.9361 - val_loss: 0.0858 - val_accuracy: 0.9738 +Epoch 2/25 +48000/48000 [==============================] - 36s 741us/sample - loss: 0.0555 - accuracy: 0.9828 - val_loss: 0.0607 - val_accuracy: 0.9821 +Epoch 3/25 +48000/48000 [==============================] - 45s 932us/sample - loss: 0.0295 - accuracy: 0.9905 - val_loss: 0.0605 - val_accuracy: 0.9807 +Epoch 4/25 +18500/48000 [==========>...................] - ETA: 25s - loss: 0.0150 - accuracy: 0.9957 +``` + +* * * + +## Summary + +Training Machine Learning models requires that your data is numeric. While this is true in many cases, some features represent groups of data - the categorical features. This is especially true for target values. In order to use them in your Machine Learning model, especially a Neural Network in the context of this article, you might want to one-hot encode your target data. This article looked at one-hot encoding in more detail. + +Firstly, we looked at what one-hot encoding involves. More specifically, we saw that it allows us to convert categorical data expressed in integer format (e.g. the groups 'Healthy' and 'Unhealthy' in sets of bits where for each set just one value equals one and all the others equal zeros). This allows us to uniquely express groups and text based data for usage in Machine Learning models. We also looked at the necessity for categorical (and hence one-hot encoded) data when using categorical crossentropy loss, which is common in today's Neural Networks. + +After finishing looking at theory, we moved forward to a practical example: showing how TensorFlow and Keras can be used for one-hot encoding a dataset. Specifically, using the TensorFlow `to_categorical` utility function, we saw how we can convert integer based targets for the MNIST dataset into one-hot encoded targets, after which categorical crossentropy loss is usable (as demonstrated by a neural network implemented towards the end). If you do however have such data, you might also wish to use sparse categorical crossentropy loss instead - there's no need to convert at all, but that was just done for the sake of this article. + +I hope that you have learned something from today's article! If you did, please feel free to leave a comment in the comments section below 💬 Please do the same when you have any questions or other remarks. Regardless, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2005, June 16). _One-hot_. Wikipedia, the free encyclopedia. Retrieved November 24, 2020, from [https://en.wikipedia.org/wiki/One-hot](https://en.wikipedia.org/wiki/One-hot) + +TensorFlow. (n.d.). [https://www.tensorflow.org/](https://www.tensorflow.org/) diff --git a/overview-of-activation-functions-for-neural-networks.md b/overview-of-activation-functions-for-neural-networks.md new file mode 100644 index 0000000..367118b --- /dev/null +++ b/overview-of-activation-functions-for-neural-networks.md @@ -0,0 +1,170 @@ +--- +title: "Overview of activation functions for neural networks" +date: "2020-01-24" +categories: + - "deep-learning" +tags: + - "activation-function" + - "activation-functions" + - "deep-learning" + - "machine-learning" + - "neural-networks" +--- + +The neurons of neural networks perform operations that are linear: they multiple an _input vector_ with a _weights vector_ and add a bias - operations that are linear. + +By consequence, they are not capable of learning patterns in nonlinear data, except for the fact that _activation functions_ can be added. These functions, to which the output of a neuron is fed, map the linear data into a nonlinear range, and hence introduce the nonlinearity that the system as a whole needs for learning nonlinear data. Hence, it's not strange that activation functions are also called "nonlinearities", even though - strictly speaking - \[latex\]f(x) = x\[/latex\] can also be an activation function. + +In this blog post, we provide an overview of activation functions covered on MachineCurve. It allows you to quickly identify common activation functions and navigate to those which are interesting to you, in order to learn more about them in more detail. We cover traditional activation functions like Sigmoid, Tanh and ReLU, but also the newer ones like Swish (and related activation functions) as well as Leaky and Parametric ReLU (and related ones). + +Are you ready? Let's go! 😎 + +**Update June 2020:** added possible instability and computational intensity of Swish to provide a better balance between advantages and disadvantages. + +* * * + +\[toc\] + +* * * + +## Sigmoid + +One of the traditional activation functions is the Sigmoid activation function. I consider it one of the most widely known activation functions known and perhaps used today, except for ReLU. It converts a domain of \[latex\]x \\in \[ -\\infty  , \\infty\]\[/latex\] into the range \[latex\]y \\in \[ 0, 1 \]\[/latex\], with the greatest change present in the \[latex\]x \\in \[-4, +4\]\[/latex\] interval. + +Using Sigmoid possibly introduces two large bottlenecks into your machine learning project. Firstly, the outputs are not symmetrical around the origin; that is, for \[latex\]x = 0\[/latex\], \[latex\]y = 0.5\[/latex\]. This might slow down convergence to the optimum solution. + +Secondly, the derivative of Sigmoid has a maximum output of \[latex\]\\approx 0.25\[/latex\] for \[latex\]x = 0\[/latex\]. This means that chaining gradients, as is done during neural network optimization, produces very small gradients for upstream layers. Very large neural networks experience this problem as the _vanishing gradients problem_, and it may slow down learning or even make it impossible. + +Hence, for today's ML projects: it's perfectly fine to use Sigmoid, if you consider its limitations and know that possibly better activation functions are available. + +[![](images/sigmoid_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/sigmoid_and_deriv.jpeg) + +**Read more:** [ReLU, Sigmoid and Tanh: today’s most used activation functions](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#sigmoid) + +* * * + +## Tanh + +Another commonly used activation function known and used since many years is the _Tangens hyperbolicus_, or Tanh activation function. It takes values from the entire domain and maps them onto the range \[latex\]y \\in \[-1, +1\]\[/latex\]. + +Even though it _does_ provide symmetry around the origin, it's still sensitive to vanishing gradients. The next activation function was identified to counter this problem. + +[![](images/tanh_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/tanh_and_deriv.jpeg) + +**Read more:** [ReLU, Sigmoid and Tanh: today’s most used activation functions](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#tangens-hyperbolicus-tanh) + +* * * + +## Rectified Linear Unit (ReLU) + +Perhaps the most widely known and used activation function today: the Rectified Linear Unit, or ReLU activation function. It activates as either \[latex\]x\[/latex\] for all \[latex\]x > 0\[/latex\], and as zero for all other values in the domain. + +In terms of the derivative, this means that the gradient is either _zero_ or _one_. This is both good and bad. It's good because models are sparse (all inputs \[latex\]x < 0\[/latex\] are not taken into account) and because the vanishing gradients problem no longer occurs (for positive gradients, the gradient is always one). + +It's bad because we're now opening ourselves to an entirely new problem: the _dying ReLU problem_. It may sometimes be the case that the sparsity-inducing effect of the zero activations for all negative inputs results in too many neurons that produce zeroes yet cannot recover. In other words, they "die off". This also produces models which can no longer successfully learn. + +Nevertheless, ReLU is still the way to go in many cases these days. + +[![](images/relu_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/relu_and_deriv.jpeg) + +**Read more:** [ReLU, Sigmoid and Tanh: today’s most used activation functions](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#rectified-linear-unit-relu) + +* * * + +## Leaky ReLU + +Now onto some fixes for the dying ReLU problem. Leaky ReLU is the first: by means of a hyperparameter called \[latex\]\\alpha\[/latex\], the machine learning engineer can configure the outputs for the negative domain to be very small, but nonzero. This can be seen in the plot below. + +As a result, the gradient for the negative domain is no longer zero, and the neurons no longer die off. This comes at the cost of non-sparse models, and does not always work (especially because you use simple models, it doesn't really work better than traditional ReLU in my experience), but empirical tests have shown quite some success in larger cases. Worth a try! + +[![](images/leaky_relu.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/leaky_relu.png) + +**Read more:** [Using Leaky ReLU with Keras](https://www.machinecurve.com/index.php/2019/11/12/using-leaky-relu-with-keras/) + +* * * + +## Parametric ReLU (PReLU) + +Leaky ReLU works with some \[latex\]\\alpha\[/latex\] that must be configured by the machine learning engineer. Generalizing from here, Parametric ReLU (or PReLU) takes this job from the engineer and puts it in the training process. + +That is, it adds a few extra parameters to the neural network, which represent the alpha parameter (either one alpha per dimension of your data, or one alpha for all dimensions - this can be set by you). Optimization then determines the best alpha for your dataset and continuously adapts it based on training progress. + +![](images/loss.png) + +**Read more:** [How to use PReLU with Keras?](https://www.machinecurve.com/index.php/2019/12/05/how-to-use-prelu-with-keras/) + +* * * + +## ELU + +The authors of the Exponential Linear Unit (ELU) activation function recognize that Leaky ReLU and PReLU contribute to resolving the issues with activation functions to quite a good extent. However, they argued, their fixes introduced a new issue: the fact that there is no "noise-deactivation state" and that by consequence, the models are not robust to noise. + +What does this mean? Put very simply, the fact that the negative domain produces negative outputs means that for very large negative numbers, the outputs may still be considerable. This means that noise can still introduce disbalance into the model. + +For this reason, the authors propose ELU: an activation function that looks like ReLU, has nonzero outputs for the negative domain, yet (together with its gradient) saturates to some value (which can be configured with an \[latex\]\\alpha\[/latex\] parameter), so that the model is protected from the impact of noise. + +[![](images/elu_avf.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_avf.png) + +**Read more:** [How to use ELU with Keras?](https://www.machinecurve.com/index.php/2019/12/09/how-to-use-elu-with-keras/) + +* * * + +## Softmax + +Now something entirely different: from activation functions that are used on hidden layers, we'll move to an output activation function as a small intermezzo. Let's take a look at the Softmax activation function. + +Softmax is quite widely used in classification, and especially when you're trying to solve a multiclass classification problem with [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). Softmax works very nicely and quite intuitively: by interrelating all the values in some vector, and converting them into numbers that adhere to the principles of probability theory, Softmax essentially computes a discrete probability distribution over the values in your vector. When these values represent the outputs of a neural network based classifier, you effectively compute a probability distribution over the target classes for each sample. This allows you to select a "most probable class" and has contributed to e.g. neural network based object detectors. + +![](images/softmax_logits.png) + +**Read more:** [How does the Softmax activation function work?](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) + +* * * + +## Swish + +Back to the ReLU-like activation functions. Another activation function which attempts to mimic ReLU is the Swish activation function, which was invented by a Google Brain team. It ensures both ReLU style activations for the positive domain, introduces smoothness around \[latex\]x \\approx 0\[/latex\], then also allows negative inputs close to the origin result in negative outputs, but saturates to \[latex\]y \\approx 0\[/latex\] for large negative inputs. Quite understandably, Swish has produced quite good results in the authors' empirical tests. However, it is more computationally intensive than say ReLU, which may impact the resources you need for training (Deep Learning University, 2020). It can also be unstable, impacting the training process. Therefore, proceed with caution. + +[![](images/swish_deriv-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/swish_deriv.png) + +**Read more:** [Why Swish could perform better than ReLu](https://www.machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/) + +* * * + +## FTSwish + +Another Swish style activation function is called Flatten-T Swish. Effectively combining the ReLU and Sigmoid activation functions into one, it attempts to resolve much of the issues related to traditional activation functions: + +[![](images/ftswish-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/ftswish-1.png) + +**Read more:** + +- [What is the FTSwish activation function?](https://www.machinecurve.com/index.php/2020/01/03/what-is-the-ftswish-activation-function/) +- [How to use FTSwish with Keras?](https://www.machinecurve.com/index.php/2020/01/06/how-to-use-ftswish-with-keras/) + +* * * + +## LiSHT + +Another activation function is LiSHT. It works in a different way when comparing it to more traditional activation functions: negative inputs are converted into positive outputs. However, in terms of the derivative, this produces negative gradients for negative inputs, which eventually saturate to zero. This may also be good for both model sparsity and training power. It might thus be worth a try! + +![](images/lisht_visualized-1024x511.png) + +**Read more:** + +- [Beyond Swish: the LiSHT activation function](https://www.machinecurve.com/index.php/2019/11/17/beyond-swish-the-lisht-activation-function/) +- [How to use LiSHT activation function with Keras?](https://www.machinecurve.com/index.php/2019/11/17/how-to-use-lisht-activation-function-with-keras/) + +* * * + +## Summary + +In this blog post, you found an overview of commonly used activation functions and newer ones, which attempt to solve the problems related to these activation functions. Most notably, such problems are the vanishing gradients problem and the dying ReLU problem. For each activation function, we provided references to additional blog articles which study the activation function in more detail. + +Please do note that in a fast-changing landscape like the ML one, this overview can never be complete. Therefore, if you know about a new activation function which must really be covered, please feel free to leave a comment in the comments section. I'll then try to add it as soon as possible. Please leave a comment too if you have any questions, or when you spot issues in this blog. + +Thanks for reading MachineCurve today and happy engineering! 😎 + +## References + +Deep Learning University. (2020, June 8). _Swish as an activation function in neural network_. [https://deeplearninguniversity.com/swish-as-an-activation-function-in-neural-network/](https://deeplearninguniversity.com/swish-as-an-activation-function-in-neural-network/) diff --git a/performing-dbscan-clustering-with-python-and-scikit-learn.md b/performing-dbscan-clustering-with-python-and-scikit-learn.md new file mode 100644 index 0000000..c706881 --- /dev/null +++ b/performing-dbscan-clustering-with-python-and-scikit-learn.md @@ -0,0 +1,391 @@ +--- +title: "DBSCAN clustering tutorial: example with Scikit-learn" +date: "2020-12-09" +categories: + - "frameworks" + - "svms" +tags: + - "clustering" + - "dbscan" + - "machine-learning" + - "python" + - "scikit-learn" + - "unsupervised-learning" +--- + +There are many algorithms for clustering available today. DBSCAN, or **density-based spatial clustering of applications with noise**, is one of these clustering algorithms. It can be used for clustering data points based on _density_, i.e., by grouping together areas with many samples. This makes it especially useful for performing clustering under noisy conditions: as we shall see, besides clustering, DBSCAN is also capable of detecting noisy points, which can - if desired - be discarded from the dataset. + +In this article, we will be looking at DBScan in more detail. Firstly, we'll take a look at an example use case for clustering, by generating two blobs of data where some nosiy samples are present. Then, we'll introduce DBSCAN based clustering, both its concepts (core points, directly reachable points, reachable points and outliers/noise) and its algorithm (by means of a step-wise explanation). Subsequently, we're going to implement a DBSCAN-based clustering algorithm with Python and Scikit-learn. This allows us to both _understand_ the algorithm and _apply_ it. + +In this tutorial, you will learn... + +- **The concepts behind DBSCAN.** +- **How the DBSCAN algorithm works.** +- **How you can implement the DBSCAN algorithm yourself, with Scikit-learn.** + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +Let's take a look! 😎 + +**Update 11/Jan/2021:** added quick-start code example. + +* * * + +\[toc\] + +* * * + +## Code example: how to perform DBSCAN clustering with Scikit-learn? + +With this quick example you can get started with DBSCAN in Python immediately. If you want to understand how the algorithm works in more detail, or see step-by-step examples for coding the clustering method, make sure to read the full article below! + +``` +from sklearn.datasets import make_blobs +from sklearn.cluster import DBSCAN +import numpy as np + +# Configuration options +num_samples_total = 1000 +cluster_centers = [(3,3), (7,7)] +num_classes = len(cluster_centers) +epsilon = 1.0 +min_samples = 13 + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5) + +# Compute DBSCAN +db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X) +labels = db.labels_ + +no_clusters = len(np.unique(labels) ) +no_noise = np.sum(np.array(labels) == -1, axis=0) + +print('Estimated no. of clusters: %d' % no_clusters) +print('Estimated no. of noise points: %d' % no_noise) +``` + +* * * + +## What is clustering? + +DBSCAN is a clustering algorithm and is part of the class of Unsupervised Learning algorithms. But what is clustering? Let's first take a look at a definition: + +> **Cluster analysis** or **clustering** is the task of grouping a set of objects in such a way that objects in the same group (called a **cluster**) are more similar (in some sense) to each other than to those in other groups (clusters). +> +> Wikipedia (2004) + +Aha! + +It allows us to select groups from datasets based on shared characteristics for samples within a particular group. + +That's interesting, because - to give just one example - we can use clustering to generate a _labeled dataset_ (e.g. to select classes if we don't have them) for creating a predictive model. What's more, as we shall see in this article, clustering can also be used for detecting noisy samples, which can possibly be removed prior to training a Supervised Learning model. + +Another vast array of examples is available [here](https://en.wikipedia.org/wiki/Cluster_analysis#Applications). + +* * * + +## Introducing DBSCAN clustering + +DBSCAN is an algorithm for performing cluster analysis on your dataset. + +Before we start any work on implementing DBSCAN with Scikit-learn, let's zoom in on the algorithm first. As we read above, it stands for **density-based spatial clustering of applications with noise**, which is quite a complex name for a relatively simple algorithm. But we can break it apart so that we can intuitively grasp what it does. _Density-based_ means that it will zoom into areas that have great _density_, or in other words a large amount of samples closely together. Since clusters are dense, this focus on density is good. + +_Spatial_ clustering means that it performs clustering by performing actions in the feature space. In other words, whereas some clustering techniques [work by sending messages between points](https://www.machinecurve.com/index.php/2020/04/18/how-to-perform-affinity-propagation-with-python-in-scikit/), DBSCAN performs distance measures in the space to identify which samples belong to each other. _Clustering_ speaks for itself, and _applications with noise_ means that the technique can be used with noisy datasets. We shall see why this is the case next, because we will now look at the fundamental concepts of DBScan: core points, directly reachable points, reachable points and outliers. + +### The concepts of DBScan + +Before we start looking at these concepts, we must generate an imaginary dataset first. Here it is. Suppose that we are dealing with a two-dimensional feature space where our samples can be expressed as points (i.e. as \[latex\](X\_1, X\_2)\[/latex\] coordinates). It could then look like this: + +![](images/samples.png) + +When performing DBSCAN, two parameters must be provided before the algorithm is run. The first is the **epsilon value**, or \[latex\]\\epsilon\[/latex\]. This value indicates some distance around a point, which can be visualized as a circle with a diamater of \[latex\]\\epsilon\[/latex\] around a point. Note that each point has the same epsilon, but that we draw the circle for just one point below. + +The second is the **minimum number of samples**. This number indicates the minimum number of samples (including the point itself) that should be within the epsilon range (i.e., the circle) for a point to be considered a _core point_. We will now look at what these are. + +![](images/samples-1.png) + +#### Core Points + +Suppose that we have some epsilon \[latex\]\\epsilon\[/latex\] and set the minimum number of points to 3. We will now look at two points of the dataset. On the left, we look at the above point, while on the right, we look at one of the middle points. + +> A point _p_ is a _core point_ if at least minPts points are within distance _ε_ of it (including _p_). +> +> Wikipedia (2007) + +In other words, in our example, a point is a core point if at least 3 points, including itself, are within the circle. As becomes clear, both points that we are looking at are so-called core points. + +The great thing of core points is that they are likely part of a cluster, because they are in the vicinity of other points. That's why they are so important in the DBSCAN algorithm. + +![](images/corepoints.png) + +If the dataset were larger (e.g. because we zoomed into a particular area), and another point would be inspected, we could arrive at the conclusion that it is not a core point. The example below illustrates why: there are only two points, including itself, in the \[latex\]\\epsilon\[/latex\] based vicinity of the point. Since \[latex\]\\text{minPts} = 3\[/latex\] and \[latex\]2 < 3\[/latex\], this is not a core point. + +![](images/corepoints-1.png) + +#### Directly Reachable Points + +If a point is not a core point, we must look whether it is **directly reachable**. + +> A point _q_ is _directly reachable_ from _p_ if point _q_ is within distance _ε_ from core point _p_. Points are only said to be directly reachable from core points. +> +> Wikipedia (2007) + +In the example above, we saw that the extra point we were looking at is not a core point. But is it directly reachable? + +It seems to be the case: + +- The closest point to the point we were looking at is a core point, since its \[latex\]\\epsilon\[/latex\] circle contains 4 points, which exceeds the minimum of 3. +- The point itself lies within the \[latex\]\\epsilon\[/latex\] circle for the closest core point. + +This means that it is directly reachable. + +![](images/corepoints-3-1024x504.png) + +#### Reachable Points + +Another concept in DBSCAN is the one of **reachable points:** + +> A point _q_ is _reachable_ from _p_ if there is a path _p_1, ..., _pn_ with _p_1 = _p_ and _pn_ = _q_, where each _p__i_+1 is directly reachable from _pi_. Note that this implies that the initial point and all points on the path must be core points, with the possible exception of _q_. +> +> Wikipedia (2007) + +Points are reachable from some point if we can draw a path to it, through points directly reachable from the points on the path (i.e. core points on the path), to the specific point. In our example, B is reachable from A, and we display just one of the paths through which B can be reached. + +![](images/corepoints-4.png) + +#### Outliers + +If a point is not reachable from any other point, it is called an outlier: + +> All points not reachable from any other point are _outliers_ or _noise points_. +> +> Wikipedia (2007) + +In other words, if we cannot draw a path from a core point to another point (i.e. if it's not directly reachable nor reachable from the particular point), it's considered an outlier. This is what makes DBSCAN so good for clustering with outlier detection: it can signal outliers natively. + +![](images/corepoints-5.png) + +### How everything fits together: DBScan in pseudocode + +Now that we know about all the DBSCAN concepts, i.e. the _what_, we can now dive into the _how_. In other words, it's time to look at how DBSCAN works. Funnily, despite the complex name, the algorithm is really simple (Wikipedia, 2007): + +1. We set values for \[latex\]\\epsilon\[/latex\] and \[latex\]\\text{minPts}\[/latex\]. +2. We randomly select a point from the samples that has not been checked before. +3. We retrieve the \[latex\]\\epsilon-\\text{neighborhood}\[/latex\] for this point. If it equals or exceeds \[latex\]\\text{minPts}\[/latex\], we signal it as a cluster. Otherwise, we label it as noise. +4. We signal the \[latex\]\\epsilon-\\text{neighborhood}\[/latex\] as being part of the cluster. This means that for each point of that neighborhood, its own \[latex\]\\epsilon-\\text{neighborhood}\[/latex\] is added to the cluster as well, and so on, and so on. We continue until no further point can be added to the cluster. Note that the point originally labeled as noise can now also become part of this cluster (it may be part of the \[latex\]\\epsilon-\\text{neighborhood}\[/latex\] of one of the other points), or of another cluster later, because: +5. We now start at (2) again, unless all points have been checked and labeled. + +By searching for clusters cluster-by-cluster, we can slowly but surely build one cluster, and do not necessarily end up with too many cluster indications that are actually part of the same cluster. Of course, this is something that we can control by setting \[latex\]\\epsilon\[/latex\] and \[latex\]\\text{minPts}\[/latex\] and is depending on the dataset (requiring your own exploratory data analysis first). In addition, labeling points as noise means that after clustering has finished, we can simply show and count the points that remain labeled as noise, and possibly remove them from our dataset. + +If for some \[latex\]\\epsilon\[/latex\] the value for \[latex\]\\text{minPts} = 4\[/latex\], this would be the outcome: many core points, some points that are not core points but directly reachable from core points and hence part of the cluster, and some points that are not reachable and hence outliers. In other words, we have one cluster here, including the green and red points, where the two blue ones are outliers. + +![](images/samples-2-1.png) + +* * * + +## Performing DBSCAN-based clustering with Scikit-learn + +All right, you should now have a fair understanding about how the DBSCAN algorithm works and hence how it can be used for clustering. Let's convert our knowledge into code by writing a script that is capable of performing clustering on some data. We'll be using Scikit-learn for this purpose, since it makes available `DBSCAN` within its `sklearn.cluster` API, and because Python is the de facto standard language for ML engineering today. + +Let's open a code editor and create a file named e.g. `dbscan.py`. + +### Adding the imports + +The first thing that we do is adding the imports: + +- We'll import `make_blobs` from `sklearn.datasets` for generating the blob-based dataset in the next section. +- From `sklearn.cluster` we import `DBSCAN`, which allows us to perform the clustering. +- NumPy (as `np`) will be used for number processing. +- Finally, we'll use the Matplotlib PyPlot API (`plt`) for visualizing the generated dataset after clustering. + +``` +from sklearn.datasets import make_blobs +from sklearn.cluster import DBSCAN +import numpy as np +import matplotlib.pyplot as plt +``` + +### Generating a dataset + +For generating the dataset, we'll do two things: specifying some configuration options and using them when calling `make_blobs`. Note that we also specify `epsilon` and `min_samples`, which will later be used for the clustering operation. + +``` +# Configuration options +num_samples_total = 1000 +cluster_centers = [(3,3), (7,7)] +num_classes = len(cluster_centers) +epsilon = 1 +.0 +min_samples = 13 + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5) +``` + +The clusters look as follows (in your case, they will look slightly differently since they are generated randomly). + +![](images/twoclusters.png) + +For replicability, it can be wise to save the data just once after running the script - by uncommenting the `.save(...)` line then, you'll always load the same data from the `clusters.npy` file. This is however not required. + +``` +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') +``` + +### Initializing DBScan and computing the clusters + +We can now initialize DBScan and compute the clusters. + +- We initialize `DBSCAN` with our values for `epsilon` and `min_samples`. +- We then immediately fit the data to DBSCAN, meaning that clustering will start. +- We load the generated labels (i.e. cluster indices) into `labels` after clustering has finished. + +``` +# Compute DBSCAN +db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X) +labels = db.labels_ +``` + +In our case, printing the number of clusters and number of noisy samples yields 2 clusters with 0 noisy samples due to our selection of \[latex\]\\epsilon = 1.0; \\text{minPts} = 13\[/latex\]. In your case, the results will likely be different. Shuffling around with epsilon values (i.e. making the circle bigger) or minimum number of samples (depending on the density of your clusters) will yield other results then! + +``` +no_clusters = len(np.unique(labels) ) +no_noise = np.sum(np.array(labels) == -1, axis=0) + +print('Estimated no. of clusters: %d' % no_clusters) +print('Estimated no. of noise points: %d' % no_noise) +``` + +(Outcome:) + +``` +Estimated no. of clusters: 2 +Estimated no. of noise points: 0 +``` + +### Plotting the clustered data + +Finally, we can generate a scatter plot for our training data. Since we have two clusters, we use a simple lambda function that selects either one color or the other. If you have multiple clusters, you can easily generalize this lambda function [with a dictionary approach](https://www.machinecurve.com/index.php/question/how-to-give-multiple-colors-when-plotting-clusters/). + +``` +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title('Two clusters with data') +plt.xlabel('Axis X[0]') +plt.ylabel('Axis X[1]') +plt.show() +``` + +The end result is indeed two clusters, as intended: + +![](images/twoclustersclustered.png) + +### Full model code + +Here's the full code for those who aim to use it straight away: + +``` +from sklearn.datasets import make_blobs +from sklearn.cluster import DBSCAN +import numpy as np +import matplotlib.pyplot as plt + +# Configuration options +num_samples_total = 1000 +cluster_centers = [(3,3), (7,7)] +num_classes = len(cluster_centers) +epsilon = 1.0 +min_samples = 13 + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5) + +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') + +# Compute DBSCAN +db = DBSCAN(eps=epsilon, min_samples=min_samples).fit(X) +labels = db.labels_ + +no_clusters = len(np.unique(labels) ) +no_noise = np.sum(np.array(labels) == -1, axis=0) + +print('Estimated no. of clusters: %d' % no_clusters) +print('Estimated no. of noise points: %d' % no_noise) + +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title('Two clusters with data') +plt.xlabel('Axis X[0]') +plt.ylabel('Axis X[1]') +plt.show() +``` + +* * * + +## Removing noise from the dataset after clustering + +If we adapt the value for \[latex\]\\epsilon\[/latex\] and set it to 0.3, we get different results: + +``` +Estimated no. of clusters: 3 +Estimated no. of noise points: 50 +``` + +In particular, the algorithm is now capable of detecting noisy samples, as we can see in the image below. However, removing the noisy samples after performing DBSCAN is easy - and requires just four lines of extra code. This is the because because DBSCAN sets the labels for noisy samples to `-1`; this is its way of "signaling a label as noisy". + +Adding the lines before generating the scatter plot shows that samples that are labeled as noise are removed from the dataset. + +For this reason, we can also use DBSCAN as a noise removal algorithm, e.g. before applying [SVM based classification](https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/), to find better decision boundaries. + +``` +# Remove the noise +range_max = len(X) +X = np.array([X[i] for i in range(0, range_max) if labels[i] != -1]) +labels = np.array([labels[i] for i in range(0, range_max) if labels[i] != -1]) + +# Generate scatter plot for training data +colors = list(map(lambda x: '#000000' if x == -1 else '#b40426', labels)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title(f'Noise removed') +plt.xlabel('Axis X[0]') +plt.ylabel('Axis X[1]') +plt.show() +``` + +- [![](images/noisy.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/noisy.png) + +- [![](images/noiseremoved.png)](https://www.machinecurve.com/wp-content/uploads/2020/12/noiseremoved.png) + + +* * * + +## Summary + +In this article, we looked at DBSCAN based clustering in multiple ways. Firstly, we looked at cluster analysis or clustering in general - what is it? What is it used for? As we could see in this article, there are some interesting areas where such techniques can be employed. + +We then introduced DBSCAN, which stands for density-based spatial clustering of applications with noise, and is a widely used clustering algorithm. We looked at the algorithm and the conceptual building blocks first. We saw that core points are named so if at least \[latex\]\\text{minPts}\[/latex\] points are located at less than \[latex\]\\epsilon\[/latex\] distance from the point. All the points within this circle are directly reachable. If we can construct a path from a point to another, non-directly reachable point, through other core points, the point is finally said to be reachable. All points that are not reachable are considered to be outliers, or noise. + +The algorithm itself is then really simple. Starting from one point, it attempts to build a cluster by grouping its \[latex\]\\epsilon-\\text{neighborhoods}\[/latex\], i.e. directly reachable points for the point. If no such point is available, it is labeled as noise. If some are available, for these points, their directly reachable points are added, and so on, until the cluster cannot be expanded any further. Then, it selects another non-visited point and performs the same steps, until all points have been visited. We then know the clusters and the noisy points. + +Knowing about the building blocks and how the algorithm works conceptually, we then moved on and provided a Python implementation for DBSCAN using Scikit-learn. We saw that with only a few lines of Python code, we were able to generate a dataset, apply DBSCAN clustering to it, visualize the clusters, and even remove the noisy points. The latter makes our dataset cleaner without losing much of the core information available in the clusters. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from today's article. If you have any questions, you can leave a comments in the comments section below 💬. You can also click the green button to the right 🟢, where you can ask your questions in our **Ask Questions** forum. Please feel free to leave a comment as well if you have other remarks or suggestions for improvement. I'd love to hear from you. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2007, October 16). _Dbscan_. Wikipedia, the free encyclopedia. Retrieved December 8, 2020, from [https://en.wikipedia.org/wiki/DBSCAN](https://en.wikipedia.org/wiki/DBSCAN) + +Scikit-learn. (n.d.). _Sklearn.cluster.DBSCAN — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved December 9, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) + +Wikipedia. (2004, May 21). _Cluster analysis_. Wikipedia, the free encyclopedia. Retrieved December 9, 2020, from [https://en.wikipedia.org/wiki/Cluster\_analysis](https://en.wikipedia.org/wiki/Cluster_analysis) diff --git a/performing-linear-regression-with-python-and-scikit-learn.md b/performing-linear-regression-with-python-and-scikit-learn.md new file mode 100644 index 0000000..c89a552 --- /dev/null +++ b/performing-linear-regression-with-python-and-scikit-learn.md @@ -0,0 +1,233 @@ +--- +title: "Performing Linear Regression with Python and Scikit-learn" +date: "2020-12-10" +categories: + - "frameworks" + - "svms" +tags: + - "fit" + - "least-squares" + - "linear-regression" + - "machine-learning" + - "ordinary-least-squares" + - "python" + - "regression" + - "scikit-learn" +--- + +Sometimes, life is easy. There are times when you are building a Machine Learning model for regression and you find your data to be linear. In other words, a regression model can be fit by means of a straight line. While these cases are relatively rare, **linear regression** is still a useful tool for in your Machine Learning toolkit. + +What is Linear Regression? And how does it work? That's what we will investigate in today's Machine Learning article. It is structured as follows. First of all, we will be introducing Linear Regression conceptually, specifically Ordinary Least Squares based Linear Regression. We'll look at what regression is in the first place, and then introduce the linear variant - explaining the maths behind it in an intuitive way, so that it'll be entirely clear what is going on. We also cover how Linear Regression is performed, i.e., how after regressing a fit the model is improved, yielding better fits. + +Subsequently, we'll move from theory into practice, and implement Linear Regression with Python by means of the Scikit-learn library. We will generate a dataset where a linear fit can be made, apply Scikit's `LinearRegression` for performing the Ordinary Least Squares fit, and show you with step-by-step examples how you can implement this yourself. + +Let's take a look :) + +* * * + +\[toc\] + +* * * + +## Introducing Linear Regression + +In this section, we will be looking at how Linear Regression is performed by means of an Ordinary Least Squares fit. For doing so, we will first take a look at regression in general - what is it, and how is it useful? Then, we'll move forward to Linear Regression, followed by looking at the different types for performing regression analysis linearly. Finally, we zoom in on the specific variant that we will be using in this article - Oridnary Least Squares based linear regression - and will explore how it works. + +Of course, since we're dealing with a method for Machine Learning, we cannot fully move away from maths. However, I'm not a big fan of writing down a lot of equations without explaining them. For this reason, we'll explain the math in terms of _intuitions_, so that even though when you cannot fully read the equations, you will understand what is going on. + +### What is Regression? + +Most generally, we can define regression as follows: + +> Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors', 'covariates', or 'features'). + +In other words, suppose that we have the following dataset: + +| **No. Projects completed** | **No. Successful project** | **No. Positive reviews** | **Salary increase (%/100)** | +| --- | --- | --- | --- | +| 2 | 2 | 1 | 0.05 | +| 4 | 1 | 2 | 0.00 | +| 1 | 0 | 1 | 0.00 | +| 2 | 2 | 5 | 0.12 | +| 3 | 3 | 2 | 0.10 | +| 4 | 2 | 1 | 0.05 | +| … | … | … | … | + +And suppose that our goal is to build a predictive model where we explore whether any or a combination of the variables \[latex\]\\text{projects\_completed}\[/latex\], \[latex\]\\text{successful\_projects}\[/latex\] or \[latex\]\\text{positive\_reviews}\[/latex\] can predict the annual salary increase, i.e. \[latex\]\\text{salary\_increase}\[/latex\]. + +In other words, we explore whether: + +\[latex\]\\text{\\{projects\_completed, successful\_projects, positive\_reviews\\}} \\rightarrow \\text{salary\_increase}\[/latex\] + +Here, \[latex\]\\text{salary\_increase}\[/latex\] is a _continuous variable_, meaning that it can take any 'real value', i.e. any positive and negative number with decimals. Salary increases can be 0.00, even negative (if our salary would decrease, e.g. -0.05), or really positive if performed well (0.12 or 12% to give just one example). + +[Contrary to classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/), where we attempt to assign some inputs to one of multiple categories (and where hence the output is a _discrete_ variable), this is a regression problem. Generating a predictive model here thus means that we attempt to capture patterns which ensure us to make a mapping between input values and a real-valued outcome. In other words, we attempt to estimate the salary increase based on the input variables. + +Here, the salary is the dependent variable, whereas the three others are the independent ones. + +### What is Linear Regression? + +When we perform the regression in a linear way, i.e. by fitting a straight line through the data, we call our approach a **Linear Regression** problem. + +In the example below, you can see what is meant with Linear Regression. You can see a dataset with points in a two-dimensional space, e.g. with variables \[latex\]x\[/latex\] and \[latex\]y\[/latex\]. This regression problem is called a _**Simple**_ **Linear Regression** problem, because there is "one explanatory variable" (i.e., \[latex\]x\[/latex\]; Wikipedia, 2005). + +In that case, the regression problem can be written as \[latex\]y = \\alpha + \\beta x\[/latex\]. The slope of the line is represented by \[latex\]\\beta\[/latex\] whereas the y-interceptor (i.e. the value for \[latex\]y\[/latex\] where the line crosses the axis). In the image below, the y intercept is 5. If you've had some maths in high school, you likely recognize the function \[latex\] y = ax + b\[/latex\] here. It's exactly the same. + +![](images/1920px-Linear_regression.svg_-1024x677.png) + +However, not every Linear Regression problem is a _simple_ one. In those cases, we call the regression problem one of _multiple_ variables, and hence **Multiple Linear Regression**, also known as multivariable linear regression. In that case, we can write the formula as follows (Wikipedia, 2001): + +\[latex\]y\_i = \\beta\_0 + \\beta\_1x\_{i1} + … + \\beta\_px\_{ip} + \\epsilon\_i\[/latex\] + +In other words, the outcome is a combination of the input values from the input vector \[latex\]\\textbf{x}\[/latex\] multiplied by the corresponding weights, which have been learned during the fit. Generating the _outcome_ of the function, once fit, is therefore really simple. But let's now take a better look at how the fit is made, because that is the core of the Linear Regression type that we will be using today. + +### Linear Regression Types + +Indeed, the _type_ of Linear Regression problem, because there are multiple ways to solve such a problem. The _solving_ here involves estimating the values for \[latex\]B\_i\[/latex\], where \[latex\]i \\in {0, 1, ..., p}\[/latex\]. These are common methods for solving a linear regression problem: + +- **Least-squares estimation:** in this class of methods, the goal is to minimize the sum of mean squared loss. There are three primary techniques that are in use here: Ordinary Least Squares (OLS), Weighted Least Squares (WLS) and Generalized Least Squares (GLS). We will be using OLS in this article. +- **Maximum-likelihood estimation:** we can also use a probability based way of estimating should the distribution of the error terms be known. +- **Other techniques**, such as Bayesian linear regression, Quantile regression, Mixed models, Principal component regression, and so on. These are not commonly used. + +Above, you read that we will be using Ordinary Least Squares regression. Let's now take a look at how it works in more detail. + +### How is Ordinary Least Squares Linear Regression performed? + +With Ordinary Least Squares regression, the goal is to minimize the sum of mean squared loss by means of some hyperplane. Recall the concept of a hyperplane from [Support Vector Machines](https://www.machinecurve.com/index.php/2020/11/25/using-radial-basis-functions-for-svms-with-python-and-scikit-learn/): if our feature space has \[latex\]N\[/latex\] dimensions, a hyperplane is \[latex\]N-1\[/latex\]-dimensional. In other words, in the image above, which shows a twodimensional feature space, our hyperplane is the line. + +Indeed, regression always attempts to generate a hyperplane which allows us to produce real-valued output for the input vector that we provide. + +Suppose that we would generate some samples: + +``` +from sklearn.datasets import make_blobsx +import numpy as np +import matplotlib.pyplot as plt + +# Configuration options +num_samples_total = 150 +cluster_centers = [(3,3), (3.3, 3.3), (3.6, 3.6), (4, 4)] +num_features = 1 +epsilon = 0.3 +min_samples = 18 + +# Generate data +X, _ = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_features, center_box=(0, 1), cluster_std = 0.10) + +# Generate scatter plot for training data +plt.scatter(X[:,0], X[:,1], marker="o", picker=True) +plt.title(f'Samples') +plt.xlabel('x') +plt.ylabel('y') +plt.show() +``` + +![](images/samples-3.png) + +As the data seems to be somewhat linear, we can draw a line through it, which represents a fit to the data. This fit was generated with NumPy's `polyfit` function, with a first-degree polynomial fit (i.e. a linear fit): + +![](images/fit.png) + +If we select one point (randomly), draw a vertical line to the hyperplane and measure its distance, we have measured the **residual** for a point. The residual, here, is the difference between the _observed_ value and the _estimated_ value. In other words, it tells us something about how well the model has performed when generating the prediction for that point. The larger the residual, the worse the model performs. + +As you can see, the (absolute value for the) residual here is relatively large. + +![](images/residual.png) + +Residuals are calculated as \[latex\]y\_i - \\hat{y\_i}\[/latex\], where \[latex\]y\_i\[/latex\] is the observed value (the value from the dataset) and \[latex\]\\hat{y\_i}\[/latex\] is the prediction. As you can see, if the line lies above the observed/dataset value, \[latex\]y\_i < \\hat{y\_i}\[/latex\], and \[latex\]y\_i > \\hat{y\_i}\[/latex\] otherwise. + +Now, a naïve approach for computing how good the fit is, is summing together all residuals: \[latex\]\\sum\_{i=0}^{p} y\_i - \\hat{y\_i}\[/latex\]. But is this a good approach? + +No. + +It is quite problematic, to say the least. As you can see, the line is fit somewhere in the middle of the data. Approximately 50% of the samples lie above the fit while the other lies below the fit. If we would just sum all the residuals, we would expect the outcome of the sum to be somewhere close to zero. As if the model is not off for many of the samples. Doesn't work. + +Fortunately, some smart people have thought about a relatively easy fix: what if, instead of taking the residual value for each point, we would take the residual value squared? In other words, what if we would take \[latex\](y\_i - \\hat{y\_i})^2\[/latex\] and hence compute \[latex\]\\sum\_{i=0}^{p} (y\_i - \\hat{y\_i})^2\[/latex\] which is known as the **sum of squared resisudals**, **error sum of squares** or **residual sum of squares**? + +Our problem is solved. And so is the regression problem, because if we minimize this sum and select the argument, i.e. \[latex\]\\text{argmin} \\sum\_{i=0}^{p} (y\_i - \\hat{y\_i})^2\[/latex\], we'll find the set of weights / coefficients / values \[latex\]\\beta\[/latex\] with which we can compute the output value. Since the function has a global minimum, there is a unique set of values with which the sum is minimized (Wikipedia, 2001). + +We will now take a look at how we can implement OLS based Linear Regression with Python. + +* * * + +## Implementing OLS Linear Regression with Python and Scikit-learn + +Let's now take a look at how we can generate a fit using **Ordinary Least Squares** based Linear Regression with Python. We will be using the Scikit-learn Machine Learning library, which provides a `LinearRegression` implementation of the OLS regressor in the `sklearn.linear_model` API. + +Here's the code. Ensure that you have Scikit-learn installed on your machine (`pip install scikit-learn`), as well as `numpy` and `matplotlib`. We'll walk through the code step-by-step first before we show the code. + +- As the first step, we define the imports for our model. From `sklearn.datasets` we import `make_blobs`, which allows us to create the blobs of data which jointly compose the linear-like dataset. We import `LinearRegression` for generating the OLS fit. Finally, we import the Matplotlib PyPlot API as `plt` for visualizing the fit towards the end. +- Then, we specify some configuration options. We'll generate 150 samples randomly, at four centers (which overlap due to a low cluster standard deviation). Each sample has two features. +- We then generate the blobs and construct the true `X` and `y` variables based on the data in `X`. Since when working with 1D arrays Scikit-learn requires us to reshape the data, we'll use `.reshape(-1, 1)` here, which makes it a 2D array again (with an empty second dimension). +- Then, we perform the OLS based fit: we instantiate the linear regressor by means of `LinearRegression()` and immediately fit the data by means of `.fit(X, y)`. We print the formula of the fit, which in our case is `y = 0.9553431556159293x + 0.15085887517191932`. +- Then, we generate predictions for the original data, so that we can visualize the fit on top of the data itself, which we finally do with the `.scatter(...)` and `.plot(...)` calls. The rest is Matplotlib boilerplate code. + +Voilà, performing an Ordinary Least Squares based linear fit is easy - especially now that you know how it works under the hood! + +``` +from sklearn.datasets import make_blobs +from sklearn.linear_model import LinearRegression +import numpy as np +import matplotlib.pyplot as plt + +# Configuration options +num_samples_total = 150 +cluster_centers = [(3,3), (3.3, 3.3), (3.6, 3.6), (4, 4)] +num_features = 2 + +# Generate data +X, _ = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_features, center_box=(0, 1), cluster_std = 0.10) + +# Reshape X and create y +y = X[:,1].reshape(-1, 1) +X = X[:,0].reshape(-1, 1) + +# Perform OLS fit +reg = LinearRegression().fit(X, y) +print('y = {}x + {}'.format(reg.coef_[0][0], reg.intercept_[0])) + +# Generate predictions +y_pred = reg.predict(X) + +# Generate scatter plot for training data +plt.scatter(X, y, marker="o", picker=True) +plt.plot(X, y_pred, color='red') +plt.title(f'Samples and OLS Linear Regression fit') +plt.xlabel('x') +plt.ylabel('y') +plt.show() +``` + +This is the plot that is generated after the fit is complete: + +![](images/fit-1.png) + +* * * + +## Summary + +In this article, we focused on performing Regression Analysis with Python, and more specifically, performing a Linear Regression analysis for some dataset by using an Ordinary Least Squares fit. We first looked at regression in general. We saw that it is used to predict a continuous dependent variable using a set of independent variables. If it's a linear fit that is generated, we call it linear regression. Ordinary Least Squares fit is one of the techniques for estimating the coefficients / values for the linear function, and it works by minimizing the sum of squared residuals, which are the distances between the points and the estimations. + +Once we knew how OLS based Linear Regression works conceptually, we moved towards a more practical part. Using Python and Scikit-learn, we implemented an OLS based regression model using its `LinearRegression` model. We saw that we were capable of generating a fit which captures the data as good as it can. In a step-by-step example, you have seen how you can create such a model yourself. + +I hope that this article was useful to you and that you have learned something today! If you did, please feel free to share it in the comments section below 💬 If you have any questions, I'd love to hear from you through the **Ask a question** button, which allows you to ask your question to our readers. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2004, July 15). _Regression analysis_. Wikipedia, the free encyclopedia. Retrieved December 10, 2020, from [https://en.wikipedia.org/wiki/Regression\_analysis](https://en.wikipedia.org/wiki/Regression_analysis) + +Wikipedia. (2001, May 20). _Linear regression_. [https://en.wikipedia.org/wiki/Linear\_regression](https://en.wikipedia.org/wiki/Linear_regression) + +Wikipedia. (2005, September 1). _Simple linear regression_. Wikipedia, the free encyclopedia. Retrieved December 10, 2020, from [https://en.wikipedia.org/wiki/Simple\_linear\_regression](https://en.wikipedia.org/wiki/Simple_linear_regression) + +Wikipedia. (2005, March 26). _Ordinary least squares_. [https://en.wikipedia.org/wiki/Ordinary\_least\_squares](https://en.wikipedia.org/wiki/Ordinary_least_squares) + +NumPy. (n.d.). _Numpy.polyfit — NumPy v1.19 manual_. [https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) + +Scikit-learn. (n.d.). _Sklearn.linear\_model.LinearRegression — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved December 10, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.linear\_model.LinearRegression.html#sklearn.linear\_model.LinearRegression.fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit) diff --git a/performing-optics-clustering-with-python-and-scikit-learn.md b/performing-optics-clustering-with-python-and-scikit-learn.md new file mode 100644 index 0000000..0110849 --- /dev/null +++ b/performing-optics-clustering-with-python-and-scikit-learn.md @@ -0,0 +1,332 @@ +--- +title: "Performing OPTICS clustering with Python and Scikit-learn" +date: "2020-12-15" +categories: + - "frameworks" + - "svms" +tags: + - "clustering" + - "dbscan" + - "machine-learning" + - "optics" + - "scikit-learn" + - "sklearn" + - "unsupervised-learning" +--- + +Unsupervised Machine Learning problems involve clustering, adding samples into groups based on some measure of similarity because no labeled training data is available. There are many algorithms for clustering available today. OPTICS, or _Ordering points to identify the clustering structure,_ is one of these algorithms. It is very similar to [DBSCAN](https://www.machinecurve.com/index.php/2020/12/09/performing-dbscan-clustering-with-python-and-scikit-learn/), which we already covered in another article. In this article, we'll be looking at how to use OPTICS for clustering with Python. + +It is structured as follows. Firstly, in order to provide you with the necessary context, we will briefly look at clustering. We will see what it is and how it works generally speaking. Then, we'll move on to the conceptual details of OPTICS. We will take a look at its components, the algorithm and its dendogram output called a _reachability plot_, and how to generate clusters from the diagram. + +Once we know the ins and outs of the components and the algorithm, we move forward to a practical implementation using `OPTICS` in Scikit-learn's `sklearn.cluster` module. We will see how we can generate a dataset for which we can generate clusters, and will apply OPTICS to generate them. Using this step-by-step example, you will see how you can build an OPTICS based clustering model with Python. + +In other words, after reading this article, you'll both know how OPTICS works and have the skills to apply it to your own Machine Learning problem. + +Let's take a look! :) + +* * * + +\[toc\] + +* * * + +## What is clustering? + +Before we start looking at how OPTICS works, it is worthwhile to consider clustering in general first. Because, well, what is clustering? + +Let's take a look at a definition. + +> **Cluster analysis** or **clustering** is the task of grouping a set of objects in such a way that objects in the same group (called a **cluster**) are more similar (in some sense) to each other than to those in other groups (clusters). +> +> Wikipedia (2004) + +In other words, suppose that we have a dataset like this one: + +![](images/twoclusters.png) + +Intuitively, we can already see that there are two groups of data: one towards the bottom left part of the plot, another towards the upper right part of the plot. + +But the machine doesn't know this yet. Rather, it just has an array with samples and their corresponding \[latex\]X\[0\]\[/latex\] and \[latex\]X\[1\]\[/latex\] positions, allowing us to generate the plot. + +Clustering algorithms are designed to select points which look like each other (and hence have high similarity) and assign them to the same group. In other words, if such an algorithm is deployed for the dataset visualized above, the desired end goal is that it finds the samples in the left corner so similar with respect to the ones in the right corner that it assigns group 0 to the bottom left ones, and 1 to the upper right ones. + +> **Unsupervised learning** is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. +> +> Wikipedia (2003) + +In doing so, they have no information about the actual groups; they just have the positions. This is why clustering algorithms are called _unsupervised_: no pre-existing labels are there, and yet they are capable of finding patterns allowing us to group the samples. Really nice! + +OPTICS is such a clustering algorithm. Now that we know about clustering in general, let's take a look at how it works :) + +* * * + +## Introducing OPTICS: a relative of DBSCAN + +**Ordering points to identify the clustering structure**, or OPTICS, is an algorithm for density based clustering. It's quite an old algorithm already, as it was presented in 1999. Nevertheless, it is still a good algorithm today - not everything that's no longer new and shiny must be discarded. It is similar to the [DBSCAN algorithm](https://www.machinecurve.com/index.php/2020/12/09/performing-dbscan-clustering-with-python-and-scikit-learn/) for clustering, an extension even, and hence borrows some of its components as well as its algorithmic components. + +Let's take a look at OPTICS here. Firstly, we'll take a look at OPTICS' components, followed by taking a look at its algorithm. The outcome of this algorithm is a [dendrogram](https://en.wikipedia.org/wiki/Dendrogram) (which shows the tree-like structure of the data by means of the _reachability distance_, which is one of the components that we will cover next). Once we know the output of the algorithm, we'll move on to interpreting this diagram, answering the questions how we can generate the clusters from this reachability plot. + +### OPTICS components + +Let's first take a look at the components of the OPTICS method. + +#### Epsilon parameter + +The first parameter is the epsilon parameter, or \[latex\]\\epsilon\[/latex\]. It is a distance parameter in the sense that for any point \[latex\]p\[/latex\], the epsilon defines a distance around the point, like this: + +![](images/samples-1.png) + +#### MinPts parameter + +Another parameter is \[latex\]\\text{minPts}\[/latex\], or the _minimum amount of points_ parameter. It is used together with epsilon because it illustrates how many points must be within the \[latex\]\\epsilon\[/latex\] distance of a point \[latex\]p\[/latex\] (including the point) in order to form a cluster. + +#### Core points + +When the point \[latex\]p\[/latex\] has \[latex\]\\text{minPts}\[/latex\] within its \[latex\]\\epsilon\[/latex\] distance including itself, we say that it is a core point and that it has sufficient amount of points in its \[latex\]\\epsilon \\text{-neighborhood}\[/latex\] for becoming one. A core point always represents a cluster. Possibly, it is still in formation, meaning that it will merge with other clusters later. We'll see how this happens when we take a look at the OPTICS algorithm. + +![](images/corepoints.png) + +#### Core distance + +If you have read the article about [DBSCAN](https://www.machinecurve.com/index.php/2020/12/09/performing-dbscan-clustering-with-python-and-scikit-learn/), you might have thought that many of these concepts are familiar. And in fact, they are! All concepts covered so far are also components of the DBSCAN algorithm. The next one, **core distance**, is however unique to OPTICS. Let's take a look. + +Core distance is defined as follows. For any point \[latex\]p\[/latex\] with some epsilon \[latex\]\\epsilon\[/latex\] and hence an epsilon neighborhood \[latex\]N\_\\epsilon(p)\[/latex\]: + +\[mathjax\] + +\\begin{equation} +\\text{core-dist}\_{\\epsilon , \\text{minPts}}(p) = +\\begin{cases} +\\text{undefined}, & \\text{if}\\ |N\_{\\epsilon}(p)| < \\text{minPts} \\\\ +\\text{minPts-th smallest distance in } N\_{\\epsilon}(p), & \\text{otherwise} \\ +\\end{cases} +\\end{equation} + +In other words, the core distance is undefined if the number of points in the neighborhood (including \[latex\]p\[/latex\] itself) is lower than the minimum number of points required. This makes sense: if the point is no core point, it does not have a core distance. + +In the other case, however, it's the \[latex\]\\text{minPts-th smallest distance in } N\_{\\epsilon}(p)\[/latex\]. This is a very generic description of the point in the epsilon neighborhood located farthest away from the core point; in the case of \[latex\]\\text{minPts} = 3\[/latex\], it would be the 3rd smallest distance. For this reason, the core distance also describes the **minimum value for epsilon in order to keep the point a core point.** Using the distance, in the algorithm, we can keep merging clusters by (1) knowing that they are close to a core point and hence reachable thus part of the cluster, and (2) do so in an extensive way, growing the cluster time after time. + +If this is a bit abstract to you, don't worry - we'll get back to this when describing the algorithm! + +![](images/corereach-1.png) + +#### Reachability distance + +While the core distance expresses the _minimum distance to keep a point a core point_, the **reachability distance** expresses the distance which is reachable from a core point. + +It is expressed as follows in terms of an arbitrary point \[latex\]o\[/latex\] that is reached from a point \[latex\]p\[/latex\]: + +\\begin{equation} +\\text{reach-dist}\_{\\epsilon , \\text{minPts}}(o, p) = +\\begin{cases} +\\text{undefined}, & \\text{if}\\ |N\_{\\epsilon}(p)| < \\text{minPts} \\\\ +\\text{max}(\\text{core-dist}\_{\\epsilon , \\text{minPts}}(p), dist(p,o)), & \\text{otherwise} \\ +\\end{cases} +\\end{equation} + +If \[latex\]p\[/latex\]'s epsilon neighborhood has insufficient points, it is not a core point and hence cannot be used in reaching another point. This is similar to direct reachability and reachability in DBSCAN. For this reason, if this happens, the reachability distance is set to undefined. + +If it is a core point, the reachability distance is either the core distance or the distance between \[latex\]p\[/latex\] and \[latex\]o\[/latex\], whichever is bigger. In other words, any point within either the core or reachability distance can be reached from that particular core point. This allows us to continue constructing clusters. + +In the example below, the reachability distance from the core point to point 1 equals the core distance, because it is bigger. However, for a random point R, the reachability distance equals the _distance_ to that point, because that one is bigger than the core distance. + +![](images/corereach-2.png) + +To summarize: + +- Points that are part of local clusters can be identified by means of **core points**, using the concepts of epsilon and minimum number of points borrowed from the DBSCAN algorithm. +- After a local cluster has been identified, points in the vicinity must be identified for whether they are part of the cluster or not. For this reason we compute the **core distance**, the minimum distance from a point in order to remain a core point, and the **reachability distance** for another point, which expresses how far away the point is located from that point. + +It's now time to look at the OPTICS algorithm itself. We shall see that from one core point forward, the algorithm will continue to search for cluster extensions by (1) identifying whether possible extensions are dense enough, by means of core distance, and (2) what their distance from the most dense parts of the cluster are, by ordering based on reachability distance. + +### OPTICS algorithm + +Let's now take a closer look at the OPTICS algorithm. We'll start our algorithm with an ordered list that is empty. We also maintain a list with processed points. + +`ordered list = empty list`. + +`processed points = empty list` + +Here's how OPTICS works. + +#### The main loop + +The main part of the OPTICS algorithm is the **main loop** (Wikipedia, 2009). It describes the optics function: + +- The function `OPTICS` can be called with a database (`DB`), and values for epsilon and minimum amount of points. +- For each point in the database, we first set reachability distance to `undefined`; we must compute it later. +- Then, for each unprocessed point, we perform the following: + - We get the \[latex\]\\epsilon \\text{-neighborhood}\[/latex\] for the point. + - We mark p as processed (we looked at it). + - We push p to the ordered list (it's the first point we're looking at). + - We now look at the core distance of p: if it's not undefined (i.e. if it is a core point), we will look further. If it is no core point, we move on to the next unprocessed point. + - For core points, we initialize an empty priority queue i.e. a queue where the most important values are read from first. We then call the update function which we will discuss in the next section, which orders the priority queue based on reachability distance. + - For the ordered priority queue (where we shall see that lowest reachability distance from the core point \[latex\]p\[/latex\] and hence the closest points are covered first), for each point, we get its neighbors. We then mark the point \[latex\]q\[/latex\] as processed and output it to the ordered list. If it's a core point as well, we can extend the priority queue as the clusters are close to each other and likely belong to the same bigger cluster. Extending the priority queue through update means that more points are added to the reachability-distance ordered Seeds list. + - In other words, the algorithm keeps expanding on a particular point _until_ none of the unprocessed points have a core distance anymore (i.e. aren't core points). These are the outliers. + +``` +function OPTICS(DB, eps, MinPts) is + for each point p of DB do + p.reachability-distance = UNDEFINED + for each unprocessed point p of DB do + N = getNeighbors(p, eps) + mark p as processed + output p to the ordered list + if core-distance(p, eps, MinPts) != UNDEFINED then + Seeds = empty priority queue + update(N, p, Seeds, eps, MinPts) + for each next q in Seeds do + N' = getNeighbors(q, eps) + mark q as processed + output q to the ordered list + if core-distance(q, eps, MinPts) != UNDEFINED do + update(N', q, Seeds, eps, MinPts) +``` + +In the next sections, we will both describe the update function and how we can derive clusters from the output of the algorithm. + +#### The update function + +Above, at two places, a call is made to an `update` function which updates the Seeds queue, or in other words, the priority queue. Updating always happens because either the queue is empty (in other words, at the start of the algorithm, or when all close points have been covered and unprocessed points remain) or because new points have been found to extend the cluster with. + +Updating happens in the following way: + +- First of all, for the point \[latex\]p\[/latex\]for which the neighborhood is passed along, the core distance is computed. In other words, we then know what the minimum distance is to keep the neighborhood a true neighborhood. +- For each point in the neighborhood, if not processed, we compute the reachability distance. If it's undefined (i.e. because the point has never been touched before) we set the reachability distance and insert it to the priority queue at precisely that distance. +- If it is already set, though, we update the queue and move it forward if the new reachability distance is lower than the old one. + +In other words, during the extension, we continuously change the order in the priority queue, where points with lower reachability distance (and hence points closer to the cluster's center points) are added earlier. + +``` +function update(N, p, Seeds, eps, MinPts) is + coredist = core-distance(p, eps, MinPts) + for each o in N + if o is not processed then + new-reach-dist = max(coredist, dist(p,o)) + if o.reachability-distance == UNDEFINED then // o is not in Seeds + o.reachability-distance = new-reach-dist + Seeds.insert(o, new-reach-dist) + else // o in Seeds, check for improvement + if new-reach-dist < o.reachability-distance then + o.reachability-distance = new-reach-dist + Seeds.move-up(o, new-reach-dist) +``` + +#### Combining the main loop and update function + +Combining the main loop and the update function, we therefore see the following behavior emerge: + +1. We keep iterating until each point has been processed. +2. For each unprocessed point, we attempt to process as many related points by taking a look at neighborhoods and extending these neighborhoods until no close points can be found anymore. +3. All distances are added to an ordered list. This ordered list hence, for each cluster, contains reachability distances in an ascending way. Outliers represent the cutoff points for determining the clusters. We will illustrate how this works by means of the reachability plot. + +### Generating clusters from the reachability plot + +The 'ordered list' we just covered above is displayed in the image below, a so-called **reachability plot**. In this plot, the reachability distances for each point are mapped. Clearly, from one point, we can see how the clusters have been extended in a reachability distance-based ascending ordering: for the blue cluster, the distance ascends until it moves towards a red one; then towards a green one. + +The valleys therefore represent clusters and can be considered as being clusters, whereas the peaks represent the cutoff points, sometimes even with outliers in between. + +Generating clusters from the reachability plot can therefore naïvely be performed by means of taking a cutoff value, where an arbitrary y value marks the change in cluster. If done so, the clustering algorithm works in a similar way to DBSCAN, because a threshold is taken just like DBSCAN does. However, more advanced methods exist as well (Wikipedia, 2009). + +![](images/OPTICS.svg_-1024x700.png) + +### OPTICS vs DBSCAN + +OPTICS shares many components with DBSCAN. The epsilon value and minimum number of points are shared and, together with the concepts of core points and reachable points (by implication of the reachability distance) overlap. One of the key differences is the way that clusters are computed: rather than by picking a fixed value (DBSCAN), a different method can be applied with OPTICS. + +For this reason, OPTICS is preferable over DBSCAN when your clusters have varying density. In other cases, the choice for algorithm does not really matter. + +* * * + +## Building an OPTICS model with Python and Scikit-learn + +Now that we understand how OPTICS works, we can take a look at implementing it with Python and Scikit-learn. + +With the following code, we can perform OPTICS based clustering on a random blob-like dataset. It works as follows. + +- First of all, we make all the imports; `make_blobs` for generating the data, `OPTICS` for clustering, and NumPy and Matplotlib for numbers processing and visualization, respectively. +- Then, we specify a range of configuration options. We will generate 1000 samples in total around two centers, so that we'll get two blobs of data. We set epsilon and min\_samples to values that we derived during testing, as well as the method for clustering and the distance metric. + - The values for `cluster_method` can be `xi` and `dbscan`. With `xi`, a cluster-specific method will be used for extracting clusters. With `dbscan`, a fixed threshold will be used for extracting the clusters from the recahability plot. + - Many metrics can be specified under `metric`. The Minkowski distance is the default one. See all metrics [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html). +- We next generate data: two blobs of data, with `make_blobs`. +- Based on this data, we perform OPTICS-based clustering, with epsilon, minimum number of samples, cluster method and metric defined. We immediately fit the data so that the clusters are generated. +- We then print some information about the number of clusters and noisy samples, and finally generate a scatter plot. + +``` +from sklearn.datasets import make_blobs +from sklearn.cluster import OPTICS +import numpy as np +import matplotlib.pyplot as plt + +# Configuration options +num_samples_total = 1000 +cluster_centers = [(3,3), (7,7)] +num_classes = len(cluster_centers) +epsilon = 2.0 +min_samples = 22 +cluster_method = 'xi' +metric = 'minkowski' + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.5) + +# Compute OPTICS +db = OPTICS(max_eps=epsilon, min_samples=min_samples, cluster_method=cluster_method, metric=metric).fit(X) +labels = db.labels_ + +no_clusters = len(np.unique(labels) ) +no_noise = np.sum(np.array(labels) == -1, axis=0) + +print('Estimated no. of clusters: %d' % no_clusters) +print('Estimated no. of noise points: %d' % no_noise) + +# Generate scatter plot for training data +colors = list(map(lambda x: '#3b4cc0' if x == 1 else '#b40426', labels)) +plt.scatter(X[:,0], X[:,1], c=colors, marker="o", picker=True) +plt.title(f'OPTICS clustering') +plt.xlabel('Axis X[0]') +plt.ylabel('Axis X[1]') +plt.show() +``` + +Running the algorithm yields the following scatter plot: + +![](images/optics.png) + +We can also easily generate the reachability plot: + +``` +# Generate reachability plot +reachability = db.reachability_[db.ordering_] +plt.plot(reachability) +plt.title('Reachability plot') +plt.show() +``` + +![](images/rplot.png) + +* * * + +## Summary + +In this article, we took a look at the OPTICS algorithm for clustering. Similar to the DBSCAN algorithm, but notably different, it can be used for clustering when the density of your clusters is different. This is something that DBSCAN cannot do so well. + +First of all, we looked at what clustering is in the first place. We saw that clustering algorithms can be used to group samples in a dataset based on similarity. Then, we moved on to OPTICS, and studied its components. We saw that similar to DBSCAN, OPTICS also works with epsilon and a minimum number of points, which specifies a distance around a point and a minimum number of points (including the point itsefl) to be within this distance in order to classify the point as a core point. Core points represent dense points, and using the core distance and reachability distance, OPTICS is capable of grouping samples together. + +We saw that OPTICS works by ordering based on reachability distance while expanding the clusters at the same time. The output of the OPTICS algorithm is therefore an ordered list of reachability distances, which by means of thresholds or different techniques we can split into clusters. This way, we're able of generating clusters for groups of data that have varying densities. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from today's article. If you did, please feel free to leave a message in the comments section! 💬 Please also leave remarks and comments, or leave them through the **Ask Questions** button. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2009, April 21). _OPTICS algorithm_. Wikipedia, the free encyclopedia. Retrieved December 9, 2020, from [https://en.wikipedia.org/wiki/OPTICS\_algorithm](https://en.wikipedia.org/wiki/OPTICS_algorithm) + +Scikit-learn. (n.d.). _Sklearn.cluster.OPTICS — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved December 9, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html) + +Wikipedia. (2004, May 21). _Cluster analysis_. Wikipedia, the free encyclopedia. Retrieved December 11, 2020, from [https://en.wikipedia.org/wiki/Cluster\_analysis](https://en.wikipedia.org/wiki/Cluster_analysis) + +Wikipedia. (2003, May 25). _Unsupervised learning_. Wikipedia, the free encyclopedia. Retrieved December 11, 2020, from [https://en.wikipedia.org/wiki/Unsupervised\_learning](https://en.wikipedia.org/wiki/Unsupervised_learning) diff --git a/problems-with-fixed-and-decaying-learning-rates.md b/problems-with-fixed-and-decaying-learning-rates.md new file mode 100644 index 0000000..2b0ecb9 --- /dev/null +++ b/problems-with-fixed-and-decaying-learning-rates.md @@ -0,0 +1,391 @@ +--- +title: "Problems with fixed and decaying learning rates" +date: "2019-11-11" +categories: + - "buffer" + - "deep-learning" +tags: + - "adaptive-optimizers" + - "artificial-intelligence" + - "deep-learning" + - "learning-rate" + - "machine-learning" + - "optimizer" +--- + +Learning rates come in various flavors and can be used to influence the learning process. More specifically, they are meant to ensure that gradient updates are not too large as they are set to small values by default. In a [different blog post](https://www.machinecurve.com/index.php/2019/11/06/what-is-a-learning-rate-in-a-neural-network/), we covered them conceptually, but highlighted that fixed and decaying learning rates come with their set of challenges. + +What these challenges are is what we'll cover in this blog post. For fixed learning rates, we will compare [SGD](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) learning rates that are either too large or too small with the baseline scenario, which is the Keras default learning rate for the SGD optimizer. + +For learning rate decay, we'll show you how it improves the learning process, but why setting the default one in advance and choosing a decay scheme might still influence the training results. Finally, we'll show you a few possible solutions for the problem. + +If you still think that this post covers the machine learning problem you're dealing with - let's go! 😎 + +**Update 02/Nov/2021:** fixed bug in model code with missing reshape and `input_shape` variable. + +**Update 01/Mar/2021:** ensure that article is up to date for 2021. Replaced TF 1 based code with TensorFlow 2 based code, so that it can be used with recent versions of the library. Some other improvements as well. + +**Update 01/Feb/2020:** added link to [Learning Rate Range Test](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/). + +* * * + +\[toc\] + +* * * + +## Problems with Fixed Learning Rates + +In order to show the issues you may encounter when using fixed learning rates, we'll use a [CNN based image classifier](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) that we created before. This model uses the MNIST dataset for demonstration purposes. This dataset is used in educational settings quite often. + +The code of our model can be found by clicking the link above or by scrolling slightly to the bottom of this post, under 'Model code'. + +First, we will create our baseline by training our CNN with the default learning rate. It allows us to show that our model does actually perform well. Next, we'll show what happens when we increase the learning rate: your model no longer performs. We then show that decreasing the learning rate is not a solution either, since it will tremendously slow down your learning process. + +### Baseline: default learning rate + +This is a [visualization of the performance](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/) of the model with our baseline scenario: + +[![](images/fixed_lr_baseline.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/fixed_lr_baseline.png) + +It actually shows a very normal learning curve: a steep descent during the first few epochs, after which the model gets close to the minimum (whether local or global!) and learning stabilizes. The final loss value on the training data is approximately 0.01 whereas it's 0.05 on the validation data - that's pretty good. Test accuracy, in this case, confirmed the results: + +``` +Test loss: 0.02863448634357819 / Test accuracy: 0.9919000267982483 +``` + +### Too large fixed learning rate: overshooting loss minimum + +Now, what happens when we set the learning rate to \[latex\]0.5\[/latex\], which the machine learning community considers a really large one? + +(Note that in this case, 50% of the computed gradient change is actually used to change the model's weights!) + +In the really large case, this is the result: + +[![](images/fixed_lr_really_large.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/fixed_lr_really_large.png) + +``` +Test loss: 2.3188612442016603 / Test accuracy: 0.10100000351667404 +``` + +Loss, initially, was really large - and while indeed, the loss decreases substantially during the first epoch, nothing happens during the subsequent ones. Rather, test loss is 2.32 (instead of 0.029 in the baseline scenario) and accuracy is only 10.1%. Really large learning rates therefore don't work: if the minimum can be found at all, it continuously oversohots it. + +Now, what happens if we decrease the learning rate to a value that is still large - but generally speaking, acceptably large? + +That is, we're now using a learning rate of \[latex\]0.01\[/latex\]: + +[![](images/fixed_lr_large.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/fixed_lr_large.png) + +``` +Test loss: 0.045392444870159344 / Test accuracy: 0.9887999892234802 +``` + +Yes, we're finding convergence again with really good test accuracy! 😎 + +...but still, we're being impacted by the fact that the learning rate seems to be too large: + +- First, the test loss is approximately twice as high compared with the baseline scenario: 0.045 instead of 0.029. This likely occurs because the model cannot find the actual minimum, since the learning rate is too large and we overshoot our minimum every time. +- Secondly, compared to the [baseline plot](https://www.machinecurve.com/wp-content/uploads/2019/11/fixed_lr_baseline.png), we can observe that our loss value oscillates more heavily. This is also the result of a less subtle learning rate, compared with the baseline scenario. + +All in all, comparing our baseline, we can see that while increasing the learning rate _may_ help you find convergence faster, _it may also be destructive for learning_. Choose wisely! Start with the default LR and perhaps increase it in small steps, [visualize training history](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/) and watch for oscillation in your plots, and stop slightly before this occurs. + +### Too small fixed learning rate: extremely slow convergence + +Okay. We have seen that increasing the learning rate helps, but only to some extent. It allows you to find convergence faster, but at some point it ensures that your model no longer converges. Instead, you then find very poor model performance. + +You may now think: okay, but what happens when I _decrease_ instead of _increase_ the learning rate? Does the same pattern emerge then? + +Let's find out. + +We'll first use a learning rate of \[latex\]0.00001\[/latex\]. Note that our baseline learning rate is \[latex\]0.001\[/latex\], so ours is now 100 times smaller. + +[![](images/fixed_lr_small.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/fixed_lr_small.png) + +``` +Test loss: 0.17123142351806164 / Test accuracy: 0.9513999819755554 +``` + +We can make a few observations here: + +- First, the history plot shows that our history is much smoother than the ones found with the larger learning rate. This makes sense, since a smaller learning rate essentially means that you're taking smaller steps when performing [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/), and hence your path downhill is smoother. +- Secondly, we unfortunately find higher loss values compared to our baseline scenario. Training and validation loss approximate 0.25, while test loss is approximately 0.171. Compare this with our baseline, where test loss was 0.029, and you see what happens - smaller loss values lead to smoother learning curves, but result in slower convergence (while theoretically your model still converges by simply increasing the number of epochs, things like [vanishing gradients](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) or using the [Adadelta optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adagrad) could then result in finding no convergence at all!) + +This pattern gets even stronger when we decrease our learning rate again, once more 100 times. In fact, the steps are now too small in order to find model convergence quickly: + +[![](images/fixed_lr_really_small.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/fixed_lr_really_small.png) + +``` +Test loss: 2.2723510555267334 / Test accuracy: 0.16249999403953552 +``` + +### Fixing your learning rate is resource inefficient + +We can thus conclude that while fixed learning rates benefit you in terms of _simplicity_, they have multiple drawbacks: + +- Too large fixed learning rates will ensure that your model no longer converges to the loss minimum, because it always overshoots this minimum during training. +- Too small fixed learning rates may result in the same, but then because your steps are so small that it (theoretically) takes infinitely long to find the minimum. +- Hence, there is a range in between learning rates result in quick and approximate convergence. Your learning rate is best configured to be in this range. +- Unfortunately, this range is dependent on the loss landscape that is generated by your dataset. You can only find this landscape by either visualizing it, or experimenting with many trial runs of your training process. +- Additionally, the range is dependent on the configuration of the other hyperparameters in your machine learning model, which itself are also dependent on the dataset. This introduces quite some complexity. +- Hence, fixed learning rates are flawed, since they require setting a learning rate a priori - either finding no or less-than-superior convergence, or wasting a lot of resources. +- Could there be a better solution? We'll explore learning rate decay schemes in the next section. + +### Model code + +This is the code that we used for our model, more speifically for our baseline setting. It uses the [Adam adaptive optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) and its default learning rate of \[latex\]0.001\[/latex\]. Note that you can adapt the learning rate under 'Fit data to model'. + +``` +''' + Problems with fixed and decaying Learning Rates: + Fixed Learning Rate - Baseline Scenario +''' +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import matplotlib.pyplot as plt + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert them into black or white: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Reshape everything +input_train = input_train.reshape(input_train.shape[0], 28, 28, 1) +input_test = input_test.reshape(input_test.shape[0], 28, 28, 1) + +# Set input shape +input_shape = (28, 28, 1) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.001), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Plot history +plt.plot(history.history['loss'], label='Categorical crossentropy loss (training data)') +plt.plot(history.history['val_loss'], label='Categorical crossentropy loss (validation data)') +plt.title('Categorical crossentropy loss for Fixed LR / Baseline scenario') +plt.ylabel('Categorical crossentropy loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Problems with Learning Rate Decay + +In the previous section, we found that fixed learning rates can be used, but that they are inherently flawed if you cannot find an adequate, or even the best, fixed learning rate quickly. + +Learning Rate Decay may reduce your challenge. + +### What is Learning Rate Decay? + +Why should it be necessary to keep your learning rate fixed - that is the premise behind decaying learning rates. It's just that simple: a decaying learning rate is a learning rate that gets smaller and smaller as the number of epochs increases. + +This allows you to start with a relatively large learning rate, while benefiting from smaller (or even small!) ones towards your final stages of training. + +In terms of the training process, this is beneficial: at the beginning, a relatively large learning rate is necessary in order to set huge steps, while you wish to set increasingly smaller steps when you approach the loss minimum. + +Decay schemes are thus a better idea than fixed learning rates, and there are many of them (Lau, 2017): + +**Linear decay**, well, decays your learning rate linearly. That is, it decreases with a fixed rate, until it reaches 0: + +``` +l_lr = initial_learning_rate +def linear_decay(epoch): + lr_decay = 0.00001 + global l_lr + l_lr = l_lr - lr_decay + return max(l_lr, 0) +``` + +![](images/linear_decay.png) + +**Step decay** allows you to drop the learning rates in exponentially smaller steps, every few epochs. In the case below, the learning rate drops step-wise every 15 epochs. The first drop is 0.5, the second 0.025, then 0.0125, and so on. + +``` +def step_decay(epoch): + lr_drop_by = 0.5 + lr_drop_every = 15 + return initial_learning_rate * math.pow( + lr_drop_by, math.floor((1+epoch)/lr_drop_every) + ) +``` + +![](images/step_decay.png) + +**Time decay** decreases the learning rate overtime. Decay starts slowly at first, to ensure that the learning rate remains relatively large during the early phases of the training process. Subsequently, decay gets larger, but slows down towards the end. Compared with linear and step decay, time decay is smooth. This might reduce oscillation around your loss curve. + +``` +td_lr = initial_learning_rate +def time_decay(epoch): + lr_decay = 0.0000015 + global td_lr + td_lr *= (1 / (1 + lr_decay * epoch)) + return td_lr +``` + +[![](images/time_decay.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/time_decay.png) + +**Exponential decay** is similar to time decay, but - compare the plots! - is different. Contrary to time decay, which decays slowly at first, exponential decay decays fastest at first, only to decrease decay with increasing epochs. Similar to time decay, it's also smooth. + +[![](images/exponential_decay.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/exponential_decay.png) + +As discussed, learning rate decay schemes do improve the learning process by reducing the impact of fixed learning rates. Nevertheless, decay schemes also come with their own set of peculiarities, of which two are primarily relevant: **setting the default learning rate in advance** is still necessary, while it's _not_ known in advance **which decay scheme is best**. + +### Setting the Default Learning Rate in advance + +One of the main drawbacks of fixed learning rates was that it must be set in advance, even though it's dependent on the loss landscape, which itself is dependent on the dataset and how you configured the rest of your hyperparameters. + +Learning rate decay schemes partially resolve this problem: by setting a quite high learning rate and applying a decay scheme, you can (1) ensure that your model still converges and (2) that its steps are smaller once you get closer to the loss minimum. + +This is good. + +But there's still a catch: the default learning rate, i.e. the learning rate from which decay starts, is still _fixed_. And while the range of decay-fixed learning rates is larger than true-fixed learning rates (by virtue of LR decay), you still have to make assumptions about your default learning rate - because it's once again dependent on the data _and_ the other hyperparameters. + +While learning rate decay schemes therefore make your life easier, they're still not a full solution. + +### Choosing a Decay Scheme + +Another choice you'll have to make _in advance_ is which decay scheme you'll use. As we saw above, multiple decay schemes exist - schemes that are linear, that are dependent on time, or work exponentially. There is no fixed law that prescribes which decay scheme to use in which situation, and your choice is often dependent on experience. + +...and experience is always related to the structure of the dataset you're using, which means that choosing a decay scheme is also dependent on your dataset and hence the other hyperparameters. + +The necessity of choosing a decay scheme in advance is therefore, together with choosing a default learning rate a priori, why there might still be better options. + +* * * + +## Possible solutions + +This does not mean that your life is lost. No, on the contrary - both fixed learning rates and learning rate decay schemes often result in well-performing models, especially if your data has high volume and variety. + +Nevertheless, it may be that you wish to optimize your model to the max _or_ find mediocre performance and wish to investigate whether it's your learning rate that is to blame. + +You may in that case try a few of these solutions. + +### Using adaptive optimizers + +First of all - you may wish to use an [adaptive optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/). Normal [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) updates all your weights in a similar fashion: it applies the gradient update times the learning rate and subtracts this update from the model's current weights. + +However, over time, you may wish to decrease the updates for weights that have already updated quite often (since they apparently do not contribute to model improvement) while increasing the weights that haven't updated quite often yet (because perhaps, they may contribute after all). + +What you need is an _adaptive optimizer_ such as [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam), [Adadelta](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adadelta) or [AdaMax](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adamax), which + +Yes - for most of them, you'll still need to configure the default learning rate. However, what's best **is that the impact of your choice impacts the training process even less**, because the optimizer will be able to alter the _impact_ of the update itself. Even better, if you use [Adadelta](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adadelta) it means that you don't need to configure a default learning rate at all. + +(Note that adaptive optimizers are not [without challenges](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#challenges-with-adaptive-optimizers-new-ones) - as is pretty much everything in deep learning. Do not use them blindly, but use your common sense - if they don't work, perhaps try SGD with one of the other options instead.) + +### Learning Rate Range Test + +In 2017, in two papers, Smith (2017) and Smith & Topin (2017) produced an interesting idea: _what if you can determine the most optimal learning rate empirically_? + +This is the premise behind the **[Learning Rate Range Test](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/)**, which essentially allows you to test a range of learning rates by training the model once, but then with exponentially increasing learning rate. + +Take for example the [Keras LR Finder](https://github.com/surmenok/keras_lr_finder) implementation in Keras, which is based on the algorithm described in Smith (2017) and essentially: + +> Plots the change of the loss function of a Keras model when the learning rate is exponentially increasing. + +Generating plots like this: + +![](images/image.png) + +Result of the Learning Rate Range Test for a CNN I trained for my master's thesis. Clearly, loss is lowest with learning rates in the range of \[0.01, 0.1\]. + +It allows you to to estimate a default learning rate, which you can then use in a fixed sense or with a decay scheme. Even though you'll have to fix your learning rate a priori before starting the real training process, you can now find an estimate that might actually work best, or approaches an optimal value quite closely. + +If you wish to implement it with Keras, [take a look here](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/#implementing-the-learning-rate-range-test-with-keras). + +### Cyclical Learning Rates + +Smith (2017) doesn't only describe the LR Range Test, however. In fact, the author combines the test with another new concept: a **cyclical learning rate**. + +The concept is conceptually very easy: + +**Just move back and forth between a really large learning rate, and a lower one, in a zig-zag way, between some bounds.** + +Like this: + +[![](images/clr.png)](blob:https://www.machinecurve.com/4f506139-515f-4c7a-b431-1f4a1927dc59) + +The bounds can be determined by means of the LR Range Test: in the case above, e.g. \[latex\]10^-2\[/latex\] and \[latex\]10^-1\[/latex\]. + +How does this benefit you compared to a 'traditional' learning rate, you may now wonder? + +Well, this has actually yielded quite good results, and this is perhaps why: it combines the best of _large steps_ with the best of _small steps_. By setting large steps, you may move quickly towards a minimum, and even overshoot a local minimum if it seems that you're getting stuck. The smaller steps will then ensure, if you're closer to a minimum, that you will be able to find the true minimum without overshooting. + +Especially when you're in the global minimum, it's very unlikely that you'll overshoot when learning rates switch back from small to large ones. In fact, when you combine this strategy with [EarlyStopping and/or ModelCheckpointing](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), it's even more unlikely that you'll overshoot your global minimum. + +* * * + +## Summary: use common sense and don't be afraid to experiment + +As we've seen, it's very much possible to achieve quite well-performing models with fixed learning rates and learning rate decay schemes. Still, they face some challenges that cannot be ignored if you wish to optimize your model to the fullest. Adaptive optimizers, the Learning Rate Range Test and Cyclical Learning Rates may then help you, but even those must be applied with caution and common sense. + +The lesson learnt here could then perhaps be summarized as follows: + +**Benefit from the best of theory, but don't be afraid to experiment.** + +Waste resources in finding your best hyperparameters - but do so wisely, rather than naïvely. For example, use the LR Range Test with open source libraries like the Keras LR Finder to find a good learning rate empirically, and don't make guesses yourself. Try cyclical learning rates and/or adaptive optimizers for some time, but make sure to compare them with traditional SGD with e.g. fine-tuned learning rate decay too. + +...when working with common sense, I'm confident that you're on your way to creating a well-performing model! 😎 + +I hope you've learnt something from reading this blog post. If you did, I'd appreciate a heads up about what you did - and if anything was unclear. Just leave a comment in the comments box below 👇 and I'll be happy to answer any questions, or improve my blog where necessary. + +Thanks - and happy engineering! 😄 + +* * * + +## References + +Lau, S. (2017, August 1). Learning Rate Schedules and Adaptive Learning Rate Methods for Deep Learning. Retrieved from [https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1) + +Smith, L. N. (2017, March). [Cyclical learning rates for training neural networks.](https://ieeexplore.ieee.org/abstract/document/7926641/) In _2017 IEEE Winter Conference on Applications of Computer Vision (WACV)_ (pp. 464-472). IEEE. + +Smith, L. N., & Topin, N. (2017). Exploring loss function topology with cyclical learning rates. _[arXiv preprint arXiv:1702.04283](https://arxiv.org/abs/1702.04283)_[.](https://arxiv.org/abs/1702.04283) + +Surmenok, P. (2018, July 14). Estimating an Optimal Learning Rate For a Deep Neural Network. Retrieved from [https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0](https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0) + +surmenok/keras\_lr\_finder. (2019, October 19). Retrieved from [https://github.com/surmenok/keras\_lr\_finder](https://github.com/surmenok/keras_lr_finder) diff --git a/python-feature-scaling-with-outliers-in-your-dataset.md b/python-feature-scaling-with-outliers-in-your-dataset.md new file mode 100644 index 0000000..7803361 --- /dev/null +++ b/python-feature-scaling-with-outliers-in-your-dataset.md @@ -0,0 +1,250 @@ +--- +title: "Python Feature Scaling with Outliers in your Dataset" +date: "2020-11-19" +categories: + - "frameworks" + - "svms" +tags: + - "data-preprocessing" + - "dataset" + - "deep-learning" + - "feature-scaling" + - "machine-learning" + - "normalization" + - "outliers" + - "robust-scaling" + - "scikit-learn" + - "standardization" + - "tensorflow" +--- + +When you are training Machine Learning models, data preprocessing is an important activity. It is sometimes even crucial to the success of your project that your dataset is adequately prepared. **Feature Scaling**, adapting the scales of your features so that they become comparable, can be crucial to the performance provided by the model. Fortunately, there are many methods and techniques that can be applied for Feature Scaling. + +However, if you have a dataset where many outliers are present, especially one of the two most important techniques - Standardization - might not perform so well. This article zooms in on this problem and looks at Robust Scaling, which is a way to overcome this problem. It is structured as follows. Firstly, we will look at why Feature Scaling is important and sometimes even necessary for Machine Learning algorithms - to give you the appropriate context for the rest of the article. We then look at why Feature Scaling with especially Standardization can be difficult when your dataset contains (extreme) outliers. Subsequently, we introduce Robust Scaling, and show you how it works by means of the `RobustScaler` implementation in Scikit-learn. The examples include Scikit-learn models and TensorFlow/Keras models. + +Let's take a look! 😎 + +**Update 08/Dec/2020:** added references to PCA article. + +* * * + +\[toc\] + +* * * + +## Why Feature Scaling is necessary for Machine Learning algorithms + +If you have trained Machine Learning models before, or if you have looked closely at the articles at MachineCurve, you have likely seen lines like these sometimes: + +``` +# Convert into [0, 1] range. +input_train = input_train / 255 +input_test = input_test / 255 +``` + +These lines rescale the (in this case) grayscale input data from the \[latex\]\[0, 255\]\[/latex\] range into the \[latex\]\[0, 1\]\[/latex\] range. It is one of the methods of **Feature Scaling**, which is often necessary for your Machine Learning projects. It involves reducing the _range_ (i.e. the minimum and maximum values) to a small interval, without losing the relationships between the individual samples. In fact, in many cases, you get really weird performance (close-to-zero accuracy or infinite loss with Neural Networks, for example) if you don't apply it. + +But why do you have to apply it in the first place? What makes Machine Learning algorithms or more specifically the datasets so dependent on Feature Scaling? + +Feature Scaling can be necessary because of one or more of the following reasons: + +- Feature Selection and Extraction +- Convergence of your Machine Learning algorithm +- Regularization applied in your algorithm + +Let's take a look at the individual reasons in more detail now, and then introduce Normalization and Standardization for performing Feature Scaling. + +### Feature Selection and Extraction + +Suppose that you have a dataset where two variables are candidates for being the _predictor variable_, i.e. the independent value which is used in the \[latex\]\\text{independent variable} \\rightarrow \\text{dependent variable}\[/latex\] relationship of a predictive model. + +If we would plot those variables, without knowing what they represent, the plot could look like this: + +![](images/gauss0.png) + +Often, if a variable has a big range, its variance is also bigger compared to variables which have a small range. + +> _Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average value._ +> +> Wikipedia (2001) + +And variables with greater variance often contribute more significantly to the relationship of a predictive model, for the simple reason that they capture more possible values from the domain of input variables. + +Here's the catch, though: for feature extraction and selection, we often use algorithms like [Principal Component Analysis (PCA)](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/), which are dependent on variable variance for extracting the variables that contribute most significantly to the spread of our dataset. + +But if the variable scales are incompatible and hence cannot be compared, the comparison and hence the application of Feature Selection algorithm is pointless. + +If you want to apply Feature Selection (which is true in many cases), you first want to make your scales comparable. This can be achieved with Normalization or, even more accurately, with Standardization. It is one of the reasons why Feature Scaling can improve your ML model performance: **you will truly find the variables that contribute most.** + +### Machine Learning convergence + +Another reason why you should consider applying Feature Scaling is due to the convergence of your Machine Learning model. + +**Some Machine Learning algorithms are dependent on Feature Scaling should they converge to an optimal solution well, or converge at all.** + +For example, some algorithms utilize distance metrics for learning the [decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) of the model. You can imagine that in a variable \[latex\]X\_1\[/latex\] with a range of \[latex\]\[0, 102500\]\[/latex\] the distances are much bigger compared to a variable \[latex\]X\_2\[/latex\] with a \[latex\]\[0, 1\]\[/latex\] range. Now, should they both be used in generating a prediction (e.g. in a relationship that looks like \[latex\]{{X\_1}, {X\_2}} \\rightarrow y\[/latex\], then much more emphasis will be put on the distances measured for \[latex\]X\_1\[/latex\]. + +This significantly distorts the impact of the other, smaller-range variables, and is another reason why you may wish to apply Feature Scaling. + +### Regularization + +The third reason is related to [regularization](https://www.machinecurve.com/index.php/2020/01/26/which-regularizer-do-i-need-for-training-my-neural-network/), which is used for controlling the weights of the model. For example, L1 (Lasso) regularization ensures that models are sparse, by dropping out weights that contribute insignificantly, while L2 (Ridge) keeps weights small without making models sparse. + +Here's the catch when you use a non-scaled dataset with regularization: applying regularizers involves computing distance metrics. We saw above what happens when distance metrics are computed and the ranges of your variables vary significantly - things go south. In the case of regularization, we should ensure that Feature Scaling is applied, **which ensures that penalties are applied appropriately** (Wikipedia, 2011). + +### Normalization and Standardization for Feature Scaling + +Above, we saw that Feature Scaling can be applied to [normalize or standardize](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) your features. As the names already suggest, there are two main candidates for normalization and standardization: + +- **Min-max normalization:** here, the values are scaled to a \[latex\]\[0, 1\]\[/latex\] range (or possibly an \[latex\]\[a, b\]\[/latex\] range) using the minimum and maximum values. Although simple and very efficient, the means and standard deviations of the variables are still unequal, meaning that they remain a bit incompatible, although the situation improves a lot. +- **Standardization** (or **z-score normalization**): here, for all variables, the mean is brought to zero and the standard deviation to one. This makes the scales fully compatible because the values now express the _differences from the mean in standard deviations_, which are always the same. The technique is best applied with Gaussian data, although it can also work with other data in many cases. Just see for yourself which one works best. + +You can click on the referenced link above to dive into Normalization and Standardization in more detail. In the remainder of this article, we will look at why Feature Scaling using Standardization can become problematic when your dataset contains (extreme) outliers, and how to handle these situations. + +* * * + +## Problems with Feature Scaling with Outliers in your Dataset + +In the article referenced above, we saw that datasets can become surprisingly comparable when Standardization is applied - in this case, the variance of one feature seems to be comparable after Standardization. This means that both variables are equally important, which is not something we thought when we first saw the dataset on the left! + +- ![](images/gauss0.png) + +- ![](images/gauss1.png) + + +As expected, the entire feature space would now be centered around \[latex\](\\mu = 0, \\sigma = 1)\[/latex\]. The new range is approximately \[latex\]\[-3.66, 3.16\]\[/latex\]. + +``` +print(np.mean(X1)) +print(np.std(X1)) +print(f'Range: [{np.min(X1)}, {np.max(X1)}]') + +> 4.920508445138694e-16 +> 1.0000000000000004 +> Range: [-3.65964765666819, 3.1606977253547752] +``` + +Now let's take a look at what happens when we regenerate the dataset but then introduce outliers - in approximately 20% of the cases, the samples are multiplied by 10-25 so that they are really off: + +- ![](images/outliers-1.png) + +- ![](images/outliers2-1.png) + + +The `StandardScaler` used for applying Standardization to this dataset nicely generates a standard dataset centered around \[latex\](\\mu = 0, \\sigma = 1)\[/latex\]. + +But this is misleading if we look at the new range! + +``` +9.090506125630782e-16 +1.0000000000000004 +Range: [-3.549687574954954, 4.685632224899642] +``` + +\[latex\]\[-3.55, 4.69\]\[/latex\]? Compare this to the range of the dataset without outliers, which was \[latex\]\[-3.66, 3.16\]\[/latex\]. + +If this looks strange to you, it is because it _is_. After standardization, it looks like as if the dataset never had outliers in the first place: the 'lowest' values have scantily moved, while the outliers were moved _significantly_ towards zero mean and unit variance. We now have a distorted dataset which could potentially allow you to detect patterns which are not there, by masking the presence of outliers! + +This is why applying Standardization for Feature Scaling can be problematic and must be dealt with appropriately. + +* * * + +## How to perform Feature Scaling with Outliers + +Let's now take a look at how we can perform Feature Scaling when we have (extreme) outliers in our dataset. For doing so, we can apply a technique called _Robust Scaling_, which comes delivered in Scikit-learn by means of the `sklearn.preprocessing.RobustScaler` module. + +### Introducing the RobustScaler + +Where Z-score Standardization removes the mean and then divides by the standard deviation to ensure that the mean is zero and the scales become comparable in terms of standard deviation, we saw that this _overperforms_ in the case of outliers. + +Using the `RobustScaler` in Scikit-learn, we can overcome this problem, by scaling the dataset appropriately - reducing the range, while keeping the outliers, so that they keep contributing to feature importance and model performance. + +> This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). +> +> Scikit-learn (n.d.) + +- ![](images/robust.png) + +- ![](images/robust2.png) + + +### Using the RobustScaler with Python for Scikit-learn and TensorFlow models + +Applying Robust Scaling with the `RobustScaler` is really easy and works both for Scikit-learn and TensorFlow models. Suppose that we generate the originally Gaussian data from the plots above, and then stretch one of the axes by `2.63` and then stretch 20% of the data more by multiplying it with a number between \[latex\]\[10, 25\]\[/latex\]. We then have a dataset available in `X1` which is also what we would have when e.g. training a [Scikit-learn](https://www.machinecurve.com/index.php/2020/11/12/using-error-correcting-output-codes-for-multiclass-svm-classification/) or a [TensorFlow](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) model. + +``` +# Imports +from sklearn.datasets import make_gaussian_quantiles +from sklearn.preprocessing import RobustScaler +import random +import numpy as np + +# Make Gaussian data +X1, Y1 = make_gaussian_quantiles(n_features=2, n_classes=2, n_samples=1000, mean=(2,3)) + +# Stretch one of the axes with random numbers +x_new = [] +X1[:, 1] = 2.63 * X1[:, 1] +for x in X1[:, 1]: + outlier_generation = random.uniform(0, 1) + if outlier_generation >= 0.80: + x = x * random.uniform(10, 25) + x_new.append(x) +X1[:, 1] = x_new + +# Robustly Scale Gaussian data +scaler = RobustScaler() +scaler.fit(X1) +X1 = scaler.transform(X1) + +# Print statistics +print(np.mean(X1)) +print(np.std(X1)) +print(f'Range: [{np.min(X1)}, {np.max(X1)}]') +``` + +Applying the `RobustScaler` is then really easy: + +- Firstly, we initialize the `RobustScaler` and assign it to `scaler`. +- Secondly, we `.fit` the `X1` dataset to the `scaler` variable. +- We then `.transform` the `X1` dataset and re-assign it to `X1`. + +Printing the statistics then yields the following values for mean, standard deviation and range: + +``` +2.4222758827852466 +7.942996751183663 +Range: [-2.500445964741244, 61.19964933941751] +``` + +Not entirely zero mean and unit variance, but much better than what it was _before_ applying Feature Scaling that is robust to outliers: + +``` +18.09362233921902 +45.25650019182367 +Range: [-1.9328478564634346, 379.4841869132472] +``` + +* * * + +## Summary + +Feature Scaling is a very important data preprocessing task when training Machine Learning models. It can be critical to the success of your Feature Selection algorithms, if you apply any, but also to the convergence of your Machine Learning algorithm and to the regularizer, if applied. That's why Normalization and Standardization are heavily used in many Machine Learning models. + +Unfortunately, many datasets do however contain outliers, and especially Standardization is not robust to these outliers, significantly masking their significance and possibly giving you a model that performs due to false reasons. + +Robust Feature Scaling by means of the `RobustScaler` in Scikit-learns can help you fix this issue. By scaling data according to the quantile range rather than the standard deviation, it reduces the range of your features while keeping the outliers in. In this article, we also looked at how we can implement Robust Scaling with Scikit-learn, and use it for Scikit-learn and TensorFlow models. + +I hope that you have learned something from this article. If you did, please feel free to leave a message in the comments section below 💬 Please do the same when you have other comments or questions. I'd love to hear from you! Thank you for reading MachineCurve today and happy engineering 😎 + +* * * + +## References + +Wikipedia. (2001, June 30). _Variance_. Wikipedia, the free encyclopedia. Retrieved November 18, 2020, from [https://en.wikipedia.org/wiki/Variance](https://en.wikipedia.org/wiki/Variance) + +Wikipedia. (2011, December 15). _Feature scaling_. Wikipedia, the free encyclopedia. Retrieved November 18, 2020, from [https://en.wikipedia.org/wiki/Feature\_scaling](https://en.wikipedia.org/wiki/Feature_scaling) + +Scikit-learn. (n.d.). _Sklearn.preprocessing.RobustScaler — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 19, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) + +Scikit-learn. (2020, November 19). _How to normalize or standardize a dataset in Python? – MachineCurve_. MachineCurve. [https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) diff --git a/quick-and-easy-gpu-tpu-acceleration-for-pytorch-with-huggingface-accelerate.md b/quick-and-easy-gpu-tpu-acceleration-for-pytorch-with-huggingface-accelerate.md new file mode 100644 index 0000000..fd8726a --- /dev/null +++ b/quick-and-easy-gpu-tpu-acceleration-for-pytorch-with-huggingface-accelerate.md @@ -0,0 +1,292 @@ +--- +title: "Quick and easy GPU & TPU acceleration for PyTorch with HuggingFace Accelerate" +date: "2022-01-07" +categories: + - "deep-learning" + - "frameworks" + - "geen-categorie" +tags: + - "acceleration" + - "deep-learning" + - "gpu" + - "huggingface" + - "machine-learning" + - "tpu" +--- + +Deep learning benefits from Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs) because of the way they handle the necessary computations during model training. GPU and TPU based acceleration can thus help you speed up your model training process greatly. + +Unfortunately, accelerating your PyTorch model on a GPU or TPU has quite a bit of overhead in native PyTorch: you'll need to assign the data, the model, the optimizer, and so forth, to the `device` object that contains a reference to your accelerator. It's very easy to forget it just once, and then your model breaks. + +In today's article, we're going to take a look at **HuggingFace Accelerate** - a PyTorch package that abstracts away the overhead and allows you to accelerate your neural network with only a few lines of Python code. In other words, it allows you to **quickly and easily accelerate your deep learning model with GPU and TPU**. + +Let's take a look! :) + +* * * + +\[toc\] + +* * * + +## What is HuggingFace Accelerate? + +If you're familiar to the machine learning world, it's likely that you have heard of HuggingFace already - because they are known for their [Transformers library](https://www.machinecurve.com/index.php/getting-started-with-huggingface-transformers/). HuggingFace itself is a company providing an AI community "building the future of AI". + +And that's why they provide [a lot more libraries](https://github.com/huggingface) which can be very useful to you as a machine learning engineer! + +In today's article, we're going to take a look at **quick and easy** **accelerating** for your **PyTorch deep learning model** using your **GPU or TPU.** + +This can be accomplished with `accelerate`, a [HuggingFace package](https://github.com/huggingface/accelerate) that can be described in the following way: + +> 🚀 A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision. +> +> GitHub (n.d.) + +Who doesn't want to benefit from speed when you have the hardware available? + +Let's continue by looking at how it works :D + +* * * + +## How to install HuggingFace Accelerate? + +Installing HuggingFace is very easy. Obviously, you will need to have a recent install of Python and PyTorch (the package was tested with Python 3.6+ and PyTorch 1.4.0+). Then, it's only the execution of a `pip` command: + +``` +pip install accelerate +``` + +* * * + +## Easy GPU/TPU acceleration for PyTorch - Python example + +Now that you have installed HuggingFace Accelerate, it's time to accelerate our PyTorch model 🤗 + +Obviously, a model is necessary if you want to accelerate it, so that is why we will use a model that we created before, [in another blog article](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/). It's a simple Multilayer Perceptron that is trained for classification with the CIFAR-10 dataset, and you will find an explanation as to how it works when clicking the link. + +Today, however, we will simply use it for acceleration with HuggingFace Accelerate. Here, you can find the code - which, as you can see, has no references to `cuda` whatsoever and hence runs on CPU by default: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(32 * 32 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +The first thing that you will need to do is ensuring that HuggingFace `accelerate` is imported. You can do this by adding the following to the imports: + +``` +from accelerate import Accelerator +``` + +Immediately afterwards, you then initialize the accelerator: + +``` +accelerator = Accelerator() +``` + +That's pretty much it when it comes to loading stuff, you can now immediately use it by accelerating the model (`mlp`), the optimizer (`optimizer`) and `DataLoader` (`trainloader`) - just before the training loop of your MLP: + +``` + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Accelerate the model, optimizer and trainloader + mlp, optimizer, trainloader = accelerator.prepare(mlp, optimizer, trainloader) +``` + +Now, the only thing you will need to do is changing the backward pass by the functionality provided by the accelerator, so that it is performed in an accelerated way: + +``` + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + accelerator.backward(loss) +``` + +That's it - here's the full code if you want to get started straight away :) + +``` +import os +import torch +from torch import nn +from torchvision.datasets import CIFAR10 +from torch.utils.data import DataLoader +from torchvision import transforms +from accelerate import Accelerator + +accelerator = Accelerator() + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(32 * 32 * 3, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = CIFAR10(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Accelerate the model, optimizer and trainloader + mlp, optimizer, trainloader = accelerator.prepare(mlp, optimizer, trainloader) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + accelerator.backward(loss) + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +That's it! + +You have accelerated your PyTorch model by letting it use your GPU or TPU when available! + +If you have any questions, comments or suggestions, feel free to leave a message in the comments section below 💬 I will then try to answer you as quickly as possible. For now, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +GitHub. (n.d.). _Huggingface/accelerate: 🚀 a simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision_. [https://github.com/huggingface/accelerate](https://github.com/huggingface/accelerate) diff --git a/random-initialization-vanishing-and-exploding-gradients.md b/random-initialization-vanishing-and-exploding-gradients.md new file mode 100644 index 0000000..f1d092c --- /dev/null +++ b/random-initialization-vanishing-and-exploding-gradients.md @@ -0,0 +1,241 @@ +--- +title: "Vanishing and exploding gradients" +date: "2019-08-30" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "exploding-gradients" + - "initializers" + - "neural-networks" + - "vanishing-gradients" + - "weight-initialization" +--- + +Neural networks must be initialized before one can start training them. As with any aspect of deep learning, however, there are many ways in which this can be done. Random initialization of the neural weights is one of those ways. In fact, it is quite often suggested as being _the_ way of initializing your neural networks. + +This might however not exactly be the case due to two problems: the _vanishing gradients problem_ and the _exploding gradients problem_. In this blog, we'll take a look at those problems and will find means to overcome them to a great extent. + +Before we can do that, we must however first provide a small recap on the necessity for weight initialization in the first place. This must be followed by a discussion on random initialization and how that is achieved. Once we understand how neural networks are optimized, we can introduce the two problems and the possible solutions. + +Let me know in the comments if you've got any remarks, questions or tips. Thanks! :-) + +**Update 11/Jan/2021:** checked correctness of the article and updated header information. + +\[toc\] + +\[ad\] + +## The necessity for weight initialization + +I always think that a little bit of context must be provided before we move to the details. + +The context in this case would be as follows: why is weight initialization necessary in the first place? + +Although I primarily wrote on weight initialization [in another blog post](https://machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/), I will briefly cover it here again. + +Put very simply, a neural network is composed of various neurons. Those neurons are a combination of a linear operation that I call _vector dot product plus bias_ and a possibly nonlinear operation called the _activation_. + +In this latter, also known as the activation function, nonlinearity is added to the linear output of the linear operation. If this wouldn't be done, the neural network would not perform better than a linear one - and all the progress that has occurred over the previous years wouldn't have been possible. + +We'll cover activation functions in more detail in a later blog. + +The first part, the linear operation itself, is what is interesting today. During this operation, a so-called _input vector_ is multiplied with a _weights vector_, after which a bias value is added to the outcome of this multiplication. Let's break the vectors apart slightly more: + +- The **input vector** contains the sample you currently feed into the neural network. In the first layer, this is the actual data, in subsequent layers, it contains the outputs of the neurons in the previous layer. +- The **weights vector** contains the ideosyncrasies, or unique patterns, that the neuron has learnt from the data. It is essentially how the neural network learns: because each neuron is capable of learning a subset of the patterns hidden in the data, the network as a whole can identify many of them. + +However, before the training process starts, all weights vectors must be initialized - or - configured with some numbers. They simply cannot be empty, because an empty vector cannot be multiplied properly. As you probably guess by now, there are many initializers... of which _random initialization_ is one of the most widely known ones. + +\[ad\] + +## Random initialization + +Random initialization, as you would have probably guessed by now, initializes the weights randomly ;-) + +There exist two ways to achieve random initialization: by means of a normal distribution and an uniform distribution. + +### Uniform distribution + +This is the uniform distribution: + +[![](images/Uniform_Distribution_PDF_SVG.svg_-1024x732.png)](https://machinecurve.com/wp-content/uploads/2019/08/Uniform_Distribution_PDF_SVG.svg_.png) + +The uniform distribution. Thanks to the creator of this [work](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)#/media/File:Uniform_Distribution_PDF_SVG.svg): © [IkamusumeFan](https://commons.wikimedia.org/wiki/User:IkamusumeFan)at Wikipedia, licensed under [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode). + +Don't be scared, it's actually really easy to interpret it :-) + +\[mathjax\] + +What you see is the _probability distribution_ of the uniform distribution, and it essentially says this: when I draw a number randomly, the odds are \[latex\]1/(b-a)\[/latex\] that they are in the range \[latex\]a <= x <= b\[/latex\] and 0 if they are outside this range. + +Fun fact: this is a continuous distribution. That is, there is an infinite amount of real numbers in the interval specified above. By consequence, the probability that you find a certain _number_ is 0. [Read here why](https://stats.stackexchange.com/questions/60702/why-is-the-probability-zero-for-any-given-value-of-a-normal-distribution). + +Usually, it is possible to give as input the following variables when configuring the uniform distribution for deep learning: + +- The **minimum value** that should be selected. +- The **maximum value** that should be selected. +- A **seed number** to fix the random number generator. Seeding is sometimes necessary because random number generators aren't random; they're [pseudo-random](https://curiosity.com/topics/why-computers-can-never-generate-truly-random-numbers-curiosity/). Hence, you'll want to have the same peculiarities of pseudo-randomness (i.e., deviations from true randomness) every time you use the generator, because otherwise your weights may share different peculiarities. + +The maximum value in this case is \[latex\]a\[/latex\] and the maximum value is \[latex\]b\[/latex\]. + +### Normal distribution + +This is the normal distribution: + +[![](images/1920px-Normal_Distribution_PDF.svg_-1024x654.png)](https://machinecurve.com/wp-content/uploads/2019/08/1920px-Normal_Distribution_PDF.svg_.png) + +Credits: [Inductiveload at Wikipedia](https://commons.wikimedia.org/wiki/User:Inductiveload) + +Like the uniform distribution, the normal distribution is a continuous one as well. + +It's in fact one of the most widely used probability distributions; many natural phenomena can be described according to the distribution, if configured properly. + +Specifically, one can configure the **mean** and the **standard deviation**, and once again **seed** the distribution to a specific (pseudo-)random number generator. + +If you've had some statistics, you probably know what mean and standard deviation are. If not, [check this out](http://www.ltcconline.net/greenl/courses/201/descstat/mean.htm). + +Fun fact: compared with the uniform distribution, where you manually configure the _range_ of possible values, you don't do that with the normal distribution. + +Theoretically, that means that you could find any real number as a result. However, as you can see in the image above - e.g. in the standard normal distribution displayed in red - the odds are most likely that your number will fall in the \[-3, +3\] range. + +\[ad\] + +### Which distribution to choose + +Now that you know that both the uniform and the normal distribution are used quite often in deep neural networks, you may wonder: which distribution to use, then? + +...if you would choose to initialize them randomly, of course. + +A post on [StackExchange](https://datascience.stackexchange.com/a/13362) answers this question for us: it seems to be the case that it doesn't really matter. + +Or at least, that it's very unclear whether one is better over the other. + +In fact, in the [Glorot](http://www.jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) and [He](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) 'initialization papers', the author of that post argues, in which Glorot et al. and He et al. discuss the problems with random initialization, they used the two of them: the Glorot one uses an uniform distribution and the He one a normal one. + +It thus seems to be the case that choosing a random statistical distribution for initializing your weights may be chosen by you. + +...if you would initialize them randomly, of course. + +Because random initialization itself can become problematic under some conditions: you may then face the _vanishing gradients_ and _exploding gradients_ problems. Before we'll introduce those, we take a brief look at how most neural networks are optimized. + +\[ad\] + +## Optimizing neural networks: gradient descent & backprop + +When you train a neural network, you essentially feed it data for which it makes a prediction, computing the error - or loss - afterwards. + +This is called a forward pass of your data. + +However, one iteration comprises a forward and a backwards pass. + +Now, you can view the loss as a mathematical function that is highly characteristic of your data. Functions can be optimized, i.e., their minimum can be found. + +And where does the model perform best? At minimum loss, of course. + +### Computing gradients + +So by computing the derivative of the loss function, you arrive at a gradient for improving the _final hidden layer_ of your neural network. By moving your weights slightly into the direction of the gradient, your model is expected to 'walk' a little bit towards the loss minimum, and hence improve. + +We call this gradient descent, and we often use the [stochastic](https://datascience.stackexchange.com/a/36451) one. + +### Chaining gradients + +However, a neural network consists of multiple layers, not just one. + +We cannot simply take the gradient again for the last-but-one hidden layer, since it is intrinsically connected to the last one. Hence, when computing the gradient for this layer, you will always need to consider the gradient of the loss function given the gradient. + +For the next layer, you'll repeat this process, but then also including the last-but-one hidden layer. + +And so on. + +You're essentially creating a chain of gradients, which you will multiply to find the gradient for improving the current layer. + +We call this backpropagation. + +As you can see, optimizing a neural network thus comprises a: + +- Forward pass of the data, computing the current error or _loss_; +- Backwards pass of the data, computing the improvement by means of + - Backpropagation for computing the gradient given the layer you wish to optimize. + - (Stochastic) gradient descent or a more advanced optimizer that takes the gradient and moves the neural network weights into the right direction, essentially walking down the 'mountain of loss'. + +\[ad\] + +## Vanishing gradients + +Chaining gradients by multiplying them to find the gradient for an arbitrary layer presents you with a weird peculiarity: the so-called vanishing gradients problem. + +As you can see from the normal distribution, to give but one example, is that the majority of the values are relatively low, say within +3 and -1. In fact, the odds are largest that you randomly select a number that is larger than -1 and smaller than 1, i.e. \[latex\] +\-0.9999999999(..) < x < 0.99999999999(..)\[/latex\] + +![](images/1920px-Normal_Distribution_PDF.svg_-1024x654.png) + +Suppose that all your neurons output \[latex\]0.1\[/latex\] - a bit strange, but it makes reasoning about vanishing gradients easier. Suppose that you have some layers and a gradient improvement of 0.03. Five layers upstream, with an activation function that outputs between 0 and 1, the gradient improvement for the sixth given the others could be something like 0.1 x 0.1 x 0.1 x 0.1 x 0.1 x 0.03. + +And that's a very small number. + +The vanishing gradients problem thus means that your most upstream layers will learn _very slowly_, because essentially the computed gradient is _very small_ due to the way the gradients are chained together. + +In practice, that could mean that you need infinite time and computing resources to end up at the most optimum value, i.e. the minimum loss. + +And we simply don't want that - we want the best possible model. + +\[ad\] + +## Exploding gradients + +Similarly, the _exploding gradients problem_ may happen during training. Essentially, certain neurons die off because they experience what is known as an overflow - or, the number is too large to be handled by computer memory. + +Why does this occur? + +Suppose that you initialize your weights randomly. It does not really matter which initializer you use. You can then imagine that very likely, the behavior produced by the random weights during the forward pass generates a very large loss, simply because it does not match the underlying data distribution at all. + +What happens? The weight swing, or the gradient, could be really large. This is especially the case when random numbers are drawn that are > 1 or < -1. Due to the same effect, chaining the outputs, we get into trouble. Instead of numbers that are getting smaller, we're observing numbers that keep getting bigger. + +And eventually, this causes a number overflow with NaNs, or Not-a-Number, as a result. The effect: your learning process is severly hampered. + +## What to do against these problems? + +Fortunately, there's a fix. Thanks to certain scientists - particularly the +[Glorot](http://www.jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) and [He](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf) papers - more advanced initializers have emerged that could be of help in mitigating the vanishing _and_ exploding gradients. + +### Xavier and He initialization + +These initializers, which are known as the Xavier (or Glorot) and He initializers and available in e.g. [Keras](https://keras.io/initializers/), essentially do one thing: they ensure that the weights are set close to 1. + +This way, the problems are avoided to a great extent. + +They are different in the way how they manipulate the drawn weights to arrive at approximately 1. By consequence, they are best used with different activation functions. Specifically, He initialization is developed for ReLU based activating networks and by consequence is best used on those. For others, Xavier (or Glorot) initialization generally works best. + +\[ad\] + +### Experiment! + +Despite all those mitigation techniques that work in theory, there is one piece of advice that is generally true for data science and by consequence machine learning projects: **experiment!** + +Find out what works and see what fails - and adapt your approach based on what you see. At the same time, try to understand what happens inside of your black box, to derive more generic insights from your observation that can be re-used later. + +Put very simply: all theory is nice, but it has to work for you. And there is only one way towards that, i.e. learning by doing. + +## Recap + +In this blog, we've seen how random initialization works and why it's better than all-zeros initialization. We also know why it is necessary in the first place. However, we also were introduced to some fundamental problems with random initialization, being the vanishing and exploding gradients problem. By applying more advanced initializers, like the He initializer or the Xavier (or Glorot) initializer, we may eventually avoid these problems and arrive at a well-performing model. + +I truly hope that you've learnt something from this blog post. I would highly appreciate your comment below 👇😎 Please let me know if you have any questions, any remarks or suggestions for improvement. I'd be happy to apply these, since only that way we can arrive at a best possible post. + +Thanks and happy engineering! 😄 + +## References + +Alese, E. (2018, June 10). The curious case of the vanishing & exploding gradient. Retrieved from [https://medium.com/learn-love-ai/the-curious-case-of-the-vanishing-exploding-gradient-bf58ec6822eb](https://medium.com/learn-love-ai/the-curious-case-of-the-vanishing-exploding-gradient-bf58ec6822eb) + +Glorot, X., & Bengio, Y. (2019). _Understanding the difficulty of training deep feedforward neural networks_. Paper presented at International Conference on Artificial Intelligence and Statistics, Sardinia, Italy. + +He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. _2015 IEEE International Conference on Computer Vision (ICCV)_. [doi:10.1109/iccv.2015.123](http://doi.org/doi:10.1109/iccv.2015.123) + +Keras. (n.d.). Initializers. Retrieved from [https://keras.io/initializers/](https://keras.io/initializers/) + +When to use (He or Glorot) normal initialization over uniform init? And what are its effects with Batch Normalization? (n.d.). Retrieved from [https://datascience.stackexchange.com/questions/13061/when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are/13362#13362](https://datascience.stackexchange.com/questions/13061/when-to-use-he-or-glorot-normal-initialization-over-uniform-init-and-what-are/13362#13362) + +Yadav, S. (2018, November 9). Weight Initialization Techniques in Neural Networks. Retrieved from [https://towardsdatascience.com/weight-initialization-techniques-in-neural-networks-26c649eb3b78](https://towardsdatascience.com/weight-initialization-techniques-in-neural-networks-26c649eb3b78) diff --git a/reducing-trainable-parameters-with-a-dense-free-convnet-classifier.md b/reducing-trainable-parameters-with-a-dense-free-convnet-classifier.md new file mode 100644 index 0000000..6003026 --- /dev/null +++ b/reducing-trainable-parameters-with-a-dense-free-convnet-classifier.md @@ -0,0 +1,480 @@ +--- +title: "Reducing trainable parameters with a Dense-free ConvNet classifier" +date: "2020-01-31" +categories: + - "deep-learning" + - "frameworks" +tags: + - "convolutional-neural-networks" + - "dense" + - "global-average-pooling" + - "max-pooling" + - "neural-networks" +--- + +When you Google around for questions like "how to create an image classifier", it's possible that you end up on pages which explain how to create such neural networks with e.g. Keras. In pretty much all of the cases, you'll see that there is a fixed structure for creating those networks: + +_You'll use convolutional layers as feature extractors while you use Dense layers for generating the classification._ + +Did you however know that you can also take a different approach, which may be less intense in terms of computational requirements? Did you know that you might not even lose much predictive performance while doing so? + +Replacing the Dense layers with a Global Average Pooling based model does the trick. And this blog post shows you how it's done by means of an example model. + +But firstly, we'll take a look at using Global Average Pooling in theory. What are pooling layers? What does Global Average Pooling do and why can it be useful for replacing Dense layers when creating a classifier? We must understand these questions first before we actually start writing some code. + +However, code is included - don't worry. By means of a Keras model using TensorFlow 2.0, we build a classifier step by step, providing explanations for each part of the model. Finally, we validate the model, and show you the results. + +Are you ready? Let's go! 😎 + +**Update 05/Nov/2020:** removed channels first/last check for CNTK/Theano/TF backends because Keras is now tightly integrated with TF 2.x; rendering the check obsolete. In other words, made the code compatible with TensorFlow 2.x. + +* * * + +\[toc\] + +* * * + +## Using Global Average Pooling to replace Dense layers + +Before we begin, I think it's important to take a look at the concept of pooling - and specifically Global Average Pooling - first. It's only going to be a brief introduction so as to save you time ([click here if you want to read a more detailed discussion](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/)). However, with this understanding, I think you can better understand what happens in the code later, and why. + +Let's begin our analysis with what pooling layers are. + +### What are pooling layers? + +When training a [convolutional neural network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), your goal is to build a spatial hierarchy of increasingly abstract representations of your input data. Doing so allows you to feed new, but slightly different input, without any consequence for the classifier or regression model - because the convolutional layers, a.k.a. the feature extractors, have still produced intermediate outputs that are highly similar. + +So, in short, your goal is to build a hierarchy that is similar to the one on the left, versus the one on the right: + +![](images/hierarchies.png) + +Convolutional layers partially achieve this by downsampling the inputs. However, they are expensive, as each layer contains _trainable parameters_ which must be optimized during training for the layer to be useful. + +But downsampling _is_ necessary in order to achieve a spatial hierarchy like the one we chose above. How do we achieve this? + +Often, we take a look at pooling layers for this purpose. Pooling layers create a small "pool" of data (often a few by a few pixels), which slides over the input data. That's similar to convolutional layers - they do the same - but what happens _inside the pool_ is different. Rather than a pairwise multiplication between the input vector and the learnt weights vector (explaining the relative computational expensiveness of the layer sketched above), a cheap operation such as `max` is performed. Indeed, Max Pooling is one of the most widely used pooling layers: + +[![](images/Max-Pooling-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Max-Pooling-1.png) + +As a result of these pools, their sliding process and the cheap operation, they achieve _downsampling_ as well - but in a much cheaper way: + +[![](images/Max-Pooling-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Max-Pooling-2.png) + +_Side note: there are additional benefits when using Max Pooling. Take a look at the [blog post scrutinizing it in more detail](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/#max-pooling) if you wish to understand which ones they are._ + +### What does Global Average Pooling do? + +Another form of pooling is the so-called Global Average Pooling. It's different from Max Pooling in two ways: + +- The size of the pool equals the size of the input data. +- Instead of a `max` operation, an `avg` operation is performed. Rather than taking the brightest value (which yields sensitivity to noise), the average of the input is taken, smoothing everything together. + +Visually, this looks as follows: + +[![](images/Global-Average-Pooling-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Global-Average-Pooling-2.png) + +### Why can Global Average Pooling replace Dense layers in classifiers? + +Traditionally, ConvNet based classifiers work as follows: + +- The convolutional layers serve as feature extractors, learning the features of importance in your input data. +- A Flatten layer is used to convert the multidimensional data into one-dimensional format. +- This allows Dense or densely-connected layers to take the input and generate a class based prediction, often using the [Softmax activation function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/). + +Global Average Pooling can be used to replace these Dense layers in classifiers. + +By removing the Dense and Flatten layers, we have a model left with a set of convolutional layers that serve as feature extractors. Now, we can do this: + +- Add another Conv layer with active padding, which learns \[latex\]N\[/latex\] feature maps, where \[latex\]N\[/latex\] is the number of target classes. +- Add a Global Average Pooling layer, which transforms the \[latex\]W x H\[/latex\] feature maps into 1 x 1 maps, effectively producing "class predictions" that are not yet interrelated (like in the final Dense layer before they are fed to Softmax). +- Add a Softmax layer, which generates a multiclass probability distribution over the feature maps and by consequence the target classes. + +There we go: we have a ConvNet for classification which does not use Dense layers! As we'll see, it significantly reduces the number of trainable parameters, and yields quite adequate results. However, let's first introduce the dataset and full model architecture :) + +* * * + +## Creating the model + +Now that we understand the theory sufficiently, we can move on to the practical part: creating the model. Let's take a look at today's dataset first. Then, we inspect the model architecture in more detail. Subsequently, we'll start writing some code! :) + +### Today's dataset + +As today's dataset, we'll be using the KMNIST dataset from our [Extra Keras Datasets module](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/). This module makes available various additional datasets in the style of Keras `load_data` functionality. This way, you can use a variety of datasets in your models quite easily, allowing you to try different datasets than e.g. MNIST all the time. + +With regards to the dataset itself: the KMNIST dataset, as you can see, replaces MNIST digits with Japanese characters. It's a drop-in dataset for MNIST: it has the same number of output classes (10) and the same number of samples (60k in total for training). + +[![](images/kmnist-kmnist.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/kmnist-kmnist.png) + +### Model architecture + +[![](images/model-137x300.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/model.png) + +Let's take a look at the model that we will be creating today: a [Convolutional Neural Network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) for image classification. + +On the right, you see the architecture of the particular model. Click on the image to make it larger, so that we can look at it in more detail. + +Obviously, an `InputLayer` takes in the data. The input shape, as we will see, is `(28, 28, 1)`, as the images that we will be feeding the model are 28 x 28 pixels and have one image channel only. + +We then have two convolutional blocks: a `Conv2D` layer followed by `MaxPooling2D` for [downsampling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) and `Dropout` for [regularization](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). Finally, the data is added to another `Conv2D`, which generates a number of feature maps equal to the number of classes, and then to `GlobalAveragePooling2D`, which generates the average value for each feature map. Since this is output to a [`Softmax` activation function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/), we get the same probability distribution as we would get with [a classic Flatten/Dense based structure](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). + +### What you'll need to run this model + +You need to have a few software dependencies installed if you wish to successfully run this model: + +- **TensorFlow**, and this must be version 2.0+ +- **Matplotlib**, for generating plots. +- The [**Extra Keras Datasets**](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) module, but you can also use `tensorflow.keras.datasets.mnist` instead. + +All right - let's start writing some code! 😎 Open some code editor and create a file. For example, name it `model_no_dense.py`. If possible, enable Python syntax checking. Now, let's go! + +### Model imports + +First, we add the imports for our model. Given the fact that it is suggested to use the TensorFlow built-in facilities for Keras since TensorFlow released TF 2.0, we'll do precisely that. For this reason, we import Keras from Tensorflow: `import tensorflow.keras`. + +Subsequently, we import the `kmnist` dataset from our `extra_keras_datasets` module. + +From Keras itself, we subsequently import the `Sequential` API, allowing us to stack the individual layers nicely. Speaking about layers, we use a few of them: first of all the `Conv2D`, `MaxPooling2D` and `GlobalAveragePooling2D` layers, as we discussed above - they're the real workhorses here. Additionally, we import `Dropout` and `Activation`, as they'll complement the others and finalize the architecture of our model. + +We finally import Matplotlib for generating some plots towards the end of this blog post. + +``` +import tensorflow.keras +from extra_keras_datasets import kmnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Activation, Dropout +from tensorflow.keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D +import matplotlib.pyplot as plt +``` + +### Model configuration + +Next, we configure the model. Image width and image height are 28 x 28 pixels, the batch size is 25 (which is on the low end of the spectrum, but given the low memory requirements for the KMNIST dataset this is perfectly fine) and the number of epochs - or iterations - is 25. This is also low, but for demonstration purposes this is fine too (and as we will see, our control model will converge to very good values pretty quickly). 20% of our training data will be used for validation purposes, and given our setting for verbosity mode all output is displayed on screen. + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 25 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +### Loading and preparing the dataset + +Now that we have imported our dependencies and set the configuration options for the model, it's time to import the data. We do so by calling the `load_data()` definition from our [module](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/): + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = kmnist.load_data() +``` + +This imports the data into the four variables. + +It's now necessary to reshape the input data and to set the input shape for the model differently. We can do so as follows, with default code provided by the Keras team on GitHub: + +``` +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) +``` + +We then convert the numbers into `float32` format, which presumably speeds up the training process: + +``` +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') +``` + +And subsequently normalize the data to bring it closer to the \[latex\] \[-1, 1\] \[/latex\] range: + +``` +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +Finally, in terms of data preparation, we convert our target vectors - which are provided in integer format - into categorical format, by generating one-hot encoded vectors with `to_categorical`. This allows us to use [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) during the optimization process. + +``` +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +### Defining the model architecture + +Next, we code the model architecture. It's equal to what we discussed above, so the discussion here won't be very lengthy. What wasn't included above, is that we use [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) as an activation function on our hidden layers. + +What's more, the Conv2D layer which converts the 64 feature maps into `no_classes = 10` ones uses `padding='same'`, which ensures that the _size_ of the feature maps remains equal to the ones generated by the previous layer (we do so because we aim to compare the performance of this model with a regular Dense based CNN later, and we wish to keep the data shapes equal). + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(no_classes, kernel_size=(3, 3), padding='same', activation='relu')) +model.add(GlobalAveragePooling2D()) +model.add(Activation('softmax')) +``` + +### Compiling the model and starting the training process + +At this point, we have a prepared dataset, a set of configuration options and a model architecture. So, in short, we have a skeleton, but no instantiated model yet. That's what we do next, by calling `model.compile` with the configuration options that we specified earlier: + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) +``` + +Accuracy is added as an additional metric because it's much more intuitive for humans. + +Now that we have a compiled model, we can actually fit the data, i.e. start the training process. Once again, this is done in line with the configuration options that we defined above: + +``` +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Generating evaluation metrics + +The Keras training process will output various metrics on screen during training. It will display training loss and (in our case) accuracy, and it will do the same for the validation data. However, this is data _that the model has seen during training_. While they are very good indicators for the _predictive_ power of our model, they cannot be relied upon for telling us how well it _generalizes to new data_. + +This is why we split the dataset into `input_train` and `input_test` before. The testing data, which the model hasn't seen yet during training, can be used for evaluating the model. We can do this by calling `model.evaluate`: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +This is it! You have a working Keras model now 😎 Open up an command prompt with which you can access the required software dependencies. Navigate to the folder where your file is located using `cd`, and then execute the script with Python: `python model_no_dense.py`. You should see the training process begin :) + +### Full model code + +Should you wish to obtain the full code at once instead, here you go: + +``` +import tensorflow.keras +from extra_keras_datasets import kmnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Activation, Dropout +from tensorflow.keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D +import matplotlib.pyplot as plt + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 25 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = kmnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(no_classes, kernel_size=(3, 3), padding='same', activation='relu')) +model.add(GlobalAveragePooling2D()) +model.add(Activation('softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Results with and without Dense + +Obviously, we trained the model ourself too! 😋 Let's now take a look at some results. + +Despite showing that it _is possible_ to create a CNN based classifier without Dense layers, we're also interested in _how well it works_. For this, we need to compare the model we defined above with some other model; preferably, this is a classic one. Hence, in this section, we define a control model first. + +Then, for both, we show the number of trainable parameters, which tells you something about the complexity of the model and the corresponding computational requirements. + +Subsequently, we show you the evaluation metrics generated for both models with the testing data. Finally, this is visualized by some visualizations of the [training history](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). This enables you to compare the progress of e.g. the loss value for each model over time. + +### Control model + +Here's the control model. It's a classic CNN with the two `Conv2D` based blocks, a `Flatten` layer, and two `Dense` layers: + +``` +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +If you add this code to your model, you must ensure that you also import the additional layers; your inputs will thus become: + +``` +from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense +from tensorflow.keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D +``` + +### Trainable parameters + +As indicated earlier, the number of trainable parameters tells you something about the complexity and computational requirements of your machine learning model. With `model.summary()`, we can generate an overview of the models. Here is the overview for the Global Average Pooling based model: + +``` +Model: "GlobalAveragePoolingBased" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d (Conv2D) (None, 26, 26, 32) 320 +_________________________________________________________________ +max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0 +_________________________________________________________________ +dropout (Dropout) (None, 13, 13, 32) 0 +_________________________________________________________________ +conv2d_1 (Conv2D) (None, 11, 11, 64) 18496 +_________________________________________________________________ +max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64) 0 +_________________________________________________________________ +dropout_1 (Dropout) (None, 5, 5, 64) 0 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 5, 5, 10) 5770 +_________________________________________________________________ +global_average_pooling2d (Gl (None, 10) 0 +_________________________________________________________________ +activation (Activation) (None, 10) 0 +================================================================= +Total params: 24,586 +Trainable params: 24,586 +Non-trainable params: 0 +_________________________________________________________________ +``` + +And this is the one for the traditional one, the classic CNN: + +``` +Model: "Traditional" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d_3 (Conv2D) (None, 26, 26, 32) 320 +_________________________________________________________________ +max_pooling2d_2 (MaxPooling2 (None, 13, 13, 32) 0 +_________________________________________________________________ +dropout_2 (Dropout) (None, 13, 13, 32) 0 +_________________________________________________________________ +conv2d_4 (Conv2D) (None, 11, 11, 64) 18496 +_________________________________________________________________ +max_pooling2d_3 (MaxPooling2 (None, 5, 5, 64) 0 +_________________________________________________________________ +dropout_3 (Dropout) (None, 5, 5, 64) 0 +_________________________________________________________________ +flatten (Flatten) (None, 1600) 0 +_________________________________________________________________ +dense (Dense) (None, 256) 409856 +_________________________________________________________________ +dense_1 (Dense) (None, 10) 2570 +================================================================= +Total params: 431,242 +Trainable params: 431,242 +Non-trainable params: 0 +_________________________________________________________________ +``` + +As you can see, the number of trainable parameters for the Global Average Pooling based model is substantially lower than for the classic one. The second overview, and especially the `dense` layer in this overview, illustrates the difference: large Dense layers add quite a set of trainable parameters due to their high connectedness. The question, however, is this: + +Does the substantial reduction in trainable parameters also influence the _performance_ of the model? Let's take a look 😉 + +### Evaluation metrics + +The evaluation metrics show us that our model performs only slightly worse than the classic one: + +``` +Global Average Pooling: Test loss: 0.3354340976119041 / Test accuracy: 0.9086999893188477 +Classic: Test loss: 0.3033015901445908 / Test accuracy: 0.9617000222206116 +``` + +...but this minor deterioration comes at the benefit of an approximately 18 times reduction in trainable parameters! Perhaps, we can increase the amount of trainable parameters a bit in our new model by e.g. adding another convolutional block, to capture the patterns in a more detailed way. But that's something for a different post! :) + +### Training history + +The plots of our training history also demonstrate that the performance of our models is converging towards each other. However, what is apparent is that the Global Average Pooling based model takes longer to arrive at loss values that are similar to the regular, classic CNN. Perhaps, this occurs because the trainable parameters in the Dense layers are omitted, and that the individual `Conv2D` layer added to capture the "classification" process with only 10 feature maps takes longer to learn. + +- [![](images/gap_loss.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/gap_loss.png) + +- [![](images/gap_acc.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/gap_acc.png) + + +All in all, some pretty awesome results! 😎 + +* * * + +## Summary + +In this blog post, we created a Keras based model to show you that ConvNets do not necessarily require Dense layers at the end if we want to use them for classification purposes. Rather, with Global Average Pooling, they can work as well - at a fraction of the trainable parameters. + +The blog post specifically looked at these elements: + +- **What pooling layers are and why they are useful**. We studied the concept of pooling and argued why pooling layers benefit training: they ensure that you can better build the spatial hierarchy of abstractness required in your ConvNets, without losing detail unnecessarily. +- **What Global Average Pooling is and how it can be used to remove Dense layers from the ConvNet**: with Global Average Pooling, which has pools equal sizes equal to the size of input data, it's possible to replicate the Dense layer based process at the end of classic ConvNets. We discussed how this works. +- **A Keras based example of the former.** Using the Keras framework for deep learning, we showed how it _really_ works by providing example code. What's more, we validated our model by means of evaluation metrics and plots, showing that - with a little bit of extra time to converge to a minimum - the Global Average Pooling way of working also gets close to what can be achieved with a classic ConvNet. + +I hope you've learnt something from today's blog post! If you did, please let me know in the comments - I appreciate your feedback 😊 Please do the same if you have questions or remarks, or if you found mistakes. I'll happily improve. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +MachineCurve. (2020, January 30). What are Max Pooling, Average Pooling, Global Max Pooling and Global Average Pooling? Retrieved from [https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) diff --git a/relu-sigmoid-and-tanh-todays-most-used-activation-functions.md b/relu-sigmoid-and-tanh-todays-most-used-activation-functions.md new file mode 100644 index 0000000..103a829 --- /dev/null +++ b/relu-sigmoid-and-tanh-todays-most-used-activation-functions.md @@ -0,0 +1,225 @@ +--- +title: "ReLU, Sigmoid and Tanh: today's most used activation functions" +date: "2019-09-04" +categories: + - "deep-learning" +tags: + - "activation-functions" + - "deep-learning" + - "relu" + - "sigmoid" + - "tanh" +--- + +Today's deep neural networks can handle highly complex data sets. For example, object detectors have grown capable of predicting the positions of various objects in real-time; timeseries models can handle many variables at once and many other applications can be imagined. + +The question is: why can those networks handle such complexity. More specifically, why can they do what previous machine learning models were much less capable of? + +There are many answers to this question. Primarily, the answer lies in the depth of the neural network - it allows networks to handle more complex data. However, a part of the answer lies in the application of various **activation functions** as well - and particularly the non-linear ones most used today: ReLU, Sigmoid and Tanh. + +In this blog, we will find out a couple of things: + +- What an activation function is; +- Why you need an activation function; +- An introduction to the Sigmoid activation function; +- An introduction to the Tanh, or tangens hyperbolicus, activation function; +- An introduction to the Rectified Linear Unit, or ReLU, activation function. + +Are you ready? Let's go! :-) + +* * * + +**Update 17/Jan/2021:** checked the article to ensure that it is up to date in 2021. Also added a short section with the key information from this article. + +* * * + +\[toc\] + +* * * + +## In short: the ReLU, Sigmoid and Tanh activation functions + +In today's deep learning practice, three so-called **activation functions** are used widely: the Rectified Linear Unit (ReLU), Sigmoid and Tanh activation functions. + +Activation functions in general are used to convert linear outputs of a neuron into [nonlinear outputs](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/), ensuring that a neural network can learn nonlinear behavior. + +**Rectified Linear Unit (ReLU)** does so by outputting `x` for all `x >= 0` and `0` for all `x < 0`. In other words, it [equals](https://www.machinecurve.com/index.php/question/why-does-relu-equal-max0-x/) `max(x, 0)`. This simplicity makes it more difficult than the **Sigmoid** **activation function** and the **Tangens hyperbolicus** **(Tanh)** activation function, which use more difficult formulas and are computationally more expensive. In addition, ReLU is not sensitive to vanishing gradients, whereas the other two are, slowing down learning in your network. Also known to generalize well, it is unsurprising to see that ReLU is the most widely used activation function today. + +* * * + +## What is an activation function? + +You do probably recall the structure of a basic neural network, in deep learning terms composed of _densely-connected layers:_ + +![](images/Basic-neural-network.jpg) + +In this network, every neuron is composed of a weights vector and a bias value. When a new vector is input, it computes the dot product between the weights and the input vector, adds the bias value and outputs the scalar value. + +...until it doesn't. + +Because put very simply: both the dot product and the scalar additions are _linear_ operations. + +Hence, when you have this value as neuron output and do this for every neuron, you have a system that behaves linearly. + +And as you probably know, _most data is highly nonlinear_. Since linear neural networks would not be capable of e.g. generating a decision boundary in those cases, there would be no point in applying them when generating predictive models. + +The system as a whole must therefore be nonlinear. + +**Enter the activation function.** + +This function, which is placed directly behind every neuron, takes as input the linear neuron output and generates a nonlinear output based on it, often deterministically (i.e., when you input the same value twice, you'll get the same result). + +This way, with every neuron generating in effect a linear-but-nonlinear output, the system behaves nonlinearly as well and by consequence becomes capable of handling nonlinear data. + +### Activation outputs increase with input + +Neural networks are inspired by the human brain. Although very simplistic, they can be considered to resemble the way human neurons work: they are part of large neural networks as well, with synapses - or pathways - in between. Given neural inputs, human neurons activate and pass signals to other neurons. + +The system as a whole results in human brainpower as we know it. + +If you wish to resemble this behavior in neural network activation functions, you'll need to resemble human neuron activation as well. Relatively trivial is the notion that in human neural networks outputs tend to increase when stimulation, or input to the neuron, increases. By consequence, this is also often the case in artificial ones. + +Hence, we're looking for mathematical formulae that take linear input, generate a nonlinear output _and_ increase or remain stable over time (a.k.a., + +### Towards today's prominent activation functions + +Today, three activation functions are most widely used: the **Sigmoid** function, the Tangens hyperbolicus or **tanh** and the Rectified Linear Unit, or **ReLU**. Next, we'll take a look at them in more detail. + +* * * + +## Sigmoid + +Below, you'll see the (generic) **sigmoid** function, also known as the logistic curve: + +[![](images/sigmoid-1024x511.png)](https://machinecurve.com/wp-content/uploads/2019/05/sigmoid.png) + +Mathematically, it can be represented as follows: + +\[mathjax\] + +\\begin{equation} y: f(x) = \\frac{1}{1 + e^{-x}} \\end{equation} + +As you can see in the plot, the function slowly increases over time, but the greatest increase can be found around \[latex\]x = 0\[/latex\]. The range of the function is \[latex\](0, 1)\[/latex\]; i.e. towards high values for \[latex\]x\[/latex\] the function therefore approaches 1, but never equals it. + +The Sigmoid function allows you to do multiple things. First, as we recall from our post on [why true Rosenblatt perceptrons cannot be created in Keras](https://machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/), step functions used in those ancient neurons are not differentiable and hence gradient descent for optimization cannot be applied. Second, when we implemented the Rosenblatt perceptron ourselves with the [Perceptron Learning Rule](https://machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/), we noticed that in a binary classification problem, the decision boundary is optimized per neuron and will find one of the possible boundaries if they exist. This gets easier with the Sigmoid function, since it is more smooth (Majidi, n.d.). + +Additionally, and perhaps primarily, we use the Sigmoid function because it outputs between \[latex\](0, 1)\[/latex\]. When estimating a probability, this is perfect, because probabilities have a very similar range of \[latex\]\[0, 1\]\[/latex\] (Sharma, 2019). Especially in binary classification problems, when we effectively estimate the probability that the output is of some class, Sigmoid functions allow us to give a very weighted estimate. The output \[latex\]0.623\[/latex\] between classes A and B would indicate "slightly more of B". With a step function, the output would have likely been \[latex\]1\[/latex\], and the nuance disappears. + +* * * + +## Tangens hyperbolicus: Tanh + +Another widely used activation function is the tangens hyperbolicus, or hyperbolic tangent / **tanh** function: + +[![](images/tanh-1024x511.png)](https://machinecurve.com/wp-content/uploads/2019/05/tanh.png) + +It works similar to the Sigmoid function, but has some differences. + +First, the change in output accelerates close to \[latex\]x = 0\[/latex\], which is similar with the Sigmoid function. + +It does also share its asymptotic properties with Sigmoid: although for very large values of \[latex\]x\[/latex\] the function approaches 1, it never actually equals it. + +On the lower side of the domain, however, we see a difference in the range: rather than approaching \[latex\]0\[/latex\] as minimum value, it approaches \[latex\]-1\[/latex\]. + +### Differences between tanh and Sigmoid + +You may now probably wonder what the differences are between tanh and Sigmoid. I did too. + +Obviously, the range of the activation function differs: \[latex\](0, 1)\[/latex\] vs \[latex\](-1, 1)\[/latex\], as we have seen before. + +Although this difference seems to be very small, it might have a large effect on model performance; specifically, how fast your model converges towards the most optimal solution (LeCun et al., 1998). + +This is related to the fact that they are symmetric around the origin. Hence, they produce outputs that are close to zero. Outputs close to zero are best: during optimization, they produce the least weight swings, and hence let your model converge faster. This will really be helpful when your models are very large indeed. + +As we can see, the **tanh** function is symmetric around the origin, where the **Sigmoid** function is not. Should we therefore always choose tanh? + +Nope - it comes with a set of problems, or perhaps more positively, _challenges_. + +* * * + +## Challenges of Sigmoid and Tanh + +The paper by LeCun et al. was written in 1998 and the world of deep learning has come a long way... identifying challenges that had to be solved in order to bring forward the deep learning field. + +First of all, we'll have to talk about _model sparsity_ (DaemonMaker, n.d.). The less complex the model is during optimization, the faster it will converge, and the more likely it is that you'll find a mathematical optimum in time. + +And _complexity_ can be viewed as the _number of unimportant neurons_ that are still in your model. The fewer of them, the better - or _sparser_ - your model is. + +Sigmoid and Tanh essentially produce non-sparse models because their neurons pretty much always produce an output value: when the ranges are \[latex\](0, 1)\[/latex\] and \[latex\](-1, 1)\[/latex\], respectively, the output either cannot be zero or is zero with very low probability. + +Hence, if certain neurons are less important in terms of their weights, they cannot be 'removed', and the model is not sparse. + +Another possible issue with the output ranges of those activation functions is the so-called [vanishing gradients problem](https://machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) (DaemonMaker, n.d.). During optimization, data is fed through the model, after which the outcomes are compared with the actual target values. This produces what is known as the loss. Since the loss can be considered to be an (optimizable) mathematical function, we can compute the gradient towards the zero derivative, i.e. the mathematical optimum. + +Neural networks however comprise many layers of neurons. We would essentially have to repeat this process over and over again for every layer with respect to the downstream ones, and subsequently chain them. That's what backpropagation is. Subsequently, we can optimize our models with gradient descent or a similar optimizer. + +When neuron outputs are very small (i.e. \[latex\] -1 < output < 1\[/latex\]), the chains produced during optimization will get smaller and smaller towards the upstream layers. This will cause them to learn very slowly, and make it questionable whether they will converge to their optimum at all: enter the _vanishing gradients problem_. + +A more detailed review on this problem can be found [here](https://machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). + +* * * + +## Rectified Linear Unit: ReLU + +In order to improve on these observations, another activation was introduced. This activation function, named Rectified Linear Unit or **ReLU**, is the de facto first choice for most deep learning projects today. It is much less sensitive to the problems mentioned above and hence improves the training process. + +It looks as follows: + +[![](images/relu-1024x511.png)](https://machinecurve.com/wp-content/uploads/2019/05/relu.png) + +And can be represented as follows: + +\\begin{equation} f(x) = \\begin{cases} 0, & \\text{if}\\ x < 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Or, in plain English, it produces a zero output for all inputs smaller than zero; and \[latex\]x\[/latex\] for all other inputs. Hence, for all \[latex\]inputs <= 0\[/latex\], it produces zero outputs. + +### Sparsity + +This benefits sparsity substantially: in almost half the cases, now, the neuron doesn't fire anymore. This way, neurons can be made silent if they are not too important anymore in terms of their contribution to the model's predictive power. + +### Fewer vanishing gradients + +It also reduces the impact of vanishing gradients, because the gradient is always a constant: the derivative of \[latex\]f(x) = 0\[/latex\] is 0 while the derivative of \[latex\]f(x) = x\[/latex\] is 1. Models hence learn faster and more evenly. + +### Computational requirements + +Additionally, ReLU does need much fewer computational resources than the Sigmoid and Tanh functions (Jaideep, n.d.). The function that essentially needs to be executed to arrive at ReLU is a `max` function: \[latex\]max(0, x)\[/latex\] produces 0 when \[latex\]x < 0\[/latex\] and x when \[latex\]x >= 0\[/latex\]. That's ReLU! + +Now compare this with the formulas of the Sigmoid and tanh functions presented above: those contain exponents. Computing the output of a max function is much simpler and less computationally expensive than computing the output of exponents. For one calculation, this does not matter much, but note that in deep learning many such calculations are made. Hence, ReLU reduces your need for computational requirements. + +### ReLU comes with additional challenges + +This does however not mean that ReLU itself does not have certain challenges: + +- Firstly, it tends to produce very large values given its non-boundedness on the upside of the domain (Jaideep, n.d.). Theoretically, infinite inputs produce infinite outputs. +- Secondly, you will face the _dying ReLU problem_ (Jaideep, n.d.). If a neuron's weights are moved towards the zero output, it may be the case that they eventually will no longer be capable of recovering from this. They will then continually output zeros. This is especially the case when your network is poorly initialized, or when your data is poorly normalized, because the first rounds of optimization will produce large weight swings. When too many neurons output zero, you end up with a dead neural network - the dying ReLU problem. +- Thirdly: Small values, even the non-positive ones, may be of value; they can help capture patterns underlying the dataset. With ReLU, this cannot be done, since all outputs smaller than zero are zero. +- Fourthly, the transition point from \[latex\]f(x) = 0\[/latex\] to \[latex\]f(x) = x\[/latex\] is not smooth. This will impact the loss landscape during optimization, which will not be smooth either. This may (slightly albeit significantly) hamper model optimization and slightly slow down convergence. + +To name just a few. + +Fortunately, new activation functions have been designed to overcome these problems in especially very large and/or very deep networks. A prime example of such functions is [Swish](https://machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/); another is Leaky ReLU. The references navigate you to blogs that cover these new functions. + +* * * + +## Recap + +In this blog, we dived into today's standard activation functions as well as their benefits and possible drawbacks. You should now be capable of making a decision as to which function to use. Primarily, though, it's often best to start with ReLU; then try tanh and Sigmoid; then move towards new activation functions. This way, you can experimentally find out which works best. However, take notice of the resources you need, as you may not necessarily be able to try all choices. + +Happy engineering! :-) + +* * * + +## References + +Panchal, S. (n.d.). What are the benefits of using a sigmoid function? Retrieved from [https://stackoverflow.com/a/56334780](https://stackoverflow.com/a/56334780) + +Majidi, A. (n.d.). What are the benefits of using a sigmoid function? Retrieved from [https://stackoverflow.com/a/56337905](https://stackoverflow.com/a/56337905) + +Sharma, S. (2019, February 14). Activation Functions in Neural Networks. Retrieved from [https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6) + +LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. -. (1998). Efficient BackProp. _Lecture Notes in Computer Science_, 9-50. [doi:10.1007/3-540-49430-8\_2](http://doi.org/10.1007/3-540-49430-8_2) + +DaemonMaker. (n.d.). What are the advantages of ReLU over sigmoid function in deep neural networks? Retrieved from [https://stats.stackexchange.com/a/126362](https://stats.stackexchange.com/a/126362) + +Jaideep. (n.d.). What are the advantages of ReLU over sigmoid function in deep neural networks? Retrieved from [https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks](https://stats.stackexchange.com/questions/126238/what-are-the-advantages-of-relu-over-sigmoid-function-in-deep-neural-networks) diff --git a/resnet-a-simple-introduction.md b/resnet-a-simple-introduction.md new file mode 100644 index 0000000..9e7a3a4 --- /dev/null +++ b/resnet-a-simple-introduction.md @@ -0,0 +1,128 @@ +--- +title: "ResNet, a simple introduction" +date: "2022-01-13" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "degradation-problem" + - "exploding-gradients" + - "machine-learning" + - "resnet" + - "shattering-gradients" + - "vanishing-gradients" +--- + +Residual networks or ResNets for short have been key elements in the computer vision oriented deep learning community. In this article, you will take a conceptual look at these networks, and the problems that they solve. However, we'll shy away from the underlying maths, and keep explanations as simple as possible. + +In other words, after reading this tutorial, you will... + +- **Understand why neural networks _should_ improve performance with increasing depth... in theory.** +- **Why the _shattering gradients_ problem results in degraded performance with depth, a.k.a. the _degradation problem_.** +- **How ResNets reduce shattering gradients and yield better performance, and what they look like architecturally/component-wise.** + +Are you ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Adding more layer should improve performance... in theory + +In 2012, during the AlexNet deep learning breakthrough reviving interest in the technique, people realized en masse that deep neural networks are feature learners. + +What this means can be best explained by comparing them to other, more traditional techniques, such as logistic regression or Support Vector Machines. In these model types, machine learning engineers and/or data scientists first had to engineer features (i.e. the model input columns) explicitly and manually. In other words, extensive feature analysis and often feature selection had to be performed, e.g. with [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/). When relevant features were selected, they had to be engineered as well (e.g., by applying specific [filters](https://learnopencv.com/image-filtering-using-convolution-in-opencv/) over them to make them suitable for the machine learning method). + +With neural networks, this was no longer the case. Especially with the addition of [convolutional layers](https://www.machinecurve.com/index.php/2021/07/08/convolutional-neural-networks-with-pytorch/), neural networks became able to learn to detect important parts of images relevant to e.g. classification outcomes, by making the filters/kernels learnable. + +When stacking multiple convolutional layers, the feature maps learned in each subsequent layer become more high level. For example, in the image below, we see a ConvNet with multiple Conv layers and two Dense layers for eventual prediction. When visualizing the filters/kernels of these ConvNets, low-level concepts (such as eyes) are still visible within images. The more downstream one gets, the more generic these patterns get. + +![](images/conv.png) + +You can imagine that when you are training a neural network that classifies between cats and dogs, the more downstream you get, the more generic a representation of cats and dogs emerges. The more generic the representation, the more cats are recognized as cats and dogs as dogs. In other words, it should then become much easier for the Dense layers to distinguish between the classes. + +Obviously, this leads to the popularity of _deep_ learning. Deep here refers to the depth of the neural networks being used, as frequently deeper networks perform better. In theory, these networks should be [universal function approximators](https://www.machinecurve.com/index.php/2019/07/18/can-neural-networks-approximate-mathematical-functions/) and be able to learn anything. But can they? Let's take a look at some problems neural network practitioners ran into in the early days of deep learning. + +* * * + +## From vanishing and exploding gradients to the "degradation problem" + +In the early days, while practitioners realized that they had to use [nonlinear activation functions](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/) in order to let their neural networks perform, they were pretty much used to Sigmoid and Tanh. Recall from the [high-level machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) that when a forward pass has completed, the loss is computed backwards throughout the network, yielding a gradient for each trainable parameter, after which the network adapts in the direction of each individual gradient. + +Throughout the stack of layers, this backpropagation is performed with the chain rule - in other words, by means of gradient multiplications. For example, for the 3rd layer away from the loss in the hierarchy, the loss and the gradients computed for the two layers in between influence the gradients in the 3rd layer. This influence is expressed by means of multiplications. + +Let's now take a look at the image below, where you can see the Sigmoid activation function and its first derivative. You can see that the maximum value of the derivative is ~0.25, and that it's <0.1 for a large part of the domain. + +Recall that this derivative is used for gradient computation. Now imagine what happens when you multiply these values across the layers. If your network has 10 layers, the gradient at the layer is impacted by a multiplication with values in the order of `0.25^9` - indeed, a very small number. The gradient at the most upstream layers are thus very small when using these classic activation functions, resulting in very slow (or even the impossibility of) training. It's what many of you know as the **vanishing gradients problem.** + +![sigmoid_deriv – MachineCurve](images/sigmoid_deriv.png) + +Fortunately, the vanishing gradients problem can be resolved by means of the [ReLU activation function](https://www.machinecurve.com/index.php/2021/01/21/using-relu-sigmoid-and-tanh-with-pytorch-ignite-and-lightning/). + +The opposite of small gradients is also possible; in other words, your model can also suffer from the **exploding gradients problem**. This problem occurs when your data was not normalized properly, after which during optimization gradients become very large. The effect of multiplying these large gradients is that gradients in the most upstream layers will become _very_ large and possibly become larger than the maximum values for the data type (e.g., float), after which they become `NaN`. This results in model instability and, once again, the impossibility of converging to a solution every now and then. + +[Batch Normalization](https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/) combined with ReLU resolves the vanishing and exploding gradients problems to a large extent. + +However, there is another problem, as observed by He et al. in their 2016 paper. If we take a look at a graph cited from [their paper](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf), we can spot the problem instantly: + +![](images/image-1-1024x542.png) + +Source: He, K., Zhang, X., Ren, S., & Sun, J. (2016). [Deep residual learning for image recognition.](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf) In _Proceedings of the IEEE conference on computer vision and pattern recognition_ (pp. 770-778). + +Both training **and** testing performance degrades when neural networks such as the one sketched above, with a stack of neural layers (called "plain networks" by He et al.), become deeper. + +Clearly, the 56-layer network performs worse than the 20-layer network. + +This is very counterintuitive, because theory suggests that deeper networks are better feature learners and should hence perform better. When replicating this problem with "identity mappings", in the case of the 20-vs-56-layer comparison e.g. by training a network with 20 layers and another with 20 layers and 36 identity mappings that return their inputs, they still encountered this problem. Since network performance degrades, they coined it the **degradation problem**. + +In other words, there's a divergence between what theory suggests and practice proves. But why? Let's take a look at another paper investigating the gradients of increasingly deep networks. + +* * * + +## Shattering gradients problem + +In that paper, which is called [The shattered gradients problem: If resnets are the answer, then what is the question?](https://arxiv.org/pdf/1702.08591.pdf), Balduzzi et al. investigate why ResNets (you'll learn about them in the next section) of a certain depth perform better compared to plain networks of the same or even smaller depth. + +They built a neural network that learns to map scalars to scalars, in other words, `2` to `2`, to give an example (their input domain was `[-2, 2]`). Note that they claim that the network itself will likely not be useful for practice, but that it is a good "laboratory" candidate with which the problem can be looked at in detail. + +The image below suggests what happens in increasingly deep networks. Showing the _gradients_ of each input value as well as _covariance of the gradients between the inputs_, it becomes clear that there is structure in gradients in a shallow network (most left column). In other words, since close inputs produce similar gradients, the network can slowly but surely converge to a locally or globally optimal solution. + +When the layer is made a lot deeper (the (b) column towards the left), this structure disappears, and the similarity between gradients now resembles white noise. In other words, since certain inputs produce significantly gradients, finding an optimal solution becomes increasingly difficult with depth. Balduzzi et al. coin this the **shattering gradients problem** and suggest that it's one of the key reasons for the degradation problem mentioned before. + +Interestingly, you can also see the results of the analysis for a 50-layer ResNet. Clearly, the similarity between gradients by means of their covariance is worse compared to the 1-layer "plain" network, but it's much better compared to the 24-layer plain one - and it's twice as deep! In fact, something also becoming clear by plotting autocorrelation for each model type ([paper, page 3](https://arxiv.org/pdf/1702.08591.pdf)), ResNet gradient similarity resembles that of brown noise. + +In other words, ResNet gradients remain somewhat similar with increased depth - i.e., by reducing the shattering gradients problem - allowing much deeper networks to be trained. Now that we understand why they can be better, let's actually take a look at what they look like :) + +![](images/image-2-1024x480.png) + +Source: Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W. D., & McWilliams, B. (2017, July). [The shattered gradients problem: If resnets are the answer, then what is the question?](https://arxiv.org/pdf/1702.08591.pdf). In _International Conference on Machine Learning_ (pp. 342-350). PMLR. + +* * * + +## Introducing residual networks (ResNets) + +He et al. (2016) found that neural networks could be made deeper while remaining performant by designing them in a framework they call **residual learning**. For this reason, they are called **residual networks** or **ResNets** for short. + +If a stack of neural layers (for example, the two layers in the image below) need to learn some mapping `H(x)`, they can do so by simply being stacked. This yields the "plain network" scenario that you just learned can be problematic when depth is increased. However, it can also be framed in a different way... by decomposing `H(x)` into separate components. + +For example, what if we let the stack of neural layers learn another mapping instead - `F(x)`, where `F(x) = H(x) - x`. Then, we can construct the original mapping by means of `H(x) = F(x) + x`. + +Interestingly, we can implement that mapping very easily in our neural network by using the concept of a **skip connection**. The input to our stack learning `F(x)` will simply be added to the stack's output afterwards, creating `F(x) + x`, or in other words the original mapping `H(x)`. Note that He et al. choose to do the element-wise addition _before_ the stack's final ReLU activation. + +![](images/image-3.png) + +Source: He, K., Zhang, X., Ren, S., & Sun, J. (2016). [Deep residual learning for image recognition.](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf) In _Proceedings of the IEEE conference on computer vision and pattern recognition_ (pp. 770-778). + +By wrapping many such "residual blocks" on top of each other, you can create a neural network that learns a series of these decomposed mappings. From the Balduzzi et al. paper, you can see that doing so allows gradients to be more similar with increasing depth, benefiting convergence to some optimum and hence allowing better feature learning. + +While being a relatively old paper, ResNets are worth discussing today, because they are still widely used in the computer vision community, despite the emergence of some other approaches. If you have any questions, comments or suggestions, feel free to leave a message in the comments section below 💬 I will then try to answer you as quickly as possible. For now, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +He, K., Zhang, X., Ren, S., & Sun, J. (2016). [Deep residual learning for image recognition.](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf) In _Proceedings of the IEEE conference on computer vision and pattern recognition_ (pp. 770-778). + +Balduzzi, D., Frean, M., Leary, L., Lewis, J. P., Ma, K. W. D., & McWilliams, B. (2017, July). [The shattered gradients problem: If resnets are the answer, then what is the question?](https://arxiv.org/pdf/1702.08591.pdf). In _International Conference on Machine Learning_ (pp. 342-350). PMLR. diff --git a/rise-of-the-machines-review-ais-broader-context.md b/rise-of-the-machines-review-ais-broader-context.md new file mode 100644 index 0000000..cdfe9af --- /dev/null +++ b/rise-of-the-machines-review-ais-broader-context.md @@ -0,0 +1,78 @@ +--- +title: "Rise of the Machines Review: AI's broader context" +date: "2019-10-23" +categories: + - "books-about-ai" +tags: + - "cybernetics" + - "machine-learning" + - "technology" + - "thomas-rid" +--- + +Since 2012, there has been increasing attention for machine learning and especially [deep learning](https://www.machinecurve.com/index.php/2018/11/23/what-is-deep-learning-exactly/), the ML branch that benefits from more advanced techniques and increased computing resources to equal human-level performance in some domains. + + + +However, neither machine learning nor artificial intelligence are isolated when considering global technology developments. Rather, they both belong to one big story which connects "war machines, computer networks, social media, ubiquitous surveillance and virtual reality" - according to Kevin Kelly, the founder of Wired. + +_(the Amazon advertisement contains an affiliate link for MachineCurve)._ + +If you're interested in either the business or technology aspects of AI and/or machine learning, I would like to recommend the 2016 book **Rise of the Machines: the lost history of cybernetics** by Thomas Rid, a German professor specializing in political science at Johns Hopkins University in the USA (Wikipedia, 2016). + +Rid, who specializes in how politics shape information technology and how IT shapes politics (and society) in return, with Rise of the Machines provides an excellent and thorough overview of the global trends in WW2 and post-WW2 technology developments. + +## Setting the stage + +The book begins in the autumn of 1940, when German fighter pilots raid the city of London, the Londoners being unknowing about how the concept of war would change in the years to come (Rid, 2016). The Battle of Britain would be fertile ground for many technology developments such as radar technology being capable to track enemy aircraft, variable-time fuse shells which explode when near the enemy and so on. + +However, it wouldn't stop there. Rather, with scientists like Turing inventing the Bombe - an electromagnetic computer cracking German Enigma codes - a technology revolution was about to begin (Wikipedia, 2001). Rid's story presents how this technology story has unfolded over many decades. + +Starting with the movement of _cybernetics_, which in the later 1940s emerged to make sense of how technology had evolved, Rid shows how through _controlling_ the environment by means of _feedback_ humans and machines had now been coupled tightly at an unprecedented scale. + +This is followed by chapters on _automation_, discussing the long-term effects of such human-machine symbiosis on e.g. employability, on _organisms_, discussing whether technology could physically integrate with organisms, to _culture_ and _space_, introducing how technology spawned entirely new subcultures such as cyberpunk as well as new territory called _cyberspace_. + +![](images/beautiful-facial-expression-female-834949-1024x683.jpg) + +We think virtual reality is a new development - it's not. It's intrinsically linked to the concept _cyberspace_ discussed in Rid's book. Photographer: Bruce Mars, Pexels License. + +But who controls _cyberspace_? Rid's book continues until today, discussing the battle between anarchists and governments about who owns the vast digital lands, and how the introduction of cryptography has substantially polarized this debate. + +Entering today's world, the book discusses _war_ again - digital war indeed, introducing cybercrime and cyberattacks, the type of warfare that is ubiquitous today. In its conclusion, Rid is spot on: "cybernetics started at war - and eventually came back to war" (Rid, 2016). Today's world has digitized and Rid's book tells us how it did. + +## AI's broader context + +Now, why would this book be a recommendation if you're interested in AI? + +I get the question - let me explain. + +Nothing in this world happens in isolation. Any action is triggered by some previous action and will trigger another action - or perhaps a few of them - which in return spawn more actions, and so on. Hence, I think that it's important to study _context_ when discussing some phenomenom, and preferably as objectively as possible. + +The same is true for Artificial Intelligence. Did you know that in the World War 2 era, Turing already undertook thought experiments about AI, questioning whether it was possible - with the Turing test as a prime example? That the ideas about today's narrow AI systems, which is that they often work best when they support humans (i.e., human-machine symbiosis), are grounded in decades-old concepts? + +That the ideas put forward by the so-called singularity movement, claiming that superintelligent AI will create an exponentially better world for humans to live in, have been here since the late 1950s? + +(And that the same is true for the apocalyptic thoughts about the same superintellitent technology?) + +Well, you get the point. + +If you wish to understand today's AI developments, you'll have to consider them in _the broad context of technological history_. Thomas Rid's Rise of the Machines is, although written by an academic and hence sometimes a little challenging to plough through, an excellent chronology of how technology has shaped the world. Absolute recommendation! + +## Check prices + +[![](//ws-na.amazon-adsystem.com/widgets/q?_encoding=UTF8&MarketPlace=US&ASIN=1925228649&ServiceVersion=20070822&ID=AsinImage&WS=1&Format=_SL160_&tag=webn3rd-20)](https://www.amazon.com/gp/product/1925228649/ref=as_li_tl?ie=UTF8&camp=1789&creative=9325&creativeASIN=1925228649&linkCode=as2&tag=webn3rd-20&linkId=b5d0411bb65cd79b43f28c4e9cc81f78)![](//ir-na.amazon-adsystem.com/e/ir?t=webn3rd-20&l=am2&o=1&a=1925228649) + +**Rise of the Machines: the lost history of cybernetics** +Thomas Rid, 2016 +ISBN 9781925228649 +Scribe Publications + +[Check prices at Amazon (affiliate link).](https://amzn.to/32RkFM9) + +## References + +Wikipedia. (2016, October 20). Thomas Rid. Retrieved from [https://en.wikipedia.org/wiki/Thomas\_Rid](https://en.wikipedia.org/wiki/Thomas_Rid) + +Rid, T. (2016). _Rise of the Machines: the lost history of cybernetics_. Scribe Publications. + +Wikipedia. (2001, November 12). Alan Turing. Retrieved from [https://en.wikipedia.org/wiki/Alan\_Turing#Bombe](https://en.wikipedia.org/wiki/Alan_Turing#Bombe) diff --git a/saying-hello-to-tensorflow-2-4-0.md b/saying-hello-to-tensorflow-2-4-0.md new file mode 100644 index 0000000..2183d53 --- /dev/null +++ b/saying-hello-to-tensorflow-2-4-0.md @@ -0,0 +1,227 @@ +--- +title: "Saying hello to TensorFlow 2.4.0" +date: "2020-11-05" +categories: + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "machine-learning" + - "neural-networks" + - "tensorflow" +--- + +Although there are many approaches to creating deep learning models these days, [TensorFlow](http://tensorflow.org) is one of the most widely known ones. Two days ago, they released **TensorFlow 2.4.0-rc0**, a TF pre-release, with a lot of major features and improvements. In this article, we're welcoming TensorFlow 2.4.0 and look at what's changed. + +**Update 05/Nov/2020:** fixed quite a bit of spelling issues. Sorry about that! + +* * * + +\[toc\] + +* * * + +## Major Features and Improvements + +According to the [GitHub release page](https://github.com/tensorflow/tensorflow/releases/tag/v2.4.0-rc0), TensorFlow 2.4.0 will have these major features and improvements: + +- Experimental support will be added to `tf.distribute` for training your Keras models asynchronously. A `ParameterServerStrategy` was added for this purpose. Below, we'll cover this in more detail. +- The `MultiWorkerMirroredStrategy` was moved into stable, and is no longer experimental. Many bug fixes have been applied in between the experimental and stable APIs. With the strategy, you can distribute your training process across many GPUs across many worker machines. +- A new module called `tf.experimental.numpy` was added - and it's a NumPy-compatible API for writing TensorFlow programs. We'll cover it below in more detail. +- Support for `TensorFloat-32` was added on Ampere based GPUs. +- The Keras `Functional` API was refactored in a major way. While the refactor primarily targeted the internals, some functions have changed on the outside as well. +- The `tf.keras.mixed_precision` API was moved into stable, allowing you to use 16-bit floats during training. +- Changes to TF Profiler were made; also, TFLite Profiler is available for Android. Below, we'll cover them in more detail. +- TensorFlow `pip` packages now require you to have installed CUDA11 and cuDNN 8.0.2 onto your system. + +* * * + +## Changes in more detail + +Let's take a look at the major features/improvements in more detail :) + +### Keras async training + +While many people who start with TensorFlow train their neural networks on just one machine with one GPU, it is possible to extend your training setup in multiple ways: + +- You can use multiple GPUs on your machine. +- You can use a [cloud setup](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/) for using multiple GPUs on multiple machines. + +How you setup your training process can be configured by a **distribution strategy**, available through the `tf.distribute` API in TensorFlow. Now, a new strategy was added - called `ParameterServerStrategy`: + +``` +tf.distribute.experimental.ParameterServerStrategy( + cluster_resolver, variable_partitioner=None +) +``` + +Generally, if you would use a cluster of machines for training your neural network, you would do so in a data-parallel way, by splitting your dataset into multiple batches, [training instances of the same model with those batches](https://www.machinecurve.com/index.php/2020/10/22/distributed-training-tensorflow-and-keras-models-with-apache-spark/#data-parallelism-vs-model-parallelism), and subsequently aggregating the parameters changes into a change in the full model. + +This can be done **synchronously** and **asynchronously**, which differs in the way how model variables of the _full model_ are updated. + +> _Synchronous_, or more commonly _sync_, training is where the updates from each replica are aggregated together before updating the model variables. This is in contrast to _asynchronous_, or _async_ training, where each replica updates the model variables independently. You may also have replicas partitioned into groups which are in sync within each group but async between groups. +> +> TensorFlow (n.d.) + +The `ParameterServerStrategy` introduces **parameter server training** and hence asynchronous training to TensorFlow, which allows you to use a cluster of workers and parameter servers. + +> As a result, failures of some workers do not prevent the cluster from continuing the work, and this allows the cluster to train with instances that can be occasionally unavailable (e.g. preemptible or spot instances). +> +> TensorFlow (n.d.) + +This greatly boosts parallel training, especially now that Amazon has released [EC P4d Instances](https://aws.amazon.com/ec2/instance-types/p4/) for Machine Learning, which run in AWS EC2 UltraClusters. + +### Into stable: MultiWorkerMirroredStrategy + +A synchronous method that [used to be experimental](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/#multiworkermirroredstrategy), called the `MultiWorkerMirroredStrategy`, is being moved from experimental into stable (TensorFlow, n.d.): + +``` +tf.distribute.MultiWorkerMirroredStrategy( + cluster_resolver=None, communication_options=None +) +``` + +Using the distribution strategy, you can train your model in a setup across **multiple workers, each with potentially multiple GPUs**. This is a strategy that can be employed in [cloud-based training](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/). + +![](images/pexels-manuel-geissinger-325229-1024x358.jpg) + +Photo by **[Manuel Geissinger](https://www.pexels.com/@artunchained?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels)** from **[Pexels](https://www.pexels.com/photo/interior-of-office-building-325229/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels)**. + +### Experimental TensorFlow NumPy-compatible API + +New to TensorFlow in version 2.4.0 is the `tensorflow.experimental.numpy` API: + +> This module provides a subset of NumPy API, built on top of TensorFlow operations. APIs are based on and have been tested with NumPy 1.16 version. +> +> TensorFlow (n.d.) + +As a **subset of NumPy**, i.e. not all components are implemented and more will be added later, it is fully interoperable with NumPy. In addition, as it is built on top of TensorFlow, the API interoperates seamlessly with TensorFlow. + +The reason why this was added seems to be **performance**, mainly. + +- TensorFlow Numpy uses highly optimized TensorFlow kernels dispatchable on CPUs, GPUs and TPUs. +- Compiler optimizations are also performed. + +Generally, it seems to be the case that if your NumPy workloads have complex operations, performance benefits become clear. For smaller or not-so-complex workloads, TensorFlow (n.d.) suggests to still use NumPy instead. + +Here is a comparison for a [Sigmoid](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) activation function implemented with NumPy and TensorFlow NumPy: + +![png](images/output_p-fs_H1lkLfV_0.png) + +Credits: [TensorFlow (n.d.)](https://www.tensorflow.org/guide/tf_numpy). Licensed under the [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/), no changes were made. + +### TensorFloat-32 on Ampere based GPUs + +Data can be represented with many types of math - using `integers`, for example, but also 32-bit floating-point numbers i.e. `float32` are possible. Generally, floating-point math is _precise_ but also comes at a cost: many bits and hence lots of memory are necessary for training and eventually deploying your machine learning model. + +Earlier this year, **TensorFloat-32 was introduced** and was made the new math mode in the new A100 GPUs from NVIDIA, which run on the Ampere architecture. + +> TensorFloat-32 is the new math mode in [NVIDIA A100 GPUs](http://www.nvidia.com/a100) for handling the matrix math also called tensor operations used at the heart of AI and certain HPC applications. +> +> NVIDIA (2020) + +Floating-point math utilizes a significand, base and exponent to represent a number (Wikipedia, n.d.): + +\[latex\]significand \\times base^{exponent}\[/latex\] + +TensorFloat-32 (TF32) improves upon regular 32-bits floating-point numbers (FP32) by reducing the bit size for the float significand (a.k.a. mantissa) and exponent, making computation less resource intensive, boosting speed and capabilities of a GPU. + +> TF32 uses the same 10-bit mantissa as the half-precision (FP16) math, shown to have more than sufficient margin for the precision requirements of AI workloads. And TF32 adopts the same 8-bit exponent as FP32 so it can support the same numeric range. +> +> NVIDIA (2020) + +TensorFlow 2.4.0 adds support for TF32 format for Ampere based GPUs; it is enabled by default. + +### Keras Functional API refactoring + +Those who are used to creating Keras models know that there are two main approaches to creating one - using the more rigid but accessible `Sequential API` or the more flexible but relatively difficult `Functional` API. + +The table below gives a small example for a `model` and the subsequent addition of one `Dense` layer for the Sequential and Functional APIs. + +
model = Sequential()
model.add(Dense(256, activation='relu', input_shape=input_shape))
inputs = keras.Input(shape=input_shape)
outputs  = Dense(256, activation="relu")(inputs)
model = keras.Model(inputs=inputs, outputs=outputs)
+ +Constructing a model and adding the layer in the Sequential (left) and Functional (right) APIs. + +In TensorFlow 2.4.0, the Functional API had a major refactor, making it more reliable, stable and performant when constructing Keras models. + +While the **refactor mostly involved internals**, some external calls might require a change - check the [breaking changes section of the release](https://github.com/tensorflow/tensorflow/releases/tag/v2.4.0-rc0) to see if this is applicable to your model. + +![](images/pexels-fernando-arcos-211122-1024x681.jpg) + +Photo by **[Fernando Arcos](https://www.pexels.com/@ferarcosn?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels)** from **[Pexels](https://www.pexels.com/photo/under-construction-signage-on-laptop-keyboard-211122/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels)** + +### Into stable: Keras mixed precision API + +Recall the floating-point arithmetic that we covered above. Also recall that floating-point numbers increase precision compared to integers, but also require more bits. + +Generally speaking, 32-bit floats and 16-bit floats are used for this purpose. They do however present a trade-off: using `float32` format is more stable, while `float16` is faster. Using `tensorflow.keras.mixed_precision`, it was already possible to **combine both 16-bit and 32-bit floating point types**. + +> Mixed precision is the use of both 16-bit and 32-bit floating-point types in a model during training to make it run faster and use less memory. By keeping certain parts of the model in the 32-bit types for numeric stability, the model will have a lower step time and train equally as well in terms of the evaluation metrics such as accuracy. +> +> TensorFlow (n.d.) + +Using mixed precision, training your model could become faster without losing too much performance in terms of accuracy and so on. With TensorFlow 2.4.0, `tensorflow.keras.mixed_precision` was moved from `experimental` into `stable`. + +### TensorFlow Profiler changes + +If you want to understand why your TensorFlow model performs in a certain way, e.g. because you have changed hardware, you can use the [TensorFlow Profiler](https://www.tensorflow.org/guide/profiler): + +> Use the tools available with the Profiler to track the performance of your TensorFlow models. See how your model performs on the host (CPU), the device (GPU), or on a combination of both the host and device(s). +> +> Profiling helps you understand the hardware resource consumption (time and memory) of the various TensorFlow operations (ops) in your model and resolve performance bottlenecks and ultimately, make the model execute faster. +> +> TensorFlow (n.d.) + +Note from above that the strategy was moved into `stable`. This requires that the Profiler is adapted for a multi-worker strategy as well. In TensorFlow 2.4.0, the Profiler [adds support](https://www.tensorflow.org/guide/profiler#profiling_apis) for a multi-worker setup: + +``` +# E.g. your worker IP addresses are 10.0.0.2, 10.0.0.3, 10.0.0.4, and you +# would like to profile for a duration of 2 seconds. +tf.profiler.experimental.client.trace( + 'grpc://10.0.0.2:8466,grpc://10.0.0.3:8466,grpc://10.0.0.4:8466', + 'gs://your_tb_logdir', + 2000) +``` + +_(Credits for the code snippet: TensorFlow, licensed under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0))_ + +In addition, a [TensorFlow Lite Profiler for Android](https://www.tensorflow.org/lite/performance/measurement#trace_tensorflow_lite_internals_in_android) is now available. + +### TensorFlow pip packages CUDA/cuDNN change + +Finally, from TensorFlow 2.4.0 onwards, `pip` packages are now built with different CUDA and cuDNN versions: + +- **CUDA:** 11 +- **cuDNN:** 8.0.2 + +* * * + +## Summary + +In this article, we said hello to TensorFlow version 2.4.0, which is now available in pre-release, and looked at its major features and improvements. Generally speaking, new things focus on distributed training, model optimization and library optimization (through a major refactor of the Functional API). Really new is the addition of the `tensorflow.experimental.numpy` API, which brings an interoperable subset of NumPy functionality to TensorFlow, for performance reasons. + +I hope that you've learnt something new today. Please don't hesitate to drop a comment in the comments section below if you have any questions 💬 Please do the same if you have other comments. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +_TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc._ + +TensorFlow. (n.d.). _Releases · TensorFlow/TensorFlow_. GitHub. [https://github.com/tensorflow/tensorflow/releases](https://github.com/tensorflow/tensorflow/releases) + +TensorFlow. (n.d.). _Tf.distribute.experimental.ParameterServerStrategy_. [https://www.tensorflow.org/api\_docs/python/tf/distribute/experimental/ParameterServerStrategy?version=nightly](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/ParameterServerStrategy?version=nightly) + +TensorFlow. (n.d.). _Module: Tf.distribute_. [https://www.tensorflow.org/api\_docs/python/tf/distribute](https://www.tensorflow.org/api_docs/python/tf/distribute) + +TensorFlow. (n.d.). _Tf.distribute.MultiWorkerMirroredStrategy_. [https://www.tensorflow.org/api\_docs/python/tf/distribute/MultiWorkerMirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/MultiWorkerMirroredStrategy) + +TensorFlow. (n.d.). _Module: Tf.experimental.numpy_. [https://www.tensorflow.org/api\_docs/python/tf/experimental/numpy](https://www.tensorflow.org/api_docs/python/tf/experimental/numpy) + +NVIDIA. (2020, May 18). _NVIDIA blogs: Tensorfloat-32 accelerates AI training HPC upto 20x_. The Official NVIDIA Blog. [https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) + +Wikipedia. (2001, November 11). _Floating-point arithmetic_. Wikipedia, the free encyclopedia. Retrieved November 5, 2020, from [https://en.wikipedia.org/wiki/Floating-point\_arithmetic](https://en.wikipedia.org/wiki/Floating-point_arithmetic) + +TensorFlow. (n.d.). _Mixed precision_. [https://www.tensorflow.org/guide/mixed\_precision](https://www.tensorflow.org/guide/mixed_precision) + +TensorFlow. (n.d.). _Optimize TensorFlow performance using the profiler_. [https://www.tensorflow.org/guide/profiler](https://www.tensorflow.org/guide/profiler) diff --git a/simple-multi-options-a-b-n-test-with-multi-armed-bandit-in-python.md b/simple-multi-options-a-b-n-test-with-multi-armed-bandit-in-python.md new file mode 100644 index 0000000..91a8a6f --- /dev/null +++ b/simple-multi-options-a-b-n-test-with-multi-armed-bandit-in-python.md @@ -0,0 +1,277 @@ +--- +title: "Simple Multi-options A/B/n test with Multi-Armed Bandit in Python" +date: "2021-10-05" +categories: + - "reinforcement-learning" +tags: + - "a-b-test" + - "a-b-n-test" + - "bandit" + - "bandits" + - "machine-learning" + - "multi-armed-bandit" + - "multi-armed-bandits" + - "q-value" + - "reinforcement-learning" +--- + +The hardened Machine Learning professional knows that there are three key branches of ML: supervised learning, unsupervised learning and reinforcement learning. In the latter, agents learn to translate state and possibly other internal knowledge into decisions - which impact state, and by consequence, influence the agent's next decision. + +In a way, this is how humans operate. + +The class of **Multi-Armed Bandits** is a simple way of looking at Reinforcement Learning. In this article, we're going to take a look at a simple form of these bandits - the **A/B/n testing scenario**. This is a generalization of A/B testing to multiple choices. It's simple because it does not use state - it only learns from rewards for a particular action in the past. In that way, it's also single-step; it does not look beyond the decision that is currently to be made. However, while simplicity gets you up to speed quickly, there are drawbacks too. We're going to cover everything - including a **step-by-step Python example** for implementing your own A/B/n test. + +Are you ready? Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## About Multi-Armed Bandits and A/B/n testing + +Before we're going to write code, let's take a look at what a **Multi-Armed Bandit or MAB problem** actually is: + +> \[A multi-armed bandit problem\] is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice. +> +> Wikipedia (2005) + +Quite a complex definition! Broken apart, it means that: + +- There is a fixed limited set of resources. In other words, at every run, a choice must be made. +- The choice is between competing choices, meaning that we can only choose one (or a few), but not all choices. +- While doing so, we must maximize our gain, or make choices so that we profit in the best possible way - always selecting the good choices while leaving the poor ones. +- However, we're not fully aware about what is the best choice - we may learn so over time. + +We've now arrived at the **exploration/exploitation** **dilemma** that is continuously present within Reinforcement Learning. Recall from the definition that we always want to make the best possible choice. However, at the beginning, we don't even know what the best possible choice is. We must learn that! In other words, we must first _explore_ all possible choices, until we can _exploit_ our knowledge with a reasonable amount of certainty. + +Where to break off exploration in favor of exploitation, or even choosing to do so (it's also possible to mix both together in multiple ways, as we shall look at in further articles about ε-greedy MABs and Thompson Sampling) must be set in a good way. In essence, this is also a MAB problem in itself :) + +### The advertising scenario + +In the remainder of this article, we're going to work on creating our own A/B/n test. We do so with an **advertising setting**, which is actually a very common setting for A/B/n tests. Suppose that we have a website (like MachineCurve) where ads must be shown. We don't know anything about the user and we don't collect any data in order to understand the user better. Still, we'd love to maximize our revenue, which can be done by maximizing the amount of clicks per amount of views - or the **click-through rate (CTR)**. + +This is a classic MAB problem. At the start, we simply have three ads, and we don't know anything about their performance. We'll have to _explore_ their performance first, before we can show the best ad to our visitors, and _exploit_ our knowledge to _maximize_ our revenue. + +Suppose that these are the three ads. To allow comparison afterwards, we model them to have Binomial distributions with success probabilities (or CTRs) of 1%, 2.4% and 3%, respectively. A Binomial distribution effectively represents a trial (in our case, just _one_ - showing the ad) where the outcome is (1 - success - click) with some probability `p`, and (0 - fail - no click) with probability `1 - p`. In other words, for our three ads, the odds that someone clicks on the ad will be 1%, 2.4% and 3%, respectively, but _we don't know this officially_. + +![](images/Ads.drawio-1024x146.png) + +Let's know take a look at modeling this with Python. + +* * * + +## Creating a Multi-Armed Bandit for A/B/n testing with Python + +Building a MAB for A/b/n testing with Python involves the following steps: + +1. Importing all the dependencies, which are just two in our case. +2. Representing an ad: creating a `SimpleAdvertisement` class which can be shown. +3. Generating three `SimpleAdvertisements`. +4. Setting the variables for our A/B/n scenario. +5. The exploration phase - finding which ad performs best experimentally. +6. The exploitation phase - maximizing profits with the best ad. +7. Plotting the results. + +### Importing dependencies + +The first step would be to import our dependencies. For today's code, you'll only rely on `numpy` for numbers processing and `matplotlib` for visualizing the outcomes. Both can be installed with `pip`, through `pip install numpy`, for example. + +``` +import numpy as np +import matplotlib.pyplot as plt +``` + +### Representing an advertisement: `SimpleAdvertisement` class + +The next step is to represent the blueprint of an advertisement. If you look at what such an ad should do - it should display. Displaying should return a reward, which is either a click or no click. + +Recall that we generate clicks following a [Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) which returns 1 (success) with probability `p`. For that reason, we allow configurable `p`s to be passed in the constructor, which we then use to generate a reward when the advertisement is shown in `show`. + +``` +class SimpleAdvertisement(): + """ Representation of a simple advertisement.""" + + def __init__(self, p): + """ + Constructor. Set p value for the binomial distribution + that models user click behavior for this advertisement. + A p-value represents the odds of a click, 0 <= p <= 1. + """ + self.p = p + + def show(self): + """ + Fictitiously show an advertisement. Return a reward: + either 0 (no click) or 1 (click). Draw just once (n) + and draw successfully (click) with probability p. + """ + return np.random.binomial(n=1, p=self.p) +``` + +### Generating three ads + +Let's now generate three ads and put them in a list. Note that each advertisement has a different parameter: `0.01` for ad 1, `0.024` for ad 2 and `0.03` for ad 3. Indeed, these are the CTRs that we saw above - or in other words, the `p values` which represent the probability that our binomial sample returns a `1` (click). + +``` +# Generate the advertisements +advertisement_one = SimpleAdvertisement(0.01) +advertisement_two = SimpleAdvertisement(0.024) +advertisement_three = SimpleAdvertisement(0.03) +advertisements = [advertisement_one, advertisement_two, advertisement_three] +``` + +### Setting scenario variables + +Now that you have created the advertisements, you can set the global variables for our A/B/n test. The number of tests represents the number of exploration iterations in which the best ad is chosen. The number of production runs represents the number of subsequent exploitation iterations in which we continue with the chosen ad, to find a final score. Of course, we hope that this score approximates the CTR of `0.03`, which is that of our best-performing ad. + +Number of ads is simply the number of advertisements created - three. The average rewards over time list is used for storing the average reward after every exploration/exploitation step, so that we can generate a plot later. The reward sum is indeed a sum of all the rewards, and the `N_impres` represents the number of impressions. + +What's left is the `Q_values`. This is an important term in Reinforcement Learning problems and hence also in MAB problems. A **Q-value**, also called **action value**, represents a weighted average of all rewards over time, and is a measure of how well a certain choice performs. You'll see that our task involves picking the ad with the highest Q-value! + +``` +# Set the scenario's variables +num_tests = 12500 +num_prod = 50000 +num_ads = len(advertisements) +average_rewards_over_time = [] +N_impres = np.zeros(num_ads, dtype=np.int) +Q_values = np.zeros(num_ads) +reward_sum = 0 +``` + +### The exploration phase: A/B/n testing + +We've now arrived at the **exploration phase**. In this phase, we run our tests and continuously update the Q values given the reward we received after picking an advertisement, so that we can pick the best-performing advertisement later. + +In the code below, this happens: + +- We iterate over the number of tests, as indicated before. +- We randomly choose one of the advertisements, and set its object reference. +- We show the advertisement and observe whether we have a click (a reward of 1) or no click (a reward of 0). +- We then update the number of impressions for the advertisement and the Q value. As you can see with the Q value, we _add_ the difference between the reward and the current Q value, but in a _weighted_ way - as the number of impressions increases, the less important this Q value update is. +- Finally, we increase the reward sum, compute the average reward over time and append it to all average rewards observed so far. + +``` +def a_b_n_test(num_test, ads): + """ Run A/B/n testing phase. """ + global reward_sum + # Iterate over the test range. + for test in range(num_test): + # Pick an advertisement at random. + chosen_ad_idx = np.random.randint(len(ads)) + chosen_ad = ads[chosen_ad_idx] + # Observe reward for advertisement [click = 1, no click = 0] + reward = chosen_ad.show() + # Increase counter for ad and Q/action value + N_impres[chosen_ad_idx] = N_impres[chosen_ad_idx] + 1 + Q_values[chosen_ad_idx] += (1 / N_impres[chosen_ad_idx]) * (reward- Q_values[chosen_ad_idx]) + # Increase total reward + reward_sum += reward + average_reward_so_far = reward_sum / (test + 1) + average_rewards_over_time.append(average_reward_so_far) +``` + +### The exploitation phase: running the chosen ad in production + +In the **exploitation phase**, we have selected an ad which we can now run in production. We'll actually pick the best ad from the exploration phase when merging everything together a bit more below, but let's take a look at the code for running the advertisement in production first. + +``` +def a_b_n_prod(num_prod, best_ad): + """ Run the best ad in production. """ + global reward_sum + # Iterate over the test range. + for prod in range(num_prod): + # Observe reward for advertisement [click = 1, no click = 0] + reward = best_ad.show() + # Increase total reward + reward_sum += reward + average_reward_so_far = reward_sum / (prod + num_tests + 1) + average_rewards_over_time.append(average_reward_so_far) +``` + +As you can see, we run a number of iterations - `num_prod`. Using the best advertisement, we observe another reward, with which we increase the reward sum and add the average rewards over time. + +### Plotting the results + +Before we merge everything together, there's only one thing left - and that is merging everything together. + +As you can see below, using `matplotlib`, we plot the average rewards over time. Precisely at the cutoff point between the exploration and exploitation phases (which in our case is after `num_tests = 12500` iterations), we draw a vertical line and write some text, to indicate the change in behavior. Finally, we set a title. + +``` +def plot_phases(selected_ad, selected_ad_score): + """ + Plot the reward for the exploration and exploitation phases. + """ + plt.plot(average_rewards_over_time) + plt.axvline(x=num_tests, linestyle='--', color='gray') # Plot vertical line at cutoff for exploration + plt.text(num_tests-6000, 0, 'Exploration', color='gray') + plt.text(num_tests+2000, 0, 'Exploitation', color='gray') + plt.title(f"Average reward over time - Selected ad {selected_ad} (score: {selected_ad_score}) - Best ad: 3 (score: 0.03)") + plt.show() +``` + +### Merging everything together + +Next up is merging everything together. In the `ad_scenario()` def, you'll actually create the scenario, and run the exploration and exploitation phases. As you can see: + +1. You first run the `a_b_n_test` or exploration scenario on _all_ advertisements. +2. Using the Q values, you pick the index of the best-performing ad, and display it on screen. +3. You then pick the ad using this `best_ad_index` and run it in production for `num_prod` iteratinos. +4. The overall ad performance is printed on screen. +5. And a plot is generated. + +``` +def ad_scenario(): + """ + Run an advertisement based A/B/n Multi-Armed Bandit scenario. + Select the best out of three ads, specified above, then run + in production. The proper ad was chosen if the average reward + over time approximates the highest p value (0.03) chosen + with the advertisements at the top of this code. + """ + # 1. Run A/B/n test (exploration) on advertisements + a_b_n_test(num_tests, advertisements) + # 2. Pick best ad after testing + best_ad_index = np.argmax(Q_values) + print("="*50) + print(f"Best-performing advertisement after exploration is Ad {best_ad_index+1}") + print("="*50) + print(f"Score board:") + for i in range(len(advertisements)): + print(f"> Ad {i+1} - {Q_values[i]}") + print("="*50) + # 3. Run the ad in production + a_b_n_prod(num_prod, advertisements[best_ad_index]) + # 4. Print overall ad performance + print(f"Global average reward over time: {average_rewards_over_time[-1]}") + print("="*50) + # 5. Plot the performance + plot_phases(best_ad_index, Q_values[best_ad_index]) + + + +if __name__ == '__main__': + ad_scenario() +``` + +### Results and how to improve upon simple A/B/n testing + +Now, it's time to run the code. Quite quickly, you should observe the following: + +[![](images/ad-1024x521.png)](https://www.machinecurve.com/wp-content/uploads/2021/10/ad.png) + +After exploration, **advertisement 2** was the best-performing advertisement (with a CTR of `0.0288` or 2.88%). Recall that the best ad is actually **advertisement 3**, with a CTR of 3%. In this case, while the CTR is close, you could have done better! + +This is one of the drawbacks of a simple A/B/n test: you have to configure the length of the exploration phase yourself. It can be too long, and then you waste precious amounts of time selecting a candidate that could have already been running in production. It can also be too short, which is the case above, and then you lose money - because you wanted to be too fast. + +While quick, it's also dirty, and another drawback of this method is that selected choices (in this case, our selected advertisement) can no longer be changed in the exploitation phase. Indeed, once chosen, you're stuck with a potentially underperforming choice. Finally, during exploration, you'll quickly note that some ads perform quite poorly. As you select advertisements at random during that phase, you'll keep seeing these ads come by - even though you know that they perform worse than your top candidates. With a regular A/B/n test, you cannot drop such ads, unnecessarily lengthening your exploration phase. + +Methods like ε-greedy MABs and Thompson Sampling help you make better choices regarding the exploration/exploitation trade-off. We'll cover them in future articles. + +* * * + +## References + +Wikipedia. (2005, October 7). _Multi-armed bandit_. Wikipedia, the free encyclopedia. Retrieved October 4, 2021, from [https://en.wikipedia.org/wiki/Multi-armed\_bandit](https://en.wikipedia.org/wiki/Multi-armed_bandit) diff --git a/storing-web-app-machine-learning-predictions-in-a-sql-database.md b/storing-web-app-machine-learning-predictions-in-a-sql-database.md new file mode 100644 index 0000000..6c0a8bc --- /dev/null +++ b/storing-web-app-machine-learning-predictions-in-a-sql-database.md @@ -0,0 +1,698 @@ +--- +title: "Storing web app machine learning predictions in a SQL database" +date: "2020-04-13" +categories: + - "deep-learning" + - "frameworks" +tags: + - "database" + - "deployment" + - "fastapi" + - "keras" + - "postgresql" + - "predict" + - "predictions" +--- + +In a previous blog post, we looked at how we could deploy a Keras model [by means of an API](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/). That is, once it's ready, we wrap an internet-ready environment around it, so that we can use it in the field - for generating predictions. This way, we can really use our model! + +In that blog post, we actually got an MNIST-trained ConvNet running, having it generate the correct predictions for any numeric inputs that we fed it. + +Now, while deploying the model with an API is a nice achievement, we can do more. For example, we might be interested in all the predictions that are generated with the machine learning model when it's deployed in the field. We thus have to add some kind of data storage to make this work. Let's do this! + +In today's blog post, we'll be using the code [that we created before](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/#full-model-code) and extend it - by means of a PostgreSQL database, so that we can store the predictions. Now, as this might be new territory for you, let me warn you in advance: PostgreSQL databases, or relational databases in general, aren't good choices when you'll be using your model in high-volume settings - like, big data big settings. They will simply fail and there are other solutions for that. But I do think that stepping from a simple machine learning model to solutions such as CassandraDB or Hadoop based appending is a bridge too far. It simply won't allow you to understand _why_ SQL databases have limits when it comes to vast quantities of data. That's why we'll do this post anyway :) + +So, what we're going to do is this: + +- We'll be documenting the flow of data. That is, we discuss the model we're going to deploy, the deployment itself - for those who haven't read that other blog post - and eventually how data is moved into the SQL database. +- We'll be discussing how to set up a basic SQL database for storing predictions. This includes that discussion about data volume through the benefits and drawbacks of relational databases - consistency and, surprisingly, consistency ;-) We also discuss why we use PostgreSQL here. +- Then, we'll take a look at how PostgreSQL and Python can be linked. We discuss things like SQL injection, why this must be avoided at all cost and how the tools we'll use can help you achieve this. +- Having covered all the theory, we move on to the interesting part - actually writing some code! We'll take the code we wrote for deploying our Keras model with FastAPI and extend it with storing the predictions into our PostgreSQL database. What's more, we'll also make a call with which we can retrieve all predictions, and one where we can retrieve one in particular. +- We then run it altogether and see how everything works. + +I hope it'll benefit you. Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Today's flow: model → deployment → predictions into SQL database + +During the supervised machine learning process, you [feed forward](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) samples, which result in predictions, which results in a loss value, which results in optimization during yet another iteration. + +Eventually, you end up with a machine learning model that works well - and if you do it well, it works _really_ well. That's the **model** stage of today's flow. We're not going to show you here how you can train a machine learning model. For example, take a look at [this blog post](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) if you wish to understand this in more detail. Rather, we'll be using the end result to demonstrate how to insert the predictions into a database. + +Now, that's the first step. If you want your model to work well in real life, you'll have to deploy it. You can deploy it in a web application, for example. For deployments like those, you need a means - and a REST API can be one of the means that allows your frontend web application to communicate with the machine learning model. In another blog post, we already [wrapped a FastAPI REST API around a Keras machine learning model](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/). + +That's the **deployment** stage of today's development flow. + +The third and final stage will be the new one: **inserting the predictions into a SQL database**. SQL databases are a class of databases that can be considered relational in nature, and are hence called Relational Database Management Systems (RDBMS). In such databases, you create "entities" (for example a Bus and a TimeTable), and subsequently create relationships between the individual instances (say, bus "9301" has this "timetable"). This is a common format for representing everyday objects and it's therefore unsurprising that it's one of the most widely used database paradigms today. + +Now, you might wonder, what is SQL? It's a _language_ (Wikipedia, 2001). In more detail, it's a language for _querying_ relational databases - and it stands for Structured Query Language. It allows you to insert, update, delete and select values to or from the database. For example, if we wanted to add a new bus: + +``` +INSERT INTO buses (bus_no) VALUES ('9302'); +``` + +Yet another bus added in a language that is understandable for humans. Let's now pinpoint this discussion to setting up a SQL database for storing predictions made by our deployed machine learning model. + +* * * + +## Setting up a basic SQL database for storing predictions + +A common thing in the software industry is the mantra that standards (and, by extension, technologies) are always extended...because everyone wants to rule the world. + +As mentioned, this does apply to technologies too. The open source Operating System Linux [has an enormous amount of variations](https://en.wikipedia.org/wiki/List_of_Linux_distributions) available, some of which remain open source, others of which are proprietary. + +[![](images/standards.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/standards.png) + +Source - [xkcd: Standards](https://xkcd.com/927/) + +The same is true for relational databases. We have MySQL, we have Oracle, we have MariaDB, and so on. + +### Why PostgreSQL? + +But for me, the winner is always PostgreSQL. I've come from a MariaDB background and while it was certainly good too, PostgreSQL really trumps it. It's open source - which means that you don't have to worry about integrating it with your software stack. It's extensible - for example, there is an extremely mature GIS module available for PostgreSQL, allowing you to use native data types and easily convert between the enormous amount (see the pattern?) of coordinate systems available throughout the world. Third of all, it also supports _non-relational_ data types like JSON (Chiessi, 2018). + +### Our database design + +When I make a database design, I always take a look at the _objects_ that we're trying to process into the database. + +Take that buses scenario from above. Clearly, the classes of objects (or entities, in database terms) that we'd have to model if we were drawing a diagram for that scenario are Buses and TimeTables. It could easily be extended with, say, Drivers, and so on - but this isn't a blog about buses. You do however now get the point. + +Now back to our machine learning scenario. When we read the [FastAPI blog post](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/), we can derive a few interesting pointers that suggest some entities that can be modeled by us: + +- **Predictions:** this will be the key entity. It's also clear what it does - store the predictions made for some input. +- **Inputs:** that gets us to the second most important class. While not strictly necessary, it can be wise to store the inputs too. In our case, those would be the images that were fed to the machine learning model. While strictly speaking it _might not be wise to store images in relational databases directly_ (there are better solutions for that, e.g. object storage), we're not going to make our post more confusing than it should be. + +#### The diagram + +Let's now take a look at the diagram in more detail. This is what I came up with for today's blog post: + +![](images/erd-1-1.png) + +A fairly simple database model. We have two tables: **Inputs** and **Predictions**. + +The Inputs table has a primary key (a unique identifier) called `id` and allows us to store the `image`, as text. Why as text, you might wonder? Well: because we'll convert the [input image](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/#defining-the-prediction-route) into Base64 format - so that we can easily store, retrieve and view it again afterwards. + +Then, the Predictions table. It has a unique identifier as well, but also a foreign key to the Inputs table. It essentially links the Predictions to the Input. It also has a `predictions` attribute of type `json`, which stands for JavaScript Object Notation. We choose this data type because the _structure of our predictions depends on the model we're deploying_. For example, in the [tutorial](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/#defining-the-prediction-route), we have a model that utilizes [Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) to generate a probability distribution over 10 classes. Not every model does this, so we need a generic data type for storing our predictions. JSON does the trick. Now, you might wonder - why don't you use a SQL array? I thought about this, and chose JSON, because SQL arrays would make it more difficult to deploy regression models, which simply generate a numeric value. However, if you're really keen on SQL arrays, you're free to adapt the code that we will write later! :) + +### Creating our database and tables in the database + +Now, assuming that you have installed [PostgreSQL](https://www.postgresql.org/download/) onto your system, as well as [PgAdmin](https://www.pgadmin.org/download/) or a different PostgreSQL client, it's time to create your database. First, if you don't understand how to create a database from here, it might be wise to take a look at this video: + +https://www.youtube.com/watch?v=lG2Nes-wi54 + +Source: [Creating a PostgreSQL database with pgAdmin and logging into it - Denys on Data](https://www.youtube.com/watch?v=lG2Nes-wi54) + +Once you're logged in, it's time to execute the SQL queries for generating the database: + +``` +CREATE TABLE Inputs ( + id serial PRIMARY KEY, + image text +); + +CREATE TABLE Predictions ( + id serial PRIMARY KEY, + inputId integer, + predictions json, + FOREIGN KEY (inputId) REFERENCES Inputs(id) +); +``` + +### The benefit of relational databases: consistency + +All right, now that we have a functioning database, we _could_ move on to the code. + +But I think that's not so much of a good idea for the simple reason that you'll need to understand why PostgreSQL (and any relational database) is useful for scenarios where your data volumes aren't _too high_. + +Too high here meaning big data high. But first, let's take a look at why we _would_ use relational databases in the first place. + +If you're already working in an organization, you'll likely know that _data is a mess_. Various proprietary technologies that have been used for years and are now extremely outdated? Common practice. Various standards being patched every time? Not uncommon. And so on. Massive ERP systems that will deliver you grey hairs? Yep, I understand your frustration. Data is a pain in the ass. + +But generally, _within_ a relational database, the problem shouldn't be too big. Instead, here, we benefit from the _consistency_ principle of a relational database. This principle simply means that all data is consistent, and the database ensures that it is. That is, each input is processed accordingly, no data is lost, and no data is linked ("related") in wrong ways. + +This is a great benefit, because you don't want your bus to depart with the wrong time table, the wrong driver, and so on. You neither want your ML prediction to be linked to the wrong input, or not even inserted at all, generating an orphan input in the database. That's great! + +### The drawback of relational databases: consistency, and what it means for volume + +...but it's also a drawback. The fact that consistency emerges in relational database means that it _takes time to process new inputs_. Not much time, but too much time to handle the massive data volumes that we see today. For example, for platforms like YouTube or Facebook, using relational database technologies at the core of what they're doing simply doesn't work. They need something bigger. + +One of the solutions that we don't cover today is CassandraDB. It gives up strict consistency for availability. While it's always available, it might not always give back the correct result - but the likelihood for this is very low. It's just the trade-off we need to make when creating solutions for big data. + +But enough about Cassandra for now. Let's put our focus back onto what we were doing: inserting the predictions generated by our FastAPI machine learning model into a PostgreSQL database. + +* * * + +## PostgreSQL and Python: how does it work? + +For today's blog post, we're going to be using the [psycopg2 database adapter](https://pypi.org/project/psycopg2/). This adapter links Python code with your PostgreSQL database and can be installed very easily, with `pip install psycopg2`. + +Before we start, there's just one thing we'll have to check first: SQL injection, and why to avoid it. + +### SQL injection: what it is and why to avoid it + +Let's take a look at this comic from xkcd, which is very popular in the world of programmers: + +[![](images/exploits_of_a_mom.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/exploits_of_a_mom.png) + +Source - [xkcd: Exploits of a Mom](https://xkcd.com/327/) + +It's a classic example of SQL injection. + +Wikipedia describes it as follows: + +> **SQL injection** is a code injection technique, used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution (e.g. to dump the database contents to the attacker). +> +> [Wikipedia (2004)](https://en.wikipedia.org/wiki/SQL_injection) + +In the comic above, all students were removed from the system because the school's software was not protected well against SQL injection :) + +We don't want this to happen to our inputs and predictions, for obvious reasons. We'll have to protect our API against SQL injection, as that's the _attack vector_ for those attacks in our case. + +Fortunately, if used right, `psycopg2` sanitizes your queries automatically for you. Read [here](https://www.psycopg.org/docs/usage.html#query-parameters) more about what you must definitely _not_ do when using this adapter. + +Let's now extend the previous FastAPI code with some PostgreSQL based calls! + +* * * + +## Extending our previous FastAPI deployment code with PostgreSQL + +Before we extend our code, I think it might be nice to take a look at what we have so far. Here it is - we'll discuss it below the code: + +``` +# Imports +from fastapi import FastAPI, File, UploadFile, HTTPException +from PIL import Image +from pydantic import BaseModel +from tensorflow.keras.models import load_model +from typing import List +import io +import numpy as np +import sys + +# Load the model +filepath = './saved_model' +model = load_model(filepath, compile = True) + +# Get the input shape for the model layer +input_shape = model.layers[0].input_shape + +# Define the FastAPI app +app = FastAPI() + +# Define the Response +class Prediction(BaseModel): + filename: str + contenttype: str + prediction: List[float] = [] + likely_class: int + +# Define the main route +@app.get('/') +def root_route(): + return { 'error': 'Use GET /prediction instead of the root route!' } + +# Define the /prediction route +@app.post('/prediction/', response_model=Prediction) +async def prediction_route(file: UploadFile = File(...)): + + # Ensure that this is an image + if file.content_type.startswith('image/') is False: + raise HTTPException(status_code=400, detail=f'File \'{file.filename}\' is not an image.') + + try: + # Read image contents + contents = await file.read() + pil_image = Image.open(io.BytesIO(contents)) + + # Resize image to expected input shape + pil_image = pil_image.resize((input_shape[1], input_shape[2])) + + # Convert from RGBA to RGB *to avoid alpha channels* + if pil_image.mode == 'RGBA': + pil_image = pil_image.convert('RGB') + + # Convert image into grayscale *if expected* + if input_shape[3] and input_shape[3] == 1: + pil_image = pil_image.convert('L') + + # Convert image into numpy format + numpy_image = np.array(pil_image).reshape((input_shape[1], input_shape[2], input_shape[3])) + + # Scale data (depending on your model) + numpy_image = numpy_image / 255 + + # Generate prediction + prediction_array = np.array([numpy_image]) + predictions = model.predict(prediction_array) + prediction = predictions[0] + likely_class = np.argmax(prediction) + + return { + 'filename': file.filename, + 'contenttype': file.content_type, + 'prediction': prediction.tolist(), + 'likely_class': likely_class + } + except: + e = sys.exc_info()[1] + raise HTTPException(status_code=500, detail=str(e)) +``` + +In short, this code... + +- Loads the Keras model that we trained [earlier](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/). +- Starts a FastAPI app, which is a REST API. +- Defines the response and generates a warning to the root route that the API must be used differently. +- Specifies a `POST /prediction` route which (1) makes the input image uniform and (2) generates the actual predictions, returning them in the API response. + +### Storing inputs and predictions + +Let's first extend this code with storing the prediction. For this to work, we'll need to add a few more imports at the top: + +``` +import psycopg2 +import base64 +from io import BytesIO +import json +``` + +Well, for the first, we know what it does. We need the second and third in order to convert the input image into Base64 format, which is one way of storing the input data (see 'database diagram' above for a bit of elaboration on why I chose this format). Then, we also need `json`, for storing the predictions later on. + +Then, directly below the `app = FastAPI()` statement, we make the connection to our database: + +``` +# Make a connection to the database +conn = psycopg2.connect("dbname=mcsample user=postgres password=postgres") +``` + +You can specify any [database parameter](https://www.postgresqltutorial.com/postgresql-python/connect/) you want. + +We then specify two new definitions - `store_input` and `store_prediction`: + +``` +# Store an input image +def store_input(image): + # Convert input into Base64 + buffered = BytesIO() + image.save(buffered, format='JPEG') + img_str = base64.b64encode(buffered.getvalue()) + img_base64 = bytes("data:image/jpeg;base64,", encoding='utf-8') + img_str + base_string = img_base64.decode("utf-8") + # Create a cursor + cur = conn.cursor() + # Define the query + sql = """INSERT INTO inputs(image) + VALUES(%s) RETURNING id;""" + # Perform the query + cur.execute(sql, (base_string,)) + # Get the input id + input_id = cur.fetchone()[0] + # Commit and close + conn.commit() + cur.close() + # Return the input id + return input_id + +# Store a prediction +def store_prediction(prediction, input_id): + # Convert prediction into json + prediction = json.dumps(prediction.tolist()) + # Create a cursor + cur = conn.cursor() + # Define the query + sql = """INSERT INTO predictions(inputId, predictions) + VALUES(%s, %s) RETURNING id;""" + # Perform the query + cur.execute(sql, (input_id,prediction)) + # Get the prediction id + prediction_id = cur.fetchone()[0] + # Commit and close + conn.commit() + cur.close() + # Return the prediction id + return prediction_id +``` + +The flow is relatively equal in both cases: we take the input, create what is known as a "cursor", define the query, and execute it, before closing the cursor again. Then, we return the identifier of the newly stored input or prediction. + +Just before the return statement in our `POST /prediction` call, we add these lines of code: + +``` +# Store the input +input_id = store_input(pil_image) + +# Store the prediction +prediction_id = store_prediction(prediction, input_id) +``` + +Now, all predictions should be stored to the database. But let's extend it a little bit more! + +### Retrieving all predictions from the database + +The next thing we'll specify is a _new call_ - `GET /predictions`. It simply retrieves all predictions from the database: + +``` +# Get all predictions +@app.get('/predictions/') +def get_predictions(): + # Create a cursor + cur = conn.cursor() + # Define the query + sql = """SELECT * FROM predictions ORDER BY id DESC;""" + # Perform the query + cur.execute(sql) + # Get the predictions + predictions = cur.fetchall() + # Commit and close + conn.commit() + cur.close() + # Return the predictions + return predictions +``` + +### Retrieving one prediction by id + +Sometimes, though, you only want to retrieve _just one_ prediction, instead of all of them. In that case, we should also add some code for that. + +``` +# Get all predictions +@app.get('/prediction/{prediction_id}') +def get_prediction(prediction_id: str): + # Create a cursor + cur = conn.cursor() + # Define the query + sql = """SELECT p.predictions, i.image + FROM predictions p, inputs i + WHERE p.inputId = i.id + AND p.id = %s;""" + # Perform the query + cur.execute(sql,(prediction_id,)) + # Get the prediction + prediction = cur.fetchone() + # Commit and close + conn.commit() + cur.close() + # Check if we have a prediction + if prediction is not None: + return prediction + else: + return { + 'message': f'No prediction with id {prediction_id} could be found' + } +``` + +### Full code + +In total, this yields the following code: + +``` +# Imports +from fastapi import FastAPI, File, UploadFile, HTTPException +from PIL import Image +from pydantic import BaseModel +from tensorflow.keras.models import load_model +from typing import List +import io +import numpy as np +import sys +import psycopg2 +import base64 +from io import BytesIO +import json + +# Load the model +filepath = './saved_model' +model = load_model(filepath, compile = True) + +# Get the input shape for the model layer +input_shape = model.layers[0].input_shape + +# Define the FastAPI app +app = FastAPI() + +# Make a connection to the database +conn = psycopg2.connect("dbname=mcsample user=postgres password=aime") + +# Define the Response +class Prediction(BaseModel): + filename: str + contenttype: str + prediction: List[float] = [] + likely_class: int + +# Define the main route +@app.get('/') +def root_route(): + return { 'error': 'Use GET /prediction instead of the root route!' } + +# Store an input image +def store_input(image): + # Convert input into Base64 + buffered = BytesIO() + image.save(buffered, format='JPEG') + img_str = base64.b64encode(buffered.getvalue()) + img_base64 = bytes("data:image/jpeg;base64,", encoding='utf-8') + img_str + base_string = img_base64.decode("utf-8") + # Create a cursor + cur = conn.cursor() + # Define the query + sql = """INSERT INTO inputs(image) + VALUES(%s) RETURNING id;""" + # Perform the query + cur.execute(sql, (base_string,)) + # Get the input id + input_id = cur.fetchone()[0] + # Commit and close + conn.commit() + cur.close() + # Return the input id + return input_id + +# Store a prediction +def store_prediction(prediction, input_id): + # Convert prediction into json + prediction = json.dumps(prediction.tolist()) + # Create a cursor + cur = conn.cursor() + # Define the query + sql = """INSERT INTO predictions(inputId, predictions) + VALUES(%s, %s) RETURNING id;""" + # Perform the query + cur.execute(sql, (input_id,prediction)) + # Get the prediction id + prediction_id = cur.fetchone()[0] + # Commit and close + conn.commit() + cur.close() + # Return the prediction id + return prediction_id + +# Define the /prediction route +@app.post('/prediction/', response_model=Prediction) +async def prediction_route(file: UploadFile = File(...)): + + # Ensure that this is an image + if file.content_type.startswith('image/') is False: + raise HTTPException(status_code=400, detail=f'File \'{file.filename}\' is not an image.') + + try: + # Read image contents + contents = await file.read() + pil_image = Image.open(io.BytesIO(contents)) + + # Resize image to expected input shape + pil_image = pil_image.resize((input_shape[1], input_shape[2])) + + # Convert from RGBA to RGB *to avoid alpha channels* + if pil_image.mode == 'RGBA': + pil_image = pil_image.convert('RGB') + + # Convert image into grayscale *if expected* + if input_shape[3] and input_shape[3] == 1: + pil_image = pil_image.convert('L') + + # Convert image into numpy format + numpy_image = np.array(pil_image).reshape((input_shape[1], input_shape[2], input_shape[3])) + + # Scale data (depending on your model) + numpy_image = numpy_image / 255 + + # Generate prediction + prediction_array = np.array([numpy_image]) + predictions = model.predict(prediction_array) + prediction = predictions[0] + likely_class = np.argmax(prediction) + + # Store the input + input_id = store_input(pil_image) + + # Store the prediction + prediction_id = store_prediction(prediction, input_id) + + return { + 'filename': file.filename, + 'contenttype': file.content_type, + 'prediction': prediction.tolist(), + 'likely_class': likely_class, + 'input_id': input_id, + 'prediction_id': prediction_id + } + except: + e = sys.exc_info()[1] + raise HTTPException(status_code=500, detail=str(e)) + +# Get all predictions +@app.get('/predictions/') +def get_predictions(): + # Create a cursor + cur = conn.cursor() + # Define the query + sql = """SELECT * FROM predictions ORDER BY id DESC;""" + # Perform the query + cur.execute(sql) + # Get the predictions + predictions = cur.fetchall() + # Commit and close + conn.commit() + cur.close() + # Return the predictions + return predictions + +# Get all predictions +@app.get('/prediction/{prediction_id}') +def get_prediction(prediction_id: str): + # Create a cursor + cur = conn.cursor() + # Define the query + sql = """SELECT p.predictions, i.image + FROM predictions p, inputs i + WHERE p.inputId = i.id + AND p.id = %s;""" + # Perform the query + cur.execute(sql,(prediction_id,)) + # Get the prediction + prediction = cur.fetchone() + # Commit and close + conn.commit() + cur.close() + # Check if we have a prediction + if prediction is not None: + return prediction + else: + return { + 'message': f'No prediction with id {prediction_id} could be found' + } +``` + +* * * + +## Running it altogether + +Let's now see if we can run it :) As with the [FastAPI tutorial](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/), we [run it with uvicorn](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/#running-the-deployed-model). Open up a terminal, `cd` to the directory where your `main.py` file is stored (it's the file we created with the FastAPI instance, so if you don't have it yet because you started here, create one with your code) and execute `uvicorn main:app --reload`. Then, the app should start: + +``` +uvicorn main:app --reload +INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit) +INFO: Started reloader process [20780] +2020-04-13 12:19:41.537433: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll +2020-04-13 12:19:44.494767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll +2020-04-13 12:19:45.353113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: +name: GeForce GTX 1050 Ti with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.4175 +pciBusID: 0000:01:00.0 +2020-04-13 12:19:45.360620: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. +2020-04-13 12:19:45.367452: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 +2020-04-13 12:19:45.371210: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 +2020-04-13 12:19:45.387313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: +name: GeForce GTX 1050 Ti with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.4175 +pciBusID: 0000:01:00.0 +2020-04-13 12:19:45.408220: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. +2020-04-13 12:19:45.414763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 +2020-04-13 12:19:46.094212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: +2020-04-13 12:19:46.099454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 +2020-04-13 12:19:46.102651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N +2020-04-13 12:19:46.107943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2998 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1) +INFO: Started server process [24148] +INFO: Waiting for application startup. +INFO: Application startup complete. +``` + +Time to go! + +### Generating a prediction + +Generating a new prediction is not done differently than [previously](https://www.machinecurve.com/index.php/2020/03/19/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi/#running-the-deployed-model): + +[![](images/image-2-1024x248.png)](https://www.machinecurve.com/wp-content/uploads/2020/03/image-2.png) + +[![](images/image-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/03/image-1.png) + +Yielding the correct prediction indeed: + +``` +{ + "filename": "mnist_sample.png", + "contenttype": "image/png", + "prediction": [ + 0.0004434768052306026, + 0.003073320258408785, + 0.008758937008678913, + 0.0034302924759685993, + 0.0006626666290685534, + 0.0021806098520755768, + 0.000005191866875975393, + 0.9642654657363892, + 0.003465399844571948, + 0.013714754022657871 + ], + "likely_class": 7 +} +``` + +This time, though, it's also stored in the database: + +[![](images/image-4-1024x432.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/image-4.png) + +And so are the predictions: + +[![](images/image-5.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/image-5.png) + +### Retrieving the predictions + +Time to check our next calls: retrieving all predictions and just one. As expected, `GET /predictions` nicely returns all the predictions that were stored in our database: + +![](images/image-6.png) + +Whereas the `GET /prediction/{id}` call, say `GET /prediction/4`, nicely returns the prediction as well as the Base64 input image: + +[![](images/image-7.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/image-7.png) + +Which, using an online [Base64 image decoder](https://codebeautify.org/base64-to-image-converter), can be converted into the original input again: + +[![](images/image-8-1024x616.png)](https://www.machinecurve.com/wp-content/uploads/2020/04/image-8.png) + +Nice! We have a working database storage for our machine learning predictions! :) + +* * * + +## Summary + +In this blog post, we looked at how to store the predictions generated by your Keras machine learning model into a PostgreSQL database. For this, we looked at the generic flow from training the model towards storing the predictions after deployment first. This was followed by a brief introduction to relational database management systems and PostgreSQL in particular. + +Afterwards, we designed our database schema, looked at the PostgreSQL/Python connector to be used and extended the Python code for model deployment that we created previously. It allows us to store the predictions into the database, while also retrieving all predictions or just one. + +I hope you've learnt something today! If you did, I'd appreciate it if you left a comment in the comments section below 👇 Please do the same if you have any questions or remarks. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2001, June 28). _SQL_. Wikipedia, the free encyclopedia. Retrieved April 12, 2020, from [https://en.wikipedia.org/wiki/SQL](https://en.wikipedia.org/wiki/SQL) + +Chiessi, L. (2018, September 21). _Why should I use PostgreSQL as database in my startup/Company_. Medium. [https://medium.com/we-build-state-of-the-art-software-creating/why-should-i-use-postgresql-as-database-in-my-startup-company-96de2fd375a9](https://medium.com/we-build-state-of-the-art-software-creating/why-should-i-use-postgresql-as-database-in-my-startup-company-96de2fd375a9) + +Wikipedia. (2004, March 14). _SQL injection_. Wikipedia, the free encyclopedia. Retrieved April 13, 2020, from [https://en.wikipedia.org/wiki/SQL\_injection](https://en.wikipedia.org/wiki/SQL_injection) diff --git a/stylegan-a-step-by-step-introduction.md b/stylegan-a-step-by-step-introduction.md new file mode 100644 index 0000000..ffbb98d --- /dev/null +++ b/stylegan-a-step-by-step-introduction.md @@ -0,0 +1,255 @@ +--- +title: "StyleGAN, a step-by-step introduction" +date: "2021-12-27" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "gan" + - "generative-adversarial-networks" + - "neural-network" + - "stylegan" +--- + +Generative Adversarial Networks have been the go-to machine learning technique for generative content in the past few years. Seemingly magically converting random inputs into highly detailed outputs, they have found applications in generating images, generating music, and even generating medication. + +**StyleGAN** is a GAN type that really moved the state-of-the-art in GANs forward. When the paper introducing StyleGAN, "[A style-based generator architecture for generative adversarial networks](https://arxiv.org/abs/1812.04948)" by Karras et al. (2018) appeared, GANs required heavy regularization and were not able to produce such stunning results as they are known for today. + +In this article, we'll dive deep into the StyleGAN architecture. Firstly, we introduce the high-level architecture of a classic or vanilla GAN, so that we can subsequently introduce StyleGAN's high-level architecture and compare both. This already sheds some light on high-level differences and how StyleGAN is radically different compared to approaches that were prominent at the time. Then, we'll take a look at each individual StyleGAN component and discuss it in detail. This way, you'll also learn about what's beyond the high level details, and understand the impact of each individual component. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Classic GANs, a recap + +Before we dive into StyleGAN, let's take a look at the high-level architecture of a classic Generative Adversarial Network first. As you can see, it is composed of two main components - a **generator**, which generates fake images, and a **dicriminator**, which has the task of correctly distinguishing between fake and real images. + +The discriminator is trained with real images, which have a specific statistical distribution - the _data distribution_. The generator takes a sample from some distribution - also called the _latent_ distribution because after training, it is structured in such a way that it mimics the data distribution - and converts it into a fake image. + +Both real and fake images are fed to the discriminator during training, after which a loss value is computed. Both models are optimized given this loss value. The discriminator will face a hard time detecting fake from real images after a while, because the generator will be able to generate progressively more accurate outputs. The same is also true; the generator will become better and better, because it will find idiosyncrasies in the data that it will exploit. + +In other words, the scenario can be viewed as a counterfeiter vs the police situation, where the counterfeiter becomes progressively better, until the discriminator may not even be capable anymore of detecting fake images when compared to real ones. That's the moment when the generator is ready for the real world: its latent distribution then almost equals the data distribution and it's capable of generating real images on its own. + +![](images/GAN.jpg) + +### Problems with classic Generative Adversarial Networks + +While quite a bit of a breakthrough when the first GAN paper appeared in 2014. In "[Generative adversarial networks](https://arxiv.org/abs/1406.2661)", Goodfellow et al. (2014) introduced the training procedure that was discussed above. Soon, however, it became clear that training a classic GAN results in a few issues - as becomes clear from Karras et al. (2018) as well. Here's two of the main issues solved by StyleGAN: + +- **Generators operate as black boxes.** Latent spaces of classic GANs were poorly understood at the time of writing the Karras et al. paper. +- **GANs must be heavily regularized.** The game played between the generator and discriminator is a fine one - and it proved to be very easy that one of the two overpowers the other early in the training process. When this happens, the other cannot recover, and expectations never materialize. Heavy regularization must be applied to solve this issue. +- **There is little control over image synthesis.** A great latent space is structured according to some kind of order. In other words, if I were to pick a sample and move a bit, the generated image should at least resemble the image of my picked sample. And changes should be comparable across [generators](https://www.machinecurve.com/index.php/2021/03/24/an-introduction-to-dcgans/) of [different kinds](https://www.machinecurve.com/index.php/2021/03/25/conditional-gans-cgans-explained/). But they aren't. + +Let's now take a look at StyleGAN. Rather than building a whole image from a latent vector, it uses the latent space to _control_ the synthesis process. In other words, rather than providing the foundation for generation, StyleGAN provides the steering wheel with which it's possible to control what can be generated. And more smartly, it separates noisy and stochastic details (such as the generation of where hairs are located) from more fixed components (such as whether a person in a generated image is wearing glasses). Let's take a look at StyleGAN at a high level now. + +* * * + +## StyleGAN, a high-level overview + +The figure below shows you the high-level architecture of StyleGAN, as found in Karras et al. (2018). + +There are two vertical blocks involved: + +- The **mapping network**, called \[latex\]f\[/latex\], is visible on the left. It maps a (normalized) latent vector \[latex\]\\textbf{z} \\in Z\[/latex\] into another vector \[latex\]\\textbf{w}\[/latex\] from an intermediate latent space, called \[latex\]W\[/latex\]. This mapping network is a simple set of fully-connected feedforward layers. +- The **synthesis network**, called \[latex\]g\[/latex\] and visible on the right, uses \[latex\]\\textbf{w}\[/latex\] to generate a "style" that controls the image synthesis process. It begins with a Constant, \[latex\]4 \\times 4 \\times 512\[/latex\] dimensional vector. Scaled noise samples (\[latex\]\\text{B}\[/latex\]) are generated and added to this Constant tensor. Subsequently, the style (\[latex\]\\text{A}\[/latex\]) is added via Adaptive Instance Normalization (AdaIN) operations, after which a convolution operation is applied. This is followed by another noise addition and AdaIN-based styling operation. We then arrive at an image at a 4x4 pixel resolution. In the next block, the image is upsampled, and the same is performed again, arriving at an 8x8 pixel resolution. This is repeated until the image is 1024x1024 pixels. + +Clearly, we can already see a big difference between classic GANs and StyleGAN. The latent vector \[latex\]\\textbf{z}\[/latex\] is no longer used directly in the image synthesis process. Interestingly, and even surprising the authors of the StyleGAN paper, starting with a Constant tensor was possible and even produced good results. + +Rather than being the foundation of the image synthesis process, \[latex\]\\textbf{z}\[/latex\] is now used to generate styles that _control_ the synthesis process. + +If you do not understand everything that was written above, don't worry. It's an extreme summarization and only highlights what happens at a high level. If you want to dive into StyleGAN in depth, let's now spend some time looking at the details. If, however, you're having trouble understanding basic GAN concepts such as a _latent space_ or _latent vector_, it may be best to read the [introduction to GANs article](https://www.machinecurve.com/index.php/2021/03/23/generative-adversarial-networks-a-gentle-introduction/) first. + +![](images/StyleGAN.drawio-925x1024.png) + +StyleGAN architecture. Source: Karras et al. (2018) + +* * * + +## StyleGAN in more detail + +We will now look at the mapping and synthesis networks and their individual components in more detail. This allows you to get a detailed understanding of how StyleGAN works. + +### The mapping network f + +We start with the mapping network, also called \[latex\]f\[/latex\]. It takes a latent vector \[latex\]\\textbf{z}\[/latex\] sampled from the original latent distribution and performs a learned mapping to an intermediate latent vector, \[latex\]\\textbf{w}\[/latex\]. This mapping is performed with a stack of fully-connected layers in a neural network. + +#### Sampling latent vectors z + +![](images/sample_normalization.png) + +Before any forward pass - whether during training or inference - the latent vector \[latex\]\\textbf{z}\[/latex\] is **sampled from the original latent distribution**. + +A standard normal distribution is used for mapping the latent vectors \[latex\]\\textbf{z}\[/latex\] in StyleGAN. This is a common distribution to sample from when it comes to GANs. + +According to the paper, its latent space is 512-dimensional (Karras et al., 2018). + +#### Latent vector normalization + +Neural networks are notorious for suffering from poor performance when inputs aren't normalized or, even better, standardized. By means of a **normalization** **step**, the vector can be made ready for input. [Min-max normalization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) is one of the options. [Standardization](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) is too. + +If you use a _standard normal_ _distribution_ in your StyleGAN implementation, it's questionable whether you'll need this normalization step - as your inputs will already have zero mean and unit variance. Still, it doesn't hurt to keep it in. + +#### The stack of fully-connected feedforward layers to generate intermediate latent vector w + +Your (potentially normalized) sampled latent vector \[latex\]\\textbf{z}\[/latex\] is now ready for input. **It's fed to the actual _mapping network_**, which is a neural network with 8 trainable [fully connected (or Dense) layers](https://www.machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) a.k.a. a Multilayer Perceptron or MLP. It produces another vector, an intermediate latent vector \[latex\]\\textbf{w}\[/latex\]. This is the latent vector that will be used by the synthesis network for generating the output image. + +The mapping is nonlinear, meaning that each fully-connected layer has an activation function, typically a ReLU or LeakyReLU one. + +![](images/mapping_network.png) + +Intermediate latent vector w is also 512-dimensional (Karras et al., 2018). + +Now, the question you likely ask, **why do we need such a mapping in the first place?** + +For this, we'll have to take a look at a concept called _entanglement_. When something is entangled, it.... + +> \[has become\] twisted together with or caught in. +> +> Google, when searching for 'entangled definition' + +If a latent space were _disentangled_, it would contain of linear subspaces (Karras et al., 2018). In normal English, this means that there are parts of the dimensions of the latent space that control certain aspects of the image. + +For example, if our 512-dimensional latent space \[latex\]Z\[/latex\] would be fully disentangled and were to be part of a GAN trained on faces, dimension 1 would control glasses, dimension 2 hair, dimension 3 face shape, and so on. By simply moving in one dimension, one would have full control over a minor part, and generating images of choice would be really easy. + +Unfortunately, GANs usually don't have disentangled spaces. We saw it before - classic GANs offer the machine learning engineer little control over its latent space. This is a simpler way of writing that latent spaces are entangled. + +The authors propose that having a mapping network convert the originally sampled latent vector \[latex\]\\textbf{z}\[/latex\] into an intermediate vector \[latex\]\\textbf{w}\[/latex\] from a learned and intermediate latent distribution \[latex\]W\[/latex\] ensures that sampling for the synthesis process is not done from a _fixed distribution_ - such a the standard normal distribution with all its characteristics and ideosyncrasies. Rather, it is performed from a _learned distribution_. This learned distribution is learned in such a way, that it is as disentangled as possible, which originates from pressure from the Generator because it produces better outcomes doing so (Karras et al., 2018). + +Indeed, having such a network improves all metrics that describe distribution entanglement and the eventual synthesis performed from the learned latent distribution compared with the data distribution from the real images. As 8 layers in the mapping network produced the best result, 8 layers are chosen (Karras et al., 2018). + +#### So, in other words + +- A latent vector \[latex\]\\textbf{z}\[/latex\] is sampled from a chosen distribution, usually a standard normal distribution, and is 512-dimensional. +- It's fed through a 8-layer nonlinear MLP, producing a 512-dimensional intermediate latent vector \[latex\]\\textbf{w}\[/latex\] that will be used by the synthesis network to control the styles of the image being generated. +- The nonlinear learned mapping is necessary to reduce entanglement of the latent space used by the synthesis network (generator). This allows StyleGAN to significantly improve control over the latent space as well as to produce better results. + +### The synthesis network g + +Now that we understand why the _mapping network_ produces an intermediate latent vector, and how it does that, it's time to see how it's used for generating the output image. In other words, let's take a look at the **synthesis network**. This network is also called \[latex\]g\[/latex\]. + +![](images/StyleGAN.drawio-1.png) + +A high-level overview of the first part of the synthesis network, up to and including the 8x8 resolution. + +### Synthesis blocks + +In the image above, you can see that StyleGAN's synthesis network utilizes **synthesis blocks** - which progressively build the image by upsampling the image resolution from 4x4, to 8x8, to 16x16, ... eventually to 1024x1024 pixels. + +Core components of each StyleGAN synthesis blocks are: + +- **Upsampling (except for the first synthesis block).** The output of the previous synthesis block is made bigger so that it can subsequently be processed. +- **Convolution layer.** +- **Adaptive Instance Normalization (AdaIN).** +- **Style vectors (A) and noise vectors (B)**. + +We'll take a look at each individual component in more detail next. You will then discover what each component does in StyleGAN. Let's first start with the beginning point of the first synthesis block: the `Constant` Tensor that the image is built from! + +#### The Constant starting point in the first synthesis block + +Yes, you heard it correctly - the starting point of StyleGAN's synthesis block is a **constant value**. + +This is a complete difference in design compared to previous GANs, which all started with the sample drawn from latent space. + +This input is learned, and initialized as ones (Karras et al., 2018). In other words, after each epoch, the constant changes slightly - but _within each epoch_, it remains the same. + +#### Styles and noise, two important synthesis elements + +The Constant Tensor is now processed by the rest of the synthesis block. Although we'll discuss each component in the block in more detail now, it's important to know that there are two high-level concepts that will come back: + +- **Styles.** The Constant is like the 'back bone' being used for any kind of synthesis. Like a painter, which has a specific style, the _high-level components_ in a generated image are adapted by what is known as a 'style'. +- **Noise.** If you'd feed the same Constant Tensor to the synthesis block having the same styles, you'd get precisely the same image over and over again. This is not realistic: you will read about e.g. wind having an impact on how someone is pictured. In other words, there is randomness involved in generating a picture, and this is achieved through noise. + +#### How noise is generated and added + +![](images/image.png) + +The first thing that happens with the Constant value is that **_noise_** is added to it. + +The need for noise can be best explained when one notes hair in a picture. + +Suppose that the picture below was generated by StyleGAN. It wasn't, but suppose it is. You can see that it contains a variety of components which all have different granularity: + +- The **lower-granularity components,** such as the head (and specifically its position), the torso, and so forth. For each instance of class _human_, they are relatively equal. +- The **higher-granularity components**, instead - like the hair - differ _between people_ but also _between pictures of the same person_. The position of one's hair in a picture is dependent on relatively deterministic choices - like one's hair style - but also on seemingly random impacts, like wind. + +Noise is what determines these higher-granularity components. The position of the woman's hair in the picture below? If it were generated by a StyleGAN, it wouldn't have been driven by the styles you will hear about next, but by randomness - and thus noise. + +The noise Tensor is drawn from a Gaussian distribution (Karras et al., 2018). + +![](images/pexels-daria-shevtsova-880474-819x1024.jpg) + +#### How w is converted into styles + +Now that we know how noise adds randomness to a generated picture, it's time to take a look at _**styles**_ and how they control the image synthesis process. + +This starts with the latent vector **w** generated by the mapping network. + +This vector is fed to what is called **A** in the overview below - the learned _affine transformations_ part of the neural network. + +> In [Euclidean geometry](https://en.wikipedia.org/wiki/Euclidean_geometry), an **affine transformation**, or an **affinity** (from the Latin, _affinis_, "connected with"), is a [geometric transformation](https://en.wikipedia.org/wiki/Geometric_transformation) that preserves [lines](https://en.wikipedia.org/wiki/Line_(geometry)) and [parallelism](https://en.wikipedia.org/wiki/Parallelism_(geometry)) (but not necessarily [distances](https://en.wikipedia.org/wiki/Euclidean_distance) and [angles](https://en.wikipedia.org/wiki/Angle)). +> +> https://en.wikipedia.org/wiki/Affine\_transformation + +If, for example we would have the vector \[latex\]\\begin{bmatrix}2 \\\\ 3 \\end{bmatrix}\[/latex\], an affine transform would be able to produce vector \[latex\]\\begin{bmatrix}4 \\\\ 6 \\end{bmatrix}\[/latex\] (a scale 2 with the same _lines_ in space but only longer, and hence no _distances_ preserved). + +Conceptually, this means that affine transforms can _change the image components without overhauling the image_, because the affine transformation outputs must be "connected with" the input, being the latent vector **w**. + +The input vector **w** is transformed into style \[latex\]\\textbf{y}\[/latex\] where \[latex\]\\textbf{y} = (\\textbf{y}\_s, \\textbf{y}\_b)\[/latex\]. These are the _scale_ and _bias_ components of the style, respectively (and you will learn below how they are used). They have the same shape as the synthesis Tensor they will control. + +The affine transformations are learned during training and hence are the components that can be used to _control_ the image synthesis process for the lower-granularity components, such as the hair style, skin color, and so forth - whereas, remember, the randomness is used to control the _position_ of for example the indivudial hairs. + +You should now be able to explain how _styles_ and _randomness_ allows us to generate a unique image. Let's now take a more precise look at _how_ styles can control image generation. + +![](images/image-1.png) + +#### Adaptive Instance Normalization based style addition + +The **how** related to style addition is a vector addition of two vectors: + +- The **noise-added Constant Tensor** (in the first, 4x4 pixels synthesis block) or the **produced Tensor so far** (for the other synthesis blocks). +- The **affine transformation**, but then **Adaptive Instance Normalization** (AdaIN) normalized. + +This is what AdaIN looks like: + +![](images/image-2.png) + +Here, \[latex\]\\textbf{x}\_i\[/latex\] is the \[latex\]i\[/latex\]th feature map from the input Tensor (i.e., the \[latex\]i\[/latex\]th element from the vector), and \[latex\]\\textbf{y}\[/latex\] is the affine transformation generated style. + +You can see in the middle part that the feature map is first normalized (or rather, [standardized](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/)) to zero mean, unit variance - and subsequently _scaled_ by the \[latex\]i\[/latex\]th element from the style's scale component, and the \[latex\]i\[/latex\]th bias component is added subsequently. + +In other words, AdaIN ensures that the generated _styles_ can **control** the (normalized) synthesis input by changing scale and/or bias. This is how styles control the image synthesis process on the noise-added input Tensor! + +#### The second and higher synthesis blocks - upsampling, then control + +The text above primarily focused on the first synthesis block - the output of which is a 4 by 4 pixels image. As you can imagine, this is barely enough to be impressed with :) + +![](images/image-3-300x290.png) + +Subsequent synthesis blocks (8x8, 16x16, up to 1024x1024 pixels) work slightly differently compared to the first synthesis block: + +- First, **bilinear upsampling** is applied to upsample the image, after which a **[2d Convolutional layer with 3x3 kernel](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/)** is used for learned downsampling. +- Subsequently, **noise is added**, after which an **AdaIn** operation for style control is performed. +- Then, another **downsampling** is operated with a similar Convolutional layer, after which another noise-and-style-control is performed. + +For the first layer, this yields an 8x8 pixels image, and so forth. + +#### The end result + +The end result of StyleGAN when trained on faces is really cool! :) + +![](images/stylegan-teaser-1024x614.png) + +Source: Karras et al. (2018) and [https://github.com/NVlabs/stylegan/blob/master/stylegan-teaser.png](https://github.com/NVlabs/stylegan/blob/master/stylegan-teaser.png), [CC Attribution-NonCommercial 4.0 International license](https://creativecommons.org/licenses/by-nc/4.0/). + +* * * + +## References + +Karras, T., Laine, S., & Aila, T. (2018). [A style-based generator architecture for generative adversarial networks.](https://arxiv.org/abs/1812.04948) _arXiv preprint arXiv:1812.04948_. + +Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). [Generative adversarial networks.](https://arxiv.org/abs/1406.2661) _arXiv preprint arXiv:1406.2661_. diff --git a/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model.md b/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model.md new file mode 100644 index 0000000..01c8cda --- /dev/null +++ b/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model.md @@ -0,0 +1,667 @@ +--- +title: "TensorFlow Cloud: easy cloud-based training of your Keras model" +date: "2020-10-16" +categories: + - "deep-learning" + - "frameworks" +tags: + - "cloud" + - "deep-learning" + - "keras" + - "tensorflow" + - "training-process" +--- + +Training a supervised machine learning model does often require a significant amount of resources. With ever-growing datasets and models that continuously become deeper and sometimes wider, the computational cost of getting a well-performing model increases day after day. That's why it's sometimes not worthwhile to train your model on a machine that is running on-premise: the cost of buying and maintaining such a machine doesn't outweigh the benefits. + +In those cases, cloud platforms come to the rescue. By means of various service offerings, many cloud vendors - think Amazon Web Services, Microsoft Azure and Google Cloud Platform - have pooled together resources that can be used and paid for as you use them. For example, they allow you to train your model with a few heavy machines, while you simply turn them off after you've finished training your model. In many cases, the limited costs of this approach (especially compared to the cost of owning and maintaining a heavy-resource machine) really makes training your models off-premises worthwhile. + +Traditionally, training a model in the cloud hasn't been stupidly easy. TensorFlow Cloud changes this. By simply connecting to the Google Cloud Platform, with a few lines of code, it allows you to train your Keras models in the cloud. This is great, because a training job can even be started from your own machine. What's more, if desired, TensorFlow Cloud supports parallelism - meaning that you can use _multiple_ machines for training, all at once! While training a model in the cloud was not difficult before, doing so distributed was. + +In this article, we'll be exploring TensorFlow Cloud in more detail. Firstly, we'll be looking at the need for cloud-based training, by showing the need for training with heavy equipment as well as the cost of getting such a device. We then also argue for why cloud services can help you reduce the cost without losing the benefits of such heavy machinery. Subsequently, we introduce the Google Cloud AI Platform, with which TensorFlow Cloud connects for training your models. This altogether gives us the context we need for getting towards the real work. + +The real work, here, is TensorFlow cloud itself. We'll introduce it by looking at the TensorFlow Cloud API and especially at the cloud strategies that can be employed, allowing you to train your model in a distributed way. After the introduction, we will show how TensorFlow Cloud can be installed and linked to your Keras model. Finally, we demonstrate how a Keras model can actually be trained in the cloud. This concludes today's article. + +Let's take a look! :) + +**Update 02/Nov/2020:** fixed issue with file name in Step 2. + +* * * + +\[toc\] + +* * * + +## The need for cloud-based training: resources required for training + +Deep learning has been very popular for eight years now, as of 2020. Especially in the fields of Computer Vision and Natural Language Processing, deep learning models have outperformed previously state-of-the-art non-ML approaches. In many cases + +For example, only today, I was impressed because the municipality of my nation's capital - Amsterdam - has trained and deployed deep learning models to detect garbage alongside the road, or to detect whether people maintain mandatory social distancing measures against the [COVID-19 pandemic](https://www.machinecurve.com/index.php/2020/03/17/help-fight-covid-19-participate-in-the-cord-19-challenge/). They in fact used a variety of open source frameworks and libraries, _and_ pretrained models - a great feat! + +Now, they also argued that training some of the models was costly in terms of the computational resources that are required. For example, in a case study where a data engineer showed how a model was created for detecting bicycle road signs on road pavement, he argued that approximately 150 GB of data was to be used for training. It cost three full hours to train the model on four NVIDIA Tesla V100 GPUs, which are one of the fastest currently on the market. + +With a cost of approximately $11.500 for just **one** 32GB GPU (and hence $46k), the investment would be enormous if you were to purchase a machine for your deep learning workloads. For $46k, you only have the GPUs! Now, even worse, it's very likely that you wouldn't run deep learning workloads for 24/7, all the time between the 'go live' moment and the end-of-life of the purchased hardware. + +This is in effect a waste of money. + +Cloud vendors, such as Amazon Web Services, Digital Ocean, Microsoft Azure and Google Cloud Platform have recognized this matter and have very competitive offerings available for deep learning workloads. For example, at the time of writing, a 64GB EC2 P3 machine with Amazon costs only $12.24 per hour. Yes: those 3 hours of training would now cost less than $40. That makes training worthwhile! + +* * * + +## Training ML models in Google Cloud AI Platform + +Google's offering for training deep learning models is embedded in the [Google Cloud AI Platform](https://cloud.google.com/ai-platform). It's one platform to build, deploy and manage machine learning models, as the introductory text argues: + +![](images/image-14-1024x566.png) + +In fact, it supports multiple phases. At a high level, the AI Platform provides functionality for **preparing your dataset**, **building your model**, **validating your model** and **deploying your model**. Effectively, it allows you to do these things in the individual stages: + +- **Prepare stage:** labeling your dataset as well as storing and retrieving it to and from Google BigQuery. +- **Build:** playing around with code in Notebooks, training your models on highly powered machines, and applying AutoML functionality for [training automation](https://www.machinecurve.com/index.php/2020/06/09/automating-neural-network-configuration-with-keras-tuner/). +- **Validate:** once a model is trained, the AI Platform allows you to perform activities related to explainable AI, and black-box optimization with a tool called Vizier. +- **Deploy:** once trained, validated and considered ready for production usage, a model can be deployed through Google too. + +![](images/image-16.png) + +Today, we will be using only a minor part of the AI Platform: from the **build** step, we'll be using the highly powered machines to train our model with. In addition, we'll be using the [Google Cloud Container Registry](https://cloud.google.com/container-registry) to temporarily store the Docker image that we'll build with TensorFlow Cloud, to be run by the AI Platform machines. As we're now talking about TensorFlow Cloud, let's inspect it in more detail, before we move on to actual coding stuff :) + +* * * + +## The next five years of Keras: introducing TensorFlow Cloud + +In early 2020, during the Scaled Machine Learning Conference, there was a talk by François Chollet - the creator of the Keras framework that is being widely used for training deep learning models today. + +Below, you can see the talk, but in it, Chollet argued that three key developments will drive the next five years of Keras development: + +1. **Automation** +2. **Scale & Cloud** +3. **Into the real world** + +https://www.youtube.com/watch?v=HBqCpWldPII + +### Automating models & applying them in the real world + +Here, with **automation**, Chollet means that developments like [automated hyperparameter tuning](https://www.machinecurve.com/index.php/2020/06/09/automating-neural-network-configuration-with-keras-tuner/), architecture search and even Automated Machine Learning (AutoML) will help commoditize the field of Machine Learning. Gone will be the days where practicing Deep Learning will be considered a field only accessible to people who are highly familiar with mathematics and complex programming. No, instead, the ML power user (and perhaps even the more introductory user!) will provide a dataset and desired model outcomes, and some automation program will find the best set of architectural and hyper-architectural principles to apply. + +With **into the real world**, Chollet argues that future Keras developments will focus on getting Deep Learning models _out there_. By packaging data preprocessing with the model, to give just an example, models can be run in the field much more robustly. In addition, it's likely that edge equipment shall be used more often, requiring the need to [optimize models](https://www.machinecurve.com/index.php/tag/model-optimization/). + +And so on, and so on! + +### Scaling & cloud-based training + +However, related to this blog post is **scaling the model and training them in the cloud**. Chollet argues that it's sometimes better to _not_ train models on-premise, for the reason that clouds provide services dedicated to training your models efficiently. We saw that training a deep learning model will be fastest when you have a heavyweight GPU in your machine. It becomes even faster by having many of such GPUs and applying a local distributed training strategy. However, it's unlikely that your deep learning machine runs 24/7, making it inefficient in terms of the total cost of ownership for such a machine. + +That's why in the video, Chollet introduced **TensorFlow Cloud**. It's a means for training your TensorFlow model in the cloud. In fact, the [TensorFlow Cloud GitHub page](https://github.com/tensorflow/cloud) describes it as follows: + +> The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging, training, tuning your Keras and TensorFlow code in a local environment to distributed training/tuning on Cloud. +> +> TensorFlow/cloud (n.d.) + +Let's now take a look at those APIs, or primarily, the TensorFlow Cloud `run` API. + +### The TensorFlow Cloud API + +Within the `tensorflow_cloud` module that will be available upon installing TensorFlow Cloud (we will get to that later), a definition called `run` is available in order to let your model train in the cloud. This definition will do multiple things: + +1. Making your Keras code cloud ready (TensorFlow/cloud, n.d.) +2. Packaging your model code into a [Docker container](https://www.machinecurve.com/index.php/2020/10/07/easy-install-of-jupyter-notebook-with-tensorflow-and-docker/#what-is-docker) which can be deployed in the cloud for training. +3. Subsequently, deploying this container and training the model with the TensorFlow training (distribution) strategy of your choice. +4. Write logs to a cloud-hosted [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/). + +Here is the arguments list of the `def` - we'll describe the arguments soon, and show an example later in this article: + +``` +def run( + entry_point=None, + requirements_txt=None, + docker_config="auto", + distribution_strategy="auto", + chief_config="auto", + worker_config="auto", + worker_count=0, + entry_point_args=None, + stream_logs=False, + job_labels=None, + **kwargs +): + """Runs your Tensorflow code in Google Cloud Platform. +``` + +Those are the arguments: + +- The **entry\_point** describes where TensorFlow Cloud must pick up your Python code (e.g. a file called `keras.py`) or your [Notebook](https://www.machinecurve.com/index.php/2020/10/07/easy-install-of-jupyter-notebook-with-tensorflow-and-docker/) (`*.ipynb`) for preprocessing and packaging it into a Docker container. +- The **requirements\_txt** (optional) is the file path to a file called `requirements.txt` where you can specify additional `pip` packages to be installed. +- The **docker\_config** (optional) allows you to configure additional settings for the Docker container. For example, by configuring the `base_image`, you can specify a custom Docker image to start with as base image, and with `image_build_bucket` you can specify a bucket where Google Cloud Platform stores the built container if you choose to build it in the cloud. It defaults to 'auto', which means that default settings are used. +- The **distribution\_strategy** (optional) allows you to pick a training strategy for your cloud-based training. Those are distributed TensorFlow training strategies which we will cover in more detail later in this article. By default, it is set to 'auto', which means that an appropriate strategy is automatically inferred based on `chief_config`, `worker_config` and `worker_count`. +- In a distributed training scenario, it is often the case that one machine is the leader whereas others are followers. This ensures that there will be no decision-making issues related to the coordination of information about e.g. the state of a machine. The **chief\_config** allows you to pick the Google Cloud Platform machine type for your 'chief', a.k.a. the leader. By default, it is set to 'auto', which means the deployment of a `COMMON_MACHINE_CONFIGS.T4_1X` machine (8 cpu cores, 30GB memory, 1 Nvidia Tesla T4). +- The **worker\_config** (optional) describes the machine type of your workers, a.k.a. the followers. By default, it is also set to auto, meaning the deployment of a `COMMON_MACHINE_CONFIGS.T4_1X` machine (8 cpu cores, 30GB memory, 1 Nvidia Tesla T4). +- The **worker\_count** (optional) describes the number of workers you want to deploy besides your chief. +- The **entry\_point\_args** (optional), which should be a list of Strings, represents extra arguments to be input to the program run through the `entry_point`. +- If enabled, **stream\_logs** (optional, default False) streams logs back from the training job in the cloud. +- With **job\_labels** (optional), you can specify up to 64 key-value pairs of labels and values that together organize the cloud training jobs. Very useful in a scenario where you'll train a lot of TensorFlow models in the cloud, as you can organize stuff programmatically instead of manually. + +### Available machine configurations + +For `worker_config` and `chief_config`, there are many out-the-box machine configurations available within [TensorFlow Cloud](https://github.com/tensorflow/cloud/blob/master/src/python/tensorflow_cloud/core/machine_config.py): + +``` +COMMON_MACHINE_CONFIGS = { + "CPU": MachineConfig( + cpu_cores=4, + memory=15, + accelerator_type=AcceleratorType.NO_ACCELERATOR, + accelerator_count=0, + ), + "K80_1X": MachineConfig( + cpu_cores=8, + memory=30, + accelerator_type=AcceleratorType.NVIDIA_TESLA_K80, + accelerator_count=1, + ), + "K80_4X": MachineConfig( + cpu_cores=16, + memory=60, + accelerator_type=AcceleratorType.NVIDIA_TESLA_K80, + accelerator_count=4, + ), + "K80_8X": MachineConfig( + cpu_cores=32, + memory=120, + accelerator_type=AcceleratorType.NVIDIA_TESLA_K80, + accelerator_count=8, + ), + "P100_1X": MachineConfig( + cpu_cores=8, + memory=30, + accelerator_type=AcceleratorType.NVIDIA_TESLA_P100, + accelerator_count=1, + ), + "P100_4X": MachineConfig( + cpu_cores=16, + memory=60, + accelerator_type=AcceleratorType.NVIDIA_TESLA_P100, + accelerator_count=4, + ), + "P4_1X": MachineConfig( + cpu_cores=8, + memory=30, + accelerator_type=AcceleratorType.NVIDIA_TESLA_P4, + accelerator_count=1, + ), + "P4_4X": MachineConfig( + cpu_cores=16, + memory=60, + accelerator_type=AcceleratorType.NVIDIA_TESLA_P4, + accelerator_count=4, + ), + "V100_1X": MachineConfig( + cpu_cores=8, + memory=30, + accelerator_type=AcceleratorType.NVIDIA_TESLA_V100, + accelerator_count=1, + ), + "V100_4X": MachineConfig( + cpu_cores=16, + memory=60, + accelerator_type=AcceleratorType.NVIDIA_TESLA_V100, + accelerator_count=4, + ), + "T4_1X": MachineConfig( + cpu_cores=8, + memory=30, + accelerator_type=AcceleratorType.NVIDIA_TESLA_T4, + accelerator_count=1, + ), + "T4_4X": MachineConfig( + cpu_cores=16, + memory=60, + accelerator_type=AcceleratorType.NVIDIA_TESLA_T4, + accelerator_count=4, + ), + "TPU": MachineConfig( + cpu_cores=None, + memory=None, + accelerator_type=AcceleratorType.TPU_V3, + accelerator_count=8, + ), +} +``` + +### Cloud distribution strategies + +We saw that it is possible to pick a particular cloud **distribution strategy** when training a TensorFlow model by means of TensorFlow cloud. A distribution strategy is a common term for those who are already used to distributed training, but then locally, using multiple GPUs in e.g. one or multiple machines. + +Here are the cloud distribution strategies available within TensorFlow Cloud. We'll cover them in more detail next: + +- **No distribution:** a CPU-based chief without any workers. +- **OneDeviceStrategy:** a GPU-based chief without any workers. +- **MirroredStrategy:** a GPU-based chief with multiple GPUs, without any additional workers. +- **MultiWorkerMirroredStrategy:** a GPU-based chief with multiple GPUs, as well as additional workers +- **TPUStrategy:** a TPU-based config with a chief having 1 CPU and one worker having a TPU. +- **Custom distribution strategy:** picking a custom strategy is also possible, requiring you to turn off TensorFlow Cloud based distribution strategies. + +Below, we'll cover those strategies in more detail. The code samples come from the [TensorFlow Cloud GitHub page](https://github.com/tensorflow/cloud) and are licensed under the [Apache License 2.0](https://github.com/tensorflow/cloud/blob/master/LICENSE). + +#### No distribution + +If you choose to train `some_model.py` in the cloud without any distribution strategy, TensorFlow Cloud will interpret your choice as "I don't want any GPUs to train with". In this case, it will spawn a CPU-based machine - i.e. one chief that runs on a CPU - and trains your model there. This could be an interesting choice from a cost perspective or getting used to using TensorFlow cloud, but it's not the best choice in terms of training a model at scale. + +``` +tfc.run(entry_point='some_model.py', + chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU']) +``` + +#### OneDeviceStrategy + +If you choose for a `OneDeviceStrategy`, TensorFlow Cloud will spawn a chief in Google Cloud Platform with one GPU attached. More specifically, it will spawn a `T4_1X` strategy, which means a machine having 8 cpu cores, 30GB memory, and 1 Nvidia Tesla T4. This will already give your training process a significant boost. Remember that by means of `chief_config`, you can choose to use another machine type, e.g. if you want a more powerful GPU. + +``` +tfc.run(entry_point='some_model.py') +``` + +#### MirroredStrategy + +Choosing a `MirroredStrategy` will equal a mirrored strategy in local distributed training - that is, it will benefit from multiple GPUs on one machine. In this case, a machine will be spawned with 4 V100 GPUs, which will give your training process an enormous boost: + +``` +tfc.run(entry_point='some_model.py', + chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_4X']) +``` + +#### MultiWorkerMirroredStrategy + +If that's still not enough, you can also deploy a `MultiWorkerMirroredStrategy`. A mouth full of words, this strategy effectively combines the MirroredStrategy with a multi-device scenario. That is, here, beyond the `chief_config`, you also specify the config of your workers, as well as the number of workers you want. In the example below, one chief will run with one V100 GPU, whereas two workers will be spawned with 8 V100 GPUs each. + +This will result in an extremely fast training process, but will also result in significant cost if you still have a substantial training operation. However, so do the other GPU-based and cloud strategies, so choose wisely! + +``` +tfc.run(entry_point='some_model.py', + chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_1X'], + worker_count=2, + worker_config=tfc.COMMON_MACHINE_CONFIGS['V100_8X']) +``` + +#### TPUStrategy + +If, however, you don't want to train your model with a GPU but with a TPU (a processing unit specifically tailored to Tensors and hence a good choice for training TensorFlow models), you can do so by employing a `TPUStrategy`: + +``` +tfc.run(entry_point="some_model.py", + chief_config=tfc.COMMON_MACHINE_CONFIGS["CPU"], + worker_count=1, + worker_config=tfc.COMMON_MACHINE_CONFIGS["TPU"]) +``` + +Here, the chief runs with a CPU, while you have one TPU-based worker. At the time of writing, this only works with TensorFlow 2.1, which is the only version currently supported to run with a TPU-based strategy. (Please leave a comment if this is no longer the case, and I'll make sure to adapt!) + +#### Custom distribution strategy + +Sometimes, all the strategies mentioned before do not meet your requirements. In this case, it becomes possible to define a custom distribution strategy - but you must do so within your model code, as with regular distributed TensorFlow. In order not to have TensorFlow cloud interfere with your custom distribution strategy, you must turn it off in the TF Cloud code: + +``` +tfc.run(entry_point='some_model.py', + distribution_strategy=None, + worker_count=2) +``` + +* * * + +## Installing TensorFlow Cloud + +Let's now take a look at how to install TensorFlow Cloud :) + +### Step 1: ensure that you meet the dependencies + +First of all, it is important that you ensure that you have met all the dependencies for running TensorFlow cloud on your machine. Those dependencies are as follows: + +- **Python:** you'll need Python version >= 3.5 (check with `python -V`) +- **TensorFlow**: you'll need TensorFlow 2.x, since you'll train a Keras model. +- **Google Cloud Project**: TensorFlow Cloud will run as this project. +- **An authenticated GCP Service account**: you'll need to authenticate as yourself if you wish to run jobs in the cloud, so you'll also need this. +- **Google AI Platform APIs enabled**: the AI platform is used for the deployment of the built Docker images on the Google Cloud Platform. +- **Docker, or a Google Cloud Storage bucket/Google Cloud Build**: although the creators of TensorFlow Cloud argue that it's better to use Google Cloud Build to build the Docker container, you can optionally do so as well with your local Docker daemon. +- **Authentication to the Docker Container Registry**. +- **Nbconvert** (optional) if you want to train directly from a Jupyter Notebook. + +If you **don't** meet the requirements, don't worry. While you will need to install the correct Python version yourself, and will need to create your own Google account, we will take you through all the necessary steps + +### Step 2: becoming Google Cloud Ready + +The first step is becoming 'cloud ready' with the Google Cloud Platform. We assume that you have already created a Google Account. + +Becoming Cloud Ready involves two primary steps: + +1. Creating a Google Cloud Project that has the necessary APIs enabled. +2. Creating a GCP service account for authenticating to the cloud from your machine. + +#### Step 2A: create a Google Cloud Project with enabled APIs + +First, navigate to the [Project Selector page](https://console.cloud.google.com/projectselector2) in the Google Cloud Platform, which looks like this: + +![](images/image-6-1024x218.png) + +Select an existing project or create a new one by clicking 'Create Project'. This should look like this: + +![](images/image-8.png) + +Then, make sure to enable billing: you can find out how [here](https://cloud.google.com/billing/docs/how-to/modify-project). + +If you have created a Google Cloud Project (e.g. named `TF Cloud`, like me) and have enabled billing, it's time to enable the APIs that we need for running TensorFlow Cloud! Click [here](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.89617243.1196059648.1602526829-739722337.1600260254), which will allow you to enable the AI Platform Training & Prediction API and the Compute Engine API in your project. + +![](images/image-10.png) + +#### Step 2B: creating a GCP service account for authentication + +The next step is ensuring that you created what is called a Service Account for authenticating yourself from your machine. This can be done [here](https://console.cloud.google.com/apis/credentials/serviceaccountkey?_ga=2.88036184.1196059648.1602526829-739722337.1600260254). After having navigated to this page, follow the following steps: + +1. From the dropdown list **Service account**, click **New service account**. +2. Enter a name in the **Service account name** field. +3. Add the following **Role** items: **Machine Learning Engineer > ML Engine Admin**, and **Storage > Storage Object Admin**. Those roles ensure that you can (1) train your model in the cloud, and (2) store your built container in Google Cloud Storage. +4. Click **Create**. Now, a JSON file containing the key of your service account is downloaded to your computer. Move it into some relatively persistent directory, e.g. `/home/username/tensorflow/credentials.json`. +5. Export the path to the JSON file to `GOOGLE_APPLICATION_CREDENTIALS`: + 1. On Linux or MacOS, run `export GOOGLE_APPLICATION_CREDENTIALS="[PATH]"` in a terminal, with `[PATH]` replaced by the path to your credentials file. + 2. On Windows, open PowerShell (right mouse click on the Start Windows logo, then PowerShell, preferably Admin rights enabled), and run `$env:GOOGLE_APPLICATION_CREDENTIALS="[PATH]"`, with `[PATH`\] replaced by the path to your credentials file. You might also wish to do so via the Start menu, then Environment variables (type this into your Win8/10 search bar), adding the `GOOGLE_APPLICATION_CREDENTIALS` variable in the GUI. Sometimes, this works better than the other way. + +![](images/image-12.png) + +### Step 3: install and configure latest TensorFlow Cloud + +Now that you have enabled the necessary Google Cloud APIs and have become cloud ready for training your model, it's time to install TensorFlow Cloud. While the steps 1 and 2 looked a bit complex, installing TensorFlow Cloud is really easy and can be done with `pip`: + +``` +pip install -U tensorflow-cloud +``` + +### Step 4: install Docker if you don't want to build your containers in the cloud + +If you don't want to build the Docker container in which your training job runs in the cloud, you can also build it with a running Docker daemon on your machine. In that case, you must install Docker on your machine. Please find instructions for doing so here: [Get Docker](https://docs.docker.com/get-docker/). + +### Step 5: install `nbconvert` if you'll train from a Jupyter Notebook + +If you want to train in the cloud from a Jupyter Notebook, you must convert it into a workable format first. [Nbconvert](https://nbconvert.readthedocs.io/en/latest/) can be us ed for this purpose. Hence, if you want to run cloud-based training from a Jupyter Notebook - which is entirely optional, as it can be ran from `.py` files as well - then `nbconvert` must be installed as follows: + +``` +pip install nbconvert +``` + +* * * + +## Training your Keras model + +Now that we have installed TensorFlow Cloud, it's time to look at training the Keras model itself - i.e., in the cloud. This involves four main steps: + +1. Creating a Keras model +2. Training it locally, to ensure that it works +3. Adding code for TensorFlow Cloud +4. Running it! + +Let's take a look in more detail. + +### Step 1: pick a Keras model to train + +For training our Keras model in the cloud, we need - well, a Keras model. The model below is a relatively simple [Convolutional Neural Network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), for the creation of which you can find a detailed article [here](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). Within a few iterations, it can reach significant accuracies on the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). Precisely this simplicity is what makes the model easy to follow, and why it's good for an educational setting like this. Obviously, you wouldn't train such easy and simple models in the cloud normally. + +Open up your code editor, create a file called `model.py`, and add the following code. + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 1 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Step 2: train it locally, but briefly, to see that it works + +Before running your model in the cloud, we should ensure that it can train properly. That is, that it loads data properly, that it configures the model properly, and that the training process starts. + +This is why we set the `no_epochs` to 1 in the configuration options above. It will train for just one iteration, and will show us whether it works. + +Open up a terminal, `cd` to the folder where your `model.py` file is located, and run `python model.py`. If all is well, possibly including the download of the MNIST dataset, you should eventually see the following: + +``` +48000/48000 [==============================] - 6s 132us/sample - loss: 0.3527 - accuracy: 0.8942 - val_loss: 0.0995 - val_accuracy: 0.9712 +Test loss: 0.08811050849966705 / Test accuracy: 0.9728000164031982 +``` + +### Step 3: invoking it with TensorFlow cloud + +Time to add TensorFlow Cloud! + +Create another file called `cloud.py`, and open it in your code editor. + +First of all, make sure to add TensorFlow Cloud into your new file: + +``` +import tensorflow_cloud as tfc +``` + +Then add the `run` call for TensorFlow Cloud: + +``` +# Add TensorFlow cloud +tfc.run( + entry_point='model.py', + distribution_strategy='auto', + requirements_txt='requirements.txt', + chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_4X'], + worker_count=0) +``` + +Here, we specify that the model we want to run resides in `model.py` (which it does), that we let TF Cloud determine the distribution strategy, that additional requirements are specified in `requirements.txt` (so make sure to create that file too, even though you can leave it empty), and that we will run our chief on a machine that has 4 Tesla V100 GPUs. We don't use any workers. + +### Step 4: Run it! + +Time to run it! Open up your terminal, navigate to the folder where your `cloud.py` and `model.py` files are located, and run `python cloud.py`. + +#### Error: Python version mismatch + +The first time I ran my `cloud.py`, I got this error: + +``` +>>> from google.auth.transport import mtls +Traceback (most recent call last): + File "", line 1, in +ImportError: cannot import name 'mtls' from 'google.auth.transport' +``` + +Strange! For some reason, it seemed that an old version of `google-auth` was installed or came installed with TensorFlow Cloud. I'm not sure, but if you're running into this issue, the fix is as follows: install version 1.17.2 or newer, like this. + +``` +pip install google-auth==1.17.2 +``` + +Then, when you run again, it starts building the Docker container: + +![](images/image-13.png) + +This takes quite some time, because downloading the TensorFlow base image is quite resource-intensive. + +#### Error: GCP unauthorized + +The next error you may face now is the following: + +``` +RuntimeError: Docker image publish failed: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication +``` + +Run `gcloud auth configure-docker` and confirm the settings: + +``` + { + "credHelpers": { + "gcr.io": "gcloud", + "marketplace.gcr.io": "gcloud", + "eu.gcr.io": "gcloud", + "us.gcr.io": "gcloud", + "staging-k8s.gcr.io": "gcloud", + "asia.gcr.io": "gcloud" + } +} + +Do you want to continue (Y/n)? y +``` + +#### Publishing! + +When you now run again, it should seem as if the process freezes at publishing: + +``` +INFO:tensorflow_cloud.core.containerize:Publishing docker image: gcr.io/tf-cloud-292417/tf_cloud_train:f852b25b_2b3c_4f92_a70a_d0544c99e02c +``` + +This is in fact _good_, because it is actually sending your built image into the Google Container Registry. Here, it will be available for the AI Platform to run. + +#### TypeError: default() got an unexpected keyword argument 'quota\_project\_id' + +If you're getting this error, you might wish to update `google-auth` with `pip`: + +``` +pip install --upgrade google-auth +``` + +It could be the case that your `google-auth` version was old, too old for the `quota_project_id` keyword argument. In my case, the upgrade fixed the issue, and as we can see I had an old version installed: + +``` +Installing collected packages: google-auth + Attempting uninstall: google-auth + Found existing installation: google-auth 1.17.2 + Uninstalling google-auth-1.17.2: + Successfully uninstalled google-auth-1.17.2 +Successfully installed google-auth-1.22.1 +``` + +#### PermissionError: \[WinError 32\] The process cannot access the file because it is being used by another process + +On Windows, you may now run into the error mentioned above. After building the Docker image and publishing it to the Google Cloud Registry, the Python script crashes. At least, that's what happens at the time of writing. + +The reason is simple: TensorFlow Cloud creates temporary files for e.g. the Docker image, and attempts to remove those when the publish is completed. This is very neat, because that's what should be done - however, Python (at least on Windows) expects the files to be _closed_ before they can be removed. Because TensorFlow Cloud doesn't close the files, the script crashes with the expected error mentioned above. + +I've provided [a fix](https://github.com/tensorflow/cloud/pull/218) which is hopefully merged with the main repository soon. For the time being, you can adapt your files manually. After that, jobs are submitted successfully: + +``` +Job submitted successfully. +Your job ID is: tf_cloud_train_eb615bf8_795a_4138_92dd_6f0a81abde40 +``` + +#### Looking at the training process in Google Cloud Platform + +You can now take a look at the Google Cloud Platform to see how your job performs: + +``` +Please access your training job information here: +https://console.cloud.google.com/mlengine/jobs/tf_cloud_train_eb615bf8_795a__SOME_OTHER + +Please access your training job logs here: https://console.cloud.google.com/logs/viewer?resource=ml_job%2Fjob_id%2Ftf_cloud_train_eb615bf8_795a_4138_SOME_OTHER +``` + +Those URLs were made shorter on purpose. + +When doing so, you'll first see that your job starts running. It can take a while for the job to prepare, but eventually it will start: + +[![](images/image-17-1024x428.png)](https://www.machinecurve.com/wp-content/uploads/2020/10/image-17.png) + +With the second URL, you can follow the logs in real-time: + +[![](images/image-18-1024x470.png)](https://www.machinecurve.com/wp-content/uploads/2020/10/image-18.png) + +Clearly, the MNIST-based model rushed through the 100 epochs, as expected. It was able to perform epochs of 40k samples with < 1 second per epoch. That's truly impressive! + +### Don't forget to remove your files! + +When ready, it's important to don't forget removing your files, because otherwise you might end up paying for Storage: + +- From the **Container Registry:** [https://console.cloud.google.com/gcr](https://console.cloud.google.com/gcr) +- From the **Cloud Storage** bucket: [https://console.cloud.google.com/storage](https://console.cloud.google.com/storage) +- The reference to the Job in AI Platform doesn't have to be removed, because you're only paying for the Compute time it cost. + +* * * + +## Summary + +In this article, we focused on training a Keras deep learning model in the cloud. Doing so is becoming increasingly important these days, as models become deeper, data sets become larger, and the trade-off between purchasing on-premise GPUs versus cloud-based GPUs is increasingly being won by the cloud. This makes the need for training your machine learning models in the cloud significant. + +Unfortunately, training a TensorFlow/Keras model in the cloud was relatively difficult up to now. Not too difficult, but it cost some time to set up a specific machine learning instance such as in Amazon AMI, copy the model there, and then train it. Fortunately, for TensorFlow and Keras models, today there is TensorFlow Cloud: an extension of TensorFlow that packages your model into a Docker container, and then stores it in the Google Cloud Container Registry for training in the Google Cloud AI Platform. + +In the article, we introduced the AI Platform, but then also focused on the characteristics of TensorFlow cloud. TensorFlow Cloud allows you to train your models by means of a variety of distribution strategies, letting you determine what kind of machine you need and what hardware it should contain, or whether you need a multi-machine-multi-GPU setup. It then builds your Docker container and starts the training process. + +I hope that you've learnt a lot from this article. Although I read quite a bit about training deep learning models in the cloud, until now, I never actually sent one there for training. It's really great to see the progress made in this regard. Please feel free to leave a comment if you have any questions, remarks or other comments. I'll happily answer them and improve my article to make it better. Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +_TensorFlow/cloud_. (n.d.). GitHub. [https://github.com/tensorflow/cloud](https://github.com/tensorflow/cloud) + +Chollet, F. (2020). _Franćois Chollet - Keras: The Next Five Years_. YouTube. [https://www.youtube.com/watch?v=HBqCpWldPII](https://www.youtube.com/watch?v=HBqCpWldPII) + +_Tf.distribute.experimental.MultiWorkerMirroredStrategy_. (n.d.). TensorFlow. [https://www.tensorflow.org/api\_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/experimental/MultiWorkerMirroredStrategy) + +_Tf.distribute.MirroredStrategy_. (n.d.). TensorFlow. [https://www.tensorflow.org/api\_docs/python/tf/distribute/MirroredStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy) + +_Tf.distribute.OneDeviceStrategy_. (n.d.). TensorFlow. [https://www.tensorflow.org/api\_docs/python/tf/distribute/OneDeviceStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/OneDeviceStrategy) + +_Tf.distribute.TPUStrategy_. (n.d.). TensorFlow. [https://www.tensorflow.org/api\_docs/python/tf/distribute/TPUStrategy](https://www.tensorflow.org/api_docs/python/tf/distribute/TPUStrategy) + +_NVIDIA Tesla V100 price analysis_. (2018, August 17). Microway. [https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/](https://www.microway.com/hpc-tech-tips/nvidia-tesla-v100-price-analysis/) + +_Amazon EC2 P3 – Ideal for machine learning and HPC - AWS_. (n.d.). Amazon Web Services, Inc. [https://aws.amazon.com/ec2/instance-types/p3/](https://aws.amazon.com/ec2/instance-types/p3/) diff --git a/tensorflow-eager-execution-what-is-it.md b/tensorflow-eager-execution-what-is-it.md new file mode 100644 index 0000000..3d05ce8 --- /dev/null +++ b/tensorflow-eager-execution-what-is-it.md @@ -0,0 +1,160 @@ +--- +title: "TensorFlow Eager Execution: what is it?" +date: "2020-09-13" +categories: + - "frameworks" +tags: + - "deep-learning" + - "eager-execution" + - "machine-learning" + - "tensorflow" +--- + +Looking at the [Effective TensorFlow 2](https://www.tensorflow.org/guide/effective_tf2) guide, we can see what major changes have occurred between TensorFlow 1 and 2.x. While some are relatively straightforward, such as the API Cleanup changes, others are less so. For example, something is written about _eager execution_: + +> TensorFlow 1.X requires users to manually stitch together an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree) (the graph) by making `tf.*` API calls. It then requires users to manually compile the abstract syntax tree by passing a set of output tensors and input tensors to a `session.run()` call. TensorFlow 2.0 executes eagerly (like Python normally does) and in 2.0, graphs and sessions should feel like implementation details. +> +> Effective TensorFlow 2 (n.d.) + +Now, while I have a background in software engineering (and since a few years machine learning engineering), I still find the text above really technical... especially for beginners. + +What is eager execution? Why has the change been made, and what are the benefits for people who are using TensorFlow, possibly with TensorFlow based Keras? + +Very interesting questions, indeed - especially if you want to get to know the TensorFlow framework in a better way. In order to understand eager execution at a high level, I've written this article, in which I will try to outline the answers to the questions above. Firstly, we'll cover the old way of working - that is, creating a computational graph, and requiring Sessions in order to run this graph. Being relatively inefficient for modeling purposes, we'll then cover how TensorFlow has changed - towards executing eagerly, no longer requiring that graph. It allows us to compare both approaches, and see - in my point of view - why this is much better for modeling. Finally, we'll cover briefly how to find whether your TensorFlow runs with Eager Execution enabled. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Creating a computational graph + +Suppose that we have three Tensors, which all three represent a constant number: + +``` +import tensorflow as tf +one = tf.constant([12]) +two = tf.constant([3]) +three = tf.constant([2]) +``` + +Our goal would be to multiply the first two Tensors - thus `one` and `two` - first, followed by a subtraction - the result of the multiplication minus `three`. + +``` +multres = tf.math.multiply(one, two) +``` + +And subsequently, the substraction: + +``` +subres = multres - three +``` + +### Sequence of events + +Usually, you would write it down in a sequence, like this, so that once you run your Python script, it gets executed at once: + +``` +import tensorflow as tf +one = tf.constant([12]) +two = tf.constant([3]) +three = tf.constant([2]) +multres = tf.math.multiply(one, two) +subres = multres - three +``` + +Humans think that things flow as follows: + +- Python first computes the values for `one`, `two` and `three`. +- Subsequently, it would compute the result for `multres` being 12 \* 3 = 36 +- Then, finally it would compute the result for `subres` being 36 - 2 = 34. + +Now, that isn't precisely how TensorFlow would work by default prior to version 2.x, and by option prior to version 1.7. + +### Graph based computation + +Instead, it would first create a _graph_ based on your input. A graph can be defined as "a structure amounting to a set of objects in which some pairs of the objects are in some sense "related"" (Wikipedia, 2003). + +Visually, that would look something like this (note that I've likely omitted _many_ things for the sake of simplicity): + +![](images/graph-1.png) + +It's effectively a skeleton about what needs to happen when you would _really_ do things. As if you would write down a set of steps that would be executed upon start of your program. Those who have used TensorFlow for quite some time now, still recognize this: all instantiations of TensorFlow stuff had to be started within a `tf.Session` - being the instantiation of that graph before anything could happen. + +The benefits of using graphs is that, as we mentioned before, they effectively compose a set of _steps_ about what needs to happen - which greatly helps when a model has to be rebuilt on, say, another machine. + +On the other hand, this is incredibly frustrating when you are fine-tuning your machine learning model: you literally have to compile the _whole_ model over and over again. It's also a hassle when you want to store intermediate output from your model. What's more, it's unlike how Python normally works - being that any operation returns the result, immediately, instead of some intermediate representation like "one x two". + +* * * + +## Executing models eagerly + +While TensorFlow used computational graphs until version 1.7, developers of PyTorch, the other popular framework for deep learning, recognized the potential bottleneck that this way of working provided - and ensured that their framework _was not so static_ (Chopra, 2018). Becoming increasingly popular, TensorFlow provided a break from static computational graphs in TF 1.7: it provided _eager execution_ in the framework by moving it from `contrib`, where all _additions_ are available. + +Eager execution "is an imperative programming environment that evaluates operations immediately, without building graphs: operations return concrete values instead of constructing a computational graph to run later" (Tensorflow, n.d.). In plainer English, this means that static graphs are a thing from the past. Rather, each operation performed in TensorFlow immediately returns the value (so "36" instead of "one x two") which is subsequently used as is in the next operation ("36 - 2 = 34" instead of "multres - three produces _some final result_". + +### Benefits of eager execution + +According to Tensorflow (n.d.), this provides various benefits already recognized and driving the PyTorch ecosystem: + +> _An intuitive interface_—Structure your code naturally and use Python data structures. Quickly iterate on small models and small data. +> +> _Easier debugging_—Call ops directly to inspect running models and test changes. Use standard Python debugging tools for immediate error reporting. +> +> _Natural control flow_—Use Python control flow instead of graph control flow, simplifying the specification of dynamic models. + +With respect to the intuitive interface, this makes a lot of sense. Python makes use of 'eager execution' by default: if you multiply 12 by 3, you won't get some kind of intermediate result, but rather, it will output 36 immediately. Sessions were a purely TensorFlow thing for the experienced Python developer, and with eager execution enabled, the necessity for them has disappeared. This provides an easier interface for Python developers who are new to TensorFlow and allows one's code to be cleaner. + +Easier debugging makes sense as well. As the outputs of your TensorFlow operations are numbers instead of intermediate results, it's now very easy to output intermediate results - such as the outputs of intermediate layers of your machine learning model - in order to debug it. + +In fact, it allows you to be aware of how certain changes produce certain impacts immediately - and you can indeed do so with standard Python debugging tools, which can read default output rather than those intermediate results. + +The point about _natural control flow_ was already covered above, but it's true: there's no _un-Pythonic_ graphs anymore, but regular Python operations instead. Hence, I'd say that this is valid as well - and indeed, a benefit :) + +So, in short, eager execution provides **clear benefits** over graph mode: it's more intuitive to the Python developer, making use of TensorFlow more natural and hence easier, providing cleaner code and faster debugging. Sounds good! + +* * * + +## Does your TensorFlow have Eager Execution enabled? + +All TensorFlow 2.x versions should come with eager execution enabled by default. If you are still running an 1.x version or want to find whether it's running eagerly anyway, you could execute this code to find out whether that's the case for your machine learning model: + +``` +import tensorflow as tf +tf.executing_eagerly() +``` + +If it outputs `True`, then you know that your model runs with eager execution enabled. + +* * * + +## Summary + +In this blog post, we looked at eager execution - enabled by default in TensorFlow 2.x - and what it is. What's more, we also looked at why it is different compared to static graphs used in earlier versions of the machine learning framework. + +Firstly, we started with an example of how graphs were used before. While this can be a very elegant solution when you have to export models and reconstruct them on other machines, it is a hassle when you have to debug models and want to use intermediate results. Especially since PyTorch was much more dynamic, the TensorFlow team introduced eager execution in TF 1.7 and enabled it by default in 2.x versions. + +Funnily, in my point of view, that _major_ change has happened in the 1.x to 2.x TensorFlow transition - and hence, that's why eager execution is a point in TensorFlow (n.d.). If you're very new to TensorFlow, and if you've never worked with 1.x versions in your career, then you won't even _know_ about graphs in the first place. Still, I hope that you've learnt something from this article if that's the case - and also if that's not the case. Please leave a comment in the comments section below if you have any questions, remarks or suggestions. I'd love to hear from you and will respond where possible. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +_Effective TensorFlow 2_. (n.d.). TensorFlow. [https://www.tensorflow.org/guide/effective\_tf2](https://www.tensorflow.org/guide/effective_tf2) + +_Graph (discrete mathematics)_. (2003, September 23). Wikipedia, the free encyclopedia. Retrieved September 13, 2020, from [https://en.wikipedia.org/wiki/Graph\_(discrete\_mathematics)](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) + +Chopra, S. (2018, September 15). _Eager execution in TensorFlow : A more pythonic way of building models_. Medium. [https://medium.com/coding-blocks/eager-execution-in-tensorflow-a-more-pythonic-way-of-building-models-e461810618c8](https://medium.com/coding-blocks/eager-execution-in-tensorflow-a-more-pythonic-way-of-building-models-e461810618c8) + +_Importance of using TensorFlow eager execution for developers_. (2020, April 26). Analytics India Magazine. [https://analyticsindiamag.com/beginners-guide-to-tensorflow-eager-execution-machine-learning-developers/](https://analyticsindiamag.com/beginners-guide-to-tensorflow-eager-execution-machine-learning-developers/) + +Aggarwal, K. (2018, April 9). _A brief guide to TensorFlow eager execution_. Medium. [https://towardsdatascience.com/eager-execution-tensorflow-8042128ca7be](https://towardsdatascience.com/eager-execution-tensorflow-8042128ca7be) + +_Eager execution_. (n.d.). TensorFlow. [https://www.tensorflow.org/guide/eager](https://www.tensorflow.org/guide/eager) diff --git a/tensorflow-model-optimization-an-introduction-to-pruning.md b/tensorflow-model-optimization-an-introduction-to-pruning.md new file mode 100644 index 0000000..0733dc1 --- /dev/null +++ b/tensorflow-model-optimization-an-introduction-to-pruning.md @@ -0,0 +1,768 @@ +--- +title: "TensorFlow model optimization: an introduction to Pruning" +date: "2020-09-23" +categories: + - "frameworks" +tags: + - "edge-ai" + - "optimizer" + - "pruning" + - "quantization" + - "tensorflow" + - "model-optimization" +--- + +Enjoying the benefits of machine learning models means that they are deployed in the field after training has finished. However, if you're counting on great speed with which predictions for new data - called model inference - are generated, then it's possible that you're getting a bit intimidated. If you _really_ want your models to run with speed, it's likely that you'll have to buy powerful equipment - like massive GPUs - which come at significant cost. + +If you don't, your models will run slower; sometimes, really slow - especially when your models are big. And big models are very common in today's state-of-the-art in machine learning. + +Fortunately, modern machine learning frameworks such as TensorFlow attempt to help machine learning engineers. Through extensions such as TF Lite, methods such as [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) can be used to optimize your model. While with quantization the number representation of your machine learning model is adapted to benefit size and speed (often at the cost of precision), we'll take a look at **model pruning** in this article. Firstly, we'll take a look at why model optimization is necessary. Subsequently, we'll introduce pruning - by taking a look at how neural networks work as well as questioning why we should keep weights that don't contribute to model performance. + +Following the theoretical part of this article, we'll build a Keras model and subsequently apply pruning to optimize it. This shows you how to apply pruning to your TensorFlow/Keras model with a real example. Finally, when we know how to do is, we'll continue by _combining_ pruning with quantization for compound optimization. Obviously, this also includes adding quantization to the Keras example that we created before. + +Are you ready? Let's go! 😎 + +**Update 02/Oct/2020:** added reference to article about pruning schedules as a suggestion. + +* * * + +\[toc\] + +* * * + +## The need for model optimization + +Machine learning models can be used for a wide variety of use cases, for example the detection of objects: + +https://www.youtube.com/watch?v=\_zZe27JYi8Y + +If you're into object detection, it's likely that you have heard about machine learning architectures like RCNN, Faster-RCNN, YOLO (recently, version 5 was released!) and others. Those are increasingly state-of-the-art architectures that can be used to detect objects very efficiently based on a training dataset. + +The architectures are composed of a pipeline that includes a feature extraction model, region proposal network, and subsequently a classification model (Data Science Stack Exchange, n.d.). By consequence, this pipeline is capable of extracting interesting features from your input data, detecting regions of interest for classification, and finally classifying those regions - resulting in videos like the one above. + +Now, while they are very performant in terms of object detection, the neural networks used for classifying (and sometimes also for feature extraction/region selection) also come at a downside: **_they are very big_.** + +For example, the neural nets, which can include [VGG-16](https://neurohive.io/en/popular-networks/vgg16/), [RESNET-50](https://towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33), and others, have the following size when used as a `tf.keras` application (for example, as a convolutional base): + +| Model | Size | Top-1 Accuracy | Top-5 Accuracy | Parameters | Depth | +| --- | --- | --- | --- | --- | --- | +| [Xception](https://keras.io/api/applications/xception) | 88 MB | 0.790 | 0.945 | 22,910,480 | 126 | +| [VGG16](https://keras.io/api/applications/vgg/#vgg16-function) | 528 MB | 0.713 | 0.901 | 138,357,544 | 23 | +| [VGG19](https://keras.io/api/applications/vgg/#vgg19-function) | 549 MB | 0.713 | 0.900 | 143,667,240 | 26 | +| [ResNet50](https://keras.io/api/applications/resnet/#resnet50-function) | 98 MB | 0.749 | 0.921 | 25,636,712 | \- | +| [ResNet101](https://keras.io/api/applications/resnet/#resnet101-function) | 171 MB | 0.764 | 0.928 | 44,707,176 | \- | +| [ResNet152](https://keras.io/api/applications/resnet/#resnet152-function) | 232 MB | 0.766 | 0.931 | 60,419,944 | \- | +| [ResNet50V2](https://keras.io/api/applications/resnet/#resnet50v2-function) | 98 MB | 0.760 | 0.930 | 25,613,800 | \- | +| [ResNet101V2](https://keras.io/api/applications/resnet/#resnet101v2-function) | 171 MB | 0.772 | 0.938 | 44,675,560 | \- | +| [ResNet152V2](https://keras.io/api/applications/resnet/#resnet152v2-function) | 232 MB | 0.780 | 0.942 | 60,380,648 | \- | +| [InceptionV3](https://keras.io/api/applications/inceptionv3) | 92 MB | 0.779 | 0.937 | 23,851,784 | 159 | +| [InceptionResNetV2](https://keras.io/api/applications/inceptionresnetv2) | 215 MB | 0.803 | 0.953 | 55,873,736 | 572 | +| [MobileNet](https://keras.io/api/applications/mobilenet) | 16 MB | 0.704 | 0.895 | 4,253,864 | 88 | +| [MobileNetV2](https://keras.io/api/applications/mobilenet/#mobilenetv2-function) | 14 MB | 0.713 | 0.901 | 3,538,984 | 88 | +| [DenseNet121](https://keras.io/api/applications/densenet/#densenet121-function) | 33 MB | 0.750 | 0.923 | 8,062,504 | 121 | +| [DenseNet169](https://keras.io/api/applications/densenet/#densenet169-function) | 57 MB | 0.762 | 0.932 | 14,307,880 | 169 | +| [DenseNet201](https://keras.io/api/applications/densenet/#densenet201-function) | 80 MB | 0.773 | 0.936 | 20,242,984 | 201 | +| [NASNetMobile](https://keras.io/api/applications/nasnet/#nasnetmobile-function) | 23 MB | 0.744 | 0.919 | 5,326,716 | \- | +| [NASNetLarge](https://keras.io/api/applications/nasnet/#nasnetlarge-function) | 343 MB | 0.825 | 0.960 | 88,949,818 | \- | +| [EfficientNetB0](https://keras.io/api/applications/efficientnet/#efficientnetb0-function) | 29 MB | \- | \- | 5,330,571 | \- | +| [EfficientNetB1](https://keras.io/api/applications/efficientnet/#efficientnetb1-function) | 31 MB | \- | \- | 7,856,239 | \- | +| [EfficientNetB2](https://keras.io/api/applications/efficientnet/#efficientnetb2-function) | 36 MB | \- | \- | 9,177,569 | \- | +| [EfficientNetB3](https://keras.io/api/applications/efficientnet/#efficientnetb3-function) | 48 MB | \- | \- | 12,320,535 | \- | +| [EfficientNetB4](https://keras.io/api/applications/efficientnet/#efficientnetb4-function) | 75 MB | \- | \- | 19,466,823 | \- | +| [EfficientNetB5](https://keras.io/api/applications/efficientnet/#efficientnetb5-function) | 118 MB | \- | \- | 30,562,527 | \- | +| [EfficientNetB6](https://keras.io/api/applications/efficientnet/#efficientnetb6-function) | 166 MB | \- | \- | 43,265,143 | \- | +| [EfficientNetB7](https://keras.io/api/applications/efficientnet/#efficientnetb7-function) | 256 MB | \- | \- | 66,658,687 | \- | + +Source: Keras Team (n.d.) + +Some are approximately half a gigabyte with more than 100 million trainable parameters. That's _really_ big! + +The consequences of using those models is that you'll need very powerful hardware in order to perform what is known as **model inference** - or generating new predictions for new data that is input to the trained model. This is why most machine learning settings are centralized and often cloud-based: cloud vendors such as Amazon Web Services, Azure and [DigitalOcean](https://m.do.co/c/2cbc2c399ad5) _(affiliate link)_ provide GPU-based or heavy compute-based machines for running machine learning inference. + +Now, this is good if your predictions can be batch oriented or when some delay is acceptable - but if you want to respond to observations in the field, with a very small delay between observation and a response - this is unacceptable. + +Very large models, however, cannot run in the field, for the simple reason that insufficiently powerful hardware is available in the field. Embedded devices simply aren't good enough to equal performance of their cloud-based competitors. This means that you'll have to trade-off model performance by using smaller ones. + +Fortunately, modern deep learning frameworks provide a variety of techniques to optimize your machine learning models. As we have seen in another blog post, changing the number representation into a less-precise but smaller variant - a technique called [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) - helps already. In this blog post, we'll take a look at another technique: **model pruning**. Really interesting, especially if you combine the two - as we shall do later! :) + +* * * + +## Introducing Pruning + +Adapting a definition found at Wikipedia (2006) for decision trees, pruning in general means "simplifying/compressing and optimizing a \[classifier\] by removing sections of the \[classifier\] that are uncritical and redundant to classify instances" (Wikipedia, 2006). Hence, while with quantization models are optimized by changing their number representation, pruning allows you to optimize models by removing parts that don't contribute much to the outcome. + +I can imagine that it's difficult to visualize this if you don't fully understand how neural networks operate from the inside. Therefore, let's take a look at how they do before we continue introducing pruning. + +### Neural network maths: features and weights + +Taken from our blog post about loss and loss functions, we can sketch a [high-level machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) for supervised learning scenarios - such as training a classifier or a regression model: + +![](images/High-level-training-process-1024x973.jpg) + +Training such a model involves a cyclical process, where **features** (or data inputs) are fed to a machine learning model that is initially [initialized](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) quite randomly, after which predictions are compared with the actual outcomes - or the ground truth. After comparison, the model is adapted, after which the process restarts. This way, models are improved incrementally, and "learning" takes place. + +If we talk about initializing a machine learning model, we're talking about initializing their **weights**. Each machine learning model has a large amount of weights that can be trained, i.e., where learning can be captured. Both weights and features are vectors. Upon the forward pass (i.e., passing a feature, generating a prediction), the inputs for every layer are fed to the weights, after which they are vector multiplied. The collective outcome (another vector) is subsequently passed to the next layer. The system as a whole generates the prediction, and can be used for generating highly complex predictions due to its [nonlinearity](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/). + +- [Read more about weights and features here.](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) +- [Read here why full random weight initialization could not be a good idea.](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) + +### Why keep weights that don't contribute? + +Now, you can possibly imagine that the contribution of each individual weight to model performance is not equal. Just like a group of people that attempt to reach a common goal, the input of some people is more important than the input of others. This could be unconscious - for example, because somebody is having a bad day - or on purposes. Whichever it is, it doesn't matter - just the absolute is what does. + +Now, if some people (or in our case, neural network weights) do not contribute significantly, it could be that the cost of keeping them in (in terms of model sparsity and hence optimization) is larger than removing them from the model. That's precisely what **pruning** does: remove weights that do not contribute from your machine learning model. It does so quite ingeneously, as we shall see. + +#### Saving storage and making things faster with magnitude-based pruning + +In TensorFlow, we'll prune our models using **magnitude-based pruning**. This method, which is really simple, removes the smallest weight after each epoch (Universität Tubingen, n.d.). In fact, the pruning method is so simple that it compares the absolute size of the weight with some threshold lambda (Nervana Systems, n.d.): + +\[latex\]thresh(w\_i)=\\left\\lbrace \\matrix{{{w\_i: \\; if \\;|w\_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w\_i| \\leq \\lambda} } \\right\\rbrace\[/latex\] + +According to Universität Tubingen (n.d.), this method often yields quite good results - no worse than more advanced methods. + +Why this method works is because of the effect of weights that are set to zero. As we recall, within a neuron, some input vector \[latex\]\\textbf{x}\[/latex\] is multiplied with the weights vector \[latex\]\\textbf{w}\[/latex\]. If the weights in the vector are set to zero, the outcome will always be zero. This, in effect, ensures that the neuron no longer contributes to model performance. + +Why, was the question I now had. Why does setting model weights to zero help optimize a model, and make it smaller? Gale et al. (2019) answer this question: "models can be stored and transmitted compactly using sparse matrix formats". This benefits from the fact that "\[sparse\] data is by nature more easily [compressed](https://en.wikipedia.org/wiki/Data_compression) and thus requires significantly less [storage](https://en.wikipedia.org/wiki/Computer_data_storage)." (Wikipedia, 2003). In addition, beyond compression, computation-wise programming code (such as computing `x`+`y`) can be made faster (e.g., it can be omitted if `x` or `y` are sparse, or both - `x+0` = `x`, and so on), benefiting processing - _inference_, in our case. + +#### Now what happens to my accuracy? + +Okay, fair enough - the simplicity of magnitude-based pruning combined with the benefits of sparse matrices definitely helps optimize your model. But what does this mean for model performance? + +Often, not much. The weights that contribute to model performance most significantly often do not get removed. Still, this does mean that you observe _minor_ performance deterioration. For those cases, it is possible to fine-tune your model after pruning. This means that when pruning was performed (whether that means after an epoch or after you have finished training an early version of your model), it's possible to continue training; then, your model will attempt to get back to convergence with only a minority of the weights. + +* * * + +## Pruning: a Keras example + +Great! We now know what pruning is all about, and most specifically, we understand how _magnitude-based pruning_ can benefit from storage and computational benefits related to sparse matrices. And who doesn't love its simplicity? :) That's why it's time to move from theory into practice, and see whether we can actually create a Keras model to which we apply pruning. + +### Installing the TensorFlow Model Optimization toolkit + +For pruning, we'll be using the TensorFlow Model Optimization toolkit, which "minimizes the complexity of optimizing machine learning inference." (TensorFlow Model Optimization, n.d.). It's a collection of interesting tools for optimizing your TensorFlow models. + +You must first install it using `pip`, so that would be your first step to take: + +``` +pip install --user --upgrade tensorflow-model-optimization +``` + +### Using our Keras ConvNet + +In another blog post, we saw how to create a [Convolutional Neural Network with Keras](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). Here, I'll re-use that code, for its sheer simplicity - it does nothing more than create a small CNN and train it with the MNIST dataset. It'll be the starting point of our pruning exercise. Here it is - if you wish to understand it in more detail, I'd recommend taking a look at the page we just linked to before: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import tempfile +import tensorflow_model_optimization as tfmot +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 10 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Also make sure to store your model to a _temporary file_, so that you can compare the sizes of the original and the pruned model later: + +``` +# Store file +_, keras_file = tempfile.mkstemp('.h5') +save_model(model, keras_file, include_optimizer=False) +print(f'Baseline model saved: {keras_file}') +``` + +### Loading and configuring pruning + +Time to add pruning functionality to our model code! + +We'll first add this: + +``` +# Load functionality for adding pruning wrappers +prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude +``` + +What it does is loading the `prune_low_magnitude` [functionality](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/prune_low_magnitude) from TensorFlow (Tfmot.sparsity.keras.prune\_low\_magnitude, n.d.). `prune_low_magnitude` simply modifies a layer by making it ready for pruning. It does so by wrapping a `keras` model with pruning functionality, more specifically by ensuring that the model's layers are prunable. This only _loads_ the functionality, we'll actually call it later. + +Upon loading the pruning wrappers, we will set pruning configuration: + +``` +# Finish pruning after 5 epochs +pruning_epochs = 5 +num_images = input_train.shape[0] * (1 - validation_split) +end_step = np.ceil(num_images / batch_size).astype(np.int32) * pruning_epochs + +# Define pruning configuration +pruning_params = { + 'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.40, + final_sparsity=0.70, + begin_step=0, + end_step=end_step) +} +model_for_pruning = prune_low_magnitude(model, **pruning_params) +``` + +Here, the following happens: + +- We configure the length of the pruning process by means of the number of `epochs` that the model will prune for, and fine-tune. +- We load the number of images used in our training set, minus the validation data. +- We compute the `end_step` of our pruning process given batch size, the number of images as well as the number of epochs. +- We subsequently define configuration for the pruning operation through `pruning_params`. We define a pruning schedule using `PolynomialDecay`, which means that sparsity of the model increases with increasing number of `epochs`. Initially, we set the model to be 40% sparse, increasingly getting sparser to eventually 70%. We begin at 0, and end at `end_step`. +- Finally, we actually call the `prune_low_magnitude` functionality (which generates the prunable model) from our initial `model` and the defined `pruning_params`. + +**Suggestion:** make sure to read our [article about PolynomialDecay and ConstantSparsity pruning schedules](https://www.machinecurve.com/index.php/2020/09/29/tensorflow-pruning-schedules-constantsparsity-and-polynomialdecay/) to find out more about these particular schedules. + +### Starting the pruning process + +After configuring the pruning process, we can actually recompile the model (this is necessary because we added pruning functionality), and start the pruning process. We must use the `UpdatePruningStep` callback here, because it propagates optimizer activities to the pruning process (Tfmot.sparsity.keras.UpdatePruningStep, n.d.). + +``` +# Recompile the model +model_for_pruning.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Model callbacks +callbacks = [ + tfmot.sparsity.keras.UpdatePruningStep() +] + +# Fitting data +model_for_pruning.fit(input_train, target_train, + batch_size=batch_size, + epochs=pruning_epochs, + verbose=verbosity, + callbacks=callbacks, + validation_split=validation_split) +``` + +### Measuring pruning effectiveness + +Once pruning finishes, we must measure its effectiveness. We can do so in two ways: + +- By measuring how much performance has changed, compared to before pruning; +- By measuring how much model size has changed, compared to before pruning. + +We'll do so with the following lines of code: + +``` +# Generate generalization metrics +score_pruned = model_for_pruning.evaluate(input_test, target_test, verbose=0) +print(f'Pruned CNN - Test loss: {score_pruned[0]} / Test accuracy: {score_pruned[1]}') +print(f'Regular CNN - Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Those ones are simple. They evaluate the pruned model with the testing data and subsequently print the outcome, as well as the (previously obtained) outcome of the original model. + +Next, we export it again - just like we did before - to ensure that we can compare it: + +``` +# Export the model +model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning) +_, pruned_keras_file = tempfile.mkstemp('.h5') +save_model(model_for_export, pruned_keras_file, include_optimizer=False) +print(f'Pruned model saved: {keras_file}') +``` + +Subsequently (thanks to Pruning Keras Example (n.d.)) we can compare the size of the Keras model. To illustrate the benefits of pruning, we must use a compression algorithm like `gzip`, after which we can compare the sizes of both models. Recall that pruning generates sparsity, and that sparse matrices can be saved very efficiently when compressed. That's why `gzip`s are useful for demonstration purposes. We first create a `def` that can be used for compression, and subsequently call it twice: + +``` +# Measuring the size of your pruned model +# (source: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras#fine-tune_pre-trained_model_with_pruning) + +def get_gzipped_model_size(file): + # Returns size of gzipped model, in bytes. + import os + import zipfile + + _, zipped_file = tempfile.mkstemp('.zip') + with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f: + f.write(file) + + return os.path.getsize(zipped_file) + +print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file))) +print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file))) +``` + +### Runtime outcome + +Now, it's time to run it. Save your file as e.g. `pruning.py`, and run it from a Python environment where you have `tensorflow` 2.x installed as well as `numpy` and the `tensorflow_model_optimization` toolkit. + +First, regular training will start, followed by the pruning process, and then effectiveness is displayed on screen. First, with respect to model performance (i.e., loss and accuracy): + +``` +Pruned CNN - Test loss: 0.0218335362634185 / Test accuracy: 0.9923999905586243 +Regular CNN - Test loss: 0.02442687187876436 / Test accuracy: 0.9915000200271606 +``` + +The pruned model even performs slightly better than the regular one. This is likely because we trained the initial model for only 10 epochs, and subsequently continued with pruning afterwards. It's very much possible that the model had not yet converged; that moving towards convergence has continued in the pruning process. Often, performance deteriorates a bit, but should do so only slightly. + +Then, with respect to model size: + +``` +Size of gzipped baseline Keras model: 1601609.00 bytes +Size of gzipped pruned Keras model: 679958.00 bytes +``` + +Pruning definitely made our model smaller - 2.35 times! + +### Full model code + +If you wish to obtain the full model code at once - here you go: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential, save_model +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import tempfile +import tensorflow_model_optimization as tfmot +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 10 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Regular CNN - Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Store file +_, keras_file = tempfile.mkstemp('.h5') +save_model(model, keras_file, include_optimizer=False) +print(f'Baseline model saved: {keras_file}') + +# Load functionality for adding pruning wrappers +prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude + +# Finish pruning after 5 epochs +pruning_epochs = 5 +num_images = input_train.shape[0] * (1 - validation_split) +end_step = np.ceil(num_images / batch_size).astype(np.int32) * pruning_epochs + +# Define pruning configuration +pruning_params = { + 'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.40, + final_sparsity=0.70, + begin_step=0, + end_step=end_step) +} +model_for_pruning = prune_low_magnitude(model, **pruning_params) + +# Recompile the model +model_for_pruning.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Model callbacks +callbacks = [ + tfmot.sparsity.keras.UpdatePruningStep() +] + +# Fitting data +model_for_pruning.fit(input_train, target_train, + batch_size=batch_size, + epochs=pruning_epochs, + verbose=verbosity, + callbacks=callbacks, + validation_split=validation_split) + +# Generate generalization metrics +score_pruned = model_for_pruning.evaluate(input_test, target_test, verbose=0) +print(f'Pruned CNN - Test loss: {score_pruned[0]} / Test accuracy: {score_pruned[1]}') +print(f'Regular CNN - Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Export the model +model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning) +_, pruned_keras_file = tempfile.mkstemp('.h5') +save_model(model_for_export, pruned_keras_file, include_optimizer=False) +print(f'Pruned model saved: {keras_file}') + +# Measuring the size of your pruned model +# (source: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras#fine-tune_pre-trained_model_with_pruning) + +def get_gzipped_model_size(file): + # Returns size of gzipped model, in bytes. + import os + import zipfile + + _, zipped_file = tempfile.mkstemp('.zip') + with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f: + f.write(file) + + return os.path.getsize(zipped_file) + +print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file))) +print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file))) +``` + +* * * + +## Combining Pruning with Quantization for compound optimization + +Above, we saw how we can apply **pruning** to our TensorFlow model to make it smaller without losing much performance. Doing so, we achieved a model that was 2.35 times smaller than the original one. However, it's possible to make the model even smaller. We can do so by means of [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/). If you're interested in what it is, I'd recommend you read the blog post for much detail. Here, we'll look at it very briefly and subsequently add it to our Keras example to gain even further improvements in model size. + +### What is quantization? + +Quantization, in short, means to change the number representation of your machine learning model (whether that's weights or also activations) in order to make it smaller. + +By default, TensorFlow and Keras work with `float32` format. Using 32-bit floating point numbers, it's possible to store really large numbers with great precision. However, the fact that 32 bits can be used makes the model not so efficient in terms of storage - and neither in terms of speed (`float` operations are usually best run on GPUs, and this is cumbersome if you want to deploy your model in the field). + +Quantization means changing this number representation. For example, using `float16` quantization, one can convert parts of the model from `float32` into `float16` format - approximately reducing model size by 50%, without losing much performance. Other approaches allow you to quantize into `int8` format (possibly losing quite some performance while gaining 4x size boost) or combined `int8`/`int16` format (best of both worlds). Fortunately, it's also possible to make your model quantization-aware, meaning that it simulates quantization during training so that the layers can already adapt to performance loss incurred by quantization. + +In short, once the model has been pruned - i.e., stripped off non-contributing weights - we can subsequently add quantization. It should make the model even smaller in a compound way: 2.35 times size reduction should theoretically, using `int8` quantization, mean a 4 x 2.35 = 9.4 times reduction in size! + +### Adding quantization to our Keras example + +Let's now take a look how we can add quantization to a pruned TensorFlow model. More specifically, we'll add [dynamic range quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/#post-training-dynamic-range-quantization), which quantizes the weights, but not necessarily model activations. + +Adding quantization first requires you to add a `TFLite` converter. This converter converts your TensorFlow model into TensorFlow Lite equivalent, which is what quantization will run against. Converting the model into a Lite model allows us to specify a model optimizer - `DEFAULT` or dynamic range quantization, in our case. Finally, we `convert()` the model: + +``` +# Convert into TFLite model and convert with DEFAULT (dynamic range) quantization +stripped_model = tfmot.sparsity.keras.strip_pruning(model_for_pruning) +converter = tensorflow.lite.TFLiteConverter.from_keras_model(stripped_model) +converter.optimizations = [tensorflow.lite.Optimize.DEFAULT] +tflite_model = converter.convert() +``` + +Note that we must first strip the pruning wrappers from the model, creating a `stripped_model`. When the model has completed quantization, we can save it and print its size to see how much things have improved: + +``` +# Save quantized model +_, quantized_and_pruned_tflite_file = tempfile.mkstemp('.tflite') + +with open(quantized_and_pruned_tflite_file, 'wb') as f: + f.write(tflite_model) + +# Additional details +print("Size of gzipped pruned and quantized TFlite model: %.2f bytes" % (get_gzipped_model_size(quantized_and_pruned_tflite_file))) +``` + +Running again yields: + +``` +Size of gzipped baseline Keras model: 1601609.00 bytes +Size of gzipped pruned Keras model: 679958.00 bytes +Size of gzipped pruned and quantized TFlite model: 186745.00 bytes +``` + +...meaning: + +- Size improvement original --> pruning: 2.35x +- Size improvement pruning --> quantization: 3.64x +- Total size improvement pruning + quantization: 8.58x + +Almost 9 times smaller! 😎 + +### Full model code: pruning + quantization + +Should you wish to run the pruning and quantization code at once, here you go: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential, save_model +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import tempfile +import tensorflow_model_optimization as tfmot +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 10 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Regular CNN - Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Store file +_, keras_file = tempfile.mkstemp('.h5') +save_model(model, keras_file, include_optimizer=False) +print(f'Baseline model saved: {keras_file}') + +# Load functionality for adding pruning wrappers +prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude + +# Finish pruning after 5 epochs +pruning_epochs = 5 +num_images = input_train.shape[0] * (1 - validation_split) +end_step = np.ceil(num_images / batch_size).astype(np.int32) * pruning_epochs + +# Define pruning configuration +pruning_params = { + 'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(initial_sparsity=0.40, + final_sparsity=0.70, + begin_step=0, + end_step=end_step) +} +model_for_pruning = prune_low_magnitude(model, **pruning_params) + +# Recompile the model +model_for_pruning.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Model callbacks +callbacks = [ + tfmot.sparsity.keras.UpdatePruningStep() +] + +# Fitting data +model_for_pruning.fit(input_train, target_train, + batch_size=batch_size, + epochs=pruning_epochs, + verbose=verbosity, + callbacks=callbacks, + validation_split=validation_split) + +# Generate generalization metrics +score_pruned = model_for_pruning.evaluate(input_test, target_test, verbose=0) +print(f'Pruned CNN - Test loss: {score_pruned[0]} / Test accuracy: {score_pruned[1]}') +print(f'Regular CNN - Test loss: {score[0]} / Test accuracy: {score[1]}') + +# Export the model +model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning) +_, pruned_keras_file = tempfile.mkstemp('.h5') +save_model(model_for_export, pruned_keras_file, include_optimizer=False) +print(f'Pruned model saved: {keras_file}') + +# Measuring the size of your pruned model +# (source: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras#fine-tune_pre-trained_model_with_pruning) + +def get_gzipped_model_size(file): + # Returns size of gzipped model, in bytes. + import os + import zipfile + + _, zipped_file = tempfile.mkstemp('.zip') + with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f: + f.write(file) + + return os.path.getsize(zipped_file) + +print("Size of gzipped baseline Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file))) +print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file))) + +# Convert into TFLite model and convert with DEFAULT (dynamic range) quantization +stripped_model = tfmot.sparsity.keras.strip_pruning(model_for_pruning) +converter = tensorflow.lite.TFLiteConverter.from_keras_model(stripped_model) +converter.optimizations = [tensorflow.lite.Optimize.DEFAULT] +tflite_model = converter.convert() + +# Save quantized model +_, quantized_and_pruned_tflite_file = tempfile.mkstemp('.tflite') + +with open(quantized_and_pruned_tflite_file, 'wb') as f: + f.write(tflite_model) + +# Additional details +print("Size of gzipped pruned and quantized TFlite model: %.2f bytes" % (get_gzipped_model_size(quantized_and_pruned_tflite_file))) +``` + +* * * + +## Summary + +This article demonstrated how TensorFlow models can be optimized using pruning. By means of pruning, which means to strip off weights that contribute insufficiently to model outcomes, models can be made sparser. Sparse models, in return, can be stored more efficiently, and can also _run_ more efficiently due to smart run-time effects in many programming languages and frameworks. + +Beyond theory, we also looked at a practical scenario - where you're training a Convolutional Neural Network using Keras. After training, we first applied pruning using `PolynomialDecay`, which reduced model size 2.35 times. Then, we also added quantization - which we covered in another blog post but means changing the number representation of your model - and this reduced model size even further, to a total size reduction of 8.5 times compared to our initial model. Awesome! + +I hope you have learnt a lot about model optimization from this blog article. I myself did when researching pruning and quantization! If you have any questions or remarks, please feel free to leave a comment in the comments section below 💬 I'm looking forward to hearing from you. Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +Universität Tübingen. (n.d.). _Magnitude based pruning_. Kognitive Systeme | Universität Tübingen. [https://www.ra.cs.uni-tuebingen.de/SNNS/UserManual/node249.html](https://www.ra.cs.uni-tuebingen.de/SNNS/UserManual/node249.html) + +_Trim insignificant weights_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/guide/pruning](https://www.tensorflow.org/model_optimization/guide/pruning) + +_YOLOv5 is here_. (2020, August 4). Roboflow Blog. [https://blog.roboflow.com/yolov5-is-here](https://blog.roboflow.com/yolov5-is-here) + +_Is faster RCNN the same thing as VGG-16, RESNET-50, etc... or not?_ (n.d.). Data Science Stack Exchange. [https://datascience.stackexchange.com/questions/54548/is-faster-rcnn-the-same-thing-as-vgg-16-resnet-50-etc-or-not](https://datascience.stackexchange.com/questions/54548/is-faster-rcnn-the-same-thing-as-vgg-16-resnet-50-etc-or-not) + +_VGG16 - Convolutional network for classification and detection_. (2018, November 21). Neurohive - Neural Networks. [https://neurohive.io/en/popular-networks/vgg16/](https://neurohive.io/en/popular-networks/vgg16/) + +Dwivedi, P. (2019, March 27). _Understanding and coding a ResNet in Keras_. Medium. [https://towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33](https://towardsdatascience.com/understanding-and-coding-a-resnet-in-keras-446d7ff84d33) + +Keras Team. (n.d.). _Keras documentation: Keras applications_. Keras: the Python deep learning API. [https://keras.io/api/applications/](https://keras.io/api/applications/) + +_TensorFlow model optimization: An introduction to quantization – MachineCurve_. (2020, September 16). MachineCurve. [https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) + +_Decision tree pruning_. (2006, June 7). Wikipedia, the free encyclopedia. Retrieved September 22, 2020, from [https://en.wikipedia.org/wiki/Decision\_tree\_pruning](https://en.wikipedia.org/wiki/Decision_tree_pruning) + +_Pruning - Neural network distiller_. (n.d.). Site not found · GitHub Pages. [https://nervanasystems.github.io/distiller/algo\_pruning.html](https://nervanasystems.github.io/distiller/algo_pruning.html) + +Gale, T., Elsen, E., & Hooker, S. (2019). [The state of sparsity in deep neural networks](https://arxiv.org/pdf/1902.09574.pdf). _arXiv preprint arXiv:1902.09574_. + +_Sparse matrix_. (2003, October 15). Wikipedia, the free encyclopedia. Retrieved September 22, 2020, from [https://en.wikipedia.org/wiki/Sparse\_matrix](https://en.wikipedia.org/wiki/Sparse_matrix) + +_Computational advantages of sparse matrices - MATLAB & Simulink_. (n.d.). MathWorks - Makers of MATLAB and Simulink - MATLAB & Simulink. [https://www.mathworks.com/help/matlab/math/computational-advantages-of-sparse-matrices.html](https://www.mathworks.com/help/matlab/math/computational-advantages-of-sparse-matrices.html) + +_TensorFlow model optimization_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/guide](https://www.tensorflow.org/model_optimization/guide) + +_Tfmot.sparsity.keras.prune\_low\_magnitude_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/sparsity/keras/prune\_low\_magnitude](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/prune_low_magnitude) + +_Tfmot.sparsity.keras.UpdatePruningStep_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/sparsity/keras/UpdatePruningStep](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/UpdatePruningStep) + +_Pruning in Keras example_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/guide/pruning/pruning\_with\_keras#fine-tune\_pre-trained\_model\_with\_pruning](https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras#fine-tune_pre-trained_model_with_pruning) diff --git a/tensorflow-model-optimization-an-introduction-to-quantization.md b/tensorflow-model-optimization-an-introduction-to-quantization.md new file mode 100644 index 0000000..78c32d2 --- /dev/null +++ b/tensorflow-model-optimization-an-introduction-to-quantization.md @@ -0,0 +1,283 @@ +--- +title: "TensorFlow model optimization: an introduction to Quantization" +date: "2020-09-16" +categories: + - "deep-learning" + - "frameworks" +tags: + - "edge-ai" + - "latency" + - "optimizer" + - "quantization" + - "storage" + - "tensorflow" + - "model-optimization" + - "tflite" +--- + +Since the 2012 breakthrough in machine learning, spawning the hype around deep learning - that should have mostly passed by now, favoring more productive applications - people around the world have worked on creating machine learning models for pretty much everything. Personally, to give an example, I have spent time creating a machine learning model for recognizing the material type of underground utilities using ConvNets for my master's thesis. It's really interesting to see how TensorFlow and other frameworks, such as Keras in my case, can be leveraged to create powerful AI models. Really fascinating! + +Despite this positivity, critical remarks cannot be left out. While the research explosion around deep learning has focused on finding alternatives to common loss functions, the effectiveness of Batch Normalization and Dropout, and so on, practical problems remain huge. One class of such practical problems is related to deploying your model in the real world. During training, and especially if you use one of the more state-of-the-art model architectures, you'll create a _very big model_. + +Let's repeat this, but then in bold: **today's deep learning models are often very big**. Negative consequences of model size are that very powerful machines are required for inference (i.e. generating predictions for new data) or to even get them running. Until now, those machines have been deployed in the cloud. In situations where you want to immediately respond in the field, creating a cloud connection is not the way to go. That's why today, a trend is visible where machine learning models are moving to the edge. There is however nobody who runs very big GPUs in the field, say at a traffic sign, to run models. Problematic! + +Unless it isn't. Today, fortunately, many deep learning tools have built-in means to optimize machine learning models. TensorFlow and especially the TensorFlow Lite set of tools provide many. In this blog, we'll cover **quantization**, effectively a means to reduce the size of your machine learning model by rounding `float32` numbers to nearest smaller-bit ones. + +* * * + +\[toc\] + +* * * + +## AI at the edge: the need for model optimization + +Let's go back to the core of my master's thesis that I mentioned above - the world of underground utilities. Perhaps, you have already experienced outages some times, but in my country - the Netherlands - things go wrong _once every three minutes_. With 'wrong', I mean the occurrence of a utility strike. Consequences are big: annually, direct costs are approximately 25 million Euros, with indirect costs maybe ten to fifteen times higher. + +Often, utility strikes happen because information about utilities present in the underground is outdated or plainly incorrect. Because of this reason, there are companies today which specialize in scanning and subsequently mapping those utilities. For this purpose, among others, they use a device called a _ground penetrating radar_ (GPR). Using a GPR, which emits radio waves into the ground and subsequently stores the reflections, geophysicists scan and subsequently generate maps of what's subsurface. + +Performing such scans and generating those maps is a tedious task. First of all, the engineers have to walk hundreds of meters to perform the scanning activities. Subsequently, they must scrutinize all those hundreds of metres - often in a repetitive way. Clearly, this presents opportunities for automation. And that's what I attempted to do in my master's thesis: amplify the analyst's knowledge by using machine learning - and specifically today's ConvNets - to automatically classify objects on GPR imagery with respect to radar size. + +https://www.youtube.com/watch?v=oQaRfA7yJ0g + +While very interesting from a machine learning point of view, that should not be the end goal commercially. The holy grail would be to equip a GPR device with a machine learning model that is very accurate and which generalizes well. When this happens, engineers who dig in the underground can _themselves_ perform those scan, and subsequently _analyze themselves where they have to be cautious_. What an optimization that would be compared to current market conditions, which are often unfavorable for all parties involved. + +Now, if that would be the goal, we'd have to literally _run_ the machine learning model on the GPR device as well. That's where we repeat what we discussed in the beginning of this blog: given the sheer size of today's deep learning models, that's practically impossible. Nobody will equip a hardware device used in the field with a very powerful GPU. And if they would, where would they get electricity from? It's unlikely that it can be powered by a simple solar panel. + +Here emerges the need for creating machine learning models that run in the field. In business terms, we call this **Edge AI** - indeed, AI is moving from centralized orchestrations in clouds to the edge, where it can be applied instantly and where insights can be passed to actuators immediately. But doing so requires that models become efficient - much more efficient. Fortunately, many frameworks - TensorFlow included - provide means for doing so. Next, we'll cover TensorFlow Lite's methods for optimization related to **quantization**. Other optimization methods, such as pruning, will be discussed in future blogs. + +* * * + +## Introducing Quantization + +Optimizing a machine learning model can be beneficial in multiple ways (TensorFlow, n.d.). Primarily, **size reduction**, **latency reduction** and **accelerator compatibility** can be reasons to optimize one's machine learning model. With respect to reducing _model size_, benefits are as follows (TensorFlow, n.d.): + +> **Smaller storage size:** Smaller models occupy less storage space on your users' devices. For example, an Android app using a smaller model will take up less storage space on a user's mobile device. +> +> **Smaller download size:** Smaller models require less time and bandwidth to download to users' devices. +> +> **Less memory usage:** Smaller models use less RAM when they are run, which frees up memory for other parts of your application to use, and can translate to better performance and stability. + +That's great from a cost perspective, as well as a user perspective. The benefits of _latency reduction_ compound this effect: because the model is smaller and more efficient, it takes less time to let a new sample pass through it - reducing the time between generating a prediction and _receiving_ that prediction. Finally, with respect to _accelerator compatibility_, it's possible to achieve extremely good results when combining optimization with TPUs, which are specifically designed to run TensorFlow models (TensorFlow, n.d.). Altogether, optimization can greatly increase machine learning cost performance while keeping model performance at similar levels. + +### Float32 in your ML model: why it's great + +By default, TensorFlow (and Keras) use `float32` number representation while training machine learning models: + +``` +>>> tf.keras.backend.floatx() +'float32' +``` + +Floats or _floating-point numbers_ are "arithmetic using formulaic representation of real numbers as an approximation to support a trade-off between range and precision. For this reason, floating-point computation is often found in systems which include very small and very large real numbers, which require fast processing times" (Wikipedia, 2001). Put plainly, it's a way of representing _real_ numbers (a.k.a., numbers like 1.348348348399...), ensuring processing speed, while having only a minor trade-off between range and precision. This is contrary to _integer_ numbers, which can only be round (say, 10, or 3, or 52). + +Floats can always store a number of _bits_, or 0/1 combinations. The same is true for integers. The number after `float` in `float32` represents the number of bits with which Keras works by default: 32. Therefore, it works with 32-bit floating point numbers. As you can imagine, such `float32`s can store significantly more precise data compared to `int32` - it can represent 2.12, for example, while `int32` can only represent 2, and 3. That's the first benefit of using floating point systems in your machine learning model. + +This directly translates into another benefit of using `float`s in your deep learning model. Training your machine learning process is continuous (Stack Overflow, n.d.). This means that weight initialization, backpropagation and subsequent model optimization - a.k.a. [the high-level training process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) - benefits from _very precise numbers_. Integers can only represent numbers between `X` and `Y`, such as 2 and 3. Floats can represent any real number in between the two. Because of this, computations during training can be much more precise, benefiting the performance of your model. Using floating point numbers is therefore great during training. + +### Float32 in your ML model: why it's not so great + +However, if you want to deploy your model, the fact that it was trained using `float32` is not so great. The precision that benefits the training process comes at a cost: the cost of storing that precision. For example, compared to the integer 3000, 3000.1298289 requires much bigger number systems in order to be represented. This, in return, makes your model bigger and less efficient during inference. + +### What model quantization involves + +Quantization helps solve this problem. Following TensorFlow (n.d.), "\[it\] works by reducing the precision of the numbers used to represent a model's parameters". Hence, we simply cut off the precision - that is, from 321.36669 to 321. Hoping that the difference wouldn't impact the model in a major way, we can cut model size significantly. In the blog post ["_Here’s why quantization matters for AI_."](https://www.qualcomm.com/news/onq/2019/03/12/heres-why-quantization-matters-ai), Qualcomm (2020) greatly demonstrates why quantization helps reduce the size of your model through quantization by means of an example: + +- In order to represent 3452.3194 in floating point numbers, you would need a 32-bit float, thus `float32`. +- Quantizing that number to 3452 however requires an 8-bit integer only, `int8`, which means that you can reserve 24 fewer bits for representing the approximation of that float! + +Now that we know what quantization is and how it benefits model performance, it's time to take a look at the quantization approaches supported by TensorFlow Lite. TF Lite is a collection of tools used for model optimization (TensorFlow lite guide, n.d.). It can be used sequential to regular TensorFlow to reduce size and hence increase efficiency of your trained TF models; it can also be installed on edge devices to run the optimized models. TF Lite supports the following methods of quantization: + +- **Post-training float16 quantization**: quantizing of model weights and activations from `float32` to `float16`. +- **Post-training dynamic range quantization**: quantizing of model weights and activaitons from `float32` to `int8`. On inference, weights are dequantized back into `float32` (TensorFlow, n.d.). +- **Post-training integer quantization**: converting `float32` activations and model weights into `int8` format. For this reason, it is also called _full integer quantization_. +- **Post-training integer quantization with int16 activations**, also called _16x8 quantization_, allows you to quantize `float32` weights and activations into `int8` and `int16`, respectively. +- **Quantization-aware training:** here, the model is made aware of subsequent quantization activities during training, emulating inference-time quantization during the training process. + +### Post-training float16 quantization + +One of the simplest quantization approaches is to convert the model's `float32` based weights into `float16` format (TensorFlow, n.d.). This effectively means that the size of your model is reduced by 50%. While the reduction in size is lower compared to other quantization methods (especially the `int` based ones, as we will see), its benefit is that your models will still run on GPUs - and will run faster, most likely. This does however not mean that they don't run on CPUs instead (TensorFlow, n.d.). + +- More information about `float16` quantization: [Post-training float16 quantization](https://www.tensorflow.org/lite/performance/post_training_float16_quant) + +### Post-training dynamic range quantization + +It's also possible to quantize dynamically - meaning that model weights get quantized into `int8` format from `float32` format (TensorFlow, n.d.). This means that your model will become 4 times smaller, or 25% of the original size - a 2x increase compared to post-training `float16` quantization discussed above. What's more, model activations can be quantized as well, but only during inference time. + +While models get smaller with dynamic range quantization, you lose the possibility of running your model for inference on a GPU or TPU. Instead, you'll have to use a CPU for this purpose. + +- More information about dynamic range quantization: [Post-training dynamic range quantization](https://www.tensorflow.org/lite/performance/post_training_quant) + +### Post-training integer quantization (full integer quantization) + +Another, more thorough approach to quantization, is to convert "all model math" into `int` format. More precisely, everything from your model is converted from `float32` into `int8` format (TensorFlow, n.d.). This also means that the activations of your model are converted into `int` format, compared to dynamic range quantization, which does so during inference time only. This method is also c alled "full integer quantization". + +Integer quantization helps if you want to run your model on a CPU or even an Edge TPU, which requires integer operations in order to accelerate model performance. What's more, it's also likely that you'll have to perform integer quantization when you want to run your model on a microcontroller. Still, despite the model getting smaller (4 times - from 32-bit into 8-bit) and faster, you'll have to think carefully: changing floats into ints removes _precision_, as we discussed earlier. Do you accept the possibility that model performance is altered? You'll have to thoroughly test this if you use integer quantization. + +- More information about full integer quantization: [Post-training integer quantization](https://www.tensorflow.org/lite/performance/post_training_integer_quant) + +### Post-training integer quantization with int16 activations (16x8 integer quantization) + +Another approach that actually extends the former is post-training integer quantization with int16 activations (TensorFlow, n.d.). Here, weights are converted into `int8`, but activations are converted into `int16` format. Compared to the former, this method is often called "16x8 integer quantization" (TensorFlow, n.d.). It has the benefit that model size is still reduced - because weights are still in `int8` format. However, for inference, greater accuracy is achieved compared to full integer quantization through activation quantization into `int16` format. + +- More information about 16x8 integer quantization: [Post-training integer quantization with int16 activations](https://www.tensorflow.org/lite/performance/post_training_integer_quant_16x8) + +### Quantization-aware training + +All previous approaches to quantization require you to train a full `float32` model first, after which you apply one of the forms of quantization to optimize the model. While this is easier, model accuracy could possibly benefit when the model is already aware during training that it will eventually be quantized using one of the quantization approaches discussed above. Quantization-aware training allows you to do this (TensorFlow, n.d.), by emulating inference-time quantization during the fitting process. Doing so allows your model to learm parameters that are _robust_ against the loss of precision invoked with quantization (Tfmot.quantization.keras.quantize\_model, n.d.). + +Generally, quantization-aware training is a three-step process: + +1. Train a regular model through `tf.keras` +2. Make it quantization-aware by applying the related API, allowing it to learn those loss-robust parameters. +3. Quantize the model use one of the approaches mentioned above. + +- More information about quantization-aware training: [Quantization aware training](https://www.tensorflow.org/model_optimization/guide/quantization/training.md) + +### Which quantization method to choose for your ML model + +It can be difficult to choose a quantization method for your machine learning model. The table below suggests which quantization method could be best for your use case. It seems to be a trade-off between _benefits_ and _hardware_, primarily. Here are some general heuristics: + +- If you want to run your quantized model on a GPU, you must use `float16` quantization. +- If you want to benefit from greatest speedups, full `int8` quantization is best. +- If you want to ensure that model performance does not deteriorate significantly when performing full int quantization, you could choose 16x8 quantization instead. On the downside, models remain a bit bigger, and speedups seem to be a bit lower. +- If you're sure that you want to run your model on a CPU, dynamic range quantization is likely useful to you. +- Quantization-aware training is beneficial prior to performing quantization. + +| Technique | Benefits | Hardware | +| --- | --- | --- | +| Dynamic range quantization | 4x smaller, 2x-3x speedup | CPU | +| Full integer quantization | 4x smaller, 3x+ speedup | CPU, Edge TPU, Microcontrollers | +| 16x8 integer quantization | 3-4x smaller, 2x-3x+ speedup | CPU, possibly Edge TPU, Microcontrollers | +| Float16 quantization | 2x smaller, GPU acceleration | CPU, GPU | + +Benefits of optimization methods and hardware that supports it. Source: [TensorFlow](https://www.tensorflow.org/lite/performance/post_training_quantization), licensed under the [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/), no changes were made except for the new 16x8 integer quantization row. + +* * * + +## An example: model quantization for a Keras model + +Let's now implement (dynamic range) quantization for a model trained with `tf.keras`, to give an example - and to learn myself as well :) For this, we'll be using a relatively straight-forward [ConvNet created with Keras](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) that is capable of classifying the MNIST dataset. It allows us to focus on the new aspects of quantization rather than having to worry about how the original neural network works. + +### CNN classifier code + +Here's the full code for the CNN classifier which serves as our starting point. It constructs a two-Conv-layer neural network combined with [max pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) and [Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/). It trains on the MNIST dataset, which is first converted from `uint8` format into `float32` format - precisely because of that precision mentioned in the beginning of this blog post. The rest for the code speaks for itself; if not, I'd recommend reading the ConvNet post linked above. + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 1 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Adding dynamic range quantization + +The next step is creating a `TFLiteConverter` which can convert our Keras model into a TFLite representation. Here, we specify that it must optimize the model when doing so, using the `tf.lite.Optimize.DEFAULT` optimization method. In practice, this [reflects](https://www.tensorflow.org/lite/performance/post_training_integer_quant#convert_using_dynamic_range_quantization) dynamic range quantization. + +``` +# Convert into TFLite model and convert with DEFAULT (dynamic range) quantization +converter = tf.lite.TFLiteConverter.from_keras_model(model) +converter.optimizations = [tf.lite.Optimize.DEFAULT] +tflite_model = converter.convert() +``` + +Et voila! You can now [save](https://www.reddit.com/r/tensorflow/comments/f1ec02/how_do_i_save_a_converted_tensorflow_lite_model/) your TFLite model, and re-use it on your edge device. + +* * * + +## Summary + +In this article, we looked at quantization for model optimization - in order to make trained machine learning models smaller and faster without incurring performance loss. Quantization involves converting numbers into another number representation, most often from `float32` (TensorFlow default) into `float16` or `int8` formats. This allows models to be smaller, benefiting storage requirements, and often faster, benefiting inference. + +We covered multiple forms of quantization: `float16` quantization, where model size is cut in half, as well as full-integer and 16x8-based integer quantization, and finally dynamic range quantization. This involved analyzing how your use case is benefited by one of those approaches, as some work on GPUs, while others work better on CPUs, and so forth. Finally, we covered quantization-aware training, which can be performed prior to quantization, in order to make models robust to loss incurred by quantization by emulating quantization at training time. + +I hope this post was useful for your machine learning projects and that you have learned a lot - I definitely did when looking at this topic! In future blogs, we'll cover more optimization aspects from TensorFlow, such as pruning. For now, however, I'd like to point out that if you have any questions or comments, please feel free to leave a comment in the comments section below 💬 Please do the same if you have any comments, remarks or suggestions for improvement; I'll happily adapt the blog to include your feedback. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +_Model optimization_. (n.d.). TensorFlow. [https://www.tensorflow.org/lite/performance/model\_optimization](https://www.tensorflow.org/lite/performance/model_optimization) + +_Floating-point arithmetic_. (2001, November 11). Wikipedia, the free encyclopedia. Retrieved September 15, 2020, from [https://en.wikipedia.org/wiki/Floating-point\_arithmetic](https://en.wikipedia.org/wiki/Floating-point_arithmetic) + +_Why do I have to convert "uint8" into "float32"_. (n.d.). Stack Overflow. [https://stackoverflow.com/questions/59986353/why-do-i-have-to-convert-uint8-into-float32](https://stackoverflow.com/questions/59986353/why-do-i-have-to-convert-uint8-into-float32) + +_Here’s why quantization matters for AI_. (2020, August 25). Qualcomm. [https://www.qualcomm.com/news/onq/2019/03/12/heres-why-quantization-matters-ai](https://www.qualcomm.com/news/onq/2019/03/12/heres-why-quantization-matters-ai) + +_TensorFlow lite guide_. (n.d.). TensorFlow. [https://www.tensorflow.org/lite/guide](https://www.tensorflow.org/lite/guide) + +_Post-training dynamic range quantization_. (n.d.). TensorFlow. [https://www.tensorflow.org/lite/performance/post\_training\_quant](https://www.tensorflow.org/lite/performance/post_training_quant) + +_Post-training float16 quantization_. (n.d.). TensorFlow. [https://www.tensorflow.org/lite/performance/post\_training\_float16\_quant](https://www.tensorflow.org/lite/performance/post_training_float16_quant) + +_Post-training integer quantization with int16 activations_. (n.d.). TensorFlow. [https://www.tensorflow.org/lite/performance/post\_training\_integer\_quant\_16x8](https://www.tensorflow.org/lite/performance/post_training_integer_quant_16x8) + +_Post-training integer quantization_. (n.d.). TensorFlow. [https://www.tensorflow.org/lite/performance/post\_training\_integer\_quant](https://www.tensorflow.org/lite/performance/post_training_integer_quant) + +_Post-training quantization_. (n.d.). TensorFlow. [https://www.tensorflow.org/lite/performance/post\_training\_quantization](https://www.tensorflow.org/lite/performance/post_training_quantization) + +_TensorFlow/TensorFlow_. (n.d.). GitHub. [https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize](https://github.com/tensorflow/tensorflow/tree/r1.14/tensorflow/contrib/quantize) + +_Tfmot.quantization.keras.quantize\_model_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/quantization/keras/quantize\_model](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/quantization/keras/quantize_model) diff --git a/tensorflow-model-optimization-introducing-weight-clustering.md b/tensorflow-model-optimization-introducing-weight-clustering.md new file mode 100644 index 0000000..ccb7667 --- /dev/null +++ b/tensorflow-model-optimization-introducing-weight-clustering.md @@ -0,0 +1,462 @@ +--- +title: "TensorFlow model optimization: introducing weight clustering" +date: "2020-10-06" +categories: + - "frameworks" +tags: + - "clustering" + - "edge-ai" + - "machine-learning" + - "tensorflow" + - "model-optimization" +--- + +Today's state-of-the-art deep learning models are deep - which means that they represent a large hierarchy of layers which themselves are composed of many weights often. The consequence of their depth is that when saving [model weights](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/) after training, the resulting files can become [really big](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#the-need-for-model-optimization). This poses relatively large storage requirements to hardware where the model runs on. In addition, as running a model after it was trained involves many vector multiplications in [the forward pass of data](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), compute requirements are big as well. + +Often, running such machine learning models in the field is quite impossible due to these resource requirements. This means that cloud-based hardware, such as heavy GPUs, are often necessary to generate predictions with acceptable speed. + +Now, fortunately, there are ways to optimize one's model. In other articles, we studied [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) which changes number representation and [pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/) for zeroing out weights that contribute insignificantly to model performance. However, there is another technique: **weight clustering**. In short, and we shall look into the technique in more detail in this article, it involves reduction of model size by clustering layer weights and subsequently changing the weights that belong to a cluster from their own representation into that of their cluster centroids. + +Now, I can imagine that this all sounds a bit abstract. Let's therefore move forward quickly and take a look in more detail. Firstly, we'll cover the need for model optimization - briefly, as we have done this in the articles linked above as well. Secondly, we'll take a look at what weight clustering is conceptually - and why it could work. Then, we cover `tfmot.clustering`, the weight clustering representation available in the TensorFlow Model Optimization Toolkit. Finally, we'll create a Keras model ourselves, and subsequently attempt to reduce its size by applying weight clustering. We also take a look at whether clustering the weights of a pruned and quantized model makes the model even smaller, and what it does to accuracy. + +\[toc\] + +* * * + +## The need for model optimization + +We already saw it in the introduction of this article: machine learning models that are very performant these days are often also very big. The reason why is twofold. First of all, after the [2012 deep learning breakthrough](https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/), people found that by making neural networks deeper and deeper, learned representations could be much more complex. Hence, model performance increased while data complexity did too - which is a good thing if you're trying to build models that should work in the real world. + +Now, as we saw above, a neural network is essentially a system of neurons, with _model weights_, that are initialized and subsequently optimized. When the neural network is deep, and could potentially be broad as well, the number of so-called _trainable parameters_ is huge! That's the second reason why today's neural networks are very big: their architecture or way of working requires them to be so, when combined with the need for deep networks emerging from the 2012 breakthrough. + +When machine learning models are big, it becomes more and more difficult to run them without having dedicated hardware for doing so. In particular, Graphical Processing Units (GPUs) are required if you want to run very big models at speed. Loading the models, getting them to run, and getting them to run at adequate speed - this all gets increasingly difficult when the model gets bigger. + +In short, running models in the field is not an easy task today. Fortunately, for the TensorFlow framework, there are methods available for optimizing your neural network. While we covered [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) and [pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/) in another article, we're going to focus on the third method here today: **weight clustering**. + +Let's take a look! + +* * * + +## Weight clustering for model optimization + +Training a neural network is a supervised learning operation: it is trained following the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), involving training samples and their corresponding ground truth. However, if you are already involved with Machine Learning, you'll likely also know that there is a branch of techniques that fall under the umbrella of unsupervised learning. [Clustering](https://www.machinecurve.com/index.php/2020/04/16/how-to-perform-k-means-clustering-with-python-in-scikit/) is one of those techniques: without any training samples, an algorithm attempts to identify 'clusters' of similar samples. + +![](images/weight_images.jpg) + +A representation of model weights in TensorBoard. + +They can be used for many purposes - and as we shall see, they can also be used for model optimization by means of clustering weights into groups of similar ones. + +### High-level supervised ML process + +Identifying how this works can be done by zooming in to the [supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). We know that during training it works by means of a forward pass and subsequent optimization, and that this happens iteratively. In more detail, this is a high-level description of that flow: + +- Before the first iteration, weights are [initialized pseudorandomly with some statistical deviation](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). +- In the first iteration, samples are fed forward - often in batches of samples - after which predictions are generated. +- These predictions are compared with ground truth and converge into a _[loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/)_, which is subequently used to [optimize](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) i.e. adapt model weights. +- The iteration is repeated until the preconfigured amount of iterations was completed or a [threshold](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) is met. + +This means that after every iteration (i.e. attempt to train the model), weights are adapted. Essentially, this can be characterized as a continuous 'system state change', where the state of the system of weights changes because the weights are adapted. Once training finishes, the state remains constant - until the model is subsequently trained further e.g. with additional data. + +### Weight representation + +Now, weights themselves are represented mathematically by means of vectors. Those vectors contain numbers given some dimensionality, which can be configured by the ML engineer. All those numbers capture a small part of the learning performed, while the system of numbers (scalars) / vectors as a whole captures all the patterns that were identified in the dataset with respect to the predicted value. + +Using blazing-fast mathematical programming libraries, we can subsequently perform many computations at once in order to train the model (i.e. the forward pass) or model inference (generating predictions for new samples, which is essentially also a forward pass, but then without subsequent optimization). + +### Clustering weights for model compression benefits + +If weights are represented numerically, it is possible to apply [clustering](https://www.machinecurve.com/index.php/2020/04/23/how-to-perform-mean-shift-clustering-with-python-in-scikit/) techniques to them in order to identify groups of similar weights. This is precisely how **weight clustering for model optimization works**. By applying a clustering technique, it is possible to reduce the number of unique weights that are present in a machine learning model (TensorFlow, n.d.). + +How this works is as follows. First of all, you need a trained model - where the system of weights can successfully generate predictions. Applying weight clustering based optimization to this model involves grouping the weights of layers into \[latex\]N\[/latex\] clusters, where \[latex\]N\[/latex\] is configurable by the Machine Learning engineer. This is performed using some clustering algorithm (we will look at this in more detail later). + +If there's a cluster of samples, it's possible to compute a value that represents the middle of a cluster. This value is called a **centroid** and plays a big role in clustering based model optimization. Here's why: we can argue that the centroid value is the 'average value' for all the weights in the particular cluster. If you remove a bit from one vector in the cluster to move towards the centroid, and add a bit to another cluster, one could argue that - holistically, i.e. from a systems perspective - the model shouldn't lose too much of its predictive power. + +And that's precisely what weight clustering based optimization does (TensorFlow, n.d.). Once clusters are computed, all weights in the cluster are adapted to the cluster's centroid value. This brings benefits in terms of model compression: values that are equal can be compressed better. People from TensorFlow have performed tests and have seen up to 5x model compression imrpovements _without_ losing predictive performance in the machine learning model (TensorFlow, n.d.). That's great! + +Applying weight clustering based optimization can therefore be a great addition to your existing toolkit, which should include [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) and [pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/). + +Now that we know what weight clustering based optimization involves, it's time to take a look at how weight clustering based model optimization is implemented in TensorFlow. + +* * * + +## Weight clustering in the TensorFlow Model Optimization Toolkit + +For those who use TensorFlow for creating their neural networks, I have some good news: optimizing machine learning inference is relatively easy, because it can be done with what is known as the [TensorFlow Model Optimization Toolkit, or TFMOT](https://www.tensorflow.org/model_optimization/guide). This toolkit provides functionality for quantization, pruning and weight clustering and works with the Keras models you already created with TensorFlow 2.x. + +In this section, we'll be looking at **four components** of weight clustering in TFMOT, namely: + +1. **Cluster\_weights(...):** used for wrapping your regular Keras model with weight clustering wrappers, so that clustering can happen. +2. **CentroidInitialization:** used for computation of the initial values of the cluster centroids used in weight clustering. +3. **Strip\_clustering(...):** used for stripping the wrappers off your clustering-ready Keras model, to get back to normal. +4. **Cluster\_scope(...):** used when deserializing (i.e. loading) your weight clustered neural network. + +Let's now take a look at each of them in more detail. + +### Enabling clustering: cluster\_weights(...) + +A regular Keras model cannot be weight clustered as it lacks certain functionality for doing so. That's why we need to _wrap_ the model with this functionality, which clusters weights during training. It is essentially the way to configure weight clustering for your Keras model. Do note, however, as we shall see in the tips later in this article, that you should only cluster a model that already shows acceptable performance e.g. because it was trained before. + +Applying `cluster_weights(...)` works as follows (source: [TensorFlow](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering/keras/cluster_weights), license: [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/), no changes): + +``` +clustering_params = { + 'number_of_clusters': 8, + 'cluster_centroids_init': + CentroidInitialization.DENSITY_BASED +} + +clustered_model = cluster_weights(original_model, **clustering_params) +``` + +Here, we define the number of clusters we want, as well as how the centroids are initialized - a configuration option that we will look at in more detail next. Subsequently, we pass the clustering parameters into `cluster_weights(...)` together with our original model. The `clustered_model` that remains can then be used for clustering. + +### Determining centroid initialization: CentroidInitialization + +From the section above, we know that weight clustering involves clustering the weights (no shit, sherlock) but then also replacing the weights that are part of a cluster with the centroids of that particular cluster. This achieves the benefits in terms of compression that we talked about. + +Understanding that there are multiple [algorithms](https://www.machinecurve.com/index.php/2020/04/23/how-to-perform-mean-shift-clustering-with-python-in-scikit/) for [clustering](https://www.machinecurve.com/index.php/2020/04/16/how-to-perform-k-means-clustering-with-python-in-scikit/) yields the question if certain alterations are present within the TFMOT based weights clustering technique as well. + +Now, while it seems to be the case that the _clustering algorithm itself cannot be chosen_ (it seems like [K-means is used under the hood](https://www.machinecurve.com/index.php/2020/04/16/how-to-perform-k-means-clustering-with-python-in-scikit/)), it's possible to choose what is known as a **centroid initialization**. Here's what centroid initialization involves. When starting clustering, as we saw in the previous section, the Machine Learning engineer can configure a number of clusters for either the model or the layers that they intend to cluster. + +Those _clusters_ need to be initialized - that is, they need to be placed somewhere in sample space, before the clustering algorithm can work towards convergence. This initial placement is called the initialization of the centers of the clusters, also known as the centroids. In TensorFlow model optimization, a strategy for doing so can be chosen by means of a `CentroidInitialization` parameter. You can choose from the following centroid initialization strategies: + +- **Density-based initialization:** using the density of the sample space, the centroids are initialized, as to distribute them with more centroids present in more dense areas of the feature space. +- **Linear initialization:** centroids are initialized evenly spaced between the minimum and maximum weight values present, ignoring any density. +- **Random initialization:** as the name suggests, cluster centroids are chosen by sampling randomly in between the minimum and maximum weight values present. +- **Kmeans++-based initialization:** using Kmeans++, the cluster centroids are initialized. + +### Stripping clustering wrappers: strip\_clustering(...) + +We know that we had to apply `cluster_weights(...)` in order to wrap the model with special functionality in order to be able to apply clustering in the first place. However, this functionality is no longer required when the model was weight clustered - especially because it's the _weights_ that are clustered, and they belong to the original model. + +That's why it's best, and even required, to remove the clustering wrappers if you wish to see the benefits from clustering in terms of reduction of model size when compressed. `strip_clustering(...)` can be used for this purpose. Applying it is really simple: you pass the clustered model, and get a stripped model, like this: + +``` +model = tensorflow.keras.Model(...) +wrapped_model = cluster_weights(model) +stripped_model = strip_clustering(wrapped_model) +``` + +### Model deserialization: cluster\_scope(...) + +Sometimes, however, you [save a model](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/) when it is wrapped with clustering functionality: + +``` +model = tf.Keras.Model(...) +wrapped_model = cluster_weights(model) +tensorflow.keras.models.save_model(wrapped_model, './some_path') +``` + +If you then [load the model](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/) with `load_model`, things will go south! This originates from the fact that you are trying to load a _regular_ Keras model, i.e. a model without wrappers, while in fact you saved the model _with_ clustering wrappers. + +Fortunately, TFMOT provides functionality to put the loading operation to `cluster_scope` which means that it takes into account the fact that it is loading a model that has been wrapped with clustering functionality: + +``` +model = tf.Keras.Model(...) +wrapped_model = cluster_weights(model) +file_path = './some_path' +tensorflow.keras.models.save_model(wrapped_model, file_path) + +with tfmot.clustering.keras.cluster_scope(): + loaded_model = tensorflow.keras.models.load_model(file_path) +``` + +* * * + +## Tips for applying weight clustering + +If you want to apply weight clustering based optimization, it's good to follow a few best practices. Here, we've gathered a variety tips from throughout the web that help you get started with this model optimization technique (TensorFlow, n.d.): + +- Weight optimization can be combined with **[post-training quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/)**. This should bring even more benefits compared to weight clustering based optimization or quantization based optimization alone. +- A model should already be trained before weight based clustering is performed. Contrary to e.g. [pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/), where sparsity can be increased while the model is training, weight based pruning does not work in parallel with the training process. It must be applied after training finishes. +- If you apply clustering to layers that precede a batch normalization layer, the benefits are reduced. This is likely due to the normalizing effect of Batch Normalization layers. +- It could be that clustering weights for all layers leads to unacceptable accuracies or other loss scores. In those cases, it is possible to cluster only a few layers only. Click [here](https://www.tensorflow.org/model_optimization/guide/clustering/clustering_comprehensive_guide#cluster_some_layers_sequential_and_functional_models) to find out more if that's what you want to do. +- Apparently, downstream layers (i.e. the later layers in your neural network) have _more redundant parameters_ compared to layers early in the neural network (TensorFlow, n.d.). Here, weight clustering based optimization should provide the biggest benefits. If you want to clusters a few layers only, it could be worthwhile to optimize those later layers instead of early ones. +- Critical layers (e.g. attention layers) should not be clustered; for example, because attention could be lost. +- If you're optimizing a few layers only using weight based optimization, it's important to freeze the first few layers. This ensures that they remain _constant_; if you don't, it could be the case that their weights change in order to accomodate for the changes in the later layers. You often don't want this to happen. +- The way the algorithm computes the centroids of the clusters "plays a key role in (..) model accuracy" (TensorFlow, n.d.). Generally, linear > density based centroid initialization, and linear > random based centeroid initialization. Sometimes, though, the others _are_ better, however only in a minority of cases. Do make sure to test all of them, but if you want to use some heuristics, there they are. +- Fine tuning a model during weight clustering must be done with a learning rate that is lower than the one used in training. This ensures that there won't be any jumpiness in terms of the weights, but that instead the 'optimization steps' performed jointly with clustering are really small. +- If you want to see the compression benefits, you must **both** use `strip_clustering` (which removes the clustering wrappers) and a compression algorithm (such as `gzip`). If you don't, you won't see the benefits. + +* * * + +## Example: weight clustering your Keras model + +Let's now take a step away from all the theory - we're going to code a model that applies weight clustering based optimization for a Keras model 😎 + +### Defining the ConvNet + +For this example, we're going to create a simple [Convolutional Neural Network](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) with Keras that is trained to recognize digits from the MNIST [dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). If you're familiar with Machine Learning, you're well aware that this dataset is used in educational settings very often. Precisely that is the reason that we are also using this dataset here today. In fact, it's a model that will _guarantee_ to perform well (if trained adequately), often with accuracies of 95-97% and more. + +Do note that if you wish to run the model code, you will need `tensorflow` 2.x as well as the TensorFlow Model Optimization Toolkit or `tfmot`. If you don't have it already, you must also install NumPy. Here's how to install them: + +- **TensorFlow:** `pip install tensorflow` +- **TensorFlow Model Optimization Toolkit:** `pip install --upgrade tensorflow-model-optimization` +- **NumPy:** `pip install numpy` + +![](images/mnist.png) + +Samples from the MNIST dataset. + +### Compiling and training the ConvNet + +The first step is relatively simple, and we'll skip the explanations for this part. If you don't understand them yet but would like to do so, I'd recommend clicking the link to the ConvNet article above, where I explain how this is done. + +Now, open up some file editor, create a file - e.g. `clustering.py`. It's also possible to use a Jupyter Notebook for this purpose. Then, add this code, which imports the necessary functionality, defines the architecture for our neural network, compiles it and subsequently fits it i.e. starts the training process: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential, save_model +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import tempfile +import tensorflow_model_optimization as tfmot +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +no_epochs = 15 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Generating evaluation metrics for the ConvNet + +After fitting the data to your model, you have exhausted your training set _and_ your validation dataset. That is, you can't use both datasets in order to test how well it performs - because both have played a role in the training process. + +You don't want to be the butcher who checks their own meat, don't you? + +Instead, in the code above, we have split off a part of the dataset (in fact, Keras did that for us) which we can use for _testing_ purposes. It allows us to test how well our model performs when it is ran against samples it hasn't seen before. In ML terms, we call this _testing how well the model generalizes._ + +With Keras, you can easily evaluate model performance: + +``` +# Generate generalization metrics for original model +score = model.evaluate(input_test, target_test, verbose=0) +``` + +### Storing the ConvNet to file + +Later in this article, we're going to compare the size of a compressed-and-saved model that was optimized with weights clustering to the size of a compressed-and-saved original model. If we want to do this, we must save the original model to a temporary file. Here's how we do that, so let's add this code next: + +``` +# Store file +_, keras_file = tempfile.mkstemp('.h5') +save_model(model, keras_file, include_optimizer=False) +print(f'Baseline model saved: {keras_file}') +``` + +### Configuring weight clustering for the ConvNet + +Now that we have trained, evaluated and saved the original ConvNet, we can move forward with the actual weights clustering related operations. The first thing we're going to do is configuring how TensorFlow will cluster weights during finetuning. + +For this reason, we're going to create a dictionary with the `number_of_clusters` we want the clustering algorithm to find and how the cluster centroids are initialized: + +``` +# Define clustering parameters +clustering_params = { + 'number_of_clusters': 14, + 'cluster_centroids_init': tfmot.clustering.keras.CentroidInitialization.LINEAR +} +``` + +We want 14 clusters. In line with the tips from above, we're using a `CentroidInitialization.LINEAR` strategy for applying weight clustering here. + +### Compiling and finetuning the clustered model + +Then, it's time to wrap our trained `model` with clustering functionality configured according to our `clustering_params`: + +``` +# Cluster the model +wrapped_model = tfmot.clustering.keras.cluster_weights(model, **clustering_params) +``` + +We're now almost ready to finetune our model with clustered weights. However, recall from the tips mentioned above that it is important to decrease the learning rate when doing so. That's why we're redefining our [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) with a lower learning rate (`1e-4` by default): + +``` +# Decrease learning rate (see tips in article!) +decreased_lr_optimizer = tensorflow.keras.optimizers.Adam(lr=1e-5) +``` + +We then recompile the model and finetune _for just one epoch_: + +``` +# Compile wrapped model +wrapped_model.compile( + loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=decreased_lr_optimizer, + metrics=['accuracy']) + +# Finetuning +wrapped_model.fit(input_train, target_train, + batch_size=batch_size, + epochs=1, + verbose=verbosity, + validation_split=validation_split) +``` + +### Evaluating the clustered model + +Here, too, we must investigate how well the clustered model generalizes. We add the same metrics _and also print the outcomes of the previous evaluation step:_ + +``` +# Generate generalization metrics for clustered model +clustered_score = model.evaluate(input_test, target_test, verbose=0) +print(f'Regular CNN - Test loss: {score[0]} / Test accuracy: {score[1]}') +print(f'Clustered CNN - Test loss: {clustered_score[0]} / Test accuracy: {clustered_score[1]}') +``` + +### Comparing the clustered and original models + +For comparing the clustered and original models, we must do a few things: + +1. Remember to use `strip_clustering(...)` in order to convert our wrapped model back into a regular Keras model. +2. Store our file. +3. Gzip both of our models, and run our example. + +First of all, we strip the wrappers and store our file: + +``` +# Strip clustering +final_model = tfmot.clustering.keras.strip_clustering(wrapped_model) + +# Store file +_, keras_file_clustered = tempfile.mkstemp('.h5') +save_model(final_model, keras_file_clustered, include_optimizer=False) +print(f'Clustered model saved: {keras_file_clustered}') +``` + +Then, we're using a Python definition provided by TensorFlow (Apache 2.0 licensed) to get the size of our gzipped model: + +``` +# Measuring the size of your pruned model +# (source: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras#fine-tune_pre-trained_model_with_pruning) + +def get_gzipped_model_size(file): + # Returns size of gzipped model, in bytes. + import os + import zipfile + + _, zipped_file = tempfile.mkstemp('.zip') + with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f: + f.write(file) + + return os.path.getsize(zipped_file) +``` + +The last thing is comparing the sizes of both models when compressed: + +``` +print("Size of gzipped original Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file))) +print("Size of gzipped clustered Keras model: %.2f bytes" % (get_gzipped_model_size(keras_file_clustered))) +``` + +* * * + +## Running the example + +Time to run the example! + +Open up your Python example, such as your terminal or your Notebook, and run the code - e.g. with `python clustering.py`. You will likely observe the following: + +1. Your model will train for 15 epochs, and will achieve significantly low loss scores and high accuracies relatively soon - it's the MNIST dataset, after all. +2. Your model will then train for 1 epoch, and likely, this will be significantly slower than each of the 15 epochs (remember that clustering is applied here under the hood). +3. Your model will then print both the evaluation and the compression comparison scores. + +In my case, this produced the following numbers: + +``` +Regular CNN - Test loss: 0.02783038549570483 / Test accuracy: 0.9919999837875366 +Clustered CNN - Test loss: 0.027621763848347473 / Test accuracy: 0.9919000267982483 +Size of gzipped original Keras model: 1602422.00 bytes +Size of gzipped clustered Keras model: 196180.00 bytes +``` + +We see a reduction in size of **more than 8 times** with a _very small loss of performance_. That's awesome! 😎 + +* * * + +## Summary + +Today's machine learning models can become very large, hampering things like model inference in the field. Another factor that is impacted is storage: weights must both be stored and loaded, impacting performance of your Edge AI scenario and incurring additional costs. + +Fortunately, with modern machine learning libraries like TensorFlow, it is possible to apply a variety of optimization techniques to your trained ML models. In another posts, we focused on quantization and pruning. In this article, we looked at weights clustering: the application of an unsupervised clustering algorithm to cluster the weights of your machine learning model in \[latex\]N\[/latex\] clusters. How this optimizes your machine learning model is relatively easy: as weights within the clusters are set to the centroid values for each cluster, model compression benefits are achieved, as the same numbers can be comrpessed more easily. + +In the remainder of the article, we specifically looked at how weight clustering based model optimization is presented within the API of the TensorFlow Model Optimization Toolkit. We looked at how Keras models can be wrapped with clustering functionality, what initialization strategies for the cluster centroids can be used, how models can be converted back into regular Keras models after training and finally how wrapped models can be deserialized. + +We extended this analysis by means of an example, where we trained a simple Keras CNN on the MNIST dataset and subsequently applied weight clustering. We noticed that the size of our compressed Keras model was reduced by more than 8 times with only a very small reduction in performance. Very promising indeed! + +I hope that you have learnt a lot from this article - I did, when researching :) Please feel free to leave a message if you have any remarks, questions or other suggestions for the improvement of this post. If not, thanks for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +_Module: Tfmot.clustering_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/clustering](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering) + +_Module: Tfmot.clustering.keras_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/clustering/keras](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering/keras) + +_Module: Tfmot.clustering.keras_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/clustering/keras](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering/keras) + +_Tfmot.clustering.keras.CentroidInitialization_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/clustering/keras/CentroidInitialization](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering/keras/CentroidInitialization) + +_Tfmot.clustering.keras.cluster\_scope_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/clustering/keras/cluster\_scope](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering/keras/cluster_scope) + +_Tfmot.clustering.keras.cluster\_weights_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/clustering/keras/cluster\_weights](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering/keras/cluster_weights) + +_Tfmot.clustering.keras.strip\_clustering_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/clustering/keras/strip\_clustering](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/clustering/keras/strip_clustering) + +_Weight clustering in Keras example_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/guide/clustering/clustering\_example](https://www.tensorflow.org/model_optimization/guide/clustering/clustering_example) + +_Weight clustering_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/guide/clustering](https://www.tensorflow.org/model_optimization/guide/clustering) diff --git a/tensorflow-pruning-schedules-constantsparsity-and-polynomialdecay.md b/tensorflow-pruning-schedules-constantsparsity-and-polynomialdecay.md new file mode 100644 index 0000000..15c27ef --- /dev/null +++ b/tensorflow-pruning-schedules-constantsparsity-and-polynomialdecay.md @@ -0,0 +1,376 @@ +--- +title: "TensorFlow pruning schedules: ConstantSparsity and PolynomialDecay" +date: "2020-09-29" +categories: + - "frameworks" +tags: + - "constant-sparsity" + - "edge-ai" + - "optimizer" + - "polynomial-decay" + - "pruning" + - "sparsity" + - "tensorflow" + - "model-optimization" +--- + +Today's deep learning models can become very large. That is, the weights of some contemporary model architectures are already approaching 500 gigabytes if you're working with pretrained models. In those cases, it is very difficult to run the models on embedded hardware, requiring cloud technology to run them successfully for model inference. + +This is problematic when you want to generate predictions in the field that are accurate. Fortunately, today's deep learning frameworks provide a variety of techniques to help make models smaller and faster. In other blog articles, we covered two of those techniques: [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/) and [magnitude-based pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/). Especially when combining the two, it is possible to significantly reduce the size of your deep learning models for inference, while making them faster and while keeping them as accurate as possible. + +They are interesting paths to making it possible to run your models at the edge, so I'd recommend the linked articles if you wish to read more. In this blog post, however, we'll take a more in-depth look at pruning in TensorFlow. More specifically, we'll first take a look at pruning by providing a brief and high-level recap. This allows the reader who hasn't read the posts linked before to get an idea what we're talking about. Subsequently, we'll be looking at the TensorFlow Model Optimization API, and specifically the `tfmot.sparsity.keras.PruningSchedule` functionality, which allows us to use preconfigured or custom-designed pruning schedules. + +Once we understand `PruningSchedule`, it's time to take a look at two methods for pruning that come with the TensorFlow Model Optimization toolkit: the `ConstantSparsity` method and the `PolynomialDecay` method for pruning. We then converge towards a practical example with Keras by using `ConstantSparsity` to make our model sparser. If you want to get an example for `PolynomialDecay`, click [here](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#pruning-a-keras-example) instead. + +Enough introduction for now! Let's start :) + +* * * + +\[toc\] + +* * * + +## A brief recap on Pruning + +If we train a machine learning model by means of a training, validation and testing dataset, we're following a methodology that is called [supervised learning](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). If you look at the name, it already tells you much about how it works: by _supervising_ the learning process, you'll allow the model to learn generate successful predictions for new situations. Supervision, here, means to let the model learn and check its predictions with the true outcome later. It is a highly effective form of machine learning and is used very often in today's machine learning settings. + +### Training a machine learning model: the iterative learning process + +If we look at supervised learning in more detail, we can characterize it as follows: + +![](images/High-level-training-process-1024x973.jpg) + +We start our training process with a model where the weights are [initialized pseudorandomly](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/), with a small alteration given vanishing and exploding gradients. A model "weight" is effectively a vector that contains (part of the) learnt ability, and stores it numerically. All model weights, which are stored in a hierarchical fashion through layers, together capture all the patterns that have been learnt during training. Generating a new prediction involves a vector multiplication between the first-layer weight vectors and the vector of your input sample, subsequently passing the output to the next layer, and repeating the process for all downstream layers. The end result is one prediction, which can be a predicted class or a regressed real-valued number. + +In terms of the machine learning process outlined above, we call feeding the training data to the model a _forward pass_. When data is passed forward, a prediction is computed for the input vector. In fact, this is done for all input vectors, generating as many predictions as there are training rows. Now that all the predictions are in, we can compare them with the ground truth - hence the supervision. In doing so, we can compute an average that represents the average error in the model, called a _[loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/)_. Using this loss value, we can subsequently compute the error contribution of individual neurons and subsequently perform optimization using [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [modern optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/). + +Repeating this process allows us to continuously adapt our weights until the loss value is lower than a predefined threshold, after which we (perhaps [automatically](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/)) stop the training process. + +### Model optimization: pruning and quantization + +Many of today's state-of-the-art machine learning architectures are [really big](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#the-need-for-model-optimization) - 100 MB is no exception, and some architectures are 500 MB when they are trained. As we understand from the introduction and the linked article, it's highly impractical if not impossible to run those models adequately on embedded hardware, such as devices in the field. + +They will then either be too _slow_ or they _cannot be loaded altogether_. + +Using [pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/) and [quantization](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/), we can attempt to reduce model size. We studied pruning in detail in a different blog article. Let's now briefly cover what it is before we continue by studying the different types of pruning available in TensorFlow. + +### Applying pruning to keep the important weights only + +If we train a machine learning model, we can attempt to find out how much every model weight contributes to the final outcome. It should be clear that if a weight does not contribute significantly, it is not worth it to keep it in the model. In fact, there are many reasons why those weights should be thrown out - a.k.a., set to zero, making things _sparse_, as this is called: + +- Compressing the model will be much more effective given the fact that sparse data can be compressed much better, decreasing the requirements for model storage. +- Running the model will be faster because sparse representations will always produce zero outputs (i.e., multiplying anything with 0 yields 0). Programmatically, this means that libraries don't have to perform vector multiplications when weights are sparse - making the prediction faster. +- Loading the model on embedded software will also be faster given the previous two reasons. + +This is effectively what pruning does: it checks which weights contribute most, and throws out everything else that contributes less than a certain threshold. This is called [magnitude-based pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/) and is applied in TensorFlow. Since pruning happens during training, the weights that _do_ contribute significantly enough can adapt to the impact of the weights-thrown-out, making the model as a whole robust against sparsity on the fly. + +While one must be very cautious still, since pruning (and quantization) can significantly impact model performance, both pruning and quantization can be great methods for optimizing your machine learning models. + +* * * + +## Pruning in TensorFlow + +Now that we know how supervised machine learning models are trained and how pruning works conceptually, we can take a look at how TensorFlow provides methods for pruning. Specifically, this is provided through the TensorFlow Model Optimization toolkit, which must be installed separately (and is no core feature of TF itself, but integrates natively). + +For pruning, it provides two methods: + +- `ConstantSparsity` based pruning, which means that sparsity is kept constant during training. +- `PolynomialDecay` based pruning, which means that the degree of sparsity is changed during training. + +### Generic terminology + +Before we can look into `ConstantSparsity` and `PolynomialDecay` pruning schedules in more detail, we must take a look at some generic terminology first. More specifically, we'll discuss pruning schedules - implemented by means of a `PruningSchedule` - as well as pruning steps. + +#### Pruning schedule + +Applying pruning to a TensorFlow model must be done by means of a **pruning schedule** (PruningSchedule, n.d.). It "specifies when to prune layer and the sparsity(%) at each training step". More specifically: + +> PruningSchedule controls pruning during training by notifying at each step whether the layer's weights should be pruned or not, and the sparsity(%) at which they should be pruned. + +Essentially, it provides the necessary wrapper for pruning to take place in a scalable way. That is, while the pruning schedule _instance_ (such as `ConstantSparsity`) determines how pruning must be done, the `PruningSchedule` class provides the _skeleton_ for communicating the schedule. That is, it produces information about whether a layer should be pruned at a particular pruning step (by means of `should_prune`) and if so, what sparsity it must be pruned for. + +#### Pruning steps + +Now that we know about a `PruningSchedule`, we understand that it provides the skeleton for a pruning schedule to work. Any pruning schedule instance will thus tell you about whether pruning should be applied and what sparsity should be generated, but it will do so for a particular _step._ This terminology - **pruning steps** \-confused me, because well, what is a step? Is it equal to an epoch? If it is, why isn't it called epoch? If it's not, what is it? + +In order to answer this question, I first looked at the source code for `PruningSchedule` on GitHub. As we know, TensorFlow is open source, and hence its code is available for everyone to see (TensorFlow/model-optimization, 2020). While it provides code that outputs whether to prune (`_should_prune_in_step`), it does not provide any explanation for the concept of a step. + +However, in the [article about pruning](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/), we saw that we must add the `UpdatePruningStep` callback to the part where pruning is applied. That is, after an epoch or a batch, it is applied to the model in question (Keras Team, n.d.). For this reason, it would be worthwhile to continue the search in the source code for the `UpdatePruningStep` callback. + +Here, we see the following: + +``` + def on_train_batch_begin(self, batch, logs=None): + tuples = [] + for layer in self.prunable_layers: + tuples.append((layer.pruning_step, self.step)) + + K.batch_set_value(tuples) + self.step = self.step + 1 +``` + +This code is executed _upon the start of every batch_. To illustrate, if your training set has 1000 samples and you have a batch size of 250, every epoch will consist of 4 batches. Per epoch, the code above will be called 4 times. + +In it, the pruning step is increased by one: `self.step = self.step + 1`. + +This means that every _batch_ during your training process represents a pruning step. This is also why in the pruning article, [we configured the end\_step](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#loading-and-configuring-pruning) as follows: + +``` +end_step = np.ceil(num_images / batch_size).astype(np.int32) * pruning_epochs +``` + +That's the number of images divided by the batch size (i.e., the number of steps per epoch) times the number of epochs; this produces the total number of steps performed during pruning. + +### ConstantSparsity based pruning + +TensorFlow's **constant sparsity** during pruning can be characterized as follows (ConstantSparsity, n.d.): + +> Pruning schedule with constant sparsity(%) throughout training. + +As it inherits from the `PruningSchedule` defined above, it must implement all the Python definitions and can hence be used directly in pruning. + +It accepts the following arguments (source: [TensorFlow](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/ConstantSparsity) - [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/), no edits): + +| Args | +| --- | +| `target_sparsity` | A scalar float representing the target sparsity value. | +| `begin_step` | Step at which to begin pruning. | +| `end_step` | Step at which to end pruning. `-1` by default. `-1` implies continuing to prune till the end of training. | +| `frequency` | Only apply pruning every `frequency` steps. | + +Those arguments allow you to configure pruning to your needs. With a constant target sparsity, you set the degree of sparsity that is to be applied when sparsity must be applied. This latter is determined by the `begin_step` as well as the `end_step` and `frequency`. Should you wish to apply pruning only to the final part of training, you can configure so through `begin_step`; the same goes for applying to the entire training process, only the first part, and other configurations. How often must be pruned can be configured by means of `frequency` (default frequency = 100). + +It returns the following (TensorFlow/model-optimization, n.d.) data with respect to `should_prune` and `sparsity`: + +``` + return (self._should_prune_in_step(step, self.begin_step, self.end_step, + self.frequency), + tf.constant(self.target_sparsity, dtype=tf.float32)) +``` + +It thus indeed returns a constant sparsity value to prune for. + +### PolynomialDecay based pruning + +If you recall some basic maths, you might remember what is known as a _polynomial function_. Such functions, for example `x` squared, take an input `x` and multiply it by some value of themselves. This can also be applied in pruning to make the applied pruning level non-constant. Using **polynomial decay based sparsity**, more or fewer sparsity can be used with increasing or decreasing speed, as training progresses. It is represented in the `PolynomialDecay` function with TensorFlow: + +> Pruning Schedule with a PolynomialDecay function. + +It also inherits from `PruningSchedule`, so it implements all the necessary functionality for it to be used in pruning directly. + +It accepts the following arguments (source: [TensorFlow](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/ConstantSparsity) - [Creative Commons Attribution 4.0 License](https://creativecommons.org/licenses/by/4.0/), no edits): + +| Args | +| --- | +| `initial_sparsity` | Sparsity (%) at which pruning begins. | +| `final_sparsity` | Sparsity (%) at which pruning ends. | +| `begin_step` | Step at which to begin pruning. | +| `end_step` | Step at which to end pruning. | +| `power` | Exponent to be used in the sparsity function. | +| `frequency` | Only apply pruning every `frequency` steps. | + +Here, the user must provide an _initial sparsity_ as well as a _final sparsity_ percentage. Similar to constant sparsity, a begin and end step and a frequency must be passed along as well. New again is the `power` argument, which represents the exponent of the polynomial function to be used for computing sparsity. + +### When use which pruning schedule? + +It seems that there has been no extensive investigation into what pruning schedule must be used under what conditions. For example, Zhu and Gupta (2017) have investigated the effects of what we know as `PolynomialDecay` for a variety of sparsity levels ranging between 50 and 90%, and found that sparsity does not significantly hamper accuracy. + +In my point of view - and I will test my point of view later in this blog - I think this partially occurs because how polynomial decay is implemented. Their training process started at a low sparsity level (0%, in fact) and was then increased (to 50/75/87.5% for three scenarios, respectively). In all scenarios, sparsity increase started at the 2000th step, in order to allow the model to start its path towards convergence without being hurt by sparsity inducing methods already. + +The effect of this strategy is that while the model already starts converging, sparsity is introduced slowly. Model weights can take into account this impact and become robust to the effect of weights being dropped, similar to [quantization-aware training](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/#quantization-aware-training) in the case of quantization. I personally think that this is a better strategy compared to a `ConstantSparsity`, which immediately increases sparsity levels from 0% to the constant sparsity level that was configured. + +* * * + +## A code example with ConstantSparsity + +Next, we will provide an example that trains a model with `ConstantSparsity` applied. The model code is equal to the version applying `PolynomialDecay`, but then applies `ConstantSparsity` instead for a sparsity of 87.5%. We start applying sparsity after 20% of the training process has finished, i.e. after `0.2 * end_step`, and continue pruning until `end_step` i.e. for the rest of the pruning steps. + +We train for 30 epochs, as a ConvNet-based MNIST classifier will always see good performance after only few epochs. + +Should you wish to get additional explanation or see the code for `PolynomialDecay`, click [here](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#pruning-a-keras-example). Here is the full code for creating, training, pruning, saving and comparing a pruned Keras model with `ConstantSparsity`: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential, save_model +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +import tempfile +import tensorflow_model_optimization as tfmot +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +pruning_epochs = 30 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() +input_shape = (img_width, img_height, 1) + +# Reshape data for ConvNet +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize [0, 255] into [0, 1] +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Load functionality for adding pruning wrappers +prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude + +# Finish pruning after 10 epochs +num_images = input_train.shape[0] * (1 - validation_split) +end_step = np.ceil(num_images / batch_size).astype(np.int32) * pruning_epochs + +# Define pruning configuration +pruning_params = { + 'pruning_schedule': tfmot.sparsity.keras.ConstantSparsity(target_sparsity=0.875, + begin_step=0.2*end_step, + end_step=end_step) +} +model_for_pruning = prune_low_magnitude(model, **pruning_params) + +# Recompile the model +model_for_pruning.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Model callbacks +callbacks = [ + tfmot.sparsity.keras.UpdatePruningStep() +] + +# Fitting data +model_for_pruning.fit(input_train, target_train, + batch_size=batch_size, + epochs=pruning_epochs, + verbose=verbosity, + callbacks=callbacks, + validation_split=validation_split) + +# Generate generalization metrics +score_pruned = model_for_pruning.evaluate(input_test, target_test, verbose=0) +print(f'Pruned CNN - Test loss: {score_pruned[0]} / Test accuracy: {score_pruned[1]}') + +# Export the model +model_for_export = tfmot.sparsity.keras.strip_pruning(model_for_pruning) +_, pruned_keras_file = tempfile.mkstemp('.h5') +save_model(model_for_export, pruned_keras_file, include_optimizer=False) +print(f'Pruned model saved: {pruned_keras_file}') + +# Measuring the size of your pruned model +# (source: https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras#fine-tune_pre-trained_model_with_pruning) + +def get_gzipped_model_size(file): + # Returns size of gzipped model, in bytes. + import os + import zipfile + + _, zipped_file = tempfile.mkstemp('.zip') + with zipfile.ZipFile(zipped_file, 'w', compression=zipfile.ZIP_DEFLATED) as f: + f.write(file) + + return os.path.getsize(zipped_file) + +print("Size of gzipped pruned Keras model: %.2f bytes" % (get_gzipped_model_size(pruned_keras_file))) +``` + +* * * + +## Comparing the effects of ConstantSparsity and PolynomialDecay based pruning + +Earlier, we saw that there has been no large-scale investigation into what method of pruning works best in TensorFlow. Although intuitively, it feels as if `PolynomialDecay` pruned models produce more robust models, this is simply an intuition and must be tested. Training the `ConstantSparsity` model with the classifier above for 30 epochs yields the following results: + +``` +Pruned CNN - Test loss: 0.03168991032061167 / Test accuracy: 0.9886000156402588 +Size of gzipped pruned Keras model: 388071.00 bytes +``` + +It performs really well, but this is expected from models classifying MNIST digits. + +Subsequently retraining the PolynomialDecay based one from the [other post](https://www.machinecurve.com/index.php/2020/09/23/tensorflow-model-optimization-an-introduction-to-pruning/#pruning-a-keras-example), but then following the Zhu & Gupta (2017) setting (sparsity at 0% at first, up to 87.5% - equaling the constant sparsity of the other model; beginning at 20% of the training data), this is the outcome of training with polynomial decay: + +``` +Pruned CNN - Test loss: 0.02177981694244372 / Test accuracy: 0.9926999807357788 +Size of gzipped pruned Keras model: 384305.00 bytes +``` + +Recall that the size of the baseline model trained in that other post was much larger: + +``` +Size of gzipped baseline Keras model: 1601609.00 bytes +``` + +In short, pruning seems to work both ways in terms of reducing model size. `PolynomialDecay` based sparsity seems to work slightly better (slightly higher accuracy and especially a 33% lower loss value). It also produced a smaller model in size. Now, while this is a N = 1 experiment, which cannot definitively answer whether it is better than `ConstantSparsity`, the intuitions are still standing. We challenge others to perform additional experiments in order to find out. + +* * * + +## Summary + +In this article, we studied pruning in TensorFlow in more detail. Before, we covered quantization and pruning for model optimization, but for the latter there are multiple ways of doing so in TensorFlow. This blog post looked at those methods and their difference. + +Before being able to compare the pruning schedules, we provided a brief recap to how supervised machine learning models are trained, and how they can be pruned. By discussing the forward pass, computation of the loss value and subsequently backward computation of the error and optimization, we saw how models are trained. We also saw what pruning does to weights, and how the sparsity this brongs benefits model storage, model loading and model inference, especially hardware at the edge. + +Then, we looked at the pruning schedules available in TensorFlow: `ConstantSparsity` and `PolynomialDecay`. Both inheriting the `PruningSchedule` class, they provide functionalities that determine whether a particular layer must be pruned during a particular step, and to what sparsity. Generally, the constant sparsity applies a constant sparsity when it prunes a layer, while the polynomial decay pruning schedule induces a sparsity level based on a polynomial function, from a particular sparsity level to another. + +Finally, we provided an example using Keras, TensorFlow's way of creating machine learning models. In comparing the outcomes, we saw that `PolynomialDecay` based sparsity / pruning works slightly better than `ConstantSparsity`, which was expected intuitively. + +I hope you've learnt a lot by reading this post! I did, when researching :) Please feel free to leave a comment in the comments section below if you have any questions, remarks or other suggestions for improvement 💬 Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +_Module: Tfmot.sparsity_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/sparsity](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity) + +_Tfmot.sparsity.keras.ConstantSparsity_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/sparsity/keras/ConstantSparsity](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/ConstantSparsity) + +_Tfmot.sparsity.keras.PolynomialDecay_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/sparsity/keras/PolynomialDecay](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/PolynomialDecay) + +_Tfmot.sparsity.keras.PruningSchedule_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/sparsity/keras/PruningSchedule](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/PruningSchedule) + +_TensorFlow/model-optimization_. (2020, 30). GitHub. [https://github.com/tensorflow/model-optimization/blob/0f6dd5aeb818c5f61123fc1d5642435ea0f5cd70/tensorflow\_model\_optimization/python/core/sparsity/keras/pruning\_callbacks.py#L46](https://github.com/tensorflow/model-optimization/blob/0f6dd5aeb818c5f61123fc1d5642435ea0f5cd70/tensorflow_model_optimization/python/core/sparsity/keras/pruning_callbacks.py#L46) + +_TensorFlow/model-optimization_. (2020, January 10). GitHub. [https://github.com/tensorflow/model-optimization/blob/0f6dd5aeb818c5f61123fc1d5642435ea0f5cd70/tensorflow\_model\_optimization/python/core/sparsity/keras/pruning\_schedule.py#L41](https://github.com/tensorflow/model-optimization/blob/0f6dd5aeb818c5f61123fc1d5642435ea0f5cd70/tensorflow_model_optimization/python/core/sparsity/keras/pruning_schedule.py#L41) + +_Tfmot.sparsity.keras.UpdatePruningStep_. (n.d.). TensorFlow. [https://www.tensorflow.org/model\_optimization/api\_docs/python/tfmot/sparsity/keras/UpdatePruningStep](https://www.tensorflow.org/model_optimization/api_docs/python/tfmot/sparsity/keras/UpdatePruningStep) + +Keras Team. (n.d.). _Keras documentation: Callbacks API_. Keras: the Python deep learning API. [https://keras.io/api/callbacks/](https://keras.io/api/callbacks/) + +_TensorFlow/model-optimization_. (n.d.). GitHub. [https://github.com/tensorflow/model-optimization/blob/0f6dd5aeb818c5f61123fc1d5642435ea0f5cd70/tensorflow\_model\_optimization/python/core/sparsity/keras/pruning\_schedule.py#L137-L180](https://github.com/tensorflow/model-optimization/blob/0f6dd5aeb818c5f61123fc1d5642435ea0f5cd70/tensorflow_model_optimization/python/core/sparsity/keras/pruning_schedule.py#L137-L180) + +_TensorFlow/model-optimization_. (n.d.). GitHub. [https://github.com/tensorflow/model-optimization/blob/0f6dd5aeb818c5f61123fc1d5642435ea0f5cd70/tensorflow\_model\_optimization/python/core/sparsity/keras/pruning\_schedule.py#L183-L262](https://github.com/tensorflow/model-optimization/blob/0f6dd5aeb818c5f61123fc1d5642435ea0f5cd70/tensorflow_model_optimization/python/core/sparsity/keras/pruning_schedule.py#L183-L262) + +Zhu, M., & Gupta, S. (2017). [To prune, or not to prune: exploring the efficacy of pruning for model compression](https://arxiv.org/abs/1710.01878). _arXiv preprint arXiv:1710.01878_. diff --git a/testing-pytorch-and-lightning-models.md b/testing-pytorch-and-lightning-models.md new file mode 100644 index 0000000..1f7a98d --- /dev/null +++ b/testing-pytorch-and-lightning-models.md @@ -0,0 +1,543 @@ +--- +title: "Testing PyTorch and Lightning models" +date: "2021-01-27" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "machine-learning" + - "model-evaluation" + - "neural-network" + - "neural-networks" + - "pytorch" + - "pytorch-lightning" + - "testing-data" + - "train-test-split" +--- + +Model evaluation is key in validating whether your machine learning or deep learning model really works. This procedure, where you test whether your model really works against data it has never seen before - on data _with_ and _without_ the distribution of your training data - ensures that your model is useful in practice. Because hey, what would be the benefits of using a model if it doesn't work? + +Deep learning frameworks use different approaches for evaluating your models. This tutorial zooms into the PyTorch world, and covers evaluating your model with either PyTorch or PyTorch Lightning. After reading the tutorial, you will... + +- Understand why it is good practice to evaluate your model after training. +- Have built an evaluation approach for your PyTorch model. +- Have also built such an approach for your PyTorch Lightning model. + +* * * + +\[toc\] + +* * * + +## Summary and code examples: evaluating your PyTorch or Lightning model + +Training a neural network involves feeding forward data, comparing the predictions with the ground truth, generating a loss value, computing gradients in the backwards pass and subsequent optimization. This cyclical process is repeated until you manually stop the training process or when it is configured to stop automatically. You train your model with a training dataset. + +However, if you want to use your model in the real world, you must evaluate - or test - it with data that wasn't seen during the training process. The reason for this is that if you would evaluate your model with your training data, it would equal a student who is grading their own exams, and you don't want that. That's why today, we'll show you how to evaluate your PyTorch and PyTorch Lightning models. Below, there are two full-fledged examples for doing so. If you want to understand things in more detail, make sure to read the rest of this tutorial as well :) + +### Classic PyTorch + +Testing your PyTorch model requires you to, well, create a PyTorch model first. This involves defining a `nn.Module` based model and adding a custom training loop. Once this process has finished, testing happens, which is performed using a custom testing loop. Here's a **full example of model evaluation in PyTorch**. If you want to understand things in more detail, or want to build this approach step-by-step, make sure to read the rest of this tutorial as well! :) + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=True) + dataset_test = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=False) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + trainloader_test = torch.utils.data.DataLoader(dataset_test, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop for 15 epochs + for epoch in range(0, 15): + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') + + # Print about testing + print('Starting testing') + + # Saving the model + save_path = './mlp.pth' + torch.save(mlp.state_dict(), save_path) + + # Testing loop + correct, total = 0, 0 + with torch.no_grad(): + + # Iterate over the test data and generate predictions + for i, data in enumerate(trainloader_test, 0): + + # Get inputs + inputs, targets = data + + # Generate outputs + outputs = mlp(inputs) + + # Set total and correct + _, predicted = torch.max(outputs.data, 1) + total += targets.size(0) + correct += (predicted == targets).sum().item() + + # Print accuracy + print('Accuracy: %d %%' % (100 * correct / total)) +``` + +### PyTorch Lightning + +Another way of using PyTorch is with Lightning, a lightweight library on top of PyTorch that helps you organize your code. In Lightning, you must specify testing a little bit differently... with `.test()`, to be precise. Like the training loop, it removes the need to define your own custom testing loop with a lot of boilerplate code. In the `test_step` within the model, you can specify precisely what ought to happen when performing model evaluation. + +Here, you'll find a **full example for model evaluation with PyTorch Lightning**. If you want to understand Lightning in more detail, make sure to read on as well! + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import pytorch_lightning as pl + +class MLP(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + return self.layers(x) + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss + + def test_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + y_hat = torch.argmax(y_hat, dim=1) + accuracy = torch.sum(y == y_hat).item() / (len(y) * 1.0) + output = dict({ + 'test_loss': loss, + 'test_acc': torch.tensor(accuracy), + }) + return output + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-4) + return optimizer + + +if __name__ == '__main__': + + # Load the datasets + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=True) + dataset_test = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=False) + + # Set seed + pl.seed_everything(42) + + # Initialize model and Trainer + mlp = MLP() + trainer = pl.Trainer(auto_scale_batch_size='power', gpus=1, deterministic=True, max_epochs=15) + + # Perform training + trainer.fit(mlp, DataLoader(dataset, num_workers=15, pin_memory=True)) + + # Perform evaluation + trainer.test(mlp, DataLoader(dataset_test, num_workers=15, pin_memory=True)) +``` + +* * * + +## Why evaluate your model after training? + +At a high level, training a deep neural network involves two main steps: the first is the forward pass, and the second is the backwards pass and subsequent optimization. + +When you start training a model, you'll initialize the weights and biases of the neurons pseudorandomly. During the first iteration, which is also called an epoch, all the data from your training set is fed through the model, generating predictions. This is called the forward pass. The predictions from this forward pass are compared with the actual targets for these training samples, which are called ground truth. The offset between the predictions and the targets is known as a loss value. At the beginning of a training process, loss values are relatively high. + +Once the loss value is known, we perform the backwards pass. Here, we compute the contribution of the individual neurons to the error. Having computed this contribution, which is also known as a gradient, we can perform optimization with an optimizer such as gradient descent or Adam. Optimization slightly changes the weights into the opposite direction of the gradients, and it likely makes the model better. We then start a new iteration, or epoch, and the process starts again. + +![](images/High-level-training-process-1024x973.jpg) + +Once you finish training the model, you want to use it in the real world. But can it easily be applied there? Who guarantees that it actually works, and that it didn't capture some spurious patterns present in the training set? Relevant questions which must be answered by means of **model evaluation**. + +From this high-level process description, it does however become clear that the data from the training set is used in optimization, i.e. for making the model better. This is true for the _actual_ training data as well as the validation data, which come from the same dataset but which are used for slightly different purposes. This is problematic if we want to evaluate the model, because we cannot simply rely on this data for evaluation purposes. If we would do that, it would equal a student grading their own exams. In other words, we need different data for this purpose. + +Testing data comes at the rescue here. By generating a train/test split before training the model, setting apart a small portion of the training data, we can evaluate our model with data that was never seen during the training process. In other words, the student is no longer grading their own homework. This ensures that we create models that are more likely to work in the real world if evaluation passes. And precisely that is what we are now going to do. We'll show you how to evaluate your models created with PyTorch or PyTorch Lightning. + +* * * + +## Evaluating your PyTorch model + +Let's now take a look at how we can evaluate a model that was created with PyTorch. + +### The model we will evaluate + +This is the model that we want to evaluate. If you want to understand how it works, make sure to [read this tutorial](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/#classic-pytorch_1) too. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare MNIST dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=True) + dataset_test = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=False) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + trainloader_test = torch.utils.data.DataLoader(dataset_test, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for 15 epochs + for epoch in range(0, 15): + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +### Adding evaluation code + +As you can see in the code above, PyTorch requires you to define many aspects of the training process yourself. For example, we have defined the entire training loop above. The same is true for model evaluation. In classic PyTorch, we also have to define our own testing loop. + +We can define the testing loop so in the following way. + +1. We print that testing starts and save the model, so that we can use it layer (and test it separately, if we wanted to do that). +2. We define the testing loop: + 1. We first set `torch.no_grad()` to ensure no gradients are updated, and set `correct` and `total` (the number of correct and total number of values processed during testing) to zero. + 2. We then iterate over the test data generator. + 3. During every minibatch iteration, we decompose the data into inputs and targets, generate the outputs, compare the predictions with the ground truth values, and update the `total` and `correct` variables. Here, `torch.max(outputs.data, 1)` looks complex, but it is simple - it simply takes a look at the _indices_ of the classes that have the highest maximum value. Now that's a smart approach, because these are the indices of our classes too! In one line of code, we can make our predictions comparable with the targets. +3. Finally, we print the accuracy. + +``` + + # Print about testing + print('Starting testing') + + # Saving the model + save_path = './mlp.pth' + torch.save(mlp.state_dict(), save_path) + + # Testing loop + correct, total = 0, 0 + with torch.no_grad(): + + # Iterate over the test data and generate predictions + for i, data in enumerate(trainloader_test, 0): + + # Get inputs + inputs, targets = data + + # Generate outputs + outputs = mlp(inputs) + + # Set total and correct + _, predicted = torch.max(outputs.data, 1) + total += targets.size(0) + correct += (predicted == targets).sum().item() + + # Print accuracy + print('Accuracy: %d %%' % (100 * correct / total)) +``` + +### Results + +After running the model for 15 epochs, we get an accuracy of 96% on the MNIST dataset: + +``` +... +Starting epoch 15 +Loss after mini-batch 500: 0.080 +Loss after mini-batch 1000: 0.083 +Loss after mini-batch 1500: 0.079 +Loss after mini-batch 2000: 0.090 +Loss after mini-batch 2500: 0.075 +Loss after mini-batch 3000: 0.089 +Loss after mini-batch 3500: 0.081 +Loss after mini-batch 4000: 0.069 +Loss after mini-batch 4500: 0.086 +Loss after mini-batch 5000: 0.085 +Loss after mini-batch 5500: 0.091 +Loss after mini-batch 6000: 0.085 +Training process has finished. +Starting testing +Accuracy: 96 % +``` + +* * * + +## Evaluating your PyTorch Lightning model + +Today, many engineers who are used to PyTorch are using PyTorch Lightning, a library that runs on top of classic PyTorch and which helps you organize your code. Below, we'll also show you how to evaluate your model when created with PyTorch Lightning. + +### The model we will evaluate + +The PyTorch model that will be used for testing is similar to the one created with classic PyTorch above: + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms +import pytorch_lightning as pl + +class MLP(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28 * 1, 64), + nn.ReLU(), + nn.Linear(64, 32), + nn.ReLU(), + nn.Linear(32, 10) + ) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + return self.layers(x) + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss + + def test_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + y_hat = torch.argmax(y_hat, dim=1) + accuracy = torch.sum(y == y_hat).item() / (len(y) * 1.0) + output = dict({ + 'test_loss': loss, + 'test_acc': torch.tensor(accuracy), + }) + return output + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-4) + return optimizer +``` + +### Adding evaluation code + +Frankly, most of the evaluation code was already added in the code example above. More precisely, in the `test_step`, we perform a forward pass for each minibatch (`batch`), compute test loss and accuracy, and return everything as a dictionary. + +What remains now is to add the runtime code, which loads the datasets (both training and testing data), sets the seed of the random number generator, initializes the model and the Trainer object, and performs training and evaluation. + +``` +if __name__ == '__main__': + + # Load the datasets + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=True) + dataset_test = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor(), train=False) + + # Set seed + pl.seed_everything(42) + + # Initialize model and Trainer + mlp = MLP() + trainer = pl.Trainer(auto_scale_batch_size='power', gpus=1, deterministic=True, max_epochs=15) + + # Perform training + trainer.fit(mlp, DataLoader(dataset, num_workers=15, pin_memory=True)) + + # Perform evaluation + trainer.test(mlp, DataLoader(dataset_test, num_workers=15, pin_memory=True)) +``` + +### Results + +Running the model and evaluation gives an 96% accuracy again! + +``` +Epoch 0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60000/60000 [09:23<00:00, 106.57it/s, loss=0.0544, v_num=14] +Testing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:46<00:00, 214.89it/s] +-------------------------------------------------------------------------------- +DATALOADER:0 TEST RESULTS +{'test_acc': tensor(.96), 'test_loss': tensor(0.0024, device='cuda:0')} +-------------------------------------------------------------------------------- +``` + +* * * + +## Recap + +In this tutorial, we looked at evaluating your PyTorch and PyTorch Lightning models after they have been trained. This is important if you don't want your model to be useless in the real world even when it passed the training process successfully. In fact, testing your model is a crucial step that must not be skipped. + +Above, you saw step-by-step examples for performing model evaluation with your PyTorch and PyTorch Lightning models. For the first library, you saw that adding a custom testing loop allowed you to perform evaluation - manually iterating over the testing data and computing accuracy. For Lightning, testing is much more automated - the only thing you had to do is specify a `test_step()` and call `model.test()`. + +I hope that you have learned something from today's tutorial! If you did, please feel free to drop a message below 💬 I'd love to hear from you. Please do the same if you have any questions or remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +PyTorch Lightning. (2021, January 12). [https://www.pytorchlightning.ai/](https://www.pytorchlightning.ai/) + +PyTorch. (n.d.). [https://pytorch.org](https://pytorch.org/) + +PyTorch. (n.d.). _ReLU — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU) diff --git a/the-differences-between-artificial-intelligence-machine-learning-more.md b/the-differences-between-artificial-intelligence-machine-learning-more.md new file mode 100644 index 0000000..9fd85a8 --- /dev/null +++ b/the-differences-between-artificial-intelligence-machine-learning-more.md @@ -0,0 +1,592 @@ +--- +title: "The differences between AI, machine learning & more" +date: "2017-09-30" +categories: + - "applied-ai" +--- + +We're being flooded with data related buzzwords these days :) + +Artificial intelligence. Big data. Business analytics. Machine learning. Deep learning. And we should not forget the mother of all buzzwords, _data science_. :D We use every term to describe a field that processes or analyses data to achieve something. + +But what is something? + +And if every field works with data, are they different at all? + +If they are, do they overlap as well? + +In this article, I'll describe the differences between these fields from my perspective. As you may [read](https://www.machinecurve.com/index.php/about/), I have a background in business & IT and have started learning machine learning on my own. This was one of the first articles I wrote - and it helped me a lot in getting a proper understanding of the fields. Below, I'll first argue why I think that it's important to draw clear boundaries between these fieldsthis. I'll then write about each field individually. + +**Update February 2020** \- Added links to other MachineCurve blog posts. + +## **Table of contents** + +\[toc depth="2" title=""\] + +## **How I started with machine learning** + +When I [started to study machine learning](https://www.machinecurve.com/index.php/dissecting-deep-learning/), I first looked at various algorithms that are out there. But rather quickly, I was lost in a random forest (pun intended) full of algorithms. I realised that I needed to create a high-level overview first before I could dive into this extremely interesting world. + +So I focused on making an overview of machine learning algorithms. + +## **The need for a high-level overview** + +Soon enough, I was lost again. If I'm interested in _machine learning_, I thought, there should be some line that clearly distinguishes between machine learning and _everything else_. + +But what does this line look like, if it exists at all? + +That's when I knew I first had to focus on the differences between widely used terms that are related to what I'm doing. I found these terms online and in my academic literature: + +- Data science +- Artificial intelligence +- Machine learning +- Deep learning +- Big data and business analytics + +So I started Googling. Turned out that the machine learning field itself doesn't even properly know its boundaries. For some reason, people cannot properly give a definition of - for example - the differences between machine learning and artificial intelligence. However, for some reason, they know that machine learning is part of AI while the opposite is not true. :D I'll get back to that later. + +## **Clear boundaries = clear thinking** + +I think it's important to draw sharp lines between the fields, especially because they are buzzwords. Like _[data](https://www.quora.com/Is-Data-Science-a-meme-term-or-a-buzzword)_ [science](https://www.quora.com/Is-Data-Science-a-meme-term-or-a-buzzword): can you exactly describe what it is? The thing with buzzwords is that everybody talks about them, but nobody knows _what_ they are and _how_ to properly use them. Only a few minds have the power to see through the hype. These types often: + +- Know exactly what a word is about, but most importantly what it is not about; +- Know how they can use technology; +- And, most importantly, know how to successfully apply it in business. + +These three describe one of my most important beliefs about technology: + +_It's simple to make it run, it's difficult to make it fly._ + +This belief provides me with a clear direction: I first have to be able to draw these boundaries before I can even start thinking about algorithms at all. If I can't do that, I'll most likely end up contributing to the confusion. + +A long story short: we need to figure out these boundaries. + +But I also need to provide you with a clear overview of the fields. That's why I'll also have to take a look at the overlap between them - because what I did find out is that they are not mutually exclusive. Based on my Google research, I got a clearer idea about the fields, their developments, boundaries and limitations. And that's how we end up with the next image. It reflects my views on these boundaries and overlap. Let's take a look: + +![Differences between business analytics, BI, big data, data science, machine learning, deep learning, AI and statistical methods.](images/differences-1.jpg) + +I'll now cover the fields in more detail. I'll try to explain how they are different and how they overlap. If you need very specific information, I advise you to use the table of contents at the top of this post. + + + + + +## **Data science** + +Let's start with data science since it's the most used buzzword in business these days. Like me at the beginning, you may have your ideas about what a data scientist is. About what they do. And what they don't, or can't. In my view, data scientists are curious, pragmatic and persevering jack-of-all-trades...with in addition a rather exceptional entrepreneurial, mathematical and technological skillset. + +Let me explain that in more detail. + +### **It started in science** + +A short timeline of how data science became what it is today: + +#### Mid-to-late 20th century + +- 1960: Peter Naur, a prominent Danish computer scientist, used the term as a replacement for _computer science_. +- 1974: The same Peter Naur moved the term to the field of data processing methods, probably realising that computer science is about more than data alone. +- 1996: For the first time in history, an academic conference is named after data science: _Data science, classification, and related methods._ +- 1997: A professor asks whether "Statistics = Data Science?" in his inaugural lecture. This can be considered the start of data science outside of the computer science world. + +#### 21st century + +- 2001: A statistician introduces data science as an independent academic discipline. Data science, according to the statistician, is about more than the maths alone, given the advances of modern computing. +- 2002: Academics started the Data Science Journal, followed by more journals in the period from 2003 to 2005. +- 2012: Harvard Business Review claims that data scientists have [the sexiest job of the 21st century](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century). I consider this article to be the bridge from academia to business (although I do not agree with their observation that they themselves coined the term _data science_). Everything before the HBR article was focused on statistics and academia. Everything afterwards was focused on _applying_ this knowledge to solve business problems. One could say that since then, demand for a data scientist has skyrocketed. + +### **Data scientists care about the end, not the means** + +Data scientists aren't engineers. OK, I get your caution, and I do agree - they use engineering disciplines in their work. But that's not everything! Data scientists don't care about how technology works. They care about the phenomena they investigate. If their work can benefit business and, consequently, the customer, they are happy. + +At the same time, they are also entrepreneurial. Data scientists have the ability to identify new business opportunities in data. But they must sometimes be patient. If a data scientist's approach towards their end goal does not provide the expected results, they don't give up. They find a second approach so they can still achieve what they promise. + +And they do so in creative ways. The [Harvard Business Review article](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century) describes a data scientist who used the resemblance between a DNA sequencing problem and a fraud scheme to apply methods used in DNA sequencing. The scientist successfully tackled the problem for business. + +### **Data science is a process** + +Their entrepreneurial nature means that data scientists must have a broad general knowledge. Nevertheless, as I wrote, they have a toolbox filled with very valuable technologies and statistical knowledge. This means that they must also be specialised in certain areas in order to succeed: + +#### Statistics + +And then mainly in applying statistical methods in order to solve a problem in a data-driven way. A data scientist does often not care about the exact mathematical foundations of the formulas. Although I'll dive into some mathematical foundations on this website, I myself don't care either about the exact mathematical proof. + +#### Technology + +For structuring large quantities of data in order to prepare them for analysis, a data scientist must know about data cleansing and structuring methods. Often, these require knowledge about technologies like Hadoop and algorithms like MapReduce - as well as knowledge about various machine learning algorithms. Therefore, a data scientist must have specialised knowledge about technology - data processing technology in particular. In fact, one of the interviewees in the HBR article - a recruiter - does not care about whether data scientists have statistical knowledge. Their ability to write code is much more important. + +#### Business + +Data scientists must be able to feel what's happening in an organisation. You cannot figure out a solution to a problem if you do not see or cannot understand it. Therefore, new data scientists do not necessarily need to have a background in statistics or technology. They may even come from business - or other rarities like ecology :D + +#### Communication + +Some of you may feel offended, but computer scientists and mathematicians are not the greatest of communicators every now and then. Sometimes, their explanations become very complex - so complex that a business manager does no longer understand how he may benefit from the solution being discussed. However, the manager definitely needs that solution because it solves his problem. Therefore, data scientists need to be very communicative. I've read about one of the success factors for hiring a data scientist -- and one recruiter said that he always checked whether an applicant could explain a problem and his solution in layman's terms. + +Together, they form the process of starting with a problem --> processing and analysing data --> finding a solution. Therefore, rather than an umbrella term for interdisciplinary elements, I believe that _data science is a process_! + +### It includes business analytics + +Beller & Barnett's definition of business analytics is as follows ([thanks to Wikipedia](https://en.wikipedia.org/wiki/Business_analytics)): + +> Business analytics (BA) refers to the skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. + +And as discussed, in my view data science is the process of solving business problems in a data-driven way. Data scientists have explicit knowledge about how a business works, but also the technological and mathematical skills in order to perform the data processing & analysis. + +Consequently, I believe that business analytics - with its _skills, technologies and practices for continuous iterative exploration and investigation (...) to drive business planning_ - is a part of data science. Data science itself is a bit broader: + +- It does not merely look at past business data, but attempts to include everything if it's interesting for problem solving - data science may therefore not only be explanatory, but also predictive in nature. +- The goal of business analytics is gaining business insights and driving business planning. While gaining insights is in line with data science's goal of understanding phenomena, I think that data science does not need to be about _driving business planning_ alone (but in practice, obviously, this happens most of the time!) +- Data scientists are creative in the sense that they may use various approaches to solving a problem (remember the DNA sequencing method?) - business analysts use a more tech&business oriented skillset/toolbox. + +### **Now what is a data scientist?** + +As a result, you may end up conceptualising a data scientist. The data scientist is the proud owner of a toolbox that is filled with various technologies and statistical methods. He has the skills to use them in order to prepare data for analysis, perform the analysis, and communicate the analysis - often to business. It is not uncommon to see that data scientists often make themselves irreplaceable :) + +Yet, a formal definition is to come. But given the fact that we're dealing with a hype here, I'm not sure whether it will come soon... + +But who cares! We can now see through the hype. ;) + + + + + +## **Artificial intelligence** + +Let's now move to artificial intelligence. As you can see in the fields overview above, in my view artificial intelligence is oriented around technology and mathematics. It is truly an engineering discipline and has little to do with business. Funnily enough, it is strongly grounded in philosophical developments too - as well as on dreams about the future. Below, I'll describe some of my observations. + +### **It's an engineering umbrella term** + +In my opinion, the term _artificial intelligence_ is very vague. The way I see it - AI is an umbrella term for various disciplines related to mathematics and technology. Examples of these may be machine learning, deep learning, rule-based learning, and more. + +There are multiple activities within AI: + +- _Research activities_ focus on the validation of existing methods and techniques and the development of new ones. It's mostly academics who focus on these activities, but sometimes organisations also have their own R&D for artificial intelligence. +- _Philosophical activities_ discuss developments within artificial intelligence. These discussions range from the application of AI in the future world, AI ethics, dangers and opportunities. Thinking about new, currently non-existent forms of AI is also a part of AI related philosophy. +- _Tech activities_ focus on the actual technological part: engineers making intelligence happen. Pure maths does not make the world better - it only does so if it is applied (I'm sorry, mathematicians :) ). Engineers are often responsible for these tech activities - developing new algorithms based on state-of-the-art research, fitting algorithms into an existing AI infrastructure, and showing that it works. + +You may miss _business activities_ in this list. I personally believe that bridging the gap between technology and business is not a part of AI. Delivering value by applying AI is the job of a data scientist. + +### **The goal of AI: making machines intelligent** + +At least, you could say that is the highest level goal. But what is intelligence? + +#### Types of intelligence + +Within AI, people often come up with two types of intelligence: strong (or general) AI and weak (or applied) AI. Artificial general intelligence works towards a machine that is intelligent enough to perform any task a human being can do too. + +A weak AI system focuses on one particular task at hand - for example playing chess or classifying an email. + +Today however, no system is known that meets this definition for general AI, and every work so far is forcibly known as weak AI. + +However, some AI scientists and engineers work towards general AI by creating bigger and better weak AI, combining various systems, methods and technologies in smart ways. + +But some also work on improving weak AI. + +Therefore, you cannot get a _single goal for AI_ when you ask different people in the field. Actually, you'll see two categories - which are strongly related to general and weak AI. You'll mostly hear that the goal of AI is either one of two: + +- Working towards technological singularity +- Improving weak AI + +#### Working towards technological singularity + +Ok, you may have never heard of that term before :) However, I believe that the goal of those working on general AI is, maybe indirectly, _working towards technological singularity._ + +If we manage to make a general AI one day, the machine has the capability to perform tasks like a human being. But we may go one step further: we may make a machine that is more intelligent than a human being, i.e. a machine that is _superintelligent_. This machine may then - by accident or by the grace of intelligence - realise that it can make a better version of itself by either developing a new, more intelligent machine - or rewriting its own code. + +This more intelligent machine, once it runs, could then also realise that it can make a better version of itself - in less time than its 'parent'. We then enter a loop in which exponential technological growth is triggered. Even better superintelligent machines would make human intelligence very dumb, to say the least :) + +One of the founding fathers of the singularity concept is the famous mathematician John von Neumann: + +> "The accelerating progress of technology and changes in the mode of human life, give the appearance of approaching some essential singularity in the history of the race beyond which human affairs, as we know them, could not continue" - _in Ulam, Stanislaw (May 1958). ["Tribute to John von Neumann"](https://docs.google.com/file/d/0B-5-JeCa2Z7hbWcxTGsyU09HSTg/edit?pli=1). 64, #3, part 2. Bulletin of the American Mathematical Society: 5._ + +Another pioneer is Ray Kurzweil, an American inventor and computer scientist. + +Both argue that today, human beings cannot imagine the so-called post-singularity world. I tend to agree. + +#### Is singularity science fiction? + +I can see you thinking: this is pure science fiction. And yes, today that's the case. :) I'm convinced that we'll first have to work towards general AI before we may start thinking about technological singularity. Nevertheless, you see striking technological advances every now and then. Facebook, for example, had to shut down [a language generating AI](https://www.forbes.com/sites/tonybradley/2017/07/31/facebook-ai-creates-its-own-language-in-creepy-preview-of-our-potential-future/) because engineers simply no longer understood how to interpret the language. + +However, while people working on _general AI_ may work towards singularity, there is another group that focuses on weak AI. These are the folks that work on: + +#### Improving the quality of weak AI + +Like I said earlier, weak AI focuses on one task at hand. + +And although there is no general AI today, the folks working on general AI have a different goal in mind than those working on weak AI. + +Because yes, there are groups of researchers and engineers who purely want to improve the machine's intelligence with regards to one task. Examples are: + +- Emulating the human brain - through neural networks. The brain, however, has various different tasks. The Natural Language Processing field for example works on methods and techniques for interpreting and processing language. These folks have no interest in self-driving cars. +- Liberating human beings from their duties. A robot cleaning device may learn how to move based on its environment. It is an intelligent machine, but its only task is cleaning a room. + +#### The AI effect + +A funny but frustrating effect within AI is known as the AI effect or AI paradox. Whenever some AI program achieves something, this something is seen as no longer intelligent - but as "just a computation". It means that the definition of machine intelligence changes all the time, but also that AI researchers are only working on what "failed" so far. + +Tough life for them! ;) + + + + + +## **Machine learning** + +Besides data science and artificial intelligence, machine learning is also a widely used buzzword. But what is machine learning? How is it different from both data science and AI? Below, I'll discuss my observations. + +### **What is machine learning?** + +If we take a look at [Wikipedia](https://en.wikipedia.org/wiki/Machine_learning), we'll see a very clear definition of machine learning: + +> Machine learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. + +This gives a very clear indication about what machine learning is: + +1. A field within computer science; +2. That makes computers learn based on data; +3. But computers learn by themselves, so there is no explicit programming! + +And also what it is not: + +1. An _intelligent_ system that works with human-written rules (which would maybe classify as AI, but not as machine learning) + +Learning can happen in multiple ways. Before we can proceed with a more detailed view on machine learning, we first have to make a very important distinction between supervised and unsupervised learning. + +#### Supervised learning + +Suppose that we have a very simple data set (like a spreadsheet) with these elements: _name, length, sex._ We want to know whether we can predict somebody's _sex_ based on their _name_ and their _length_. This is a **supervised learning** problem. In supervised learning, you deliver data samples that both contain the predictor(s) - _name and length -_ and the target value(s) - _sex_. You then let a machine learning algorithm learn which names and lengths often correspond with a certain sex. When you input a new combination of name and length into the system, you get a prediction for this person's sex. + +The above supervised learning problem is known as a **classification** problem. In a classification problem, you attempt to predict a _discrete_ value. This may be new to you, I understand - but a discrete function can take only certain values (like _David_, or _Peter_, or _Grace_, or _Sheila_). + +Another type of supervised learning problem is a **regression** problem. Here, you attempt to predict a _continuous_ value. That's a value which may take any value within a range. An example regression situation would be to predict the _length_ (e.g. 1.24 metres) based on one's _sex_ and _age_. + +In both cases, you first train the algorithm with your initial data set - also known as your **training data**. The _black box_ then learns to predict a value (either discrete or continuous) based on your data set. If it finishes training, you may input new data (e.g. _Frank, 1.81 metres)_. It then predicts whether Frank is male or female. We all know the answer ;-) + +\[caption id="attachment\_84" align="aligncenter" width="932"\]![Supervised learning](images/ML-supervised.png) Supervised learning\[/caption\] + +#### Unsupervised learning + +In an **unsupervised learning** problem, you do not provide the target values. The machine learning algorithm has to find out on its own which pattern exists within the data. + +One type of unsupervised learning is **clustering**. Suppose that you have a geographical data set with lightning strikes. You want to find out which lightning strikes belong to the same lightning storm. Then, a clustering algorithm may help you - it can detect which strikes belong _to the same cluster_ and hence, with a certain level of certainty, to the same lightning storm. + +Another type of unsupervised learning is **dimensionality reduction**. Here, you try to reduce the complexity of your initial data. Suppose that you have multiple fields (in machine learning also known as features) which contain the same information, like _age_ as well as _date of birth_. In dimensionality reduction, you remove the _age_ feature because you can compute the person's current age by subtracting their _date of birth_ from today's date. Dimensionality reduction reduces the complexity of the data, making it easier for a machine learning algorithm to handle the data. This, in return, may save you a lot of time! + +In both cases of unsupervised learning, you provide the black box with data - but not with target values. The _black box_ attempts to find patterns itself - and then comes with a prediction. + +\[caption id="attachment\_85" align="aligncenter" width="937"\]![Unsupervised learning](images/ML-unsupervised.png) Unsupervised learning\[/caption\] + +#### Inside the black box - an output function + +You may think that this black box is a mystery, but the machine learning engineer is less convinced about that. It is in fact to a large extent understandable what happens inside a machine learning system, due to the nature of most machine learning algorithms. + +In the images above, I introduced the concept of an **output function**. The basic notation of such a function may be the following: + +Y = f(X) + +Which means this: for a value X inserted in the output function, you get value Y. + +And this is actually the same as the examples we sketched before: + +If you would input _name_ and _length_ as X values, you would get _sex_ as a return value - the Y value. + +#### So many algorithms + +You may now think that machine learning is very easy. To some extent, that is true. It is really about creating and improving this output function in order to maximise the predictability of the function - and the prediction as a whole. + +But you may now wonder about this: why are there so many machine learning algorithms around? We have - for example - various classifiers such as linear classifiers, but also support vector machines, linear regression and logistic regression, perceptrons, neural networks, ...and the list goes on. + +Here's why: the output function can be _tweaked_ in various ways, and every machine learning algorithm does that in a slightly different way. + +We know that certain algorithms perform better under certain circumstances, such as the nature of data, the type of data, the quality of data or even the number of samples. Consequently, we know that based on our data set and problem type (supervised vs unsupervised - and then classification/regression or clustering/dimensionality reduction!), we can choose an algorithm that tweaks the output function in the best way. + +As a result, we end up with a large spectrum of algorithms. I want to treat them all on this website, eventually :) + +### **The difference between machine learning and data science / business analytics** + +In my opinion, data science encompasses more than just the machine aspect. If you recall the goals of machine learning... + +1. Making computers learn from data; +2. Without humans explicitly programming the actual learning + +...you can clearly see the difference when you compare it to the elements which I think should be the core competencies of a data scientist: + +1. A feeling for business; +2. A good communicative skillset; +3. A grounded technological understanding (in order to create infrastructures like Hadoop/MapReduce, and to code machine learning algorithms); +4. As well as a statistical/mathematical understanding of what's happening (so the system can be continuously improved). + +We may see machine learning as an integral part of a data scientist's job - and in this way they do overlap. However, in my view, data science is more than just machines alone. + +### **The difference between machine learning and AI** + +Another vague difference may the one between machine learning and artificial intelligence. The highest-level goal of artificial intelligence is 'making machines intelligent' (up to the level of _general AI_, you may recall). Therefore, machine learning is part of artificial intelligence - it is a way to make machines intelligent. + +But are there more ways to make machines intelligent? + +Yes, and an exemplary case may be rule-based intelligence - where the rules are programmed by a human being. Given the definition of machine learning such a system cannot be considered a machine learning system, but it may demonstrate intelligence after all. Consequently, these systems may be considered artificial intelligence - until the moment that they work well, due to this _AI effect_ we discussed above :D + +To summarise: machine learning is a part of artificial intelligence, but AI is not only about machine learning. + + + + + +## **Deep learning** + +Another buzzword that I see a lot is [_deep learning_](https://www.machinecurve.com/index.php/2018/11/23/what-is-deep-learning-exactly/). Questions that often arise are about the difference between deep learning and machine learning. Below, I'll discuss my views on these differences. + +### **It's a part of machine learning** + +According to the [Wikipedia page about deep learning](https://en.wikipedia.org/wiki/Deep_learning), this is the definition of the field: + +> Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, partially supervised or unsupervised. + +Oops. :) Let's try and find out what it is in plainer English. We can do that by breaking up the definition into some important parts: + +- It's somehow related to machine learning, so we can say that it is a part of it; +- It has something to do with _learning data representations_, which are the opposite of task-specific algorithms. +- Learning can be supervised, partially supervised or unsupervised. + +Additionally, I found a lot of information on the internet suggesting that deep learning has something to do with very complex neural networks. We'll therefore also have to take a look at these and sketch the basics in order to understand what deep learning is. + +#### Additional characteristics + +Wikipedia describes additional characteristics which at least for me make things a bit more clear: + +> 1\. use a cascade of many layers of [nonlinear processing](https://en.wikipedia.org/wiki/Nonlinear_filter "Nonlinear filter") units for [feature extraction](https://en.wikipedia.org/wiki/Feature_extraction "Feature extraction") and transformation. Each successive layer uses the output from the previous layer as input. The algorithms may be [supervised](https://en.wikipedia.org/wiki/Supervised_learning "Supervised learning") or [unsupervised](https://en.wikipedia.org/wiki/Unsupervised_learning "Unsupervised learning") and applications include pattern analysis (unsupervised) and classification (supervised). +> +> 2\. are based on the (unsupervised) learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation. +> +> 3\. are part of the broader machine learning field of learning representations of data. +> +> 4\. learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts. + +If they don't for you, hang on tight. I'll try to explain them in layman's terms below :-) + +#### Layers of nonlinear processing units for feature extraction & transformation + +Ok, that's a lot of difficult words :) + +We begin with the most important ones here: feature extraction and feature transformation. + +##### Feature extraction + +In the case of **feature extraction**, a machine learning engineer starts with an initial dataset and retrieves features from this data set. Suppose that we have a data set with attributes _name, age, length, sex_. Let's suppose that one of the rows in the dataset is about Peter, who is 28 years old, and is a male of 1.88 meters. _Peter, 28,_ _1.88_ and _male_ are so-called **features**. They are put together in a **feature vector**, which is like a collection of features for one instance - one row. + +\[caption id="attachment\_96" align="alignright" width="450"\]![Mapping a 3D feature vector onto a 3D feature space. MonthlyIncome = 3500, Age = 50, Length = 1.75](images/FeatureVectorFeatureSpace.png) Mapping a 3D feature vector onto a 3D feature space. MonthlyIncome = 3500, Age = 50, Length = 1.75\[/caption\] + +A feature vector is n-dimensional: n is the number of features in the vector. In our case above, we have a 4-dimensional feature vector. It is impossible to make a 4-dimensional space understandable for human beings, but we can visualise the 3D space of a 3-dimensional space, the plane of a 2-dimensional space, and the line of a 1-dimensional space. + +These directly translate into the concept of a feature space. + +A **feature space** is the space onto which you can map feature vectors. The image of the space near this text shows how you can visualise this process. I'll discuss feature vectors and feature spaces in more detail in another article, since we focus on the differences between machine learning and deep learning in this one. For now, it's enough to remember that a feature vector contains the necessary data (features) for one instance (a row, about e.g., a client in a shop). Make sure to remember that in any practical machine learning _and_ deep learning situation, the feature space is filled with dozens of feature vectors! + +##### Feature transformation + +In the case of **feature transformation**, features are used to create a different feature. It's like combining old data to generate new data, based on logic. + +Feature extraction and feature transformation are two common processes in both machine learning and deep learning. Normally, features are extracted  in a linear way, which means that it is changed with a linear function of its input. + +Y = 2X, for example, is a linear function. + +But linear functions have their limitations given the fact that they always separate the feature space into two halfs of equal size. + +Non-linear functions do not necessarily do that, and that's what makes them interesting for automating the discovery of features. Deep learning uses these non-linear functions a lot, contrary to other subsets of machine learning - which quite strongly rely on linear functions. In fact, in deep learning algorithms use multiple layers of non-linear functions, which serve like a chain. This way, they can autonomously detect new features. + +And that's why deep learning is called a part of the field of _learning representations of data_. + +\[caption id="attachment\_97" align="aligncenter" width="962"\]![](images/ComplexNeuralNetwork.png) A complex neural network. These and even more complex neural nets provide different layers of possibly non-linear functionality, and may thus be used in deep learning.\[/caption\] + +#### Learning representations of data field + +Normally, machine learning engineers need to prepare the data set and manually select the features with which they will work. They often don't like this part of the job, since getting actual results is much more interesting :) However, as we will see when we dive into the _big data_ term below, data is never clean on arrival. It needs to be cleaned. + +The layers of non-linear processing units allow researchers and engineers to automatically retrieve features and transform them into new ones. More importantly, their algorithms can do this autonomously - without interaction with these stakeholders. They can do that because they simply do no longer have the limitations of linearity, and the system can be learnt to recognise everything - just like a human can. + +Contrary to machine learning, in deep learning it is possible to automatically learn new data based on existing data. + +That's what makes deep learning so unique and interesting at the same time! + +#### Multiple levels of features + +This ability to learn does not work at one conceptual level. Deep learning algorithms are capable of generalising their knowledge about features into _more general concepts_. An example would be a system that detects what you're going to do when you wake up. A traditional machine learning algorithm may provide you with the suggestion to make some coffee based on behaviour shown in the past - since you always seem to drink coffee in the morning. However, a deep learning algorithm may recognise that making coffee is a _process_, and that for making coffee you do need _coffee beans_. If it somehow knows (maybe via sensors) that you no longer have coffee beans, or rather you'll run out of them after your first coffee in the morning, it suggests to buy new ones. + +Like me, you may now begin to realise that we're dealing with an entirely different type of machine intelligence here - one that leans much more towards _general AI_ than anything we have seen before. + +#### Ok, can you summarise? + +Yes, of course I can. I understand that it has become a complex story :D + +Deep learning is: + +- A totally different way of learning and generalising from data when compared to traditional machine learning; +- Nevertheless a part of machine learning; +- Making sure that learning can be done at various conceptual hierarchies, compared to traditional machine learning - hence the name _deep_ learning; +- About making learning an autonomous process, rather than a human-driven one; +- Using very complex (neural) algorithms - sometimes engineers do no longer get what's happening _inside_; +- Really promising for the future, since deep learning systems are already better than humans when performing a certain task. +- Very threatening at the same time, given the fact that human beings don't know what happens when intelligent systems become superintelligent. + +#### How is it related to neural networks? + +Like I said, I see a lot of statements on the internet which suggest that deep learning is closely related to (complex) neural networks. + +Some even say that _deep learning = neural networks_. + +And that may be correct, at least for today :) Neural networks seem to be one of the better choices today for performing deep learning activities. + +Let's briefly compare traditional neural networks with the ones used in deep learning. + +The difference with 'normal' neural networks is both the number of hidden layers as well as the number of neurons. In the image above (scrolljust a bit until you see the neural network), you'll see 2 blue-ish hidden layers. It's quite a shallow network. Complex ones used in deep learning may have a lot more hidden layers. + +The image above, however, does not limit the amount of input neurons, the yellow ones on the left. Deep neural networks may also have a large amount of input neurons, and subsequently a large amount of neurons in the following hidden layers. + +This allows for great computational power, but also a lot of complexity which may impact the results! + +### **It's a hype today** + +Like anything that leans towards _doing magical things with data_, deep learning is one of the hype words in today's business environment. I do however think that there is a promising future ahead for deep learning, if we can see through the nonsense and focus at the work at hand. What we should also not forget is to work on how business can benefit from deep learning. You may have seen that my explanation has become very technical - and that's a logical consequence of deep learning's very complex nature. But ultimately, it is not important how it works - it is important _that_ it works. For humans. + + + + + +## **Big data and business analytics** + +In my experience, when you talk to business people about processing data in order to achieve business results, they do often not come up with _data science_ - unless they are IT staff themselves. Rather, they will tell you about their attempts in creating value by using big data, or their approach regarding business analytics. + +But big data projects fail very often. The same applies to business analytics. But why is that? And what are the differences between the two? I'll discuss my observations below. + +### **Both are business-driven** + +One of my funnier observations is that you'll almost never hear an engineer talk about his _big data_ problems. He either speaks about _machine learning_ or something similar or describes his problems in very technical terms, like "my infrastructure was a a b\*\*\*\* today". + +Consequently, you can imagine that both big data and business analytics are business-driven. It's mostly business that speaks about the potential that processing and analysing these large quantities of data has. + +However, you need a totally different person in order to speak to an engineer...to actually make it work. And given my observations about data scientists above, I think they may be suitable persons to bridge this gap. + +### **Big data characteristics** + +But what _is_ big data? + +Let's use Gartner's initial definition of big data, stated in their [2012 IT glossary](https://web.archive.org/web/20170718161704/https://research.gartner.com/definition-whatis-big-data): + +> Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation + +We see three important Vs in this definition: **volume, velocity and variety**. These became known as the _three Vs_, three important pillars to define big data. Today, the three V model is still widely used, albeit extended in some ways. + +Below, I'll discuss Gartner's three V model, which is now already known as the _four Vs of big data_. I'll also discuss variability as an additional characteristic. + +#### The four Vs of big data + +Big data itself can often be characterised along four different Vs, namely: + +- Volume +- Variety +- Velocity +- Veracity + +#### Volume + +The **volume** attribute refers to the _big_ in big data. Big data can be characterised by the large volume of data that is generated and, subsequently, stored. We can make two important observations here, which are highly relevant for a big data enthusiast: + +- The volume of data increases rapidly; +- The growth rate of data volume increases as well, which means that the increase in data is going increasingly faster; + +#### Variety + +Next to volume, big data can be characterised by its **variety**. It's not about one simple data set - it's about the massive amount of data that can be stored. If we focus on a customer, we may store a lot of things about him: + +- His personal details +- Where the customer lives or works +- His relationships (family members, friends, acquaintances) +- The way he moves through a store +- The way he behaves online, both on a store's website and other websites +- ...and who knows what? + +Companies tend to store as much as they can. Consequently, a key skill of a big data engineer (who may actually be a data scientist) is to reduce the dimensionality of the data. In plainer English: being capable of precisely selecting the data he needs, putting focus on the interesting parts of the data set (and then obviously, interesting for business). Given the enormous variety within the data, this is a complex task! + +#### Velocity + +The **velocity** characteristic is about the speed with which data is processed. Large data processing infrastructures like Hadoop in combination with MapReduce process a lot of data - even more than traditional data management systems could handle. + +In fact, MapReduce is a great example to illustrate why I think that data scientists are entrepreneurial. Data scientists need to provide business value based on data. Folks at Google realised they could no longer process the increasing quantities of data with the algorithms they had in place. So rather than trying to fix a broken system, they designed a new algorithm. MapReduce, as it is called, breaks down data in smart ways in order to reduce its quantity without information loss. In fact, it is one of the most used - and cited - algorithms these days within the big data field! + +#### Veracity + +It now all sounds very great. And while big data offers great potential to businesses, data quality may rather often be low. **Veracity** is about the fluctuating quality of data - one of the main pitfalls in big data. It's then up to the big data engineer to make it better. He can do so in multiple ways: + +- By smartly combining multiple data sources in order to retrieve the information in another way; +- By applying various data cleansing algorithms in order to end up with a less complex data set; +- et cetera! + +#### Additional characteristic: variability + +Some folks related to big data use **variability** as well. It's about the consistency of data - which is partially about data quality, but which may also qualify to be a distinct characteristic. Nevertheless, imagine you're working for a logistics company that has integrated its IT systems with the systems of another company. You both agreed to store data of your clients and their routes - and share it so you could both improve your market position. Suppose that at first, the other company sends data about various things, e.g.: + +_{ ShipName, ContainerNo, DepartureHarbour, ArrivalHarbour, ExpectedArrivalDate }_ + +... because the company received an invoice from shipper A, who sends it in this format. + +Then they send another bunch of data with this format: + +_{ Shipping, ContainerInfo: { No, Harbour: { Arrival, Departure }, ArrivalDate, Expected }_ + +Because another shipper sent it this way. + +This shows the variability a big data engineer has to deal with. + +### The limitations of big data projects + +Big data projects often fail. There is a multitude of factors that contribute to these failures: + +- Business simply overestimates what's possible in big data analysis today; +- It becomes impossible/uneconomical to clean data, rendering analysis undoable; +- There is not enough data in order to boost model accuracies; +- There is a scope creep - big data projects need to be scoped in order to succeed. If you want to do it all in one project, you'll definitely fail; +- and more! + +### **The differences between big data and business analytics** + +I've discussed business analytics before and like I said, I think it's a part of the data science process. In order to save you time scrolling to business analytics, here is the definition by Beller & Barnett: + +> Business analytics (BA) refers to the skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning. + +We can immediately see the differences between big data and business analytics if we compare both along the Vs: handling volume, velocity, variety, veracity and variability are not the characteristics or drivers of business analytics. In fact, business analytics cares about gaining insights for business planning. + +I do however think that there is some overlap. In fact, I think that a data scientist is also partially performing business analytics when handling big data (and yes, I know these are a lot of buzzwords in one phrase, but you get me). Therefore, these may all be interchangeable, rendering the discussion whether all these terms are necessary at all. + +But let's skip that discussion for now. + +### **The differences between big data and machine learning** + +Another distinction that is often made is about the **difference between** **big data and machine learning**. We know about big data that the 4 Vs are important. But what is machine learning? You could read about these two main characteristics above: + +- Making computers learn from data; +- Without humans explicitly programming the actual learning + +Once again, there is nothing about volume, velocity, variety, veracity and variability. + +But there is some overlap here as well - with data science. + +While _machine learning_ (and to a larger extent deep learning too) is the toolbox for making learning from data possible, big data is the term that provides the higher-level overview of any business data infrastructure. You can see the relationship - and then data science enters the picture again. If we remember well, my view on a data scientist is that he attempts to provide value to a business by solving a problem through processing and analysis of data - maybe even large quantities. For this, he needs a proper understanding of the business, a good communicative skillset, a proper technological understanding and a mathematical basis. You can see that for a data scientist technology meets business. + +Long story short - for a data scientist machine learning may be one of his tools; big data may be one of his goals. + + + + + +## **A brief overview of the differences** + +If we could make a high-level overview of the fields you could read about above, I would come to this: + +- Data science is about bridging the gap between data technology and data driven problems by bringing a good technological understanding & business-/problem-oriented mentality to the table; +- Artificial intelligence is about making machines intelligent - the goals may vary from superintelligence to very task-specific intelligence; +- Machine learning is about learning from data without explicit human programming, but with human-configured features; +- Deep learning makes the feature learning process autonomous. For this, the field uses very complex algorithms; +- Big data is about the characteristics of the data - volume, velocity, variety, veracity (and maybe variability) - and the value it may provide, but less about technology. +- Big data may be applied within any field, like ecology, astronomy, meteorology, but also in business. Business analytics tends to focus on applying data about past events to a business problem in order to learn for the future. + +I hope you've enjoyed my article. If you have any comments, questions or suggestions, please feel free to leave a comment below :) diff --git a/the-tapas-transformer-table-parsing-with-bert.md b/the-tapas-transformer-table-parsing-with-bert.md new file mode 100644 index 0000000..cba5823 --- /dev/null +++ b/the-tapas-transformer-table-parsing-with-bert.md @@ -0,0 +1,17 @@ +--- +title: "The TAPAS Transformer: Table Parsing with BERT" +date: "2021-03-05" +categories: + - "buffer" + - "deep-learning" +tags: + - "deep-learning" + - "machine-learning" + - "nlp" + - "table-parsing" + - "tapas" + - "transformer" + - "transformers" +--- + +TAPAS (Table Parser) is a weakly supervised Transformer-based question answering model that reasons over tables _without_ generating logical forms. Instead, it predicts a minimal program by selecting a relevant subset of table cells + the most likely aggregation operator to be executed on top of these cells, jointly. This allows TAPAS to learn operations based on natural language without requiring some explicit formalism. diff --git a/this-person-does-not-exist-how-does-it-work.md b/this-person-does-not-exist-how-does-it-work.md new file mode 100644 index 0000000..3ea14f3 --- /dev/null +++ b/this-person-does-not-exist-how-does-it-work.md @@ -0,0 +1,96 @@ +--- +title: "This Person Does Not Exist - how does it work?" +date: "2019-07-17" +categories: + - "applied-ai" + - "deep-learning" +tags: + - "deep-learning" + - "gans" + - "generative-adversarial-networks" + - "generative-models" + - "images" +--- + +In the news recently: the website [This Person Does Not Exist](http://thispersondoesnotexist.com). This website does nothing else than showing you a face. When you refresh the website, a new fase is shown. And another one. And another one. And so on. It is perhaps a weird title for a website, but that's intended. In the box to the right bottom of the website we read this: _produced by a (...) network_. Huh, someone real who doesn't exist? ...yes, seriously ... every person you see on this website, _does not exist_. + +In this tech blog, we dive into the deep to find out how this is possible. You will see that we will be covering a game of a machine learning technique known as a GAN - a generative adversarial network. We'll look into the relatively short history of this way of thinking about machine learning. In doing so, we take a short side step towards game theory. Finally, we will look at the specific case of _This Person Does Not Exist_ and the building blocks that together compose the machine learning aspects of the website. + +Sounds difficult? Not too difficult, if you take some time to read this blog. And don't worry, I will do my best to discuss GANs in layman's terms. Please let me know in the comments whether I succeeded in that - I can only learn from your responses :) + +[![](images/thispersondoesnotexist-1-1022x1024.jpg)](https://machinecurve.com/wp-content/uploads/2019/07/thispersondoesnotexist-1-1022x1024.jpg) + +This person was generated by a neural network. Source: [thispersondoesnotexist.com](http://thispersondoesnotexist.com) + +## Game theory and zero-sum games + +It's possible that you play a game in which only one reward can be shared over all participants. Playing chess and playing tennis are perfect examples of such a game: one person wins, which means that the other one loses. Or, in case of playing chess, it's a tie. If you would note the scores for all players in those situations and subtract them from one another, you would get the following: + +- **1-0:** Player 1 (+1 win), player 2 (-1 win) = together 0 win; +- **0-1:** Player 1 (-1 win), player 2 (+1 win) = together 0 win; +- **Tie:** Player 1 (½ win), player 2 (½ win) = together 0 win. + +\[ad\] + +In all cases, those type of games yields a _sum of zero_ with respect to the distribution of scores. It therefore won't surprise you that such a game is called a zero-sum game. It's one of the most important elements from a mathematical field known as game theory, because besides games like chess it can also be applied to more complex systems. Unfortunately, war, to give just one example, is often also a zero-sum game. + +[![](images/pexels-photo-209679-1024x663.jpeg)](https://machinecurve.com/wp-content/uploads/2019/07/pexels-photo-209679-1024x663.jpeg) + + +_Concepts from game theory are applicable to neural networks. Photographer: Alexas\_Fotos, Pixabay License._ + +## Generative adversarial networks + +All right, let's continue with the core topic of this blog: the website This Person Does Not Exist. We've seen what a zero-sum game is, but now we will have to apply it in the area of machine learning. The website was made by using a technique known as a _generative adversarial network_, also known as GAN. We'll have to break that term into its distinct parts if we would like to know what it means: + +- **Generative:** it makes something; +- **Adversarial:** something battles against each other in some kind of game; +- **Network:** two neural networks, in this case. + +In short: a GAN is composed of two neural networks which, by playing against each other and trying to let each other lose, make something. + +\[ad\] + +And 'something' is pretty varied these days. After modern applications of GANs emerged in 2014, networks have been developed which can produce pictures of the interior, of shoes, bags and clothing. But related networks are now also capable of _playing videos ahead of time_, which means: to upload 10 seconds of video, allowing the model to predict the next two. Another one: in 2017, a work was [published](https://arxiv.org/abs/1702.01983) in which the development of GANs that can make pictures older is discussed. Its application can be extended to missing children, who have possibly grown older but whose case was never resolved. + +GANs are thus a new technique in the arsenal of a machine learning engineer which spawns a wide range of new applications. Not only _predictive power_, like with other models, but also some _creative power!_ + +But then, how does a GAN work exactly? + +Schematically, you see its inner working next. + +[![](images/GAN-1024x431.jpg)](https://machinecurve.com/wp-content/uploads/2019/07/GAN-1024x431.jpg) + +A GAN, schematically. + +It all starts with what we call a _noise vector_. A vector is an abstract representation of what you can also consider to be some sort of list. In machine learning, data is converted into numbers in nine out of ten cases. A noise vector could therefore also be seen as a list of random numbers. The vector, or the list, is input to the first neural network, which is called the _generator network._ This generator is capable of converting a large amount of noise into a larger and more accurate picture, layer after layer. But it's a _fake_ one, though! + +\[ad\] + +The fake picture is fed into the second neural network, which is also known as the _discriminator_. The network, which has been trained with real pictures, is capable of doing the opposite: breaking down the image in individual components to determine the category to which the picture belongs. In the case of a GAN, the categories are _fake_ and _real_. In a way, you can thus see the generator as the criminal and the discriminator as the cop, which has to catch the criminal. + +## GANs and their training + +How good catching the criminal works is what we known when we finish one epoch - a round of training. For every sample from the validation set, for which a target (fake or real) is available, it is determined how much the predicted value differs from the real one. We call this the _loss_. + +Just with any other neural network, this loss value can be used to optimize the model. Optimizing the model is too complex for this blog, but with a very elegant mathematical technique one simply calculates the shortest path from the mountain top (the worst loss value) towards the valley (the best loss value). Based on one training epoch the model is capable of adapting both the _generator_ and the _discriminator_ for improvement, after which a new epoch can start. + +Perhaps, you can imagine that whatever the generator produces is dependent on the discriminator. With every machine learning model, the goal is to maximize the gain; which also means minimizing the loss. When the discriminator becomes better and better in predicting whether an image is real or fake (and consequently yields in higher loss), the generator must improve time after time to get away with its attempt to fool the cop (making the loss lower). The discriminator, however, gets better and better in predicting _real pictures_, which we fed to this neural network. Consequently, if the generator wants to keep up with the discriminator, it means that the generator must make itself better and better in generating images that look like the real ones in the discriminator. + +And the recent consequence of those developments within GANs are the pictures on ThisPersonDoesNotExist. It also explains why we're speaking about an _adversarial network_, in which two neural networks play a zero-sum game against each other... what one wins in terms of loss, is what the other loses. + +[![](images/black-background-brain-close-up-818563-1024x576.jpg)](https://machinecurve.com/wp-content/uploads/2019/07/black-background-brain-close-up-818563-1024x576.jpg) + +Photographer: David Cassolato, Pexels License. + +## How This Person Does Not Exist is unique + +Yet, the story does not end there. Generative adversarial networks work in some kind of cop-and-criminal-relationship in order to produce very interesting results. But _This Person Does Not Exist_ had a different goal: showing that it is possible to generate very accurate but also very large (1024 x 1024 pixels and larger) pictures can be generated at some speed. + +\[ad\] + +That's exactly what the bottleneck of GANs was at the time. Early GANs worked quite well, but were not too accurate (resulting in [vague pictures](https://hackernoon.com/what-are-creative-adversarial-networks-cans-bb81d09aa235)) or could only make smaller images. In 2018, NVIDIA's AI research team proposed a solution: the ProGAN network, which composes the generator in a very specific way. It is different in the sense that it buils the picture layer after layer, where the layers get bigger and more accurate. For example, the first layer is 4 by 4 pixels, the second 8 by 8, and so on. The interesting part of this way of working is that every new layer can benefit from the less granular results of the previous ones. In fact, is does not have to find out everything on its own. As we all know, _extending something that already exists_ is much easier than starting out of the blue. ProGAN was thus a small breakthrough in the field of generative adversarial networks. + +But that still doesn't end the story. The GAN that is built into This Person Does Not Exist is named StyleGAN, and is an upgrade of ProGAN. NVIDIA's AI team added various new elements, which allows practitioners to control more aspects of the network. For example, they can better separate the generator and the discriminator, which ensures less dependence of the generator on the training set. This allows one to, for example, reduce discrimination in the generated pictures. Nevertheless, separating those remains a challenge, which spawns a wide array of research opportunities for generative adversarial networks for the coming years! + +All in all, we saw that GANs allow the introduction of creativity in machine learning. That's simply a totally new approach to machine learning. I am very curious about the new application ares that we will see over the next period. I'll keep you up to date... :-) diff --git a/training-your-neural-network-with-cyclical-learning-rates.md b/training-your-neural-network-with-cyclical-learning-rates.md new file mode 100644 index 0000000..da48fa6 --- /dev/null +++ b/training-your-neural-network-with-cyclical-learning-rates.md @@ -0,0 +1,435 @@ +--- +title: "Training your Neural Network with Cyclical Learning Rates" +date: "2020-02-25" +categories: + - "deep-learning" + - "frameworks" +tags: + - "cyclical-learning-rate" + - "deep-learning" + - "keras" + - "learning-rate" + - "learning-rate-range-test" + - "machine-learning" +--- + +At a high level, training supervised machine learning models involves a few easy steps: feeding data to your model, computing loss based on the differences between predictions and ground truth, and using loss to improve the model with an optimizer. + +However, practice isn't so simple. For example, it's possible to choose multiple optimizers - ranging from traditional [Stochastic Gradient Descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) to [adaptive optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), which are also very common today. + +Say that you settle for the first - Stochastic Gradient Descent (SGD). Likely, in your deep learning framework, you'll see that the _learning rate_ is a parameter that can be configured, with a default value that is preconfigured most of the times. + +Now, what is this learning rate? Why do we need them? And more importantly - what value do we choose? We'll start our blog by taking a brief look at these questions, refering to other blog posts that we wrote for more details. + +We do so because the core of this blog post is the concept of a **Cyclical Learning Rate**, introduced by Smith (2017). In his research, Smith showed that such learning rates can perform much better compared to classic ones, such as SGD or even adaptive optimizers. That is, they can reach lower loss in much shorter time. That would be great for optimization, which is often a long wait - especially when your models are large. + +But why does this work? And what types of Cyclical Learning Rates (CLRs) are out there? How do you configure Cyclical Learning Rates? Many new questions - which are all valid, and will be answered in this blog post :) Beyond the first part, focusing on learning rates at a high level, we'll focus on these things: + +1. Introducing the concept of a Cyclical Learning Rate and why they can improve the performance of your machine learning model. +2. Show you that the learning rate cycles of CLRs can be linear, parabolic and sinusoidal. +3. How to configure CLRs once you choose to use them. +4. Building a real-world Cyclical Learning Rate example with Keras. + +Now, that's quite some work we'll do today! Fear not, I'll make sure to guide you through the process as smoothly as possible, explaining every choice I make as we go. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## What are learning rates and why do we need them? + +The first thing we need to do before we can introduce Cyclical Learning Rates - and why they are useful - is to introduce you to the concept of a learning rate. + +If you already know what learning rates are, I suggest you skip this section. However, if you're interested in sharpening your understanding, then make sure to read on :) + +Training a supervised machine learning model, as we already illustrated above, can be captured in a [few steps](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) that are easy in theory. One of these steps is computing the _gradient_, i.e. the estimated change, that can be used to change the model - doing so would likely improve it in the next iteration. + +The backpropagation, with its "change with respect to layer X (...) with respect to the loss value" logic, is used to compute the gradient for a particular layer. Upstream gradients are often more complex to compute, with problems like the [vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) as a result. + +However, we're not here today to complain about issues with gradients :) Rather, we're here to see what learning rates are. + +For this, we'll have to stick to the gradient, but we'll also require a little bit of imagination. Suppose that you are walking down a mountain, and your goal is to end at the _valley_; that is, the _global minimum_ of that particular relief. + +Such a mountainous scenario can be represented by the mathematical plot below: + +![](images/MaximumCounterexample.png) + +_Source: [Sam Derbyshire](https://en.wikipedia.org/wiki/User:Sam_Derbyshire "wikipedia:User:Sam Derbyshire") at [Wikipedia](https://en.wikipedia.org/wiki/ "wikipedia:") [CC BY-SA 3.0](http://creativecommons.org/licenses/by-sa/3.0/ "Creative Commons Attribution-Share Alike 3.0"), [Link](https://commons.wikimedia.org/w/index.php?curid=48728184)_ + +However, it must be clear that the _valley_ depicted here - i.e., the _red dot_ - is only a local valley. As you can see, by walking further, you can descend even further. However, let's now assume that you're in the top left part while descending, and with the aim to arrive at that red dot. + +What do you do? + +You set steps. Steps in the direction of the red dot. This is the role of the backpropagation algorithm: computing the gradient (i.e. the step) that can be taken towards the red are. + +But how large should these steps be? That's a critical question as well. If your steps are too small, it will take an eternity before you arrive at the red dot. If they are too large, you may never arrive at all, because you keep overstepping the dot, left and right, back and forth. + +This is where the learning rate enters the picture. With the learning rate, you control the step size. As you might guess by now, a high-level description of the formula to alter the weights of your machine learning model for every iteration is this: + +_Weight update = Previous weights - Learning rate x Gradient_ + +By setting the learning rate to small values (e.g. \[latex\]0.001\[/latex\]), you ensure that steps are small enough in order to converge towards the minimum and arrive at a position close to a minimum. If you set it too large (e.g. \[latex\]0.5\[/latex\]), you might overshoot the minimum every time. However, it would speed up learning in the beginning, while you're still at the top of the mountain and can afford to take large steps. This is precisely the problem [why fixed learning rates aren't a good idea](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/), and why you need to be careful with decaying learning rates as well. + +* * * + +## Introducing Cyclical Learning Rates + +In his paper, Smith (2017) argues that "increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect". + +But why is this the case? + +Let's take a look at a nasty phenomenon that you can encounter when training machine learning models - saddle points. + +### Saddle points are problematic for machine learning success + +Wikipedia (2004) defines a saddle point as follows: + +> In mathematics, a **saddle point** or **minimax point** is a point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function. +> +> [Wikipedia (2004)](https://en.wikipedia.org/wiki/Saddle_point) + +Indeed, it's a point where the gradient is zero - while it's no minimum. Such points often look like this and have an upward direction from the front and towards the right, while having a downward one from the left and towards the end: + +- [![](images/Saddle_point.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/Saddle_point.png) + +- [![](images/Saddle_Point_between_maxima.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/Saddle_Point_between_maxima.png) + + +_Two landscapes with saddle points. On the left, it's most visible - while on the right, it's in between two maxima. | Left: By [Nicoguaro](//commons.wikimedia.org/wiki/User:Nicoguaro "User:Nicoguaro") - Own work, [CC BY 3.0](https://creativecommons.org/licenses/by/3.0 "Creative Commons Attribution 3.0"), [Link](https://commons.wikimedia.org/w/index.php?curid=20570051) | Right: By [Nicoguaro](//commons.wikimedia.org/wiki/User:Nicoguaro "User:Nicoguaro") - Own work, [CC BY 4.0](https://creativecommons.org/licenses/by/4.0 "Creative Commons Attribution 4.0"), [Link](https://commons.wikimedia.org/w/index.php?curid=48854962)_ + +As noted above, saddle points are infamous for getting in the way of adequately performing machine learning models. + +This occurs because _the gradient is zero, while they don't represent a minimum_. + +As you know, during optimization, a model will compute the gradient given some loss and will steer your weights into the direction so that the gradient becomes zero. + +As minimums have zero gradients, this is good - except for the fact that saddle points have them too. Once your weights are pushed in a direction where a saddle point ensures that you find a gradient of zero, your model will stop improve. + +This is bad. + +What's more, it may be that saddle points are even worse than finding local minima, where gradients are also zero and which are thus also a problem when your goal is to find the global minimum (Dauphin et al., 2015). + +Hence, we need a way out of there. Various adaptations to learning rates, such as momentum, are of help here - as the rolling momentum, for example, may ensure that your updates "shoot" further than the saddle point, ensuring that your model will continue improving. + +### Cyclical Learning Rates against saddle points + +Precisely this problem is why Smith (2017) argued that "increasing the learning rate might have a short term negative effect and yet achieve a longer term beneficial effect". + +From everything above, we can observe that when the learning rate gets too small around local minima and saddle points, we can't escape them anymore. + +However, if learning rates are too large globally, then we will no longer find the global minimum. + +Now, if we increase the learning rate, the effect on the short term may be negative - a higher loss value, for example because loss moves up the mountain - while the long-term effect is positive, as you escape the saddle points and minimas. + +However, we don't want the learning rate to increase all the time: over time, you should likely be near your global minimum, and with increasing learning rate you would overstep the minimum time after time. + +Here, the concept of a **Cyclical Learning Rate** or CLR may help. Introduced by Smith (2017), a CLR simply means that your learning rate moves back and forth between a low and a high learning rate. Thus: when it's high, you can escape saddle points and local minima, while stepping close to your global minimum when it's low. Or, indeed, experience a short term negative effect and yet achieve a longer term beneficial one. + +Let's now take a look at some of these Cyclical Learning Rates. + +### Forms of Cyclical Learning Rates: linear, parabolic and sinusoidal CLRs + +In his paper, Smith (2017) describes three types of CLRs. The first is a linear one, also known as a triangular one, which looks as follows: + +[![](images/triangular.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/triangular.png) + +As you can see, the learning rate moves back and forth between a low one (\[latex\]bound\_{min} = 0.05\[/latex\]) and quite a high one (\[latex\]bound\_{max} = 1.50\[/latex\]). The same is true for the next one, except that movement is _smooth_ here - it's a sinusoidal one: + +[![](images/sinusoidal.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/sinusoidal.png) + +A third one reported in Smith (2017) is neither linear nor sinusoidal, but rather parabolic in nature: + +[![](images/parabolic.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/parabolic.png) + +All three of them share the characteristic that the learning rate moves back and forth between a _minimum_ and a _maximum_ bound, ensuring that saddle points and local minima can be escaped while your training process can truly reach global ones. Experiments with those various forms showed that results were equivalent. For the sake of simplicity, Smith (2017) thus chose to use triangular learning rates in the rest of his work. + +* * * + +### Decaying the bounds over time + +In some cases, it's desirable to let the bounds decay over time (Smith, 2017). This ensures that the learning varies less and less once the epochs pass - that is, presumably, when you reach the global minimum. Below, you'll see an example for parabolic-like CLR with exponential bound decay. Another approach lets the learning rates decay in a triangular fashion, i.e. by cutting them in half after every iteration. + +[![](images/clr_decay.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/clr_decay.png) + +## Approach to using CLRs in your neural network + +Let's now take a look at **a step-wise approach** for using and configuring Cyclical Learning Rates in your neural network. + +Globally, configuration goes as follows: + +1. You choose to use CLRs in your neural network. +2. You choose a style of cycling (i.e., linear, parabolic or sinusoidal) and whether your bounds decay over time. +3. You set the _cycle length_; that is, the length of the phase of your cycle; +4. You set the _minimum bound_ and the _maximum bound_, and possibly, your _decay policy_. + +Above, we already saw that triangular (a.k.a. linear) cycling often works best. Let's now take a look at finding cycle lengths and minimum and maximum bounds. + +### How to find cycle length for Cyclical Learning Rates? + +The length of a cycle and the input parameter stepsize can be easily computed from the number of iterations in an epoch. An epoch is calculated by dividing the number of training images by the batchsize used. For example, CIFAR-10 has 50, 000 training images and the batchsize is 100 so an epoch = 50, 000/100 = 500 iterations. The final accuracy results are actually quite robust to cycle length but experiments show that it often is good to set stepsize equal to 2 − 10 times the number of iterations in an epoch. For example, setting stepsize = 8 ∗ epoch with the CIFAR-10 training run (as shown in Figure 1) only gives slightly better results than setting stepsize = 2 ∗ epoch. + +According to Smith (2017), choosing a cycle length for CLRs is pretty easy: + +**You compute the number of iterations in an epoch and set the cycle length - i.e. the `stepsize` input parameter - to 2 to 10 times this value.** + +As we know, one epoch is the full forward pass of all the samples in your training set. So, for example, if your training set has 60.000 images, and you use a batch size of 250, your step size must be configured to be within \[latex\]2 \* \\frac{60.000}{250} = 2 \* 240 = 480\[/latex\] and \[latex\]10 \* \\frac{60.000}{250} = 10 \* 240 = 2400\[/latex\] . + +### How to find minimum and maximum bounds for Cyclical Learning Rates? + +Another question that is raised often is _how to find the minimum and the maximum bounds_ for Cyclical Learning Rates. + +Smith (2017) also provides an answer to this question. + +This one's easy too: **we use the Learning Rate Range Test**. + +Indeed, that test [which we already encountered](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/) when estimating proper starting LRs for learning rate decay. + +This time, we'll use it a little bit differently, though. With the Learning Rate Range Test, we let the model run for several epochs, while the learning rate increases over time. For every learning rate, we get the loss value - and this information tells us something about the bounds we need to set. + +For the linear policy, the mechanism is simple (Smith, 2017): + +- The base learning rate is set to the minimum value that we want to test for; +- The max learning rate to the maximum value that we want to test for; +- The `stepsize` is equaled to the `number of iterations` in the range test. +- When the loss value starts to decrease rapidly (or accuracy starts to increase rapidly), note the value for the learning rate. +- Do the same when the improvement starts to flatten, i.e. when a plateau occurs. + +Use these values for the `min_bound` and `max_bound` in your Cyclical Learning Rate. This is a good value, as this range captures the biggest improvement and hence the optimum learning rate is likely somewhere within the bounds - and thus encountered during the cycles (Smith, 2017). + +* * * + +## A Keras example for Cyclical Learning Rates + +Let's now take a look at how we can implement Cyclical Learning Rates with the Keras framework for deep learning. + +To make this work, we use two great open source implementations of: + +- The **[Learning Rate Range Test](https://gist.github.com/WittmannF/c55ed82d27248d18799e2be324a79473)**, which was created by Fernando Wittmann ([mirror](https://gist.github.com/christianversloot/f5d647503b47249adbd1f9633183ea49)); +- The **[Cyclical Learning Rates](https://github.com/bckenstler/CLR)**, which were created by Brad Kenstler ([mirror](https://github.com/christianversloot/CLR)). + +### Today's Keras model + +The first thing that we have to do is define today's Keras model. We'll use a model that is very similar to the one created for [sparse categorical crossentropy](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/), with a few adaptations. If you want to learn how to build this model from the ground up, it's wise if you read the post linked above. Therefore, we'll continue with (a large part) of that model here without further explanation. + +Create a folder containing a Python file, such as `base_model.py`. In this file, add the following code: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +import matplotlib.pyplot as plt +from CLR.clr_callback import CyclicLR +from LRF.lr_finder import LRFinder + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 15 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-100 data +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Parse numbers as floats +input_train = input_train.astype('float32').reshape((input_train.shape[0], img_width, img_height, img_num_channels)) +input_test = input_test.astype('float32').reshape((input_test.shape[0], img_width, img_height, img_num_channels)) + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +### A Learning Rate Range Test for Cyclical Learning Rates + +Do note above that one thing is different compared to the sparse categorical crossentropy model: we import the code for Cyclical Learning Rates as `CLR` and the one for the Learning Rate Range Test as `LRFinder` - as you can see here: + +``` +from CLR.clr_callback import CyclicLR +from LRF.lr_finder import LRFinder +``` + +But how to install these packages? + +Let's take a look. + +#### Adding the LRFinder package + +Adding the package for the Learning Rate Range test is simple. + +- In the folder where your `base_model.py` file is stored, create a folder named `LRF`. +- In this new folder, create a file called `lr_finder.py`. +- Add the code [you can find here](https://gist.github.com/WittmannF/c55ed82d27248d18799e2be324a79473) ([mirror](https://gist.github.com/christianversloot/f5d647503b47249adbd1f9633183ea49)) to this file. + +Now you can use Fernando Wittmann's Learning Rate Range Test implementation with your Keras model! :) + +#### Adding the CLR package + +Adding the package for Cyclical Learning Rates is also quite easy. It requires that you have installed Git on your machine - search for "how to install Git" on Google if you haven't installed Git yet. + +Open a terminal and `cd` to the folder where your `base_model.py` file is stored. + +Then execute a Git clone: `git clone https://github.com/bckenstler/CLR.git` (use the [mirror](https://github.com/christianversloot/CLR) if the repository is no longer available). + +Now, the CLR repository should clone to a new folder called `CLR` - which is precisely what you need in your `base_model.py`. + +Voila, you can now also use Brad Kenstler's CLR implementation with your Keras modal :) + +Time to use them! 😎 + +### A Learning Rate Range Test for Cyclical Learning Rates + +The first thing we'll have to find out is the cycle length. + +Then, we need to identify the minimum and maximum bounds. The Learning Rate Range Test is what we can use for this. Let's add some code for using `LRFinder`: + +``` +## +## LR Finder specific code +## + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) +``` + +First, we compile the model - that is, create a true instance with our specified loss function and optimizer. + +Then, we specify a few configuration options: + +``` +# Configuration settings for LR finder +start_lr = 1e-4 +end_lr = 1e0 +no_epochs = 10 +``` + +We will try to find the best learning rate within the range between \[latex\]10^{-4}\[/latex\] and \[latex\]10^0 = 1\[/latex\]. We do so in ten epochs. + +We then define the Learning Rate Range Test as `lr_finder` and add it as a Keras callback to `model.fit`: + +``` +# Define LR finder callback +lr_finder = LRFinder(min_lr=start_lr, max_lr=end_lr) + +# Perform LR finder +model.fit(input_train, target_train, batch_size=batch_size, callbacks=[lr_finder], epochs=no_epochs) +``` + +Now, when we run the Python file (i.e. `python base-model.py`), the training process for finding the optimal learning rate should begin. Once it finished, you should see a visualization pop up immediately which looks somewhat like this one: + +[![](images/lrf_mnist-1024x512.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/lrf_mnist.png) + +Interpreting this plot leads to the conclusion that a decrease in loss (i.e., model improvement) starts immediately - which means that we'll choose \[latex\]10^{-4}\[/latex\] as the lower bound for our cyclical learning rate. + +We observe a plateau around \[latex\]10^{-2}\[/latex\], after which loss values become unstable. Hence, we choose this as the value for our upper bound, and set the bounds accordingly next. + +### Applying CLRs in the Keras model + +Now that we know which bounds we'll use, we can remove all Learning Rate Range Test specific code. That is, remove everything up to and including: + +``` +## +## LR Finder specific code +## +``` + +Ensure that your code now ends at the `model.add`s for the layers. + +If they do, let's move on - and add the Cyclical Learning Rate implementation. + +The first thing that we'll have to do is to specify the options: + +``` +# Set CLR options +clr_step_size = int(4 * (len(input_train)/batch_size)) +base_lr = 1e-4 +max_lr = 1e-2 +mode='triangular' +``` + +Clearly, our learning rate range is configured as we found it to be optimal. What's more, we specify the `clr_step_size` in line with the estimates provided by Smith (2017): within 2 to 10 times the number of iterations per epoch - i.e. the length of our training set divided by the batch size. + +[![](images/triangular-300x140.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/triangular.png) + +The `mode` is set to triangular: that's equal to linear mode. We don't use `triangular2` or `exp_range`, which are also supported and would represent decaying bounds. + +We can then define the callback for our Keras model: + +``` +# Define the callback +clr = CyclicLR(base_lr=base_lr, max_lr=max_lr, step_size=clr_step_size, mode=mode) +``` + +The only thing that is left by then is model compilation with `model.compile` and starting the training process with `model.fit`. Note the use of the callback during the `fit`! + +``` +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split, + callbacks=[clr]) +``` + +There you go! If you call `python base_model.py` again, training will now begin with a cyclical learning rate 😎 + +* * * + +## Summary + +In this blog post, we looked at the concept of Cyclical Learning Rates - a type of learning rate configuration introduced by Leslie Smith, and specifically Smith (2017). + +In order to introduce them successfully, we first looked at learning rates. What are they? How should they be configured? Why shouldn't they preferably be constant or decay over time? That is, because humans still need to configure them - and guess them - even though a test is available for this these days. Additionally, static learning rates will be too low in the first stages of the training process, while too high in the later stages. + +Cyclical Learning Rates can solve this. By letting the learning rate oscillate back and forth between a lower and an upper bound, it's possible to avoid this - while even overcoming the problem of saddle points and local minima. We discussed the forms of CLRs available, as well as the decay of the bounds of your CLR. + +Then, we moved on to an implementation for the Keras deep learning framework - by using open source additions to Keras, created by third party developers. Thanks guys! :) + +I hope this blog post has helped you understand learning rates and specifically Cyclical ones. If it did, please drop a message in the comments box below 👇 I'd be happy to read your comment! Make sure to do the same if you have questions, if you spot a mistake or if you have any general remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Smith, L. N. (2017, March). [Cyclical learning rates for training neural networks](https://arxiv.org/abs/1506.01186). In _2017 IEEE Winter Conference on Applications of Computer Vision (WACV)_ (pp. 464-472). IEEE. + +Wikipedia. (2004, May 7). Saddle point. Retrieved from [https://en.wikipedia.org/wiki/Saddle\_point](https://en.wikipedia.org/wiki/Saddle_point) + +Dauphin, Y., De Vries, H., & Bengio, Y. (2015). [Equilibrated adaptive learning rates for non-convex optimization](http://papers.nips.cc/paper/5870-equilibrated-adaptive-learning-rates-for-non-convex-optimization). In _Advances in neural information processing systems_ (pp. 1504-1512). + +Kenstler, B. (2018, March 11). CLR. Retrieved from [https://github.com/bckenstler/CLR](https://github.com/bckenstler/CLR) + +Wittmann, F. (n.d.). Learning Rate Finder as a Keras Callback. Retrieved from [https://gist.github.com/WittmannF/c55ed82d27248d18799e2be324a79473](https://gist.github.com/WittmannF/c55ed82d27248d18799e2be324a79473) diff --git a/transformers-for-long-text-code-examples-with-longformer.md b/transformers-for-long-text-code-examples-with-longformer.md new file mode 100644 index 0000000..0553d66 --- /dev/null +++ b/transformers-for-long-text-code-examples-with-longformer.md @@ -0,0 +1,248 @@ +--- +title: "Transformers for Long Text: Code Examples with Longformer" +date: "2021-03-12" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "code-examples" + - "deep-learning" + - "longformer" + - "machine-learning" + - "transformer" +--- + +Transformer models have been boosting NLP for a few years now. Every now and then, new additions make them even more performant. Longformer is one such extension, as it can be used for long texts. + +While being applied for many tasks - think [machine translation](https://www.machinecurve.com/index.php/2021/02/16/easy-machine-translation-with-machine-learning-and-huggingface-transformers/), [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/) and [named-entity recognition](https://www.machinecurve.com/index.php/2021/02/11/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers/) - classic Transformers always have faced difficulties when texts became too long. This results from the self-attention mechanism applied in these models, which in terms of time and memory consumption scales quadratically with sequence length. + +Longformer makes Transformers available to long texts by introducing a sparse attention mechanism and combining it with a global, task specific one. More about that [can be read here](https://www.machinecurve.com/index.php/question/what-is-the-longformer-transformer-and-how-does-it-work/). In this tutorial, you're going to work with actual Longformer instances, for a variety of tasks. More specifically, after reading it, you will know... + +- **How to use Longformer based Transformers in your Machine Learning project.** +- **What is necessary for using Longformer for Question Answering, Text Summarization and Masked Language Modeling (Missing Text Prediction).** +- **That Longformer is really capable of handling large texts, as we demonstrate in our examples.** + +Let's take a look! 🚀 + +* * * + +\[toc\] + +* * * + +## What is the Longformer model? + +Ever since Transformer models have been introduced in 2017, they have brought about change in the world of NLP. With a variety of architectures, such as [BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) and [GPT](https://www.machinecurve.com/index.php/2021/01/02/intuitive-introduction-to-openai-gpt/), a wide range of language tasks have been improved to sometimes human-level quality... and in addition, with libraries like HuggingFace Transformers, applying them has been democratized significantly. + +As a consequence, we can now create pipelines for [machine translation](https://www.machinecurve.com/index.php/2021/02/16/easy-machine-translation-with-machine-learning-and-huggingface-transformers/), [text summarization](https://www.machinecurve.com/index.php/2020/12/21/easy-text-summarization-with-huggingface-transformers-and-machine-learning/) and [named-entity recognition](https://www.machinecurve.com/index.php/2021/02/11/easy-named-entity-recognition-with-machine-learning-and-huggingface-transformers/) with only a few lines of code. + +Classic Transformers - including GPT and BERT - have one problem though: the time and memory complexity of the [self-attention function](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/#multi-head-attention). As you may recall, this function applies queries, keys and values by means of \[latex\]Q\[/latex\], \[latex\]K\[/latex\] and \[latex\]V\[/latex\] generations from the input embeddings - and more specifically, it performs a multiplication of the sort \[latex\]QK^T\[/latex\]. This multiplication is _quadratic_. In other words, time and memory complexity increases quadratically with sequence length. + +In other words, when your sequences (and thus your input length) are really long, Transformers cannot process them anymore - simply because too much time or too many resources are required. To mitigate this, classic Transformers and BERT- and GPT-like approaches truncate text and sometimes adapt their architecture to specific tasks. + +While we want a Transformer that can handle long texts without the necessity for any significant changes. + +That's why **Longformer** was introduced. It changes the attention mechanism by applying _dilated sliding window based attention_, where each token has a 'window' of tokens around that particular token - including dilation - for which attention is computed. In other words, attention is now more _local_ rather than global. To ensure that some global patterns are captured as well (e.g. specific attention to particular tokens), _global attention_ is added as well - but this is more task specific. We have covered the details of Longformer [in another article](https://www.machinecurve.com/index.php/question/what-is-the-longformer-transformer-and-how-does-it-work/), so make sure to head there if you want to understand Longformer in more detail. Let's now take a look at the example text that we will use today, and then move forward to the code examples. + +### Today's example text + +To show you that Longformer works with really long tasks in a variety of tasks, we'll use some segments from the [Wikipedia page about Germany](https://en.wikipedia.org/wiki/Germany) (Wikipedia, 2001). More specifically, we will be using this text: + +``` +Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a country at the intersection of Central and Western Europe. It is situated between the Baltic and North seas to the north, and the Alps to the south; covering an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. It borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. Germany is the second-most populous country in Europe after Russia, as well as the most populous member state of the European Union. Its capital and largest city is Berlin, and its financial centre is Frankfurt; the largest urban area is the Ruhr. + +Various Germanic tribes have inhabited the northern parts of modern Germany since classical antiquity. A region named Germania was documented before AD 100. In the 10th century, German territories formed a central part of the Holy Roman Empire. During the 16th century, northern German regions became the centre of the Protestant Reformation. Following the Napoleonic Wars and the dissolution of the Holy Roman Empire in 1806, the German Confederation was formed in 1815. In 1871, Germany became a nation-state when most of the German states unified into the Prussian-dominated German Empire. After World War I and the German Revolution of 1918–1919, the Empire was replaced by the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany was divided into the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic led by a chancellor. + +Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP, and the fifth-largest by PPP. As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. As a developed country, which ranks very high on the Human Development Index, it offers social security and a universal health care system, environmental protections, and a tuition-free university education. Germany is also a member of the United Nations, NATO, the G7, the G20, and the OECD. It also has the fourth-greatest number of UNESCO World Heritage Sites. +``` + +### Software requirements + +To run the code that you will create in the next sections, it is important that you have installed a few things here and there. Make sure to have an environment (preferably) or a global Python environment running on your machine. Then make sure that HuggingFace Transformers is installed through `pip install transformers`. As HuggingFace Transformers runs on top of either PyTorch or TensorFlow, install any of the two. + +Note that the code examples below are built for PyTorch based HuggingFace. They can be adapted to TensorFlow relatively easily, usually by prepending `TF` before the model you are importing, e.g. `TFAutoModel`. + +### Moving forward + +Now, we can move forward to showing you how to use Longformer. Specifically, you're going to see code for these tasks: + +- Question Answering +- Text Summarization +- Masked Language Modeling (Predicting missing text). + +Let's take a look! 🚀 + +* * * + +## Longformer and Question Answering + +Longformer can be used for question answering tasks. This requires that the pretrained Longformer is fine-tuned so that it is tailored to the task. Today, you're going to use a Longformer model that has been fine-tuned on the [SQuAD v1](https://www.machinecurve.com/index.php/question/what-is-the-squad-dataset/) language task. + +This is a question answering task using the Stanford Question Answering Dataset (SQuAD). + +Creating the code involves the following steps: + +1. **Imports:** we'll need PyTorch itself to take an `argmax` with gradients later, so we must import it through `import torch`. Then, we also need the `AutoTokenizer` and the `AutoModelForQuestionAnswering` from HuggingFace `transformers`. +2. **Initialization of tokenizer and model.** Secondly, we need to get our tokenizer and model up and running. For doing so, we'll be using a model that is available in the HuggingFace Model Hub - the `valhalla/longformer-base-4096-finetuned-squadv1` model. As you can see, it's the Longformer base model fine-tuned on SQuAD v1. As with [any fine-tuned Longformer model](https://www.machinecurve.com/index.php/question/what-is-the-longformer-transformer-and-how-does-it-work/), it can support up to 4096 tokens in a sequence. +3. **Specifying the text and the question**. The `text` contains the context that is used by Longformer for answering the question. As you can imagine, it's the text that we specified above. For the `question`, we're interested in the size of Germany's economy by national GDP (Germany has the fourth-largest economy can be read in the text). +4. **Tokenization of the input text**. Before we can feed the text to our Longformer model, we must tokenize it. We simply feed question and text to the tokenizer and return PyTorch tensors. From these, we can extract the input identifiers, i.e. the unique token identifiers in the vocabulary for the tokens from our input text. +5. **Getting the attention mask**. Recall that Longformer works with sparse local attention and task-specific global attention. For question answering, the tokenizer generates the attention mask; this was how the tokenizer was trained. That's why we can also extract the attention mask from the encoding. Note that global attention is applied to tokens related to the question only. +6. **Getting the predictions**. Once we have tokenized our input and retrieved the atetntion mask, we can get the predictions. +7. **Converting the predictions into the answer, and printing the answer on screen.** The seventh and final step is to actually convert the identifiers to tokens, which we then decode and print on our screen. + +``` +import torch +from transformers import AutoTokenizer, AutoModelForQuestionAnswering + +# Initialize the tokenizer +tokenizer = AutoTokenizer.from_pretrained("valhalla/longformer-base-4096-finetuned-squadv1") + +# Initialize the model +model = AutoModelForQuestionAnswering.from_pretrained("valhalla/longformer-base-4096-finetuned-squadv1") + +# Specify text and question +text = """Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a country at the intersection of Central and Western Europe. It is situated between the Baltic and North seas to the north, and the Alps to the south; covering an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. It borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. Germany is the second-most populous country in Europe after Russia, as well as the most populous member state of the European Union. Its capital and largest city is Berlin, and its financial centre is Frankfurt; the largest urban area is the Ruhr.Various Germanic tribes have inhabited the northern parts of modern Germany since classical antiquity. A region named Germania was documented before AD 100. In the 10th century, German territories formed a central part of the Holy Roman Empire. During the 16th century, northern German regions became the centre of the Protestant Reformation. Following the Napoleonic Wars and the dissolution of the Holy Roman Empire in 1806, the German Confederation was formed in 1815. In 1871, Germany became a nation-state when most of the German states unified into the Prussian-dominated German Empire. After World War I and the German Revolution of 1918–1919, the Empire was replaced by the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany was divided into the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic led by a chancellor.Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP, and the fifth-largest by PPP. As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. As a developed country, which ranks very high on the Human Development Index, it offers social security and a universal health care system, environmental protections, and a tuition-free university education. Germany is also a member of the United Nations, NATO, the G7, the G20, and the OECD. It also has the fourth-greatest number of UNESCO World Heritage Sites.""" +question = "How large is Germany's economy by nominal GDP?" + +# Tokenize the input text +encoding = tokenizer(question, text, return_tensors="pt") +input_ids = encoding["input_ids"] + +# Get attention mask (local + global attention) +attention_mask = encoding["attention_mask"] + +# Get the predictions +start_scores, end_scores = model(input_ids, attention_mask=attention_mask).values() + +# Convert predictions into answer +all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist()) +answer_tokens = all_tokens[torch.argmax(start_scores) :torch.argmax(end_scores)+1] +answer = tokenizer.decode(tokenizer.convert_tokens_to_ids(answer_tokens)) + +# Print answer +print(answer) +``` + +The results: + +``` +fourth-largest +``` + +Yep indeed, Germany has the fourth-largest economy by nominal GDP. Great! :D + +* * * + +## Longformer and Text Summarization + +Next up is text summarization. This can also be done with Transformers. Compared to other tasks such as question answering, summarization is a _generative_ activity that also greatly benefits from a lot of _context_. That's why traditionally, sequence-to-sequence architectures have been useful for this purpose. + +That's why in the example below, we are using a Longformer2RoBERTa architecture, which utilizes Longformer as the encoder segment, and RoBERTa as the decoder segment. It was fine-tuned on the CNN/DailyMail dataset, which is a common one in the field of text summarization. + +So strictly speaking, this is not a full Longformer model, but Longformer merely plays a part in the whole stack. Nevertheless, it works pretty well, as we shall see! + +This is how we build the model + +- **Imports:** we import the `LongformerTokenizer` and the `EncoderDecoderModel` (which is what you'll need for [Seq2Seq](https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/)!) +- **Loading model and tokenizer:** our model is an instance of `patrickvonplaten/longformer2roberta-cnn_dailymail-fp16`, which contains the full Seq2Seq model. However, as the encoder segment is Longformer, we can use the Longformer tokenizer - so we use `allenai/longformer-base-4096` there. +- **Specifying the article:** the text from above. +- **Tokenization, summarization and conversion:** we feed the `article` into the tokenizer, return the input ids from the PyTorch based Tensors, and then generate the summary with our `model`. Once the summary is there, we use the `tokenizer` again for decoding the output identifiers into readable text. We skip special tokens. +- **Printing the summary on screen:** to see if it works :) + +``` +from transformers import LongformerTokenizer, EncoderDecoderModel + +# Load model and tokenizer +model = EncoderDecoderModel.from_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16") +tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096") + +# Specify the article +article = """Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a country at the intersection of Central and Western Europe. It is situated between the Baltic and North seas to the north, and the Alps to the south; covering an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. It borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. Germany is the second-most populous country in Europe after Russia, as well as the most populous member state of the European Union. Its capital and largest city is Berlin, and its financial centre is Frankfurt; the largest urban area is the Ruhr.Various Germanic tribes have inhabited the northern parts of modern Germany since classical antiquity. A region named Germania was documented before AD 100. In the 10th century, German territories formed a central part of the Holy Roman Empire. During the 16th century, northern German regions became the centre of the Protestant Reformation. Following the Napoleonic Wars and the dissolution of the Holy Roman Empire in 1806, the German Confederation was formed in 1815. In 1871, Germany became a nation-state when most of the German states unified into the Prussian-dominated German Empire. After World War I and the German Revolution of 1918–1919, the Empire was replaced by the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany was divided into the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic led by a chancellor.Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP, and the fifth-largest by PPP. As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. As a developed country, which ranks very high on the Human Development Index, it offers social security and a universal health care system, environmental protections, and a tuition-free university education. Germany is also a member of the United Nations, NATO, the G7, the G20, and the OECD. It also has the fourth-greatest number of UNESCO World Heritage Sites.""" + +# Tokenize and summarize +input_ids = tokenizer(article, return_tensors="pt").input_ids +output_ids = model.generate(input_ids) + +# Get the summary from the output tokens +summary = tokenizer.decode(output_ids[0], skip_special_tokens=True) + +# Print summary +print(summary) +``` + +The results: + +``` +Germany is the second-most populous country in Europe after Russia. +It is the country's second-largest economy and the most populous member state of the European Union. +Germany is also a member of the United Nations, the G7, the OECD and the G20. +``` + +Quite a good summary indeed! + +* * * + +## Longformer and Masked Language Modeling / Predicting Missing Text + +Next up is Masked Language Modeling using Longformer. Recall that [MLM](https://www.machinecurve.com/index.php/2021/03/02/easy-masked-language-modeling-with-machine-learning-and-huggingface-transformers/) is a technique used for pretraining BERT-style models. When applied, parts of the text are masked, and the goal of the model is to predict the original text. If it can do so correctly and at scale, it effectively learns the relationships between text and therefore generates the supervision signal through the attention mechanism. + +Let's see if we can get this to work with Longformer, so that we can apply MLM to longer texts. As you can see we apply the mask just after the text starts: `officially the Federal Republic of Germany,[e] is a {mask}`. + +That should be _country_, indeed, so let's see if we can get the model to produce that. + +1. **Imports and pipeline init:** HuggingFace Transformers offers a [`pipeline` for Masked Language Modeling](https://www.machinecurve.com/index.php/2021/03/02/easy-masked-language-modeling-with-machine-learning-and-huggingface-transformers/), the `fill-mask` pipeline. We can initialize it with the `allenai/longformer-base-4096` model. This base model is the MLM pretrained base model that still requires fine-tuning for task specific behavior. However, because it was pretrained with MLM, we can also _use_ it for MLM and thus Predicting Missing Text. We thus load the `pipeline` API from `transformers`. +2. **Loading the mask token:** the `mlm.tokenizer` has a specific `mask_token`. We simplify it by referring to it as `mask`. +3. **Masking the text:** we specify the text, but then apply `{mask}` to where `country` is written in the original text. +4. **Perform MLM:** we then feed the `text` to our `mlm` pipeline to obtain the result, which we then print on screen. + +``` +from transformers import pipeline + +# Initialize MLM pipeline +mlm = pipeline('fill-mask', model='allenai/longformer-base-4096') + +# Get mask token +mask = mlm.tokenizer.mask_token + +# Get result for particular masked phrase +text = f"""Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a {mask} at the intersection of Central and Western Europe. It is situated between the Baltic and North seas to the north, and the Alps to the south; covering an area of 357,022 square kilometres (137,847 sq mi), with a population of over 83 million within its 16 constituent states. It borders Denmark to the north, Poland and the Czech Republic to the east, Austria and Switzerland to the south, and France, Luxembourg, Belgium, and the Netherlands to the west. Germany is the second-most populous country in Europe after Russia, as well as the most populous member state of the European Union. Its capital and largest city is Berlin, and its financial centre is Frankfurt; the largest urban area is the Ruhr.Various Germanic tribes have inhabited the northern parts of modern Germany since classical antiquity. A region named Germania was documented before AD 100. In the 10th century, German territories formed a central part of the Holy Roman Empire. During the 16th century, northern German regions became the centre of the Protestant Reformation. Following the Napoleonic Wars and the dissolution of the Holy Roman Empire in 1806, the German Confederation was formed in 1815. In 1871, Germany became a nation-state when most of the German states unified into the Prussian-dominated German Empire. After World War I and the German Revolution of 1918–1919, the Empire was replaced by the semi-presidential Weimar Republic. The Nazi seizure of power in 1933 led to the establishment of a dictatorship, World War II, and the Holocaust. After the end of World War II in Europe and a period of Allied occupation, Germany was divided into the Federal Republic of Germany, generally known as West Germany, and the German Democratic Republic, East Germany. The Federal Republic of Germany was a founding member of the European Economic Community and the European Union, while the German Democratic Republic was a communist Eastern Bloc state and member of the Warsaw Pact. After the fall of communism, German reunification saw the former East German states join the Federal Republic of Germany on 3 October 1990—becoming a federal parliamentary republic led by a chancellor.Germany is a great power with a strong economy; it has the largest economy in Europe, the world's fourth-largest economy by nominal GDP, and the fifth-largest by PPP. As a global leader in several industrial, scientific and technological sectors, it is both the world's third-largest exporter and importer of goods. As a developed country, which ranks very high on the Human Development Index, it offers social security and a universal health care system, environmental protections, and a tuition-free university education. Germany is also a member of the United Nations, NATO, the G7, the G20, and the OECD. It also has the fourth-greatest number of UNESCO World Heritage Sites.""" +result = mlm(text) + +# Print result +print(result) +``` + +When we observe the results (we cut off the text at the masked token; it continues in the real results), we can see that it is capable of predicting `country` indeed! + +``` +[{'sequence': "Germany (German: Deutschland, German pronunciation: [ˈdɔʏtʃlant]), officially the Federal Republic of Germany,[e] is a country +``` + +Great! + +* * * + +## Summary + +In this tutorial, we covered practical aspects of the Longformer Transformer model. Using this model, you can now process really long texts, by means of the simple change in attention mechanism compared to the one used in classic Transformers. Put briefly, you have learned... + +- **How to use Longformer based Transformers in your Machine Learning project.** +- **What is necessary for using Longformer for Question Answering, Text Summarization and Masked Language Modeling (Missing Text Prediction).** +- **That Longformer is really capable of handling large texts, as we demonstrate in our examples.** + +I hope that this article was useful to you! If it was, please let me know through the comments 💬 Please do the same if you have any questions or other comments. I'd love to hear from you :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +HuggingFace. (n.d.). _Allenai/longformer-base-4096 · Hugging face_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/allenai/longformer-base-4096](https://huggingface.co/allenai/longformer-base-4096) + +HuggingFace. (n.d.). _Valhalla/longformer-base-4096-finetuned-squadv1 · Hugging face_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1](https://huggingface.co/valhalla/longformer-base-4096-finetuned-squadv1) + +HuggingFace. (n.d.). _Patrickvonplaten/longformer2roberta-cnn\_dailymail-fp16 · Hugging face_. Hugging Face – On a mission to solve NLP, one commit at a time. [https://huggingface.co/patrickvonplaten/longformer2roberta-cnn\_dailymail-fp16](https://huggingface.co/patrickvonplaten/longformer2roberta-cnn_dailymail-fp16) + +Wikipedia. (2001, November 9). _Germany_. Wikipedia, the free encyclopedia. Retrieved March 12, 2021, from [https://en.wikipedia.org/wiki/Germany](https://en.wikipedia.org/wiki/Germany) diff --git a/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras.md b/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras.md new file mode 100644 index 0000000..aa95731 --- /dev/null +++ b/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras.md @@ -0,0 +1,480 @@ +--- +title: "Tutorial: building a Hot Dog - Not Hot Dog classifier with TensorFlow and Keras" +date: "2020-10-20" +categories: + - "deep-learning" + - "frameworks" +tags: + - "keras" + - "machine-learning" + - "neural-network" + - "tensorflow" +--- + +People who start with creating machine learning models, including the deep learning ones / neural networks that are popular today, often want to start with relatively simple models. They feel as if there is a steep learning curve to getting up to speed with the libraries being used. Truth be told: such a learning curve exists. And designing well-scoped exercises can be of great help when you want to understand how those models work. At least, they did for me. + +That's why in today's article, we will be creating a relatively simple ConvNet classifier that is capable of classifying between Hot Dogs and Not Hot Dogs. Being inspired by a television series, we set out to create such a machine learning model by means of Python, TensorFlow, Keras and OpenCV. Don't worry about its complexity: we will explain each part of the model step-by-step and show you how you can neatly structure your model into different parts. This way, you'll be able to grasp the concepts and produce something that is really tangible. + +Let's start! 😀 + +* * * + +\[toc\] + +* * * + +## Ehhh... Hot Dog - Not Hot Dog? + +First of all: you might think that I'm a bit weird for making a classifier that can distinguish between hot dogs and non-hot dogs, and perhaps I am. However, take a look at this fragment from HBO's Silicon Valley series: + +https://www.youtube.com/watch?v=pqTntG1RXSY + +Here, Jian-Yang, portrayed by Jimmy O. Yang, demonstrates a classifier which, to everyone's surprise, turns out to be a [binary one](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) that can only distinguish between _hotdogs_ and _everything else_ (that is, not hot dog). + +As creating such a binary classifier should be relatively simple using today's machine learning libraries, we're going to create a similar classifier. Let's see if we can replicate what they did there! + +* * * + +## Today's deep learning libraries: TensorFlow and Keras + +For doing so, we're going to use two libraries with which you are likely already familiar. For those who are not, let's take a look at them briefly. + +> TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries and community resources that lets researchers push the state-of-the-art in ML and developers easily build and deploy ML powered applications. +> +> TensorFlow (n.d.) + +First TensorFlow. As we saw in the quota above, it is a library that has emerged from Google Brain and is the convergence of all ideas related to deep learning as they were present within Google. Originally a research project, through the application of those ideas within many Google products such as Speech Recognition, Images and Search, it has transformed into a production-level library for machine learning. It's even made open source: all source code is [publicly available](https://github.com/tensorflow/tensorflow) and can be adapted by the open source community. This has really [boosted](https://trends.google.com/trends/explore?date=all&q=tensorflow) adoption. + +https://www.youtube.com/watch?v=oZikw5k\_2FM + +One of the key benefits of using a library like TensorFlow is that the **abyss between research-level machine learning and production-level machine learning is removed**. In the past, researchers with "crazy ideas" would use particular software to test out their ideas. If they worked, they should have easily moved into a production setting. Unfortunately, the software they used was not production ready; e.g. for reasons of scalability. This was a real bottleneck for adopting the state-of-the-art in ML into production. With libraries like TensorFlow, models can easily be moved from research settings into production ones, greatly improving the speed of your organization's ML lifecycle. + +One of the key drawbacks of _original_ TensorFlow is that it's difficult. The learning curve to start working with TensorFlow is steep; in the early days, it cost a lot of time to become a true TensorFlow expert. This is where Keras comes in. **Keras is the high-level API of TensorFlow 2.0**: an approchable, highly-productive interface for solving machine learning problems, with a focus on modern deep learning (Keras Team, n.d.). The keyword here is _iteration_: we don't want to spend a lot of time tracing bugs or other mistakes, but rather want to test a lot of variations to your model, to find which variation works best: + +> It provides essential abstractions and building blocks for developing and shipping machine learning solutions with high iteration velocity. Keras empowers engineers and researchers to take full advantage of the scalability and cross-platform capabilities of TensorFlow 2.0: you can run Keras on TPU or on large clusters of GPUs, and you can export your Keras models to run in the browser or on a mobile device. +> +> Keras Team (n.d.) + +Together, TensorFlow 2.x and Keras are one of the silver bullets currently in use within the deep learning communities. We're also going to use them in today's article. However, let's first take a look at the technical aspects of the machine learning model that we will be creating today: a Convolutional Neural Network. + +* * * + +## Today's model: a ConvNet-based classifier + +A **Convolutional Neural Network** is a type of neural network that is used in Computer Vision and Natural Language Processing tasks quite often due to the fact that it can learn to _extract relevant features from the input data_. + +I can imagine that this sounds a bit too difficult already, so I'm going to break things apart. We're going to look at what Convolutional Neural Networks (or ConvNets, or CNNs) are based on the image that follows next, as well as its components: + +![](images/convnet_fig.png) + +A CNN architecture. Source: [gwding/draw\_convnet](https://github.com/gwding/draw_convnet) + +### What is a ConvNet? + +Altogether, a ConvNet is a neural network that can do two things really well if it is trained properly: + +1. **Generate predictions for new input samples.** +2. **Extract relevant features from the input data to generate those predictions.** + +It is not surprising to find (1) with this class of machine learning models, or with any machine learning model, because it is the essence of the [supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process): training a model with some data in order to make it capable of generating new predictions. + +Number (2) is more intriguing, especially if you look at it with the lens of the Computer Vision field, because in the pre-CNN era CV models were not capable of doing this. In fact, researchers and engineers employed a wide variety of [feature extraction techniques](https://en.wikipedia.org/wiki/Feature_extraction) in order to reduce the dimensionality of the input data. As you can imagine, a 200 x 200 pixels RGB image has 40.000 pixels times 3 channels = 120.000 _features_ that the model should be taking into account. In the pre-CNN era, this was a serious bottleneck, and dimensionality had to be reduced - requiring manual work and tweaking. + +ConvNets changed this in 2012 (Gershgorn, 2017). In an annual image classification competition, one ConvNet - such a solution was never proposed before - outranked all the other competitors, which did not use such types of layers. The year after, pretty much everyone started using ConvNets. Years later, we've seen another machine learning hype, and ConvNet performance has led to near-100% accuracies in very narrow domains with adequately large datasets. Truly impressive, to say the least, the effect of (2). + +We're now going to study it from the right towards the left, because we want to look at how it comes at a particular prediction - and those always happen near the end of a neural network. A ConvNet: + +- Has an **output layer**, which outputs the predictions of the model. This can either be a [binary prediction or a multiclass/multilabel prediction](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). +- Has **dense layers** (or _densely-connected layers_), which take as input some features and generate more abstract representations of the patterns captured within those features. In doing so, they make the output layer capable of generating the correct prediction. In the past, features that were input into the first dense layer were collected by human beings. Today, convolutional layers are used. +- Those **convolutional layers** are what make the ConvNet a ConvNet rather than a regular neural network. If you know what happens when you let sunlight move through a magnifying glass, you're already on your way to understanding what ConvNets do. If you perform this activity with sunlights, you'll find that light converges into a smaller area - essentially, the light's energy gathers there, and colors are more abstract (no clear shapes can be recognized). The same happens within convolutional layers. Input features are the "light" which are transformed through a 'magnifying glass', after which a smaller and more abstract representation is output. By stacking multiple convolutional layers on top of each other (as you can see in the image above, with two Conv layers and two [Max pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/) layers from left to right), you can make the model learn extract increasingly abstract features. From those lower-dimensional feature representations (called feature maps), the dense layers can generate their predictions. +- Finally, there is an **input layer** where the original input is presented to the neural network. + +As we can see, this stack of various layers puts benefit on top of benefit. Today, we'll be using a stack of layers like this one to generate our Hot Dog - Not Hot Dog model. But first, let's take a look at the dataset that we are using. + +![](images/pexels-pixabay-268460.jpg) + +* * * + +## Getting ready for creating the model + +Before we can build the model, it's important that you ensure that your development environment is ready for... well, development. Running the machine learning model that we'll create next requires you to have installed the following software and respective version numbers: + +- **Python:** version 3.x, preferably 3.6+ +- **TensorFlow:** version 2.x, preferably one of the newest versions. `pip install tensorflow`, after installing Python. +- **Numpy**: `pip install numpy`, after installing Python. +- **OpenCV:** `pip install opencv-python`, after installing Python. + +It's often best to install those packages in what is called an _environment_, an isolated development area where installs from one project don't interfere with those from another. Take a look at [Anaconda](http://conda.io) if you want to learn more about this topic and get started with environments. + +* * * + +## Building your model + +Time to get ready for building your model! Open up your development environment, load some folder, and create a Python file - say, `hotdog.py`. Obviously, it's also fine to use a [Jupyter Notebook](https://www.machinecurve.com/index.php/2020/10/07/easy-install-of-jupyter-notebook-with-tensorflow-and-docker/), but then it's a notebook rather than an individual Python file that you create. + +### Adding the necessary imports + +Now it's time to build the model. + +The first step in building it is adding the necessary imports. Primarily, we're using `tensorflow` - and its `tensorflow.keras` sub imports. Specifically, that will be the `Sequential` API for constructing your model (which allows you to stack layers on top of each other using `model.add`), and the `Dense`, `Conv2D` and `Flatten` layers. We also use `numpy`, the `os` util from Python itself as well as OpenCV, by means of `cv2`. + +Make sure to add this code to your `hotdog.py` file: + +``` +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Conv2D, Flatten +import numpy as np +import os +import cv2 +``` + +### Adding the model configuration + +After specifying the imports, we can add the configuration options for our model: + +``` +# Configuration +img_width, img_height = 25, 25 +input_shape = (img_width, img_height, 1) +batch_size = 10 +no_epochs = 25 +no_classes = 2 +validation_split = 0.2 +verbosity = 1 +``` + +As we shall see, we'll be using 25 x 25 pixel images that are grayscale (hence the `1` in the `input_shape`), use a batch size of 10 (our data set will be relatively small), 25 iterations, 2 classes (not hot dog = 0 / hot dog = 1), and 20% of our data will be used for [validation purposes](https://www.machinecurve.com/index.php/2020/02/18/how-to-use-k-fold-cross-validation-with-keras/). We make the training process verbose, meaning that all results will be printed on screen. + +### Loading and preprocessing of the dataset + +The next step is loading and preprocessing of your dataset. For today's model, we're using the [Hot Dog-Not Hot Dog dataset](https://www.kaggle.com/dansbecker/hot-dog-not-hot-dog). Make sure to create an account at Kaggle in order to download the dataset. After downloading, unzip the data, rename the folder into `hotdog` and move the folder to the folder where your `hotdog.py` file is located. + +To give you an idea about the dataset: it's a few hundred pictures of hot dogs, and a few hundred pictures of foods that aren't hotdogs. Here are four samples: + +![](images/notresized.png) + +After downloading the dataset, it's time to write some code that (1) loads the data from that particular folder and (2) preprocesses it. Here it is: + +``` +# Load data +def load_data(data_type='train', class_name='hot_dog'): + instances = [] + classes = [] + for filepath in os.listdir(f'hotdog/{data_type}/{class_name}'): + resized_image = cv2.imread(f'hotdog/{data_type}/{class_name}/{format(filepath)}', 0) + resized_image = cv2.resize(resized_image, (img_width, img_height)) + instances.append(resized_image) + classes.append(0 if class_name == 'not_hot_dog' else 1) + return (instances, classes) +``` + +If we look at it, we see that this definition - once used - does a couple of things: + +1. It allows you to specify the `data_type` and the `class_name`. By default, it attempts to load `hot_dog` images from the `train` folder. +2. It lists the contents of the folder specified with the previous parameters available in the `hotdog` folder. That's why you had to rename the folder and move it to the folder where your Python script is located! +3. It loads the image using `imread`, as a grayscale image - by means of the `0` specification. We don't want colors of images to interfere with the prediction, as it's all about shape. This is especially important in the case of small datasets, which can be biased. That's why we load the images as grayscale ones. +4. We resize the images to 25 x 25 pixels, in line with the model configuration specified above. Resizing is necessary for two reasons. Firstly, images can be really large sometimes, and this can hamper learning. It's usually best to train your models with images relatively small in size. Secondly, your model will accept inputs only when they have the shape of the input specified in the Input layer (which we shall cover next). That's why all images must be using the same number of color channels (that is, either RGB or grayscale, but not both) and be of the same size. +5. We append the resized image to the list of `instances`, and the corresponding [class number](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) to the list of `classes`. +6. We output a tuple with the `instances` and `classes`. + +After loading and preprocessing, our images should be both in grayscale and resized. Indeed, they now look as follows: + +![](images/resized.png) + +### Creating the model skeleton + +Now we have defined a function for loading and preprocessing the data, we can move on and create a function that creates the model skeleton. Such a skeleton is essentially the representation of the model building blocks - i.e., the architecture. The model itself is not yet alive, and will be instantiated after specifying the skeleton, as we shall see. + +Make sure to add this code for specifying the model skeleton: + +``` +# Model creation +def create_model(): + model = Sequential() + model.add(Conv2D(4, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(Conv2D(8, kernel_size=(3, 3), activation='relu')) + model.add(Conv2D(12, kernel_size=(3, 3), activation='relu')) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + return model +``` + +The steps performed are simple: a `model` is created with the `Sequential` API, three [convolutional layers](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) are stacked on top of each other (note the increasing number of feature maps with increasing abstractness; we benefit most from learning the abstract representations), a `Flatten` operation which allows the output feature maps to be input by the `Dense` layers, which finally generate a [multiclass probability distribution using Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/). + +It finally returns the `model` after creating the skeleton. + +### Instantiating the model + +Instantiating the model, _making it alive_, involves the `model` we just built - as well as a compilation step: + +``` +# Model compilation +def compile_model(model): + model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + return model +``` + +Here, we specify things like the [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) (we use [sparse categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/) because our targets, our `y` values, are integers rather than one-hot encoded vectors - it is functionally equal to [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/)). We also specify the optimize, which can be [gradient descent-based](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [Adaptive](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), like Adam. In addition, we specify additional metrics. We then return the `model` again to be used by the next step. + +### Running the training process + +...this next step is actually starting the training process! + +We define another function for doing this: + +``` +# Model training +def train_model(model, X_train, y_train): + model.fit(X_train, y_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + shuffle=True, + validation_split=validation_split) + return model +``` + +Here, we accept the `model` as well as the features and corresponding targets from the training set. Using configuration options specified in the model configuration (such as batch size, number of epochs, and verbosity) we start the training process. We do so by calling `model.fit`, which essentially fits the data to the model and attempts to find the [global loss minimum](https://www.machinecurve.com/index.php/2020/02/26/getting-out-of-loss-plateaus-by-adjusting-learning-rates/). Once training has finished, which in our case happens after 25 iterations (or `epochs`), the trained `model` is returned. + +### Generating evaluation metrics + +We can then test our model by means of applying `model.evaluate`: + +``` +# Model testing +def test_model(model, X_test, y_test): + score = model.evaluate(X_test, y_test, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + return model +``` + +This function accepts the trained `model` as well as the features and targets of your testing dataset. It evaluates the model with those samples and prints test loss and accuracy. For convenience reasons, this function also returns the trained (and now tested) `model`. + +### Connecting the building blocks + +What we did in the sections above is creating the building blocks of today's machine learning exercise. We didn't connect them yet, which currently makes them rather meaningless as a whole. That's why it's now time to connect the dots, and specifically do those two things: + +1. Load and merge training and testing data +2. Constructing the model + +#### Loading and merging training and testing data + +This step is really easy: + +``` +# CLICKING EVERYTHING TOGETHER +# Load and merge training data +X_train_nh, y_train_nh = load_data(data_type='train', class_name='not_hot_dog') +X_train_h, y_train_h = load_data(data_type='train', class_name='hot_dog') +X_train = np.array(X_train_nh + X_train_h) +X_train = X_train.reshape((X_train.shape[0], img_width, img_height, 1)) +y_train = np.array(y_train_nh + y_train_h) + +# Load and merge testing data +X_test_nh, y_test_nh = load_data(data_type='test', class_name='not_hot_dog') +X_test_h, y_test_h = load_data(data_type='test', class_name='hot_dog') +X_test = np.array(X_test_nh + X_test_h) +X_test = X_test.reshape((X_test.shape[0], img_width, img_height, 1)) +y_test = np.array(y_test_nh + y_test_h) +``` + +For both data sets, we use `load_data` to retrieve our hot dog / not hot dog data, and eventually merge the two sub datasets each time and create a `np.array` with all the training and testing data, respectively. + +#### Constructing the model + +Finally, constructing the model is essentially connecting the functions we defined above: + +``` +# Create and train the model +model = create_model() +model = compile_model(model) +model = train_model(model, X_train, y_train) +model = test_model(model, X_test, y_test) +``` + +### Full model code + +As we now have a functional model, I can imagine that it would be preferable to some people that they can copy and paste the full model code at once. Especially for you, here you go! :) + +``` +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Conv2D, Flatten +import numpy as np +import os +import cv2 + +# Configuration +img_width, img_height = 25, 25 +input_shape = (img_width, img_height, 1) +batch_size = 10 +no_epochs = 25 +no_classes = 2 +validation_split = 0.2 +verbosity = 1 + +# Load data +def load_data(data_type='train', class_name='hot_dog'): + instances = [] + classes = [] + for filepath in os.listdir(f'hotdog/{data_type}/{class_name}'): + read_image = cv2.imread(f'hotdog/{data_type}/{class_name}/{format(filepath)}', 0) + resized_image = cv2.resize(read_image, (img_width, img_height)) + instances.append(resized_image) + classes.append(0 if class_name == 'not_hot_dog' else 1) + return (instances, classes) + +# Model creation +def create_model(): + model = Sequential() + model.add(Conv2D(4, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(Conv2D(8, kernel_size=(3, 3), activation='relu')) + model.add(Conv2D(12, kernel_size=(3, 3), activation='relu')) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + return model + +# Model compilation +def compile_model(model): + model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + return model + +# Model training +def train_model(model, X_train, y_train): + model.fit(X_train, y_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + shuffle=True, + validation_split=validation_split) + return model + +# Model testing +def test_model(model, X_test, y_test): + score = model.evaluate(X_test, y_test, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + return model + +# CLICKING EVERYTHING TOGETHER +# Load and merge training data +X_train_nh, y_train_nh = load_data(data_type='train', class_name='not_hot_dog') +X_train_h, y_train_h = load_data(data_type='train', class_name='hot_dog') +X_train = np.array(X_train_nh + X_train_h) +X_train = X_train.reshape((X_train.shape[0], img_width, img_height, 1)) +y_train = np.array(y_train_nh + y_train_h) + +# Load and merge testing data +X_test_nh, y_test_nh = load_data(data_type='test', class_name='not_hot_dog') +X_test_h, y_test_h = load_data(data_type='test', class_name='hot_dog') +X_test = np.array(X_test_nh + X_test_h) +X_test = X_test.reshape((X_test.shape[0], img_width, img_height, 1)) +y_test = np.array(y_test_nh + y_test_h) + +# Create and train the model +model = create_model() +model = compile_model(model) +model = train_model(model, X_train, y_train) +model = test_model(model, X_test, y_test) +``` + +* * * + +## Running the model: results + +Running the model unfortunately does not provide spectacular results: + +``` +Epoch 1/25 +2020-10-20 21:52:07.059086: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll +2020-10-20 21:52:07.628550: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll +2020-10-20 21:52:09.160540: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows +Relying on driver to perform ptx compilation. This message will be only logged once. +398/398 [==============================] - 4s 9ms/sample - loss: 6.8960 - accuracy: 0.5503 - val_loss: 3.5099 - val_accuracy: 0.1300 +Epoch 2/25 +398/398 [==============================] - 0s 477us/sample - loss: 0.5785 - accuracy: 0.7588 - val_loss: 0.9329 - val_accuracy: 0.6600 +Epoch 3/25 +398/398 [==============================] - 0s 485us/sample - loss: 0.1795 - accuracy: 0.9322 - val_loss: 1.9922 - val_accuracy: 0.3700 +Epoch 4/25 +398/398 [==============================] - 0s 460us/sample - loss: 0.0868 - accuracy: 0.9774 - val_loss: 1.7145 - val_accuracy: 0.4800 +Epoch 5/25 +398/398 [==============================] - 0s 467us/sample - loss: 0.0387 - accuracy: 0.9975 - val_loss: 1.3653 - val_accuracy: 0.5800 +Epoch 6/25 +398/398 [==============================] - 0s 489us/sample - loss: 0.0201 - accuracy: 1.0000 - val_loss: 1.7234 - val_accuracy: 0.5000 +Epoch 7/25 +398/398 [==============================] - 0s 434us/sample - loss: 0.0117 - accuracy: 1.0000 - val_loss: 2.1737 - val_accuracy: 0.4500 +Epoch 8/25 +398/398 [==============================] - 0s 508us/sample - loss: 0.0077 - accuracy: 1.0000 - val_loss: 2.2219 - val_accuracy: 0.4600 +Epoch 9/25 +398/398 [==============================] - 0s 449us/sample - loss: 0.0055 - accuracy: 1.0000 - val_loss: 2.3434 - val_accuracy: 0.4700 +Epoch 10/25 +398/398 [==============================] - 0s 447us/sample - loss: 0.0041 - accuracy: 1.0000 - val_loss: 2.3332 - val_accuracy: 0.4700 +Epoch 11/25 +398/398 [==============================] - 0s 441us/sample - loss: 0.0031 - accuracy: 1.0000 - val_loss: 2.6542 - val_accuracy: 0.4600 +Epoch 12/25 +398/398 [==============================] - 0s 467us/sample - loss: 0.0024 - accuracy: 1.0000 - val_loss: 2.8659 - val_accuracy: 0.4100 +Epoch 13/25 +398/398 [==============================] - 0s 433us/sample - loss: 0.0018 - accuracy: 1.0000 - val_loss: 2.7200 - val_accuracy: 0.4500 +Epoch 14/25 +398/398 [==============================] - 0s 425us/sample - loss: 0.0014 - accuracy: 1.0000 - val_loss: 3.0100 - val_accuracy: 0.4300 +Epoch 15/25 +398/398 [==============================] - 0s 424us/sample - loss: 0.0010 - accuracy: 1.0000 - val_loss: 3.0143 - val_accuracy: 0.4500 +Epoch 16/25 +398/398 [==============================] - 0s 441us/sample - loss: 8.0917e-04 - accuracy: 1.0000 - val_loss: 3.2440 - val_accuracy: 0.4200 +Epoch 17/25 +398/398 [==============================] - 0s 450us/sample - loss: 6.2771e-04 - accuracy: 1.0000 - val_loss: 3.2514 - val_accuracy: 0.4300 +Epoch 18/25 +398/398 [==============================] - 0s 432us/sample - loss: 5.1242e-04 - accuracy: 1.0000 - val_loss: 3.2235 - val_accuracy: 0.4500 +Epoch 19/25 +398/398 [==============================] - 0s 439us/sample - loss: 4.3272e-04 - accuracy: 1.0000 - val_loss: 3.4012 - val_accuracy: 0.4300 +Epoch 20/25 +398/398 [==============================] - 0s 427us/sample - loss: 3.6860e-04 - accuracy: 1.0000 - val_loss: 3.3259 - val_accuracy: 0.4500 +Epoch 21/25 +398/398 [==============================] - 0s 439us/sample - loss: 3.1128e-04 - accuracy: 1.0000 - val_loss: 3.4801 - val_accuracy: 0.4400 +Epoch 22/25 +398/398 [==============================] - 0s 428us/sample - loss: 2.6993e-04 - accuracy: 1.0000 - val_loss: 3.5010 - val_accuracy: 0.4400 +Epoch 23/25 +398/398 [==============================] - 0s 450us/sample - loss: 2.3627e-04 - accuracy: 1.0000 - val_loss: 3.5777 - val_accuracy: 0.4300 +Epoch 24/25 +398/398 [==============================] - 0s 448us/sample - loss: 2.1087e-04 - accuracy: 1.0000 - val_loss: 3.4808 - val_accuracy: 0.4800 +Epoch 25/25 +398/398 [==============================] - 0s 442us/sample - loss: 1.7988e-04 - accuracy: 1.0000 - val_loss: 3.6907 - val_accuracy: 0.4400 +``` + +With a 52.6% test accuracy, the model performs only slightly better than simply tossing a coin: + +``` +Test loss: 3.108407344818115 / Test accuracy: 0.5260000228881836 +``` + +* * * + +## Summary + +In this article, we looked at how we can create a machine learning model that is capable of distinguishing hot dogs from foods that aren't a hot dog. Inspired by the television series Silicon Valley, we set out to replicate what was shown there. + +For doing so, we first took a look at how Computer Vision models work these days - that is, by means of Convolutional Neural Networks. The benefits of using convolutional layers include the fact that those layers _learn_ what the most important features are, instead of the human need for selecting those features using a variety of techniques. This makes machine learning much more robust, as we have seen since 2012, when the first ConvNet was applied at massive scale. + +We then moved on from theory into practice - and found how we can create a ConvNet that classifies Hot Dog / Not Hot Dog using the similarly-named dataset available on Kaggle. We created Python code using TensorFlow, Keras and OpenCV which is freely available for you and is explained step-by step. + +I hope that you've learnt something interesting today! If you did, please feel free to drop a message in the comments section below 💬 Please do the same if you have questions, have suggestions for improvement of this article, or other comments. Thank you f or reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +Keras Team. (n.d.). _Keras documentation: About Keras_. Keras: the Python deep learning API. [https://keras.io/about/](https://keras.io/about/) + +TensorFlow. (n.d.). [https://www.tensorflow.org/](https://www.tensorflow.org/) + +Gershgorn, D. (2017, July 26). _The data that transformed AI research—and possibly the world_. Quartz. [https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/](https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/) + +_Hot dog - Not hot dog_. (n.d.). Kaggle: Your Machine Learning and Data Science Community. [https://www.kaggle.com/dansbecker/hot-dog-not-hot-dog/data](https://www.kaggle.com/dansbecker/hot-dog-not-hot-dog/data) diff --git a/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi.md b/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi.md new file mode 100644 index 0000000..787ff8d --- /dev/null +++ b/tutorial-how-to-deploy-your-convnet-classifier-with-keras-and-fastapi.md @@ -0,0 +1,439 @@ +--- +title: "Tutorial: How to deploy your ConvNet classifier with Keras and FastAPI" +date: "2020-03-19" +categories: + - "deep-learning" + - "frameworks" +tags: + - "api" + - "convolutional-neural-networks" + - "deep-learning" + - "deployment" + - "http" + - "keras" + - "machine-learning" + - "model" +--- + +Training machine learning models is fun - but what if you found a model that really works? You'd love to deploy it into production, so that others can use it. + +In today's blog post, we'll show you how to do this for a ConvNet classifier using Keras and FastAPI. It begins with the software dependencies that we need. This is followed by today's model code, and finally showing you how to run the deployed model. + +Are you ready? Let's go! :) + +* * * + +\[toc\] + +* * * + +## Software dependencies + +In order to complete today's tutorial successfully, and be able to run the model, it's key that you install these software dependencies: + +- FastAPI +- Pillow +- Pydantic +- TensorFlow 2.0+ +- Numpy + +Let's take a look at the dependencies first. + +### FastAPI + +With FastAPI, we'll be building the _groundwork_ for the machine learning model deployment. + +What it is? Simple: + +> FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. +> +> FastAPI. (n.d.). [https://fastapi.tiangolo.com/](https://fastapi.tiangolo.com/) + +With the framework, we can build a web service that accepts requests over HTTP, allows us to receive inputs, and subsequently send the machine learning prediction as the response. + +Installing goes through `pip`, with `pip install fastapi`. What's more, you'll also need an ASGI (or Asynchronous Server Gateway Interface) server, such as _uvicorn_: `pip install uvicorn`. + +### Pillow + +Then Pillow: + +> Pillow is the friendly PIL fork by [Alex Clark and Contributors](https://github.com/python-pillow/Pillow/graphs/contributors). PIL is the Python Imaging Library by Fredrik Lundh and Contributors. +> +> _Pillow — Pillow (PIL Fork) 3.1.2 documentation_. (n.d.). Pillow — Pillow (PIL Fork) 7.0.0 documentation. [https://pillow.readthedocs.io/en/3.1.x/index.html](https://pillow.readthedocs.io/en/3.1.x/index.html) + +We can use Pillow to manipulate images - which is what we'll do, as the inputs for our ConvNet are images. Installation, once again, goes through `pip`: + +``` +pip install Pillow +``` + +### Pydantic + +Now, the fun thing with web APIs is that you can send pretty much anything to them. For example, if you make any call (whether it's a GET one with parameters or a PUT, POST or DELETE one with a body), you can send any data along with your request. + +Now, the bad thing with such possibility is that people may send data that is incomprehensible for the machine learning model. For example, it wouldn't work if text was sent instead of an image, or if the image was sent in the wrong way. + +Pydantic comes to the rescue here: + +> Data validation and settings management using python type annotations. +> +> Pydantic.[https://pydantic-docs.helpmanual.io/](https://pydantic-docs.helpmanual.io/) + +With this library, we can check whether all data is ok :) + +### TensorFlow 2.0+ + +The need for TensorFlow is obvious - we're deploying a machine learning model. + +What's more, we need TensorFlow 2.0+ because of its deep integration with modern Keras, as the model that we'll deploy is a Keras based one. + +Fortunately, installing TensorFlow is easy - especially when you're running it on your CPU. [Click here to find out how](https://www.tensorflow.org/install). + +### Numpy + +Now, last but not least, Numpy. As we all know what it is and what it does, I won't explain it here :) We'll use it for data processing. + +* * * + +## Today's code + +Next up: the code for today's machine learning model deployment 🦾 It consists of three main parts: + +- Importing all the necessary libraries. +- Loading the model and getting the input shape. +- Building the FastAPI app. + +The latter of which is split into three sub stages: + +- Defining the Response. +- Defining the main route. +- Defining the `/prediction` route. + +Ready? Let's go! :) Create a Python file, such as `main.py`, on your system, and open it in a code editor. Now, we'll start writing some code :) + +### Just a break: what you'll have to do before you go further + +Not willing to interrupt, but there are two things that you'll have to do first before you actually build your API: + +- Train a machine learning model with Keras, [for example with the MNIST dataset](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) (we assume that your ML model handles the MNIST dataset from now on, but this doesn't really matter as the API works with all kinds of CNNs). +- Save the model instance, so that you can load it later. [Find out here how](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/). + +### Model imports + +The first thing to do is to state all the model imports: + +``` +# Imports +from fastapi import FastAPI, File, UploadFile, HTTPException +from PIL import Image +from pydantic import BaseModel +from tensorflow.keras.models import load_model +from typing import List +import io +import numpy as np +import sys +``` + +Obviously, we'll need parts from `FastAPI`, `PIL` (Pillow), `pydantic` and `tensorflow`, as well as `numpy`. But we'll also need a few other things: + +- For the list data type, we'll use `typing` +- For input/output operations (specifically, byte I/O), we'll be using `io` +- Finally, we'll need `sys` - for listening to Exception messages. + +### Loading the model and getting input shape + +Next, we [load the model](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/): + +``` +# Load the model +filepath = './saved_model' +model = load_model(filepath, compile = True) +``` + +This assumes that your model is in the new TensorFlow 2.0 format. If it's not, click the link above, as we describe there how to save it in the 1.0 format - this is directly applicable here. + +Then, we get the input shape _as expected by the model_: + +``` +# Get the input shape for the model layer +input_shape = model.layers[0].input_shape +``` + +That is, we _wish to know what the model expects_ - so that we can transform any inputs into this shape. We do so by studying the `input_shape` of the first (`i = 0`) layer of our model. + +### Building the FastAPI app + +Second stage already! Time to build the actual groundwork. First, let's define the FastAPI app: + +``` +# Define the FastAPI app +app = FastAPI() +``` + +#### Defining the Response + +Then, we can define the Response - or the output that we'll serve if people trigger our web service once it's live. It looks like this: + +``` +# Define the Response +class Prediction(BaseModel): + filename: str + contenttype: str + prediction: List[float] = [] + likely_class: int +``` + +It contains four parts: + +- The file name, or `filename`; +- The `contenttype`, or the content type that was found +- A `prediction`, which is a list of floats - [remember how Softmax generates outputs in this way?](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) +- A `likely_class`, which is the most likely class predicted by the model. + +#### Defining the main route + +Now, we'll define the main route - that is, when people navigate to your web API directly, without going to the `/prediction` route. It's a very simple piece of code: + +``` +# Define the main route +@app.get('/') +def root_route(): + return { 'error': 'Use GET /prediction instead of the root route!' } +``` + +It simply tells people to use the correct route. + +#### Defining the /prediction route + +The `/prediction` route is a slightly longer one: + +``` +# Define the /prediction route +@app.post('/prediction/', response_model=Prediction) +async def prediction_route(file: UploadFile = File(...)): + + # Ensure that this is an image + if file.content_type.startswith('image/') is False: + raise HTTPException(status_code=400, detail=f'File \'{file.filename}\' is not an image.') + + try: + # Read image contents + contents = await file.read() + pil_image = Image.open(io.BytesIO(contents)) + + # Resize image to expected input shape + pil_image = pil_image.resize((input_shape[1], input_shape[2])) + + # Convert from RGBA to RGB *to avoid alpha channels* + if pil_image.mode == 'RGBA': + pil_image = pil_image.convert('RGB') + + # Convert image into grayscale *if expected* + if input_shape[3] and input_shape[3] == 1: + pil_image = pil_image.convert('L') + + # Convert image into numpy format + numpy_image = np.array(pil_image).reshape((input_shape[1], input_shape[2], input_shape[3])) + + # Scale data (depending on your model) + numpy_image = numpy_image / 255 + + # Generate prediction + prediction_array = np.array([numpy_image]) + predictions = model.predict(prediction_array) + prediction = predictions[0] + likely_class = np.argmax(prediction) + + return { + 'filename': file.filename, + 'contenttype': file.content_type, + 'prediction': prediction.tolist(), + 'likely_class': likely_class + } + except: + e = sys.exc_info()[1] + raise HTTPException(status_code=500, detail=str(e)) +``` + +Let's break it into pieces: + +- We define the route and the response model, and specify as the parameter that a `File` can be uploaded into the attribute `file`. +- Next, we check the content type of the file - to ensure that it's an image (all image content types start with `image/`, like `image/png`). If it's not, we throw an error - `HTTP 400 Bad Request`. +- Then, we open up a `try/catch` block, where if anything goes wrong the error will be caught gracefully and nicely sent as a Response (`HTTP 500 Internal Server Error`). +- In the `try/catch` block, we first read the contents of the image - into a Byte I/O structure, which acts as a temporary byte storage. We can feed this to `Image` from Pillow, allowing us to actually _open_ the image sent over the network, and manipulate it programmatically. +- Once it's opened, we resize the image so that it meets the `input_shape` of our model. +- Then, we convert the image into `RGB` if it's `RGBA`, to avoid alpha channels (our model hasn't been trained for this). +- If required by the ML model, we convert the image into grayscale. +- Then, we convert it into Numpy format, so that we can manipulate it, and then _scale the image_ (this is dependent on your model! As we scaled [it before training](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), we need to do so here too or we get an error) +- Finally, we can generate a prediction and return the Response in the format that we specified. + +* * * + +## Running the deployed model + +That's it already! Now, open up a terminal, navigate to the folder where your `main.py` file is stored, and run `uvicorn main:app --reload` : + +``` +INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit) +INFO: Started reloader process [8960] +2020-03-19 20:40:21.560436: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll +2020-03-19 20:40:25.858542: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll +2020-03-19 20:40:26.763790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: +name: GeForce GTX 1050 Ti with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.4175 +pciBusID: 0000:01:00.0 +2020-03-19 20:40:26.772883: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. +2020-03-19 20:40:26.780372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 +2020-03-19 20:40:26.787714: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 +2020-03-19 20:40:26.797795: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: +name: GeForce GTX 1050 Ti with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.4175 +pciBusID: 0000:01:00.0 +2020-03-19 20:40:26.807064: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check. +2020-03-19 20:40:26.815504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 +2020-03-19 20:40:29.059590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: +2020-03-19 20:40:29.065990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 +2020-03-19 20:40:29.071096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N +2020-03-19 20:40:29.076811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2998 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1) +INFO: Started server process [19516] +INFO: Waiting for application startup. +INFO: Application startup complete. +``` + +Now, your API has started successfully. + +Time to send a request. I'll use Postman for this, which is a HTTP client that is very useful. + +I'll send this MNIST sample, as my model was trained on the MNIST dataset: + +![](images/image-1.png) + +Specifying all the details: + +[![](images/image-2-1024x248.png)](https://www.machinecurve.com/wp-content/uploads/2020/03/image-2.png) + +Results in this output: + +``` +{ + "filename": "mnist_sample.png", + "contenttype": "image/png", + "prediction": [ + 0.0004434768052306026, + 0.003073320258408785, + 0.008758937008678913, + 0.0034302924759685993, + 0.0006626666290685534, + 0.0021806098520755768, + 0.000005191866875975393, + 0.9642654657363892, + 0.003465399844571948, + 0.013714754022657871 + ], + "likely_class": 7 +} +``` + +Oh yeah! 🎉 + +* * * + +## Summary + +In this blog post, we've seen how machine learning models can be deployed by means of a web based API. I hope you've learnt something today. If you did, please leave a comment in the comments section! :) + +Sorry for the long delay in blogs again and happy engineering. See you soon! 😎 + +### Full model code + +If you wish to obtain the code at once, here you go: + +``` +# Imports +from fastapi import FastAPI, File, UploadFile, HTTPException +from PIL import Image +from pydantic import BaseModel +from tensorflow.keras.models import load_model +from typing import List +import io +import numpy as np +import sys + +# Load the model +filepath = './saved_model' +model = load_model(filepath, compile = True) + +# Get the input shape for the model layer +input_shape = model.layers[0].input_shape + +# Define the FastAPI app +app = FastAPI() + +# Define the Response +class Prediction(BaseModel): + filename: str + contenttype: str + prediction: List[float] = [] + likely_class: int + +# Define the main route +@app.get('/') +def root_route(): + return { 'error': 'Use GET /prediction instead of the root route!' } + +# Define the /prediction route +@app.post('/prediction/', response_model=Prediction) +async def prediction_route(file: UploadFile = File(...)): + + # Ensure that this is an image + if file.content_type.startswith('image/') is False: + raise HTTPException(status_code=400, detail=f'File \'{file.filename}\' is not an image.') + + try: + # Read image contents + contents = await file.read() + pil_image = Image.open(io.BytesIO(contents)) + + # Resize image to expected input shape + pil_image = pil_image.resize((input_shape[1], input_shape[2])) + + # Convert from RGBA to RGB *to avoid alpha channels* + if pil_image.mode == 'RGBA': + pil_image = pil_image.convert('RGB') + + # Convert image into grayscale *if expected* + if input_shape[3] and input_shape[3] == 1: + pil_image = pil_image.convert('L') + + # Convert image into numpy format + numpy_image = np.array(pil_image).reshape((input_shape[1], input_shape[2], input_shape[3])) + + # Scale data (depending on your model) + numpy_image = numpy_image / 255 + + # Generate prediction + prediction_array = np.array([numpy_image]) + predictions = model.predict(prediction_array) + prediction = predictions[0] + likely_class = np.argmax(prediction) + + return { + 'filename': file.filename, + 'contenttype': file.content_type, + 'prediction': prediction.tolist(), + 'likely_class': likely_class + } + except: + e = sys.exc_info()[1] + raise HTTPException(status_code=500, detail=str(e)) +``` + +\[kerasbox\] + +* * * + +## References + +FastAPI. (n.d.). [https://fastapi.tiangolo.com/](https://fastapi.tiangolo.com/) + +_Pillow — Pillow (PIL Fork) 3.1.2 documentation_. (n.d.). Pillow — Pillow (PIL Fork) 7.0.0 documentation. [https://pillow.readthedocs.io/en/3.1.x/index.html](https://pillow.readthedocs.io/en/3.1.x/index.html) + +Pydantic.[https://pydantic-docs.helpmanual.io/](https://pydantic-docs.helpmanual.io/) diff --git a/u-net-a-step-by-step-introduction.md b/u-net-a-step-by-step-introduction.md new file mode 100644 index 0000000..32ae064 --- /dev/null +++ b/u-net-a-step-by-step-introduction.md @@ -0,0 +1,217 @@ +--- +title: "U-Net, a step-by-step introduction" +date: "2022-01-28" +categories: + - "deep-learning" +tags: + - "computer-vision" + - "deep-learning" + - "image-segmentation" + - "machine-learning" + - "neural-networks" + - "unet" +--- + +Computer vision has many sub fields - and image segmentation is one of them. By classifying each individual pixel of an image or applying regression to these pixels, it's possible to generate very precise interpretations of input images. + +Such interpretations can be very useful in high-precision fields such as medical biology or autonomous driving, to give just a few examples. + +One of the key architectures in image segmentation is U-Net. In this article, you're going to take a look at the original U-Net architecture proposed in 2015 by Ronneberger et al. You're going to learn about the contracting path, the expansive path, and the skip connections - and how when put together they give a U shape. In addition, we're discussing how modern U-Nets are built up with a variety of ConvNet backbones. + +So, after reading this introduction, you will understand: + +- **What image segmentation is at a pixel level, for both classification and regression.** +- **How U-Net can be used for image segmentation through the contractive & expansive paths using skip connections.** +- **What backbones are being used today in U-Net like architectures.** + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Solving image segmentation + +Classic computer vision approaches using deep learning solutions focused on classification. AlexNet, as created by Krizhevsky et al. (2012), is an example. It uses convolutional layers for feature learning, after which a set of densely-connected layers is attached for assigning a class to the input image. + +This allows you to distinguish between cats and dogs, or [hotdog/no hotdog](https://www.machinecurve.com/index.php/2020/10/20/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras/), to give just a few examples. + +Now, there is much more that you can do with images! For example, it should be possible to detect objects within them. If you're familiar with deep learning and computer vision already, it's more than likely that you have heard about approaches like YOLO or [Transformers](https://www.machinecurve.com/index.php/2022/01/04/easy-object-detection-with-python-huggingface-transformers-and-machine-learning/) that draw boxes around objects present in an image. + +However, it should be possible to add even more precision to your image. While classic object detectors draw bounding boxes, **image segmentation** approaches perform their work at pixel level. In other words, each pixel is inspected and assigned a class. This allows you to draw very sharp boxes separating your objects. + +![](images/cat.png) + +![](images/catmask.png) + +Source: Parkhi et al. (2012) + +### Classification at pixel level + +If you've built classification models before, you know that Softmax is used to generate a pseudo probability distribution over your target classes. You can then simply take the maximum argument of your Softmaxed output Tensor to find the class that your sample belongs to. This is called a **segmentation mask**. + +At the tail of an image segmentation network, the output Tensor does not represent the whole image - but rather, it's output at pixel level. This means that Softmax will be applied at the pixel level and that you can take the maximum argument for each pixel to find the class it belongs to. + +### For regression, too + +If we simply left the Softmax out, and set the number of output channels to 1 (or > 1 if you have multiple dependent regression variables), you'll get a linear output for each pixel. This is similar to what you would do when [building a regression model](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/), but then once again at pixel level. By using a loss function like MAE or MSE, you can now perform image segmentation for regression, too. + +It also means that pretrained models for image segmentation for classification can be adapted for regression scenarios, by simply removing the Softmax activation (or only changing the loss function if in fact, Softmax is applied in the loss function - both are possible). + +Examples of image segmentation for regression are the following: + +- Pixel level depth estimation +- Map value estimation + +In the rest of this article, you will learn about U-Net, which is a model for image segmentation. It was introduced by Ronneberger et al. in a 2015 paper on image segmentation in the biomedical sciences. In the original work, U-Net is used for classification. + +Let's take a look! 😎 + +* * * + +## U-Net: a high-level perspective + +The image below represents the U-Net. As the network is composed of layer groups that are shaped like an U, it's not surprising where the name comes from. The left part and foot of the U is called the **contracting path**, whereas the right part is called the **expansive path**. Jointly, and with the help of **skip connections**, U-nets downsample an input image to learn about its salient features, to reconstruct the input (or a derived product, like a segmentation mask) via upsampling. + +![](images/unet-1-1024x868.png) + +Inspired by Ronneberger et al. (2015) + +Let's now take a look at the individual components, to begin with the **contracting path.** + +In their work on U-Net, Ronneberger et al. (2015) started with a regular convolutional neural network. Each ConvNet is what they call a **contracting network**. In more simple terms, this means that the convolutional layers (and possibly pooling layers) scale down the feature maps (outputs) in size. + +For example, if the input to a regular ConvNet is a 32x32x3 image (indeed, that can be a CIFAR-10 sample), a simple 2x2 convolutional layer with regular stride and 12 feature maps would produce a 30x30x12 output. By stacking multiple convolutional layers (and possibly, pooling layers) on top of each other, it becomes possible to scale down the original input to - for example - 10x10x64. + +Recall that this allows convolutional neural networks to learn a hierarchy of features, from the more detailed ones (at the start of the network) to the more coarse-grained ones (towards the tail of the network). Because the network looks a bit like a pyramid (it gets less wide the more downstream we get), it _contracts_, and hence is called a _contracting network_. + +### From contracting to upsampling + +In U-Net, the contracting network is used, but is extended with an _upsampling_ network to reconstruct an output at a specific resolution. This network is called the **expansive path**. + +> The main idea \[..\] is to supplement a usual contracting network by successive layers, where pooling operators are replaced by upsampling operators. +> +> Ronneberger et al. (2015) + +Note that in contemporary variations of U-Net, the output is always of equal size to the input in the width and height dimensions (e.g., 32x32x3 --> 32x32xC). In the original U-Net, however, this was not the case! + +### The need for skip connections + +Now, suppose that you have a working stack of contracting and upsampling layers and thus something that looks very much like a U-Net. It's time to start training and you do so without using pretrained weights, but rather weights initialized by e.g. He init or Xavier init, depending on the activation functions you use. + +When performing a forward pass of your images, they are first passed through the contracting part of your U-Net-like model. Then, from the contracted input, they are upsampled again to find e.g. a mask. When upsampling from the contracted (or _summarized_, to view it more conceptually) input at the end of your contracting network, upsampling to find what looks like the original image is really difficult - because you're doing it blindly! At that point in the network, you have absolutely no information about what the original input looked like - except for the summary. If you would imagine it to be a choice between many doors, you would effectively need to walk through every door, see if it's the right door to move through, and if not walk back and try another. + +> In order to localize, high resolution features from the contracting path are combined with the upsampled output. +> +> Ronneberger et al. (2015) + +A little bit of steering towards the correct door would thus be very helpful. In U-Net, this is achieved through **skip connections**, which pass the output of each level in the contracting network to the corresponding level in the upsampling network. Jointly with the current state of upsampling, it is used to upsample even further. By combining the _high-level information about the input image_ (which comes from the end of the contracting network) with the _high-level information about the level_, results are expected to be much better. + +> The resulting network is applicable to various biomedical segmentation problems. +> +> Ronneberger et al. (2015) + +Turned out it would be applicable to more than just biomedical segmentation problems, too! With tens of thousands of citations, the Ronneberger et al. (2015) paper is one of the key papers in deep learning based image segmentation. U-Net is a widely used architecture, and remains one of the possible choices in image segmentation today. + +* * * + +## Individual U-Net building blocks + +Now that you understand the high-level U-Net architecture, it's a good idea to take a look at its individual components. We begin with the contracting path, followed by the expansive path and the skip connections. + +### Contracting path + +![](images/contracting-222x300.png) + +Inspired by Ronneberger et al. (2015) + +The contracting part is composed of multiple building blocks of convolutional layers, the so-called **convolutional blocks**. + +The number of convolutional blocks is configurable, but is five in the original paper. + +Note that each convolutional block, except for the last, lets its output be used as a skip connection, so in the five-block setting we have four skip connections in total. + +> The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. +> +> Ronneberger et al. (2015) + +Each convolutional block in the U-Net contracting path is composed of two **convolutional layers** with a **3x3** **kernel** size. They are not padded. Recall that when a 3x3 kernel is used with stride 1, image height and width reduce by 2 pixels each. This is clearly visible in the image displaying the contracting path. For example, the input image is contracted from 572x572 pixels to 570x570 and then 568x568 pixels in the convolutional block. + +Each convolutional layer is followed by a ReLU activation function. + +![](images/CNN-onechannel.png) + +Between each convolutional block, a **max pooling** operation is performed. It uses a 2x2 pool size and a stride 2 for downsampling. This means that the input image is cut in half width and height wise. This is compensated with a doubling in feature map size. Since the number of feature maps at the start is 64, we end up with a 5-time double, thus 1024 feature maps at the bottom of the U. + +So, in other words, _relatively_, the U-Net learns fewer features with high resolution and more features with lower resolution. This allows for a better balance in resource use. + +Through the contracting path, an input image is downsampled, and a lot of features are learned. These can now be used for upsampling, both from the output of the contractive path and the skip connections. Let's now take a look at how this is done. + +### Expansive path + +![](images/expansive-228x300.png) + +Inspired by Ronneberger et al. (2015) + +Now that you understand the contracting path, it's time to look at the expansive path. It is similar to what happens in the contracting path, but everything happens in inverse. + +Instead of downsampling the input image, it's upsampling that is performed. For this reason, the expansive path contains an equal number of blocks compared to the contracting path, but these are **upsampling blocks**. + +The general procedure for these upsampling blocks is as follows: + +1. Upsampling is performed on the feature maps at a specific level, to effectively go one level upwards in terms of feature map size. Upsampling also decreases the number of feature maps by 2. +2. The skip connection is center cropped to the width and height of the upsampled input. Note that the skip connection has the same number of feature maps as the upsampled input. +3. Both are concatenated, with the skip connection first followed by the upsampled input. The number of feature maps is double now. +4. With two convolutional layers with 2x2 kernels, the feature maps are downsampled again and the number of feature maps is reduced to the original number. So, for example, when the concatenation yields 512x2 = 1024 feature maps, the Conv layers yield 512 again. + +For upsampling, upsampling layers with a 2x2 kernel and stride 2 are used. The convolutional layers are once again with 3x3 kernel size and stride 1, followed by ReLU activation. + +> Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each followed by a ReLU. (...) At the final layer a 1x1 convolution is used to map each 64- component feature vector to the desired number of classes. +> +> Ronneberger et al. (2015) + +This means that eventually you'll end up with an upsampled image with an initial amount of feature maps. Depending on your problem - for example, classification with two possible classes - a 1x1 kernel Convolution is now used to reduce the number of feature maps. The width and height remain the same, but the feature maps are then reduced to - say, 2. + +This, of course, can be followed by Softmax activation (either as a separate layer or more preferably, pushed to the loss function) and categorical crossentropy loss for training your image segmentation model for classification. In the case you want to use U-Net for regression, you leave Softmax out and let the output be Linear, with (most probably) 1 output feature map. In regression scenarios, loss functions like MSE and MAE would be good starting points. + +### Skip connections + +Now that you understand both the contracting and expansive paths, it's time to look into the final core aspect of U-Nets in more detail. Although briefly mentioned in the previous section, it's important to understand what happens to the **skip connections**. + +Recall from the section about the contractive path that skip connections are generated **at every level, except for the last one**. Indeed, in the image you can see that skip connections emerge from the four upper contractive levels, while the last one (serving as the bottom of the U) does not generate a skip connection. + +![](images/skips_example.png) + +A skip connection (gray) at the bottom of the U. Note that the last level has no skip connections. Inspired by Ronneberger et al. (2015) + +These skip connections are then re-used at the expansive layers at the opposite side of the U. Do note, however, that in the original setup of U-Net feature map width and height of the skip connection are unequal to (in fact, larger than) those of the upsampled inputs. In other words, it's not possible to concatenate them straight away, if you were to build U-Net. + +In their work, Ronneberger et al. (2015) fixed this problem by taking a center crop from the skip connection - a crop of the size of the upsampled input. So, for example, if your skip connection has feature maps of size 136 x 136 pixels while the upsampled input has 104 x 104, the 136 x 136 pixels image is center cropped to 104 x 104 pixels by removing (136-104)/2 pixels from the top, bottom, left and right. Now, both can be concatenated. + +> The cropping is necessary due to the loss of border pixels in every convolution. +> +> Ronneberger et al. (2015) + +U-Net greatly benefits from skip connections in the sense that they pass information about "what the network saw" at the particular level in the contracting path, to the expansive path. Recall the door scenario: if you have thousands of options i.e. doors to walk through, you can do so blindly but end up taking many wrong paths - resulting in a long time to find the good direction. If you have some steering, you might make a few judgment errors here and there, but you can find the correct door much quicker. + +### A practical note: different backbones in modern U-Nets + +So far, you have looked at how the U-Net architecture was implemented in the original work by Ronneberger et al. Over the years, many people have experienced with different setups for U-Nets, including pretraining on e.g. ImageNet and then finetuning to their specific image segmentation tasks. + +This means that today, you will likely use a U-Net that no longer utilizes the original architecture as proposed above - but it's still a good starting point, because the contractive path, expansive path and the skip connections remain the same. + +Common backbones for U-Net architectures these days are ResNet, ResNeXt, EfficientNet and DenseNet architectures. Often, these have been pretrained on the ImageNet dataset, so that many common features have already been learned. By using these backbone U-Nets, initialized with pretrained weights, it's likely that you can reach convergence on your segmentation problem much faster. + +That's it! You have now a high-level understanding of U-Net and its components 😎 If you have any questions, comments or suggestions, feel free to leave a message in the comments section below 💬 I will then try to answer you as quickly as possible. For now, thank you for reading MachineCurve today and happy engineering! + +* * * + +## References + +Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). [Imagenet classification with deep convolutional neural networks.](https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf) _Advances in neural information processing systems_, _25_, 1097-1105. + +Ronneberger, O., Fischer, P., & Brox, T. (2015, October). [U-net: Convolutional networks for biomedical image segmentation.](https://arxiv.org/abs/1505.04597) In _International Conference on Medical image computing and computer-assisted intervention_ (pp. 234-241). Springer, Cham. + +Parkhi, O. M., Vedaldi, A., Zisserman, A., & Jawahar, C. V. (2012, June). [Cats and dogs.](https://www.robots.ox.ac.uk/~vgg/publications/2012/parkhi12a/parkhi12a.pdf) In _2012 IEEE conference on computer vision and pattern recognition_ (pp. 3498-3505). IEEE. diff --git a/understanding-separable-convolutions.md b/understanding-separable-convolutions.md new file mode 100644 index 0000000..008e979 --- /dev/null +++ b/understanding-separable-convolutions.md @@ -0,0 +1,266 @@ +--- +title: "Understanding separable convolutions" +date: "2019-09-23" +categories: + - "buffer" + - "deep-learning" +tags: + - "convolutional-neural-networks" + - "deep-learning" + - "kernel" + - "machine-learning" +--- + +Over the past years, convolutional neural networks (CNNs) have led to massive achievements in machine learning projects. The class of deep learning models has specifically boomed in computer vision, spawning many applications such as [snagging parking spaces with a webcam and a CNN](https://medium.com/@ageitgey/snagging-parking-spaces-with-mask-r-cnn-and-python-955f2231c400). + +That's great! + +But those networks come at a cost. Training them is relatively costly. Not necessarily in money, because computing power is relatively cheap (the most powerful deep learning instance at AWS costs $33/hour in February 2021), but in time. When you have a massive dataset -which is a necessity when you aim to achieve extremely high performance- you will face substantial training times. It's not uncommon to see that training a deep learning model takes two weeks when the dataset is really big. + +This is especially unfavorable when your goal is to test whether your model works and, thus, when you want to iterate quickly. + +Although the landscape is slowly changing with GPUs that are becoming exponentially powerful, training convolutional neural networks still takes a lot of time. The main culprit: the number of multiplications during the training process. + +**After reading this article, you will understand...** + +- Why traditional convolutions yield good performance, but require many computational resources. +- How spatially separable convolutions can reduce the computational requirements, but that they work in only a minority of cases. +- Why depthwise separable convolutions resolve this problem _and_ achieve computational efficiency. + +Let's take a look! 🚀 + +* * * + +**Update 05/Feb/2021:** ensure that the article is up to date. + +* * * + +\[toc\] + +* * * + +## Summary: how separable convolutions improve neural network performance + +**Convolutional Neural Networks** have allowed significant progress to be made in the area of Computer Vision. This is especially true for really deep networks with many convolutional layers. These layers, however, require significant resources to be trained. For example, one convolutional layer trained on 15x15x3 pixel images will already require more than 45.000 multiplications to be made... per image! + +**Spatially separable convolutions** help solve this problem. They are convolutions that can be separated across their spatial axis, meaning that one large convolution (e.g. the original Conv layer) can be split into smaller ones that when convolved sequentially produce the same result. By consequence, the number of multiplications goes down, while getting the same resul.t + +[![This image has an empty alt attribute; its file name is CNNaltogether.png](images/CNNaltogether.png)](https://machinecurve.com/wp-content/uploads/2019/09/CNNaltogether.png) + +The downside of these convolutions is that they cannot be used everywhere since only a minority of kernels is spatially separable. To the rescue here are **depthwise separable convolutions**. This technique simply splits convolutions differently, over a depthwise convolution and a pointwise convolution. The depthwise convolution applies the kernel to each individual channel layer only. The pointwise convolution then convolves over all channels at once, but only with a 1x1 kernel. As you can see in the image, you get the same result as with the original Conv layer, but at only 20% of the multiplications required. A substantial reduction! + +If you wish to understand everything written above in more detail, make sure to read the rest of this article as well 🚀 + +* * * + +## A traditional convolution + +Understanding separable convolutions requires to understand traditional ones first. Because I often try to favor [development use](https://machinecurve.com/index.php/mastering-keras/) of deep learning over pure theory, I had to look into the inner workings of those traditional layers again. Since this provides valuable insights (or a valuable recap) about convolutions and I think you'll better understand separable ones because of it, I'll include my review first. + +By consequence, we'll firist look into traditional convolutions. This is such a convolution: + +![](images/CNN.png) + +Specifically, it's the inner workings of the **first** convolutional layer in your neural network: it takes an RGB image as its input. + +### RGB image and channels + +As you know, RGB images can be represented by their _width_, by their _height_ and by their _channels_. + +Channels? + +Yes, channels: each RGB image is composed of three channels that each describe the 'colorness' of the particular pixel. They do so at the levels _red_, _green_ and _blue_; hence, it's called an _RGB_ image. Above, you'll therefore see the input represented by a cube that itself is composed of the three RGB channels of width W and height H. + +### Kernels + +As you see, the convolutional layer also contains N so-called _kernels_. A kernel is a very small piece of 'memory' that through training becomes capable of deriving particular features from the image. Kernels are typically 1x1, 3x3 or 5x5 pixels and they 'slide' over the image: + +![](images/Cnn_layer-1.jpg) + +What they essentially do is that element-wise multiplications are computed between the filter and the image currently _under inspection_. + +That is, suppose that your filter is 3x3 pixels and currently in the upper left corner of your image. Pixel (1,1) of the image is multiplied with kernel element (1,1); (1,2) with (1,2), and so forth. All those scalar values are summated together and subsequently represent _one scalar_ in the feature map, illustrated on the right in the image above. + +### Kernels and multiple channels + +When N=1, we arrive at the situation above: a two-dimensional box is slided over the image that has one channel and the result is a summary of the image. + +What confused me was what happened when there are multiple channels, like in the image we've seen before: + +![](images/CNN.png) + +The kernel itself here is 3x3x3, there are N of them; yet, the feature map that is the result of the convolution operation is HxWxN. + +I then found this video which perfectly explained what happens: + +https://www.youtube.com/watch?v=KTB\_OFoAQcc + +In essence, the fact that the kernel is three-dimensional (WxHxM, with M=3 in the RGB situation above) effectively means that a _cube_ is convolving over the _multichanneled_ image. Equal to the pair-wise multiplications above, the three-dimensional multiplications also result in a scalar value per slide. Hence, a WxHxM kernel results in a feature map third dimension of M, when three kernels are used. + +* * * + +## Traditional convolutions require many resources + +Very often, your neural network is not composed of one convolutional layer. Rather, a few of them summarize your image to an abstract representation that can be used for classification with densely classified layers that behave like [MLPs](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/). + +However, a traditional convolution is expensive in terms of the resources that you'll need during training. + +We'll investigate next why this is the case. + +Suppose that your training set contains 15x15 RGB pixel images (3 channels!) and that you're using 10 3x3x3 pixel kernels to convolve over your training data. + +In _one_ convolution on _one_ input image (i.e., 3x3x3 slide over the first 3x3x3 pixels of your RGB image, you'll do 3x3x3 = 27 multiplications to find the first scalar value. + +However, we chose to use 10 kernels, so we'll have 270 multiplications for the first 3x3 pixels of your image. + +Since we're not using padding, the kernel will have to slide over 13 (15-3+1 = 13) patches, both horizontally and vertically. Hence, per image, we'll have to make 270 x 13 x 13 = 45630 multiplications. + +We can generalize this to the following formula when we're not using padding: + +**Multiplications per image = Kernel width x Kernel height x Number of channels x Number of kernels x Number of vertical slides x Number of horizontal slides**. + +Say that the MNIST dataset added to [Keras](https://machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) contains ~60k images, of which ~48k are training data, you get the point: convolutions are expensive - and this was only the first convolutional layer. + +Why I'm covering separable convolutions in this blog today is because they might be the (partial) answer to these requirements for computational complexity. They will do the same trick while requiring much fewer resources. Let's start with spatially separable convolutions. Following those, we cover depthwise separable convolutions. For both, we'll show how they might improve the resource requirements for your machine learning projects, and save resources when you're developing convolutional neural nets. + +* * * + +## Spatially separable convolutions + +Spatially separable convolutions, sometimes briefly called _separable convolutions_ (Chollet (2017), although this does not fully cover depthwise separable convolutions), are convolutions that can be separated across their spatial axes. + +That is, they can be split into smaller convolutions that, when convolved sequentially, produce the same result. + +In [_A Basic Introduction to Separable Convolutions_](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728), Chi-Feng Wang argues that "\[o\]ne of the most famous convolutions that can be separated spatially is the Sobel kernel, used to detect edges": + +\[latex\] \\begin{bmatrix} -1 & 0 & 1 \\\\ -2 & 0 & 2 \\\\ -1 & 0 & 1 \\end{bmatrix} = \\begin{bmatrix} 1 \\\\ 2 \\\\ 1 \\end{bmatrix} \\times \\begin{bmatrix} -1 & 0 & 1 \\end{bmatrix} \[/latex\] + +### Convolution with normal kernel + +Suppose that you're performing a normal convolution operation with this kernel on a 15x15 pixel grayscale image (hence, 1 channel), and only use one kernel and no padding. + +Remember the formula? + +**Multiplications per image = Kernel width x Kernel height x Number of channels x Number of kernels x Number of vertical slides x Number of horizontal slides**. + +Or: 3x3x1x1x13x13 = 1521 multiplications. + +### Spatially separated kernel + +With the above kernel, you would first convolve the 3x1 kernel and subsequently the 1x3 kernel. This yields for both kernels: + +3x1 kernel: 3x1x1x1x13x15 = 585 multiplications. + +1x3 kernel: 1x3x1x1x15x13 = 585 multiplications. + +585 + 585 = **1170 multiplications.** + +Yet, you'll have the same result as with the original kernel! + +Spatially separable kernels can thus yield the same result with fewer multiplications, and hence you require fewer computational resources. + +### The problem with spatially separable kernels + +Then why use traditional convolution at all, you would say? + +Well, this is perfectly illustrated in [_A Basic Introduction to Separable Convolutions_](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728). + +The point is that only a minority of kernels is spatially separable. Most can't be separated that way. If you would therefore rely on spatially separable kernels while training a convolutional neural network, you would limit the network significantly. Likely, the network won't perform as well as the one trained with traditional kernels, even though it requires fewer resources. + +Depthwise separable convolutions might now come to the rescue ;-) + +* * * + +## Depthwise separable convolutions + +A depthwise separable convolution benefits from the same characteristic as spatially separable convolutions, being that splitting the kernels into two smaller ones yields the same result with fewer multiplications, but does so differently. Effectively, two operations are performed in depthwise separable convolutions - sequentially (Geeks for Geeks, 2019): + +1. Depthwise convolutions; +2. Pointwise convolutions. + +### Depthwise convolutions + +As we've seen above, normal convolutions over volumes convolve over the entire volume, i.e. over all the channels at once, producing a WidthxHeightx1 volume for every kernel. Using N kernels therefore produces a WidthxHeightxN volume called the feature map. + +In depthwise separable convolutions, particularly the first operation - the depthwise convolution - this does not happen in that way. Rather, each channel is considered separately, and _one filter per channel_ is convolved over _that channel only_. See the example below: + +![](images/CNNdepthwise-1.png) + +Here, we would use 3 one-channel filters (M=3), since we're interpreting an RGB image. Contrary to traditional convolutions, the result is no _end result_, but rather, an intermediate result that is to be interpreted further in the second phase of the convolutional layer, the pointwise convolution. + +### Pointwise convolutions + +From the intermediate result onwards, we can then continue with what are called _pointwise convolutions_. Those are filters of 1x1 pixels but which cover all the M intermediate channels generated by the filters, in our case M=3. + +And since we're trying to equal the original convolution, we need N of them. Remember that a convolution over a volume produces a SomeWidth x SomeHeight x 1 volume, as the element-wise multiplications performed over three dimensions result in a one-dimensional scalar value. If we would thus apply one such pointwise filter, we would end up with a Hfm x Wfm x 1 volume. As the original convolution produced a Hfm x Wfm x N volume, we need N such pointwise filters. + +I visualized this process below: + +![](images/CNNpointwise.png) + +### Depthwise separable convolutions altogether + +When taken altogether, this is how depthwise separable convolutions produce the same result as the original convolution: + +![](images/CNNaltogether.png) + +First, using depthwise convolutions using M filters, an intermediate result is produced, which is then processed into the final volume by means of the pointwise convolutions. Taking those volumes together, M volume x N volume, yields that the operation is equal to the original kernel volume: (3x3x1 times 1x1xM = 3x3xM = 3x3x3, the volume of our N original kernels indeed). Since we have N such filters, we produce the same result as with our N original kernels. + +![](images/CNN-1.png) + +### How many multiplications do we save? + +We recall from convolving with our traditional kernel that we required**3x3x3x10x13x13 = 45630 multiplications** to do so successfully for one image. + +How many multiplications do we need for one image when we're using a depthwise separated convolutional layer? How many multiplications do we save? + +Remember that we used a 15x15 pixel image without padding. We'll use the same for the depthwise separable convolution. We split our calculation into the number of multiplications for the depthwise and pointwise convolutions and subsequently add them together. + +All right, for the depthwise convolution we multiply the _number of convolutions in one full range of volume convolving_times the _number of channels_ times the _number of multiplications per convolution_: + +- Number of convolutions in one full range of volume convolving is Horizontal movements x Vertical movements: + - Horizontal movements = (15 - 3 + 1) = 13 + - Vertical movements = (15 - 3 + 1) = 13 + - One full range of convolving has 13 x 13 = 169 individual convolutions. +- The number of channels is 3, so we do 3 full ranges of volume convolving. +- The number of multiplications per individual convolution equals 3x3x1 since that's the volume of each individual filter. + +Hence, the number of multiplications in the depthwise convolutional operation is 13 x 13 x 3 x 3 x 3 x 1 = **4563**. + +For the pointwise convolution, we compute the _number of convolutions in one full range of volume convolving over the intermediate result_ times the _number of filters_ times the _number of multiplications per convolution_: + +- Number of convolutions in one full range of volume convolving is Horizontal movements x Vertical movements: + - Horizontal movements = 13, since our kernel is 1x1xM; + - Vertical movements = 13 for the same reason; + - Note that the intermediate result was reduced from 15x15x3 to 13x13x3, hence the movements above are 13. + - One full range of convolving therefore has 13 x 13 = 225 individual convolutions. +- The number of filters in our case is N, and we used N = 10 in the original scenario. +- The number of multiplications per convolution in our case is 1x1xM, since that's our kernel volume, and M = 3 since we used 3 channels, hence 3. + +So for the pointwise convolution that's 13 x 13 x 10 x 1 x 1 x 3 = **5070**. + +Together, that's 5070 + 4563 = **9633** multiplications, down many from the original 45630! + +That's a substantial reduction in the number of multiplications, while keeping the same result! + +* * * + +## Recap + +Today, we've seen how spatially separable and depthwise separable convolutions might significantly reduce the resource requirements for your convolutional neural networks without - in most cases - giving in on accuracy. If you're looking to optimize your convolutional neural network, you should definitely look into those! + +In the discussion, we've seen how it's more likely that you find those improvements with depthwise separable convolutions, since not many kernels can be split spatially - being a drawback for your convnets. However, even with depthwise separable convolutions, you'll likely find substantial optimization. + +I hope that this blog was useful to understand those convolutions more deeply - writing about them has at least helped me gain understanding. I therefore definitely wish to thank the articles I reference below for providing many valuable insights and, when you're interested in separable convolutions, I definitely recommend checking them out! + +If you have any questions, remarks, comments whatsoever - feel free to leave a comment below! 👇 When possible, I'll happily answer _or_ adapt my blog in order to make it better. Thanks and happy engineering! 😄 + +* * * + +## References + +Wang, C. (2018, August 14). A Basic Introduction to Separable Convolutions. Retrieved from [https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728) + +Geeks for Geeks. (2019, August 28). Depth wise Separable Convolutional Neural Networks. Retrieved from [https://www.geeksforgeeks.org/depth-wise-separable-convolutional-neural-networks/](https://www.geeksforgeeks.org/depth-wise-separable-convolutional-neural-networks/) + +Chollet, F. (2017). Xception: Deep Learning with Depthwise Separable Convolutions. _2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)_. [doi:10.1109/cvpr.2017.195](http://doi.org/10.1109/cvpr.2017.195) diff --git a/understanding-transposed-convolutions.md b/understanding-transposed-convolutions.md new file mode 100644 index 0000000..aa13360 --- /dev/null +++ b/understanding-transposed-convolutions.md @@ -0,0 +1,234 @@ +--- +title: "Understanding transposed convolutions" +date: "2019-09-29" +categories: + - "buffer" + - "deep-learning" +tags: + - "computer-vision" + - "convolutional-neural-networks" + - "deep-learning" + - "machine-learning" + - "transposed-convolution" +--- + +Recently, we've looked at convolutional layers and certain variations to see how they can be used in machine learning problems. Today, we'll focus on a variant called _transposed convolution_, which can be used for upsampling images (making them larger) or finding the original representation of a convolutional filter map. + +We'll first cover a normal convolution before we introduce transposed ones. We do so by means of the convolution matrix. Hope you'll enjoy! + +After reading this tutorial, you will understand... + +- What normal convolutions do. +- How transposed convolutions can be used for reversing the output of your ConvNet. +- Applications of transposed convolutions. + +Let's take a look! 🚀 + +* * * + +**Update 09/Feb/2021:** ensure that article is up to date. + +**Update 01/Mar/2020:** adapted images for the "normal convolution" to make them equal to the convolution matrix example. + +* * * + +\[toc\] + +* * * + +## Summary: understanding transposed convolutions + +Convolutional Neural Networks are used for computer vision projects and can be used to automatically extract features from inputs like photos and videos. These neural networks employ so-called convolutional layers that convolve (slide) over the input image, try to detect patterns, and adapt weights accordingly during the training process - allowing learning to occur. + +Sometimes, however, you want the opposite to happen: invert the output of a convolutional layer and reconstruct the original input. This is for example the case with autoencoders, where you use normal convolutions to learn an encoded state and subsequently decode them into the original inputs. If done successfully, the encoded state can be used as a lower-dimensional representation of your input data, for dimensionality reduction. + +Transposed convolutional layers can be used for this purpose. Rather than performing interpolation, they learn a set of weights that can be used to reconstruct original inputs. They can be trained jointly with convolutional layers during the training process. In this article, we'll cover transposed convolutions in more detail. We'll show you how they work and how they are applied. + +* * * + +## What does a normal convolution do? + +If we wish to understand transposed convolutions, we must be able to compare them with something - and that something, in our case, is a normal convolution. + +Let's look at one: + +[![](images/conv-new.png)](https://www.machinecurve.com/wp-content/uploads/2020/03/conv-new.png) + +More specifically, we're looking at a convolution of a one-channel image: this is likely a grayscale image. Normally, you would convolve over multiple channels, and you would likely use multiple kernels. For the sake of simplicity, our image has one channel and we use N = 1 kernels. + +It must now follow why the 2x2 kernel produces a 2x2 output when convolving over a 3x3 image. I'll briefly recap it next. + +When the convolution process starts, the kernel is placed at the upper left corner. It performs element-wise multiplications and hence, produces a scalar output (a number) for the overlapping area. It then moves one step to the right, performs the same thing, but then cannot move any further to the right. + +It then simply moves one down, if possible, and does the same trick again. Once it can no longer go to the right, it will attempt to move one down, but cannot do so for the simple reason that we've already reached the end of the image. The convolution operation then stops. Note that in the first row, two scalar values were produced, as well as in the second row. These two times two scalar values produce the 2x2 output displayed in the image above. + +_Note that we assume a stride of 1 in this example._ + +If you wish to understand normal convolutions in more detail, I suggest that you take a look at [this post](https://machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) before moving on. + +* * * + +## The goal: reconstructing the original input + +Now what if your goal is to do the opposite: given a _summary_, i.e. the result of the convolution, reconstructing the original input? + +We call this "upsampling". + +Like this: + +[![](images/conv-new2.png)](https://www.machinecurve.com/wp-content/uploads/2020/03/conv-new2.png) + +You have multiple options. + +### Traditional interpolation techniques + +First and foremost, it is possible to use traditional interpolation techniques to make your image larger. For example, you could use bicubic or bilinear interpolation to achieve the result you're interested in. However, they're not too flexible: they simply compute an estimate of the interpolated pixel values based on their surroundings. In the case of making images larger without losing a sense of detail, we might be interested in a _different approach_ - one where the _means of interpolation_ is learnt based on the target data. Regular and transposed convolutions then enter the spotlight. + +### Regular convolutions - or not? + +First, as described more lengthily in [Dumoulin & Francesco (2016)](http://arxiv.org/abs/1603.07285), you can employ a regular convolution operation. This, however, might not be the most efficient route towards reconstructing the original image: + +> Finally note that it is always possible to emulate a transposed convolution with a direct convolution. The disadvantage is that it usually involves adding many columns and rows of zeros to the input, resulting in a much less efficient implementation. +> +> Dumoulin & Francesco (2016) + +(note that the paper gives many examples - it's a good recommendation if you wish to understand it in even greater detail!) + +Let's now look at our third possible approach: a _transposed convolution._ + +* * * + +## Transposed Convolution + +Rather, we must find another way of doing so. Enter the transposed convolution. We'll discuss this way of working next in a multi-stage fashion. Firstly, we describe how forward and backwards passes are normally covered by a convolutional layer and how they are inverted in a transposed convolution. + +Secondly, we represent the normal convolution with a Convolution Matrix - it's the first step to demonstrating the power of a transposed convolution. Additionally, we'll compute the normal convolution output based on this matrix to demonstrate that we'll achieve the same result. + +Subsequently, we'll introduce the transposed convolution based on the Convolution Matrix we defined earlier, and show that it's indeed possible to reconstruct the original input. + +### Representing the normal convolution with a Convolution Matrix + +Let's now see how we can represent the normal convolution by means of a Convolution Matrix. We can use this matrix to demonstrate how the Transposed Convolution works (Dumoulin & Visin, 2016). + +Suppose that we're performing convolutions with a 2x2 kernel on a 3x3 input image, like this: + +![](images/normal_conv.jpg) + +With our understanding of how regular convolutions work, it's not surprising to find that we'll end up with a 2x2 output or feature map: + +![](images/image-2.png) + +We can also represent this operation as a Convolution Matrix. + +**What is a convolution matrix?** + +It's a matrix which demonstrates all positions of the kernel on the original image, like this: + +![](images/conv_matrix-1.jpg) + +One who looks closely, will notice that each row represents a position of the kernel on top of the image: in the first row, for example, the kernel is positioned at the top left of the image. The {1, 2, 0} at the first row of the convolution matrix therefore represents the effect of the convolution at the first row of the input image. The {2, 1, 0} represents the effect of the convolution at the second row of the input image. Since at this point in time, the convolution is not applied in either the 3rd column or the 3rd row, either the third column value of the first and second row _and_ all the third row values are 0. + +Note that when the kernel moves to the right in the second iteration, the current position is represented by the second row of the convolution matrix, and so on. The convolution matrix therefore describes the full convolutional operation of the kernel on the input image. + +![](images/explained-1.jpg) + +### Computing the normal convolution output with a Convolution Matrix + +The convolution matrix can be used to compute the output of a normal convolution. Doing so is really simple, namely, by flattening the input image into a (9x1) feature vector: + +![](images/image-1.png) + +It's possible to represent the 3 x 3 image as an 1 x 9 image instead, which essentially allows you to contain the same amount of data in the same ordering - by breaking the 1 x 9 image apart after each 3rd block and stacking the 1 x 3 blocks together, you'll arrive at the 3 x 3 image again. + +The fun thing is that we can multiply this (9x1) matrix with the (4x9) convolution matrix and hence achieve a (4x9) x (9x1) = (4x1) output: + +![](images/image-6.png) + +When turning it around, breaking it into blocks of two and stacking them vertically, we see that it's the same as what we saw earlier: + +![](images/image-2.png) + +We can thus express the convolutional operation by means of a convolution matrix! + +### From output back to the input: the Transposed Convolution + +Now suppose that this is your input: + +![](images/image-10.png) + +While this is your desired output: + +![](images/image-12.png) + +The energy is preserved while the image was upsampled. That's our goal. + +Or, in other words: you're trying to do the opposite - going backwards from a summarized version of the original input to the original input, rather than creating the summary. + +We'll now find why we call the type of convolution _transposed_, as we can also represent this by means of the convolution matrix - although not the original one, but its transpose: + +![](images/image-5.png) + +a.k.a. the one where the columns have become rows and vice-versa. + +The input can also be represented as follows: + +![](images/image-11.png) + +You may now have noticed that once again, we're in a situation in which we can perform a matrix multiplication: we now have a (9x4) matrix and a (4x1) matrix, which we can multiply to arrive at a (9x1) matrix or, when broken apart, the (3x3) matrix we were looking for! + +![](images/image-13.png) + +### Implementations in deep learning frameworks: normal convolution, transposed matrix + +There are effectively two ways of implementing the transposed convolution (Theano, n.d.): + +- By applying a regular convolution with many zeroes in padding, to effectively add a _fractional_ stride - however, this is very inefficient as you'll substantially increase the number of multiplications that are necessary to find your results. +- By applying a regular convolution, however also swapping the so-called _forward_ and _backwards pass_. That is, in a normal convolution, when data is fed forward, the output shape gets smaller due to the properties of such convolutions. However, when you swap them, you'll effectively compute the backwards pass - and make the input larger. This _swap_ can be achieved by transposing the convolution matrix, indeed similar as to what we just saw above! Hence, also, the name _transposed_ convolutions. + +Frameworks such as TensorFlow and Theano implement transposed convolutions in this way or by using a very similar one. + +### Learnable kernels are what make Transposed convolutions different + +Now, one may wonder: + +**Why should I use transposed convolutions rather than traditional interpolation techniques?** + +Although this is slightly dependent on why you intend to use such convolutions, there may be very good reasons for doing so. + +Note that in both the **regular convolution matrix** and the **transposed one**, the non-zero fields are determined by the _kernel_. + +And the kernel is _learnt over time_, as during model optimization they are adapted continuously to better reflect the relationships underlying your dataset. + +What this means is that, contrary to regular interpolation techniques, you can _learn_ kernels first (e.g. by applying regular convolution operations on your data), and subsequently using them to define your _transposed convolution_, in order to find your original data - or at least, something that hopefully looks like it. + +This, in itself, allows one to use them for very interesting applications. + +* * * + +## Applications of transposed convolutions + +Firstly, it is of course possible to perform upsampling operations with transposed convolutions. That is, when you have a smaller image, you can make it larger by applying transposed convolutions. Note that you'll have to learn the weights in order to do so - which might mean that you're better off with traditional interpolation if you want fast results. + +Secondly, transposed convolutions can be used in [Generative Adversarial Networks](https://www.machinecurve.com/index.php/2019/07/17/this-person-does-not-exist-how-does-it-work/) (Shibuya, 2019). Those networks randomly generate a small matrix and use fractionally-strided convolutions (another name to describe transposed convolutions, but then perhaps in the relatively inefficient implementation of regular convolutions with fractional strides) to upsample them to true images. The weights that have been learnt in the process allow for the upsampling of the random noise to the actual image. + +Thirdly, they're also used in _segmantic segmentation_, where you wish to [classify](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/#what-is-a-classifier) each pixel of an image into a certain category (Shibuya, 2019). They work by generating predictions for intermediate results achieved with convolutions, subsequently upsamling those to find predictions for the originally-shaped input image. + +Finally, and perhaps more recently, they are used in what is called a _convolutional autoencoder_. In those, convolutional layers are used to find an _encoding_ for some input, i.e. a representation of the original input in much lower dimensionality. A clear example would be a radar image with a landmine and one without a landmine; for the latter, one could train an autoencoder to find a particular encoding. However, autoencoders also contain a _decoding_ side: given some encoded input, it attempts to find the original output (by, unsurprisingly, upsampling layers such as transposed convolutions). By comparing the original input image with the output image, it's possible to identify whether the input belongs to a certain class. In our case, that would be landmine yes/no, and 'yes' would only be the case if the decoded encoding really looks like the original image, since then the landmine-trained encoding worked well. + +* * * + +## Summary + +In this blog, we've seen how transposed convolutions work and what they can be used for. We've looked at convolution matrices that can be used for upsampling and demonstrated how both the regular convolution matrix and the transposed one are defined by the kernels that can be learnt. Interesting application areas are upsampling, GANs, semantic segmentation and autoencoders. + +Thank you for reading! I hope this blog has made things a bit more clear :-) Would you kindly let me know in the comments? 👇 If not - or if you have questions or remarks - please feel free to write a comment as well! I'm more than happy to reply. Should I've made a mistake, which can happen, I'll gladly adapt my blog and note you as a contributor. Thank you! + +* * * + +## References + +Dumoulin, Vincent, en Francesco Visin. “A guide to convolution arithmetic for deep learning”. arXiv:1603.07285 \[cs, stat\], March 2016. arXiv.org, [http://arxiv.org/abs/1603.07285](http://arxiv.org/abs/1603.07285). + +Theano. (n.d.). Convolution arithmetic tutorial — Theano 1.0.0 documentation. Retrieved from [http://deeplearning.net/software/theano/tutorial/conv\_arithmetic.html#transposed-convolution-arithmetic](http://deeplearning.net/software/theano/tutorial/conv_arithmetic.html#transposed-convolution-arithmetic) + +Shibuya, N. (2019, January 1). Up-sampling with Transposed Convolution. Retrieved from [https://medium.com/activating-robotic-minds/up-sampling-with-transposed-convolution-9ae4f2df52d0](https://medium.com/activating-robotic-minds/up-sampling-with-transposed-convolution-9ae4f2df52d0) diff --git a/upsampling2d-how-to-use-upsampling-with-keras.md b/upsampling2d-how-to-use-upsampling-with-keras.md new file mode 100644 index 0000000..661316c --- /dev/null +++ b/upsampling2d-how-to-use-upsampling-with-keras.md @@ -0,0 +1,589 @@ +--- +title: "UpSampling2D: how to use upsampling with Keras?" +date: "2019-12-11" +categories: + - "deep-learning" + - "frameworks" +tags: + - "autoencoder" + - "conv2d" + - "conv2dtranspose" + - "convolutional-neural-networks" + - "deep-learning" + - "keras" + - "machine-learning" + - "transposed-convolution" + - "upsampling2d" +--- + +The _Convolutional layers_ section of the Keras API contains the so-called UpSampling2D layer. But what does it do? And how can it be used in real neural networks? This is not clear up front, but there are some interesting applications. + +In today's blog post, we'll cover the concept of upsampling - first with a very simple example using UpSampling2D and bilinear interpolation. We then extend this idea to the concept of an autoencoder, where the Keras upsampling layer can be used together with convolutional layers in order to construct (or reconstruct) some image based on an encoded state. This shows how UpSampling2D can be used with Keras. Of course, we'll also cover the differences with transposed convolutions - being the Conv2DTranspose layer. + +All right, let's go! 😀 + +\[toc\] + +## What is upsampling? + +Suppose that you have the following list: + +`[0. 0.5 1. 1.5]` + +...which can be reshaped into a (2, 2) image: + +``` +[[0. 0.5] + [1. 1.5]] +``` + +That, in turn can be visualized as follows: + +[![](images/pre_up_plot.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/pre_up_plot.png) + +It's a bit blocky, isn't it? + +Wouldn't it be a good idea if we applied some _smoothing_ here, so that we could get something like this? + +[![](images/af_ups.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/af_ups.png) + +That's a lot better, isn't it? 😊 + +We just applied an upsampling operation - we made the image larger and larger (look at the axes!), yet also applied _interpolation_, hence averaging, creating the nice smoothness. + +### Simple upsampling example with Keras UpSampling2D + +Keras, the deep learning framework I really like for creating deep neural networks, provides an upsampling layer - called [UpSampling2D](https://keras.io/layers/convolutional/#upsampling2d) - which allows you to perform this operation within your neural networks. In fact, the plots were generated by using the Keras Upsampling2D layers in an upsampling-only model. + +Let's see how we did that, understanding upsampling in more detail, before we move on to more advanced examples. + +``` +import keras +from keras.models import Sequential +from keras.layers import UpSampling2D +import matplotlib.pyplot as plt +import numpy as np +``` + +First, we import some libraries that we need: + +- Keras, being the deep learning framework that provides the UpSampling2D layer. +- The Sequential API, which we will use to stack multiple UpSamplign2D layers on top of each other. +- UpSampling2D itself, of course. +- Matplotlib, more specifically its PyPlot library, to generate the visualizations. +- Numpy, to reshape the original list into an image-like format (see the example above, with the four-number list). + +Then, we generate some data and measure some input-related values, such as the shape, as well as the shape of the entire _model input_ (which requires some notion about [image channels](https://en.wikipedia.org/wiki/Channel_(digital_image)), hence adding an extra 1): + +``` +# Generate some data +input_flattened = np.arange(0, 2, 0.5) +input_image = np.reshape(input_flattened, (2, 2, 1)) +input_image_shape = np.shape(input_image) +input_image_shape = (input_image_shape[0], input_image_shape[1], 1) +``` + +Next, we specify a simple model architecture: + +``` +# Create the model +model = Sequential() +model.add(UpSampling2D((32, 32), input_shape=input_image_shape, interpolation='bilinear')) +model.summary() +model.summary() +``` + +As you can see, we use UpSampling2D five times. The settings are to be understood as follows: + +- (32, 32) is the _size_ of the upsampling operation - i.e., how many times upsampling must take place. In our case, we upsample 32 times. +- The input shape is the shape of the model input that we just determined before. +- The interpolation setting is the choice for interpolation algorithm you use - it's possible to use [bilinear](https://en.wikipedia.org/wiki/Bilinear_interpolation) and [nearest neighbor](https://en.wikipedia.org/wiki/Nearest-neighbor_interpolation) interpolation. + +Next, we generate a 'prediction' - even though we already know the outcome of our Upsampling operation :) + +``` +# Perform upsampling +model_inputs = np.array([input_image]) +outputs_upsampled = model.predict(model_inputs) + +# Get output +output_upsampled = outputs_upsampled[0] +``` + +Finally, we visualize the original and the upsampled version together: + +``` +# Visualize input and output +fig, axes = plt.subplots(1, 2) +axes[0].imshow(input_image[:, :, 0]) +axes[0].set_title('Original image') +axes[1].imshow(output_upsampled[:, :, 0]) +axes[1].set_title('Upsampled input') +fig.suptitle(f'Original and upsampled input, method = bilinear interpolation') +plt.show() +``` + +Which produces the following result: + +[![](images/simple_upsampling.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/simple_upsampling.png) + +Unsurprisingly, that's quite equal to the examples we saw above 😎 + +If you wish to get the code of this simple example in full, here you go: + +``` +import keras +from keras.models import Sequential +from keras.layers import UpSampling2D +import matplotlib.pyplot as plt +import numpy as np + +# Generate some data +input_flattened = np.arange(0, 2, 0.5) +input_image = np.reshape(input_flattened, (2, 2, 1)) +input_image_shape = np.shape(input_image) +input_image_shape = (input_image_shape[0], input_image_shape[1], 1) + +# Create the model +model = Sequential() +model.add(UpSampling2D((32, 32), input_shape=input_image_shape, interpolation='bilinear')) +model.summary() +model.summary() + +# Perform upsampling +model_inputs = np.array([input_image]) +outputs_upsampled = model.predict(model_inputs) + +# Get output +output_upsampled = outputs_upsampled[0] + +# Visualize input and output +fig, axes = plt.subplots(1, 2) +axes[0].imshow(input_image[:, :, 0]) +axes[0].set_title('Original image') +axes[1].imshow(output_upsampled[:, :, 0]) +axes[1].set_title('Upsampled input') +fig.suptitle(f'Original and upsampled input, method = bilinear interpolation') +plt.show() +``` + +* * * + +## Applying UpSampling2D in neural networks + +All right - while that example was very simple, it's likely not why you're here: you wish to understand how UpSampling2D can be used in _real_ applications. + +That makes sense, so let's take a look at one advanced application: autoencoders. + +### Advanced usage of UpSampling2D: autoencoders + +This is an autoencoder: + +[![](images/Autoencoder.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/Autoencoder.png) + +In short, it's a network that is composed of the following components: + +- Some input, which in the case above is an _image_ (this is not necessary per se). +- An _encoder_ function, which has been learnt and encodes the input into lower-dimensional form. +- By consequence, an encoded state, which represents the encoding. +- A _decoder_ function, which has also been learnt and attempts to decode the encoded state into what has been learnt. In our case, this was the reconstructed image, but it may also be something entirely different. + +There is a large number of autoencoder types, but if we are speaking about convolutional autoencoders, it's possible to build them with [transposed convolutions](https://www.machinecurve.com/index.php/2019/09/29/understanding-transposed-convolutions/) (in Keras: [Conv2DTranspose](https://www.machinecurve.com/index.php/2019/12/10/conv2dtranspose-using-2d-transposed-convolutions-with-keras/)) or with upsampling (UpSampling2D, what we saw above) and regular convolutions. Click the links if you wish to know more about this first approach. In this post, we'll cover the latter. + +### How is UpSampling2D different from Conv2DTranspose? + +What's important, before we actually continue and create a Keras model based on UpSampling2D and Conv2D layers, is to understand that it is similar to Conv2DTranspose, but slightly different (StackExchange, n.d.). + +First, upsampling layers are not trainable. In Keras, the Tensorflow backend simply calls the function `resize_images`, which simply resizes the image by means of interpolation (StackExchange, n.d.). Transposed convolutions are trainable: while upsampling layers use a mathematical definition (i.e., interpolation), transposed convolutions _learn_ how to upsample, and are hence highly data-specific. + +One argument in favor of upsampling layers could thus be that you have data required to be upsampled, while sharing quite some differences within the dataset. If you have relatively similar data (such as the MNIST digits), then transposed convolutions might be the better choice. This all depends on you. + +Second, upsampling layers do not suffer from the so-called checkerboard effect - while transposed convolutions do, if you don't configure them well (Odena et al., 2016). Because of the way convolutions (and also transposed convolutions) slide over the (encoded) image, if you don't configure your stride and kernel size well, they overlap, producing checkerboard-like structures in your image. [Take a look at this post to find a really good explanation with examples.](https://distill.pub/2016/deconv-checkerboard/) + +### Why UpSampling2D and Conv2D must be used together + +When performing such an upsampling operation, e.g. with the Upsampling2D layer in Keras, you must always apply Conv2D as well? + +The why is explained very well in chapter 4 of [“A guide to convolution arithmetic for deep learning”](https://arxiv.org/abs/1603.07285) by Dumoulin & Visin (2016): the combination of upsampling and the convolution, if applied well, equals the effect of the transposed convolution. This hence allows us to reconstruct the image into its original size, presumably without losing a lot of details (which would have happened with upsampling alone, if such reconstruction would have been possible at all). + +## Building your model with Keras + +Let's now see if we can actually build the model with Keras! + +### What we'll create today + +Remember that picture of the autoencoder? + +[![](images/Autoencoder.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/Autoencoder.png) + +That's what we will build, and it looks like this: + +![](images/model-3-241x1024.png) + +We first use Conv2D and MaxPooling layers to downsample the image (i.e., the encoder part), and subsequently use UpSampling2D and Conv2D to upsample it into our desired format (i.e., the decoder part, which in our case attempts reconstructing the original input). Note that the upsampling and convolutional layer [must be used together](#why-upsampling2d-and-conv2d-must-be-used-together) due to its equality to [transposed convolutions](https://www.machinecurve.com/index.php/2019/12/10/conv2dtranspose-using-2d-transposed-convolutions-with-keras/). Note that traditionally, before Conv2DTranspose was available and fast enough, upsampling and Conv2D were really popular, and even used by François Chollet, the creator of the Keras framework (Keras Blog, n.d.). + +Today, the general consensus is this: "While the transpose convolution is more efficient, the article advocates for upsampling + convolution since it does not suffer from the checkerboard artifact" (StackExchange, n.d.). For your practical settings, thus check whether you are sensitive to the checkerboard effect (TLDR: it happens more often when you have image-like data with very flashy colors, and high contrasts), and based on this choose which approach to use (TLDR: checkerboard effect with Conv2DTranspose can be avoided by configuring your stride and kernel size correctly, see Odena et al. 2016). + +### What you'll need to run the model + +Now, let's see what we need to run the model: + +- **Python**, obviously, since we create the code in this language. +- **Keras**, as well, which is the deep learning framework we're using today. +- One of the Keras backends - and preferably **Tensorflow** (or Tensorflow GPU), given its deep integration with Keras today. +- **Matplotlib**, for generating visualizations (not mandatory, but you'll have to remove a few lines of code later if you wish to omit it) +- **Numpy**, for numbers processing. + +### Model imports + +Open up your Explorer, and at a location of your choice, create a Python file called `upsampling2d.py`. Open this file in your code editor, and let's start coding 😊 + +We first import the things we need: + +``` +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Conv2D, UpSampling2D, MaxPooling2D +import matplotlib.pyplot as plt +import numpy as np +``` + +We'll need the `mnist` dataset as we're going to use it for training our autoencoder. We need the `Sequential` API for stacking all the layers, in this case being `Conv2D`, `Upsampling2D` and `MaxPooling2D` (check the architectural diagram above to see where they fit in). + +Additionally, we need the Matplotlib Pyplot library, and an instance of Numpy. + +### Configuration options + +Next, we specify some configuration options: + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 25 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +As we're using the MNIST dataset today (see image), we set width and height to 28 pixels. We use a 25 batch size, which allows us to capture slightly more of the gradient accuracy with respect to the balance between [batch gradient descent and stochastic gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) (even though we don't use a GD-like optimizer, the effect must be similar). The number of classes is, by definition of the distinct number of digits available, ten - zero to nine. We use 20% of our training data for validation, and set verbosity to True, outputting everything on screen. While this slows down the training process slightly, it helps you understand and see what happens. Set it to False (zero) if you wish to see the outputs only. + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +### Loading & preparing data + +Next, we load, reshape, cast and normalize the data: + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 +``` + +As we use the MNIST dataset, it makes sense to use the Keras API, which provides this dataset out of the box. It must however be reshaped into the correct shape, being the image width, image height and _one channel_. Subsequently, we cast the data type into float32 format, which presumably speeds up the training process. Finally, we normalize the data, which your neural network appreciates. + +### Creating the model architecture + +Next, we create the model architecture in line with the architectural visualization from earlier: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(8, (2, 2), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=input_shape)) +model.add(MaxPooling2D((2, 2), padding='same')) +model.add(Conv2D(8, (2, 2), activation='relu', kernel_initializer='he_uniform', padding='same')) +model.add(MaxPooling2D((2, 2), padding='same')) +model.add(Conv2D(8, (2, 2), strides=(2,2), activation='relu', kernel_initializer='he_uniform', padding='same')) +model.add(Conv2D(8, (2, 2), activation='relu', kernel_initializer='he_uniform', padding='same')) +model.add(UpSampling2D((2, 2), interpolation='bilinear')) +model.add(Conv2D(8, (2, 2), activation='relu')) +model.add(UpSampling2D((2, 2), interpolation='bilinear')) +model.add(Conv2D(8, (2, 2), activation='relu', kernel_initializer='he_uniform', padding='same')) +model.add(UpSampling2D((2, 2), interpolation='bilinear')) +model.add(Conv2D(1, (2, 2), activation='sigmoid', padding='same')) + +model.summary() +``` + +We use the Conv2D, MaxPooling2D and UpSampling2D layers as defined before. What's important to note is that we use `bilinear` interpolation, which empirically does not produce different results from `nearest` interpolation - at least in this case. + +One more thing: as we activate with `relu`, [we must use He init](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/), and hence we do so. + +The `model.summary()` call generates a nice summary on the fly: + +``` +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d_1 (Conv2D) (None, 28, 28, 8) 40 +_________________________________________________________________ +max_pooling2d_1 (MaxPooling2 (None, 14, 14, 8) 0 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 14, 14, 8) 264 +_________________________________________________________________ +max_pooling2d_2 (MaxPooling2 (None, 7, 7, 8) 0 +_________________________________________________________________ +conv2d_3 (Conv2D) (None, 4, 4, 8) 264 +_________________________________________________________________ +conv2d_4 (Conv2D) (None, 4, 4, 8) 264 +_________________________________________________________________ +up_sampling2d_1 (UpSampling2 (None, 8, 8, 8) 0 +_________________________________________________________________ +conv2d_5 (Conv2D) (None, 7, 7, 8) 264 +_________________________________________________________________ +up_sampling2d_2 (UpSampling2 (None, 14, 14, 8) 0 +_________________________________________________________________ +conv2d_6 (Conv2D) (None, 14, 14, 8) 264 +_________________________________________________________________ +up_sampling2d_3 (UpSampling2 (None, 28, 28, 8) 0 +_________________________________________________________________ +conv2d_7 (Conv2D) (None, 28, 28, 1) 33 +================================================================= +Total params: 1,393 +Trainable params: 1,393 +Non-trainable params: 0 +``` + +Only 1.4K trainable parameters. Shouldn't be too difficult to train this model :-) + +### Compiling model & fitting data + +Next, we compile the model and fit the data: + +``` +# Compile and fit data +model.compile(optimizer='adam', loss='binary_crossentropy') +model.fit(input_train, input_train, + epochs=no_epochs, + batch_size=batch_size, + validation_split=validation_split) +``` + +Compiling is done with the default choices - being the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) and [binary crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). As we wish to reconstruct the original input, we set `input_train` to be both the _input_ and the _target_, and further configure the number of epochs, batch size and validation split as configured before. + +### Generating & visualizing reconstructions + +Generating reconstructions and visualizing them simply boils down to two things: + +- Taking the first _n_ (in our case, `n = 8`) samples from the test set. (Note that you may choose any selection algorithm here, but this is up to you.) Subsequently, generating predictions (i.e. an encoded state followoed by a reconstruction) for these inputs. +- Once they have been generated, visualizing input and reconstruction together, per sample. + +It can be done with the following code: + +``` +# Generate reconstructions +num_reconstructions = 8 +samples = input_test[:num_reconstructions] +targets = target_test[:num_reconstructions] +reconstructions = model.predict(samples) + +# Plot reconstructions +for i in np.arange(0, num_reconstructions): + # Get the sample and the reconstruction + sample = samples[i][:, :, 0] + reconstruction = reconstructions[i][:, :, 0] + input_class = targets[i] + # Matplotlib preparations + fig, axes = plt.subplots(1, 2) + # Plot sample and reconstruciton + axes[0].imshow(sample) + axes[0].set_title('Original image') + axes[1].imshow(reconstruction) + axes[1].set_title('Reconstruction with UpSampling2D') + fig.suptitle(f'MNIST target = {input_class}') + plt.show() +``` + +### Full model code + +If you are interested in the full model code only, which is perfectly fine, here you go: + +``` +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Conv2D, UpSampling2D, MaxPooling2D +import matplotlib.pyplot as plt +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 25 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 0 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(8, (2, 2), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=input_shape)) +model.add(MaxPooling2D((2, 2), padding='same')) +model.add(Conv2D(8, (2, 2), activation='relu', kernel_initializer='he_uniform', padding='same')) +model.add(MaxPooling2D((2, 2), padding='same')) +model.add(Conv2D(8, (2, 2), strides=(2,2), activation='relu', kernel_initializer='he_uniform', padding='same')) +model.add(Conv2D(8, (2, 2), activation='relu', kernel_initializer='he_uniform', padding='same')) +model.add(UpSampling2D((2, 2), interpolation='bilinear')) +model.add(Conv2D(8, (2, 2), activation='relu')) +model.add(UpSampling2D((2, 2), interpolation='bilinear')) +model.add(Conv2D(8, (2, 2), activation='relu', kernel_initializer='he_uniform', padding='same')) +model.add(UpSampling2D((2, 2), interpolation='bilinear')) +model.add(Conv2D(1, (2, 2), activation='sigmoid', padding='same')) + +model.summary() + +# Compile and fit data +model.compile(optimizer='adam', loss='binary_crossentropy') +model.fit(input_train, input_train, + epochs=no_epochs, + batch_size=batch_size, + validation_split=validation_split) + +# Generate reconstructions +num_reconstructions = 8 +samples = input_test[:num_reconstructions] +targets = target_test[:num_reconstructions] +reconstructions = model.predict(samples) + +# Plot reconstructions +for i in np.arange(0, num_reconstructions): + # Get the sample and the reconstruction + sample = samples[i][:, :, 0] + reconstruction = reconstructions[i][:, :, 0] + input_class = targets[i] + # Matplotlib preparations + fig, axes = plt.subplots(1, 2) + # Plot sample and reconstruciton + axes[0].imshow(sample) + axes[0].set_title('Original image') + axes[1].imshow(reconstruction) + axes[1].set_title('Reconstruction with UpSampling2D') + fig.suptitle(f'MNIST target = {input_class}') + plt.show() +``` + +## The results: visualized reconstructions + +Now open up a terminal, `cd` into the folder where your `upsampling2d.py` file is located, and execute `python upsampling2d.py`. When you have all the dependencies, you'll notice that the training process will start - possibly with a download of the MNIST dataset first. + +Once the training process finishes, it's likely that you'll arrive at a loss value of approximately 0.11. While this is quite good, it's a bit worse than the [Conv2DTranspose](https://www.machinecurve.com/index.php/2019/12/10/conv2dtranspose-using-2d-transposed-convolutions-with-keras/) we achieved of approximately 0.05. + +Visualizing the inputs and reconstructions produces this result: + +- [![](images/1-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1-1.png) + +- [![](images/2-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2-1.png) + +- [![](images/3-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/3-1.png) + +- [![](images/4-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/4-1.png) + +- [![](images/5-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/5-1.png) + +- [![](images/6-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/6-1.png) + +- [![](images/7-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/7-1.png) + +- [![](images/8-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/8-1.png) + + +### Comparison with Conv2DTranspose reconstructions + +The losses are different - approximately 0.11 for the UpSampling2D model against 0.05 for the Conv2DTranspose model. + +I was curious to see whether these results are clearly visible in the visualizations, so I've put together the UpSampling2D and Conv2DTranspose reconstructions together with the original inputs. + +The answer, unfortunately, is yes - the differences in loss are visible. + +Take a look for yourself: + +- [![](images/8-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/8-1.png) + +- [![](images/8.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/8.png) + +- [![](images/7-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/7-1.png) + +- [![](images/7.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/7.png) + +- [![](images/6-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/6-1.png) + +- [![](images/6.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/6.png) + +- [![](images/5-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/5-1.png) + +- [![](images/5.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/5.png) + +- [![](images/4-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/4-1.png) + +- [![](images/4.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/4.png) + +- [![](images/3-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/3-1.png) + +- [![](images/3.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/3.png) + +- [![](images/2-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2-1.png) + +- [![](images/2.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2.png) + +- [![](images/1-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1-1.png) + +- [![](images/1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1.png) + + +### UpSampling2D vs Conv2DTranspose ease of use + +What's more, I found creating the model with UpSampling2D and Conv2D layers slightly more difficult than using Conv2DTranspose. + +This was not necessarily due to getting the _correct shape_ - going back towards the (28, 28, 1) input shape - but primarily due to _getting the loss low enough with my architecture_. I felt that it was more difficult to achieve the Conv2DTranspose loss with UpSampling2D and Conv2D - which can be seen in the comparison above. + +## Summary + +However, this does not mean that you should skip on UpSampling2D/Conv2D altogether. No: we saw in today's blog post that it's the traditional choice, now often replaced by transposed convolutions, but still useful if you face checkerboard patterns in your reconstructions. + +Today, we saw what upsampling is, how UpSampling2D can be used in Keras, and how you can combine it with Conv2D layers (and MaxPooling2D) to generate an 'old-fashioned' autoencoder. + +Hope you've learnt something today! 😊 If you did, please let me know in the comments box below. But please do the same if you didn't, if you have questions, or when you have other remarks. I'll then try to improve this blog post based on your feedback 😁 + +Thank you for reading MachineCurve today and happy engineering! 😎 + +## References + +Keras. (n.d.). Convolutional Layers: UpSampling2D. Retrieved from [https://keras.io/layers/convolutional/#upsampling2d](https://keras.io/layers/convolutional/#upsampling2d) + +StackExchange. (n.d.). In CNN, are upsampling and transpose convolution the same? Retrieved from [https://stats.stackexchange.com/questions/252810/in-cnn-are-upsampling-and-transpose-convolution-the-same](https://stats.stackexchange.com/questions/252810/in-cnn-are-upsampling-and-transpose-convolution-the-same) + +Odena, A., Dumoulin, V., & Olah, C. (2016, October 17). Deconvolution and Checkerboard Artifacts. Retrieved from [https://distill.pub/2016/deconv-checkerboard/](https://distill.pub/2016/deconv-checkerboard/) + +Dumoulin, Vincent, en Francesco Visin. “A guide to convolution arithmetic for deep learning”. arXiv:1603.07285 \[cs, stat\], March 2016. arXiv.org, [http://arxiv.org/abs/1603.07285](http://arxiv.org/abs/1603.07285). + +StackOverflow. (n.d.). What is the the difference between performing upsampling together with strided transpose convolution and transpose convolution with stride 1 only? Retrieved from [https://stackoverflow.com/questions/48226783/what-is-the-the-difference-between-performing-upsampling-together-with-strided-t](https://stackoverflow.com/questions/48226783/what-is-the-the-difference-between-performing-upsampling-together-with-strided-t) + +Keras Blog. (n.d.). Building Autoencoders in Keras. Retrieved from [https://blog.keras.io/building-autoencoders-in-keras.html](https://blog.keras.io/building-autoencoders-in-keras.html) diff --git a/using-constant-padding-reflection-padding-and-replication-padding-with-keras.md b/using-constant-padding-reflection-padding-and-replication-padding-with-keras.md new file mode 100644 index 0000000..9d598b8 --- /dev/null +++ b/using-constant-padding-reflection-padding-and-replication-padding-with-keras.md @@ -0,0 +1,374 @@ +--- +title: "Using Constant Padding, Reflection Padding and Replication Padding with TensorFlow and Keras" +date: "2020-02-10" +categories: + - "deep-learning" + - "frameworks" +tags: + - "constant-padding" + - "convolutional-neural-networks" + - "keras" + - "neural-network" + - "padding" + - "reflection-padding" + - "replication-padding" +--- + +If you're training Convolutional Neural Networks with Keras, it may be that you don't want the size of your feature maps to be smaller than the size of your inputs. For example, because you're using a Conv layer in an autoencoder - where your goal is to generate a final feature map, not reduce the size of its output. + +Fortunately, this is possible with [padding](https://www.machinecurve.com/index.php/2020/02/08/how-to-use-padding-with-keras/), which essentially puts your feature map inside a frame that combined has the same size as your input data. Unfortunately, the Keras framework for deep learning only supports Zero Padding by design. This is especially unfortunate because there are types of padding - such as Reflection Padding and Replication Padding - which [may interfere less with the distribution of your data](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/#reflection-padding) during training. + +Now, there's no point in giving up :) That's why we got inspired by an answer on StackOverflow and got to work (StackOverflow, n.d.). By consequence, this blog post presents implementations of Constant Padding, Reflection Padding and Replication Padding to be used with TensorFlow 2.0 based Keras. The implementations are available for 1D and 2D data. Besides the implementation, it will also show you how to use them in an actual Keras model 👩‍💻. + +Are you ready? Let's go! 😎 + +**Update 05/Nov/2020:** added 'TensorFlow' to the title in order to reflect the deep integration between TensorFlow and Keras in TensorFlow 2.x. + +* * * + +\[toc\] + +* * * + +## Recap: what is padding and why is it useful? + +Suppose that you are training a [convolutional neural network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), which is a type of neural network where so-called "convolutional layers" serve as feature extractors: + +[![](images/CNN-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/09/CNN-1.png) + +In the drawing above, some input data (which is likely an RGB image) of height \[latex\]H\[/latex\] and width \[latex\]W\[/latex\] is fed to a convolutional layer. This layer, which slides (or "convolves") \[latex\]N\[/latex\] kernels of size 3x3x3 over the input, produces \[latex\]N\[/latex\] so-called "feature maps" as output. Through the _weights_ of the kernels, which have been [optimized](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) based on the training dataset, the neural network learns to recognize featues in the input image. + +Note that often, a convolutional neural network consists of quite a few convolutional layers stacked on top of each other. In this case, the feature map that is the output of the first layer, is used as the input of the second, and so on. + +Now, due to the way such layers work, the size of one feature map (e.g. \[latex\]H\_{fm}\[/latex\] and \[latex\]W\_{fm}\[/latex\] in the image above) [is _smaller_](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/#conv-layers-might-induce-spatial-hierarchy) than the size of the input to the layer (\[latex\]H\[/latex\] and \[latex\]W\[/latex\]). However, sometimes, you don't want this to happen. Rather, you wish that the size of the feature map is equal - or perhaps larger - than the size of your input data. + +Padding can be used to achieve this. By wrapping the outcome in some "frame", you can ensure that the size of the outputs are equal to those of the input. However, what does this frame look like? In our [article about padding](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/), we saw that zeros are often used for this. However, we also saw that this might result in worse performance due to the fact that zero padding is claimed to interfere with the distribution of your dataset. _Reflection padding_ and _replication padding_ are introduced as possible fixes for this issue, together with _constant padding_. + +Unfortunately, Keras does not support this, [as it only supports zero padding](https://www.machinecurve.com/index.php/2020/02/08/how-to-use-padding-with-keras/). That's why the rest of this blog will introduce constant padding, reflection padding and replication padding to Keras. The code below is compatible with TensorFlow 2.0 based Keras and hence should still work for quite some time from now. If not, feel free to leave a message in the comments box, and I'll try to fix it for you :) + +Let's take a look at the first type: using constant padding with Keras 😎 + +* * * + +## Constant padding + +The first type of padding that we'll make available for Keras: **constant padding** 😉 + +### What is constant padding? + +Let's take a look at what constant padding does by means of this schematic drawing: + +[![](images/constantpad.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/constantpad.jpg) + +As you can see, the feature maps that are the output of the `Conv2D` layer that is applied to the input data, are smaller than the input data itself. This is perfectly normal, and normally, one would apply [zero padding](https://www.machinecurve.com/index.php/2020/02/08/how-to-use-padding-with-keras/#how-to-use-same-zero-padding-with-keras). However, can't we pad with a constant value \[latex\]c\[/latex\] instead of zeros? + +Yes! + +This is what constant padding does: the "frame" around the feature maps which ensures that their size equals the size of the input data, is filled with the specified \[latex\]c\[/latex\]. Let's now take a look at Keras implementations for 1D and 2D data :) + +### Keras ConstantPadding1D + +First, constant padding for 1D data - a.k.a. `ConstantPadding1D`: + +``` +from tensorflow import pad +from tensorflow.keras.layers import Layer + +''' + 1D Constant Padding + Attributes: + - padding: (padding_left, padding_right) tuple + - constant: int (default = 0) +''' +class ConstantPadding1D(Layer): + def __init__(self, padding=(1, 1), constant=0, **kwargs): + self.padding = tuple(padding) + self.constant = constant + super(ConstantPadding1D, self).__init__(**kwargs) + + def compute_output_shape(self, input_shape): + return input_shape[1] + self.padding[0] + self.padding[1] + + def call(self, input_tensor, mask=None): + padding_left, padding_right = self.padding + return pad(input_tensor, [[0, 0], [padding_left, padding_right], [0, 0]], mode='CONSTANT', constant_values=self.constant) +``` + +The code above effectively defines a new layer type for Keras, which we call `ConstantPadding1D`. It's defined as a class and hence can be initialized multiple times. It is composed of three definitions: + +- `__init__`, which is the class constructor, and serves to lift the variables passed on creation (`padding` and `constant`, respectively) into class scope, which means that every definition can use them. +- `compute_output_shape`, which does what it suggests: it computes the output shape for the layer. In our case, that's the new shape of our `Conv1D` output data, _after_ padding is applied as well. +- `call`, which is where the data (the `input_tensor`) flows through. + +#### Results for 1D Constant Padding + +Now, let's take a look at whether it works. If we applied `ConstantPadding1D` with `constant = 0` and `padding = (5, 4)` after a `Conv1D` layer with a `kernel_size = 10`, we should expect to see Zero Padding applied to 1D data: + +[![](images/zero_padding_1d-1-1024x147.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/zero_padding_1d-1.png) + +Indeed, the left and the right of the padded feature map clearly show the zeroes being padded successfully. This is supported even more by the fact that if changed into `constant = 23`, the padding changes color, as expected: + +[![](images/23_pad_1d-1024x147.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/23_pad_1d.png) + +In both padding cases, note that the "left side" of the input is very dark, and that this darkness is also visible in the feature map. This provides some trust that it's the actual feature map that we visualize :) + +### Keras ConstantPadding2D + +Here's `ConstantPadding2D`: + +``` +from tensorflow import pad +from tensorflow.keras.layers import Layer + +''' + 2D Constant Padding + Attributes: + - padding: (padding_width, padding_height) tuple + - constant: int (default = 0) +''' +class ConstantPadding2D(Layer): + def __init__(self, padding=(1, 1), constant=0, **kwargs): + self.padding = tuple(padding) + self.constant = constant + super(ConstantPadding2D, self).__init__(**kwargs) + + def compute_output_shape(self, input_shape): + return (input_shape[0], input_shape[1] + 2 * self.padding[0], input_shape[2] + 2 * self.padding[1], input_shape[3]) + + def call(self, input_tensor, mask=None): + padding_width, padding_height = self.padding + return pad(input_tensor, [[0,0], [padding_height, padding_height], [padding_width, padding_width], [0,0] ], mode='CONSTANT', constant_values=self.constant) +``` + +The code is pretty similar to the one of `ConstantPadding1D`: + +- It still represents a new Keras layer, having the `__init__`, `compute_output_shape` and `call` definitions. +- The output shape that is computed by `compute_output_shape` differs from the 1D version, for the simple reason that both produce a different shape :) +- The `paddings` attribute that is applied to the `pad` function is also different, and suitable for 2D padding. + +#### Results for 2D Constant Padding + +Now, time for results :) Applying `ConstantPadding2D` with `constant = 0` equals Zero Padding: + +[![](images/zero_padding.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/zero_padding.png) + +However, the strength of `ConstantPadding2D` over Keras built-in `ZeroPadding2D` is that you can use any constant, as with `ConstantPadding1D`. For example, with `constant = 23`, this is what you get: + +[![](images/constant23.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/constant23.png) + +Great! :D + +* * * + +## Reflection padding + +The second type of padding that we'll make available for Keras: **reflection padding** 😉 + +### What is reflection padding? + +In order to understand reflection padding, it's important that we first take a look at this schematic drawing of \[latex\](1, 2)\[/latex\] padding which, by coincidence ;-), we call "reflection padding": + +[![](images/reflection_pad.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/reflection_pad.jpg) + +Let's now take a look at the first row of our unpadded input, i.e. the yellow box. It's \[latex\]\[3, 5, 1\]\[/latex\]. Reflection padding essentially uses the contents of this row for padding the values directly next to it. For example, move to the right, i.e. from 3 to 1. Then, move one additional box to the right - you'll find a 5. Hey, that's the middle value of our row. Then, you'll find a 3. Hey, that's the first value! And so on. You see the same happening on the left, and on top. + +Reflection padding thus "reflects" the row into the padding. This is useful because it ensures that your outputs will transition "smoothly" into the padding. Possibly, this improves the performance of your model, because padded inputs will still look like the original ones in terms of data distribution (Liu et al., 2018). + +### Keras ReflectionPadding1D + +Here's the implementation for 1D data, i.e. `ReflectionPadding1D`: + +``` +from tensorflow import pad +from tensorflow.keras.layers import Layer + +''' + 1D Reflection Padding + Attributes: + - padding: (padding_left, padding_right) tuple +''' +class ReflectionPadding1D(Layer): + def __init__(self, padding=(1, 1), **kwargs): + self.padding = tuple(padding) + super(ReflectionPadding1D, self).__init__(**kwargs) + + def compute_output_shape(self, input_shape): + return input_shape[1] + self.padding[0] + self.padding[1] + + def call(self, input_tensor, mask=None): + padding_left, padding_right = self.padding + return pad(input_tensor, [[0, 0], [padding_left, padding_right], [0, 0]], mode='REFLECT') +``` + +Once again, the class is similar to the paddings we've already seen. However, the contents of the `call` operation are different. Particularly, the `mode` has changed, from `CONSTANT` into `REFLECT`. What's more, the `constant_values` attribute was removed, and so was the `self.constant` assignment, simply because we don't need them here. + +#### Results for 1D Reflection Padding + +Now, when we apply this padding to an 1D input, we can see how it works. Firstly, the kernel reduces the input into a feature map - as you can see, the "dark area" on the left of your input has moved to approximately position 5 in the feature map. It's also 5 times smaller, which makes sense, given our large kernel size. + +However, what you'll also see, is that from the "max value" in your feature map at around 5, exact symmetry is visible on the left and on the right. This is indeed reflection padding in action! + +[![](images/reflection_1d-1024x147.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/reflection_1d.png) + +### Keras ReflectionPadding2D + +Now, let's take a look at 2D Reflection Padding, or `ReflectionPadding2D`: + +``` +from tensorflow import pad +from tensorflow.keras.layers import Layer + +''' + 2D Reflection Padding + Attributes: + - padding: (padding_width, padding_height) tuple +''' +class ReflectionPadding2D(Layer): + def __init__(self, padding=(1, 1), **kwargs): + self.padding = tuple(padding) + super(ReflectionPadding2D, self).__init__(**kwargs) + + def compute_output_shape(self, input_shape): + return (input_shape[0], input_shape[1] + 2 * self.padding[0], input_shape[2] + 2 * self.padding[1], input_shape[3]) + + def call(self, input_tensor, mask=None): + padding_width, padding_height = self.padding + return pad(input_tensor, [[0,0], [padding_height, padding_height], [padding_width, padding_width], [0,0] ], 'REFLECT') +``` + +The value for `compute_output_shape` is equal to `ConstantPadding2D`. So is the `call` operation, except for `CONSTANT -> REFLECT` and the removal of `self.constant`. Nothing too exciting :) + +#### Results for 2D Reflection Padding + +...except for the results, perhaps :) Using 2D data, the effect of reflection padding is even more visible. As you can see, with a relatively large kernel, the input is reduced to a more abstract feature map. However, the feature map has the same size as the input data, and shows perfect symmetry around the edges. Reflection padding in action! :) + +[![](images/reflection.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/reflection.png) + +* * * + +## Replication padding + +The third type of padding that we'll make available for Keras: **replication padding** 😉 + +### What is replication padding? + +Replication padding is pretty similar to reflection padding, actually, and attempts to achieve the same outcome: that the distribution of your data is disturbed as little as possible (Liu et al., 2018). + +However, it does so in a slightly different way: + +[![](images/replication_pad.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/replication_pad.png) + +Instead of a pure _reflection_, like Reflection Padding, _replication_ padding makes a copy of the input, reverses it, and then apply it. Take a look at the first row again: \[latex\]\[3, 5, 1\]\[/latex\] -> \[latex\]\[1, 5, 3\]\[/latex\], after which it's applied. In the results, you should thus see a more broad "transition zone" from _input_ to _padding_. + +In TensorFlow, replication padding is not known by the name "replication padding". Instead, in TF, it's called "symmetric padding". Hence, we'll use `SYMMETRIC` as our padding mode throughout the 1D and 2D examples that will follow next. + +### Keras ReplicationPadding1D + +Here's the code for `ReplicationPadding1D`: + +``` +from tensorflow import pad +from tensorflow.keras.layers import Layer + +''' + 1D Replication Padding + Attributes: + - padding: (padding_left, padding_right) tuple +''' +class ReplicationPadding1D(Layer): + def __init__(self, padding=(1, 1), **kwargs): + self.padding = tuple(padding) + super(ReplicationPadding1D, self).__init__(**kwargs) + + def compute_output_shape(self, input_shape): + return input_shape[1] + self.padding[0] + self.padding[1] + + def call(self, input_tensor, mask=None): + padding_left, padding_right = self.padding + return pad(input_tensor, [[0, 0], [padding_left, padding_right], [0, 0]], mode='SYMMETRIC') +``` + +Contrary to reflection padding, not much was changed: `REFLECT -> SYMMETRIC`. + +#### Results for 1D Replication Padding + +Now, let's take a look at the results :) + +Indeed, it's clear that the "transition zone" between input and padding is broader. Compared to reflection padding, the gray zone around position 5 is broader. This, obviously, is caused by the _copy_ instead of _reflection_ that replication padding makes: + +[![](images/replication_1d-1024x147.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/replication_1d.png) + +### Keras ReplicationPadding2D + +Now, the code for 2D Replication Padding a.k.a. `ReplicationPadding2D`: + +``` +from tensorflow import pad +from tensorflow.keras.layers import Layer + +''' + 2D Replication Padding + Attributes: + - padding: (padding_width, padding_height) tuple +''' +class ReplicationPadding2D(Layer): + def __init__(self, padding=(1, 1), **kwargs): + self.padding = tuple(padding) + super(ReplicationPadding2D, self).__init__(**kwargs) + + def compute_output_shape(self, input_shape): + return (input_shape[0], input_shape[1] + 2 * self.padding[0], input_shape[2] + 2 * self.padding[1], input_shape[3]) + + def call(self, input_tensor, mask=None): + padding_width, padding_height = self.padding + return pad(input_tensor, [[0,0], [padding_height, padding_height], [padding_width, padding_width], [0,0] ], 'SYMMETRIC') +``` + +#### Results for 2D Replication Padding + +Here, the difference between replication and reflection padding is visible even better. The feature map is generated - i.e., it's more abstract than the input data - and it is padded smoothly. However, the "transition zone" is broader than with reflection padding - and this is clearly visible on the bottom right of the feature map: + +[![](images/replication.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/replication.png) + +* * * + +## Using these paddings in a Keras model + +Okay, time to show you how to use these paddings :) + +We've made `ConstantPadding`, `ReflectionPadding` and `ReplicationPadding` available for `Conv1D` and `Conv2D` layers in Keras, but we still don't know how to use them. + +Here's an example for 2D padding, where we create a `Sequential` model, apply a `Conv2D` layer and subsequently apply both replication and constant padding. Obviously, this will produce a padded feature map that is larger than our original input. However, we just wanted to show how to apply replication/reflection padding, _and_ constant padding, as you require an additional parameter there :) + +``` +model = Sequential() +model.add(Conv2D(img_num_channels, kernel_size=(5, 5), activation='linear', input_shape=input_shape, kernel_initializer=Ones(), bias_initializer=Ones())) +model.add(ReplicationPadding2D(padding=(3, 3))) +model.add(ConstantPadding1D(padding=(5, 4), constant=23)) +``` + +* * * + +## Summary + +In this blog post, you found how to use Constant Padding, Reflection Padding and Replication Padding with Keras using TensorFlow. The blog started with a recap on padding, showing that you might need it if you want your Conv-generated feature maps to be of equal size to your input data. This was followed by discussing how Keras only supports zero padding, while more advanced paddings are available. + +We subsequently provided Python based implementations of these paddings, and gave an example of how to apply them into your Keras models. + +I hope you've learnt something from this blog or that it was useful! :) If it was, feel free to leave a comment in the comments section. Please do the same if you think I made mistakes, or when you have questions or remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +\[kerasbox\] + +* * * + +## References + +StackOverflow. (n.d.). Reflection padding Conv2D. Retrieved from [https://stackoverflow.com/questions/50677544/reflection-padding-conv2d](https://stackoverflow.com/questions/50677544/reflection-padding-conv2d) + +MachineCurve. (2020, February 9). What is padding in a neural network? Retrieved from [https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/](https://www.machinecurve.com/index.php/2020/02/07/what-is-padding-in-a-neural-network/) + +Liu, G., Shih, K. J., Wang, T. C., Reda, F. A., Sapra, K., Yu, Z., … & Catanzaro, B. (2018). [Partial convolution based padding](https://arxiv.org/abs/1811.11718). _arXiv preprint arXiv:1811.11718_. + +TensorFlow. (n.d.). tf.pad. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/pad](https://www.tensorflow.org/api_docs/python/tf/pad) diff --git a/using-deep-learning-for-classifying-mail-digits.md b/using-deep-learning-for-classifying-mail-digits.md new file mode 100644 index 0000000..9c4f301 --- /dev/null +++ b/using-deep-learning-for-classifying-mail-digits.md @@ -0,0 +1,284 @@ +--- +title: "Using Deep Learning for Classifying Mail Digits" +date: "2020-12-03" +categories: + - "deep-learning" + - "frameworks" +tags: + - "convolutional-neural-networks" + - "deep-learning" + - "digit-classification" + - "digits" + - "extra-keras-datasets" + - "keras" + - "neural-network" + - "neural-networks" + - "tensorflow" +--- + +Deep Learning, the subset of Machine Learning which employs Deep Neural Networks for generating models, can be used for many things. Today, it's being used in Google Translate, in Uber's app, and in many [Computer Vision applications](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). One of these examples is digit classification in mail delivery. This article will focus on this use case. We will teach you to build your own Neural Network for Digit Classification, but not with the MNIST dataset - which is a pretty common dataset. Rather, we'll be using the USPS Handwritten Digits Dataset, which is made available by scanning many pieces of mail and extracting the digits from them. + +The article is structured as follows. Firstly, we'll be taking a look at mail digit classification. Why can it help in the first place? Then, we'll move on to our dataset, the USPS Handwritten Digits Dataset. We will show you how we can use [extra-keras-datasets](https://github.com/christianversloot/extra_keras_datasets) for easy loading of the dataset, and then explore it further. Once we are familiar with the dataset, we will build and train the Deep Learning model, using Python, TensorFlow and Keras. Then, we'll run it, and you will see how it performs. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Why Mail Digit Classification? + +Even though the number of mail sent (and I'm not talking about email here, haha, but real mail) has been decreasing for some years (at least in the country where I live), the number of pieces that has to be processed is still enormous. So enormous even, that we can no longer handle them by hand if we expect next-day delivery, sometimes even same-day delivery. + +We therefore have to automate away many of the parts of the mail delivery process. This is especially fruitful in process steps that require many repetitive actions, such as the ones undertaken in a distribution center (the video below shows what's going on in just _one_ USPS distribution center during Christmastime). For example, if a piece of mail comes in, the address must be registered and it must be distributed to some part of the distribution center where mail for a particular region is gathered. + +Is scanning the address a highly complex operation requiring large amounts of creativity? + +No. + +Rather, it's a highly repetitive task: read address, move piece of mail. Read address, move piece of mail. And so on, and so on. + +For this reason, automation, and the employment of Machine Learning algorithms - which learn to recognize patterns from datasets and which can later employ these learnings to handle new observations - can be really worthwhile in mail distribution centers. In fact, many such algorithms have been around for years, using camera techniques and smart actuators for distributing the mail into the right buckets. It speeds up the mail sorting process and hence the time between sending your mail and a happy smile at the receiving end of the mail process :) + +In this article, we'll try to find whether we can build (parts of) such a mail classification system ourselves. We'll be specifically focusing on **digits**, being the numbers 0 to 9. Using Computer Vision technology and Deep Learning, we will build a Neural network capable of classifying mail digits correctly in many cases. Let's take a look at our dataset first. + +https://www.youtube.com/watch?v=A3UCmTr5RBk + +* * * + +## Our dataset: the USPS Handwritten Digits Dataset + +For building our Neural Network, we will be using the **USPS Handwritten Digits Dataset**. It is a dataset made available in Hull (1994). It is in fact quite an extensive dataset: + +> An image database for handwritten text recognition research is described. Digital images of approximately 5000 city names, 5000 state names, 10000 ZIP Codes, and 50000 alphanumeric characters are included. +> +> Hull (1994) + +It was constructed as follows: + +> Each image was scanned from mail in a working post office at 300 pixels/in in 8-bit gray scale on a high-quality flat bed digitizer. The data were unconstrained for the writer, style, and method of preparation. These characteristics help overcome the limitations of earlier databases that contained only isolated characters or were prepared in a laboratory setting under prescribed circumstances. +> +> Hull (1994) + +Let's now take a look at the data in a bit more detail. In order for easy accessibility, we have made available the dataset in our [Extra Keras Datasets package](https://github.com/christianversloot/extra_keras_datasets) which can be installed really easily: `pip install extra-keras-datasets`. We can then call `load_data(...)` to load the data, as we can see here: + +``` +from extra_keras_datasets import usps + +# Load dataset +(X_train, y_train), (X_test, y_test) = usps.load_data() +``` + +Using Matplotlib, we can then visualize the dataset. If we visualize six random images, we get the following results. Clearly, we are working with small images (they are pixelated if we make them bigger). The dataset also seems to be prepared nicely: numbers are centered, and all numbers are surrounded by a black box (likely, by taking the inverse color after scanning). What's more, the numbers are grayscale digits, which also removes the aspect of color from the equation. + +![](images/usps.png) + +* * * + +## Building the Deep Learning model + +Now that we are aware of the dataset, we can start building our Deep Learning model. We will use TensorFlow and specifically the `tensorflow.keras` API for building the model. TensorFlow is one of the leading Machine Learning libraries that is being used these days and can be used for constructing Neural networks. Building our network involves the following steps which together create Python code: + +1. **Adding imports:** we depend on other packages for building our Neural network. We have to import the specific components that we require first. +2. **Specifying the configuration options:** configuring a Neural network involves specifying some configuration options. +3. **Loading the dataset:** using the Extra Keras Datasets package, we'll easily load the dataset into our code. We will also perform preprocessing activities. +4. **Creating the model skeleton:** we then actually create the Neural network, or more specifically the model skeleton. We will then know what our model _looks like_, but it's not real yet. +5. **Compiling the model:** when compiling the model, we make it real, by instantiating it and configuring it. It can now be used. +6. **Fitting data to the model:** in other words, training the model. +7. **Evaluating the model:** checking how well it works after it was trained. + +Let's get to work! Open a code editor, create a file - e.g. `usps.py` - and we can go. + +### Adding the imports + +As we said, the first thing we have to do is adding the imports. + +- First of all, we'll be using the [Extra Keras Datasets](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) package for importing `usps`, i.e. the USPS Handwritten Digits Dataset. +- We then import the `Sequential` Keras API, which is the foundation for our Keras model. Using this API, we can stack layers on top of each other, which jointly represent the Deep Learning model. +- We will also use a few layers: we'll use [Convolutional ones](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) (`Conv2D`) for 2D data (i.e., images), Densely-connected ones (for generating the actual predictions) and `Flatten` (Dense layers can't handle non-1D data, so we must flatten the outputs of our final Conv layers). +- For optimization, we use the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) (`tensorflow.keras.optimizers.Adam`) and for [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) we use `categorical_crossentropy` [loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). +- Finally, because we use categorical crossentropy loss, we must [one-hot encode our targets](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras/). Using the `to_categorical` util, we can achieve this. + +``` +from extra_keras_datasets import usps +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tensorflow.keras.utils import to_categorical +``` + +### Specifying the configuration options + +Now that we have the imports, we can move on to specifying the configuration options. Strictly speaking, this step is not necessary, because it is possible to define all the options _within_ the later parts (compiling the model, fitting the data ...) as well. However, I think that listing them near the top of your model helps with clarity: you can immediately see how your model is configured. Next, we will therefore specify the configuration options for our ML model: + +- Fitting data goes [in batches](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) if you want to avoid exhausting your memory. That's why we have to specify a `batch_size`. We set it to 250 samples, meaning that our [forward pass](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) moves 250 samples through the model, generates predictions, and then optimizes. When all batches have passed, the iteration - or epoch - is complete. +- The number of iterations, or `no_epochs`, is set to 150. This means that our model will feed forward samples, generate predictions, and then optimize for 150 times. Is this a good number? We don't know up front. If you want to stop at precisely the good moment, [you can apply callbacks](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), but for the purpose of this experiment, setting a fixed number of epochs will work well. +- Some of the training data must be used for [validation purposes](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/). In other words, it must be used to steer the training process _while training is happening_, to preserve the testing set for true model evaluation. We therefore set `validation_split_size` to `0.20`, meaning that we use 20% of the training data for validation purposes. +- We set `verbosity` to 1, which will instruct Keras to print all outputs on screen. This slows down the training process slightly, so it's best not to use it for production training settings (if you want small summaries, you can set `verbosity = 2`, otherwise I recommend `verbosity = 0`). However, for this experiment, we actually _want_ everything to be displayed on screen. +- We use the [Adam optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) and [categorical crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) for the optimization process, and specify accuracy as an additional metric, because it is more intuitive for humans. + +``` +# Configuration options +batch_size = 250 +no_epochs = 150 +validation_split_size = 0.20 +verbosity = 1 +optimizer = Adam() +loss_function = categorical_crossentropy +additional_metrics = ['accuracy'] +``` + +### Loading the dataset + +Now that we have specified the configuration options, it's time to load the dataset. Fortunately, with the [Extra Keras Datasets package](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/), this is really easy: + +``` +# Load dataset +(X_train, y_train), (X_test, y_test) = usps.load_data() +``` + +If you don't have this package yet: it can be installed easily, using `pip install extra-keras-datasets`. + +Next, we have to do three things: + +- [Scale the data](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) to the \[latex\]\[0, 1\]\[/latex\] range, which helps the optimization process. +- Reshape the 2D grayscale data (which has dimensions for `width` and `height` only) into a 3D object, because the [Conv2D](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) accepts arrays with 3D inputs (`width`, `height` and `color channels` only). We'll therefore reshape each sample into `(width, height, 1)`, which does not change anything semantically. +- Convert our target values to [one-hot encoded format](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras/), which makes them compatible with our usage of categorical crossentropy loss. + +Let's add these tasks to our code: + +``` +# Set to [0, 1] range +X_train = X_train / 255.0 +X_test = X_test / 255.0 + +# Reshape 2D grayscale data into 2D-grayscale-with-one-channel data +X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1) +X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], 1) + +# Convert targets into one-hot encoded format +y_train = to_categorical(y_train) +y_test = to_categorical(y_test) +``` + +### Creating the model skeleton + +It's now time to create the skeleton for our Neural network. This skeleton describes _what our model looks like_. It does however not create a model that we can use (we must compile it in the next section before we can use it). + +Recall that in a Neural network, we have an input layer, hidden layers and an output layer. In our model skeleton, we describe the structure of our hidden layers and our output layer. Keras [will construct the input layer for us](https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim/). What we must do, however, is showing Keras what it must look like. We'll therefore derive the input shape from one sample and specify it as the shape of our input layer shape later: + +``` +# Input shape +input_shape = X_train[0].shape +print(f'Input shape = {input_shape}') +``` + +We can then create the model skeleton: + +- We first initialize the Sequential API into the `model` variable, giving us an empty model to work with. +- We then stack a few layers on top of each other and indirectly on top of the `model` foundation by calling `model.add(...)`. Specifically, we use two Convolutional layers, then Flatten the feature maps generated by the last layer, and use Dense layers for the final prediction ([using Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/)). +- Specifically note the `input_shape = input_shape` assignment in the first layer, telling Keras what the shape of an input sample looks like! + +``` +# Create the model +model = Sequential() +model.add(Conv2D(16, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(8, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(32, activation='relu')) +model.add(Dense(10, activation='softmax')) +``` + +### Compiling the model + +We can now compile our model skeleton into a functional model. Since we already specified the configuration options before, this is really easy: + +``` +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=additional_metrics) +``` + +### Fitting data to the model + +The same is true for starting the training process, or fitting data to the model: + +``` +# Fit data to model +model.fit(X_train, y_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split_size) +``` + +We explicitly fit the `(X_train, y_train)` data, applying the 80/20 validation split before actually starting the training process. The other configuration options speak for themselves, because we already covered them in the previous section about specifying the configuration options. + +### Evaluating the model + +Should we now run our Python code, we will see that our model starts training, and that it will eventually finish. However, doing that, we can only check whether our model works well _on training data_. As we know, [we cannot rely on that data](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/) if we want to [evaluate our model](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/), because it is like checking your own homework. We therefore have to use testing data for evaluation purposes, in order to find out how well our model really works. + +We can easily perform a model evaluation step with the testing data, as follows: + +``` +# Generate generalization metrics +score = model.evaluate(X_test, y_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +* * * + +## Running the model + +Now that we have written code for configuration options, loading and preparing the dataset, constructing and compiling the model, and subsequently training and evaluating it, we can actually run the model. + +Therefore, save your `usps.py` file, open up a terminal where you have TensorFlow and [Extra Keras Datasets](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) installed, and run `python usps.py`. + +If the dataset hasn't been loaded onto your machine yet, you will first see a loading bar that illustrates the dataset download process. Then, you'll find that your model starts training. + +``` +Epoch 1/150 +5832/5832 [==============================] - 2s 350us/sample - loss: 2.2843 - accuracy: 0.1632 - val_loss: 2.2766 - val_accuracy: 0.1590 +Epoch 2/150 +5832/5832 [==============================] - 0s 26us/sample - loss: 2.2399 - accuracy: 0.2327 - val_loss: 2.2148 - val_accuracy: 0.2954 +Epoch 3/150 +5832/5832 [==============================] - 0s 26us/sample - loss: 2.1096 - accuracy: 0.3009 - val_loss: 1.9877 - val_accuracy: 0.3002 +Epoch 4/150 +5832/5832 [==============================] - 0s 25us/sample - loss: 1.7524 - accuracy: 0.4114 - val_loss: 1.5406 - val_accuracy: 0.5278 +Epoch 5/150 +5832/5832 [==============================] - 0s 24us/sample - loss: 1.2502 - accuracy: 0.6722 - val_loss: 1.0161 - val_accuracy: 0.7313 +Epoch 6/150 +3000/5832 [==============>...............] - ETA: 0s - loss: 0.8810 - accuracy: 0.8007 +``` + +After it has finished, our validation accuracy is close to 96%. This is good, but we cannot trust validation data fully either (because it is used in the training process, it can also leave its marks on the model internals, rendering it a bit suspect). We really have to use testing data, i.e. data that the model has never seen during training, for model evaluation. + +The evaluation result is also displayed on screen: + +``` +Test loss: 0.3841879335271522 / Test accuracy: 0.9217737913131714 +``` + +Clearly, we can see that our model performs a bit worse with testing data, but still - it performs at a 92.2% accuracy. That's good! + +* * * + +## Summary + +In this article, we saw how we can create a Neural network for digits classification with TensorFlow and Keras. Contrary to many other articles, which use the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#mnist-database-of-handwritten-digits), we used a different one - the USPS Handwritten Digits Dataset, available through the Extra Keras Datasets package. + +We firstly saw why digits classification can help in the mail distribution process. We then briefly looked at the dataset, and saw that it is composed of many handwritten digits, scanned from real, USPS-distributed mail. Then, we moved on to model construction, showing you how to build and train a TensorFlow/Keras model for digits classification with step-by-step examples. With the dataset and our Neural network, we achieved a 92.2% accuracy on our testing dataset. + +I hope that you have learned something from today's article! If you did, please feel free to leave a comment in the comments section - I'd love to hear from you 💬 Please do the same if you have questions or other remarks. Whenever possible, I'll try to help you move forward in your ML career. + +Regardless of that, thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +J. J. Hull. (1994). A database for handwritten text recognition research. _IEEE Transactions on Pattern Analysis and Machine Intelligence_, _16_(5), 550–554. https://doi.org/10.1109/34.291440 diff --git a/using-dropout-with-pytorch.md b/using-dropout-with-pytorch.md new file mode 100644 index 0000000..f6be41b --- /dev/null +++ b/using-dropout-with-pytorch.md @@ -0,0 +1,165 @@ +--- +title: "Using Dropout with PyTorch" +date: "2021-07-07" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "dropout" + - "machine-learning" + - "neural-network" + - "overfitting" + - "pytorch" +--- + +The Dropout technique can be used for avoiding overfitting in your neural network. It has been around for some time and is widely available in a variety of neural network libraries. Let's take a look at how Dropout can be implemented with PyTorch. + +In this article, you will learn... + +- **How variance and overfitting are related.** +- **What Dropout is and how it works against overfitting.** +- **How Dropout can be implemented with PyTorch**. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Variance and overfitting + +In our article about the [trade-off between bias and variance](https://www.machinecurve.com/index.php/2020/11/02/machine-learning-error-bias-variance-and-irreducible-error-with-python/), it became clear that models can be high in _bias_ or high in _variance_. Preferably, there is a balance between both. + +To summarize that article briefly, models high in bias are relatively rigid. Linear models are a good example - they assume that your input data has a linear pattern. Models high in variance, however, do not make such assumptions -- but they are sensitive to changes in your training data. + +As you can imagine, striking a balance between rigidity and sensitivity. + +Dropout is related to the fact that deep neural networks have high variance. As you know when you have dealt with neural networks for a while, such models are sensitive to overfitting - capturing noise in the data as if it is part of the real function that must be modeled. + +* * * + +## Dropout + +In their paper [“Dropout: A Simple Way to Prevent Neural Networks from Overfitting”](http://jmlr.org/papers/v15/srivastava14a.html), Srivastava et al. (2014) describe Dropout, which is a technique that temporarily removes neurons from the neural network. + +> With Dropout, the training process essentially drops out neurons in a neural network. +> +> [What is Dropout? Reduce overfitting in your neural networks](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) + +When certain neurons are dropped, no data flows through them anymore. Dropout is modeled as [Bernoulli variables](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/#bernoulli-variables), which are either zero (0) or one (1). They can be configured with a variable, \[latex\]p\[/latex\], which illustrates the probability (between 0 and 1) with which neurons are dropped. + +When neurons are dropped, they are not dropped permanently: instead, at every epoch (or even minibatch) the network randomly selects neurons that are dropped this time. Neurons that had been dropped before can be activated again during future iterations. + +- For a more detailed explanation of Dropout, see our article [_What is Dropout? Reduce overfitting in your neural networks_](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/). + +* * * + +## Using Dropout with PyTorch: full example + +Now that we understand what Dropout is, we can take a look at how Dropout can be implemented with the PyTorch framework. For this example, we are using a [basic example](https://www.machinecurve.com/index.php/2021/01/26/creating-a-multilayer-perceptron-with-pytorch-and-lightning/) that models a Multilayer Perceptron. We will be applying it to the MNIST dataset (but note that Convolutional Neural Networks are more applicable, generally speaking, for image datasets). + +In the example, you'll see that: + +- We import a variety of dependencies. These include `os` for Python operating system interfaces, `torch` representing PyTorch, and a variety of sub components, such as its neural networks library (`nn`), the `MNIST` dataset, the `DataLoader` for loading the data, and `transforms` for a Tensor transform. +- We define the `MLP` class, which is a PyTorch neural network module (`nn.Module`). Its constructor initializes the `nn.Module` super class and then initializes a `Sequential` network (i.e., a network where layers are stacked on top of each other). It begins by flattening the three-dimensional input (width, height, channels) into a one-dimensional input, then applies a `Linear` layer (MLP layer), followed by Dropout, Rectified Linear Unit. This is then repeated once more, before we end with a final `Linear` layer for the final multiclass prediction. +- The `forward` definition is a relatively standard PyTorch definition that must be included in a `nn.Module`: it ensures that the forward pass of the network (i.e., when the data is fed to the network), is performed by feeding input data `x` through the `layers` defined in the constructor. +- In the `main` check, a random seed is fixed, the dataset is loaded and prepared; the MLP, loss function and optimizer are initialized; then the model is trained. This is the classic PyTorch training loop: gradients are zeroed, a forward pass is performed, loss is computed and backpropagated through the network, and optimization is performed. Finally, after every iteration, statistics are printed. + +``` +import os +import torch +from torch import nn +from torchvision.datasets import MNIST +from torch.utils.data import DataLoader +from torchvision import transforms + +class MLP(nn.Module): + ''' + Multilayer Perceptron. + ''' + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Flatten(), + nn.Linear(28 * 28 * 1, 64), + nn.Dropout(p=0.5), + nn.ReLU(), + nn.Linear(64, 32), + nn.Dropout(p=0.5), + nn.ReLU(), + nn.Linear(32, 10) + ) + + + def forward(self, x): + '''Forward pass''' + return self.layers(x) + + +if __name__ == '__main__': + + # Set fixed random number seed + torch.manual_seed(42) + + # Prepare CIFAR-10 dataset + dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) + trainloader = torch.utils.data.DataLoader(dataset, batch_size=10, shuffle=True, num_workers=1) + + # Initialize the MLP + mlp = MLP() + + # Define the loss function and optimizer + loss_function = nn.CrossEntropyLoss() + optimizer = torch.optim.Adam(mlp.parameters(), lr=1e-4) + + # Run the training loop + for epoch in range(0, 5): # 5 epochs at maximum + + # Print epoch + print(f'Starting epoch {epoch+1}') + + # Set current loss value + current_loss = 0.0 + + # Iterate over the DataLoader for training data + for i, data in enumerate(trainloader, 0): + + # Get inputs + inputs, targets = data + + # Zero the gradients + optimizer.zero_grad() + + # Perform forward pass + outputs = mlp(inputs) + + # Compute loss + loss = loss_function(outputs, targets) + + # Perform backward pass + loss.backward() + + # Perform optimization + optimizer.step() + + # Print statistics + current_loss += loss.item() + if i % 500 == 499: + print('Loss after mini-batch %5d: %.3f' % + (i + 1, current_loss / 500)) + current_loss = 0.0 + + # Process is complete. + print('Training process has finished.') +``` + +* * * + +## References + +PyTorch. (n.d.). _Dropout — PyTorch 1.9.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html) + +MachineCurve. (2019, December 17). _What is dropout? Reduce overfitting in your neural networks_. [https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) diff --git a/using-error-correcting-output-codes-for-multiclass-svm-classification.md b/using-error-correcting-output-codes-for-multiclass-svm-classification.md new file mode 100644 index 0000000..0e0b838 --- /dev/null +++ b/using-error-correcting-output-codes-for-multiclass-svm-classification.md @@ -0,0 +1,192 @@ +--- +title: "Using Error-Correcting Output Codes with Scikit-learn for multiclass SVM classification" +date: "2020-11-12" +categories: + - "frameworks" + - "svms" +tags: + - "classification" + - "ecoc" + - "error-correcting-output-codes" + - "multiclass-classification" + - "scikit-learn" + - "support-vector-machine" + - "svm" +--- + +Classification is a key theme in the area of Supervised Machine Learning. As we saw in [another article](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/), there are multiple forms of classification - binary, multiclass and multilabel. In binary classification, an input sample is categorized into one out of two categories. In other words, into "0 or 1", or "False or True" - you name it. + +While this can be a good approach if you have a binary classification problem, many of today's classification problems are multiclass. Think about the [COVID-19 classifier](https://www.machinecurve.com/index.php/2020/11/05/ml-against-covid-19-detecting-disease-with-tensorflow-keras-and-transfer-learning/), for example: it has three classes, namely COVID-19 pneumonia, Other Viral pneumonia and no pneumonia. And there are many more examples. Finally, there is also multilabel classification, where multiple classes (also known as labels in that case) are attached to an input sample. + +A variety of algorithms is natively capable of multiclass classification. Neural networks, for example, can achieve this by learning to generate a multiclass probability distribution with [Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/). Support Vector Machines (SVMs), on the other hand, cannot do this natively. There are however many approaches to creating a multiclass SVM classifier anyway. Having covered [One-vs-Rest and One-vs-One SVMs](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/) in another article, we will focus on **Error-Correcting Output Codes** (ECOC) in today's article. It is structured as follows. Firstly, we'll revisit why SVMs aren't capable of performing multiclass classification natively. Then, we introduce ECOCs conceptually. What are they? How can they be used for multiclass classification? Those are the questions we will answer. + +Finally, we will implement an ECOC based Support Vector Machine with the Scikit-learn Machine Learning library. Step by step, we'll look at how one can be constructed. Let's take a look! :) + +* * * + +\[toc\] + +* * * + +## Why SVMs don't support multiclass classification natively + +Suppose that we have the following assembly line, where red, yellow and blue objects are rolling down the line. They must be added to the correct bucket, which is then shipped to diverse customers. This is a **multiclass classification scenario:** + +![](images/whatisclassification5.png) + +As the task at hand (looking at the objects and putting them into the correct buckets) is really repetitive, we could create an automated system that performs the task for us. This system contains what is known as a _multiclass classifier_. It generates a decision boundary between the classes based on a set of characteristics called _features_. If two features would characterize an object (e.g. **shape** and **color**), and if we could plot them on two axes (the image below is fictitious), such a decision boundary _could_ look like this: + +![](images/ovr_boundary.png) + +As said, some Machine Learning algorithms - like the ones that optimize a Neural network - can automatically generate a decision boundary between multiple classes. For [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/), this does not work for the simple reason that SVMs don't support this natively. But why? Let's take a look at how an SVM operates. + +An SVM is known as a _kernel method_ that _maximizes the margin between two classes_ by means of _[support vectors](https://www.machinecurve.com/index.php/2020/05/05/how-to-visualize-support-vectors-of-your-svm-classifier/)_. Kernel methods mean that a so-called _kernel function_ is used to generate a linear separation boundary between two classes by mapping the samples from the original feature space (i.e. axes) onto another one, where linear separation can be achieved. If the data is already linearly separable, like the black and white classes in the image below, separation is simple. In other cases, we must use more advanced [kernel functions](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/) for this purpose. + +In the figure below, we can observe three decision boundaries, namely \[latex\]H\_1\[/latex\], \[latex\]H\_2\[/latex\] and \[latex\]H\_3\[/latex\]. But which one is best? + +- **Is \[latex\]H\_1\[/latex\] best?** No, definitely not. It is not even capable of separating black and white, and is therefore not usable. We see such decision boundaries often by models that have just been initialized, which happens relatively randomly. Soon after, the decision boundary starts to shift. +- **Is \[latex\]H\_2\[/latex\] best?** Neither! Although it _is_ capable of separating between black and white, it is only marginally so. Especially if an outlier from the black class is present, the odds are that it is not able to assign it the correct class, because it "crosses the line". While \[latex\]H\_2\[/latex\] is a decision boundary that works, we often see those boundaries in stages of the training process where separation has _just_ been achieved, but where more optimal solutions are available. +- **Is \[latex\]H\_3\[/latex\] best?** Finally, yes. It is both capable of separating between the two classes _and_ does so with an equal distance between \[latex\]H\_3\[/latex\] and the earliest black vectors and \[latex\]H\_3\[/latex\] and the earliest white vectors, which both are called _support vectors_. This is called a _maximum margin_ and means that the boundary is _equidistant_ to the two classes (Wikipedia, 2005). + +Now this is why Support Vector Machines are called _Support Vector_ Machines. We now also know how they work, i.e. by generating a maximum-margin linear decision boundary, by means of a kernel function. + +![](images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png) + +Hyperplanes and data points. The [image](https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)is not edited. Author: [Zack Weinberg](https://commons.wikimedia.org/w/index.php?title=User:ZackWeinberg&action=edit&redlink=1), derived from [Cyc’s](https://commons.wikimedia.org/w/index.php?title=User:Cyc&action=edit&redlink=1) work. License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode) + +What remains unanswered is why SVMs cannot be used for multiclass classification. The answer is in fact really simple. Recall that the decision boundary must be equidistant for an SVM to converge, meaning that it must be as far as possible from both classes - and hence perfectly in the middle. + +Now suppose that we add a third class and hence our decision boundary splits into three line segments, like in the green/blue/orange figure above. In this case, the line segment between blue and orange is equidistant for those two classes, but not for the green class. The same is true for the other two line segments. Since SVMs always try to find a maximum margin decision boundary, finding one for > 2 classes is impossible. This is why SVMs cannot be used for multiclass classification natively. + +Fortunately, there is a solution: training multiple binary classifiers at once and using them jointly for generating a multiclass prediction. The [One-vs-Rest and One-vs-One](https://www.machinecurve.com/index.php/2020/11/11/creating-one-vs-rest-and-one-vs-one-svm-classifiers-with-scikit-learn/) are two approaches that train multiple classifiers that compete against each other. In this article, we will continue with an interesting but different approach: that of Error-Correcting Output Codes. + +* * * + +## Introducing Error-Correcting Output Codes + +Let's recall that above, we were working on a three-class multiclass classification problem. **Error-Correcting Output Codes** (ECOC) represent a method for doing so by generating a variety of binary classifiers that predict _output codes_. In the table below, we see what is called a three-bit _output code_ for each class. With an output code, a class can be described in some multidimensional space (in this case, a three-dimensional space). In other words, in our case, we can draw a vector \[latex\](0, 0, 1)\[/latex\] in three-dimensional space in order to represent class 0. The same can be done for classes 1 and 2, making them unique. + +Output codes can be used to generate a multiclass classifier, by learning a wide range of binary classifiers that predict specific bits in the output code. For example, in our three-class classification scenario, a three-bit output code is capable of describing each class. Binary classifier 3 (B3) predicts whether the input sample looks more like class 0 or like classes 1/2. B2 predicts whether the input sample looks more like class 1 or like classes 0/2. B1 predicts whether it's class 2 or classes 0/1. + +By aggregating the binary classifier predictions, we get a number of bits - the output code. When generating the multiclass prediction, after aggregating the individual predictions into an output code, the predicted output code is compared to the classes available and their corresponding output code. The closest match is picked as the predicted class. This way, ECOC can be used to generate a multiclass classifier. + +
Class / ClassifierB1B2B3
0001
1010
2100
+ +We haven't yet discussed why they are called _error-correcting_. The reason is simple. While for an N-class multiclass problem at least N binary classifiers must be trained, it is possible to train many more. For example, if we want to represent the three classes with 15-bit output codes (and hence 15 binary classifiers), this is perfectly possible. This is also what makes the method error-correcting: with each additional classifier, the output codes become more complex and therefore much more tailored to a specific class because _the closest class output code is chosen_. + +Now, if one binary classifier is wrong, the impact would be bigger if the number of classifiers is low (e.g., a wrong prediction in the table above would immediately switch classes). If the number is high, only one out of many bits would be wrong, and the resulting output code could still be closest to the class we actually want the input to be assigned. In other words, using more binary classifiers allows them to "correct the errors of their colleagues". Of course, the more binary classifiers are added, the more resources for training are required. + +Another aspect that we must cover is the initialization of the output codes. In the table above, we initialized the output codes randomly - I did not put any thought in assigning the numbers, except that the output codes must be unique. In many cases, random initialization provides adequate results. However, and especially in the case where many binary classifiers are used, better methods could exist (Scikit-learn, n.d.). It is however beyond the scope of this article to discuss them in detail. + +* * * + +## Implementing an ECOC based SVM with Scikit-learn + +Imagine that you would need to generate a multiclass classifier for this linearly separable dataset in a two-dimensional feature space: + +![](images/linearly.png) + +Constructing the dataset can be done in the following way: + +``` +from sklearn.datasets import make_blobs + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3), (1,5)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) +``` + +Note that for both the code above and the code below, we assume that you have installed the common data science dependencies. For today's code, they include `sklearn`, `matplotlib`, `numpy` and (if desired, otherwise code can be scrapped) `mlxtend`. + +Now that we have a dataset that we can use for generating an ECOC based multiclass SVM, we can actually create our model. For creating it, we will use [Scikit-learn](https://scikit-learn.org/stable/), a widely used library for Machine Learning in Python. Let's now take a look at the code for our model and walk through it step-by-step. + +- First of all, we must import all the dependencies that we need. We import `pyplot` from `matplotlib` as `plt` for visualizing the data, like we did above. Numpy will be used for saving the clusters if necessary; this could be useful if you want to perform model changes and run them with _exactly_ the same generated data. We then use many functions from `sklearn`. The `make_blobs` function will be used for data generation. `OutputCodeClassifier` is a wrapper that makes a Scikit-learn model multiclass by adding ECOC classification. `LinearSVC` is a linear SVM, which is adequate for today's article since our data can be separated linearly. `Train_test_split` can be used by generating a split between training data and testing data. With `plot_confusion_matrix`, we can generate a [confusion matrix](https://www.machinecurve.com/index.php/2020/05/05/how-to-create-a-confusion-matrix-with-scikit-learn/) for our classifier. Finally, importing from `mlxtend`, we add `plot_decision_regions`, for [visualizing the decision boundaries](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) for our model. +- We then specify configuration options for data generation. We create 10.000 samples around three centers, and hence have 3 classes. +- Subsequently, we generate data by calling `make_blobs` for the number of samples, centers and number of classes specified above. We set a standard deviation of 0.30 in order to make the plots only a little bit scattered, but not too much. +- We then save and load the data into a `.npy` file. If you want to tune the model and train it with exactly the same data, this is useful - by simply commenting `save`, you'll always use the same data. If you don't want this, you can also delete the code. +- We then make the train/test split with `train_test_split` and store the split data in `X_train`, `X_test`, `y_train` and `y_test` (i.e. the split feature vectors and corresponding labels). +- We then create the `LinearSVC` SVM and initialize it with some random state. +- We then wrap the `svm` into the `OutputCodeClassifier`, making it an ECOC classifier. The `code_size` attribute represents the "percentage of the number of classes to be used to create the code book." (Scikit-learn, n.d.) +- We then fit the training data to the `ecoc_classifier`, starting the training process. After fitting has completed, our trained ECOC classifier is also available in `ecoc_classifier`, because we assign it there. +- The trained classifier can then be evaluated. Performing classification analysis using a [confusion matrix](https://www.machinecurve.com/index.php/2020/05/05/how-to-create-a-confusion-matrix-with-scikit-learn/) is a useful tool for quick-and-dirty analysis of model performance. We use `plot_confusion_matrix` with the testing data and some Matplotlib colormap to generate the confusion matrix. With `normalize='true'`, we instruct Scikit to display the _normalized_ predictions: in other words, we don't see the absolute amount of predictions on screen, but rather the proportions between predictions. +- We finally plot the confusion matrix with Matplotlib. + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.multiclass import OutputCodeClassifier +from sklearn.svm import LinearSVC +from sklearn.model_selection import train_test_split +from sklearn.metrics import plot_confusion_matrix +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 10000 +cluster_centers = [(5,5), (3,3), (1,5)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) + +# Save/load data +np.save('./clusters.npy', X) +X = np.load('./clusters.npy') + +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) + +# Create the SVM +svm = LinearSVC(random_state=42) + +# Make it an ECOC classifier +ecoc_classifier = OutputCodeClassifier(svm, code_size=6) + +# Fit the data to the ECOC classifier +ecoc_classifier = ecoc_classifier.fit(X_train, y_train) + +# Evaluate by means of a confusion matrix +matrix = plot_confusion_matrix(ecoc_classifier, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for ECOC classifier') +plt.show(matrix) +plt.show() + +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=ecoc_classifier, legend=2) +plt.show() +``` + +Running our code, our end result is a SVM-based linear classifier that is capable of learning a correct decision boundary. As with any perfect setting (i.e. our data is linearly separable), our confusion matrix shows great performance. Of course, in the real world, your data is likely not linearly separable and the confusion matrix will be more diffuse. + +- ![](images/ecoc_conf.png) + +- ![](images/ecoc_boundary.png) + + +* * * + +## Summary + +Support Vector Machines do not support multiclass classification natively. Fortunately, by using some techniques, such classifiers can be achieved - even with SVMs. In this article, we covered Error-Correcting Output Codes. These codes, which effectively map input samples to some region of a multidimensional space (and hence a class), can uniquely describe classes through multiple binary classifiers. More than the strictly required number of binary classifiers can be added, reducing the system's sensitivity to errors, making the output codes _error-correcting_. + +We also provided an example of ECOC based multiclass classification with SVMs through the `OutputCodeClassifier` in Scikit-learn. In the example above, we walked through creating it step-by-step, so that you can understand in detail how such classifiers can be constructed. + +I hope that you have learned something from today's article! If you did, please feel free to drop a message in the comments section below 💬 Please do the same if you have any questions or other remarks. I'd love to hear from you and will respond whenever possible. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2005, February 21). _Equidistant_. Wikipedia, the free encyclopedia. Retrieved November 11, 2020, from [https://en.wikipedia.org/wiki/Equidistant](https://en.wikipedia.org/wiki/Equidistant) + +Error-Correcting Output Codes. (n.d.). [https://www.ccs.neu.edu/home/vip/teach/MLcourse/4\_boosting/lecture\_notes/ecoc/ecoc.pdf](https://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/lecture_notes/ecoc/ecoc.pdf) + +Scikit-learn. (n.d.). _1.12. Multiclass and multilabel algorithms — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 12, 2020, from [https://scikit-learn.org/stable/modules/multiclass.html#error-correcting-output-codes](https://scikit-learn.org/stable/modules/multiclass.html#error-correcting-output-codes) + +Scikit-learn. (n.d.). _Sklearn.multiclass.OutputCodeClassifier — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 12, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OutputCodeClassifier.html#sklearn.multiclass.OutputCodeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OutputCodeClassifier.html#sklearn.multiclass.OutputCodeClassifier) diff --git a/using-huber-loss-in-keras.md b/using-huber-loss-in-keras.md new file mode 100644 index 0000000..a2acc66 --- /dev/null +++ b/using-huber-loss-in-keras.md @@ -0,0 +1,313 @@ +--- +title: "Using Huber loss with TensorFlow 2 and Keras" +date: "2019-10-12" +categories: + - "buffer" + - "frameworks" +tags: + - "deep-learning" + - "huber-loss" + - "keras" + - "loss-function" + - "machine-learning" + - "neural-networks" + - "regression" +--- + +The Huber loss function can be used to balance between the Mean Absolute Error, or MAE, and the Mean Squared Error, MSE. It is therefore a good loss function for when you have varied data or only a few outliers. + +But how to implement this loss function in Keras? + +That's what we will find out in this blog. + +We first briefly recap the concept of a loss function and introduce Huber loss. Next, we present a Keras example implementation that uses the Boston Housing Prices Dataset to generate a regression model. + +After reading this tutorial, you will have learned... + +- What loss functions are in neural networks. +- How Huber loss works and how it combines MAE and MSE. +- How `tensorflow.keras.losses.Huber` can be used within your TensorFlow 2 / Keras model. + +Let's get to work! 🚀 + +_Note that the full code is also available on GitHub, in my [Keras loss functions repository](https://github.com/christianversloot/keras-loss-functions)._ + +* * * + +**Update 28/Jan/2021:** updated the tutorial to ensure that it is ready for 2021. The code now runs with TensorFlow 2 based versions and has been updated to use `tensorflow.keras.losses.Huber` instead of a custom Huber loss function. Also updated header information and featured image. + +* * * + +\[toc\] + +* * * + +## Summary and code example: Huber Loss with TensorFlow 2 and Keras + +Loss functions are used to compare predictions with ground truth values after the forward pass when training a neural network. There are many loss functions, and choosing one can be dependent on the dataset that you are training with. For example, in regression problems, you want to use Mean Absolute Error if you have many outliers, while if you don't Mean Squared Error can be a better choice. + +But sometimes, you don't know exactly which of these two is best. In that case, **Huber loss** can be of help. Based on a delta parameter, it shapes itself as a loss function somewhere in between MAE and MSE. This way, you have more control over your neural network. + +In TensorFlow 2 and Keras, Huber loss can be added to the compile step of your model - i.e., to `model.compile`. Here, you'll see **an example of Huber loss with TF 2 and Keras**. If you want to understand the loss function in more detail, make sure to read the rest of this tutorial as well! + +``` +model.compile(loss=tensorflow.keras.losses.Huber(delta=1.5), optimizer='adam', metrics=['mean_absolute_error']) +``` + +* * * + +## About loss functions and Huber loss + +When you train machine learning models, you feed data to the network, generate predictions, compare them with the actual values (the targets) and then compute what is known as a _loss_. This loss essentially tells you something about the performance of the network: the higher it is, the worse your networks performs overall. + +There are many ways for computing the loss value. [Huber loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#huber-loss) is one of them. It essentially combines the [Mean Absolute Error](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#mean-absolute-error-l1-loss) and the [Mean Squared Error](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#mean-squared-error) depending on some delta parameter, or 𝛿. This parameter must be configured by the machine learning engineer up front and is dependent on your data. + +Huber loss looks like this: + +[![](images/huberloss-1024x580.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/10/huberloss.jpeg) + +As you can see, for target = 0, the loss increases when the error increases. However, the speed with which it increases depends on this 𝛿 value. In fact, Grover (2019) writes about this as follows: **Huber loss approaches MAE when 𝛿 ~ 0 and MSE when 𝛿 ~ ∞ (large numbers.)** + +When you compare this statement with the benefits and disbenefits of both the MAE and the MSE, you'll gain some insights about how to adapt this delta parameter: + +- If your dataset contains large outliers, it's likely that your model will not be able to predict them correctly at once. In fact, it might take quite some time for it to recognize these, if it can do so at all. This results in large errors between predicted values and actual targets, because they're outliers. Since MSE squares errors, large outliers will distort your loss value significantly. If outliers are present, you likely don't want to use MSE. Huber loss will still be useful, but you'll have to use small values for 𝛿. +- If it does not contain many outliers, it's likely that it will generate quite accurate predictions from the start - or at least, from some epochs after starting the training process. In this case, you may observe that the errors are very small overall. Then, one can argue, it may be worthwhile to let the largest small errors contribute more significantly to the error than the smaller ones. In this case, MSE is actually useful; hence, with Huber loss, you'll likely want to use quite large values for 𝛿. +- If you don't know, you can always start somewhere in between - for example, in the plot above, 𝛿 = 1 represented MAE quite accurately, while 𝛿 = 3 tends to go towards MSE already. What if you used 𝛿 = 1.5 instead? You may benefit from both worlds. + +Let's now see if we can complete a regression problem with Huber loss! + +* * * + +## Huber loss example with TensorFlow 2/Keras + +Next, we show you how to use Huber loss with Keras to create a regression model. We'll use the **Boston housing price regression dataset** which comes with Keras by default - that'll make the example easier to follow. Obviously, you can always use your own data instead! + +Since we need to know how to configure 𝛿, we must inspect the data at first. Do the target values contain many outliers? Some statistical analysis would be useful here. + +Only then, we create the model and configure 𝛿 to an estimate that seems adequate. Finally, we run the model, check performance, and see whether we can improve 𝛿 any further. + +### Regression dataset: Boston housing price regression + +Keras comes with datasets on board the framework: they have them stored on some Amazon AWS server and when you load the data, they automatically download it for you and store it in user-defined variables. It allows you to experiment with deep learning and the framework easily. This way, you can get a feel for DL practice and neural networks without getting lost in the complexity of loading, preprocessing and structuring your data. + +The **Boston housing price regression dataset** is one of these [datasets](https://keras.io/datasets/#boston-housing-price-regression-dataset). It is taken by Keras from the Carnegie Mellon University StatLib library that contains many datasets for training ML models. It is described as follows: + +> The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. +> +> [StatLib Datasets Archive](http://lib.stat.cmu.edu/datasets/) + +And contains these variables, according to the StatLib website: + +- **CRIM** per capita crime rate by town +- **ZN** proportion of residential land zoned for lots over 25,000 sq.ft. +- **INDUS** proportion of non-retail business acres per town +- **CHAS** Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) +- **NOX** nitric oxides concentration (parts per 10 million) +- **RM** average number of rooms per dwelling +- **AGE** proportion of owner-occupied units built prior to 1940 +- **DIS** weighted distances to five Boston employment centres +- **RAD** index of accessibility to radial highways +- **TAX** full-value property-tax rate per $10,000 +- **PTRATIO** pupil-teacher ratio by town +- **B** 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town +- **LSTAT** % lower status of the population +- **MEDV** Median value of owner-occupied homes in $1000's + +In total, one sample contains 13 features (CRIM to LSTAT) which together approximate the median value of the owner-occupied homes or MEDV. The structure of this dataset, mapping some variables to a real-valued number, allows us to perform regression. + +Let's now take a look at the dataset itself, and particularly its target values. + +### Does the dataset have many outliers? + +The number of outliers helps us tell something about the value for d that we have to choose. When thinking back to my _Introduction to Statistics_ class at university, I remember that box plots can help visually identify outliers in a statistical sample: + +> Examination of the data for unusual observations that are far removed from the mass of data. These points are often referred to as outliers. Two graphical techniques for identifying outliers, scatter plots and box plots, (…) +> +> [Engineering Statistics Handbook](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm) + +The sample, in our case, is the Boston housing dataset: it contains some mappings between feature variables and target prices, but obviously doesn't represent all homes in Boston, which would be the statistical population. Nevertheless, we can write some code to generate a box plot based on this dataset: + +``` +''' + Generate a BoxPlot image to determine how many outliers are within the Boston Housing Pricing Dataset. +''' +import tensorflow.keras +from tensorflow.keras.datasets import boston_housing +import numpy as np +import matplotlib.pyplot as plt + +# Load the data +(x_train, y_train), (x_test, y_test) = boston_housing.load_data() + +# We only need the targets, but do need to consider all of them +y = np.concatenate((y_train, y_test)) + +# Generate box plot +plt.boxplot(y) +plt.title('Boston housing price regression dataset - boxplot') +plt.show() +``` + +And next run it, to find this box plot: + +[![](images/boston_boxplot.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/boston_boxplot.png) + +Note that we concatenated the _training data_ and the _testing data_ for this box plot. Although the plot hints to the fact that many outliers exist, and primarily at the high end of the statistical spectrum (which does make sense after all, since in life extremely high house prices are quite common whereas extremely low ones are not), we cannot yet conclude that the MSE may not be a good idea. We'll need to inspect the individual datasets too. + +We can do that by simply adapting our code to: + +``` +y = y_train +``` + +or + +``` +y = y_test +``` + +This results in the following box plots: + +- ![](images/boston_boxplot_test-1.png) + +- ![](images/boston_boxplot_train-2.png) + + +Although the number of outliers is more extreme in the training data, they are present in the testing dataset as well. + +Their structure is also quite similar: most of them, if not all, are present in the high end segment of the housing market. + +Do note, however, that the median value for the testing dataset and the training dataset are slightly different. This means that patterns underlying housing prices present in the testing data may not be captured fully during the training process, because the statistical sample is slightly different. However, there is only one way to find out - by actually creating a regression model! + +### Creating the model + +Let's now create the model. Create a file called `huber_loss.py` in some folder and open the file in a development environment. We're then ready to add some code! However, let's analyze first what you'll need to use Huber loss in Keras. + +#### What you'll need to use Huber loss in Keras + +The primary dependency that you'll need is **[TensorFlow 2](https://tensorflow.org)**, one of the two deep learning libraries for Python. In TensorFlow 2, Keras is tightly coupled as `tensorflow.keras` and can therefore be used easily. In fact, today, it's _the_ way to create neural networks with TensorFlow easily. + +#### Model imports + +Now that we can start coding, let's import the Python dependencies that we need first: + +``` +''' + Keras model demonstrating Huber loss +''' +from tensorflow.keras.datasets import boston_housing +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.losses import Huber +import numpy as np +import matplotlib.pyplot as plt +``` + +Obviously, we need the `boston_housing` dataset from the available Keras datasets. Additionally, we import `Sequential` as we will build our model using the Keras Sequential API. We're creating a very simple model, a [multilayer perceptron](https://www.machinecurve.com/index.php/2019/07/30/creating-an-mlp-for-regression-with-keras/), with which we'll attempt to regress a function that correctly estimates the median values of Boston homes. For this reason, we import `Dense` layers or densely-connected ones. + +We also need `Huber` since that's the loss function we use. Numpy is used for number processing and we use Matplotlib to visualize the end result. + +#### Loading the dataset + +We next load the data by calling the Keras `load_data()` function on the housing dataset and prepare the input layer shape, which we can add to the initial hidden layer later: + +``` +# Load data +(x_train, y_train), (x_test, y_test) = boston_housing.load_data() + +# Set the input shape +shape_dimension = len(x_train[0]) +input_shape = (shape_dimension,) +print(f'Feature shape: {input_shape}') +``` + +#### Preparing the model: architecture & configuration + +Next, we do actually provide the model architecture and configuration: + +``` +# Create the model +model = Sequential() +model.add(Dense(16, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(8, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='linear')) + +# Configure the model and start training +model.compile(loss=Huber(delta=1.5), optimizer='adam', metrics=['mean_absolute_error']) +history = model.fit(x_train, y_train, epochs=250, batch_size=1, verbose=1, validation_split=0.2) +``` + +As discussed, we use the Sequential API; here, we use two densely-connected hidden layers and one output layer. The hidden ones activate by means of [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) and for this reason require [He uniform initialization](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). The final layer activates linearly, because it regresses the actual value. + +Compiling the model requires specifying the delta value, which we set to 1.5, given our estimate that we don't want true MAE but that given the outliers identified earlier full MSE resemblence is not smart either. We'll optimize by means of Adam _and also define the MAE as an extra error metric_. This way, we can have an estimate about what the true error is in terms of thousands of dollars: the MAE keeps its domain understanding whereas Huber loss does not. + +Subsequently, we fit the training data to the model, complete 250 epochs with a batch size of 1 (true SGD-like optimization, albeit with Adam), use 20% of the data as validation data and ensure that the entire training process is output to standard output. + +#### Performance testing & visualization + +Finally, we add some code for performance testing and [visualization](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/): + +``` +# Test the model after training +test_results = model.evaluate(x_test, y_test, verbose=1) +print(f'Test results - Loss: {test_results[0]} - MAE: {test_results[1]}') + +# Plot history: Huber loss and MAE +plt.plot(history.history['loss'], label='Huber loss (training data)') +plt.plot(history.history['val_loss'], label='Huber loss (validation data)') +plt.title('Boston Housing Price Dataset regression model - Huber loss') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() + +plt.title('Boston Housing Price Dataset regression model - MAE') +plt.plot(history.history['mean_absolute_error'], label='MAE (training data)') +plt.plot(history.history['val_mean_absolute_error'], label='MAE (validation data)') +plt.ylabel('Loss value') +plt.xlabel('No. epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Model performance for 𝛿 = 1.5 + +Let's now take a look at how the model has optimized over the epochs with the Huber loss: + +[![](images/huber_loss_d1.5-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/huber_loss_d1.5.png) + +And with the MAE: + +[![](images/huber_loss_mae1.5-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/huber_loss_mae1.5.png) + +We can see that overall, the model was still improving at the 250th epoch, although progress was stalling - which is perfectly normal in such a training process. The mean absolute error was approximately $3.639. + +``` +Test results - Loss: 4.502029736836751 - MAE: 3.6392388343811035 +``` + +* * * + +## Recap + +In this blog post, we've seen how the Huber loss can be used to balance between MAE and MSE in machine learning regression problems. By means of the delta parameter, or 𝛿, you can configure which one it should resemble most, benefiting from the fact that you can check the number of outliers in your dataset a priori. I hope you've enjoyed this blog and learnt something from it - please let me know in the comments if you have any questions or remarks. Thanks and happy engineering! 😊 + +* * * + +## References + +Grover, P. (2019, September 25). 5 Regression Loss Functions All Machine Learners Should Know. Retrieved from [https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0) + +StatLib---Datasets Archive. (n.d.). Retrieved from [http://lib.stat.cmu.edu/datasets/](http://lib.stat.cmu.edu/datasets/) + +Keras. (n.d.). Datasets. Retrieved from [https://keras.io/datasets/](https://keras.io/datasets/) + +Keras. (n.d.). Boston housing price regression dataset. Retrieved from [https://keras.io/datasets/#boston-housing-price-regression-dataset](https://keras.io/datasets/#boston-housing-price-regression-dataset) + +Carnegie Mellon University StatLib. (n.d.). Boston house-price data. Retrieved from [http://lib.stat.cmu.edu/datasets/boston](http://lib.stat.cmu.edu/datasets/boston) + +Engineering Statistics Handbook. (n.d.). 7.1.6. What are outliers in the data? Retrieved from [https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm](https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm) + +TensorFlow. (2021). _Tf.keras.losses.Huber_. [https://www.tensorflow.org/api\_docs/python/tf/keras/losses/Huber](https://www.tensorflow.org/api_docs/python/tf/keras/losses/Huber) diff --git a/using-leaky-relu-with-keras.md b/using-leaky-relu-with-keras.md new file mode 100644 index 0000000..5aa8e4d --- /dev/null +++ b/using-leaky-relu-with-keras.md @@ -0,0 +1,431 @@ +--- +title: "Using Leaky ReLU with TensorFlow 2 and Keras" +date: "2019-11-12" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "activation-function" + - "activation-functions" + - "deep-learning" + - "keras" + - "machine-learning" + - "relu" +--- + +Even though the traditional [ReLU activation function](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/) is used quite often, it may sometimes not produce a converging model. This is due to the fact that ReLU maps all negative inputs to zero, with a dead network as a possible result. + +The death of a neural network? How is that even possible? + +Well, you'll find out in this blog 😄 + +We briefly recap on Leaky ReLU, and why it is necessary, and subsequently present how to implement a Leaky ReLU neural network with Keras. Additionally, we'll actually train our model, and compare its performance with a traditional ReLU network. + +**After reading this tutorial, you will...** + +- See how the _dying ReLU problem_ can impact your neural network. +- Understand how the 'negative side' of ReLU causes this problem. +- Learn using Leaky ReLU with TensorFlow, which can help solve this problem. + +Let's go! 😎 + +**Update 01/Mar/2021:** ensure that Leaky ReLU can be used with TensorFlow 2; replaced all old examples with new ones. + +* * * + +\[toc\] + +* * * + +## Recap: what is Leaky ReLU? + +As you likely know, this is how traditional ReLU activates: + +\\begin{equation} f(x) = \\begin{cases} 0, & \\text{if}\\ x < 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +That is, the output is \[latex\]x\[/latex\] for all \[latex\]x >= 0\[/latex\], while it's zero for all other \[latex\]x\[/latex\]. + +Generally, this works very well in many neural networks - and in fact, since this makes the model a lot sparser, the training process tends to be impacted only by the features in your dataset that actually contribute to the model's decision power. + +However, there are cases when this sparsity becomes a liability: + +- If you didn't normalize your data before you fed it to your neural network, large changes in your model's weights can occur during the first stages of the training process. When the optimizer becomes less fierce when training progresses, some weights may be just too negative - and they can no longer 'escape' from the zero-ReLU-activation. +- Similarly, when you didn't configure your model's hyperparameters well, this may occur. + +Since the majority of your neurons will be unresponsive, we call the _neural network dead_. Using ReLU may in some cases thus lead to the death of neural networks. While preventable in essence, it happens. Leaky ReLU may in fact help you here. + +Mathematically, Leaky ReLU is defined as follows (Maas et al., 2013): + +\\begin{equation} f(x) = \\begin{cases} 0.01x, & \\text{if}\\ x < 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Contrary to traditional ReLU, the outputs of Leaky ReLU are small and nonzero for all \[latex\]x < 0\[/latex\]. This way, the authors of the paper argue that death of neural networks can be avoided. We do have to note, though, that there also exists [quite some criticism](https://www.machinecurve.com/index.php/2019/10/15/leaky-relu-improving-traditional-relu/#does-leaky-relu-really-work) as to whether it really works. + +* * * + +## Leaky ReLU and the Keras API + +Nevertheless, it may be that you want to test whether traditional ReLU is to blame when you find that your Keras model does not converge. + +In that case, we'll have to know how to implement Leaky ReLU with Keras, and that's what we're going to do next 😄 + +Let's see what the Keras API tells us about Leaky ReLU: + +> Leaky version of a Rectified Linear Unit. +> It allows a small gradient when the unit is not active: `f(x) = alpha * x for x < 0`, `f(x) = x for x >= 0`. +> +> [Keras Advanced Activation Layers: LeakyReLu](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LeakyReLU) + +It is defined as follows: + +``` +tf.keras.layers.LeakyReLU(alpha=0.3) +``` + +Contrary to our definition above (where \[latex\]\\alpha = 0.01\[/latex\], Keras by default defines alpha as 0.3). This does not matter, and perhaps introduces more freedom: it allows you to experiment with some \[latex\]\\alpha\[/latex\] to find which works best for you. + +What it does? Simple - take a look at the definition from the API docs: `f(x) = alpha * x for x < 0`, `f(x) = x for x >= 0` . + +Alpha _is the slope of the curve for all \[latex\]x < 0\[/latex\]._ + +**One important thing before we move to implementation!** + +With traditional ReLU, you directly apply it to a layer, say a `Dense` layer or a `Conv2D` layer, like this: + +``` +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu', kernel_initializer='he_uniform')) +``` + +You don't do this with Leaky ReLU. Instead, you have to apply it as an additional layer, and import it as such: + +``` +# In your imports +from tensorflow.keras.layers import LeakyReLU +# In your model +# ... upstream model layers +model.add(Conv1D(8, 1, strides=1, kernel_initializer='he_uniform')) +model.add(LeakyReLU(alpha=0.1)) +# ... downstream model layers +``` + +Note my use of the He uniform initializer contrary to Xavier, [which is wise theoretically](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/) when using ReLU or ReLU-like activation functions. + +* * * + +## Implementing your Keras LeakyReLU model + +Now that we know how LeakyReLU works with Keras, we can actually implement a model using it for activation purposes. + +I chose to take the [CNN we created earlier](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), which I trained on the MNIST dataset: it's relatively easy to train, its dataset already comes out-of-the-box with Keras, and hence it's a good starting point for educational purposes 😎 Additionally, it allows me to compare LeakyReLU performance with traditional ReLU more easily. + +Obviously, Leaky ReLU can also be used in more complex settings - just use a similar implementation as we'll create next. + +### What you'll need to run it + +You will need the following dependencies installed on your system if you want to run this model: + +- **Python**, and preferably version 3.6+. +- **TensorFlow 2** or any recent 2.x version, which contains Keras by default, in `tensorflow.keras`. +- **Matplotlib**, for [visualizing the model history](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/). + +### The dataset we're using + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +To show how Leaky ReLU can be implemented, we're going to build a convolutional neural network image classifier that is [very similar to the one we created with traditional ReLU](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). + +It is trained with the MNIST dataset and therefore becomes capable of classifying handwritten digits into the correct classes. With normal ReLU, the model achieved very high accuracies. Let's hope that it does here as well! + +### Model file & imports + +Now, open your Explorer, navigate to some folder, and create a Python file - such as `model_leaky_relu.py`. Open a code editor, open the file in your edit, and we can start adding the imports! + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.layers import LeakyReLU +import matplotlib.pyplot as plt +``` + +### Model configuration + +We can next specify some configuration variables: + +``` +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +leaky_relu_alpha = 0.1 +``` + +The width and height of the handwritten digits provided by the MNIST dataset are 28 pixels. Hence, we specify `img_width` and `img_height` to be 28. + +We will use a [minibatch approach](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) (although strictly speaking, we don't use Gradient Descent but Adam for optimization), with a `batch_size` of 250. We train the model for a fixed amount of iterations, with `no_epochs = 25`, and have 10 classes. This makes sense, as digits range from 0 to 9, which are ten in total. + +20% of our training data will be used for validation purposes, and hence the `validation_split` is 0.2. Verbosity mode is set to True (by means of 'one'), which means that all output is returned to the terminal when running the model. Finally, we set the \[latex\]\\alpha\[/latex\] value for Leaky ReLU; in our case to 0.1. Note that (1) any alpha value is possible _if_ it is equal or larger than zero, and (2) that you may also specify different alpha values for each layer you add Leaky ReLU to. This is however up to you. + +### Data preparation + +We can next proceed with data preparation: + +``` +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert them into black or white: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +This essentially resolves to these steps: + +- Loading the MNIST dataset by calling the Keras API (this is what I meant with relative ease - the dataset is a default Keras dataset, which means that we don't have to write much code for importing, benefiting today's educational purposes). +- Reshaping data based on whether your backend (TensorFlow, Theano or CNTK) uses a channels first / channels last approach. +- Next, we parse the training data as `float32` values. This is argued to make the training process faster (Quora, n.d.). +- We subsequently normalize our data. +- Finally, we convert our data into categorical format. That is, we fix the number of categories and convert our integer targets into category vectors. This allows us to use the [categorical crossentropy loss function](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/). + +### Model architecture + +We can next define our model's architecture. + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape)) +model.add(LeakyReLU(alpha=leaky_relu_alpha)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3))) +model.add(LeakyReLU(alpha=leaky_relu_alpha)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256)) +model.add(LeakyReLU(alpha=leaky_relu_alpha)) +model.add(Dense(no_classes, activation='softmax')) +``` + +Note that we're using the Sequential API, which is the easiest one and most suitable for simple Keras problems. We specify two blocks with `Conv2D` layers, apply `LeakyReLU` directly after the convolutional layer, and subsequently apply `MaxPooling2D` and `Dropout`. + +Subsequently, we `Flatten` our input into onedimensional format to allow the `Dense` or densely-connected layers to handle it. The first, which used traditional ReLU in the traditional scenario, is now also followed by Leaky ReLU. The final `Dense` layer has ten output neurons (since `no_classes = 10`) and the activation function is Softmax, to generate the multiclass probability distribution we're looking for as we use categorical data. + +A few important observations: + +- Note that by omitting any activation function for the `Conv2D` layers and the first `Dense` layer, we're essentially telling Keras to use a linear activation function instead. This activates as \[latex\]f(x) = x\[/latex\]. [Normally, this is a bad idea](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/), but today it is not, as we directly apply Leaky ReLU afterwards. +- The `input_shape` parameter is based on our dataset. +- As discussed before, Leaky ReLU is applied by specifying an extra layer to the model stack, _not by specifying some `activation=''` in the layer you're applying it on!_ + +### Adding model configuration & performing training + +Next, we can specify our hyperparameters and start the training process: + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +We assign the results of fitting the data to the configured model to the `history` object in order to visualize it later. + +### Performance testing & visualization + +Finally, we can add code for performance testing and visualization: + +``` +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss for Keras Leaky ReLU CNN: {score[0]} / Test accuracy: {score[1]}') + +# Visualize model history +plt.plot(history.history['accuracy'], label='Training accuracy') +plt.plot(history.history['val_accuracy'], label='Validation accuracy') +plt.title('Leaky ReLU training / validation accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() + +plt.plot(history.history['loss'], label='Training loss') +plt.plot(history.history['val_loss'], label='Validation loss') +plt.title('Leaky ReLU training / validation loss values') +plt.ylabel('Loss value') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() +``` + +The first block takes the testing code and generates test loss and test accuracy values - in order to find out whether the trained model generalizes well beyond data it has already seen before. + +The second and third block simply use Matplotlib to visualize the accuracy and loss values over time, i.e. for every epoch or iteration. These can be saved to your system and used in e.g. reports, as we will show next. + +### Full model code + +If you are interested, you can also copy the full model code here: + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras.layers import LeakyReLU +import matplotlib.pyplot as plt + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +leaky_relu_alpha = 0.1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert them into black or white: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape)) +model.add(LeakyReLU(alpha=leaky_relu_alpha)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3))) +model.add(LeakyReLU(alpha=leaky_relu_alpha)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256)) +model.add(LeakyReLU(alpha=leaky_relu_alpha)) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss for Keras Leaky ReLU CNN: {score[0]} / Test accuracy: {score[1]}') + +# Visualize model history +plt.plot(history.history['accuracy'], label='Training accuracy') +plt.plot(history.history['val_accuracy'], label='Validation accuracy') +plt.title('Leaky ReLU training / validation accuracies') +plt.ylabel('Accuracy') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() + +plt.plot(history.history['loss'], label='Training loss') +plt.plot(history.history['val_loss'], label='Validation loss') +plt.title('Leaky ReLU training / validation loss values') +plt.ylabel('Loss value') +plt.xlabel('Epoch') +plt.legend(loc="upper left") +plt.show() +``` + +* * * + +## Model performance + +Now, we can take a look at how our model performs. Additionally, since we also retrained the [Keras CNN with traditional ReLU](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) as part of creating the model defined above, we can even compare traditional ReLU with Leaky ReLU for the MNIST dataset! + +### LeakyReLU model performance + +Generally speaking, I'm quite satisfied with how the model performed during training. The curves for loss and accuracy are actually pretty normal - large improvements at first, slower improvements at last. Perhaps, the model already starts overfitting slightly, as validation loss is stable after the 10th epoch and perhaps already increasing _very lightly_. However, that's not (too) relevant for now. + +[![](images/lrr_lr_losses.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lrr_lr_losses.png) + +[![](images/lrr_lr_accuracies.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lrr_lr_accuracies.png) + +As we can observe from our evaluation metrics, test accuracy was 99.19% - that's really good! + +``` +Test loss for Keras ReLU CNN: 0.02855007330078265 / Test accuracy: 0.9919000267982483 +``` + +### Comparing LeakyReLU and normal / traditional ReLU + +Comparing our Leaky ReLU model with traditional ReLU produced these results: + +[![](images/lrr_lrr_loss.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lrr_lrr_loss.png) + +[![](images/lrr_lrr_acc.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/lrr_lrr_acc.png) + +With those evaluation metrics for testing: + +``` +Test loss for Keras Leaky ReLU CNN: 0.029994659566788557 / Test accuracy: 0.9927999973297119 +Test loss for Keras ReLU CNN: 0.02855007330078265 / Test accuracy: 0.9919000267982483 +``` + +I'd say they perform equally well. Although the traditional ReLU model seems to perform _slightly_ better than Leaky ReLU during training and testing, it's impossible to say whether this occurs by design or by chance (e.g., due to [pseudo-random weight initialization](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/)). + +* * * + +## Summary + +By consequence, we can perhaps argue - in line with the [criticism we saw before](https://www.machinecurve.com/index.php/2019/10/15/leaky-relu-improving-traditional-relu/#does-leaky-relu-really-work) - that in most cases, Leaky ReLU does not perform better than traditional ReLU. This makes sense, as the leaky variant is only expected to work much better in the cases when you experience many dead neurons. + +Nevertheless, it can be used with Keras, as we have seen in this blog post. We first introduced the concept of Leaky ReLU by recapping on how it works, comparing it with traditional ReLU in the process. Subsequently, we looked at the Keras API and how Leaky ReLU is implemented there. We then used this knowledge to create an actual Keras model, which we also used in practice. By training on the MNIST dataset, we also investigated how well it performs and compared it with traditional ReLU, as we've seen above. + +I hope you've learnt something from this blog post - or that it was useful in other ways 😊 Let me know if you have any questions or if you think that it can be improved. I'll happily answer your comments, which you can leave in the comments box below 👇 + +Thanks again for visiting MachineCurve - and happy engineering! 😎 + +* * * + +## References + +Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. Retrieved from [https://www.semanticscholar.org/paper/Rectifier-Nonlinearities-Improve-Neural-Network-Maas/367f2c63a6f6a10b3b64b8729d601e69337ee3cc](https://www.semanticscholar.org/paper/Rectifier-Nonlinearities-Improve-Neural-Network-Maas/367f2c63a6f6a10b3b64b8729d601e69337ee3cc) + +Keras. (n.d.). Advanced Activations Layers: LeakyReLU. Retrieved from [https://keras.io/layers/advanced-activations/#leakyrelu](https://keras.io/layers/advanced-activations/#leakyrelu) + +Quora. (n.d.). When should I use tf.float32 vs tf.float64 in TensorFlow? Retrieved from [https://www.quora.com/When-should-I-use-tf-float32-vs-tf-float64-in-TensorFlow](https://www.quora.com/When-should-I-use-tf-float32-vs-tf-float64-in-TensorFlow) + +TensorFlow. (n.d.). _Tf.keras.layers.LeakyReLU_. [https://www.tensorflow.org/api\_docs/python/tf/keras/layers/LeakyReLU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LeakyReLU) diff --git a/using-radial-basis-functions-for-svms-with-python-and-scikit-learn.md b/using-radial-basis-functions-for-svms-with-python-and-scikit-learn.md new file mode 100644 index 0000000..4de6cdf --- /dev/null +++ b/using-radial-basis-functions-for-svms-with-python-and-scikit-learn.md @@ -0,0 +1,427 @@ +--- +title: "Using Radial Basis Functions for SVMs with Python and Scikit-learn" +date: "2020-11-25" +categories: + - "frameworks" + - "svms" +tags: + - "classification" + - "classifier" + - "kernel" + - "kernel-function" + - "machine-learning" + - "python" + - "radial-basis-function" + - "scikit-learn" + - "support-vector-machine" + - "support-vectors" + - "svm" +--- + +There is a wide variety of Machine Learning algorithms that you can choose from when building a model. One class of models, Support Vector Machines, is used quite frequently, besides Neural Networks, of course. SVMs, as they are abbreviated, can be used to successfully build nonlinear classifiers, an important benefit of a Machine Learning model. + +However, contrary to Neural Networks, you have to choose the specific kernel with which a mapping towards a linearly separable dataset is created, yourself. **Radial Basis Functions** can be used for this purpose, and they are in fact the default kernel for Scikit-learn's nonlinear SVM module. But what are these functions? And how do they help with SVMs, to generate this "linearly separable dataset"? We take a look at all these questions in this article. + +It is structured as follows. First of all, we take a look at introducing nonlinearity to Support Vector Machines. It shows why linear SVMs have difficulties with fitting on nonlinear data, and includes a brief analysis about how SVMs work in the first place. Secondly, we introduce Radial Basis Functions conceptually, and zoom into the RBF used by Scikit-learn for learning an RBF SVM. This is precisely what we will do thirdly: create an actual RBF based Support Vector Machine with Python and Scikit-learn. We walk you through the process **step-by-step**, so that you can understand each detail and hence grasp the concept as a whole. + +Let's take a look! + +**Update 08/Dec/2020:** added link to PCA article. + +* * * + +\[toc\] + +* * * + +## Introducing nonlinearity to Support Vector Machines + +If we want to understand why Radial Basis Functions can help you with training a Support Vector Machine classifier, we must first take a look at _why_ this is the case. + +And the only way we can do so is by showing when it does _not_ work as expected, so we're going to build [a simple linear SVM classifier](https://www.machinecurve.com/index.php/2020/05/03/creating-a-simple-binary-svm-classifier-with-python-and-scikit-learn/) with Scikit-learn. + +### Creating a linear classifier + +[![](images/classes-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/classes-1.png) + +Suppose that we have a dataset as the one pictured on the right. We can see two blobs of data that are linearly separable. In other words, we can draw a line which is capable of fully separating the two classes from each other. + +We can now create a **linear** **classifier** using **Support Vector Machines**. The code below illustrates how we can do this. + +- **We perform some imports.** First of all, for visualization purposes, we import `matplotlib.pyplot`. Then, we also import `numpy`, for numbers processing. From `sklearn`, we import a lot of functions: `make_blobs` for generating the blobs we see on the right, `SVC` which represents a Support Vector Machine Classifier, `train_test_split` for [generating a training and testing set](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/), and two `metrics` for plotting a [confusion matrix](https://www.machinecurve.com/index.php/2020/05/05/how-to-create-a-confusion-matrix-with-scikit-learn/) and displaying accuracy score. Finally, we import `plot_decision_regions` from [Mlxtend](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) to plot the decision boundary of our model. +- **We specify configuration options**. These are mainly related to the dataset that is created for our model. Our blobs will have a total of 2500 samples, there will be two clusters with centers at \[latex\](3, 3)\[/latex\] and \[latex\](5, 5)\[/latex\] (this matches with the image!) and hence an equal number of classes. +- **We generate and process the dataset**. This involves invoking `make_blobs` to generate the linearly separable clusters and generating the [train/test split](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/). +- **We create and train the Support Vector Machine**. This involves initializing the `SVC` and fitting the _training_ data to it. Note explicitly that we use a _linear_ kernel. In other words, we create a SVM that works with linear data - and this is a crucial fact for the rest of this article! +- **We evaluate the model**. We generate a [confusion matrix](https://www.machinecurve.com/index.php/2020/05/05/how-to-create-a-confusion-matrix-with-scikit-learn/), compute accuracy based on [predictions](https://www.machinecurve.com/index.php/2020/02/21/how-to-predict-new-samples-with-your-keras-model/), and [plot the decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) for our model. + +Let's now run the model - ensure that you have installed the Python packages (`matplotlib`, `numpy`, `scikit-learn` and `mlxtend`) and run the code! + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_blobs +from sklearn.svm import SVC +from sklearn.model_selection import train_test_split +from sklearn.metrics import plot_confusion_matrix, accuracy_score +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 2500 +cluster_centers = [(5,5), (3,3)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_features = num_classes, center_box=(0, 1), cluster_std = 0.30) + +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) + +# Create the SVM +svm = SVC(random_state=42, kernel='linear') + +# Fit the data to the SVM classifier +svm = svm.fit(X_train, y_train) + +# Evaluate by means of a confusion matrix +matrix = plot_confusion_matrix(svm, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for linear SVM') +plt.show(matrix) +plt.show() + +# Generate predictions +y_pred = svm.predict(X_test) + +# Evaluate by means of accuracy +accuracy = accuracy_score(y_test, y_pred) +print(f'Model accuracy: {accuracy}') + +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=svm, legend=2) +plt.show() +``` + +### Evaluating model performance + +After the model finishes training, we get two plots and an accuracy metric printed on screen. + +``` +Model accuracy: 1.0 +``` + +- [![](images/0cf.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/0cf.png) + +- [![](images/0db.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/0db.png) + + +We can see that our classifier works perfectly. Our confusion matrix illustrates that _all_ examples have been classified correctly, and the reason why becomes clear when looking at the decision boundary plot: _it can perfectly separate the blobs_. + +But this is what we already expected, didn't we? ;-) + +### What happens when our data becomes nonlinear? + +Now suppose that instead we had a dataset that cannot be separated linearly, i.e. by drawing a line, like this one: + +[![](images/g1.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/g1.png) + +We can also try to use a linear Support Vector Machine by making a few changes to our model code. + +- Instead of `make_blobs`, we use `make_gaussian_quantiles` to generate the Gaussian data. +- For this reason, we also specify different Configuration options. + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_gaussian_quantiles +from sklearn.svm import SVC +from sklearn.model_selection import train_test_split +from sklearn.metrics import plot_confusion_matrix, accuracy_score +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 2500 +gaussian_mean = (2,3) +num_classes_total = 2 +num_features_total = 2 + +# Generate data +X, y = make_gaussian_quantiles(n_features=num_features_total, n_classes=num_classes_total, n_samples=num_samples_total, mean=gaussian_mean) + +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) + +# Create the SVM +svm = SVC(random_state=42, kernel='linear') + +# Fit the data to the SVM classifier +svm = svm.fit(X_train, y_train) + +# Evaluate by means of a confusion matrix +matrix = plot_confusion_matrix(svm, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for linear SVM') +plt.show(matrix) +plt.show() +``` + +The outcome? + +``` +Model accuracy: 0.6206060606060606 +``` + +Oops. + +- [![](images/2cf.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/2cf.png) + +- [![](images/2db.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/2db.png) + + +Even more oops. + +Clearly, our confusion matrix shows that our model no longer performs so well. The accuracy has also dropped dramatically: from 100% to ~62%. The decision boundary plot clearly shows why: the line which is learned by the _linear_ SVM is simply incapable of learning an appropriate decision boundary for our dataset. + +In fact, when retraining the model for a few times, I saw cases where no line was found at all, dropping the accuracy to 50% (simple guesswork, as you're right in half the cases when your dataset is 50/50 split between the classes and all outputs are guessed to be of the same class). + +But we did also expect that, didn't we? ;-) + +* * * + +## SVM Kernels and Radial Basis Functions + +This article covers Radial Basis Functions (RBFs) and their application within Support Vector Machines for training Machine Learning models. I get it - but the previous section gave you the necessary context to understand why RBFs can be used to allow for training with nonlinear data in some cases. + +### Changing the SVM kernel we use + +In the article about [Support Vector Machines](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/), we read that SVMs are part of the class of **kernel methods**. In addition, they are **maximum-margin classifiers**, and they attempt to maximize the distance from **support vectors** to a **hyperplane** for generating the best decision boundary. + +Let's first cover these terms in more detail, but we'll do so briefly, so that we can move on with full understanding. + +- Support Vector Machines will attempt to learn a _hyperplane_ that separates the data. A hyperplane is always an \[latex\]N-1\[/latex\] dimensional object. Let's take a look at the figure above. We know that our feature space (e.g. all the axes onto which we map our samples) represents two dimensions (and hence there are two features per sample: \[latex\]X\_1\[/latex\] and \[latex\]X\_2\[/latex\]). This can be visualized as a plane. Our _hyperplane_ is therefore \[latex\]N-1 = 2-1 = 1\\text{-dimensional}\[/latex\], and represents a line. +- They will do so by means of _support vectors_. These are feature vectors (or their processed variants, e.g. when using [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/)) that are closest to the hyperplane. They help support the position of the hyperplane by giving input about the _margin_ between them and the hyperplane. The goal is to find a hyperplane (a line, in this case) which maximizes the margin between the _support vectors of each class_ and the hyperplane. In other words, \[latex\]H\_3\[/latex\] is the best hyperplane because it uses few support vectors _and_ ensures that it is as far away from both classes as possible. +- Support Vector Machines are _kernel methods_. They are so because they require _linear separability_ for the hyperplane learning to work well. We saw this in the example above: if our data is linearly separable, it will learn to distinguish between the classes perfectly. If it's not linearly separable, performance goes south. With a _[kernel function](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/#what-if-data-is-not-linearly-separable-kernels)_, however, we can try and make our dataset as linearly separable as possible! Kernel functions map our input to another space where linear separability is sometimes possible, but do so in a smart way using what is known as the _[kernel trick](https://en.wikipedia.org/wiki/Kernel_method#Mathematics:_the_kernel_trick)_, avoiding the actual computational cost. + +Contrary to neural networks, which learn their mappings themselves, kernel functions are not learned - they must be provided. This is why we explicitly stated that our `kernel='linear'` in the example above. We wanted to use a linear kernel, which essentially maps inputs to outputs \[latex\]\\textbf{x} \\rightarrow \\textbf{y}\[/latex\] as follows: \[latex\]\\textbf{y}: f(\\textbf{x}) = \\textbf{x}\[/latex\]. In other words, it makes a linear mapping. It allowed us to demonstrate the linearity requirement of a SVM when no kernel or a linear kernel is used. + +![](images/Svm_separating_hyperplanes_SVG.svg_-1024x886.png) + +Hyperplanes and data points. The [image](https://en.wikipedia.org/wiki/Support-vector_machine#/media/File:Svm_separating_hyperplanes_(SVG).svg)is not edited. Author: [Zack Weinberg](https://commons.wikimedia.org/w/index.php?title=User:ZackWeinberg&action=edit&redlink=1), derived from [Cyc’s](https://commons.wikimedia.org/w/index.php?title=User:Cyc&action=edit&redlink=1) work. License: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/legalcode) + +### Introducing Radial Basis Functions as SVM kernels + +Fortunately, there are many kernel functions that can be used. It's even possible to define your [custom kernel function](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html), if you want to. + +The point here is that kernel functions must fit your data. It is important that the kernel function you are using ensures that (most of) the data becomes linearly separable: it will be effective only then. + +Now, for some datasets, so-called **Radial Basis Functions** can be used as kernel functions for your Support Vector Machine classifier (or [regression model](https://www.machinecurve.com/index.php/2019/09/20/intuitively-understanding-svm-and-svr/#support-vector-regression)). We will see visually how they can be used with our dataset later in this article, but we will first take a look at what these functions are and how they work. + +> A **radial basis function** (**RBF**) is a real-valued function ![{\textstyle \varphi }](https://wikimedia.org/api/rest_v1/media/math/render/svg/99015519246670af1cb5592e439ad64a27fb4830) whose value depends only on the distance between the input and some fixed point, either the origin, so that ![{\textstyle \varphi (\mathbf {x} )=\varphi (\left\|\mathbf {x} \right\|)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/d7a7e824d6055f994c4c4db1af779491e2e7bb8f), or some other fixed point ![{\textstyle \mathbf {c} }](https://wikimedia.org/api/rest_v1/media/math/render/svg/d0d8239c0502ae3b5f33956596b3309fcb61bbc6), called a _center_ (...) +> +> Wikipedia (2005) + +In other words, if we choose some point, the _output_ of an RBF will be the distance between that point and some fixed point. In other words, we can create a \[latex\]z\[/latex\] dimension with the outputs of this RBF, which essentially get a 'height' based on how far the point is from some point. + +There are in fact many RBF implementations that can be used (Wikipedia, 2005). Scikit-learn implements what is known as the "squared-exponential kernel" (Scikit-learn, n.d.). + +### Scikit-learn's RBF implementation + +This **squared-exponential kernel** can be expressed mathematically as follows (Scikit-learn, n.d.): + +\[latex\]k(x\_i, x\_j) = \\exp\\left(- \\frac{d(x\_i, x\_j)^2}{2l^2} \\right)\[/latex\] + +Here, \[latex\]d(\\cdot,\\cdot)\[/latex\] is the [Euclidian distance](https://en.wikipedia.org/wiki/Euclidean_distance) between two points, and the \[latex\]l\[/latex\] stands for the length scale of the kernel (Scikit-learn, n.d.), which tells us something about the wiggliness of the mapping of our kernel function. + +In other words, the bigger the distance \[latex\]d(x\_i, x\_j)\[/latex\], the larger the value that goes into the exponent, and the lower the \[latex\]z\[/latex\] value will be: + +``` +>>> import numpy as np +>>> np.exp(0) +1.0 +>>> np.exp(-0.5) +0.6065306597126334 +>>> np.exp(-1) +0.36787944117144233 +>>> np.exp(-10) +4.5399929762484854e-05 +``` + +### What happens when we apply an RBF to our nonlinear dataset? + +Let's now apply the RBF kernel to our nonlinear dataset. Recall that our dataset looks as follows: + +[![](images/g1.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/g1.png) + +We can visualize what happens with our dataset in a third axis (which the SVM can use easily for linear separability with the kernel trick) with the following code. + +- We import many things that we need: the MatplotLib 3D plot facilities, the RBF kernel, and the [Z-score normalizer](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) with which we can rescale the dataset to \[latex\](\\mu = 0.0, \\sigma = 1.0)\[/latex\]. +- We then create the 3D Plot, specify the colors definition, generate and scale the data - just as we are familiar with from other articles and the sections above. +- We then generate the \[latex\]z\[/latex\] component for our data by calling the RBF with the default length scale of `1.0`. +- We then plot the data into a 3D scatter chart. + +``` +from mpl_toolkits.mplot3d import Axes3D +from sklearn.gaussian_process.kernels import RBF +from sklearn.datasets import make_gaussian_quantiles +import matplotlib.pyplot as plt +import numpy as np +from sklearn.preprocessing import StandardScaler + +# Create 3D Plot +fig = plt.figure() +ax = fig.add_subplot(111, projection='3d') + +# Colors definition +colors = { +0: '#b40426', +1: '#3b4cc0', +2: '#f2da0a', +3: '#fe5200' +# ... and so on +} + +# Generate data +X, y = make_gaussian_quantiles(n_features=2, n_classes=2, n_samples=2500, mean=(2,3)) + +# Scale data +scaler = StandardScaler() +scaler.fit(X) +X = scaler.transform(X) + +# Generate Z component +z = RBF(1.0).__call__(X)[0] + +# Plot +colors = list(map(lambda x: colors[x], y)) +ax.scatter(X[:, 0], X[:, 1], z, c=colors, marker='o') + +ax.set_xlabel('X Label') +ax.set_ylabel('Y Label') +ax.set_zlabel('Z Label') + +plt.show() +``` + +This is the outcome, visualized from three angles: + +- [![](images/rbf3.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/rbf3.png) + +- [![](images/rbf2.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/rbf2.png) + +- [![](images/rbf1.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/rbf1.png) + + +We recognize aspects from our sections above. For example, the RBF we used maps highest values to points closest to the origin, where the center of our dataset is. In addition, when we look at the data from above, we find back our original 2D Gaussian data. And clearly, in this three-dimensional space, we can even think about learning a hyperplane (a plane, in this case, because our space is now a cube) that can linearly separate much more of the data! + +Let's take a look what happens when we implement our Scikit-learn classifier with the RBF kernel. + +* * * + +## RBF SVMs with Python and Scikit-learn: an Example + +We can easily implement an RBF based SVM classifier with Scikit-learn: the only thing we have to do is change `kernel='linear'` to `kernel='rbf'` during `SVC(...)` initialization. We also change the `plt.title(...)` of our confusion matrix, to illustrate that it was trained with an RBF based SVM. + +For the rest, we configure, generate, split, create, fit and evaluate just as we did above. + +``` +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_gaussian_quantiles +from sklearn.svm import SVC +from sklearn.model_selection import train_test_split +from sklearn.metrics import plot_confusion_matrix, accuracy_score +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 2500 +cluster_centers = [(5,5), (3,3), (1,5)] +num_classes = len(cluster_centers) + +# Generate data +X, y = make_gaussian_quantiles(n_features=2, n_classes=2, n_samples=2500, mean=(2,3)) + +# Split into training and testing data +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) + +# Create the SVM +svm = SVC(random_state=42, kernel='rbf') + +# Fit the data to the SVM classifier +svm = svm.fit(X_train, y_train) + +# Evaluate by means of a confusion matrix +matrix = plot_confusion_matrix(svm, X_test, y_test, + cmap=plt.cm.Blues, + normalize='true') +plt.title('Confusion matrix for RBF SVM') +plt.show(matrix) +plt.show() + +# Generate predictions +y_pred = svm.predict(X_test) + +# Evaluate by means of accuracy +accuracy = accuracy_score(y_test, y_pred) +print(f'Model accuracy: {accuracy}') + +# Plot decision boundary +plot_decision_regions(X_test, y_test, clf=svm, legend=2) +plt.show() +``` + +### Evaluating model performance + +After fitting the data and hence training the classifier, this is the output for the RBF based classifier: + +``` +Model accuracy: 0.9915151515151515 +``` + +- ![](images/3cm.png) + +- ![](images/3db.png) + + +We're back at great performance, and the decision boundary clearly shows that we can classify (most of) the samples correctly! + +It will also work with data of various other shapes: + +- [![](images/4cm.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/4cm.png) + +- [![](images/4db.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/4db.png) + + +This is the power of Radial Basis Functions when they are used as kernel functions for your SVM classifier. + +### Cautionary remarks + +We saw that RBFs can really boost SVM performance when they are used with nonlinear SVMs. However, towards the end of the article, I must stress one thing that we already touched earlier but which may have been sunk in your memory: + +**While RBFs can be great, they are not the holy grail.** + +In other words: while they can work in many cases, they don't work in many other cases. + +This is because the way that this particular kernel function works, mapping distances between some point and other points. The dataset above clearly fit this purpose because it covered a circle and a ring, where the ring is always farthest away from the center of the circle; and the circle is always closer than the ring. This made that data perfectly suitable for RBFs. + +So, to conclude: pick, or create if none is available, a kernel function that best matches **your** data. Perform exploration on your feature space first; apply kernel functions second. + +* * * + +## Summary + +In this article, we looked at one of the ways forward when your Support Vector Machine does not work because your data is not linear - apply Radial Basis Functions. We first explored how linear data can be classified easily with a Support Vector Machine classifier using Python and Scikit-learn. By changing our data into a nonlinear structure, however, this changed, and it no longer worked. + +We saw that Radial Basis Functions, which measure the distance of a sample to a point, can be used as a kernel functon and hence allow for learning a linear decision boundary in nonlinear data, applying the kernel trick. + +Using a variety of visual and code examples, we explained step-by-step how we can use Scikit-learn and Python to apply RBFs for your Support Vector Machine based Machine Learning model. I hope that this article was you and that you have learned something by reading it. If you did, please feel free to leave a message in the comments section 💬 Please do the same if you have any comments or questions. + +Thanks for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2005, July 26). _Radial basis function_. Wikipedia, the free encyclopedia. Retrieved November 25, 2020, from [https://en.wikipedia.org/wiki/Radial\_basis\_function](https://en.wikipedia.org/wiki/Radial_basis_function) + +Scikit-learn. (n.d.). _Sklearn.gaussian\_process.kernels.RBF — scikit-learn 0.23.2 documentation_. scikit-learn: machine learning in Python — scikit-learn 0.16.1 documentation. Retrieved November 25, 2020, from [https://scikit-learn.org/stable/modules/generated/sklearn.gaussian\_process.kernels.RBF.html](https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.RBF.html) diff --git a/using-relu-sigmoid-and-tanh-with-pytorch-ignite-and-lightning.md b/using-relu-sigmoid-and-tanh-with-pytorch-ignite-and-lightning.md new file mode 100644 index 0000000..eb01543 --- /dev/null +++ b/using-relu-sigmoid-and-tanh-with-pytorch-ignite-and-lightning.md @@ -0,0 +1,385 @@ +--- +title: "Using ReLU, Sigmoid and Tanh with PyTorch, Ignite and Lightning" +date: "2021-01-21" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "activation-function" + - "activation-functions" + - "deep-learning" + - "deep-neural-network" + - "ignite" + - "lightning" + - "machine-learning" + - "neural-networks" + - "pytorch" + - "relu" + - "sigmoid" + - "tanh" +--- + +Rectified Linear Unit, Sigmoid and Tanh are three activation functions that play an important role in how neural networks work. In fact, if we do not use these functions, and instead use _no_ function, our model will be unable to learn from nonlinear data. + +This article zooms into ReLU, Sigmoid and Tanh specifically tailored to the PyTorch ecosystem. With simple explanations and code examples you will understand how they can be used within PyTorch and its variants. In short, after reading this tutorial, you will... + +- Understand what activation functions are and why they are required. +- Know the shape, benefits and drawbacks of ReLU, Sigmoid and Tanh. +- Have implemented ReLU, Sigmoid and Tanh with PyTorch, PyTorch Lightning and PyTorch Ignite. + +All right, let's get to work! 🔥 + +* * * + +\[toc\] + +* * * + +## Summary and example code: ReLU, Sigmoid and Tanh with PyTorch + +Neural networks have boosted the field of machine learning in the past few years. However, they do not work well with nonlinear data natively - we need an activation function for that. Activation functions take any number as input and map inputs to outputs. As any function can be used as an activation function, we can also use nonlinear functions for that goal. + +As results have shown, using nonlinear functions for that purpose ensure that the neural network as a whole can learn from nonlinear datasets such as images. + +The **Rectified Linear Unit (ReLU), Sigmoid and Tanh** **activation functions** are the most widely used activation functions these days. From these three, ReLU is used most widely. All functions have their benefits and their drawbacks. Still, ReLU has mostly stood the test of time, and generalizes really well across a wide range of deep learning problems. + +In this tutorial, we will cover these activation functions in more detail. Please make sure to read the rest of it if you want to understand them better. Do the same if you're interested in better understanding the implementations in PyTorch, Ignite and Lightning. Next, we'll show code examples that help you get started immediately. + +### Classic PyTorch and Ignite + +In classic PyTorch and PyTorch Ignite, you can choose from one of two options: + +1. Add the activation functions `nn.Sigmoid()`, `nn.Tanh()` or `nn.ReLU()` to the neural network itself e.g. in `nn.Sequential`. +2. Add the _functional equivalents_ of these activation functions to the forward pass. + +The first is easier, the second gives you more freedom. Choose what works best for you! + +``` + +import torch.nn.functional as F + +# (1). Add to __init__ if using nn.Sequential +def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 256), + nn.Sigmoid(), + nn.Linear(256, 128), + nn.Tanh(), + nn.Linear(128, 10), + nn.ReLU() + ) + +# (2). Add functional equivalents to forward() +def forward(self, x): + x = F.sigmoid(self.lin1(x)) + x = F.tanh(self.lin2(x)) + x = F.relu(self.lin3(x)) + return x +``` + +With Ignite, you can now proceed and finalize the model by adding Ignite specific code. + +### PyTorch Lightning + +In Lightning, too, you can choose from one of the two options: + +1. Add the activation functions to the neural network itself. +2. Add the functional equivalents to the forward pass. + +``` +import torch +from torch import nn +import torch.nn.functional as F +import pytorch_lightning as pl + +# (1) IF USED SEQUENTIALLY +class SampleModel(pl.LightningModule): + + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 256), + nn.Sigmoid(), + nn.Linear(256, 128), + nn.Tanh(), + nn.Linear(128, 56), + nn.ReLU(), + nn.Linear(56, 10) + ) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + return self.layers(x) + +# (2) IF STACKED INDEPENDENTLY +class SampleModel(pl.LightningModule): + + def __init__(self): + super().__init__() + self.lin1 = nn.Linear(28 * 28, 256) + self.lin2 = nn.Linear(256, 128) + self.lin3 = nn.Linear(128, 56) + self.lin4 = nn.Linear(56, 10) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + x = F.sigmoid(self.lin1(x)) + x = F.tanh(self.lin2(x)) + x = F.relu(self.lin3(x)) + x = self.lin4(x) + return x +``` + +* * * + +## Activation functions: what are they? + +Neural networks are composed of _layers_ of _neurons_. They represent a system that together learns to capture patterns hidden in a dataset. Each individual neuron here processes data in the form `Wx + b`. Here, `x` represents the input vector - which can either be the input data (in the first layer) or any subsequent and partially processed data (in the downstream layers). `b` is the bias and `W` the weights vector, and they represent the trainable components of a neural network. + +Performing `Wx + b` equals making a linear operation. In other words, the mapping from an input value to an output value is always linear. While this works perfectly if you need a model to generate a linear decision boundary, it becomes problematic when you don't. In fact, when you need to learn a decision boundary that is _not_ linear (and there are many such use cases, e.g. in computer vision), you can't if only performing the operation specified before. + +Activation functions come to the rescue in this case. Stacked directly after the neurons, they take the neuron output values and map this linear input to a nonlinear output. By consequence, each neuron, and the system as a whole, becomes capable of learning nonlinear patterns. The exact flow of data flowing through one neuron is visualized below and can be represented by these three steps: + +1. **Input data flows through the neuron, performing the operation `Wx + b`.** +2. **The output of the neuron flows through an activation function, such as ReLU, Sigmoid and Tanh.** +3. **What the activation function outputs is either passed to the next layer or returned as model output.** + +![](images/layer-act-1024x227.png) + +### ReLU, Sigmoid and Tanh are commonly used + +There are many activation functions. In fact, any activation function can be used - even \[latex\]f(x) = x\[/latex\], the linear or identity function. While you don't gain anything compared to using no activation function with that function, it shows that pretty much anything is possible when it comes to activation functions. + +The key consideration that you have to make when creating and using an activation function is the function's computational efficiency. For example, if you would design an activation function that trumps any such function in performance, it doesn't really matter if it is _really_ slow to compute. In those cases, it's more likely that you can gain similar results in the same time span, but then with more iterations and fewer resources. + +That's why today, [three key activation functions](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) are most widely used in neural networks: + +1. **Rectified Linear Unit (ReLU)** +2. **Sigmoid** +3. **Tanh** + +Click the link above to understand these in more detail. We'll now take a look at each of them briefly. + +The **Tanh** and **Sigmoid** activation functions are the oldest ones in terms of neural network prominence. In the plot below, you can see that Tanh converts all inputs into the `(-1.0, 1.0)` range, with the greatest slope around `x = 0`. Sigmoid instead converts all inputs to the `(0.0, 1.0`) range, also with the greatest slope around `x = 0`. **ReLU** is different. This function maps all inputs to `0.0` if `x <= 0.0`. In all other cases, the input is mapped to `x`. + +While being very prominent, all of these functions come with drawbacks. These are the benefits and drawbacks for ReLU, Sigmoid and Tanh: + +- Sigmoid and Tanh suffer greatly from the [vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). This problem occurs because the derivatives of both functions have a peak value at `x < 1.0`. Neural networks use the chain rule to compute errors backwards through layers. This chain rule effectively _chains_ and thus _multiplies_ gradients. You can imagine what happens when, where `g` is some gradient for a layer, you perform `g * g * g * ...`. The result for the most upstream layers is then very small. In other words, larger networks struggle or even fail learning when Sigmoid or Tanh is used. +- In addition, with respect to Sigmoid, the middle point in terms of the `y` value does not lie around `x = 0`. This makes the process somewhat unstable. On the other hand, Sigmoid is a good choice for binary classification problems. Use at your own caution. +- Finally with respect to these two, the functions are more complex than that of ReLU, which essentially boils down to `[max(x, 0)](https://www.machinecurve.com/index.php/question/why-does-relu-equal-max0-x/)`. Computing them is thus slower than when using ReLU. +- While it seems to be the case that ReLU trumps all activation functions - and it surely generalizes to many problems and is really useful, partially due to its computational effectiveness - it has its own unique set of drawbacks. It's not smooth and therefore [not fully differentiable](https://www.machinecurve.com/index.php/question/why-is-relu-not-differentiable-at-x-0/), neural networks can start to [explode](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) because there is no upper limit on the output, and using ReLU also means opening up yourself to the dying ReLU problem. Many activation functions attempting to resolve these problems have emerged, such as [Swish](https://www.machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/), [PReLU](https://www.machinecurve.com/index.php/2019/12/05/how-to-use-prelu-with-keras/) and [Leaky ReLU](https://www.machinecurve.com/index.php/2019/11/12/using-leaky-relu-with-keras/) - and [there are many more](https://www.machinecurve.com/index.php/tag/activation-function). But for some reason, they haven't been able to dethrone ReLU yet, and it is still widely used. + +- [![](images/tanh-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/tanh.png) + +- [![](images/relu-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/relu.png) + +- [![](images/sigmoid-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/05/sigmoid.png) + + +* * * + +## Implementing ReLU, Sigmoid and Tanh with PyTorch + +Now that we understand how ReLU, Sigmoid and Tanh work, we can take a look at how we can implement with PyTorch. In this tutorial, you'll learn to implement these activation functions with three flavors of PyTorch: + +1. **Classic PyTorch.** This is where it all started and it is PyTorch as we know it. +2. **PyTorch Ignite.** Ignite is a PyTorch-supported approach to streamline your models in a better way. +3. **PyTorch Lightning**. The same is true for Lightning, which focuses on model organization and automation even more. + +Let's start with classic PyTorch. + +### Classic PyTorch + +In classic PyTorch, the suggested way to create a neural network is using a class that utilizes `nn.Module`, the neural networks module provided by PyTorch. + +``` +class Model(nn.Module): + def __init__(self): + super(Model, self).__init__() + self.lin1 = nn.Linear(28 * 28, 256), + self.lin2 = nn.Linear(256, 128) + self.lin3 = nn.Linear(128, 10) +``` + +You can also choose to already stack the layers on top of each other, like this, using `nn.Sequential`: + +``` +import torch.nn.functional as F + +class Model(nn.Module): + def __init__(self): + super(Model, self).__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 256), + nn.Linear(256, 128), + nn.Linear(128, 10) + ) +``` + +As you can see, this way of working resembles that of the `tensorflow.keras.Sequential` API, where you add layers on top of each other using `model.add`. + +#### Adding activation functions + +In a `nn.Module`, you can then add a `forward` definition for the forward pass. The implementation differs based on the choice of building your neural network from above: + +``` + # If stacked on top of each other + def forward(self, x): + return self.layers(x) + + # If stacked independently + def forward(self, x): + x = self.lin1(x) + x = self.lin2(x) + return self.lin3(x) +``` + +Adding **Sigmoid, Tanh or ReLU** to a classic PyTorch neural network is really easy - but it is also dependent on the way that you have constructed your neural network above. When you are using `Sequential` to stack the layers, whether that is in `__init__` or elsewhere in your network, it's best to use `nn.Sigmoid()`, `nn.Tanh()` and `nn.ReLU()`. An example can be seen below. + +If instead you are specifying the layer composition in `forward` - similar to the Keras Functional API - then you must use `torch.nn.functional`, which we import as `F`. You can then wrap the layers with the activation function of your choice, whether that is `F.sigmoid()`, `F.tanh()` or `F.relu()`. Quite easy, isn't it? :D + +``` + +import torch.nn.functional as F + +# Add to __init__ if using nn.Sequential +def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 256), + nn.Sigmoid(), + nn.Linear(256, 128), + nn.Tanh(), + nn.Linear(128, 10), + nn.ReLU() + ) + +# Add functional equivalents to forward() +def forward(self, x): + x = F.sigmoid(self.lin1(x)) + x = F.tanh(self.lin2(x)) + x = F.relu(self.lin3(x)) + return x +``` + +### PyTorch Ignite + +You can use the classic PyTorch approach from above for adding Tanh, Sigmoid or ReLU to PyTorch Ignite. Model creation in Ignite works in a similar way - and you can then proceed adding all Ignite specific functionalities. + +### PyTorch Lightning + +In Lightning, you can pretty much repeat the classic PyTorch approach - i.e. use `nn.Sequential` and specify calling the whole system in the `forward()` definition, or create the forward pass yourself. The first is more restrictive but easy, whereas the second gives you more freedom for creating exotic models at the cost of increasing difficulty. + +Here's an example of using ReLU, Sigmoid and Tanh when you stack all layers independently and configure data flow yourself in `forward`: + +``` +import torch +from torch import nn +import torch.nn.functional as F +import pytorch_lightning as pl + +class SampleModel(pl.LightningModule): + + # IF STACKED INDEPENDENTLY + def __init__(self): + super().__init__() + self.lin1 = nn.Linear(28 * 28, 256) + self.lin2 = nn.Linear(256, 128) + self.lin3 = nn.Linear(128, 56) + self.lin4 = nn.Linear(56, 10) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + x = F.sigmoid(self.lin1(x)) + x = F.tanh(self.lin2(x)) + x = F.relu(self.lin3(x)) + x = self.lin4(x) + return x + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-3) + return optimizer +``` + +Do note that the functional equivalents of Tanh and Sigmoid are deprecated and may be removed in the future: + +``` +UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead. + warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.") +UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead. + warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.") +``` + +The solution would be as follows. You can also choose to use `nn.Sequential` and add the activation functions to the model itself: + +``` +import torch +from torch import nn +import torch.nn.functional as F +import pytorch_lightning as pl + +class SampleModel(pl.LightningModule): + + # IF USED SEQUENTIALLY + def __init__(self): + super().__init__() + self.layers = nn.Sequential( + nn.Linear(28 * 28, 256), + nn.Sigmoid(), + nn.Linear(256, 128), + nn.Tanh(), + nn.Linear(128, 56), + nn.ReLU(), + nn.Linear(56, 10) + ) + self.ce = nn.CrossEntropyLoss() + + def forward(self, x): + return self.layers(x) + + def training_step(self, batch, batch_idx): + x, y = batch + x = x.view(x.size(0), -1) + y_hat = self.layers(x) + loss = self.ce(y_hat, y) + self.log('train_loss', loss) + return loss + + def configure_optimizers(self): + optimizer = torch.optim.Adam(self.parameters(), lr=1e-3) + return optimizer +``` + +That's it, folks! As you can see, adding ReLU, Tanh or Sigmoid to any PyTorch, Ignite or Lightning model is _a piece of cake_. 🍰 + +If you have any comments, questions or remarks - please feel free to leave a comment in the comments section 💬 I'd love to hear from you. [You can also leave your question here](https://www.machinecurve.com/index.php/machine-learning-questions/). + +Thanks for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +PyTorch Ignite. (n.d.). _Ignite your networks! — ignite master documentation_. PyTorch. [https://pytorch.org/ignite/](https://pytorch.org/ignite/) + +PyTorch Lightning. (2021, January 12). [https://www.pytorchlightning.ai/](https://www.pytorchlightning.ai/) + +PyTorch. (n.d.). [https://pytorch.org](https://pytorch.org/) + +PyTorch. (n.d.). _ReLU — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU) + +PyTorch. (n.d.). _Sigmoid — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html#torch.nn.Sigmoid](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html#torch.nn.Sigmoid) + +PyTorch. (n.d.). _Tanh — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html#torch.nn.Tanh](https://pytorch.org/docs/stable/generated/torch.nn.Tanh.html#torch.nn.Tanh) + +PyTorch. (n.d.). _Torch.nn.functional — PyTorch 1.7.0 documentation_. [https://pytorch.org/docs/stable/nn.functional.html](https://pytorch.org/docs/stable/nn.functional.html) diff --git a/using-selu-with-tensorflow-and-keras.md b/using-selu-with-tensorflow-and-keras.md new file mode 100644 index 0000000..c43387e --- /dev/null +++ b/using-selu-with-tensorflow-and-keras.md @@ -0,0 +1,295 @@ +--- +title: "Using SELU with TensorFlow 2.0 and Keras" +date: "2021-01-12" +categories: + - "deep-learning" + - "frameworks" +tags: + - "activation-function" + - "deep-learning" + - "machine-learning" + - "neural-networks" + - "relu" + - "selu" + - "tensorflow" +--- + +Neural networks thrive on nonlinear data only when [nonlinear activation functions](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/) are used. The Rectified Linear Unit, or RELU, is one such activation function - and in fact, it is currently the most widely used one due to its robustness in many settings. But training a neural network can be problematic, even with functions like RELU. + +Parts of these problems can be related to the speed of the training process. For example, we know from [Batch Normalization](https://www.machinecurve.com/index.php/2020/01/15/how-to-use-batch-normalization-with-keras/) that it helps speed up the training process, because it [normalizes](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) the inputs to a layer. While this is not necessarily problematic, deep learning engineers must pay attention to how they construct the rest of their model. For example, using Dropout in combination with Batch Normalization might not be a good idea if implemented incorrectly. In addition, Batch Normalization must be explicitly added to a neural network, which might not always what you want. + +In this article, we are going to take a look at the **Scaled Exponential Linear Unit** or **SELU activation function**. This activation function, which has self-normalizing properties, ensures that all outputs are normalized without explicitly adding a normalization layer to your model. What's better is that it can be used relatively easily and that it provides adequate results, according to the authors in Klambauer et al. (2017). + +It's structured as follows. Firstly, we're going to provide a code example that immediately answers the question "how to use SELU with TensorFlow and Keras?". It allows you to get up to speed quickly. After that, we'll go in a bit more detail. First of all, we're going to take a brief look at the need for activation functions to provide some context. This is followed by looking at the SELU activation function, which we'll explore both mathematically and visually. Once we did that, we take a look at how SELU is implemented in TensorFlow, by means of `tf.keras.activations.selu`. Finally, we build an actual neural network using SELU, and provide step-by-step examples. + +After reading this tutorial, you will... + +- Understand what activation functions are. +- Know what SELU is and how SELU relates to RELU. +- See how SELU is implemented in TensorFlow. +- Be capable of building a neural network using SELU. + +Let's take a look! 😊 + +* * * + +\[toc\] + +* * * + +## Code example: using SELU with tf.keras.activations.selu + +This quick example helps you get started with SELU straight away. If you want to know how to use SELU with TensorFlow or Keras, you can use the code below. Do make sure to take a look at the important notes however, they're really important! Read the full article below if you want to understand their _whys_ and the SELU activation function in general in more detail. + +``` +# Using SELU with TensorFlow and Keras - example. +# Important: +# 1. When using SELU, the LecunNormal() initializer must be used. +# 2. When using SELU and Dropout, AlphaDropout() must be used. +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), kernel_initializer=LecunNormal(), activation='selu', input_shape=input_shape)) +model.add(AlphaDropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='selu', kernel_initializer=LecunNormal())) +model.add(Dense(no_classes, activation='softmax')) +``` + +* * * + +## What are activation functions? + +By design, a neural network processes data linearly. Every neuron takes an input vector `x` and multiplies this vector element-wise with vector `w`, which contains **weights**. These weights, in return, are learned by the network, as well as the **bias**. As each neuron learns to process data individually, the system as a whole learns to process the data collectively, because it is trained to do so by means of the [high-level machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). + +![](images/layer-linear.png) + +Neural networks are therefore perfectly capable of learning [linear decision boundaries](https://www.machinecurve.com/index.php/2020/10/29/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example/): + +![](images/linear-1024x514.png) + +Unfortunately, today's world comes with complex datasets. These datasets often contain patterns that are not linear. If we would train a neural network using the approach mentioned above, that would not work. This is clearly visible in the example that [we visualized above](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/): the neural network is not capable of learning a nonlinear decision boundary. + +### Adding activation functions + +But if we add **activation functions** to the neural network, this behavior changes, and we can suddenly learn to detect nonlinear patterns in our datasets. Activation functions are simple mathematical functions that map some inputs to some outputs, but then in a nonlinear way. We place them directly after the neurons, as we visualized in the image below. + +![](images/layer-act-1024x227.png) + +This is the effect with the data visualized above when a nonlinear activation function is used: + +![](images/nonlinear-1-1024x514.png) + +### About RELU + +One of the most prominent activation functions that is used today is the **[Rectified Linear Unit](https://www.machinecurve.com/index.php/2019/09/09/implementing-relu-sigmoid-and-tanh-in-keras/)**, or **RELU**. This activation function effectively boils down to the following output: + +\[mathjax\] + +\\begin{equation} f(x) = \\begin{cases} 0, & \\text{if}\\ x < 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +In other words, the output will be zero if `x < 0` and will equal `x` otherwise. Being as simple as implementing [max(x, 0)](https://www.machinecurve.com/index.php/question/why-does-relu-equal-max0-x/), ReLU is a very efficient and easy activation function. It is therefore not surprising that it is widely used today. + +![](images/relu-1024x511.png) + +* * * + +## What is the SELU activation function? + +Training a neural network successfully does not depend on an activation function alone. Especially with bigger models, the training process also becomes dependent on a variety of efficiencies that must be built into the neural network for it to work well. For example, we know that the distribution of layer outputs significantly impacts the speed of the training process. [Batch Normalization](https://www.machinecurve.com/index.php/2020/01/14/what-is-batch-normalization-for-training-neural-networks/) has been invented to deal with it, and we can use it easily in TensorFlow by simply [adding it as a layer](https://www.machinecurve.com/index.php/2020/01/15/how-to-use-batch-normalization-with-keras/). + +But while Batch Normalization speeds up the training process by normalizing the outputs of each layer, it comes at a few drawbacks. The first one is that it must be added explicitly, incurring additional computational costs that are unnecessary, strictly speaking. In addition, using Batch Normalization together with Dropout is not a good idea necessarily, unless implemented correctly. + +That's why Klambauer et al. (2017) argue for the **Scaled Exponential Linear Unit**, or the **SELU activation function**. This activation function combines both the benefits of classic RELU with self-normalizing properties, hence removing the necessity to use BatchNorm. + +> The activation function of SNNs are "scaled exponential linear units" (SELUs), which induce self-normalizing properties. Using the Banach fixed-point theorem, we prove that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero mean and unit variance -- even under the presence of noise and perturbations. +> +> Klambauer et al. (2017) + +A SELU activation function is defined in the following way: + +\\begin{equation} f(x) = \\begin{cases} \\text{scale} \\times \\text{alpha} \\times (exp(x) - 1), & \\text{if}\\ x \\lt 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Here, `alpha=1.67326324` and `scale=1.05070098` (TensorFlow, n.d.). + +It has the properties that leads the neural network to become **self-normalizing**, meaning that the outputs of each layer are pushed to a mean (\[latex\]\\mu\[/latex\]) of zero (\[latex\]\\mu = 0.0\[/latex\]) whereas variance equals 1.0 (\[latex\]\\sigma = 1.0\[/latex\]). This equals the effect of Batch Normalization, without using Batch Normalization. If this is not _strictly_ possible, the authors show that at least an upper and lower bound is present for the derivative, [avoiding the vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) (Klambauer et al., 2017). + +> We have introduced self-normalizing neural networks for which we have proved that neuron activations are pushed towards zero mean and unit variance when propagated through the network. Additionally, for activations not close to unit variance, we have proved an upper and lower bound on the variance mapping. Consequently, SNNs do not face vanishing and exploding gradient problems. +> +> Klambauer et al. (2017) + +Visually, a SELU activation functions looks as follows: + +![](images/selu.png) + +* * * + +## SELU in TensorFlow + +Of course, it is possible to use **Scaled Exponential Linear Unit** or SELU with TensorFlow and Keras. The example at the top of this page already demonstrates how you can use it within your neural network. In TensorFlow 2.x, the SELU activation function is available as `tf.keras.activations.selu` (TensorFlow, n.d.): + +``` +tf.keras.activations.selu( + x +) +``` + +The function is really simple - it takes `x` as input and applies the self-normalizing nonlinear mapping that was visualized above. + +#### About SELU and Dropout + +Note that if you're using Dropout, you must use [AlphaDropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/AlphaDropout) instead of regular Dropout (TensorFlow, n.d.). + +#### About SELU and Initializers + +Note that for [weight initialization](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/), you must take into account the utilization of SELU (just as you would need to use a [different initializer when using RELU](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/)). If you are using SELU, you must use the `LecunNormalInitializer` [instead](https://www.tensorflow.org/api_docs/python/tf/keras/initializers/LecunNormal). + +* * * + +## Building a neural network using SELU: example + +Adding SELU to a TensorFlow / Keras powered neural network is really easy and involves three main steps: + +1. **Setting the `activation` attribute to `'selu'`.** As you can see in the example above, all activations are set to SELU through `activation='selu'`. Of course, we don't do this at the last layer, because (as we shall see) we are trying to solve a [multiclass classification problem](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). For these, we need [Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/). +2. **Using the `LecunNormal` kernel initializer**. The TensorFlow docs suggest to use this initializer when using SELU, which is related in the fact that [different activation functions need different initializers](https://www.machinecurve.com/index.php/2019/09/16/he-xavier-initialization-activation-functions-choose-wisely/). +3. **Using `AlphaDropout` instead of `Dropout`.** Another important suggestion made the docs is to use this type of [Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) when you need to use it. + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), kernel_initializer=LecunNormal(), activation='selu', input_shape=input_shape)) +model.add(AlphaDropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='selu', kernel_initializer=LecunNormal())) +model.add(Dense(no_classes, activation='softmax')) +``` + +### Fully working neural network with SELU + +We can use these easy steps in the creation of a neural network which can be used for [multiclass classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). In fact, we will be using it for classification of the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/#mnist-database-of-handwritten-digits), which is composed of handwritten digits - a few examples of them visualized on the right. + +In other words, the neural network that we will create is capable of generating a prediction about the digit it sees - giving a number between zero and nine as the output. The code below constructs the neural network and is composed of multiple sections. Read the article about [constructing a ConvNet](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) for more step-by-step instructions, but these are the important remarks: + +1. **Imports section**. We import everything that we need in this section. Recall once more that this also includes the `LecunNormal` initializer and the `AlphaDropout` layer; the latter only if you desire to use [Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/). +2. **Model configuration**. Here, we set a few configuration options throughout the model. +3. **Loading and preparing the dataset.** With these lines of code, we use `load_data()` to [load the MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) and reshape it into the correct [input format](https://www.machinecurve.com/index.php/2020/04/05/how-to-find-the-value-for-keras-input_shape-input_dim/). It also includes parsing numbers as floats, [which might speed up the training process](https://www.machinecurve.com/index.php/2020/09/16/tensorflow-model-optimization-an-introduction-to-quantization/#float32-in-your-ml-model-why-its-great). Finally, it is also [normalized](https://www.machinecurve.com/index.php/2020/11/19/how-to-normalize-or-standardize-a-dataset-in-python/) relatively naïvely and target vectors are [one-hot encoded](https://www.machinecurve.com/index.php/2020/11/24/one-hot-encoding-for-machine-learning-with-tensorflow-and-keras/). +4. **The model is created and compiled**. This involves stacking layers on top of each other with `model.add(..)` and actually initializing the model with `model.compile(..)`, getting us a model that can be trained. +5. **Training the model**. We use the `input_train` and `target_train` variables for this; in other words, [our training dataset](https://www.machinecurve.com/index.php/2020/11/16/how-to-easily-create-a-train-test-split-for-your-machine-learning-model/). +6. **[Evaluating](https://www.machinecurve.com/index.php/2020/11/03/how-to-evaluate-a-keras-model-with-model-evaluate/) the model**. Finally, we evaluate the performance of the model with `input_test` and `target_test`, to see whether it generalizes to data that we haven't seen before. + +``` +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.initializers import LecunNormal +from tensorflow.keras.layers import AlphaDropout +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 5 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert into [0, 1] range. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), kernel_initializer=LecunNormal(), activation='selu', input_shape=input_shape)) +model.add(AlphaDropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='selu', kernel_initializer=LecunNormal())) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +If you are getting memory errors when running this script using your GPU, you might need to add the following code directly after the imports. It limits the growth of GPU memory and allows you to get your code running again. + +``` +gpus = tensorflow.config.experimental.list_physical_devices('GPU') +if gpus: + try: + # Currently, memory growth needs to be the same across GPUs + for gpu in gpus: + tensorflow.config.experimental.set_memory_growth(gpu, True) + logical_gpus = tensorflow.config.experimental.list_logical_devices('GPU') + print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs") + except RuntimeError as e: + # Memory growth must be set before GPUs have been initialized + print(e) +``` + +### Results + +These are the results, which suggest a well-performing model - but this is not unexpected given the simplicity of MNIST. + +``` +192/192 [==============================] - 9s 24ms/step - loss: 0.9702 - accuracy: 0.7668 - val_loss: 0.1748 - val_accuracy: 0.9530 +Epoch 2/5 +192/192 [==============================] - 4s 22ms/step - loss: 0.2187 - accuracy: 0.9349 - val_loss: 0.1293 - val_accuracy: 0.9624 +Epoch 3/5 +192/192 [==============================] - 4s 22ms/step - loss: 0.1411 - accuracy: 0.9569 - val_loss: 0.1153 - val_accuracy: 0.9688 +Epoch 4/5 +192/192 [==============================] - 5s 24ms/step - loss: 0.1068 - accuracy: 0.9667 - val_loss: 0.1097 - val_accuracy: 0.9710 +Epoch 5/5 +192/192 [==============================] - 7s 38ms/step - loss: 0.0889 - accuracy: 0.9715 - val_loss: 0.1014 - val_accuracy: 0.9739 +Test loss: 0.09341142326593399 / Test accuracy: 0.9747999906539917 +``` + +* * * + +## Summary + +The **Scaled Exponential Linear Unit** or **SELU activation function** can be used to combine the effects of RELU and Batch Normalization. It has self-normalizing properties, meaning that the outputs have an upper and lower bound at worst (avoiding vanishing gradients) and activations normalized around zero mean and unit variance at best. This means that Batch Normalization might no longer be necessary, making the utilization of Dropout easier. + +In this article, we looked at activation functions, SELU, and an implementation with TensorFlow. We saw that activation functions help our neural networks learn to handle nonlinear data, whereas SELU combines the effects of RELU (today's most common activation function) with those of Batch Normalization. In TensorFlow and hence Keras, it is implemented as `tf.keras.activations.selu`. + +In an example implementation, we also saw how we can create a neural network using SELU. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that this tutorial has been useful to you and that you have learned something! 😀 If you did, please feel free to leave a message in the comments section below 💬 Please do the same if you have any questions or remarks, or click the **Ask Questions** button to the right. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +TensorFlow. (n.d.). _Tf.keras.activations.selu_. [https://www.tensorflow.org/api\_docs/python/tf/keras/activations/selu](https://www.tensorflow.org/api_docs/python/tf/keras/activations/selu) + +Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). [Self-normalizing neural networks.](https://arxiv.org/abs/1706.02515) _Advances in neural information processing systems_, _30_, 971-980. diff --git a/using-simple-generators-to-flow-data-from-file-with-keras.md b/using-simple-generators-to-flow-data-from-file-with-keras.md new file mode 100644 index 0000000..3c863dc --- /dev/null +++ b/using-simple-generators-to-flow-data-from-file-with-keras.md @@ -0,0 +1,303 @@ +--- +title: "Using simple generators to flow data from file with Keras" +date: "2020-04-06" +categories: + - "deep-learning" + - "frameworks" +tags: + - "big-data" + - "dataset" + - "deep-learning" + - "generator" + - "keras" + - "large-dataset" + - "machine-learning" +--- + +During development of basic neural networks - such as the ones we build to show you how e.g. [Conv2D layers work](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/) - we often load the whole dataset into memory. This is perfectly possible, because the datasets we're using are relatively small. For example, the MNIST dataset has only 60.000 samples in its _training_ part. + +Now, what if datasets are larger? Say, they are 1.000.000 samples, or even more? At some point, it might not be feasible or efficient to store all your samples in memory. Rather, you wish to 'stream' them from e.g. a file. How can we do this with Keras models? That's what we will cover in today's blog post. + +Firstly, we'll take a look at the question as to why: why flow data from a file anyway? Secondly, we'll take a look at _generators_ - and more specifically, _custom_ generators. Those will help you do precisely this. In our discussion, we'll also take a look at how you must fit generators to TensorFlow 2.x / 2.0+ based Keras models. Finally, we'll give you an example - how to fit data from a _very simple CSV file_ with a generator instead of directly from memory. + +**Update 05/Oct/2020:** provided example of using generator for validation data with `model.fit`. + +* * * + +\[toc\] + +* * * + +## Why would you flow data from a file? + +The answer is really simple: sometimes, you don't want to spend all your memory storing the data. + +You wish to use some of that memory for other purposes, too. + +In that case, fitting the data with a custom generator can be useful. + +### Fitting data with a custom generator + +But what is such a generator? For this, we'll have to look at the Python docs: + +> Generator functions allow you to declare a function that behaves like an iterator, i.e. it can be used in a for loop. +> +> [Python (n.d.)](https://wiki.python.org/moin/Generators) + +It already brings us further, but it's still vague, isn't it? + +Ilya Michlin (2019) explains the need for generators in better terms, directly related to machine learning: + +> You probably encountered a situation where you try to load a dataset but there is not enough memory in your machine. As the field of machine learning progresses, this problem becomes more and more common. Today this is already one of the challenges in the field of vision where large datasets of images and video files are processed. +> +> [Michlin (2019)](https://towardsdatascience.com/keras-data-generators-and-how-to-use-them-b69129ed779c) + +Combined with the rather vague explanation, we can get there. + +A generator can be used to "behave like an iterator", "used in a loop" - to get us small parts of some very large data file. These parts, in return, can subsequently be fed to the model for training, to avoid the memory problems that are common in today's machine learning projects. + +Bingo! Generators can help us train with large data files. Nice :) + +* * * + +## Example model + +Let's now take a look at an example with Keras. Suppose that we have this massive but simple dataset - 500.000.000 rows of simple \[latex\]x \\rightarrow y\[/latex\] mappings: + +``` + +x,y +1,1 +2,2 +3,3 +4,4 +5,5 +... +``` + +This file might be called e.g. `five_hundred.csv`. + +As you might expect, this is the linear function \[latex\]y: f(x) = x\[/latex\]. It's one of the most simple regression scenarios that you can encounter. + +Now, let's build a model for this dataset just like we always do - with one exception: we use a generator to load the data rather than loading it in memory. Here's what we'll do: + +- We load our imports, which represent the dependencies for today's model; +- We set some basic configuration options - specifically targeted at the dataset that we'll be feeding today; +- We specify the function which loads the data; +- We create the - very simple - model architecture; +- We compile the model; +- We fit the generator to the model. + +Let's go! + +### Loading our imports + +As always, the first thing we do is loading our imports. We import the Sequential API from `tensorflow.keras`, the TensorFlow 2.x way of importing Keras, as well as the Dense layer. As you may understand by now, we'll be building a densely-connected neural network with the Sequential API. Additionally, we also import TensorFlow itself, and Numpy. + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np +import tensorflow +``` + +### Setting some basic configuration options + +Next, we set two configuration options. First, we specify the total number of rows present in the file: + +``` +# Num rows +num_rows = 5e8 # Five hundred million +batch_size = 250 +``` + +`5e8` equals `500e6` which equals 500.000.000. + +Additionally, we feed 250 samples in a minibatch during each iteration. By consequence, in this case, we'll have `2e6` or 2.000.000 steps of 250 samples per epoch, as we will see later. + +### Specifying the function that loads the data + +Now that we have configured our model, we can specify the function that loads the data. It's actually a pretty simple function - a simple Python definition. + +It has a `path` and a `batchsize` attribute, which are used later, and first creates empty arrays with `inputs` and `targets`. What's more, it sets the `batchcount` to 0. Why this latter one is necessary is what we will see soon. + +Subsequently, we keep iterating - we simply set `while True`, which sets a never-ending loop until the script is killed. Every time, we we open the file, and subsequently parse the inputs and targets. Once the batch count equals the batch size that we configured (do note that this happens when we have the _exact same size_, as the batch count starts at 0 instead of 1), we finalize the arrays and subsequently yield the data. Don't forget to reset the `inputs`, `targets` and `batchcount`, though! + +``` +# Load data +def generate_arrays_from_file(path, batchsize): + inputs = [] + targets = [] + batchcount = 0 + while True: + with open(path) as f: + for line in f: + x,y = line.split(',') + inputs.append(x) + targets.append(y) + batchcount += 1 + if batchcount > batchsize: + X = np.array(inputs, dtype='float32') + y = np.array(targets, dtype='float32') + yield (X, y) + inputs = [] + targets = [] + batchcount = 0 +``` + +### Creating the model architecture + +Now that we have specified our function for flowing data from file, we can create the architecture of our model. Today, our architecture will be pretty simple. In fact, it'll be a three-layered model, of which two layers are hidden - the latter one is the output layer, and the input layer is specified implicitly. + +As you can see by the number of output neurons for every layer, slowly but surely, an information bottleneck is created. We use [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) for activating in the hidden layers, and `linear` for the final layer. This, in return, suggests that we're dealing with a regression scenario. Unsurprisingly: we are. + +``` +# Create the model +model = Sequential() +model.add(Dense(16, input_dim=1, activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(1, activation='linear')) +``` + +### Compiling the model + +This latter fact gets even more clear when we look at the `compile` function for our model. As our loss, we use the mean absolute error, which is a typical [loss function for regression problems](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss-functions-for-regression). Additionally, we specify the mean squared error, which is one too. Adam is [used for optimizing the model](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#adam) - which is a common choice, especially when you don't really care about optimizers, as we do now (it's not the goal of today's blog post), Adam is an adequate choice. + +``` +# Compile the model +model.compile(loss='mean_absolute_error', + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['mean_squared_error']) +``` + +### Fitting the generator + +Next, we `fit` the generator function - together with the file and batch size - to the model. This will allow the data to flow from file into the model directly. + +Note that on my machine, this file with five hundred million rows exceeds 10GB. If it were bigger, it wouldn't have fit in memory! + +``` +# Fit data to model +model.fit(generate_arrays_from_file('./five_hundred.csv', batch_size), + steps_per_epoch=num_rows / batch_size, epochs=10) +``` + +### Full model code + +Altogether, here's the code as a whole: + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np +import tensorflow + +# Num rows +num_rows = 5e8 # Five hundred million +batch_size = 250 + +# Load data +def generate_arrays_from_file(path, batchsize): + inputs = [] + targets = [] + batchcount = 0 + while True: + with open(path) as f: + for line in f: + x,y = line.split(',') + inputs.append(x) + targets.append(y) + batchcount += 1 + if batchcount > batchsize: + X = np.array(inputs, dtype='float32') + y = np.array(targets, dtype='float32') + yield (X, y) + inputs = [] + targets = [] + batchcount = 0 + +# Create the model +model = Sequential() +model.add(Dense(16, input_dim=1, activation='relu')) +model.add(Dense(8, activation='relu')) +model.add(Dense(1, activation='linear')) + +# Compile the model +model.compile(loss='mean_absolute_error', + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['mean_squared_error']) + +# Fit data to model +model.fit(generate_arrays_from_file('./five_hundred.csv', batch_size), + steps_per_epoch=num_rows / batch_size, epochs=10) +``` + +Running it does indeed start the training process, but it will take a while: + +``` +Train for 2000000.0 steps +Epoch 1/10 +2020-04-06 20:33:35.556364: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll + 352683/2000000 [====>.........................] +``` + +However, we successfully completed our code! 🎉 + +* * * + +## Using model.fit using validation data specified as a generator + +In the model above, I'm working with a generator that flows training data into the model during the forward pass executed by the `.fit` operation when training. Unsurprisingly, some people have asked in the comments section of this post if it is possible to use a generator for validation data too, and if so, how. + +Remember that validation data is used during the training process in order to identify whether the machine learning model [has started overfitting](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/#how-well-does-your-model-perform-underfitting-and-overfitting). Testing data, in the end, is used to test whether your model generalizes to data that it hasn't seen before. + +I thought this wouldn't be possible, as the TensorFlow documentation clearly states this: + +>  Note that `validation_data` does not support all the data types that are supported in `x`, eg, dict, generator or [`keras.utils.Sequence`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence). +> +> _Tf.keras.Model_. (n.d.) + +Here, `x` is the input argument, or the training data you're inputting - for which we used a generator above. + +One comment to this post argued that counter to the docs, it is possible to use a generator for validation data. And indeed, it seems to work (Stack Overflow, n.d.). That's why it's also possible to let validation data flow into your model - like this: + +``` +# Fit data to model +model.fit(generate_arrays_from_file('./five_hundred.csv', batch_size), + steps_per_epoch=num_rows / batch_size, epochs=10, validation_data=generate_arrays_from_file('./five_hundred_validation_split.csv', batch_size), validation_steps=num_val_steps) +``` + +Here: + +- We use the same generator for our validation set, which however comes from a different file (`./five_hundred_validation_split.csv`). Do note that we're using the same batch size, but this can be a different one as validation happens once training has finished. +- Do note that normally, TensorFlow infers the number of `validation_steps` (i.e. how many batches of samples are used for validation) automatically by means of the rule `if None then len(validation_set)/batch_size`. However, the length of our validation set is not known up front, because it's generated from file. We must specify it manually. We should therefore add a number of validation steps manually; the value is dependent on the length of your validation set. If it has one million rows, it's best to set it to one million, e.g. `num_val_steps = int(1e6/batch_size)`. If your `batch_size` is 250, the number of validation steps would be 4.000. + +Thanks to PhilipT in the comments section for reporting! :) + +* * * + +## Summary + +In this blog post, we looked at the concept of generators - and how we can use them with Keras to overcome the problem of data set size. More specifically, it allows us to use really large datasets when training our Keras model. In an example model, we showed you how to use generators with your Keras model - in our example, with a data file of more than 10GB. + +I hope you've learnt something today! If you did, please feel free to leave a comment in the comments section below. Make sure to feel welcome to do the same if you have questions, remarks or suggestions for improvement. Where possible, I will answer and adapt the blog post! :) + +Thank you for reading MachineCurve today and happy engineering 😎 + +\[kerasbox\] + +* * * + +## References + +Python. (n.d.). _Generators - Python Wiki_. Python Software Foundation Wiki Server. Retrieved April 6, 2020, from [https://wiki.python.org/moin/Generators](https://wiki.python.org/moin/Generators) + +Michlin, I. (2019, October 6). _Keras data generators and how to use them_. Medium. [https://towardsdatascience.com/keras-data-generators-and-how-to-use-them-b69129ed779c](https://towardsdatascience.com/keras-data-generators-and-how-to-use-them-b69129ed779c) + +Keras. (n.d.). _Sequential_. Home - Keras Documentation. [https://keras.io/models/sequential/#fit\_generator](https://keras.io/models/sequential/#fit_generator) + +_Could validation data be a generator in tensorflow.keras 2.0?_ (n.d.). Stack Overflow. [https://stackoverflow.com/questions/60003006/could-validation-data-be-a-generator-in-tensorflow-keras-2-0](https://stackoverflow.com/questions/60003006/could-validation-data-be-a-generator-in-tensorflow-keras-2-0) + +_Tf.keras.Model_. (n.d.). TensorFlow. [https://www.tensorflow.org/api\_docs/python/tf/keras/Model#fit](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) diff --git a/using-teachable-machine-for-creating-tensorflow-models.md b/using-teachable-machine-for-creating-tensorflow-models.md new file mode 100644 index 0000000..7aa5120 --- /dev/null +++ b/using-teachable-machine-for-creating-tensorflow-models.md @@ -0,0 +1,181 @@ +--- +title: "Using Teachable Machine for creating TensorFlow models" +date: "2020-10-27" +categories: + - "deep-learning" + - "frameworks" +tags: + - "ai-experiments" + - "google" + - "hot-dog" + - "machine-learning" + - "tensorflow" + - "web-browser" +--- + +Although it likely sounds a bit strange, getting started with Machine Learning gets easier day after day. That is - the steep learning curve that the field of ML once presented to engineers who wanted to gets started is going away. + +And I think that this is great news, because the field provides a lot of opportunities for those who wish to obtain both a holistic business perspective as well as an engineering one. For those who master the business and engineering aspects of machine learning, or data science in general, the future is bright. + +In today's article, we will cover one of the ways in which creating TensorFlow based models is getting easier - that is, through Google's AI experiment Teachable Machine. It can be used for generating training data and training a Machine Learning model straight from a web browser. In fact, as we shall see, the trained model can be exported for usage in native TensorFlow, TensorFlow.js and TensorFlow Lite. This is awesome, so let's take a look at what it can do in more detail! + +* * * + +\[toc\] + +* * * + +## Google's AI Experiments + +Did you already know about the existence of Google's AI Experiments? + +The website quotes itself as follows: + +> AI Experiments is a showcase for simple experiments that make it easier for anyone to start exploring machine learning, through pictures, drawings, language, music, and more. +> +> AI Experiments (n.d.) + +For those who aren't engineers - getting excited about the fact that your code runs and your model trains can be boring. What's the point of seeing a few lines of text move on a computer screen, they often argue. While for people like me, and perhaps even us, it's very exciting - something is happening within the machine, a model is learning! + +That's why AI Experiments makes the powerful capabilities of Artificial Intelligence (and hence Machine Learning) very visual. By means of simple experiments / demonstrations, often in the form of games, Google wants to make it very easy to get excited about ML. For example, the website provides a variety of games related to drawing: with **Quick, Draw!**, a neural network will attempt to learn what you're drawing. In **Handwriting**, a neural network will attempt to complete a handwriting attempt started by yourself. + +![](images/image-21-1024x538.png) + +Another set of experiments which I think are really cool are related to audio. For example, with the **Freddiemeter**, you can measure how much your singing looks like that of Freddie Mercury's. The **Semi-Conductor** allows you to conduct a digital orchestra based on your webcam, by means of a [PoseNet-like](https://github.com/tensorflow/tfjs-models/tree/master/posenet) Machine Learning architecture. + +![](images/image-22-1024x540.png) + +I'm loving these examples as they really make Machine Learning Accessible! 😎 + +* * * + +## Introducing Teachable Machine + +Despite the fact that those experiments are - in my opinion - quite awesome, I've always thought that one key thing was missing. + +A method for creating your own Machine Learning models from the browser. Or rather, a method for _letting other people create ML models from their browser_. In the workshops that I give for business audiences, people are often thrilled to hear about the possibilities that Machine Learning can bring them. In fact, they really get impressed how the world is changing rapidly into a place where data lies at the core of contemporary organizations. + +...but I think they could be even more impressed when they can actually create their own ML model from scratch, but without having to code, because business people aren't the greatest coders, generally speaking. + +Unfortunately, this never seemed to be possible. Until - for another workshop - I looked at the AI Experiments website again today. When looking around, I found an experiment called **Teachable Machine**. According to its website, it makes it possible for you to "\[t\]rain a computer to recognize your own images, sounds, & poses" (Teachable Machine, n.d.). In fact, you can now create a Machine Learning model from your web browser - and pick from three model types: + +- A Machine Learning model that can **classify audio samples** into one of user-configured classes. +- A Machine Learning model that can **classify images** (or webcam streams) into one of user-configured classes. +- A Machine Learning model that can **classify human poses** into one of user-configured classes. + +Teachable Machine thus only support [classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) as of now, but its website suggests that more model types are added frequently. + +https://www.youtube.com/watch?v=T2qQGqZxkD0&feature=emb\_title + +* * * + +## Getting Started: an Image Project + +Enough bla-bla for now. Time to get started, so let's see if we can generate an Image Project with Teachable Machine. This afternoon, I managed to train a model that can classify my webcam stream into "cup" and "no cup" based on only approximately 150-225 images per class. + +In fact, I generated those images by simply using my webcam to generate a stream of data, including some weird behavior like moving the cup from edge to edge, turning it up side down, and so on. + +### Creating a new project + +The first thing you need to do in order to get started with Teachable Machine is to navigate to their website, [teachablemachine.withgoogle.com](https://teachablemachine.withgoogle.com/). + +Then make sure to click the button 'Get Started'. + +![](images/image-24-1024x432.png) + +The web application then navigates to a page that lets you [create your own project](https://teachablemachine.withgoogle.com/train). We're going to create a classifier that is capable of classifying a webcam stream, so make sure to click 'Image Project'. Of course, if you're playing around yourself, you can also create an Audio Project or a Pose Project. Do note that training the model will take a bit longer with those kind of projects, because the input data is more complex. + +![](images/image-25-1024x501.png) + +### Deciding about your classes + +Then it's time to decide about the classes you're going to train with. Do you want to train a [binary classifier](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/#variant-1-binary-classification) and hence use [Sigmoid](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) for generating the prediction under the hood, or will you make it a [multiclass one](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/#variant-2-multiclass-classification) using [Softmax](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/)? + +Take some time to think about what you want. You could also pick one of these classes: + +- **Cup / no cup**, in case you have a a cup somewhere near you. Being a true engineer, powered by coffee, this is likely the case 😉☕ +- **Hot dog / No Hot dog**, which is [self-explanatory](https://www.machinecurve.com/index.php/2020/10/20/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras/). +- **Cat / dog**, should you have two animals walking around in your house. +- **Purple / red / green / blue**, if you have papers in various colors nearby. + +Let's now enter these classes. I'm going for the Hot Dog / No Hot Dog scenario, just like the [Hotdog Classifier](https://www.machinecurve.com/index.php/2020/10/20/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras/). + +![](images/image-36-1024x468.png) + +I first enter 'Hot Dog' by adapting the first class, then 'Not Hot Dog' by adapting the second class. + +![](images/image-37-1024x441.png) + +If you chose the scenario with > 2 classes, then you can use the 'Add a class' button to add another class. + +### Generating training data + +It's now time to generate training data. You can do so in multiple ways. If you have a dataset available (for example the [Hot Dog / Not Hot Dog dataset](https://www.machinecurve.com/index.php/2020/10/20/tutorial-building-a-hot-dog-not-hot-dog-classifier-with-tensorflow-and-keras/)), you can of course use that by clicking the 'Upload' button. You can also use your Webcam if you want to generate data yourself. + +![](images/image-38.png) + +Make sure to have that dataset, or generate it: + +![](images/image-39.png) + +Now upload images from Drive or file. At least 150 are necessary, in my experience. + +![](images/image-40.png) + +Make sure to do the same for 'Not Hot Dog', but then with the other data. + +### Training the model + +Your screen should now look somewhat like this: + +![](images/image-41-1024x468.png) + +Through the **Advanced** button, you can adapt the number of _epochs_ (i.e. number of iterations), the batch size (the poorer your hardware, the lower it should be) and the [Learning Rate](https://www.machinecurve.com/index.php/2019/11/06/what-is-a-learning-rate-in-a-neural-network/) - which should be fine at 0.001 in this setting. Generally, you don't want to tweak the settings, and click 'Train Model' straight away. + +So let's do that. The model then starts training. + +![](images/image-42-1024x461.png) + +### Generating new predictions + +When it's done training, Preview Mode becomes available. Here, you can upload a few files - and see whether your model works. + +In our case, it should: + +![](images/image-45.png) + +Correct! + +![](images/image-43.png) + +Also good! + +Awesome! 😎 + +* * * + +## Exporting your model to TensorFlow, TF.js or TF Lite + +Even better is that you can **export the model you just trained**. For example, you can [load it with TensorFlow](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/), run it in the web browser with [TensorFlow.js](https://www.tensorflow.org/js) or use it with [Model Optimization techniques in TensorFlow Lite](https://www.machinecurve.com/index.php/tag/model-optimization/). Simply click 'Export Model' and use the option you want - you'll even get a code example as part of the deal. + +![](images/image-35.png) + +* * * + +## Summary + +In this relatively brief but in my opinion interesting article, we looked at a technique for training Machine Learning models from your web browser - Teachable Machine, a Google AI Experiment. It demonstrates that Machine Learning is not necessarily a domain for experts and that everyone can train models, if they understand the [basics](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/). + +It allows one to create a classification model for images, audio or human poses, to capture a training set directly from the web browser, and export the model to be ran with true TensorFlow, TF.js or TF Lite. It's a great tool for workshops, allowing people without any experiments to become very proud: + +_I've just created my own Machine Learning model!_ + +I hope that you've enjoyed this article. If you did, please feel free to let me know in the comments section below 💬 Please do the same if you have any other questions, remarks or comments. I'd happily adapt the article whenever necessary, and help you move forward where possible. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +AI experiments. (n.d.). Experiments with Google. [https://experiments.withgoogle.com/collection/ai](https://experiments.withgoogle.com/collection/ai) + +Teachable Machine. (n.d.). [https://teachablemachine.withgoogle.com/](https://teachablemachine.withgoogle.com/) diff --git a/visualize-keras-models-overview-of-visualization-methods-tools.md b/visualize-keras-models-overview-of-visualization-methods-tools.md new file mode 100644 index 0000000..c56e0d5 --- /dev/null +++ b/visualize-keras-models-overview-of-visualization-methods-tools.md @@ -0,0 +1,261 @@ +--- +title: "Visualize Keras models: overview of visualization methods & tools" +date: "2019-12-03" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keract" + - "keras" + - "keras-vis" + - "machine-learning" + - "tensorboard" + - "tensorflow" + - "visualization" +--- + +Visualizing your Keras model, whether it's the architecture, the training process, the layers or its internals, is becoming increasingly important as business requires explainability of AI models. + +But until recently, generating such visualizations was not so straight-forward. + +Fortunately, with respect to the Keras deep learning framework, many visualization toolkits have been developed in the past few years. This has led to a landscape that is scattered and contains many open source toolkits and other elements. That's at least what I found out when I wrote tutorials for many of these recently. + +In this blog post, I've attempted to summarize what exists out there - and create an overview that introduces you to all of them that I know of. I've added links to the respective tutorials where you can find more information if you need it. Perhaps, let this be the starting point of your visualization activities! ...of course, if you know about some tools that I didn't cover here, feel free to add them by dropping a comment 😊 + +Thanks, and let's go! 😎 + +\[toc\] + +## Visualizing model architecture: Keras + +Neural networks, and by consequence Keras models, contain layers. These layers are often stacked in an architecture. When you're interested in this architecture - i.e., when you ask yourself **which layers are part of my neural network?** - it may be wise to visualize the architecture of your Keras model, like this: + +[![](images/model.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/model.png) + +Keras comes with a handy method to generate such a visualization with only one line of code: `plot_model`. At MachineCurve, we've created a tutorial that is dedicated to this topic - how to build a model, train it, while visualizing its architecture. Click the link below if you wish to read more. + +**Read more:** [How to visualize a model with Keras?](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/') + +* * * + +## Visualizing model architecture: TensorBoard + +However, since Keras integrates with the TensorFlow backend, it's also possible to use TensorBoard for visualizing the architecture of your model. TensorBoard is a TensorFlow toolkit for generating various visualizations of your neural networks. If you're interested in **what is the architecture of my TensorFlow model?** - as well as various other kinds of visualizations - this tutorial is for you. + +[![](images/image-6-1024x480.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/image-6.png) + +The best thing is that TensorBoard even works - albeit in a limited way - when you use Theano or CNTK as the backend for your Keras models! + +**Read more:** [How to use TensorBoard with Keras?](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras) + +* * * + +## Visualizing model architecture: Net2Vis + +Another tool for generating visualizations of the architecture of your Keras generated neural networks is Net2Vis. The tool, which is a React web application and a Python Flask backend, was created by German scholars who found that existing tools would only produce vertical visualizations - making them useless for print media, which often requires horizontal ones. + +[![](images/image-4-1024x568.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-4.png) + +Using Net2Vis is really easy, it supports the Keras Functional and Sequential APIs, and there is a wide range of configuration options available - _even_ color sets for the color blind and those with monochromatic (i.e., grayscale) vision. I really love it! + +What's more, at MachineCurve, we've created a Docker based installation procedure called `net2vis-docker`, which allows you to run it with only one command. + +**Read more:** [Visualizing Keras neural networks with Net2Vis and Docker](https://www.machinecurve.com/index.php/2020/01/07/visualizing-keras-neural-networks-with-net2vis-and-docker/) + +* * * + +## Visualizing model architecture: Netron + +One of the most beautiful tools for visualizing model architectures that I know about is Netron. This tool, which has a cross-platform availability (source code builds and installers for Macs and Windows machines) and supports a variety of frameworks and model formats, allows you to inspect models in a visually appealing way: + +[![](images/image-7-1024x782.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/image-7.png) + +It's also possible to export these plots, so that you can use them in publications. However, contrary to Net2Vis, which generates horizontal visualizations, Netron makes them vertical - and doesn't allow you to switch directions. Especially with deep networks, this results in plots that can hardly be printed. However, despite this observation, I love the visual styles! + +**Read more:** [Visualizing your Neural Network with Netron](https://www.machinecurve.com/index.php/2020/02/27/visualizing-your-neural-network-with-netron/) + +* * * + +## Visualizing the training process: Keras History object + +Besides the architecture of your model, it may be interesting to know something about your training process as well. This is especially important when you want to answer the following questions: + +- How do I know whether my model is overfitting? +- Is my model still underfitting? +- Is training progress stalling? Do I need fewer epochs, or do I need to change my architecture? +- Can I achieve better performance, and do I perhaps need to add more epochs? +- Do I need to change my architecture based on training results? + +[![](images/fixed_lr_small.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/fixed_lr_small.png) + +Visualizing the training process, i.e. the _history_ of your training process, might then be of help. Keras helps you with this by providing a `History` object if you wish to capture this training history. At MachineCurve, we've written a tutorial that helps you make such plots when you wish to visualize them. Doing so is easy: it involves adding a bit of code to _one line of Python_ only, as well as some Matplotlib code for visualizations. Click the link below if you wish to read more. + +**Read more:** [How to visualize the training process in Keras?](https://www.machinecurve.com/index.php/2019/10/08/how-to-visualize-the-training-process-in-keras/) + +* * * + +## Visualizing the training process: TensorBoard + +As with the architecture of your neural network, you can also generate visualizations of your training process with TensorBoard. Keras natively supports TensorBoard by means of a callback, so integrating it with your model should be really easy. + +[![](images/image-1-1024x505.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/image-1.png) + +As you can see, contrary to History-based visualization, the TensorBoard visualizations are more detailed. They are also more interactive, as you can visualize various options on the fly. This is not possible with the History-object based approach. Nevertheless, both work fine. If you wish to find out how to visualize how training proceeds over time with TensorBoard, the corresponding tutorial at 'Read more' should help you further. + +**Read more:** [How to use TensorBoard with Keras?](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras) + +* * * + +## Visualizing model decision boundaries: Mlxtend + +The success of a classifier is determined by how well it classifies - i.e., assigns new objects to the correct class. During training, it generates what is known as a decision boundary - a dividing line between two or more classes that allows the classifier to generate its prediction. + +Mlxtend, a generic toolkit providing extensions to various machine learning models, was created by dr. Sebastian Raschka to - among others - serve this need. It allows you to visualize the decision boundary of your machine learning model: + +[![](images/mh_boundary-1024x587.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/mh_boundary.png) + +...and, by consequence, also the decision boundary of your Keras model 😄 + +Fun thing is that integrating Mlxtend with your Keras model for visualizing the model's decision boundary is not difficult. Hence, answering the question **How does my model decide between classes?** becomes a lot more transparent. Make sure to read this tutorial if you're interested in those kind of visualizations. + +**Read more:** [How to visualize the decision boundary for your Keras model?](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) + +* * * + +## Visualizing weight/bias distributions: TensorBoard + +TensorBoard once again! 😁 But this time we're not discussing model architectures or the training process. + +No, rather, you may be interested in visualizing the distribution of weights and biases at your layers. TensorBoard supports this natively, and Keras as well through its integration with TensorBoard. The tutorial below helps you with this. + +[![](images/image-4-1024x505.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/image-4.png) + +**Read more:** [How to use TensorBoard with Keras?](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras) + +* * * + +## Visualizing weight/bias change over time: TensorBoard + +While weights and biases of your layers are static with respect to the individual layers, they change over time. Visualizing how they change over time helps you answering a lot of questions with respect to the training process: + +- **Which layers contribute most to training?** +- **Which layers do not contribute to training?** +- **Can I remove a layer?** +- **Do I need to add more layers?** +- **Does training happen throughout weights or biases?** + +[![](images/featured_image-1024x505.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/featured_image.png) + +Make sure to read the TensorBoard tutorial if you wish to understand it, as it explains how you can generated _and_ read these charts in order to better understand your neural network. + +**Read more:** [How to use TensorBoard with Keras?](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras) + +* * * + +## Visualizing ConvNet inputs: Activation Maximization + +Convolutional neural networks are, as any neural network, viewed as black boxes very often. What if I told you that there exist methods to visualize these black boxes, and to take a look inside them, in order to find out how your model performs? + +What if you can answer the question: **does my model actually generate its prediction based on the correct input elements?** + +Activation Maximization can help you with this - combined with the `keras-vis` toolkit in Keras. + +The technique essentially uses a trained model and performs inverse operations to find out _which image would be perfect with respect to a class_. That is, what would your input need to be in order to find a particular prediction - in this case, for classes '3', '6' and '4' of the MNIST dataset 😁 + +- [![](images/3-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/3-1.png) + +- [![](images/6-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/6-1.png) + +- [![](images/4-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/4-1.png) + + +If you're very excited about this - cool, so was I when I first found out about this! At MachineCurve, I've written a tutorial that explains how to use Activation Maximization for generating 'perfect class images' that help you understand your Keras ConvNet. I've provided a link at 'Read more'. + +**Read more:** [Visualizing Keras model inputs with Activation Maximization](https://www.machinecurve.com/index.php/2019/11/18/visualizing-keras-model-inputs-with-activation-maximization/) + +* * * + +## Visualizing ConvNet filters: Activation Maximization + +While Activation Maximization can be used at the _output level_ - generating images that represent perfect inputs with respect to some class - it can also be used for visualizing the filters of your ConvNet. + +This answers the question: **What does my CNN see?** + +As well as the following ones: **What patterns have my ConvNet filters/kernels learnt?** + +You get visualizations that look like this: + +- [![](images/block5_conv2_480.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_480.jpg) + +- [![](images/block5_conv2_479.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_479.jpg) + +- [![](images/block5_conv2_136.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_136.jpg) + + +As with input visualizations, we're using `keras-vis` for applying Activation Maximization to ConvNet filters. Similarly, we've written a tutorial as well. Check it out if you wish to find out how this works! + +**Read more:** [What do ConvNets see? Visualizing filters with Activation Maximization](https://www.machinecurve.com/index.php/2019/12/03/what-do-convnets-see-visualizing-filters-with-activation-maximization/) + +* * * + +## Visualizing where your ConvNet looks: Saliency maps + +Activation Maximization can be used in order to generate a _perfect representation_: that is, it allows you to find out whether the model has actually learnt to recognize the _correct object_. + +It is irrelevant of input in the sense that only a fixed output and fixed model weights are required, and it will generate the perfect input for you. + +But what if you wish to find an answer to a slightly different question - **given some input, does my model look at the correct object when generating the prediction?** + +Activation Maximization does not help here - but saliency maps do. They essentially highlight which pixels contribute most to generating a prediction, like this: + +- [![](images/sal9.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal9.png) + +- [![](images/horse-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/horse-2.png) + + +How saliency maps work and how you can use `keras-vis` to implement them for visualizing the importance within inputs to your Keras model? Check out the tutorial that we wrote for this purpose 😁 Hope it helps you! + +**Read more:** [Visualizing Keras CNN attention: Saliency maps](https://www.machinecurve.com/index.php/2019/11/25/visualizing-keras-cnn-attention-saliency-maps/) + +* * * + +## Visualizing where your ConvNet looks: Grad-CAM activation maps + +While saliency maps help you answer the question _which areas of the input image contribute most to generating the prediction_, you get the answer at a pixel level. Especially with complex images, or when you wish to generate overlays of importance/input, this is difficult. Rather, you would really find the true _areas_ of importance, rather than the _pixels_ of importance. + +Heatmaps may help you here. While they are less granular, they might be a competitor for saliency maps: + +- [![](images/7-2-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/7-2-1.png) + +- [![](images/2-3.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/2-3.png) + + +The link below refers to our tutorial for visualizing where your Keras ConvNet attends to with Grad-CAM activation maps. Once again, we use `keras-vis` for this purpose. When you augment these activation maps with guided backprop (which is similar to generating the saliency maps), your results get even more powerful. + +**Read more:** [Visualizing Keras CNN attention: Grad-CAM Class Activation Maps](https://www.machinecurve.com/index.php/2019/11/28/visualizing-keras-cnn-attention-grad-cam-class-activation-maps/) + +* * * + +## Visualizing layer outputs: Keract + +You may also be interested in answering the questions: **how do my Keras model's layers activate given some input?** The `keract` toolkit might provide you with an answer to this question, as it allows you to visualize this for one, multiple or all of your layers - by providing heatmaps or simple activation outputs: + +- [![](images/maxpooling-1024x577.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/maxpooling.png) + +- [![](images/2_conv2d_2-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2_conv2d_2.png) + + +Once again, we have a tutorial for this 😊 + +**Read more:** [Visualize layer outputs of your Keras classifier with Keract](https://www.machinecurve.com/index.php/2019/12/02/visualize-layer-outputs-of-your-keras-classifier-with-keract/) + +* * * + +## Summary + +In this blog post, we've provided an overview of Keras visualization methods & techniques that are currently available. We provided references to a wide variety of blog posts at MachineCurve that may help you further when your interest is visualizing your model's architecture, the training process, or how its layers activate and/or behave. + +If you have any questions or remarks, please leave a comment in the comments box below. **I kindly request that you especially do so when you know about another visualization method that I didn't cover yet!** I will then try my best to cover it as soon as possible 😁 + +Thanks a lot for reading MachineCurve today and happy engineering! 😎 diff --git a/visualize-layer-outputs-of-your-keras-classifier-with-keract.md b/visualize-layer-outputs-of-your-keras-classifier-with-keract.md new file mode 100644 index 0000000..1f12800 --- /dev/null +++ b/visualize-layer-outputs-of-your-keras-classifier-with-keract.md @@ -0,0 +1,304 @@ +--- +title: "Visualize layer outputs of your Keras classifier with Keract" +date: "2019-12-02" +categories: + - "deep-learning" + - "frameworks" +tags: + - "activation" + - "deep-learning" + - "deep-neural-network" + - "keract" + - "keras" + - "machine-learning" + - "model" + - "visualization" +--- + +When you train Convolutional Neural Networks, you wish to understand their performance before you apply them in the real world. This spawns the need for visualization: when you can see how they perform, layer by layer, you can improve them in a more guided fashion. + +This is what is possible with Keract - and not only for Convolutional Neural Networks. This toolkit, which is available as an open source Github repository and `pip` package, allows you to visualize the outputs of any Keras layer for some input. This way, you can trace how your input is eventually transformed into the prediction that is output - possibly identifying bottlenecks in the process - and subsequently improve your model. + +In this blog post, we'll cover precisely this feature of the Keract toolkit. We first argue in more detail why it can be smart to visualize the output of various neural network layers. Subsequently, we introduce Keract, which is followed by the creation of a simple ConvNet that can classify MNIST digits. Note again that you can also use Keract when you don't have a ConvNet - that is, it allows you to visualize Dense layers as well. + +Followed by the creation of our simple classifier, we use Keract to do a couple of things. First, we use it to generate visualizations of the outputs of the model's layers. Subsequently, we show that it can also generate _activation heatmaps_, which are similar to the [Grad-CAM maps](https://www.machinecurve.com/index.php/2019/11/28/visualizing-keras-cnn-attention-grad-cam-class-activation-maps/) which we created in another blog post. Finally, we show that you don't necessarily need ConvNets to use Keract - as indicated - by giving you an example. + +Let's go! 😎 + +**Update April 2020:** updated the code to run with TensorFlow 2.0+ + +\[toc\] + +## Why visualize layer outputs? + +Training your supervised neural network involves feeding forward your training data, generating predictions, and computing a loss score, which is used for [optimization purposes](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). However, it may be that your optimizer gets stuck after some time - and you would like to know why this occurs and, more importantly, what you could do about it. + +Take for example a [Convolutional Neural Network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). Such a network is often composed of two types of layers: convolutional layers, which learn features from the image, that can be used by densely-connected layers for classification purposes. The result is a neural network that can classify images - and with quite some accuracy in many cases! + +However, especially with problems that are less straight-forward, ConvNets can be tough to train. In some cases, it does not even converge. Visualizing layer outputs gets important in those cases. As convolutional layers, together with additional layers such as pooling layers downsample the image - in the sense that it gets smaller and more abstract - it may be the case, for example, that information loss occurs. When this happens, a neural network might no longer be able to discriminate between the classes, and hence show inadequate performance. Solving this might be done by increasing the number of feature maps or by removing a layer. While this increase computational cost, it might also improve model performance. + +Hence: visualization is important. Let's now introduce Keract, which we can use to visualize the outputs of the layers of our Keras models. + +## What is Keract? + +Keract is best summarized as follows: **You have just found a (easy) way to get the activations (outputs) and gradients for each layer of your Keras model ([LSTM](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/), conv nets…)** (Rémy, 2019). + +It is a set of simple yet powerful tools to visualize the outputs (and gradients, but we leave them out of this blog post) of every layer (or a subset of them) of your Keras model. Contrary to many visualization packages, it doesn't only visualize the convolutional layers. Rather, it visualizes output of other layers as well: [LSTMs](https://www.machinecurve.com/index.php/2020/12/29/a-gentle-introduction-to-long-short-term-memory-networks-lstm/), densely-connected ones, and so on. That's great news, as Keract will thus allow you to follow an input throughout the entire model towards its final prediction. + +Let's now implement Keract based visualization using a simple convolutional neural network that classifies the MNIST dataset 😀 As you likely know, this dataset contains thousands of 28x28 pixel images of handwritten digits, i.e. the numbers 0 to 9. Visualizing a subset of them produces this plot: + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +Since the MNIST dataset is integrated with Keras by design, it's very easy to use it. Additionally, models often achieve very high accuracies quite simply, making it one of the better datasets when applying Keras for educational purposes. + +Let's go! 😎 + +## Creating a simple MNIST ConvNet: model architecture + +This is the architecture of the model that we will create today: + +[![](images/model.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/model.png) + +The model's architecture from the input layer to the output Dense layer. Click [here](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/) if you want to understand how to make such plots yourself. + +We start with an **input layer**, which is simply receptive to the inputs as they are fed to the model. This input layer passes the data to a **Conv2D** layer, which is a convolutional layer that handles two-dimensional inputs. The layer will output six so-called [feature maps](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/#convolutional-layers), in which the model can detect certain 'features' that separate the classes, to be used for discriminating between the digits 0 to 9. This number - six - is arbitrary: it could have been 32, or 250, but the more feature maps, the more computational resources you need. Additionally, the Keract visualizations that will follow would then consume a substantial amount of space. For this post, we've kept it simple - but feel free to change this number as you desire. + +Upon passing through the Conv2D layer, data flows to a **MaxPooling2D** layer. This layer essentially looks at the data with e.g. a 2x2 pixels block, taking the maximum value for the block at every time. While the information present in the feature map remains relatively intact, the image gets substantially smaller - saving the need for computational resources. If you're having trouble visualizing this in your head - you've found the _exact_ reason why visualization with e.g. Keract helps you with tuning your deep neural networks.... as it will provide the visualization for you. But let's finish analyzing the architecture first! + +Once the feature maps have passed the max pooling layer, they are fed into another **Conv2D** layer, which learns to identify more abstract features (based on more abstract data). This way, with multiple convolutional layers, a _hierarchy_ of features is learnt: highly abstract ones as well as more detailed ones. This benefits the model's discriminative power. + +Once this convolutional layer has generated another set of feature maps - ten in our case - we let the data pass to a **Flatten** layer. This layer simply takes the data, which is multidimensional (by having a width, height and depth - the ten feature maps), and converts it into a onedimensional array. Why this is necessary is simple: the densely-connected layers, or the "true neuron like layers" that we will use next, can only handle one-dimensional data. The Flatten layer hence connects the convolutional part of your model with the Dense, or classification, part. + +As said, two **Dense** layers subsequently follow the convolutional part. They allow for actual classification. The final Dense layer uses the **Softmax** activation function, for [multiclass classification](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) purposes. + +## From architecture to code: the model + +Okay, enough bla-bla about the architecture for now. Let's turn it into code! 😁 + +### What you'll need to run the model + +You can't run code when you don't have the appropriate tools. To run today's model, you'll need to install these dependencies: + +- Python, obviously, as we're creating Python code. Please use version 3.6+. +- TensorFlow 2.x, which includes Keras, the deep learning framework we're using today. +- Keract, for generating the neural network visualizations. + +That's it already! 😊 + +### Imports and model preparations + +Open up your Explorer and navigate to some folder. Create a file - and name it e.g. `keract_activations.py`. Now open an editor, open the file, and start coding. What you'll have to code largely aligns with the [Keras CNN tutorial](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), and especially the first part: + +``` +''' + Visualize layer activations of a tensorflow.keras CNN with Keract +''' + +# ============================================= +# Model to be visualized +# ============================================= +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras import backend as K +from tensorflow.keras import activations + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height) + input_shape = (1, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) +``` + +### The architecture in code + +The architecture, that follows next, equals the architecture we visualized before: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(6, kernel_size=(5, 5), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Conv2D(10, kernel_size=(5, 5), activation='relu')) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +### Model compilation & training + +Configuring the model (by tuning hyperparameters) and fitting it to the data (i.e., assigning where the training process should start) is similar to the Keras CNN tutorial again: + +``` +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### Starting the training process + +Now, when we open up a terminal where the required software dependencies are installed, you most likely experience a working training process. Once the model has trained for 25 epochs, the process finishes and the model evaluation procedure kick in: + +``` +Test loss: 0.043425704475744877 / Test accuracy: 0.9896000027656555 +``` + +99.0% accuracy (okay, 98.96%) - that's great! 😊 + +## Installing Keract + +So far, we haven't done anything different from the Keras CNN tutorial. But that's about to change, as we will now install Keract, the visualization toolkit that we're using to generate model/layer output visualizations & heatmaps today. + +Installation is simple: `pip install keract`. Run this command in your terminal, and that should be it 😁 + +Next up: generating visualizations of the outputs of your layers! + +## Generating layer output visualizations + +Keract comes with a very simple API, and it is very easy to generate output visualizations for your layers. It's as simple as this: + +``` +# ============================================= +# Keract visualizations +# ============================================= +from keract import get_activations, display_activations +keract_inputs = input_test[:1] +keract_targets = target_test[:1] +activations = get_activations(model, keract_inputs) +display_activations(activations, cmap="gray", save=False) +``` + +This is what this piece of code does: + +- First, the `get_activations` and `display_activations` functions are imported from Keract. With `get_activations`, you can retrieve the layer activations numerically, i.e. you get arrays with activations. By calling `display_activations` and passing these activations, you can visualize them. +- After the imports, you select the `inputs` and `targets` that must be visualized this time. In this case, we visualize the first sample of the test set. +- Finally, we call `get_activations` for the `model` instance and input image, and subsequently display them with `display_activations`, using the gray colormap. We don't save them, but rather display them on screen. + +Now it's time to run the model again. Likely, you have to start training again (check [ModelCheckpoint](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/) if you wish to avoid this by saving your model instance to file), but when it finishes, visualizations start popping up. + +Remember the architecture? Recall... + +- That we start with a Conv2D layer; +- That this layer is followed by MaxPooling2D; +- Once again followed by Conv2D; +- That subsequently, the data is flattened; +- And that it is finally passed through two Dense layers, generating a prediction. + +Let's now see whether we actually _see this happen_. + +This is the input image that is represented by `input_test[:1]`: + +[![](images/seven.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/seven.png) + +When this is passed to the first layer, recall that **six feature maps** are generated that learn to detect a feature. Think of them as a "coloring mechanism": when you pass them the input they know to detect, the elements of the input picture that match these inputs will "light up", as if given some color. + +In our case, this is what lights up for each feature map when the _seven_ is passed to the first convolutional layer: + +[![](images/0_conv2d_1-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/0_conv2d_1.png) + +As you can see, some feature maps detect the top of the seven, others the bottom, whereas others detect the inner edges, and so on. They all detect different features of your input image. + +The next layer is the Max Pooling layer: + +[![](images/1_maxpooling2d_1-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1_maxpooling2d_1.png) + +Recall that this layer is used for _downsampling_, i.e., making the image smaller with (preferably) limited information loss. You can see this happening when you compare the visualization of the Max Pooling layer with the Conv2D one above: the activations learnt by the convolutional layer persist, but they get blocky and the total images get smaller. This is precisely what Max Pooling does. + +Next up, another convolutional layer: + +[![](images/2_conv2d_2-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/2_conv2d_2.png) + +Here, **ten feature maps** are learnt, which learn to detect abstract features in the converted input image. Nevertheless, you can still detect how they activate for the number seven. + +Next up: the Flatten layer. + +[![](images/3_flatten-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/3_flatten.png) + +It simply converts the multidimensional input into a onedimensional output, being an array, or dots on a line segment. This is fed into a Dense layer which activates with the [ReLU activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/): + +[![](images/4_dense-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/4_dense.png) + +Finally, data arrives at the **Softmax layer**, which essentially generates a prediction. Perhaps unsurprisingly, you can see that all neurons are black (outputting **0**) while only one is **white, or 'one' or 'true'**. When counting the block size from left to right, you'll see that the model output is the **eight** digit - or, the number _seven_ (zero is the first digit) Bingo! 🎉 + +[![](images/5_dense-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/5_dense.png) + +## Generating layer activation heatmaps + +Note that sometimes, you do not only wish to get _output visualizations_ for your layers (showing how the feature maps activate on some aspects of your image), but rather, you wish to generate an input/activation overlay. This can be done with Keract as well, and more specifically with its `display_heatmaps` function. This can be done as follows: + +``` +# ============================================= +# Keract visualizations +# ============================================= +from keract import get_activations, display_heatmaps +keract_inputs = input_test[:1] +keract_targets = target_test[:1] +activations = get_activations(model, keract_inputs) +display_heatmaps(activations, keract_inputs, save=False) +``` + +Resulting in slightly different visualizations _for only the convolutional and convolutional-related layers:_ + +- [![](images/conv2d_1-1024x577.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/conv2d_1.png) + +- [![](images/conv2d_2-1024x577.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/conv2d_2.png) + +- [![](images/maxpooling-1024x577.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/maxpooling.png) + + +### You don't need ConvNets to use Keract + +Note again, as we've seen with `display_activations`, you do not necessarily need to use a ConvNet if you wish to use Keract. Rather, it works with Dense layers and recurrent ones as well. But that's for another blog! 😃 + +## Summary + +Today, we've seen how to visualize the way your Keras model's layers activate by using Keract. We provided an example model that is capable of classifying the MNIST dataset, and subsequently showed how to use Keract in order to visualize how your model's layers activate when passed a new input value. + +I hope you've learnt something today. If you did, I'd appreciate a comment - please feel free to leave one in the comments box below 😁 Thanks a lot, and happy engineering! 😊 + +## References + +Rémy, P. (2019, November 28). Keract. Retrieved from [https://github.com/philipperemy/keract](https://github.com/philipperemy/keract) diff --git a/visualizing-gradient-importance-with-vanilla-gradients-and-tf-explain.md b/visualizing-gradient-importance-with-vanilla-gradients-and-tf-explain.md new file mode 100644 index 0000000..83aa4ca --- /dev/null +++ b/visualizing-gradient-importance-with-vanilla-gradients-and-tf-explain.md @@ -0,0 +1,533 @@ +--- +title: "Visualizing gradient importance with Vanilla Gradients and tf-explain" +date: "2020-05-02" +categories: + - "deep-learning" + - "frameworks" +tags: + - "explainability" + - "model-explainability" + - "tf-explain" + - "vanilla-gradients" + - "visualization" +--- + +Machine learning and deep learning are here to stay. After the spectacular rise of deep learning since 2012, much research has been undertaken into how those models need to be trained. This has spawned a significant rise in academic works on machine learning, as well as practical applications. + +Personally, I think the latter is of significance too - machine learning should not remain a research field only. In fact, many companies are already using machine learning in the core of their business. Take Amazon, for example. It's a very data-driven company and harnesses machine learning for generating, say, the products you should likely buy. + +And so does Uber, with demand prediction, crash detection and Estimated Time of Arrival computations, to give just a few examples. + +Now, applications of machine learning can sometimes be critical. For example, in the field of medicine, utilization of computer vision models for inspecting scans can produce very good results - but what if it misses one? + +Machine learning explainability is a key driver of future adoption of ML in production settings. Recently, many approaches for explaining the outcomes of machine learning models have emerged - and then especially so for computer vision related models. + +In this blog post, we will also be looking at one of those approaches for explaining the outcome of Convolutional Neural Networks: at Vanilla Gradients, to be precise. + +What are Vanilla Gradients? How can they be used to determine which pixels of an image contribute most to the outcome? And how can we implement a Keras model and explain it by means of the `tf-explain` framework? + +That's what we will cover today. Firstly, I'll introduce `tf-explain`, which is an awesome framework which allows you to use a variety of model explainability techniques with your Keras model. + +Finally, this is followed by the step-by-step implementation of an actual Keras model by means of `tf-explain`. This way, you'll be able to understand _how_ model explainability works with Vanilla Gradients, _why_ it works that way and how you can _use_ it in practice. + +Ready? All right, let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Introducing tf-explain + +Now that we understand what Vanilla Gradients are, we can take a look at `tf-explain`. Put simply, it is a collection of techniques used for explaining machine learning models (Tf-explain, n.d.). + +A wide range of explainability techniques is supported: + +1. [Activations Visualization](https://www.machinecurve.com/index.php/2020/04/27/neural-network-activation-visualization-with-tf-explain/) +2. _Vanilla Gradients_ +3. Gradients\*Inputs +4. Occlusion Sensitivity +5. Grad CAM (Class Activation Maps) +6. SmoothGrad +7. Integrated Gradients + +…and others are on their development roadmap: + +1. GradCAM++ +2. Guided SmoothGrad +3. LRP + +Created by a French company called Sicara, it's really worth a look. Here, you can find it [on GitHub](https://github.com/sicara/tf-explain). + +Installation is simple: `pip install tf-explain`. That’s it – and it’s usable for both the TensorFlow CPU and GPU based models 🙂 + +* * * + +## Vanilla Gradients and your Keras ConvNet + +All right - that's the theory for today. Let's shift our focus to some practice :D Practice, here, meaning explaining your Keras based Convolutional Neural Network with Vanilla Gradients using `tf-explain`. + +What do we need? + +A model, I guess. And it should be a ConvNet. + +### Today's model + +Let's take a look at the model that we will be using first. + +It'll be a simple Convolutional Neural Network that we created in our post explaining the [Keras Conv2D layer type](https://www.machinecurve.com/index.php/2020/03/30/how-to-use-conv2d-with-keras/). + +Why not create the ConvNet here, you'd likely argue. There is one big reason:It would spoil this blog post - which is not about creating a ConvNet, but about applying Vanilla Gradients for visualizing the importance of gradients with respect to your input image. + +That's why we will be using the model that we created before, but we'll adapt it to use the MNIST dataset. + +If you wish to understand how the model was created - that's entirely possible :) Please click the link above to go to the particular post explaining the code below. + +Here's the Python code for today's model. Open up a code editor, create a Python file (such as `vanillagradients.py`) and code away: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 100 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load MNIST data +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Reshape data +input_train = input_train.reshape((input_train.shape[0], img_width, img_height, img_num_channels)) +input_test = input_test.reshape((input_test.shape[0], img_width, img_height, img_num_channels)) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +\[affiliatebox\] + +### Applying vanilla gradients during training + +Tf-explain allows you to apply vanilla gradients in two ways: **during training**, which allows you to visualize progress using TensorBoard, and **after training**, so that you can see how new data responds. + +We'll cover both in this blog post, but here, we will cover the _during training_ visualization (_after_ training is covered below). + +It consists of multipe phases of adaptations to the model code above: + +1. Adding tf-explain to your imports; +2. Creating a Keras callback: the VanillaGradientsCallback; +3. Fitting data to your model with the callback appended. + +Let's start with adding `tf-explain` to our imports. + +#### Adding tf-explain to your imports + +These are the current imports for our Keras model: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +``` + +We'll have to add `tf_explain` and specifically the VanillaGradientsCallback, which is done as follows: + +``` +from tf_explain.callbacks.vanilla_gradients import VanillaGradientsCallback +``` + +Also make sure to import the `os` module, the need for which we'll explain later - `import os`. + +...this yields: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tf_explain.callbacks.vanilla_gradients import VanillaGradientsCallback +import os +``` + +#### Creating a Keras callback: the VanillaGradientsCallback + +Now that we have imported the `VanillaGradientsCallback`, it's time to use it in our model. + +We can do so by means of a **Keras callback**. Callbacks are pieces of code that are executed after each iteration, or epoch, and can manipulate the training process. + +For example, with the [ModelCheckpoint and EarlyStopping callbacks](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), you can ensure that your training process stops precisely in time, while saving the best model instance you've found during the training process. + +This saves you resources and avoids that your [saved model](https://www.machinecurve.com/index.php/2020/02/14/how-to-save-and-load-a-model-with-keras/) has been overfitting for some time. + +However, we can also apply callbacks for using vanilla gradients with `tf-explain`: by means of the `VanillaGradientsCallback`, we can visualize and explain our model during the training process. Here's the code for creating the Keras callback: + +``` +# Defining callbacks +output_dir = './output' +os.mkdir(output_dir) +keras_callbacks = [ + VanillaGradientsCallback( + validation_data=(input_test, target_test), + class_index=0, + output_dir=output_dir, + ), +] +``` + +It contains of 3 separate blocks: + +- The `output_dir`, which specifies the directory where your TensorBoard required files are stored so that visualization can happen. +- The `os.mkdir` call, which generates the `output_dir` in your file system. +- The `keras_callbacks` array, which is the collection of callbacks that will be used during the training process. In today's case, it's only the `VanillaGradientsCallback`. We specify our test set as validation data for the callback, set a class that we want to visualize, and specify the output directory. + +#### Fitting data to your model with the callback appended + +We can then add the callback to our `model.fit` operation which starts the training process, to ensure that it is actually used: + +``` +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split, + callbacks=keras_callbacks) +``` + +That's it already! If you open up your terminal where `tf-explain` and TensorFlow 2.x are installed, and run the code, you'll see the training process begin. + +#### Full model code + +If you wish to obtain the full model code at once, that's possible :) Here you go: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tf_explain.callbacks.vanilla_gradients import VanillaGradientsCallback +import os + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 100 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + +# Load MNIST data +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Reshape data +input_train = input_train.reshape((input_train.shape[0], img_width, img_height, img_num_channels)) +input_test = input_test.reshape((input_test.shape[0], img_width, img_height, img_num_channels)) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Defining callbacks +output_dir = './output' +os.mkdir(output_dir) +keras_callbacks = [ + VanillaGradientsCallback( + validation_data=(input_test, target_test), + class_index=0, + output_dir=output_dir, + ), +] + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split, + callbacks=keras_callbacks) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +Now, open up your terminal again (possibly the same one as you trained your model in), `cd` to the folder where your `.py` file is located, and start TensorBoard: + +``` +tensorboard --logdir=./output +``` + +By default, TensorBoard will load on `localhost` at port `6006`: + +``` +Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all +TensorBoard 2.1.0 at http://localhost:6006/ (Press CTRL+C to quit) +``` + +At that URL, you can find the visualizations you need :) + +### Applying Vanilla Gradients to new data + +Sometimes, however, you don't want to use Vanilla Gradients _during_ training, but rather, _after training_, to find how your model behaves... and explain it. + +With Vanilla Gradients, you can do so in two ways: + +1. Set a class to explain, and feed it an image corresponding to the class, and see which parts of the image contribute most to the output. This tells you something about whether the correct parts of the image contribute to the correct class output. +2. Set a class to explain, and feed it an image corresponding to an incorrect class. The output image tells you something about what parts of the image (which is the wrong class) contribute to the output mostly, i.e. to the _error_ mostly. + +Let's give an example. + +#### Adding tf-explain to your imports + +The first thing we do is adding the `VanillaGradients` explainer to our imports: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tf_explain.core.vanilla_gradients import VanillaGradients +import matplotlib.pyplot as plt +import numpy as np +``` + +We also add Matplotlib, for generating visualizations later, and Numpy, for numbers processing. + +#### Instantiating the VanillaGradients explainer + +Then, after `model.fit`, we select a sample - in this case, sample 25 from the test set: + +``` +# Get some sample +sample = 25 +sample_image = np.array(input_test[sample]).reshape((img_width, img_height)) +plt.imshow(sample_image) +plt.show() +``` + +We also visualize it. + +Then, we instantiate the Vanilla Gradients explainer: + +``` +# Instantiate the explainer +explainer = VanillaGradients() +``` + +And explain away: + +``` +# Explain away +sample_array = (np.array([input_test[sample]]), None) +explanation = explainer.explain(sample_array, model, class_index=0) +plt.imshow(explanation) +plt.show() +``` + +...once again visualizing the outcome. + +#### Full model code + +Should you wish to obtain the full code for your model - that's possible again :) Here you go: + +``` +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Flatten, Conv2D +from tensorflow.keras.losses import sparse_categorical_crossentropy +from tensorflow.keras.optimizers import Adam +from tf_explain.core.vanilla_gradients import VanillaGradients +import matplotlib.pyplot as plt +import numpy as np + +# Model configuration +batch_size = 50 +img_width, img_height, img_num_channels = 28, 28, 1 +loss_function = sparse_categorical_crossentropy +no_classes = 10 +no_epochs = 10 +optimizer = Adam() +validation_split = 0.2 +verbosity = 1 + + +# Load MNIST data +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Determine shape of the data +input_shape = (img_width, img_height, img_num_channels) + +# Reshape data +input_train = input_train.reshape((input_train.shape[0], img_width, img_height, img_num_channels)) +input_test = input_test.reshape((input_test.shape[0], img_width, img_height, img_num_channels)) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Scale data +input_train = input_train / 255 +input_test = input_test / 255 + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(Conv2D(128, kernel_size=(3, 3), activation='relu')) +model.add(Flatten()) +model.add(Dense(128, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=loss_function, + optimizer=optimizer, + metrics=['accuracy']) + +# Fit data to model +history = model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Get some sample +sample = 25 +sample_image = np.array(input_test[sample]).reshape((img_width, img_height)) +plt.imshow(sample_image) +plt.show() + +# Instantiate the explainer +explainer = VanillaGradients() + +# Explain away +sample_array = (np.array([input_test[sample]]), None) +explanation = explainer.explain(sample_array, model, class_index=0) +plt.imshow(explanation) +plt.show() + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +#### Results + +Running your code will allow the training process to start: + +``` +Train on 48000 samples, validate on 12000 samples +2020-05-02 20:35:01.571880: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll +2020-05-02 20:35:01.844852: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll +2020-05-02 20:35:02.834555: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows +Relying on driver to perform ptx compilation. This message will be only logged once. + 9200/48000 [====>.................. +``` + +And once it ends, you should have two visualizations generated :) + +- ![](images/vg_0.png) + +- ![](images/z0.png) + + +On the left, you see the sample you're trying to explain your model with - and on the right, you see which parts of the sample contribute most to the class output. There you go :) + +Now for use case 2 - explaining the error. Suppose that we set `class_index = 4` in our explainer (which would correspond to the number 4, as the MNIST dataset has 10 classes, the numbers 0-9), and feed it a 9, we see: + +- ![](images/z9.png) + +- ![](images/z9_o.png) + + +It seems that the distinction is not so strong as I thought it would be. Nevertheless, you can still use Vanilla Gradients to determine _which parts of the input contribute to the output the most_. + +* * * + +\[affiliatebox\] + +## Summary + +In this blog post, we showed you how Vanilla Gradients can be used for explaining ConvNet performance. We started with an introduction to `tf-explain`, which is a great collection of model explanation techniques. This was followed by an example implementation of Vanilla Gradients for your Keras model, both for visualizing during training and after training. We concluded by demonstrating the results of today's post visually. + +I hope you've learnt something today! If you did, please feel free to leave a comment in the comments section below - I would appreciate it 😊💬 Please make sure to do the same if you have any questions, remarks or other comments. I'll be happy to respond. + +Thank you for reading MachineCurve today and happy engineering 😎 + +\[kerasbox\] + +* * * + +## References + +Tf-explain. (n.d.). _tf-explain documentation_. tf-explain — tf-explain documentation. [https://tf-explain.readthedocs.io/en/latest/](https://tf-explain.readthedocs.io/en/latest/) diff --git a/visualizing-keras-cnn-attention-grad-cam-class-activation-maps.md b/visualizing-keras-cnn-attention-grad-cam-class-activation-maps.md new file mode 100644 index 0000000..f650f1e --- /dev/null +++ b/visualizing-keras-cnn-attention-grad-cam-class-activation-maps.md @@ -0,0 +1,347 @@ +--- +title: "Visualizing Keras CNN attention: Grad-CAM Class Activation Maps" +date: "2019-11-28" +categories: + - "deep-learning" + - "frameworks" +tags: + - "class-activation-maps" + - "computer-vision" + - "deep-learning" + - "grad-cam" + - "keras" + - "keras-vis" + - "machine-learning" + - "neural-networks" + - "visualization" +--- + +When training image classifiers, you wish to know that it generates predictions based on what you want the model to see. For example, if you have a classifier that can distinguish cars from buses, it should determine whether the picture contains a bus or a car based on _the vehicle_, rather than the environment. + +....this may sound odd, since a well-performing model ensures that this in order, doesn't it? + +You can't imagine how simple is to disturb the model :-) What if your training set contained buses in snowy environments only, whereas the cars drive in various weather conditions? What if your dataset contains cars at night, while buses drive during daytime? And so on. In those cases, it might be that the model's discriminative powers come from the environment rather than the target, rendering pretty bad performance for buses that drive while it's not snowing, especially at nighttime. + +Fortunately, it's possible to inspect where your ConvNet attends to, with **Class Activation Maps**. In this blog post, we cover the maps offered by the `keras-vis` toolkit: the Grad-CAM class activation maps. We'll first recap why model performance should be visualized in your ML projects, from a high level perspective. Subsequently, we introduce `keras-vis`, and will point you to other blogs on this topic. Then, we continue with the real deal: + +- We cover traditional class activation maps and Grad-CAM maps and will cover the reasons why `keras-vis` offers the latter ones. +- We implement the visualizations, using the MNIST dataset. We cover this process step by step, providing you the code with explanations. +- We argue why guided Grad-CAM might result in even better visualizations, but why `keras-vis` does (no longer) support this. +- Subsequently, because we can't help it, we repeat the process for a CIFAR10 CNN 😀 + +All right. Enough introductory text - let's go! 😎 + +\[toc\] + +## Recap: why visualize model performance? + +Machine learning models, or more colloquially _AI models_, have been taking a special role in today's business environment. 'Algorithms', as they are sometimes called as well, are automating away tasks that previously required human knowledge. + +Especially machine learning models, which are trained with large quantities of data, are increasing the speed of this process. This comes with an inherent risk: we often don't know what happens within these models. Explaining their behavior can be hard and difficult. Still, this is one of the most important aspects of machine learning, as - according to Gehrmann et al. (2019): + +- Users give up their agency, or autonomy, and control over the processes automated by machine learning. +- Users are forced to trust models that have been shown to be biased. +- Similarly, users have to rely on these same models. + +## Introducing `keras-vis` + +Hence, scholars have been finding ways to explain model behavior. `keras-vis` is a practical implementation of these attempts. It is a toolkit that can be integrated with your Keras models, and used for visualization. + +Broadly speaking, it comes with three types of visualizations: + +- **[Activation Maximization](https://www.machinecurve.com/index.php/2019/11/18/visualizing-keras-model-inputs-with-activation-maximization/)**, which essentially generates a perfect image of a particular class for a trained model. +- **[Saliency Maps](https://www.machinecurve.com/index.php/2019/11/25/visualizing-keras-cnn-attention-saliency-maps/)**, which - given some input image - tell you something about the importance of each pixel for generating the class decision, hence visualizing where the model looks at when deciding. +- **Class Activation Maps**, and especially Grad-CAM class activation maps, which generate heatmaps at the _convolutional_ level rather than the _dense_ neural layer level, taking into account more spatial details. + +We cover the latter in this blog post. Please click the links above if you wish to understand more about the other two, or if you wish to find examples for them. + +## Traditional and Grad-CAM Class Activation Maps + +Let's first cover the inner workings of class activation maps and Grad-CAMs, or _gradient-weighted class activation maps_, before we continue to the example implementation. Of course, if you're interested in the example only, please feel free to skip this section - but I think it's interesting to see why these visualizations work as they do. + +In fact, we'll have to take an additional step backwards in order to understand Grad-CAMs: by looking at saliency maps. + +As we covered in the [saliency maps blog post](https://www.machinecurve.com/index.php/2019/11/25/visualizing-keras-cnn-attention-saliency-maps/), saliency maps tell you something about the importance of a pixel of the input image. In the case of `keras-vis` based saliency maps, this is the importance of a pixel of the input image with respect to _generating the class prediction_, i.e. the output. This is achieved by mathematically asking the following question: how does the output of the saliency map change when changing its input? + +As you could see in the blog post, they work pretty well in telling you which parts of the image are used for generating the target prediction: + +[![](images/frog-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/frog-2.png) + +However, we can find suggestions for improvement with respect to saliency maps (Selvaraju et al., 2017): + +- These maps, which the above authors call _pixel-space gradient visualizations,_ highlight many details in the image, but are not necessarily _class discriminative_ (see the MNIST image below). +- This especially occurs when two similar but different classes occur in an image. In their paper, the authors provide an image of a cat and a dog, and the pixel based visualizations highlight both the cat and the dog for the 'dog' and 'cat' classes, respectively. + +[![](images/sal9.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal9.png) + +We can't be 100% sure whether this activates because of a 9 or because of an 8. + +Class activation maps (or CAMs) solve this problem: they are highly class discriminative, exclusively highlighting the class regions for the class to be visualized (Selvaraju et al., 2017). This is why traditional CAMs have been popular for some time, as proposed by Zhou et al. (2016). However - and this is why `keras-vis` makes use of Grad-CAMs (we'll explain these next) - traditional CAMs also come with one big drawback: + +**Traditional CAMs can only be used by a small class of ConvNets, i.e. those without densely-connected layers, directly passing forward the convolutional feature maps to the output layer (Selvaraju et al., 2017).** + +This fact makes it hard to use them in real life models, where often [convolutional layers are followed by densely-connected ones](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/), to generate various [computer vision applications](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). Fortunately, Selvajaru et al. (2017) propose a generalization of the CAM approach which can be used by _any_ architecture, hence also the ones with densely-connected layers. + +It is called **gradient-weighted class activation maps** (Grad-CAM) and works as follows: + +- First, the gradient of the _output class prediction_ with respect to the _feature maps of your **last** convolutional layer_ is computed (before the Softmax layer which is common in [multiclass scenarios](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) - hence, we replace it in our implementation with Linear). +- Subsequently, these gradients flow back, and determine the relative importance of these feature maps for the class prediction, by means of global average pooling. +- By generating a weighted combination of the feature maps in this layer and their weights, we get a _gradient-weighted_ CAM heatmap that represents both the _positive_ and _negative_ importance factors for the input image. The positive factors mean that many feature maps participate in the importance of some area with respect to the output class (i.e., the desired class). Those are the areas that likely contain the object of interest. The negative factors mean that many feature maps participate in the importance of that area with respect to the _other classes_ (as the gradients will be strongly negative). +- Selvaraju et al. simply yet ingeniously propose to pass the heatmap through a [ReLU](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) function to filter out the negative areas, setting them to zero importance, while maintaining importance of the positive areas. + +...which allows us to visualize which parts of an image participate in a class decision, and hence add _explainability_ to the ConvNet's prediction process! + +Let's now see if we can implement this 😎 + +## Implementing the visualizations + +### Today's dataset: MNIST + +In today's implementation, we will generate visualizations for predictions made with a model trained on the MNIST dataset. As you can see in the image below, this dataset contains many small images that represent handwritten digits. That is, we have ten classes: the numbers 0 to 9. + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +### What you'll need to run the models + +As with many MachineCurve tutorials, you'll need to have a few software dependencies installed in order to run the models. For generating Grad-CAMs, dependencies are as follows: + +- **Python**, as you will need to use Keras - the deep learning framework for Python. Make sure to install Python 3.6+. +- **Keras**, which is the deep learning framework we're using today. +- One of the supported backends, being **Tensorflow, Theano or CNTK**. Keras runs on top of these and abstracts the backend into easily comprehensible format. We advise to use Tensorflow, as it is deeply integrated with Keras today. +- **Matplotlib**, for generating plots for the visualizations and colormap information. +- **Numpy**, for data processing. +- **Keras-vis**, the toolkit for generating Grad-CAMs. + +From the blog on [saliency maps](https://www.machinecurve.com/index.php/2019/11/25/visualizing-keras-cnn-attention-saliency-maps/) \- this is important: + +With this latter requirement, there is a catch: `pip install keras-vis` doesn't work, as it will not install the most recent version - which is a version that doesn't work with the most recent versions of Tensorflow/Keras. + +Instead, you'll need to install `keras-vis` a little bit differently, like this: + +``` +pip install https://github.com/raghakot/keras-vis/archive/master.zip +``` + +When doing so, version `0.5.0` will be installed, which is - as of November 2019 - the most recent version: + +``` +>pip install https://github.com/raghakot/keras-vis/archive/master.zip +Collecting https://github.com/raghakot/keras-vis/archive/master.zip + Downloading https://github.com/raghakot/keras-vis/archive/master.zip + \ 58.1MB 819kB/s +Building wheels for collected packages: keras-vis + Building wheel for keras-vis (setup.py) ... done +Successfully built keras-vis +Installing collected packages: keras-vis +Successfully installed keras-vis-0.5.0 +``` + +### Today's model: default Keras CNN + +The first step you'll undertake now is opening File explorer and creating a file such as `class_activation_maps_mnist.py`. In this file, you're going to add your code. Now open a code editor and open the file. Then proceed. + +We're going to use the Keras CNN we created and explained in a different blog for today's MNIST visualizations. Hence, I won't explain the model in much detail here, but would like to refer you [to that blog](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) if you wish to know more. Instead, I'll just give you the model code: + +``` +''' + Visualizing how layers represent classes with keras-vis Class Activation Maps (Grad-CAM). +''' + +# ============================================= +# Model to be visualized +# ============================================= +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras import backend as K +from keras import activations + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height) + input_shape = (1, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax', name='visualized_layer')) + +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### ...one difference, though + +Being the final layer that is added to the model: + +``` +model.add(Dense(no_classes, activation='softmax', name='visualized_layer')) +``` + +We added `name='visualized_layer'`, which is not present in the model creation in the other blog post. Adding this allows us to use this layer in Grad-CAM visualizations later. + +### Creating the Grad-CAM map + +Now that we have a model instance, we can create a Grad-CAM - let's find out how. + +First, we add some additional imports: `keras-vis` as `vis` for visualization purposes, together with Matplotlib (for plotting purposes) and Numpy (for data processing): + +``` +# ============================================= +# Grad-CAM code +# ============================================= +from vis.visualization import visualize_cam, overlay +from vis.utils import utils +import matplotlib.pyplot as plt +import numpy as np +import matplotlib.cm as cm +``` + +Subsequently, we find the index of the layer we will use in the visualizations by means of the `name` we gave it earlier: + +``` +# Find the index of the to be visualized layer above +layer_index = utils.find_layer_idx(model, 'visualized_layer') +``` + +Next, we swap the final layer's Softmax activation function with a Linear activation function, which simply passes input `x` as `x`. As Grad-CAMs are generated by weighing the gradients of the output class prediction with respect to the feature maps of the last convolutional layer of your model, you need to ensure that information passes properly. Softmax breaks this pattern, and hence you need to convert it into Linear. We do that as follows: + +``` +# Swap softmax with linear +model.layers[layer_index].activation = activations.linear +model = utils.apply_modifications(model) +``` + +Next, we specify some samples to visualize: + +``` +# Numbers to visualize +indices_to_visualize = [ 0, 12, 38, 83, 112, 74, 190 ] +``` + +...and actually generate the visualizations: + +``` +# Visualize +for index_to_visualize in indices_to_visualize: + # Get input + input_image = input_test[index_to_visualize] + input_class = np.argmax(target_test[index_to_visualize]) + # Matplotlib preparations + fig, axes = plt.subplots(1, 3) + # Generate visualization + visualization = visualize_cam(model, layer_index, filter_indices=input_class, seed_input=input_image) + axes[0].imshow(input_image[..., 0], cmap='gray') + axes[0].set_title('Input') + axes[1].imshow(visualization) + axes[1].set_title('Grad-CAM') + heatmap = np.uint8(cm.jet(visualization)[..., :3] * 255) + original = np.uint8(cm.gray(input_image[..., 0])[..., :3] * 255) + axes[2].imshow(overlay(heatmap, original)) + axes[2].set_title('Overlay') + fig.suptitle(f'MNIST target = {input_class}') + plt.show() +``` + +- We iterate over all the indices that we wish to visualize. +- For each index, we get the input image and the input class that corresponds to this image. +- We subsequently prepare Matplotlib and generate the visualization for the input image given the input class. +- Next, we plot the original image, the Grad-CAM feature map and the two as an overlay image. +- That's it - we show the plot! + +### The results + +Now, open up a terminal (e.g. `cmd` or your Anaconda terminal - any terminal suffices as long it has the dependencies), and `cd` to the folder where your `class_activation_maps_mnist.py` or similar file is stored. Next, execute the script, via `python class_activation_maps_mnist.py`. + +Once the model finishes training, the plots should start popping up: + +- ![](images/1-3.png) + +- ![](images/1-2-1.png) + +- ![](images/2-3.png) + +- ![](images/3-3.png) + +- ![](images/7-3.png) + +- ![](images/7-2-1.png) + +- ![](images/9-3.png) + + +## Guided Grad-CAM maps and `keras-vis` + +While Grad-CAMs are quite capable to generate heatmaps often, it would be even better if pixel based approaches (such as saliency maps) can be combined with Grad-CAMs. Guided Grad-CAMs are a solution to this challenge, as traditional Grad-CAMs are combined with guided backprop in order to generate an even more accurate visualization (Selvaraju et al., 2017). + +While `keras-vis` supports this, maintenance on the toolkit has dropped somewhat. This unfortunately means that it is no longer fully compatible with newer Tensorflow and Keras versions. Although traditional Grad-CAMs do work, the guided (and rectified) versions unfortunately produce errors that cannot be overcome. + +## Summary + +In this blog post, we've seen how to generate gradient-weighted Class Activation Maps (or Grad-CAMs) with the `keras-vis` toolkit, for your Keras model. We explained the conceptual nature of Grad-CAMs and how they differ from pixel based approaches such as saliency maps. We also provided an example implementation that runs with your Keras model. + +I hope you've learnt something from this blog and am looking forward to your comment! 😊 If you have any questions, remarks, or when you think that my blog can be improved - please feel free to drop me a message in the comments box below ⬇. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +## References + +Gehrmann, S., Strobelt, H., Kruger, R., Pfister, H., & Rush, A. M. (2019). Visual Interaction with Deep Learning Models through Collaborative Semantic Inference. _IEEE Transactions on Visualization and Computer Graphics_, 1-1. [doi:10.1109/tvcg.2019.2934595](https://arxiv.org/abs/1907.10739) + +Kotikalapudi, Raghavendra and contributors. (2017). Github / keras-vis. Retrieved from [https://github.com/raghakot/keras-vis](https://github.com/raghakot/keras-vis) + +Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. _2017 IEEE International Conference on Computer Vision (ICCV)_. [doi:10.1109/iccv.2017.74](https://arxiv.org/abs/1610.02391) + +Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning Deep Features for Discriminative Localization. _2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)_. [doi:10.1109/cvpr.2016.319](http://cnnlocalization.csail.mit.edu/) diff --git a/visualizing-keras-cnn-attention-saliency-maps.md b/visualizing-keras-cnn-attention-saliency-maps.md new file mode 100644 index 0000000..fbf5ed1 --- /dev/null +++ b/visualizing-keras-cnn-attention-saliency-maps.md @@ -0,0 +1,532 @@ +--- +title: "Visualizing Keras CNN attention: Saliency maps" +date: "2019-11-25" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "keras-vis" + - "machine-learning" + - "saliency-map" + - "visualization" +--- + +Suppose that you're training an image classifier. You don't have much training data. The classifier is binary and it allows you to distinguish between cats and dogs. However, for the cats, you only have images of the animals photographed at home, sitting on a couch. + +Now, having achieved high accuracies and low losses on your training results, you're very confident that your model is capable of separating cats from dogs. You feed your trained model another image, this time of a cat outside - but for some reason, it outputs _dog_. + +Why does this occur? You've trained the model to separate dogs and cats! + +There might be a simple yet unwanted explanation for this behavior: the model does not actually separate _cats from dogs_, but _the outside environment from the inside environment_, as cats were only photographed inside, whereas dogs were photographed both inside and outside. + +You obviously don't want this. For image classifiers, it may thus be a good idea to actually _check_ whether your model uses the interesting parts of your input image to generate the class output. But how to do this? + +That's where **saliency maps** enter the picture. They can be used to visualize _attention_ of your ConvNet, i.e., which parts of an input image primarily help determine the output class. In this blog post, we'll take a look at these saliency maps. We do so by first taking a look at attention and why it's a good idea to visualize them in the first place. Then, we get technical. + +In the technical part, we first introduce `keras-vis`, which we use for visualizing these maps. Next, we actually generate saliency maps for visualizing attention for possible inputs to a Keras based CNN trained on the MNIST dataset. Then, we investigate whether this approach also works with the CIFAR10 dataset, which doesn't represent numbers but objects instead. + +Hope you're ready, because let's go! 😎 + +\[toc\] + +## Recap: what is attention and why visualize it? + +When you look at this text, it's likely that there are various objects that compete for your attention. The titles of this post, for example, or the _related articles_ in the sidebar, all require your _attention_. But when you're interested in understanding how to visualize attention of a ConvNet with saliency maps, what should you look at? + +Yes: the text 😉 + +Now suppose that you have trained a [ConvNet classifier](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) which you can use to generate predictions for images. As discussed before, accuracy is high. But can you be certain that your classifier looks at the important aspects of an image when generating a prediction? + +For example, that - when being trained on pictures of cats and dogs - it really looks at the _animal_ for generating the predition, rather than the _environment_. + +(You can guess how easy it to mislead a model during training when e.g. the cats are all recorded in a snowy environment, while the dogs are not.) + +It's important to visualize the decision structure of your ConvNet. Does it really make its prediction based on the object, and not the environment? That's the question, which can be embedded into the broader context of why to visualize the model (Gehrman et al., 2019), as: + +- Users give up their agency, or autonomy, and control over the processes automated by machine learning. +- Users are forced to trust models that have been shown to be biased. +- Similarly, users have to rely on these same models. + +## Introducing `keras-vis` + +When building a model with Keras, you may wish to visualize the 'attention' of your ConvNet with respect to the object you're trying to visualize. + +Say hello to `keras-vis`, which [allows you to do precisely this](https://github.com/raghakot/keras-vis). The toolkit, which runs with your Keras model, allows you to visualize models in multiple ways: + +- By _[activation maximization](https://www.machinecurve.com/index.php/2019/11/18/visualizing-keras-model-inputs-with-activation-maximization/)_, essentially generating a 'perfect picture' of your classes. +- By _saliency maps_, which we cover next. +- By _[class activation maps](https://www.machinecurve.com/index.php/2019/11/28/visualizing-keras-cnn-attention-grad-cam-class-activation-maps/)_, which are heatmaps of where the model attends to. + +## Using saliency maps to visualize attention at MNIST inputs + +In this blog post, however, we cover _saliency maps_. Wikipedia defines such a map as: + +> In [computer vision](https://en.wikipedia.org/wiki/Computer_vision), a **saliency map** is an [image](https://en.wikipedia.org/wiki/Image) that shows each [pixel](https://en.wikipedia.org/wiki/Pixel)'s unique quality. +> +> Wikipedia (2015) + +In our case, this unique quality is _how much a pixel contributes to the class prediction_. + +Or, to put it in terms of `keras-vis`: to compute the gradient of output category with respect to the input image. I.e., if we change the input image from an empty image to say, a 'one' as provided by the MNIST dataset, how much do the _output pixels_ of the saliency map change? This tells us something about where the model attends to when generating a prediction. + +Now, how do we implement this? Let's give it a try for the MNIST dataset. + +### Today's dataset: MNIST + +We're going to use a very straight-forward dataset today: the MNIST dataset. This dataset, which stands for _Modified National Institute of Standards and Technology_ dataset, contains thousands of 28x28 pixel handwritten digits, like this: + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +Given the simplicity of the dataset, the deep integration with various Python frameworks for deep learning - including Keras - and the ease of which good results can be obtained, it's one of the better datasets for educational purposes. + +Hence, we're using this dataset in today's Keras CNN. Let's now discover what we need to run the visualizations we'll be creating next. + +### What you'll need to run the models + +What you need is really simple: + +- You need **Keras**, which is the deep learning framework we're using to train the models. +- You need one of the backends, being **Theano, Tensorflow or CNTK** - and preferably TensorFlow, since Keras has been integrated deeply (and doing so increasingly) with this backend. +- You need **Matplotlib** for actually displaying the visualizations on screen. +- Additionally, **Numpy** is required for data processing. +- You finally need `keras-vis` for generating the visualizations. + +With this latter requirement, there is a catch: `pip install keras-vis` doesn't work, as it will not install the most recent version - which is a version that doesn't work with the most recent versions of Tensorflow/Keras. + +Instead, you'll need to install `keras-vis` a little bit differently, like this: + +``` +pip install https://github.com/raghakot/keras-vis/archive/master.zip +``` + +When doing so, version `0.5.0` will be installed, which is - as of November 2019 - the most recent version: + +``` +>pip install https://github.com/raghakot/keras-vis/archive/master.zip +Collecting https://github.com/raghakot/keras-vis/archive/master.zip + Downloading https://github.com/raghakot/keras-vis/archive/master.zip + \ 58.1MB 819kB/s +Building wheels for collected packages: keras-vis + Building wheel for keras-vis (setup.py) ... done +Successfully built keras-vis +Installing collected packages: keras-vis +Successfully installed keras-vis-0.5.0 +``` + +Preferably, you run all requirements in an Anaconda environment, given isolation purposes with respect to other packages. However, using Anaconda is not mandatory to make it work. + +### Adding a Keras CNN + +Now that you know what is necessary to train the model and generate the saliency map visualizations, it's time to add a model. + +We simply add the [Keras CNN that we created in a different blog post](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). + +For the sake of brevity, I'm not repeating the explanation about the architecture and code blocks here. If you wish to understand this in more detail, please feel free to click the link above and read the other blog post - where you'll find all the details. + +Your first step in the context of generating the saliency map visualizations will thus be to open up your Explorer, navigate to some folder, and create a file called e.g. `saliency_maps_mnist.py`. Next, you open your code editor, open up the file, and paste the Keras CNN we created before: + +``` +''' + Visualizing how layers represent classes with keras-vis Saliency Maps. +''' + +# ============================================= +# Model to be visualized +# ============================================= +import keras +from keras.datasets import mnist +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras import backend as K +from keras import activations + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 1, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 1, img_width, img_height) + input_shape = (1, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) + input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax', name='visualized_layer')) + +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') +``` + +### ...the same, except for one thing + +But wait! + +There is _one difference_ with respect to the Keras CNN. + +And it's related to the last layer we'll add: + +``` +model.add(Dense(no_classes, activation='softmax', name='visualized_layer')) +``` + +In this layer, we add `name='visualized_layer'`, which isn't present in the traditional CNN. + +Why is this the case? + +Simple: our saliency map will visualize the attention generated _by some particular layer_, or set of layers, if you will. + +This requires that you assign a name to at least one layer, to be reused in the saliency map code layer. + +Since we're interested in attention in our _final layer_ (i.e., implicitly taking into account all the upstream layers), we're adding the name there. + +Let's now create the saliency map itself 😀 + +### Creating the saliency map + +For this purpose, we first need to add some imports: + +``` +# ============================================= +# Saliency Maps code +# ============================================= +from vis.visualization import visualize_saliency +from vis.utils import utils +import matplotlib.pyplot as plt +import numpy as np +``` + +`vis` represents the `keras-vis` toolkit, and from it, we import `visualize_saliency` (allowing us to perform saliency map visualization) and `utils`. + +Additionally, we import Matplotlib - for generating the visualization plots - and Numpy, for some data processing. + +Next, we find the index of the layer for which we wish to generate the saliency map visualizations - indeed, this is the `visualized_layer` layer name that we assigned to the final layer previously. + +``` +# Find the index of the to be visualized layer above +layer_index = utils.find_layer_idx(model, 'visualized_layer') +``` + +This code simply converts a layer name into a layer index, or a number that specifies where the layer to be visualized can be found in the architecture. + +Next, we swap the final Softmax layer with a linear one: + +``` +# Swap softmax with linear +model.layers[layer_index].activation = activations.linear +model = utils.apply_modifications(model) +``` + +We need to do this because the _saliency map_ generation process essentially performs a reversed process: instead of feeding data forward, and computing how the input should change with respect to the output, we compute it the other way around. That is, we compute how the output changes with respect to a change in input. Softmax, in this case, causes trouble - and that's why we replace it with a linear activation, which essentially passes the data. + +We next specify some samples from our test set for which we wish to generate saliency maps: + +``` +# Numbers to visualize +indices_to_visualize = [ 0, 12, 38, 83, 112, 74, 190 ] +``` + +And finally add code for generating the visualizations: + +``` +# Visualize +for index_to_visualize in indices_to_visualize: + # Get input + input_image = input_test[index_to_visualize] + input_class = np.argmax(target_test[index_to_visualize]) + # Matplotlib preparations + fig, axes = plt.subplots(1, 2) + # Generate visualization + visualization = visualize_saliency(model, layer_index, filter_indices=input_class, seed_input=input_image) + axes[0].imshow(input_image[..., 0]) + axes[0].set_title('Original image') + axes[1].imshow(visualization) + axes[1].set_title('Saliency map') + fig.suptitle(f'MNIST target = {input_class}') + plt.show() +``` + +This code: + +- Loops over all the indices to visualize. +- For each index, retrieves the actual _input image_, and the _index of the input class_. This latter, in this case, is also the number in MNIST (i.e., index 1 is the number 1, and so on). +- Prepares Matplotlib to plot two subplots in one plot: two columns, one row. +- Calls `visualize_saliency` to generate the saliency map visualization. We do this with our `model` instance (which we trained as a Keras CNN), at a particular `layer_index` (which we selecteD), with some `filter_indices` (i.e., the true class we wish to visualize) and some `seed_input` (i.e., the input image we're generating the saliency map for). +- Subsequently plots the original image and the saliency map for each index, adds a global title, and displays the plot on screen. + +### The results + +Now that we've finished coding the model, we can run it. + +Open up a terminal that can access the dependencies you've installed, `cd` to the particular folder you've stored the file into, and hit a command like `python saliency_maps_mnist.py`. You should see Keras starting the training process, likely with the Tensorflow backend. Eventually, when the training process finishes, you should see the results generated with your test data: + +``` +Test loss: 0.03036452561810365 / Test accuracy: 0.9911999702453613 +``` + +In the MNIST case, we created a pretty well functioning model, with 99+% accuracies! + +That's great. + +Next, you should find visualizations popping up your screen: + +- [![](images/sal1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal1.png) + +- [![](images/sal1-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal1-2.png) + +- [![](images/sal2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal2.png) + +- [![](images/sal3.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal3.png) + +- [![](images/sal7.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal7.png) + +- [![](images/sal7-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal7-2.png) + +- [![](images/sal9.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/sal9.png) + + +...which show you where the model attends to when generating a class prediction 😎 + +What's great is that apparently, the model recognizes that it should look at positions _near the numbers_ for generating the prediction. This effect is especially visible with seven, three, two and one, and slightly less with the nine. Nevertheless, mission achieved! 😀 + +## Do they also work with CIFAR10 inputs? + +We can next try to do the same thing with the CIFAR10 dataset, which contains various real-world objects: + +![](images/cifar10_visualized.png) + +We create another file, e.g. `saliency_maps_cifar10.py`, and add code that really resembles the MNIST scenario: + +``` +''' + Visualizing how layers represent classes with keras-vis Saliency Maps. +''' + +# ============================================= +# Model to be visualized +# ============================================= +import keras +from keras.datasets import cifar10 +from keras.models import Sequential +from keras.layers import Dense, Dropout, Flatten +from keras.layers import Conv2D, MaxPooling2D +from keras import backend as K +from keras import activations + +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Reshape data based on channels first / channels last strategy. +# This is dependent on whether you use TF, Theano or CNTK as backend. +# Source: https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py +if K.image_data_format() == 'channels_first': + input_train = input_train.reshape(input_train.shape[0], 3, img_width, img_height) + input_test = input_test.reshape(input_test.shape[0], 3, img_width, img_height) + input_shape = (1, img_width, img_height) +else: + input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) + input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) + input_shape = (img_width, img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Normalize data +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = keras.utils.to_categorical(target_train, no_classes) +target_test = keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax', name='visualized_layer')) + +# Compile the model +model.compile(loss=keras.losses.categorical_crossentropy, + optimizer=keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# ============================================= +# Saliency Maps code +# ============================================= +from vis.visualization import visualize_saliency +from vis.utils import utils +import matplotlib.pyplot as plt +import numpy as np + +# Find the index of the to be visualized layer above +layer_index = utils.find_layer_idx(model, 'visualized_layer') + +# Swap softmax with linear +model.layers[layer_index].activation = activations.linear +model = utils.apply_modifications(model) + +# Numbers to visualize +indices_to_visualize = [ 0, 12, 38, 83, 112, 74, 190 ] + +# Visualize +for index_to_visualize in indices_to_visualize: + # Get input + input_image = input_test[index_to_visualize] + # Class object + classes = { + 0: 'airplane', + 1: 'automobile', + 2: 'bird', + 3: 'cat', + 4: 'deer', + 5: 'dog', + 6: 'frog', + 7: 'horse', + 8: 'ship', + 9: 'truck' + } + input_class = np.argmax(target_test[index_to_visualize]) + input_class_name = classes[input_class] + # Matplotlib preparations + fig, axes = plt.subplots(1, 2) + # Generate visualization + visualization = visualize_saliency(model, layer_index, filter_indices=input_class, seed_input=input_image) + axes[0].imshow(input_image) + axes[0].set_title('Original image') + axes[1].imshow(visualization) + axes[1].set_title('Saliency map') + fig.suptitle(f'CIFAR10 target = {input_class_name}') + plt.show() +``` + +What is different is this: + +- CIFAR10 data is loaded instead of MNIST data. +- Reshaping the input data considers the 3 RGB channels, instead of just one channel in the MNIST case; +- A `classes` object is added to allow Matplotlib to find the class name for some input target class integer. + +When running this again, with e.g. `python saliency_maps_cifar10.py`, we see that the model is performing slightly less - which makes sense, as MNIST is _really, really simple_ in terms of complexity - with these performance metrics: + +``` +Test loss: 0.8597345282554626 / Test accuracy: 0.7184000015258789 +``` + +And these are the saliency maps for CIFAR10 targets: + +- ![](images/airplane-2.png) + +- ![](images/cat-2.png) + +- ![](images/dog-2.png) + +- ![](images/dog2.png) + +- ![](images/frog-2.png) + +- ![](images/horse-2.png) + +- ![](images/truck-2.png) + + +...attention seems to be in order, and is especially striking with the frog and the horse images. Funnily, the firetruck is recognized by its wheels. + +## Rectified and guided backprop + +We've successfully saliency maps, but can we make the inputs sharper? + +Yes, and `keras-vis` supports this - by modifying the backprop operations performed when generating the visualizations, into rectified or guided backprop. + +...what's sad, however, is that `keras-vis` has not been updated for quite some time, and the model crashes time after time with new versions of Keras and TensorFlow. + +So, unfortunately, `keras-vis` based saliency maps with rectified and guided backprop seem to be no option for the time being ☹ Nevertheless, it was great generating them with 'vanilla' backprop, and to see that they really work! + +## Summary + +In this blog post, we've seen how to visualize where your ConvNet attends to by means of _saliency maps_. We discussed what is visualized and how you can visualize these for your Keras models by means of the `keras-vis` toolkit. + +I hope you've learnt something interesting today. If you did, please feel free to leave a comment below 😊 You're invited to do the same when you face questions, or when you have other remarks. I'll happily answer your questions and if necessary adapt my blog. + +Thanks a lot and happy engineering! 😎 + +## References + +Gehrmann, S., Strobelt, H., Kruger, R., Pfister, H., & Rush, A. M. (2019). Visual Interaction with Deep Learning Models through Collaborative Semantic Inference. _IEEE Transactions on Visualization and Computer Graphics_, 1-1. [doi:10.1109/tvcg.2019.2934595](https://arxiv.org/abs/1907.10739) + +Kotikalapudi, Raghavendra and contributors. (2017). Github / keras-vis. Retrieved from [https://github.com/raghakot/keras-vis](https://github.com/raghakot/keras-vis) + +Wikipedia. (2015, December 3). Saliency map. Retrieved from [https://en.wikipedia.org/wiki/Saliency\_map](https://en.wikipedia.org/wiki/Saliency_map) diff --git a/visualizing-keras-model-inputs-with-activation-maximization.md b/visualizing-keras-model-inputs-with-activation-maximization.md new file mode 100644 index 0000000..64507db --- /dev/null +++ b/visualizing-keras-model-inputs-with-activation-maximization.md @@ -0,0 +1,604 @@ +--- +title: "Activation Maximization with TensorFlow 2 based Keras for visualizing model inputs" +date: "2019-11-18" +categories: + - "buffer" + - "deep-learning" + - "frameworks" +tags: + - "activation-maximization" + - "keras" + - "keras-vis" + - "visualization" +--- + +Deep neural networks were black boxes traditionally. The mantra "you feed them data, you'll get a working model, but you cannot explain how it works" is still very common today. Fortunately, however, developments in the fields of machine learning have resulted in _explainable AI_ by means of visualizing the internals of machine learning models. + +In this blog, we'll take a look at a practice called **activation maximization**, which can be used to visualize 'perfect inputs' for your deep neural network. It includes an example implementation for Keras classification and regression models using the `tf-keras-vis` library. + +However, we'll start with some rationale - why visualize model internals in the first place? What is activation maximization and how does it work? And what is `tf-keras-vis`? + +Subsequently, we move on to coding examples for two Keras CNNs: one trained on the MNIST dataset, the other trained on the CIFAR10 dataset. Finally, we'll wrap up our post by looking at what we created. + +After reading this tutorial, you will have learned... + +- **What Activation Maximization is and how it can be used to compute expected inputs for class outcomes.** +- **How to use `tf-keras-vis` for Activation Maximization.** +- **How to apply Activation Maximization to your TensorFlow 2 based Keras models, using the MNIST and CIFAR-10 datasets.** + +Let's go! 😎 + +**Update 17/Mar/2021:** large article update. `keras-vis` does not work with TensorFlow 2, so switched to `tf-keras-vis`. This ensures that the code can be used with TensorFlow 2 based versions of Keras. Also adapted the text to reflect on this. Ensures that the article is up to date for 2021 and beyond. + +* * * + +\[toc\] + +* * * + +## Code example: Activation Maximization with TensorFlow 2 based Keras + +The code below provides a full example of using Activation Maximization with TensorFlow 2 based Keras for visualizing the expected inputs to a model, in order to reach a specific class outcome. For example, in the example below, you'll see what you should input to e.g. get class output for class 4. + +It allows you to get started quickly. If you want to understand more details behind Activation Maximization or `tf-keras-vis`, make sure to read the rest of this tutorial as well! 🚀 + +``` +''' + Visualizing how layers represent classes with keras-vis Activation Maximization. +''' + +# ============================================= +# Model to be visualized +# ============================================= +import tensorflow +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras import activations +import numpy as np + +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 1 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) +input_shape = (img_width, img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert them into black or white: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax', name='visualized_layer')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# ============================================= +# Activation Maximization code +# ============================================= +from tf_keras_vis.activation_maximization import ActivationMaximization +import matplotlib.pyplot as plt + +def loss(output): + return (output[0, 0], output[1, 1], output[2, 2], output[3, 3], output[4, 4], output[5, 5], output[6, 6], output[7, 7], output[8, 8], output[9, 9]) + +def model_modifier(m): + m.layers[-1].activation = tensorflow.keras.activations.linear + +# Initialize Activation Maximization +visualize_activation = ActivationMaximization(model, model_modifier) + +# Generate a random seed for each activation +seed_input = tensorflow.random.uniform((10, 28, 28, 3), 0, 255) + +# Generate activations and convert into images +activations = visualize_activation(loss, seed_input=seed_input, steps=512, input_range=(30,150)) +images = [activation.astype(np.float32) for activation in activations] + +# Define classes +classes = { + 0: 'airplane', + 1: 'automobile', + 2: 'bird', + 3: 'cat', + 4: 'deer', + 5: 'dog', + 6: 'frog', + 7: 'horse', + 8: 'ship', + 9: 'truck' +} + +# Visualize each image +for i in range(0, len(images)): + visualization = images[i] + plt.imshow(visualization, cmap='gray') + plt.title(f'CIFAR10 target = {classes[i]}') + plt.show() +``` + +* * * + +## Why visualize model internals? + +Over the past few years we have seen quite a few AI breakthroughs. With the ascent of deep neural networks since 2012 have come self-driving cars, usage of AI in banking, and so on. However, humans still don't trust AI models entirely - and rightfully so, as for example AI has [a gender bias](https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G). It is really important to visualize the neural network which has so far been a _black box_ (Gehrman et al., 2019), as: + +- Users give up their agency, or autonomy, and control over the processes automated by machine learning. +- Users are forced to trust models that have been shown to be biased. +- Similarly, users have to rely on these same models. + +Fortunately, various approaches for studying the internals of the model with respect to how it works have emerged over the past few years. **Activation Maximization** is one of them, and can be used to generate images of the 'best input' for some class. We'll take a look at it intuitively now. + +* * * + +## Activation Maximization explained intuitively + +During the [supervised training process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), your neural network learns by adapting its weights, step by step, based on the error generated by the training data fed forward. + +Suppose you're training a classifier. During training, you thus have fixed model inputs and model outputs for these inputs (since your training samples will always have a corresponding class number), and what is dynamic are the weights. These are adapted continuously (Valverde, 2018) in order to generate a model that performs well. + +Now think the other way around. Suppose that you finished training a classifier. How do you know that it was trained correctly? Firstly, you can take a look at the loss value, but this does not tell you everything. Rather, you would want to see what the model _thinks_ belongs to every _class_. So, suppose you're using the MNIST dataset of handwritten digits, you're interested in e.g. what the model thinks is the best visual representation of class '4'. This is hopefully a visualization that somewhat (or perhaps even better, greatly) resembles an actual 4. + +This is what _activation maximization can do:_ **you visualize what a class in your trained model looks like by inverting the process mentioned above**. This time, the weights and the desired output are constant, and the input will be modified as long as neurons that yield the class are maximized (Valverde, 2018). Since only the best possible image will maximize the activation of the neurons that produce this class number as output, you'll find _what the model thinks it sees when you're talking about some class_. + +In the case of the '4' mentioned above, that would be something like this: + +[![](images/4-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/4-1.png) + +* * * + +## Visualizing Keras model performance: say hi to `tf-keras-vis` + +Fortunately, for engineers who use Keras in their deep learning projects, there is a toolkit out there that adds activation maximization to Keras: `tf-keras-vis` ([link](https://github.com/keisen/tf-keras-vis)). Since it integrates with Keras quite well, this is the toolkit of our choice. As we'll be creating actual models, we'll next take a look at what software dependencies you need to install in order to run the models. Additionally, we'll also take a closer look at installing `tf-keras-vis`, which is slightly more complex than you realize now. + +### What you'll need to run the models + +We'll do two things in this tutorial: + +- Create a Keras model (based on a model we created before); +- Visualize the network's inputs with `tf-keras-vis`. + +We hence need the following dependencies: + +- **TensorFlow 2** or any newer versions. +- **Python**, preferably version 3.8+; +- **[Tf-keras-vis](https://github.com/keisen/tf-keras-vis)**, for generating the input visualizations with activation maximization, adapted from `keras-vis` to TensorFlow 2; +- **Matplotlib**, for converting these visualizations into actual plots. + +### Installing `tf-keras-vis` + +Keras-vis for TensorFlow 2 can easily be installed with `pip install tf-keras-vis`. + +* * * + +## Visualizing Keras CNN MNIST inputs + +Let's now create an example of visualizing inputs with activation maximization! + +... a simple one, but a very fun one indeed: we'll be visualizing true inputs for the Keras MNIST CNN created [in another blog post](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). This means that our code will consist of two parts: + +- The Keras MNIST CNN, which can be replaced by your own Keras code, as long as it has some `model` instance. +- The activation maximization visualization code. + +Open up your file explorer, navigate to some directory, and create a file. You can name it as you like, but `activation_maximization_mnist.py` seems to be a good choice for us today, so if you're uninspired perhaps just choose that one. + +### Keras CNN + +We'll first add the code for the [Keras CNN](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/) that we'll visualize. Since this code was already explained here, and an explanation will only distract us from the actual goal of this blog post, I'd like to refer you to the post if you wish to understand the CNN code in more detail. + +``` +''' + Visualizing how layers represent classes with keras-vis Activation Maximization. +''' + +# ============================================= +# Model to be visualized +# ============================================= +import tensorflow +from tensorflow.keras.datasets import mnist +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras import activations +import numpy as np + +# Model configuration +img_width, img_height = 28, 28 +batch_size = 250 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load MNIST dataset +(input_train, target_train), (input_test, target_test) = mnist.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 1) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 1) +input_shape = (img_width, img_height, 1) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert them into black or white: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) +``` + +### Activation Maximization code + +#### Imports + +We next add the imports: the most important one is `ActivationMaximization` from `tf-keras-vis`, in order to apply activation maximization. Note that the `-` characters from the `pip` command have changed into `_` here. Secondly, we import Matplotlib, for actually outputting the visualizations. + +``` +# ============================================= +# Activation Maximization code +# ============================================= +from tf_keras_vis.activation_maximization import ActivationMaximization +import matplotlib.pyplot as plt + +``` + +#### Preparations + +``` +def loss(output): + return (output[0, 0], output[1, 1], output[2, 2], output[3, 3], output[4, 4], output[5, 5], output[6, 6], output[7, 7], output[8, 8], output[9, 9]) + +def model_modifier(m): + m.layers[-1].activation = tensorflow.keras.activations.linear +``` + +Next, we prepare our visualization code, by performing two things: + +- Defining a `loss` function, which describes the outputs that must be used in the backwrds computation. Here, `output[i, c]` is used multiple times, where `i` represents the index of the output, and `c` the class index that must be visualized there. For example, in MNIST, your number 2 has class index `2`. +- Defining a `model_modifier`. This swaps the Softmax activation function in our trained model, which is common for multiclass classification problems, with the linear activation function. Why this is necessary can be seen in the images below. Since you're essentially looking backwards, from outputs and fixed weights to inputs, you need a free path from outputs back to inputs. Softmax disturbs this free path by essentially transforming your model data in intricate ways, which makes the activation maximizations no longer understandable to humans. + +![](images/8-2.png) + +You don't want this - so swap Softmax for Linear. + +#### Visualization + +Finally, we add some code for the visualizations. It does the following: + +- Initializing `ActivationMaximization` with our `model` and the `model_modifier` which swaps Softmax for Linear. +- Generating a random seed for each activation, to ensure that its initialization is not biased. +- Generate the activations with `visualize_activation` and our self-defined `loss` function, the `seed_input` seeds and using 512 steps. The latter is recommended by `tf-keras-vis` docs for "generating good images". +- Finally, converting the activations into images, and visualizing them. + +``` +# Initialize Activation Maximization +visualize_activation = ActivationMaximization(model, model_modifier) + +# Generate a random seed for each activation +seed_input = tensorflow.random.uniform((10, 28, 28, 1), 0, 255) + +# Generate activations and convert into images +activations = visualize_activation(loss, seed_input=seed_input, steps=512) +images = [activation.astype(np.float32) for activation in activations] + +# Visualize each image +for i in range(0, len(images)): + visualization = images[i].reshape(28,28) + plt.imshow(visualization) + plt.title(f'MNIST target = {i}') + plt.show() +``` + +### Class visualizations + +Now, let's take a look what happens when we run our model. Obviously, it will first train for 25 epochs and will likely achieve a very high accuracy in the range of 99%. Subsequently, it will start outputting plots one by one. + +And surprisingly, they are quite interpretable for humans! We can recognize numbers - especially 0, 1, 8 and 9 . Really cool! 😎 + +- [![](images/0.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/0.png) + +- [![](images/1.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/1.png) + +- [![](images/2.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/2.png) + +- [![](images/3.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/3.png) + +- [![](images/4.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/4.png) + +- [![](images/5.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/5.png) + +- [![](images/6.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/6.png) + +- [![](images/7.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/7.png) + +- [![](images/8.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/8.png) + +- [![](images/9.png)](https://www.machinecurve.com/wp-content/uploads/2021/03/9.png) + + +### Why swapping Softmax is necessary + +You already saw what happened when you don't swap Softmax for linear. However, for the sake of completeness, this is what you'll get for every class when you _don't_ swap Softmax: + +- [![](images/0-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/0-2.png) + +- [![](images/1-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/1-2.png) + +- [![](images/2-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/2-2.png) + +- [![](images/3-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/3-2.png) + +- [![](images/4-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/4-2.png) + +- [![](images/5-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/5-2.png) + +- [![](images/6-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/6-2.png) + +- [![](images/7-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/7-2.png) + +- [![](images/8-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/8-2.png) + +- [![](images/9-2.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/9-2.png) + + +* * * + +## Visualizing Keras CNN CIFAR10 inputs + +Let's now see what happens when we perform the same operation with the CIFAR10 dataset. We train the same model, once for 25 epochs and once for 100 epochs, and hope that our visualizations somewhat resemble the objects in the dataset. + +This is a random selection from CIFAR10: + +[![](images/cifar10_images.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/cifar10_images.png) + +This is the code used for CIFAR10 visualization. It is really similar to the MNIST one above, so take a look there for explanations: + +``` +''' + Visualizing how layers represent classes with keras-vis Activation Maximization. +''' + +# ============================================= +# Model to be visualized +# ============================================= +import tensorflow +from tensorflow.keras.datasets import cifar10 +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Dropout, Flatten +from tensorflow.keras.layers import Conv2D, MaxPooling2D +from tensorflow.keras import activations +import numpy as np + +# Model configuration +img_width, img_height = 32, 32 +batch_size = 250 +no_epochs = 1 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load CIFAR-10 dataset +(input_train, target_train), (input_test, target_test) = cifar10.load_data() + +# Reshape data +input_train = input_train.reshape(input_train.shape[0], img_width, img_height, 3) +input_test = input_test.reshape(input_test.shape[0], img_width, img_height, 3) +input_shape = (img_width, img_height, 3) + +# Parse numbers as floats +input_train = input_train.astype('float32') +input_test = input_test.astype('float32') + +# Convert them into black or white: [0, 1]. +input_train = input_train / 255 +input_test = input_test / 255 + +# Convert target vectors to categorical targets +target_train = tensorflow.keras.utils.to_categorical(target_train, no_classes) +target_test = tensorflow.keras.utils.to_categorical(target_test, no_classes) + +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax', name='visualized_layer')) + +# Compile the model +model.compile(loss=tensorflow.keras.losses.categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + +# Fit data to model +model.fit(input_train, target_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + validation_split=validation_split) + +# Generate generalization metrics +score = model.evaluate(input_test, target_test, verbose=0) +print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + +# ============================================= +# Activation Maximization code +# ============================================= +from tf_keras_vis.activation_maximization import ActivationMaximization +import matplotlib.pyplot as plt + +def loss(output): + return (output[0, 0], output[1, 1], output[2, 2], output[3, 3], output[4, 4], output[5, 5], output[6, 6], output[7, 7], output[8, 8], output[9, 9]) + +def model_modifier(m): + m.layers[-1].activation = tensorflow.keras.activations.linear + +# Initialize Activation Maximization +visualize_activation = ActivationMaximization(model, model_modifier) + +# Generate a random seed for each activation +seed_input = tensorflow.random.uniform((10, 28, 28, 3), 0, 255) + +# Generate activations and convert into images +activations = visualize_activation(loss, seed_input=seed_input, steps=512, input_range=(30,150)) +images = [activation.astype(np.float32) for activation in activations] + +# Define classes +classes = { + 0: 'airplane', + 1: 'automobile', + 2: 'bird', + 3: 'cat', + 4: 'deer', + 5: 'dog', + 6: 'frog', + 7: 'horse', + 8: 'ship', + 9: 'truck' +} + +# Visualize each image +for i in range(0, len(images)): + visualization = images[i] + plt.imshow(visualization, cmap='gray') + plt.title(f'CIFAR10 target = {classes[i]}') + plt.show() +``` + +### Visualizations at 25 epochs + +At 25 epochs, it's possible to detect the shapes of the objects very vaguely - I think this is especially visible at automobiles, deer, horses and trucks. + +- [![](images/airplane.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/airplane.png) + +- [![](images/automobile.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/automobile.png) + +- [![](images/bird.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/bird.png) + +- [![](images/cat.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/cat.png) + +- [![](images/deer.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/deer.png) + +- [![](images/dog.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/dog.png) + +- [![](images/frog.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/frog.png) + +- [![](images/horse.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/horse.png) + +- [![](images/ship.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/ship.png) + +- [![](images/truck.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/truck.png) + + +### Visualizations at 100 epochs + +At 100 epochs, the model specified above is overfitting quite severely - but nevertheless, these are the visualizations: + +- [![](images/airplane-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/airplane-1.png) + +- [![](images/automobile-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/automobile-1.png) + +- [![](images/bird-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/bird-1.png) + +- [![](images/cat-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/cat-1.png) + +- [![](images/deer-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/deer-1.png) + +- [![](images/dog-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/dog-1.png) + +- [![](images/frog-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/frog-1.png) + +- [![](images/horse-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/horse-1.png) + +- [![](images/ship-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/ship-1.png) + +- [![](images/truck-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/truck-1.png) + + +Primarily, they have become 'sharper' - but not necessarily more detailed. Well, this is already questionable at first for a 32x32 pixel image, but also shows that you should not expect _magic_ to happen, despite the possible advantages of methods like activation maximization. + +* * * + +## Summary + +In this blog post, we studied what Activation Maximization is and how you can visualize the 'best inputs' for your CNN classes with `tf-keras-vis`, i.e., with TensorFlow 2 based Keras. Activation Maximization helps you in understanding what happens within your model, which may help you to find hidden biases that - when removed - really improve the applicability of your machine learning model. + +I hope you've learnt something today - for me, it was really interesting to see how it's possible to visualize the model's black box! 😊 If you have any questions, remarks, or other comments, feel free to leave a comment below 👇 I will try to respond as soon as possible. + +Thanks for reading MachineCurve and happy engineering! 😎 + +* * * + +## References + +GitHub. (2021). _Keisen/tf-keras-vis_. [https://github.com/keisen/tf-keras-vis](https://github.com/keisen/tf-keras-vis) + +Kotikalapudi, Raghavendra and contributors. (2017). Github / keras-vis. Retrieved from [https://github.com/raghakot/keras-vis](https://github.com/raghakot/keras-vis) + +Valverde, J. M. (2018, June 18). Introduction to Activation Maximization and implementation in Tensorflow. Retrieved from [http://laid.delanover.com/introduction-to-activation-maximization-and-implementation-in-tensorflow/](http://laid.delanover.com/introduction-to-activation-maximization-and-implementation-in-tensorflow/) + +Gehrmann, S., Strobelt, H., Kruger, R., Pfister, H., & Rush, A. M. (2019). Visual Interaction with Deep Learning Models through Collaborative Semantic Inference. _IEEE Transactions on Visualization and Computer Graphics_, 1-1. [doi:10.1109/tvcg.2019.2934595](https://arxiv.org/abs/1907.10739) diff --git a/visualizing-keras-neural-networks-with-net2vis-and-docker.md b/visualizing-keras-neural-networks-with-net2vis-and-docker.md new file mode 100644 index 0000000..8831b88 --- /dev/null +++ b/visualizing-keras-neural-networks-with-net2vis-and-docker.md @@ -0,0 +1,360 @@ +--- +title: "Visualizing Keras neural networks with Net2Vis and Docker" +date: "2020-01-07" +categories: + - "deep-learning" + - "frameworks" +tags: + - "convolutional-neural-networks" + - "keras" + - "neural-network" + - "visualization" +--- + +Visualizing the structure of your neural network is quite useful for publications, such as papers and blogs. + +Today, various tools exist for generating these visualizations - allowing engineers and researchers to generate them either by hand, or even (partially) automated. + +Net2Vis is one such tool: recognizing that current tools have certain flaws, scholars at a German university designed a web application which allows you to visualize Keras-based neural networks automatically. + +In this blog post, we'll take a look at Net2Vis. Firstly, we'll inspect the challenges of current tools in more detail, followed by the introduction of Net2Vis. We then suggest a different way of installing it, by using our Docker-based installation process, saving you quite some time on installing dependencies. Subsequently, we'll talk you through our experience with Net2Vis - and show you what it's capable of. + +Are you ready? All right! Let's go :) + +* * * + +\[toc\] + +* * * + +## What is Net2Vis? + +Let's briefly take a look at what Net2Vis is precisely first :) + +In the scientific field of deep learning, many scholars are writing academic papers about their findings. The same is true for practical findings in industry journals and on industry blogs. Visualizing neural networks is a key element in those reports, as people often appreciate visual structures over large amounts of text. + +However, when looking at the available tools and techniques for visualizing neural networks, Bäuerle & Ropinski (2019) found some key insights about the state of the art of neural network visualization: + +- Most of the time, neural networks are visualized by hand, which consumes a lot of time and induces errors in (even _published_!) papers. +- This time is better spent on improving the model through tuning hyperparameters or training result evaluation. +- Often, print media requires horizontal visualizations, maintaining the natural flow of reading, while still conveying all important information. +- There are a lot of tools available for visualizing neural networks, like [Keras plot\_model](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/), but they either do not convey enough information or produce vertical visualizations. + +Hence, convinced that the current tool landscape is suboptimal, they set out and created [Net2Vis](https://github.com/viscom-ulm/Net2Vis), a web application for automatically visualizing your Keras neural networks. + +[![](images/image-4-1024x568.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-4.png) + +* * * + +## Using Net2Vis online + +If you don't want to install Net2Vis at all, you can also use it online at the authors' website: [https://viscom.net2vis.uni-ulm.de](https://viscom.net2vis.uni-ulm.de) :) + +* * * + +## Installing Net2Vis + +Installing Net2Vis is quite easy if you use a Mac or a Linux based machine, but it's more difficult when using Windows. However, don't worry, because I created a Docker based variant which also makes using Net2Vis on Windows easy. So, for installing Net2Vis, you can choose from one of two options: + +- Install it with Docker, which means that it runs inside containers. It's however still accessible from your host machine, e.g. your computer. +- Install it manually, which requires you to follow the authors' installation steps. + +### Easy way: install with Docker + +If you wish to install Net2Vis with Docker, keep on reading. + +#### What is Docker? + +Docker's slogan is as follows: + +> Securely build, share and run any application, anywhere +> +> [https://www.docker.com/](https://www.docker.com/) + +This sounds good :) + +Docker is software which allows you to run apps in a virtualized-ish fashion, but not in the style of traditional Virtual Machines as we know them. Rather, Docker runs containerized apps directly in Docker's container engine, on your host machine. This allows you to _still_ use containerized apps, _still_ run apps in an isolated and atomic fashion, _still_ benefit from the benefits of Linux - even on Windows, _without_ having the need to install massive operating systems in virtual machines, consuming a lot of disk space. + +#### What you'll need to run Net2Vis + +There's a couple of things that you must install if you wish to install Net2Vis with Docker: + +- Firstly, you'll need **Docker**. This allows you to run the Net2Vis backend and frontend in two separate containers. + - You can find the installation instructions here: [https://docs.docker.com/install/](https://docs.docker.com/install/) +- Secondly, you'll need **Docker Compose**. Compose allows you to create one file in which you specify the orchestration of separate containers, i.e. how they must be started and what their interdependencies are. This way, you can make your app start with just one command. + - You can find the installation instructions here: [https://docs.docker.com/compose/install](https://docs.docker.com/compose/install) +- Finally, you'll need **Git**, as you'll need to clone my repository from GitHub in order to run it from your local machine. + - You can find the installation instructions here: [https://git-scm.com/book/en/v2/Getting-Started-Installing-Git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) + +#### Installation procedure + +As per the [GitHub repository](https://github.com/christianversloot/net2vis-docker), this is the installation procedure: + +1. Clone the repository: `git clone https://github.com/christianversloot/net2vis-docker.git` +2. Open a terminal, `cd` into the folder where `net2vis-docker` has been unpacked, and run it with Docker compose: + 1. `docker-compose up` if you wish to run it in the front so that you can see messages easily, with the downside that it shuts down when you close the terminal; + 2. `docker-compose up -d` if you wish to run it in the background so that it keeps running when you close the terminal, with the downside that you'll have to run `docker logs ` if you wish to see what happens inside. + +If you choose to run it in the frontend, the backend will start first, followed by the frontend. The first time, it will also build the Docker containers :) Startup looks like this: + +``` +Creating backend ... done Recreating frontend ... done Attaching to backend, frontend +backend | * Serving Flask app "server" (lazy loading) +backend | * Environment: production +backend | WARNING: This is a development server. Do not use it in a production deployment. +backend | Use a production WSGI server instead. +backend | * Debug mode: on +backend | * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) +backend | * Restarting with stat +backend | * Debugger is active! +backend | * Debugger PIN: 181-171-933 +frontend | +frontend | > netviz@0.1.0 start /Net2Vis/net2vis +frontend | > npm-run-all -p watch-css start-js +frontend | +frontend | +frontend | > netviz@0.1.0 start-js /Net2Vis/net2vis +frontend | > react-scripts start +frontend | +frontend | +frontend | > netviz@0.1.0 watch-css /Net2Vis/net2vis +frontend | > npm run build-css && node-sass-chokidar src/ -o src/ --watch --recursive +frontend | +frontend | +frontend | > netviz@0.1.0 build-css /Net2Vis/net2vis +frontend | > node-sass-chokidar src/ -o src/ +frontend | +frontend | Rendering Complete, saving .css file... +frontend | Wrote CSS to /Net2Vis/net2vis/src/styles/index.css +frontend | Wrote 1 CSS files to /Net2Vis/net2vis/src/ +frontend | => changed: /Net2Vis/net2vis/src/styles/index.scss +frontend | Rendering Complete, saving .css file... +frontend | Wrote CSS to /Net2Vis/net2vis/src/styles/index.css +frontend | [HPM] Proxy created: /api -> http://host.docker.internal:5000 +frontend | Starting the development server... +frontend | +frontend | Browserslist: caniuse-lite is outdated. Please run next command `npm update` +frontend | Compiled successfully! +frontend | +frontend | You can now view netviz in the browser. +frontend | +frontend | Local: http://localhost:3000/ +frontend | On Your Network: http://192.168.96.3:3000/ +frontend | +frontend | Note that the development build is not optimized. +frontend | To create a production build, use npm run build. +frontend | +frontend | Compiling... +frontend | Compiled successfully! +``` + +### Original way: install on your host machine + +Of course, you might also wish to omit installing Net2Vis with Docker - for example, because you have a Mac or Linux based system, which enables you to install e.g. Cairo quite easily. In that case, please go [to the original Net2Vis repository and follow its installation instructions](https://github.com/viscom-ulm/Net2Vis). + +* * * + +## How does Net2Vis work? + +When Net2Vis is running, you can access it from http://localhost:3000. When navigating here with your web browser, a blue and white web application will open with a code editor on the left, a settings panel on the right and a visualization in the middle of your screen. + +Let's now walk through each of these :) + +[![](images/image-4-1024x568.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-4.png) + +### The model we'll be visualizing today + +But first - the model we'll be visualizing today. It's a simple ConvNet with two Conv2D layers, Max Pooling and [Dropout](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/). Do note that Net2Vis has a particular structure, which in essence boils down to this: + +``` +def get_model(): + model = .... + ..... + return model +``` + +That is, it will look in your code for a definition called `get_model` which should return a Keras model. Both the Sequential and the Functional API are supported. This also means that in essence, Net2Vis doesn't really care about data preprocessing, and you can skip these parts of your code. + +You really only have to add the `model` definition, the layers, the necessary variables as used in the layers, and finally return the `model`. + +Take this example of the ConvNet to be visualized as follows: + +``` +# You can freely modify this file. +# However, you need to have a function that is named get_model and returns a Keras Model. +import keras as k +from keras.models import Sequential +from keras.layers import Conv2D, MaxPooling2D, Dropout, Flatten, Dense + +def get_model(): + input_shape = (28, 28, 1) + no_classes = 10 + model = Sequential() + model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(MaxPooling2D(pool_size=(2, 2))) + model.add(Dropout(0.25)) + model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) + model.add(MaxPooling2D(pool_size=(2, 2))) + model.add(Dropout(0.25)) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + + return model +``` + +If you add it _(do note that the code editor can be sensitive at times; it's best to write your code in a regular editor, after which you select the code in Net2Vis, delete it entirely, and ctrl+c/ctrl+v your own code in the empty code editor)_, the default visualization should immediately change into something that resembles this: + +[![](images/image-7-1024x261.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-7.png) + +That's good, because this is the ConvNet we just defined 👇 + +[![](images/image-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-1.png) + +### Interpreting visualizations + +However, the default visualization doesn't provide you with all the details - for example, it doesn't show you which layer is being visualized. + +Fortunately, Net2Vis shows a legend immediately below the visualization: + +![](images/image-8.png) + +Indeed, the colors match the model code we just defined, so we've got the correct one visualized :) + +### Configuring the plots + +[![](images/image-3-150x150.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-3.png) + +On the right side of the web application, there is a large amount of configuration options. Let's take a look at these. + +By default, Net2Vis determines the height of a particular box by the _spatial resolution_ of the layer, whereas the width is determined by the _number of feature channels_ (Bäuerle & Ropinski, 2019). Note that the resolution and number of feature channels of Conv and Dense layers do not influence each other, given the fact that they are incomparable in terms of these determinants. + +#### Plot shape + +It's possible to change these **minimum and maximum visualization heights**. By default, the neural network is 30 pixels high at minimum and 100 at maximum, but let's take a look at what the model looks like when we change these values. + +For example, let's set the maximum height to 350 and the minimum height to 100: + +[![](images/image-9.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-9.png) + +It's also possible to do this with **minimum and maximum visualization widths** **of the individual layers** - say, we change them from 20/80 into 50/100: + +[![](images/image-10-1024x191.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-10.png) + +Increasing or decreasing the **horizontal and vertical spacing** between individual layers is also possible. For example, here, the horizontal spacing was changed from 20 into 50: + +[![](images/image-11-1024x204.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-11.png) + +(note that vertical spacing becomes useful when you have models that branch at one point in time and, possibly, join together later) + +These options give you full control over the shape of your visualization. + +#### Default plot color + +But there is more. Net2Vis also allows you to adapt the **color** of the visualizations, and even provide a **colorless** version for people with monochromatic vision, which means that people can only see in grayscale (Bäuerle & Ropinski, 2019). + +By selecting the color selector on the right, it's possible to select one out of three options: + +- **Palette mode**, which provides a 17-color palette which is used to visualize the layers (Bäuerle & Ropinski, 2019). +- **Interpolation mode**, which uses "farthest point sampling (...) \[to find\] unused colors in hsv color space" (Bäuerle & Ropinski, 2019). +- **Color blindness mode**, which also provides a palette, but instead provides only 8 colors to ensure that color blind people can sufficiently distinguish between layer colors (Bäuerle & Ropinski, 2019). + +- [![](images/graph-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/graph-1.png) + +- [![](images/graph.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/graph.png) + +- [![](images/grap1h.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/grap1h.png) + + +_Palette (top), interpolation (middle) and color blindness mode (bottom)._ + +By selecting 'Disable color', you activate monochromatic mode: + +[![](images/image-12.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-12.png) + +#### Custom plot color + +However, it doesn't end here: users can select their custom colors as well :) + +By clicking on a layer, the palette for that particular layer type can be changed. That is, when selecting a color (one can even set a hex value manually!), it's possible to change the layer style - the other layers of the type change automatically when the color is changed. + +So, beyond space, the user can also fully control the colors of their neural network visualizations with Net2Vis. I love it :) + +[![](images/image-13-1024x375.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-13.png) + +#### Plot labels and extra details + +[![](images/image-5.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-5.png) + +And we're still not there yet :) + +Net2Vis has additional functionality with respect to labels and additional details. As explained by Bäuerle & Ropinski in their paper (Bäuerle & Ropinski, 2019), the goal of Net2Vis is to produce models that are as sparsely and as understandably as possible. As a result, the authors have decided to keep as much as possible away from the default visualization - which essentially just distinguishes between layers and shows spatial resolutions and number of kernels used. + +However, it's possible to: + +- Display or hide the dimensions of the input of a particular layer; +- Display or hide the number of features learnt by some particular layer; +- Replace split layers; +- Show the bounding boxes of the input and output samples that must be fed to the models; +- Disable colors as a whole, as we just saw before. + +For example, by selecting the _dimensions label_, the spatial dimensions of the layer inputs start popping up. For example, in our ConvNet, we have a 28 x 28 pixels input which is downsampled to 5 x 5 before it's flattened (green) and fed to the Dense layers (red). This is reminiscent of the MNIST dataset, and indeed, this was the dataset which we trained our model with :) + +![](images/image-14-1024x174.png) + +By selecting the _features label_, the number of features learnt per layer is shown. As we can see, the first Conv2D learns 32 filters, and the second one learns 64. The first Dense layer learns 256 features, while the second one learns 10 - and hey, this is exactly the number of classes present within our MNIST dataset, which suggests a multiclass / categorical data classification problem and a Softmax activation function. + +![](images/image-16.png) + +![](images/image-17.png) + +### Grouping layers + +[![](images/image-6.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-6.png) + +Before configuring layer shape and colors, however, you may wish to take a look at the functionality for grouping layers together. + +For example, in the visualizations above, it's clear that there's a block that repeats twice: the Conv2D-MaxPooling2D-Dropout block. I always tend to call these 'convolutional blocks', as it's a nice summary, and it's possible to visualize these with Net2Vis as well. + +This is especially useful if you have to visualize very large networks, and don't have so much space available in e.g. your paper or on your website. + +By clicking "automatically group", Net2Vis will attempt to find blocks by checking the sequences available within the graph. For example, it will find the convolutional blocks we identified in the plot manually: + +[![](images/image-18-1024x562.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-18.png) + +What's best is that it automatically adapts the legend, and adds the group to it: purple = blue + yellow + brown, i.e. group = Conv2D + MaxPooling2D + Dropout. Great :) + +### Exporting to PDF and SVG + +By clicking the download button in the blue top bar, your visualization will be downloaded to your computer: + +[![](images/image-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-2.png) + +It'll download a ZIP file containing: + +- Your visualization, in PDF and SVG format; +- The corresponding legend, also in PDF and SVG format. + +* * * + +## Summary + +In this blog post, we've shown you what Net2Vis is and how it can be used to visualize Keras models easily and consistently. We took a look at the features available within this web application, which was designed by Bäuerle & Ropinski (2019). + +What's more, given the relative complexity of installing Net2Vis on a Windows machine, we made available a Docker based version called `net2vis-docker`, which takes away the complex installation steps and installs Net2Vis for you, granted that you have Docker and Docker Compose installed on your system, as well as Git. + +I hope this blog post has been useful to your machine learning projects. If it has, I'd love to know how, so in that case please leave a comment in the comments box below! 👇 Please do too if you have questions or remarks :) + +Thanks for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Bäuerle, A., & Ropinski, T. (2019). [Net2Vis: Transforming Deep Convolutional Networks into Publication-Ready Visualizations](https://arxiv.org/abs/1902.04394). arXiv preprint arXiv:1902.04394. + +Visual Computing Group (Ulm University). (n.d.). Net2Vis. Retrieved from [https://github.com/viscom-ulm/Net2Vis](https://github.com/viscom-ulm/Net2Vis) + +MachineCurve. (n.d.). net2vis-docker. Retrieved from [https://github.com/christianversloot/net2vis-docker](https://github.com/christianversloot/net2vis-docker) + +Docker. (n.d.). Enterprise Container Platform. Retrieved from [https://www.docker.com/](https://www.docker.com/) diff --git a/visualizing-transformer-behavior-with-ecco.md b/visualizing-transformer-behavior-with-ecco.md new file mode 100644 index 0000000..79e787c --- /dev/null +++ b/visualizing-transformer-behavior-with-ecco.md @@ -0,0 +1,307 @@ +--- +title: "Visualizing Transformer behavior with Ecco" +date: "2021-01-19" +categories: + - "frameworks" +tags: + - "ecco" + - "model-interpretability" + - "transformer" + - "transformers" + - "visualization" +--- + +These days, Transformer based architectures are taking the world of Natural Language Processing by storm. What's more, even more recently, they have also blended the fields of NLP and Computer Vision - with approaches like the Visual Transformer and the DeIT architecture. + +In other words, we can expect many new developments emerge from these fields in the years to come. + +But as with any Machine Learning field, it is not only important to know that your model works - but you must also ensure that you know _**why**_ it works. For example, if you are building a binary classifier that classifies between dogs and cats, you cannot be sure that it's the animal it decides on whether you have a snowy background with many of the dogs, while using indoor pictures with the cats. + +Fortunately, these days, many Machine Learning practitioners build great stuff _and_ release it as open source packages. That's why we can say hello to **Ecco**, which was created by [Jay Alammar](https://jalammar.github.io/). It can be used for interpreting the outputs of (language based) Transformers. Built on top of HuggingFace Transformers and using PyTorch, it will be useful to a wide audience. + +![](images/ezgif.com-gif-maker-4.gif) + +Visualizing the outputs of a Transformer 🤗 Really cool! + +Currently, two methods for visualizing how a Transformer works are supported. Relatively similar to the [saliency maps](https://www.machinecurve.com/index.php/2019/11/25/visualizing-keras-cnn-attention-saliency-maps/) that we know from ConvNets, Ecco can compute the importance of input tokens for the predicted output token, something known as **input saliency**. In addition, it's capable of visualizing how the neurons in Transformer networks activate using **neuron activation**. + +In this tutorial, we will be looking at Ecco in more detail. After reading it, you will... + +- Have a bit of background knowledge about Transformers and Model Interpretability. +- Know what Ecco is all about. +- Have built a Transfomer visualization using input saliency and neuron activations. + +Let's go! 🤗🚀 + +* * * + +\[toc\] + +* * * + +## Visualizing Transformer models: summary and code examples + +- Transformer models are taking the world by storm. With the emergence of models like BERT, GPT-2 and GPT-3, the field of NLP is making a lot of progress. In fact, a few breakthroughs are spilling over into the world of Computer Vision these days, with the emergence of Transformers there as well. +- As with any deep learning model, interpretability is an important thing for model evaluation. If we don't understand the model, how can we ensure that it can be used without adverse events occurring? +- [Ecco](https://github.com/jalammar/ecco) is a library that can be used for visualizing the behavior of your Transformer model. In this tutorial, we'll take a look at what Ecco is, how it works and how it can be used. +- First, we take a look at two ready-to-use code examples for using Ecco for generating input saliency and neuron activation visualizations. + +### Visualizing the importance of the input tokens when predicting an output token + +``` +import ecco + +# Load pretrained DistilGPT2 and capture neural activations +lm = ecco.from_pretrained('distilgpt2', activations=True) + +# Input text + +text = "Frameworks for Machine Learning include: 1. TensorFlow\n2. PyTorch\n3.Scikit-learn\n4." + +# Generate 35 tokens to complete the input text. +output = lm.generate(text, generate=35, do_sample=True) + +# To view the input saliency +output.saliency() +``` + +### Visualizing what's going on inside the Transformer during prediction + +``` +import ecco + +# Load pretrained DistilGPT2 and capture neural activations +lm = ecco.from_pretrained('distilgpt2', activations=True) + +# Input text + +text = "Frameworks for Machine Learning include: 1. TensorFlow\n2. PyTorch\n3.Scikit-learn\n4." + +# Perform NMF +nmf = output.run_nmf(n_components=10) +nmf.explore() +``` + +* * * + +## Transformers and Model Interpretability + +In the field of Machine Learning, **Transformer architectures** are currently very prominent. Having been around since 2017 after a breakthrough paper by Vaswani et al., the field of NLP - classically relying on recurrent neural networks - saw that recurrent segments were not necessary for achieving state-of-the-art performance on a variety of language tasks. + +Since then, we have seen many Transformer based architectures dominate progress in Natural Language Processing, among which are BERT, GPT-2 and GPT-3. Now, recently, Transformers have even been adapted to the computer vision domain. With the Visual Transformer and the DeIT Transformer, it even becomes possible to use Conv-free classifiers for relatively state-of-the-art performance on vision tasks. They work even in the area in between these fields: with DALL-E, we can generate images based on textual inputs. + +The original Transformer architecture works by combining an **encoder segment** and a **decoder segment**. Using a technique called multi-head attention and using it many times over, inputs can be converted in some hidden representation and subsequently into an output token. This way, we can e.g. perform Neural Machine Translation. Even Google Search is now primarily running on BERT for performing its natural language understanding. + +Speaking about BERT, which is an extension to the original work by Vaswani et al., we must say that recently work on Transformers has split into an encoder vs. decoder battle. Despite some approaches which attempt to bridge the gap (e.g. BART), the GPT camp argues that models must be autoregressive and hence use the decoder segment when performing language generation, whereas the BERT camp argues that the _masked_ attention segment in the decoder unnecessarily limits the model when used for language understanding. + +Let's see where things are going with Transformers in the months and years ahead. + +![](images/Diagram-32-1-1024x991.png) + +The original Transformer architecture, as proposed by Vaswani et al. (2017) + +Another field in Machine Learning that is gaining a lot of traction in the past few years is the field of **Model Interpretability.** Machine learning models have drawn a lot of attention, and this attention was both positive and negative. + +While positive attention focused on the possible benefits of Machine Learning and how it could achieve a significant impact within company operations (by finding patterns that previously remained hidden in large datasets available within your organization), negative attention tended to argue that using a Machine Learning model is equal to using a black box. + +And there lies truth in both stories. While people can be overconfident about AI's impact, it's not exactly a black box, either: today, [many model visualization approaches](https://www.machinecurve.com/index.php/2019/12/03/visualize-keras-models-overview-of-visualization-methods-tools/) have been made available that can give insight in _why_ a model performs in a certain way. + +Today, this is even true for Transformer based models. Let's take a look at Ecco! + +* * * + +## Introducing Ecco + +![](images/68747470733a2f2f61722e706567672e696f2f696d672f6563636f2d6c6f676f2d772d3830302e706e67.png) + +Wouldn't it be cool if we can visualize what is happening in a Transformer based language model? + +It would greately help our understanding of such models and allow us to track spurious predictions in order to find out what is going wrong. + +Let's introduce **[Ecco](https://github.com/jalammar/ecco)**, an open source (BSD-3) library that can be used on top of HuggingFace Transformers an PyTorch for understanding your NLP model. Created by Jay Alammar, a ML research engineer currently focused on NLP, it's a really cool library to work with. + +> Ecco is a python library for explaining Natural Language Processing models using interactive visualizations. +> +> Alammar (2021) + +Let's take a look at it quickly! + +* * * + +## Visualizing your NLP model with Ecco + +Currently, Ecco supports two methods for visualizing your language model: + +- Using **input saliency**, we can determine the importance of any input token for generating a prediction. In other words, we can look back and see what words have contributed to the word that has just been predicted. This way, we can visualize interrelationships between words in sentences. +- Using **neuron activations**, we can identify the parts of the densely-connected (feedforward) layers in a Transformer architecture that fire when generating predictions for new tokens. + +While strictly working with all kinds of language models, it works best when the model is autoregressive (Alammar, 2021). + +### Installing Ecco + +Installing Ecco is really easy. There are two prerequisites to installing Ecco: you must have PyTorch installed ([howto here](https://pytorch.org/get-started/locally/)) and also HuggingFace Transformers ([here](https://github.com/huggingface/transformers)) is required, although Ecco seems to install the latter during the installation process if it's not available on your system. + +Performing the installation itself then equals the execution of just one command: + +``` +pip install ecco +``` + +* * * + +## Visualizing token importance: input saliency + +If you've been a frequent reader of this website, you know that we have written another article about [visualizing attention of a ConvNet](https://www.machinecurve.com/index.php/2019/11/25/visualizing-keras-cnn-attention-saliency-maps/) with saliency maps. Such maps were defined as follows: + +> In computer vision, a saliency map is an image that shows each pixel‘s unique quality. +> +> Wikipedia (2015) + +Indeed, we could use them to visualize the parts of an input image that were most important to a Convolutional Neural Network in generating a prediction. Fortunately, in this case, it's the frog, and not the background! + +![](images/frog-2.png) + +### How input saliency works + +If you are familiar with how a neural network works, understanding the workings of input saliency is not difficult. First recall that training a neural network involves a forward pass, where an input token is fed forward through the model - i.e. the Transformer - after which an output prediction is generated. + +This forward pass results in a loss value which is then used for computing gradients with backpropagation and subsequent optimization with gradient descent or another optimizer. In other words, the forward pass is followed by a backward pass. + +![](images/Slaiency.png) + +Forward and backward pass, i.e. optimizing a neural network + +With input saliency, the question we're asking is exactly the opposite one. In the case above we want to know how the input needs to change with respect to a desired change in the outputs (in our case, the difference between predicted and expected values, or the loss). Input saliency, on the other hand, means understanding the change in _output_ with respect to a change in _input_ values. + +Or, in plainer English, if we know which input value changes the output the most, we know the rudimentary pixel (or in this case, the token) value which contributes most significantly. After multiplying the resulting gradient with the input embedding of the token, and taking the L2 norm, we know the input salience of every particular token. + +![](images/Slaiency-1.png) + +With input saliency, the process is in the opposite direction. + +### Example text + +Generating the **input saliency** map for input tokens of a text is really easy. With Ecco, we can use any pretrained language model (note again that in the current version autoregressive models are the primary focus) available within HuggingFace Transformers. + +For example, let's allow the `distilgpt2` model to further predict based on the following text: + +_Frameworks for Machine Learning include:_ + +1. _TensorFlow_ +2. _PyTorch_ +3. _Scikit-learn_ + +### Code + +We can do so with the following code, which loads the pretrained `distilgpt2` model into Ecco using `transformers`, performs generation of 35 more tokens using the input text, and views input saliency. Run this in a [Jupyter Notebook](https://www.machinecurve.com/index.php/2020/10/07/easy-install-of-jupyter-notebook-with-tensorflow-and-docker/): + +``` +import ecco + +# Load pretrained DistilGPT2 and capture neural activations +lm = ecco.from_pretrained('distilgpt2', activations=True) + +# Input text + +text = "Frameworks for Machine Learning include: 1. TensorFlow\n2. PyTorch\n3.Scikit-learn\n4." + +# Generate 35 tokens to complete the input text. +output = lm.generate(text, generate=35, do_sample=True) + +# To view the input saliency +output.saliency() +``` + +### Output + +You will then get the following output: + +![](images/image-7.png) + +It shows the output of the DistilGPT2 Transformer as well as the saliency for generating the first token, which is underlined as Py. GPT2 seems to have some difficulty recognizing that PyTorch was already mentioned before, and funnily even thinks that PyTorch and TensorFlow will merge in the future, but well... that's not in scope for today's article. + +You can use your mouse to see which previous tokens were most important for generating the one under inspection. + +![](images/ezgif.com-gif-maker-4.gif) + +### Observations + +Some observations: + +- The words **Frameworks** and **Machine Learning** keep getting highlighted all the time, meaning that GPT-2 understands that all these libraries are related to Machine Learning. +- When the token **Torch** is highlighted, the model understands that it is highly related to _Py_. +- The same is true for **Flow** in this context, for which _Tensor_ is very salient. +- All future references of **Scikit** mostly refer back to the original reference of that library. + +It's really cool that we can analyze a Transformer model in such a visual way! + +* * * + +## Visualizing neuron activations + +While input saliency tells you something about the _external factors_ of your Transformer model, **neuron activations** tell you what is happening inside the neural network. More specifically, it tells you what is happening in the feedforward subsegments of each Transformer block. + +This functionality essentially 'groups' together inputs - in components - that cause same regions in the Dense classifier to fire. This can provide interesting insights about the behavior of your Transformer model. + +Let's take a look at how this works in more detail. + +### Code + +We can visualize the neuron activations in the following way. Here, `n_components` describes the number of groups that we want to construct. + +``` +import ecco + +# Load pretrained DistilGPT2 and capture neural activations +lm = ecco.from_pretrained('distilgpt2', activations=True) + +# Input text + +text = "Frameworks for Machine Learning include: 1. TensorFlow\n2. PyTorch\n3.Scikit-learn\n4." + +# Perform NMF +nmf = output.run_nmf(n_components=10) +nmf.explore() +``` + +### Observations + +After running the DistilGPT2 model, the output of the neuron activations in that particular case looks like this: + +![](images/image-8.png) + +Output of Transformer neuron activations + +We can derive some interesting insights here: + +1. In the predicted part (i.e. after _4._), all numbers belong to the same group. The same is true for the numbers in the provided part. DistilGPT2 for some reason does not consider these to be part of the same sequence. +2. In many cases, the capital letters of a word in the list are estimated to belong to the same group, and the same is true for the first few tokens after the capital letter. +3. It's difficult to see a pattern in the newline characters `\n`. + +As you can see, breaking into the black box is now also possible with Transformer models! 😎 + +* * * + +## Summary + +Transformers are taking the world of NLP by storm, and have also been introduced to the field of Computer Vision recently. However, visualizing the behavior of Transformer models is difficult, and is in its infancy today, while the need for interpretable models is high. + +In this tutorial, we introduced Ecco, a new library for visualizing the behavior of your Transformer model. With Ecco, we can visualize Transformer behavior in two ways. The first is **input saliency**, or the importance of the input tokens in predicting a particular output value. The second is **neuron activations**, which groups the neurons in the feedforward segment(s) of your Transformer into a predefined amount of components, allowing you to see _within_ the model as well. + +Using the code examples, we have also demonstrated how Ecco can be used for visualizing your Transformer language model. + +[Ask a question](https://www.machinecurve.com/index.php/add-machine-learning-question/) + +I hope that you have learned something from this tutorial! 😎 If you did, please feel free to leave a message in the comments section. Please do the same if you have any questions, or click the **Ask Questions** button on the right. + +Thank you for reading MachineCurve today and happy engineering! + +* * * + +## References + +Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). [Attention is all you need](https://arxiv.org/abs/1706.03762). _Advances in neural information processing systems_, _30_, 5998-6008. + +Alammar, J. (2021). _Jalammar/ecco_. GitHub. [https://github.com/jalammar/ecco](https://github.com/jalammar/ecco) + +Wikipedia. (2015, December 3). Saliency map. Retrieved from [https://en.wikipedia.org/wiki/Saliency\_map](https://en.wikipedia.org/wiki/Saliency_map) diff --git a/visualizing-your-neural-network-with-netron.md b/visualizing-your-neural-network-with-netron.md new file mode 100644 index 0000000..57b250b --- /dev/null +++ b/visualizing-your-neural-network-with-netron.md @@ -0,0 +1,167 @@ +--- +title: "Visualizing your Neural Network with Netron" +date: "2020-02-27" +categories: + - "deep-learning" + - "frameworks" +tags: + - "architecture" + - "deep-learning" + - "deep-neural-network" + - "machine-learning" + - "model" + - "neural-network" + - "visualization" +--- + +Neural networks, and especially the deep ones, have achieved many state-of-the-art results over the past few years. Many scholars and practitioners have used them to create cool tools and new techniques, which are used in various real-world scenarios today. + +Let's say that you've identified a new type of architecture that works really well. Now, you wish to communicate about this architecture. How do you do so? And how can you visualize your neural network architecture easily - or inspect it, if you will? + +Netron is such a tool. Being a viewer for neural networks and machine learning models, it generates beautiful visualizations that you can use to clearly communicate the structure of your neural network. What's more, using the tool, you can explore your models in great detail. And best of all, it's a cross-platform tool - which also means Windows and Mac support - and works with a wide range of machine learning frameworks and model formats. + +In this blog post, we'll take a look at Netron. First, we'll discuss what it is and what frameworks and model formats it supports. Then, we move on to an example with Keras: we show you how to generate a Netron-ready model output, and how to visualize and inspect it subsequently. + +Let's take a look! :) + +* * * + +\[toc\] + +* * * + +## Introducing Netron + +![](images/image-8-135x300.png) + +Let's now take a look at Netron. Created by Lutz Roeder - from now on cited as Roeder (2020) - is a cross-platform tool for visualizing deep learning models, specifically deep neural networks. + +Or as they describe their tool: **Netron is a viewer for neural network, deep learning and machine learning models** (Roeder, 2020). + +It can generate beautiful visualizations of your neural network and supports a wide range of frameworks and formats. A slice from such a visualization can be seen on the right, and was generated from a Keras model. + +Let's now take a look at the frameworks and formats that are supported by Netron. Then, we'll show you how to install the tool - which is really easy, and given the fact that it's cross-platform, it's supported for Windows and Mac machines as well. + +Then, we continue by providing an example for Keras. + +### What frameworks and formats does Netron support? + +As you can see, Netron supports a wide range of frameworks - and offers experimental support for a wide range of others (Roeder, 2020) :) + +| Framework | Supported? | File types | +| --- | --- | --- | +| ONNX | Supported | .onnx, .pb, .pbtxt | +| Keras | Supported | .h5, .keras | +| Core ML | Supported | .mlmodel | +| Caffe | Supported | .caffemodel, .prototxt | +| Caffe2 | Supported | predict\_net.pb, predict\_net.pbtxt | +| Darknet | Supported | .cfg | +| MXNet | Supported | .model, -symbol.json | +| ncnn | Supported | .param | +| TensorFlow Lite | Supported | .tflite | +| TorchScript | Experimental support | .pt, .pth | +| PyTorch | Experimental support | .pt, .pth | +| TorchScript | Experimental support | .t7 | +| Arm NN | Experimental support | .armnn | +| BigDL | Experimental support | .bigdl, .model | +| Chainer | Experimental support | .npz, .h5 | +| CNTK | Experimental support | .model, .cntk | +| Deeplearning4j | Experimental support | .zip | +| MediaPipe | Experimental support | .pbtxt | +| ML.NET | Experimental support | .zip | +| MNN | Experimental support | .mnn | +| OpenVINO | Experimental support | .xml | +| PaddlePaddle | Experimental support | .zip, \_\_model\_\_ | +| Scikit-learn | Experimental support | .pkl | +| TensorFlow.js | Experimental support | model.json, .pb | +| TensorFlow | Experimental support | .pb, .meta, .pbtxt, .ckpt, .index | + +### Installing Netron + +Installing Netron is pretty easy! :) + +Navigate to the [releases](https://github.com/lutzroeder/netron/releases) page of the Netron repository, select the installer of your choice (for example, `.exe` for Windows systems, `dmg` for Apple systems or the source code if you wish to build it yourself), and ensure that installation completes. + +Netron will then open automatically, and you can also do so from e.g. the Start Menu. + +[![](images/image-3-1024x790.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/image-3.png) + +* * * + +## An example with a Keras model + +Let's now generate an example with a Keras based model. We'll be taking a shortcut, and we'll be using one of the `tf.keras.applications` models that come predelivered with Keras - just as simple, and it doesn't distract from the point - showing how Netron works - with large amounts of model code. + +Do note that Netron works with any `model` instance, so models created by yourself will work too! :) + +### Keras code + +Now, open up your Explorer, navigate to some folder, and create a file - say, `netron.py`. Given what we decided above, today's model code will be very brief. Let's start with the imports: + +``` +# Imports +from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2 +``` + +Or, rather, _import_ 😉 + +There is only one: the `MobileNetV2` class of the MobileNet v2 model that we'll be visualizing today. The next thing we do is instantiate it: + +``` +# Define the model +model = MobileNetV2() +``` + +And subsequently, we save it into an HDF5 file: + +``` +# Save the model +model.save('netron_model_mobilenetv2.h5') +``` + +### Exploring the model in Netron + +Now, open up Netron, and import the `netron_model_mobilenetv2.h5` file that can be found in the folder of your `netron.py` file. In no time, the model should open up on screen. When zooming in, the individual layers are clearly and beautifully visualized: + +[![](images/image-6-1024x779.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/image-6.png) + +Now, when you click on layers, additional information becomes visible on screen: + +![](images/image-7-1024x782.png) + +This information includes, but is not limited to: + +- The type of the layer; +- The name of the layer; +- Whether the layer is trainable; +- What the data type is; +- For Convolutional layers, the number of filters, the kernel size, the strides, padding, data format and dilation rate; +- The [activation function](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/) that is used; +- Whether bias is used; +- And how the kernels and (if applied) biases are [initialized](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/). + +Each layer has its own unique set of characteristics. + +### Exporting visualizations + +It's also possible to export visualizations by using the top menu, then the 'Export' menu button. This allows you to generate PNG images of the models. The only downside is that these architectures aren't very suitable for print, especially if they are very deep: + +[![](images/netron_model_mobilenetv2.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/netron_model_mobilenetv2.png) + +If you wish to use architecture visualizations in print or, more generally, in a horizontal fashion, use [Net2Vis](https://www.machinecurve.com/index.php/2020/01/07/visualizing-keras-neural-networks-with-net2vis-and-docker/) instead. + +* * * + +## Summary + +As you can see, Netron is a very beautiful and easy way to visualize your neural networks. With a wide range of frameworks and model types that is supported, it's truly scalable and usable for many people in the machine learning community. + +It's even possible to export the plots, although you might wish to use a different approach if your goal is to generate plots for print, especially when they are very deep. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Roeder, L. (2020, February 27). lutzroeder/netron. Retrieved from [https://github.com/lutzroeder/netron](https://github.com/lutzroeder/netron) diff --git a/wav2vec-2-transformers-for-speech-recognition.md b/wav2vec-2-transformers-for-speech-recognition.md new file mode 100644 index 0000000..5f59132 --- /dev/null +++ b/wav2vec-2-transformers-for-speech-recognition.md @@ -0,0 +1,9 @@ +--- +title: "Wav2vec 2: Transformers for Speech Recognition" +date: "2021-02-16" +categories: + - "buffer" + - "deep-learning" +--- + +[Wav2vec 2](https://arxiv.org/abs/2006.11477) is the successor of the Wav2vec model and was [developed by Facebook AI](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/). It can be used for performing speech recognition tasks. Among others, it can be used for speech 2 text tasks. diff --git a/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks.md b/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks.md new file mode 100644 index 0000000..e35a11b --- /dev/null +++ b/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks.md @@ -0,0 +1,396 @@ +--- +title: "What are L1, L2 and Elastic Net Regularization in neural networks?" +date: "2020-01-21" +categories: + - "deep-learning" +tags: + - "elastic-net-regularization" + - "l1-regularization" + - "l2-regularization" + - "machine-learning" + - "regularization" + - "regularizer" +--- + +When you're training a neural network, you're learning a mapping from some input value to a corresponding expected output value. This is great, because it allows you to create predictive models, but who guarantees that the _mapping_ is correct for the data points that aren't part of your data set? + +That is, how do you ensure that your learnt mapping does not oscillate very heavily if you want a smooth function instead? + +Regularization can help here. With techniques that take into account the complexity of your weights during optimization, you may steer the networks towards a more general, but scalable mapping, instead of a very data-specific one. + +In this blog, we cover these aspects. First, we'll discuss the need for regularization during model training. We then continue by showing how regularizers can be added to the loss value, and subsequently used in optimization. This is followed by a discussion on the three most widely used regularizers, being L1 regularization (or Lasso), L2 regularization (or Ridge) and L1+L2 regularization (Elastic Net). Finally, we provide a set of questions that may help you decide which regularizer to use in your machine learning project. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## The need for regularization during model training + +When you are training a machine learning model, at a high level, you're learning a function \[latex\]\\hat{y}: f(x) \[/latex\] which transforms some _input_ value \[latex\]x\[/latex\] (often a vector, so \[latex\]\\textbf{x}\[/latex\]) into some output value \[latex\]\\hat{y}\[/latex\] (often a scalar value, such as a class when classifying and a real number when regressing). \\ + +Contrary to a regular mathematical function, the exact mapping (to \[latex\]y\[/latex\]) is not known in advance, but is learnt based on the input-output mappings present in your training data (so that \[latex\]\\hat{y} \\approx y\[/latex\] - hence the name, machine learning :) + +This understanding brings us to the need for regularization. + +### Complex mappings vs simple mappings + +Say that you've got a dataset that contains points in a 2D space, like this small one: + +[![](images/points.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/points.png) + +Now suppose that these numbers are reported by some bank, which loans out money (the values on the x axis in $ of dollars). This has an impact on the weekly cash flow within a bank, attributed to the _loan_ and other factors (together represented by the y values). + +The bank suspects that this interrelationship means that it can predict its cash flow based on the amount of money it spends on new loans. In practice, this relationship is likely much more complex, but that's not the point of this thought exercise. + +Machine learning is used to generate a predictive model - a regression model, to be precise, which takes some input (amount of money loaned) and returns a real-valued number (the expected impact on the cash flow of the bank). + +After training, the model is brought to production, but soon enough the bank employees find out that it doesn't work. Upon analysis, the bank employees find that the actual _function_ learnt by the machine learning model is this one: + +[![](images/poly_large.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/poly_large.png) + +The employees instantly know why their model does not work, using nothing more than common sense: + +**The function is way too extreme for the data**. It's nonsense that if the bank would have spent $2.5k on loans, returns would be $5k, and $4.75k for $3.5k spendings, but minus $5k and counting for spendings of $3.25k. + +They'd rather have wanted something like this: + +[![](images/poly_small.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/poly_small.png) + +Which, as you can see, makes a lot more sense: + +[![](images/poly_both.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/poly_both.png) + +### On training machine learning models + +But how do we get there? + +The two functions are generated based on the same data points, aren't they? + +For me, it was simple, because I used a `polyfit` on the data points, to generate either a polynomial function of the third degree or one of the tenth degree. Obviously, the one of the tenth produces the wildly oscillating function. + +Machine learning however does not work this way. Besides not even having the certainty that your ML model will learn the mapping correctly, you also don't know if it will learn a highly specialized mapping or a more generic one. + +Or can you? Let's explore a possible route. + +From our article about [loss and loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), you may recall that a supervised model is trained following the high-level supervised machine learning process: + +- Training data is fed to the network in a feedforward fashion. +- The predictions generated by this process are stored, and compared to the actual targets, or the "ground truth". +- The difference between the predictions and the targets can be computed and is known as the loss value. +- Through computing gradients and subsequent [gradient based optimization techniques](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/), the weights of your neural network can be adapted, possibly improving the model. + +This means that optimizing a model equals minimizing the loss function that was specified for it. + +### Loss based regularization + +You can imagine that if you train the model for too long, minimizing the loss function is done based on loss values that are entirely adapted to the dataset it is training on, generating the highly oscillating curve plot that we've seen before. + +This is not what you want. However, you also don't know exactly _the point where you should stop_. + +This is why you may wish to add a regularizer to your neural network. + +Regularizers, which are attached to your _loss value_ often, induce a penalty on large weights or weights that do not contribute to learning. This way, we may get sparser models and weights that are not too adapted to the data at hand. If done well, adding a regularizer should result in models that produce better results for data they haven't seen before. + +Let's take a look at some foundations of regularization, before we continue to the actual regularizers. + +* * * + +## For starters: a little bit of foundation + +Before we do so, however, we must first deepen our understanding of the concept of regularization in conceptual and mathematical terms. + +Say, for example, that you are training a machine learning model, which is essentially a function \[latex\]\\hat{y}: f(\\textbf{x})\[/latex\] which maps some input vector \[latex\]\\textbf{x}\[/latex\] to some output \[latex\]\\hat{y}\[/latex\]. + +From previously, we know that during training, there exists a true target \[latex\]y\[/latex\] to which \[latex\]\\hat{y}\[/latex\] can be compared. + +Say that some function \[latex\]L\[/latex\] computes the loss between \[latex\]y\[/latex\] and \[latex\]\\hat{y}\[/latex\] (or \[latex\]f(\\textbf{x})\[/latex\]). For one sample \[latex\]\\textbf{x}\_i\[/latex\] with corresponding target \[latex\]y\_i\[/latex\], loss can then be computed as \[latex\]L(\\hat{y}\_i, y\_i) = L(f(\\textbf{x}\_i), y\_i)\[/latex\]. + +Total loss can be computed by summing over all the input samples \[latex\]\\textbf{x}\_i ... \\textbf{x}\_n\[/latex\] in your training set, and subsequently performing a minimization operation on this value: + +\[latex\]\\min\_f \\sum\_{i=1}^{n} L(f(\\textbf{x}\_i), y\_i) \[/latex\] + +### Adding a regularizer + +Before, we wrote about regularizers that they "are attached to your _loss value_ often". Indeed, adding some regularizer \[latex\]R(f)\[/latex\] - "regularization for some function \[latex\]f\[/latex\]" - is easy: + +\[latex\] L(f(\\textbf{x}\_i), y\_i) = \\sum\_{i=1}^{n} L\_{ losscomponent}(f(\\textbf{x}\_i), y\_i) + \\lambda R(f) \[/latex\] + +...where \[latex\]\\lambda\[/latex\] is a hyperparameter, to be configured by the machine learning engineer, that determines the relative importance of the regularization component compared to the loss component. + +The above means that the loss _and_ the regularization components are minimized, not the loss component alone. Let's take a look at some scenarios: + +- If the loss component's value is low but the mapping is not generic enough (a.k.a. overfitting), a regularizer value will likely be high. There is still room for minimization. +- If a mapping is very generic (low regularization value) but the loss component's value is high (a.k.a. underfitting), there is also room for minimization. +- The optimum is found when the model is both as generic and as good as it can be, i.e. when both values are as low as they can possible become. + +### Instantiating the regularizer function R(f) + +Now, you likely understand that you'll want to have your outputs for \[latex\]R(f)\[/latex\] to minimize as well. But what is this function? What does it look like? It turns out to be that there is a wide range of possible instantiations for the regularizer. + +In the machine learning community, three regularizers are very common: + +- **L1 Regularization**, also known as Lasso Regularization; +- **L2 Regularization**, also known as Ridge Regularization; +- **L1+L2 Regularization**, also known as Elastic Net Regularization. + +Next, we'll cover the three of them. + +* * * + +## L1 Regularization + +**L1 Regularization** (or **Lasso**) adds to so-called L1 Norm to the loss value. A "norm" tells you something about a vector in space and can be used to express useful properties of this vector (Wikipedia, 2004). + +The L1 norm of a vector, which is also called the taxicab norm, computes the absolute value of each vector dimension, and adds them together (Wikipedia, 2004). As computing the norm effectively means that you'll travel the full distance from the starting to the ending point for each dimension, adding it to the distance traveled already, the travel pattern resembles that of a taxicab driver which has to drive the blocks of e.g. New York City; hence the name (Wikipedia, 2004). + +In terms of maths, this can be expressed as \[latex\] R(f) = \\sum\_f{ \_{i=1}^{n}} | w\_i |\[/latex\], where this is an iteration over the \[latex\]n\[/latex\] dimensions of some vector \[latex\]\\textbf{w}\[/latex\]. + +Visually, and hence intuitively, the process goes as follows. Suppose that we have this two-dimensional vector \[latex\]\[2, 4\]\[/latex\]: + +[![](images/empty_vector.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/empty_vector.png) + +...our formula would then produce a computation over two dimensions, for the first: + +[![](images/taxicab1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/taxicab1.png) + +Then the second: + +[![](images/taxicab2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/taxicab2.png) + +The L1 norm for our vector is thus 6, as you can see: + +\[latex\] \\sum\_{i=1}^{n} | w\_i | = | 4 | + | 2 | = 4 + 2 = 6\[/latex\] + +Adding L1 Regularization to our loss value thus produces the following formula: + +\[latex\] L(f(\\textbf{x}\_i), y\_i) = \\sum\_{i=1}^{n} L\_{ losscomponent}(f(\\textbf{x}\_i), y\_i) + \\lambda \\sum\_{i=1}^{n} | w\_i | \[/latex\] + +...where \[latex\]w\_i\[/latex\] are the values of your model's weights. + +This way, our loss function - and hence our optimization problem - now also includes information about the _complexity_ of our weights. + +### On negative vectors + +Say we had a negative vector instead, e.g. \[latex\]\[-1, -2.5\]\[/latex\]: + +![](images/neg_vec.png) + +As you can derive from the formula above, L1 Regularization takes some value related to the weights, and adds it to the same values for the other weights. As you know, "some value" is the absolute value of the weight or \[latex\]| w\_i |\[/latex\], and we take it for a reason: + +[![](images/l1_component.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_component.png) + +Taking the absolute value ensures that _negative values_ contribute to the regularization loss component as well, as the sign is removed and only the, well, absolute value remains. This way, L1 Regularization natively supports negative vectors as well, such as the one above. + +### On model sparsity + +Next up: model sparsity. L1 Regularization produces sparse models, i.e. models where unnecessary features don't contribute to their predictive power, which - as an additional benefit - may also speed up models during inference (Google Developers, n.d.). + +But why is this the case? Let's take a closer look (Caspersen, n.d.; Neil G., n.d.). + +This is the derivative for L1 Regularization: + +[![](images/l1_deriv.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_deriv.png) + +It's either -1 or +1, and is undefined at \[latex\]x = 0\[/latex\]. + +Now suppose that we have trained a neural network for the first time. We have a loss value which we can use to compute the weight change. Obviously, this weight change will be computed with respect to the _loss component_, but this time, the _regularization component_ (in our case, L1 loss) would also play a role. + +If our loss component were static for some reason (just a thought experiment), our obvious goal would be to bring the regularization component to zero. As you can see, this would be done in small but constant steps, eventually allowing the value to reach minimum regularization loss, at \[latex\]x = 0\[/latex\]. This would essentially "drop" a weight from participating in the prediction, as it's set at _zero_. This is also known as the "model sparsity" principle of L1 loss. + +This theoretical scenario is however not necessarily true in real life. Besides the regularization loss component, the normal loss component participates as well in generating the loss value, and subsequently in gradient computation for optimization. This means that the theoretically constant steps in one direction, i.e. sparse models, are less "straight" in practice. Nevertheless, since the regularization loss component still plays a significant role in computing loss and hence optimization, L1 loss will _still tend to push weights to zero and hence produce sparse models_ (Caspersen, n.d.; Neil G., n.d.). + +### Lasso disadvantages + +Unfortunately, besides the benefits that can be gained from using L1 regularization, the technique also comes at a cost: + +1. Lasso does not work that well in a high-dimensional case, i.e. where the number of _samples_ is lower than the number of _dimensions_ (Tripathi, n.d.; Wikipedia, 2011). This is also called the "large \[latex\]p\[/latex\], small \[latex\]n\[/latex\] case" or the "short, fat data problem", and it's not good because L1 regularization can only select \[latex\]n\[/latex\] variables at most (Duke University, n.d.; Tripathi, n.d.). +2. Secondly, the main benefit of L1 regularization - i.e., that it results in sparse models - could be a disadvantage as well. For example, when you don't need variables to drop out - e.g., because you already performed variable selection - L1 might induce too much sparsity in your model (Kochede, n.d.). The same is true if the relevant information is "smeared out" over many variables, in a correlative way (cbeleites, 2013; Tripathi, n.d.). In this case, having variables dropped out removes essential information. On the contrary, when your information is primarily present in a few variables only, it makes total sense to induce sparsity and hence use L1. +3. Even when you _do_ want variables to drop out, it is reported that L1 regularization does not work as well as, for example, L2 Regularization and Elastic Net Regularization (Tripathi, n.d.). We will cover both of them next. + +Therefore, always make sure to decide whether you need L1 regularization based on your dataset, before blindly applying it. + +* * * + +## L2 Regularization + +Another type of regularization is **L2 Regularization**, also called **Ridge**, which utilizes the L2 norm of the vector: + +\[latex\] R(f) = \\sum\_f{ \_{i=1}^{n}} w\_i^2\[/latex\] + +When added to the regularization equation, you get this: + +\[latex\] L(f(\\textbf{x}\_i), y\_i) = \\sum\_{i=1}^{n} L\_{ losscomponent}(f(\\textbf{x}\_i), y\_i) + \\lambda \\sum\_{i=1}^{n} w\_i^2 \[/latex\] + +Visually, it looks as follows: + +[![](images/l2_comp.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l2_comp.png) + +As you can see, L2 regularization also stimulates your values to approach zero (as the loss for the regularization component is zero when \[latex\]x = 0\[/latex\]), and hence stimulates them towards being very small values. + +However, unlike L1 regularization, it does not push the values to be _exactly zero_. + +### Why L1 yields sparsity and L2 likely does not + +Let's recall the gradient for L1 regularization: + +[![](images/l1_deriv.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_deriv.png) + +Regardless of the value of \[latex\]x\[/latex\], the gradient is a constant - either plus or minus one. + +This is also true for very small values, and hence, the expected weight update suggested by the regularization component is quite static over time. This, combined with the fact that the normal loss component will ensure some oscillation, stimulates the weights to take zero values whenever they do not contribute significantly enough. + +However, the situation is different for L2 loss, where the derivative is \[latex\]2x\[/latex\]: + +![](images/l2_deriv.png) + +From this plot, you can see that the closer the weight value gets to zero, the smaller the gradient will become. + +And the smaller the gradient value, the smaller the weight update suggested by the regularization component. + +Much like how you'll never reach zero when you keep dividing 1 by 2, then 0.5 by 2, then 0.25 by 2, and so on, you won't reach zero in this case as well. This is due to the nature of L2 regularization, and especially the way its gradient works. Thus, while L2 regularization will nevertheless produce very small values for non-important values, the models will _not_ be stimulated to be sparse. This is a very important difference between L1 and L2 regularization. + +### Why would you want L2 over L1? + +Primarily due to the L1 drawback that situations where _high-dimensional data where many features are correlated_ will lead to ill-performing models, because relevant information is removed from your models (Tripathi, n.d.). + +### Ridge disadvantage + +Unfortunately, L2 regularization also comes with a disadvantage due to the nature of the regularizer (Gupta, 2017). It is model interpretability: due to the fact that L2 regularization does not promote sparsity, you may end up with an uninterpretable model if your dataset is high-dimensional. + +This may not always be unavoidable (e.g. in the case where you have a correlative dataset), but once again, take a look at your data first before you choose whether to use L1 or L2 regularization. + +* * * + +## Elastic Net Regularization + +Now that we have identified how L1 and L2 regularization work, we know the following: + +- L1 regularization produces sparse models, but cannot handle "small and fat datasets". +- L2 regularization can handle these datasets, but can get you into trouble in terms of model interpretability due to the fact that it does not produce the sparse solutions you may wish to find after all. + +But what if we can combine them? + +Say hello to **Elastic Net Regularization** (Zou & Hastie, 2005). It's a linear combination of L1 and L2 regularization, and produces a regularizer that has both the benefits of the L1 (Lasso) and L2 (Ridge) regularizers. Let's take a look at how it works - by taking a look at a _naïve_ version of the Elastic Net first, the Naïve Elastic Net. + +### Naïve Elastic Net + +In their work "Regularization and variable selection via the elastic net", Zou & Hastie (2005) introduce the Naïve Elastic Net as a linear combination between L1 and L2 regularization. With hyperparameters \[latex\]\\lambda\_1 = (1 - \\alpha) \[/latex\] and \[latex\]\\lambda\_2 = \\alpha\[/latex\], the elastic net penalty (or regularization loss component) is defined as: + +\[latex\](1 - \\alpha) | \\textbf{w} |\_1 + \\alpha | \\textbf{w} |^2 \[/latex\] + +Here, the first part is the L1 penalty \[latex\] \\sum\_{i=1}^{n} | w\_i | \[/latex\], while the second part is the L2 penalty \[latex\] \\sum\_f{ \_{i=1}^{n}} w\_i^2 \[/latex\]. The hyperparameter to be tuned in the Naïve Elastic Net is the value for \[latex\]\\alpha\[/latex\] where, \[latex\]\\alpha \\in \[0, 1\]\[/latex\]. + +With Elastic Net Regularization, the total value that is to be minimized thus becomes: + +\[latex\] L(f(\\textbf{x}\_i), y\_i) = \\sum\_{i=1}^{n} L\_{ losscomponent}(f(\\textbf{x}\_i), y\_i) + (1 - \\alpha) \\sum\_{i=1}^{n} | w\_i | + \\alpha \\sum\_{i=1}^{n} w\_i^2 \[/latex\] + +As you can see, for \[latex\]\\alpha = 1\[/latex\], Elastic Net performs Ridge (L2) regularization, while for \[latex\]\\alpha = 0\[/latex\] Lasso (L1) regularization is performed. Tuning the alpha parameter allows you to balance between the two regularizers, possibly based on prior knowledge about your dataset. Visually, we can see this here: + +[![](images/penalty-values.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/penalty-values.png) + +Do note that frameworks often allow you to specify \[latex\]\\lambda\_1\[/latex\] and \[latex\]\\lambda\_2\[/latex\] manually. The penalty term then equals: + +\[latex\]\\lambda\_1| \\textbf{w} |\_1 + \\lambda\_2| \\textbf{w} |^2 \[/latex\] + +The Elastic Net works well in many cases, especially when the final outcome is close to either L1 or L2 regularization only (i.e., \[latex\]\\alpha \\approx 0\[/latex\] or \[latex\]\\alpha \\approx 1\[/latex\]), but performs less adequately when the hyperparameter tuning is different. That's why the authors call it naïve (Zou & Hastie, 2005). The cause for this is "double shrinkage", i.e., the fact that both L2 (first) and L1 (second) regularization tend to make the weights as small as possible. As this may introduce unwanted side effects, performance can get lower. Fortunately, the authors also provide a fix, which resolves this problem. I'd like to point you to the Zou & Hastie (2005) paper for the discussion about correcting it. + +* * * + +## Should I start with L1, L2 or Elastic Net Regularization? + +If you want to add a regularizer to your model, it may be difficult to decide which one you'll need. Generally speaking, it's wise to start with Elastic Net Regularization, because it combines L1 and L2 and generally performs better because it cancels the disadvantages of the individual regularizers (StackExchange, n.d.). However, you may wish to make a more informed choice - in that case, read on :) + +Fortunately, there are three questions that you can ask yourself which help you decide where to start. We'll cover these questions in more detail next, but here they are: + +1. **How much prior knowledge \[about the dataset\] do you have?** This includes a judgement call (or an empirical decision) about whether you need regularization in the first place, and how sparse and/or correlative your dataset already is. +2. **How much room for validation do you have?** You might wish to test the method and the hyperparameter before you're spending all your resources on one approach. +3. **What are your computational requirements?** Depending on them, you might wish to choose either L1 or L2 over Elastic Net regularization. + +### How much prior knowledge do you have? + +The first thing that you'll have to inspect is the following: t**he amount of prior knowledge that you have about your dataset**. + +Knowing some crucial details about the data may guide you towards a correct choice, which can be L1, L2 or Elastic Net regularization, no regularizer at all, or a regularizer that we didn't cover here. + +For example, it may be the case that your model does not improve significantly when applying regularization - due to sparsity already introduced to the data, as well as good normalization up front (StackExchange, n.d.). In those cases, you may wish to avoid regularization altogether. + +If you don't, **you'll have to estimate the _sparsity_ and _pairwise correlation_ of and within the dataset** (StackExchange). For this purpose, you may benefit from these references: + +- [How do you calculate how dense or sparse a dataset is?](https://datascience.stackexchange.com/questions/10580/how-do-you-calculate-how-dense-or-sparse-a-dataset-is) +- [Calculating pairwise correlation among all columns](https://stackoverflow.com/questions/33997753/calculating-pairwise-correlation-among-all-columns) + +Depending on your analysis, **you might have enough information to choose a regularizer**. If your dataset turns out to be very sparse already, L2 regularization may be your best choice. The same is true if the dataset has a large amount of pairwise correlations. If it doesn't, and is dense, you may choose L1 regularization instead. If you don't know for sure, or when your metrics don't favor one approach, Elastic Net may be the best choice for now. + +However, before actually starting the training process with a large dataset, you might wish to validate first. You could do the same if you're still unsure. + +### How much room for validation do you have? + +If you have some resources to spare, you may also perform some **validation activities** first, before you start a large-scale training process. These validation activities especially boil down to the following two aspects: + +1. Method testing; +2. Hyperparameter tuning. + +Firstly, and obviously, if you choose to validate, it's important to _validate the method you want to use_. If, when using a representative dataset, you find that some regularizer doesn't work, the odds are that it will neither for a larger dataset. + +Secondly, when you find a method about which you're confident, it's time to estimate _the impact of the_ _hyperparameter_. The hyperparameter, which is \[latex\]\\lambda\[/latex\] in the case of L1 and L2 regularization and \[latex\]\\alpha \\in \[0, 1\]\[/latex\] in the case of Elastic Net regularization (or \[latex\]\\lambda\_1\[/latex\] and \[latex\]\\lambda\_2\[/latex\] separately), effectively determines the impact of the _regularizer_ on the loss value that is optimized during training. The stronger you regularize, the sparser your model will get (with L1 and Elastic Net), but this comes at the cost of underperforming when it is too large (Yadav, 2018). + +### What are your computational requirements? + +Thirdly, and finally, you may wish to inform yourself of the **computational requirements** of your machine learning problem. + +Often, and especially with today's movement towards commoditization of hardware, this is not a problem, but Elastic Net regularization is more expensive than Lasso or Ridge regularization applied alone (StackExchange, n.d.). Hence, if your machine learning problem already balances at the edge of what your hardware supports, it may be a good idea to perform additional validation work and/or to try and identify additional knowledge about your dataset, in order to make an informed choice between L1 and L2 regularization. + +Now that you have answered these three questions, it's likely that you have a good understanding of what the regularizers do - and _when_ to apply _which_ one. With this understanding, we conclude today's blog :) + +* * * + +## Summary + +In this article, you've found a discussion about a couple of things: + +1. **The need for regularization.** Primarily, we looked at a fictional scenario where a regression model was estimated based on a few datapoints. Clearly, we saw why a more _generic_ model may be preferred over a very _specific_ one - as we don't want the bank go bankrupt :) +2. **The foundations of a regularizer**. We saw how regularizers are attached to the loss values of a machine learning model, and how they are thus included in the optimization step. Combining the original loss value with the regularization component, models will become simpler with likely losing not much of their predictive abilities. +3. **L1 regularization, or Lasso**. This approach, by using the L1 norm of your weights, ensures that the weights of your model are both small and sparse, dropping out weights that are not relevant. This is especially useful when you have many dimensions that are not correlated, as your models get simpler. However, when you have a small but fat dataset, or when the variables in your dataset correlate quite substantially, L1 regularization may not be suitable for your machine learning problem. +4. **L2 regularization, or Ridge**. By taking the L2 norm of your weights, it ensures that weights get small, but without the zero enforcement. While it is very useful in the cases where L1 regularization is not so useful, the typical datasets suitable for L1 (high-dimensional, high-volume and low-correlation between samples) yield uninterpretable models when L2 loss is used. +5. **Elastic Net regularization**, which has a _naïve_ and a _smarter_ variant, but essentially combines L1 and L2 regularization linearly. It's often the preferred regularizer during machine learning problems, as it removes the disadvantages from both the L1 and L2 ones, and can produce good results. +6. **However, we also looked at questions that help you determine the best regularizer for your machine learning problem**. Even though Elastic Net regularization produces good results often, it may not always be the best choice. For example, do you have substantial prior knowledge about your dataset? Do you need regularization at all? Do you have resources to spare for validation activities? Or, on the contrary, do you already balance on the fine line between overshooting your computational limits and staying on track? The answer to these questions may help you further. + +If you have any questions or remarks - feel free to leave a comment 😊 I will happily answer those questions and will improve my blog if you found mistakes. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Wikipedia. (2004, September 16). Norm (mathematics). Retrieved from [https://en.wikipedia.org/wiki/Norm\_(mathematics)](https://en.wikipedia.org/wiki/Norm_(mathematics)) + +Chioka. (n.d.). Differences between L1 and L2 as Loss Function and Regularization. Retrieved from [http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/](http://www.chioka.in/differences-between-l1-and-l2-as-loss-function-and-regularization/) + +Google Developers. (n.d.). Regularization for Sparsity: L1 Regularization. Retrieved from [https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization](https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization) + +Neil G. (n.d.). Why L1 regularization can "zero out the weights" and therefore leads to sparse models? Retrieved from [https://stats.stackexchange.com/questions/375374/why-l1-regularization-can-zero-out-the-weights-and-therefore-leads-to-sparse-m](https://stats.stackexchange.com/questions/375374/why-l1-regularization-can-zero-out-the-weights-and-therefore-leads-to-sparse-m) + +Wikipedia. (2011, December 11). Elastic net regularization. Retrieved from [https://en.wikipedia.org/wiki/Elastic\_net\_regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization) + +Khandelwal, R. (2019, January 10). L1 L2 Regularization. Retrieved from [https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2](https://medium.com/datadriveninvestor/l1-l2-regularization-7f1b4fe948f2) + +Caspersen, K. M. (n.d.). Why L1 norm for sparse models. Retrieved from [https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models/159379](https://stats.stackexchange.com/questions/45643/why-l1-norm-for-sparse-models/159379) + +Kochede. (n.d.). What are disadvantages of using the lasso for variable selection for regression? Retrieved from [https://stats.stackexchange.com/questions/7935/what-are-disadvantages-of-using-the-lasso-for-variable-selection-for-regression](https://stats.stackexchange.com/questions/7935/what-are-disadvantages-of-using-the-lasso-for-variable-selection-for-regression) + +cbeleites(https://stats.stackexchange.com/users/4598/cbeleites-supports-monica), What are disadvantages of using the lasso for variable selection for regression?, URL (version: 2013-12-03): [https://stats.stackexchange.com/q/77975](https://stats.stackexchange.com/q/77975) + +Tripathi, M. (n.d.). Are there any disadvantages or weaknesses to the L1 (LASSO) regularization technique? Retrieved from [https://www.quora.com/Are-there-any-disadvantages-or-weaknesses-to-the-L1-LASSO-regularization-technique/answer/Manish-Tripathi](https://www.quora.com/Are-there-any-disadvantages-or-weaknesses-to-the-L1-LASSO-regularization-technique/answer/Manish-Tripathi) + +Duke University. (n.d.). _Sparsity and p >> n - Duke Statistical Science_ \[PDF\]. Retrieved from [http://www2.stat.duke.edu/~banks/218-lectures.dir/dmlect9.pdf](http://www2.stat.duke.edu/~banks/218-lectures.dir/dmlect9.pdf) + +Gupta, P. (2017, November 16). Regularization in Machine Learning. Retrieved from [https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a](https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a) + +Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. _Journal of the royal statistical society: series B (statistical methodology)_, _67_(2), 301-320. + +StackExchange. (n.d.). What is elastic net regularization, and how does it solve the drawbacks of Ridge ($L^2$) and Lasso ($L^1$)? Retrieved from [https://stats.stackexchange.com/questions/184029/what-is-elastic-net-regularization-and-how-does-it-solve-the-drawbacks-of-ridge](https://stats.stackexchange.com/questions/184029/what-is-elastic-net-regularization-and-how-does-it-solve-the-drawbacks-of-ridge) + +Yadav, S. (2018, December 25). All you need to know about Regularization. Retrieved from [https://towardsdatascience.com/all-you-need-to-know-about-regularization-b04fc4300369](https://towardsdatascience.com/all-you-need-to-know-about-regularization-b04fc4300369) diff --git a/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling.md b/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling.md new file mode 100644 index 0000000..1fdd3d8 --- /dev/null +++ b/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling.md @@ -0,0 +1,312 @@ +--- +title: "What are Max Pooling, Average Pooling, Global Max Pooling and Global Average Pooling?" +date: "2020-01-30" +categories: + - "deep-learning" + - "frameworks" +tags: + - "average-pooling" + - "deep-learning" + - "global-average-pooling" + - "global-max-pooling" + - "global-pooling" + - "keras" + - "machine-learning" + - "max-pooling" + - "pooling-layers" +--- + +Creating ConvNets often goes hand in hand with pooling layers. More specifically, we often see additional layers like max pooling, average pooling and global pooling. But what are they? Why are they necessary and how do they help training a machine learning model? And how can they be used? + +We answer these questions in this blog post. + +Firstly, we'll take a look at pooling operations from a conceptual level. We explore the inner workings of a ConvNet and through this analysis show how pooling layers may help the spatial hierarchy generated in those models. Then, we continue by identifying four types of pooling - max pooling, average pooling, global max pooling and global average pooling. + +Subsequently, we switch from theory to practice: we show how the pooling layers are represented within Keras, one of the most widely used deep learning frameworks today. Then, we conclude this blog by giving a MaxPooling based example with Keras, using the 2-dimensional variant i.e. `MaxPooling2D`. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## What are pooling operations? + +Suppose that you're training a [convolutional neural network](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/). Your goal is to classify images from a dataset - say, the [SVHN](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/) one. The operation performed by the first convolutional layer in your neural network can be represented as follows: + +[![](images/CNN-1.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/09/CNN-1.jpg) + +The inputs for this layer are images, of height \[latex\]H\[/latex\], width \[latex\]W\[/latex\] and with three channels. Thus, they're likely RGB images. Using a 3x3x3 kernel, a convolution operation is performed over the input image, generating \[latex\]N\[/latex\] so-called "feature maps" of size \[latex\]H\_{fm} \\times W\_{fm}\[/latex\]. One feature map learns one particular feature present in the image. Through [activating](https://www.machinecurve.com/index.php/2019/12/03/what-do-convnets-see-visualizing-filters-with-activation-maximization/), these feature maps contribute to the outcome prediction during training, and for new data as well. \[latex\]N\[/latex\] can be configured by the machine learning engineer prior to starting the training process. + +In the case of the SVHN dataset mentioned above, where the images are 32 x 32 pixels, the first convolution operation (assuming a stride of 1 and no padding whatsoever) would produce feature maps of 30 x 30 pixels; say we set \[latex\]N = 64\[/latex\], then 64 such maps would be produced in this first layer (Chollet, 2017). + +### Downsampling your inputs + +Let's now take one step back and think of the goals that we want to achieve if we were to train a ConvNet successfully. The primary goal, say that we have an image classifier, **is that it classifies the images correctly.** + +If we as humans were to do that, we would look at **both the details and the high-level patterns**. + +Now let's take a look at the concept of a feature map again. In the first layer, you learn a feature map based on very "concrete" aspects of the image. Here, the feature map consists of very low-level elements within the image, such as curves and edges, a.k.a. the **details**. However, we cannot see the **higher-level** **patterns** with just one convolutional layer. We need many, stacked together, to learn these patterns. This is also called building a spatial hierarchy (Chollet, 2017). Good spatial hierarchies summarize the data substantially when moving from bottom to top, and they're like a pyramid. Here's a good one versus a bad one: + +[![](images/hierarchies.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/hierarchies.png) + +_A good spatial hierarchy (left) versus a worse one (right)._ + +As you [likely know](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/), in the convolution operation of a ConvNet, a small block slides over the entire input image, taking element-wise multiplications with the part of the image it currently slides over (Chollet, 2017). This is a relatively expensive operation. Can't this be done in a simpler way? Do we really need to have a hierarchy built up from convolutions only? The answer is no, and pooling operations prove this. + +### Introducing pooling + +Here's one definition of pooling: + +> Pooling is basically “downscaling” the image obtained from the previous layers. It can be compared to shrinking an image to reduce its pixel density. +> +> [Hervatte (n.d.)](https://www.quora.com/What-is-pooling-in-a-convolutional-neural-network/answer/Shreyas-Hervatte) + +All right, downscaling it is. But it is also done in a much simpler way: by performing a _hardcoded tensor operation_ such as `max`, rather than through a learned transformation, we don't need the relatively expensive operation of learning the weights (Chollet, 2017). This way, we get a nice and possibly useful spatial hierarchy at a fraction of the cost. + +In the rest of this blog post, we cover four types of pooling operations: + +- Max pooling; +- Average pooling; +- Global max pooling; +- Global average pooling. + +Let's take a look at Max Pooling first. + +* * * + +## Max Pooling + +Suppose that this is one of the 4 x 4 pixels feature maps from our ConvNet: + +[![](images/Max-Pooling.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Max-Pooling.png) + +If we want to downsample it, we can use a pooling operation what is known as "max pooling" (more specifically, this is _two-dimensional_ max pooling). In this pooling operation, a \[latex\]H \\times W\[/latex\] "block" slides over the input data, where \[latex\]H\[/latex\] is the height and \[latex\]W\[/latex\] the width of the block. The stride (i.e. how much it steps during the sliding operation) is often equal to the pool size, so that its effect equals a reduction in height and width. + +For each block, or "pool", the operation simply involves computing the \[latex\]max\[/latex\] value, like this: + +[![](images/Max-Pooling-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Max-Pooling-1.png) + +Doing so for each pool, we get a nicely downsampled outcome, greatly benefiting the spatial hierarchy we need: + +[![](images/Max-Pooling-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Max-Pooling-2.png) + +### How Max Pooling benefits translation invariance + +Besides being a cheap replacement for a convolutional layer, there is another reason why max pooling can be very useful in your ConvNet: _translation invariance_ (Na, n.d.). + +When a model is translation invariant, it means that it doesn't matter where an object is present in a picture; it will be recognized anyway. For example, if I hold a phone near my head, or near my pocket - it should be part of the classification both times. + +As you can imagine, achieving translation invariance in your model greatly benefits its predictive power, as you no longer need to provide images where the object is _precisely_ at some desired position. Rather, you can just provide a massive set of images that contain the object, and possibly get a well-performing model. + +Now, how does max pooling achieve translation invariance in a neural network? + +Say that we have a one-pixel object - that's a bit weird, as objects are normally multi-pixel, but it benefits our explanation. The object has the highest contrast and hence generates a high value for the pixel in the input image. Suppose that the 4 at (0, 4) in the red part of the image above is the pixel of our choice. With max pooling, it is still included in the output, as we can see. + +Now imagine that this object, and thus the 4, isn't present at (0, 4), but at (1, 3) instead. Does it disappear from the model? No. Rather, the output of the max pooling layer will still be 4. Hence, it doesn't really matter where the object resides in the red block, as it will be "caught" anyway. + +That's why max pooling means translation invariance and why it is really useful, except for being relatively cheap. + +Do note however that if the object were in any of the non-red areas, it would be recognized there, but _only_ if there's nothing with a greater pixel value (which is the case for all the elements!). Hence, max pooling does not produce translation invariance if you only provide pictures where the object resides in a _very small_ _area_ all the time. However, if your dataset is varied enough, with the object being in various positions, max pooling does really benefit the performance of your model. + +### Why Max Pooling is the most used pooling operation + +Next, we'll look at Average Pooling, which is another pooling operation. It can be used as a drop-in replacement for Max Pooling. However, when you look at neural network theory (such as Chollet, 2017), you'll see that Max Pooling is preferred all the time. + +Why is this the case? + +The argument is relatively simple: as the objects of interest likely produce the largest pixel values, it shall be more interesting to take the max value in some block than to take an average (Chollet, 2017). + +Oops, now I already gave away what Average Pooling does :) + +* * * + +## Average Pooling + +Another type of pooling layers is the Average Pooling layer. Here, rather than a `max` value, the `avg` for each block is computed: + +[![](images/Average-Pooling.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Average-Pooling.png) + +As you can see, the output is also different - and less extreme compared to Max Pooling: + +[![](images/Average-Pooling-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Average-Pooling-1.png) + +Average Pooling is different from Max Pooling in the sense that it retains much information about the "less important" elements of a block, or pool. Whereas Max Pooling simply throws them away by picking the maximum value, Average Pooling blends them in. This can be useful in a variety of situations, where such information is useful. We'll see one in the next section. + +### Why think about Average Pooling at all? + +On the internet, many arguments pro and con Average Pooling can be found, often suggesting Max Pooling as the alternative. Primarily, the answers deal with the difference mentioned above. + +For example: + +> So, to answer your question, I don’t think average pooling has any significant advantage over max-pooling. But, may be in some cases, where variance in a max pool filter is not significant, both pooling will give same type results. But in extreme cases, max-pooling will provide better results for sure. +> +> [Rahman (n.d.)](https://www.quora.com/What-is-the-benefit-of-using-average-pooling-rather-than-max-pooling/answer/Nouroz-Rahman) + +But also: + +> I would add an additional argument - that max-pooling layers are worse at preserving localization. +> +> Ilan (n.d.) + +Consequently, the only correct answer is this: it is entirely dependent on the problem that you're trying to solve. + +If the position of objects is not important, Max Pooling seems to be the better choice. If it is, it seems that better results can be achieved with Average Pooling. + +* * * + +## Global Max Pooling + +Another type of pooling layer is the Global Max Pooling layer. Here, we set the pool size equal to the input size, so that the `max` of the entire input is computed as the output value (Dernoncourt, 2017): + +[![](images/Global-Max-Pooling-3.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Global-Max-Pooling-3.png) + +Or, visualizing it differently: + +[![](images/Global-Max-Pooling-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Global-Max-Pooling-1.png) + +Global pooling layers can be used in a variety of cases. Primarily, it can be used to reduce the dimensionality of the feature maps output by some convolutional layer, to replace Flattening and sometimes even Dense layers in your classifier (Christlein et al., 2019). What's more, it can also be used for e.g. word spotting (Sudholt & Fink, 2016). This is due to the property that it allows detecting _noise_, and thus "large outputs" (e.g. the value 9 in the exmaple above). However, this is also one of the downsides of Global Max Pooling, and like the regular one, we next cover Global Average Pooling. + +* * * + +## Global Average Pooling + +When applying Global Average Pooling, the pool size is still set to the size of the layer input, but rather than the maximum, the average of the pool is taken: + +[![](images/Global-Average-Pooling-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Global-Average-Pooling-2.png) + +Or, once again when visualized differently: + +[![](images/Global-Average-Pooling-3.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Global-Average-Pooling-3.png) + +They're often used to replace the fully-connected or densely-connected layers in a classifier. Instead, the model ends with a convolutional layer that generates as many feature maps as the number of target classes, and applies global average pooling to each in order to convert each feature map into one value (Mudau, n.d.). As feature maps can recognize certain elements within the input data, the maps in the final layer effectively learn to "recognize" the presence of a particular class in this architecture. By feeding the values generated by global average pooling into a [Softmax activation function](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/), you once again obtain the multiclass probability distribution that you want. + +What's more, this approach might improve model performance because of the nativeness of the "classifier" to the "feature extractor" (they're both convolutional instead of convolutional/dense), and reduce overfitting because of the fact that there is no parameter to be learnt in the global average pooling layer (Mudau, n.d.). In a different blog post, we'll try this approach and show the results! + +* * * + +## Pooling layers in the Keras API + +Let's now take a look at how Keras represents pooling layers in its API. + +### Max Pooling + +Max Pooling comes in a one-dimensional, two-dimensional and three-dimensional variant (Keras, n.d.). The one-dimensional variant can be used together with Conv1D layers, and thus for temporal data: + +``` +keras.layers.MaxPooling1D(pool_size=2, strides=None, padding='valid', data_format='channels_last') +``` + +Here, the pool size can be set as an integer value through `pool_size`, strides and padding can be applied, and the data format can be set. With strides, which if left `None` will default the `pool_size`, one can define how much the pool "jumps" over the input; in the default case halving it. With padding, we may take into account the edges if they were to remain due to incompatibility between pool and input size. Finally, the data format tells us something about the channels strategy (channels first vs channels last) of your dataset. + +Max Pooling is also available for 2D data, which can be used together with Conv2D for spatial data (Keras, n.d.): + +``` +keras.layers.MaxPooling2D(pool_size=(2, 2), strides=None, padding='valid', data_format=None) +``` + +The API is really similar, except for the `pool_size`. It's possible to define it as an integer value (e.g. `pool_size = 3`), but it will be converted to `(3, 3)` internally. Obviously, one can also set a tuple instead, having more flexibility over the shape of your pool. + +3D Max Pooling can be used for spatial or spatio-temporal data (Keras, n.d.): + +``` +keras.layers.MaxPooling3D(pool_size=(2, 2, 2), strides=None, padding='valid', data_format=None) +``` + +Here, the same thing applies for the `pool_size`: it can either be set as an integer value or as a three-dimensional tuple. + +### Average Pooling + +For Average Pooling, the API is no different than for Max Pooling, and hence I won't repeat everything here except for the API representation (Keras, n.d.): + +``` +keras.layers.AveragePooling1D(pool_size=2, strides=None, padding='valid', data_format='channels_last') +keras.layers.AveragePooling2D(pool_size=(2, 2), strides=None, padding='valid', data_format=None) +keras.layers.AveragePooling3D(pool_size=(2, 2, 2), strides=None, padding='valid', data_format=None) +``` + +### Global Max Pooling + +Due to the unique structure of global pooling layers where the pool shape equals the input shape, their representation in the Keras API is really simple. For example, for Global Max Pooling (Keras, n.d.): + +``` +keras.layers.GlobalMaxPooling1D(data_format='channels_last') +keras.layers.GlobalMaxPooling2D(data_format='channels_last') +keras.layers.GlobalMaxPooling3D(data_format='channels_last') +``` + +Here, the only thing to be configured is the `data_format`, which tells us something about the ordering of dimensions in our data, and can be `channels_last` or `channels_first`. + +### Global Average Pooling + +The same can be observed for Global Average Pooling (Keras, n.d.): + +``` +keras.layers.GlobalAveragePooling1D(data_format='channels_last') +keras.layers.GlobalAveragePooling2D(data_format='channels_last') +keras.layers.GlobalAveragePooling3D(data_format='channels_last') +``` + +* * * + +## Conv2D and Pooling example with Keras + +Now that we know what pooling layers are and how they are represented within Keras, we can give an example. For this example, we'll show you the model we created before, to show [how sparse categorical crossentropy worked](https://www.machinecurve.com/index.php/2019/10/06/how-to-use-sparse-categorical-crossentropy-in-keras/). Hence, we don't show you all the steps to creating the model here - click the link to finalize your model. + +But what we do is show you the fragment where pooling is applied. Here it is: + +``` +# Create the model +model = Sequential() +model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Conv2D(64, kernel_size=(3, 3), activation='relu')) +model.add(MaxPooling2D(pool_size=(2, 2))) +model.add(Dropout(0.25)) +model.add(Flatten()) +model.add(Dense(256, activation='relu')) +model.add(Dense(no_classes, activation='softmax')) +``` + +Essentially, it's the architecture for our model. Using the Sequential API, you can see that we add Conv2D layers, which are then followed by MaxPooling2D layers with a `(2, 2)` pool size - effectively halving the input every time. The [Dropout layer](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/) helps boost the model's generalization power. + +That's it! Applying pooling layers to Keras models is really easy :) + +* * * + +## Summary + +In this blog post, we saw what pooling layers are and why they can be useful to your machine learning project. Following the general discussion, we looked at max pooling, average pooling, global max pooling and global average pooling in more detail. + +The theory details were followed by a practical section - introducing the API representation of the pooling layers in the Keras framework, one of the most popular deep learning frameworks used today. Finally, we provided an example that used MaxPooling2D layers to add max pooling to a ConvNet. + +I hope you've learnt something from today's blog post. If you did, please let me know. I'm really curious to hear about how you use my content, if you do. In that case, please leave a comment below! 💬👇 Please also drop a message if you have any questions or remarks. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Keras. (n.d.). Pooling Layers. Retrieved from [https://keras.io/layers/pooling/](https://keras.io/layers/pooling/) + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Hervatte, S. (n.d.). What is “pooling” in a convolutional neural network? Retrieved from [https://www.quora.com/What-is-pooling-in-a-convolutional-neural-network/answer/Shreyas-Hervatte](https://www.quora.com/What-is-pooling-in-a-convolutional-neural-network/answer/Shreyas-Hervatte) + +Na, X. (n.d.). How exactly does max pooling create translation invariance? Retrieved from [https://www.quora.com/How-exactly-does-max-pooling-create-translation-invariance/answer/Xingyu-Na](https://www.quora.com/How-exactly-does-max-pooling-create-translation-invariance/answer/Xingyu-Na) + +Rahman, N. (n.d.). What is the benefit of using average pooling rather than max pooling? Retrieved from [https://www.quora.com/What-is-the-benefit-of-using-average-pooling-rather-than-max-pooling/answer/Nouroz-Rahman](https://www.quora.com/What-is-the-benefit-of-using-average-pooling-rather-than-max-pooling/answer/Nouroz-Rahman) + +Ilan, S. (n.d.). What is the benefit of using average pooling rather than max pooling? Retrieved from [https://www.quora.com/What-is-the-benefit-of-using-average-pooling-rather-than-max-pooling/answer/Shachar-Ilan](https://www.quora.com/What-is-the-benefit-of-using-average-pooling-rather-than-max-pooling/answer/Shachar-Ilan) + +Dernoncourt, F (2017) ([https://stats.stackexchange.com/users/12359/franck-dernoncourt](https://stats.stackexchange.com/users/12359/franck-dernoncourt)), What is global max pooling layer and what is its advantage over maxpooling layer?, URL (version: 2017-01-20): [https://stats.stackexchange.com/q/257325](https://stats.stackexchange.com/q/257325) + +Christlein, V., Spranger, L., Seuret, M., Nicolaou, A., Král, P., & Maier, A. (2019). [Deep Generalized Max Pooling](https://arxiv.org/abs/1908.05040). _arXiv preprint arXiv:1908.05040_. + +Sudholt, S., & Fink, G. A. (2016, October). PHOCNet: A deep convolutional neural network for word spotting in handwritten documents. In _2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)_ (pp. 277-282). IEEE. + +Mudau, T. ([https://stats.stackexchange.com/users/139737/tshilidzi-mudau](https://stats.stackexchange.com/users/139737/tshilidzi-mudau)), What is global max pooling layer and what is its advantage over maxpooling layer?, URL (version: 2017-11-10): [https://stats.stackexchange.com/q/308218](https://stats.stackexchange.com/q/308218) diff --git a/what-do-convnets-see-visualizing-filters-with-activation-maximization.md b/what-do-convnets-see-visualizing-filters-with-activation-maximization.md new file mode 100644 index 0000000..67d097b --- /dev/null +++ b/what-do-convnets-see-visualizing-filters-with-activation-maximization.md @@ -0,0 +1,350 @@ +--- +title: "What do ConvNets see? Visualizing filters with Activation Maximization" +date: "2019-12-03" +categories: + - "deep-learning" + - "frameworks" +tags: + - "deep-learning" + - "keras" + - "keras-vis" + - "machine-learning" + - "visualization" +--- + +Training a ConvNet can be equal to training a black box: you start the training process, get a model that performs (or not) and that's it. It's then up to you to find out what is possibly wrong, and whether it can be improved any further. This is difficult, since you cannot look inside the black box. + +Or can you? In the past few years, many techniques have emerged that allow you to _take a look inside that black box!_ + +In this blog post, we'll cover Activation Maximization. It can be used to generate a 'perfect representation' for some aspect of your model - and in this case, convolutional filters. We provide an example implementation with `keras-vis` for visualizing your Keras CNNs, and show our results based on the VGG16 model. + +All right - let's go! 😎 + +\[toc\] + +## Recap: what are convolutional filters? + +I find them interesting, these **convolutional neural networks** - you feed them image-like data, they start learning, and you may end up with a model that can correctly identify objects within real images, or classify the real images as a whole. + +However, it's important to understand how convolutional neural networks work if we wish to understand how we can visualize their _filters_ with Activation Maximization (which we will also cover next). + +If you wish to understand convolutional neural networks in more detail, I would like to recommend you read these two blogs: + +- [Convolutional Neural Networks and their components for computer vision](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) +- [Understanding separable convolutions](https://www.machinecurve.com/index.php/2019/09/23/understanding-separable-convolutions/) + +However, if you already have a slight understanding about them or only need to reiterate your existing knowledge, hang on tight - as we'll give you a crash course ConvNets here. + +Recall that this is the generic structure of a ConvNet: + +![](images/CNN-1.png) + +The input might be a W x H RGB image, meaning that the input to the ConvNet is three-dimensional: the width, the height and the red, blue and green channels. + +Once the data is input, it passes through **N kernels** (where N is an integer number, such as 3) or **filters** that have the same dimension. These kernels slide over the input data, performing element-wise multiplications, generating **N feature maps** of width Wfm and height Hfm, depending on the size of the kernel. + +This convolutional operation is often followed by pooling operations, possibly by other convolutional operations, and likely, finally, by densely-connected neural operations, to generate a prediction. It is hence part of the [high-level supervised learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). + +This also sheds light on how ConvNets actually learn. We saw that for any input, the kernels help determine the feature map. The kernels thus contain the patterns that the model has learnt. As after training the kernels are kept constant, they drive the predictions for all the inputs when a model is put into production (possibly augmented with the weights from the densely-connected layers - it's important to know that convolutions and Dense layers are often combined in ConvNets.) + +But how does it learn? And what does it learn? Even though it might sound difficult, it's actually pretty simple. We know that the kernels contain the learnt information. They thus need to be adapted when learning needs to take place. From the high-level supervised learning process, the concept of a [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions), and the concept of an [optimizer](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/), we know that: + +1. Data is fed forward in full batches, minibatches or a stochastic (single-item) fashion. +2. For every sample, a prediction is generated. +3. The average difference between the predictions and the true targets (which are known in supervised settings) determines how _bad_ the model performs, or - in other words - how high its _loss_ is. How this is computed is determined by the choice of loss function. +4. With backpropagation, the error displayed by the loss can be computed backwards to each neuron, computing what is known as a _gradient_, or the change of loss with respect to changes in neurons. +5. With the optimizer, the (negative of the) computed gradient is applied to the neuron's weights, changing them and likely improving the model as a result. The choice of optimizer (such as [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [adaptive optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/)) determines how gradients are applied. + +Kernels are nothing but neurons structured differently. Hence, learning can take place by shifting neuron weights, which means that the high-level supervised learning process is responsible for changing the neurons. Recall that kernels are also called filters every now and then. With that in mind, let's now take a look at the concept of Activation Maximization - which we can use to visualize these filters. + +## Recap: what is Activation Maximization? + +In different blog post, we used **Activation Maximization** [to visualize the perfect input to produce some class prediction](https://www.machinecurve.com/index.php/2019/11/18/visualizing-keras-model-inputs-with-activation-maximization/). This is a really powerful idea: we derive whether the model has learnt correctly by _generating some input_ that _maximizes_ _activations_ in order to produce _some output_, which we set in advance - to some class that we wish to check. Really nice results! + +But how does Activation Maximization work? The principle is simple: + +- You keep the output class constant, you keep the weights constant, and change the input to find maximum activations for the constant class. +- If the generated input, which is the 'perfect input for some class' given the trained model, looks accurate, then you can be more confident that the model has learnt correctly. +- If it doesn't, you might wish to inspect learning in more detail with e.g. [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/), [saliency maps](https://www.machinecurve.com/index.php/2019/11/25/visualizing-keras-cnn-attention-saliency-maps/) or [Grad-CAMs](https://www.machinecurve.com/index.php/2019/11/28/visualizing-keras-cnn-attention-grad-cam-class-activation-maps/), to identify where the model attends and which layers contribute to learning. +- You might do the same with [Keract output visualizations](https://www.machinecurve.com/index.php/2019/12/02/visualize-layer-outputs-of-your-keras-classifier-with-keract/), or read on, to learn how to visualize ConvNet filters with Activation Maximization. + +## Introducing `keras-vis` + +Today, we'll be creating ConvNet filter visualizations with Keras, the deep learning framework that is deeply integrated with TensorFlow and originally created by François Chollet. We're going to use `keras-vis` for this purpose, which is a third-party toolkit for visualizing Keras models, supporting Activation Maximization, Saliency Maps and Grad-CAM class activation maps. + +Or, in their words: + +> keras-vis is a high-level toolkit for visualizing and debugging your trained keras neural net models. +> +> [https://github.com/raghakot/keras-vis](https://github.com/raghakot/keras-vis) + +We will use it to visualize what a Keras based ConvNet sees through (some of its) filters, by means of Activation Maximization. + +### Installing `keras-vis` + +The first step is installing `keras-vis`. Unfortunately, it is a little bit less straight-forward than performing a `pip install keras-vis`. That is due to the status of the `pip` package: it's not up to date, and hence doesn't run with newer Keras versions. + +Fortunately, there is an escape. + +It's actually rather simple, too: first, open up a terminal, preferably the terminal where you have access to all the other dependencies (Python, Keras, and so on). Second, run this command: + +``` +pip install https://github.com/raghakot/keras-vis/archive/master.zip +``` + +It still uses `pip` to install `keras-vis`, but simply installs the most recent version from the [Github](https://github.com/raghakot/keras-vis) repository. + +When you see this (or anything more recent than `0.5.0`, you've successfully installed `keras-vis`: + +``` +>pip install https://github.com/raghakot/keras-vis/archive/master.zip +Collecting https://github.com/raghakot/keras-vis/archive/master.zip + Downloading https://github.com/raghakot/keras-vis/archive/master.zip + \ 58.1MB 819kB/s +Building wheels for collected packages: keras-vis + Building wheel for keras-vis (setup.py) ... done +Successfully built keras-vis +Installing collected packages: keras-vis +Successfully installed keras-vis-0.5.0 +``` + +## Today's model: VGG16 + +Now, let's take a look at today's model. Contrary to other posts, where we used a simple Convolutional Neural Network for visualization purposes (e.g. in our other [Activation Maximization post](https://www.machinecurve.com/index.php/2019/11/18/visualizing-keras-model-inputs-with-activation-maximization/)), we don't use simple ones here today. This is due to the nature of this post: we're interested in generating filter visualizations that are relatively discriminative in terms of abstractness, yet show enough similarity to the task that we can include them here. + +Fortunately, the Keras framework comes to the rescue; more specifically, the `keras.applications` ([Github here](https://github.com/keras-team/keras-applications)). It is delivered with various model architectures included. That's perfect for our task today! 🎉 + +We're using the **VGG16 model** today. This model, which was created by scientists at the Visual Geometry Group (hence VGG) at the University of Oxford, participated in the [ImageNet Large Scale Visual Recognition Challenge of 2014](http://image-net.org/challenges/LSVRC/2014/). It uses many (sixteen, in our case - hence VGG16) convolutional layers and has achieved substantial accuracies in the 2014 competition. If you wish to read more about VGG16, [click here for an excellent resource](https://neurohive.io/en/popular-networks/vgg16/). + +ConvNets can be trained on any dataset. However, what often happens is that large-scale datasets are used for pretraining, only to be slightly altered by subsequent training afterwards, possibly for another purpose - a process called Transfer Learning. These large-scale datasets therefore come delivered with such models often, and in this case it's the same. The Keras `VGG16` model can be used directly while it is initialized with [weights trained on the ImageNet dataset](https://keras.io/applications/#vgg16), if you wish. Today, we'll do precisely that: visualizing filters of the VGG16 model when it's initialized on the ImageNet dataset. + +Let's go! 😀 + +## Creating ConvNet filter visualizations + +### What you'll need to run this code + +As usual, you'll to install a set of dependencies if you wish to run the model & the visualization code: + +- **Python**, as we will create Python code. Preferably, use version 3.6+. I can't guarantee that it works with older versions. +- **Keras**, which is the deep learning framework that we will use today. +- One of the Keras backends - being Tensorflow, Theano and CNTK - and of these, preferably **Tensorflow**. +- **Keras-vis**, which you can install by following the [instructions above](#installing-keras-vis). +- **Matplotlib**, for saving the visualizations to disk as images. +- **Numpy**, for number processing. + +### Imports & VGG16 initialization + +Now, let's write some code! 😎 + +To start, create a file in some directory, e.g. `activation_maximization_filters.py`. + +Open this file in the code editor of your choice, and write with me: + +``` +''' + ConvNet filter visualization with Activation Maximization on exemplary VGG16 Keras model +''' +from keras.applications import VGG16 +from vis.utils import utils +from vis.visualization import visualize_activation, get_num_filters +from vis.input_modifiers import Jitter +import matplotlib.pyplot as plt +import numpy as np +import random +import os.path +``` + +These are the imports that you'll need for today's tutorial: + +- You import `VGG16` from `keras.applications`, which is the model that we're using today. +- From `keras-vis`, you'll import `utils` (for finding the layer index of the to be visualized layer later), `visualize_activation` and `get_num_filters` (for the visualization part) and `Jitter` (to boost image quality). +- You'll import the PyPlot API from Matplotlib into `plt`. +- Numpy is imported for number processing, `random` is used for drawing a random sample, and `os.path` for selecting the path to write the images to. + +Now, let's define the name of the folder which we'll be writing into: + +``` +# Define the folder name to save into +folder_name = 'filter_visualizations' +``` + +Then define the model: + +``` +# Define the model +model = VGG16(weights='imagenet', include_top=True) +``` + +This initializes the `VGG16` model into the `model` variable, and initializes it with weights trained on the ImageNet dataset. With `include_top`, the densely-connected layers that generate the prediction are included; if set to `False`, you'll only get the convolutional layers. The latter is especially useful when you wish to use pretrained Keras models as your convolutional base, to train additional layers further. + +### Generating visualizations + +Next, we can generate some visualizations! + +``` +# Iterate over multiple layers +for layer_nm in ['block1_conv1', 'block2_conv1', 'block3_conv2', 'block4_conv1', 'block5_conv2']: +``` + +This part means that your code will iterate over an array that contains various layers: + +- The first Conv layer of the first Conv block; +- The first Conv layer of the second Conv block; +- The second Conv layer of the third Conv block; +- The first Conv layer of the fourth Conv block; +- The second Conv layer of the fifth Conv block. + +That is, it will generate visualizations for (a random selection of) the filters that are part of these blocks. In your case, you may choose any blocks. This, however, comes in to flavors. When using Keras pretrained models, you can look for the layer names in the code available for these models - such as for the VGG16 at Keras' GitHub (search for 'block1\_conv1' on [this page](https://github.com/keras-team/keras-applications/blob/master/keras_applications/vgg16.py), to give you an example). When you do however visualize the Conv filters of your own models, you'll have to name layers yourself when you stack the architecture: + +``` +model.add(Dense(no_classes, activation='softmax', name='dense_layer')) +``` + +When adding these names to the array above, you'll ensure that you're visualizing the correct layers. + +The following code is part of the iteration, which means that it runs every time the loop is activated (in our case, five times, for five layers): + +``` + # Find the particular layer + layer_idx = utils.find_layer_idx(model, layer_nm) +``` + +...this `keras-vis` util finds the correct layer index for the name that we specify. `layer_nm`, in this case, is one of the layer names in the array, e.g. `block1_conv1`. + +``` + # Get the number of filters in this layer + num_filters = get_num_filters(model.layers[layer_idx]) +``` + +We then retrieve the number of filters in this layer. This is also done by applying a nice Keras-vis util. + +Then, we select six filters randomly (with replacement, so there's a small chance that you visualize one or two filters twice - but I've found that this doesn't really happen given the large number of filters present in VGG16. For your own model, this may be different): + +``` + # Draw 6 filters randomly + drawn_filters = random.choices(np.arange(num_filters), k=6) +``` + +Finally, we visualize each filter drawn: + +``` + # Visualize each filter + for filter_id in drawn_filters: + img = visualize_activation(model, layer_idx, filter_indices=filter_id, input_modifiers=[Jitter(16)]) + plt.imshow(img) + img_path = os.path.join('.', folder_name, layer_nm + '_' + str(filter_id) + '.jpg') + plt.imsave(img_path, img) + print(f'Saved layer {layer_nm}/{filter_id} to file!') +``` + +- We iterate over every filter. +- We visualize its activation with `keras_vis`, for the particular `filter_id`, modifying the input with `Jitter` to make the images more clear. +- Subsequently, we draw the array (which is returned by `keras-vis`) with Matplotlib. +- We don't show the visualization on screen, but rather save them to `img_path` (based on the `folder_name`, `layer_nm` and `filter_id` properties) with Matplotlib. +- Finally, we print that the layer was visualized and that the visualization was saved - allowing you to take a look at what was generated! + +## Results + +In my case, these were the results: + +### Block1Conv1 + +For the first Conv layer in the first Conv block, the results are not very detailed. However, filters clearly distinguish from each other, as can be seen from the results: + +- [![](images/block1_conv1_2.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block1_conv1_2.jpg) + +- [![](images/block1_conv1_5.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block1_conv1_5.jpg) + +- [![](images/block1_conv1_11.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block1_conv1_11.jpg) + +- [![](images/block1_conv1_12.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block1_conv1_12.jpg) + +- [![](images/block1_conv1_15.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block1_conv1_15.jpg) + +- [![](images/block1_conv1_25.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block1_conv1_25.jpg) + + +### Block2Conv1 + +In the second block, a little bit more detail becomes visible. Certain stretched patterns seem to be learnt by the filters. + +- [![](images/block2_conv1_26.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block2_conv1_26.jpg) + +- [![](images/block2_conv1_33.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block2_conv1_33.jpg) + +- [![](images/block2_conv1_39.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block2_conv1_39.jpg) + +- [![](images/block2_conv1_84.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block2_conv1_84.jpg) + +- [![](images/block2_conv1_97.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block2_conv1_97.jpg) + +- [![](images/block2_conv1_100.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block2_conv1_100.jpg) + + +### Block3Conv2 + +This gets even clearer in the third block. The stretches are now combined with clear patterns, and even blocky representations, like in the center-bottom visualization. + +- [![](images/block3_conv2_3.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block3_conv2_3.jpg) + +- [![](images/block3_conv2_17.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block3_conv2_17.jpg) + +- [![](images/block3_conv2_21.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block3_conv2_21.jpg) + +- [![](images/block3_conv2_123.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block3_conv2_123.jpg) + +- [![](images/block3_conv2_162.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block3_conv2_162.jpg) + +- [![](images/block3_conv2_185.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block3_conv2_185.jpg) + + +### Block4Conv1 + +Details become visible in the fourth convolutional block. It's still difficult to identify real objects in these visualizations, though. + +- [![](images/block4_conv1_69.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block4_conv1_69.jpg) + +- [![](images/block4_conv1_78.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block4_conv1_78.jpg) + +- [![](images/block4_conv1_97.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block4_conv1_97.jpg) + +- [![](images/block4_conv1_100.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block4_conv1_100.jpg) + +- [![](images/block4_conv1_294.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block4_conv1_294.jpg) + +- [![](images/block4_conv1_461.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block4_conv1_461.jpg) + + +### Block5Conv2 + +This latter becomes possible in the visualizations generated from the fifth block. We see eyes and other shapes, which clearly resemble the objects that this model was trained to identify. + +- [![](images/block5_conv2_53.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_53.jpg) + +- [![](images/block5_conv2_136.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_136.jpg) + +- [![](images/block5_conv2_222.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_222.jpg) + +- [![](images/block5_conv2_247.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_247.jpg) + +- [![](images/block5_conv2_479.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_479.jpg) + +- [![](images/block5_conv2_480.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/12/block5_conv2_480.jpg) + + +This clearly illustrates that the model learns very detailed patterns near the output, i.e. in the final layers of the model, whereas more global and abstract ones are learnt in the early layers. It now makes perfect sense why the first two or perhaps three layers of ImageNet trained models are often used in practical settings in order to boost training accuracy: the patterns that are learnt are so general that they do not necessarily represent the _object in question_, but rather the _shape in question_. While both the sun, a football and a volleyball are round, we don't know whether an input is any of those in the first few layers. We do know, however, that it's _round_. + +## Summary + +In this blog post, we've seen how we can use Activation Maximization to generate visualizations for filters in our CNNs, i.e. convolutional neural networks. We provided an example that demonstrates this by means of the `keras-vis` toolkit, which can be used to visualize Keras models. + +I hope you've learnt something today! 😀 If you did, or if you have any questions or remarks, please feel free to leave a comment in the comments box below 👇 Thank you for reading MachineCurve today and happy engineering! 😎 + +## References + +Kotikalapudi, Raghavendra and contributors. (2017). Github / keras-vis. Retrieved from [https://github.com/raghakot/keras-vis](https://github.com/raghakot/keras-vis) + +Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. _[arXiv preprint arXiv:1409.1556](https://arxiv.org/abs/1409.1556)_. + +VGG16 - Convolutional Network for Classification and Detection. (2018, November 21). Retrieved from [https://neurohive.io/en/popular-networks/vgg16/](https://neurohive.io/en/popular-networks/vgg16/) diff --git a/what-is-a-learning-rate-in-a-neural-network.md b/what-is-a-learning-rate-in-a-neural-network.md new file mode 100644 index 0000000..ee664e5 --- /dev/null +++ b/what-is-a-learning-rate-in-a-neural-network.md @@ -0,0 +1,130 @@ +--- +title: "What is a Learning Rate in a Neural Network?" +date: "2019-11-06" +categories: + - "buffer" + - "deep-learning" +tags: + - "artificial-intelligence" + - "backpropagation" + - "deep-learning" + - "gradient-descent" + - "learning-rate" + - "machine-learning" + - "neural-networks" + - "optimizer" +--- + +When creating deep learning models, you often have to configure a _learning rate_ when setting the model's hyperparameters, i.e. when you are configuring your neural network. + +Every time you do that, you might actually wonder like me at first about this: **what is a learning rate?** + +Why _is it there_? And how can you configure it? + +We'll take a look at these questions in this blog post. This requires that we'll take a look at how models optimize first. We do so along the high-level machine learning process that we defined in another blog post. + +Subsequently, we move on with learning rates - both how they work and what they do conceptually _and_ what types of learning rates exist in today's deep learning engineers' toolboxes. + +**After reading this article, you will...** + +- Understand at a high level how models optimize. +- See how learning rates can be used to tune the amount of learning. +- Know what types of learning rates can be used in neural networks. + +Let's go! 😊 + +**Update 01/Mar/2021:** ensure that article is up to date in 2021. + +**Update 01/Feb/2020:** added link to [Learning Rate Range Test](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/). + +* * * + +\[toc\] + +* * * + +## How models optimize + +If we wish to understand what learning rates are and why they are there, we must first take a look at the [high-level machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) for supervised learning scenarios: + +[![](images/High-level-training-process-1024x973.jpg)](https://www.machinecurve.com/wp-content/uploads/2019/09/High-level-training-process.jpg) + +### Feeding data forward and computing loss + +As you can see, neural networks improve iteratively. This is done by feeding the training data forward, generating a prediction for every sample fed to the model. When comparing the predictions with the actual (known) targets by means of a [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), it's possible to determine how well (or, strictly speaking, how bad) the model performs. + +### Changing model weights with gradient updates + +Subsequently, before starting the second iteration, the model will slightly adapt its internal structure - the weights for each neuron - by using gradients of the loss landscape. With a technique called _backpropagation_, the gradient of the update for a particular error is computed with respect to the original error and the neurons between the particular neuron and the error. Backprop allows you to compute the gradient efficiently by smartly using the chain rule, which you've likely encountered in calculus class. + +However, this is always combined with what is known as an _optimizer_, which effectively performs the update. There are many optimizers: [three forms of gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/), where you simply move in the opposite direction of the gradient, are the simplest ones. With [adaptive ones](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), improvements to gradient descent are combined with per-neuron updates. + +However - it's not a good idea that weight updates are large. This is because when weights swing back and forth, it's likely that you either have a very oscillating path towards your global minimum. Additionally, when they are large, it might be that you continously overshoot the optimum, getting worse performance than necessary! + +...here's where the learning rate enters the picture 😄 + +* * * + +## Configuring how much is learnt with Learning Rates + +At a highly abstract level, a weight update can be written down as follows: + +`new_weight = old_weight - learning rate * gradient update` + +You take the old weight and subtract the gradient update - but wait: you first multiply the update with the learning rate. + +This learning rate, which you can configure before you start the training process, allows you to make the gradient update smaller. By default, for example in the Stochastic Gradient Descent optimizer built into the Keras deep learning framework, learning rates are relatively small - `0.01` is the default value in Keras' SGD. That essentially means that the _real weights update_ are by default only 1% of the computed gradient update. + +Yep, it'll take you longer to converge, but you likely don't overshoot and oscillate less severely across epochs! + +### Types of Learning Rates + +The example above depicts what is known as a _fixed learning rate_. You set the learning rate in advance and it doesn't change over the epochs. This has both benefits and disbenefits. The primary benefit is that you have to think about your learning rate in very simple terms: you choose one number and that's it. + +And as we shall investigate more deeply in another blog, this is also the drawback of a fixed learning rate. As you know, neural networks learn exponentially during the first few epochs - and fixed learning rates may then be _too small_, which means that you waste resources in terms of opportunity cost. + +[![](images/huber_loss_d1.5-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/huber_loss_d1.5.png) + +Loss values for some training process. As you can see, substantial learning took place initially, changing into slower learning eventually. + +However, towards the more final stages of the learning process, you don't want large learning rates because the learning process slows down. Only gentle and small updates might bring you closer to the minimum, which requires small learning rates. In any other case, you may overshoot the minimum, with worse than possible performance as a result. + +Enter another type of learning rate: a _learning rate decay scheme_. This essentially means that your learning rate gets smaller over time, allowing you to start with a relatively large one, benefiting both from the substantial improvements in the first few epochs and the more gradual ones towards the end. + +There are many options here: it's possible to have your learning rate decay exponentially, linearly, or with some other function. + +It's better than a fixed learning rate for obvious reasons, but learning rate decay schemes suffer from a drawback that also impacts fixed learning rates: the fact that you have to configure them in advance. This is essentially a guess, because you then don't know your exact loss landscape yet. And with any guess, the results may be good, but also disastrous. + +Fortunately, there's also something as a _[Learning Rate Range Test](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/)_, which we'll also cover in a subsequent blog. With this range test, you essentially test average model performance across a range of learning rates. This results in a plot that allows you to pick a starting learning rate based on empirical testing, which you can subsequently use in e.g. a learning rate decay scheme. + +Another type of learning rate we'll cover in another blog is the concept of a _Cyclical Learning Rate_. In this case, the learning rate moves back and forth between a very high and a very low learning rate, in between some bounds that you can specify using the same _range test_ as discussed previously. This is contradictory to the concept of a large learning rate at first and a small one towards the final epochs, but it actually makes a lot of sense. With larger learning rates throughout the entire training process, you can both speed up your training process in the early stages _and_ find an escape route if you're stuck in local minima. Smaller learning rates, which will inevitably follow the larger ones, will then allow you to look around for some time, taking smaller steps towards the minimum close by. Empirical results have shown promising results. + +Especially when you combine decaying learning rates and cyclical learning rates with early cutoff techniques such as [EarlyStopping](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/), it's very much possible to find a well-performing model without risking severe overfitting. + +* * * + +## Summary + +In this blog post, we've looked at the concept of a learning rate at a high level. We explained why they are there in terms of the high-level supervised machine learning process and how they are combined with feeding data forward and model optimization. + +Subsequently, we looked at some types of learning rates that are available and common today: fixed learning rates, learning rate decay schemes, the [Learning Rate Range Test](https://www.machinecurve.com/index.php/2020/02/20/finding-optimal-learning-rates-with-the-learning-rate-range-test/) which can be combined with either learning rate decay _or_ Cyclical Learning Rates, which are an entirely different approach to learning. + +Thanks for reading! If you have any questions or remarks, feel free to leave a comment below 👇 I'll happily answer whenever I can, and will update and/or improve my blog post if necessary. + +* * * + +## References + +Smith, L. N. (2017, March). [Cyclical learning rates for training neural networks.](https://ieeexplore.ieee.org/abstract/document/7926641/) In _2017 IEEE Winter Conference on Applications of Computer Vision (WACV)_ (pp. 464-472). IEEE. + +Smith, L. N., & Topin, N. (2017). Exploring loss function topology with cyclical learning rates. _[arXiv preprint arXiv:1702.04283](https://arxiv.org/abs/1702.04283)_[.](https://arxiv.org/abs/1702.04283) + +Smith, S. L., Kindermans, P. J., Ying, C., & Le, Q. V. (2017). Don't decay the learning rate, increase the batch size. _[arXiv preprint arXiv:1711.00489](https://arxiv.org/abs/1711.00489)_[.](https://arxiv.org/abs/1711.00489) + +MachineCurve. (2019, October 22). About loss and loss functions. Retrieved from [https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) + +MachineCurve. (2019, October 24). Gradient Descent and its variants. Retrieved from [https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) + +MachineCurve. (2019, November 3). Extensions to Gradient Descent: from momentum to AdaBound. Retrieved from [https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) + +Jain, V. (2019, August 5). Cyclical Learning Rates ? The ultimate guide for setting learning rates for Neural Networks. Retrieved from [https://medium.com/swlh/cyclical-learning-rates-the-ultimate-guide-for-setting-learning-rates-for-neural-networks-3104e906f0ae](https://medium.com/swlh/cyclical-learning-rates-the-ultimate-guide-for-setting-learning-rates-for-neural-networks-3104e906f0ae) diff --git a/what-is-a-variational-autoencoder-vae.md b/what-is-a-variational-autoencoder-vae.md new file mode 100644 index 0000000..b81eb89 --- /dev/null +++ b/what-is-a-variational-autoencoder-vae.md @@ -0,0 +1,350 @@ +--- +title: "What is a Variational Autoencoder (VAE)?" +date: "2019-12-24" +categories: + - "deep-learning" +tags: + - "autoencoder" + - "deep-learning" + - "deep-neural-network" + - "generative-models" + - "machine-learning" + - "variational-autoencoder" +--- + +Suppose that you have an image of a man with a moustache and one of a man without one. You feed them to a segment of a neural network that returns an approximation of the most important features that determine the image, once per image. You then smartly combine the two approximations into one, which you feed to another part of the same neural network... + +...what do you see? + +If you've trained the neural network well, there's a chance that the output is the _man without moustache, but then with the other person's moustache_. + +Sounds great, doesn't it? + +[![](images/image-300x203.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/image.png) + +_Popularity of VAEs in [Google Trends](https://trends.google.com/trends/explore?date=today%205-y&q=variational%20autoencoder)._ + +Like [GANs](https://www.machinecurve.com/index.php/2019/07/17/this-person-does-not-exist-how-does-it-work/), **Variational Autoencoders** **(VAEs)** can be used for this purpose. Being an adaptation of classic autoencoders, which are used for dimensionality reduction and input denoising, VAEs are _generative_. Unlike the classic ones, with VAEs you can use what they've learnt in order to generate _new samples_. Blends of images, predictions of the next video frame, synthetic music - the list goes on. + +..and on! VAEs have been rising in popularity over the last few years. Let's investigate them in more detail 😁 + +In this blog post, we'll take a _generative_ view towards VAEs. Although strictly speaking, VAEs are autoencoders and can also be used for e.g. denoising, we already have posts about such applications - specifically for [image denoising](https://www.machinecurve.com/index.php/2019/12/20/building-an-image-denoiser-with-a-keras-autoencoder-neural-network/) and [signal denoising](https://www.machinecurve.com/index.php/2019/12/19/creating-a-signal-noise-removal-autoencoder-with-keras/). Here, we'll focus on how to use VAEs for generative purposes. + +This means first covering traditional (or, _vanilla_) autoencoders. What types do exist? And what are they used for? We'll see that they have very interesting applications. But we'll also find out what their limitations are. When your goal is to generate new content, it's difficult if not impossible to use these classic autoencoders. We'll also cover why this is the case. + +We then introduce Variational Autoencoders. We'll cover what they are, and how they are different from traditional autoencoders. The two primary differences - that samples are encoded as two vectors that represent a probability distribution over the latent space rather than a point in latent space _and_ that [Kullback-Leibler divergence](https://www.machinecurve.com/index.php/2019/12/21/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras/) is added to optimization - will be covered in more detail. Through these, we'll see why VAEs are suitable for generating content. + +As an extra, this blog also includes some examples of data generated with VAEs. + +Are you ready? + +Let's go! 😎 + +**Update 08/Dec/2020:** added references to PCA article. + +* * * + +\[toc\] + +* * * + +## About normal autoencoders + +Before we can introduce Variational Autoencoders, it's wise to cover the general concepts behind autoencoders first. Those are valid for VAEs as well, but also for the vanilla autoencoders we talked about in the introduction. + +At a high level, this is the architecture of an autoencoder: + +[![](images/Autoencoder.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/Autoencoder.png) + +It takes some data as input, encodes this input into an encoded (or latent) state and subsequently recreates the input, sometimes with slight differences (Jordan, 2018A). + +Autoencoders have an _encoder segment_, which is the mapping between the input data and the encoded or latent state, and a _decoder segment_, which maps between latent state and the reconstructed output value. + +Reconstructions may be the original images: + +[![](images/4-1.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/4-1.png) + +But autoencoders may also be used for [noise reduction](https://www.machinecurve.com/index.php/2019/12/20/building-an-image-denoiser-with-a-keras-autoencoder-neural-network/): + +[![](images/1-5.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1-5.png) + +The fun thing about autoencoders is that the encoder and decoder segments are _learnt_, because neural networks are used to implement them. They are trained together with the other parts of the network. Usually, the networks as a whole use loss functions such as Mean Squared Error or [Crossentropy loss](https://www.machinecurve.com/index.php/2019/10/22/how-to-use-binary-categorical-crossentropy-with-keras/) (Shafkat, 2018). This way, autoencoders will be very data-specific. This is good news when you wish to have e.g. a tailor-made denoiser, but becomes challenging when you want to use the learnt encoding across various projects. In those cases, e.g. generalized denoising functions such as mean/median sample removal may be more suitable to your problem. + +Let's now take a look at classic autoencoders in more detail and how they are used, so that we can understand why they are problematic if we want to generate new content. + +### Types of vanilla / traditional autoencoders + +Jordan (2018B) defines multiple types of traditional autoencoders: among them, undercomplete autoencoders, sparse autoencoders and denoising autoencoders. Myself, I'd like to add _convolutional autoencoders_ to this list, as well as _recurrent_ autoencoders. They effectively extend undercomplete and sparse autoencoders by using convolutional or recurrent layers instead of Dense ones. + +**Undercomplete** autoencoders involve creating an information bottleneck, by having hidden layers with many fewer neurons than the input and output layers. This way, the neural network is forced to compress much information in fewer dimensions (Jordan, 2018B) - exactly the goal of an autoencoder when generating the encoding. + +[![](images/undercomplete.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/undercomplete.png) + +**Sparse autoencoders**, on the other hand, do have an equal number of neurons in their hidden layers compared to input and output neurons, only not all of them are used or do contribute to the training process (Jordan, 2018B). Regularization techniques like L1 regularization or [Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) can serve this purpose, effectively creating the information bottleneck once more. + +[![](images/sparse.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/sparse.png) + +When using **Denoising autoencoders**, the goal is no longer to _reconstruct the input data_. Rather, your goal has become _denoising the input data_ by learning the noise (Jordan, 2018B). This is achieved by adding noise to pure inputs, feeding them as samples, while having the original pure samples as targets. Minimizing reconstruction loss then involves learning noise. At MachineCurve, we have available examples for [signal noise](https://www.machinecurve.com/index.php/2019/12/19/creating-a-signal-noise-removal-autoencoder-with-keras/) and [image noise](https://www.machinecurve.com/index.php/2019/12/20/building-an-image-denoiser-with-a-keras-autoencoder-neural-network/). + +[![](images/3-3.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/3-3.png) + +While traditionally densely-connected layers (or Dense layers) have been used for autoencoders, it's of course also possible to use **convolutional** or **recurrent** layers when creating them. The convolutional ones are useful when you're trying to work with image data or image-like data, while the recurrent ones can e.g. be used for discrete and sequential data such as text. + +### What are normal autoencoders used for? + +There are two main applications for traditional autoencoders (Keras Blog, n.d.): + +- **Noise removal**, as we've seen above. +- **Dimensionality reduction**. As the _encoder segment_ learns representations of your input data with much lower dimensionality, the encoder segments of autoencoders are useful when you wish to perform dimensionality reduction. This can especially be handy when e.g. [PCA](https://www.machinecurve.com/index.php/2020/12/07/introducing-pca-with-python-and-scikit-learn-for-machine-learning/) doesn't work, but you suspect that nonlinear dimensionality reduction does (i.e. using neural networks with nonlinear activation functions). + +You may now think: I have an idea! 💡 It goes as follows: + +_"Okay, my autoencoder learns to map inputs to an encoded representation (the latent state), which is subsequently re-converted into some output. Can't I generate new outputs, then, when I feeda randomly sampled encoded state to the decoder segment of my autoencoder?"_ + +It's a good idea, because intuitively, the decoder must be capable of performing similar to the generator of a GAN when trained (Rocca, 2019). + +But the answer is _no_ 😥. Traditional autoencoders cannot be used for this. We'll now investigate why. + +* * * + +## The Content Generation problem + +Yes: generating new content with traditional autoencoders is quite challenging, if not impossible. This has to do with how classic autoencoders map their input to the latent space and how the encoded state is represented. If this seems like abracadabra to you - don't worry. I'll try to explain it in plainer English now 😀 + +### How classic autoencoders map input to the latent space + +To illustrate the point, I've trained a classic autoencoder where the encoded state has only 2 dimensions. This allows us to plot digits with Matplotlib. Do note that going from 784 to 2 dimensions is a substantial reduction and will likely lead to too much information loss than strictly necessary (indeed, the loss value stalled at around \[latex\]\\approx 0.25\[/latex\], while in a similar network a loss of \[latex\]\\approx 0.09\[/latex\] could be achieved). + +The plot of our encoded space - or latent space - looks as follows. Each color represents a class: + +[![](images/classic_autoencoder-1024x853.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/classic_autoencoder.png) + +Some classes (the zeroes and especially the ones) are discriminative enough in order to be mapped quite successfully. Others, such as nines, eights and sevens, are less discriminative. This explains the relatively high loss. + +### Continuity and completeness + +However, let's go back to content generation. If we do wish to create new content, we really want our latent space to satisfy two criteria (Rocca, 2019): + +- It must be **continuous**. This means that two close points in the latent space should give two similar outputs when decoded. +- It must be **complete**. This means that a point sampled from the distribution should produce an output that makes sense. + +The thing with classic autoencoders is this: they're likely neither. Let's find out why. + +### Normal autoencoders don't work here + +As an example: suppose that you train a classic autoencoder where your latent space has six dimensions. The encoder segment of the autoencoder will then output a vector with six values. In other words: it outputs a single value per dimension (Jordan, 2018A). + +In the plot of our latent state space above - where we trained a classic autoencoder to encode a space of _two dimensions_ - this would just be a dot somewhere on an (x, y) plane. + +Does this plot, with all the dots, meet the criteria specified above? + +No: it's neither continuous nor complete. Take a look at the plot and at what would happen if I would take a random position in my latent space, decode the output - generating a zero - and then start moving around. + +If the space were _continuous_, it would mean that I'd find a value somewhere between a zero and a five (in terms of shape, not in terms of number!). + +![](images/incontinuous-1024x853.png) + +As you can see, however, I would find outputs like six, seven, one, two, ... anything but a five-ish output. The latent space of a classic autoencoder is hence not continuous. + +But is it complete? + +[![](images/classic_drawing-1024x853.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/classic_drawing.png) + +Nope, it's neither. + +If I'd go back to my original sample, and moved around to a position that would decode as a one, I'd find a hole just beyond the halfway mark. + +The decoder would likely produce utter nonsense here, since it simply hasn't seen anything similar to that particular encoding! (Rocca, 2019) + +I hope it's clear now: the latent spaces of classic autoencoders are neither _continuous_ nor _complete_. They don't produce similar outputs when changing the encoding over short distances in space, and the odds are there that they will produce nonsense when you'll feed encodings the model hasn't seen before. This is why traditional autoencoders cannot be used for content generation. + +### Why does this happen? + +Funnily, the undesirable behavior of classic autoencoders with respect to content generation is perfectly explainable. It's because _they were never trained to do so_. They were trained to approximate the target output (i.e., the reconstructed input, or the denoised input, and so on) in the best way possible (Rocca, 2019). + +In machine learning terms, this means that the only goal of the classic autoencoder is to _minimize reconstruction loss_. Minimizing reconstruction loss needs no continuity or completeness. Nope, it only needs learning a way of generating encodings for inputs that maximize reconstruction to the desired output. Whether this happens with or without a continuous and complete state, is of no concern to the autoencoder. + +What's more, factors like the distribution of your training data, the dimension of the latent space configured by the machine learning engineer, and the architecture of your encoder - they all influence the _regularity_ and hence continuity and completeness of your autoencoder's, and thus are factors in explaining why classic autoencoders cannot be used (Rocca, 2019). + +To make a long story short: training an autoencoder that generates a latent space that is both continuous and complete _locally_ (i.e., for some point in space and its direct vicinity) is difficult. Achieving the same but then _globally_ (i.e., for the entire space) is close to impossible when using traditional autoencoders. Such a shame! 😑 + +* * * + +## Say hello to Variational Autoencoders (VAEs)! + +Let's now take a look at a class of autoencoders that _does work_ well with generative processes. It's the class of Variational Autoencoders, or VAEs. They are "powerful generative models" with "applications as diverse as generating fake human faces \[or producing purely synthetic music\]" (Shafkat, 2018). When comparing them with [GANs](https://www.machinecurve.com/index.php/2019/07/17/this-person-does-not-exist-how-does-it-work/), Variational Autoencoders are particularly useful when you wish to _adapt_ your data rather than _purely generating new data_, due to their structure (Shafkat, 2018). + +### How are VAEs different from traditional autoencoders? + +[![](images/vae_mlp-300x180.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/vae_mlp.png) + +They achieve this through two main differences (Shafkat, 2018; Rocca, 2019; Jordan, 2018A): + +- Firstly, recall that classic autoencoders output one value per dimension when mapping input data to latent state. VAEs don't do this: rather, they output a Gaussian probability distribution with some mean \[latex\]\\mu\[/latex\] and standard deviation \[latex\]\\sigma\[/latex\] for every dimension. For example, when the latent state space has seven dimensions, you'd thus get seven probability distributions that together represent state, as a probability distribution across space. +- Secondly, contrary to classic autoencoders - which minimize reconstruction loss only - VAEs minimize a combination of reconstruction loss and a probability comparison loss called [Kullback-Leibler divergence](https://www.machinecurve.com/index.php/2019/12/21/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras/). This enforces the regularization we so deeply need. + +These two differences allow them to be both _continuous_ and, quite often, _complete_, making VAEs candidates for generative processes. + +Let's now take a look at these differences in more detail :) + +### First difference: encodings are probability distributions + +Recall that classic autoencoders encode their inputs as a single point in some multidimensional space. Like this, for five-dimensional encoded space: + +[![](images/classic-autoencoder.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/classic-autoencoder.png) + +VAEs don't do this, and this is their first difference: yes, they still encode inputs to some multidimensional space, but they encode inputs as a _distribution over the latent space_ (Rocca, 2019). As part of this, the encoder doesn't output _one vector_ of size \[latex\]N\[/latex\], but instead _two vectors_ of size \[latex\]N\[/latex\]. The first is a vector of means, \[latex\]\\mu\[/latex\], and the second a vector of standard deviations, \[latex\]\\sigma\[/latex\]. + +[![](images/vae-encoder.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/vae-encoder.png) + +The _encoder_ segment of our VAE is what Kingma & Welling (2013) call the _recognition model:_ it's a learnt approximation ("what must encoding \[latex\]z\[/latex\] be given input \[latex\]x\[/latex\]?") of the _true_ posterior \[latex\]p(z | x)\[/latex\]. Since the approximation is learnt, we don't know its exact distribution, but we _do_ know that the true posterior would be Gaussian, so that the \[latex\]z\[/latex\] from our true posterior would be \[latex\]z \\sim \\mathcal{N}(\\mu,\\,\\sigma^{2})\\,\[/latex\] ("z is part of a Gaussian a.k.a. normal distribution with mean \[latex\]\\mu\[/latex\] and standard deviation \[latex\]\\sigma\[/latex\]", Kingma & Welling 2013). + +By consequence, we assume that the _approximated_ posterior distribution (the distribution generated by the encoder) is also distributed \[latex\]\\mathcal{N}(\\mu,\\,\\sigma^{2})\\,\[/latex\]. This, in return, means that we can effectively combine the two vectors into one, if we assume that each element in the new vector is a random variable \[latex\]X \\sim \\mathcal{N}(\\mu,\\,\\sigma^{2})\\,\[/latex\] with the \[latex\]\\mu\[/latex\]s and \[latex\]\\sigma\[/latex\]s being the values from the vectors. + +So: + +[![](images/vae-encoder-x.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/vae-encoder-x.png) + +When we know the encoding of our input, we can randomly sample from all the variables \[latex\]X\[/latex\], selecting a number from the distribution with which the encoding was made. We then feed this number to the decoder, which decodes it into - hopefully 😀 - interpretable output (Shafkat, 2018). + +[![](images/vae-encoder-decoder-1024x229.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/vae-encoder-decoder.png) + +The fact that we sample randomly means that what we feed to the decoder is different every time (i.e., at every epoch during training, and at every inference in production, Jordan 2018A). This means that the reconstructed output is slightly different every time (Shafkat, 2018). + +It's important to understand this property, which is visualized below for a two-dimensional latent space with two Gaussian distributions (red and blue) generating a range of possible sampled \[latex\]X\[/latex\]s (the area in green): + +[![](images/MultivariateNormal.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/MultivariateNormal.png) + +_Even though this work is licensed under CC0, I'd wish to pay thanks to Wikipedia user 'BScan' for creating it: ["Illustration of a multivariate gaussian distribution and its marginals."](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#/media/File:MultivariateNormal.png)_ + +As we can see, the mean values \[latex\]\\mu\[/latex\] for our distributions determine the average center of the range of values, while the \[latex\]\\sigma\[/latex\]s determine the area in green (Shafkat, 2018). + +Now why is this difference - _probability distributions instead of points_ - important? Let's explore. + +Do you remember the two criteria that latent spaces must preferably satisfy if you wish to use autoencoders for generative processes? Indeed, they must be _continuous_ and, preferably, _complete_. + +If the space is continuous, two inputs to the decoder should produce similar results. If it's complete, all inputs in some area should produce results that make sense. + +Having the VAE encoder output a probability distribution over the latent space ensures that it's continuous and (theoretically, with infinite iterations) complete over a _local segment_ in the space. That is, for the samples in one class, or for only a few samples together. The illustration above clearly demonstrates this: within the green area, there's only a limited amount of space that is white. What's more, the results are similar: all the samples are drawn from the same probability distribution, which was generated as the encoding for _just one sample_. + +### Second difference: KL divergence + reconstruction error for optimization + +Now imagine what happens when you feed dozens of samples (or, with the size of today's datasets, likely thousands or tens of thousands of samples) to the encoder. Given its learnt internals, it will produce a vector of means and standard deviations for each of them. + +Now imagine that for each vector, we draw the variables \[latex\]X\[/latex\] once, generating various points in the latent space. But now imagine that we do so an infinite amount of times, but _without removing the earlier points_. What you'll get is an area in space that becomes entirely filled, with only the bounds unfilled. + +Why this happens is simple: the probability distributions that were encoded by the encoder overlap, and so do the points - especially when you don't stop sampling :-) + +This is the benefit of the _first difference_, covered previously. But we're not there yet. Let's take a look at a visualization of a 2D latent space generated by a VAE that is trained to minimize reconstruction loss: + +![](images/rl_2d-1024x853.png) + +What I see is a distribution that is not _centered_, leaving many holes in between samples, where the decoder will not know how to decode the sampled point, producing nonsensical outputs. + +What I see as well is that when moving across space (ignoring the nonsense data generated in the holes), the results produced are quite similar to each other. Take for example the _zeroes_ generated at the top right of the diagram. Moving a bit to the left and to the top leaves us in the cluster with _sixes_, and yep: a 0 looks quite like a 6, in terms of shape. Zeroes and _ones_ do not look like each other at all, and hey, they are located really far from each other! That's great 🎉 + +In terms of the principles: **the latent spaces generated by VAEs trained to minimize reconstruction loss are _continuous_, but not _complete_.** + +This happens because the neural network has only been trained to minimize _reconstruction loss_ so far. + +Minimizing reconstruction loss in plain English goes like this: "make the output look like the input as much as possible - and take any chance you'll get". This ensures that the model will encode the latent space in a way that _discriminates_ between classes, as much as possible (i.e., if it's not certain whether an input is a zero or a six, it will encode it to be somewhere in between. It will also move samples about which it is very certain as far away as possible, especially when encoding samples at the edges of input space). + +Thus: + +1. Training with reconstruction loss clusters samples that look like each other together. This means that each class is clustered together and samples from different classes that look alike are encoded close to each other. Hence, the _continuity_ principle is satisfied. +2. However, there is no such thing that ensures that the clusters _do overlap to some extent, having attachment to each other_. In fact, it may be the case that in order to minimize reconstruction loss, the encoder will encode samples into disjoint clusters, i.e. clusters that _have no overlap!_ By consequence, we must say that the _completeness_ principle is still not satisfied. + +Fortunately, there is a workaround: adding the [Kullback-Leibler divergence](https://www.machinecurve.com/index.php/2019/12/21/how-to-use-kullback-leibler-divergence-kl-divergence-with-keras/) to the loss function. This divergence, which is also called KL divergence, essentially computes the "divergence" between two probability distributions (i.e., how much they look _not_ like each other). + +If we add it to the loss function (currently with reconstruction loss only) to be minimized by the neural network, and configure it to compare the probability distribution generated by the encoder with the standard Gaussian \[latex\]\\mathcal{N}(0, 1^{2})\\,\[/latex\], we get the following plot when retraining the model: + +[![](images/rlkl_2d-1024x853.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/rlkl_2d.png) + +It's clear that _continuity_ is still enforced: zeroes and ones are still on opposite sides of the latent space, while for example the values 6 and 8 are close together. + +However, what also becomes visible is that the _completeness_ principle is now also met to a great extent! + +This happens because the KL divergence loss term increases when the probability distribution generated by the encoder diverges from the \[latex\]\\mathcal{N}(0, 1^{2})\\,\[/latex\] standard normal distribution. Effectively, this means that the neural network is regularized to learn an encoder that produces a probability distribution with \[latex\]\\mu \\approx 0\[/latex\] and \[latex\]\\sigma \\approx 1\[/latex\], "pushing" the probability distributions and hence the sampled \[latex\]X\[/latex\]s close together. + +And this is visible in the illustration above: the entire latent space is built around the point \[latex\](0, 0)\[/latex\] with the majority of samples being within the \[latex\]\[-1, +1\]\[/latex\] domain and range. There are much fewer holes now, making the global space much more _complete_. + +### Recap: why does this help content generation? + +So, in short: + +- VAEs learn encoders that produce probability distributions over the latent space instead of points in the latent space. +- As we sample from these probability distributions during many training iterations, we effectively show the decoder that the entire area around the distribution's mean produces outputs that are similar to the input value. In short, we create a _continuous and complete latent space_ locally. +- By minimizing a loss function that is composed of both reconstruction loss and KL divergence loss, we ensure that the same principles also hold globally - at least to a maximum extent. +- This way, we have a continuous and complete latent space globally - i.e., for all our input samples, and by consequence also similar ones. +- This, in return, allows us to "walk" across the latent space, and generate input that both makes sense (thanks to completeness) and is similar to what we've seen already on our journey (thanks to the continuity). + +Let's now take a walk 😂 + +* * * + +## Examples of VAE generated content + +### MNIST dataset + +When training a VAE with the MNIST dataset, this is the latent space (on the left) and the result of selecting points in this space randomly on the right (Keras Blog, n.d.). Clearly, the latent space is continuous _and_ complete, as the generated content shows. + +- [![](images/vae_space.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/vae_space.png) + +- [![](images/vae_mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/vae_mnist.png) + + +_The script to generate these plots was created by François Chollet and can be retrieved [here](https://github.com/keras-team/keras/blob/master/examples/variational_autoencoder.py)._ + +Great! 😎 + +### Fashion MNIST + +With quite some easy changes (effectively replacing all references to `mnist` with `fashion_mnist` in the script mentioned above), one can replace the MNIST dataset with the Fashion MNIST dataset. This should be harder for the model, because the fashion items are less discriminative than the original MNIST samples. I feel that indeed, the plot of the latent space is a bit flurrier than the plot of the original MNIST dataset - but still, random decodings of points in the latent space show that it works! 🎉 + +- [![](images/fmnist_50_latsp.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/fmnist_50_latsp.png) + +- [![](images/fmnist_50_plot.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/fmnist_50_plot.png) + + +_The script to generate these plots was created by François Chollet and can be retrieved [here](https://keras.io/examples/variational_autoencoder_deconv/)._ + +Now, let's see if we can improve when we regularize even further. + +As with the Dropout best practices, [we applied Dropout](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/) with \[latex\]p = 0.5\[/latex\] in the hidden layers and max-norm regularization with \[latex\]maxnormvalue = 2.0\[/latex\]. It seems to improve the model's ability to discriminate between classes, which also becomes clear from the samples across latent space: + +- [![](images/fmnist_dmax_space.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/fmnist_dmax_space.png) + +- [![](images/fmnist_dmax_plot.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/fmnist_dmax_plot.png) + + +_The script to generate these plots was created by François Chollet and can be retrieved [here](https://keras.io/examples/variational_autoencoder_deconv/)._ + +* * * + +## Summary + +In this blog post, we've looked at the concept of a Variational Autoencoder, or VAE. We did so by looking at classic or 'normal' autoencoders first, as well as their difficulties when it comes to content generation. + +Doing so, we have seen how VAEs may overcome these issues by encoding samples as a probability distribution over the latent space, making it continuous and complete - which allows generative processes to take place. We illustrated this with two examples, a visualization of the MNIST dataset and its latent space as well as the Fashion MNIST dataset. Clearly, the more discriminative - the MNIST - produced a better plot. + +I hope you've learnt something today. If you like my blog, please leave a comment in the comments box below 👇 - I'd really appreciate it! Please do the same if you find mistakes or when you think things could be better. Based on your feedback, I'll try to improve my post where possible. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Autoencoder. (2006, September 4). Retrieved from [https://en.wikipedia.org/wiki/Autoencoder](https://en.wikipedia.org/wiki/Autoencoder) + +Shafkat, I. (2018, April 5). Intuitively Understanding Variational Autoencoders. Retrieved from [https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf](https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf) + +Rocca, J. (2019, December 8). Understanding Variational Autoencoders (VAEs). Retrieved from [https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73](https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73) + +Jordan, J. (2018A, July 16). Variational autoencoders. Retrieved from [https://www.jeremyjordan.me/variational-autoencoders/](https://www.jeremyjordan.me/variational-autoencoders/) + +Jordan, J. (2018B, March 19). Introduction to autoencoders. Retrieved from [https://www.jeremyjordan.me/autoencoders/](https://www.jeremyjordan.me/autoencoders/) + +Keras Blog. (n.d.). Building Autoencoders in Keras. Retrieved from [https://blog.keras.io/building-autoencoders-in-keras.html](https://blog.keras.io/building-autoencoders-in-keras.html) + +Kingma, D. P., & Welling, M. (2013). [Auto-encoding variational bayes](https://arxiv.org/abs/1312.6114). _arXiv preprint arXiv:1312.6114_. diff --git a/what-is-batch-normalization-for-training-neural-networks.md b/what-is-batch-normalization-for-training-neural-networks.md new file mode 100644 index 0000000..7f07967 --- /dev/null +++ b/what-is-batch-normalization-for-training-neural-networks.md @@ -0,0 +1,209 @@ +--- +title: "What is Batch Normalization for training neural networks?" +date: "2020-01-14" +categories: + - "deep-learning" +tags: + - "batch-normalization" + - "gradient-descent" + - "minibatch-gradient-descent" + - "training-process" +--- + +Training neural networks is an art rather than a process with a fixed outcome. You don't know whether you'll end up with working models, and there are many aspects that may induce failure for your machine learning project. + +However, over time, you'll also learn a certain set of brush strokes which significantly improve the odds that you'll _succeed_. + +Even though this may sound weird (it did when I started my dive into machine learning theory), it think that the above description is actually true. Once you'll dive in, there will be a moment when all the pieces start coming together. + +In modern neural network theory, Batch Normalization is likely one of the encounters that you'll have during your quest for information. + +It has something to do with _normalizing_ based on _batches of data_... right? Yeah, but that's actually repeating the name in different words. + +Batch Normalization, in fact, helps you overcome a phenomenon called **internal covariate shift**. What this is, and how Batch Normalization works? We'll answer those questions in this blog. + +To be precise: we'll kick off by exploring the concept of an internal covariate shift. What is it? How is it caused? Why does it matter? These are the questions that we'll answer. + +It is followed by the introduction of Batch Normalization. Here, we'll also take a look at what it is, how it works, what it does and why it matters. This way, you'll understand how it can be used to **speed up your training**, or to even save you from situations with **non-convergence**. + +Are you ready? Let's go! 😎 + +* * * + +\[toc\] + +* * * + +## Internal covariate shift: a possible explanation of slow training and non-convergence + +Suppose that you have a neural network, such as this one that has been equipped with [Dropout neurons](https://www.machinecurve.com/index.php/2019/12/16/what-is-dropout-reduce-overfitting-in-your-neural-networks/): + +[![](images/dropout.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/dropout.png) + +As you might recall from the [high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), training a neural network includes a _feedforward operation_ on your training set. During this operation, the data is fed to the neural network, which generates a prediction for each sample that can be compared to the _target data_, a.k.a. the ground truth. + +This results in a [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss) that is computed by some [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#loss-functions). + +Based on the loss function, backpropagation will compute what is known as the _gradient_ to improve the loss, while [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or an [adaptive optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) will actually change the weights of the neurons of your neural network. Based on this change, the model is expected to perform better during the next iteration, in which the process is repeated. + +### Changing input distributions + +Now, let's change your viewpoint. Most likely, you'll have read the previous while visualizing the neural network as a whole. Perfectly fine, as this was intended, but now focus on the network as if it is _a collection of stacked, but individual, layers_. + +Each layer takes some input, transforms this input through interaction with its weights, and outputs the result, to be consumed by the first layer downstream. Obviously, this is not true for the input layer (with the original sample as input) and the output layer (with no subsequent layer), but you get the point. + +Now suppose that we feed the entire training set to the neural network. The first layer will _transform_ this data into _something else_. Statistically, however, this is also a _sample_, which thus has a sample mean and a sample standard deviation. This process repeats itself for each individual layer: the input data can be represented as some statistical sample with mean \[latex\]\\mu\[/latex\] and standard deviation \[latex\]\\sigma\[/latex\]. + +### Internal covariate shift + +Now do note two things: + +- Firstly, the argument above means by consequence that the distribution of input data for some particular layer depends on _all the interactions happening in all the upstream layers_. +- Secondly, this means by consequence that _a change in how one or more of the upstream layer(s) process data_ will change the _input distribution_ for this layer. + +...and what happens when you train your model? Indeed, you change _how the layers process data_, by changing their weights. + +Ioffe & Szegedy (2015), in their paper ["Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift"](https://arxiv.org/abs/1502.03167) call this process the **"internal covariate shift"**. They define it as follows: + +> The change in the distribution of network activations due to the change in network parameters during training. +> +> Ioffe, S., & Szegedy, C. (2015). [Batch normalization: Accelerating deep network training by reducing internal covariate shift](https://arxiv.org/abs/1502.03167). _arXiv preprint arXiv:1502.03167_. + +### Why is this bad? + +Put plainly and simply: + +**It slows down training**. + +If you were using a very strict approach towards defining a supervised machine learning model, you would for example say that machine learning produces _a function which maps some input to some output based on some learnt mapping, which equals the mapping made by the true, underlying mapping in your data_. + +This is also true for each layer: each layer essentially is a function which learns to map some input to some output, so that the system as a whole maps the original input to the desired output. + +Now imagine that you're looking at the training process from some distance. Slowly but surely, each layer learns to represent the internal mapping and the system as a whole starts to show the desired behavior. Perfect, isn't it? + +Yes, except that you also see some oscillation during the process. Indeed, you see that the layers make _tiny_ mistakes during training, because they expect the inputs to be of some kind, while they are slightly different. They do know how to handle this, as the changes are very small, but they have to readjust each time they encounter such a change. As a result, the process as a whole takes a bit longer. + +The same is true for the actual machine learning process. The _internal covariance shift_, or the changing distributions of the input data for each hidden layer, mean that each layer requires some extra time to learn the weights which allow the system as a whole to minimize the loss value of the entire neural network. In extreme cases, although this does not happen too often, this shift may even result in non-convergence, or the impossibility of learning the mapping as a whole. This especially occurs in datasets which have not been normalized and are by consequence a poor fit for ML. + +* * * + +## Introducing Batch Normalization + +Speaking about such normalization: rather than leaving it to the machine learning engineer, can't we (at least partially) fix the problem in the neural network itself? + +That's the thought process that led Ioffe & Szegedy (2015) to conceptualize the concept of **Batch Normalization**: by normalizing the inputs to each layer to a learnt representation likely close to \[latex\](\\mu = 0.0, \\sigma = 1.0)\[/latex\], the internal covariance shift is reduced substantially. As a result, it is expected that the speed of the training process is increased significantly. + +But how does it work? + +Let's find out. + +### Per-feature normalization on minibatches + +The first important thing to understand about Batch Normalization is that it works on a per-feature basis. + +This means that, for example, for feature vector \[latex\]\\textbf{x} = \[0.23, 1.26, -2.41\]\[/latex\], normalization is not performed equally for each dimension. Rather, each dimension is normalized individually, based on the sample parameters of the _dimension_. + +The second important thing to understand about Batch Normalization is that it makes use of minibatches for performing the normalization process (Ioffe & Szegedy, 2015). It avoids the computational burden of using the entire training set, while assuming that minibatches approach the dataset's sample distribution if sufficiently large. This is a very smart idea. + +### Four-step process + +Now, the algorithm. For each feature \[latex\]x\_B^{(k)} \[/latex\] in your feature vector \[latex\]\\textbf{x}\_B\[/latex\] (which, for your hidden layers, doesn't contain your features but rather the inputs for that particular layer), Batch Normalization normalizes the values with a four-step process on your minibatch \[latex\]B\[/latex\] (Ioffe & Szegedy, 2015): + +1. **Computing the mean of your minibatch**: \[latex\]\\mu\_B^{(k)} \\leftarrow \\frac{1}{m} \\sum\\limits\_{i=1}^m x\_B{ \_i ^{(k)} } \[/latex\]. +2. **Computing the variance of your minibatch:** \[latex\]\\sigma^2{ \_B^{(k)} } \\leftarrow \\frac{1}{m} \\sum\\limits\_{i=1}^m ( x\_B{ \_i ^{(k)} } - \\mu\_B^{(k)})^2\[/latex\] +3. **Normalizing the value:** \[latex\]\\hat{x}\_B^{(k)} \\leftarrow \\frac{x\_B{ ^{(k)} } - \\mu\_B^{(k)}}{\\sqrt{ \\sigma^2{ \_B^{(k)} } + \\epsilon}}\[/latex\] +4. **Scaling and shifting:** \[latex\]y\_i \\leftarrow \\gamma\\hat{x} \_B ^{(k)} + \\beta\[/latex\]. + +#### Computing mean and variance + +The first two steps are simple and are very common as well as required in a normalization step: **computing the mean** \[latex\]\\mu\[/latex\] and **variance** \[latex\]\\sigma^2\[/latex\] of the \[latex\]k^{\\text{th}}\[/latex\] dimension of your minibatch sample \[latex\]x\_B\[/latex\]. + +#### Normalizing + +These are subsequently used in the **normalization step**, in which the expected distribution is \[latex\](0, 1)\[/latex\] as long as samples in the minibatch have the same distribution and the value for \[latex\]\\epsilon\[/latex\] is neglected (Ioffe & Szegedy, 2015). + +You may ask: indeed, this \[latex\]\\epsilon\[/latex\], why is it there? + +It's for numerical stability (Ioffe & Szegedy, 2015). If the variance \[latex\]\\sigma^2\[/latex\] were zero, one would get a _division by zero_ error. This means that the model would become numerically unstable. The value for \[latex\]\\epsilon\[/latex\] resolves this by taking a very small but nonzero value to counter this effect. + +#### Scaling and shifting + +Now, finally, the fourth step: **scaling and shifting** the normalized input value. I can get why this is weird, as we already completed normalization in the third step. + +> Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation \[latex\]x^{(k)}\[/latex\], a pair of parameters \[latex\]\\gamma^{(k)}\[/latex\], \[latex\]\\beta^{(k)}\[/latex\], which scale and shift the normalized value: +> +> \[latex\]y^{(k)} = \\gamma^{(k)}\\hat{x}^{(k)} + \\beta^{(k)}\[/latex\] +> +> Ioffe, S., & Szegedy, C. (2015). [Batch normalization: Accelerating deep network training by reducing internal covariate shift](https://arxiv.org/abs/1502.03167). _arXiv preprint arXiv:1502.03167_. + +Linear regime of the nonlinearity? Represent the identity transform? What are these? + +Let's decomplexify the rather academic English into a plainer variant. + +First, the "linear regime of the nonlinearity". Suppose that we're using the [Sigmoid activation function](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/), which is a nonlinear activation function (a "nonlinearity") and was still quite common in 2015, when the Ioffe & Szegedy paper was written. + +It looks like this: + +[![](images/sigmoid_and_deriv-1024x511.jpeg)](https://www.machinecurve.com/wp-content/uploads/2019/09/sigmoid_and_deriv.jpeg) + +Suppose that we've added it to some arbitrary layer. + +_Without Batch Normalization_, the inputs of this layer do not have a distribution of approximately \[latex\](0, 1)\[/latex\], and hence could theoretically be likelier to take rather large values (e.g. \[latex\]2.5623423...\[/latex\]). + +Suppose that our layer does nothing but pass the data (it makes our case simpler), the _activations_ of those input values produce outputs that have a _nonlinear_ slope: as you can see in the plot above, for inputs to the activation function in the domain \[latex\]\[2, 4\]\[/latex\], the output bends a bit. + +However, for inputs of \[latex\]\\approx 0\[/latex\], this is not the case: the outputs for the input domain of approximately \[latex\]\[-0.5, 0.5\]\[/latex\] don't bend and actually seem to represent a _linear function_. This entirely reduces the effect of nonlinear activation, and by consequence the performance of our model, and might not be what we want! + +...and wait: didn't we normalize to \[latex\](0, 1)\[/latex\], meaning that the inputs to our activation function are likely in the domain \[latex\]\[-1, 1\]\[/latex\] for every layer? Oops 🙊 + +This is why the authors introduce a scaling and shifting operation with some parameters \[latex\]\\gamma\[/latex\] and \[latex\]\\beta\[/latex\], with which the normalization can be adapted during training, in extreme cases even to "represent the identity transform" (a.k.a., what goes in, comes out again - entirely removing the Batch Normalization step). + +The parameters are learnt during training, together with the other parameters (Ioffe & Szegedy, 2015). + +### Continuing our small example + +Now, let's revise our small example from above, with our feature vector \[latex\]\\textbf{x} = \[0.23, 1.26, -2.41\]\[/latex\]. + +Say if we used a minibatch approach with 2 samples per batch (a bit scant, I know, but it's sufficient for the explanation), with another vector \[latex\]\\textbf{x}\_a = \[0.56, 0.75, 1.00\]\[/latex\] in the set, our Batch Normalization step would go as follows (assuming \[latex\]\\gamma = \\beta = 1\[/latex\]): + +| **Features** | **Mean** | **Variance** | **Input** | **Output** | +| --- | --- | --- | --- | --- | +| \[0.23, 0.56\] | 0.395 | 0.054 | 0.23 | \-0.710 | +| \[1.26, 0.75\] | 1.005 | 0.130 | 1.26 | 0.707 | +| \[-2.41, 1.00\] | \-0.705 | 5.81 | \-2.41 | \-0.707 | + +As we can see, with \[latex\]\\gamma = \\beta = 1\[/latex\], our values are normalized to a distribution of approximately \[latex\](0, 1)\[/latex\] - with some \[latex\]\\epsilon\[/latex\] term. + +### The benefits of Batch Normalization + +Theoretically, there are some assumed benefits when using Batch Normalization in your neural network (Ioffe & Szegedy, 2015): + +- The model is less sensitive to hyperparameter tuning. That is, whereas larger learning rates led to non-useful models previously, larger LRs are acceptable now. +- Weight initialization is a tad less important now. +- Dropout, which is used to [add noise to benefit training](https://www.machinecurve.com/index.php/2019/12/18/how-to-use-dropout-with-keras/), can be removed. + +### Batch Normalization during inference + +While a minibatch approach speeds up the training process, it is "neither necessary nor desirable during inference" (Ioffe & Szegedy, 2015). When inferring e.g. the class for a new sample, you wish to normalize it based on the _entire_ training set, as it produces better estimates and is computationally feasible. + +Hence, during inference, the Batch Normalization step goes as follows: + +\[latex\]\\hat{x}^{(k)} \\leftarrow \\frac{x\_i^{(k)} - \\mu^{(k)}}{\\sqrt{ \\sigma^2{ ^{(k)} } + \\epsilon}}\[/latex\] + +Where \[latex\]x \\in X\[/latex\] and \[latex\]X\[/latex\] represents the full training data, rather than some minibatch \[latex\]X\_b\[/latex\]. + +* * * + +## Summary + +In this blog post, we've looked at the problem of a relatively slow and non-convergent training process, and noted that Batch Normalization may help reduce the issues with your neural network. By reducing the distribution of the input data to \[latex\](0, 1)\[/latex\], and doing so on a per-layer basis, Batch Normalization is theoretically expected to reduce what is known as the "internal covariance shift", resulting in faster learning. + +I hope you've learnt something from this blog post. If you did, please feel free to leave a comment in the comments box below - I'll happily read and answer :) Please do the same if you have any questions or when you have remarks. Thanks for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Ioffe, S., & Szegedy, C. (2015). [Batch normalization: Accelerating deep network training by reducing internal covariate shift](https://arxiv.org/abs/1502.03167). _arXiv preprint arXiv:1502.03167_. + +Reddit. (n.d.). Question about Batch Normalization. Retrieved from [https://www.reddit.com/r/MachineLearning/comments/3k4ecb/question\_about\_batch\_normalization/](https://www.reddit.com/r/MachineLearning/comments/3k4ecb/question_about_batch_normalization/) diff --git a/what-is-convbert-and-how-does-it-work.md b/what-is-convbert-and-how-does-it-work.md new file mode 100644 index 0000000..800797f --- /dev/null +++ b/what-is-convbert-and-how-does-it-work.md @@ -0,0 +1,16 @@ +--- +title: "What is ConvBERT and how does it work?" +date: "2021-02-26" +categories: + - "buffer" + - "deep-learning" +tags: + - "bert" + - "convbert" + - "huggingface" + - "nlp" + - "transformer" + - "transformers" +--- + +Convolutional BERT (ConvBERT) improves the original BERT by replacing some Multi-headed Self-attention segments with cheaper and naturally local operations, so-called span-based dynamic convolutions. These are integrated into the self-attention mechanism to form a mixed attention mechanism, allowing Multi-headed Self-attention to capture global patterns; the Convolutions focus more on the local patterns, which are otherwise captured anyway. In other words, they reduce the computational intensity of training BERT. diff --git a/what-is-deep-learning-exactly.md b/what-is-deep-learning-exactly.md new file mode 100644 index 0000000..3a16f1e --- /dev/null +++ b/what-is-deep-learning-exactly.md @@ -0,0 +1,188 @@ +--- +title: "What is deep learning exactly?" +date: "2018-11-23" +categories: + - "deep-learning" +tags: + - "activation-functions" + - "deep-learning" + - "feature-learning" + - "information-processing" + - "multilayer" + - "neural-networks" + - "nonlinear" + - "representation-learning" +--- + +Recently, I've picked up deep learning both in my professional and spare-time activities. This means that I spent a lot of time learning the general concepts behind this very hot field. On this website, I'm documenting the process for others to repeat. + +But in order to start, you'll have to start with the definition. **What is deep learning, exactly?** If you don't know what it is, you cannot deepen your understanding. + +In this blog, I thus investigate the definition of deep learning in more detail. I'll take a look at the multi-layered information processing, the nonlinear [activation functions](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/), as well as the concept behind representation learning. It's slightly high-level to keep this blog at an adequate complexity, and I will cover the particular topics in more detail in other blogs. + +So keep coming back every now and then to find new information available for you to read, free of charge! :-) + +Hopefully, this blog will put you into the right direction in your quest for information. If you have any questions or remarks, tips and tricks; obviously, they are welcome. Please leave me a message below and I am more than happy to respond. + +Okay, let's give it a start :-) + +**Update February 2020** \- Extended certain areas of the text and added additional links to other MachineCurve articles. + +\[toc\] + +\[ad\] + +## What is deep learning, exactly? + +There seems to be a bit of a definition clash, haha. In all these years, there has been [no agreed upon definition](https://www.machinecurve.com/index.php/2017/09/30/the-differences-between-artificial-intelligence-machine-learning-more/) about what the differences are between artificial intelligence, machine learning and deep learning. Especially for artificial intelligence things get vague with very fuzzy boundaries. + +For deep learning, things tend to get a bit better. + +If we quote Wikipedia's [page about deep learning](https://en.wikipedia.org/wiki/Deep_learning), it writes as follows: _"**Deep learning** (also known as **deep structured learning** or **hierarchical learning**) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms."_ + +We now have a couple of new questions: + +- What does _learning data representations_ mean? +- How are they different than tasks-specific algorithms? + +The book 'Deep Learning Methods and Applications' by Li Deng and Dong Yu provides a synthesis of various definitions based on previous academic research. They highlight that within all these definitions, overlap exists between two key concepts: + +1. Deep learning models are models consisting of multiple layers or stages of nonlinear information processing; +2. Deep learning methods are methods for supervised or unsupervised learning of feature representation at successively higher, more abstract layers. + +This somewhat deepens our understanding from the Wikipedia quote, but we still have some remaining questions. + +- Once again, _what does learning data representation_ or _feature representation_ mean? +- How can we visualize the successively higher, more abstract layers? +- What is nonlinear information processing? + +We now have a problem space which we can use to move forward :-) + +\[ad\] + +## Multiple layers of information processing + +Classic methods of machine learning work with just one layer of information processing. + +To make this principle clear, we take one of the simpler variants of these kind of models: a [linear classifier](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/). + +![](images/linear_classifier.jpg) + +Above, we have the mathematical notation of a linear classifier. I'll now try to explain it more intuitively. + +### Input, output and weights + +Suppose that we have a model. This means that you will have **input** which you feed to the model, and based on the model you get some **output**. In the notation above, vector (for programmers, this is like an array; for anyone else, it is like an ordered list) **x** is the new input you're feeding the model. **y** is the output, for example the class in a [classification problem](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/). + +Vector **w** is called the **weights vector**. This is the "learnt" knowledge for the model. If you train the model, you feed it "input values" with the corresponding "output value". Based on the way the model is built up itself, it attempts to discover patterns in this data. For example, a medium-sized animal which makes a barking sound probably belongs to output... dog. This means that if trained well, when your input vector (x) consists of 'medium-sized' and 'barking', the model's output (y) will be 'dog'. + +### Converting input to output: linear classifier + +In the case of a linear classifier, it works by converting the dot product of the weights and the input vector scalars into the desired output value. It's simply a summated multiplication of the two vector's scalars at the same levels in the vector. This is a situation in which a **linear function** is used to produce the output. + +We can use this to demonstrate how a deep learning network is different than a classic machine learning method. + +If you wish to use the classic method, like the linear classifier above, you feed it with input and you get some output. However, **only one thing happens**. This means that the information is processed just once. In the case of the linear classifier, a dot product between the model's weights and the input scalars is calculated... and that provides the output score. + +For the deep learning methods, things are a bit different. If we wish to demonstrate this, we must take a generic neural network and show it first: + +\[caption id="attachment\_172" align="aligncenter" width="296"\]![](images/296px-Colored_neural_network.svg_.png) _Source: [Colored neural network at Wikipedia](https://en.wikipedia.org/wiki/Artificial_neural_network#/media/File:Colored_neural_network.svg), author: [Glosser.ca](https://commons.wikimedia.org/wiki/User_talk:Glosser.ca), license: [CC BY-SA 3.0](https://creativecommons.org/licenses/by-sa/3.0/nl/legalcode), no changes._\[/caption\] + +I think you did immediately notice that an artificial neural network consists of multiple layers :-) There is one input layer, one output layer and some hidden layers in between. + +These layers, and the nodes within these layers, they are all connected. In most cases, [this happens in a feed-forward fashion](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), as you can notice in the image above, but some network architectures exist in which certain information from the past is used to make better predictions in future layers. + +### Converting input to output: neural network + +In both cases, this means that **multiple things happen** when you feed a neural network new input data. This is very much contrary to the linear classifier and all the other classic machine learning methods, in which this is not the case. + +Now you may ask: why is this better than classic machine learning methods? + +The simple answer to this question is: it is not necessarily better. This totally depends on the task. But we have seen that these kind of network architectures _do_ generally perform better when comparing them to the classic models. + +And here's why. + +\[ad\] + +## Nonlinear activation functions + +We'll have to first look into another aspect of these kind of models: the so-called [**nonlinear activation functions**](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/). + +![](images/linear_classifier.jpg) + +We will have to go back to the simple principle of calculating a dot product again. In a short recap, this means to calculate the dot product of both the weights vector and the input vector. + +Quite frankly, the same thing happens in a neuron, which is the node illustrated in the neural network above. + +### Activation functions + +But neural networks are slightly inspired on how the human brain works. Neurology research used in the development of artificial neural networks tells us that the brain is a collective cooperation between neurons, which process information and 'fire' a signal to other neurons if they wish to process information. + +This means that the neurons can partially decide that certain signals do not need to be processed further down the chain, whereas for others this is actually important. + +It can be achieved by using an [**activation function**](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/). One such function uses some kind of threshold value to decide whether activation should take place. For example, "fire if value > threshold, otherwise do not fire". In numbers: "1 if value > threshold, 0 otherwise". Many types of activation function exist. + +### Nonlinearity + +This is really different from a regular model, which does not use any kind of activation function, as we saw with the linear classifier. + +In neural networks, activation functions are [**nonlinear**](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/). We can show the difference by first explaining a linear function: + +y: A(x) = c \* x. + +With a certain weight c (which does not matter for the example), function A produces output value y for the input value x. However, as we can see, this output value is proportional to the input value. If c = 1, we can see that A(1) = 1, A(2) = 2 et cetera. + +Nonlinear functions do not work this way. Their input is not necessarily proportional to the input value (but may be for some ranges within the possible input values). For example, one of the most-used nonlinear activation functions is the so-called [**_ReLu_** _activation function_](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#rectified-linear-unit-relu). If the x values are < 0, the output is 0, else the output is x. This means that for x >= 0, the output is proportional to the input, but if an input scalar x is < 0, it is not proportional. + +So, in nonlinear models, the weighted product calculated by the neuron is then put through an activation function that is nonlinear. Its output, if activated, is sent to the connected neurons in the subsequent layer. + +### Differences with classic models + +The **benefit** of these kind of activation functions is that data can be handled in a better way. Data is inherently nonlinear, as the world is too. It is therefore very complex [to fully grasp the world in linear models](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/). Nonlinear activation functions can help identifying much more complex patterns in data than a linear model can handle. This partially explains the enormous rise in accuracy for machine learning models since the rise of deep learning. + +\[ad\] + +## Back to the multiple layers + +Now that we know how nonlinear activation functions are an integral part of deep learning, we can go back to the multiple layers story with which we ended prior to reaching a conclusion. + +Every layer adds a level of non-linearity that cannot be captured by another layer. + +For example, suppose that we wish to identify all the data points that lie within the red circle and the orange circle in the diagram below: + +![](images/example.jpg) + +We first greatly benefit from the **nonlinearity** of the neural networks, because these circles are not linear (have you seen a circular line before?). + +With traditional models, it would have been impossible to achieve great accuracy on this kind of problem. + +Partially because the linearity of these models, but also because it cannot separate the two problems presented hidden in the problem described above: + +1. First, identify everything that lies _within_ the red circle; +2. Second, identify everything that lies _outside_ the orange circle. + +Combined, these provide the answer to our problem. + +Using a multi-layered neural network, we can train the model to make this separation. One layer will take the first problem; the second layer will take onto itself the second problem. Probably, a few additional layers are necessary to "polish" the result, but it illustrates why multiple layers of information processing distinguish deep learning methods from classic machine learning methods, as well as the nonlinearity. + +\[ad\] + +## Learning data / feature representation + +Another important aspect of these deep learning networks is that they perform **learning data representation**. It is also one of the answers to the sketch drawn above, in which the data is not linearly separable. + +Internally, every layer will learn its own **representation** of the data. This means that it will structure the data in a better way so the task at hand, for example classification, becomes simpler. + +This also means that the data will be more abstract and more high-level for every subsequent layer. It is an essential stap in transforming the very dynamic, often heterogenous data into something for which a computer can distinguish that - for example - it's either A or B. + +In a concrete example for image recognition in humans, this means that every input image is converted into higher-level concepts. For example, the noses of the various humans involved in the pictures are transformed into a generic nose, and subsequently decomposed in many other simpler, generic concepts. + +This way, once the model sees a new nose, it can attempt to do the same thing - to know that it's a nose, and therefore possibly a human being :-) + +## Conclusion + +In this blog I investigated the definition for deep learning in more detail. I hope it helped you in some way towards becoming better in machine learning and deep learning. I would appreciate your comment and your feedback, especially if you think I made a mistake. This way, we can cooperatively make this blog better, which I would appreciate very much :-) + +## Sources + +- [Wikipedia: Deep learning](https://en.wikipedia.org/wiki/Deep_learning); diff --git a/what-is-dropout-reduce-overfitting-in-your-neural-networks.md b/what-is-dropout-reduce-overfitting-in-your-neural-networks.md new file mode 100644 index 0000000..6309c32 --- /dev/null +++ b/what-is-dropout-reduce-overfitting-in-your-neural-networks.md @@ -0,0 +1,225 @@ +--- +title: "What is Dropout? Reduce overfitting in your neural networks" +date: "2019-12-16" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "dropout" + - "machine-learning" + - "neural-networks" + - "regularization" + - "regularizer" +--- + +When training neural networks, your goal is to produce a model that performs really well. + +This makes perfect sense, as there's no point in using a model that does not perform. + +However, there's a relatively narrow balance that you'll have to maintain when attempting to find a _perfectly well-performing model_. + +It's the balance between _underfitting_ and _overfitting_. + +In order to avoid underfitting (having worse than possible predictive performance), you can continue training, until you experience the other problem - overfitting, a.k.a. being too sensitive to your training data. Both hamper model performance. + +Sometimes, the range in which your model is not underfit nor overfit is really small. Fortunately, it can be extended by applying what is known as a _regularizer_ - a technique that regularizes how your model behaves during training, to delay overfitting for some time. + +Dropout is such a regularization technique. In this blog post, we cover it, by taking a look at a couple of things. Firstly, we dive into the difference between underfitting and overfitting in more detail, so that we get a deeper understanding of the two. Secondly, we introduce Dropout based on academic works and tell you how it works. Thirdly, we will take a look at whether it really works, by describing the various experiments done with this technique. Finally, we will compare traditional Dropout with Gaussian Dropout - and how it changes training your model. + +Ready? Let's go! 😎 + +\[toc\] + +## How well does your model perform? Underfitting and overfitting + +Let's first take a look at what underfitting and overfitting are. + +When starting the [training process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), the weights of your neurons are likely initialized at random or with some other initialization strategy. This means that the error rate, or loss value, will be very high during the first few epochs. Take a look at this diagram, where the loss decreases very rapidly during the first few epochs: + +![](images/elu_loss.png) + +When both the training loss and the validation decrease, the model is said to be **underfit:** it can still be trained to make better predictions, i.e. to gain in its _predictive power_. + +The action to undertake then is to continue training. + +However, this cannot continue forever. Optimizing a model involves generating validation predictions with your validation data, resulting in loss values and gradients for optimization, which is then performed. Unfortunately, this means that _some of the ideosyncrasies of the data are leaked into the model weights_. That is, since the data is a sample rather than a full population, it is always slighty different than the full population it represents. When you optimize the model for hundreds of epochs with this data, you'll always get an offset with respect to this true population. + +If you would continue training, your model would adapt more and more to those ideosyncrasies, making it less suitable for data it has never seen before - i.e., other samples from the population. The model is then said to be **overfit:** it is too well-adapted to the training and validation data. + +Overfitting can be detected on plots like the one above by inspecting the validation loss: when it goes up again, while the training loss remains constant or decreases, you know that your model is overfitting. As you can see, the ELU powered network in the plot above has started overfitting very slightly. + +Both underfitting and overfitting are to be avoided, as your model will perform worse than it could perform theoretically. Fortunately, certain techniques - called regularizers - can be used to reduce the impact of overfitting. **Dropout** is one of them - and we will cover it in this blog. Let's begin by analyzing what Dropout is, what it does and how it works. + +## What is Dropout and how does it work? + +In their paper ["Dropout: A Simple Way to Prevent Neural Networks from Overfitting"](http://jmlr.org/papers/v15/srivastava14a.html), Srivastava et al. (2014) describe the _Dropout_ technique, which is a stochastic regularization technique and should reduce overfitting by (theoretically) combining many different neural network architectures. + +With Dropout, the training process essentially drops out neurons in a neural network. They are temporarily removed from the network, which can be visualized as follows: + +[![](images/dropout.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/dropout.png) + +Note that the connections or synapses are removed as well, and that hence no data flows through these neurons anymore. + +...but only very briefly! This process repeats every epoch (or even every minibatch! - Srivastava et al. 2014) and hence sampling thinned networks happens very often. This should lead to significantly lower generalization error rates (i.e., overfitting), as "the presence of neurons is made unreliable" (Srivastava et al., 2014). + +This removal of neurons and synapses during training is performed at random, with a parameter \[latex\]p\[/latex\] that is tunable (or, given empirical tests, best set to 0.5 for hidden layers and close to 1.0 for the input layer). This effectively means that, according to the authors, the "thinned" network is sampled from the global architecture, and used for training. + +At test time, "it is not feasible to explicitly average the predictions from exponentially many thinned models" (Srivastava et al., 2014). That's true: it would become a computational burden when hundreds of thousands of epochs/minibatches have to be averaged, especially when networks become really large. + +Fortunately, there is a solution - which is simple, but produces the same result. By using one neural network, where the weight outputs are scaled down according to the \[latex\]p\[/latex\] with which a unit was retained during training. This means that the expected output at training time is the same as the true output at test time, resolving the computational issue and making Dropout usable in practice. + +### Bernoulli variables + +Let's now take a look at how Dropout works mathematically. Don't worry, we don't bury you with maths, but instead we'll try to take a very intuitive point of view. + +Very simplistically, this is how a neuron receives its input: e.g. three upstream neurons in a three-neuron Dense layer send their outputs to the next layer, where it is received as input. Note that for the sake of simplicity we omit the bias values here. + +![](images/Normal-neuron.png) + +Normal neuron (assumed to be without bias) + +It is very simple to go from here to a Dropout neuron, which looks as follows: + +![](images/Dropout-neuron.png) + +Dropout neuron (assumed to be without bias) + +Mathematically, this involves so-called Bernoulli random variables: + +> In probability theory and statistics, the Bernoulli distribution, named after Swiss mathematician Jacob Bernoulli, is the discrete probability distribution of a random variable which takes the value 1 with probability \[latex\]p\[/latex\]. +> +> [Wikipedia on the Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) + +To create Dropout, Srivastava et al. (2014) attached Bernoulli variables to the network's neurons (by multiplying them with neural outputs), "each of which \[have\] probability \[latex\]p\[/latex\] of being 1". The \[latex\]p\[/latex\] value here is selected by the machine learning engineer, usually based on some validation set, or naïvely set to 0.5. + +Inside the network, the Bernoulli variable and its value of 1 or 0 determines whether a neuron is 'dropped out' during this epoch or minibatch feedforward operation. This, in effect, leads to the 'thinned network' that Srivastava et al. (2014) talk about. + +### Why could Dropout reduce overfitting? + +You may now wonder: why does Bernoulli variables attached to regular neural networks, making the network thinner, reduce overfitting? + +For the answer to this question, we will have to take a look at how neural networks are trained. + +Usually, backpropagation and [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or a [similar optimizer](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) is used for this purpose. Given a [loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/), so-called 'gradients' are computed which the optimizer then processes into the network's weights. By using these gradients (with respect to the error rate) to change the weights, the network likely performs slightly better during the next iteration of the training process. + +Computing the gradient is done _with respect to the error_, but also _with respect to what all other units are doing_ (Srivastava et al., 2014). This means that certain neurons, through changes in their weights, may fix the mistakes of other neurons. These, Srivastava et al. (2014) argue, lead to complex co-adaptations that may not generalize to unseen data, resulting in overfitting. + +Dropout, then, prevents these co-adaptations by - as we wrote before - _making the presence of other hidden \[neurons\] unreliable_. Neurons simply cannot rely on other units to correct their mistakes, which reduces the number of co-adaptations that do not generalize to unseen data, and thus presumably reduces overfitting as well. + +## Training neural nets with Dropout + +Training neural networks to which Dropout has been attached is pretty much equal to training neural networks without Dropout. [Stochastic gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or [similar optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) can be used. The only difference, as reported by Srivastava et al. (2014), can be found when using a mini-batch approach: rather than per epoch, thinned networks are sampled per minibatch. + +Additionally, methods that improve classic SGD - like [momentum](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/#momentum) - can be used as well, and show similar improvements as with regular neural networks (Srivastava et al., 2014). + +What the authors also found to be useful during training is applying _max-norm regularization_, which means constraining the norm of the incoming weight to be bounded by some maximum value \[latex\]c\[/latex\]. This value must be set by the engineer upfront, and determined using a validation set (Srivastava et al., 2014). + +Combining Dropout with max-norm regularization improves performance compared to using Dropout alone, but the authors reported even better results when Dropout and max-norm regularization are combined with two other things: + +- Large, decaying learning rates. +- High momentum. + +According to Srivastava et al. (2014), this can possibly be justified by the following arguments: + +1. Constraining weight vectors makes it possible to use large learning rates without [exploding weights](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/). +2. Dropout noise plus large learning rates then help optimizers "to explore different regions of the weight space that would have otherwise been difficult to reach". +3. Decaying the learning rate then slows down the jumpiness of the exploration process, eventually "settling into a minimum". +4. High momentum allows the network to overcome local minima, increasing the likelihood that the global minimum is found. + +## Does dropout _actually_ work? Experimental results + +With any improvement in machine learning, it's nice to have a theoretical improvement - but it's also important to test whether it really works. Srivastava et al. (2014) performed multiple tests to find out whether Dropout works. Firstly, they used various standard datasets (such as the MNIST dataset) to test whether Dropout improves model performance across a wide range of classification problems. + +Secondly, they checked how it performed with a variety of other regularizers (yielding the insight that max-norm regularization together with Dropout works best - but let's take a look at these results in more detail later), and thirdly, Srivastava et al. (2014) investigated which dropout rates (i.e., which parameter \[latex\]p\[/latex\]) work best and how data size impacts Dropout performance. Let's take a look! + +[![](images/mnist.png)](https://www.machinecurve.com/wp-content/uploads/2019/07/mnist.png) + +Samples from the MNIST dataset + +### Dropout vs no dropout on standard datasets + +The authors tested Dropout vs No Dropout on these standard datasets (Srivastava et al., 2014): + +- The **MNIST** dataset, which contains thousands of handwritten digits; +- The **TIMIT** speech benchmark dataset for clean speech recognition. +- The **CIFAR-10** and **CIFAR-100** datasets, containing tiny natural images in 10 and 100 classes. +- The **Street View House Numbers** (SVHN) dataset,w ith images of house numbers collected from Google Street View. +- The **ImageNet** dataset, which contains many natural images. +- The **Reuters RCV1** newswire articles dataset. This is a text dataset rather than an image dataset. + +[![](images/cifar10_images.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/cifar10_images.png) + +Samples from the CIFAR10 dataset + +For all datasets, Dropout improved the generalization power of the model. On MNIST, drastically different test errors could be reported, with substantial improvements for all the different architectures that were tested. + +Dropout also outperforms regular neural networks on the ConvNets trained on CIFAR-100, CIFAR-100, and the ImageNet datasets. + +For the SVHN dataset, another interesting observation could be reported: when Dropout is applied on the convolutional layer, performance also increases. According to the authors, this is interesting, because before, these layers were assumed not to be sensitive to overfitting because they do not have many parameters (Srivastava et al., 2014). It is argued that adding Dropout to the Conv layers provides noisy inputs to the Dense layers that follow them, which prevents _them_ further from overfitting. + +Finally, Dropout works on the TIMIT speech benchmark datasets and the Reuters RCV1 dataset, but here improvement was much smaller compared to the vision and speech datasets. + +### Dropout vs no dropout with other regularizers + +Now that the authors knew that Dropout scales well across a variety of machine learning problems, they investigated it further: _how does it perform with respect to other regularizers?_ + +Several regularizer methods were tested for preventing overfitting: + +- L2 weight decay; +- Lasso; +- KL sparsity; +- Max-norm regularization. + +Srivastava et al. (2014) found that when combined with max-norm regularization, Dropout gives even lower generalization errors. In fact, it provided the lowest error reported, followed - at some distance - by Dropout + L2 regularization, and finally the others. + +Hence, when applying Dropout, it might also be a good idea to perform max-norm regularization at the same time. + +### When does Dropout work best? About Dropout rate and Dataset size + +Another question they tried to answer: does the _dropout rate_ (i.e., the \[latex\]p\[/latex\] parameter) and/or _dataset size_ impact the performance of Dropout and the neural networks it is attached to? + +The question must be answered with **yes**. + +#### What is the best value for \[latex\]p\[/latex\]? + +First, the parameter \[latex\]p\[/latex\]. By now, we can recall that it is tunable, and must in fact be set up front by the machine learning engineer. The fact that it is tunable leads to the same errors as why [fixed learning rates aren't a good idea](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/): you simply don't know which \[latex\]p\[/latex\] fits the data best. + +Hence, the authors argue, selecting a value for \[latex\]p\[/latex\] must be done by some initial tests with a validation set. + +They did so as well - in order to see whether interesting patterns could be found. + +And they did find such a pattern: across multiple scenarios, a value of \[latex\]p \\approx 0.5\[/latex\] for the hidden layers seems to result in the best performance when applying Dropout (Srivastava et al., 2014). This is true for all layers except the input one, where \[latex\]p\[/latex\] must be \[latex\]\\approx 1.0\[/latex\]. The latter is presumably the case because the input layer takes the input data, and it's difficult to find patterns when data is dropped at random. + +#### How does Dropout perform with respect to dataset size? + +According to the authors, a "good regularizer makes it possible to get a good generalization error from models with a large number of parameters trained on small data sets". That is, it performs really well on data it has not seen before - even when trained with small data. + +In order to find whether Dropout regularizes well across various dataset sizes, Srivastava et al. (2014) performed tests with various sizes on the MNIST dataset. The sizes were as follows: "100, 500, 1K, 5K, 10K and 50K chosen randomly from the MNIST training set" (Srivastava et al., 2014). + +The authors found that there is a trade-off between when Dropout is necessary, and when it's no longer useful. First, to cover the case where the dataset is extremely small: even Dropout does not improve performance in that case, simply because the dataset size is too small. The same is true for datasets that are large enough: Dropout then does no longer improve the model, but rather, model performance gets worse. + +Hence, there exists a sweet spot, when Dropout is necessary and when it's smart not to use it (or to increase dataset size). According to Srivastava et al. (2014), there are no heuristics to determine this size; rather, it must be determined with a validation set. + +## Gaussian Dropout: Gaussian instead of Bernoulli variables + +We recall from above that Dropout works with Bernoulli variables which take 1 with probability \[latex\]p\[/latex\] and 0 with the rest, being \[latex\]1 - p\[/latex\]. + +This idea can be generalized to multiplying the activations with random variables from other distributions (Srivastava et al., 2014). In their work, Srivastava et al. found that the Gaussian distribution and hence Gaussian variables work just as well - and perhaps even better. + +Applying Gaussian variables can be done in a similar way: thinning networks at training time, and using weighted activations at test and production time (as with regular Dropout). However, the authors choose to use Gaussian Dropout differently - i.e., multiplicatively. Instead of thinning and weighting, Gaussian Dropout is weighted at training time, when activated values that are not dropped are multiplied by \[latex\]1/p\[/latex\] instead of \[latex\]1\[/latex\] (with regular Bernoulli Dropout). They are not modified at test time. This equals the previous scenario. + +Gaussian Dropout must be configured by some \[latex\]\\sigma\[/latex\], which in Srivastava et al.'s experiments was set to \[latex\]\\sqrt{(1-p)/p}\[/latex\], where \[latex\]p\[/latex\] is the configuration of the Bernoulli variant (i.e., in naïve cases \[latex\]p \\approx 0.5\[/latex\] for hidden layers and \[latex\]\\approx 1.0\[/latex\] for the input layer). + +## Summary + +In this blog post, we looked at overfitting - and how to avoid it, with Dropout. By looking at what it is, how it works, and _that it works_, we found that it is an interesting technique for application in your deep learning models. + +I hope you've learnt something today - something useful to your ML models 😀 If you did, or when you have questions, please do not hesitate to leave a comment below ⬇! When possible, I'll answer your questions 😊 + +Thank you for reading MachineCurve today and happy engineering! 😎 + +## References + +Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014, June 15). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Retrieved from [http://jmlr.org/papers/v15/srivastava14a.html](http://jmlr.org/papers/v15/srivastava14a.html) + +Wikipedia. (2003, March 20). Bernoulli distribution. Retrieved from [https://en.wikipedia.org/wiki/Bernoulli\_distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) diff --git a/what-is-padding-in-a-neural-network.md b/what-is-padding-in-a-neural-network.md new file mode 100644 index 0000000..6f0eeb3 --- /dev/null +++ b/what-is-padding-in-a-neural-network.md @@ -0,0 +1,237 @@ +--- +title: "What is padding in a neural network?" +date: "2020-02-06" +categories: + - "deep-learning" +tags: + - "convolutional-neural-networks" + - "deep-learning" + - "machine-learning" + - "neural-network" + - "neural-networks" + - "padding" +--- + +Training Convolutional Neural Networks means that your network is composed of two separate parts most of the times. The last part of your network, which often contains densely-connected layers [but doesn't have to](https://www.machinecurve.com/index.php/2020/01/31/reducing-trainable-parameters-with-a-dense-free-convnet-classifier/), generates a classification or regresses a value based on the inputs received by the first Dense layer. + +The first part, however, serves as a "feature extraction" mechanism - it transforms the original inputs into "bits of information" which ensures that the Dense layers perform better (for example, due to the effects of translation invariance; Chollet, 2017). By consequence, the system as a whole allows you to feed it raw inputs, which are processed internally, while you get a probability distribution over a set of classes in return. + +Typically, Convolutional layers are used as feature extractors. Through optimization, these layers learn "kernels" which slide (or convolve) over the input data, generating a number of "feature maps" that can subsequently be used for detecting certain patterns in the data. This is achieved by element-wise multiplications between the _slice_ of input data the filter is currently hovering over, and the _weights_ present within the filter. + +This, in return, effectively means that a spatial hierarchy is created: the more one moves towards the right when inspecting the [model architecture](https://www.machinecurve.com/index.php/2019/10/07/how-to-visualize-a-model-with-keras/), the smaller the inputs and hence feature maps become. Sometimes, though, you don't want your input to become smaller - in the case of [an autoencoder](https://www.machinecurve.com/index.php/2019/12/19/creating-a-signal-noise-removal-autoencoder-with-keras/), for example, where you just want to converge the feature maps into one Sigmoid activated output. This can be achieved with the **"padding mechanism"**, which is precisely what we'll cover in this blog post. + +Firstly, we'll look into the necessity of padding for some cases. This is followed by a generic description of the various forms of padding that are present within today's two most widely used frameworks for deep learning, being Keras - and thus TensorFlow - and PyTorch (please note that we don't provide code examples; this we'll do in a different blog post). For each of them, we'll check what they do. We also try to find out which one should be used in what scenario. We finally recap on all our learnings and finalize the blog. + +This way, you should have a good understanding about both the _necessity_ and the _workings_ of padding upon finishing this blog! + +Are you ready? Let's go 😎 + +* * * + +\[toc\] + +* * * + +## What is padding and why do we need it? + +Let's first take a look at what padding is. From this, it gets clear straight away why we might need it for training our neural network. More specifically, our _ConvNet_, because that's where you'll apply padding pretty much all of time time 😄 + +Now, in order to find out about how padding works, we need to study the internals of a convolutional layer first. + +Here you've got one, although it's very generic: + +[![](images/CNN.png)](https://www.machinecurve.com/wp-content/uploads/2019/09/CNN.png) + +What you see on the left is an RGB input image - width \[latex\]W\[/latex\], height \[latex\]H\[/latex\] and three channels. Hence, this layer is likely the _first layer in your model_; in any other scenario, you'd have feature maps as the input to your layer. + +Now, what is a feature map? That's the yellow block in the image. It's a collection of \[latex\]N\[/latex\] one-dimensional "maps" that each represent a particular "feature" that the model has spotted within the image. This is why convolutional layers are known as feature extractors. + +Now, this is very nice - but how do we get from input (whether image or feature map) to a feature map? This is through _kernels_, or _filters_, actually. These filters - you configure some number \[latex\]N\[/latex\] per convolutional layer - "slide" (strictly: convolve) over your input data, and have the same number of "channel" dimensions as your input data, but have much smaller widths and heights. For example, for the scenario above, a filter may be 3 x 3 pixels wide and high, but always has 3 channels as our input has 3 channels too. + +Now, when they slide over the input - from left to right horizontally, then moving down vertically after a row has been fully captured - they perform _element-wise multiplications_ between what's "currently under investigation" within the input data and the _weights present within the filter_. These weights are equal to the weights of a "classic" neural network, but are structured in a different way. Hence, optimization a ConvNet involves computing [a loss value](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) for the model and subsequently using [an optimizer](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) to change the weights. + +Through these weights, as you may guess, the model learns to detect the presence of particular features - which, once again, are represented by the feature maps. This closes the circle with respect to how a convolutional layer works :) + +### Conv layers might induce spatial hierarchy + +[![](images/pad-nopad-conv-1-300x300.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/pad-nopad-conv-1.jpg) + +If the width and/or height of your kernels is \[latex\]> 1\[/latex\], you'll see that the width and height of the feature map being output gets smaller. This occurs due to the fact that the feature map slides over the input and computes the element-wise multiplications, but is too large in order to inspect the "edges" of the input. This is illustrated in the image to the right, where the "red" position is impossible to take and the "green" one is part of the path of the convolution operation. + +As it cannot capture the edges, it won't be able to effectively "end" at the final position of your row, resulting in a smaller output width and/or height. + +For example, take the model that we generated in our blog post ["Reducing trainable parameters with a Dense-free ConvNet classifier"](https://www.machinecurve.com/index.php/2020/01/31/reducing-trainable-parameters-with-a-dense-free-convnet-classifier/). In the model summary, you clearly see that the output shape gets smaller in terms of width and height. Primarily, this occurs due to [max pooling](https://www.machinecurve.com/index.php/2020/01/30/what-are-max-pooling-average-pooling-global-max-pooling-and-global-average-pooling/), but you also see that the second `Conv2D` layer impacts the width and height of the feature map (and indeed, also the _number_ of maps, but this is not relevant for now). + +``` +Model: "GlobalAveragePoolingBased" +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d (Conv2D) (None, 26, 26, 32) 320 +_________________________________________________________________ +max_pooling2d (MaxPooling2D) (None, 13, 13, 32) 0 +_________________________________________________________________ +dropout (Dropout) (None, 13, 13, 32) 0 +_________________________________________________________________ +conv2d_1 (Conv2D) (None, 11, 11, 64) 18496 +``` + +We call this _a spatial hierarchy._ Indeed, convolutional layers may cause a "hierarchy"-like flow of data through the model. Here, you have a schematic representation of a substantial hierarchy and a less substantial one - which is often considered to be _less efficient_: + +[![](images/hierarchies.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/hierarchies.png) + +### Padding avoids the loss of spatial dimensions + +Sometimes, however, you need to apply filters of a fixed size, but you _don't want to lose width and/or height dimensions in your feature maps_. For example, this is the case when you're [training an autoencoder](https://www.machinecurve.com/index.php/2019/12/20/building-an-image-denoiser-with-a-keras-autoencoder-neural-network/). You need the output images to be of the same size as the input, yet need an [activation function](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/) like e.g. [Sigmoid](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/) in order to generate them. + +If you would do so with a `Conv` layer, this would become problematic, as you'd reduce the size of your feature maps - and hence would produce outputs unequal in size to your inputs. + +That's not what we want when we create an autoencoder. We want the original output and the original output only ;-) + +Padding helps you solve this problem. Applying it effectively adds "space" around your input data or your feature map - or, more precisely, "extra rows and columns" \[with some instantiation\] (Chollet, 2017). + +[![](images/pad-nopad.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/pad-nopad.jpg) + +The consequences of this fact are rather pleasurable, as we can see in the example below. + +[![](images/pad-nopad-conv.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/pad-nopad-conv.jpg) + +Adding the "extra space" now allows us to capture the position we previously couldn't capture, and allows us to detect features in the "edges" of your input. This is great! 😊 + +* * * + +## Types of padding + +Now, unfortunately, padding is not a binary option - i.e., it cannot simply be turned on and off. Rather, you can choose which padding you use. Based on the Keras docs (Keras, n.d.) and PyTorch docs (PyTorch, n.d.), we'll cover these types of padding next: + +- Valid padding (or no padding); +- Same padding; +- Causal padding; +- Constant padding; +- Reflection padding; +- Replication padding. + +Please note that the discussion next doesn't contain any Python code. We'll cover the padding options in terms of code in a different blog post ;) + +### Valid padding / no padding + +[![](images/validpad-300x300.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/validpad.jpg) + +Valid padding simply means "no padding" (Keras, n.d.). + +This equals the scenario to the right, where capturing the "edges" only is not possible. + +It may seem strange to you that frameworks include an option for valid padding / no padding, as you could simply omit the padding as well. However, this is not strange at all: if you specify some `padding` attribute, there must be a default value. As it may be confusing to perform some padding operation if you didn't specify any, at least Keras chooses to set `padding` to 'valid' if none is provided. By consequence, you can also _specify it yourself_. A bit useless, but possible by design :) + +### Same padding / zero padding + +Another option would be "same padding", also known as "zero padding". Here, the padding ensures that the output has the same shape as the input data, as you can see in the image below (Keras, n.d.). It is achieved by adding "zeros" at the edges of your layer output, e.g. the white space on the right of the image. + +Side note: in Keras, there is an inconsistency between backends (i.e., TensorFlow, Theano and CNTK) [as described here](https://github.com/keras-team/keras/pull/9473#issuecomment-372166860) (Keras, n.d.). However, with TensorFlow 2.0 being the "recommended choice" these days, this shouldn't be too much of a problem. + +[![](images/same-pad.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/same-pad.jpg) + +### Constant padding + +A type of padding that really resembles same padding is _constant padding_. Here, the outcome can be the same - the output will have the same shape as the input. However, rather than "zeros" - which is what same padding does - constant padding allows you to pad with a user-specified constant value (PyTorch, n.d.). In PyTorch, it is also possible to specify the padding at the boundary level (e.g. pad on the left and the top but not on the right and at the bottom). This obviously breaks with _same padding_ covered earlier; be aware of this. + +[![](images/constantpad.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/constantpad.jpg) + +### Causal padding + +Suppose that you have a time series dataset, where _two inputs_ together determine an _output_, in a causal fashion. Like this: + +![](images/Causalpad-2.jpg) + +It's possible to create a model that can handle this by means of a `Conv1D` layer with a kernel of size 2 - the learnt kernel will be able to map the inputs to the outputs successfully. + +But what about the first two targets? + +[![](images/Causalpad-3-1024x429.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/Causalpad-3.jpg) + +Although they are valid targets, the _inputs_ are incomplete - that is, there is insufficient input data available in order to successfully use them in the training process (The Blog, n.d.). For the second target, _one_ input - visible in gray - is missing (whereas the second is actually there), while for the first target both aren't there. + +For the first target, there is no real hope for success (as we don't have any input at all and hence do not know which values produce the target value), but for the second, we have a partial picture: we've got half the inputs that produce the target. + +Causal padding on the `Conv1D` layer allows you to include the partial information in your training process. By padding your input dataset with zeros at the front, a causal mapping to the first, missed-out targets can be made (Keras, n.d.; The Blog, n.d.). While the first target will be useless for training, the second can now be used based on the partial information that we have: + +[![](images/Causalpad-4-1024x262.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/Causalpad-4.jpg) + +### Reflection padding + +Another type of padding is "reflection padding" (TensorFlow, n.d.). As you can see, it pads the values with the "reflection" or "mirror" of the values directly in the opposite direction of the edge of your to be padded shape. + +For example, if you look at the image below, for the first row of the yellow box (i.e., your shape): + +- If you go to the right, you'll see a 1. Now, you need to fill the padding element directly to the right. What do you find when you move in the _opposite_ direction of the edge? Indeed, a 5. Hence, your first padding value is a 5. When you move further, it's a 3, so the next padding value following the 5 is a 3. And so on. +- In the opposite direction, you get a mirrored effect. Having a 3 at the edge, you'll once again find the 5 (as it's the center value) but the second value for padding will be a 1. +- And so on! + +[![](images/reflection_pad.jpg)](https://www.machinecurve.com/wp-content/uploads/2020/02/reflection_pad.jpg) + +Reflective padding seems to improve the empirical performance of your model (Physincubus, n.d.). Possibly, this occurs because of how "zero" based padding (i.e., the "same" padding) and "constant" based padding alter the distribution of your dataset: + +https://twitter.com/karpathy/status/720622989289644033 + +This becomes clear when we actually visualize the padding when it is applied: + +- [![](images/zero_padding.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/zero_padding.png) + +- [![](images/reflection.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/reflection.png) + + +### Replication padding / symmetric padding + +Replication padding looks like reflection padding, but is slightly different (TensorFlow, n.d.). Rather than _reflecting_ like a _mirror_, you simply take a copy, and mirror it. Like this: + +- You're at the first row again, at the right. You find a 1. What is the next value? +- Simple: you copy the entire row, mirror it, and start adding it as padding values horizontally. So, for row 1 with \[latex\]\[3, 5, 1\]\[/latex\], this will be \[latex\]\[1, 5, 3\]\[/latex\] being added. As you can see, since we only pad 2 elements in width, there are 1 and 5, but 3 falls off the padding. + +![](images/replication_pad.png) + +As with reflection padding, replication padding attempts to reduce the impact of "zero" and "constant" padding on the quality of your data by using "plausible data values by re-using what is along the borders of the input" (Liu et al., 2018): + +- [![](images/zero_padding.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/zero_padding.png) + +- [![](images/replication.png)](https://www.machinecurve.com/wp-content/uploads/2020/02/replication.png) + + +* * * + +## Which padding to use when? + +There are no hard criteria that prescribe when to use which type of padding. Rather, it's important to understand that padding is pretty much important all the time - because it allows you to preserve information that is present at the borders of your input data, and present there only. + +We've seen multiple types of padding. If you have causal data (i.e. multiple inputs that lead to one target value) and use a one-dimensional convolutional layer to improve model efficiency, you might benefit from "causal" padding to stress the importance of causality in your data by ensuring that your target is never present _before all your input data_. + +If you have an image classification problem, or wish to use Conv layers differently, causal padding might not be interesting for you. But "zero" padding, "constant" padding, "reflection" padding and "replication" padding may be. All of them add one or multiple columns and/or rows of padded elements around your shape, but each works differently. While zero and constant padding add zeros and constants, reflection and replication padding attempt to preserve the distribution of your data by re-using what's present along the borders. This, scholars like Liu et al. (2018) expect, could improve model performance. Hence, if you're in this scenario, you may wish to start with reflection or replication padding, moving to constant and eventually zero padding if they don't work. + +* * * + +## Summary + +This blog post discussed the necessity of padding that you may encounter in your machine learning problems - and especially when using Conv layers / when creating a ConvNet. It did so by taking a look at convolutional layers, explaining why borders only cannot be inspected when you don't add padding to your inputs. + +Subsequently, we discussed various types of padding - valid padding (a.k.a. no padding), same (or zero) padding, constant padding, reflection padding and replication padding. Through this discussion, you are now likely able to explain the differences between those types of padding. + +I hope you've learnt something today! If you did, please feel free to leave a comment in the comments section below 😊 Please do the same if you have any questions, remarks or when you spot a mistake. + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Keras. (n.d.). Convolutional Layers. Retrieved from [https://keras.io/layers/convolutional/](https://keras.io/layers/convolutional/) + +PyTorch. (n.d.). torch.nn.modules.padding. Retrieved from [https://pytorch.org/docs/stable/\_modules/torch/nn/modules/padding.html](https://pytorch.org/docs/stable/_modules/torch/nn/modules/padding.html) + +The Blog. (n.d.). Convolutions in Autoregressive Neural Networks. Retrieved from [https://theblog.github.io/post/convolution-in-autoregressive-neural-networks/](https://theblog.github.io/post/convolution-in-autoregressive-neural-networks/) + +TensorFlow. (n.d.). tf.pad. Retrieved from [https://www.tensorflow.org/api\_docs/python/tf/pad](https://www.tensorflow.org/api_docs/python/tf/pad) + +Physincubus. (n.d.). When to use what type of padding for convolution layers? Retrieved from [https://stackoverflow.com/questions/44960987/when-to-use-what-type-of-padding-for-convolution-layers#comment77020477\_44960987](https://stackoverflow.com/questions/44960987/when-to-use-what-type-of-padding-for-convolution-layers#comment77020477_44960987) + +Liu, G., Shih, K. J., Wang, T. C., Reda, F. A., Sapra, K., Yu, Z., ... & Catanzaro, B. (2018). [Partial convolution based padding](https://arxiv.org/abs/1811.11718). _arXiv preprint arXiv:1811.11718_. diff --git a/what-is-the-bart-transformer-in-nlp.md b/what-is-the-bart-transformer-in-nlp.md new file mode 100644 index 0000000..9dbec25 --- /dev/null +++ b/what-is-the-bart-transformer-in-nlp.md @@ -0,0 +1,14 @@ +--- +title: "What is the BART Transformer in NLP?" +date: "2021-02-15" +categories: + - "buffer" + - "deep-learning" +tags: + - "bart" + - "bert" + - "nlp" + - "transformers" +--- + +The Bidirectional and Auto-Regressive Transformer or BART is a Transformer that combines the Bidirectional Encoder (i.e. BERT like) with an Autoregressive decoder (i.e. GPT like) into one Seq2Seq model. In other words, it gets back to the original Transformer architecture proposed by Vaswani, albeit with a few changes. diff --git a/what-is-the-ftswish-activation-function.md b/what-is-the-ftswish-activation-function.md new file mode 100644 index 0000000..5eb7a86 --- /dev/null +++ b/what-is-the-ftswish-activation-function.md @@ -0,0 +1,286 @@ +--- +title: "What is the FTSwish activation function?" +date: "2020-01-03" +categories: + - "deep-learning" +tags: + - "activation-function" + - "activation-functions" + - "deep-learning" + - "ftswish" + - "machine-learning" +--- + +Over the last few years, we've seen the rise of a wide range of activation functions - such as **FTSwish**. Being an improvement to traditional ReLU by blending it with Sigmoid and a threshold value, it attempts to achieve the best of both worlds: ReLU's model sparsity and Sigmoid's smoothness, presumably benefiting the loss surface (MachineCurve, 2019). + +In doing so, it attempts to be on par with or even improve newer activation functions like Leaky ReLU, ELU, PReLU and Swish. + +In this blog post, we'll look at a couple of things. Firstly, we'll look at the concept of an artificial neuron and an activation function - what are they again? Then, we'll continue by looking at the challenges of classic activation - notably, the vanishing gradients problem and the dying ReLU problem. This is followed by a brief recap on the new activation functions mentioned above, followed by an introduction to **Flatten-T Swish**, or FTSwish. We conclude with comparing the performance of FTSwish with ReLU's and Swish' on the MNIST, CIFAR-10 and CIFAR-100 datasets. + +Are you ready? Let's go 😊 + +* * * + +\[toc\] + +* * * + +## Classic activation functions & their challenges + +When creating neural networks, you need to attach activation functions to the individual layers in order to make them work with nonlinear data. Inspiration for them can be traced back to biological neurons, which "fire" when their inputs are sufficiently large, and remain "silent" when they're not. Artificial activation functions tend to show the same behavior, albeit in much less complex ways. + +[![](images/1920px-Drawing_of_a_neuron.svg_-1024x397.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/1920px-Drawing_of_a_neuron.svg_.png) + +_Schematic drawing of a biological neuron. Source: Dana Scarinci Zabaleta at Wikipedia, licensed CC0_ + +And necessary they are! Artificial neural networks, which include today's deep neural networks, operate by multiplying a learnt "weights vector" with the "input vector" at each neuron. This element-wise multiplication is a linear operation, which means that any output is linear. As the system as a whole is now linear, it can handle linear data only. This is not powerful. + +By placing an artificial activation function directly after each neuron, it's possible to mathematically map the linear neuron output (which is input to the activation function) to some nonlinear output, e.g. by using \[latex\]sin(x)\[/latex\] as an activation function. Now, the system as a whole operates nonlinearly and is capable of handling nonlinear data. That's what we want, because pretty much no real-life data is linear! + +### Common activation functions + +Over the years, many nonlinear activation functions have emerged that are widely used. They can now be considered to be legacy activation functions, I would say. These are the primary three: + +- The **[Sigmoid](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#sigmoid)** activation function has been around for many years now. It maps any input from a real domain to the range \[latex\](0, 1)\[/latex\]. Sigmoid is a good replacement for the Heaviside step function used in [Rosenblatt Perceptrons](https://www.machinecurve.com/index.php/2019/07/24/why-you-cant-truly-create-rosenblatts-perceptron-with-keras/), which made them non-differentiable and useless with respect to Gradient Descent. Sigmoid has also been used in Multilayer Perceptrons used for classification. +- The **[Tangens hyperbolicus](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#tangens-hyperbolicus-tanh)** or Tanh activation function is quite an oldie, and has also been around for many years. It maps any input from a real domain into a value in the range \[latex\](-1, 1)\[/latex\]. Contrary to Sigmoid, it's symmetrical around the origin, which benefits optimization. However, it's relatively slow during training, while the next one is faster. +- The **[Rectified Linear Unit](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#rectified-linear-unit-relu)** or ReLU is the de facto standard activation function for today's neural networks. Activating to zero for all negative inputs and to the identity \[latex\]f(x) = x\[/latex\] for all nonnegative inputs, it induces sparsity and greatly benefits learning. + +### Problems with common activation functions + +Having been used for many years, both practitioners and researchers have identified certain issues with the previous activation functions that might make training neural nets impossible - especially when the neural networks are larger. + +The first issue, the [**vanishing gradients problem**](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/), occurs when the gradients computed during backpropagation are smaller than 1. Given the fact that each gradient is chained to the downstream layers' gradients to find the update with respect to the error value, you'll easily see where it goes wrong: with many layers, and gradients \[latex\]< 1\[/latex\], the upstream gradients get very small. For example: \[latex\] 0.25 \\times 0.25 \\times 0.25 = 0.25^3 = 0.015625\[/latex\]. And this only for a few layers in the network. Remember that today's deep neural networks can have thousands. The effect of vanishing gradients, which is present particularly with Sigmoid: the most upstream layers learn very slowly, or no longer at all. This severely impacts learning. + +Fortunately, ReLU does not suffer from this problem, as its gradient is either zero (for \[latex\] x < 0\[/latex\]) or one (for the other values). However, the **dying ReLU problem** is a substantial bottleneck for learning here: if _only one_ of the layers in the chain produces a partial gradient of zero, the entire chain _and_ the upstream layers have a zero gradient. This effectively excludes the neurons from participating in learning, once again severely impacting learning. Especially with larger networks, this becomes an issue, which you'll have to deal with. + +* * * + +## New activation functions: Leaky ReLU, PReLU, ELU, Swish + +Over the years, some new activation functions have emerged to deal with this problem. The first is **Leaky ReLU**: it's a traditional ReLU which "leaks" some information on the left side of the function, i.e. where \[latex\]x < 0\[/latex\]. This is visible in the plot below, as you can identify a very gentle slope configured by some \[latex\]\\alpha\[/latex\] parameter. It resolves the dying ReLU problem by ensuring that the gradient value for all \[latex\]x < 0\[/latex\] is also very small, i.e. the \[latex\]\\alpha\[/latex\] that you configured. + +- [Leaky ReLU: improving traditional ReLU](https://www.machinecurve.com/index.php/2019/10/15/leaky-relu-improving-traditional-relu/) +- [Using Leaky ReLU with Keras](https://www.machinecurve.com/index.php/2019/11/12/using-leaky-relu-with-keras/) + +[![](images/leaky_relu.png)](https://www.machinecurve.com/wp-content/uploads/2019/10/leaky_relu.png) + +The downside of Leaky ReLU is that the value for \[latex\]\\alpha\[/latex\] has to be set in advance. Even though an estimate can be made by pretraining with a small subset of your data serving as a validation set, it's still suboptimal. Fortunately, Leaky ReLU can be generalized into what is known as **Parametric ReLU**, or PReLU. The value for \[latex\]\\alpha\[/latex\] no longer needs to be set by the machine learning engineer, but instead is learnt during training through a few extra trainable parameters. Here too, the gradient for all \[latex\]x < 0\[/latex\] is (very likely, as a learnt \[latex\]\\alpha = 0\[/latex\] cannot be ignored) small but nonzero, so that the dying ReLU problem is avoided. + +- [How to use PReLU with Keras?](https://www.machinecurve.com/index.php/2019/12/05/how-to-use-prelu-with-keras/) + +Another activation function with which the dying ReLU problem can be avoided is the **Exponential Linear Unit**, or ELU. The creators of this activation function argue that both PReLU and Leaky ReLU still produce issues when inputs are _really_ _large_ and negative, because the negative side of the spectrum does not saturate to some value. They introduce ELU, which both resolves the dying ReLU problem and ensures saturation based on some \[latex\]\\alpha\[/latex\] value. + +[![](images/elu_avf.png)](https://www.machinecurve.com/wp-content/uploads/2019/12/elu_avf.png) + +- [How to use ELU with Keras?](https://www.machinecurve.com/index.php/2019/12/09/how-to-use-elu-with-keras/) + +Another relatively popular new activation function is **Swish**, which really looks like ReLU but is somewhat different: + +[![](images/relu_swish-1024x511.png)](https://www.machinecurve.com/wp-content/uploads/2019/11/relu_swish.png) + +Firstly, it's smooth - which is expected to improve the loss surface during optimization (MachineCurve, 2019). Additionally, it saturates for large negative values, to zero - which is expected to still ensure that the activation function yields model sparsity. However, thirdly, it does produce small but nonzero (negative) outputs for small negative inputs, which is expected to help reduce the dying ReLU problem. Empirical tests with large datasets have shown that Swish may actually be beneficial in settings when larger neural networks are used. + +- [Why Swish could perform better than ReLu](https://www.machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/) + +* * * + +## Introducing Flatten-T Swish (FTSwish) + +Another activation function was introduced in a research paper entitled "[Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning](https://arxiv.org/abs/1812.06247)", by Chieng et al. (2018). **Flatten-T Swish**, or FTSwish, combines the ReLU and Sigmoid activation functions into a new one: + +[![](images/ftswish-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/ftswish-1.png) + +FTSwish can be mathematically defined as follows: + +\\begin{equation} FTSwish: f(x) = \\begin{cases} T, & \\text{if}\\ x < 0 \\\\ \\frac{x}{1 + e^{-x}} + T, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +Where \[latex\]T\[/latex\] is a parameter that is called the _threshold value_, and ensures that the negative part of the equation produces negative values (see e.g. the plot, where \[latex\]T = -1.0\[/latex\]). + +Clearly, we recognize the ReLU and Sigmoid activation functions combined in the positive segment: + +\\begin{equation} Sigmoid: f(x) = \\frac{1}{1 + e^{-x}} \\end{equation} + +\\begin{equation} ReLU: f(x) = \\begin{cases} 0, & \\text{if}\\ x < 0 \\\\ x, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +This way, the authors expect that the function can _both_ benefit from ReLU's and Swish's advantages: sparsity with respect to the negative segment of the function, while the positive segment is smooth in terms of the gradients. + +Why both? Let's take a look at these gradients: + +[![](images/ftswish_deriv.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/ftswish_deriv.png) + +_FTSwish derivative_ + +As we can see, the sparsity principle is still true - the neurons that produce negative values are taken out. + +What we also see is that the derivative of FTSwish is smooth, which is what made Swish theoretically better than ReLU in terms of the loss landscape (MachineCurve, 2019). + +However, what I must note is that this function does not protect us from the dying ReLU problem: the gradients for \[latex\]x < 0\[/latex\] are zero, as with ReLU. That's why I'm a bit cautious, especially because Swish has _both_ the smoothness property _and_ the small but nonzero negative values at \[latex\]x \\approx 0\[/latex\] when negative. + +## FTSwish benchmarks + +Only one way to find out if my cautiousness is valid, right? :) Let's do some tests: we'll compare FTSwish with both ReLU and Swish with a ConvNet. + +I used the following [Keras datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/) for this purpose: + +- MNIST; +- CIFAR-10; +- CIFAR-100. + +### MNIST + +For the MNIST based CNN, the architecture was relatively simple - two Conv2D layers, MaxPooling2D, and Dropout, with a limited amount of trainable parameters. This should, in my opinion, still lead to a well-performing model, because MNIST is a relatively discriminative and simple dataset. + +``` +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d_1 (Conv2D) (None, 26, 26, 32) 320 +_________________________________________________________________ +max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32) 0 +_________________________________________________________________ +dropout_1 (Dropout) (None, 13, 13, 32) 0 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 11, 11, 64) 18496 +_________________________________________________________________ +max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64) 0 +_________________________________________________________________ +dropout_2 (Dropout) (None, 5, 5, 64) 0 +_________________________________________________________________ +flatten_1 (Flatten) (None, 1600) 0 +_________________________________________________________________ +dense_1 (Dense) (None, 256) 409856 +_________________________________________________________________ +dense_2 (Dense) (None, 10) 2570 +================================================================= +Total params: 431,242 +Trainable params: 431,242 +Non-trainable params: 0 +_________________________________________________________________ +``` + +These are the results: + +- [![](images/acc.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/acc.png) + +- [![](images/loss.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/loss.png) + +- [![](images/comp.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/comp.png) + + +As we can see, FTSwish finds accuracies of 97%+. However, the loss values are slightly worse than the ones reported by training either ReLU or Swish. + +### CIFAR-10 + +With CIFAR-10, I used the same architecture: + +``` +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d_1 (Conv2D) (None, 26, 26, 32) 320 +_________________________________________________________________ +max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32) 0 +_________________________________________________________________ +dropout_1 (Dropout) (None, 13, 13, 32) 0 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 11, 11, 64) 18496 +_________________________________________________________________ +max_pooling2d_2 (MaxPooling2 (None, 5, 5, 64) 0 +_________________________________________________________________ +dropout_2 (Dropout) (None, 5, 5, 64) 0 +_________________________________________________________________ +flatten_1 (Flatten) (None, 1600) 0 +_________________________________________________________________ +dense_1 (Dense) (None, 256) 409856 +_________________________________________________________________ +dense_2 (Dense) (None, 10) 2570 +================================================================= +Total params: 431,242 +Trainable params: 431,242 +Non-trainable params: 0 +_________________________________________________________________ +``` + +These are the results: + +- [![](images/acc-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/acc-1.png) + +- [![](images/loss-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/loss-1.png) + +- [![](images/combined.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/combined.png) + + +Contrary to the MNIST case, we can see overfitting occur here, despite the application of Dropout. What's more, ReLU seems to perform consistently over time, whereas overfitting definitely occurs with Swish and FTSwish. + +### CIFAR-100 + +Finally, I trained a ConvNet architecture with the CIFAR-100 dataset, which contains 600 images per class across 100 classes. Please do note that I did not use any pretraining and thus transfer learning, which means that the results will likely be quite poor with respect to the state-of-the-art. + +But I'm curious to find out how the activation functions perform, so I didn't focus on creating a transfer learning based model (perhaps, I'll do this later). + +This is the architecture, which contains additional trainable parameters: + +``` +_________________________________________________________________ +Layer (type) Output Shape Param # +================================================================= +conv2d_1 (Conv2D) (None, 30, 30, 64) 1792 +_________________________________________________________________ +max_pooling2d_1 (MaxPooling2 (None, 15, 15, 64) 0 +_________________________________________________________________ +dropout_1 (Dropout) (None, 15, 15, 64) 0 +_________________________________________________________________ +conv2d_2 (Conv2D) (None, 13, 13, 128) 73856 +_________________________________________________________________ +max_pooling2d_2 (MaxPooling2 (None, 6, 6, 128) 0 +_________________________________________________________________ +dropout_2 (Dropout) (None, 6, 6, 128) 0 +_________________________________________________________________ +flatten_1 (Flatten) (None, 4608) 0 +_________________________________________________________________ +dense_1 (Dense) (None, 512) 2359808 +_________________________________________________________________ +dense_2 (Dense) (None, 256) 131328 +_________________________________________________________________ +dense_3 (Dense) (None, 100) 25700 +================================================================= +Total params: 2,592,484 +Trainable params: 2,592,484 +Non-trainable params: 0 +_________________________________________________________________ +``` + +These are the results: + +- [![](images/acc-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/acc-2.png) + +- [![](images/loss-2.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/loss-2.png) + +- [![](images/combined-1.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/combined-1.png) + + +As we can see, the model starts overfitting quite soon, despite the application of Dropout. Overfitting is significantly worse compared to e.g. CIFAR-10, but this makes sense, as the number of samples per class is lower _and_ the number of classes is higher. + +Perhaps, the number of Dense parameters that can be trained might play a role as well, given the relatively few trainable parameters in the Conv layers :) + +What's more, Swish seems to be most vulnerable to overfitting, followed by FTSwish. Traditional ReLU overfits as well, but seems to be most prone against it. + +* * * + +## Summary + +In this blog post, we've looked at the Flatten-T Swish activation function, also known as FTSwish. What is FTSwish? Why do the authors argue that it might improve ReLU? Why does it look like traditional Swish? We answered these questions above: being a combination between traditional Sigmoid and ReLU, it is expected that FTSwish benefits both from the sparsity benefits of ReLU and the smoothness benefits of Sigmoid. This way, it looks a bit like traditional Swish. + +In addition, we looked at the performance of FTSwish with the MNIST, CIFAR-10 and CIFAR-100 datasets. Our results suggest that with simple models - we did not use a convolutional base whatsoever - traditional ReLU still performs best. It may be worthwhile to extend these experiments to larger and deeper models in the future, to find out about performance there. + +I hope you've learnt something today :) If you did, please leave a comment in the comments box below 😊 Feel free to do so as well if you have questions or if you wish to leave behind remarks, and I'll try to answer them quickly :) + +Thanks for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +Chieng, H. H., Wahid, N., Ong, P., & Perla, S. R. K. (2018). [Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning](https://arxiv.org/abs/1812.06247). _arXiv preprint arXiv:1812.06247_. + +Why Swish could perform better than ReLu – MachineCurve. (2019, September 4). Retrieved from [https://www.machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/](https://www.machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/) diff --git a/what-is-the-t5-transformer-and-how-does-it-work.md b/what-is-the-t5-transformer-and-how-does-it-work.md new file mode 100644 index 0000000..85fdff2 --- /dev/null +++ b/what-is-the-t5-transformer-and-how-does-it-work.md @@ -0,0 +1,14 @@ +--- +title: "What is the T5 Transformer and how does it work?" +date: "2021-02-15" +categories: + - "buffer" + - "deep-learning" +tags: + - "nlp" + - "t5" + - "transformer" + - "transformers" +--- + +The **[Text-to-Text Transfer Transformer or T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html)** is a type of [Transformer](https://www.machinecurve.com/index.php/2020/12/28/introduction-to-transformers-in-machine-learning/) that is capable of being trained on a variety of tasks with a uniform architecture. It was created by Google AI and was published about in the paper “[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)“. Here, we’ll take a look at T5 architecture, pretraining, finetuning — including variations and the conclusions that can be derived from them. It effectively summarizes the above linked paper. diff --git a/what-is-weight-initialization.md b/what-is-weight-initialization.md new file mode 100644 index 0000000..22e318a --- /dev/null +++ b/what-is-weight-initialization.md @@ -0,0 +1,172 @@ +--- +title: "Weight initialization in neural networks: what is it?" +date: "2019-08-22" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "exploding-gradients" + - "initializers" + - "vanishing-gradients" + - "weight-initialization" +--- + +An important predictor for deep learning success is how you initialize the weights of your model, or **weight initialization** in short. However, for beginning deep learning engineers, it's not always clear at first what it is - partially due to the overload of initializers available in contemporary frameworks. + +In this blog, I will introduce weight initialization at a high level by looking at the structure of neural nets and the high-level training process first. Subsequently, we'll move on to weight initialization itself - and why it is necessary - as well as certain ways of initializing your network. + +In short, you'll find out why weight initialization is necessary, what it is, how to do it - and how _not_ to do it. + +* * * + +**Update 18/Jan/2021:** ensure that article is up to date in 2021. Also added a brief summary with the contents of the whole article at the top. + +* * * + +\[toc\] + +* * * + +## Summary: what is weight initialization in a neural network? + +Neural networks are stacks of layers. These layers themselves are composed of neurons, mathematical units where the equation `Wx + b` is computed for each input. Here, `x` is the input itself, whereas `b` and `W` are representations for _bias_ and the _weights_, respectively. + +Both components can be used for making a neural network learn. The weights capture most of the available patterns hidden within a dataset, especially when they are considered as a system, i.e. as the neural network as a whole. + +Using weights means that they must be initialized before the neural network can be used. We use a weight initialization strategy for this purpose. A poor strategy would be to initialize with zeros only: in this case, the input vector no longer plays a role, and the neural network cannot learn properly. + +Another strategy - albeit a bit naïve - would be to initialize weights randomly. Very often, this works nicely, except in a few cases. Here, more advanced strategies like He and Xavier initialization must be used. We'll cover all of them in more detail in the rest of this article. Let's take a look! + +* * * + +## The structure of a neural network + +Suppose that we're working with a relatively simple neural net, a [Multilayer Perceptron](https://machinecurve.com/index.php/2019/07/27/how-to-create-a-basic-mlp-classifier-with-the-keras-sequential-api/) (MLP). + +An MLP looks as follows: + +![](images/Basic-neural-network.jpg) + +It has an **input layer** where data flows in. It also has an **output layer** where the prediction (be it classification or regression) flows out. Finally, there are one or multiple **hidden layers** which allow the network to handle complex data. + +Mathematically, one of the neurons in the hidden layers looks as follows: + +\[mathjax\] + +\\begin{equation} \\begin{split} output &= \\textbf{w}\\cdot\\textbf{x} + b \\\\ &=\\sum\_{i=1}^{n} w\_nx\_n + b \\\\ &= w\_1x\_1 + ... + w\_nx\_n + b \\\\ \\end{split} \\end{equation} + +where \[latex\]\\textbf{w}\[/latex\] represents the **weights vector,** \[latex\]\\textbf{x}\[/latex\] the **input vector** and \[latex\]b\[/latex\] the bias value (which is not a vector but a number, a scalar, instead). + +When they are input, they are multiplied by means of a **dot product.** This essentially computes an element-wise vector multiplication of which subsequently the new vector elements are summated. + +Your framework subsequently adds the bias value before the neuron output is complete. + +It's exactly this **weights vector** that can (and in fact, must) be initialized prior to starting the neural network training process. + +But explaining why this is necessary requires us to take a look at the high-level training process of a neural network first. + +* * * + +## The high-level training process + +Training a neural network is an iterative process. In the case of classification and a feedforward neural net such as an MLP, you'll make what is known as a _forward pass_ and a _backward pass_. The both of them comprise one _iteration_. + +In the forward pass, all training data is passed through the network once, generating one prediction per sample. + +Since this is training data, we know the actual target and can compare them with our prediction, computing the difference. + +Generally speaking, the average difference between prediction and target value over our entire training set is what we call the _loss_. There are multiple ways of computing loss, but this is the simplest possible way of imagining it. + +Once we know the loss, we can start our backwards pass: given the loss, and especially the loss landscape (or, mathematically, the loss function), we can compute the error backwards from the output layer to the beginning of the network. We do so by means of _backpropagation_ and use a process called _gradient descent_. If we know the error for an arbitrary neuron, we can adapt its weights slightly (controlled by the _learning rate_) to move a bit into the direction of the error. That way, over time, the neural network adapts to the data set it is being fed. + +We'll cover these difficult terms in later blogs, since they do not further help explaining weight initialization, but together, they can ensure - given appropriate data - that neural networks show learning behavior over many iterations. + +![](images/High-level-training-process-1024x973.jpg) + +The high-level training process, showing a forward and a backwards pass. + +* * * + +## Weight initialization + +Now, we have created sufficient body to explain the need for weight initialization. Put very simply: + +- If one neuron contains a _weights vector_ that represents what a neuron has learnt that is multiplied with an _input vector_ on new data; +- And if the learning process is cyclical, feeding forward all data through the network. + +...it must start somewhere. And indeed, it starts at epoch 0 - or, put simply, at the start. And given the fact that during that first epoch, we'll see a forward pass, the network cannot have empty weights whatsoever. They will have to be _initialized_. + +In short, weight initialization comprises setting up the weights vector for all neurons for the first time, just before the neural network training process starts. As you can see, indeed, it is highly important to neural network success: without weights, the forward pass cannot happen, and so cannot the training process. + +* * * + +## Ways to initialize your network + +Now that we have covered _why_ weight initialization is necessary, we must look briefly into _how_ to initialize your network. The reason why is simple: today's deep learning frameworks contain a quite wide array of initializers, which may or may not work as intended. Here, we'll slightly look into all-zeros initialization, random initialization and using slightly more advanced initializers. + +### All-zeros initialization + +Of course, it is possible to initialize all neurons with all-zero weight vectors. However, this is a very bad idea, since effectively you'll start training your network while all your neurons are dead. It is considered to be poor practice by the deep learning community (Doshi, 2019). + +Here's why: + +Recall that in a neuron, \[latex\]output = \\textbf{w}\\cdot\\textbf{x} + b\[/latex\]. + +Or: + +\\begin{equation} \\begin{split} output &= \\textbf{w}\\cdot\\textbf{x} + b \\\\ &=\\sum\_{i=1}^{n} w\_nx\_n + b \\\\ &= w\_1x\_1 + ... + w\_nx\_n + b \\\\ \\end{split} \\end{equation} + +Now, if you initialize \[latex\]\\textbf{w}\[/latex\] as an all-zeros vector, a.k.a. a list with zeroes, what do you think happens to \[latex\]w1 ... wn\[/latex\]? + +Exactly, they're all zero. + +And since anything multiplied by zero is zero, you see that with zero initialization, the input vector \[latex\]\\textbf{x}\[/latex\] no longer plays a role in computing the output of the neuron. + +Zero initialization would thus produce poor models that, generally speaking, do not perform better than linear ones (Doshi, 2019). + +### Random initialization + +It's also possible to perform random initialization. You could use two statistical distributions for this: either the standard normal distribution or the uniform distribution. + +Effectively, you'll simply initialize all the weights vectors randomly. Since they then have numbers > 0, your neurons aren't dead and will work from the start. You'll only have to expect that performance is (very) low the first couple of epochs, simply because those random values likely do not correspond to the actual distribution underlying the data. + +With random initialization, you'll therefore see an exponentially decreasing loss, but then inverted - it goes fast first, and plateaus later. + +There's however two types of problems that you can encouunter when you initialize your weights randomly: the _vanishing gradients problem_ and the _exploding gradients problem_. If you initialize your weights randomly, two scenarios may occur: + +- Your weights are very small. Backpropagation, which computes the error backwards, chains various numbers from the loss towards the updateable layer. Since 0.1 x 0.1 x 0.1 x 0.1 is very small, the actual gradient to be taken at that layer is really small (0.0001). Consequently, with random initialization, in the case of very small weights - you may encounter _vanishing gradients_. That is, the farther from the end the update takes place, the slower it goes. This might yield that your model does not reach its optimum in the time you'll allow it to train. +- In another case, you experience the _exploding gradients_ scenario. In that case, your initialized weights are _very much off_, perhaps because they are really large, and by consequence a large weight swing must take place. Similarly, if this happens throughout many layers, the weight swing may be large: \[latex\]10^6 \\cdot 10^6 \\cdot 10^6 \\cdot 10^6 = 10^\\text{24}\[/latex\]. Two things may happen then: first, because of the large weight swing, you may simply not reach the optimum for that particular neuron (which often requires taking small steps). Second, weight swings can yield number overflows in e.g. Python, so that the language can no longer process those large numbers. The result, `NaN`s (Not a Number), will reduce the power of your network. + +So although random initialization is much better than all-zeros initialization, you'll see that it can be improved even further. + +Note that we'll cover the vanishing and exploding gradients problems in a different blog, where we'll also introduce more advanced initializers in large detail. For the scope of this blog, we'll stick at a higher level and will next cover two mitigators for those problems - the He and Xavier initializers. + +### Advanced initializers: He, Xavier + +If you could reduce the vanishing gradients problem and the exploding gradients problem, how to do it? + +That's exactly the question with which certain scholars set out in order to improve the initialization procedure for their neural networks. + +Two initialization techniques are the result: Xavier (or Glorot) initialization, and He initialization. + +Both work relatively similarly, but share their differences - that we will once again cover in another blog in more detail. + +However, in plain English, what they essentially do is that they 'normalize' the initialization value to a value for which both problems are often no longer present. That means that very large initializations will be lowered significantly, while very small ones will be made larger. In effect, both initializers will attempt to produce values around one. + +By consequence, modern deep learning practice often favors these advanced initializers - He initialization and Xavier (or Glorot) initialization - over pure random and definitely all-zeros initialization. + +In this blog, we've seen why weight initialization is necessary, what it is and how to do it. If you have any questions, if you wish to inform me about new initializers or mistakes I made here... I would be really happy if you left a comment below 👇 + +Thank you and happy engineering! 😎 + +* * * + +## References + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Doshi, N. (2019, May 2). Deep Learning Best Practices (1) ? Weight Initialization. Retrieved from [https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94](https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94) + +Neural networks and deep learning. (n.d.). Retrieved from [http://neuralnetworksanddeeplearning.com/chap5.html](http://neuralnetworksanddeeplearning.com/chap5.html) + +Wang, C. (2019, January 8). The Vanishing Gradient Problem. Retrieved from [https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484) diff --git a/which-regularizer-do-i-need-for-training-my-neural-network.md b/which-regularizer-do-i-need-for-training-my-neural-network.md new file mode 100644 index 0000000..ac0b29f --- /dev/null +++ b/which-regularizer-do-i-need-for-training-my-neural-network.md @@ -0,0 +1,225 @@ +--- +title: "Which regularizer do I need for training my neural network?" +date: "2020-01-26" +categories: + - "deep-learning" +tags: + - "deep-learning" + - "elastic-net-regularization" + - "l1-regularization" + - "l2-regularization" + - "machine-learning" + - "regularization" + - "regularizer" +--- + +There are three widely known regularization techniques for neural networks: L1 (or Lasso) regularization, L2 (or Ridge) regularization and Elastic Net regularization, which combines the two, and is also called L1+L2. + +But which regularizer to use in your neural network? Especially for larger machine learning projects and for people who just started with machine learning, this is an often confusing element of designing your neural network. + +In this blog post, I will try to make this process easier, by proposing a simple flowchart for choosing a regularizer for your neural network. It's based both on my experience with adding regularizers as well as what theory suggests about picking a regularization technique. If you have any suggestions, improvements or perhaps extensions, please _feel free to share your experience by leaving a comment at the bottom of the page!_ 😊💬 Based on your feedback, we can improve the flowchart together. + +The structure of this blog post is as follows. Firstly, we'll take a brief look at the basics of regularization - for starters, or for recap, if you wish. This includes a brief introduction to the L1, L2 and Elastic Net regularizers. Then, we continue with the flowchart - which includes the steps you could follow for picking a regularizer. Finally, we'll discuss each individual step in more detail, so that you can understand the particular order and why these questions are important. + +Are you ready? Let's go! 😎 + +_PS:_ If you wish to understand regularization or how to use regularizers with Keras in more detail, you may find these two blogs interesting as well: + +- [What are L1, L2 and Elastic Net Regularization in neural networks?](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/) +- [How to use L1, L2 and Elastic Net Regularization with Keras?](https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/) + +* * * + +\[toc\] + +* * * + +## The basics of regularization + +Before we move on to my answer to the question _which regularizer do I need_, I think it's important to take one step back and look at the basics of regularization first. Through these basics, you'll likely understand the flowchart in a better way. If you already know a thing or two about regularizers, feel free to skip this part and directly move on to the flowchart. + +Here, we'll cover these things: + +- Why a regularizer could be necessary for your machine learning project; +- L1 Regularization; +- L2 Regularization; +- Elastic Net (L1+L2) Regularization. + +### Why a regularizer could be necessary + +Firstly, let's take a look at why we need a regularizer in the first place. Suppose that we have a few data points from which we learn a regression model using a neural network: + +![](images/poly_both.png) + +Which learnt mapping - i.e., the learnt function which maps the input data to the output data - would likely scale better to unseen data, the light blue one or the orange one? Obviouly, this depends on your data, but let's add a little bit of context: the data covers the weekly income versus the weekly net expense of a bank. + +In that case, it's very unlikely that the light blue function is what we're looking for. The orange one looks quite nice though, and may even be representative of the actual real world pattern. + +Unfortunately, training a machine learning model is a continuous balance between underfitting and overfitting, trying to find the sweet spot between a model that can be _better_ in terms of predicting versus a model that _predicts too well_... for the training data, producing worse results for the data it hasn't seen yet. Especially when your model overfits, you'll see light blue-like curves starting to appear. You want to avoid this! + +One way of doing so is by applying a regularizer to your model. A regularizer, by means of inclusion in the loss value that is minimized, punishes the model for being _too complex_. That is, it penalizes the states where the weights of your neural networks have too large and ideosyncratic values. This way, your weights remain simple, and you may find that the sensitivity of your model to overfitting is reduced. + +### Common regularizers + +Now, which regularizers are out there? Let's take a look at the most common ones that are available for your machine learning projects today: L1, L2 and Elastic Net Regularization. [Click here if you want to understand the regularizers in more detail.](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/) + +#### L1 Regularization + +First of all, L1 Regularization, which is also called Lasso Regularization. By computing and summating the absolute values of weights, the so-called Taxicab norm (or L1 norm) is added to the loss value. This norm, which tells you something about the absolute distance across all dimensions that must be traveled from the origin to the tip of the weights vector, helps you to achieve simple models. What's more, because of the structure of the derivative (which produces constant values), weights that do not contribute sufficiently to the model essentially "drop out", as they are forced to become zero. + +[![](images/l1_component.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l1_component.png) + +#### L2 Regularization + +Secondly, there is L2 Regularization (a.k.a. Ridge Regularization), which is based on the summated squares of the weights. Although it does enforce simple models through small weight values, it doesn't produce sparse models, as the derivative - \[latex\]2x\[/latex\] - produces smaller and smaller gradients (and hence changes) when \[latex\]x\[/latex\] approaches zero. It can thus be useful to use L2 if you have correlative data, or when you think sparsity won't work for your ML problem at hand. + +[![](images/l2_comp.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/l2_comp.png) + +#### Elastic Net Regularization + +When you combine L1 and L2 Regularization, you get Elastic Net Regularization. As you can see, depending on some hyperparameter (or two, if you don't combine the lambdas into the single alpha parameter) that can be configured by the machine learning engineer, it takes the shape of L1, or L2, or something in between. If you don't know which regularizer to apply, it may be worthwhile to try Elastic Net first, especially when you think that regularization may improve model performance over no regularization. + +[![](images/penalty-values.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/penalty-values.png) + +* * * + +## Picking a regularizer - a flowchart + +Now that we have identified the three most common regularizers, here's a flowchart which can help you determine the regularizer that may be useful to your project. I always use these questions to help choose one: + +- Can you take a subsample that has the same distribution as the main sample? +- Do you have resoures available for validation? + - If so, how do the validation experiments perform? +- Do you have prior knowledge about your dataset? + - Are your features correlated? + - If not, do you need the "entire picture"? + +In the next section, we'll analyze these questions individually, using more detail. + +Now, here's the flowchart. Blue means that _a_ _question_ must be answered, yellow means that _an action_ must be taken, and green means that you've arrived at an _outcome_. + +[![](images/Which-regularizer-do-I-need-2-794x1024.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/Which-regularizer-do-I-need-2.png) + +* * * + +## Question to outcome - dissecting the flowchart + +Let's now take a look at how we constructed the flow chart - and why I think that these questions help you choose a regularizer for your machine learning project. + +### Question 1: Can you take a sub sample with the same distribution? + +Taking a sub sample of your dataset allows you to perform validation activities, i.e. to pick a likely adequate regularizer by testing up front. However, here, it's important to ensure that your sub sample has the same distribution (i.e. sample mean and sample variance) as the entire sample. This way, it will be most likely that good results achieved through validation generalize to the real training scenario. + +#### Yes - take a sub sample and move to question 2 + +When you can take a sub sample with the same distribution, that would be preferred, as it opens the path to empirically determining which regularizer works best. + +[Here is an algorithm that can be used for generating the sub sampl](https://maxhalford.github.io/blog/subsampling-a-training-set-to-match-a-test-set---part-1/)[e](https://maxhalford.github.io/blog/subsampling-a-training-set-to-match-a-test-set---part-1/)[.](https://maxhalford.github.io/blog/subsampling-a-training-set-to-match-a-test-set---part-1/) + +#### No - move on to question 3 + +It may be the case that you cannot take a sub sample from your dataset. For example, it may be too much of a hassle. In that case, let's move on to question 2, where the goal is to find out a possible starting point based on theoretical knowledge and/or assumptions about your dataset. + +### Question 2: Do you have resources available for testing? + +If you can generate a sub set of your data that has approximately the same distribution as your whole data set, you can use it to validate certain choices with respect to regularizers. Validating your choice and your assumptions may save you quite some resources later. However, this requires you to answer one question in particular: + +Do you have resources available for those validation activities? + +It's more of a business related question than a technical one. It isn't difficult to spawn an instance in the cloud with a well-performing GPU, nor is it to use your own dedicated one if available. However, do you have the time? What does it mean for project costs? Those questions need to be answered. + +[![](images/image-19-1024x239.png)](https://www.machinecurve.com/wp-content/uploads/2020/01/image-19.png) + +_Adding L2 Regularization to a Keras model._ + +#### Yes - baseline & first regularizer + +If you can answer them positively, it's time to start some validation activities. First of all, you'll need a baseline model, with which you can compare the effects of training models with a regularizer. Quite unsurprisingly, such a baseline model could be one where no regularization is applied. Hence, the first step is to train such a baseline model with your sub sample. Do make sure to evaluate it with testing data as well, to see how well it generalizes. + +Afterwards, the validation process can start. I always think it's best to start with Elastic Net based regularization here, as it combines L1 and L2. Apply it to the layers of your choice, set the hyperparameters \[latex\]\\lambda\_1\[/latex\] and \[latex\]\\lambda\_2\[/latex\] (or \[latex\]\\alpha\[/latex\], depending on your approach), and train the model - once again with your sub sample. After training, make sure to generate evaluation metrics as well. Then, you might play around a bit in terms of settings and retrain for a couple of times, but depending on the results this might be not even necessary. Move on to question 4. + +#### No - move on to question 3 + +If you can generate a sub sample but don't have resources available for validation, it's best to move to question 3. Here, we'll try to decide which regularizer it's best to start with based on possible assumptions and prior knowledge about your dataset. + +### Question 3: Do you have prior knowledge about the dataset? + +Even if you cannot generate a sub sample or perform validation activities, it may be possible to determine a starting point regarding the regularizer that is to be applied. This involves answering two questions: + +- Are your features correlated? +- Do you need the "entire picture"? + +Next, I'll explain them in more detail, so that you'll know how to answer them :) + +#### Are your features correlated? + +Say that you are training a classification model where your features are correlated. For example, in the case of health measurements that may predict the onset of diabetes in five years, a model can be trained that either tells you that the patient is or is not sensitive to diabetes five years from now. + +Likely, features such as Body Mass Index (BMI), fat percentage and blood pressure correlate to some extent. In this case, they provide predictive information individually, but their _interrelationships_ also contribute to the outcome - i.e., whether a person will develop diabetes or not. + +Hence, they cannot simply be dropped out when training the model. So if your answer to the question "are your features correlated?" is yes, it's highly unlikely that applying L1 Regularization will help you achieve better results. By consequence, the same is true for L1+L2 (Elastic Net) Regularization. Hence, in this case, it's best to start with L2 regularization. + +If the answer is "no", let's take a look at the next question. + +#### Do you need the "entire picture"? + +Suppose that you don't have features which describe individual variables, such as when you're using image and video datasets. For example, the images below are samples from the [EMNIST dataset](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/), which means that they are 28 x 28 pixel images with only 1 image channel. By consequence, the number of features present is 28 x 28 = 784. But does each feature tell something about a variable? + +Nope. + +![](images/emnist-letters.png) + +_Sample letters from the [EMNIST dataset](https://www.machinecurve.com/index.php/2020/01/10/making-more-datasets-available-for-keras/)._ + +Do we need the "entire picture" - and in this case, literally, the entire picture? + +Likely - yes. + +If you need the entire picture, in my experience, it may not be worthwhile to apply L1 Regularization, due to its effect on sparsity. Hence, if the answer to this question is "yes", you may also wish to start with L2 Regularization. From there, you may experiment with other regularizers. + +If the answer to this question is "no", I'd recommend L1 Regularization as a starting point. + +### Question 4: How well does L1+L2 perform on the baseline model? + +All right, let's take a look at the fourth and final question. You're likely here because you were able to (1) generate a sub sample with a similar or equal distribution and (2) have resources available for validation. Then, you trained a baseline model (likely without regularization) and then validated the Elastic Net (or L1+L2) regularizer. + +The question to be answered is: does it perform better or worse on the baseline model? + +#### Better: validate with L1 regularization + +If L1+L2 regularization performs better, it's worth a try to perform regularization with L1 alone. If you're here, these are the steps forward: + +- Retrain your validation model, but this time use L1 Regularization only; +- If it performs better, your preferred starting point would be L1 Regularization; +- Otherwise, the preferred starting point would be to apply Elastic Net (L1+L2) Regularization during training with your entire data set. + +#### Worse: validate with L2 regularization + +In my experience, if the L1+L2 regularizer performs worse compared to the no-regularization-baseline model, then one of these assumptions is _false_: + +- The features don't correlate; +- You don't need the entire picture. + +Or, in short: in my experience, it's likely the L1 Regularizer component that produces pretty bad performance. Now, these are the steps forward: + +- Retrain your validation model, but this time use L2 Regularization only. +- If the retrained model performs _better_, your preferred starting point would be L2 Regularization. +- If it performs _worse_, the starting point would be to apply no regularization at all. + +* * * + +## Summary + +Which regularizer to apply when training your neural network entirely depends on the data at hand, and some other conditions. In this blog post, we looked at this question in more detail. Through a flowchart, and a more detailed discussion on its individual components, I hope this blog sheds more light on the decision that must be made by a ML engineer every now and then. + +If you did learn something today, please leave a comment in the comments box below - I'd appreciate it! 😊 However, make sure to do the same if you disagree with elements of this blog post, as they generally describe my working process - which does not necessarily have to be the right one! I'll happily improve this blog post and learn from your experience :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +MachineCurve. (2020, January 21). What are L1, L2 and Elastic Net Regularization in neural networks? Retrieved from [https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/](https://www.machinecurve.com/index.php/2020/01/21/what-are-l1-l2-and-elastic-net-regularization-in-neural-networks/) + +MachineCurve. (2020, January 23). How to use L1, L2 and Elastic Net Regularization with Keras? Retrieved from [https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/](https://www.machinecurve.com/index.php/2020/01/23/how-to-use-l1-l2-and-elastic-net-regularization-with-keras/) diff --git a/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example.md b/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example.md new file mode 100644 index 0000000..1fb6672 --- /dev/null +++ b/why-nonlinear-activation-functions-improve-ml-performance-with-tensorflow-example.md @@ -0,0 +1,262 @@ +--- +title: "Why nonlinear activation functions improve ML performance - with TensorFlow 2.0 example" +date: "2020-10-29" +categories: + - "deep-learning" +tags: + - "activation-function" + - "activation-functions" + - "linear" + - "machine-learning" + - "nonlinear" + - "tensorflow" +--- + +Machine Learning is here to stay. More and more organizations are grasping the benefits of the technology, as long as it is applied with care - and with a realistic mindset. By consequence, demand for ML engineers is high, and the field is working towards increased [commoditization](https://www.machinecurve.com/index.php/2020/10/27/using-teachable-machine-for-creating-tensorflow-models/) and [automation](https://www.machinecurve.com/index.php/2020/06/09/automating-neural-network-configuration-with-keras-tuner/). + +But why was there an explosion of Machine Learning, anyway? Why did Deep Neural Networks grow in popularity exponentially in the years after the 2012 computer vision breakthrough? + +In fact, there are many reasons - for example, that computational capabilities were now sufficient for training very deep models. However, one of the reasons is the fact that **nonlinear activation functions** are used. In this article, we'll figure out why this boosts Machine Learning performance. As we shall see, thanks to such activation functions, we can learn more complex patterns within data, compared to more linear approaches in the past. This includes an example TensorFlow model that demonstrates why nonlinear activations often lead to much better performance compared to a linear one. + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## Neural networks as a system: layers with activation functions + +I think that it's important to first view neural networks as a system, thus holistically: it's an architecture of layers and activation functions that is _trained_ - and by means of training, we mean a mathematical and iterative optimization process until some kind of threshold is reached or the process stops through user input. + +### Layers + +For example, let's take a look at this basic neural network: + +![](images/Basic-neural-network.jpg) + +It's in fact a really simple one - we see a yellow **input layer**, a red **output layer** and just one blue **hidden layer** in between. + +As you can imagine, the input layer is capable of accepting input to the model that is in some kind of shape. For example, if the model supports a three-dimensional Tensor, that's what you must feed the input layer - or an error will be thrown. It's thus like an ingestion mechanism that feeds forward acceptable input into the next layer. + +The hidden layer(s - just one in this case) attempt to capture patterns hidden within the dataset as a whole through training. Training, as we shall see later in this article, is an iterative optimization process. Patterns are captured by means of [weights](https://www.machinecurve.com/index.php/2019/08/22/what-is-weight-initialization/). When new samples pass through these hidden layers, they thus attempt to 'see' whether certain patterns are present - and if so, the individual components (i.e. neurons) that capture these patterns will 'fire' to the next layer with more strength. + +Finally, the output layer generates the final prediction. For example, [in the case of binary classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/), this is a numeric prediction in the range \[latex\]\[0, 1\]\[/latex\] - i.e. a 0% to 100% probability that it's some class. Output \[latex\]0.6666\[/latex\] suggests that it's more likely to be class \[latex\]1\[/latex\], but the model is not entirely sure. + +In the case of [multiclass classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/), it would be a [probability distribution](https://www.machinecurve.com/index.php/2020/01/08/how-does-the-softmax-activation-function-work/) over the possible output classes - with each class assigned a probability, all probabilities summed to \[latex\]1\[/latex\]. + +### Individual neurons + +If you would zoom in to one of the neurons from the neural network displayed above, you would see this: + +![](images/layer-linear.png) + +What is happening inside each individual neuron is that the _vector_ with input data called \[latex\]\\textbf{x}\[/latex\], which can be compared to a Python list, is multiplied with a weights vector \[latex\]\\textbf{w}\[/latex\], after which a bias value \[latex\]b\[/latex\] is added. The output is passed to the next layer. + +Of course, in the input layer, \[latex\]\\textbf{x}\[/latex\] represents the _feature vector_ - i.e., the list of features that together represents one sample - while it represents the output of previous layers in the hidden layers and the output layer. + +For each neuron, vector \[latex\]\\textbf{w}\[/latex\] represents the patterns within the dataset learnt by each individual neuron; the system as a whole captures all the patterns that can possibly be captured by the neural network. + +This is already a great step forward - a system of neurons can learn increasingly abstract patterns from a dataset: it's Machine Learning taking place! + +But our definition of a neuron until now is also problematic. If you remember some of your mathematics, you recall that the function of a straight line is of the form \[latex\]f(x): y = a \\times x + b\[/latex\]. This really looks like the vector multiplication and bias addition mentioned above! + +Indeed, it is exactly the cause of the problem - the weights \[latex\]w\[/latex\] and bias \[latex\]b\[/latex\] that can be learnt by the model effectively allow each neuron to capture a linear pattern - a line. As the system as a whole performs such tasks but then at massive scale, we can easily see that with the neuron setup from above, the system can only learn linear patterns. + +In plain English, in case of [classification](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/), it can only learn to generate a [separation boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) that has the shape of a line. And in the case of regression, the other form of supervised learning where the outcome is a real number (like 2.349839), it can only learn to predict a linear relationship between input variables. + +We're not going the war with neural networks if they look like this, so let's take a look how to move forward. + +### Activation functions + +If we zoom out a bit more, especially in the case of newer neural networks, we see that an individual neuron is always followed by another block -- a block called an **activation function**: + +![](images/layer-act-1024x227.png) + +This function takes the neuron output as its input and generates another output based on this input. + +Or, more mathematically speaking, it maps the input to some output: + +\[latex\]layer\_output(\\textbf{x}) = activation(\\textbf{w} \\times \\textbf{x} + b)\[/latex\]. + +And this output does not need to be linear! In fact, it is perfectly possible to use an \[latex\]activation\[/latex\] function that is nonlinear. + +### A common activation function: ReLU + +One such _nonlinear_ activation function is called the **Rectified Linear Unit**, or ReLU for short. + +It is really simple: when the input \[latex\]x < 0\[/latex\], the output is \[latex\]0\[/latex\]. Otherwise, it is \[latex\]x\[/latex\]. Mathematically, it can also be written as \[latex\]activation(x) = max(x,0)\[/latex\] - as it will become 0 for all negative inputs. This mathematical and hence computational simplicity has led ReLU to become one of the most common activation functions used today. + +Visually, it looks as follows: + +![](images/relu-1024x511.png) + +Thanks to nonlinear activation functions like ReLU, training a neural network becomes training a _nonlinear_ system. This subtle change suddenly allows us to capture nonlinear inputs. + +* * * + +## Nonlinearity is part of why ML exploded + +In fact, in my point of view, it is one of the reasons why the field of Machine Learning has seen such an extreme rise in popularity these last few years. (Okay, the massive performance boost of ConvNets has also contributed - but nonlinearity in activation functions has contributed as well). + +I can certainly imagine that you want some kind of evidence in favor of this statement. For this reason, let's build a TensorFlow model that will first only use linear activation functions. We then show that it doesn't work with a nonlinear dataset, and subsequently move forward with ReLU based activation. + +### A nonlinear dataset + +Let's first construct a nonlinear dataset using [Scikit-learn](https://scikit-learn.org/) - also make sure that you have `matplotlib` and `numpy` on your system: + +``` +# Imports +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_moons + +# Configuration options +num_samples_total = 1000 +training_split = 250 + +# Generate data +X, targets = make_moons(n_samples = num_samples_total) +targets[np.where(targets == 0)] = -1 +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() +``` + +Using Scikit's `make_moons` function, we're going two generate 1000 samples (750 training / 250 testing samples) that together form two moons: + +![](images/nonlinear.png) + +It's simply impossible to create a linear [classifier](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) here - no straight line can be drawn that separates the two moons :) + +### Constructing a linear TensorFlow model + +We now show you that it doesn't work by generating a linear TensorFlow model. Let's extend our code: + +- We add extra imports, specifically related to `tensorflow` and its Keras API +- We set configuration options for the Machine Learning model. +- We create the model - do note that our `Dense` layer (i.e. the blue hidden layer in the plot above) activates _linearly_, that is, it applies \[latex\]f(x) = x\[/latex\] as an activation function. +- We compile the model. +- We start the fitting i.e. training process. +- We perform light model evaluation activities to see how well it performs on a testing dataset. +- We use Mlxtend to [visualize the decision boundary](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) of our model. + +``` +# Imports +import tensorflow.keras +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import matplotlib.pyplot as plt +import numpy as np +from sklearn.datasets import make_moons +from mlxtend.plotting import plot_decision_regions + +# Configuration options +num_samples_total = 1000 +training_split = 250 + +# Generate data +X, targets = make_moons(n_samples = num_samples_total) +targets[np.where(targets == 0)] = -1 +X_training = X[training_split:, :] +X_testing = X[:training_split, :] +Targets_training = targets[training_split:] +Targets_testing = targets[:training_split] + +# Generate scatter plot for training data +plt.scatter(X_training[:,0], X_training[:,1]) +plt.title('Nonlinear data') +plt.xlabel('X1') +plt.ylabel('X2') +plt.show() + +# Set the input shape +feature_vector_shape = len(X_training[0]) +input_shape = (feature_vector_shape,) +print(f'Feature shape: {input_shape}') + +# Create the model +model = Sequential() +model.add(Dense(50, input_shape=input_shape, activation='linear', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='sigmoid')) + +# Configure the model and start training +model.compile(loss='squared_hinge', optimizer='adam', metrics=['accuracy']) +model.fit(X_training, Targets_training, epochs=50, batch_size=25, verbose=1, validation_split=0.2) + +# Test the model after training +test_results = model.evaluate(X_testing, Targets_testing, verbose=1) +print(f'Test results - Loss: {test_results[0]} - Accuracy: {test_results[1]*100}%') + +# Plot decision boundary +plot_decision_regions(X_testing, Targets_testing, clf=model, legend=2) +plt.show() +``` + +### Why it doesn't work + +After running this code and letting the training process finish, this is the outcome: + +![](images/linear-1024x514.png) + +And it is precisely as expected. Since our neurons activate linearly, i.e. \[latex\]f(x) = x\[/latex\], the neural network as a whole behaves linearly - and can only learn a straight line. + +That straight line is perfectly visible in the plot above. + +Long story short: linearity in a neural network significantly impacts model performance when your dataset is nonlinear. + +### Using ReLU based nonlinear activation + +Let's now replace the model creation part of the code above with the code that follows next. Here, we: + +- Replace the `activation` function with ReLU, a.k.a. \[latex\]max(x, 0)\[/latex\]. +- Add a few extra layers with more neurons per layer in the upstream layers, increasing the detail with which the model will learn a decision boundary (in fact, we're explicitly adding sensitivity to overfitting here, but for the demonstration purpose that is precisely what we want to happen). + +``` +# Create the model +model = Sequential() +model.add(Dense(200, input_shape=input_shape, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(150, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(100, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(50, activation='relu', kernel_initializer='he_uniform')) +model.add(Dense(1, activation='sigmoid')) +``` + +### So... does it work? + +Yes, definitely: + +![](images/nonlinear-1-1024x514.png) + +Simply using nonlinear activation functions makes our neural network behave as a nonlinear system, allowing it to capture a decision boundary in nonlinear datasets as well. + +And precisely that is what I mean with one of the advances that has boosted ML in the last decade! + +* * * + +## Summary + +In this article, we looked at why linearity in a neural network - that is, using no activation function or a linear activation - will significantly impact model performance. This is due to the fact that neurons and the mathematical operations that happen within those neurons are linear operations - and can learn nothing but linear functions. The nonlinearities that can be added, such as Rectified Linear Unit - which is computationally efficient - will ensure that the model can learn a nonlinear decision boundary. + +Using a TensorFlow model, we demonstrated that this is really the case in practice. Using a nonlinear dataset, we saw that no adequate decision boundary can be learnt when the neural network behaves linearly. Converting the linear activation functions into nonlinear ones, on the other hand, meant that learning the decision boundary became a piece of cake, at least for the relatively simple dataset that we used today. + +I hope that you have learnt something from today's article. If you did, please feel free to leave a message in the comments section below! 💬 Please feel free to do the same if you have questions or other remarks. Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +MachineCurve. (2019, October 22). _How to visualize the decision boundary for your Keras model?_ [https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/](https://www.machinecurve.com/index.php/2019/10/11/how-to-visualize-the-decision-boundary-for-your-keras-model/) + +MachineCurve. (2020, February 2). _Why you shouldn't use a linear activation function_. [https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/) + +MachineCurve. (2020, October 22). _3 variants of classification problems in machine learning_. [https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) diff --git a/why-swish-could-perform-better-than-relu.md b/why-swish-could-perform-better-than-relu.md new file mode 100644 index 0000000..49cd566 --- /dev/null +++ b/why-swish-could-perform-better-than-relu.md @@ -0,0 +1,89 @@ +--- +title: "Why Swish could perform better than ReLu" +date: "2019-05-30" +categories: + - "deep-learning" +tags: + - "activation-function" + - "deep-learning" + - "swish" +--- + +Neural networks are composed of various layers of neurons. Mathematically, a neuron is nothing but the dot product between the weights vector **w** and the input vector **x**, yielding a scalar value that is passed on to the next layer. + +Except that it isn't. + +If we would pass the scalar value, the model would behave as if it is a [linear one](https://www.machinecurve.com/index.php/2019/06/11/why-you-shouldnt-use-a-linear-activation-function/). In fact, it would only be able to produce linear decision boundaries between the classes you're training the model for. To extend neural network behavior to non-linear data, smart minds have invented the _[activation function](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/#what-is-an-activation-function)_ - a function which takes the scalar as its input and maps it to another numerical value. Since activation functions can be non-linear, neural networks have acquired the capability of handling non-linear data. In many applications, the results have been impressive. In this blog, we'll study today's commonly used activation functions and inspect a relatively new player... the Swish activation function. Does it perform better and if so, why is that the case? + +**Update February 2020** - Added links to other MachineCurve blogs and processed small spelling improvements. + +\[toc\] + +\[ad\] + +## Today's activation functions + +In the machine learning community, three major activation functions are used today. + +First, there is the **tanh** activation function. It can be visualized as follows. Clearly, one can see that the entire domain (-∞, ∞) is mapped to a range of (-1, 1). + +![](images/tanh-1024x511.png) + +Second, there is the **sigmoid** or _softstep_ activation function. Its visualization goes as follows. + +![](images/sigmoid-1024x511.png) + +The shape of this function is really similar, but one noticeable difference is that the (-∞, ∞) domain is mapped to the (0, 1) range instead of the (-1, 1) range. + +Finally, the most prominently activation function used today is called Rectified Linear Unit or **ReLU**. All inputs x < 0 are mapped to 0, zeroing out the neuron, whereas for inputs x >= 0 ReLU is linear. It looks as follows. + +![](images/relu-1024x511.png) + +Those activation functions all have their own [benefits and disbenefits](https://machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/). This primarily has to do with how neural networks are optimized - i.e., through [gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). That is, the gradient is computed with respect to the neural weights, after which the weights are altered based on this gradient and the learning rate. + +\[ad\] + +The derivative of any function at x is simply another function whose input is mapped to another numeric value. We can explain the benefits and disbenefits by visualizing the derivatives of those three activation functions below. + +![](images/derivatives-1024x511.png) + +Now, the deep learning community often deals with two types of problems during training - the [vanishing gradients problem and the exploding gradients](https://machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) problem. In the first, the backpropagation algorithm, which chains the gradients together when computing the error backwards, will find _really small gradients_ towards the left side of the network (i.e., farthest from where error computation started). This problem primarily occurs with the Sigmoid and Tanh activation functions, whose derivatives produce outputs of 0 < x' < 1, except for Tanh which produces x' = 1 at x = 0. When you chain values that are smaller than one, such as 0.2 \* 0.15 \* 0.3, you get really small numbers (in this case 0.009). Consequently, when using Tanh and Sigmoid, you risk having a suboptimal model that might possibly not converge due to vanishing gradients. + +ReLU does not have this problem - its derivative is 0 when x < 0 and is 1 otherwise. 1x1x1 = 1 and 1x1x0x1 = 0. Hence, no vanishing gradients. What's more, it makes your model sparser, since all gradients which turn to 0 effectively mean that a particular neuron is zeroed out. + +Finally, it is computationally faster. Computing this function - often by simply maximizing between (0, x) - takes substantially fewer resources than computing e.g. the sigmoid and tanh functions. By consequence, ReLU is the de facto [standard activation function](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/) in the deep learning community today. + +\[ad\] + +## The Swish activation function + +Nevertheless, it does not mean that it cannot be improved. In October 2017, Prajit Ramachandran, Barret Zoph and Quoc V. Le from Google Brain proposed the [Swish activation function](https://arxiv.org/pdf/1710.05941v1.pdf). It is a relatively simple function: it is the multiplication of the input x with the sigmoid function for x - and it looks as follows. + +![](images/swish-1024x511.png) + +![](images/swish_formula.png) + +Upon inspection of this plot your probable first guess is that it looks a lot like ReLU. And that's not a poor guess. Instead, it _does_ look like the de facto standard activation function, with one difference: the domain around 0 differs from ReLU. + +Swish is a smooth function. That means that it does not abruptly change direction like ReLU does near x = 0. Rather, it smoothly bends from 0 towards values < 0 and then upwards again. + +This observation means that it's also non-monotonic. It thus does not remain stable or move in one direction, such as ReLU and the other two activation functions. The authors write in their paper that it is in fact this property which separates Swish from most other activation functions, which do share this monotonicity. + +## Why Swish could be better than ReLu + +In their work, Ramachandran et al. write that their "extensive experiments show that Swish consistently matches or outperforms ReLU on deep networks applied to a variety of challenging domains such as image classification and machine translation". + +This is interesting. Previously proposed replacements for ReLU have shown inconsistent results over various machine learning tasks. If Swish consistently matches or outperforms ReLU, as the authors claim ... well, that would make it a candidate for challenging ReLU at a global scale! + +\[ad\] + +The question that now remains is - if Swish often yields better results than ReLU does, why is that the case? The authors make various observations which attempt to explain this behavior: + +- First, it is bounded below. Swish therefore benefits from sparsity similar to ReLU. Very negative weights are simply zeroed out. +- Second, it is unbounded above. This means that for very large values, the outputs do not saturate to the maximum value (i.e., to 1 for all the neurons). According to the authors of the Swish paper, this is what set ReLU apart from the more traditional activation functions. +- Third, separating Swish from ReLU, the fact that it is a smooth curve means that its output landscape will be smooth. This provides benefits when optimizing the model in terms of convergence towards the minimum loss. +- Fourth, small negative values are zeroed out in ReLU (since f(x) = 0 for x < 0). However, those negative values may still be relevant for capturing patterns underlying the data, whereas large negative values may be zeroed out (for reasons of sparsity, as we saw above). The smoothness property and the values of f(x) < 0 for x ≈ 0 yield this benefit. This is a clear win over ReLU. + +The [original work on Swish](https://arxiv.org/pdf/1710.05941v1.pdf) provides very interesting benchmark results for common neural network architectures. + +All in all, if you're feeling a little adventurous in your machine learning project, the Swish activation function may be a candidate for testing. Perhaps, you'll even improve your model in the process. Enjoy engineering! diff --git a/why-you-cant-truly-create-rosenblatts-perceptron-with-keras.md b/why-you-cant-truly-create-rosenblatts-perceptron-with-keras.md new file mode 100644 index 0000000..a56f7fd --- /dev/null +++ b/why-you-cant-truly-create-rosenblatts-perceptron-with-keras.md @@ -0,0 +1,400 @@ +--- +title: "Why you can't truly create Rosenblatt's Perceptron with Keras" +date: "2019-07-24" +categories: + - "deep-learning" + - "frameworks" + - "svms" +tags: + - "keras" + - "neural-network" + - "neural-networks" + - "rosenblatt-perceptron" +--- + +...and what to do about it! + +It was January 1957 when a report was released by Cornell Aeronautical Laboratory. It was written by Frank Rosenblatt and titled _The Perceptron - a Perceiving and Recognizing Automaton_, which aimed to "formulate a brain analogue useful in analysis" (Rosenblatt, 1957). + +In his work, he presented the [perceptron](https://machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/)\- a one-neuron neural network that would eventually lie at the basis of many further developments in this field. + +Since I'm currently investigating historical algorithms _and_ because I use Keras on a daily basis for creating deep neural networks, I was interested in combining both - especially since I saw some blogs on the internet that had applied it too. + +Rather unfortunately, I ran into trouble relatively quickly. And it all had to do with the fact that Keras to me seems unsuitable for creating the Perceptron - you can get close to it, but you cannot replicate it exactly. + +Why? + +That's what I will cover in this blog. First, I'm going to take a look at the internals of a perceptron. I cover how data is propagated through it and how this finally yields a (binary) output with respect to the preferred class. Subsequently, I'll try to replicate it in Keras ... until the point that you'll see me fail. I will then introduce the Perceptron Learning Rule that is used for optimizing the weights of the perceptron, based on one of my previous posts. Based on how deep neural networks are optimized, i.e. through Stochastic Gradient Descent (SGD) or a SGD-like optimizer, I will then show you why Keras cannot be used for single-layer perceptrons. + +Finally, I will try to _get close_ to replication - to see what the performance of single-neuron networks _could_ be for a real-world dataset, being the Pima Indians Diabetes Database. + +Let's hope we won't be disappointed! + +**Update 02/Nov/2020:** made code compatible with TensorFlow 2.x. + +* * * + +\[toc\] + +* * * + +## Some intuition for a perceptron + +Mathematically, a Rosenblatt perceptron can be defined as follows: + +\[mathjax\] + +\\begin{equation} f(x) = \\begin{cases} 1, & \\text{if}\\ \\textbf{w}\\cdot\\textbf{x}+b > 0 \\\\ 0, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +However, mathematics is useless unless you understand it - which in my opinion cannot be done without building _intuition_ and _visualization_. Only when you can visualize an equation, and thoroughly understand how it works, you can finally enjoy its beauty. + +Therefore, let's precisely do that. This is a generic sketch of the perceptron as it is defined above: + + +![](images/Perceptron-1024x794.png) + + +In the maths above, you noticed a weights vector `**w**` and an input vector `**x**` that are multiplied. Finally, a bias `b` is added. The class is one if this output is larger than zero. Otherwise, it picks the other class. + +### Computing a dot product + +Let's cover the first part - multiplying the vectors - first. When you do that, it's called a _dot product_. Computing one is actually really simple: the dot product is the sum of the multiplication of the individual vector elements. Visualized, that's `x1` multiplied by `w1`; `x2` and `w2`, et cetera - mathematically: + +\\begin{equation} \\begin{split} z &= \\textbf{w}\\cdot\\textbf{x} \\\\ &=\\sum\_{i=1}^{n} w\_nx\_n \\\\ &= w\_1x\_1 + ... + w\_nx\_n \\\\ \\end{split} \\end{equation} + + +All these individual outputs are summated, as you can see. Subsequently, the bias value is added and the value is passed along to the 'gateway' (real name: unit step) function that assigns it either class 0 or class 1. The output passed to the unit step function looks as follows: + +\\begin{equation} \\begin{split} z &= \\textbf{w}\\cdot\\textbf{x} + b \\\\ &=\\sum\_{i=1}^{n} w\_nx\_n + b \\\\ &= w\_1x\_1 + ... + w\_nx\_n + b \\\\ \\end{split} \\end{equation} + +The step function: + +\\begin{equation} f(x) = \\begin{cases} 1, & \\text{if}\\ \\textbf{w}\\cdot\\textbf{x}+b > 0 \\\\ 0, & \\text{otherwise} \\\\ \\end{cases} \\end{equation} + +It is therefore one of the simplest examples of a binary classifier. + +\[ad\] + +## Let's code it - a Keras based perceptron + +All right, let's see if we can code one. First ensure that you have all necessary dependencies installed, preferably in an Anaconda environment. Those dependencies are as follows: + +- A clean Python installation, preferably 3.6+: [https://www.python.org/downloads](https://www.python.org/downloads/) +- Keras: `pip install keras` +- By consequence, TensorFlow: `pip install tensorflow` (go [here](https://medium.com/@soumyadipmajumder/complete-guide-to-tensorflow-gpu-installation-on-windows-10-36e5858640e9) if you wish to install the GPU version on Windows). + - You may also wish to run it on Theano or CNTK, which are supported by Keras, but I only tested it with TF as a backend. +- Numpy: `pip install numpy`. +- Scipy: `pip install scipy`. + +Create a new folder somewhere on your machine called `simple-perceptron`: + +![](images/image.png) + + + +Open the folder and create one file: `model.py`. + +### Dataset: Pima Indians Diabetes Database + +We'll use the Pima Indians Diabetes Database as our dataset. It's a CC0 (or public domain) dataset that is freely available at [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database). It can be described as follows: + +> This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. +> +> Source: [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database) + + +All right, the first step would be to download the dataset, so let's do that first. Download the dataset to the same folder as `model.py` and call it `pima_dataset.csv`. + +\[ad\] + +### Loading dependencies and data + +Now open `model.py` in a text editor or an IDE. First add the dependencies that you'll need: + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np +``` + +Then load your data: + +``` +# Load data +dataset = np.loadtxt('./pima_dataset.csv', delimiter=',') + +# Separate train and test data +X = dataset[:, 0:8] +Y = dataset[:, 8] +``` + +What you do above is less difficult than it looks. First, you use the `numpy` library to use the Pima dataset, which is delimited (i.e. the columns are separated) by a comma. Indeed, when you open the CSV file, you'll see this: + +``` +6,148,72,35,0,33.6,0.627,50,1 +1,85,66,29,0,26.6,0.351,31,0 +8,183,64,0,0,23.3,0.672,32,1 +1,89,66,23,94,28.1,0.167,21,0 +0,137,40,35,168,43.1,2.288,33,1 +...........and so on +``` + +Let's take the first row. + +``` +6,148,72,35,0,33.6,0.627,50,1 +``` + +The numbers \[latex\]\\{6, 148, ..., 50\\}\[/latex\] represent the feature vector \[latex\]\\mathbf{x\_0} = \\{6, 148, 72, 35, 0, 33.6, 0.627, 50\\}\[/latex\]. This feature vector is part of your training set which is the Pima dataset - or \[latex\]\\mathbf{x\_0} \\in X\[/latex\]. + +There is however one value left: \[latex\]1\[/latex\]. This is actually the _desired outcome_, or the class to which this feature vector belongs. The total number of desired outcomes is 2, as the set is \[latex\]Y = \\{ 0, 1 \\}\[/latex\] or, in plainer English: \[latex\]Y = \\{ \\text{no diabetes}, \\text{diabetes} \\}\[/latex\]. Recall why this is the case: the objective of the Pima dataset is to "to diagnostically predict whether or not a patient has diabetes". + +This also explains why you'll do this: + +``` +# Separate train and test data +X = dataset[:, 0:8] +Y = dataset[:, 8] +``` + +In Python, what you're writing for \[latex\]X\[/latex\] is this: for the entire `dataset`, take all rows (`:`) as well as columns 0 up to 8 (excluding 8). Assign the output to `X`. By consequence, `X` - or your set of feature vectors - therefore contains the _actual features_, excluding the targets (which are in column 8). + +Obviously, it's now easy to understand what happens for the desired outcomes or target set `Y`: you'll take the 8th column for all rows. + +Next, create the model and add your Perceptron, which is a Dense layer: + +``` +# Create the Perceptron +model = Sequential() +model.add(Dense()) +``` + +### Problems! + +I now got confused. The [Keras docs](https://keras.io/layers/core/) wanted me to specify an _activation function_ and an _initializer_. + +\[ad\] + +So I started looking around for clues, and then I found this: + +> Based on that, gradient descent can't be used for perceptrons but can be used for conventional neurons that uses the sigmoid activation function (since the gradient is not zero for all x). +> +> Source: [Yahia Zakaria, StackOverflow](https://stackoverflow.com/a/40758135), or (Zakaria, 2016). + + +Today's neural networks, which are supported by Keras, apparently use an entirely different method for optimization, I found. Whereas the Rosenblatt Perceptron updates the weights by pushing them slightly into the right direction (i.e. the [Perceptron Learning Rule](https://machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/)), today's neural networks don't do that. Instead, they compute the loss with a so-called loss function, which is differentiable. By minimizing this gradient, the algorithms find the way to the best-performing model. We call this (Stochastic) Gradient Descent. Instead of pushing the weights into the right direction, it's like descending a mountainous path, where your goal is to go to the valley - changing the model weights as you go. + +The next question is then: the Perceptron step function outputs class 0 for all values \[latex\]\\leq 0\[/latex\] and 1 for the rest. Why cannot this be used as a loss function, then? + +Very simple - because the derivative is always zero, except for \[latex\]x = 0\[/latex\]. Consider one of the classes as the output of a function, say for class = 1, and you will get: + +\\begin{equation} \\begin{split} f(x) &= 1 \\end{split} \\end{equation} + +Since \[latex\]x^0\[/latex\] is 1, we can rewrite \[latex\]f(x)\[/latex\] as: + +\\begin{equation} \\begin{split} f(x) &= 1 \\cdot x^0 \\\\ &= 1 \\cdot 1 \\\\ &= 1 \\\\ \\end{split} \\end{equation} + +And you will see that the derivative is 0: + +\\begin{equation} \\begin{split} f'(x) &= \\frac{df}{dx}(1) \\\\ &= \\frac{df}{dx}(1 \\cdot x^0) \\\\ &= 0 \\cdot (1 \\cdot x^\\text{-1}) \\\\ &= 0 \\end{split} \\end{equation} + +What's even worse is that the derivative is _undefined_ for \[latex\]x = 0\[/latex\]. This is the case because a function must be differentiable. Since it 'steps' from 0 to 1 at \[latex\]x = 0\[/latex\], the function is not differentiable at this point, rendering the derivative to be undefined. This can be visualized as follows, but obviously then for \[latex\]x = 0\[/latex\]: + +[![](images/1024px-Right-continuous.svg_-1024x853.png)](https://machinecurve.com/wp-content/uploads/2019/07/1024px-Right-continuous.svg_.png) + +Crap. There goes my plan of creating a Rosenblatt Perceptron with Keras. What to do? + +### Finding an appropriate activation function + +Mathematically, it is impossible to use gradient descent with Rosenblatt's perceptron - and by consequence, that's true for Keras too. + +\[ad\] + +But what if we found a function that _is actually differentiable_ and highly resembles the step function used in the Rosenblatt perceptron? + +We might then be able to pull it off, while accepting _a slight difference compared to the Rosenblatt perceptron_. + +But to me, that's okay. + +The first candidate is the Sigmoid function, which can be mathematically defined as: + +\\begin{equation} \\begin{split} sig(t) = \\frac{\\mathrm{1} }{\\mathrm{1} + e^\\text{-t} } \\end{split} \\end{equation} + +And visualized as follows: + +[![](images/sigmoid-1024x511.png)](https://machinecurve.com/wp-content/uploads/2019/05/sigmoid.png) + +Across a slight interval around \[latex\]x = 0\[/latex\], the Sigmoid function transitions from 0 to 1. It's a differentiable function and is therefore suitable for this Perceptron. + +But can we find an even better one? + +Yes. + +It's the _hard Sigmoid_ function. It retains the properties of Sigmoid but transitions less quickly. + +And fortunately, Keras [supports it](https://keras.io/activations/): `tensorflow.keras.activations.hard_sigmoid(x)` ! + +### Let's move on with our implementation + +Note that so far, we have this: + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +import numpy as np + +# Load data +dataset = np.loadtxt('./pima_dataset.csv', delimiter=',') + +# Separate train and test data +X = dataset[:, 0:8] +Y = dataset[:, 8] + +# Create the Perceptron +model = Sequential() +model.add(Dense()) +``` + +We can now add the activation function and the initializer. Since _zero initialization_ (which is what one can do with the [real Rosenblatt Perceptron](https://machinecurve.com/index.php/2019/07/23/linking-maths-and-intuition-rosenblatts-perceptron-in-python/#training-the-model)) is **not a good idea with SGD** (I'll cover this in another post), I'll initialize them with the default Keras initializer, being `glorot_uniform` (or Xavier uniform). + +Let's add the `hard_sigmoid` activation function to the imports: + +``` +from tensorflow.keras.activations import hard_sigmoid +``` + +\[ad\] + +Also define it, together with the initializer and the input shape (remember, 8 columns so 8 features): + +``` +model.add(Dense(1, input_shape=(8,), activation=hard_sigmoid, kernel_initializer='glorot_uniform')) +``` + +### Compiling the model + +We next compile the model, i.e. initialize it. We do this as follows: + +``` +model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) +``` + +Binary cross entropy is the de facto standard loss function for binary classification problems, so we use it too (Chollet, 2018). Why this is a good one will be covered in another blog. The same goes for the Adam optimizer, which is an extension of default gradient descent, resolving some of its challenges.... and we use accuracy because it is more intuitive than loss 😎 + +### Fitting the data to our pseudo-Perceptron + +We'll next fit the data to our pseudo Rosenblatt Perceptron. This essentially tells Keras to start the training process: + +``` +model.fit(X, Y, epochs=225, batch_size=25, verbose=1, validation_split=0.2) +``` + +Note that we've had to configure some options: + +- The **number of epochs**, or the number of iterations of passing through the data and subsequent optimization before the model stops the training process. +- The **batch size**, or the sample size used during epochs. +- We set the output to **verbose**, so that we'll see what happens during execution. +- I'm also splitting 20% of the dataset into **validation data**, which essentially reduces overfitting. + +For this last reason, we'll have to clearly inspect the dataset first. If, say, all non-diabetes cases (class 0) came first, followed by the diabetes class (1), we'd have a problem: + +> The validation data is selected from the last samples in the `x` and `y` data provided, before shuffling. +> +> Source: [Keras docs](https://keras.io/models/model/#fit) + +... a.k.a. our validation data would only have diabetic cases in that case, rendering it highly unreliable. + +However, inspecting the data at random ensures that the dataset seems to be distributed rather randomly with respect to target class: + +``` +.... +2,71,70,27,0,28.0,0.586,22,0 +7,103,66,32,0,39.1,0.344,31,1 +7,105,0,0,0,0.0,0.305,24,0 +1,103,80,11,82,19.4,0.491,22,0 +1,101,50,15,36,24.2,0.526,26,0 +5,88,66,21,23,24.4,0.342,30,0 +8,176,90,34,300,33.7,0.467,58,1 +7,150,66,42,342,34.7,0.718,42,0 +1,73,50,10,0,23.0,0.248,21,0 +7,187,68,39,304,37.7,0.254,41,1 +0,100,88,60,110,46.8,0.962,31,0 +0,146,82,0,0,40.5,1.781,44,0 +0,105,64,41,142,41.5,0.173,22,0 +2,84,0,0,0,0.0,0.304,21,0 +8,133,72,0,0,32.9,0.270,39,1 +.... +``` + +All right, let's go! This is our code (it is also available on [GitHub](https://github.com/christianversloot/keras-pseudo-perceptron)): + +``` +# Load dependencies +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense +from tensorflow.keras.activations import hard_sigmoid +import numpy as np + +# Load data +dataset = np.loadtxt('./pima_dataset.csv', delimiter=',') + +# Separate train and test data +X = dataset[:, 0:8] +Y = dataset[:, 8] + +# Create the Perceptron +model = Sequential() +model.add(Dense(1, input_shape=(8,), activation=hard_sigmoid, kernel_initializer='glorot_uniform')) +model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) + +# Train the Perceptron +model.fit(X, Y, epochs=225, batch_size=25, verbose=1, validation_split=0.2) +``` + +\[ad\] + +## Test results + +Does it work? + +``` +2019-07-24 18:44:55.155520: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3026 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1) +``` + +Yes. + +Does it work well? + +``` +Epoch 225/225 +614/614 [==============================] - 0s 111us/step - loss: 5.5915 - acc: 0.6531 - val_loss: 5.7565 - val_acc: 0.6429 +``` + +...and with a different Glorot initialization... + +``` +Epoch 225/225 +614/614 [==============================] - 0s 103us/step - loss: 5.2812 - acc: 0.6596 - val_loss: 6.6020 - val_acc: 0.5844 +``` + +...yes, only slightly. On the validation dataset, the accuracy is only ~60%. + +Why is this the case? + +It's the complexity of the dataset! It's difficult to cram 8 vectors into only one neuron, that's for sure. However, I'm still impressed with the results, though! _Think about creating shallow networks first before starting with deep ones_, is what may be the actual lesson learnt here. + +Altogether, today, you've seen how to use Keras to create a perceptron that mimics Rosenblatt's one. You also saw why a true perceptron cannot be created with Keras because it learns differently. Finally, we showed that this is actually the case and saw our development fail. I hope you've learnt something, and please - I am happy to receive your remarks, whether positive or negative - let me know below 👇 and I'll improve! 😊 + +Happy coding! + +## References + +Ariosa, R. (2018, April 27). MrRobb/keras-zoo. Retrieved from [https://github.com/MrRobb/keras-zoo/blob/master/P%20(Perceptron)/readme.md](https://github.com/MrRobb/keras-zoo/blob/master/P%20(Perceptron)/readme.md) + +Chollet, F. (2017). _Deep Learning with Python_. New York, NY: Manning Publications. + +Rosenblatt, F. (1957). _The Perceptron - a Perceiving and Recognizing Automaton_. Retrieved from UMass website: [https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf](https://blogs.umass.edu/brain-wars/files/2016/03/rosenblatt-1957.pdf) + +Zakaria, Y. (2016, November 23). Non-smooth and non-differentiable customized loss function tensorflow. Retrieved from [https://stackoverflow.com/a/40758135](https://stackoverflow.com/a/40758135) diff --git a/why-you-shouldnt-use-a-linear-activation-function.md b/why-you-shouldnt-use-a-linear-activation-function.md new file mode 100644 index 0000000..85a72f5 --- /dev/null +++ b/why-you-shouldnt-use-a-linear-activation-function.md @@ -0,0 +1,79 @@ +--- +title: "Why you shouldn't use a linear activation function" +date: "2019-06-10" +categories: + - "deep-learning" +tags: + - "activation-functions" + - "deep-learning" + - "linear" +--- + +In today's deep learning community, three [activation functions are commonly used](https://www.machinecurve.com/index.php/2019/09/04/relu-sigmoid-and-tanh-todays-most-used-activation-functions/): the sigmoid function, the tanh function and the Rectified Linear Unit, or ReLU for short. + +While there exist other activation functions [such as Swish](https://machinecurve.com/index.php/2019/05/30/why-swish-could-perform-better-than-relu/), it has been hard over the years for them to catch up with both the _improvements in predictive power_ required as well as the _generalization over training sets_. Whereas the high performance of ReLU for example generalizes well over various machine learning problems, this hasn't been the case with many other activation functions. + +And there's another question people are asking a lot: **why can't I use a linear activation function when I'm training a deep neural network?** We'll take a look at this question in this blog, specifically inspect the [optimization process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) of deep neural networks. The answer is relatively simple **\- using a linear activation function means that your model will behave as if it is linear**. And that means that it can no longer handle the complex, non-linear data for which those deep neural nets have boosted performance those last couple of years. + +**Update February 2020** - Added links to other MachineCurve blogs; added table of contents; processed textual improvements. + +\[toc\] + +\[ad\] + +## Optimizing your model: computing gradients, backprop and gradient descent + +When you're building a deep neural network, there are three terms that you'll often hear: + +- A gradient; +- Backpropagation, and finally... +- [Gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/), often the stochastic version (SGD) - or SGD like optimizers. + +Let's take a look [at the training process of a neural network](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process), so that we'll understand the necessity of those three before we move on to studying the behavior of linear activation functions. + +As you know, training a deep neural network goes iteratively, using epochs. This means that small batches of training data are input into the network, after which the [error is computed](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) and the [model is optimized](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). If all the training data has been input once, an epoch has passed and the same process starts again - until the second, third, fourth, and so on - epochs have passed. + +Suppose that we're at epoch 0 (or 1, if you like). The weights of the model have been initialized randomly, or pseudo-randomly. You input your first batch of training data into the model. Obviously, it will perform very poorly, and the [loss](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) - the difference between the actual targets and the predictions for this training data - will be huge. It needs to be improved if we want to use it in real life. + +One way of doing so is by using _gradients_ and _backpropagation_, the latter of which stands for "backwards propagation of errors". While the data has been propagated forwards, the error can be computed backwards. This is done as follows: + +1. We know which loss function we used and how it is instantiated. For this function, we can compute its derivative. That is, we can compute its _gradient_ i.e. how much it changes at some particular location. If we do that for our current spot on the loss curve, we can estimate where to move to in order to improve that particular weight. +2. Backpropagation allows us to descend the gradient with respect to _all the weights_. By chaining the gradients found, it can compute the gradient for any weight - and consequently, can compute improvements with respect to the errors backwards towards the most upstream layer in the network. +3. The _optimizer_, i.e. [SGD](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) or the [SGD like optimizer such as Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), is subsequently capable of altering the weights slightly in an attempt to improve overall network performance. + +\[ad\] + +And this often causes a really fast drop in loss at first, while it gets stable over time: + +[![](images/image-1024x223.png)](https://machinecurve.com/wp-content/uploads/2019/06/image.png) + +An example from my [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) + +## The problem with linear activation functions + +As you know, the dot product between the weight vector and the input (or transformed input) vector produced by the neuron itself is linear. It flows through an [activation function](https://www.machinecurve.com/index.php/2020/01/24/overview-of-activation-functions-for-neural-networks/) to, generally, make it non-linear. But neural networks don't care what kind of function you choose for activating neuron output. + +You can thus choose to use \[latex\]f(x) = x\[/latex\], i.e. the identity function, as your activation function. + +But this is often a really bad idea. + +And it all has to do with the gradient of this linear activation function: + +[![](images/derivative_linear-1024x537.png)](https://machinecurve.com/wp-content/uploads/2019/06/derivative_linear.png) + +Yep, it's 1. + +The formula of \[latex\] f'(x) \[/latex\] when \[latex\] f(x) = x \[/latex\]? + +\[ad\] + +\[latex\] f'(x) = 1 \\times x^0 = 1 \* 1 = 1 \[/latex\] + +**You will thus find the same gradient for any neuron output when you use the linear activation function, namely 1.** + +And this impacts neural network training in two fundamental ways: + +1. You cannot apply backpropagation to find how your neural weights should change based on the errors found. This observation emerges from the simple notion that gradients are no longer dependent on the input values (and by consequence, the errors) - they're always the same. There's thus simply no point in attempting to find where to improve your model. +2. Your model becomes a linear model because all layers chained can be considered to be a _linear combination_ of individual linear layers. You'll thus at best get some good performance on _linear data_. Forget good performance for non-linear data. + +And that's why you shouldn't use linear activation functions :-) diff --git a/working-with-imbalanced-datasets-with-tensorflow-and-keras.md b/working-with-imbalanced-datasets-with-tensorflow-and-keras.md new file mode 100644 index 0000000..c3b694b --- /dev/null +++ b/working-with-imbalanced-datasets-with-tensorflow-and-keras.md @@ -0,0 +1,327 @@ +--- +title: "Working with Imbalanced Datasets with TensorFlow 2.0 and Keras" +date: "2020-11-10" +categories: + - "deep-learning" + - "frameworks" +tags: + - "dataset" + - "imbalanced-data" + - "keras" + - "tensorflow" +--- + +Training your machine learning model or neural network involves exploratory research activities in order to estimate what your data looks like. This is really important if you want to create a model that performs well, that performs well in many cases _and_ performs well because of why you think it performs well. + +To give a simple example: if you classify between reindeer and wild boar, it is very likely that snow is present in the reindeer pictures, while it's not in the boar ones. In those cases, the model might learn to distinguish based on the presence of snow, which is not what you want. + +In another case, the one of an **imbalanced dataset**, you can also run into trouble. This case is what we will cover in this article. We're going to look at a couple of things. First of all, we'll cover the concept of an imbalanced dataset. What does it mean, to have no balance in your dataset? And what is wrong with it, and why? This gives us enough context to move onto the practical part. + +In that practical part, we'll be taking class imbalances into account with TensorFlow and Keras. We take a look at **undersampling**, **oversampling** and an approach which works by means of **class weights**. In addition, we also look at the concept of **F1 Score**. Through examples, we will demonstrate that it is in fact possible to use an imbalanced dataset while training your machine learning model. Let's take a look! + +**Update 11/Nov/2020:** repaired a number of textual mistakes. Sorry about that! + +* * * + +\[toc\] + +* * * + +## What is an imbalanced dataset? + +In today's article, we will be looking at imbalanced datasets. But what are those datasets? What makes a dataset have _class imbalance_? Let's try and answer those questions through the lens of an example dataset. + +### Introducing the Insurance Cross-sell Prediction Dataset + +For example, let's take a look at the **Insurance Cross-sell Prediction Dataset**. This dataset, [which is available at Kaggle](https://www.kaggle.com/arashnic/imbalanced-data-practice), contains various parameters which indicate whether people who have purchased _Health insurance_ would also be interested in _Vehicle insurance_. + +> The data provided by an Insurance company which is not excluded from other companies to getting advantage of ML. This company provides Health Insurance to its customers. We can build a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company. +> +> Kaggle (n.d.) + +If we are able to generate a predictive model which helps us understand if a customer might be interested in purchasing insurance, we might be able to sell much more effectively. That would be a very valuable use of Machine Learning in business. + +> Cross-selling is the action or practice of selling an additional product or service to an existing customer. +> +> Wikipedia (2005) + +### Inspecting the dataset + +Make sure to download the dataset from Kaggle to a folder if you want to use the code. Copy the `train.csv` file into a particular folder, and rename it into `data.csv`. The `test.csv` file apparently has no corresponding Response (i.e. `y`) targets. + +We use the following code snippet for visualizing the outcomes by means of a histogram. + +``` +import pandas as pd +import matplotlib.pyplot as plt +import numpy as np + +df = pd.read_csv('./data.csv') +plt.rcParams["figure.figsize"] = (10,6) +plt.hist(df['Response'], bins=2, density=True) +plt.ylabel('Number of samples') +plt.xlabel('Interested in insurance (no = 0; yes = 1)') +plt.show() +``` + +It looks as follows. + +- [![](images/insurance.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/insurance.png) + +- [![](images/insurancedensity.png)](https://www.machinecurve.com/wp-content/uploads/2020/11/insurancedensity.png) + + +### Finding class imbalance + +In total, more than 300.000 samples reflect people who have no interest in insurance. Close to 70.000 people _do_ have interest in insurance. This means that for approximately every 0.35 people who want insurance, close to 1.7 want no insurance (approximately 5 want no insurance when 1 wants insurance). + +This suggests that we have found what is known as an **imbalanced dataset**. + +> Imbalance means that the number of data points available for different the classes is different: +> +> If there are two classes, then balanced data would mean 50% points for each of the class. +> +> Kaggle (n.d.) + +Let's now take a look at why you must be careful when creating a Machine Learning model when your dataset is imbalanced. + +* * * + +## What's wrong with imbalanced datasets? + +When training a neural network, you are performing [supervised learning](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process). This effectively involves feeding samples from a training dataset forward, generating predictions, which can be compared to the dataset's corresponding labels: the ground truth. This results in a **loss value** that can subsequently be used for [optimizing the model](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/). + +There are [various loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) that are used in neural networks. However, in addition to a loss value, we often use the **accuracy**. It is used because it is very intuitive to human beings, and can be defined as follows: + +\[latex\]Accuracy = \\frac{TP + TN}{TP + TN + FP + FN}\[/latex\] + +Here: + +- **TP:** True Positives, a.k.a. predictions for class 1 that are actually class 1. +- **TN**: True Negatives, a.k.a. predictions for class 0 that are actually class 0. +- **FP:** False Positives, a.k.a. predictions for class 1 that are actually class 0. +- **FN:** False Negatives, a.k.a. predictions for class 0 that are actually class 1. + +The goal is to increase the numbers for TP and TN and, by consequence, reducing the numbers for FP and FN. + +Now, let's take a look again at our insurance dataset + +![](images/insurance.png) + +If we would train our model with this dataset, the model might actually learn the following behavior: **simply giving 0 as class output because it is correct more often.** In other words, if I would randomly pick one of the two classes, it would be more likely that my answer is correct if I pick 0 compared to picking 1. The model might pick this up and steer towards those _average_ based predictions. + +In this case, you end up with a model that _does_ perform, but not _why_ you want it to perform (i.e., not because the _patterns_ in the dataset indicate certain classes to be more likely, but rather, the relative likelihood that it is a class now determines the prediction outcome). + +For this reason, you don't want to create Machine Learning models that have a significant class imbalance. + +Let's now take a look at some methods for removing the imbalance within your dataset. + +* * * + +## Taking class imbalances into account with TensorFlow and Keras + +In TensorFlow and Keras, you can work with imbalanced datasets in multiple ways: + +1. **Random Undersampling:** drawing a subset from the original dataset, ensuring that you have equal numbers per class, effectively discarding many of the big-quantity class samples. +2. **Random Oversampling:** drawing a subset from the original dataset, ensuring that you have equal numbers per class, effectively copying many of the low-quantity class samples. +3. **Applying class weights:** by making classes with higher data quantities less important in the model optimization process, it is possible to achieve optimization-level class balance. +4. **Working with the F1 score instead of Precision and Recall:** by using a metric that attempts to find a balance between relevance of all results and number of relevant results found, you could reduce the impact of class balance on your model without removing it. + +### Random Undersampling + +If we apply **undersampling** to our model, we effectively _reconstruct_ the dataset - but then ensure that it is balanced. In other words, we ensure that all classes contain an equal amount of samples. By consequence, as can be seen in the figure below, a lot of samples are discarded to regain class balance; balance is found at `min(num_samples_per_class`). Samples are chosen randomly. + +We can perform undersampling as follows. + +``` +import pandas as pd + +# Read CSV +df = pd.read_csv('./data.csv') + +# Count samples per class +classes_zero = df[df['Response'] == 0] +classes_one = df[df['Response'] == 1] + +# Print sizes +print(f'Class 0: {len(classes_zero)}') +print(f'Class 1: {len(classes_one)}') + +# Undersample zero to the size of one +classes_zero = classes_zero.sample(len(classes_one)) + +# Print sizes +print(f'Class 0: {len(classes_zero)}') +print(f'Class 1: {len(classes_one)}') +``` + +Before undersampling, our datasets look as follows: + +``` +Class 0: 319553 +Class 1: 62601 +``` + +Afterwards: + +``` +Class 0: 62601 +Class 1: 62601 +``` + +- **Benefits of undersampling:** regaining class balance. +- **Drawbacks of undersampling:** if your dataset as a whole is small, reducing its size further can make you lose some predictive power. In those cases, you must critically inspect whether undersampling is the right fit for your setting. + +![](images/sampling.png) + +### Random Oversampling + +If we apply **oversampling** instead, we also reconstruct the dataset into a balanced one, but do it in such a way that all our classes find balance at `max(num_samples_per_class)`. While undersampling means discarding samples, here, we copy multiple samples instead to fill the classes that are imbalanced. Here, sampling also happens randomly. + +``` +import pandas as pd + +# Read CSV +df = pd.read_csv('./data.csv') + +# Count samples per class +classes_zero = df[df['Response'] == 0] +classes_one = df[df['Response'] == 1] + +# Print sizes +print(f'Class 0: {len(classes_zero)}') +print(f'Class 1: {len(classes_one)}') + +# Oversample one to the size of zero +classes_one = classes_one.sample(len(classes_zero), replace=True) + +# Print sizes +print(f'Class 0: {len(classes_zero)}') +print(f'Class 1: {len(classes_one)}') +``` + +Before: + +``` +Class 0: 319553 +Class 1: 62601 +``` + +After: + +``` +Class 0: 319553 +Class 1: 319553 +``` + +- **Benefits of oversampling:** regaining class balance. +- **Drawbacks of oversampling:** if you have classes with _very few_ instances, both relative to the other classes and absolutely in terms of total number of samples - you might put too much emphasis on the fact that those samples belong to a specific class. Here, you must ensure that there aren't other instances that _also_ belong to the class but are not yet reflected in the dataset, or you risk missing them when running your model in production. + +![](images/sampling-1.png) + +### Applying class weights: a Keras model + +Instead of changing your dataset, another approach to handling imbalanced datasets involves instructing TensorFlow and Keras to take that class imbalance into account. For this, the `model.fit` function contains a `class_weights` attribute. + +> Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class. +> +> TensorFlow (n.d.) + +Effectively, you're thus telling the training process to keep in mind that some samples come from an underrepresented class. + +Do note that applying class weights does not work adequately with [classic gradient descent](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) and others. However, if you use [Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/), you will be fine (TensorFlow, n.d.). + +We can first compute the weights with Scikit-learn's `compute_class_weight` function. + +``` +import pandas as pd +import sklearn +import numpy as np + +# Read CSV +df = pd.read_csv('./data.csv') + +# Count samples per class +classes_zero = df[df['Response'] == 0] +classes_one = df[df['Response'] == 1] + +# Convert parts into NumPy arrays for weight computation +zero_numpy = classes_zero['Response'].to_numpy() +one_numpy = classes_one['Response'].to_numpy() +all_together = np.concatenate((zero_numpy, one_numpy)) +unique_classes = np.unique(all_together) + +# Compute weights +weights = sklearn.utils.class_weight.compute_class_weight('balanced', unique_classes, all_together) +print(weights) +``` + +And then apply them in `model.fit`: + +``` +# Fit data to model +model.fit(X, y, + epochs=no_epochs, + verbose=verbosity, + class_weight=weights, + callbacks=keras_callbacks, + validation_split=0.2) +``` + +- **Benefits of applying class weights:** weighted/balanced training process. +- **Drawbacks of applying class weights:** extra computations necessary during the already resource-intensive training process. + +### Working with the F1 score instead of Precision/Recall + +Many people use **precision** and **recall** for computing model performance. + +> The precision is the proportion of relevant results in the list of all returned results. The recall is the ratio of the relevant results returned by the search engine to the total number of the relevant results that could have been returned. +> +> Hacker Noon (2005) + +https://www.youtube.com/watch?v=o9A4e7zopu8 + +However, if your precision is high but your recall is low, then you will find mostly relevant results, but the number of relevant results is relatively low. + +If your recall is high, you will find many relevant results, but there may be also many ones that are not relevant. + +Often, precision and recall are at odds - and you must trade-off between _high relevance_ and _a bit of noise_. This is effectively impactful if you have an imbalanced dataset that you'll have to work with. + +The F1 score takes both precision and recall and produces another value: + +\[latex\]F1 = 2 \\times \\frac{precision \\times recall}{precision + recall}\[/latex\] + +If you optimize your model with F1 instead of Precision or Recall, you will find that it will attempt to maximize F1 - and hence maximize precision without losing too much on recall, and vice versa. This way, you _can_ keep using imbalanced data, by omitting the faulty evaluation metrics in the first place. + +- **Benefits of applying F1 score:** balance between precision and recall, omitting accuracy. +- **Drawbacks of applying F1 score:** a theoretically slightly less-performant model compared to a true balanced dataset one. + +* * * + +## Summary + +In this article, we looked at imbalanced datasets - i.e. datasets where the number of samples within each class is strongly non-equal. While slightly imbalanced datasets should not significantly ML performance, big class imbalance _can_ cause model performance issues. That's why it's a good idea to take into account class imbalances when creating your Machine Learning model. + +The rest of this article therefore focused on a couple of things related to this issue. First, we saw how we can detect a class imbalance in an insurance dataset. Subsequently, we looked at four ways of reducing the issue: by performing undersampling, oversampling, applying class weights in Keras/TensorFlow and changing the evaluation criterion. This way, we can resolve class imbalances, and produce a model that works. + +I hope that you have learned something from reading this article! If you have any questions or comments, please feel free to leave a comment below 💬 Please also leave a messge if you have suggestions for improvement. I'll happily respond and change the article where necessary :) Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +_TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc._ + +KDnuggets. (n.d.). _The 5 most useful techniques to handle Imbalanced datasets_. [https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html](https://www.kdnuggets.com/2020/01/5-most-useful-techniques-handle-imbalanced-datasets.html) + +Analytics Vidhya. (2020, July 24). _10 techniques to deal with Imbalanced classes in machine learning_. [https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/](https://www.analyticsvidhya.com/blog/2020/07/10-techniques-to-deal-with-class-imbalance-in-machine-learning/) + +Kaggle. (n.d.). _Learning from Imbalanced insurance data_. Kaggle: Your Machine Learning and Data Science Community. [https://www.kaggle.com/arashnic/imbalanced-data-practice](https://www.kaggle.com/arashnic/imbalanced-data-practice) + +Wikipedia. (2005, October 11). _Cross-selling_. Wikipedia, the free encyclopedia. Retrieved November 10, 2020, from [https://en.wikipedia.org/wiki/Cross-selling](https://en.wikipedia.org/wiki/Cross-selling) + +Kaggle. (n.d.). _What is an imbalanced dataset? Machine learning | Data science and machine learning_. Kaggle: Your Machine Learning and Data Science Community. [https://www.kaggle.com/getting-started/100018](https://www.kaggle.com/getting-started/100018) + +TensorFlow. (n.d.). _Tf.keras.Model_. [https://www.tensorflow.org/versions/r2.0/api\_docs/python/tf/keras/Model](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/Model) + +Hacker Noon. (2005, November 10). _Idiot’s guide to precision, recall and confusion matrix_. [https://hackernoon.com/idiots-guide-to-precision-recall-and-confusion-matrix-b32d36463556](https://hackernoon.com/idiots-guide-to-precision-recall-and-confusion-matrix-b32d36463556) diff --git a/your-first-machine-learning-project-with-tensorflow-and-keras.md b/your-first-machine-learning-project-with-tensorflow-and-keras.md new file mode 100644 index 0000000..dbbcae4 --- /dev/null +++ b/your-first-machine-learning-project-with-tensorflow-and-keras.md @@ -0,0 +1,443 @@ +--- +title: "Your First Machine Learning Project with TensorFlow 2.0 and Keras" +date: "2020-10-26" +categories: + - "deep-learning" + - "frameworks" +tags: + - "beginners" + - "deep-learning" + - "first-model" + - "keras" + - "machine-learning" + - "neural-network" + - "tensorflow" +--- + +A few years ago, some people argued that being a data scientist meant that you had the sexiest job of the 21st Century. Who's to disagree: being one, you're responsible for [diving into](https://www.machinecurve.com/index.php/2017/09/30/the-differences-between-artificial-intelligence-machine-learning-more/) (sometimes big) datasets, finding relevant results for business, and reporting about them so that new opportunities can be created. This requires that you develop intuition for both business and technology as well as the capability of working in both worlds. It's in fact a very challenging but rewarding path, as salaries for data scientists are quite substantial. + +Now, Machine Learning - which I often describe as automatically finding patterns in datasets that can be used for predictive purposes, by means of some type of model architecture - is one of the sub branches of data science related jobs. Becoming a machine learning engineer puts you at the technology side of the data science spectrum. Requiring some intuition about mathematics (but not necessarily a maths degree!), as well as some interest and expertise with programming languages such as Python, you could be responsible for creating a variety of predictive models that serve the future needs of the organization you're working for. + +Unfortunately, the learning curve for starting with Machine Learning can be relatively steep, in my experience. A few years ago, when I had no significant ML experience whatsoever - I came from a software engineering background - the primary question I had was: "Where do I start?". That's why I've written this article, which helps you start learning how to apply machine learning for generating predictive models. In the article, we'll zoom in to TensorFlow and Keras - two tightly coupled libraries that can be used for predictive modelling - and show you step-by-step how they can be installed. We also demonstrate how a neural network can be created and trained. This way, you've set your first steps in the interesting world that Machine Learning is! + +Let's take a look! 😎 + +* * * + +\[toc\] + +* * * + +## TensorFlow and Keras for Machine Learning + +The first step of getting started with Machine Learning is getting to know two of the libraries that we will be using for today's article: **TensorFlow** first and **Keras** second. What are those libraries, and what can they be used for? + +First of all, TensorFlow is "an end-to-end open source machine learning platform". According to its website, it's a flexible ecosystems of tools that can help you convert your research-oriented Machine Learning ideas into practice, by being a bridge between research ML and production deployments. + +With TensorFlow, you can… + +- Easily train and deploy models wherever you want, whether that's in the cloud, in your web browser, or on a device; +- Easily perform Machine Learning research because of the ecosystem's flexibility; +- Easily perform model building due to the intuitive, high-level APIs like Keras that run with eager execution. + +https://www.youtube.com/watch?v=oZikw5k\_2FM + +Ah - there's the second one for today: Keras. + +It's a high-level API that runs on top of TensorFlow. However, the latter was not always the case. While it has always been a high-level API with the goal of making the creation of deep learning models easier (the core argument for Keras is that originally native TensorFlow had a steep learning curve, negatively impacting the adoption of Machine Learning) . Originally, it worked on top of TensorFlow, but also on top of Theano and CNTK, other libraries for creating Machine Learning Models. + +Today, however, that's no longer the case - Keras is now tightly coupled with TensorFlow in some kind of a symbiotic fashion: the `tensorflow.keras` API. TensorFlow's adoption within the community will be more difficult if Keras is not around and vice-versa; Keras needs something to run on top of. However, together, they become a very powerful tool for Machine Learning engineers to be comfortable working with. + +And that's why we'll focus on them in this article explaining how to create your first Machine Learning project. In fact, we will be creating a Neural network with those tools. But first, let's take a look at how you can install TensorFlow and Keras on your machine. + +* * * + +## Installing TensorFlow with Conda + +When you're a data scientist, using TensorFlow for creating machine learning models is likely not the only thing you will do - or if not it could be that you need other libraries such as PyTorch or [Scikit-learn](https://www.machinecurve.com/index.php/how-to-use-scikit-learn-for-machine-learning-with-python-mastering-scikit/) to do the job. To just give a few examples: + +- It could very much be the case that you will be asked to filter data from CSV files by means of [Pandas](https://pandas.pydata.org/). +- You might need Scikit-learn or PyTorch to run another person's Machine Learning model. +- Or, when it comes to big data, it could be that you need to run data processing jobs on [Apache Spark](https://www.machinecurve.com/index.php/2020/10/22/distributed-training-tensorflow-and-keras-models-with-apache-spark/). + +You will need a variety of Python libraries (also called 'packages') to make this happen - think `pyspark`, `pandas`, `sklearn`, `pytorch`, `tensorflow`, to name just a few. + +And each package has a variety of dependencies, i.e. other packages that it will run on - for example, many of the packages above require e.g. `numpy` and `scipy` as dependencies. + +### The need for environments + +However: + +- It is not necessarily the case that each package works with the latest version of dependencies. For example, older packages (only the very popular packages are updated very frequently) might require you to use older versions of `numpy`. + +By consequence, while many packages can share dependencies, they do not necessarily have to share the same dependency versions. + +And in general, if you think about the packages you install as tools in a toolbox, you likely agree with me that you want to structure your tools nicely. You don't want your toolbox to become a mess. + +Here, the concept of an **environment** can help. By means of an environment, you can organize your toolbox into 'areas' where your tools (i.e., your packages) are stored. Tools cannot be used across environments, but support each other within the environment. Visually, this looks as follows: + +![](images/envs-1.jpg) + +As you can see, environments allow you to install multiple versions of the same tool - i.e. package - across environments. In one environment, you can use one version of a package, while in another environment, another version of the same package can be used. + +This way, you can organize the tools that you need for particular tasks - and have no interference between stuff you need to do your job. + +### Introducing Conda and Miniconda + +Great concepts, environment, but up to now it's entirely theoretical. Let's move into practice by introducing Conda, an open source package management system. As an additional benefit, it can also be used for managing environments like the ones we described above. + +It runs on Windows, macOS and Linux, and allows you to (1) easily create and switch between environments, and (2) install packages quickly. + +![](images/image-19-1024x704.png) + +Installing Conda is very easy as you can install a tool called [Miniconda](https://docs.conda.io/en/latest/miniconda.html#), which makes your job really easy: + +> Miniconda is a free minimal installer for conda. It is a small, bootstrap version of Anaconda that includes only conda, Python, the packages they depend on, and a small number of other useful packages, including pip, zlib and a few others. Use the `conda install command` to install 720+ additional conda packages from the Anaconda repository. +> +> Miniconda (n.d.) + +By many, Miniconda is considered to be the Swiss knife among environment managers. While Python native [virtual environments](https://docs.python.org/3/tutorial/venv.html) can do the same thing for you, Minoconda offers all the tools you need in a simple installer. + +The next part of this article assumes that you have Miniconda installed on your system. [Click here](https://docs.conda.io/en/latest/miniconda.html#) to find the installer that works for you. + +### Installing TensorFlow and Keras + +After installing Miniconda, you should be able to run it from your Start Menu by typing 'miniconda' - then hitting Enter once it appears in the search list. + +A terminal should now open displaying `(base)`. This is good - it's the base environment: + +![](images/image-20.png) + +Let's make a new environment: + +``` +conda create --name first-tensorflow +``` + +Here, we'll be making a new environment called `first-tensorflow`. After hitting Enter, you must likely confirm the creation of the environment: + +``` +## Package Plan ## + + environment location: C:\Users\chris\miniconda3\envs\first-tensorflow + + + +Proceed ([y]/n)? y +``` + +After which it is created: + +``` +Preparing transaction: done +Verifying transaction: done +Executing transaction: done +# +# To activate this environment, use +# +# $ conda activate first-tensorflow +# +# To deactivate an active environment, use +# +# $ conda deactivate +``` + +It's now time to **activate** the environment, as the instructions already suggest. This means that you effectively _move over_ to this environment in your toolbox, so that you can install specific packages for use in the environment -- as well as use the ones already installed. + +By writing `conda activate first-tensorflow`, the newly created environment is created, and you can see `(base)` change into `(first-tensorflow)`: + +``` +(base) PS C:\Users\chris> conda activate first-tensorflow +(first-tensorflow) PS C:\Users\chris> +``` + +The next step - installing TensorFlow, which includes Keras - is remarkably simple: + +``` +(first-tensorflow) PS C:\Users\chris> conda install tensorflow==2.1.0 +``` + +Do note that while TensorFlow currently has version 2.3.0 as the latest stable release, [newest versions may not be available in Conda immediately](https://www.anaconda.com/blog/tensorflow-in-anaconda). If you do truly want to use newer versions of TensorFlow, [use this answer](https://stackoverflow.com/a/43729857) to first install `pip` _locally, i.e. within your environment_ and then use its local installment to install e.g. `pip install tensorflow==2.3.0`. + +* * * + +## Your First Neural Network + +Exciting stuff, we're going to build our first neural network using TensorFlow and Keras, which we installed in the previous section. Doing so always boils down to a few, sequential steps: + +1. Taking a look at the dataset that you will be creating your predictive model for, to get an intuition about the patterns that could be hidden within it. +2. Creating a model file and adding the imports of the TensorFlow/Keras components you'll need today. +3. Creating the model itself by adding code to the model file. +4. Starting the training process. +5. Observing and interpreting the results. + +### Today's Dataset + +Let's now take a look at the dataset that we will be using today. For your first Machine Learning project, you will build what is known as a [classifier](https://www.machinecurve.com/index.php/2020/10/19/3-variants-of-classification-problems-in-machine-learning/) - a system that allows you to assign an input to one of multiple (sometimes two, sometimes more) buckets. We will train the classifier on the [MNIST dataset](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/), which contains thousands of handwritten digits from 0 to 9: + +![](images/mnist-visualize.png) + +In the ideal case, our classifier will do this really well: + +![](images/mnist-1.jpg) + +Let's find out if we can create such a classifier. Open up your code editor (e.g. Visual Studio Code) and create a file called `first-ml-model.py`. Open the file, so that we can start coding! 😎 + +### Adding the imports + +Time to add to the file the TensorFlow components that we need: + +``` +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Conv2D, Flatten +``` + +We'll need `tensorflow` in general, and some specific components. We'll be using the `Sequential` API for stacking the layers of our neural network on top of each other (as we shall see), and we require `Dense`, `Conv2D` and `Flatten`, which are general components of Convolutional Neural Networks - neurla networks often used for image processing. + +### Creating the model + +Next, we must specify a few configuration options: + +``` +# Configuration +img_width, img_height = 28, 28 +input_shape = (img_width, img_height, 1) +batch_size = 1000 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 +``` + +The image width and height are set to 28 pixels, because that's the size of each input image. As we use grayscale images, which are one-channel (contrary to the three channels when images are RGB), our input shape has a `1` as third dimension. We feed data forward into the network with mini batches of 1000 samples. We will train for 25 iterations, and instruct the model that it must take into account 10 classes - the digits 0 to 9. 20% of our data will be used for steering the training process away from a process called overfitting (which involves training too blindly on the training data, leading to generalization issues on unseen data when used in production). We specify the model training process to be verbose, i.e. to specify all the possible output in the terminal (which comes at the cost of speed). + +#### Loading the dataset + +We can next load the dataset: + +``` +# Load data +def load_data(): + return tensorflow.keras.datasets.mnist.load_data(path="mnist.npz") +``` + +Per the Keras datasets `load_data` function, we can easily load the MNIST data - one of the benefits of Keras, which comes with [preinstalled datasets](https://www.machinecurve.com/index.php/2019/12/31/exploring-the-keras-datasets/). + +#### Creating the model + +Then, it's actually time to add the neural network. Using `model.add`, we stack a few layers on top of each other - starting with convolutional layers for extracting features (i.e. selecting what patterns within the dataset must be used for generating predictions) and Dense layers for actually converting the presence of features into a prediction (i.e. the true predictive part). Jointly, they can be expected to produce quite a performance indeed! + +``` +# Model creation +def create_model(): + model = Sequential() + model.add(Conv2D(4, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(Conv2D(8, kernel_size=(3, 3), activation='relu')) + model.add(Conv2D(12, kernel_size=(3, 3), activation='relu')) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + return model +``` + +#### Compiling the model + +Above, we just created a _skeleton_ of the model. That is, we didn't create a working model yet, but simply described what it should look like. Compiling a model involves specifying an optimization mechanism ([Adam](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) in our case) and [loss function](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/): + +``` +# Model compilation +def compile_model(model): + model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + return model +``` + +### Training and testing components + +Finally, before we actually start the training process, we must add functionality both for _training_ (which makes sense) and _testing_ (which I should explain in a bit more detail). When a model is trained, it must be tested against data that it hasn't seen before. This ensures that the model _can be generalized_, meaning that it is also effective on data that it has not seen before. If it's not, there's no point in generating the predictive model, is there? Hence, your goal as a machine learning engineer is to create a model that can both predict _and_ generalize. + +``` +# Model training +def train_model(model, X_train, y_train): + model.fit(X_train, y_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + shuffle=True, + validation_split=validation_split) + return model + +# Model testing +def test_model(model, X_test, y_test): + score = model.evaluate(X_test, y_test, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + return model +``` + +### Stacking stuff and starting the training process + +We can now add code which truly loads our data based on the definition we created before: + +``` +# Load data +(X_train, y_train), (X_test, y_test) = load_data() +``` + +We must normalize it to the \[latex\]\[0, 1\]\[/latex\] range: + +``` +# Normalize data +(X_train, X_test) = (X_train / 255.0, X_test / 255.0) +``` + +Also reshape it, because TensorFlow expects our input to have a specific structure: + +``` +# Reshape data +(X_train, X_test) = ( + X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1), + X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], 1), +) +``` + +And finally, we can create, compile, train and test our model: + +``` +# Create and train the model +model = create_model() +model = compile_model(model) +model = train_model(model, X_train, y_train) +model = test_model(model, X_test, y_test) +``` + +### Results + +Now open up the Miniconda terminal, ensure that you activated your `first-tensorflow` environment, and run the model - `python first-ml-model.py`. + +The model should now start training and you should see stuff like this: + +``` +Epoch 2/25 +48000/48000 [==============================] - 2s 39us/sample - loss: 0.1141 - accuracy: 0.9679 - val_loss: 0.0874 - val_accuracy: 0.9756 +``` + +And finally this accuracy: + +``` +Test loss: 0.0544307005890682 / Test accuracy: 0.9878000020980835 +``` + +...which is close to 100% - really good! 😎 + +### Full model code + +Should you wish to obtain the full model code just at once - here you go: + +``` +import tensorflow +from tensorflow.keras.models import Sequential +from tensorflow.keras.layers import Dense, Conv2D, Flatten + +# Configuration +img_width, img_height = 28, 28 +input_shape = (img_width, img_height, 1) +batch_size = 1000 +no_epochs = 25 +no_classes = 10 +validation_split = 0.2 +verbosity = 1 + +# Load data +def load_data(): + return tensorflow.keras.datasets.mnist.load_data(path="mnist.npz") + +# Model creation +def create_model(): + model = Sequential() + model.add(Conv2D(4, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) + model.add(Conv2D(8, kernel_size=(3, 3), activation='relu')) + model.add(Conv2D(12, kernel_size=(3, 3), activation='relu')) + model.add(Flatten()) + model.add(Dense(256, activation='relu')) + model.add(Dense(no_classes, activation='softmax')) + return model + +# Model compilation +def compile_model(model): + model.compile(loss=tensorflow.keras.losses.sparse_categorical_crossentropy, + optimizer=tensorflow.keras.optimizers.Adam(), + metrics=['accuracy']) + return model + +# Model training +def train_model(model, X_train, y_train): + model.fit(X_train, y_train, + batch_size=batch_size, + epochs=no_epochs, + verbose=verbosity, + shuffle=True, + validation_split=validation_split) + return model + +# Model testing +def test_model(model, X_test, y_test): + score = model.evaluate(X_test, y_test, verbose=0) + print(f'Test loss: {score[0]} / Test accuracy: {score[1]}') + return model + +# Load data +(X_train, y_train), (X_test, y_test) = load_data() + +# Normalize data +(X_train, X_test) = (X_train / 255.0, X_test / 255.0) + +# Reshape data +(X_train, X_test) = ( + X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1), + X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], 1), +) + +# Create and train the model +model = create_model() +model = compile_model(model) +model = train_model(model, X_train, y_train) +model = test_model(model, X_test, y_test) +``` + +* * * + +## Next steps + +Great! You just completed creating your first Machine Learning model with TensorFlow and Keras! 🎉 Perhaps, you've gained some momentum now, and you're interested in adding extra bits of knowledge to your Machine Learning toolbox. In those cases, I would like to recommend you to the following further readings: + +1. Extending the monitoring (for which you used a simple `model.evaluate` call and validation data) to a more extensive form of monitoring, with [TensorBoard](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/). +2. Understanding [Convolutional Neural Networks](https://www.machinecurve.com/index.php/2018/12/07/convolutional-neural-networks-and-their-components-for-computer-vision/) and what they do in more detail. +3. Reading more about how models are optimized: + 1. [The high-level supervised machine learning process](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/#the-high-level-supervised-learning-process) + 2. [Loss and loss functions](https://www.machinecurve.com/index.php/2019/10/04/about-loss-and-loss-functions/) + 3. [Gradient descent based optimization](https://www.machinecurve.com/index.php/2019/10/24/gradient-descent-and-its-variants/) + 4. [Adaptive optimizers](https://www.machinecurve.com/index.php/2019/11/03/extensions-to-gradient-descent-from-momentum-to-adabound/) + +* * * + +## Summary + +Getting started with Machine Learning can be a challenging task because of the steep learning curve. Fortunately, there are high-level libraries like Keras that can make your life easier. In fact, they make everyone's life easier, because all machine learning engineers can create and train production-level models with Keras and its counterpart TensorFlow - in a very easy way. + +In this article, we looked at creating your first Machine Learning model with Keras and TensorFlow. We first looked at what Keras and TensorFlow are - two symbiotic libraries that either allow people to produce production-level ML models (TensorFlow) and do so in an easy way (Keras). We then moved forward to installing TensorFlow (which includes Keras) on your system, and saw what environments are. We also saw how Miniconda can be used for managing environments and installing Python packages. Finally, we created your first neural network with TensorFlow and Keras, using the MNIST dataset. + +I hope that this article provided a starting point for your Machine Learning career! If it did, or if you've learnt something from it in general, please feel free to leave a message in the comments section 💬 Please do the same if you have questions or general remarks. I'd be more than happy to answer your messages :) + +Thank you for reading MachineCurve today and happy engineering! 😎 + +* * * + +## References + +_TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc._ + +TensorFlow. (n.d.). [https://www.tensorflow.org/](https://www.tensorflow.org/) + +_Miniconda — Conda documentation_. (n.d.). Conda — Conda documentation. [https://docs.conda.io/en/latest/miniconda.html#](https://docs.conda.io/en/latest/miniconda.html#)