Autoencoder: Denoise image using UpSampling2D and Conv2DTranspose Layers (Part: 1)
For better understanding, this post is divided into three parts:
In this part, introductory part and I will discuss some basic terms and processes used in this tutorial. This will help us to get the concept and better understand the other parts of this tutorial.
This part will demonstrate how we can use upsampling method for denoising an image from their input. This part will be implemented using the notMNIST dataset.
This part is similar to the previous part but I will use transposed convolution for denoising. This part will be covered using the infamous MNIST dataset.
Let’s start …
GAN (Generative Adversarial Network)
Generative Adversarial Networks, or GANs, is a process for estimating generative models via an adversarial process. It provides an architecture for training generative models, such as convolutional neural networks for creating images.
GANs were designed by Ian Goodfellow and other researchers at the University of Montreal in 2014. GAN modeling is an unsupervised learning process in machine learning that involves two sub-models, the generator model that trains to generate new examples, and the discriminator model that tries to evaluate examples as real or generated. The process operates in terms of data distributions. Typically, the generative network learns to map from a latent space to a data distribution of interest, while the discriminative network distinguishes candidates produced by the generator from the true data distribution. Two neural networks contest with each other in a game, the generative network’s training objective is to increase the error rate of the discriminative network by producing real candidates that the discriminator thinks are not generated.
Autoencoder is a representation (encoding) of data that learns how to efficiently reduce data and then how to reconstruct the data back from the reduced data to a representation that is as close to the original input as possible.
The simplest form of an autoencoder is a feedforward, non-recurrent neural network similar to single-layer perceptrons that participate in multilayer perceptrons (MLP). It may consist of an input layer and an output layer connected by one or more hidden layers. The output layer has the same number of nodes (neurons) as the input layer. Its purpose is to reconstruct its inputs (minimizing the difference between the input and the output). So autoencoders are unsupervised learning models without any labeled input data.
Two common types of layers that can be used in the generator model are a upsample layer (UpSampling2D) that simply doubles the dimensions of the input by using the nearest neighbor or bilinear upsampling and the transpose convolutional layer (Conv2DTranspose) that performs a convolution upscale operation by learning details in the training process, similar to the regular Conv2D layer
Upsampling is the opposite of the pooling layer. Pooling layers reduce the dimensions of the data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. A more advanced technique is unpooling which reverts maxpooling by remembering the location of the maxima in the maxpooling layers and in the unpooling layers copy the value to exactly this location.
A simple version of an unpooling or opposite pooling layer is called an upsampling layer. It works by repeating the rows and columns of the input. Multiple layers can be used on a GAN to perform the required upsampling operation to transform a small input into a large image output.
For example, an image with 3x3 pixel as input can output 9x9 pixel in upscaling layer. We can define the interpolation method to fill in the new rows and columns. By default, the UpSampling2D layer will use the nearest neighbor algorithm to fill in the new rows and columns. This interpolation method will simply double rows and columns. Similarly, a bilinear interpolation method can be used to upscale new rows and columns. Bilinear interpolation replaces each missing pixel with a weighted average of the nearest pixels.
Transposed convolutions are more flexible and complex implementation than the classical nearest neighbor or bilinear upsampling methods. These layers are used to upsample the input feature map to a desired output feature map using some learnable parameters. It requires to specify the number of filters and the kernel size of each filter. One of the main considerations of this layer is stride. Stride or strides refers to the implementation of a filter scanning across an input in a traditional convolutional layer resulting in smaller output. But in transposed convolutions from a distribution perspective, stride scans over the output, which increases the size of the output. It is also referred to as fractionally strided convolution since stride over the output is equivalent to fractional stride over the input. For instance, a stride of 2 over the output is 1/2 stride over the input. Strides are responsible for the upscaling effect of transposed convolutions.
The Conv2DTranspose layer, which takes images as input directly and outputs the result of the operation. The Conv2DTranspose both upsamples and performs a convolution. So we must specify the number of filters and the size of the filters as we do for Conv2D layers and a stride size because the upsampling is achieved by the stride behavior of the convolution on the input.
Transposed Convolutions are the backbone of modern segmentation and super-resolution algorithms. They provide the best and most generalized upsampling of abstract representations.
In the following parts, I will use two different datasets for two different upscaling methods. Both of these datasets are similar for our upscaling task and also we can fine-tune the outcome similarly in this case.
1. MNIST Dataset
The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. It was created by “re-mixing” the samples from NIST’s original datasets. The black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
The MNIST database contains 60,000 training images and 10,000 testing images. Half of the training set and half of the test set were taken from NIST’s training dataset, while the other half of the training set and the other half of the test set were taken from NIST’s testing dataset.
2. notMNIST Dataset
The notMNIST dataset was created by Yaroslav Bulatov by taking some publicly available fonts and extracting glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J. A set of training and test images of letters from A to J on various typefaces. The image size is 28x28 pixels.
“Judging by the examples, one would expect this to be a harder task than MNIST. This seems to be the case — logistic regression on top of stacked auto-encoder with fine-tuning gets about 89% accuracy whereas same approach gives got 98% on MNIST. Dataset consists of small hand-cleaned part, about 19k instances, and large uncleaned dataset, 500k instances. Two parts have approximately 0.5% and 6.5% label error rate. I got this by looking through glyphs and counting how often my guess of the letter didn’t match it’s unicode value in the font file.”
— Yaroslav Bulatov
I think that’s enough for the theory, Now we will dive into our coding parts.