Getting Better Accuracies: Using CNN on ESRGAN Enhanced Images

In this article, we are going to be going through ESRGANs and using them in CNNs to get better Validation accuracy

7 min readMay 27, 2023

First of all, let’s understand what the use case was for enhancing the image quality so that it can be fed to the CNN model. We had some images that I had taken off of Google Earth and those were not the best resolution images and the requirement was to get better accuracy and see if the CNN model was performing any better on the enhanced images or not.

WHAT IS A CNN?

CNN stands for Convolutional Neural Network. It is a type of neural network architecture that is particularly effective for image classification and recognition tasks.

CNNs consist of multiple layers of interconnected nodes, called neurons, which are designed to recognize specific features in an image, such as edges or shapes. The neurons are organized into layers, with each layer performing a specific type of data processing on the input image.

The first layer of a CNN detects simple patterns in the input image, such as edges and corners. Subsequent layers build upon these patterns to detect more complex features, such as shapes and textures. Finally, the output layer of the CNN makes a prediction about what the image contains, such as a dog or a car.

CNNs are trained using a large dataset of labeled images, with the network adjusting its internal parameters to optimize its accuracy on the training data. Once the CNN is trained, it can be used to classify new, unseen images with a high degree of accuracy, making it a powerful tool for image recognition and other computer vision tasks.

VGG16

The CNN that was used was VGG16 which is a deep convolutional neural network architecture named after the Visual Geometry Group at the University of Oxford, where it was developed. It is a popular and effective network for image recognition tasks, such as object detection and image classification.

The architecture consists of 16 layers, including 13 convolutional layers that extract features from the input image and 3 fully connected layers that perform the final classification. The convolutional layers use small filters (3x3) and are stacked on top of each other, making the network deeper than many other architectures.

VGG16 is known for its simplicity and ease of implementation, as well as its high accuracy on standard image recognition benchmarks. It has been used in a variety of applications, including medical image analysis, autonomous driving, and security systems.

IMAGE UPSCALING

Image upscaling is the process of increasing the size of a digital image without losing too much quality. It is often necessary when you have a low-resolution image and you want to make it larger, or when you want to print an image at a larger size than its original resolution.

Upscaling can be done using software or algorithms that interpolate the pixels in the image to create new ones. This can be a simple linear interpolation or more complex methods like bicubic or Lanczos interpolation. However, upscaling can often result in a loss of quality, as the new pixels are not always accurate representations of the original image.

To address this issue, machine learning-based upscaling methods have been developed in recent years, such as ESRGAN (Enhanced Super-Resolution Generative Adversarial Networks), which can create high-quality upscaled images by learning from large datasets of high-resolution images. A model that was used to enhance the images is Real ESRGAN

REAL ESRGAN

Real ESRGAN stands for “Enhanced Super-Resolution Generative Adversarial Networks”. It is an advanced image upscaling algorithm that uses a deep learning technique called Generative Adversarial Networks (GANs) to enhance the resolution and detail of images.

The “Real” in Real ESRGAN refers to the fact that this algorithm is based on the ESRGAN, which is a similar technique but with several enhancements to produce more realistic-looking images. Real ESRGAN can be used to upscale images and videos, such as old photos or low-quality footage, to higher resolutions while preserving details and reducing artifacts.

History

The ESRGAN model was proposed by a team of researchers led by Xiaodong Gu and Radu Timofte from ETH Zurich and the National University of Singapore in 2018. The researchers aimed to improve the quality of image super-resolution using a deep learning technique called Generative Adversarial Networks (GANs). The GANs framework consists of two networks — a generator and a discriminator — that are trained together in a process called adversarial training.
The generator network is trained to produce high-resolution images from low-resolution inputs, while the discriminator network learns to differentiate between the generated images and real high-resolution images. The two networks compete against each other in a game-like setting until the generator produces images that are indistinguishable from real images by the discriminator.
The ESRGAN model builds on the earlier SRGAN model, which was also based on GANs. However, ESRGAN incorporates several improvements, including better perceptual quality, sharper edges, and more natural textures. The “Real” in the name Real ESRGAN refers to the fact that the model produces more realistic-looking images than previous models.

Since its introduction, the Real ESRGAN model has been widely used in various applications, including image and video upscaling, restoration, and enhancement. The model has been trained on large datasets of images and can produce high-quality results even when working with low-resolution inputs.

CNNs and ESRGAN vs. Traditional Techniques

CNNs and ESRGAN are often used together to improve the quality of images, such as medical imaging and satellite imagery. The process starts by using a CNN to extract features from a low-resolution input image and then ESRGAN generates a high-resolution output image based on these features. This combination of technologies has been extremely effective in enhancing the quality of images in a wide range of applications.

In the past, bicubic interpolation has been the go-to method for image enhancement, but it tends to produce low-quality results because it overlooks the local structure and content of the image. However, CNNs and ESRGAN offer an alternative approach by leveraging deep learning algorithms. CNNs extract features from images and identify patterns, while ESRGAN employs a generative adversarial network (GAN) to learn how to generate high-quality images that closely resemble the original high-resolution images.

These methods are capable of capturing fine details and textures in upscaled images, resulting in higher-quality outcomes than traditional interpolation methods. Furthermore, deep learning-based techniques can be trained on large datasets of high-resolution images, allowing them to learn from a diverse range of examples and improve over time.

Now we know what CNNs and ESRGANs are, let’s head back to the use case at hand. We trained the CNN model and the accuracy looked like this:-

The model was making very less error and only 36 cases out of 2798 cases are being misclassified by the pretrained CNN Model, VGG16, which gives us an error rate of 1.287. This is when we knew the right decision was taken to upscale the images. We then went ahead and trained a 3 layer deep CNN model on the same dataset which gave us the accuracy of 41%.

CONCLUSION

CNN (Convolutional Neural Network) is a type of deep learning algorithm that has been used in many applications like object detection, image classification, and image recognition.
VGG16 is a specific CNN architecture that is commonly used in computer vision tasks. It consists of 16 layers and is able to accurately classify images into different categories.
These technologies have a wide range of applications in our lives, from improving medical diagnosis through image enhancement, to powering facial recognition technology for security purposes. They are also used in self-driving cars to recognize objects and navigate safely.
Traditional image enhancement methods result in a loss of detail and sharpness due to a lack of consideration for the local structure and content of the image. CNNs and ESRGAN address these issues by using deep learning algorithms to extract features from images and generate high-quality images that are visually similar to the original high-resolution images.

It’s quite evident that the VGG16 Model is performing better as opposed to the CNN architecture that we built on our own and using the ESRGAN enhanced images was a good choice to go ahead with.

If you want to check out the full code, pls click the link.

If you liked the article, I would appreciate if you can give it a clap and follow me for more.