Tintelligence: Developing a Neural Network to Colorize Black/White Photos and Videos

Group Members: John Peabody, Ka'amed Dieye, Ben Cerbin, Anbo Li

Table of Contents

Abstract

Tintelligence is a convolutional neural network (CNN) that can convert black and white photos into color. CNNs have the ability to “learn” features and object structures, making them an excellent architecture for image work. This relatively recent technological advance has been instrumental in archival work. We train our model on the CIFAR-10 image dataset which contains 50,000 training images credited to Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Our model currently provides…..accuracy…results yet to come). This work is significant since it can significantly speed up the old-fashioned process of hand-coloring old photos and has applications in museums, academia, documentary-making, and medical imaging—any space where greyscale images need to be colorized.

Introduction

Deprived of the vivid colors that now paint modern pictures, black and white footage during the era of pre-colorized film can easily seem detached from reality. Take documentary footage of World War II for example. Only when colorized with the same shades that one would see in their everyday life--often only possible thanks to an expensive, commercial restoration effort like History Channel’s “World War II in Colour”--can these pictures evoke the “realness” of the moments they depict. And yet, with significant advancements having been made in the field of AI-driven image colorization, the powerful visual data processing capabilities of convolutional neural networks offers a chance to expedite this process and bring further historical footage to life. The benefits don’t stop here, however. Colorization has the potential to enhance the effectiveness of medical imaging or create more accessible content for those with colorblindness. Such a process does not come without challenges, however. Using a convolutional neural network trained on a large dataset of consecutive colored images and their grayscale counterparts, we will be able to implement an automatic black-and-white colorizer of plausible color schemes.

Within the field of black & white colorization, there has been a great number of progress and variations in terms of projects. While all of them use convolutional neural networks, there are variations in the training models used. For Revathi et al. (4) in “Black and White Image Colorization Using Convolutional Neural Networks” they trained their network with the MIRFLICKR 25k dataset. This proposed method produced great results with a very little mean square error of 0.3174. As for Joshi et al. (3) in “Auto-Colorization of Historical Images Using Deep Convolutional Neural Networks” they integrate deep CNN with Inception ResNetV2 and train their model on a dataset they created with 1.2 K historical images comprised of old and ancient photographs of Nepal. There is also a tutorial on “Pyimagesearch” by Rosebrock (5) that suggests using deep CNN alongside OpenCV with training on the ImageNet dataset. This method enables the automatic colorization of grayscale images that can convincingly resemble natural color photographs just like the previously mentioned works.

Building on black & white colorization, there have also been suggestions of colorization that deviate from realistic/natural colors and apply artistic/aesthetic colorization to images. For Zhang et al. (1) in “Colorful Image Colorization”, rather than attempting to find the “correct” colorization, this team of researchers tried to make a colorizer that generated “plausible” color schemes. As for Cho et al. (2) who worked on PaletteNet, they had a deep neural network that took in an image and a specified color palette and then returned the given image newly colorized in the desired color palette. These cases show that image colorization can go beyond realistic replication to include artistic projects and aesthetic preferences.

Ethics

It is important to consider the ethics behind our project. The first concern is the data source. Images in the CIFAR-10 database were scraped from the internet by researchers at MIT and NYU. We are aware of the high possibility that some people who uploaded these images did not consent for its use in training neural networks. However, given the need for a large database of images for training CNNs, we believe CIFAR-10 is the best option given our limited constraints in gathering enough of our own training data. It is worth noting that CIFAR-10 is a popular database within the image community, and one of its creators, Geoffrey Hinton, recently won a Nobel Prize in Physics for his contributions—an external vet of validity. For these reasons, CIFAR-10 was selected and used. Ideally, future datasets will be created with the full consent of anyone involved. The second concern is how our CNN will be used. We intend for our CNN to be used to capture the “realness” of the moments they depict from the past. This can be beneficial for family members wishing to better understand their ancestors, historians researching archival photos, and documentary work to better appeal to modern audiences. There are also benefits in the medical imaging community for better visualizations. However, we realize that this CNN can be used for “bad” reasons, such as digital depiction and fraud. There may also be images people wish to remain untouched. Realistically, it is hard to distinguish between good and bad intent. We will be explicit in our directions on what this CNN should be used for.

Methods

In regards to software, we intend to convert a Jupyter Notebook using PyTorch into a standalone app using Voila. While Gradio is a sound alternative--and one we considered using--we think that Voila will be most effective towards achieving a tangible end result. As for datasets, we anticipate that an ImageNet dataset with color images turned into grayscale will be adequate (as suggested by previous research). Although an analysis that hones in on creating realistic grayscale images and testing the believability of this is a meaningful route, that would require test participants; instead, we will merely analyze the error between the colorized image and the original, non-grayscale image. However, in the unlikely event that the ImageNet dataset is inadequate--or we wanted to improve the “randomness” of the validation dataset--we anticipate that we can collect and grayscale a large number of images from the internet. For our model, we’re opting to use a pre-trained convolutional neural network from PyTorch’s library as a feature extractor in order to save time and energy, but then later training the custom layers ourselves.

Results

Link to results graphs/images:
(Prof Clark, if you can see this link, it's because we haven't formatted the images into the website yet. So check out some of our results! The title of the file gives the necessary context.)
https://drive.google.com/drive/folders/1wAMHP02_XYZAwHgQwJqoeCQy2UfEXxgd?usp=sharing

Our best results came from the CIFAR-10 dataset, which we used to train a multi-layer convolutional neural network with an encoder-decoder structure. The model included batch normalization, ReLU activations, and a final sigmoid layer to produce pixel-level predictions. Among the loss functions we tested, Huber Loss performed the best on this architecture.

It’s important to note that we assessed output quality primarily through subjective visual evaluations, as loss values did not always correlate with perceived image quality. For instance, while MSE loss was technically lower on a simpler CNN we experimented with, the more complex CNN produced results that were far more vibrant and visually distinct—taking “riskier” predictions that looked substantially more successful in colorizing grayscale images. Interestingly, this contrasts with our worst-performing attempt: using the same complex CNN on the larger TinyImageNet dataset. Despite suspecting bugs in the code may have contributed to the poor results, the model consistently output bland, average brown tones. It seemed to minimize loss by predicting a safe, visually neutral shade—at the expense of color diversity or realism.

L1 vs MSE vs Huber Loss

When switching our best model/dataset pair from MSE to L1 loss, we saw that the output colors became noticeably more muted and uniform. While the average training and validation loss dropped to about 0.053, the model struggled to distinguish between green and brown objects, often producing sepia-toned results rather than realistic colorizations. In contrast, Huber Loss outperformed both L1 and MSE. With average training and validation losses of 0.0027 and 0.0028 respectively, the Huber-based model produced images that were more vibrant and visually distinct. It particularly excelled in distinguishing greens from browns—a crucial distinction given the dataset’s abundance of nature imagery. However, it still struggled with reds, which may point to an imbalance or underrepresentation of red tones in the training data.

Learning Rate

Our default learning rate was 0.001, which yielded the best results, with an average training loss of 0.0027 and validation loss of 0.0028. When we moderately adjusted the rate to 0.0005 and 0.002, performance declined slightly—[INSERT LOSS NUMBERS HERE]—though the difference was marginal.

Overall, the model performed well on blues, greens, and browns, and tended to default to a neutral brown when uncertain. We believe this is because those colors are both prevalent in the training data (e.g., skies, vegetation, and soil) and safe from a visual plausibility standpoint. The model likely defaulted to brown in ambiguous regions lacking clear grayscale cues, and may have been penalized more harshly for incorrect red predictions. Without stronger semantic understanding, it seemed hesitant to commit to bold color choices like red.

Discussion

Given that these grayscale images will have an original color-photo counterpart, it is important to not only produce a colorized image from the grayscale image but also to compare the colorized image to the original color one. With this in mind, we anticipate using Mean Absolute Error between the colorized pixels and the corresponding pixel in the original image. Naturally, a lower Mean Absolute Error would suggest that the neural network was effective in accomplishing its task, but a non-low Mean Absolute Error does not necessarily suggest it was ineffective. If the Neural Network produces images that are plausible to the eye but vary significantly in actual RGB value to the original, it has at least become an effective plausible colorizer--as opposed to a “replicator.” To elaborate further, consider a picture of a blue billiards table with five red balls on it; some are dull, some are sharp. Due to the rules of the game, billiards tables would likely have a variety of colors on the table--is it unreasonable for the Neural Network to colorize each ball differently, especially given that there is variation between them? As the UC Berkeley study cited in this paper suggests, this is not a failure of the Neural Network, merely an attribute.

Conclusion

Conclusion:

Using the RGB approach, we found that our best results came from a multi-layer convolutional neural network with an encoder-decoder structure, batch normalization, ReLU activations, and a final sigmoid layer for pixel-level predictions. Huber Loss outperformed both L1 and MSE, and the CIFAR-10 dataset yielded better results than TinyImageNet.

However, the RGB approach appeared to be less effective than the more commonly used Lab-based approaches that others have published online.

Reflection

As we wrapped up this project, we realized that a main limitation in the quality of our results stemmed from predicting RGB values instead of Lab values. The Lab color space is designed to be more perceptually uniform, which means it better aligns with how humans perceive color differences—making it more effective for tasks like colorization. That said, I’m still glad we started with RGB. It was the most intuitive first step, and it allowed us to explore the problem in a way that made sense to us as we learned how to implement and train CNNs.

The problem with the RGB approach was that the model seemed to make everything the most average color of brown to reduce loss, and it wouldn’t take enough risk to put color into images. As a result, the loss was going down, but the results were hardly improving.

I’m proud of how many different model variations we tested throughout the process. However, I do wish we had a stronger grasp of how to structure and manage datasets. We ended up with two entirely separate codebases, each with its own issues, because that was the only way we figured out how to test different data. In hindsight, it would have been far better to develop a more flexible pipeline that could switch between datasets more easily.

If we had gotten the RGB-based CNNs working earlier, we might have realized sooner how difficult it would be to achieve high-quality results. That could have given us time to pivot to the Lab approach, which seems to perform better. Comparing the two approaches—on similar datasets and with models of similar complexity—would have been a compelling experiment.

In the end, one of the biggest lessons we learned is the importance of slowing down and engaging deeply with the existing literature before jumping into a project. Taking the time to learn from others and then laying out a clear, shared approach is crucial. Finally, I’m walking away with a deeper understanding that collaboration and coordination are far more effective in the long run than just working asynchronously in isolation to get things done.

As we are still puting this project together, we have come across obstacles related to integration of systems from the related works as well as version control.