Computer Vision


We have all seen Facebook, Google Photos and high end mobile phones that can identify faces and people in the photographs. How does that work? In fact They have already beaten humans in this task. How do they manage this?

Image processing was infact one of the first candidates for machine learning problems. A simple image of 4096x4096 pixels has 2^24 pixels. That is 3 * 2^24 bytes of data. Comparing two images would mean comparing 3 * 2^24 bytes with 3 * 2^24 bytes - that would mean 9 * 2^48 bytes. And for facebook to compare all the images that it has, this could be really huge. How do they manage it?

Well how do we compare two images? Do we compare pixels? Certainly not. Yet we can identify a person in one corner of an image and compare it with a person in another corner of another image. How does that work? As opposed to languages, data in an image is localized. The first thing we identify in an image is the edges. And for identifying edges, we do not need to check the entire image. Edges are localized and do not require processing the entire image at a time. This is done using Convolution Neural Networks - that processes parts of the image at a time.

Detecting edges, identifying shapes, etc are elementary aspects of vision. There is a lot more to images, that is unsolved. The image below is commonly quoted in this respect.

This image contains a lot more than just shapes and people. It portrays an amazing aspect of a unique personality. That is what makes this image special. We can understand it because we have seen this person. We already have a lot of information about the person and what he is doing. How can our machine identify and point out that this image is different from many other images it has?

Such questions remain open, waiting for the next breakthrough in machine learning.