It’s not easy to teach a computer how to see. You can’t just put a camera on the computer and expect it to see. For example, imagine someone throws you a ball from across the room and you catch it. This seems simple, right? But, in reality, this is one of the most difficult and intricate processes that we have ever tried to understand, let alone reproduce. Basically, what happens when you see the ball in the air is it first strikes your retina, when performs complex analysis and sends this information to the brain, where the visual cortex analyzes the image more thoroughly. This information is then relayed to the rest of the cortex, which compares the image to everything it knows already, determines the dimensions of the object and then decides on an action to perform i.e. raise your hand(s) and catch the ball. All of this happens in a fraction of a second with little to no conscious effort and it almost never fails. If you want a computer to view the world like a person does it will rely on both computer vision and image recognition.

What is Computer Vision?

It is best to think of computer vision as the part of the human brain that processes the information received through the eyes. This is what gives a barcode scanner the ability to “see” all of the stripes on a UPC label, this is also how Apple is using Face ID in its latest iPhone model to determine whether or not the face its camera is looking at is, in fact, the owner of the phone.

Computer vision is an integral part of artificial intelligence (AI), since it gives machines a sense of sight, but does not give it an inherent understanding of the physical universe. For this, computer vision needs training. It is best to compare training computer vision to a small child. If you show a child a number or a letter enough times, it will learn to recognize it. Computer vision works the same way since it is currently in the infancy stages: it must be trained to recognize and identify certain objects.

While it is easy to make a computer recognize a specific image, for example, a QR code, it is an entirely different story when you would like it to recognize things in states they don’t expect. If we return to our child analogy mentioned above, many children can recognize letters and numbers right away that are upside down. This is because our biological neural networks are pretty good at interpreting visual information if the image that we are processing does not look exactly the way we expect it to.

Image recognition usually involves creating a neural network that processes the individual pixels of an image. These neural networks are fed with as many pre-labeled images as possible in order to teach them how to recognize similar images.

Challenges to Computer Vision Technology

Even though we have discussed how difficult it is to create a machine that views the world like a human, it is best to put our finger on exactly what is so difficult.

Image classification — Basically, this is labeling an image based on the content of the image. There would usually be a fixed set of labels and your model would have to predict the label that best fits the image. This is very difficult for a machine since all it sees is a stream of numbers in a given image.

Object detection — This involves recognizing the various sub-images and drawing a bounding box around each recognized sub-image. This problem is more difficult to solve since it requires more adjustments to the image coordinates. The best-known detection method to date is called Faster-RCNN (Regional Convolutional Neural Network) which is responsible for localizing in the regions in the image that needs to be processed and classified.

Image Segmentation — This is the process of partitioning an image based on the objects present, with accurate boundaries. There are two types of image segmentation: semantic segmentation and instance segmentation. In semantic segmentation, you need to label each pixel by a class object. In instance segmentation, each object is classified differently.

Image Captioning — This one of the coolest computer vision problems because it has an added flavor of natural language processing. It involves generating a caption that is most appropriate for your image. Image captioning is image detection plus captioning. The detection is done by the same Faster-RCNN method mentioned above. The captioning is done using an RNN (Recurrent Neural Net).

AI vs. The Human Brain

Even though there was a lot of progress made over the years concerning computer vision, humans are much better in image understanding in the long term. Simply put, machines are too narrow sighted in the sense that they learn things by going through a fixed category of images. Although they may have learned from a massive amount of images, which is usually millions of images, this is still nowhere close to what a human is capable of. This can be attributed to the neocortex, which is the part of the brain that is responsible for recognizing patterns, cognition, perception and many other higher-order functions.

Even though there have been some major breakthroughs in the field of AI such as AlphaGo defeating a world champion in a game of Go and Open AI’s Dota2 bot defeated experts in the game of Dota2. This stuff is very niche i.e. Dota2 bot is specific to Dota2. The human brain, on the other hand, is very generic, meaning you use your brain for all of your daily activities. In order to compete with the human brain, we will need a similar General Artificial Intelligence.

Best regards,

Skywell Software team

Please check out our BlockDelta profile for our full contact details.