Firm has now released the open-source code to let

Latest version of the system is faster to train and far more accurate

Artificial intelligence systems have recently begun to try their hand at writing picture captions, often producing hilarious, and even offensive, blunders.

But, Google’s Show and Tell algorithm has almost perfected the craft.

According to the firm, the AI can now describe images with nearly 94 percent accuracy and may even ‘understand’ the context and deeper meaning of a scene.

According to the firm, Google's AI can now describe images with nearly 94 percent accuracy and may even ‘understand’ the context and deeper meaning of a scene. The AI was first trained in 2014, and has steadily improved in the time since

Google has released the open-source code for its image captioning system, allowing developers to take part, the firm revealed on its research blog.

The AI was first trained in 2014, and has steadily improved in the time since.

Now, the researchers say it is faster to train, and produces more detailed, accurate descriptions.

The most recent version of the system uses the Inception V3 image classification model, and undergoes a fine-tuning phase in which its vision and language components are trained on human generated captions.

The most recent version of the system uses the Inception V3 image classification model, and undergoes a fine-tuning phase in which its vision and language components are trained on human generated captions

HOW IT WORKS The AI can describe exactly what's in a scene The system uses the Inception V3 image classification model as the basis for the image encoder, allowing for 93.9 percent classification accuracy. These encodings help the system to recognize various objects in an image. Then the image model is fine-tuned, allowing the system to describe the objects rather than simply classifying them. So, it can identify the colours in an image, and determine how objects in the image relate to each other. In this phase, the system’s vision and language components are jontly trained on human generated captions. Advertisement

Examples of its capabilities show the AI can describe exactly what is in a scene, including ‘A person on a beach flying a kite,’ and ‘a blue and yellow train traveling down train tracks.’

As the system learns on a training set of human captions, it sometimes will reuse these captions for a similar scene.

This, the researchers say, may spur some questions on its true capabilities – but while it does ‘regurgitate’ captions when applicable, this is not always the case.

‘So does it really understand the objects and their interactions in each image? Or does it always regurgitate descriptions from the training data?,' the researchers wrote.

As the system learns on a training set of human captions, it sometimes will reuse these captions for a similar scene. This can be seen in the examples above

'Excitingly, our model does indeed develop the ability to generate accurate new captions when presented with completely new scenes, indicating a deeper understanding of the objects and context in the images.'

An example shared in the blog post shows how the components of separate images come together to generate new captions.

Three separate images of dogs in various situations can thus lead to the accurate description of a photo later on: ‘A dog is sitting on the beach next to a dog.’

‘Moreover,’ the researchers explain, ‘it learns how to express that knowledge in natural-sounding English phrases despite receiving no additional language training other than reading the human captions.’