Image recognition has come a long way over the last few years and maybe more so than anybody else, Google has brought some of those advances to end users. To see how far we’ve come, just try searching through your own images on Google Photos, for example. But recognizing objects (and maybe basic scenes) is only a first step.

In September, Google showed how its approach, using the currently popular deep learning methodology, could not just recognize images of single objects but also classify different objects in a single image (think different kinds of fruits in a fruit basket, for example).

Once you can do that, you can also try to create a full natural language description of the image and that’s what Google is doing now. According to a new Google Research paper, the company has now developed a system that can teach itself how to describe a photo like the one below with a very high degree of accuracy.

As Google’s researchers note, the typical approach to this problem would be to first let the computer vision algorithms do their job and then use natural language processing to create a description. That sounds pretty reasonable, but the researchers instead suggest that the better approach is to merge “recent computer vision and language models into a single jointly trained system, taking an image and directly producing a human readable sequence of words to describe it.” This has worked well in machine translation with the help of combining two recurrent neural networks, Google says. The captioning system works a bit differently, but essentially uses the same approach.

That’s not to say Google’s approach is perfect. Using the BLEU score, which is typically used to compare the quality of machine translations with that of a human, the computer captions typically score somewhere between 27 and 59 points depending on the data set. Humans tend to score around 69 points. Still, that’s a huge step forward from other approaches which don’t score above 25 points.