The Basics

Here's what needs to happen immediately after a user starts typing:

Do a search on Giphy and start downloading the top results.

Make a transparent image file containing the text to overlay on top of the Giphy search results.

Composite those two pieces together.

Re-encode and return the result to the Telegram servers.

In the video you can see that Telegram returns around 20 inline results every time you pause while typing. That's great for seeing lots of options, but it's also a lot of encoding to do simultaneously. And if the bot isn't fast enough than it definitely won't be worth using.

To make serverside compositing and encoding fast enough usually requires a bunch of hardware or expensive video encoding services. However, there is one thing working in our favor - each gif in the result is completely independent, so it's very parallelizable. We just need a server with enough cores to take advantage of that fact.