The question doesn't say that this is some kind of a web-based application. The one thing that caught my eye was:

I'm sampling a data set of billions of elements and every time I need to pick 10 numbers out of it (simplified) and sort them (and make conclusions from the sorted 10 element list).

As a software and hardware engineer this absolutely screams FPGA to me. I don't know what kind of conclusions you need to draw from the sorted set of numbers or where the data comes from, but I know it would be almost trivial to process somewhere between one hundred million and a billion of these "sort-and-analyze" operations per second. I've done FPGA-assisted DNA sequencing work in the past. It is nearly impossible to beat the massive processing power of FPGAs when the problem is well suited for that type of a solution.

At some level, the only limiting factor becomes how quickly you can shovel data into an FPGA and how quickly you can get it out.

As a point of reference, I designed a high performance real-time image processor that received 32 bit RGB image data at a rate of about 300 million pixels per second. The data streamed through FIR filters, matrix multipliers, lookup tables, spatial edge detection blocks and a number of other operations before coming out the other end. All of this on a relatively small Xilinx Virtex2 FPGA with internal clocking spanning from about 33 MHz to, if I remember correctly, 400 MHz. Oh, yes, it also had a DDR2 controller implementation and ran two banks of DDR2 memory.

An FPGA can output a sort of ten 32 bit number on every clock transition while operating at hundreds of MHz. There would be short delay at the start of the operation as the data fills the processing pipeline/s. After that you should be able to get one result per clock. Or more if the processing can be parallelized through replicating the sort-and-analyze pipeline. The solution, in principle, is almost trivial.

The point is: If the application isn't PC-bound and the data stream and processing is "compatible" with an FPGA solution (either stand-alone or as a co-processor card in the machine) there is no way you are going to be able to beat the attainable level of performance with software written in any language, regardless of the algorithm.

I Just ran quick search and found a paper that might be of use to you. It looks like it dates back to 2012. You can do a lot better in performance today (and even back then). Here it is:

Sorting Networks on FPGAs