I have an index with around 400 milllion documents stored. The fields in general look like this: systemname, filename, timestamp, message, version, etc,

What I want to get is the first and last document (based on the timestamp) per filename.

What I am doing right now is:

Query the whole index

Define window based on filename

Sort ascending

Add rownumber

Add max_rownumber

Filter all entries where rownumber = 1 or rownumber = max_rownumber

This is really really slow

Code can be seen here: Getting first and last entry of window

ES Version: 6.0.0 ES Spark Connector: elasticsearch-spark-20 6.0.0 Scala Version 2.11.8 Spark Version: 2.2

Does someone have any idea on how I could speed this up?

The best way would probably be to make the I/O smaller but elasticsearch does not feature aggregations with spark. I also found no way to query only specific fields of a document. For this example querying only 5 out of the 10 fields per document would be enough. This calculation is taking over 15 hours on a cluster with 112 cores.

If someone could help me with a solution that would be really nice. I am out of Options / Ideas. Not being able to query aggregations seems to become a dealbreaker for elasticsearch but I don't want to switch away.