If you start working intensively with Elasticsearch you cannot get around the understanding of internal data structures of it. Here i'll try to make this very comprehensible:

Inverted Index

Key Characteristics of Inverted Index

Allow very fast full-text searches

Not good structure for sorting

Created at index-time

Serialized to disk

An inverted index is basic memory structure. It consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears. Consider the following structure.

Term Doc_1 Doc_2 ...| Doc_X ---------------------------------- hello | X | X world | X | X java | | X perl | X | golang | | ... | X ... ----------------------------------

Here for every term a list of documents containing that term.Now, if we want to search for "world perl", we just need to find the documents in which each term appears:

Term Doc_1 Doc_2 ------------------------- world | X | X perl | X | ------------------------ Total | 2 | 1

Both documents match, but the first document has more matches than the second. Keep in mind on indexing the values are subject to tokenization and normalization - process that called analysis .

Doc Values

Key Characteristics of Doc Values

Good for sorting operations

Stores all the values for a single field together in a single column of data

Doc values are enabled by default for all fields types except text .

. Created at index-time

Serialized to disk

While indexing Elasticsearch adds the elements or tokens to the inverted index for search. But it also extracts the terms and adds them to the columnar storage called Doc Values.

Doc Terms ----------------------------------------------------------------- Doc_1 | hello, world, perl Doc_2 | hello, world, java Doc_3 | We, need, more, golang, tutorials -----------------------------------------------------------------

Doc values are used in several Use Cases in Elasticsearch:

For Sorting

Aggregations on a field

Certain filters (for example, geolocation filters)

Scripts that refer to fields

When the "working set" is smaller than the available memory on a node, the OS will naturally keep all the doc values hot in memory, leading to very fast access. When the "working set" is much larger than available memory, the OS will naturally start to page doc-values on/off disk.

Fielddata

Key Characteristics of Fielddata

Good for operations like doc values

But for text fields only

Created at query-time

in-memory data structure

Is not serialized to disk

Is disabled by default (expensive to build them, and preseve in heap)

Most fields can use index-time, on-disk docvalues for this data access pattern, but text fields do not support docvalues.

Instead, text fields use a query-time in-memory data structure called fielddata. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.

Before you enable fielddata , consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so, since they are quite memory and computation expensive.

P.S. Did i forgot something? Your comment is welcome!