At Statsbiblioteket we maintain a historical net archive for the Danish parts of the Internet. We index it all in Solr and we recently caught up with the present. Time for a status update. The focus is performance and logistics, not net archive specific features.

Hardware & Solr setup

Search hardware is 4 machines, each with the same specifications & setup:

16 true CPU-cores (32 if we count Hyper Threading)

384GB RAM

25 SSDs @ 1TB (930GB really)

Each machine runs 25 Solr 4.10 instances @ 8GB heap, each instance handling a single shard on a dedicated SSD. Except for machine 4 that only has 5 shards, because it is being filled. Everything coordinated by SolrCloud as a single collection.

As the Solrs are the only thing running on the machines, it follows that there are at least 384GB-25*8GB = 184GB free RAM for disk cache on each machine. As we do not specify Xms, this varies somewhat, with 200GB free RAM on last inspection. As each machine handles 25*900GB = 22TB index data, the amount of disk cache is 200GB/22TB = 1% of index size.

Besides the total size of 70TB / 16 billion documents, the collection has some notable high-cardinality fields, used for grouping and faceting:

domain & host: 16 billion values / 2-3 million unique

url_norm: 16 billion values / ~16 billion unique

links: ~50 billion values / 20 billion+ unique

Workload and performance

The archive is not open to the public, so the amount of concurrent users is low, normally just 1 or 2 at a time. There are three dominant access patterns:

Interactive use: The typical request is a keyword query with faceting on a 4-6 fields (domain, year, mine-type…), sometimes grouped on url_norm and often filtered on one or more of the facets. Corpus probing: Batch extraction jobs, such as using the stats component for calculating the size of all harvested material, for a given year, for all harvested domains separately. Lookup mechanism for content retrieval: Very experimental and used similarly to CDX-lookups + Wayback display. Such lookups are searches for 1-100 url_field:url pairs, OR’ed together, grouped on the url_field and sorted on temporal proximity to a given timestamp.

Due to various reasons, we do not have separate logs for the different scenarios. To give an approximation of interactive performance, a simple test was performed: Extract all terms matching more that 0.01% of the documents, use those terms to create fake multi-term queries (1-4 terms) and perform searches for the queries in a single thread.

The response times for interactive use lies within our stated goal of keeping median response times < 2 seconds. It is not considered a problem that queries with 100M+ hits takes a few more seconds.

The strange low median for non-faceted search in the 10⁸-bucket is due to query size (number of terms in the query) impacting search-speed. The very fast single-term queries dominates this bucket as very few multi-term queries gave enough hits to land in the bucket. The curious can take a look at the measurements, where the raw test result data are also present.

Morale: Those are artificial queries and should only be seen as a crude approximation of interactive use. More complex queries, such as grouping on billions of hits or faceting on the links-field, can take minutes. The so-far-discovered extremely worst-case is 30 minutes.

Secret sauce

Each shard is fully optimized and the corpus is extended by adding new shards, which are build separately. The memory savings from this are large: No need for the extra memory needed for updating indexes (which requires 2 searchers to be temporarily open at the same time), no need for large segment→ordinal maps for high-cardinality faceting. Sparse faceting means lower latency, lower memory footprint and less GC. To verify this, the performance test above was re-taken with vanilla Solr faceting.

Lessons learned so far

Memory. We started out with 256GB of RAM per machine. This worked fine until all the 25 Solr JVMs(machine had expanded up to their 8GB Xmx, leaving ~50GB or 0.25% of the index size free for disk cache. At that point the performance tanked, which should not have come as a surprise as we tested this scenario nearly 2 years ago. Alas, quite foolishly we had relied on the Solr JVMs not expanding all the way up to 8GB.Upping the memory per machine to 384GB, leaving 200GB or 1% of index size free for disk cache ensured that interactive performance was satisfactory.

An alternative would have been to lower the heap for each Solr. The absolute minimum heap for our setup is around 5GB per Solr, but that setup is extremely vulnerable to concurrent requests or memory heavy queries. To free enough RAM for satisfactory disk caching, we would have needed to lower the heaps to 6GB, ruling out faceting on some of the heavier fields and in general having to be careful about the queries issued. Everything works well with 8GB, with the only Out Of Memory incidents having been due to experiment-addicted developers (aka me) issuing stupid requests. Indexing power: Practically all the work on indexing is being done in the Tika-analysis phase. It took about 40 CPU-core years to build the current Version 2 of the index; in real-time it took about 1 year. Fortunately the setup scales practically linear, so next time we’ll try to allocate 12 power houses for 1 month instead of 1 machine for 12 months. Automation: The current setup is somewhat hand-held. Each shard is constructed by running a command, waiting a bit less than a week, manually copying the constructed shard to the search cloud and restarting the cloud (yes, restart).In reality it is not that cumbersome, but a lot of time was wasted with the indexer having finished, with noone around to start the next batch. Besides, the excessively separated index/search setup means that the content currently being indexed into an upcoming shard cannot be searched.

Looking forward

Keep the good stuff: We are really happy about the searchers being non-mutable and on top of fully optimized shards. Increase indexing power: This is a no-brainer and “only” a question of temporary access to more hardware. Don’t diss the cloud: Copying raw shards around and restarting the cloud is messy. Making each shard a separate collection would allow us to use the collections API for moving them around and an alias to access it all as a single collection. Connect the indexers to the searchers: As the indexers only handle a single shard at a time, they are fairly easy to scale so that they can also function as searchers for the shards being build. The connection is trivial to create if #3 is implemented. Upgrade the components: Solr 6 will give us JSON faceting, Streaming, SQL-ish support, graph traversal and more. These aggregations would benefit both interactive use and batch jobs.We could do this by upgrading the shards with Lucene’s upgrade tool, but we would rather perform a whole re-index as we have also need better data in the index. A story for another time.