With the recent release of Apache Lucene and Solr 6, we should familiarize ourselves with the juicy features that come with them. We have the new default Similarity implementation – BM25 – instead of the previously used TF-IDF Similarity, we have improvements in the default Similarity configuration, new dimensional points, spatial module not using third-party libraries, and so on. We will look at all that in the upcoming weeks, but for now let’s dig into one of the biggest additions – Parallel SQL over MapReduce in SolrCloud.

So, what is this Parallel SQL all about? Is this really the ability to run SQL queries in Solr? In short – yes. While we don’t yet have support for complete SQL support yet (e.g., there is no join support), what Solr now gives us is still very powerful, especially for those coming to Solr from the world of relational databases and who may not yet be used to the whole world of Lucene and Solr term-oriented data structures and indices.

How does it work?

Under the hood, the SolrCloud Parallel SQL over MapReduce feature is powered by Solr Streaming API and Facebook Presto SQL Parser project, which is a part of PrestoDB (https://prestodb.io/) and is used not only by Solr, but in another software out there as well, such as crate.io for example.

Internally, Solr is able to provide both limited and unlimited sets of results (using the limit keyword) and lets us use both MapReduce and faceting algorithms. An SQL table is mapped directly to a SolrCloud collection, a column in an SQL table is mapped to a field name in the Solr Collection and we can use terms to leverage full-text searching capabilities.

How to start?

You need to have Solr 6.0 or greater. You can grab it from one of the official Apache mirrors (http://www.apache.org/dyn/closer.cgi/lucene/solr/).

You will need to run Solr in SolrCloud mode. We can do that using the example cloud configuration available with Solr distribution just by running:

bin/solr -e cloud -noprompt

The above command will create the gettingstarted collection with the following layout:

And indexing some data right after SolrCloud started:

bin/post -c gettingstarted example/exampledocs/*.xml

That should result in us having 32 documents indexed and available for searching:

Using SQL in SolrCloud

By default, we have a specialized request handler, called /sql available in Solr. This handled lets us use Parallel SQL functionality in Solr. You can see it being defined in the Solr admin UI:

Now that we have the handler and some data, let’s run the first request using SQL. Let’s retrieve all documents from the gettingstarted collection that have the field inStock equal to true.

To do that using the default query parser in Solr we would run the following command:

curl 'localhost:8983/solr/gettingstarted/select?q=inStock:true&fl=id,title&indent=true'

And that would result in 17 documents being returned. Now let’s do the same using SQL:

curl --data-urlencode 'stmt=select id,name from gettingstarted where inStock = 'true'' http://localhost:8983/solr/gettingstarted/sql

The thing to remember is that we need our data to be URL encoded and the SQL query is provided as the value of the stmt property in the request body. This is done to allow Solr JDBC driver to work (yes, there is one and it is shipped with SolrJ).

And guess what? That really works:

{"result-set":{"docs":[ {"name":["Test with some UTF-8 encoded characters"],"id":"UTF8TEST"}, {"name":["Solr, the Enterprise Search Server"],"id":"SOLR1000"}, {"name":["Canon PowerShot SD500"],"id":"9885A004"}, {"name":["Canon PIXMA MP500 All-In-One Photo Printer"],"id":"0579B002"}, {"name":["ViewSonic VA902B - flat panel display - TFT - 19\""],"id":"VA902B"}, {"name":["Dell Widescreen UltraSharp 3007WFP"],"id":"3007WFP"}, {"name":["One British Pound"],"id":"GBP"}, {"name":["One Krone"],"id":"NOK"}, {"name":["One Dollar"],"id":"USD"}, {"name":["One Euro"],"id":"EUR"}, {"name":["A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM"],"id":"VDBDB1A16"}, {"name":["CORSAIR ValueSelect 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - Retail"],"id":"VS1GB400C3"}, {"name":["CORSAIR XMS 2GB (2 x 1GB) 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) Dual Channel Kit System Memory - Retail"],"id":"TWINX2048-3200PRO"}, {"name":["Apple 60 GB iPod with Video Playback Black"],"id":"MA147LL/A"}, {"name":["Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300"],"id":"6H500F0"}, {"name":["Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133"],"id":"SP2514N"}, {"name":["Test with some GB18030 encoded characters"],"id":"GB18030TEST"}, {"EOF":true,"RESPONSE_TIME":22}]}}

As you can see, the same number of documents have been returned, our SQL query has been properly interpreted and executed. What’s more – the query was distributed, because the gettingstarted collection has two leader shards and one replica each. So that means that all the distributed Solr features are available as well! If your mouth is not watering at this point – it should be! 🙂

Of course, we can also limit the number of documents returned just by including the limit keyword:

curl --data-urlencode 'stmt=select id,name from gettingstarted where inStock = 'true' limit 2' http://localhost:8983/solr/gettingstarted/sql

We can also retrieve the number of matched documents:

curl --data-urlencode 'stmt=select count(*) from gettingstarted where inStock = 'true'' http://localhost:8983/solr/gettingstarted/sql

or order documents by score or by a field value:

curl --data-urlencode 'stmt=select id,name from gettingstarted where inStock = 'true' order by id desc' http://localhost:8983/solr/gettingstarted/sql

What about grouping?

But what about grouping you could ask. Of course, Solr and its Parallel SQL also support up to four group by clauses with the having keyword as well. For example, let’s try to group documents by the _version_ field (we will use it because it is not multi-valued):

curl --data-urlencode 'stmt=select max(_version_) from gettingstarted where inStock = 'true' group by _version_' http://localhost:8983/solr/gettingstarted/sql

And we can of course use the having keyword to limit the groups to those that match our requirements (again, we will use the _version_ field because we don’t have any other single valued field in this collection):

curl --data-urlencode 'stmt=select max(_version_) as max from gettingstarted where inStock = 'true' group by _version_ having max > 1000' http://localhost:8983/solr/gettingstarted/sql

Solr lets us use aggregation functions like:

count(*) – returns the number of documents

avg(field_name) – returns average value for a field in a group

min(field_name) – returns minimum value for a field in a group

max(field_name) – returns maximum value for a field in a group

sum(field_name) – returns sum of values for a field in a group

What else can be done?

Of course this is not all what we can do with the Parallel SQL functionality available in Solr. We can get information about distinct field values:

curl --data-urlencode 'stmt=select distinct id as distId from gettingstarted where inStock = 'true' order by id desc' 'http://localhost:8983/solr/gettingstarted/sql'

We can control the internal execution by adding the aggregationMode parameter set to either facet (which will tell Solr to use facets) or map_reduce (which will use parallel map reduce functionality):

curl --data-urlencode 'stmt=select distinct id as distId from gettingstarted where inStock = 'true' order by id desc' 'http://localhost:8983/solr/gettingstarted/sql?aggregationMode=facet'

And of course we can use the JDBC driver. If you are interested in more detailed information, there is the official cwiki page of Solr documentation available on the topic (https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface). In the future posts we’ll cover streaming aggregations and talk some more about Parallel SQL. Until then, follow @sematext and if you need any help with Solr/SolrCloud or Elasticsearch, please let us know.

Share Twitter

Facebook

LinkedIn

Reddit

Email

