Ryft

Ryft is an FPGA – (field programmable gate array) appliance that allows for hosting and searching data quickly. In this post I will show one way to connect up Apache Zeppelin for use in data analysis using Scala code. Previously I showed how to connect Apache Zeppelin to SAP Hana.

The Ryft can quickly search structured and unstructured data without needing to build an index. This ability is attributed the the FPGA that can filter data on demand. It uses the internal 4 FPGA modules to process the data at search time. Other types of search systems like ElasticSearch, solr, Lucine or a database have to build and store an index of the data. Ryft operates without an index.

I have populated my Ryft with a cache of data from Enron. It is a dump of Enron emails obtained from Carnegie Mellon. This was as simple as uploading files to the Ryft and running a command like this:

ryftutil -copy “enron*” -c enron_email -a ryft.volumeintegration.com:8765

In the Zeppelin interface I will be able to search for keywords or phrases in the email files and display them. The size of the enron e-mail archive is 20 megabytes.

Apache Zeppelin

Apache Zeppelin is an open source web notebook that allows a person to write code in many languages to manipulate and visualize data.

To Apache Zeppelin work with Ryft I installed Apache Zeppelin onto the Ryft appliance and connected the Spark Ryft Connector jar found at this git project. Or download a prebuilt jar.

Follow the directions provided at the spark-ryft-connector project to compile the jar file needed. I compiled the jar file on my local desktop computer. Place the spark-ryft-connector jar file onto the Ryft machine. I did run into one that was not documented; the ryft connector was not working properly. It gives the error: “java.lang.NoClassDefFoundError: org/apache/spark/Logging”

I resolved the issue by downloading spark-core_2.11-1.5.2.logging.jar from https://raw.githubusercontent.com/swordsmanliu/SparkStreamingHbase/master/lib/spark-core_2.11-1.5.2.logging.jar and put it in zeppelin/interpreter/spark/dep directory and that resoved the issue.

Now you can create a note in Zeppelin. I am using the Spark interpreter which allows you to write the code in Scala.

First you have to make sure Zeppelin can use the ryft code in the jar file. Make a dependency paragraph with this code:

%dep

z.reset()

z.load("/home/ryftuser/spark-ryft-connector-2.10.6-0.9.0.jar")

Ryft Query

Now make a new paragraph with the code to make form fields and run the Ryft API commands to perform a search. Figuring these queries out takes a detailed study of the documentation.

These are the commands to prepare and run the query. I show a simple search, a fuzzy hamming search and a fuzzy edit distance search. The Ryft can perform very fast fuzzy searches with wide edit distances because there is not an index being built.

Simple Query

queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte) query = SimpleQuery(searchFor.toString)

Hamming Query

queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs)

Edit Distance Query

queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte)

The Search

var searchRDD = sc.ryftRDD(Seq(query), queryOptions)

This produces an RDD that can be manipulated to view the contents using code like the example below.

searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData => println(ryftData.offset) println(ryftData.length) println(ryftData.fuzziness) println(ryftData.data.replace("

", " ")) println(ryftData.file) }

The Result in Zeppelin

In addition I have included code that allows the user to click on Show File to see the original e-mail with the relevant text highlighted in bold.

I installed Apache Zeppelin in a way that allows it access to a portion of the file system on the server where I stored the original copy of the email files.

In order for Apache Zeppelin to display the original email, I had to give it access to the part of the filesystem where the original emails were stored. Ryft uses a catalog of the emails to perform searches, as it performs better when searching fewer larger files than more smaller ones. The catalog feature allows it to combine many small files into one large file.

The search results return a filename and offset which Apache Zeppelin uses to retrieve the relevant file and highlight the appropriate match.

In the end results Ryft found all instances of the name Mohammad with various spelling differences in 0.148 seconds in a dataset of 30 megabytes. When I performed the same search terms on 48 gigabytes of data it ran the search in 5.89 seconds. And 94 gigabytes took 12.274 seconds, 102 gigabytes took 13 seconds. These are just quick sample numbers using dumps of many files. Perhaps performance could be improved by consolidating small files into catalogs.

Zeppelin Editor

The code is edited in Zeppelin itself.

You edit the code in the web interface but it can hide it once you have the form fields working. Here is the part of the code that produces the form fields:

val searchFor = z.input("Search String", "mohammad") val distance = z.input("Search Distance", 2) var queryType = z.select("Query Type", Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString var surrounding = z.input("Surrounding", "line")

So in the end we end up with the following code.

%spark import com.ryft.spark.connector._ import com.ryft.spark.connector.domain.RyftQueryOptions import com.ryft.spark.connector.query.SimpleQuery import com.ryft.spark.connector.query.value.{EditValue, HammingValue} import com.ryft.spark.connector.rdd.RyftRDD import com.ryft.spark.connector.domain.{fhs, RyftData, RyftQueryOptions} import scala.language.postfixOps import spark.implicits._ import org.apache.spark.sql.types._ import org.apache.spark.sql._ import scala.io.Source def isEmpty(x: String) = x == null || x.isEmpty var queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte) val searchFor = z.input("Search String", "mohammad") val distance = z.input("Search Distance", 2) var queryType = z.select("Query Type",("2","Hamming"), Seq(("1","Simple"),("2","Hamming"),("3","Edit Distance"))).toString var surrounding = z.input("Surrounding", "line") var query = SimpleQuery(searchFor.toString) if (isEmpty(queryType)) { queryType = "2" } if (queryType.toString.toInt == 1) { println("simple") if (surrounding == "line") { queryOptions = RyftQueryOptions("enron_email", "line", 0 toByte) } else { queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, 0 toByte) } query = SimpleQuery(searchFor.toString) } else if (queryType.toString.toInt ==2) { println("hamming") if (surrounding == "line") { queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte, fhs) } else { queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte, fhs) } } else { println("edit") if (surrounding == "line") { queryOptions = RyftQueryOptions("enron_email", "line", distance.toString.toByte) } else { queryOptions = RyftQueryOptions("enron_email", surrounding.toString.toInt, distance.toString.toByte) } } var searchRDD = sc.ryftRDD(Seq(query), queryOptions) var count = searchRDD.count() print(s"%html <h2>Count: $count</h2>") if (count > 0) { println(s"Hamming search RDD first: ${searchRDD.first()}") println(searchRDD.count()) print("%html <table>") print("<script>") println("function showhide(id) { var e = document.getElementById(id); e.style.display = (e.style.display == 'block') ? 'none' : 'block';}") print("</script>") print("<tr><td>File</td><td>Data</td></tr>") searchRDD.asInstanceOf[RyftRDD[RyftData]].collect.foreach { ryftData => print("<tr><td style='width:600px'><a href=javascript:showhide('"+ryftData.file+"')>Show File </a></td>") val x = ryftData.data.replace("

", " ") print(s"<td> $x</td></tr>") println("<tr id="+ ryftData.file +" style='display:none;'>") println("<td style='width:600px'>") val source = Source.fromFile("/home/ryftuser/maildir/"+ryftData.file) var theFile = try source.mkString finally source.close() var newDoc = "" var totalCharCount = 0 var charCount = 0 for (c <- theFile) { charCount = charCount + 1 if (totalCharCount + charCount == ryftData.offset) { newDoc = newDoc+"<b>" } else if (totalCharCount+charCount == ryftData.offset+ryftData.length+1) { newDoc = newDoc+"</b>" } newDoc = newDoc+c } print(newDoc.replace("

", "<br>")) totalCharCount = totalCharCount + charCount println("</td>") println("</tr>") } print("</table>") }

So this should get you started on being able to search data with Zeppelin and Ryft. YOu can use this interface to experiment with the different edit distances and search queries the Ryft supports. You can also implement additional methods to search by RegEx, IP addresses, dates and currency.

Please follow us on Facebook and on twitter at volumeint.