My pals generally act impressed when I show them my noodlings in the J language. I’m pretty sure they’re impressed with the speed and power of J because it is inarguably fast and powerful, but I’ve also always figured they more saw it as an exercise in obfuscated coding; philistines! While I can generally read my own J code, I must confess some of the more dense tacit style isn’t something I can read naturally without J’s code dissector. I have also been at it for a while, and for a long time went on faith that this skill would come. Notation as a tool of thought is one of the most powerful ideas I’ve come across. The problem becomes talking people into adopting your notation. Building important pieces of your company around a difficult mathematical notation is a gamble which most companies are not willing to take.

Everyone knows about Arthur Whitney and K because of Kx systems database KDB. Having fiddled around with KDB and Eric Iverson and J-software’s Jd, the mind-boggling power of these things on time series and data problems in general makes me wonder why everyone doesn’t use these tools. Then I remember the first time I looked at things like this:

wavg:{(+/x*y)%+/x} // K version wavg=: +/ .* % +/@] NB. J version

Oh yeah, that’s why J and K adoption are not universal. I mean, I can read it. That doesn’t mean everyone can read it. And I certainly can understand people’s reluctance to learn how to read things like this. It’s not easy.

For the last year and a half, my partner Kevin Lawler has been trying to fix this problem. You may know of him as the author of Kona, the open source version of K3. Kevin’s latest creation is Kerf. Kerf is basically an APL that humans can read, along with one of the highest performance time series databases money can buy. I liked it so much, I quit my interesting and promising day job doing Topological Data Analysis at Ayasdi, and will be dedicating the next few years of my life to this technology.

We know the above code fragments are weighted averages, but mostly because that’s what they’re called in the verb definitions. Mischievous programmers (the types who write code in K and J) might have called them d17 or something. Kerf looks a lot more familiar.

function wavg(x,y) { sum(x*y) / sum(x) }

This is cheating a bit, since K & J don’t have a sum primitive, but it begins to show the utility of organizing your code in a more familiar way. Notice that x * y is done vector wise; no stinking loops necessary. Expressing the same thing in more primitive Kerf functions looks like this:

function wavg(x,y) { (+ fold x*y) / (+ fold x) }

In J and K, the ‘/’ adverb sticks the left hand verb between all the elements on the right hand side. In Kerf, we call that operation “fold” (we also call adverbs “combinators” which we think is more descriptive for what they do in Kerf: I think John Earnest came up with the term).

You could also write the whole thing out in terms of for loops if you wanted to, but fold is easier to write, easier to read, and runs faster.

There are a few surprises with Kerf. One is the assignment operator.

a: range(5); b: repeat(5,1); KeRF> b b [1, 1, 1, 1, 1] KeRF> a a [1, 2, 3, 4]

Seems odd. On the other hand, it looks a lot like json. In fact, you can compose things into a map in a very json like syntax:

aa:{a: 1 2 3, b:'a bit of data', c:range(10)}; KeRF> aa['a'] aa['a'] [1, 2, 3] KeRF> aa aa {a:[1, 2, 3], b:"a bit of data", c:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]}

This seems like syntax sugar, but it actually helps. For example, if I had to feed variable ‘aa’ to somthing that likes to digest json representations of data, it pops it out in ascii json:

json_from_kerf(aa) "{\"a\":[1,2,3],\"b\":\"a bit of data\",\"c\":[0,1,2,3,4,5,6,7,8,9]}"

OK, no big deal; a language that has some APL qualities which speaks json. This is pretty good, but we’d be crazy to attempt to charge money for something like this (Kerf is not open source; Kevin and I have to eat). The core technology is a clean APL that speaks json, but the thing which is worth something is the database engine. Tables in Kerf look like interned maps and are queried in the usual SQL way.

u: {{numbers: 19 17 32 8 2 -1 7, strings: ["A","B","C","D","H","B","Q"]}} select * from u where numbers>18 ┌───────┬───────┐ │numbers│strings│ ├───────┼───────┤ │ 19│ A│ │ 32│ C│ └───────┴───────┘ select numbers from u where strings="B" ┌───────┐ │numbers│ ├───────┤ │ 17│ │ -1│ └───────┘

Now the business with ‘:’ starts to make more sense. Since SQL is part of the language, the ‘=’ sign is busy doing other things, rather than setting equality. Now your eyes don’t have to make out any contextual differences or look for ‘==’ versus ‘=’ -everything with an ‘=’ is an equality test. Everything with an ‘:’ is setting a name somewhere.

Standard joins are available with left join:

v:{{a: 1 2 2 3, numbers: 19 17 1 99}} left_join(v,u,"numbers") ┌─┬───────┬───────┐ │a│numbers│strings│ ├─┼───────┼───────┤ │1│ 19│ A│ │2│ 17│ B│ │2│ 1│ null│ │3│ 99│ null│ └─┴───────┴───────┘

For timeseries, having a good time type, preferably first class and with the ability to look at nanoseconds is important. So are asof joins.

qq:{{nums: range(10), date: 1999.01.01+ (24 * 3600000000000) * range(10), strg:["a","b","c","d","e","f","g","h","i","j"]}} vv:{{nums: 10+range(10), date: 1999.01.01+ (12 * 3600000000000) * range(10), strg:["a","b","c","d","e","f","g","h","i","j"]}} select nums,nums1,mavg(3,nums1),strg,strg1,date from asof_join(vv,qq,[],"date") ┌────┬─────┬────────┬────┬─────┬───────────────────────┐ │nums│nums1│nums11 │strg│strg1│date │ ├────┼─────┼────────┼────┼─────┼───────────────────────┤ │ 10│ 0│ 0.0│ a│ a│ 1999.01.01│ │ 11│ 0│ 0.0│ b│ a│1999.01.01T12:00:00.000│ │ 12│ 1│0.333333│ c│ b│ 1999.01.02│ │ 13│ 1│0.666667│ d│ b│1999.01.02T12:00:00.000│ │ 14│ 2│ 1.33333│ e│ c│ 1999.01.03│ │ 15│ 2│ 1.66667│ f│ c│1999.01.03T12:00:00.000│ │ 16│ 3│ 2.33333│ g│ d│ 1999.01.04│ │ 17│ 3│ 2.66667│ h│ d│1999.01.04T12:00:00.000│ │ ⋮│ ⋮│ ⋮│ ⋮│ ⋮│ ⋮│ └────┴─────┴────────┴────┴─────┴───────────────────────┘

Kerf is still young and occasionally rough around the edges, but it is quite useful as it exists now: our customers and partners think so anyway. The only thing comparable to it from an engineering standpoint are the other APL based databases such as KDB and Jd. We think we have some obvious advantages in usability, and less obvious advantages in the intestines of Kerf. Columnar databases like Vertica and Redshift are great for some kinds of problems, but they don’t really compare: they can’t be extended in the same way that Kerf can, nor are they general purpose programming systems, which Kerf is.

We also have a lot of crazy ideas for building out Kerf as a large scale distributed analytics system. Kerf is already a suitable terascale database system; we think we could usefully expand out to hundreds of terabytes on data which isn’t inherently time oriented if someone needs such a thing. There is no reason for things like Hadoop and Spark to form the basis of large scale analytic platforms; people simply don’t know any better and make do with junk that doesn’t really work right, because it is already there.

You can download a time-limited version of Kerf from github here.

John Earnest has been doing some great work on the documentation as well.

I’ve also set up a rudimentary way to work with Kerf in emacs.

Also, for quick and dirty exposition of the core functions: a two page refcard

Keep up with Kerf at our company website:

www.kerfsoftware.com

Kerf official blog:

getkerf.wordpress.com

A visionary who outlined a nice vision of a sort of “Cloud Kerf.”

http://conceptualorigami.blogspot.com/2010/12/vector-processing-languages-future-of.html