05 Oct 2016

Classifying documents using doc2vec in Python is already easy, but calling doc2vec from Ruby is even easier. Here is an example of preprocessing text data(in this case manpage) with Ruby and classifying them using doc2vec in Python.

One can call not only Python but Julia from Ruby. Following is an another example to perform K-means Clustering using Julia via Ruby (experimentally ported from official documentation of Clustering.jl).

VirtualModule on which I've been working lately enable you to call arbitrary Python or Julia code from Ruby. Here is yet another example to perform an SVM classification using cross-validation through scikit-learn from Ruby.

It looks perfectly sweet except for the fact that I needed to pass :_ Symbol whenever calling functions which take no arguments. This is required because Ruby syntax allows us to call functions without using parentheses, otherwise VirtualModule will never know if the message sent to the object is calling functions or accessing instance variables.

Playing with VirtualModule in REPL

In this section, I'm going to try my best to show you what's happening in VirtualModule internally. I'm assuming following are installed on the system already:

virtual_module gem (v0.3.0 or higher)

gem (v0.3.0 or higher) Python or Julia environment

Then here we go. Now we start to run irb on the terminal.

debussy:~ remore$ irb -r virtual_module irb(main):001:0> po = VirtualModule.new(:python=>["sklearn"=>"datasets"]) => #<Module:0x007fb7e1aee818>

Once you call VirtualModule#new , Python or Julia Process will be booted as a background job. If the background job successfully booted, then VirtualModule will return new Module instance, which behaves as a proxy to background job (In the example above, it's assigned to local variable po ). To simplify, in this document we call it PO(Proxy Object).

irb(main):002:0> py.int(2.3) => 2 irb(main):003:0> po.unknown_method(2.3) RuntimeError: An error occurred while executing the command in python process: ,name 'unknown_method' is not defined

What PO is doing internally is as simple as ABC. In the first example, PO receives int(2.3) method call and pass through this to the background process using msgpack. Consequently, Fixnum 2 is displayed on the terminal, which value is sent from the background job. If unknown method is called, background job will tell that there is no such method defined. Since data transfer is implemented using msgpack, at this moment only very limited types of data defined by msgpack spec is converted to Ruby's data type.

irb(main):004:0> po.datasets => #<Module:0x007ffd0906c030> irb(main):005:0> po.datasets.load_iris(:_) => #<Module:0x007ffd09074500> irb(main):006:0> po.datasets.load_iris(:_).vclass => "<class 'sklearn.datasets.base.Bunch'>" irb(main):007:0> po.datasets.load_iris(:_).data[1].to_a => [4.9, 3.0, 1.4, 0.2]

What will happen if the data conversion by msgpack fails? The example above shows the behavior when one try to access unconvertible data type using msgpack. In the example above, a PO instance(local variable "po") creates new PO instance(#<Module:0x007ffd0906c030> etc) on every single method call, until the background process return values which are successfully converted to Ruby's data type by msgpack.

irb(main):008:0> po.datasets.vclass => "<type 'module'>" irb(main):009:0> iris = po.datasets.load_iris(:_) => #<Module:0x007ffd09057568> irb(main):010:0> iris.target.vclass => "<type 'numpy.ndarray'>" irb(main):011:0> iris.target.vmethods => ["T", "__abs__", "__add__", "__and__", "__array__", "__array_finalize__", "__array_interface__", "__array_prepare__", "__array_priority__", "__array_struct__", "__array_wrap__", "__class__", "__contains__", "__copy__", "__deepcopy__", "__delattr__", "__delitem__", "__delslice__", "__div__", "__divmod__", "__doc__", "__eq__", "__float__", "__floordiv__", "__format__", "__ge__", "__getattribute__", "__getitem__", "__getslice__", "__gt__", "__hash__", "__hex__", "__iadd__", "__iand__", "__idiv__", "__ifloordiv__", "__ilshift__", "__imod__", "__imul__", "__index__", "__init__", "__int__", "__invert__", "__ior__", "__ipow__", "__irshift__", "__isub__", "__iter__", "__itruediv__", "__ixor__", "__le__", "__len__", "__long__", "__lshift__", "__lt__", "__mod__", "__mul__", "__ne__", "__neg__", "__new__", "__nonzero__", "__oct__", "__or__", "__pos__", "__pow__", "__radd__", "__rand__", "__rdiv__", "__rdivmod__", "__reduce__", "__reduce_ex__", "__repr__", "__rfloordiv__", "__rlshift__", "__rmod__", "__rmul__", "__ror__", "__rpow__", "__rrshift__", "__rshift__", "__rsub__", "__rtruediv__", "__rxor__", "__setattr__", "__setitem__", "__setslice__", "__setstate__", "__sizeof__", "__str__", "__sub__", "__subclasshook__", "__truediv__", "__xor__", "all", "any", "argmax", "argmin", "argpartition", "argsort", "astype", "base", "byteswap", "choose", "clip", "compress", "conj", "conjugate", "copy", "ctypes", "cumprod", "cumsum", "data", "diagonal", "dot", "dtype", "dump", "dumps", "fill", "flags", "flat", "flatten", "getfield", "imag", "item", "itemset", "itemsize", "max", "mean", "min", "nbytes", "ndim", "newbyteorder", "nonzero", "partition", "prod", "ptp", "put", "ravel", "real", "repeat", "reshape", "resize", "round", "searchsorted", "setfield", "setflags", "shape", "size", "sort", "squeeze", "std", "strides", "sum", "swapaxes", "take", "tobytes", "tofile", "tolist", "tostring", "trace", "transpose", "var", "view"] irb(main):012:0> iris.target.to_a => [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Just like Object in Ruby provides #class and #methods to know much about object instance, VirtualModule provides corresponding methods - #vclass and #vmethods. #vclass will tell you what data type a proxy object is referring to, and #vmethods will show you a list of functions you can request. If you'd like to see more examples, please consider to visit my GitHub repo and refer the other usages. And if you have any question, please feel free to post Issue.

Currently, VirtualModule has numerous drawbacks, as many other OSS software at the early stage has. For example, VirtualModule cannot handle Julia's good parts very well such as multidimensional arrays and Functions as first-class objects. Although I know this series of attempts to make it possible to call Python and Julia from Ruby are not enough to satisfy data scientist's needs yet, but somehow exciting enough to get my attention in the last few months. Moreover, in my opinion both the environment and the situation around scientific computing in Ruby has started to get improved these days, as some sources described about this as well. If you are interested in seeing what's happening these days around Ruby, I strongly recommend joining SciRuby community. You will find many other interesting projects and activities there.

Tweet