Access 20 Gig or more from LuaJIT while coding in native Lua and minimizing GC speed penalties.

I started using LuaJIT© after first using F#, Python, Julia and C for stock and Forex related predictive work. I am always on the lookout for a language that is high speed as close as I can get to C without having to write in low level C all the time.

Lua is a language that feels somewhat like a cross between BASIC and Ruby and has been around for a long time. Lua may embedded or used stand-alone. It has been embedded into many games, entertainment consoles and other devices as a scripting language. The LuaJIT is a new compiler technology and takes what was already fast as an interpreted language and in some of our tests made it run over 20X faster with a few tests reaching 80X faster.

LuaJIT seemed like the ideal combination since it provided a language any ruby or python programmer would find readable with fast start-up times, excellent run-time speeds and good error messages.

Almost C Speed but a Memory limit.

I was quite simply amazed at the speed of LuaJIT because it out performed some of our well optimized Java and F# code. There is a gotcha and it is a big one. Within a couple weeks of work I ran into a problem of running out of memory which seemed odd when I could do the same task in Python, F# and Java with no problem. The problem occurred when I had lots of memory available on the machine.

After some google searches I found there are some design limits which prevent the LuaJIT garbage collector and memory allocator from accessing more than 2 Gig of RAM. 1.8 Gig is about about where we encountered the problems. In addition LuaJIT wasn’t real smart so instead of stopping and collecting what it could when an allocation was going to fail is simply aborted which can be a real pain for a analysis ran 20 minutes into a job.

Mike Pall, the driving force behind LuaJIT is working on a version 3.0 of the LuaJit garbage collector but I needed a solution now or I had no choice but to jump out to either C or possibly GO or fallback to F#.

I had already had F# code that ran faster than Java. I wanted to escape the F# .net on windows environment and F# on mono is very slow. I don’t like where Microsoft was taking F# or lack of command line tools in isolation of visual studio. The worst factor was that F# programmers wanted to use Scala style functional techniques which took F# from faster than Java to slower than Python and retraining violated their world view. (Similar to O.O. purists in the late 90’s). I still feel there is a productivity advantage in high level languages especially when exploring ideas that may morph many times before you find the perfect or at least good enough design. LuaJIT seemed to deliver almost the programming conciseness of Python and F# but with speed and cross platform portability to boot.

Escape the LuaJIT GC limit

After first discovering the memory limits of LuaJIT and reading the blogs I was thinking I had made a mistake porting the code to Lua. After interacting with the LuaJIT users group one of their users sent me some code which showed how to Store data outside the space managed by the garbage collector but still access the objects almost as if they were native lua objects.

It turns out the approach he proposed is an elegant way to escape GC hell and has some strategic benefits that are not immediately apparent. These strategic advantages could make LuaJit an ideal platform for major big data projects. I think it could be better than some common tools because it makes the GC available when needed but allows you to use hand optimized C style structures where it makes sense. This kind of escape could make it competitive even for major projects like Elastic Search which regularly battle the Java Garbage collector.

So far I have been able to access up to 10 Gig on a 64 bit Windows 8.1 machine. If anybody had a chance to try it on a Linux box with 32 or more gig please let me know how far you were able to go.

LuaJit made it simple to gain the best of GC and manually managed worlds

The complete sample code below demonstrates creating gigabytes worth of double arrays encoded identically to how they would be in C but accessing them from Lua. Even better; the generated assembler to iterate the array seems very similar to the assembler produced in C.

Download Source as Text file: ffi_non_gc_double_array_v2.lua

--Declaring a C style structure that makes casting easy local DArr = ffi.metatype("struct{uint32_t size; double* a;}") ptr = ffi.C.malloc(size_in_bytes) -- allocating memory C tobj = DArr( adj_len, ptr) -- create a wrapper and cast into double -- After this I can access tobj.a as a normal array ffi.C.free(tobj.a) -- return the memory to C tobj.a = nil - makes sure we don't try to re-use tobj = nill - all cleaned up. -- could have skipped the structure definition and directly cast the result to an pointer to doubles but I wanted the size included as part of core structure.

Using the wrapper is trivial easy

The following example initializes an array of 600K floats, using it and deleting . As you can see it is about as easy as it could possibly get. This is enabled by a small wrapper written using FFI which accessed portions of C libraries from inside of Lua. FFI allowed me to call underlying C malloc and free directly from lua and cast the results to my desired object type very easily.

local tmparr = double_array(600000, "my test data arr")

The code to free it up again is:

tmparr.done() The destructor is safe so if you try to .done() again is simply returns false.

The code to iterate is native Lua and performs incredibly well.

function avg_arr(tarr, start_ndx, end_ndx) local tsum = 0.0 local num_ele = (end_ndx - start_ndx) + 1 for x = start_ndx, end_ndx do tsum = tsum + tarr[x] end local avg = tsum / num_ele return avg end One of the features I don't like about Lua is that if you do not specifically declare something as a local it becomes a global. This can be nuisance when debugging but there are some tools to help find those issues.

The Strategic value of Lua FFI for external memory management

The ability to choose when to use the GC and when to step outside the bounds of what is managed by the GC is a huge benefit. The ability to access both sets of objects with native Lua at high speed is huge bonus. Let me explain way in the form of a machine learning ML use case.

In machine learning we compute what math geeks call features but you can think of them as a column of numbers such as the SMA(30) which is a moving average across 30 rows of data. These are computed from columns of numbers such as open price, close price, age, weight, etc. Some practitioners call the source data features and computed data facets but the line get’s blurry. We then combine these arrays in a wide variety of ways using algorithms like SVM, Decision trees, KNN and Bayes.

For 1 minute Forex data a fairly small data set is 250,000 rows with 8 based attributes and generally between 10 and a few hundred derived features. This is a lot of data but you can squeeze it into a 4 gig memory space if very careful. If we have 50 computed features using 64 bit floats then our total memory space is 250,000 rows * (8 base features + 50 computed features) * 8 bytes each = 116 Megabytes. Unfortunately no run-time system is perfectly efficient and this is before we build the model based these features or use it to predict the future. Our medium data sets are easily 20 times this size. Needless to say 2 Gig and even 4 Gig of ram doesn’t stretch all that far.

Once we load the source data and compute the features we use them over and over again to compute intermediate results. Once we have them computed the most critical aspect is very high access speed as we apply them.

Some genetic algorithms will combine features or change the weights of features millions of times and then either rebuild or re-apply the model. The bulk of the run time is generally spent in applying the model in the process of optimization (a form of learning). Some algorithms such as random forests take a lot of processing time to recursively build their models as well. Even when using trees or random forests choosing the features used in each trees of forest is a optimization trick that requires consuming the underlying features thousands of times as you rebuild and test different models.

I found it easiest when loading the data to use native Lua tables which grow automatically as needed but are limited in size by the 2 gig failure mentioned above. We don’t always know how many rows of data we will be receiving or retaining since we reject some as noise. Once each column (feature) is computed the only thing that changes is we add new numbers to the end which we can easily accommodate by reserving some extra spaces for new data points. You have to keep the source columns because they need to be referenced as historical data when computing the feature values new data rows.

In this instance we can use the native Lua with the full GC functionality to simplify our load / build loop and then once we have a column ready we allocate a external C array and copy the data over to it. This moves that column of data outside the purview of the GC must consider and keeps the space managed by the GC smaller which makes the GC faster.

When I stressed the LuaJIT GC so it had 1.5 Gig of data even when there were relatively few objects to clean up the full GC run took between 0.25 and 0.38 seconds. When I had only a few megabytes of data managed by the JIT GC it took 0.001 seconds so there is a huge performance advantage of moving long persistent data outside the purview of the GC. I think this will remain true even with the new and improved GC.

It would be a viable design decision to populate the feature data directly in the external array but that requires growing and or copying the data if we need a larger one. This would incur some overhead but it is keeps most of the pressure off the LuaJIT GC but would increase risk of fragmenting the C heap. The LuaJIT seems to be good at tail addition with minimal overhead so I have been building in a native tables and then copying over to a external array when complete. If I run short of memory in the native Lua memory space, I may have to change this approach. I will also move the derived ML models to external data structures.

Java Comparison

If I had this option available in Java I would have used when optimizing enterprise scale applications. In java something similar could be done using the JNI interface but this can be a time consuming experience. The normal solution is to allocate a large block of memory in java and re-use. The core concept is similar to how I am using these external arrays. The problem is that Java still has a GC and the GC still gets slower as you add more memory even when they are long persistent objects. This can degrade performance to the point where major Java projects like Elastic search been forced to establish best practices that recommend no more than 32 Gig of memory. I have still seen GC pauses take Elastic search to it’s knees. The .net Garbage collector is pretty good but it still has similar issues even if they manifest under different conditions.

The LuaJit + FFI approach is superior

I think the external memory concept supported by the lua JIT plus FFI is a superior approach when compared to a fully managed GC environment. I could used it to build the field caches, bit vectors and query caches in elastic search using external memory very similar to how I plan to ship features to memory outside the GC. In Java I would also use a rotating set of external character buffers and serialize my results search results directly into them which would take a burden off the GC.

I think Java could benefit from adding support for the unmanaged arrays as transparent as what Lua JIT + ffi supplies today. If they did, I would be very surprised if Elastic search couldn’t triple their load capacity on the same piece of hardware.

Be nice to the GC (garbage collector)

Once everything is ready predictions are fairly worthless unless they arrive fast enough to apply to the business problem at hand. In the trading arena a few seconds can be a lifetime. As such unplanned GC pauses are highly undesirable. One benefit of the Lua JIT ffi external arrays is that I can ship almost everything I am using outside the responsibility of the GC it gives the GC less work to do and means any pauses are likely to be shorter. If this isn’t good enough I can go back through the code and only use locals that are re-used and if that still isn’t good enough I can port the tight loops where needed to C so it is updating one static array with the results of the computation so all the Lua GC has to do is receive the computed results.

Summary

It is till too early to tell if the Lua JIT is really ready for prime time but it is showing a lot of promise. I can tell you that I have been able to test sophisticated ideas in lua jit as fast as I could have written them in Python, F# or node.js with better net performance and good readability better than F#. I really like the ability to ship the feature and model data outside the GC. I can also confirm that the same code ran in Lua without the jit is too slow to be viable.

I think Lua Jit has a viable chance of displacing Java in strategic big data ML projects but only if some of the items described in “Notes for the Lua Jit community” are implemented.



Lua Source to manage objects outside the Lua GC space

download source

-- Demonstrates a way to use Lua to mange C Memory outside -- that considered by the GC. This allows nearly normal Lua -- code to manage larger amounts of memory bypassing the GC. -- Intended to make it feasible to use the speed of lua while -- bypassing the downside normally encountered with GC cleanup. -- You have to manually call .done() for the objects to free up -- their memory but that is a pretty low overhead to gain access -- more of the memory in a 64 bit machine. local ffi = require("ffi") local size_double = ffi.sizeof("double") local size_char = ffi.sizeof("char") local size_float = ffi.sizeof("float") ffi.cdef"void* malloc (size_t size);" ffi.cdef"void free (void* ptr);" local chunk_size = 16384 -- want big enough chunks that we don't fragment the C -- memory manager -- define a structure which contains a size and array of -- doubles where we dynamically allocate the array using -- malloc() Do it this way just in case we want to write C -- code that needs the size. local DArr = ffi.metatype( -- size, array "struct{uint32_t size; double* a;}", -- add some methods to our array { __index = { done = function(self) if self.size == 0 then return false else ffi.C.free(self.a) self.a = nil self.size = 0 return true end end, -- copy data element into our externally managed array from the -- supplied src array. Start copying src[beg_ndx], stop copying -- at src[end_ndx], copy into our array starting at self.a[dest_offset] copy_in = function(self, src, beg_ndx, end_ndx, dest_offset) -- Can not use mem_cpy because the source is likely -- a native lua array. print ("self=", self, " beg_ndx=", beg_ndx, " end_ndx=", end_ndx, "dest_offset=", dest_offset) local mydata = self.a local dest_ndx = dest_offset for src_ndx = beg_ndx, end_ndx do mydata[dest_ndx] = src[src_ndx] dest_ndx = dest_ndx + 1 end end, -- copy data elements out of our externally managed array to another -- array. Start copying at self.a[beg_ndx] , stop copying at self.a[end_ndx] -- place elements in dest starting at dest[dest_offset] and working up. copy_out = function(self, dest, beg_ndx, end_ndx, dest_offset) -- Can not can use mem_cpy because the dest is likely -- a native lua array. local mydata = self.a local dest_ndx = dest_offset for ndx = beg_ndx, end_ndx do dest[dest_ndx] = mydata[ndx] dest_ndx = dest_ndx + 1 end end, -- return true if I still have a valid data pointer. -- return false if I have already ben destroyed. is_valid = function(self) print("is_valid() size=", self.size, " self.a=", self.a, " self=", self) return self.size ~= 0 and self.a ~= nil end, fill = function(self, anum, start_ndx, end_ndx) if end_ndx == nil then end_ndx = self.size end if start_ndx == nil then start_ndx = 0 end local mydata = self.a for ndx = 1, end_ndx do mydata[ndx] = anum end end, -- func fill }, __gc = function(self) self:done() end } ) -- end Darr() ------------------- --- Constructor for DArr ------------------ function double_array(leng) -- allocate the actual dynamic buffer. local size_in_bytes = (leng + 1) * size_double local adj_bytes = (math.floor(size_in_bytes / chunk_size) + 1) * chunk_size local adj_len = math.floor(adj_bytes / size_double) local ptr = ffi.C.malloc(adj_bytes) if ptr == nil then return nil end return DArr( adj_len, ptr) end function avg_arr(tarr, start_ndx, end_ndx) local tsum = 0.0 local num_ele = (end_ndx - start_ndx) + 1 for x = start_ndx, end_ndx do tsum = tsum + tarr[x] --print ("tarr[x]=", tarr[x]) end local avg = tsum / num_ele return avg end

Test Function for DArr Lib

function use_up_memory(targetMeg) tout ={} while collectgarbage('count') / 1024 < targetMeg do tx = {} tout[#tout+1] = tx for i = 1, 100000 do tx[i] = i + 1.2 end end return tout end function basic_test() -- uncomment call to waste space to see how lua gC interacts -- with the external malloc() local waste_space = use_up_memory(100) -- Change this to 1500 to run lua close to limit on it's internal GC. -- Change to 1000 to use up 1 gig which will cause some of the malloc() to fail. local num_ele = 75000 local tmparr = double_array(num_ele) -- Note: Each 75000 array should occupy roughly 600K of RAM. -- plus the overhead for our label, size, counter and containing table. nla = {} --put something interesting into our native lua object local ptr = tmparr.a for x = 1, num_ele do nla[x] = x end nlaavg = avg_arr(nla, 1, num_ele) print ("nlaavg = ", nlaavg) -- put something interesting into our external object local ptr = tmparr.a for x = 1, num_ele do ptr[x] = x end darravg = avg_arr(ptr,1, num_ele) print("darravg=", darravg) assert(nlaavg == darravg, "Expected average of nla and dla to be identical and they are not") -- Demonstrate copying a portion of the lua array into the external buffer -- copies elements 100 to 250 into external array into positions -- 320 .. 470. tmparr:copy_in(nla, 100, 250, 320) print ("ptr[320]=", ptr[320]) assert(ptr[320] == nla[100], "expected ptr[320] to contain" .. nla[100] .. " after the copy_in but received " .. ptr[320]) -- Demonstrate copying a portion back to native lua array -- copies element 74,950 to 75,000 into dest nla positions -- 1 to 50. local sbeg = num_ele - 50 local send = num_ele tmparr:copy_out(nla, sbeg, num_ele, 1) print("nla[1] = ", nla[1]) assert(nla[1] == tmparr.a[sbeg], "expected nla[1] to contain ", tmparr.a[sbeg], " after copy_out but got ",nla[1] ) print ("pre destroy is_valid=", tmparr:is_valid()) assert(tmparr:is_valid() == true, "tmparr.is_valid() should be true before the delete") local dres = tmparr:done() -- destroy and relase our memory assert(dres == true, "First destroy failed") print "sucessful destroy" print ("post destroy is_valid=", tmparr:is_valid()) assert(tmparr:is_valid() == false, "tmparr.is_valid() should be false after done()") print "try to do second destroy which should not work" --- see what happens when we destroy it a second time local qres = tmparr:done() assert(qres == false, "Second delete should have failed") print "second destroy failed as planned" -- Lets see how much memory we are using when we start. print("using ", collectgarbage('count') / 1024, " meg before GC") collectgarbage() print("using ", collectgarbage('count') / 1024, " meg after GC") print("Lua GC will not show the externally allocated buffers") -- TODO: Figure out FFI Call to get total process memory from Windows. -- for num_pass = 1,50 do -- Now try to create a enough of the external arrays that it would normally -- crash Lua. local taget_num = 3000000000 -- 3.0 gig local array_size_bytes = num_ele * size_double local num_arr_to_create = math.floor(taget_num / array_size_bytes) + 1 print ("attempting to create ", num_arr_to_create, " arrays each ", array_size_bytes, " bytes in size") local tholder = {} for andx = 1, num_arr_to_create do local da = double_array(num_ele) if da == nil then local mba = (andx * array_size_bytes) / 1000000 local lua_mem = collectgarbage('count') / 1024 local lua_mem = math.floor(lua_mem * 100) / 100 local tot_meg = mba + lua_mem print ("failed to create andx=", andx, " mb attempt=", mba, " total Meg Used=", tot_meg) else --print ("create andx=", andx, " da=", da, " da.a=", da.a) da:fill(andx) tholder[andx] = da end end print "finished create" print("using ", collectgarbage('count') / 1024, " meg before GC") collectgarbage() print("using ", collectgarbage('count') / 1024, " meg after GC") print "dbl check access every element dbl avg" for andx = 1, num_arr_to_create do local da = tholder[andx] if da ~= nil then local darravg = avg_arr(da.a,1, num_ele) --print("andx=", andx, " da=", da, "da.a=", -- da.a, " darravg = " , darravg) local calcavg = andx -- we know we filled each array with it's index -- value so that is what the average should be. local roundavg = math.floor(darravg * 1000) / 1000 -- have to round because of accumulated floating -- point error assert(roundavg == calcavg, " avg failed expected " .. calcavg .. " got " .. roundavg .. " ndx=" .. andx) end end print "Finished access check" print ("using ", collectgarbage('count') / 1024, " meg before GC") collectgarbage() print ("using ", collectgarbage('count') / 1024, " meg after GC") print "Start destroying our external arrays" for andx = 1, num_arr_to_create do local da = tholder[andx] if da ~= nil then --print ("destroy andx=", andx, " da=", da, " da.a=", da.a) local deleteok = da:done() assert(deleteok == true, " delete failed array #" .. andx) end end tholder = nil print "Finished destroy" print ("using ", collectgarbage('count') / 1024, " meg before GC") collectgarbage() print ("using ", collectgarbage('count') / 1024, " meg after GC") print (" We expect the GC to reclaim some here because" .. " we free up the space in our container array") -- end end -- func ------------------------ ---- MAIN ---------- ------------------------ if arg[0] == "ffi_non_gc_double_array_v2.lua" then basic_test() end

Caveats

For this process to access memory past 4 Gig it must be built in 64 bit mode. You can tell if it is running in 64bit or 32 bit mode by starting a copy of LuaJit. If it shows a 32 next to it in the task manager then you are going to be limited to either 2 gig or 4 gig total process space depending on the compiler or OS. I never did get a sucessful 64 bit build using Cygwin and the native mingw32-make. But after considerable frustration I was able to get a 64 bit build using msvcbuild.

Instructions for getting 64 Bit LuaJit to build on Visual Studio 2012 Express Desktop on 64 Bit Windows 8.1

The command line SDK was failing to install on my machine and I had visual studio 2012 express web previously installed but it didn’t include the C compiler so I also installed the visual studio 2012 express desktop.

Microsoft took away the setenv command for 64 bit on windows 8.1. But they replaced it with a built in in command called “VS2012 X64 Cross Tools Command Prompt”

Unfortunately Microsoft broke their own tool so it came up with the error “\Microsoft was unexpected at this time” See: http://www.blinnov.com/en/2010/06/04/microsoft-was-unexpected-at-this-time/

They also got the path wrong for the location of a couple dependencies so I added these to the front of the default system path.

After some splunking I did the following:

Removed ” surrounding any visual studio component on the Path statement.

Added the following to the front of the system PATH using Control Panel – > System Properties -> Advanced -> Environment variables

C:\VisualStudio11\VC\bin\x86_amd64;C:\VisualStudio11\VC;C:\VisualStudio11\Common7\IDE;

Once I did this the I restarted the VS2012 X64 Cross Tools Command Prompt and ran msvcbuild.bat and it worked great and the new version passed the 4 Gig boundary just fine.

Note: On my machine visual studio is installed in c:\VisualStudio11 if yours is installed elsewhere you will have to modify the path accordingly.

Side Note: The build I did for 64 bit vista on a I7 didn’t work when copied directly to 64 bit windows 8 running on a I7. This is a little worrisome because every other program I copied between these machines ran without problem. There is an underlying brittleness that will need to be resolved before lujit can hit mainstream.

Notes for the Lua Jit community

I think the community needs to work improve the standard build so you get a version with most batteries including sockets and to provide full build support on every major Windows and Linux platform that includes true 64 bit capabilities.

We need libraries that extend what I have shown here so we could move some of these external arrays into CUDA memory on GPU cards and it would be ideal if we could ship the basic compiled code down to the CUDA code from Lua at the same time.

My GPU cards have 2 Gig built in so care will have to be used to design our models to fit into the CUDA memory if we want to maximize speed advantages from the GPU’s in the CUDA cards. .

Perhaps most important, They need to improve their outreach and start recruiting more engineer / users out of the big data and analytic space. This means time invested in examples that can be directly applied without becoming FFI and 64 bit build experts. It only took a couple hours to write a pretty good CSV parser similar to the R data frame but that really should be part of the base package. It must be part of the base package if they want to attract the Big data projects that have large budgets. One thing Guido did right with python is that very early in the project he produced small samples that showed real life use of every python function. Lua in general and the Lua jit community in particular needs to copy some of his best practices.

I understand and agree with the tight lua jit build philosophy due to it’s roots as an embedded environment. I almost think there are two discrete audiences that have different requirements. I suspect the big data audience will spend more money faster so perhaps they need a second build focused there. It seems like the tight focused produce is a subset of the larger product that could be handled with two build scripts without violating the original philosophy.

The LuaJIT developers need some organization that allows them to collect money to spend on future enhancements so they can keep more of their senior contributors focused on the important technical improvements.

It is kind of scary when the main driver is distracted on unrelated projects. I understand the need for day to day income but this feels risky. It is probably one of the greater risks of adopting the platform. I would rather see him well compensated but dependent on the sucess and wide spread adoption of the lua jit.

If some of these things happen then I think LuaJIT has a viable chance of displacing Java in strategic big data ML projects.

Disclaimer

I am a expert engineer and distributed architect but have only been using Lua for about a month. There are very likely ways to dramatically improve this code. Please leave a comment or send me private note about your suggested enhancements. The best way to learn any language is from other experts.

* LuaJIT is Copyright © 2005-2014 Mike Pall, released under the MIT open source license.

The prior version didn’t use the same meta programming style. I am still not sure if one is really better than the other but i think the meta style is a little cleaner. Download prior version.

It is possible to divide the data set and allow set of processes to work on different pieces this is particularity effective when computing expensive features. It can overcome the 2 gig limit simply by shaping the data to stay under that limit. The downside it that it requires design effort to move the data between processes and merge the results. This process is made easier by Hadoop. I love Hadoop but if you can do the same job with less code, less complex logic, less hardware and equally fast then Hadoop is not always the right answer. Needless to say a 20 core Hadoop cluster costs more than a 4 core server so there is always a ROI tradeoff.

Contact us