Visualization as an aid for porting a Ruby application to Haskell

At Mpowered we are working on gradually porting a sizable Rails web application to Haskell. The application is a toolkit for B-BBEE scoring in which our customer companies can upload their data and keep track of their scorecards.

In this post, we will outline a major difficulty in translating object-oriented Ruby to Haskell, and present a method for aiding the translation by visualizing the call graph of the Ruby code. In particular, we will focus on identifying calls that require dispatching to select the method. There are two problems with dispatching:

It happens implicitly in Ruby, and can easily be missed. Each dispatch requires a corresponding explicit dispatch in Haskell.

All code associated with this post is found here.

Porting the scoring engine

The scoring engine is one central part of the application that we are currently porting to Haskell. Our aim is to first migrate the code without trying to be too clever (not to introduce bugs in the process), and after that work on refactoring and improvements.

The score calculations themselves are not terribly complicated. The main complication comes from the fact that there are many rules and many data sources, and worse – slightly differing sets of rules for different industry sectors. Scorecards for different sectors are typically derived from a standard scorecard by making various alterations and additions. They can also be derived from other derived scorecards. Such hierarchical derivations lend themselves to an object-oriented design where scorecard derivation is expressed as sub-classing.

Most of the Ruby code involves methods calling other methods and fetching data from a database. Here is a simple example of what it might look like:

def total_spending spending_in_category_A + spending_in_category_B end

Such definitions can often be translated in a very direct way to Haskell:

totalSpending :: Haxl Number totalSpending = spendingInCategoryA + spendingInCategoryB

(As an aside, note that we are using Haxl to handle the data sources.)

The problem comes when there are multiple versions of a method and Ruby uses method dispatch to select which one to call. The following example illustrates the problem:

class C1 def a b+1 end def b 100 end end class C2 < C1 def b 200 end end

The call to b from a (in the expression b+1 ) may either refer to b in class C1 or in class C2 , depending on the class of the object on which we call a :

> C1.new.a => 101 > C2.new.a => 201

Hence, the above example requires additional machinery in Haskell. We need to somehow pass information about the context to a to tell it which b to call. This can be done, e.g. by an explicit parameter and a case expression around the call to b , or perhaps more implicitly using type classes. In any case, we need to be aware of the places where dispatch is needed. It can be next to impossible to know where these places are in a larger program, and checking each method call manually can be extremely tedious.

Visualizing the call graph

We will now show a technique to visualize the call graph of a Ruby program. This can be a very useful aid in translating the code to Haskell, and, specifically, will allow us to identify the situations that require method dispatching in Haskell.

Extracting the call graph is done in several steps:

Run the Ruby code with tracing to extract a call trace. Convert the trace to a call graph. Parse the call trace and build a graph.

Identify calls that require dispatch. Visualize/export the graph.

Step 1 will be explained further below. This part of the process is somewhat ad hoc, and needs to be adapted to the application in question.

Steps 2 and 3 are done by this Haskell program.

An optional fourth step is to import the graph into the Neo4j graph database. This allows visualization and querying of very large graphs.

For demonstration we will use a toy application: scholarship.rb. It calculates scholarships for university students. The program itself is not the point, so we will not explain it in detail. We can just note that it has certain similarities to the score calculations we are working with at Mpowered.

There is another reason for not explaining the program: The whole point is to treat it as something of a black box! We want to be able to port Ruby methods to Haskell without having to have a full mental model of the application’s control flow.

Tracing Ruby

scholarship.rb provides a list of students that we can use for testing. The root of the call graph is going to be the method awarded_scholarship_amount . There is nothing stopping us from having several root calls, but in this particular program, said method is the one that calls all others.

The following program produces a call trace for the awarded_scholarship_amount method for each student:

require 'scholarship' include TestValues trace = [] students.each do |student| trace << ("#### " + student.enrolled_in_program.class.name) set_trace_func proc { |event, _, line, method, _, classname| if (event == "call" || event == "return") trace << sprintf("%s,%d,%s,%s", event, line, method, classname) end } student.awarded_scholarship_amount set_trace_func(nil) end puts trace.join("

")

(The tracer script is found in tracer.rb.)

Ruby’s set_trace_func is used to setup tracing right before the method call. After the call, tracing is turned off using set_trace_func(nil) .

The handler passed to set_trace_func receives events from the Ruby interpreter kernel. We are only interested in the events “call” and “return”, as those are enough to extract a call graph.

The trace looks like this (shortened):

#### UniversityProgram call,68,awarded_scholarship_amount,Student call,33,awarded_scholarship_amount,UniversityProgram call,26,qualifies_for_scholarship?,UniversityProgram call,14,has_scholarship?,UniversityProgram return,16,has_scholarship?,UniversityProgram call,22,points_required_for_scholarship,UniversityProgram call,10,course_points_per_year,UniversityProgram return,12,course_points_per_year,UniversityProgram call,18,requirement_for_scholarship,UniversityProgram return,20,requirement_for_scholarship,UniversityProgram return,24,points_required_for_scholarship,UniversityProgram return,29,qualifies_for_scholarship?,UniversityProgram return,39,awarded_scholarship_amount,UniversityProgram return,70,awarded_scholarship_amount,Student ... #### FancyProgram call,68,awarded_scholarship_amount,Student call,44,awarded_scholarship_amount,FancyProgram call,10,course_points_per_year,UniversityProgram return,12,course_points_per_year,UniversityProgram call,18,requirement_for_scholarship,UniversityProgram return,20,requirement_for_scholarship,UniversityProgram return,48,awarded_scholarship_amount,FancyProgram return,70,awarded_scholarship_amount,Student ...

The header lines (starting with “####”) are used to declare the context of each trace. This information will be useful later when identifying calls that require dispatch. Selecting the right value for the context will allow us to use that as the discriminator in method dispatches. For our particular program, we use the class name of the university program ( student.enrolled_in_program ) as the context, which we will see gives sufficient information.

Extracting a call graph

As mentioned earlier, a Haskell program is used to extract a graph from the call trace. To prepare the extraction, make sure that you have the dot program installed from Graphviz. Also prepare the Haskell environment by running this in the root directory:

cabal new-build

We are using Cabal to manage Haskell dependencies in this post, but the corresponding commands for Stack should work just as well.

Now we are ready to extract the graph:

ruby -Iruby ruby/tracer.rb | cabal exec runghc CallGraph | dot -T svg > graph.svg

The graph extractor can also generate CSV files which can be imported e.g. into a Neo4j database. See comments in the source code for details.

The method used by CallGraph.hs to extract the graph is roughly like this:

Traverse the trace from beginning to end maintaining a stack to keep track of nested method calls. The top of the stack, tos , is the Ruby definition that is currently in focus.

, is the Ruby definition that is currently in focus. If the next event is a call to a method m , insert an edge from tos to m and push m to the stack.

, insert an edge from to and push to the stack. If the next event is a return, pop the stack. (Of course, the return must be for the same method as tos .)

Interpreting the graph

(Click in the image to see a larger version.)

We see the call graph of the program scholarship.rb above. The black rectangular boxes represent unique bindings identified by the method name, class name and line number in the source code. (We have omitted file names in this post for simplicity.)

The arrows represent method calls, and each call is labeled with the context in which it is made. As mentioned above, we chose the context to be the class name of the university program.

The red rounded boxes are virtual nodes inserted by the extraction program. They mark places where method dispatch is needed. The condition for inserting a dispatcher is:

Whenever a method makes calls to different bindings with the same method name.

For example, the method awarded_scholarship_amount defined on line 44 calls the method requirement_for_scholarship . The trace reveals that the call goes to two different bindings: either the one on line 18 or the one on line 52. A virtual dispatch node is inserted to highlight this fact. The graph also shows the contexts in which the different calls happen:

line 18 in context FancyProgram

line 52 in context BotanicalProgram

For completeness, below is the Neo4j graph for a large part of the score calculations in our application at Mpowered. It consists of 860 bindings, 886 calls and 49 dispatches. The dispatches are seen as red nodes in the picture.

(Click in the image to see a larger version.)

Porting to Haskell

A simple way to port the Ruby program to Haskell is to make one Haskell definition for each node in the graph, including the dispatcher nodes. The fact that each normal node represents a unique binding helps us making sure that we call the right method at the right place. And the dispatcher nodes identify exactly the places where we need to select the call based on the context.

Assuming that we use a monad for calculation in Haskell, we can make have the context in the monad and define a dispatcher like this:

requirementForScholarship = do cxt <- getContext case cxt of FancyProgram -> requirementForScholarship_line18 BotanicalProgram -> requirementForScholarship_line52

Discussion

The presented method can be seen as a help in understanding the call structure of a Ruby application without necessarily understanding exactly what the program does. When porting Ruby code to a different language, it is convenient to be able to focus on small parts of the program, one by one. For each such part, the visual call graph shows how it relates to the rest of the program, and, in particular, points out the places where a call depends on the context.

Limitations

As we have seen, the presented method is rather ad hoc. In particular, the tracer hard codes the root methods to trace and the values to use as contexts. It can be hard to determine a complete set of methods to use as roots in a larger application. It may also be that a single context value isn’t enough to use for dispatching.

A more serious limitation is one that we haven’t touched upon so far: we cannot be sure that the tracer finds all possible method calls. For example, if a method call is guarded by a conditional that depends on a value in a database, this call may not be performed, even if we include all the relevant root methods in all relevant contexts.

We do not have a good solution to the problem of missed calls. It just shows that the generated graph should not be treated as a complete representation of all possible call paths, but more like a complementary aid in the process of translating the code.