I recently asked around #perl6 as to a mailing list where I might discuss PDL features that I'd like to see in Perl 6. Synopsis 9 is supposed to discuss these features, but the PDLish ideas feels like a straight port of PDL, rather than a rethink of what's important. I wanted to discuss things a bit.

The answer was, "Write a blog post." This post contains what I consider to be the essential ingredients of PDL that I think are easily achievable for Perl 6 v1.0. I want Perl 6 to provide an expressive language for writing operations on high dimensional data. I believe that we (and I do include myself) can get this done by Christmas if others can help me out.

PDL's most important feature is that the low-level looping is implemented in C. I do not expect Perl 6 to produce C code to implement this functionality, and so I should say up front that I do not expect a Perl 6 implementation of these features to be fast. But I don't care about fast for v1.0. I'll be happy if they're simply present. We can work on fast later.

Machine data representation of multidimensional arrays

Perl 6 has support for machine representation of multidimensional arrays, so we have the basic data container already.

autolooping

This is, hands down, the killer feature of PDL.

In PDL parlance, this called "threading", specifically "implicit threading". It has nothing to do with parallel threads, which is why I call it autolooping.

PDL's support for autolooping is why I keep coming back to Perl and PDL in my moments of weakness. No other language supports autolooping as well as PDL does. For a more complete explanation of this concept, see PDL::Threading and the Threading section of PDL::Indexing.

The idea is that you define operations expecting a certain number of dimensions. If the data that you end up analyzing has more dimensions, the higher dimensions are automatically looped over. If I write code that expects a matrix, autolooping would allow it to handle a "stack" of matrices.

The idea goes further, though. If one or some of your input has fewer dimensions, but the extent of that dimension can be ascertained from another array that is supposed to share the dimension's extent, then the higher dimensions for the short data are "filled in" by simply repeating the data! If I expect a matrix but I am given a scalar, it is treated like a matrix of the correct dimensions, all equal to the given scalar. For example, this is how PDL implements adding a vector to each row of a matrix.

Believe it or not, a basic implementation of this is not as difficult as it might at first seem. You need to make sure that the dimensions of all arrays are commensurate, and then you need to tabulate the strides for each dimension of each array. The bookkeeping gets a little tedious, but it's doable and I will be happy to dig into details when it comes time to implement something. I discuss notation and examples at greater length below.

Slices that operate as aliases

In Perl 6, you can create an alias of a variable. Changes you make to the alias modify the original. In PDL, saying my $copy = $pdl_data; automatically creates an alias, since that's how references work in Perl 5.

What's more, with PDL it's possible to take a slice data and assign that slice to a variable. This expression, my $slice = $pdl_data->slice([0, 20, 2]) , creates an alias to every other element of $pdl_data from the first to the 21st element. Modifications to the contents of $slice modify the data in $pdl_data . (This is called "data flow" in PDL parlance.)

It would be really nice for Perl 6 to have array aliases as a language feature—does it?—in which case multi-dimensional aliases should come along for the ride without much fuss.

Arbitrary slices (that also operate as aliases)

This is the second most important feature of PDL.

We all know that in Perl 5 you can slice an arbitrary list of elements from an array by saying @slice = @data[2, 3, 5, 7, 11] . If you happen to assign to such a slice, you can overwrite the values.

PDL takes this one step further, allowing you to alias arbitrary slices. In PDL you might use the index method to pull out a similar slice as the one just demonstrated: $slice = $data->index([2, 3, 5, 7, 11]) . The fact that this is an alias makes it much more expressive than the simple array slicing provided by Perl 5. I can pass this slice along to other methods, which can modify the contents and have the effects reflected in my original data. PDL also has the where method, which makes it particularly easy to identify data with interesting properties. For example, if I want to set all negative numbers to zero, I simply say $data->where($data < 0) .= 0 . In Perl 6, not only can we provide this method, but we can optimize it in ways that PDL cannot.

I know exactly how to implement such aliases in Perl 5 using tied arrays; surely a similar approach can be used for arbitrary slices.

Dimension checking and alignment via function signatures

In Perl 6 you can declare the type of a variable in the function signature. While you may not consider array dimensions to be part of the type, some sort of notation for this kind of validation will be necessary to avoid unwieldy boilerplate. Whether it is implemented in the function signature or via separate syntax, the logic for this sort of checking can be implemented by borrowing parts of the autolooping machinery. For detailed reading, see the links already given under autolooping.

Basically, I want more than just to say, "The two arguments to this function should be 2d arrays." I want to be able to say, sub foo ($first($nx, $ny), $second($nx, $ny)) to declare a function in which $first and $second are autoloop compatible. In fact, I might have a function in which I expect $second to have three dimensions, like this: sub foo ($first($nx, $ny), $second($nx, $nz, $ny)) . The notation here should be identical to the autoloop notation discussed below.

Easy dimension rearrangement

You will notice that I suggest autolooping and dimension alignment in function signatures. I have implied that the dimension size enforcement is based on the order of the dimensions, akin to implicit threading in PDL. If we go that route, then we need an easy way to rearrange dimensions. Suppose you have PDL $data that is 10x5. If you want to access the very last element, you'd look at position (9, 4). With PDL, it's easy to transpose the data and to access it the other way around, i.e. by saying (4, 9).

More on autolooping

As a basic example, suppose I want to loop over the pixels of an RGB image (dimensions 3 by Nx by Ny) and produce a gray scale image (dimensions Nx by Ny). I could easily implement this using for loops. But, what happens if I feed this a movie, i.e. a stack of such images? In that case I would have an array with dimensions (3, Nx, Ny, Nframes) and I would expect an output "movie", an array with dimensions (Nx, Ny, Nframes). The traditional approach is to manually wrap the original loop in yet another loop to handle the higher dimension. The problem is that such an approach is hard to handle in the general case and requires much too much boilerplate. The best solution would be a loop syntax that automatically loops over all frames and applies the code to the data in each frame. I would like Perl 6 to have the keyword autoloop , demonstrated in the two examples below. (See this stack overflow discussion for details about the formula.)

my $gam = 2.2; autoloop (@input(3, $Nx, $Ny), my @output($Nx, $Ny)) { for my $i (0 .. $Nx-1) { for my $j (0 .. $Ny-1) { my ($r, $g, $b) = @input[:, $i, $j]; my $y = 0.2126 * $r**$gam + 0.7152 * $g**$gam + 0.0722 * $b**$gam; @output[$i, $j] = 116 * $y**(1/3) - 16; } } }

In my conception, autoloop not only delimits a block and provides autolooping, but it also serves to declare the dimension variables, scoped within the block, and the output variable my @output($Nx, $Ny) which is scoped to live outside the block.

Notice that the output at $i and $j only depends upon the input at $i and $j . This means I can be even more succinct:

my $gam = 2.2; autoloop (@input(3), my @output()) { my ($r, $g, $b) = @input; my $y = 0.2126 * $r**$gam + 0.7152 * $g**$gam + 0.0722 * $b**$gam; @output[] = 116 * $y**(1/3) - 16; }

With the magic of autolooping, if @input is three dimensional, @output will be two dimensional.

I could wrap this in a function, and specify my high-dimensional array, using a function signature like this:

sub rgb_to_grayscale (@input(3, *)) { my $gam = 2.2; autoloop (@input(3), my @output()) { my ($r, $g, $b) = @input; my $y = 0.2126 * $r**$gam + 0.7152 * $g**$gam + 0.0722 * $b**$gam; @output[] = 116 * $y**(1/3) - 16; } return @output; }

How we get to something a bit trickier: an implementation of convolution with reflective boundary conditions. This is easier to implement than other boundary conditions due to the way that negative subscripts are handled. Notice that the function declaration indicates that I want two 2d arrays, but I do not require that they have the same dimensions.