We’ve recently been doing some work in PolyVox to switch the ordering of voxel data from linear order to Morton order. This work is now complete, and in this relatively technical post I’m going to highlight a couple of the interesting tricks which we came across while doing it. Hopefully these will be beneficial to other people working on the low-level details of voxel engines.

Like many voxel engines, PolyVox allows volume data to be broken down into a number of ‘chunks’. The primary advantage of this approach is that not all the data has to be loaded into memory at the same time, and we can instead load and unload chunks on demand. Note that this is an implementation detail of the our ‘PagedVolume’ class, and algorithms which operate on the data (mesh extractors, raycasting, etc) are not aware that the data is stored in this way.

The size of each chunk can be specified by the user (with the ideal size depending on a number of factors) but typically contain 32³, 64³, or 128³ voxels. The ‘ordering’ of these voxels refers to the way they are laid out in memory, and two possibilities are shown below for the 2D case. The linear ordering is the easiest to understand and can be traversed with a simple nested for loop, but the Morton ordering brings numerous benefits in terms of locality of reference, ease of downsampling, and increased compressibility.

However, the purpose of this blog post is not to explain the benefits of Morton ordering, nor to describe how (x,y,z) positions are mapped to locations on the Morton curve. For this we refer you to the Wikipedia article and the excellent blog posts by Jeroen Baert and Fabian Giesen. Instead, we wish to highlight a couple of optimizations which were useful in our implementation.

Morton index calculation

Jeroen Baert did some extensive tests on the most performant way to determine Morton curve positions from a 3D input position. The conclusion was that it is fastest to make use of a lookup table rather than perform a series of bitwise operations – a perhaps surprising result given the relative cost of processor cycles vs. memory access on modern hardware.

We tested both approaches and were able to verify Jeroen’s original findings. Compared to using a simple linear index for the 3D position, computing and accessing a Morton index took roughly four times as long. By comparison, using the lookup table only took about 40% longer than the linear index, clearly showing the benefits of this approach. None-the-less, a 40% increase in voxel access time is a significant price to pay even given the other advantages of the Morton ordering, and it is here that we make our first useful observation.

Jeroen is working in the context of SVO rendering and is using very large volumes. His function to map a 3D position to on the Morton curve looks as follows:

inline uint64_t mortonEncode_LUT(unsigned int x, unsigned int y, unsigned int z) { uint64_t answer = 0; answer = morton256_z[(z >> 16) & 0xFF ] | // we start by shifting the third byte, since we only look at the first 21 bits morton256_y[(y >> 16) & 0xFF ] | morton256_x[(x >> 16) & 0xFF ]; answer = answer << 48 | morton256_z[(z >> 8) & 0xFF ] | // shifting second byte morton256_y[(y >> 8) & 0xFF ] | morton256_x[(x >> 8) & 0xFF ]; answer = answer << 24 | morton256_z[(z) & 0xFF ] | // first byte morton256_y[(y) & 0xFF ] | morton256_x[(x) & 0xFF ]; return answer; }

The variables ‘morton256_x’, ‘morton256_y’, and ‘morton256_z’ are the lookup tables, and each stores 256 entries (enough to perform the mapping for a single byte) due to the impracticality of having an entry in the lookup table for every possible value of the unsigned int inputs. Construction of the full Morton key is therefore done by using these lookup tables repeatedly to process each byte of the input, and them combining them with more bitwise operations.

For our purposes the size of each chunk is actually very limited, and setting a hard limit of 256^3 seems reasonable. The actual volume can of course be much larger, consisting of many of these chunks. Therefore we apply this hard limit and reduce the above code to just three lookups which are OR’d:

template <typename VoxelType> VoxelType PagedVolume<VoxelType>::Chunk::getVoxel(uint32_t uXPos, uint32_t uYPos, uint32_t uZPos) const { uint32_t index = morton256_x[uXPos] | morton256_y[uYPos] | morton256_z[uZPos]; return m_tData[index]; }

With this change the calculation of the position on the Morton curve is about 1-2% faster than with the linear version, though with such a small improvement it is hard to be sure. At least, we are not paying any extra access cost for the benefits which Morton ordering provides.

Fast neighbourhood access

As well as ensuring that random access to any voxel is as fast as possible, it is also important to consider realistic access patterns. In the context of voxel engines this typically means providing fast access to a given voxel’s neighbours as this is often required for tasks such as filtering, normal estimation, and surface extraction.

Access to such neighbours is trivial with a simple linear ordering. For a given position (x,y,z), if we want to access position (x+1,y,z) then we know it is simply the next voxel in memory. There is no need to work out which chunk it is in (disregarding edge cases here) nor to perform the index calculations. In PolyVox we provide ‘peek…()’ functions to retrieve a neighbour of our current voxel, and so an older (linear) version of PolyVox peeked one voxel in the x direction as follows:

template <typename VoxelType> VoxelType PagedVolume<VoxelType>::Sampler::peekVoxel1px0py0pz(void) const { if(CAN_GO_POS_X(this->mXPosInVolume) ) { return *(mCurrentVoxel + 1); } return this->mVolume->getVoxel(this->mXPosInVolume+1,this->mYPosInVolume,this->mZPosInVolume); }

Note that the ‘CAN_GO_POS_X’ macro was just to ensure we were not on a chunk boundary, because if we were then our clever trick didn’t apply and we fell back on a regular call to getVoxel(). Peeking in multiple directions was more complex but the same principle applied:

template <typename VoxelType> VoxelType PagedVolume<VoxelType>::Sampler::peekVoxel1px1py1pz(void) const { if(CAN_GO_POS_X(this->mXPosInVolume) && CAN_GO_POS_Y(this->mYPosInVolume) && CAN_GO_POS_Z(this->mZPosInVolume) ) { return *(mCurrentVoxel + 1 + this->mVolume->m_uChunkSideLength + this->mVolume->m_uChunkSideLength*this->mVolume->m_uChunkSideLength); } return this->mVolume->getVoxel(this->mXPosInVolume+1,this->mYPosInVolume+1,this->mZPosInVolume+1); }

However, the situation becomes more complex when Morton ordering is applied. In this case, the neighbouring voxel in the x direction is not simply next to the current voxel in memory, and things get even harder when peeking in multiple directions at once. We did not find much information on how to handle this correctly, which is why we decided to present our solution here.

The naive approach is to take the Morton position for our current voxel and reverse the encoding process to obtain the original (x,y,z) position. This can then be modified by adding or subtracting the desired offset to the desired component(s), and the resulting 3D position can then be re-encoded into an index on the Morton curve. Clearly this involves quite a lot of processing.

It is possible to directly combine (‘add’?) two Morton positions as alluded to by this StackOverflow answer. However, this is still relatively expensive and again requires some significant bit-shifting. Update: Fabian Giesen has a much more detailed coverage (and more efficient version) of this approach – see the comments and also Texture tiling and swizzling.

At this point Matt made a useful observation. As he states there, “for a given (x,y,z) which has a Morton position p; if we want to peek at (x+1,y,z) then the amount by which p must increase, Δp, is dependent only on x and is independent of y and z. This same logic holds for peeking in y and z”. In other words, we can make use of three more lookup tables (for x, y, and z) which store the offset for moving a single voxel in each direction.

Such lookup tables can be generated by a simple program such as this (based on this code) which computes the offset between different pairs of adjacent x, y and z positions. The size of the lookup table needs to be at least as large as the largest chunk size we wish to support, though smaller chunk sizes are also supported by this as the elements of a smaller table are a subset of the elements of the large table.

In PolyVox we are making use of three 256 element tables allowing us to support chunks of up to 256^3 voxels. The table for moving/peeking one voxel in the x direction is:

static const std::array<int32_t, 256> deltaX = { 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 3511, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 28087, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 3511, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 224695, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 3511, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 28087, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 3511, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 1797559, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 3511, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 28087, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 3511, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 224695, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 3511, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 28087, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1, 3511, 1, 7, 1, 55, 1, 7, 1, 439, 1, 7, 1, 55, 1, 7, 1 };

and you can find the other tables in the PolyVox source here. We also define a few macros to make using the tables easier:

... #define POS_X_DELTA (deltaX[this->m_uXPosInChunk]) ...

With this is place, the Morton version of our function for peeking one voxel in the x direction becomes as simple as:

template <typename VoxelType> VoxelType PagedVolume<VoxelType>::Sampler::peekVoxel1px0py0pz(void) const { if (CAN_GO_POS_X(this->m_uXPosInChunk)) { return *(mCurrentVoxel + POS_X_DELTA); } return this->mVolume->getVoxel(this->mXPosInVolume+1,this->mYPosInVolume,this->mZPosInVolume); }

and for peeking multiple directions at once:

template <typename VoxelType> VoxelType PagedVolume<VoxelType>::Sampler::peekVoxel1px1py1pz(void) const { if (CAN_GO_POS_X(this->m_uXPosInChunk) && CAN_GO_POS_Y(this->m_uYPosInChunk) && CAN_GO_POS_Z(this->m_uZPosInChunk)) { return *(mCurrentVoxel + POS_X_DELTA + POS_Y_DELTA + POS_Z_DELTA); } return this->mVolume->getVoxel(this->mXPosInVolume+1,this->mYPosInVolume+1,this->mZPosInVolume+1); }

Conclusion

We have found that Morton ordering works well for storing voxel data in chunks inside PolyVox. While we can’t be sure that computing a Morton position is quite as fast as computing a linear position, we can say that the improved memory access time at least makes up for this due to the improved cache locality. Benefits such as improved compression and easier downsampling are then essentially free.

If you are interested in any of our work above then you can check out the PolyVox code in BitBucket (currently you need the develop branch) and look at the PagedVolume*.h/inl files. If you want to see the discussion and tests which led to the conclusions above then you can have a read of our thread in the issue tracker.

In the future we intend to have a few more of these posts covering the low-level details of PolyVox and Cubiquity, so stay tuned!