





You just add a vec3 storage buffer in your GLSL shader…. but can’t index it properly, let us find out why













vec3 buffers indexing fights back again

I am neck deep in the refactor of my engine, getting the Vulkan back end in a good state. During my work to start rendering meshes in a uniform way for both DX12 and VK, I moved from vertex push (that is, using a vertex declaration and have the vertex attributes appear in the vertex shader) to vertex pull (i.e. manually fetching the vertex data in the vertex shader.

I wrote a shader with this snippet of code in it:

layout (set= 1 ,binding= 0 ) buffer vertices { vec3 p[]; };

I am usually weary of data types packing for anything that is not 16 bytes aligned, especially in constant buffers and arrays, but this is was a storage buffer, the closest thing you can get to a normal flat array allocation in GLSL. As you can imagine, this did not go very well, the mesh was not being read properly.

I quickly went in the shader, fixed to vec4, padded my mesh and voila! Problem fixed, let us move on! See you in the next blog post!

Not so fast, that would be all good but I wanted to know why, the naive me would expect to work, possibly bit less efficient but still, working.

Alignment issues

I decided to ask Matthäus (@NIV_Anteru) to know more of the underlying details, and he was nice enough to spend the time to help me. His initial thought was that it should have worked with the right layout. Naive me jumps back in and shouts that I did indeed have have a layout!

layout (set= 1 ,binding= 0 ) buffer vertices

What I failed to understand was that it should have the right memory layout in the layout block. As an example the below forces a scalar layout:

layout (scalar, set= 1 , binding= 0 )

This is one of those moments when you realize you have missing knowledge of whole set of features of the API! So back to reading the docs. I found some interesting links to look at:

With this new informations we can see that in the specification of layout offset and alignment we find:

A three- or four-component vector has a base alignment equal to four times its scalar alignment.

That rule would force our vec3 and vec4 to have the same aligment properties. To note that this can’t be simply fixed with std430 memory layout, I tried and is not enough. The actual solution is the extension:

GL_EXT_scalar_block_layout

On the page of the actual extension we find this very important line:

This new layout aligns values only to the scalar components of the block and its composite members.

That is exactly the behaviour we wanted, this would change the alignment of our vec3 from 16 bytes to 12.

SPIR-V enters the fight

Matthäus also provided me with this amazing example from shader playground that would actually shows up what happens at SPIR-V level:

Original shader:

#version 450 #define FIX_IT 0 #if FIX_IT #extension GL_EXT_scalar_block_layout : require layout (scalar, set= 1 ,binding= 0 ) buffer vertices #else layout (set= 1 ,binding= 0 ) buffer vertices #endif { vec3 p[]; }; out gl_PerVertex { vec4 gl_Position; }; void main() { gl_Position = vec4 (p[ 0 ], 1 ); }

As we can see we have a define changing the different layout declaration of our buffer to compare the different results.

Here belowe the slice of the SPIR-V output:

Name 17 "" MemberDecorate 8(gl_PerVertex) 0 BuiltIn Position Decorate 8(gl_PerVertex) Block Decorate 14 ArrayStride 16 MemberDecorate 15(vertices) 0 Offset 0 Decorate 15(vertices) Block Decorate 17 DescriptorSet 1 Decorate 17 Binding 0 2: TypeVoid 3: TypeFunction 2 6: TypeFloat 32 7: TypeVector 6(float) 4 8(gl_PerVertex): TypeStruct 7(fvec4) 9: TypePointer Output 8(gl_PerVertex) 10: 9(ptr) Variable Output 11: TypeInt 32 1 12: 11(int) Constant 0 13: TypeVector 6(float) 3 14: TypeRuntimeArray 13(fvec3) 15(vertices): TypeStruct 14 16: TypePointer StorageBuffer 15(vertices) 17: 16(ptr) Variable StorageBuffer 18: TypePointer StorageBuffer 13(fvec3)

By investigating the SPIR-V we can notice several interesting things, the vertices array is defined as a struct referring to id 14

Decorate 14 ArrayStride 16 TypeStruct 14

This defines a struct with a stride of 16 bytes, few lines below we actually see the definition of the pointer to the storage buffer:

TypeRuntimeArray 13(fvec3) TypePointer StorageBuffer 13(fvec3)

The above is specifying that we have have an array defined at runtime of which we don’t know the length, and the pointer type of it is a fvec3. If we put this two informations together we can see we are defining a pointer to float vec3 but with a stride of 16. From this we can deduce that our our shader would work if we padded our mesh to vec4, no shader changes required, no need to change the buffer to vec4 (Although possibly better readability).

We do have that lovely define, why don’t we flip it and see what happens? Here the result:

Name 17 "" MemberDecorate 8(gl_PerVertex) 0 BuiltIn Position Decorate 8(gl_PerVertex) Block Decorate 14 ArrayStride 12 MemberDecorate 15(vertices) 0 Offset 0 Decorate 15(vertices) Block Decorate 17 DescriptorSet 1 Decorate 17 Binding 0 2: TypeVoid 3: TypeFunction 2 6: TypeFloat 32 7: TypeVector 6(float) 4 8(gl_PerVertex): TypeStruct 7(fvec4) 9: TypePointer Output 8(gl_PerVertex) 10: 9(ptr) Variable Output 11: TypeInt 32 1 12: 11(int) Constant 0 13: TypeVector 6(float) 3 14: TypeRuntimeArray 13(fvec3) 15(vertices): TypeStruct 14 16: TypePointer StorageBuffer 15(vertices) 17: 16(ptr) Variable StorageBuffer 18: TypePointer StorageBuffer 13(fvec3)

Overall the structure is exactly the same but the game change line is this:

Decorate 14 ArrayStride 12

Now our vec3 will have alignment requirement of scalar, in this case scalar multiple of the size of our type, giving us an alignment of 12 bytes. With this then finally the shader works and behaves as expected with no extra padding.

Performance

Since I was messing around with this stuff I decided to have a go and have a look at the actual disassembly for the vec3 vs vec4.

Here below code for the vec4:

s_waitcnt_depctr 0xffe3 buffer_load_dwordx4 v[0:3], v0, s[4:7], 0 offen s_waitcnt vmcnt(0) exp pos0, v0, v1, v2, v3 done

Here the code for the vec3:

s_waitcnt_depctr 0xffe3 // 000000000090: BFA3FFE3 buffer_load_dwordx3 v[0:2], v0, s[4:7], 0 offen // 000000000094: E03C1000 80010000 v_mov_b32 v3, 1.0 // 00000000009C: 7E0602F2 s_waitcnt vmcnt(0) // 0000000000A0: BF8C3F70 exp pos0, v0, v1, v2, v3 done

As we can see the only real difference is in the memory load, where in the case of the vec4, we are loading 16 bytes worth of data v[0:3] meanwhile in the vec3 we are loading only 12 v[0:2] plus an extra register load for the constant 1.0 in v3. Register pressure is exactly the same in both cases, so the only difference when it comes to amount of code is the extra register load for the 1.0f value we have.

Which one of the two version is the fastest I have no idea and it has to be benchmarked. On the one side, aligned loads have the advantage that there is no possibility to access two cache lines in a single load, which should result in maximum efficiency. On the other side, we do save 25% of bandwidth and increase the chance of cache hits.

If you happen to have experience with this or data please let me know! I would love to hear it! This will require further investigation.

Conclusion

This is the end of the run in this rabbit hole, it was quite interesting and I am getting quite the linking to SPIR-V the more I deal with it! Thanks so much to Matthäus for enduring my questions! Give him a follow on twitter since he often contributes to very interesting conversations. He runs a great blog with tons of interesting articles like the one about compute shaders execution and working details.

When it comes to my project, for the time being I am using vec4 and moving on to other stuff. I do plan at one point to do a nice pass on the geometry handling in general, where I start using meshoptimizer , compressing the data and so on, that might be a good time to revisit the topic.

If you liked this blog post share it around and follow me on twitter! @MGDev91.





