Raymond Hettinger published a nice little micro-benchmark script for comparing basic operations like attribute or item access in CPython and comparing the performance across Python versions. Unsurprisingly, Cython performs quite well in comparison to the latest CPython 3.8-pre development version, executing most operations 30-50% faster. But the script allowed me to tune some more performance out of certain less well performing operations. The timings are shown below, first those for CPython 3.8-pre as a baseline, then (for comparison) the Cython timings with all optimisations disabled that can be controlled by C macros ( gcc -DCYTHON_...=0 ), the normal (optimised) Cython timings, and the now improved version at the end.

CPython 3.8 (pre) Cython 3.0 (no opt) Cython 3.0 (pre) Cython 3.0 (tuned) Variable and attribute read access: read_local 5.5 ns 0.2 ns 0.2 ns 0.2 ns read_nonlocal 6.0 ns 0.2 ns 0.2 ns 0.2 ns read_global 17.9 ns 13.3 ns 2.2 ns 2.2 ns read_builtin 21.0 ns 0.2 ns 0.2 ns 0.1 ns read_classvar_from_class 23.7 ns 16.1 ns 14.1 ns 14.1 ns read_classvar_from_instance 20.9 ns 11.9 ns 11.2 ns 11.0 ns read_instancevar 31.7 ns 22.3 ns 20.8 ns 22.0 ns read_instancevar_slots 25.8 ns 16.5 ns 15.3 ns 17.0 ns read_namedtuple 23.6 ns 16.2 ns 13.9 ns 13.5 ns read_boundmethod 32.5 ns 23.4 ns 22.2 ns 21.6 ns Variable and attribute write access: write_local 6.4 ns 0.2 ns 0.1 ns 0.1 ns write_nonlocal 6.8 ns 0.2 ns 0.1 ns 0.1 ns write_global 22.2 ns 13.2 ns 13.7 ns 13.0 ns write_classvar 114.2 ns 103.2 ns 113.9 ns 94.7 ns write_instancevar 49.1 ns 34.9 ns 28.6 ns 29.8 ns write_instancevar_slots 33.4 ns 22.6 ns 16.7 ns 17.8 ns Data structure read access: read_list 23.1 ns 5.5 ns 4.0 ns 4.1 ns read_deque 24.0 ns 5.7 ns 4.3 ns 4.4 ns read_dict 28.7 ns 21.2 ns 16.5 ns 16.5 ns read_strdict 23.3 ns 10.7 ns 10.5 ns 12.0 ns Data structure write access: write_list 28.0 ns 8.2 ns 4.3 ns 4.2 ns write_deque 29.5 ns 8.2 ns 6.3 ns 6.4 ns write_dict 32.9 ns 24.0 ns 21.7 ns 22.6 ns write_strdict 29.2 ns 16.4 ns 15.8 ns 16.0 ns Stack (or queue) operations: list_append_pop 63.6 ns 67.9 ns 20.6 ns 20.5 ns deque_append_pop 56.0 ns 81.5 ns 159.3 ns 46.0 ns deque_append_popleft 58.0 ns 56.2 ns 88.1 ns 36.4 ns Timing loop overhead: loop_overhead 0.4 ns 0.2 ns 0.1 ns 0.2 ns

Some things that are worth noting: