First Try

Using scipy.weave and SSE2 intrinsics gives a marginal improvement. The first invocation is a bit slower since the code needs to be loaded from the disk and cached, subsequent invocations are faster:

import numpy import time from os import urandom from scipy import weave SIZE = 2**20 def faster_slow_xor(aa,bb): b = numpy.fromstring(bb, dtype=numpy.uint64) numpy.bitwise_xor(numpy.frombuffer(aa,dtype=numpy.uint64), b, b) return b.tostring() code = """ const __m128i* pa = (__m128i*)a; const __m128i* pend = (__m128i*)(a + arr_size); __m128i* pb = (__m128i*)b; __m128i xmm1, xmm2; while (pa < pend) { xmm1 = _mm_loadu_si128(pa); // must use unaligned access xmm2 = _mm_load_si128(pb); // numpy will align at 16 byte boundaries _mm_store_si128(pb, _mm_xor_si128(xmm1, xmm2)); ++pa; ++pb; } """ def inline_xor(aa, bb): a = numpy.frombuffer(aa, dtype=numpy.uint64) b = numpy.fromstring(bb, dtype=numpy.uint64) arr_size = a.shape[0] weave.inline(code, ["a", "b", "arr_size"], headers = ['"emmintrin.h"']) return b.tostring()

Second Try

Taking into account the comments, I revisited the code to find out if the copying could be avoided. Turns out I read the documentation of the string object wrong, so here goes my second try:

support = """ #define ALIGNMENT 16 static void memxor(const char* in1, const char* in2, char* out, ssize_t n) { const char* end = in1 + n; while (in1 < end) { *out = *in1 ^ *in2; ++in1; ++in2; ++out; } } """ code2 = """ PyObject* res = PyString_FromStringAndSize(NULL, real_size); const ssize_t tail = (ssize_t)PyString_AS_STRING(res) % ALIGNMENT; const ssize_t head = (ALIGNMENT - tail) % ALIGNMENT; memxor((const char*)a, (const char*)b, PyString_AS_STRING(res), head); const __m128i* pa = (__m128i*)((char*)a + head); const __m128i* pend = (__m128i*)((char*)a + real_size - tail); const __m128i* pb = (__m128i*)((char*)b + head); __m128i xmm1, xmm2; __m128i* pc = (__m128i*)(PyString_AS_STRING(res) + head); while (pa < pend) { xmm1 = _mm_loadu_si128(pa); xmm2 = _mm_loadu_si128(pb); _mm_stream_si128(pc, _mm_xor_si128(xmm1, xmm2)); ++pa; ++pb; ++pc; } memxor((const char*)pa, (const char*)pb, (char*)pc, tail); return_val = res; Py_DECREF(res); """ def inline_xor_nocopy(aa, bb): real_size = len(aa) a = numpy.frombuffer(aa, dtype=numpy.uint64) b = numpy.frombuffer(bb, dtype=numpy.uint64) return weave.inline(code2, ["a", "b", "real_size"], headers = ['"emmintrin.h"'], support_code = support)

The difference is that the string is allocated inside the C code. It's impossible to have it aligned at a 16-byte-boundary as required by the SSE2 instructions, therefore the unaligned memory regions at the beginning and the end are copied using byte-wise access.

The input data is handed in using numpy arrays anyway, because weave insists on copying Python str objects to std::string s. frombuffer doesn't copy, so this is fine, but the memory is not aligned at 16 byte, so we need to use _mm_loadu_si128 instead of the faster _mm_load_si128 .

Instead of using _mm_store_si128 , we use _mm_stream_si128 , which will make sure that any writes are streamed to main memory as soon as possible---this way, the output array does not use up valuable cache lines.

Timings

As for the timings, the slow_xor entry in the first edit referred to my improved version (inline bitwise xor, uint64 ), I removed that confusion. slow_xor refers to the code from the original questions. All timings are done for 1000 runs.

slow_xor : 1.85s (1x)

: 1.85s (1x) faster_slow_xor : 1.25s (1.48x)

: 1.25s (1.48x) inline_xor : 0.95s (1.95x)

: 0.95s (1.95x) inline_xor_nocopy : 0.32s (5.78x)

The code was compiled using gcc 4.4.3 and I've verified that the compiler actually uses the SSE instructions.