English

x86/x64 SIMD Instruction List (SSE to AVX512)

MMX register (64-bit) instructions are omitted.

S1=SSE S2=SSE2 S3=SSE3 SS3=SSSE3 S4.1=SSE4.1 S4.2=SSE4.2 V1=AVX V2=AVX2 V5=AVX512

Instructions marked * become scalar instructions (only the lowest element is calculated) when PS/PD/DQ is changed to SS/SD/SI.

C/C++ intrinsic name is written below each instruction in blue.

AVX/AVX2

Add prefix 'V' to change SSE instruction name to AVX instruction name.

FP AVX instructions can do 256-bit operations on YMM registers.

Integer AVX instructions can use YMM registers from AVX2.

To use 256-bit intrinsics, change prefix _mm to _mm256, and suffix si128 to si256.

Using YMM registers requires the support from OS (For Windows, 7 update 1 or later is required).

YMM register is basically separated into 2 lanes (upper 128-bit and lower 128-bit) and operates within each lane. Horizontal operations (unpacks, shuffles, horizontal calculations, byte shifts, conversions) can be anomalous. Check the manuals out carefully.

AVX512

Instructions noted only "(V5" can be used if CPUID AVX512F flag is on.

Instructions noted "(V5" and "+xx" can be used only if CPUID AVX512F flag is set and AVX512xx flag is also set.

Using AVX512 instructions requires the support from OS.

The features common to most AVX512 instructions ({k1}{z}, {er}/{sae}, bcst) are not mentioned in each instruction. See this -> AVX512 Memo

Opmask register instructions are here.

This document is intended that you can find the correct instruction name that you are not sure of, and make it possible to search in the manuals. Refer to the manuals before coding.

Intel's manuals -> https://software.intel.com/en-us/articles/intel-sdm

When you find any error or something please post this feedback form or email me to the address at the bottom of this page.

Highlighter SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512 to SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512 Color green yellow pink orange To make these default, bookmark this page after clicking here.

MOVE ?MM = XMM / YMM / ZMM

Conversions

Arithmetic Operations

Compare

Floating-Point Double Single Half when either (or both) is Nan condition unmet condition met condition unmet condition met Exception on QNaN YES NO YES NO YES NO YES NO compare for == VCMPEQ_OSPD* (V1

_mm_cmp_pd CMPEQPD* (S2

_mm_cmpeq_pd VCMPEQ_USPD* (V1

_mm_cmp_pd VCMPEQ_UQPD* (V1

_mm_cmp_pd VCMPEQ_OSPS* (V1

_mm_cmp_ps CMPEQPS* (S1

_mm_cmpeq_ps VCMPEQ_USPS* (V1

_mm_cmp_ps VCMPEQ_UQPS* (V1

_mm_cmp_ps compare for < CMPLTPD* (S2

_mm_cmplt_pd VCMPLT_OQPD* (V1

_mm_cmp_pd CMPLTPS* (S1

_mm_cmplt_ps VCMPLT_OQPS* (V1

_mm_cmp_ps compare for <= CMPLEPD* (S2

_mm_cmple_pd VCMPLE_OQPD* (V1

_mm_cmp_pd CMPLEPS* (S1

_mm_cmple_ps VCMPLE_OQPS* (V1

_mm_cmp_ps compare for > VCMPGTPD* (V1

_mm_cmpgt_pd (S2 VCMPGT_OQPD* (V1

_mm_cmp_pd VCMPGTPS* (V1

_mm_cmpgt_ps (S1 VCMPGT_OQPS* (V1

_mm_cmp_ps compare for >= VCMPGEPD* (V1

_mm_cmpge_pd (S2 VCMPGE_OQPD* (V1

_mm_cmp_pd VCMPGEPS* (V1

_mm_cmpge_ps (S1 VCMPGE_OQPS* (V1

_mm_cmp_ps compare for != VCMPNEQ_OSPD* (V1

_mm_cmp_pd VCMPNEQ_OQPD* (V1

_mm_cmp_pd VCMPNEQ_USPD* (V1

_mm_cmp_pd CMPNEQPD* (S2

_mm_cmpneq_pd VCMPNEQ_OSPS* (V1

_mm_cmp_ps VCMPNEQ_OQPS* (V1

_mm_cmp_ps VCMPNEQ_USPS* (V1

_mm_cmp_ps CMPNEQPS* (S1

_mm_cmpneq_ps compare for ! < CMPNLTPD* (S2

_mm_cmpnlt_pd VCMPNLT_UQPD* (V1

_mm_cmp_pd CMPNLTPS* (S1

_mm_cmpnlt_ps VCMPNLT_UQPS* (V1

_mm_cmp_ps compare for ! <= CMPNLEPD* (S2

_mm_cmpnle_pd VCMPNLE_UQPD* (V1

_mm_cmp_pd CMPNLEPS* (S1

_mm_cmpnle_ps VCMPNLE_UQPS* (V1

_mm_cmp_ps compare for ! > VCMPNGTPD* (V1

_mm_cmpngt_pd (S2 VCMPNGT_UQPD* (V1

_mm_cmp_pd VCMPNGTPS* (V1

_mm_cmpngt_ps (S1 VCMPNGT_UQPS* (V1

_mm_cmp_ps compare for ! >= VCMPNGEPD* (V1

_mm_cmpnge_pd (S2 VCMPNGE_UQPD* (V1

_mm_cmp_pd VCMPNGEPS* (V1

_mm_cmpnge_ps (S1 VCMPNGE_UQPS* (V1

_mm_cmp_ps compare for ordered VCMPORD_SPD* (V1

_mm_cmp_pd CMPORDPD* (S2

_mm_cmpord_pd VCMPORD_SPS* (V1

_mm_cmp_ps CMPORDPS* (S1

_mm_cmpord_ps compare for unordered VCMPUNORD_SPD* (V1

_mm_cmp_pd CMPUNORDPD* (S2

_mm_cmpunord_pd VCMPUNORD_SPS* (V1

_mm_cmp_ps CMPUNORDPS* (S1

_mm_cmpunord_ps TRUE VCMPTRUE_USPD* (V1

_mm_cmp_pd VCMPTRUEPD* (V1

_mm_cmp_pd VCMPTRUE_USPS* (V1

_mm_cmp_ps VCMPTRUEPS* (V1

_mm_cmp_ps FALSE VCMPFALSE_OSPD* (V1

_mm_cmp_pd VCMPFALSEPD* (V1

_mm_cmp_pd VCMPFALSE_OSPS* (V1

_mm_cmp_ps VCMPFALSEPS* (V1

_mm_cmp_ps

Floating-Point Double Single Half compare scalar values

to set flag register COMISD (S2

_mm_comieq_sd

_mm_comilt_sd

_mm_comile_sd

_mm_comigt_sd

_mm_comige_sd

_mm_comineq_sd

UCOMISD (S2

_mm_ucomieq_sd

_mm_ucomilt_sd

_mm_ucomile_sd

_mm_ucomigt_sd

_mm_ucomige_sd

_mm_ucomineq_sd COMISS (S1

_mm_comieq_ss

_mm_comilt_ss

_mm_comile_ss

_mm_comigt_ss

_mm_comige_ss

_mm_comineq_ss

UCOMISS (S1

_mm_ucomieq_ss

_mm_ucomilt_ss

_mm_ucomile_ss

_mm_ucomigt_ss

_mm_ucomige_ss

_mm_ucomineq_ss

Bitwise Logical Operations

Integer Floating-Point QWORD DWORD WORD BYTE Double Single Half and PAND (S2

_mm_and_si128 ANDPD (S2

_mm_and_pd ANDPS (S1

_mm_and_ps VPANDQ (V5...

_mm512_and_epi64

etc VPANDD (V5...

_mm512_and_epi32

etc and not PANDN (S2

_mm_andnot_si128 ANDNPD (S2

_mm_andnot_pd ANDNPS (S1

_mm_andnot_ps VPANDNQ (V5...

_mm512_andnot_epi64

etc VPANDND (V5...

_mm512_andnot_epi32

etc or POR (S2

_mm_or_si128 ORPD (S2

_mm_or_pd ORPS (S1

_mm_or_ps VPORQ (V5...

_mm512_or_epi64

etc VPORD (V5...

_mm512_or_epi32

etc xor PXOR (S2

_mm_xor_si128 XORPD (S2

_mm_xor_pd XORPS (S1

_mm_xor_ps VPXORQ (V5...

_mm512_xor_epi64

etc VPXORD (V5...

_mm512_xor_epi32

etc test PTEST (S4.1

_mm_testz_si128

_mm_testc_si128

_mm_testnzc_si128 VTESTPD (V1

_mm_testz_pd

_mm_testc_pd

_mm_testnzc_pd VTESTPS (V1

_mm_testz_ps

_mm_testc_ps

_mm_testnzc_ps VPTESTMQ (V5...

_mm_test_epi64_mask

VPTESTNMQ (V5...

_mm_testn_epi64_mask

VPTESTMD (V5...

_mm_test_epi32_mask

VPTESTNMD (V5...

_mm_testn_epi32_mask

VPTESTMW (V5+BW...

_mm_test_epi16_mask

VPTESTNMW (V5+BW...

_mm_testn_epi16_mask

VPTESTMB (V5+BW...

_mm_test_epi8_mask

VPTESTNMB (V5+BW...

_mm_testn_epi8_mask

ternary operation VPTERNLOGQ (V5...

_mm_ternarylogic_epi64

VPTERNLOGD (V5...

_mm_ternarylogic_epi32



Bit Shift / Rotate

Integer QWORD DWORD WORD BYTE shift left logical PSLLQ (S2

_mm_slli_epi64

_mm_sll_epi64 PSLLD (S2

_mm_slli_epi32

_mm_sll_epi32 PSLLW (S2

_mm_slli_epi16

_mm_sll_epi16 VPSLLVQ (V2

_mm_sllv_epi64 VPSLLVD (V2

_mm_sllv_epi32 VPSLLVW (V5+BW...

_mm_sllv_epi16 shift right logical PSRLQ (S2

_mm_srli_epi64

_mm_srl_epi64 PSRLD (S2

_mm_srli_epi32

_mm_srl_epi32 PSRLW (S2

_mm_srli_epi16

_mm_srl_epi16 VPSRLVQ (V2

_mm_srlv_epi64 VPSRLVD (V2

_mm_srlv_epi32 VPSRLVW (V5+BW...

_mm_srlv_epi16 shift right arithmetic VPSRAQ (V5...

_mm_srai_epi64

_mm_sra_epi64 PSRAD (S2

_mm_srai_epi32

_mm_sra_epi32 PSRAW (S2

_mm_srai_epi16

_mm_sra_epi16 VPSRAVQ (V5...

_mm_srav_epi64 VPSRAVD (V2

_mm_srav_epi32 VPSRAVW (V5+BW...

_mm_srav_epi16 rotate left VPROLQ (V5...

_mm_rol_epi64 VPROLD (V5...

_mm_rol_epi32 VPROLVQ (V5...

_mm_rolv_epi64 VPROLVD (V5...

_mm_rolv_epi32 rotate right VPRORQ (V5...

_mm_ror_epi64 VPRORD (V5...

_mm_ror_epi32 VPRORVQ (V5...

_mm_rorv_epi64 VPRORVD (V5...

_mm_rorv_epi32

Byte Shift

128-bit shift left logical PSLLDQ (S2

_mm_slli_si128 shift right logical PSRLDQ (S2

_mm_srli_si128 packed align right PALIGNR (SS3

_mm_alignr_epi8

Compare Strings

explicit length implicit length return index PCMPESTRI (S4.2

_mm_cmpestri

_mm_cmpestra

_mm_cmpestrc

_mm_cmpestro

_mm_cmpestrs

_mm_cmpestrz PCMPISTRI (S4.2

_mm_cmpistri

_mm_cmpistra

_mm_cmpistrc

_mm_cmpistro

_mm_cmpistrs

_mm_cmpistrz return mask PCMPESTRM (S4.2

_mm_cmpestrm

_mm_cmpestra

_mm_cmpestrc

_mm_cmpestro

_mm_cmpestrs

_mm_cmpestrz PCMPISTRM (S4.2

_mm_cmpistrm

_mm_cmpistra

_mm_cmpistrc

_mm_cmpistro

_mm_cmpistrs

_mm_cmpistrz

Others

LDMXCSR (S1

_mm_setcsr Load MXCSR register STMXCSR (S1

_mm_getcsr Save MXCSR register state

PSADBW (S2

_mm_sad_epu8 Compute sum of absolute differences MPSADBW (S4.1

_mm_mpsadbw_epu8 Performs eight 4-byte wide Sum of Absolute Differences operations to produce eight word integers. VDBPSADBW (V5+BW...

_mm_dbsad_epu8 Double Block Packed Sum-Absolute-Differences (SAD) on Unsigned Bytes

PMULHRSW (SS3

_mm_mulhrs_epi16 Packed Multiply High with Round and Scale

PHMINPOSUW (S4.1

_mm_minpos_epu16 Finds the value and location of the minimum unsigned word from one of 8 horizontally packed unsigned words. The resulting value and location (offset within the source) are packed into the low dword of the destination XMM register.

VPCONFLICTQ (V5+CD...

_mm512_conflict_epi64

VPCONFLICTD (V5+CD...

_mm512_conflict_epi32

Detect Conflicts Within a Vector of Packed Dword/Qword Values into Dense Memory/ Register

VPLZCNTQ (V5+CD...

_mm_lzcnt_epi64

VPLZCNTD (V5+CD...

_mm_lzcnt_epi32

Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values

VFIXUPIMMPD* (V5...

_mm512_fixupimm_pd

VFIXUPIMMPS* (V5...

_mm512_fixupimm_ps

Fix Up Special Packed Float64/32 Values VFPCLASSPD* (V5...

_mm512_fpclass_pd_mask

VFPCLASSSD* (V5...

_mm512_fpclass_sd_mask

Tests Types Of a Packed Float64/32 Values VRANGEPD* (V5+DQ...

_mm_range_pd

VRANGEPS* (V5+DQ...

_mm_range_pd

Range Restriction Calculation For Packed Pairs of Float64/32 Values VGETEXPPD * (V5...

_mm512_getexp_pd

VGETEXPPS * (V5...

_mm512_getexp_ps

Convert Exponents of Packed DP/SP FP Values to FP Values VGETMANTPD * (V5...

_mm512_getmant_pd

VGETMANTPS * (V5...

_mm512_getmant_ps

Extract Float64/32 Vector of Normalized Mantissas from Float64/32 Vector

AESDEC (AESNI

_mm_aesdec_si128 Perform an AES decryption round using an 128-bit state and a round key AESDECLAST (AESNI

_mm_aesdeclast_si128 Perform the last AES decryption round using an 128-bit state and a round key AESENC (AESNI

_mm_aesenc_si128 Perform an AES encryption round using an 128-bit state and a round key AESENCLAST (AESNI

_mm_aesenclast_si128 Perform the last AES encryption round using an 128-bit state and a round key AESIMC (AESNI

_mm_aesimc_si128 Perform an inverse mix column transformation primitive AESKEYGENASSIST (AESNI

_mm_aeskeygenassist_si128 Assist the creation of round keys with a key expansion schedule PCLMULQDQ (PCLMULQDQ

_mm_clmulepi64_si128 Perform carryless multiplication of two 64-bit numbers

SHA1RNDS4 (SHA

_mm_sha1rnds4_epu32 Perform Four Rounds of SHA1 Operation SHA1NEXTE (SHA

_mm_sha1nexte_epu32 Calculate SHA1 State Variable E after Four Rounds SHA1MSG1 (SHA

_mm_sha1msg1_epu32 Perform an Intermediate Calculation for the Next Four SHA1 Message Dwords SHA1MSG2 (SHA

_mm_sha1msg2_epu32 Perform a Final Calculation for the Next Four SHA1 Message Dwords SHA256RNDS2 (SHA

_mm_sha256rnds2_epu32 Perform Two Rounds of SHA256 Operation SHA256MSG1 (SHA

_mm_sha256msg1_epu32 Perform an Intermediate Calculation for the Next Four SHA256 Message SHA256MSG2 (SHA

_mm_sha256msg2_epu32 Perform a Final Calculation for the Next Four SHA256 Message Dwords

VPBROADCASTMB2Q (V5+CD...

_mm_broadcastmb_epi64

VPBROADCASTMW2D (V5+CD...

_mm_broadcastmw_epi32

Broadcast Mask to Vector Register

VZEROALL (V1

_mm256_zeroall Zero all YMM registers VZEROUPPER (V1

_mm256_zeroupper Zero upper 128 bits of all YMM registers

MOVNTPS (S1

_mm_stream_ps Non-temporal store of four packed single-precision floating-point values from an XMM register into memory MASKMOVDQU (S2

_mm_maskmoveu_si128 Non-temporal store of selected bytes from an XMM register into memory MOVNTPD (S2

_mm_stream_pd Non-temporal store of two packed double-precision floating-point values from an XMM register into memory MOVNTDQ (S2

_mm_stream_si128 Non-temporal store of double quadword from an XMM register into memory LDDQU (S3

_mm_lddqu_si128 Special 128-bit unaligned load designed to avoid cache line splits MOVNTDQA (S4.1

_mm_stream_load_si128 Provides a non-temporal hint that can cause adjacent 16-byte items within an aligned 64-byte region (a streaming line) to be fetched and held in a small set of temporary buffers ("streaming load buffers"). Subsequent streaming loads to other aligned 16-byte items in the same streaming line may be supplied from the streaming load buffer and can improve throughput.

VGATHERPFxDPS (V5+PF

_mm512_mask_prefetch_i32gather_ps

VGATHERPFxQPS (V5+PF

_mm512_mask_prefetch_i64gather_ps

VGATHERPFxDPD (V5+PF

_mm512_mask_prefetch_i32gather_pd

VGATHERPFxQPD (V5+PF

_mm512_mask_prefetch_i64gather_pd

x=0/1 Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint VSCATTERPFxDPS (V5+PF

_mm512_prefetch_i32scatter_ps

VSCATTERPFxQPS (V5+PF

_mm512_prefetch_i64scatter_ps

VSCATTERPFxDPD (V5+PF

_mm512_prefetch_i32scatter_pd

VSCATTERPFxQPD (V5+PF

_mm512_prefetch_i64scatter_pd

x=0/1 Sparse Prefetch Packed SP/DP Data Values with Signed Dword, Signed Qword Indices Using T0/T1 Hint with Intent to Write

TIPS

TIP 1: Zero Clear

XOR instructions do for both Integer and Floating-point.

Example: Zero all of 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1

pxor xmm1, xmm1

Example: Set 0.0f to 4 floats in XMM1

xorps xmm1, xmm1

Example: Set 0.0 to 2 doubles in XMM1

xorpd xmm1, xmm1

TIP 2: Copy the lowest 1 element to other elements in XMM register

Shuffle instructions do.

Example: Copy the lowest float element to other 3 elements in XMM1.

shufps xmm1, xmm1, 0

Example: Copy the lowest WORD element to other 7 elements in XMM1

pshuflw xmm1, xmm1, 0 pshufd xmm1, xmm1, 0

Example: Copy the lower QWORD element to the upper element in XMM1

pshufd xmm1, xmm1, 44h ; 01 00 01 00 B = 44h

Is this better?

punpcklqdq xmm1, xmm1

TIP 3: Integer Sign Extension / Zero Extension

Unpack instructions do.

Example: Zero extend 8 WORDS in XMM1 to DWORDS in XMM1 (lower 4) and XMM2 (upper 4).

movdqa xmm2, xmm1 ; src data WORD[7] [6] [5] [4] [3] [2] [1] [0] pxor xmm3, xmm3 ; upper 16-bit to attach to each WORD = all 0 punpcklwd xmm1, xmm3 ; lower 4 DWORDS: 0 [3] 0 [2] 0 [1] 0 [0] punpckhwd xmm2, xmm3 ; upper 4 DWORDS: 0 [7] 0 [6] 0 [5] 0 [4]

Example: Sign extend 16 BYTES in XMM1 to WORDS in XMM1 (lower 8) and XMM2 (upper 8).

pxor xmm3, xmm3 movdqa xmm2, xmm1 pcmpgtb xmm3, xmm1 ; upper 8-bit to attach to each BYTE = src >= 0 ? 0 : -1 punpcklbw xmm1, xmm3 ; lower 8 WORDS punpckhbw xmm2, xmm3 ; upper 8 WORDS

Example (intrinsics): Sign extend 8 WORDS in __m128i variable words8 to DWORDS in dwords4lo (lower 4) and dwords4hi (upper 4)

const __m128i izero = _mm_setzero_si128(); __m128i words8hi = _mm_cmpgt_epi16(izero, words8); __m128i dwords4lo = _mm_unpacklo_epi16(words8, words8hi); __m128i dwords4hi = _mm_unpackhi_epi16(words8, words8hi);

TIP 4: Absolute Values of Integers

If an integer value is positive or zero, it is already the abosoute value. Else, adding 1 after complementing all bits makes the absolute value.

Example: Set absolute values of 8 signed WORDS in XMM1 to XMM1

; if src is positive or 0; if src is negative pxor xmm2, xmm2 pcmpgtw xmm2, xmm1 ; xmm2 <- 0 ; xmm2 <- 1 pxor xmm1, xmm2 ; xor with 0(do nothing) ; xor with -1(complement all bits) psubw xmm1, xmm2 ; subtract 0(do nothing) ; subtract -1(add 1)

Example (intrinsics): Set abosolute values of 4 DWORDS in __m128i variable dwords4 to dwords4

const __m128i izero = _mm_setzero_si128(); __m128i tmp = _mm_cmpgt_epi32(izero, dwords4); dwords4 = _mm_xor_si128(dwords4, tmp); dwords4 = _mm_sub_epi32(dwords4, tmp);

TIPS 5: Absolute Values of Floating-Points

Floating-Points are not complemented so just clearing sign (the highest) bit makes the absolute value.

Example: Set absolute values of 4 floats in XMM1 to XMM1

; data align 16 signoffmask dd 4 dup (7fffffffH) ; mask for clearing the highest bit ; code andps xmm1, xmmword ptr signoffmask

Example (intrinsics): Set absolute values of 4 floats in __m128 variable floats4 to floats4

const __m128 signmask = _mm_set1_ps(-0.0f); // 0x80000000 floats4 = _mm_andnot_ps(signmask, floats4);

TIP 6: Lacking some integer MUL instructions?

Signed/unsigned makes difference only for the calculation of the upper part. Fot the lower part, the same instruction can be used both for signed and unsigned.

unsigned WORD * unsigned WORD -> Upper WORD: PMULHUW, Lower WORD: PMULLW

singed WORD * signed WORD -> Upper WORD: PMULHW, Lower WORD: PMULLW

TIP 8: max / min

Bitwise operation after getting mask by compararison does.

Example: Compare each signed DWORD in XMM1 and XMM2 and set smaller one to XMM1

; A=xmm1 B=xmm2 ; if A>B ; if A<=B movdqa xmm0, xmm1 pcmpgtd xmm1, xmm2 ; xmm1=-1 ; xmm1=0 pand xmm2, xmm1 ; xmm2=B ; xmm2=0 pandn xmm1, xmm0 ; xmm1=0 ; xmm1=A por xmm1, xmm2 ; xmm1=B ; xmm1=A

Example (intrinsics): Compare each signed byte in __m128i variables a, b and set larger one to maxAB

TIP 10: Set all bits

PCMPEQx instruction does.

Example: set -1 to all of the 2 QWORDS / 4 DWORDS / 8 WORDS / 16 BYTES in XMM1.