Contents

Write string.h functions using string instructions in asm x86-64

Introduction

The C standard library offers a bunch of functions (whose declarations can be found in the string.h header) to manage NULL-terminated strings and arrays. These are ones of the most used C functions, often implemented as builtin by the C compiler as they are crucial to the speed of programs.

On the other hand, the x86 architecture contains “string instructions”, aimed at implementing operations on strings at the hardware level. Moreover, the x86 architecture was incrementally enhanced with SIMD instructions over the years which allows processing multiple bytes of data in one instruction.

In this article we’ll inspect the implementation of string.h of the GNU standard library for x86, and see how it compares with a pure assembly implementation of these functions using string instructions and SIMD and try to explain the choices made by the GNU developers and to help you write better assembly.

Disassembling a call to memcpy

One of the most popular C functions is memcpy. It copies an array of bytes to another, which is a very common operation and makes its performance particularly important.

There are several ways you can perform this operation using x86 asm. Let’s see how it is implemented by gcc using this simple C program:

#include <string.h>

#define BUF_LEN 1024
char a[BUF_LEN];
char b[BUF_LEN];

int main(void) {
  memcpy(b, a, BUF_LEN);
  return EXIT_SUCCESS;
}

We can observe the generated asm by using godbolt.

Or compile the code using gcc 14.2: gcc -O1 -g -o string main.c
And then disassemble the executable using:

objdump --source-comment="; " --disassembler-color=extended --disassembler-options=intel --no-show-raw-insn --disassemble=main string

You should get this result:

0000000000401134 <main>:
; 
; int main(int argc, char *argv[]) {
;   memcpy(b, a, BUF_LEN);
  401134:	mov    esi,0x404440
  401139:	mov    edi,0x404040
  40113e:	mov    ecx,0x80
  401143:	rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
;   return 0;
; }
  401146:	mov    eax,0x0
  40114b:	ret

The first surprising thing you notice is that the machine code does not contain any call to the memcpy function. It has been replaced by 3 mov instructions preceding a mysterious rep movsq instruction.

rep movsq is one of the five string instructions defined in the “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 1: Basic Architecture 5.1.8”.

So it is time to learn more about these string instructions.

The string instructions of x86

String instructions perform operations on array elements pointed by rsi (source register) and rdi (destination register).

instructionDescriptionEffect on registers
movsMove string*(rdi++) = *(rsi++)
cmpsCompare stringcmp *(rsi++), *(rdi++)
scasScan stringcmp rax, *(rdi++)
lodsLoad stringrax = *(rsi++)
stosStore string*(rdi++) = rax

Each of these instructions must have a suffix (b,w,d,q) indicating the type of elements pointed by rdi and rsi (byte, word, doubleword, quadword).

These instructions may also have a prefix indicating how to repeat themselves.

prefixDescriptionEffect on registers
repRepeat while the ECX register not zerofor(; rcx != 0; rcx–)
repe/repzRepeat while the ECX register not zero and the ZF flag is setfor(; rcx != 0 && ZF == true; rcx–)
repne/repnzRepeat while the ECX register not zero and the ZF flag is clearfor(; rcx != 0 && ZF == false; rcx–)

The repe/repz and repne/repnz prefixes are used only with the cmps and scas instructions (as they are the only ones modifying the RFLAGS register).

The movs instruction

Now that we have learned more about the string instructions, we can break down the effect of the rep movsq instruction:

  1. copy the quadword pointed by rsi to rdi
  2. add 8 to rsi and rdi so that they point onto the next quadword
  3. decrement rcx and repeat until rcx == 0

This is what we would expect memcpy to do except for one thing: bytes are not copied one by one, but in blocks of 8. Here, as the byte size of our arrays is a multiple of 8, we can copy the source array as an array of quadwords. This will necessitate 8 times fewer operations than copying the array one byte at a time.

Let’s change the size of the arrays to 1023 to see how the compiler will react when the array size is not a multiple of 8 anymore:

0000000000401134 <main>:
; 
; int main(int argc, char *argv[]) {
;   memcpy(b, a, BUF_LEN);
  401134:	mov    esi,0x404440
  401139:	mov    edi,0x404040
  40113e:	mov    ecx,0x7f
  401143:	rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
  401146:	mov    eax,DWORD PTR [rsi]
  401148:	mov    DWORD PTR [rdi],eax
  40114a:	movzx  eax,WORD PTR [rip+0x36eb]        # 40483c <a+0x3fc>
  401151:	mov    WORD PTR [rip+0x32e4],ax        # 40443c <b+0x3fc>
  401158:	movzx  eax,BYTE PTR [rip+0x36df]        # 40483e <a+0x3fe>
  40115f:	mov    BYTE PTR [rip+0x32d9],al        # 40443e <b+0x3fe>
;   return 0;
; }
  401165:	mov    eax,0x0
  40116a:	ret

Instead of replacing the rep movsq by the rep movsb instruction, gcc preferred to stop the repetition of the movsq instruction 8 bytes earlier and add mov instructions to copy a doubleword, a word and a byte.

The cmps instruction

The cmps instruction will compare the elements pointed by rsi and rdi and will set the flag accordingly. As cmps will set the ZF flag, we can use the repe/repz and repne/repnz prefixes to, respectively, continue until the strings differ or stop when matching characters are encountered.

Let’s write a basic memcmp function using this instruction:

; int memcmp_cmpsb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp:
	mov rcx, rdx	; rcx = n
	xor eax, eax	; Set return value to zero
	xor edx, edx    ; rdx = 0
	repe cmpsb		; for(; rcx != 0 and ZF == true; rcx--)
					;	cmp *(rsi++), *(rdi++)
	setb al			; if(ZF == false and CF == true) al = 1
	seta dl			; if(ZF == false and CF == false) bl = 1
	sub eax, edx	; return al - dl
.exit
	ret

We use the repe cmpsb instruction to iterate over the strings s1 and s2 until two bytes differ.

When we exit the repe cmpsb instruction, the RFLAGS register is set according to the last byte comparison. We can then use the set[cc] instructions to set bytes al and dl according to the result comparison.

The same way the memcpy function copies groups of 8 bytes, we can use the repe cmpsq instruction to compare bytes by groups of 8 (or cmpsd for groups of 4 bytes on 32-bit architectures).

; int memcmp_cmpsq_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_cmpsq_unaligned:
	lea rcx, [rdx + 0x7]	; rcx = n
	and rdx, (8 - 1)		; rdx = n % 8
	shr rcx, 3				; rcx = n / 8
	xor eax, eax			; rax = 0
	repe cmpsq				; for(; rcx != 0 and ZF == true; rcx += 8)
							;	cmp *(rsi++), *(rdi++)
    je .exit                ; If no difference was found return
	mov r8, [rdi - 0x8]	    ; Read the last (unaligned) quadword of s1
	mov r9, [rsi - 0x8]	    ; Read the last (unaligned) quadword of s2
    test rcx, rcx           ; if(rcx != 0)
    jnz .cmp                ;    goto .cmp
    shl rdx, 3              ; rdx = 8 * (8 - n % 8)
    jz .cmp                 ; if(rdx == 0) goto .cmp
    bzhi r8, r8, rdx        ; r8 <<= 8 * (8 - n % 8)
    bzhi r9, r9, rdx        ; r9 <<= 8 * (8 - n % 8)
.cmp:
	bswap r8				; Convert r8 to big-endian for lexical comparison
	bswap r9				; Convert r9 to big-endian for lexical comparison
	cmp r8, r9				; Lexical comparison of quadwords
	seta al					; if (r8 > r9) al = 1
	setb cl					; if (r8 < r9) cl = 1
	sub eax, ecx			; return eax - ecx
.exit:
	ret

To get the result of the comparison, we need to compare the last two quadwords. However, on little-endian systems, the lowest significant byte will be the first one and we want to compare the byte in lexical order. Hence, the need to convert the quadword to big-endian using the bswap instruction.

Zero High Bits Starting with Specified Bit Position

The instruction bzhi is useful when you need to mask out the higher bits of a register. Here when comparing the last quadword we need to erase all bits in r8 and r9 which aren’t “valid” (i.e. which are not part of the input arrays).

You can find documentation about this instruction here.

Warning

This function should only be used for blocks of memory of size multiple of 8 with 8 bytes alignment.

For production use refer to the Benchmarking section.

The scas instruction

The scas instruction will compare the content of rax with the element pointed by rdi and set the flag accordingly. We can use it in a similar way to what we did for cmps taking advantage of the repe/repz and repne/repnz prefixes.

Let’s write a simple strlen function using the scasb instruction:

; size_t strlen(rdi: const char *s)
strlen:
	mov rcx, -1
	repnz scasb		; for(; rcx != 0 and ZF == false; rcx--)
					;	cmp rax, *(rdi++)
	not rcx			; before this insn rcx = - (len(rdi) + 2)
	dec rcx			; after this insn rcx = ~(- (len(rdi) + 2)) - 1
					;                     = -(- (len(rdi) + 2)) - 1 - 1
					;                     = len(rdi)
	xchg rax, rcx	; rax = len(rdi)
	ret
Tip
The instruction sequence used here to calculate the length of a text string is a well-known technique that dates back to the 8086 CPU.
Warning

Don’t use this function for production code as it can only compare bytes one by one.

For production always prefer loop alternative to compare groups of bytes using the largest registers (see the Benchmarking section).

The lods instruction

The lods instruction will load to the rax register the element pointed to by rsi and increment rsi to point to the next element. As this instruction does nothing else than a move on a register, it is never used with a prefix (the value would be overwritten for each repetition).

It can, however, be used to examine a string, for instance to find a character:

; char* strchr_lodsb(rdi: const char* s, rsi: int c)
strchr_lodsb:
    xchg rdi, rsi       ; rdi = c, rsi = s
.loop:
    lodsb               ; al = *(rsi++)
    cmp dil, al         ; if(c == al)
    je .end             ;    goto .end
    test al, al         ; if(al != 0)
    jnz .loop           ;    goto .loop
    xor rax, rax        ; return 0
    ret
.end:
    lea rax, [rsi - 1]  ; return rsi - 1
    ret
Warning
For production code always prefer to load data using the largest registers (see the Benchmarking section).

The stos instruction

The stos instruction will write the content of the rax register to the element pointed by rdi and increment rdi to point to the next element.

Note that according to the “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 1: Basic Architecture 7.3.9.2”:

a REP STOS instruction is the fastest way to initialize a large block of memory.

Actually, this is the way gcc will implement a memset when it knows the size and alignment of the string:

#include <string.h>

#define BUF_LEN 1024

char a[BUF_LEN];

int main(int argc, char *argv[]) {
  memset(a, 1, BUF_LEN);
  return 0;
}
000000000040115a <main>:
; 
; int main(int argc, char *argv[]) {
;   memset(a, 1, BUF_LEN);
  40115a:	mov    edx,0x404460
  40115f:	mov    ecx,0x80
  401164:	movabs rax,0x101010101010101
  40116e:	mov    rdi,rdx
  401171:	rep stos QWORD PTR es:[rdi],rax
;   return 0;
; }
  401174:	mov    eax,0x0
  401179:	ret
Tip

Always use the rep movsq instruction to initialize large blocks of memory.

For small blocks of memory use unrolled loop and the largest registers.

Let’s turn around

The direction flag

It wouldn’t be as much fun if we couldn’t make it backward 😄

On x86, the flag register (RFLAGS) has a direction flag, RFFLAGS.DF, which controls the direction of the string operations. This flag can be set and cleared respectively using the std and cld instructions.

std ; SeT Direction
; Here DF = 1, rdi and rsi are decremented
cld ; CLear Direction
; Here DF = 0, rdi and rsi are incremented

Here you have a detailed view of the RFLAGS register:

The RFLAGS register

On Intel64, the upper 32 bits of the RFLAGS register are reserved and the lower 32-bits are the same as the EFLAGS register of 32-bit architectures.

The EFLAGS register on Intel64 and IA-32 architectures

Saving and restoring RFLAGS

When you write an assembly subroutine using string instructions you should always:

  • save the state of the RFLAGS register
  • set or clear RFLAGS.DF
  • do your work
  • restore the RFLAGS register

This way, your subroutine will work independently of the state of the direction flag and won’t break other routines relying on this flag.

To do so, you can push RFLAGS to the stack using the pushfq instruction and restore it using the popfq instruction.

However, the System V Application Binary Interface for AMD64 state that:

The direction flag DF in the %rFLAGS register must be clear (set to “forward” direction) on function entry and return.

So, if you’re writing code targeting the System V ABI, you may assume that the direction flag is clear and ensure that you keep it clear when leaving your functions.

An example with strrchr

We can code a simple version of strrchr ( which looks for the last occurrence of a character in a string) based on our strchr function by simply setting rdi = s + len(s) + 1 and by setting the direction flag.

; char *strrchr(rdi: const char *s, esi: int c);
strchr:
	push rdi		; push 1st arg on the stack
	push rsi		; push 2nd arg on the stack
	
	; rcx = strlen(rdi) + 1
	call strlen
	mov rcx, rax
	add rcx, 1

	std				; RFLAGS.DF = 1
	pop rax			; rax = c
	pop rdi			; rdi = s
	add rdi, rcx	; rdi = s + len(s) + 1
	xor rdx, rdx 	; rdx = 0
	repne scasb		; for(; rcx != 0 and ZF == false; rcx--)
					;	 cmp al, *(rdi--)
	jne .exit
	mov rdx, rdi	; if(ZF == true)
	sub rdx, 1		;	 rdx = rdi - 1
.exit:
	cld				; RFLAGS.DF = 0
	mov rax, rdx 	; return rdx
	ret

We use the System V ABI for our functions, so there is no need to save the RFLAGS register, but we make sure to clear the direction flag before returning from the function.

Warning
You should avoid setting the direction flag, especially when using movs and stos instructions, because, as we’ll see in the last part, it may disable a whole class of optimization known as “fast-string operations”.

Vectorized string instructions

Vectorized string instructions are quite intricate and should not be used in most cases.

If you’re in to discover one of the strangest instructions of x86 you can continue, but if you prefer to watch graphs, you can skip to the benchmarking section.

Implicit ? Explicit ? Index ? Mask ?

SSE4.2 introduced another set of 4 instructions: The vectorized string instructions.

instructionDescription
pcmpestriPacked Compare Implicit Length Strings, Return Index
pcmpistriPacked Compare Explicit Length Strings, Return Index
pcmpestrmPacked Compare Implicit Length Strings, Return Mask
pcmpeistrmPacked Compare Explicit Length Strings, Return Mask
Note
Adding a v before any of these instructions makes it VEX.128 encoded and will zero out the upper 128 bits of the ymm registers, which may avoid some performance issues in case you forgot a vzeroupper.

A vectorized string instruction takes three parameters:

pcmpestri xmm1, xmm2/m128, imm8

The first and second arguments of the instructions are meant to contain the string fragments to be compared.

The fragments are considered valid until:

  • The first null byte for the “Implicit Length” versions
  • The length contained in eax (for xmm1) and edx (for xmm2) for the “Explicit Length” versions

The result is:

  • The index of the first match, stored in the ecx register for the “Index” version.
  • A mask of bits/bytes/words (depending on imm8), stored in the xmm0 register for the “Mask” version.

Other information is carried out by the RFLAGS register:

FlagInformation
CarryResult is non-zero
SignThe first string ends
ZeroThe second string ends
OverflowLeast Significant Bit of result

The Parity and Adjust flags are reset.

The imm8 control byte

But this doesn’t tell us which comparisons these instructions can perform. Well, they can perform 4 basic operations:

  • Find characters from a set: Finds which of the bytes in the second vector operand belong to the set defined by the bytes in the first vector operand, comparing all 256 possible combinations in one operation.
  • Find characters in a range: Finds which of the bytes in the second vector operand are within the range defined by the first vector operand.
  • Compare strings: Determine if the two strings are identical.
  • Substring search: Finds all occurrences of a substring defined by the first vector operand in the second vector operand.

The operation performed by the vectorized string instruction is controlled by the value of the imm8 byte:

imm8Description
·······0b128-bit sources treated as 16 packed bytes.
·······1b128-bit sources treated as 8 packed words.
······0·bPacked bytes/words are unsigned.
······1·bPacked bytes/words are signed.
····00··bMode is equal any.
····01··bMode is ranges.
····10··bMode is equal each.
····11··bMode is equal ordered.
···0····bIntRes1 is unmodified.
···1····bIntRes1 is negated (1’s complement).
··0·····bNegation of IntRes1 is for all 16 (8) bits.
··1·····bNegation of IntRes1 is masked by reg/mem validity.
·0······bIndex: Index of the least significant, set, bit is used (regardless of corresponding input element validity).
Mask: IntRes2 is returned in the least significant bits of XMM0.
·1······bIndex: Index of the most significant, set, bit is used (regardless of corresponding input element validity).
Mask: Each bit of IntRes2 is expanded to byte/word.

The MSB of imm8 has no defined effect and should be 0.

This means that if we want to compare xmm1 and xmm2 for byte equality and get the index of the first non-matching byte, we have to set imm8 = 0b0001'1000

We can define some macro so you do not have to remind of this:

PACKED_UBYTE 			equ	0b00
PACKED_UWORD 			equ	0b01
PACKED_BYTE 			equ	0b10
PACKED_WORD 			equ	0b11
CMP_STR_EQU_ANY 		equ	(0b00 << 2)
CMP_STR_EQU_RANGES 		equ	(0b01 << 2)
CMP_STR_EQU_EACH 		equ	(0b10 << 2)
CMP_STR_EQU_ORDERED		equ	(0b11 << 2)
CMP_STR_INV_ALL			equ	(0b01 << 4)
CMP_STR_INV_VALID_ONLY	equ (0b11 << 4)
CMP_STRI_FIND_LSB_SET	equ (0b00 << 6)
CMP_STRI_FIND_MSB_SET	equ (0b01 << 6)
CMP_STRM_BIT_MASK		equ (0b00 << 6)
CMP_STRM_BYTE_MASK	    equ (0b01 << 6)

From these definitions we can create the imm8 flag for a vpcmpxstrx instruction with bitwise or.

For instance, a flag for the vpcmpestri instruction to get the index of the first differing byte:

BYTEWISE_CMP equ (PACKED_UBYTE | CMP_STR_EQU_EACH | CMP_STR_INV_VALID_ONLY | CMP_STRI_FIND_LSB_SET)

A vectorized memcmp version

Now that we know how to use the vpcmpestri instruction and that we defined the imm8 flag to make a bytewise comparison of the content of AVX registers, we can write a version of memcmp using the vectorized string instructions.

; int memcmp_vpcmpestri_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_vpcmpestri_unaligned:
    xor r10d, r10d
	mov rax, rdx				; rax = n
	test rdx, rdx				; if(n == 0)
	jz .exit					;	 return n
	vmovdqu xmm2, [rdi + r10]	; xmm2 = rdi[r10]
	vmovdqu xmm3, [rsi + r10]	; xmm3 = rsi[r10]
.loop:
	; Compare xmm2 and xmm3 for equality and write the index of the differing byte in ecx
    ; rax contains the length of the xmm2 string, rdx contains the length of the xmm3 string
	; If all byte are the same rcx = 16
	vpcmpestri xmm2, xmm3, BYTEWISE_CMP 
	test cx, 0x10				; if(rcx != 16)
	jz .end						; 	break
	add r10, 0x10				; r10 += 10
	vmovdqu xmm2, [rdi + r10]	; xmm2 = rdi[r10]
	vmovdqu xmm3, [rsi + r10]	; xmm3 = rsi[r10]
	sub rdx, 0x10				; rdx -= 16
	ja .loop					; if(rdx > 0) goto .loop
	xor eax, eax				; rax = 0
	ret							; return
.end:
	xor eax, eax				; rax = 0
    cmp rcx, rdx                ; if(index >= rdx)
    jae .exit                   ;    return;
	xor edx, edx				; rdx = 0
    add r10, rcx                ; r10 += index
	mov r8b, [rdi + r10]		; r8b = rdi[r10]
	mov r9b, [rsi + r10]		; r9b = rsi[r10]
	cmp r8b, r9b				; if(r8b > r9b)
	seta al						; 	return 1
	setb dl						; else
	sub eax, edx				; 	return -1
.exit:
	ret

Benchmarking our string instructions

Benchmark 1: memcpy

At the beginning of the previous part, we disassembled a call to memcpy to see that it had been inlined with a rep movsq instruction by gcc.

The compiler is able to realize this optimization because the alignment and the size of both arrays are compile-time known. Let’s add an indirection to the memcpy call so that the compiler can’t rely on this information anymore.

#include <string.h>

#define BUF_LEN 1023

char a[BUF_LEN];
char b[BUF_LEN];

void *copy(void *restrict dest, const void *restrict src, size_t n) {
  return memcpy(dest, src, n);
}

int main(int argc, char *argv[]) {
  copy(b, a, BUF_LEN);
  return 0;
}
objdump --source-comment="; " --disassembler-color=extended --disassembler-option=intel-mnemonic --no-show-raw-insn --disassemble=copy string
0000000000401126 <copy>:
; #define BUF_LEN 1023
; 
; char a[BUF_LEN];
; char b[BUF_LEN];
; 
; void *copy(void *restrict dest, const void *restrict src, size_t n) {
  401126:	sub    rsp,0x8
;   return memcpy(dest, src, n);
  40112a:	call   401030 <memcpy@plt>
; }
  40112f:	add    rsp,0x8
  401133:	ret

Another situation in which gcc will produce a proper call to the libc memcpy function is when the target architecture has vector extensions. In this situation, the compiler is aware that memcpy implementation will use vector instructions which may be faster than rep movs. You can test it by adding the flag -march=corei7 to your gcc command to see what code gcc will produce for an architecture with vector extensions (you can see this in godbolt).

We can now compare different assembly versions of memcpy to its glibc implementation.

I wrote 8 version of a program copying 4MiB of memory using: an unoptimized for loop, the glibc memcpy function, rep movsb, rep movsq, the SSE2 extension and the AVX and AVX2 extensions. I also wrote a backward copy to compare the speed of rep movsb when RFLAGS.DF is set.

#include <stddef.h>

__attribute__((optimize("O1"))) void *
memcpy_dummy(void *restrict dst, const void *restrict src, size_t n) {
  void *const ret = dst;
  for (int i = 0; i < n; i++)
    *((char *)dst++) = *((char *)src++);
  return ret;
}
; void *memcpy_movsb(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n)
memcpy_movsb:
    mov rax, rdi    ; rax = dst
	mov rcx, rdx    ; rcx = n
	rep movsb       ; for(; rcx != 0; rcx--)
                    ;    *(rdi++) = *(rsi++)
	ret             ; return rax
; void *memcpy_movsb_std(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n)
memcpy_movsb_std:
    mov rax, rdi    ; rax = dst
	std             ; Set direction flag
	mov rcx, rdx    ; rcx = n
	sub rdx, 1      ; rdx = n - 1
	add rdi, rdx    ; rdi = dst + (n - 1)
	add rsi, rdx    ; rsi = src + (n - 1)
	rep movsb       ; for(; rcx != 0; rcx--)
                    ;    *(rdi)++ = *(rsi++)
	cld             ; Clear direction flag
	ret             ; return rax
; void *memcpy_movb(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n)
memcpy_movb:
    mov rax, rdi                ; Save dst in rax
    test rdx, rdx               ; if(rdx == 0)
    jz .exit                    ;    goto .exit
    xor ecx, ecx                ; rcx = 0
.loop:
    movzx r8d, byte [rsi+rcx]   ; r8b = rsi[rdx]
    mov byte [rdi+rcx], r8b     ; rdi[rcx] = r8b
    add rcx, 1                  ; rcx++
    cmp rcx, rdx                ; if(rcx != n)
    jne .loop                   ;    goto .loop
.exit:
    ret
; Macro used to copy less than a quadword
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 8-byte aligned offset
%macro memcpy_epilog_qword 2
    cmp %1, 4                   ; if(r8 < 4)
    jb %%word                    ;    goto .word
    mov eax, [rsi + %2]         ; eax = src[%2]
    mov [rdi + %2], eax         ; dst[%2] = eax
    add %2, 4                   ; %2 += 4
    sub %1, 4                   ; %1 -= 4
%%word:
    cmp %1, 2                   ; if(r8 < 2)
    jb %%byte                    ;    goto .byte
    movzx eax, word [rsi + %2]  ; ax = src[%2]
    mov [rdi + %2], ax          ; dst[%2] = ax
    add %2, 2                   ; %2 += 2
    sub %1, 2                   ; %1 -= 2
%%byte:
    test %1, %1                 ; if(r8 == 0)
    jz %%exit                    ;    goto .exit
    movzx eax, byte [rsi + %2]  ; al = src[%2]
    mov [rdi + %2], al          ; dst[%2] = al
%%exit:
%endmacro

; void *memcpy_movsq(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n)
memcpy_movsq:
    push rdi                ; rax = dst
    cmp rdx, 8                  ; if(n < 8)
    jb .end                     ;    goto .end
    lea rcx, [rsi + 8]          ; rcx = src + 8
    movsq                       ; Copy first quadword
    and rsi, -8                 ; rsi = align(src + 8, 8)
    sub rcx, rsi                ; rcx = (src + 8) - align(src + 8, 8)
    sub rdi, rcx                ; rdi = (dst + 8) - ((src + 8) - align(src + 8, 8))
    lea rdx, [rdx + rcx - 8]    ; rdx = n - (8 - ((src + 8) - align(src + 8, 8)))
    mov rcx, rdx                ; rcx = n - (align(src + 8, 8) - src)
    and rcx, -8                 ; rcx = align(n - (align(src + 8, 8) - src), 8)
    shr rcx, 3                  ; rcx = align(n - (align(src + 8, 8) - src), 8) / 8
	rep movsq                   ; for(; rcx != 0; rcx--)
                                ;    *(rdi++) = *(rsi++)
    and rdx, (8 - 1)            ; rdx = n - (align(src + 8, 8) - src) % 8
.end:
    xor ecx, ecx
    memcpy_epilog_qword rdx, rcx
.exit:
    pop rax
    ret
; void *memcpy_movq(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n)
memcpy_movq:
    xor ecx, ecx                ; rcx = 0
    mov r9, rdx                 ; r9 = n
    cmp r9, 8                   ; if(n < 8)
    jb .end                     ;    goto .end
    mov r8, [rsi]               ; r8 = *(rsi)
    mov [rdi], r8               ; *(rsi) = r8
    lea rcx, [rdi + 7]          ; rcx = dst + 7
    and rcx, -8                 ; rcx = align((dst + 7), 8)
    sub rcx, rdi                ; rcx = align((dst + 7), 8) - dst
    sub r9, rcx                 ; r9 = dst + n - align((dst + 7), 8)
.loop:
    mov rax, [rsi + rcx]        ; r8 = src[rcx]
    mov [rdi + rcx], rax        ; dst[rcx] = r8
    add rcx, 8                  ; rcx += 8
    sub r9, 8                   ; r9 -= 8
    cmp r9, 8                   ; if(r9 >= 8)
    jae .loop                   ;    goto .loop
.end:
    memcpy_epilog_qword r9, rcx
    mov rax, rdi                ; return dst
    ret
; Macro used to copy less than a 16 bytes
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 16-byte aligned offset
%macro memcpy_epilog_avx 2
    cmp %1, 8
    jb %%dword
    mov rax, [rsi + %2]
    mov [rdi + %2], rax
    add %2, 8
    sub %1, 8
%%dword:
    memcpy_epilog_qword %1, %2
%endmacro

; void *memcpy_avx(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n)
memcpy_avx:
    xor ecx, ecx                    ; rcx = 0
    cmp rdx, 16                      ; if(n < 16)
    jb .end                         ;    goto .end
    vmovdqu xmm0, [rsi]             ; Copy the first
    vmovdqu [rdi], xmm0             ; 32 bytes
    lea rcx, [rsi + 15]             ; rcx = src + 15
    and rcx, -16                    ; rcx = align((src + 15), 16)
    sub rcx, rsi                    ; rcx = align((src + 15), 16) - src
    sub rdx, rcx                     ; rdx = src + n - align((src + 15), 16)
.loop:
	vmovdqa xmm0, [rsi + rcx]       ; xmm0 = rsi[rcx]
	vmovdqu [rdi + rcx], xmm0       ; rdi[rcx] = xmm0
    add rcx, 16                     ; rcx += 16
    sub rdx, 16                      ; rdx -= 16
    cmp rdx, 16                      ; if(rdx >= 16)
	jae .loop                       ;    goto .loop
.end:
    memcpy_epilog_avx rdx, rcx
    mov rax, rdi
	ret
; Macro used to copy less than a 32 bytes
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 32-byte aligned offset
%macro memcpy_epilog_avx2 2
    cmp %1, 16
    jb %%qword
    movdqa xmm0, [rsi + %2]
    movdqu [rdi + %2], xmm0
    add %2, 16
    sub %1, 16
%%qword:
    memcpy_epilog_avx %1, %2
%endmacro

; void *memcpy_avx2(rdi: const void dst[.n], rsi: void src[.n], rdx: size_t n)
memcpy_avx2:
    xor ecx, ecx                    ; rcx = 0
    mov r9, rdx                     ; r9 = n
    cmp r9, 32                      ; if(n < 16)
    jb .end                         ;    goto .end
    vmovdqu ymm0, [rsi]             ; Copy the first
    vmovdqu [rdi], ymm0             ; 32 bytes
    lea rcx, [rsi + 31]             ; rcx = src + 31
    and rcx, -32                    ; rcx = align((src + 31), 32)
    sub rcx, rsi                    ; rcx = align((src + 31), 32) - src
    sub r9, rcx                     ; r9 = src + n - align((src + 31), 32)
.loop:
	vmovdqa ymm0, [rsi + rcx]       ; xmm0 = rsi[rcx]
	vmovdqu [rdi + rcx], ymm0       ; rdi[rcx] = xmm0
    add rcx, 32                     ; rcx += 16
    sub r9, 32                      ; r9 -= 16
    cmp r9, 32                      ; if(r9 >= 16)
	jae .loop                       ;    goto .loop
.end:
    memcpy_epilog_avx2 r9, rcx
    mov rax, rdi
    vzeroupper
	ret
Note

Note that for the “dummy version” i forced the optimization level to -O1.

Otherwise, gcc would replace the call to our custom copy function with a call to memcpy (when -O2) or write a vectorized loop using SSE2 extension (when -O3). You can check this in godbolt by changing the level of optimization.

This means that when writing casual C code with -O2 or -O3 level optimization, a simple for loop will often be identical or more efficient than a call to memcpy.

Here are the results I got on my “13th Gen Intel(R) Core(TM) i7-1355U (12) @ 5.00 GHz” using the b63 micro-benchmarking tool:

On my hardware, all the implementations are reasonably close, the slowest by far being the backward copy (setting RFLAGS.DF), the for loop and the movb version (which copy only one byte at a time), but you may have different results depending on your CPU.

This is an example of string instructions being nearly as fast as copying using the greatest registers of the processor.

According to “Optimizing subroutines in assembly language”:

REP MOVSD and REP STOSD are quite fast if the repeat count is not too small. The largest word size (DWORD in 32-bit mode, QWORD in 64-bit mode) is preferred. Both source and destination should be aligned by the word size or better. In many cases, however, it is faster to use vector registers. Moving data in the largest available registers is faster than REP MOVSD and REP STOSD in most cases, especially on older processors.

The speed of rep movs and rep stos relies heavily on fast-string operations, which need certain processor-dependent conditions to be met.

The “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 1: Basic Architecture 7.3.9.3” defines fast-string operations:

To improve performance, more recent processors support modifications to the processor’s operation during the string store operations initiated with the MOVS, MOVSB, STOS, and STOSB instructions. This optimized operation, called fast-string operation, is used when the execution of one of those instructions meets certain initial conditions (see below). Instructions using fast-string operation effectively operate on the string in groups that may include multiple elements of the native data size (byte, word, doubleword, or quadword).

The general conditions for fast-string operations to happen are:

  • The count of bytes must be high
  • Both source and destination must be aligned (on the size of the greatest registers you have)
  • The direction must be forward
  • The distance between source and destination must be at least the cache line size
  • The memory type for both source and destination must be either write-back or write-combining (you can normally assume the latter condition is met).

So how does the glibc implement the memcpy function ?

We have two ways to know how the memcpy works underneath. The first one is looking at the source code of the glibc, the second one is to dump the disassembled machine code in gdb.

We can get the code of glibc v.2.40, the GNU implementation of lib c, and verify its signature using these commands:

wget https://ftp.gnu.org/gnu/glibc/glibc-2.40.tar.xz https://ftp.gnu.org/gnu/glibc/glibc-2.40.tar.xz.sig
gpg --recv-keys 7273542B39962DF7B299931416792B4EA25340F8
gpg --verify glibc-2.40.tar.xz.sig glibc-2.40.tar.xz
tar xvf glibc-2.40.tar.xz

You can find the implementation of the memcpy function in the string/memcpy.c file:

/* Copy memory to memory until the specified number of bytes
   has been copied.  Overlap is NOT handled correctly.
   Copyright (C) 1991-2024 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <https://www.gnu.org/licenses/>.  */

#include <string.h>
#include <memcopy.h>

#ifndef MEMCPY
# define MEMCPY memcpy
#endif

void *
MEMCPY (void *dstpp, const void *srcpp, size_t len)
{
  unsigned long int dstp = (long int) dstpp;
  unsigned long int srcp = (long int) srcpp;

  /* Copy from the beginning to the end.  */

  /* If there not too few bytes to copy, use word copy.  */
  if (len >= OP_T_THRES)
    {
      /* Copy just a few bytes to make DSTP aligned.  */
      len -= (-dstp) % OPSIZ;
      BYTE_COPY_FWD (dstp, srcp, (-dstp) % OPSIZ);

      /* Copy whole pages from SRCP to DSTP by virtual address manipulation,
	 as much as possible.  */

      PAGE_COPY_FWD_MAYBE (dstp, srcp, len, len);

      /* Copy from SRCP to DSTP taking advantage of the known alignment of
	 DSTP.  Number of bytes remaining is put in the third argument,
	 i.e. in LEN.  This number may vary from machine to machine.  */

      WORD_COPY_FWD (dstp, srcp, len, len);

      /* Fall out and copy the tail.  */
    }

  /* There are just a few bytes to copy.  Use byte memory operations.  */
  BYTE_COPY_FWD (dstp, srcp, len);

  return dstpp;
}
libc_hidden_builtin_def (MEMCPY)
Info
The PAGE_COPY_FWD_MAYBE macro is empty for all architectures except the mach architecture, so you can ignore it.

The included sysdeps/generic/memcopy.h file contains the definitions of the macro used by memcpy and an explanation of its internals:

/* The strategy of the memory functions is:

     1. Copy bytes until the destination pointer is aligned.

     2. Copy words in unrolled loops.  If the source and destination
     are not aligned in the same way, use word memory operations,
     but shift and merge two read words before writing.

     3. Copy the few remaining bytes.

   This is fast on processors that have at least 10 registers for
   allocation by GCC, and that can access memory at reg+const in one
   instruction.

   I made an "exhaustive" test of this memmove when I wrote it,
   exhaustive in the sense that I tried all alignment and length
   combinations, with and without overlap.  */

For i386 (ie x86_32) architectures, the sysdeps/i386/memcopy.h macro definitions will replace the generic ones defining the WORD_COPY_FWD and BYTE_COPY_FWD macros in terms of string instructions:

#undef	BYTE_COPY_FWD
#define BYTE_COPY_FWD(dst_bp, src_bp, nbytes)                               \
  do {                                                                      \
    int __d0;                                                               \
    asm volatile(/* Clear the direction flag, so copying goes forward.  */  \
		 "cld\n"                                                            \
		 /* Copy bytes.  */                                                 \
		 "rep\n"                                                            \
		 "movsb" :                                                          \
		 "=D" (dst_bp), "=S" (src_bp), "=c" (__d0) :                        \
		 "0" (dst_bp), "1" (src_bp), "2" (nbytes) :                         \
		 "memory");                                                         \
  } while (0)

#undef	WORD_COPY_FWD
#define WORD_COPY_FWD(dst_bp, src_bp, nbytes_left, nbytes)                      \
  do                                                                            \
    {                                                                           \
      int __d0;                                                                 \
      asm volatile(/* Clear the direction flag, so copying goes forward.  */    \
		   "cld\n"                                                              \
		   /* Copy longwords.  */                                               \
		   "rep\n"                                                              \
		   "movsl" :                                                            \
 		   "=D" (dst_bp), "=S" (src_bp), "=c" (__d0) :                          \
		   "0" (dst_bp), "1" (src_bp), "2" ((nbytes) / 4) :                     \
		   "memory");                                                           \
      (nbytes_left) = (nbytes) % 4;                                             \
    } while (0)

For x86_64, the implementation is much more intricate as it requires to choose a memcpy implementation according to the vector extensions available. The sysdeps/x86_64/multiarch/memcpy.c file includes the ifunc-memmove.h which defines a IFUNC_SELECTOR function which returns a pointer to a function according to the caracteristics of the CPU running the program:

sysdeps/x86_64/multiarch/memcpy.c

/* Multiple versions of memcpy.
   All versions must be listed in ifunc-impl-list.c.
   Copyright (C) 2017-2024 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <https://www.gnu.org/licenses/>.  */

/* Define multiple versions only for the definition in libc.  */
#if IS_IN (libc)
# define memcpy __redirect_memcpy
# include <string.h>
# undef memcpy

# define SYMBOL_NAME memcpy
# include "ifunc-memmove.h"

libc_ifunc_redirected (__redirect_memcpy, __new_memcpy,
		       IFUNC_SELECTOR ());

# ifdef SHARED
__hidden_ver1 (__new_memcpy, __GI_memcpy, __redirect_memcpy)
  __attribute__ ((visibility ("hidden")));
# endif

# include <shlib-compat.h>
versioned_symbol (libc, __new_memcpy, memcpy, GLIBC_2_14);
#endif

sysdeps/x86_64/multiarch/ifunc-memmove.h

/* Common definition for memcpy, mempcpy and memmove implementation.
   All versions must be listed in ifunc-impl-list.c.
   Copyright (C) 2017-2024 Free Software Foundation, Inc.
   This file is part of the GNU C Library.

   The GNU C Library is free software; you can redistribute it and/or
   modify it under the terms of the GNU Lesser General Public
   License as published by the Free Software Foundation; either
   version 2.1 of the License, or (at your option) any later version.

   The GNU C Library is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   Lesser General Public License for more details.

   You should have received a copy of the GNU Lesser General Public
   License along with the GNU C Library; if not, see
   <https://www.gnu.org/licenses/>.  */

#include <init-arch.h>

extern __typeof (REDIRECT_NAME) OPTIMIZE (erms) attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned)
  attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_unaligned_erms)
  attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (avx512_no_vzeroupper)
  attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned)
  attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (evex_unaligned_erms)
  attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned) attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms)
  attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_rtm)
  attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (avx_unaligned_erms_rtm)
  attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (ssse3) attribute_hidden;

extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned)
  attribute_hidden;
extern __typeof (REDIRECT_NAME) OPTIMIZE (sse2_unaligned_erms)
  attribute_hidden;

static inline void *
IFUNC_SELECTOR (void)
{
  const struct cpu_features *cpu_features = __get_cpu_features ();

  if (CPU_FEATURES_ARCH_P (cpu_features, Prefer_ERMS)
      || CPU_FEATURES_ARCH_P (cpu_features, Prefer_FSRM))
    return OPTIMIZE (erms);

  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512F)
      && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
    {
      if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL))
	{
	  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
	    return OPTIMIZE (avx512_unaligned_erms);

	  return OPTIMIZE (avx512_unaligned);
	}

      return OPTIMIZE (avx512_no_vzeroupper);
    }

  if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
				   AVX_Fast_Unaligned_Load, ))
    {
      if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, AVX512VL))
	{
	  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
	    return OPTIMIZE (evex_unaligned_erms);

	  return OPTIMIZE (evex_unaligned);
	}

      if (CPU_FEATURE_USABLE_P (cpu_features, RTM))
	{
	  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
	    return OPTIMIZE (avx_unaligned_erms_rtm);

	  return OPTIMIZE (avx_unaligned_rtm);
	}

      if (X86_ISA_CPU_FEATURES_ARCH_P (cpu_features,
				       Prefer_No_VZEROUPPER, !))
	{
	  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
	    return OPTIMIZE (avx_unaligned_erms);

	  return OPTIMIZE (avx_unaligned);
	}
    }

  if (X86_ISA_CPU_FEATURE_USABLE_P (cpu_features, SSSE3)
      /* Leave this as runtime check.  The SSSE3 is optimized almost
         exclusively for avoiding unaligned memory access during the
         copy and by and large is not better than the sse2
         implementation as a general purpose memmove.  */
      && !CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Copy))
    {
      return OPTIMIZE (ssse3);
    }

  if (CPU_FEATURE_USABLE_P (cpu_features, ERMS))
    return OPTIMIZE (sse2_unaligned_erms);

  return OPTIMIZE (sse2_unaligned);
}

You can learn more about indirect functions in glibc in this blog post. What you need to understand here is that the glibc will be able to determine at runtime what function should be called when calling memcpy based on your hardware capabilities.

Let’s use gdb to find out how our call to memcpy will be resolved at runtime.

Run the following commands in gdb to run until the first call to memcpy:

b main
run
b memcpy
c
disassemble

You should get something like this:

<+0>:	endbr64
<+4>:	mov    rcx,QWORD PTR [rip+0x13b0f5]     # 0x7ffff7fa0eb0
<+11>:	lea    rax,[rip+0x769e]        			# 0x7ffff7e6d460 <__memmove_erms>
<+18>:	mov    edx,DWORD PTR [rcx+0x1c4]
<+24>:	test   dh,0x90
<+27>:	jne    0x7ffff7e65e41 <memcpy@@GLIBC_2.14+145>
<+29>:	mov    esi,DWORD PTR [rcx+0xb8]
<+35>:	test   esi,0x10000
<+41>:	jne    0x7ffff7e65e48 <memcpy@@GLIBC_2.14+152>
<+43>:	test   dh,0x2
<+46>:	je     0x7ffff7e65e20 <memcpy@@GLIBC_2.14+112>
<+48>:	test   esi,esi
<+50>:	js     0x7ffff7e65e88 <memcpy@@GLIBC_2.14+216>
<+56>:	test   esi,0x800
<+62>:	je     0x7ffff7e65e10 <memcpy@@GLIBC_2.14+96>
<+64>:	and    esi,0x200
<+70>:	lea    rdx,[rip+0xc9d43]        # 0x7ffff7f2fb40 <__memmove_avx_unaligned_erms_rtm>
<+77>:	lea    rax,[rip+0xc9cac]        # 0x7ffff7f2fab0 <__memmove_avx_unaligned_rtm>
<+84>:	cmovne rax,rdx
<+88>:	ret
<+89>:	nop    DWORD PTR [rax+0x0]
<+96>:	test   dh,0x8
<+99>:	je     0x7ffff7e65ea8 <memcpy@@GLIBC_2.14+248>
<+105>:	nop    DWORD PTR [rax+0x0]
<+112>:	test   BYTE PTR [rcx+0x9d],0x2
<+119>:	jne    0x7ffff7e65e78 <memcpy@@GLIBC_2.14+200>
<+121>:	and    esi,0x200
<+127>:	lea    rax,[rip+0x774a]        # 0x7ffff7e6d580 <__memmove_sse2_unaligned_erms>
<+134>:	lea    rdx,[rip+0x76b3]        # 0x7ffff7e6d4f0 <memcpy@GLIBC_2.2.5>
<+141>:	cmove  rax,rdx
<+145>:	ret
<+146>:	nop    WORD PTR [rax+rax*1+0x0]
<+152>:	test   dh,0x20
<+155>:	jne    0x7ffff7e65ddb <memcpy@@GLIBC_2.14+43>
<+157>:	lea    rax,[rip+0xdc23c]        # 0x7ffff7f42090 <__memmove_avx512_no_vzeroupper>
<+164>:	test   esi,esi
<+166>:	jns    0x7ffff7e65e41 <memcpy@@GLIBC_2.14+145>
<+168>:	and    esi,0x200
<+174>:	lea    rdx,[rip+0xda19b]        # 0x7ffff7f40000 <__memmove_avx512_unaligned_erms>
<+181>:	lea    rax,[rip+0xda104]        # 0x7ffff7f3ff70 <__memmove_avx512_unaligned>
<+188>:	cmovne rax,rdx
<+192>:	ret
<+193>:	nop    DWORD PTR [rax+0x0]
<+200>:	and    edx,0x20
<+203>:	lea    rax,[rip+0xdcc3e]        # 0x7ffff7f42ac0 <__memmove_ssse3>
<+210>:	jne    0x7ffff7e65e29 <memcpy@@GLIBC_2.14+121>
<+212>:	ret
<+213>:	nop    DWORD PTR [rax]
<+216>:	and    esi,0x200
<+222>:	lea    rdx,[rip+0xd112b]        # 0x7ffff7f36fc0 <__memmove_evex_unaligned_erms>
<+229>:	lea    rax,[rip+0xd1094]        # 0x7ffff7f36f30 <__memmove_evex_unaligned>
<+236>:	cmovne rax,rdx
<+240>:	ret
<+241>:	nop    DWORD PTR [rax+0x0]
<+248>:	and    esi,0x200
<+254>:	lea    rdx,[rip+0xc124b]        # 0x7ffff7f27100 <__memmove_avx_unaligned_erms>
<+261>:	lea    rax,[rip+0xc11b4]        # 0x7ffff7f27070 <__memmove_avx_unaligned>
<+268>:	cmovne rax,rdx
<+272>:	ret

Instead of breaking inside the code of the memcpy function, the glibc runtime called the IFUNC_SELECTOR. This function will return the pointer that will be used for later calls of the memcpy function.

Let’s now hit the finish gdb command to see what the selector returns in rax.

Return value of IFUNC_SELECTOR

The IFUNC_SELECTOR returned a pointer to __memmove_avx_unaligned_erms this function is defined in sysdeps/x86_64/multiarch/memmove-avx-unaligned-erms.S and sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S with handwritten assembly.

But it’s quite difficult to figure out the execution flow while reading this, so let’s put a breakpoint on this symbol, hit continue, and run this function step by step.

Note

The “erms” parts of __memmove_avx_unaligned_erms stands for “Enhanced Rep Movsb/Stosb” which is what we called earlier fast-string operation.

It is named this way in the “Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1” section 3.7.6 and correspond to a byte in cpuid which indicates that this feature is available.

I highlighted the code reached during the execution of the function:

__memmove_avx_unaligned_erms

<+0>:	endbr64
<+4>:	mov    rax,rdi												; Copy the pointer to destination into rax
<+7>:	cmp    rdx,0x20												; Compare size to 0x20 (32 bytes)
<+11>:	jb     0x7ffff7f27130 <__memmove_avx_unaligned_erms+48>		; If smaller jump
<+13>:	vmovdqu ymm0,YMMWORD PTR [rsi]								; Otherwise loads the first 32 bytes of src into ymm0
<+17>:	cmp    rdx,0x40												; Compare size to 0x40 (64 bytes)
<+21>:	ja     0x7ffff7f271c0 <__memmove_avx_unaligned_erms+192>	; If above jump
<+27>:	vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1-0x20]
<+33>:	vmovdqu YMMWORD PTR [rdi],ymm0
<+37>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x20],ymm1
<+43>:	vzeroupper
<+46>:	ret
<+47>:	nop
<+48>:	cmp    edx,0x10
<+51>:	jae    0x7ffff7f27162 <__memmove_avx_unaligned_erms+98>
<+53>:	cmp    edx,0x8
<+56>:	jae    0x7ffff7f27180 <__memmove_avx_unaligned_erms+128>
<+58>:	cmp    edx,0x4
<+61>:	jae    0x7ffff7f27155 <__memmove_avx_unaligned_erms+85>
<+63>:	cmp    edx,0x1
<+66>:	jl     0x7ffff7f27154 <__memmove_avx_unaligned_erms+84>
<+68>:	mov    cl,BYTE PTR [rsi]
<+70>:	je     0x7ffff7f27152 <__memmove_avx_unaligned_erms+82>
<+72>:	movzx  esi,WORD PTR [rsi+rdx*1-0x2]
<+77>:	mov    WORD PTR [rdi+rdx*1-0x2],si
<+82>:	mov    BYTE PTR [rdi],cl
<+84>:	ret
<+85>:	mov    ecx,DWORD PTR [rsi+rdx*1-0x4]
<+89>:	mov    esi,DWORD PTR [rsi]
<+91>:	mov    DWORD PTR [rdi+rdx*1-0x4],ecx
<+95>:	mov    DWORD PTR [rdi],esi
<+97>:	ret
<+98>:	vmovdqu xmm0,XMMWORD PTR [rsi]
<+102>:	vmovdqu xmm1,XMMWORD PTR [rsi+rdx*1-0x10]
<+108>:	vmovdqu XMMWORD PTR [rdi],xmm0
<+112>:	vmovdqu XMMWORD PTR [rdi+rdx*1-0x10],xmm1
<+118>:	ret
<+119>:	nop    WORD PTR [rax+rax*1+0x0]
<+128>:	mov    rcx,QWORD PTR [rsi+rdx*1-0x8]
<+133>:	mov    rsi,QWORD PTR [rsi]
<+136>:	mov    QWORD PTR [rdi],rsi
<+139>:	mov    QWORD PTR [rdi+rdx*1-0x8],rcx
<+144>:	ret
<+145>:	vmovdqu ymm2,YMMWORD PTR [rsi+rdx*1-0x20]
<+151>:	vmovdqu ymm3,YMMWORD PTR [rsi+rdx*1-0x40]
<+157>:	vmovdqu YMMWORD PTR [rdi],ymm0
<+161>:	vmovdqu YMMWORD PTR [rdi+0x20],ymm1
<+166>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x20],ymm2
<+172>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x40],ymm3
<+178>:	vzeroupper
<+181>:	ret
<+182>:	cs nop WORD PTR [rax+rax*1+0x0]
<+192>:	cmp    rdx,QWORD PTR [rip+0x7a051]        # 0x7ffff7fa1218 <__x86_rep_movsb_threshold>
<+199>:	ja     0x7ffff7f273c0 <__memmove_avx_unaligned_erms+704> ; Jump if size > __x86_rep_movsb_threshold
<+205>:	cmp    rdx,0x100
<+212>:	ja     0x7ffff7f27235 <__memmove_avx_unaligned_erms+309>
<+214>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x20]
<+219>:	cmp    rdx,0x80
<+226>:	jbe    0x7ffff7f27191 <__memmove_avx_unaligned_erms+145>
<+228>:	vmovdqu ymm2,YMMWORD PTR [rsi+0x40]
<+233>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x60]
<+238>:	vmovdqu ymm4,YMMWORD PTR [rsi+rdx*1-0x20]
<+244>:	vmovdqu ymm5,YMMWORD PTR [rsi+rdx*1-0x40]
<+250>:	vmovdqu ymm6,YMMWORD PTR [rsi+rdx*1-0x60]
<+256>:	vmovdqu ymm7,YMMWORD PTR [rsi+rdx*1-0x80]
<+262>:	vmovdqu YMMWORD PTR [rdi],ymm0
<+266>:	vmovdqu YMMWORD PTR [rdi+0x20],ymm1
<+271>:	vmovdqu YMMWORD PTR [rdi+0x40],ymm2
<+276>:	vmovdqu YMMWORD PTR [rdi+0x60],ymm3
<+281>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x20],ymm4
<+287>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x40],ymm5
<+293>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x60],ymm6
<+299>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x80],ymm7
<+305>:	vzeroupper
<+308>:	ret
<+309>:	mov    rcx,rdi
<+312>:	sub    rcx,rsi
<+315>:	cmp    rcx,rdx
<+318>:	jb     0x7ffff7f272f0 <__memmove_avx_unaligned_erms+496>
<+324>:	cmp    rdx,QWORD PTR [rip+0x80fe5]        # 0x7ffff7fa8230 <__x86_shared_non_temporal_threshold>
<+331>:	ja     0x7ffff7f27420 <__memmove_avx_unaligned_erms+800>
<+337>:	lea    r8,[rcx+rdx*1]
<+341>:	xor    r8,rcx
<+344>:	shr    r8,0x3f
<+348>:	and    ecx,0xf00
<+354>:	add    ecx,r8d
<+357>:	je     0x7ffff7f272f5 <__memmove_avx_unaligned_erms+501>
<+363>:	vmovdqu ymm5,YMMWORD PTR [rsi+rdx*1-0x20]
<+369>:	vmovdqu ymm6,YMMWORD PTR [rsi+rdx*1-0x40]
<+375>:	mov    rcx,rdi
<+378>:	or     rdi,0x1f
<+382>:	vmovdqu ymm7,YMMWORD PTR [rsi+rdx*1-0x60]
<+388>:	vmovdqu ymm8,YMMWORD PTR [rsi+rdx*1-0x80]
<+394>:	sub    rsi,rcx
<+397>:	inc    rdi
<+400>:	add    rsi,rdi
<+403>:	lea    rdx,[rcx+rdx*1-0x80]
<+408>:	nop    DWORD PTR [rax+rax*1+0x0]
<+416>:	vmovdqu ymm1,YMMWORD PTR [rsi]
<+420>:	vmovdqu ymm2,YMMWORD PTR [rsi+0x20]
<+425>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x40]
<+430>:	vmovdqu ymm4,YMMWORD PTR [rsi+0x60]
<+435>:	sub    rsi,0xffffffffffffff80
<+439>:	vmovdqa YMMWORD PTR [rdi],ymm1
<+443>:	vmovdqa YMMWORD PTR [rdi+0x20],ymm2
<+448>:	vmovdqa YMMWORD PTR [rdi+0x40],ymm3
<+453>:	vmovdqa YMMWORD PTR [rdi+0x60],ymm4
<+458>:	sub    rdi,0xffffffffffffff80
<+462>:	cmp    rdx,rdi
<+465>:	ja     0x7ffff7f272a0 <__memmove_avx_unaligned_erms+416>
<+467>:	vmovdqu YMMWORD PTR [rdx+0x60],ymm5
<+472>:	vmovdqu YMMWORD PTR [rdx+0x40],ymm6
<+477>:	vmovdqu YMMWORD PTR [rdx+0x20],ymm7
<+482>:	vmovdqu YMMWORD PTR [rdx],ymm8
<+486>:	vmovdqu YMMWORD PTR [rcx],ymm0
<+490>:	vzeroupper
<+493>:	ret
<+494>:	xchg   ax,ax
<+496>:	test   rcx,rcx
<+499>:	je     0x7ffff7f272ea <__memmove_avx_unaligned_erms+490>
<+501>:	vmovdqu ymm5,YMMWORD PTR [rsi+0x20]
<+506>:	vmovdqu ymm6,YMMWORD PTR [rsi+0x40]
<+511>:	lea    rcx,[rdi+rdx*1-0x81]
<+519>:	vmovdqu ymm7,YMMWORD PTR [rsi+0x60]
<+524>:	vmovdqu ymm8,YMMWORD PTR [rsi+rdx*1-0x20]
<+530>:	sub    rsi,rdi
<+533>:	and    rcx,0xffffffffffffffe0
<+537>:	add    rsi,rcx
<+540>:	nop    DWORD PTR [rax+0x0]
<+544>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x60]
<+549>:	vmovdqu ymm2,YMMWORD PTR [rsi+0x40]
<+554>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x20]
<+559>:	vmovdqu ymm4,YMMWORD PTR [rsi]
<+563>:	add    rsi,0xffffffffffffff80
<+567>:	vmovdqa YMMWORD PTR [rcx+0x60],ymm1
<+572>:	vmovdqa YMMWORD PTR [rcx+0x40],ymm2
<+577>:	vmovdqa YMMWORD PTR [rcx+0x20],ymm3
<+582>:	vmovdqa YMMWORD PTR [rcx],ymm4
<+586>:	add    rcx,0xffffffffffffff80
<+590>:	cmp    rdi,rcx
<+593>:	jb     0x7ffff7f27320 <__memmove_avx_unaligned_erms+544>
<+595>:	vmovdqu YMMWORD PTR [rdi],ymm0
<+599>:	vmovdqu YMMWORD PTR [rdi+0x20],ymm5
<+604>:	vmovdqu YMMWORD PTR [rdi+0x40],ymm6
<+609>:	vmovdqu YMMWORD PTR [rdi+0x60],ymm7
<+614>:	vmovdqu YMMWORD PTR [rdx+rdi*1-0x20],ymm8
<+620>:	vzeroupper
<+623>:	ret
<+624>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+635>:	nop    DWORD PTR [rax+rax*1+0x0]
<+640>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x20]
<+645>:	test   ecx,0xe00
<+651>:	jne    0x7ffff7f273f2 <__memmove_avx_unaligned_erms+754>
<+653>:	mov    r9,rcx
<+656>:	lea    rcx,[rsi+rdx*1-0x1]
<+661>:	or     rsi,0x3f
<+665>:	lea    rdi,[rsi+r9*1+0x1]
<+670>:	sub    rcx,rsi
<+673>:	inc    rsi
<+676>:	rep movs BYTE PTR es:[rdi],BYTE PTR ds:[rsi]
<+678>:	vmovdqu YMMWORD PTR [r8],ymm0
<+683>:	vmovdqu YMMWORD PTR [r8+0x20],ymm1
<+689>:	vzeroupper
<+692>:	ret
<+693>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+704>:	mov    rcx,rdi												; Copy pointer to dest in rcx
<+707>:	sub    rcx,rsi												; rcx = dest - src
<+710>:	cmp    rcx,rdx												; if (rcx < size)
<+713>:	jb     0x7ffff7f272f0 <__memmove_avx_unaligned_erms+496>	;	 jump
<+719>:	mov    r8,rdi												; Copy pointer to dest in r8
<+722>:	cmp    rdx,QWORD PTR [rip+0x80e4f]        # 0x7ffff7fa8228 <__x86_rep_movsb_stop_threshold>
<+729>:	jae    0x7ffff7f27420 <__memmove_avx_unaligned_erms+800>	; if(size >= __x86_rep_movsb_stop_threshold) jump
<+731>:	test   BYTE PTR [rip+0x80e3e],0x1        # 0x7ffff7fa8220 <__x86_string_control>
<+738>:	je     0x7ffff7f27380 <__memmove_avx_unaligned_erms+640>	; if(__x86_string_control & 0x1 == 0) jump
<+740>:	cmp    ecx,0xffffffc0										; if(rcx > 0xffffffc0)
<+743>:	ja     0x7ffff7f2726b <__memmove_avx_unaligned_erms+363>	;	 jump
<+749>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x20]							; Loads the first 32th-63th bytes of src into ymm1
<+754>:	sub    rsi,rdi												; rsi = src - dest
<+757>:	add    rdi,0x3f												; rdi = dest + 63 bytes
<+761>:	lea    rcx,[r8+rdx*1]										; rcx = dest + size
<+765>:	and    rdi,0xffffffffffffffc0								; Align rdi on 64 bytes
<+769>:	add    rsi,rdi												; rsi = (src - dest + (dest{64 bytes aligned}))
<+772>:	sub    rcx,rdi												; rcx = dest + size - dest{64 bytes aligned}
<+775>:	rep movs BYTE PTR es:[rdi],BYTE PTR ds:[rsi] 				; OUR REP MOVSB INSTRUCTION !!!
<+777>:	vmovdqu YMMWORD PTR [r8],ymm0								; Copy ymm0 content to dest
<+782>:	vmovdqu YMMWORD PTR [r8+0x20],ymm1							; Copy ymm1 content to dest
<+788>:	vzeroupper													; Reset for SSE2
<+791>:	ret
<+792>:	nop    DWORD PTR [rax+rax*1+0x0]
<+800>:	mov    r11,QWORD PTR [rip+0x80e09]        # 0x7ffff7fa8230 <__x86_shared_non_temporal_threshold>
<+807>:	cmp    rdx,r11
<+810>:	jb     0x7ffff7f27251 <__memmove_avx_unaligned_erms+337>
<+816>:	neg    rcx
<+819>:	cmp    rdx,rcx
<+822>:	ja     0x7ffff7f2726b <__memmove_avx_unaligned_erms+363>
<+828>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x20]
<+833>:	vmovdqu YMMWORD PTR [rdi],ymm0
<+837>:	vmovdqu YMMWORD PTR [rdi+0x20],ymm1
<+842>:	mov    r8,rdi
<+845>:	and    r8,0x3f
<+849>:	sub    r8,0x40
<+853>:	sub    rsi,r8
<+856>:	sub    rdi,r8
<+859>:	add    rdx,r8
<+862>:	not    ecx
<+864>:	mov    r10,rdx
<+867>:	test   ecx,0xf00
<+873>:	je     0x7ffff7f275f0 <__memmove_avx_unaligned_erms+1264>
<+879>:	shl    r11,0x4
<+883>:	cmp    rdx,r11
<+886>:	jae    0x7ffff7f275f0 <__memmove_avx_unaligned_erms+1264>
<+892>:	and    edx,0x1fff
<+898>:	shr    r10,0xd
<+902>:	cs nop WORD PTR [rax+rax*1+0x0]
<+912>:	mov    ecx,0x20
<+917>:	prefetcht0 BYTE PTR [rsi+0x80]
<+924>:	prefetcht0 BYTE PTR [rsi+0xc0]
<+931>:	prefetcht0 BYTE PTR [rsi+0x100]
<+938>:	prefetcht0 BYTE PTR [rsi+0x140]
<+945>:	prefetcht0 BYTE PTR [rsi+0x1080]
<+952>:	prefetcht0 BYTE PTR [rsi+0x10c0]
<+959>:	prefetcht0 BYTE PTR [rsi+0x1100]
<+966>:	prefetcht0 BYTE PTR [rsi+0x1140]
<+973>:	vmovdqu ymm0,YMMWORD PTR [rsi]
<+977>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x20]
<+982>:	vmovdqu ymm2,YMMWORD PTR [rsi+0x40]
<+987>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x60]
<+992>:	vmovdqu ymm4,YMMWORD PTR [rsi+0x1000]
<+1000>:	vmovdqu ymm5,YMMWORD PTR [rsi+0x1020]
<+1008>:	vmovdqu ymm6,YMMWORD PTR [rsi+0x1040]
<+1016>:	vmovdqu ymm7,YMMWORD PTR [rsi+0x1060]
<+1024>:	sub    rsi,0xffffffffffffff80
<+1028>:	vmovntdq YMMWORD PTR [rdi],ymm0
<+1032>:	vmovntdq YMMWORD PTR [rdi+0x20],ymm1
<+1037>:	vmovntdq YMMWORD PTR [rdi+0x40],ymm2
<+1042>:	vmovntdq YMMWORD PTR [rdi+0x60],ymm3
<+1047>:	vmovntdq YMMWORD PTR [rdi+0x1000],ymm4
<+1055>:	vmovntdq YMMWORD PTR [rdi+0x1020],ymm5
<+1063>:	vmovntdq YMMWORD PTR [rdi+0x1040],ymm6
<+1071>:	vmovntdq YMMWORD PTR [rdi+0x1060],ymm7
<+1079>:	sub    rdi,0xffffffffffffff80
<+1083>:	dec    ecx
<+1085>:	jne    0x7ffff7f27495 <__memmove_avx_unaligned_erms+917>
<+1091>:	add    rdi,0x1000
<+1098>:	add    rsi,0x1000
<+1105>:	dec    r10
<+1108>:	jne    0x7ffff7f27490 <__memmove_avx_unaligned_erms+912>
<+1114>:	sfence
<+1117>:	cmp    edx,0x80
<+1123>:	jbe    0x7ffff7f275ba <__memmove_avx_unaligned_erms+1210>
<+1125>:	prefetcht0 BYTE PTR [rsi+0x80]
<+1132>:	prefetcht0 BYTE PTR [rsi+0xc0]
<+1139>:	prefetcht0 BYTE PTR [rdi+0x80]
<+1146>:	prefetcht0 BYTE PTR [rdi+0xc0]
<+1153>:	vmovdqu ymm0,YMMWORD PTR [rsi]
<+1157>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x20]
<+1162>:	vmovdqu ymm2,YMMWORD PTR [rsi+0x40]
<+1167>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x60]
<+1172>:	sub    rsi,0xffffffffffffff80
<+1176>:	add    edx,0xffffff80
<+1179>:	vmovdqa YMMWORD PTR [rdi],ymm0
<+1183>:	vmovdqa YMMWORD PTR [rdi+0x20],ymm1
<+1188>:	vmovdqa YMMWORD PTR [rdi+0x40],ymm2
<+1193>:	vmovdqa YMMWORD PTR [rdi+0x60],ymm3
<+1198>:	sub    rdi,0xffffffffffffff80
<+1202>:	cmp    edx,0x80
<+1208>:	ja     0x7ffff7f27565 <__memmove_avx_unaligned_erms+1125>
<+1210>:	vmovdqu ymm0,YMMWORD PTR [rsi+rdx*1-0x80]
<+1216>:	vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1-0x60]
<+1222>:	vmovdqu ymm2,YMMWORD PTR [rsi+rdx*1-0x40]
<+1228>:	vmovdqu ymm3,YMMWORD PTR [rsi+rdx*1-0x20]
<+1234>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x80],ymm0
<+1240>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x60],ymm1
<+1246>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x40],ymm2
<+1252>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x20],ymm3
<+1258>:	vzeroupper
<+1261>:	ret
<+1262>:	xchg   ax,ax
<+1264>:	and    edx,0x3fff
<+1270>:	shr    r10,0xe
<+1274>:	nop    WORD PTR [rax+rax*1+0x0]
<+1280>:	mov    ecx,0x20
<+1285>:	prefetcht0 BYTE PTR [rsi+0x80]
<+1292>:	prefetcht0 BYTE PTR [rsi+0xc0]
<+1299>:	prefetcht0 BYTE PTR [rsi+0x1080]
<+1306>:	prefetcht0 BYTE PTR [rsi+0x10c0]
<+1313>:	prefetcht0 BYTE PTR [rsi+0x2080]
<+1320>:	prefetcht0 BYTE PTR [rsi+0x20c0]
<+1327>:	prefetcht0 BYTE PTR [rsi+0x3080]
<+1334>:	prefetcht0 BYTE PTR [rsi+0x30c0]
<+1341>:	vmovdqu ymm0,YMMWORD PTR [rsi]
<+1345>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x20]
<+1350>:	vmovdqu ymm2,YMMWORD PTR [rsi+0x40]
<+1355>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x60]
<+1360>:	vmovdqu ymm4,YMMWORD PTR [rsi+0x1000]
<+1368>:	vmovdqu ymm5,YMMWORD PTR [rsi+0x1020]
<+1376>:	vmovdqu ymm6,YMMWORD PTR [rsi+0x1040]
<+1384>:	vmovdqu ymm7,YMMWORD PTR [rsi+0x1060]
<+1392>:	vmovdqu ymm8,YMMWORD PTR [rsi+0x2000]
<+1400>:	vmovdqu ymm9,YMMWORD PTR [rsi+0x2020]
<+1408>:	vmovdqu ymm10,YMMWORD PTR [rsi+0x2040]
<+1416>:	vmovdqu ymm11,YMMWORD PTR [rsi+0x2060]
<+1424>:	vmovdqu ymm12,YMMWORD PTR [rsi+0x3000]
<+1432>:	vmovdqu ymm13,YMMWORD PTR [rsi+0x3020]
<+1440>:	vmovdqu ymm14,YMMWORD PTR [rsi+0x3040]
<+1448>:	vmovdqu ymm15,YMMWORD PTR [rsi+0x3060]
<+1456>:	sub    rsi,0xffffffffffffff80
<+1460>:	vmovntdq YMMWORD PTR [rdi],ymm0
<+1464>:	vmovntdq YMMWORD PTR [rdi+0x20],ymm1
<+1469>:	vmovntdq YMMWORD PTR [rdi+0x40],ymm2
<+1474>:	vmovntdq YMMWORD PTR [rdi+0x60],ymm3
<+1479>:	vmovntdq YMMWORD PTR [rdi+0x1000],ymm4
<+1487>:	vmovntdq YMMWORD PTR [rdi+0x1020],ymm5
<+1495>:	vmovntdq YMMWORD PTR [rdi+0x1040],ymm6
<+1503>:	vmovntdq YMMWORD PTR [rdi+0x1060],ymm7
<+1511>:	vmovntdq YMMWORD PTR [rdi+0x2000],ymm8
<+1519>:	vmovntdq YMMWORD PTR [rdi+0x2020],ymm9
<+1527>:	vmovntdq YMMWORD PTR [rdi+0x2040],ymm10
<+1535>:	vmovntdq YMMWORD PTR [rdi+0x2060],ymm11
<+1543>:	vmovntdq YMMWORD PTR [rdi+0x3000],ymm12
<+1551>:	vmovntdq YMMWORD PTR [rdi+0x3020],ymm13
<+1559>:	vmovntdq YMMWORD PTR [rdi+0x3040],ymm14
<+1567>:	vmovntdq YMMWORD PTR [rdi+0x3060],ymm15
<+1575>:	sub    rdi,0xffffffffffffff80
<+1579>:	dec    ecx
<+1581>:	jne    0x7ffff7f27605 <__memmove_avx_unaligned_erms+1285>
<+1587>:	add    rdi,0x3000
<+1594>:	add    rsi,0x3000
<+1601>:	dec    r10
<+1604>:	jne    0x7ffff7f27600 <__memmove_avx_unaligned_erms+1280>
<+1610>:	sfence
<+1613>:	cmp    edx,0x80
<+1619>:	jbe    0x7ffff7f277aa <__memmove_avx_unaligned_erms+1706>
<+1621>:	prefetcht0 BYTE PTR [rsi+0x80]
<+1628>:	prefetcht0 BYTE PTR [rsi+0xc0]
<+1635>:	prefetcht0 BYTE PTR [rdi+0x80]
<+1642>:	prefetcht0 BYTE PTR [rdi+0xc0]
<+1649>:	vmovdqu ymm0,YMMWORD PTR [rsi]
<+1653>:	vmovdqu ymm1,YMMWORD PTR [rsi+0x20]
<+1658>:	vmovdqu ymm2,YMMWORD PTR [rsi+0x40]
<+1663>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x60]
<+1668>:	sub    rsi,0xffffffffffffff80
<+1672>:	add    edx,0xffffff80
<+1675>:	vmovdqa YMMWORD PTR [rdi],ymm0
<+1679>:	vmovdqa YMMWORD PTR [rdi+0x20],ymm1
<+1684>:	vmovdqa YMMWORD PTR [rdi+0x40],ymm2
<+1689>:	vmovdqa YMMWORD PTR [rdi+0x60],ymm3
<+1694>:	sub    rdi,0xffffffffffffff80
<+1698>:	cmp    edx,0x80
<+1704>:	ja     0x7ffff7f27755 <__memmove_avx_unaligned_erms+1621>
<+1706>:	vmovdqu ymm0,YMMWORD PTR [rsi+rdx*1-0x80]
<+1712>:	vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1-0x60]
<+1718>:	vmovdqu ymm2,YMMWORD PTR [rsi+rdx*1-0x40]
<+1724>:	vmovdqu ymm3,YMMWORD PTR [rsi+rdx*1-0x20]
<+1730>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x80],ymm0
<+1736>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x60],ymm1
<+1742>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x40],ymm2
<+1748>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x20],ymm3
<+1754>:	vzeroupper
<+1757>:	ret

Remember: We are copying 4Mib of data so rdx = 0x400000.

The __memmove_avx_unaligned_erms performs several test over the rdx register which contains the size of our buffer (3rd argument for System V ABI).

Here is each test performed and the result of the conditional jump:

ConditionTaken ?
size is greater than 32 bytes
size is greater than 64 bytes
size is greater than __x86_rep_movsb_threshold
dest doesn’t start in src
size is greater than __x86_rep_movsb_stop_threshold
LSB(__x86_string_control) == 0
dest - src is greater than 2^32-64

✅ = taken ❌ = not taken

We can check the value of the thresholds and of the “string control” constant.

Value of the comparison constants in gdb

And after aligning rsi and rdi on 64 bytes, we call our beloved rep movsb. 😄

But note that this is very dependent on our memory layout:

  • with size >= 0x700000, memcpy would copy the buffer in blocks of 2^14 bytes using all the ymm registers and prefetching, then terminate to copy by blocks of 128 bytes until the end.
  • with 32 < size < 64, it would be a simple 64-byte copy using ymm0 and ymm1
  • etc…

We can find an explanation for the __x86_rep_movsb_threshold value in this commit message:

In the process of optimizing memcpy for AMD machines, we have found the vector move operations are outperforming enhanced REP MOVSB for data transfers above the L2 cache size on Zen3 architectures. To handle this use case, we are adding an upper bound parameter on enhanced REP MOVSB:’__x86_rep_movsb_stop_threshold'.

As per large-bench results, we are configuring this parameter to the L2 cache size for AMD machines and applicable from Zen3 architecture supporting the ERMS feature. For architectures other than AMD, it is the computed value of non-temporal threshold parameter.

Note that since this commit sysdeps/x86/dl-cacheinfo.h has been changed and __x86_rep_movsb_stop_threshold is equal to __x86_shared_non_temporal_threshold for all x86 architectures nowadays.

You can read all the details of the choices made for the __x86_shared_non_temporal_threshold tunable in sysdeps/x86/dl-cacheinfo.h:

  /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
     of the chip's cache (depending on `cachesize_non_temporal_divisor` which
     is microarch specific. The default is 1/4). For most Intel processors
     with an initial release date between 2017 and 2023, a thread's
     typical share of the cache is from 18-64MB. Using a reasonable size
     fraction of L3 is meant to estimate the point where non-temporal stores
     begin out-competing REP MOVSB. As well the point where the fact that
     non-temporal stores are forced back to main memory would already occurred
     to the majority of the lines in the copy. Note, concerns about the entire
     L3 cache being evicted by the copy are mostly alleviated by the fact that
     modern HW detects streaming patterns and provides proper LRU hints so that
     the maximum thrashing capped at 1/associativity. */
  unsigned long int non_temporal_threshold
      = shared / cachesize_non_temporal_divisor;

  /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most
     likely have incorrect/incomplete cache info in which case, default to
     3/4 * per-thread L3 to avoid regressions.  */
  unsigned long int non_temporal_threshold_lowbound
      = shared_per_thread * 3 / 4;
  if (non_temporal_threshold < non_temporal_threshold_lowbound)
    non_temporal_threshold = non_temporal_threshold_lowbound;

  /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
     a higher risk of actually thrashing the cache as they don't have a HW LRU
     hint. As well, their performance in highly parallel situations is
     noticeably worse. Zhaoxin processors are an exception, the lowbound is not
     suitable for them based on actual test data.  */
  if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)
      && cpu_features->basic.kind != arch_kind_zhaoxin)
    non_temporal_threshold = non_temporal_threshold_lowbound;
  /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of
     'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best
     if that operation cannot overflow. Minimum of 0x4040 (16448) because the
     L(large_memset_4x) loops need 64-byte to cache align and enough space for
     at least 1 iteration of 4x PAGE_SIZE unrolled loop.  Both values are
     reflected in the manual.  */
  unsigned long int maximum_non_temporal_threshold = SIZE_MAX >> 4;
  unsigned long int minimum_non_temporal_threshold = 0x4040;

  /* If `non_temporal_threshold` less than `minimum_non_temporal_threshold`
     it most likely means we failed to detect the cache info. We don't want
     to default to `minimum_non_temporal_threshold` as such a small value,
     while correct, has bad performance. We default to 64MB as reasonable
     default bound. 64MB is likely conservative in that most/all systems would
     choose a lower value so it should never forcing non-temporal stores when
     they otherwise wouldn't be used.  */
  if (non_temporal_threshold < minimum_non_temporal_threshold)
    non_temporal_threshold = 64 * 1024 * 1024;
  else if (non_temporal_threshold > maximum_non_temporal_threshold)
    non_temporal_threshold = maximum_non_temporal_threshold;

When the block to copy is greater than __x86_shared_non_temporal_threshold, the __memmove_avx_unaligned_erms implementation uses an unrolled loop of AVX2 registers with prefetching and non-temporal stores using the vmovntdq (vex mov non-temporal double quadword).

The “Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1” section 9.6.1 gives hint on when to use non-temporal stores:

Use non-temporal stores in the cases when the data to be stored is:

  • Write-once (non-temporal).
  • Too large and thus cause cache thrashing.

Non-temporal stores do not invoke a cache line allocation, which means they are not write-allocate. As a result, caches are not polluted and no dirty writeback is generated to compete with useful data bandwidth. Without using non-temporal stores, bus bandwidth will suffer when caches start to be thrashed because of dirty writebacks.

The memcpy function of the glibc is a piece of code tailored to be the most efficient and adaptable to every hardware and memory layout possible.

Benchmark 2: memset

As you can see with godbolt, gcc will inline a call to memset using the rep stosq instruction under certain circumstances.

This is already a good hint about the efficiency of this instruction.

#include <string.h>

#define BUF_LEN (1 << 13)
char a[BUF_LEN];
int value;

int main(void) {
  memset(a, value, BUF_LEN);
  return 0;
}
gcc -O1 -g -o memset memset.c
; int main(void) {
;   memset(a, value, BUF_LEN);
  401106:       mov    edi,0x404060
  40110b:       movzx  eax,BYTE PTR [rip+0x2f2e]        # 404040 <value>
  401112:       movabs rdx,0x101010101010101
  40111c:       imul   rax,rdx
  401120:       mov    ecx,0x400
  401125:       rep stos QWORD PTR es:[rdi],rax
;   return 0;
; }
  401128:       mov    eax,0x0
  40112d:       ret

To broadcast the first byte of value in all bytes of the rax register, the compiler puts 0x1010101010101010 in rdx and multiplies rax and rdx.

I wrote a macro to reproduce this:


; void *memset_movb(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_movb:
    test rdx, rdx           ; if(rdx == 0)
    jz .exit                ;    goto .exit
    xor ecx, ecx            ; rcx = 0
.loop:
    mov [rdi + rcx], sil    ; s[rcx] = sil
    add rcx, 1              ; rcx++
    sub rdx, 1              ; rdx--
    jnz .loop               ; if(rdx != 0) goto .loop
.exit:
    mov rax, rdi            ; return s
    ret

%macro setup_memset_rax 0
    mov rcx, rdx                ; rcx = n
	movzx eax, sil			    ; Mask the first byte of esi
	mov rdx, 0x0101010101010101	; Load the multiplication mask
	mul rdx 				    ; Replicate al in all rax register (rdx is erased)
    mov rdx, rcx                ; rdx = n
%endmacro

; Macro used to set less than a quadword
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 8-byte aligned offset
%macro memset_epilog_qword 2
    cmp %1, 4
    jb %%word
    mov [rdi + %2], eax
    add %2, 4
    sub %1, 4
%%word:
    cmp %1, 2
    jb %%byte
    mov [rdi + %2], ax
    add %2, 2
    sub %1, 2
%%byte:
    test %1, %1
    jz %%exit
    mov [rdi + %2], al
%%exit:
%endmacro

; void *memset_stosq(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_stosq:
    setup_memset_rax
    mov rsi, rdi            ; rsi = s
    cmp rdx, 8              ; if(n < 8)
    jb .end                 ;    goto .end
    mov rcx, rdi            ; rcx = s
    add rdx, rdi            ; rdx = dst + n
    stosq                   ; *(rdi++) = rax
    and rdi, -8             ; rdi = align(dst + 8, 8)
    sub rdx, rdi            ; rdx = dst + n - align(dst + 8, 8)
    mov rcx, rdx            ; rcx = dst + n - align(dst + 8, 8)
    and rdx, (8 - 1)        ; rdx = (dst + n - align(dst + 8, 8)) % 8
    shr rcx, 3              ; rcx = (dst + n - align(dst + 8, 8)) / 8
	rep stosq				; for(; rcx != 0; rcx -= 1)
							;	 *(rdi++) = rax
.end:
    xor ecx, ecx
    memset_epilog_qword rdx, rcx
.exit:
    mov rax, rsi            ; Return s
	ret

; void *memset_movq(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_movq:
    cmp rdx, 8
    jb .end
    setup_memset_rax
    xor ecx, ecx
.loop:
    mov [rdi + rcx], rax
    add rcx, 8
    sub rdx, 8
    cmp rdx, 8
    jae .loop
.end:
    memset_epilog_qword rdx, rcx
    mov rax, rdi
    ret

; Macro used to set less than a 16 bytes
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 16-byte aligned offset
%macro memset_epilog_avx 2
    vmovq rax, xmm0
    cmp %1, 8
    jb %%dword
    mov [rdi + %2], rax
    add %2, 8
    sub %1, 8
%%dword:
    memset_epilog_qword %1, %2
%endmacro

; void *memset_avx(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_avx:
	movzx esi, sil					; Copy sil in eax with zero-extend
	vmovd xmm0, esi					; Copy the low dword (eax) into xmm0
	vpxor xmm1, xmm1				; Load a shuffle mask filled with zero indexes
	vpshufb xmm0, xmm0, xmm1		; Broadcast the first byte of xmm0 to all xmm0 bytes 
    cmp rdx, 16
    jb .end
    vmovdqu [rdi], xmm0
    lea rcx, [rdi + 15]             ; rcx = src + 15
    and rcx, -16                    ; rcx = align((src + 15), 16)
    sub rcx, rdi                    ; rcx = align((src + 15), 16) - src
    sub rdx, rcx                     ; rdx = src + n - align((src + 15), 16)
    cmp rdx, 16
    jb .end
.loop:
	vmovdqa [rdi + rcx], xmm0		; Copy the 16 bytes of xmm0 to s
    add rcx, 16                     ; rcx += 16
    sub rdx, 16                     ; rdx -= 16
    cmp rdx, 16                     ; if(rdx >= 16)
	jae .loop						;	 goto .loop
.end:
    memset_epilog_avx rdx, rcx
.exit:
	mov rax, rdi					; rax = s
	ret

; Macro used to set less than a 32 bytes
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 32-byte aligned offset
%macro memset_epilog_avx2 2
    cmp %1, 16
    jb %%qword
    vmovdqu [rdi + %2], xmm0
    add %2, 16
    sub %1, 16
%%qword:
    memset_epilog_avx %1, %2
%endmacro

; void *memset_avx2(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_avx2:
    xor ecx, ecx
	movzx esi, sil					; Copy sil in eax with zero-extend
	vmovd xmm1, esi					; Copy the low dword (eax) into xmm0
	vpbroadcastb ymm0, xmm1			; Broadcast the first byte of xmm1 to all ymm0
    cmp rdx, 32
    jb .end
    vmovdqu [rdi], ymm0
    lea rcx, [rdi + 31]             ; rcx = src + 31
    and rcx, -32                    ; rcx = align((src + 31), 32)
    sub rcx, rdi                    ; rcx = align((src + 31), 32) - src
    sub rdx, rcx                     ; rdx = src + n - align((src + 31), 32)
    cmp rdx, 32                     ; if(rdx < 32)
    jb .end                         ;    goto .end
.loop:
	vmovdqa [rdi + rcx], ymm0		; Copy the 32 bytes of ymm0 to s
    add rcx, 32                     ; rcx += 32
    sub rdx, 32                     ; rdx -= 32
    cmp rdx, 32                     ; if(rdx >= 32)
	jae .loop						;	 goto .loop
.end:
    memset_epilog_avx2 rdx, rcx
	mov rax, rdi					; rax = s
	vzeroupper
	ret

Once again, I wrote 6 different versions of the memset function: an unoptimized for loop, the glibc memset function, rep stosb, rep stosq and the AVX and AVX2 extensions.

void *__attribute__((optimize("O1"))) memset_dummy(void *dst, unsigned int c,
                                                   unsigned long n) {
  void *const ret = dst;
  for (int i = 0; i < n; i++)
    *((char *)dst++) = (char)c;
  return ret;
}
section .text
; void *memset_stosb(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_stosb:
	mov rcx, rdx	; rcx = n
	movzx eax, sil	; eax = c
	mov rsi, rdi	; rdx = s
	rep stosb		; for(; rcx != 0; rcx--)
					;	 *(rdi++) = al
	mov rax, rsi	; return s
	ret
; void *memset_stosb_std(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_stosb_std:
    std             ; Set direction flag
	movzx eax, sil	; eax = c
	mov rsi, rdi	; rsi = s
	mov rcx, rdx	; rcx = n
    sub rdx, 1      ; rdx = n - 1
    add rdi, rdx    ; rdi = s + (n - 1)
	rep stosb		; for(; rcx != 0; rcx--)
					;	 *(rdi--) = al
	mov rax, rsi	; return s
    cld             ;
	ret
; void *memset_movb(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_movb:
    test rdx, rdx           ; if(rdx == 0)
    jz .exit                ;    goto .exit
    xor ecx, ecx            ; rcx = 0
.loop:
    mov [rdi + rcx], sil    ; s[rcx] = sil
    add rcx, 1              ; rcx++
    sub rdx, 1              ; rdx--
    jnz .loop               ; if(rdx != 0) goto .loop
.exit:
    mov rax, rdi            ; return s
    ret
; Macro used to set less than a quadword
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 8-byte aligned offset
%macro memset_epilog_qword 2
    cmp %1, 4
    jb %%word
    mov [rdi + %2], eax
    add %2, 4
    sub %1, 4
%%word:
    cmp %1, 2
    jb %%byte
    mov [rdi + %2], ax
    add %2, 2
    sub %1, 2
%%byte:
    test %1, %1
    jz %%exit
    mov [rdi + %2], al
%%exit:
%endmacro

; void *memset_stosq(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_stosq:
    setup_memset_rax
    mov rsi, rdi            ; rsi = s
    cmp rdx, 8              ; if(n < 8)
    jb .end                 ;    goto .end
    mov rcx, rdi            ; rcx = s
    add rdx, rdi            ; rdx = dst + n
    stosq                   ; *(rdi++) = rax
    and rdi, -8             ; rdi = align(dst + 8, 8)
    sub rdx, rdi            ; rdx = dst + n - align(dst + 8, 8)
    mov rcx, rdx            ; rcx = dst + n - align(dst + 8, 8)
    and rdx, (8 - 1)        ; rdx = (dst + n - align(dst + 8, 8)) % 8
    shr rcx, 3              ; rcx = (dst + n - align(dst + 8, 8)) / 8
	rep stosq				; for(; rcx != 0; rcx -= 1)
							;	 *(rdi++) = rax
.end:
    xor ecx, ecx
    memset_epilog_qword rdx, rcx
.exit:
    mov rax, rsi            ; Return s
	ret
; void *memset_movq(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_movq:
    cmp rdx, 8
    jb .end
    setup_memset_rax
    xor ecx, ecx
.loop:
    mov [rdi + rcx], rax
    add rcx, 8
    sub rdx, 8
    cmp rdx, 8
    jae .loop
.end:
    memset_epilog_qword rdx, rcx
    mov rax, rdi
    ret
; Macro used to set less than a 16 bytes
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 16-byte aligned offset
%macro memset_epilog_avx 2
    vmovq rax, xmm0
    cmp %1, 8
    jb %%dword
    mov [rdi + %2], rax
    add %2, 8
    sub %1, 8
%%dword:
    memset_epilog_qword %1, %2
%endmacro

; void *memset_avx(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_avx:
	movzx esi, sil					; Copy sil in eax with zero-extend
	vmovd xmm0, esi					; Copy the low dword (eax) into xmm0
	vpxor xmm1, xmm1				; Load a shuffle mask filled with zero indexes
	vpshufb xmm0, xmm0, xmm1		; Broadcast the first byte of xmm0 to all xmm0 bytes 
    cmp rdx, 16
    jb .end
    vmovdqu [rdi], xmm0
    lea rcx, [rdi + 15]             ; rcx = src + 15
    and rcx, -16                    ; rcx = align((src + 15), 16)
    sub rcx, rdi                    ; rcx = align((src + 15), 16) - src
    sub rdx, rcx                     ; rdx = src + n - align((src + 15), 16)
    cmp rdx, 16
    jb .end
.loop:
	vmovdqa [rdi + rcx], xmm0		; Copy the 16 bytes of xmm0 to s
    add rcx, 16                     ; rcx += 16
    sub rdx, 16                     ; rdx -= 16
    cmp rdx, 16                     ; if(rdx >= 16)
	jae .loop						;	 goto .loop
.end:
    memset_epilog_avx rdx, rcx
.exit:
	mov rax, rdi					; rax = s
	ret
; Macro used to set less than a 32 bytes
; The macro take two parameters:
; - A register containing the byte count to copy
; - A register containing the current 32-byte aligned offset
%macro memset_epilog_avx2 2
    cmp %1, 16
    jb %%qword
    vmovdqu [rdi + %2], xmm0
    add %2, 16
    sub %1, 16
%%qword:
    memset_epilog_avx %1, %2
%endmacro

; void *memset_avx2(rdi: void s[.n], rsi: int c, rdx: size_t n);
memset_avx2:
    xor ecx, ecx
	movzx esi, sil					; Copy sil in eax with zero-extend
	vmovd xmm1, esi					; Copy the low dword (eax) into xmm0
	vpbroadcastb ymm0, xmm1			; Broadcast the first byte of xmm1 to all ymm0
    cmp rdx, 32
    jb .end
    vmovdqu [rdi], ymm0
    lea rcx, [rdi + 31]             ; rcx = src + 31
    and rcx, -32                    ; rcx = align((src + 31), 32)
    sub rcx, rdi                    ; rcx = align((src + 31), 32) - src
    sub rdx, rcx                     ; rdx = src + n - align((src + 31), 32)
    cmp rdx, 32                     ; if(rdx < 32)
    jb .end                         ;    goto .end
.loop:
	vmovdqa [rdi + rcx], ymm0		; Copy the 32 bytes of ymm0 to s
    add rcx, 32                     ; rcx += 32
    sub rdx, 32                     ; rdx -= 32
    cmp rdx, 32                     ; if(rdx >= 32)
	jae .loop						;	 goto .loop
.end:
    memset_epilog_avx2 rdx, rcx
	mov rax, rdi					; rax = s
	vzeroupper
	ret
Info
Like for the memcpy implementation, without the __attribute__((optimize("O1"))) gcc would replace the call to our custom function by a call to memset.

Here are the results of the benchmarks on my computer:

Benchmark of memset implementations

We can see that except the dummy, movb (very similar implementation) and stosb_std (which is a reversed copy), all the implementations managed to use the fast-string operations and have the save low execution time.

Once again, we can explore the implementation of glibc using gdb. Like memcpy, it is an indirect function which, in my case, resolves into __memset_avx2_unaligned_erms.

As for memcpy, I highlighted the code reached during the execution of my memset for a block of 4MiB.

__memset_avx2_unaligned_erms

<+0>:	endbr64
<+4>:	vmovd  xmm0,esi
<+8>:	mov    rax,rdi
<+11>:	cmp    rdx,0x20
<+15>:	jb     0x7ffff7f27be0 <__memset_avx2_unaligned_erms+224>
<+21>:	vpbroadcastb ymm0,xmm0
<+26>:	cmp    rdx,0x40
<+30>:	ja     0x7ffff7f27b40 <__memset_avx2_unaligned_erms+64>
<+32>:	vmovdqu YMMWORD PTR [rdi],ymm0
<+36>:	vmovdqu YMMWORD PTR [rdi+rdx*1-0x20],ymm0
<+42>:	vzeroupper
<+45>:	ret
<+46>:	xchg   ax,ax
<+48>:	vmovdqu YMMWORD PTR [rdi-0x40],ymm0
<+53>:	vmovdqu YMMWORD PTR [rdi-0x20],ymm0
<+58>:	vzeroupper
<+61>:	ret
<+62>:	xchg   ax,ax
<+64>:	cmp    rdx,QWORD PTR [rip+0x796c9]        # 0x7ffff7fa1210 <__x86_rep_stosb_threshold>
<+71>:	ja     0x7ffff7f27bc0 <__memset_avx2_unaligned_erms+192>
<+73>:	vmovdqu YMMWORD PTR [rdi],ymm0
<+77>:	vmovdqu YMMWORD PTR [rdi+0x20],ymm0
<+82>:	add    rdi,rdx
<+85>:	cmp    rdx,0x80
<+92>:	jbe    0x7ffff7f27b30 <__memset_avx2_unaligned_erms+48>
<+94>:	vmovdqu YMMWORD PTR [rax+0x40],ymm0
<+99>:	vmovdqu YMMWORD PTR [rax+0x60],ymm0
<+104>:	add    rdi,0xffffffffffffff80
<+108>:	cmp    rdx,0x100
<+115>:	jbe    0x7ffff7f27ba0 <__memset_avx2_unaligned_erms+160>
<+117>:	lea    rdx,[rax+0x80]
<+124>:	and    rdx,0xffffffffffffffe0
<+128>:	vmovdqa YMMWORD PTR [rdx],ymm0
<+132>:	vmovdqa YMMWORD PTR [rdx+0x20],ymm0
<+137>:	vmovdqa YMMWORD PTR [rdx+0x40],ymm0
<+142>:	vmovdqa YMMWORD PTR [rdx+0x60],ymm0
<+147>:	sub    rdx,0xffffffffffffff80
<+151>:	cmp    rdx,rdi
<+154>:	jb     0x7ffff7f27b80 <__memset_avx2_unaligned_erms+128>
<+156>:	nop    DWORD PTR [rax+0x0]
<+160>:	vmovdqu YMMWORD PTR [rdi],ymm0
<+164>:	vmovdqu YMMWORD PTR [rdi+0x20],ymm0
<+169>:	vmovdqu YMMWORD PTR [rdi+0x40],ymm0
<+174>:	vmovdqu YMMWORD PTR [rdi+0x60],ymm0
<+179>:	vzeroupper
<+182>:	ret
<+183>:	nop    WORD PTR [rax+rax*1+0x0]
<+192>:	movzx  eax,sil
<+196>:	mov    rcx,rdx
<+199>:	mov    rdx,rdi
<+202>:	rep stos BYTE PTR es:[rdi],al
<+204>:	mov    rax,rdx
<+207>:	vzeroupper
<+210>:	ret
<+211>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+222>:	xchg   ax,ax
<+224>:	vpbroadcastb xmm0,xmm0
<+229>:	cmp    edx,0x10
<+232>:	jge    0x7ffff7f27c00 <__memset_avx2_unaligned_erms+256>
<+234>:	cmp    edx,0x8
<+237>:	jge    0x7ffff7f27c10 <__memset_avx2_unaligned_erms+272>
<+239>:	cmp    edx,0x4
<+242>:	jge    0x7ffff7f27c20 <__memset_avx2_unaligned_erms+288>
<+244>:	cmp    edx,0x1
<+247>:	jg     0x7ffff7f27c30 <__memset_avx2_unaligned_erms+304>
<+249>:	jl     0x7ffff7f27bfe <__memset_avx2_unaligned_erms+254>
<+251>:	mov    BYTE PTR [rdi],sil
<+254>:	ret
<+255>:	nop
<+256>:	vmovdqu XMMWORD PTR [rdi],xmm0
<+260>:	vmovdqu XMMWORD PTR [rdi+rdx*1-0x10],xmm0
<+266>:	ret
<+267>:	nop    DWORD PTR [rax+rax*1+0x0]
<+272>:	vmovq  QWORD PTR [rdi],xmm0
<+276>:	vmovq  QWORD PTR [rdi+rdx*1-0x8],xmm0
<+282>:	ret
<+283>:	nop    DWORD PTR [rax+rax*1+0x0]
<+288>:	vmovd  DWORD PTR [rdi],xmm0
<+292>:	vmovd  DWORD PTR [rdi+rdx*1-0x4],xmm0
<+298>:	ret
<+299>:	nop    DWORD PTR [rax+rax*1+0x0]
<+304>:	mov    BYTE PTR [rdi],sil
<+307>:	mov    BYTE PTR [rdi+0x1],sil
<+311>:	mov    BYTE PTR [rdi+rdx*1-0x1],sil
<+316>:	ret

This is way shorter than the code for memcpy, and there is one big difference: there is a threshold for the min value for which we should use rep stosq: __x86_rep_stosb_threshold, but no threshold for the max value.

Use of rep stosq

__x86_rep_stosb_threshold = 0x800 = 8 * 256

This means that for more than 8 copies using ymm registers the memset implementation will use the rep stosq instruction.

Benchmark 3: strlen

If, like many C developers, you’re used to passing around zero-terminated strings, your code may call strlen a bunch of times so this function had better be fast.

We saw a way to write a strlen function with the repne scasb instruction. But there are other ways to write a strlen function by using vectorization, either on 64-bit register or using SIMD extensions.

To find a null byte in a quadword we can define a helper macro which takes two parameters:

  • A destination register whose byte will have their sign bit set only if the corresponding byte in the source register is null.
  • A source register where to find the null bytes.

.

%macro find_zero 2
	mov %1, %2
	not %2
	sub %1, [_find_zero_one_vec]
	and %1, %2
	and %1, [_find_zero_sign_mask]
%endmacro

section .rodata
_find_zero_one_vec: dq 0x0101010101010101
_find_zero_sign_mask: dq 0x8080808080808080

This macro uses a bit trick documented here to find a byte in a quadword. We will use it in the strlen_movq implementation. We cannot use it to code a repne scasq implementation because scasq can only stop on a 64-bit equality.

#include <stddef.h>

size_t __attribute__((optimize("O1"))) strlen_dummy(const char *s) {
  unsigned int size = 0;
  while (*(s++) != 0)
    size++;
  return size;
}
section .text
; size_t strlen_scasb(rdi: const char *s)
strlen_scasb:
	xor eax, eax	; rax = 0
	mov rcx, -1		; rcx = -1
	repnz scasb		; for(; rcx != 0 and ZF == false; rcx--)
					;	cmp rax, *(rdi++)
	not rcx			; before this insn rcx = - (len(rdi) + 2)
	dec rcx			; after this insn rcx = ~(- (len(rdi) + 2)) - 1
					;                     = -(- (len(rdi) + 2)) - 1 - 1
					;                     = len(rdi)
	mov rax, rcx	; rax = len(rdi)
	ret
; size_t strlen_movb(rdi: const char *s)
strlen_movb:
	xor eax, eax	; rax = 0
.loop:
    movzx rdx, byte [rdi + rax]
    test edx, edx
    jz .end
    add rax, 1
    jmp .loop
.end:
	ret
; size_t strlen_movq(rdi: const char *s)
strlen_movq:
	xor eax, eax            ; rax = s
	mov rdx, [rdi]          ; rdx = *s
    find_zero rcx, rdx
    jnz .end
    mov rax, rdi
	and rax, -8	            ; rax = align(s, 8)
    sub rax, rdi
.loop:
	add rax, 0x8            ; rax += 8
    find_zero rcx, rdx      ; Put 1 in sign bits of rcx bytes where rdx bytes are 0
    mov rdx, [rdi + rax]          ; rdx = *rsi
	jz .loop			    ; zero byte found ?
    sub rax, 8
.end:
	bsf rcx, rcx		    ; find right-most set sign bit
	shr rcx, 3			    ; divide by 8 = byte index
	add rax, rcx		    ; rax += rcx
	ret
; size_t strlen_avx(rdi: const char *s)
strlen_avx:
	xor eax, eax		; rax = -16
	vpxor xmm0, xmm0	; xmm0 = 0
	vmovdqu xmm1, [rdi] ; xmm1 = *(rdi)
	vpcmpeqb xmm1, xmm0	; xmm1 = byte_mask(i => xmm1[i] == xmm0[i])
	vpmovmskb edx, xmm1	; edx = bitmask(i => sign(xmm1[i]))
	test edx, edx		; if(edx == 0)
	jnz .end			; 	goto .loop
    mov rax, rdi
	and rax, -16		; align rdi on 16 bytes
    sub rax, rdi
.loop:
	add rax, 0x10		; rax += 16
	vmovdqa xmm1, [rdi + rax]	; xmm1 = *(rdi)
	vpcmpeqb xmm1, xmm0	; xmm1 = byte_mask(i => xmm1[i] == xmm0[i])
	vpmovmskb edx, xmm1	; edx = bitmask(i => sign(xmm1[i]))
	test edx, edx		; if(edx == 0)
	jz .loop			; 	goto .loop
.end:
	bsf edx, edx		; edx = index of the first set bit in edx
	add rax, rdx		; rax += rdx
	ret
; size_t strlen_avx2(rdi: const char *s)
strlen_avx2:
	xor eax, eax		; rax = -32
	vpxor ymm0, ymm0	; ymm0 = 0
	vmovdqu ymm1, [rdi]	; ymm1 = *(rdi)
	vpcmpeqb ymm1, ymm0	; ymm1 = byte_mask(i => ymm1[i] == ymm0[i])
	vpmovmskb edx, ymm1	; edx = bitmask(i => sign(ymm1[i]))
	test edx, edx		; if(edx == 0)
	jnz .end			; 	goto .loop
    mov rax, rdi
	and rax, -32		; align rdi on 16 bytes
    sub rax, rdi
.loop:
	add rax, 0x20		; rax += 32
	vmovdqa ymm1, [rdi + rax]	; ymm1 = *(rdi)
	vpcmpeqb ymm1, ymm0	; ymm1 = byte_mask(i => ymm1[i] == ymm0[i])
	vpmovmskb edx, ymm1	; edx = bitmask(i => sign(ymm1[i]))
	test edx, edx		; if(edx == 0)
	jz .loop			; 	goto .loop
.end:
	bsf edx, edx		; edx = index of the first set bit in edx
	add rax, rdx		; rax += rdx
	vzeroupper
	ret
; size_t strlen_sse2(rdi: const char *s)
strlen_sse2:
	xor eax, eax        ; rax = 0
	pxor xmm0, xmm0		; xmm0 = 0
	movdqu xmm1, [rdi]	; xmm1 = *(rdi)
	pcmpeqb xmm1, xmm0	; xmm1 = byte_mask(i => xmm1[i] == xmm0[i])
	pmovmskb edx, xmm1	; edx = bitmask(i => sign(xmm1[i]))
	test edx, edx		; if(edx == 0)
	jnz .end			; 	goto .loop
    mov rax, rdi
	and rax, -16		; align rdi on 16 bytes
    sub rax, rdi
.loop:
	add rax, 0x10		; rax += 16
	movdqa xmm1, [rdi + rax]	; xmm1 = *(rdi)
	pcmpeqb xmm1, xmm0	; xmm1 = byte_mask(i => xmm1[i] == xmm0[i])
	pmovmskb edx, xmm1	; edx = bitmask(i => sign(xmm1[i]))
	test edx, edx		; if(edx == 0)
	jz .loop			; 	goto .loop
.end:
	bsf edx, edx		; edx = index of the first set bit in edx
	add rax, rdx		; rax += rdx
	ret
Info
Like for the memcpy implementation, without the __attribute__((optimize("O1"))) gcc would replace the call to our custom function by a call to strlen.

Here are the results of the benchmarks on my computer:

Benchmark of strlen implementations

As we can see, the repne scasb instruction is by far the worst way to write a strlen function while the standard strlen function outperforms every vectorized implementation.

We can again have a look to the glibc implementation of strlen for AVX2:

__strlen_avx2

<+0>:	endbr64
<+4>:	mov    eax,edi
<+6>:	mov    rdx,rdi
<+9>:	vpxor  xmm0,xmm0,xmm0
<+13>:	and    eax,0xfff
<+18>:	cmp    eax,0xfe0
<+23>:	ja     0x7ffff7f29f90 <__strlen_avx2+336>
<+29>:	vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi]
<+33>:	vpmovmskb eax,ymm1
<+37>:	test   eax,eax
<+39>:	je     0x7ffff7f29ec0 <__strlen_avx2+128>
<+41>:	tzcnt  eax,eax
<+45>:	vzeroupper
<+48>:	ret
<+49>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+60>:	nop    DWORD PTR [rax+0x0]
<+64>:	tzcnt  eax,eax
<+68>:	sub    edi,edx
<+70>:	inc    edi
<+72>:	add    eax,edi
<+74>:	vzeroupper
<+77>:	ret
<+78>:	xchg   ax,ax
<+80>:	tzcnt  eax,eax
<+84>:	sub    edi,edx
<+86>:	add    edi,0x21
<+89>:	add    eax,edi
<+91>:	vzeroupper
<+94>:	ret
<+95>:	nop
<+96>:	tzcnt  eax,eax
<+100>:	sub    edi,edx
<+102>:	add    edi,0x41
<+105>:	add    eax,edi
<+107>:	vzeroupper
<+110>:	ret
<+111>:	nop
<+112>:	tzcnt  eax,eax
<+116>:	sub    edi,edx
<+118>:	add    edi,0x61
<+121>:	add    eax,edi
<+123>:	vzeroupper
<+126>:	ret
<+127>:	nop
<+128>:	or     rdi,0x1f
<+132>:	vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi+0x1]
<+137>:	vpmovmskb eax,ymm1
<+141>:	test   eax,eax
<+143>:	jne    0x7ffff7f29e80 <__strlen_avx2+64>
<+145>:	vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi+0x21]
<+150>:	vpmovmskb eax,ymm1
<+154>:	test   eax,eax
<+156>:	jne    0x7ffff7f29e90 <__strlen_avx2+80>
<+158>:	vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi+0x41]
<+163>:	vpmovmskb eax,ymm1
<+167>:	test   eax,eax
<+169>:	jne    0x7ffff7f29ea0 <__strlen_avx2+96>
<+171>:	vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi+0x61]
<+176>:	vpmovmskb eax,ymm1
<+180>:	test   eax,eax
<+182>:	jne    0x7ffff7f29eb0 <__strlen_avx2+112>
<+184>:	inc    rdi
<+187>:	or     rdi,0x7f
<+191>:	nop
<+192>:	vmovdqa ymm1,YMMWORD PTR [rdi+0x1]
<+197>:	vpminub ymm2,ymm1,YMMWORD PTR [rdi+0x21]
<+202>:	vmovdqa ymm3,YMMWORD PTR [rdi+0x41]
<+207>:	vpminub ymm4,ymm3,YMMWORD PTR [rdi+0x61]
<+212>:	vpminub ymm5,ymm4,ymm2
<+216>:	vpcmpeqb ymm5,ymm0,ymm5
<+220>:	vpmovmskb ecx,ymm5
<+224>:	sub    rdi,0xffffffffffffff80
<+228>:	test   ecx,ecx
<+230>:	je     0x7ffff7f29f00 <__strlen_avx2+192>
<+232>:	vpcmpeqb ymm1,ymm0,ymm1
<+236>:	vpmovmskb eax,ymm1
<+240>:	sub    rdi,rdx
<+243>:	test   eax,eax
<+245>:	jne    0x7ffff7f29f70 <__strlen_avx2+304>
<+247>:	vpcmpeqb ymm2,ymm0,ymm2
<+251>:	vpmovmskb eax,ymm2
<+255>:	test   eax,eax
<+257>:	jne    0x7ffff7f29f80 <__strlen_avx2+320>
<+259>:	vpcmpeqb ymm3,ymm0,ymm3
<+263>:	vpmovmskb eax,ymm3
<+267>:	shl    rcx,0x20
<+271>:	or     rax,rcx
<+274>:	tzcnt  rax,rax
<+279>:	sub    rdi,0x3f
<+283>:	add    rax,rdi
<+286>:	vzeroupper
<+289>:	ret
<+290>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+301>:	nop    DWORD PTR [rax]
<+304>:	tzcnt  eax,eax
<+308>:	sub    rdi,0x7f
<+312>:	add    rax,rdi
<+315>:	vzeroupper
<+318>:	ret
<+319>:	nop
<+320>:	tzcnt  eax,eax
<+324>:	sub    rdi,0x5f
<+328>:	add    rax,rdi
<+331>:	vzeroupper
<+334>:	ret
<+335>:	nop
<+336>:	or     rdi,0x1f
<+340>:	vpcmpeqb ymm1,ymm0,YMMWORD PTR [rdi-0x1f]
<+345>:	vpmovmskb eax,ymm1
<+349>:	sarx   eax,eax,edx
<+354>:	test   eax,eax
<+356>:	je     0x7ffff7f29ec4 <__strlen_avx2+132>
<+362>:	tzcnt  eax,eax
<+366>:	vzeroupper
<+369>:	ret

After aligning rdi on 16 bytes, __strlen_avx2 performs an unrolled loop to find the null byte. In the loop body, the function loads 4 * 32 = 128 bytes each time, reduces them using the vpminub (Vector Packed MIN Unsigned Byte) instruction and compares the result to zero using the vpcmpeqb (Vector Packed CoMPare EQual Byte) instruction.

Note that in order to find the null byte position in the register, the __strlen_avx2 function uses the tzcnt (Count Trailing Zeroes) instruction instead of the bsf (Bit Scan Forward) instruction.

These two instructions are very similar.
According to the “Intel® 64 and IA-32 Architectures Software Developer’s Manual - Volume 2: Instruction Set Reference 4.3”:

TZCNT counts the number of trailing least significant zero bits in source operand (second operand) and returns the result in destination operand (first operand). TZCNT is an extension of the BSF instruction. The key difference between TZCNT and BSF instruction is that TZCNT provides operand size as output when source operand is zero while in the case of BSF instruction, if source operand is zero, the content of destination operand are undefined. On processors that do not support TZCNT, the instruction byte encoding is executed as BSF.

Benchmark 4: memcmp

We can benchmark the rep cmps instruction by writing several implementations of the memcmp function.

This function can also be implemented using vpcmpestri.

#include <stddef.h>
int memcmp_dummy(const void *s1, const void *s2, size_t n) {
  int diff;
  for (int i = 0; i < n; i++) {
    diff = *((char *)s1++) - *((char *)s2++);
    if (diff != 0)
      return diff < 0 ? -1 : 1;
  }
  return 0;
}
; int memcmp_cmpsb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_cmpsb:
	mov rcx, rdx	; rcx = n
	xor eax, eax	; Set return value to zero
	xor edx, edx
	repe cmpsb		; for(; rcx != 0 and ZF == true; rcx--)
					;	cmp *(rsi++), *(rdi++)
	setb al			; if(ZF == false and CF == true) al = 1
	seta dl			; if(ZF == false and CF == false) bl = 1
	sub eax, edx	; return al - dl
	ret
; int memcmp_movb(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_movb:
	xor eax, eax	; rax = 0
	xor ecx, ecx	; rcx = 0
.loop:
    movzx r8d, byte [rdi+rcx]
    movzx r9d, byte [rsi+rcx]
    add rcx, 1
    cmp r8b, r9b
    jne .end
    sub rdx, 1
    jnz .loop
.end:
	xor edx, edx
	setb al			; if(ZF == false and CF == true) al = 1
	seta dl			; if(ZF == false and CF == false) bl = 1
	sub eax, edx	; return al - dl
	ret
; int memcmp_cmpsq_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_cmpsq_unaligned:
	lea rcx, [rdx + 0x7]	; rcx = n
	and rdx, (8 - 1)		; rdx = n % 8
	shr rcx, 3				; rcx = n / 8
	xor eax, eax			; rax = 0
	repe cmpsq				; for(; rcx != 0 and ZF == true; rcx += 8)
							;	cmp *(rsi++), *(rdi++)
    je .exit                ; If no difference was found return
	mov r8, [rdi - 0x8]	    ; Read the last (unaligned) quadword of s1
	mov r9, [rsi - 0x8]	    ; Read the last (unaligned) quadword of s2
    test rcx, rcx           ; if(rcx != 0)
    jnz .cmp                ;    goto .cmp
    shl rdx, 3              ; rdx = 8 * (8 - n % 8)
    jz .cmp                 ; if(rdx == 0) goto .cmp
    bzhi r8, r8, rdx        ; r8 <<= 8 * (8 - n % 8)
    bzhi r9, r9, rdx        ; r9 <<= 8 * (8 - n % 8)
.cmp:
	bswap r8				; Convert r8 to big-endian for lexical comparison
	bswap r9				; Convert r9 to big-endian for lexical comparison
	cmp r8, r9				; Lexical comparison of quadwords
	seta al					; if (r8 > r9) al = 1
	setb cl					; if (r8 < r9) cl = 1
	sub eax, ecx			; return eax - ecx
.exit:
	ret
; int memcmp_cmpsq(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_cmpsq:
	xor eax, eax				; rax = 0
    mov rcx, rdx    			; rcx = n
	shr rcx, 3					; rcx = n / 8
    cmpsq                       ; Compare first quadword (increments rsi and rdi)
    jne .end                    ; If quadwords differ jump to .end
    mov rax, rsi                ; rax = s2
    and rax, -8                 ; rax = align(s2, 8)
    sub rax, rsi                ; rax = - (s2 - align(s2, 8))
    add rdx, 8                  ; rdx = n - 8
    sub rdx, rax                ; rdx = n - 8 + (s2 - align(s2, 8))
    add rsi, rax                ; rsi = s2 - (s2 - align(s2, 8))
    add rdi, rax                ; rdi = s1 - (s2 - align(s2, 8))
	xor eax, eax				; rax = 0
	repe cmpsq					; for(; rcx != 0 and ZF == true; rcx += 8)
								;	cmp *(rsi++), *(rdi++)
    je .exit
.end:
	mov r8, [rdi - 0x8]	        ; Read the last quadword of s1
	mov r9, [rsi - 0x8]	        ; Read the last quadword of s2
    test rcx, rcx               ; if(rcx != 0)
    jnz .cmp                    ;    goto .cmp
    and rdx, (8 - 1)            ; rdx = (n - 8 + (s2 - align(s2, 8))) % 8
    mov rcx, 8                  ; rcx = 8
    sub rcx, rdx                ; rcx -= rdx
    shl rcx, 3                  ; rcx *= 8
    shl r8, cl                  ; r8 <<= cl
    shl r9, cl                  ; r9 <<= cl
.cmp:
    xor ecx, ecx                ; rcx = 0
	bswap r8					; Convert r8 to big-endian for lexical comparison
	bswap r9					; Convert r9 to big-endian for lexical comparison
	cmp r8, r9					; Lexical comparison of quadwords
	seta al						; if (r8 > r9) al = 1
	setb cl						; if (r8 < r9) cl = 1
	sub eax, ecx				; return eax - ecx
.exit:
	ret
; int memcmp_avx(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_avx:
	xor eax, eax					; rax = 0
	test rdx, rdx					; if(n == 0)
	jz .exit						;	 return 0
	vmovdqu xmm2, [rdi]				; xmm2 = *(rsi)
	vmovdqu xmm3, [rsi]				; xmm3 = *(rdi)
	vpcmpeqb xmm0, xmm2, xmm3		; xmm0 = byte_mask(i => xmm2[i] == xmm3[i])
	vpmovmskb r8d, xmm0				; Create a mask from the most significant bit of each byte
	cmp r8d, 0xFFFF					; if(xmm2 != xmm3)
	jne .end						; 	 goto .end
    lea rcx, [rdi + 15]
    and rcx, -16                    ; rcx = align((src + 15), 16)
    sub rcx, rdi                    ; rcx = align((src + 15), 16) - src
    sub rdx, rcx                     ; rdx = src + n - align((src + 15), 16)
.loop:
    vmovdqu xmm3, [rsi + rcx]		; xmm3 = rsi[rcx]
	vmovdqa xmm2, [rdi + rcx]		; xmm2 = rdi[rcx]
    add rcx, 0x10
	vpcmpeqb xmm0, xmm2, xmm3		; xmm0 = byte_mask(i => xmm2[i] == xmm3[i])
	vpmovmskb r8d, xmm0				; Create a mask from the most significant bit of each byte
	cmp r8d, 0xFFFF					; if(xmm2 != xmm3)
	jne .end						; 	 goto .end
    cmp rdx, 0x10
	jb .end						; if(rcx != 0) goto .loop
    sub rdx, 0x10
    jmp .loop
.end:
	vpcmpgtb xmm1, xmm2, xmm3		; xmm1 = byte_mask(i => xmm2[i] > xmm3[i])
	vpmovmskb r9d, xmm1				; Create a mask from the most significant bit of each byte
	not r8w							; r8w = bitmask(i => xmm2[i] != xmm3[i])
	test r8w, r8w
	jz .exit
	bsf ecx, r8d					; ecx = index of first differing bytes in xmm1/2
    cmp ecx, edx
    jae .exit

	shr r9d, cl						; Find the corresponding bit in r9d	
	xor edx, edx
	test r9d, 0x1					; If it is set
	setnz al						;	 return 1
	setz dl							; else
	sub eax, edx					;	 return -1
.exit:
	ret
; int memcmp_avx2(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_avx2:
	xor eax, eax				; rax = 0
	test rdx, rdx				; if(n == 0)
	jz .exit					;	 return 0
	vmovdqu ymm2, [rdi]			; ymm2 = *(rsi)
	vmovdqu ymm3, [rsi]			; ymm3 = *(rdi)
	vpcmpeqb ymm0, ymm2, ymm3	; ymm0 = byte_mask(i => ymm2[i] == ymm3[i])
	vpmovmskb r8d, ymm0			; Create a mask from the most significant bit of each byte
	cmp r8d, 0xFFFFFFFF			; if(ymm2 != ymm3)
	jne .end					; 	 goto .end
    lea rcx, [rdi + 31]
    and rcx, -32                ; rcx = align((src + 15), 16)
    sub rcx, rdi                ; rcx = align((src + 15), 16) - src
    sub rdx, rcx                ; rdx = src + n - align((src + 15), 16)
.loop:
    vmovdqu ymm3, [rsi + rcx]	; ymm3 = rsi[rcx]
	vmovdqa ymm2, [rdi + rcx]	; ymm2 = rdi[rcx]
    add rcx, 0x20               ; rcx += 32
	vpcmpeqb ymm0, ymm2, ymm3	; ymm0 = byte_mask(i => ymm2[i] == ymm3[i])
	vpmovmskb r8d, ymm0			; Create a mask from the most significant bit of each byte
	cmp r8d, 0xFFFFFFFF			; if(ymm2 != ymm3)
	jne .end					; 	 goto .end
    cmp rdx, 0x20               ; if(rcx < 32)
	jb .end						;    goto .end
    sub rdx, 0x20               ; rdx -= 32
    jmp .loop
.end:
	vpcmpgtb ymm1, ymm2, ymm3	; ymm1 = byte_mask(i => ymm2[i] > ymm3[i])
	vpmovmskb r9d, ymm1			; Create a mask from the most significant bit of each byte
	not r8d						; r8w = bitmask(i => ymm2[i] != ymm3[i])
	test r8d, r8d
	jz .exit
	bsf ecx, r8d				; ecx = index of first differing bytes in xmm1/2
    cmp ecx, edx
    jae .exit
	shr r9d, cl					; Find the corresponding bit in r9d	

	xor edx, edx
	test r9d, 0x1				; If it is set
	setnz al					;	 return 1
	setz dl						; else
	sub eax, edx				;	 return -1
.exit:
	vzeroupper
	ret
; int memcmp_vpcmpestri_unaligned(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_vpcmpestri_unaligned:
    xor r10d, r10d
	mov rax, rdx				; rax = n
	test rdx, rdx				; if(n == 0)
	jz .exit					;	 return n
	vmovdqu xmm2, [rdi + r10]	; xmm2 = rdi[r10]
	vmovdqu xmm3, [rsi + r10]	; xmm3 = rsi[r10]
.loop:
	; Compare xmm2 and xmm3 for equality and write the index of the differing byte in ecx
    ; rax contains the length of the xmm2 string, rdx contains the length of the xmm3 string
	; If all byte are the same rcx = 16
	vpcmpestri xmm2, xmm3, BYTEWISE_CMP 
	test cx, 0x10				; if(rcx != 16)
	jz .end						; 	break
	add r10, 0x10				; r10 += 10
	vmovdqu xmm2, [rdi + r10]	; xmm2 = rdi[r10]
	vmovdqu xmm3, [rsi + r10]	; xmm3 = rsi[r10]
	sub rdx, 0x10				; rdx -= 16
	ja .loop					; if(rdx > 0) goto .loop
	xor eax, eax				; rax = 0
	ret							; return
.end:
	xor eax, eax				; rax = 0
    cmp rcx, rdx                ; if(index >= rdx)
    jae .exit                   ;    return;
	xor edx, edx				; rdx = 0
    add r10, rcx                ; r10 += index
	mov r8b, [rdi + r10]		; r8b = rdi[r10]
	mov r9b, [rsi + r10]		; r9b = rsi[r10]
	cmp r8b, r9b				; if(r8b > r9b)
	seta al						; 	return 1
	setb dl						; else
	sub eax, edx				; 	return -1
.exit:
	ret
; int memcmp_vpcmpestri(rdi: const void s1[.n], rsi: const void s2[.n], rdx: size_t n);
memcmp_vpcmpestri:
    xor r10d, r10d
	mov rax, rdx				; rax = n
	test rdx, rdx				; if(n == 0)
	jz .exit					;	 return 0
	vmovdqu xmm2, [rdi]			; xmm2 = *rdi
	vmovdqu xmm3, [rsi]			; xmm3 = *rsi

	; Compare xmm2 and xmm3 for equality and write the index of the differing byte in ecx
    ; rax contains the length of the xmm2 string, rdx contains the length of the xmm3 string
	; If all byte are the same rcx = 16
	vpcmpestri xmm2, xmm3, BYTEWISE_CMP 
	test cx, 0x10				; if(rcx != 16)
	jz .end						; 	break
    lea r10, [rdi + 15]         ; r10 = s1 + 15
    and r10, -16                ; r10 = align(s1 + 15, 16)
    sub r10, rdi                ; rdi = s1 - align(s1 + 15, 16)
    sub rdx, r10                ; rdx = n - (s1 - align(s1 + 15, 16))
	vmovdqa xmm2, [rdi + r10]			; xmm2 = rdi[r10]
	vmovdqu xmm3, [rsi + r10]			; xmm3 = rsi[r10]
.loop:
	; Compare xmm2 and xmm3 for equality and write the index of the differing byte in ecx
    ; rax contains the length of the xmm2 string, rdx contains the length of the xmm3 string
	; If all byte are the same rcx = 16
	vpcmpestri xmm2, xmm3, BYTEWISE_CMP 
	test cx, 0x10				; if(rcx != 16)
	jz .end						; 	break

	add r10, 0x10				; r10 += 10

	vmovdqa xmm2, [rdi + r10]			; xmm2 = rdi[r10]
	vmovdqu xmm3, [rsi + r10]			; xmm3 = rsi[r10]

	sub rdx, 0x10				; rdx -= 16
	ja .loop					; if(rdx > 0) goto .loop
	xor eax, eax				; rax = 0
	ret							; return
.end:
	xor eax, eax				; rax = 0
    cmp rcx, rdx                ; if(index >= rdx)
    jae .exit                   ;    return;
	xor edx, edx				; rdx = 0
    add r10, rcx
	mov r8b, [rdi + r10]		; r8b = rdi[rcx]
	mov r9b, [rsi + r10]		; r9b = rsi[rcx]
	cmp r8b, r9b				; if(r8b > r9b)
	seta al						; 	return 1
	setb dl						; else
	sub eax, edx				; 	return -1
.exit:
	ret
Info

Note that even with -O3 the compiler couldn’t replace our custom c function by a call to memcmp.

Indeed, memcmp only guarantees the value of the sign bit of the return value in the case of different data. But we may be wanting specifically -1, 0 or 1 to be returned, and the compiler can’t assume otherwise.

Benchmark of memcmp implementations

The repe cmpsb instruction is almost as bad as the repne scasb instruction. But this time we could give an implementation using repe cmpsq which performs almost as well as the vectorized string and AVX extensions on my computer.

Tip
When you are in a situation where you could use a cmpsb or cmpsq instruction, you should always prefer make comparisons using the greatest registers available.

Here is the disassembled code of __memcmp_avx2_movbe:

__memcmp_avx2_movbe

<+0>:	endbr64
<+4>:	cmp    rdx,0x20
<+8>:	jb     0x7ffff7f26c80 <__memcmp_avx2_movbe+736>
<+14>:	vmovdqu ymm1,YMMWORD PTR [rsi]
<+18>:	vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi]
<+22>:	vpmovmskb eax,ymm1
<+26>:	inc    eax
<+28>:	jne    0x7ffff7f26a80 <__memcmp_avx2_movbe+224>
<+34>:	cmp    rdx,0x40
<+38>:	jbe    0x7ffff7f26c04 <__memcmp_avx2_movbe+612>
<+44>:	vmovdqu ymm2,YMMWORD PTR [rsi+0x20]
<+49>:	vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi+0x20]
<+54>:	vpmovmskb eax,ymm2
<+58>:	inc    eax
<+60>:	jne    0x7ffff7f26aa0 <__memcmp_avx2_movbe+256>
<+66>:	cmp    rdx,0x80
<+73>:	jbe    0x7ffff7f26bf0 <__memcmp_avx2_movbe+592>
<+79>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x40]
<+84>:	vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdi+0x40]
<+89>:	vpmovmskb eax,ymm3
<+93>:	inc    eax
<+95>:	jne    0x7ffff7f26ac0 <__memcmp_avx2_movbe+288>
<+101>:	vmovdqu ymm4,YMMWORD PTR [rsi+0x60]
<+106>:	vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdi+0x60]
<+111>:	vpmovmskb ecx,ymm4
<+115>:	inc    ecx
<+117>:	jne    0x7ffff7f26afb <__memcmp_avx2_movbe+347>
<+123>:	cmp    rdx,0x100
<+130>:	ja     0x7ffff7f26b10 <__memcmp_avx2_movbe+368>
<+136>:	vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1-0x80]
<+142>:	vmovdqu ymm2,YMMWORD PTR [rsi+rdx*1-0x60]
<+148>:	lea    rdi,[rdi+rdx*1-0x80]
<+153>:	lea    rsi,[rsi+rdx*1-0x80]
<+158>:	vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi]
<+162>:	vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi+0x20]
<+167>:	vmovdqu ymm3,YMMWORD PTR [rsi+0x40]
<+172>:	vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdi+0x40]
<+177>:	vmovdqu ymm4,YMMWORD PTR [rsi+0x60]
<+182>:	vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdi+0x60]
<+187>:	vpand  ymm5,ymm2,ymm1
<+191>:	vpand  ymm6,ymm4,ymm3
<+195>:	vpand  ymm7,ymm6,ymm5
<+199>:	vpmovmskb ecx,ymm7
<+203>:	inc    ecx
<+205>:	jne    0x7ffff7f26ae3 <__memcmp_avx2_movbe+323>
<+207>:	vzeroupper
<+210>:	ret
<+211>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+222>:	xchg   ax,ax
<+224>:	tzcnt  eax,eax
<+228>:	movzx  ecx,BYTE PTR [rsi+rax*1]
<+232>:	movzx  eax,BYTE PTR [rdi+rax*1]
<+236>:	sub    eax,ecx
<+238>:	vzeroupper
<+241>:	ret
<+242>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+253>:	nop    DWORD PTR [rax]
<+256>:	tzcnt  eax,eax
<+260>:	movzx  ecx,BYTE PTR [rsi+rax*1+0x20]
<+265>:	movzx  eax,BYTE PTR [rdi+rax*1+0x20]
<+270>:	sub    eax,ecx
<+272>:	vzeroupper
<+275>:	ret
<+276>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+287>:	nop
<+288>:	tzcnt  eax,eax
<+292>:	movzx  ecx,BYTE PTR [rsi+rax*1+0x40]
<+297>:	movzx  eax,BYTE PTR [rdi+rax*1+0x40]
<+302>:	sub    eax,ecx
<+304>:	vzeroupper
<+307>:	ret
<+308>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+319>:	nop
<+320>:	add    rsi,rdi
<+323>:	vpmovmskb eax,ymm1
<+327>:	inc    eax
<+329>:	jne    0x7ffff7f26a80 <__memcmp_avx2_movbe+224>
<+331>:	vpmovmskb eax,ymm2
<+335>:	inc    eax
<+337>:	jne    0x7ffff7f26aa0 <__memcmp_avx2_movbe+256>
<+339>:	vpmovmskb eax,ymm3
<+343>:	inc    eax
<+345>:	jne    0x7ffff7f26ac0 <__memcmp_avx2_movbe+288>
<+347>:	tzcnt  ecx,ecx
<+351>:	movzx  eax,BYTE PTR [rdi+rcx*1+0x60]
<+356>:	movzx  ecx,BYTE PTR [rsi+rcx*1+0x60]
<+361>:	sub    eax,ecx
<+363>:	vzeroupper
<+366>:	ret
<+367>:	nop
<+368>:	lea    rdx,[rdi+rdx*1-0x80]
<+373>:	sub    rsi,rdi
<+376>:	and    rdi,0xffffffffffffffe0
<+380>:	sub    rdi,0xffffffffffffff80
<+384>:	vmovdqu ymm1,YMMWORD PTR [rsi+rdi*1]
<+389>:	vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi]
<+393>:	vmovdqu ymm2,YMMWORD PTR [rsi+rdi*1+0x20]
<+399>:	vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi+0x20]
<+404>:	vmovdqu ymm3,YMMWORD PTR [rsi+rdi*1+0x40]
<+410>:	vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdi+0x40]
<+415>:	vmovdqu ymm4,YMMWORD PTR [rsi+rdi*1+0x60]
<+421>:	vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdi+0x60]
<+426>:	vpand  ymm5,ymm2,ymm1
<+430>:	vpand  ymm6,ymm4,ymm3
<+434>:	vpand  ymm7,ymm6,ymm5
<+438>:	vpmovmskb ecx,ymm7
<+442>:	inc    ecx
<+444>:	jne    0x7ffff7f26ae0 <__memcmp_avx2_movbe+320>
<+446>:	sub    rdi,0xffffffffffffff80
<+450>:	cmp    rdi,rdx
<+453>:	jb     0x7ffff7f26b20 <__memcmp_avx2_movbe+384>
<+455>:	sub    rdi,rdx
<+458>:	cmp    edi,0x60
<+461>:	jae    0x7ffff7f26bd0 <__memcmp_avx2_movbe+560>
<+463>:	vmovdqu ymm3,YMMWORD PTR [rsi+rdx*1+0x40]
<+469>:	cmp    edi,0x40
<+472>:	jae    0x7ffff7f26bc0 <__memcmp_avx2_movbe+544>
<+474>:	vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1]
<+479>:	vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdx]
<+483>:	vmovdqu ymm2,YMMWORD PTR [rsi+rdx*1+0x20]
<+489>:	vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdx+0x20]
<+494>:	vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdx+0x40]
<+499>:	vmovdqu ymm4,YMMWORD PTR [rsi+rdx*1+0x60]
<+505>:	vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdx+0x60]
<+510>:	vpand  ymm5,ymm2,ymm1
<+514>:	vpand  ymm6,ymm4,ymm3
<+518>:	vpand  ymm7,ymm6,ymm5
<+522>:	vpmovmskb ecx,ymm7
<+526>:	mov    rdi,rdx
<+529>:	inc    ecx
<+531>:	jne    0x7ffff7f26ae0 <__memcmp_avx2_movbe+320>
<+537>:	vzeroupper
<+540>:	ret
<+541>:	nop    DWORD PTR [rax]
<+544>:	vpcmpeqb ymm3,ymm3,YMMWORD PTR [rdx+0x40]
<+549>:	vpmovmskb eax,ymm3
<+553>:	inc    eax
<+555>:	jne    0x7ffff7f26c20 <__memcmp_avx2_movbe+640>
<+557>:	nop    DWORD PTR [rax]
<+560>:	vmovdqu ymm4,YMMWORD PTR [rsi+rdx*1+0x60]
<+566>:	vpcmpeqb ymm4,ymm4,YMMWORD PTR [rdx+0x60]
<+571>:	vpmovmskb eax,ymm4
<+575>:	inc    eax
<+577>:	jne    0x7ffff7f26c24 <__memcmp_avx2_movbe+644>
<+579>:	vzeroupper
<+582>:	ret
<+583>:	nop    WORD PTR [rax+rax*1+0x0]
<+592>:	vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1-0x40]
<+598>:	vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi+rdx*1-0x40]
<+604>:	vpmovmskb eax,ymm1
<+608>:	inc    eax
<+610>:	jne    0x7ffff7f26c40 <__memcmp_avx2_movbe+672>
<+612>:	vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1-0x20]
<+618>:	vpcmpeqb ymm1,ymm1,YMMWORD PTR [rdi+rdx*1-0x20]
<+624>:	vpmovmskb eax,ymm1
<+628>:	inc    eax
<+630>:	jne    0x7ffff7f26c60 <__memcmp_avx2_movbe+704>
<+632>:	vzeroupper
<+635>:	ret
<+636>:	nop    DWORD PTR [rax+0x0]
<+640>:	sub    rdx,0x20
<+644>:	tzcnt  eax,eax
<+648>:	add    rax,rdx
<+651>:	movzx  ecx,BYTE PTR [rsi+rax*1+0x60]
<+656>:	movzx  eax,BYTE PTR [rax+0x60]
<+660>:	sub    eax,ecx
<+662>:	vzeroupper
<+665>:	ret
<+666>:	nop    WORD PTR [rax+rax*1+0x0]
<+672>:	tzcnt  eax,eax
<+676>:	add    eax,edx
<+678>:	movzx  ecx,BYTE PTR [rsi+rax*1-0x40]
<+683>:	movzx  eax,BYTE PTR [rdi+rax*1-0x40]
<+688>:	sub    eax,ecx
<+690>:	vzeroupper
<+693>:	ret
<+694>:	cs nop WORD PTR [rax+rax*1+0x0]
<+704>:	tzcnt  eax,eax
<+708>:	add    eax,edx
<+710>:	movzx  ecx,BYTE PTR [rsi+rax*1-0x20]
<+715>:	movzx  eax,BYTE PTR [rdi+rax*1-0x20]
<+720>:	sub    eax,ecx
<+722>:	vzeroupper
<+725>:	ret
<+726>:	cs nop WORD PTR [rax+rax*1+0x0]
<+736>:	cmp    edx,0x1
<+739>:	jbe    0x7ffff7f26d00 <__memcmp_avx2_movbe+864>
<+741>:	mov    eax,edi
<+743>:	or     eax,esi
<+745>:	and    eax,0xfff
<+750>:	cmp    eax,0xfe0
<+755>:	jg     0x7ffff7f26cc0 <__memcmp_avx2_movbe+800>
<+757>:	vmovdqu ymm2,YMMWORD PTR [rsi]
<+761>:	vpcmpeqb ymm2,ymm2,YMMWORD PTR [rdi]
<+765>:	vpmovmskb eax,ymm2
<+769>:	inc    eax
<+771>:	bzhi   edx,eax,edx
<+776>:	jne    0x7ffff7f26a80 <__memcmp_avx2_movbe+224>
<+782>:	xor    eax,eax
<+784>:	vzeroupper
<+787>:	ret
<+788>:	data16 cs nop WORD PTR [rax+rax*1+0x0]
<+799>:	nop
<+800>:	cmp    edx,0x10
<+803>:	jae    0x7ffff7f26d43 <__memcmp_avx2_movbe+931>
<+805>:	cmp    edx,0x8
<+808>:	jae    0x7ffff7f26d20 <__memcmp_avx2_movbe+896>
<+810>:	cmp    edx,0x4
<+813>:	jb     0x7ffff7f26d80 <__memcmp_avx2_movbe+992>
<+819>:	movbe  eax,DWORD PTR [rdi]
<+823>:	movbe  ecx,DWORD PTR [rsi]
<+827>:	shl    rax,0x20
<+831>:	shl    rcx,0x20
<+835>:	movbe  edi,DWORD PTR [rdi+rdx*1-0x4]
<+841>:	movbe  esi,DWORD PTR [rsi+rdx*1-0x4]
<+847>:	or     rax,rdi
<+850>:	or     rcx,rsi
<+853>:	sub    rax,rcx
<+856>:	jne    0x7ffff7f26d10 <__memcmp_avx2_movbe+880>
<+858>:	ret
<+859>:	nop    DWORD PTR [rax+rax*1+0x0]
<+864>:	jb     0x7ffff7f26d16 <__memcmp_avx2_movbe+886>
<+866>:	movzx  ecx,BYTE PTR [rsi]
<+869>:	movzx  eax,BYTE PTR [rdi]
<+872>:	sub    eax,ecx
<+874>:	ret
<+875>:	nop    DWORD PTR [rax+rax*1+0x0]
<+880>:	sbb    eax,eax
<+882>:	or     eax,0x1
<+885>:	ret
<+886>:	xor    eax,eax
<+888>:	ret
<+889>:	nop    DWORD PTR [rax+0x0]
<+896>:	movbe  rax,QWORD PTR [rdi]
<+901>:	movbe  rcx,QWORD PTR [rsi]
<+906>:	sub    rax,rcx
<+909>:	jne    0x7ffff7f26d10 <__memcmp_avx2_movbe+880>
<+911>:	movbe  rax,QWORD PTR [rdi+rdx*1-0x8]
<+918>:	movbe  rcx,QWORD PTR [rsi+rdx*1-0x8]
<+925>:	sub    rax,rcx
<+928>:	jne    0x7ffff7f26d10 <__memcmp_avx2_movbe+880>
<+930>:	ret
<+931>:	vmovdqu xmm2,XMMWORD PTR [rsi]
<+935>:	vpcmpeqb xmm2,xmm2,XMMWORD PTR [rdi]
<+939>:	vpmovmskb eax,xmm2
<+943>:	sub    eax,0xffff
<+948>:	jne    0x7ffff7f26a80 <__memcmp_avx2_movbe+224>
<+954>:	vmovdqu xmm2,XMMWORD PTR [rsi+rdx*1-0x10]
<+960>:	lea    rdi,[rdi+rdx*1-0x10]
<+965>:	lea    rsi,[rsi+rdx*1-0x10]
<+970>:	vpcmpeqb xmm2,xmm2,XMMWORD PTR [rdi]
<+974>:	vpmovmskb eax,xmm2
<+978>:	sub    eax,0xffff
<+983>:	jne    0x7ffff7f26a80 <__memcmp_avx2_movbe+224>
<+989>:	ret
<+990>:	xchg   ax,ax
<+992>:	movzx  eax,WORD PTR [rdi]
<+995>:	movzx  ecx,WORD PTR [rsi]
<+998>:	bswap  eax
<+1000>:	bswap  ecx
<+1002>:	shr    eax,1
<+1004>:	shr    ecx,1
<+1006>:	movzx  edi,BYTE PTR [rdi+rdx*1-0x1]
<+1011>:	movzx  esi,BYTE PTR [rsi+rdx*1-0x1]
<+1016>:	or     eax,edi
<+1018>:	or     ecx,esi
<+1020>:	sub    eax,ecx
<+1022>:	ret

…aaaand there is no use of the repe cmps instruction 😥

The _memcmp_avx2_movbe function only uses an unrolled loop of avx2 instructions.

Benchmark 5: strchr

I chose the strchr function to demonstrate the use of the lodsb instruction because we need to make two checks on the read byte, which makes it unsuitable for scasb. But as lodsb doesn’t have side effects nor set the flags, we can’t use a repetition prefix, which makes it very similar to a movb.

That’s why the implementation using movsb and movb as well as mosq and movq are factorized in two macros strchr_byte and strchr_quad.

char *strchr_dummy(const char *s, int c) {
  do {
    if (*s == c)
      return (char *)s;
  } while (*(s++) != '\0');
  return 0;
}
%macro strchr_byte 1
    xchg rdi, rsi
%%loop:
    %1
    cmp dil, al
    je %%end
    test al, al
    jnz %%loop
    xor rax, rax
    ret
%%end:
    lea rax, [rsi - 1]
    ret
%endmacro

; char* strchr_lodsb(rdi: const char* s, rsi: int c)
strchr_lodsb:
    strchr_byte lodsb

%macro load_byte_movzx 0 
	movzx eax, byte [rsi]
	add rsi, 1
%endmacro

; char* strchr_movb(rdi: const char* s, rsi: int c)
strchr_movb:
    strchr_byte load_byte_movzx
%macro strchr_quad 1
	movzx eax, sil			; eax = sil
	mul qword [%%one_mask]	; rax *= 0x0101010101010101
	mov rsi, rdi			; rsi = s
	mov rdi, rax			; rdi = rax
%%loop:
    %1
    mov rcx, rax
    find_zero r8, rcx
    xor rax, rdi
    find_zero r9, rax
    bsf rdx, r9
    mov ecx, edx
    shr ecx, 3
    test r8, r8
    jnz %%end
    test r9, r9
    jz %%loop
    lea rax, [rsi+rcx-8]
    ret
%%end:
    test rcx, rcx
    jz %%notfound
    bsf rax, r8
    cmp eax, edx
    jl %%notfound
    lea rax, [rsi+rcx-8]
    ret
%%notfound:
    xor eax, eax
    ret
%%one_mask: dq 0x0101010101010101
%endmacro

; char* strchr_lodsq(rdi: const char* s, rsi: int c)
strchr_lodsq:
    strchr_quad lodsq

%macro load_quad_mov 0 
	mov rax, [rsi]
	add rsi, 8
%endmacro

; char* strchr_movq(rdi: const char* s, rsi: int c)
strchr_movq:
    strchr_quad load_quad_mov
; size_t strchr_avx(rdi: const char *s, rsi: int c)
strchr_avx:
	vpxor xmm0, xmm0	            ; xmm0 = 0

	; Broadcast the sil byte to all bytes of xmm1
	movzx ecx, sil				; Copy sil in eax with zero-extend
	vmovd xmm1, ecx				; Copy the low dword (eax) into xmm1
	vpshufb xmm1, xmm1, xmm0	; Broadcast the first byte of xmm1 to all xmm1

	vmovdqu xmm2, [rdi]         ; xmm2 = *(rdi)

	vpcmpeqb xmm3, xmm1, xmm2	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
	vpmovmskb r9d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

	vpcmpeqb xmm3, xmm0, xmm2	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
	vpmovmskb r8d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

    test r9d, r9d               ; if(r9d != 0)
    jnz .found                  ;    goto .found
	test r8d, r8d		        ; if(r8d == 0)
	jnz .notfound			    ; 	goto .notfound

	and rdi, -16		        ; align rdi on 16 bytes
.loop:
	add rdi, 0x10		        ; rax += 16
	vmovdqa xmm2, [rdi]	        ; xmm2 = *(rdi)

	vpcmpeqb xmm3, xmm1, xmm2	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
	vpmovmskb r9d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

	vpcmpeqb xmm3, xmm0, xmm2	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
	vpmovmskb r8d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

    test r9d, r9d               ; if(r9d != 0)
    jnz .found                  ;    goto .found
	test r8d, r8d		        ; if(r8d == 0)
	jz .loop			        ; 	goto .loop
.notfound:
    xor eax, eax                ; return NULL
	ret
.found:
	bsf r9d, r9d		        ; r8d = index of the first set bit in r8d
    test r8d, r8d               ; if(r8d == 0)
    jz .noend                   ;    goto .noend
	bsf r8d, r8d		        ; r8d = index of the first set bit in r8d
    cmp r8d, r9d                ; if(r8d < r9d)
    jb .notfound                ;    goto .notfound
.noend:
    mov rax, rdi                ; rax = rdi
	add rax, r9 		        ; rax += rdx
    ret
; size_t strchr_avx2(rdi: const char *s, rsi: int c)
strchr_avx2:
	vpxor ymm0, ymm0	            ; ymm0 = 0

	; Broadcast the sil byte to all bytes of ymm1
	movzx ecx, sil				; Copy sil in eax with zero-extend
	vmovd xmm1, ecx				; Copy the low dword (eax) into ymm1
	vpbroadcastb ymm1, xmm1	    ; Broadcast the first byte of ymm1 to all ymm1

	vmovdqu ymm2, [rdi]         ; ymm2 = *(rdi)

	vpcmpeqb ymm3, ymm1, ymm2	; ymm2 = byte_mask(i => ymm2[i] == ymm0[i])
	vpmovmskb r9d, ymm3	        ; r8d = bitmask(i => sign(ymm2[i]))

	vpcmpeqb ymm3, ymm0, ymm2	; ymm2 = byte_mask(i => ymm2[i] == ymm0[i])
	vpmovmskb r8d, ymm3	        ; r8d = bitmask(i => sign(ymm2[i]))

    test r9d, r9d               ; if(r9d != 0)
    jnz .found                  ;    goto .found
	test r8d, r8d		        ; if(r8d == 0)
	jnz .notfound			    ; 	goto .notfound

	and rdi, -32		        ; align rdi on 16 bytes
.loop:
	add rdi, 0x20		        ; rax += 16
	vmovdqa ymm2, [rdi]	        ; ymm2 = *(rdi)

	vpcmpeqb ymm3, ymm1, ymm2	; ymm2 = byte_mask(i => ymm2[i] == ymm0[i])
	vpmovmskb r9d, ymm3	        ; r8d = bitmask(i => sign(ymm2[i]))

	vpcmpeqb ymm3, ymm0, ymm2	; ymm2 = byte_mask(i => ymm2[i] == ymm0[i])
	vpmovmskb r8d, ymm3	        ; r8d = bitmask(i => sign(ymm2[i]))

    test r9d, r9d               ; if(r9d != 0)
    jnz .found                  ;    goto .found
	test r8d, r8d		        ; if(r8d == 0)
	jz .loop			        ; 	goto .loop
.notfound:
    xor eax, eax                ; return NULL
	ret
.found:
	bsf r9d, r9d		        ; r8d = index of the first set bit in r8d
    test r8d, r8d               ; if(r8d == 0)
    jz .noend                   ;    goto .noend
	bsf r8d, r8d		        ; r8d = index of the first set bit in r8d
    cmp r8d, r9d                ; if(r8d < r9d)
    jb .notfound                ;    goto .notfound
.noend:
    mov rax, rdi                ; rax = rdi
	add rax, r9 		        ; rax += rdx
    vzeroupper
    ret
; size_t strchr_sse2(rdi: const char *s)
strchr_sse2:
	pxor xmm0, xmm0	            ; xmm0 = 0

	; Broadcast the sil byte to all bytes of xmm1
	movzx ecx, sil				; Copy sil in eax with zero-extend
	movd xmm1, ecx				; Copy the low dword (eax) into xmm1
	pshufb xmm1, xmm0	    ; Broadcast the first byte of xmm1 to all xmm1

	movdqu xmm2, [rdi]         ; xmm2 = *(rdi)
 
    movdqa xmm3, xmm2
	pcmpeqb xmm3, xmm1	        ; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
	pmovmskb r9d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

    movdqa xmm3, xmm2
	pcmpeqb xmm3, xmm0	        ; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
	pmovmskb r8d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

    test r9d, r9d               ; if(r9d != 0)
    jnz .found                  ;    goto .found
	test r8d, r8d		        ; if(r8d == 0)
	jnz .notfound			    ; 	goto .notfound

	and rdi, -16		        ; align rdi on 16 bytes
.loop:
	add rdi, 0x10		        ; rax += 16
	movdqa xmm2, [rdi]	        ; xmm2 = *(rdi)

    movdqa xmm3, xmm2
	pcmpeqb xmm3, xmm1      	; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
	pmovmskb r9d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

    movdqa xmm3, xmm2
	pcmpeqb xmm3, xmm0	        ; xmm2 = byte_mask(i => xmm2[i] == xmm0[i])
	pmovmskb r8d, xmm3	        ; r8d = bitmask(i => sign(xmm2[i]))

    test r9d, r9d               ; if(r9d != 0)
    jnz .found                  ;    goto .found
	test r8d, r8d		        ; if(r8d == 0)
	jz .loop			        ; 	goto .loop
.notfound:
    xor eax, eax                ; return NULL
	ret
.found:
	bsf r9d, r9d		        ; r8d = index of the first set bit in r8d
    test r8d, r8d               ; if(r8d == 0)
    jz .noend                   ;    goto .noend
	bsf r8d, r8d		        ; r8d = index of the first set bit in r8d
    cmp r8d, r9d                ; if(r8d < r9d)
    jb .notfound                ;    goto .notfound
.noend:
    mov rax, rdi                ; rax = rdi
	add rax, r9 		        ; rax += rdx
    ret
Note
Even with -O3 gcc did not replace my strchr implementation with a call to the standard strchr. It even actually generated a pretty bad implementation (maybe this could be fixed by feeding it a more contrived implementation).

The implementation is similar to the strlen implementation, especially the 64-bit one. Indeed, we still need to find the end of the string but we also need to find a given byte, and as we know how to find a 0x00 byte in a quadword using the find_zero macro, we just need to xor the quadword with the byte to nullify all the equal bytes in the quadword (a XOR a = 0 is equivalent to a = a) and then find the 0x00 bytes.

Here are the result of the benchmark on my computer:

Benchmark of strcnt implementations

The loads instruction is always slower than the mov alternatives, so… just don’t use it.

Benchmark 6: iota

I wanted to end this benchmark of the string instructions with another benchmark of the stos instruction. We saw that this instruction is quite fast with a rep prefix, so let’s see how it compares to a mov when used in a loop without a prefix:

#include <stddef.h>

void iota_dummy(unsigned char *s, size_t n) {
  for (int i = 0; i < n; i++)
    s[i] = i;
}
%macro iota_byte 1
	xor eax, eax
	mov rcx, rsi
	test rcx, rcx
	jz %%exit
%%loop:
    %1
	add rax, 0x1
	sub rcx, 0x1
	jnz %%loop
%%exit:
	ret
%endmacro

; void *iota_stosb(rdi: unsigned char s[.n], rsi: size_t n);
iota_stosb:
    iota_byte stosb

%macro store_movb 0
	mov [rdi], al
	add rdi, 0x1
%endmacro

; void *iota_movb(rdi: unsigned char s[.n], rsi: size_t n);
iota_movb:
    iota_byte store_movb
%macro iota_quad 1
	mov r8, 0x0706050403020100
	mov r9, 0x0808080808080808
	mov rax, r8
	cmp rsi, 8
	jb %%end
%%loop:
    %1
	add rax, r9
	test al, al
	cmovz rax, r8
    sub rsi, 0x8
	cmp rsi, 0x8
	jae %%loop
%%end:
    ret
%endmacro

; void *iota_stosq(rdi: unsigned char s[.n], rsi: size_t n);
iota_stosq:
    iota_quad stosq

; void *iota_movq(rdi: unsigned char s[.n], rsi: size_t n);
iota_movq:
    iota_quad store_movq
; void *iota_avx(rdi: void s[.n], rsi: size_t n);
iota_avx:
    cmp rsi, 0x10
	jb .exit
	vmovdqa xmm0, [.init_mask]
	vmovdqa xmm1, [.add_mask]
.loop:
	vmovdqu [rdi], xmm0				; Copy the 16 bytes of xmm0 to s
	vpaddb xmm0, xmm0, xmm1
	add rdi, 0x10					; Increment rdi of 16 bytes
	sub rsi, 0x10						; rdx--
    cmp rsi, 0x10
	jae .loop						;	 jump
.exit:
	ret
align 16
.init_mask: db 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0x9, 0xA, 0xB, 0xC, 0xD, 0xE, 0xF
.add_mask: times 16 db 0x10
; void *iota_avx2(rdi: void s[.n], rsi: size_t n);
iota_avx2:
	shr rsi, 5						; Divide n by 32
	jz .exit
	vmovdqa ymm0, [.init_mask]
	vmovdqa ymm1, [.add_mask]
.loop:
	vmovdqu [rdi], ymm0				; Copy the 16 bytes of xmm0 to s
	vpaddb ymm0, ymm0, ymm1
	add rdi, 0x20					; Increment rdi of 16 bytes
	sub rsi, 1						; rdx--
	jnz .loop						;	 jump
.exit:
	ret
align 32
.init_mask:
	db 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F
	db 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F
.add_mask: times 32 db 0x20

The benchmark result on my computer:

Benchmark of iota implementations

As we can see, there is no interest in using the stos instruction without a prefix as it will generally perform worse than an equivalent mov instruction.

This corroborates the claim of “Optimizing subroutines in assembly language”:

String instructions without a repeat prefix are too slow and should be replaced by simpler instructions. The same applies to the LOOP instruction and to JECXZ on some processors.

Conclusion

We’ve seen a lot in this article. Now what should you remember from all of this to write better assembly code ?

Here are some general guidelines you can keep in mind when using string instructions:

  • You should use rep stosq to initialize large blocks of memory.
  • You should use rep movsq to copy large blocks of memory (more than 512 bytes) which fit in your cache (~7Mib), for larger blocks of memory you should use prefetching mechanisms, AVX extensions, unrolled loop and non-temporal-stores.
  • Do not use string instructions without repeat prefixes as they will generally be slower than their classical alternatives.
  • Do not use scas, cmps and lods as you can always come up with more efficient (AVX) versions.

String instructions are very portable instructions as they’re present on all x86 processors (back to the 8086). However, most of the time they tend to perform slower than the alternatives using the largest available registers. Therefore, except for rep stosq and rep movsq you shouldn’t use them unless you’re optimizing for size.