I'm currently trying to improve the performance of a custom "pseudo" stack, which is used like this (full code is provided at the end of this post):
void test() { theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; // A theStack.stackTop.store(1, std::memory_order_seq_cst); // B someFunction(); // C theStack.stackTop.store(0, std::memory_order_seq_cst); // D theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E theStack.stackTop.store(1, std::memory_order_seq_cst); // F someOtherFunction(); // G theStack.stackTop.store(0, std::memory_order_seq_cst); // H}
A sampler thread periodically suspends the target thread and reads stackTop
and the stackFrames
array.
My biggest performance problem are the sequentially-consistent stores to stackTop
, so I'm trying to find out whether I can change them to release-stores.
The central requirement is: When the sampler thread suspends the target thread and reads stackTop == 1
, then the information in stackFrames[1]
needs to be fully present and consistent. This means:
- When B is observed, A must also be observed. ("Don't increment
stackTop
before putting the stack frame in place.") - When E is observed, D must also be observed. ("When putting the next frame's information in place, the previous stack frame must have been exited.")
My understanding is that using release-acquire memory ordering for stackTop
guarantees the first requirement, but not the second. More specifically:
- No writes that are before the
stackTop
release-store in program order can be reordered to occur after it.
However, no statement is made about writes that occur after the release-store to stackTop
in program order. Thus, my understanding is that E can be observed before D is observed. Is this correct?
But if that's the case, then wouldn't the compiler be able to reorder my program like this:
void test() { theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; // A theStack.stackTop.store(1, std::memory_order_release); // B someFunction(); // C // switched D and E: theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E theStack.stackTop.store(0, std::memory_order_release); // D theStack.stackTop.store(1, std::memory_order_release); // F someOtherFunction(); // G theStack.stackTop.store(0, std::memory_order_release); // H}
... and then combine D and F, optimizing away the zero store?
Because that's not what I'm seeing if I compile the above program using system clang on macOS:
$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.omain.o: file format Mach-O 64-bit x86-64Disassembly of section __TEXT,__text:__Z4testv: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 48 8d 05 5d 00 00 00 leaq 93(%rip), %rax b: 48 89 05 10 00 00 00 movq %rax, 16(%rip) 12: c7 05 14 00 00 00 1e 00 00 00 movl $30, 20(%rip) 1c: c7 05 1c 00 00 00 01 00 00 00 movl $1, 28(%rip) 26: e8 00 00 00 00 callq 0 <__Z4testv+0x2B> 2b: c7 05 1c 00 00 00 00 00 00 00 movl $0, 28(%rip) 35: 48 8d 05 39 00 00 00 leaq 57(%rip), %rax 3c: 48 89 05 10 00 00 00 movq %rax, 16(%rip) 43: c7 05 14 00 00 00 23 00 00 00 movl $35, 20(%rip) 4d: c7 05 1c 00 00 00 01 00 00 00 movl $1, 28(%rip) 57: e8 00 00 00 00 callq 0 <__Z4testv+0x5C> 5c: c7 05 1c 00 00 00 00 00 00 00 movl $0, 28(%rip) 66: 5d popq %rbp 67: c3 retq
Specifically, the movl $0, 28(%rip)
instruction at 2b
is still present.
Coincidentally, this output is exactly what I need in my case. But I don't know if I can rely on it, because to my understanding it's not guaranteed by my chosen memory ordering.
So my main question is this: Does the acquire-release memory order give me another (fortunate) guarantee that I'm not aware of? Or is the compiler only doing what I need by accident / because it's not optimizing this particular case as well as it could?
Full code below:
// clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o#include <atomic>#include <cstdint>struct StackFrame{ const char* functionName; uint32_t lineNumber;};struct Stack{ Stack() : stackFrames{ StackFrame{ nullptr, 0 }, StackFrame{ nullptr, 0 } } , stackTop{0} { } StackFrame stackFrames[2]; std::atomic<uint32_t> stackTop;};Stack theStack;void someFunction();void someOtherFunction();void test() { theStack.stackFrames[1] = StackFrame{ "someFunction", 30 }; theStack.stackTop.store(1, std::memory_order_release); someFunction(); theStack.stackTop.store(0, std::memory_order_release); theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; theStack.stackTop.store(1, std::memory_order_release); someOtherFunction(); theStack.stackTop.store(0, std::memory_order_release);}/** * // Sampler thread: * * #include <chrono> * #include <iostream> * #include <thread> * * void suspendTargetThread(); * void unsuspendTargetThread(); * * void samplerThread() { * for (;;) { * // Suspend the target thread. This uses a platform-specific * // mechanism: * // - SuspendThread on Windows * // - thread_suspend on macOS * // - send a signal + grab a lock in the signal handler on Linux * suspendTargetThread(); * * // Now that the thread is paused, read the leaf stack frame. * uint32_t stackTop = * theStack.stackTop.load(std::memory_order_acquire); * StackFrame& f = theStack.stackFrames[stackTop]; * std::cout << f.functionName << " at line " * << f.lineNumber << std::endl; * * unsuspendTargetThread(); * * std::this_thread::sleep_for(std::chrono::milliseconds(1)); * } * } */
And, to satisfy curiosity, this is the assembly if I use sequentially-consistent stores:
$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.omain.o: file format Mach-O 64-bit x86-64Disassembly of section __TEXT,__text:__Z4testv: 0: 55 pushq %rbp 1: 48 89 e5 movq %rsp, %rbp 4: 41 56 pushq %r14 6: 53 pushq %rbx 7: 48 8d 05 60 00 00 00 leaq 96(%rip), %rax e: 48 89 05 10 00 00 00 movq %rax, 16(%rip) 15: c7 05 14 00 00 00 1e 00 00 00 movl $30, 20(%rip) 1f: 41 be 01 00 00 00 movl $1, %r14d 25: b8 01 00 00 00 movl $1, %eax 2a: 87 05 20 00 00 00 xchgl %eax, 32(%rip) 30: e8 00 00 00 00 callq 0 <__Z4testv+0x35> 35: 31 db xorl %ebx, %ebx 37: 31 c0 xorl %eax, %eax 39: 87 05 20 00 00 00 xchgl %eax, 32(%rip) 3f: 48 8d 05 35 00 00 00 leaq 53(%rip), %rax 46: 48 89 05 10 00 00 00 movq %rax, 16(%rip) 4d: c7 05 14 00 00 00 23 00 00 00 movl $35, 20(%rip) 57: 44 87 35 20 00 00 00 xchgl %r14d, 32(%rip) 5e: e8 00 00 00 00 callq 0 <__Z4testv+0x63> 63: 87 1d 20 00 00 00 xchgl %ebx, 32(%rip) 69: 5b popq %rbx 6a: 41 5e popq %r14 6c: 5d popq %rbp 6d: c3 retq
Instruments identified the xchgl
instructions as the most expensive part.