Missed optimization opportunity or required behavior due to acquire-release memory ordering?

I'm currently trying to improve the performance of a custom "pseudo" stack, which is used like this (full code is provided at the end of this post):

void test() {  theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };      // A  theStack.stackTop.store(1, std::memory_order_seq_cst);           // B  someFunction();                                                  // C  theStack.stackTop.store(0, std::memory_order_seq_cst);           // D  theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E  theStack.stackTop.store(1, std::memory_order_seq_cst);           // F  someOtherFunction();                                             // G  theStack.stackTop.store(0, std::memory_order_seq_cst);           // H}

A sampler thread periodically suspends the target thread and reads stackTop and the stackFrames array.

My biggest performance problem are the sequentially-consistent stores to stackTop, so I'm trying to find out whether I can change them to release-stores.

The central requirement is: When the sampler thread suspends the target thread and reads stackTop == 1, then the information in stackFrames[1] needs to be fully present and consistent. This means:

When B is observed, A must also be observed. ("Don't increment stackTop before putting the stack frame in place.")
When E is observed, D must also be observed. ("When putting the next frame's information in place, the previous stack frame must have been exited.")

My understanding is that using release-acquire memory ordering for stackTop guarantees the first requirement, but not the second. More specifically:

No writes that are before the stackTop release-store in program order can be reordered to occur after it.

However, no statement is made about writes that occur after the release-store to stackTop in program order. Thus, my understanding is that E can be observed before D is observed. Is this correct?

But if that's the case, then wouldn't the compiler be able to reorder my program like this:

void test() {  theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };      // A  theStack.stackTop.store(1, std::memory_order_release);           // B  someFunction();                                                  // C  // switched D and E:  theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 }; // E  theStack.stackTop.store(0, std::memory_order_release);           // D  theStack.stackTop.store(1, std::memory_order_release);           // F  someOtherFunction();                                             // G  theStack.stackTop.store(0, std::memory_order_release);           // H}

... and then combine D and F, optimizing away the zero store?

Because that's not what I'm seeing if I compile the above program using system clang on macOS:

$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.omain.o: file format Mach-O 64-bit x86-64Disassembly of section __TEXT,__text:__Z4testv:       0:   55  pushq   %rbp       1:   48 89 e5    movq    %rsp, %rbp       4:   48 8d 05 5d 00 00 00    leaq    93(%rip), %rax       b:   48 89 05 10 00 00 00    movq    %rax, 16(%rip)      12:   c7 05 14 00 00 00 1e 00 00 00   movl    $30, 20(%rip)      1c:   c7 05 1c 00 00 00 01 00 00 00   movl    $1, 28(%rip)      26:   e8 00 00 00 00  callq   0 <__Z4testv+0x2B>      2b:   c7 05 1c 00 00 00 00 00 00 00   movl    $0, 28(%rip)      35:   48 8d 05 39 00 00 00    leaq    57(%rip), %rax      3c:   48 89 05 10 00 00 00    movq    %rax, 16(%rip)      43:   c7 05 14 00 00 00 23 00 00 00   movl    $35, 20(%rip)      4d:   c7 05 1c 00 00 00 01 00 00 00   movl    $1, 28(%rip)      57:   e8 00 00 00 00  callq   0 <__Z4testv+0x5C>      5c:   c7 05 1c 00 00 00 00 00 00 00   movl    $0, 28(%rip)      66:   5d  popq    %rbp      67:   c3  retq

Specifically, the movl $0, 28(%rip) instruction at 2b is still present.

Coincidentally, this output is exactly what I need in my case. But I don't know if I can rely on it, because to my understanding it's not guaranteed by my chosen memory ordering.

So my main question is this: Does the acquire-release memory order give me another (fortunate) guarantee that I'm not aware of? Or is the compiler only doing what I need by accident / because it's not optimizing this particular case as well as it could?

Full code below:

// clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.o#include <atomic>#include <cstdint>struct StackFrame{  const char* functionName;  uint32_t lineNumber;};struct Stack{  Stack()    : stackFrames{ StackFrame{ nullptr, 0 }, StackFrame{ nullptr, 0 } }    , stackTop{0}  {  }  StackFrame stackFrames[2];  std::atomic<uint32_t> stackTop;};Stack theStack;void someFunction();void someOtherFunction();void test() {  theStack.stackFrames[1] = StackFrame{ "someFunction", 30 };  theStack.stackTop.store(1, std::memory_order_release);  someFunction();  theStack.stackTop.store(0, std::memory_order_release);  theStack.stackFrames[1] = StackFrame{ "someOtherFunction", 35 };  theStack.stackTop.store(1, std::memory_order_release);  someOtherFunction();  theStack.stackTop.store(0, std::memory_order_release);}/** * // Sampler thread: * * #include <chrono> * #include <iostream> * #include <thread> * * void suspendTargetThread(); * void unsuspendTargetThread(); *  * void samplerThread() { *   for (;;) { *     // Suspend the target thread. This uses a platform-specific *     // mechanism: *     //  - SuspendThread on Windows *     //  - thread_suspend on macOS *     //  - send a signal + grab a lock in the signal handler on Linux *     suspendTargetThread(); *  *     // Now that the thread is paused, read the leaf stack frame. *     uint32_t stackTop = *       theStack.stackTop.load(std::memory_order_acquire); *     StackFrame& f = theStack.stackFrames[stackTop]; *     std::cout << f.functionName << " at line " *               << f.lineNumber << std::endl; *  *     unsuspendTargetThread(); *  *     std::this_thread::sleep_for(std::chrono::milliseconds(1)); *   } * } */

And, to satisfy curiosity, this is the assembly if I use sequentially-consistent stores:

$ clang++ -c main.cpp -std=c++11 -O3 && objdump -d main.omain.o: file format Mach-O 64-bit x86-64Disassembly of section __TEXT,__text:__Z4testv:       0:   55  pushq   %rbp       1:   48 89 e5    movq    %rsp, %rbp       4:   41 56   pushq   %r14       6:   53  pushq   %rbx       7:   48 8d 05 60 00 00 00    leaq    96(%rip), %rax       e:   48 89 05 10 00 00 00    movq    %rax, 16(%rip)      15:   c7 05 14 00 00 00 1e 00 00 00   movl    $30, 20(%rip)      1f:   41 be 01 00 00 00   movl    $1, %r14d      25:   b8 01 00 00 00  movl    $1, %eax      2a:   87 05 20 00 00 00   xchgl   %eax, 32(%rip)      30:   e8 00 00 00 00  callq   0 <__Z4testv+0x35>      35:   31 db   xorl    %ebx, %ebx      37:   31 c0   xorl    %eax, %eax      39:   87 05 20 00 00 00   xchgl   %eax, 32(%rip)      3f:   48 8d 05 35 00 00 00    leaq    53(%rip), %rax      46:   48 89 05 10 00 00 00    movq    %rax, 16(%rip)      4d:   c7 05 14 00 00 00 23 00 00 00   movl    $35, 20(%rip)      57:   44 87 35 20 00 00 00    xchgl   %r14d, 32(%rip)      5e:   e8 00 00 00 00  callq   0 <__Z4testv+0x63>      63:   87 1d 20 00 00 00   xchgl   %ebx, 32(%rip)      69:   5b  popq    %rbx      6a:   41 5e   popq    %r14      6c:   5d  popq    %rbp      6d:   c3  retq

Instruments identified the xchgl instructions as the most expensive part.

Latest Images

Trending Articles

Latest Images