Fastest possible barrier

For the last weeks I tried around with barriers on the Xeon Phi. Right now it seems to me that the already included barrier in OpenMP yields the best performance. I can beat the cycles / barrier by implementing a tournament barrier (I guess the OMP barrier in the icc implementation uses a tournament barrier; by just looking at the scaling), however, I cannot beat it by far.

My issue is that the performane of barriers on the MIC seems unsatisfactory. The barrier scales with the # of threads (logically), however, giving a rough number for the thread count that we usually have is 10k cycles. This is too much.

I also implemented a centralized barrier, which naturally performs better with a low number of threads (say smaller 16). The big problem with the centralized barrier, however, is that the scaling behavior is a lot worse. This problem arises in the last stage of the barrier, the spin-lock where all, but the last thread to arrive at the barrier, are waiting is taking to long to propagate to all the threads.

The boil the whole post down to two questions:

1. Do any of you guys know a better (faster, more scalable) way to construct barriers on the Xeon Phi (for my centralized barrier or synchronization in general I use the lock cmpxchg instruction)?

2. Are there any (probably not documentated, since I did not find any) other assembly instructions or (global accessible) registers that might help with the barriers topic?

Thanks a lot!

Florian