Dear Intel Forum Gurus,
I am practicing Vtune by using the Vtune 2016 GUI to measure the bandwidth of the STREAM benchmark on KNC 5110P. I compiled stream.c with the following options:
icpc -mmic -O3 -g -qopenmp -DSTREAM_ARRAY_SIZE=64000000 -qopt-prefetch-distance=64,8 -qopt-streaming-cache-evict=0 -qopt-streaming-stores never -restrict stream.c
Streaming stores are omitted because I want to try core-event-based sampling (more on that later).
First I tried to see if I could get the GUI to give me the bandwidth directly. This (https://software.intel.com/en-us/forums/intel-vtune-amplifier-xe/topic/518185#comment-1793935) indicates that Vtune 2013 can give the bandwidth directly, but I didn't see "Bandwidth" among my available analysis types (see first screenshot below). This (https://software.intel.com/en-us/articles/tutorial-on-intel-xeon-phi-processor-optimization) Section 6.2 indicates that Vtune 2017 will give a nice bandwidth histogram, but I didn't see the Memory Usage viewpoint within the Memory Access analysis type (second screenshot below).
Next I tried to measure the bandwidth using the formula given section 5.4 of this https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding, specifically:
Read bandwidth = (L2_DATA_READ_MISS_MEM_FILL + L2_DATA_WRITE_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED
Write bandwidth = (L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED
Bandwidth = (Read bandwidth + write bandwidth)
I compiled without streaming stores because these events do not account for streaming stores. I created a custom analysis type to record all the necessary events, and applied the formula (third screenshot below) to the Triad kernel (highlighted line). I am dividing CPU_CLK_UNHALTED by 60 in the denominator because I'm almost positive CPU_CLK_UNHALTED measures the sum of clock ticks on all 60 cores, so to get the actual wall time of the function, I need to divide by 60.
My calculation with the metrics gave 182.75 GB/s, but the actual STREAM executable's output was "Triad: 101985.9 MB/s." This is in the same ballpark but still a pretty big difference, and makes me suspicious of my calculation.
My questions are 1. Is there a way that I overlooked to get the GUI to tell me the bandwidth directly (perhaps computed under the hood using memory controller events instead of core events)? 2. Am I applying the formula using the core events correctly? If so, why is there such a large discrepancy with the output of the STREAM executable?
Thanks in advance for your help,
Michael