Quantcast
Channel: Intel® Software - Intel® Many Integrated Core Architecture (Intel MIC Architecture)
Viewing all articles
Browse latest Browse all 1789

Vtune bandwidth calculation does not match STREAM benchmark output on KNC

$
0
0

Dear Intel Forum Gurus,

I am practicing Vtune by using the Vtune 2016 GUI to measure the bandwidth of the STREAM benchmark on KNC 5110P.  I compiled stream.c with the following options:

icpc -mmic -O3 -g -qopenmp -DSTREAM_ARRAY_SIZE=64000000 -qopt-prefetch-distance=64,8 -qopt-streaming-cache-evict=0 -qopt-streaming-stores never -restrict stream.c

Streaming stores are omitted because I want to try core-event-based sampling (more on that later).

First I tried to see if I could get the GUI to give me the bandwidth directly.  This (https://software.intel.com/en-us/forums/intel-vtune-amplifier-xe/topic/518185#comment-1793935) indicates that Vtune 2013 can give the bandwidth directly, but I didn't see "Bandwidth" among my available analysis types (see first screenshot below).  This (https://software.intel.com/en-us/articles/tutorial-on-intel-xeon-phi-processor-optimization) Section 6.2 indicates that Vtune 2017 will give a nice bandwidth histogram, but I didn't see the Memory Usage viewpoint within the Memory Access analysis type (second screenshot below).

Next I tried to measure the bandwidth using the formula given section 5.4 of this https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding, specifically:

Read bandwidth = (L2_DATA_READ_MISS_MEM_FILL + L2_DATA_WRITE_MISS_MEM_FILL + HWP_L2MISS) * 64 / CPU_CLK_UNHALTED

Write bandwidth = (L2_VICTIM_REQ_WITH_DATA + SNP_HITM_L2) * 64 / CPU_CLK_UNHALTED

Bandwidth  = (Read bandwidth + write bandwidth)

I compiled without streaming stores because these events do not account for streaming stores.  I created a custom analysis type to record all the necessary events, and applied the formula (third screenshot below) to the Triad kernel (highlighted line).  I am dividing CPU_CLK_UNHALTED by 60 in the denominator because I'm almost positive CPU_CLK_UNHALTED measures the sum of clock ticks on all 60 cores, so to get the actual wall time of the function, I need to divide by 60.

My calculation with the metrics gave 182.75 GB/s, but the actual STREAM executable's output was "Triad: 101985.9 MB/s."  This is in the same ballpark but still a pretty big difference, and makes me suspicious of my calculation.

My questions are 1.  Is there a way that I overlooked to get the GUI to tell me the bandwidth directly (perhaps computed under the hood using memory controller events instead of core events)?  2. Am I applying the formula using the core events correctly?  If so, why is there such a large discrepancy with the output of the STREAM executable?

Thanks in advance for your help,
Michael

AttachmentSize
Downloadimage/png1.png1.91 MB
Downloadimage/png2.png1.14 MB
Downloadimage/png3.png408.33 KB

Thread Topic: 

Question

Viewing all articles
Browse latest Browse all 1789

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>