On Xeon Phi coprocessor, I have obtained 5x speedup when the following code is compiled with O1 instead of O0, for a single thread.
int i; for( i = 0; i < nrows; i++){ int j; double y0 = 0.0; int kstart = rows[i]; int kend = rows[i+1]; for( j = kstart; j < kend; j++){ int jj = cols[j]; double x0 = x[jj]; y0 += val[j] * x0; } y[i] = y0; }
I wonder which optimizations triggered by O1 flag cause 5x speedup. Is there any flag of ICC to show applied optimizations as disscussed for GCC in this question.