Hi all and thanks to help,
I write in Fortran a stupid program that implements a dot product between two arrays , one in double precision and the other changing the datatype.
PROGRAM datatype USE omp_lib implicit none double precision, allocatable,dimension(:,:,:) :: A,B,C integer(kind=1), allocatable,dimension(:,:,:) :: D integer(kind=4), allocatable,dimension(:,:,:) :: E integer(kind=8), allocatable,dimension(:,:,:) :: F real, allocatable,dimension(:,:,:) :: G LOGICAL, allocatable,dimension(:,:,:) :: H integer :: t,i,j,k,size = 500,repetition=40 double precision :: time,time1 ALLOCATE(A(size,size,size),B(size,size,size),C(size,size,size)) A = 4. B = 1. time = omp_get_wtime() do t = 1,repetition do i=1,size do j=1,size do k=1,size !dir$ vector aligned c(k,j,i) = a(k,j,i) * b(k,j,i) +5.2 enddo enddo enddo enddo time = omp_get_wtime() - time print *,"TIME double",time/DBLE(repetition) DEALLOCATE(B) ALLOCATE(G(size,size,size)) G = 240. time = omp_get_wtime() do t = 1,repetition do i=1,size do j=1,size do k=1,size !dir$ vector aligned c(k,j,i) = a(k,j,i) * g(k,j,i) +5.2 enddo enddo enddo enddo time = omp_get_wtime() - time print *,"TIME float",time/DBLE(repetition) DEALLOCATE(G) ALLOCATE(D(size,size,size)) D = 240 time = omp_get_wtime() do t = 1,repetition do i=1,size do j=1,size do k=1,size !dir$ vector aligned c(k,j,i) = a(k,j,i) * d(k,j,i) +5.2 enddo enddo enddo enddo time = omp_get_wtime() - time print *,"TIME int8",time/DBLE(repetition) DEALLOCATE(D) ALLOCATE(E(size,size,size)) e = 240 time = omp_get_wtime() do t = 1,repetition do i=1,size do j=1,size do k=1,size !dir$ vector aligned c(k,j,i) = a(k,j,i) * e(k,j,i) +5.2 enddo enddo enddo enddo time = omp_get_wtime() - time print *,"TIME int32",time/DBLE(repetition) DEALLOCATE(E) ALLOCATE(F(size,size,size)) f = 240 time = omp_get_wtime() do t = 1,repetition do i=1,size do j=1,size do k=1,size !dir$ vector aligned c(k,j,i) = a(k,j,i) * f(k,j,i) +5.2 enddo enddo enddo enddo time = omp_get_wtime() - time print *,"TIME int64",time/DBLE(repetition) DEALLOCATE(F) ALLOCATE(H(size,size,size)) h = .True. time = omp_get_wtime() do t = 1,repetition do i=1,size do j=1,size do k=1,size !dir$ vector aligned c(k,j,i) = a(k,j,i) * h(k,j,i) +5.2 enddo enddo enddo enddo time = omp_get_wtime() - time print *,"TIME logical",time/DBLE(repetition) END PROGRAM
I try this code on Broadwell Intel(R) Xeon(R) E5-2697 v4 @ 2.30GHz and Intel Xeon Phi 7250 KNL.
BROADWELL (1 core) TIME double 0.314651775360107 TIME float 0.256021851301193 TIME int8 0.218752950429916 TIME int32 0.245272749662399 TIME int64 0.319928669929504 TIME logical 0.245576351881027 ------------------------------------------------- KNL (1 core) TIME double 0.545190346240997 TIME float 0.608061379194260 TIME int8 0.749213725328445 TIME int32 0.718595725297928 TIME int64 0.730906349420547 TIME logical 0.544638276100159
On the broadwell architecture the best performance was obtained with double * int 8 and the worst was double * double . I think the better performance on int8 is due to better use of cache that mask the time of cast from int8 to double, is it right?
I don't understand because on KNL the behavour is opposite. I analyzed compiler opt report but in both case the double precision decide the vector lengh so the operation per clock cycle.
Someone can help me to understand this behaviour?
Thanks
Best regards
Eric