Hello everyone,
I am new to Xeon Phi coproccessors and I am now trying to adapt a program previously made for only CPU to offload to the Xeon Phi. While doing this I have found some strange results with the time neccessary for the transfer of data to the coprocessor. I am using simple offloading pragamas with in clauses to transfer an array of floats to the device. For example a data transfer of 2GB lasts for 3.08 seconds (677 MB/s) if we transfer it from CPU to the device and if we do it the other way around it lasts 0.315 s(6499 MB/s). We are using a PCIe 2.0 x16, so the theoretical bandwidth would be of 8GB/s. In the case of getting data from the device we almost get and ideal bandwidth but in the case of introducing data the bandwidth is not so good. I am thinking that maybe the channels of the bus are divided and only 2 lanes are dedicated to the host-device transfers and the reamining ones are dedicated to device-host transfers.
Also to try to overcome this problem, since the main array at first is empty, we decided to not transfer the array to the device but to create it directly on the device and at the end return it to the host. I have tried it in a number of ways using the offloading pragmas (with the into keyword) but always obtain the same error telling me that it cannot find the data associated with a pointer. I need the pointer to this dynamic memory region allocated in the MIC to be global since it is shared between different offloading calls. I'm wondering if anyone has experienced and successfully overcame this problem.
Thanks.