

In the USB3.0 variant, the USB3.0 interface is the data bottle neck. What should be the limiting element in data transmission? At an expected data rate of about 1 Gbyte / s that should not be the bottleneck, right? So theoretically it should work with CUDA. This year PCIe 4.0 will come out with twice the data rate. 6GB Asus GeForce GTX 1060 Dual OC Active PCIe 3.0 x16 (Retail) should have a transfer rate of 15 Gbytes / s. And the same for the GPU (for the initial tests you could probably go with any non-ancient GPU you can find in PC or laptop). Then try to get a hold of some FPGA target (borrow it somewhere? from NI directly?) and try to implement this FFT. So, wrapping this up, I would review the requirements (ok, if you say that it is absolutely 4 uSec without any doubts, then let's stick with it - and I really think this awesome to push the performance to the limits ). I rather say that the only way to know for sure is to actually prototype it, try different configurations and implementations and see if it works. I wrote all of the above to provide a bit of perspective, but I'm not saying that this is impossible to do. GPU itself should be able to do the FFT calculation in that time with no problem, the limiting factor is data transfer to and from GPU. Ok, thats Xilinx, but you say: " According to NI, the FFT runs in 4 μs (2048 pixels, 12bit) with the PXIe 7965 (Virtex-5 SX95T)." - I can't find it, could you provide reference? Take a look at page 41 for example - they didn't go under 4 uSec. We can also take a look at older versions of FFT IP - Xilinx even included latency information there: . From my own experience - when trying to compile the FFT core at 300 MHz I got about 50% success rate (that is - 50% of the compilations failed due to the timing constraints) - but this is FPGA compilation, so when you're at the performance boundary, it is really random. Assuming Kintex-7 target and pipelined architecture you probably won't make it to 400 MHz. If we would go with the Xilinx FFT IP Core path: the FPGA itself is not limiting factor in the clock frequency, the IP Core is.
