Fixed the GPU engine performance
Fixed the GPU engine performance. The reason why mainstream GPU engine is slower than ATC version is that one kernel has two more device memory access, which I add for debugging.
Fixed the GPU engine performance. The reason why mainstream GPU engine is slower than ATC version is that one kernel has two more device memory access, which I add for debugging.