Fix GPU engine
Fix GPU engine performance. The reason why mainstream GPU engine is slower than ATC version is that one kernel has two more device memory access, which I add for debugging. The performance doc is out-of-date now.
Fix GPU engine performance. The reason why mainstream GPU engine is slower than ATC version is that one kernel has two more device memory access, which I add for debugging. The performance doc is out-of-date now.