Kon-kkk

Kon-kkk t1_iuj7nyz wrote

  1. What framework
  2. What kind of network/model
  3. Try to reduce the CPU-GPU data transitions during training.

Try the nsight system to profile one iteration(both forward and backward) and to see if there are many idles between GPU kernels. Idle means the gpu utilization is very low, and many operations are done on the CPU side. If you are using TensorFlow you can open XLA to accelerate the training. I believe PyTorch should have the same DL compiler for training. And you can open AMP(auto mixed precision/fp16) to accelerate.

2