Try to reduce the CPU-GPU data transitions during training.
Try the nsight system to profile one iteration(both forward and backward) and to see if there are many idles between GPU kernels. Idle means the gpu utilization is very low, and many operations are done on the CPU side.
If you are using TensorFlow you can open XLA to accelerate the training. I believe PyTorch should have the same DL compiler for training. And you can open AMP(auto mixed precision/fp16) to accelerate.
Kon-kkk t1_iuj7nyz wrote
Reply to [D] When the GPU is NOT the bottleneck...? by alexnasla
Try the nsight system to profile one iteration(both forward and backward) and to see if there are many idles between GPU kernels. Idle means the gpu utilization is very low, and many operations are done on the CPU side. If you are using TensorFlow you can open XLA to accelerate the training. I believe PyTorch should have the same DL compiler for training. And you can open AMP(auto mixed precision/fp16) to accelerate.