Summary: This commit fuses the weight decay and mixed precision conversion into update kernels to get lower training latency.