Use block reduction for ArgMax and ArgMin Operator
Summary: This commit reimplements the cuda argmax/argmin via BlockReduce, instead of the naive reduction in kernel loop.
Showing
with
141 additions
and
100 deletions
-
Please register or sign in to post a comment