Use block reduction for ArgMax and ArgMin Operator
Summary: This commit reimplements the cuda argmax/argmin via BlockReduce, instead of the naive reduction in kernel loop.
Showing
with
149 additions
and
108 deletions
-
Please register or sign in to post a comment