Commits


Alex Kogan authored and GitHub committed 8b09702b88a
Enable parallel computation in Clip ops (#14925) ### Description <!-- Describe your changes. --> This PR speeds-up Clip operations by replacing their sequential implementation with a parallelized one. The parallelization is achieved by dividing the input data into chunks of size N and using a thread pool to process the chunks in parallel. The chunk size N is set to 16K based on performance evaluation on input tensors of 10^i elements for i in [1 .. 6]. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> The Clip operation is frequently executed in image processing models. Its implementation can be easily parallelized and therefore sped up when executed on a multi-core machine. On long inputs (>= 100K elements) this PR achieves speedup of over 2x. On shorter inputs, this PR does not introduce any substantial performance change.