Public / onnxruntime / c61a4b115ea

Commits

Jing Fang authored and GitHub committed c61a4b115ea28 Feb 2025
[Mlas] Unblock hardcoded matmul blocking size (#23815)

### Description

In GemmBatch, target matrix is cut into blocks to dispatch to multiple
threads for intra-op parallelism.

Currently the block size hard-coded to 16. If the CPU has > 16 cores,
cores are not fully utilized in one op.

This change unblocks the number of blocks in various MatMul.

__Benchmark results__

Model:
llmlingua-2-bert-base-multilingual-cased-meetingbank--add-force-token-100--max-seq-len-512-CPU-INT8.onnx
set up: 96 core x86 linux

Before: 
Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.485097 s
First inference time cost: 356 ms
Total inference time cost: 17.731 s
Total inference requests: 50
__Average inference time cost: 354.619 ms__
Total inference run time: 17.7312 s
Number of inferences per second: 2.81989
Avg CPU usage: 65 %
Peak working set size: 542265344 bytes
Avg CPU usage:65
Peak working set size:542265344

After:

Setting intra_op_num_threads to 32
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.523394 s
First inference time cost: 316 ms
Total inference time cost: 12.2739 s
Total inference requests: 50
__Average inference time cost: 245.478 ms__
Total inference run time: 12.2741 s
Number of inferences per second: 4.07362
Avg CPU usage: 33 %
Peak working set size: 611241984 bytes
Avg CPU usage:33
Peak working set size:611241984


Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.497698 s
First inference time cost: 289 ms
Total inference time cost: 9.49205 s
Total inference requests: 50
__Average inference time cost: 189.841 ms__
Total inference run time: 9.49226 s
Number of inferences per second: 5.26745
Avg CPU usage: 65 %
Peak working set size: 548470784 bytes
Avg CPU usage:65
Peak working set size:548470784
Runs:50

### Motivation and Context
This issue is reported by M365 research team.