Commits


Chen Fu authored and GitHub committed 224380448d3
Expand Qgemm UDOT kernel to 8x8 block (#8562) Create a new M8 loop processing A[8x8] B[8x8] per iteration. Avoid saving registers on paths that are not needed. Adjusted M2 and M1 loop, using more registers to relax the loop carrying dependencies. Nearly 7% improvement observed on Surface Pro X 2 with model ssd_mobilenet_v2_300 About 4.5% improvement on resnet50 on Surface Pro X 2.