MMPERF - single core matmul performance benchmark

Single Core MatMul performance benchmark

Follow me on GitHub

Welcome to mmperf

mmperf is a single core GEMM benchmark. This repository aims to benchmark Matrix Multiply (SGEMM) hand-tuned libraries and code generation stacks on a single thread on one CPU core. The focus will be on machine learning workloads so FP32 or smaller and irregular sizes of matrices. The goal is to expose high performance atomic kernels that can then be used to build highly efficient higher level implemenations spanning multiple cores or distributed across systems when efficient atomic kernels are asynchrously scheduled with overlapping communicaitons (interchip, in a system or across a system).

Engineered Libraries:

  • Intel MKL
  • OpenBLAS
  • RUY
  • Accelerate
  • BLIS

Compiler / Codegen kernels

  • MLIR
  • Halide
  • TVM
  • Nod.AI



Results on Nvidia A100 (cublas vs SHARK)


Results on Intel Alderlake 12900k (AVX2)


Results on Intel XEON Skylake (iMAC PRO, AVX512)


Results on Xeon Cascade Lake (GCP C2 instance, AVX 512)


Results on Xeon Cascade Lake Codegen TVM, Halide, MLIR (GCP C2 instance, AVX 512)


Results on AMD Ryzen 5950x (ZenV3, compared to AMD’s BLIS and OpenBLAS for RESNET50 sizes)


Results on Intel XEON E-2276M Coffee lake (Thinkpad P53, AVX2)


Results on Apple M1 (NEON - no AMX2)

Note: 8GB Mac Mini runs roughly 25% slower than the 16GB version on other tests. Results


For more details see mmperf on Github.

Support or Contact

mmperf aims to be a collaborative effort though primarily developed by so if you can get better performance or add a new backend please submit a PR.