performance optimization

Posted on 2025-10-05

Performance Measurement

CPU

theoretical peak

two Intel Xeon E5-2697 v2 (2S-E5) with 12 cores per CPU, each running at 2.7 GHz without turbo mode. These processors support the AVX extension with 256-bit SIMD instructions that can process 8 single precision (32 bits) numbers per CPU cycle.

theoretical peak Flop/s is 2.7 (GHz) × 8 (SP FP) × 2 (ADD/MULL) × 12 (cores) × 2 (CPUs) = 1036.8 GFlop/s.
memory bandwidth

theoretical memory bandwidth is computed from the memory frequency (1866 GHz), the number of channels (4), the number of bytes transferred by channel per cycle (8), which gives 1866 × 4 × 8 × 2 (# of processors) = 119 GByte/s peak bandwidth for the dual socket 2S-E5 system.

GPU

theoretical peak

Theoretical peak FLOPS can be calculated as follows:

FLOPS = Number of CUDA Cores × Clock Speed (GHz) × Operations per Cycle

For example, an NVIDIA Tesla V100 GPU has 5120 CUDA cores and a clock speed of 1.53 GHz. Assuming it can perform 2 floating-point operations per cycle, the theoretical peak FLOPS would be:
FLOPS = 5120 × 1.53 × 2 ≈ 15.7 TFLOPS
memory bandwidth

Theoretical memory bandwidth can be calculated using the formula:

Memory Bandwidth = Memory Clock Speed (GHz) × Memory Bus Width (bits) / 8 (to convert bits to bytes) × 2 (for DDR)

For instance, if an NVIDIA Tesla V100 has a memory clock speed of 1.75 GHz and a memory bus width of 4096 bits, the theoretical memory bandwidth would be:

Memory Bandwidth = 1.75 × 4096 / 8 × 2 ≈ 900 GB/s
GPU Architecture: A Look at Peak FLOPS and Memory Bandwidth

算子性能优化

Conv

Winograd

3x3s1
DepthWise

groups = ic = oc
DirectConv

3x3s2
Img2Col+GEMM

https://www.zhihu.com/question/68435920/answer/1770994396

图优化

Layout 调优

LayoutSensitiveOps

norm
conv
pool
resize
fused_conv
grid_simple

优化

计算时间由数据移动效率和计算效率决定

数据移动效率：CPU/GPU内存读写、PCIe数据传输、GPU kernel launch
计算效率：计算单元速度

提升计算密度，即在模型FLOPs相当的情况下减少访存量，减少kernel个数，提高FLOPs/byte或FLOPs/kernel_launch。例如：尽量减少使用tensor变换算子（Concat/Split/Transpose等算子）、减少Tile/Gather等算子导致的中间结果内存读写膨胀、尽量增大每次请求的batch size等。

计算优化

访存优化

图优化

无用节点消除

Unsqueeze 常量输入
当Slice Op的index_start等于0、index_end等于c-1时该Op无意义可被删除
Expand Op指定的输出shape等于输入shape时该Op无意义可被删除
连续的Reshape只需要保留最后一个reshape
连续的内存排布转换只需要保留最后一个
Concat Split Elimination 合并后又进行同样的拆分，可同时删除这两个Op

算子融合

Matmul+Add->Gemm
Conv+Add
Conv+Act
Conv+BN
MatMul+Scale 融合到alpha
Gemm+Act
Gemm+elementwise

算子变换

BN->Scale 计算量更少
Matmul -> Conv
FC -> Conv1x1
Matmul+Transpose
ShuffleChannel -> Reshape+Transpose