gpu architecture

GPU Architecture

ampere 100


a100 滴滴云
https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/
The full implementation of the GA100 GPU includes the following units:

  • 8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs per full GPU
  • 64 FP32 CUDA Cores/SM, 8192 FP32 CUDA Cores per full GPU
  • 4 third-generation Tensor Cores/SM, 512 third-generation Tensor Cores per full GPU
  • 6 HBM2 stacks, 12 512-bit memory controllers
    five active HBM2 stacks, HBM2 1215 MHz(10 512-bit memory controllers, not full GA100),1555 GB/sec = 10 * 512 * 1215 * 2/8

Turing

Brand Name GPU Architecture Tensor Core NVIDIA CUDA® Cores TensorFLOPS Single-Precision Double-Precision Mixed-Precision(FP16/FP32) INT8 INT4 GPU Memory Interconnect Bandwidth System Interface
V100 PCle NVIDIA Volta 640 1nd 5,120 112 TFLOPS 14 TFLOPS 7 TFLOPS 12x TFLOPS 32 GB HBM2 900 GB/sec 32 GB/sec x16 PCIe Gen3
V100 SXM2 NVIDIA Volta 640 1nd 5,120 125 TFLOPS 15.7 TFLOPS 7.8 TFLOPS 32 GB HBM2 900 GB/sec 300 GB/sec x6 NVLink 2.0
T4 NVIDIA Turing 320 2nd 2,560 8.1 TFLOPS 65 TFLOPS 130 TOPS 260 TOPS 16 GB GDDR6 300 GB/sec 32 GB/sec x16 PCIe Gen3

Nvidia Jetson

Orin 64GB

image

(2 * 4 * 2) * 4 = 64 Tensor Cores
(2GPC * 4TPC/GPC * 2SMs/TPC) * 128 = 2048 CUDA Cores (INT32+FP32)

int8算力

  • 稀疏算力 Sparse TOPS
    270 TOPS = 170T+105T(52.5*2), 其中Tensor Core贡献(1.3GHz * 64 * 256 * 2 * 2) * 2= 85 T * 2, int8是fp16两倍。 DLA 2.0 每个为52.5 TOPS sparse INT8
  • 稠密算力 Dense TOPS

GPU Infos

  • P100

    image image
  • V100

    • ALU
      • 5376 FP32 cores = 6 GPC * 7 TPC * 2 SM * 64 FP32 cores(64 INT32 cores, 32 FP64 cores, 8 Tensor Cores, Four texture units)
      • SM
        • 64 FP32 + INT32 cores, 32 FP64 cores, 8 tensor cores(FP32/FP16 mixed-precision)
        • 4 subcore inside SM, 16 FP32 + INT32 cores, 8 FP64 cores, 2 tensor cores, 8 LD/ST units
      • TensorCore: 64 floating point(FP16) FMA / (TensorCore *clock), 512 per SM per clock. 64 * 640 TensorCore * 2 * 1530 Mhz = 125 TFlops
      • single precision (FP32) floating-point calculations: (5120 FP32 CUDA cores) × (2 flop/core/cycle) × (1.53 Gcycle/s) ≈ 15.7 Tflop/s, The factor of 2 flop/core/cycle comes from the ability of each core to execute FMA instructions(instruction throughput is N/32 instructions per clock cycle).
    • Mem
      • 512 bit * 8 memory controllers
      • 6144 KB L2 cache
    • IO
      • six links and the bi-directional bandwidth of each link is 50 GB/s, so the bi-directional bandwidth between different GPUs is up to 300 GB/s.
image image image image
  • T4

image

  • A100
    • ALU
      • 8192 FP32 cores = 8 GPC * 8 TPC * 2 SM * 64 FP32 cores(64 INT32 cores, 32 FP64 cores, 4 Tensor Cores, Four texture units)
      • SM
        • 64 FP32 + INT32 cores, 32 FP64 cores, 4 * 3rd tensor cores(FP32/FP16, int8/int4 mixed-precision)
        • 4 subcore inside SM, 16 FP32 + INT32 cores, 8 FP64 cores, 1 tensor cores, 8 LD/ST units
      • TensorCore: 256 floating point(FP16) FMA / (TensorCore *clock), 256 * 4 * 108 * 2 * 1.41 Gcycle/s = 312 TFlops. Sparse performance double.
      • single precision (FP32) floating-point calculations: (8192 FP32 CUDA cores) × (2 flop/core/cycle) × (1.41 Gcycle/s) ≈ 23.1 Tflop/s, The factor of 2 flop/core/cycle comes from the ability of each core to execute FMA instructions(instruction throughput is N/32 instructions per clock cycle).
      • 108(7 * 8 * 2) * 64 * 1410 Mhz * 2 = 19.5 TFlops
    • Mem/IO
      • 512 bit * 12 memory controllers (Full A100), 512 * 10 (A100)
      • 5 active HBM2 stacks, HBM2 1215 MHz(10 512-bit memory controllers, not full GA100),1555 GB/sec = 10 * 512 * 1215 * 2/8
      • 192 KB shared-mem / L1 per SM
      • 40 MB L2 cache
      • NVLink: 50 Gbit/sec * 12 = 600 Gbit/sec

image
image

image

Comparison of NVIDIA Tesla GPUs

image
Data Center GPU NVIDIA Tesla P100 NVIDIA Tesla V100 NVIDIA A100
GPU Codename GP100 GV100 GA100
GPU Architecture NVIDIA Pascal NVIDIA Volta NVIDIA Ampere
GPU Board Form Factor SXM SXM2 SXM4
SMs 56 80 108
TPCs 28 40 54
FP32 Cores / SM 64 64 64
FP32 Cores / GPU 3584 5120 6912
FP64 Cores / SM 32 32 32
FP64 Cores / GPU 1792 2560 3456
INT32 Cores / SM NA 64 64
INT32 Cores / GPU NA 5120 6912
Tensor Cores / SM NA 8 42
Tensor Cores / GPU NA 640 432
GPU Boost Clock 1480 MHz 1530 MHz 1410 MHz
Peak FP16 Tensor TFLOPS with FP16 Accumulate1 NA 125 312/6243
Peak FP16 Tensor TFLOPS with FP32 Accumulate1 NA 125 312/6243
Peak BF16 Tensor TFLOPS with FP32 Accumulate1 NA NA 312/6243
Peak TF32 Tensor TFLOPS1 NA NA 156/3123
Peak FP64 Tensor TFLOPS1 NA NA 19.5
Peak INT8 Tensor TOPS1 NA NA 624/12483
Peak INT4 Tensor TOPS1 NA NA 1248/24963
Peak FP16 TFLOPS1 21.2 31.4 78
Peak BF16 TFLOPS1 NA NA 39
Peak FP32 TFLOPS1 10.6 15.7 19.5
Peak FP64 TFLOPS1 5.3 7.8 9.7
Peak INT32 TOPS1,4 NA 15.7 19.5
Texture Units 224 320 432
Memory Interface 4096-bit HBM2 4096-bit HBM2 5120-bit HBM2
Memory Size 16 GB 32 GB / 16 GB 40 GB
Memory Data Rate 703 MHz DDR 877.5 MHz DDR 1215 MHz DDR
Memory Bandwidth 720 GB/sec 900 GB/sec 1555 GB/sec
L2 Cache Size 4096 KB 6144 KB 40960 KB
Shared Memory Size / SM 64 KB Configurable up to 96 KB Configurable up to 164  KB
Register File Size / SM 256 KB 256 KB 256 KB
Register File Size / GPU 14336 KB 20480 KB 27648 KB
TDP 300 Watts 300 Watts 400 Watts
Transistors 15.3 billion 21.1 billion 54.2 billion
GPU Die Size 610 mm² 815 mm² 826 mm2
TSMC Manufacturing Process 16 nm FinFET+ 12 nm FFN 7 nm N7