风花雪月

cuda

Posted on 2026-04-17

CUDA编程模型详解，包括线程、块、网格的层次结构，warp概念，内存模型（寄存器、共享内存、全局内存等），以及CUDA流编程技术，涵盖内存分配、流创建、同步等核心概念。

gemm optimize

Posted on 2026-04-17

GEMM矩阵乘法优化技术详解，包括基础概念、向量内积和外积优化方法、双缓冲技术等核心优化策略，帮助提升GPU上矩阵运算性能。

In-depth analysis of CUTLASS GEMM implementation, including MmaPolicy and MmaBase template class design, shared memory management, tensor references, warp-level GEMM operations, and other core code structures and implementation details.

cuda memory

Posted on 2026-04-17

CUDA memory hierarchy explained, including register file, L1 cache, shared memory, constant cache, L2 cache, global memory, local memory, texture and constant memory characteristics and usage. Memory access patterns for global and shared memory, and optimization techniques (SoA vs AoS, vectorized loads, __ldg, broadcast, padding, bank-aware layout, occupancy, TMA, async copy).

gpu instruction throughput

Posted on 2026-04-17

GPU instruction throughput and latency analysis, detailing performance characteristics of different instruction types and instruction execution capabilities per SM, providing important reference data for GPU programming optimization.

gpu materials

Posted on 2026-04-17

GPU programming learning resources collection, including NVIDIA GTC conference lectures, CUTLASS library tutorials, CUDA programming books and open source projects, covering from basic to advanced GPU development techniques.

gpu new features

Posted on 2026-04-17

NVIDIA GPU新特性介绍，包括V100的Volta SIMT模型、Cooperative Groups，以及A100的异步拷贝、异步屏障、任务图加速和2:4结构化稀疏等先进技术。

GPU Performance Optimization

Posted on 2026-04-17

GPU performance optimization concepts including register spilling (causes, effects, overflow path) and active warp (occupancy, latency hiding, warp scheduler).

gpu roofline

Posted on 2026-04-17

GPU性能分析工具Roofline模型，用于评估GPU计算性能瓶颈，帮助开发者理解计算密度与内存带宽对性能的影响。

gpu command

Posted on 2026-04-17

GPU管理和监控命令大全，包括nvidia-smi详细参数说明、GPU状态监控、计算模式设置、功耗限制、时钟频率锁定、进程查询等实用命令和配置方法。