风花雪月

gpu architecture

Posted on 2026-04-17

In-depth analysis of GPU architectures, covering NVIDIA GPU characteristics including Ampere A100, Turing, Volta, SM counts, CUDA cores, Tensor Core configurations, memory bandwidth, and detailed technical specifications comparison.

gpu

Posted on 2026-04-17

PTX and SASS

Posted on 2026-04-17

CUDA PTX ISA and SASS assembly language learning resources, including PTX instruction set architecture documentation, compiler APIs, inline assembly guides, dynamic loading techniques, and other GPU low-level programming materials.

tensor core

Posted on 2026-04-17

NVIDIA Tensor Core技术详解，包括第一代、第二代、第三代Tensor Core的架构特点、计算能力和性能指标，以及在不同GPU架构中的实现差异。

course materials

Posted on 2026-04-17

机器学习和并行计算相关课程资源汇总，包括MLSys系统课程、GPU并行编程课程链接，以及高性能计算实验室资源，涵盖CMU、EPFL、华盛顿大学等知名院校。

Cute Copy

Posted on 2025-11-11 Edited on 2026-04-17

异步拷贝概念

从全局内存加载数据到共享内存 [global mem] -> [L2 cache] -> [L1 cache] -> [register] -> [shared mem]
cp.copy.cg [global mem] -> [L2 cache] -> [shared mem]
cp.copy.ca [global mem] -> [L2 cache] -> [L1 cache] -> [shared mem]

不同架构下内存拷贝

Ampere(SM80)

cp.async

cp.async.ca.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}
						[dst], [src], cp-size{, src-size}{, cache-policy} ;
cp.async.cg.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}
						[dst], [src], 16{, src-size}{, cache-policy} ;
cp.async.ca.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}
						[dst], [src], cp-size{, ignore-src}{, cache-policy} ;
cp.async.cg.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}
						[dst], [src], 16{, ignore-src}{, cache-policy} ;
						
.level::cache_hint = { .L2::cache_hint }
.level::prefetch_size = { .L2::64B, .L2::128B, .L2::256B }
cp-size = { 4, 8, 16 }

TS const* gmem_ptr    = &gmem_src;
uint32_t smem_int_ptr = cast_smem_ptr_to_uint(&smem_dst);
asm volatile("cp.async.ca.shared.global.L2::128B [%0], [%1], %2;\n"
    :: "r"(smem_int_ptr),
        "l"(gmem_ptr),
        "n"(sizeof(TS)));

同步机制
同步机制有两种：Async Group; mbarrier

// Establishes an ordering w.r.t previously issued cp.async instructions. Does not block.
cp_async_fence()
{
#if defined(CUTE_ARCH_CP_ASYNC_SM80_ENABLED)
  asm volatile("cp.async.commit_group;\n" ::);
#endif
}

// /// Blocks until all but N previous cp.async.commit_group operations have committed.
template <int N>
CUTE_HOST_DEVICE
void
cp_async_wait()
{
  if constexpr (N == 0) {
    asm volatile("cp.async.wait_all;\n" ::);
  } else {
    asm volatile("cp.async.wait_group %0;\n" :: "n"(N));
  }
}

cp.async.wait_all = cp.async.commit_group + cp.async.wait_group 0