风花雪月

norm

Posted on 2026-04-17

深度学习归一化方法详解，包括Batch Norm、Layer Norm、Instance Norm、Group Norm四种归一化技术的原理、实现方法和PyTorch代码示例，帮助理解不同归一化策略的应用场景。

quantization

Posted on 2026-04-17

大语言模型量化技术综述，包括SmoothQuant、AWQ、LLM.int8、GPTQ、ZeroQuant、LUT-GEMM、SparseGPT等先进量化方法，以及weight-only量化在推理优化中的应用。

NVIDIA GPU sparse computing technology overview, including efficient GPU kernel implementations for N:M sparse weights, Apex N:M sparse support, structured sparsity optimization on Tensor Cores, and related papers and open source project resources.

stable diffusion optimization

Posted on 2026-04-17

Stable Diffusion推理优化技术详解，包括Flash Attention、Norm融合、混合Layout计算、推理显存优化等核心技术，实现512×512图像0.76秒生成，性能超越TensorRT 7.9%。

cuda compile

Posted on 2026-04-17

CUDA编译技术详解，包括nvcc编译器参数配置、虚拟架构和真实架构的区别、PTX和CUBIN文件生成，以及多架构兼容性编译策略。

transformer

Posted on 2026-04-17

Transformer架构深度解析，包括Encoder和Decoder结构设计、Multi-Head Attention机制、Position-Wise Feed-Forward Network，以及完整的TensorFlow实现代码示例。

cuda stream

Posted on 2026-04-17

CUDA流编程技术，包括流的创建、同步、销毁等基本操作，以及流优先级设置、非阻塞流使用等高级特性，帮助实现GPU并行计算优化。

cutlass conv

Posted on 2026-04-17

CUTLASS convolution implementation explained, including convolution parameter definitions (K, C, R, S), Conv2dProblemSize configuration, output size calculation formulas, and CUTLASS library applications in convolution operations.

风花雪月

deep learning & llm

multi head attention

norm

quantization

sparse on nvgpu

stable diffusion optimization

cuda compile

transformer

cuda stream

cutlass conv