风花雪月

profile

Posted on 2026-01-25

性能分析工具使用指南，包括PyTorch Profiler和NVIDIA Nsight等工具的使用方法，帮助开发者进行代码性能分析和优化。

design pattern

Posted on 2026-01-25

UML设计模式基础概念详解，包括依赖、泛化、实现、关联、聚合、组合等关系的定义、表示方法和C++代码示例，帮助理解面向对象设计原则。

linux and programming

Posted on 2026-01-25

大模型训练优化

Posted on 2026-01-25

大模型训练优化技术全面解析，包括Megatron框架、计算优化（OP融合、混合精度、通信融合）、显存优化（重计算、Offload）、并行优化（数据并行、模型并行、流水线并行）等核心技术。

Setup CUDA environment

Posted on 2026-01-25

CUDA开发环境配置指南，包括nvcc编译器安装、NVIDIA容器镜像使用、网络仓库安装方法，以及nsys性能分析工具的安装配置步骤。

Setup Python environment

Posted on 2026-01-25

Linux环境下Python 3.12安装配置指南，包括从源码编译安装、使用deadsnakes PPA安装、pip配置以及设置Python 3.12为默认版本的完整步骤。

ssh setup

Posted on 2026-01-25

SSH配置和Git使用指南，包括GitHub SSH密钥生成、SSH配置文件设置、文件权限配置，以及Git常用命令如深度克隆特定分支等实用技巧。

Setup v2ray client

Posted on 2026-01-25

V2ray客户端配置指南，包括Linux下V2ray安装配置、测试连接方法，以及macOS下V2rayU证书过期问题的解决方案和代码签名修复步骤。

Low-Cost FlashAttention with Fused Exponential and Multiplication Hardware Operators

Posted on 2026-01-25

Optimizes FlashAttention algorithm through fused exponential and multiplication hardware operators (ExpMul), achieving 28.8% area reduction and 17.6% power consumption reduction in 28nm ASIC technology, significantly improving Transformer model efficiency.

Prompt Cache - Modular Attention Reuse for Low-Latency Inference

Posted on 2026-01-25

Accelerates large language model inference through modular attention reuse technology, achieving 8x GPU inference and 60x CPU inference TTFT latency improvements, particularly suitable for long-context applications like document QA and recommendation systems.

0%