Archive

2026-02

2026-01

RoPE 究竟是怎么计算的
1/30/2026
AWQ：面向端侧 LLM 压缩与加速的激活感知权重量化（Activation-aware Weight Quantization）
1/29/2026
Long Context 推理优化技术梳理
1/27/2026
Context Parallel 技术解析
1/27/2026
FlashAttention 原理与实现
1/27/2026
CUDA 012 - 编译链接流程
1/27/2026
【部分观点记录】翁家翌：OpenAI，GPT，强化学习，Infra，后训练，天授，tuixue，开源，CMU，清华 | WhynotTVPodcast
1/18/2026
RDMA 在大模型推理框架中的应用
1/13/2026
MTP 理论加速比分析：从公式到工程决策
1/6/2026
DeepGEMM 学习指南：面向初学者的 FP8 GEMM 库解析
1/6/2026
一种 TP-SP-EP 混合并行策略
1/4/2026

2025-12

Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
12/29/2025
Inside vLLM: Anatomy of a High-Throughput LLM Inference System
12/29/2025
通过零开销逐层权重卸载技术将 SGLang Diffusion wan2.2 的推理速度加速 60%
12/28/2025
CUDA Graph 学习笔记
12/27/2025
Code is not only an implementation, but also a presentation of a way of thinking
12/26/2025
Understanding Conway’s Law（康威定律）
12/25/2025

2025-11

My first Multi-GPU kernel: Writing All-to-all for AMD MI300X
11/2/2025

2025-08

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++
8/23/2025

2024-08

How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores
8/10/2024