SIGNAL
Five links. Every Tuesday. No noise.
For the data scientist who reads the paper, not the tweet about it.
"Flash Attention 3: Fast and Accurate Attention with Asynchrony and Low-precision"
The headline number is 75% faster on H100s, but the more interesting detail is the asynchrony model — they pipeline softmax and matmul across warp groups so neither unit sits idle. If you're running long-context workloads and haven't upgraded your attention kernel since 2023, this is the one that will make you feel the gap.
This week's five. Typeset exactly as they land in your inbox.
torchao: PyTorch Architecture Optimization
Meta quietly shipped the quantization library that makes INT8 inference actually ergonomic. Two lines to go from fp32 to int8 dynamic quant with no accuracy drop on standard benchmarks — and unlike bitsandbytes, it doesn't require a custom CUDA build. The repo has been starred 3k times in two weeks. Your inference costs care.
Patterns for Building LLM-Based Systems & Products
The most grounded post on production LLM systems published this year. Yan catalogues evals, fallbacks, guardrails, and caching patterns from actual shipped products — not demos. Section 4 on "model cascades" alone is worth the read. Drop this in your team's Notion before Thursday's architecture review.
Scaling Laws for Reward Model Overoptimization in RLHF
If you're fine-tuning anything with human feedback, this paper explains why your reward model eventually degrades — and gives you the KL-divergence budget math to predict when. The proxy-gaming curve in Figure 3 is the most useful single chart for anyone doing RLHF at scale. This one's been in my "re-read" folder for three weeks.
“Link five is always the wildcard — the one that has nothing to do with your current sprint and everything to do with where the field is going in eighteen months. That's the one readers forward.”
How five links weave through a real workday.
The notification lands.
You're still on your first coffee. Signal drops at 7 AM every Tuesday — same time, every week, no exceptions. The subject line is always just the date. You open it because you already know the format: five links, two sentences each, zero padding.
You read the first annotation.
Flash Attention 3. The two sentences tell you exactly why it matters for your H100 workload — not the paper abstract, not a summary, the actual implication for the code you shipped last sprint. You tab it open. You'll finish it before standup.

You drop it into #ml-eng.
"Has everyone seen this? Section 4 on model cascades." Three replies in four minutes. Your tech lead adds it to the architecture doc. The annotation you forwarded is better than anything the team would have written themselves — because it was written for someone exactly like you.

The repo is already cloned.
torchao. You ran the two-line INT8 quant on your inference pipeline during lunch. 31% cost reduction on your staging environment. You didn't find it on Hacker News — it was buried in a weekend GitHub trending list that you'd stopped checking six months ago. Signal found it for you.
The algorithm gives you everything.
Signal gives you five.
Flash Attention 3 — 75% faster on H100s. Here's why your inference pipeline should care this sprint.
arxiv · cs.LGtorchao ships INT8 quant in two lines. No custom CUDA build. Your inference costs will feel it.
GitHub · pytorch/aoEugene Yan's production LLM patterns. Section 4 on model cascades is the one for your architecture doc.
Blog · Eugene YanRLHF reward model overoptimization — the KL-divergence budget math you need before scaling.
arxiv · stat.MLThe wildcard: a 2019 database paper that every ML engineer building feature stores is about to rediscover.
VLDB · SystemsRead before
Thursday's standup.
The full issue is free. Five annotated links, each one chosen because it changes something about how you work this week. No email required to read it.
Opens in a hosted archive. No paywall. No signup wall. Just the dispatch.