You can put a Softmax in front of CrossEntropyLoss. PyTorch won’t stop you. Here are 16 other architecture bugs it won’t catch.
Design-Time ML
SubscribeSign in
You can put a Softmax in front of CrossEntropyLoss. PyTorch won’t stop you. Here are 16 other architecture bugs it won’t catch.<br>A walkthrough of the 17-rule design-time linter inside Neurarch: what each rule catches, why it matters, and where static analysis stops being useful for neural networks.
Xin Gao<br>May 17, 2026
Share
The bug that started this
You can put a Softmax in front of CrossEntropyLoss in PyTorch. The model trains. The loss curve looks fine. You ship it. Accuracy is bad, and you spend the next day finding out why.<br>The bug is that nn.CrossEntropyLoss applies log-softmax internally, so the explicit Softmax causes double-application and degrades training stability. The bug is visible from the architecture diagram in two seconds. The framework only complains at runtime, after you have burned the GPU time and the morning.<br>This is one of 17 structural failure modes I built a design-time linter to catch. The rest cover normalization ordering, missing residuals in deep nets, attention without positional encoding, GQA head divisibility, SwiGLU dimension conventions, and a dozen more. This post walks through what each rule catches, why it matters, and where static analysis stops being useful.<br>Why a linter at all
Every existing PyTorch tool catches structural bugs after the fact. Shape errors only fire when you call forward(). Vanishing gradients show up as flat loss curves after the training loop. NaN losses appear on an A100 you have already paid for. The pattern is consistent: the bug was visible from the graph, but the framework only flagged it at runtime, and the cost was hours of GPU plus the mental tax of figuring out which of 200 layers broke the gradient.<br>I wrote a linter that runs on the architecture graph at design time, before any forward pass. Static analysis for neural networks. Today it ships with 17 rules.<br>The 17 rules, grouped
The rules fall into five categories, each tied to a class of common failure mode.<br>Structure (4 rules) — the graph itself is malformed<br>R01 — Model has no Input node
R02 — Model has no Output node
R03 — Isolated components with no connections in or out
R04 — Dead-end: a non-output layer that has inputs but no outputs
These four catch the kind of bug you make when you delete a layer mid-edit and forget to reconnect. Most code generators silently drop disconnected nodes, so you only notice when the generated PyTorch file is mysteriously short.<br>Ordering (4 rules) — layers are in the wrong sequence<br>R05 — Normalization placed after an activation (the conventional pre-activation order is Conv → Norm → Activation)
R06 — Dropout directly before BatchNorm (BN re-normalizes the random zeros Dropout introduces, cancelling most of its regularization)
R07 — Softmax or Sigmoid immediately before Output (PyTorch’s nn.CrossEntropyLoss applies log-softmax internally, so an explicit Softmax causes double-application and degrades training stability)
R08 — Any normalization immediately before Output (normalizing raw logits constrains the output range and breaks standard loss functions)
R07 is the rule I personally hit the most. Every ML engineer learns the lesson once. The linter just keeps you from learning it again every six months.<br>Pattern (4 rules) — the architecture is missing a structural ingredient<br>R09 — Network deeper than 8 conv/linear layers with no residual connections at all
R10 — Attention layers present but no positional encoding anywhere (attention is permutation-invariant; without position info the model cannot distinguish token order)
R11 — Sigmoid or Tanh used in networks deeper than 5 layers (these saturate, and their gradient approaches zero for large inputs, halting learning in early layers)
R12 — Network deeper than 7 non-I/O layers with no normalization of any kind
These are the rules that turn into the “why is my loss curve flat” moment. None of them are individually subtle. What is subtle is that you only notice them collectively, after the training run.<br>Performance (2 rules) — the architecture trains but inefficiently<br>R13 — Dropout with p > 0.65 (rates above this introduce so much noise the model cannot learn stable representations)
R14 — Activation tensor larger than 50M elements (~200 MB per sample at float32; at batch size 32 a single such layer needs 6.4 GB of activation memory)
R14 is the one that catches you trying to run a model on a 16 GB T4 that should really be on an A100.<br>Transformer-specific (3 rules) — the new wave of LLM architectures<br>R15 — MoE layer present without a reminder to add the auxiliary load-balancing loss in the training loop (MoE collapses without it, all tokens routing to one expert)
R16 — Grouped-Query Attention where numHeads is not divisible by numKVHeads (the head grouping arithmetic literally does not work)
R17 — SwiGLU intermediateSize that...