NaN Loss — Daily AI Briefing
Collected<br>943
After dedup<br>421
Surfacing<br>71items
Categories
Sourceallarxiv289hackernews33twitter500websearch121<br>Executive summary<br>The biggest industry story today is Midjourney's wild pivot into hardware: the company known for image generation announced a 60-second full-body ultrasound scanner, partnering with Butterfly to ship a physical medical device. Whatever you think about the strategic logic, it's a genuinely novel bet from an AI-native company. On the open-source front,<br>Z.AI's GLM-5.2 open weights dropped under a permissive MIT license<br>, and<br>the model beats GPT-5.5 on multiple long-horizon coding benchmarks for roughly one-sixth the cost<br>— further compressing the gap between open and closed frontier models. Meanwhile, the funding machine keeps running: Baseten pulled in $1.5B at dual $11B/$13B valuations for inference infrastructure, Odyssey raised a $310M Series B, and Sarvam AI hit a $1.5B valuation with a $150M injection from HCLTech. Accenture's stock cratered 17% on weak AI-impacted guidance, which is the flip side of the same story — incumbents that can't show AI leverage are getting punished hard.<br>On the research side, the most interesting cluster of work is around failure modes in RLVR and GRPO training. Multiple papers dropped addressing distinct but related problems: SFT overtraining triggering entropy collapse and downstream rank inversion in GRPO, a "sparsity curse" that causes model merging to fail on RLVR-trained reasoning models, and STARE preventing policy entropy collapse during GRPO training. These aren't incremental — they're revealing that the post-training stack for reasoning models is more fragile than the benchmark numbers suggest. Separately, Sumi became the first 7B uniform diffusion language model pretrained from scratch on 1.5T tokens, and the "User as Engram" architecture cut LLM personalization memory footprint by 33,000x, which could matter a lot for on-device deployment.<br>On the policy and applications front, AI executives at the G7 pushed for a U.S.-led coalition to restrict China's chip access — an escalation of the emerging AI export control regime. Amazon started direct sales talks for its Trainium chips to external data centers, a serious move against Nvidia's stranglehold on AI silicon. And in a striking clinical result, OpenAI's o3 model successfully diagnosed rare diseases in 18 children, another data point that frontier reasoning models are finding real traction in high-stakes domains where exhaustive differential diagnosis actually plays to their strengths.
01LLM Research13 items<br>The past 24 hours in LLM research saw landmark updates in both frontier model benchmarking and post-training optimization. Claude Fable 5 claimed the top spot on the DeepSWE coding benchmark, while Artificial Analysis released a cost-aware agentic knowledge evaluation illustrating massive price-to-performance variances across models. In academic research, a wave of breakthroughs focused heavily on Reinforcement Learning with Verifiable Rewards (RLVR/GRPO), addressing critical vulnerabilities like SFT entropy collapse (leading to rank inversion), policy entropy decay during training, uniform credit assignment, and model merging failures (the 'sparsity curse'). Key architecture highlights included 'User as Engram,' which reduces LLM personalization footprints by 33,000x, and 'Sumi,' the first 7B uniform diffusion language model pretrained from scratch on 1.5T tokens.<br>Claude Fable 5 Claims Top Spot on DeepSWE Coding Benchmark<br>Claude Fable 5 has debuted at number one on the DeepSWE long-horizon coding benchmark, scoring 70% pass@1 and outscoring the previous best model by 3%. At its default high-effort setting, Fable 5 tracks GPT-5.5 on cost-performance, while Kimi K2.7 also joined the leaderboard with a 31% score.<br>high4 src·Claude Fable 5·DeepSWE·Model Benchmarks·Coding LLMs
Artificial Analysis Releases Agentic Knowledge Work Evaluation<br>Artificial Analysis has released a new agentic knowledge work evaluation benchmark showing that cost-per-task varies by up to 800x across frontier models. While Claude Fable 5 leads the evaluation, it costs over $31 per task on average compared to $0.04 for DeepSeek V4 Flash, with GLM-5.2 (which positioned between GPT-5.5 and Opus 4.8) and DeepSeek highlighted as the strongest price-performance open-weight models.<br>high2 src·Artificial Analysis·Agentic Benchmarks·GLM-5.2·Claude Fable 5
SFT Overtraining Found to Trigger GRPO Rank Inversion via Entropy Collapse<br>Researchers studying SFT depth ladders for Qwen2.5-Coder and DeepSeek-Coder have found that overtraining Supervised Fine-Tuning (SFT) checkpoints collapses rollout distribution entropy, triggering rank inversion during Group Relative Policy Optimization (GRPO). While early SFT depth increases pass@1 scores, the lack of behavioral entropy leaves insufficient group relative signal for GRPO, causing peak performance to collapse from 0.806 to 0.481.<br>high1 src·SFT...