Different models solve number-theory race problem

Day 17 — PalinPrimeBits The seventeenth challenge is a number-theory race. The server picks a 1-indexed integer n and the bot must report the length of the longest contiguous block of 1 bits in the binary expansion of p(n), the n-th palindromic prime. The sequence starts 2, 3, 5, 7, 11, 101, 131, 151, 181, 191, … (OEIS A002385); the n-th element is fixed, so every round has exactly one correct answer.

The format is 10 solo rounds played serially. Per-round n ranges from 5,000 to 1,000,000. Bots are not told the schedule in advance. Per-round ranking gives 10/7/5/3/1/0 points among correct submissions, tied by earliest submission timestamp. Wrong / timeout / malformed responses score zero. Per-round wall-clock budget: 30 seconds.

The dominant strategy choice is whether to enumerate palprimes lazily (start a background thread, answer rounds as the list grows) or eagerly (compute the whole list of 1,000,000 palprimes before submitting anything). prompt.md §9 permits eager precomputation before the first ROUND line, written with light amortization in mind: register first, then warm a cache while idle. Seven of the nine bots in the field read it that way. Two read it maximally, as a license to bypass the 30-second per-round wall-clock by deferring sock.connect() until after a full precompute. Those two run into server.py: REGISTRATION_WINDOW = 10.0, a 10-second window for sending the BOTNAME line, and never register.

MiMo (V2.5-Pro) is DNF. Three consecutive generation attempts terminated with finish_reason=length, 65,532 to 65,540 reasoning tokens, zero output tokens. This is MiMo’s fourth straight challenge as a generation DNF.

ChatGPT (GPT 5.5) and Grok (Expert 4.20) are DNP. Both bots compile fine and implement correct algorithms. Each defers sock.connect() until after a full precompute of 1,000,000 palindromic primes, reading prompt.md §9 (“the bot may take any approach … including pre-computation before the first ROUND line arrives. The 30 s clock only starts at each ROUND line.”) maximally, as license to bypass the per-round wall-clock entirely. ChatGPT’s source comment names the intent: # Precompute before connecting so no ROUND clock is running yet. The server’s 10-second registration window, undocumented in the prompt but enforced in server.py, catches both bots inside that precompute. Neither ever registers, and they don’t appear in the tournament log.

The results

RankBotPts1stsCorrectTotal t (correct rounds)#1 DeepSeek (V4-Pro) 73 49/1011.5 s#2 Claude (Opus 4.7) 60 19/1011.9 s#3 GLM (5.1) 4047/1041.0 s#4 Muse (Spark) 2409/1082.4 s#5 Gemini (Pro 3.1) 2008/1050.4 s#6 Kimi (K2.6) 1814/1015.3 s#7 Nemotron (3 Super) 508/1067.4 sDNPChatGPT (GPT 5.5)————DNPGrok (Expert 4.20)————DNFMiMo (V2.5-Pro)———— (Total t is summed only over rounds the bot answered correctly. DNP: did not play. DNF: did not finish. Per-round timings are taken from the server’s results.log file, which is kept local-only by repo policy; the relevant excerpts are inlined in the per-round positions table below and the bot-specific sections that follow.)

Per-round positions

RoundnCorrect k1st2nd3rdR15,0003GLM (0.04s)DeepSeek (0.06s)Claude (0.08s)R210,0005GLM (0.04s)DeepSeek (0.07s)Claude (0.09s)R320,0004GLM (0.05s)DeepSeek (0.09s)Claude (0.10s)R430,0004GLM (0.06s)Claude (0.09s)DeepSeek (0.10s)R550,0004DeepSeek (0.07s)Claude (0.08s)Muse (7.06s)R675,0004DeepSeek (0.07s)Claude (0.08s)Gemini (8.61s)R7100,0004DeepSeek (0.09s)Claude (0.11s)Kimi (0.14s)R8250,0004DeepSeek (4.43s)Claude (5.76s)Gemini (16.98s)R9500,0005Claude (5.49s)DeepSeek (6.53s)Muse (27.56s)R101,000,0006Kimi (0.04s) —— Round 10 has a single correct submission. Kimi answered in 43 ms; every other bot that played R10 either timed out or, in GLM’s case, submitted its ANSWER 1 fallback after its precompute deadline expired.

The registration-window gap (ChatGPT and Grok DNP)

ChatGPT and Grok both wrote correct, working bots. Both use the same algorithm class: enumerate decimal palindromes by their left half (the only palprime construction that matters past 11, since every even-length palindrome ≥ 100 is divisible by 11), then test each candidate with deterministic Miller-Rabin. ChatGPT (~250 lines) parallelises the enumeration across a multiprocessing pool and stores the longest-1-run for each palprime in a typed array. Grok (~130 lines) runs single-threaded, using a small-trial-division filter (primes up to 97) before a 9-witness Miller-Rabin (witnesses = [2, 3, 5, 7, 11, 13, 17, 19, 23]). Both implementations are correct and produce the full 1,000,000-palprime list in roughly 100 seconds on a typical core. That’s fast enough to finish before the tournament ends, and far too slow to fit inside the 10-second registration window.

The structural choice that cost them the tournament:

# ChatGPT def main(): botname = os.environ.get("BOTNAME") ... # Precompute before connecting so no ROUND clock is running yet. answers = precompute_answers(MAX_N) # ← ~100 s with...

Different models solve number-theory race problem

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast