Reward models for LMs are fundamentally broken

panthertrax1 pts0 comments

Vijay V. on X: "Continuous reward models can measure partial progress on a task, unlike binary rewards (e.g. RLVR). But we find they inevitably assign wildly different scores to equally-good responses, which can lead to bad policies. Surprisingly, dense rewards are often better if discretized!🧵 https://t.co/rx4eCpKgwR" / X<br>Post

Log inSign up

Post

Vijay V.

@vijaytarian

Continuous reward models can measure partial progress on a task, unlike binary rewards (e.g. RLVR). But we find they inevitably assign wildly different scores to equally-good responses, which can lead to bad policies. Surprisingly, dense rewards are often better if discretized!🧵

span:not(:empty)~span:not(:empty)]:before:content-['·'] [&>span:not(:empty)~span:not(:empty)]:before:px-1 [&>span:not(:empty)~span:not(:empty)]:before:shrink-0">3:10 PM · Jun 23, 20265.2KViews

:host{display:inline-block;direction:ltr;white-space:nowrap;line-height:1}span{display:inline-block}:host([data-will-change]) span{will-change:transform}.number,.digit{padding:round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) 0}.symbol{white-space:pre}1:where(number-flow-react){line-height:1}number-flow-react > span{font-kerning:none;display:inline-block;padding:calc(round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) * 2) 0}1<br>:host{display:inline-block;direction:ltr;white-space:nowrap;line-height:1}span{display:inline-block}:host([data-will-change]) span{will-change:transform}.number,.digit{padding:round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) 0}.symbol{white-space:pre}13:where(number-flow-react){line-height:1}number-flow-react > span{font-kerning:none;display:inline-block;padding:calc(round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) * 2) 0}13<br>:host{display:inline-block;direction:ltr;white-space:nowrap;line-height:1}span{display:inline-block}:host([data-will-change]) span{will-change:transform}.number,.digit{padding:round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) 0}.symbol{white-space:pre}36:where(number-flow-react){line-height:1}number-flow-react > span{font-kerning:none;display:inline-block;padding:calc(round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) * 2) 0}36<br>:host{display:inline-block;direction:ltr;white-space:nowrap;line-height:1}span{display:inline-block}:host([data-will-change]) span{will-change:transform}.number,.digit{padding:round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) 0}.symbol{white-space:pre}13:where(number-flow-react){line-height:1}number-flow-react > span{font-kerning:none;display:inline-block;padding:calc(round(nearest, calc(var(--number-flow-mask-height, 0.25em) / 2), 1px) * 2) 0}13

Read 1 reply

*]:shrink-0">New to X?<br>Sign up now to get your own personalized timeline!<br>Sign up with GoogleSign up with AppleCreate account<br>By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Relevant people<br>Vijay V.@vijaytarianFollow

Trending now

Don't miss what's happening<br>People on X are the first to know.

Log inSign up

number span height flow display inline

Related Articles