Theoretical Bottlenecks for Scaling LLM Inference to Get Higher Token per Second

arjmandi1 pts1 comments

Freddie on X: "https://t.co/7i6f2rdlXJ" / X<br>Post

Log inSign up

Post

Freddie

@freddie_spirit

Article

Theoretical Bottlenecks for Scaling LLM Inference to Get Higher Token per Second<br>Let's establish the fundamental accounting identity that governs everything else. For any workload on any accelerator, execution time is bounded by max(compute_time, memory_time, communication_time)....

span:not(:empty)~span:not(:empty)]:before:content-['·'] [&>span:not(:empty)~span:not(:empty)]:before:px-1 [&>span:not(:empty)~span:not(:empty)]:before:shrink-0">9:18 AM · Jul 2, 202611Views

*]:shrink-0">New to X?<br>Sign up now to get your own personalized timeline!<br>Sign up with GoogleSign up with AppleCreate account<br>By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Relevant people<br>Freddie@freddie_spiritFollow

Trending now

Don't miss what's happening<br>People on X are the first to know.

Log inSign up

span empty freddie before theoretical bottlenecks

Related Articles