LLM, give me a JSON. Make no mistakes

marek-hradil2 pts0 comments

LLM, give me a JSON. Make no mistakes. - NobodyWho

NobodyWho

About

Apps

Blog

Documentation

GitHub

Github

LLM, give me a JSON. Make no mistakes.

So how exactly do you make your LLM output a JSON? What happens under the hood? And how do you make it reliable and fast?

Make no mistakes

Imagine, you have finally managed to set up the LLM inference for your application, and now it is even able to respond to you.<br>And it can do so much stuff!<br>But for most of these use cases, getting "just" text back is very limiting. In fact, in order to make most of the non-chatbot use cases work,<br>you would need more structured info like JSON. So you just append to the prompt:

Remember to give me the output in JSON format. Make no mistakes.

As the JSON output gets longer and longer, somehow your super smart model fails from time to time. Apart from not getting the object keys right,<br>it appends the additional , at the end of the last key-value, which makes the parser complain. You might ask, is there a better way?

There is!

Being able to control what format exactly does your LLM produce is super valuable and technically super interesting.<br>Let us thus take a deep dive into how you go past "make no mistakes" and how the inference engines do it reliably and fast.

Note : If you feel familiar with JSON schemas and GBNF, just skip into the section "Processing Grammars".

Autoretries

The first solution that comes to mind is just to employ some retry strategy at the message level. Essentially:

while True:<br>answer = llm(prompt)<br>if is_json(answer):<br>break<br>This works. The only positive thing I have to say about it is that you can treat the LLM as a complete blackbox, which might<br>be viable for some libraries (actually I believe this is what LangChain does). For the negatives, there are plenty:

by being "unlucky" or employing smaller models, you can be looping for a very long time or forever, before reaching the desired output

you're wasting an enormous number of tokens, by discarding whole messages, even though they might not be all wrong

to be able to get a JSON, you need to construct or download a specific parser, which is not very extensible

So if you don't have to, just don't do this please. However, by looking more closely into the LLM, you can have a little<br>bit more principled approach.

Constrained Sampling

There are two observations, which we can make.

First, LLMs generate outputs token by token. Usually you don't have to generate the whole answer to see that something is wrong.<br>We can retry right away when the model makes the first error which is not in the right format. This way, we are not deleting the whole message,<br>but just the last token:

answer = ""<br>while True:<br>token = llm.next_token(prompt + answer)<br>if token == "":<br>break

if is_partial_json(answer):<br>answer += token<br>Secondly, the answer tokens don't appear out of the blue. Given a text, LLMs produce a probability distribution on the next token.<br>Instead of simply sampling from the distribution immediately, we can start by setting the probability of all tokens<br>leading to incorrect output to 0. This way, we are guaranteed to only sample (and thus output) a correct token.<br>If we wanted just a number, we could do something like:

Technically, this process is called "masking". One more detail to address is that just shrinking the probability of the tokens we don't want to 0<br>would break the distribution property (we want the probabilities to sum to 1). In reality the solution is therefore to set the underlying<br>logits to -inf, which will result in turning the unwanted tokens' probabilities to 0, but slightly bumping the other tokens up.<br>The pseudocode then could look like this:

answer = ""<br>while True:<br>token = llm.next_token(<br>prompt + answer,<br>mask=possible_next_json_tokens(prompt + answer)

if token == "":<br>break<br>So even though we are still looping, there is no discarding going on - we can't be unlucky and we are not wasting compute.<br>The hard part is now how to specify the possible next tokens, generally called "mask", and how to do that quickly,<br>so the LLM is not waiting for us.

Specifying the Format

What exactly is JSON? To construct the mask, we have to be able to answer this token by token.<br>One of the great ways that we are able to precisely specify some text format is regexes.<br>Unfortunately, as the name suggests, regexes are made for specifying regular languages, which JSON is not.<br>With the (basic set of) regex features, you won't be able to for example guarantee that any opened { will also be closed by a corresponding }.

A more fitting way for this use case is employing JSON schemas.<br>If you don't know JSON schemas, they are essentially a metalanguage<br>on top of JSON, to specify what JSON format you expect. This way, we can say for example:

{ "type": "object" }<br>to get any JSON object. Or, if you want something more specific:

"type": "object",<br>"properties": {<br>"name": "string",<br>"age": "integer"<br>This is a more concrete specification and some of the inference...

json answer token make tokens mistakes

Related Articles