No way to parse integers in C

konmok1 pts0 comments

-->

No way to parse integers in C

There are a few ways to attempt to parse a string into a number in the<br>C standard library. They are ALL broken.

Update at the bottom: Actually C++’s std::from_chars() looks useful.

Leaving aside the wide character versions, and staying with long<br>(skipping int, long long or intmax_t, these variants all having<br>the same problem) there are three ways I can think of:

atol()

strtol() / strtoul()

sscanf()

They are all broken.

What is the correct behavior, anyway?

I’ll start by claiming a common sense “I know it when I see it”. The<br>number that I see in the string with my eyeballs must be the numerical<br>value stored in the appropriate data type. “123” must be turned into<br>the number 123.

Another criteria is that the WHOLE number must be parsed. It is not OK<br>to stop at the first sign of trouble, and return whatever maybe is<br>right. “123timmy” is not a number, nor is the empty string.

Failing to provide the above must be an error . Or at least as the<br>user of the parser I must have the option to know if it happened.

First up: atol()

Input<br>Output

123timmy<br>123

99999999999999999999999999999999<br>LONG_MAX

timmy

empty string

" "

No. All wrong. And no way for the caller to know anything happened.

For the LONG_MAX overflow case the manpage is unclear if it’s<br>supposed to do that or return as many nines as it can, but empirically<br>on Linux this is what it does.

POSIX and C both say “if the value cannot be represented, the behavior is<br>undefined”. What the hell? This makes atol() impossible to use on untrusted<br>input. Now, in practice I don’t see compilers and libc ever triggering the<br>scary parts of UB on bad input, but on paper atol() is allowed to wipe your<br>hard drive if it gets bad input.

Great. How am I supposed to know if the value can be represented if<br>there is no way to check for errors? So if you pass a string to<br>atol() then you’re basically getting a random value, with a bias<br>towards being right most of the time.

I can kinda forgive atol(). It’s from a simpler time, a time when<br>gets() seemed like a good idea. gets() famously cannot<br>be used correctly.

Neither can atol().

Next one: strtol()

I’ll now contradict the title of this post. strtol() can actually be<br>used correctly. strtoul() cannot, but if you’re fine with signed<br>types only, then this’ll actually work.

But only carefully. The manpage has example code, but in function form<br>it’s:

bool parse_long(const char* in, long* out)<br>// Detect empty string.<br>if (!*in) {<br>fprintf(stderr, "empty string\n");<br>return false;

// Parse number.<br>char* endp = NULL; // This will point to end of string.<br>errno = 0; // Pre-set errno to 0.<br>*out = strtol(in, &endp, 0);

// Range errors are delivered as errno.<br>// I.e. on amd64 Linux it needs to be between -2^63 and 2^63-1.<br>if (errno) {<br>fprintf(stderr, "error parsing: %s\n", strerror(errno));<br>return false;

// Check for garbage at the end of the string.<br>if (*endp) {<br>fprintf(stderr, "incomplete parsing\n");<br>return false;<br>return true;

It’s a matter of the API here if it’s OK to clobber *out in the<br>error case, but that’s a minor detail.

Yay, signed numbers are parsable!

How about strtoul()/strtoull()?

Unlike its sibling, this function cannot be used correctly.

The strtoul() function returns either the result of the conversion or, if there<br>was a leading minus sign, the negation of the result of the conversion repre‐<br>sented as an unsigned value

Example outputs on amd64 Linux:

Input raw<br>Input<br>Output raw<br>Output

-1<br>-1<br>18446744073709551615<br>2^64-1

-9223372036854775808<br>-2^63<br>9223372036854775808<br>2^63

-9223372036854775809<br>-2^63-1<br>9223372036854775807<br>2^63-1

" "<br>just spaces<br>Error: endp not null

-18446744073709551614<br>-2^64+2

-18446744073709551615<br>-2^64+1

-18446744073709551616<br>-2^64<br>Error ERANGE

Phew, finally an error is reported.

This is in no way useful. Or I should say: Maybe there are use cases<br>where this is useful, but it’s absolutely not a function that returns<br>the number I asked for.

The title in the Linux manpage is convert a string to an unsigned<br>long integer. It does that. Technically it converts it into an<br>unsigned long integer. Not the obviously correct one, but it indeed<br>returns an unsigned long.

Interesting note that a non-empty input of just spaces is detectable<br>as an error. It’s obviously the right thing to do, but it’s not clear<br>that this is intentional.

So check your implementation: If passed an input of all isspace()<br>characters, is this correctly detected as an error?

If not then strtol() is probably broken too.

Maybe sscanf()?

A bit less code needed, which is nice:

bool parse_ulong(const char* in, unsigned long* out)<br>char ch; // Probe for trailing data.<br>int len;<br>if (1 != sscanf(in, "%lu%n%c", out, &len, &ch)) {<br>fprintf(stderr, "Failed to parse\n");<br>return false;

// This never triggered, so seems sscanf() doesn't stop<br>// parsing on overflow. So it's safe to skip the length check.<br>if (len != (int)strlen(in)) {<br>fprintf(stderr, "Did not parse full string\n");<br>return...

string long return error input number

Related Articles