-->
No way to parse integers in C
There are a few ways to attempt to parse a string into a number in the<br>C standard library. They are ALL broken.
Update at the bottom: Actually C++’s std::from_chars() looks useful.
Leaving aside the wide character versions, and staying with long<br>(skipping int, long long or intmax_t, these variants all having<br>the same problem) there are three ways I can think of:
atol()
strtol() / strtoul()
sscanf()
They are all broken.
What is the correct behavior, anyway?
I’ll start by claiming a common sense “I know it when I see it”. The<br>number that I see in the string with my eyeballs must be the numerical<br>value stored in the appropriate data type. “123” must be turned into<br>the number 123.
Another criteria is that the WHOLE number must be parsed. It is not OK<br>to stop at the first sign of trouble, and return whatever maybe is<br>right. “123timmy” is not a number, nor is the empty string.
Failing to provide the above must be an error . Or at least as the<br>user of the parser I must have the option to know if it happened.
First up: atol()
Input<br>Output
123timmy<br>123
99999999999999999999999999999999<br>LONG_MAX
timmy
empty string
" "
No. All wrong. And no way for the caller to know anything happened.
For the LONG_MAX overflow case the manpage is unclear if it’s<br>supposed to do that or return as many nines as it can, but empirically<br>on Linux this is what it does.
POSIX and C both say “if the value cannot be represented, the behavior is<br>undefined”. What the hell? This makes atol() impossible to use on untrusted<br>input. Now, in practice I don’t see compilers and libc ever triggering the<br>scary parts of UB on bad input, but on paper atol() is allowed to wipe your<br>hard drive if it gets bad input.
Great. How am I supposed to know if the value can be represented if<br>there is no way to check for errors? So if you pass a string to<br>atol() then you’re basically getting a random value, with a bias<br>towards being right most of the time.
I can kinda forgive atol(). It’s from a simpler time, a time when<br>gets() seemed like a good idea. gets() famously cannot<br>be used correctly.
Neither can atol().
Next one: strtol()
I’ll now contradict the title of this post. strtol() can actually be<br>used correctly. strtoul() cannot, but if you’re fine with signed<br>types only, then this’ll actually work.
But only carefully. The manpage has example code, but in function form<br>it’s:
bool parse_long(const char* in, long* out)<br>// Detect empty string.<br>if (!*in) {<br>fprintf(stderr, "empty string\n");<br>return false;
// Parse number.<br>char* endp = NULL; // This will point to end of string.<br>errno = 0; // Pre-set errno to 0.<br>*out = strtol(in, &endp, 0);
// Range errors are delivered as errno.<br>// I.e. on amd64 Linux it needs to be between -2^63 and 2^63-1.<br>if (errno) {<br>fprintf(stderr, "error parsing: %s\n", strerror(errno));<br>return false;
// Check for garbage at the end of the string.<br>if (*endp) {<br>fprintf(stderr, "incomplete parsing\n");<br>return false;<br>return true;
It’s a matter of the API here if it’s OK to clobber *out in the<br>error case, but that’s a minor detail.
Yay, signed numbers are parsable!
How about strtoul()/strtoull()?
Unlike its sibling, this function cannot be used correctly.
The strtoul() function returns either the result of the conversion or, if there<br>was a leading minus sign, the negation of the result of the conversion repre‐<br>sented as an unsigned value
Example outputs on amd64 Linux:
Input raw<br>Input<br>Output raw<br>Output
-1<br>-1<br>18446744073709551615<br>2^64-1
-9223372036854775808<br>-2^63<br>9223372036854775808<br>2^63
-9223372036854775809<br>-2^63-1<br>9223372036854775807<br>2^63-1
" "<br>just spaces<br>Error: endp not null
-18446744073709551614<br>-2^64+2
-18446744073709551615<br>-2^64+1
-18446744073709551616<br>-2^64<br>Error ERANGE
Phew, finally an error is reported.
This is in no way useful. Or I should say: Maybe there are use cases<br>where this is useful, but it’s absolutely not a function that returns<br>the number I asked for.
The title in the Linux manpage is convert a string to an unsigned<br>long integer. It does that. Technically it converts it into an<br>unsigned long integer. Not the obviously correct one, but it indeed<br>returns an unsigned long.
Interesting note that a non-empty input of just spaces is detectable<br>as an error. It’s obviously the right thing to do, but it’s not clear<br>that this is intentional.
So check your implementation: If passed an input of all isspace()<br>characters, is this correctly detected as an error?
If not then strtol() is probably broken too.
Maybe sscanf()?
A bit less code needed, which is nice:
bool parse_ulong(const char* in, unsigned long* out)<br>char ch; // Probe for trailing data.<br>int len;<br>if (1 != sscanf(in, "%lu%n%c", out, &len, &ch)) {<br>fprintf(stderr, "Failed to parse\n");<br>return false;
// This never triggered, so seems sscanf() doesn't stop<br>// parsing on overflow. So it's safe to skip the length check.<br>if (len != (int)strlen(in)) {<br>fprintf(stderr, "Did not parse full string\n");<br>return...