Re: [scc-dev] [PATCH 3/3] libc: fix wchar unicode handling from Roberto E. Vargas Caballero on 2025-02-26 (scc-dev)

From: Roberto E. Vargas Caballero <k0ga_at_shike2.net>
Date: Wed, 26 Feb 2025 10:28:39 +0100

Very good job, you catched so many errors that I am ashamed ^^!!!!!!!
I put some comments, please fell free to give your opinion.

> diff --git a/src/libc/wchar/_validutf8.c b/src/libc/wchar/_validutf8.c
> index 45b12fdc..48bc8066 100644
> --- a/src/libc/wchar/_validutf8.c
> +++ b/src/libc/wchar/_validutf8.c
> _at_@ -23,7 +23,7 @@ _validutf8(wchar_t wc, int *nbytes)
> };
> struct range *bp;
>
> - for (bp = ranges; bp->begin <= wc && bp->end > wc; ++bp)
> + for (bp = ranges; !(bp->begin <= wc && bp->end > wc); ++bp)
> ;

Good catch. I wrote these functions but I didn't test them :'( .
I would use De Morgan rules and avoid the negation:

- for (bp = ranges; bp->begin <= wc && bp->end > wc; ++bp)
+ for (bp = ranges; bp->begin > wc || bp->end <= wc; ++bp)

What do you think?

> diff --git a/src/libc/wchar/mbrtowc.c b/src/libc/wchar/mbrtowc.c
> index 6f825f8b..e1b217f0 100644
> --- a/src/libc/wchar/mbrtowc.c
> +++ b/src/libc/wchar/mbrtowc.c
> _at_@ -5,10 +7,11 @@
> #undef mbrtowc
>
> size_t
> -mbrtowc(wchar_t *restrict pwc, const char *restrict s, size_t n,
> +mbrtowc(wchar_t *restrict pwc, const char *restrict s, size_t n_,
> mbstate_t *restrict ps)
> {
> - unsigned char *t = (unsigned char *) s;
> + const unsigned char *t = (const unsigned char *) s;
> + ptrdiff_t n = n_;

I think this conversion is just wrong, because the input parameter is
a size_t for a good reason. If we recive a really big buffer, where
the size has the msb to 1 then this change will reject it, while it is
well defined that the function should accept such buffer.

> unsigned long wc;
> unsigned c;
> int i, len, maxlen;
> _at_@ -16,24 +19,35 @@ mbrtowc(wchar_t *restrict pwc, const char *restrict s, size_t n,
> if (s == NULL)
> return 0;
>
> + if (!(*t < 0x80 || (*t >= 0xC2 && *t <= 0xF4))) {
> + errno = EILSEQ;
> + return -1;
> + }
> +

I think having this if here defeats the purpose of the following loop,
which should catch any error following the utf8 rules. I have
rewritten the function as:

--- a/src/libc/wchar/mbrtowc.c
+++ b/src/libc/wchar/mbrtowc.c
_at_@ -1,3 +1,4 @@
+#include <errno.h>
#include <wchar.h>

#include "../libc.h"
_at_@ -8,37 +9,41 @@ size_t
mbrtowc(wchar_t *restrict pwc, const char *restrict s, size_t n,
         mbstate_t *restrict ps)
{
- unsigned char *t = (unsigned char *) s;
+ const unsigned char *t = (const unsigned char *) s;
        unsigned long wc;
        unsigned c;
        int i, len, maxlen;

- if (s == NULL)
+ if (s == NULL || *s == '\0')
                return 0;

        wc = c = *t++;
        for (len = 0; n > 0 && c & 0x80; --n, ++len)
                c <<= 1;
- if (n == 0 || len == 1 || len == 8)
- return -1;
+ if (n == 0 && c & 0x80)
+ return -2;
+ if (len == 1 || len == 8)
+ goto return_error;
        if (len == 0)
                goto return_code;

        wc = (c & 0xFF) >> len;
        for (i = 0; i < len-1; i++) {
                if (((c = *t++) & 0xC0) != 0x80)
- return -1;
+ goto return_error;
                wc <<= 6;
                wc |= c & 0x3F;
        }

        if (!_validutf8(wc, &maxlen) || len != maxlen)
- return -1;
+ goto return_error;

return_code:
        if (pwc)
                *pwc = wc;
- if (*s == '\0')
- return 0;
        return t - (unsigned char *) s;
+
+return_error:
+ errno = EILSEQ;
+ return -1;
}

This adds a new test case because we can return -2 in case we don't have
enough bytes to get a full utf8 sequence.

> --- a/src/libc/wchar/wcrtomb.c
> +++ b/src/libc/wchar/wcrtomb.c
> _at_@ -14,13 +14,18 @@ wcrtomb(char *restrict s, wchar_t wc, mbstate_t *restrict ps)
> if (!s)
> return 1;
>
> + if (wc < 0x80) {
> + *s = (char)wc;
> + return 1;
> + }
> +
> if (!_validutf8(wc, &n)) {
> errno = EILSEQ;
> return -1;
> }
>
> n--;
> - *s = 0;
> + *s = 0x80;
> for (i = 0; i < n; i++) {
> *s >>= 1;
> *s |= 0x80;

I think it is better to transform the special case in a normal case
with something like:

--- a/src/libc/wchar/wcrtomb.c
+++ b/src/libc/wchar/wcrtomb.c
_at_@ -18,9 +18,9 @@ wcrtomb(char *restrict s, wchar_t wc, mbstate_t *restrict ps)
                errno = EILSEQ;
                return -1;
        }
+ n--;

- n--;
- *s = 0;
+ *s = (wc < 0x80) ? 0 : 0x80;
        for (i = 0; i < n; i++) {
                *s >>= 1;
                *s |= 0x80;

> --- /dev/null
> +++ b/tests/libc/execute/0038-wchar.c
...
> +void
> +tests_positive()
> +{
> + const char *const mb[] = { "!", "\u00A1", "\u2014", "\U0001F4A9" };

I don't think it is a good idea to use universal character names in strings
because it is not well defined in the standard, in fact the standard seems
even contradictory. Quoting from [1]:

        This is in obvious contradiction with the second sentence. In
        addition, I'm not really clear what is supposed to happen in
        the case where the execution (narrow-)character set is UTF-8.
        Consider the character \u0153 (the oe in the French word
        oeuvre). Should '\u0153' be a char, with an "error" value,
        say '?' (in conformance with the requirement that it be a
        single char), or an int, with the two char values 0xC5, 0x93,
        in an implementation defined order (in conformance with the
        requirement that a character representable in the execution
        character set be represented). Supposing the former, should
        "\u0153" be the equivalent of "?" (in conformance with the
        first sentence), or "\xC5\x93" (in conformance with the
        second).

I don't know if C23 improved the situation, but anyway it does not
seem the correct thing to do in a test for the libc as it depends a
lot of the compiler implementation, and for that reason I think is
better to use directly the \xHH sequences.

> + const wchar_t wc[] = { L'!', L'\u00A1', L'\u2014', L'\U0001F4A9' };

While I think we can assume safely that there will be a 1 to 1 mapping
from universal character names to unicode entry points, I prefer to
avoid that and use directly the hexadecimal integer constants that
will ensure that the test is independent of the compiler.

For that reason, I think is better to represent these arrays as:

        static char *mb[] = {
                "!",
                "\xc2\xa1",
                "\xe2\x80\x94",
                "\xf0\x9f\x92\xa9"
        };
        static wchar_t wc[] = {
                L'!',
                0x00A1,
                0x2014,
                0x0001F4A9
        };

I removed the consts for a question of style, as const is not commonly
used in the scc code base, and I added the static to avoid the initialization
code.

+ for (i = 0; i < 4; i++) {

I think we can use a typical NELEM macro instead of hardcoding 4 here:

        #define NELEM(x) (sizeof(x)/sizeof((x)[0]))

        for (i = 0; i < NELEM(mb); i++)

and

        for (i = 0; i < NELEM(wc); i++)

because I expect more cases are going to be added in the future ...

> +void
> +tests_negative()
> +{
> + char badutf8[] = { 0xF8 /* 0b11111000 */, 0x80 | 1, 0x80 | 2, 0x80 | 3, 0x80 | 4, 0x80 | 5, '\0' };
> + char overlong[] = { 0xC0, 0x80 | 'a' };

Just to be consistent with the other arrays, and to obey the 80 columns rule
I think this should be:

        static char badutf8[] = {
                0xF8 /* 0b11111000 */,
                0x80 | 1, 0x80 | 2, 0x80 | 3,
                0x80 | 4, 0x80 | 5,
                '\0'
        };
        static char overlong[] = {0xC0, 0x80 | 'a'};

> +
> +int
> +main()
> +{
> + puts("testing");
> + tests_positive();
> + tests_negative();

While I think this works very well to test only mbtowc() and mbtowc(),
I think it can be a problem once we begin to add more functions to be
tested, so I would split the tests per function for example:

        test_mbtowc();
        test_mbtowc();

That would require to have the arrays as global variables, but it shouldn't
be a problem. Also, adding the tests for the reentrant versions should be
trivial as they are the same tests, just passing and additional parameter.
I just pushed the implementation of mbsinit() (it is trivial for the case
of utf8), that can be used in the tests for the reentrant versions.

[1] https://wg21.cmeerw.net/cwg/issue411

Regards,

--
To unsubscribe send a mail to scc-dev+unsubscribe_at_simple-cc.org

Received on Wed 26 Feb 2025 - 10:28:39 CET

This archive was generated by hypermail 2.3.0 : Wed 26 Feb 2025 - 10:30:02 CET