Re: [scc-dev] [PATCH 3/3] libc: fix wchar unicode handling from Roberto E. Vargas Caballero on 2025-03-03 (scc-dev)

From: Roberto E. Vargas Caballero <k0ga_at_shike2.net>
Date: Mon, 03 Mar 2025 20:39:51 +0100

Quoth lhr_at_disroot.org:
> The only issue I have here is that for what I presume (with no evidence) to
> be the most common case, wc < 0x80, this calls _validutf8(), while the
> special-case one didn't. For cases of iterating over long strings of wchars,
> this adds an extra function call overhead each time. But I could just be
> prematurely optimising here, the CPU may fix this with caching or branch
> prediction.

Uhmmm, good point. While I usually only care about readibility and being
uniform, it is true that the < 0x80 will be very common, specially in
Western languages. Having the if at the beginning is not so bad and it
saves a lot of time in these cases. I will go back to add the check for
this case at the beginning and removing the ternary before the loop.

>
> > I don't think it is a good idea to use universal character names in strings
> > because it is not well defined in the standard, in fact the standard seems
> > even contradictory. Quoting from [1]:
> > [SNIP]
>
> I think it's a safe assumption that while '\u0153' is a problem, the same
> problem shouldn't apply to "\u0153" as that would make \u escapes in
> multibyte encodings effectively useless. More importantly, I'm pretty certain
> all existing compilers treat this the same. Using hex escapes assumes the
> compiler's charset is unicode (which is mandated by C23) and the execution
> locale is UTF-8, but if scc libc only intends to support UTF-8 then its
> potato potato I suppose.

Well, in this case is not about assuming the charset of the compiler, it is
about assuming the charset and encoding of the libc. I would not link the
tests to specific compiler implementations (even if c23 specifies it, I would
like to keep the code of scc itself and the code of the test c99 to improve
portability). Do you know if C23 also mandates utf8 encoding in the compiler?
The link that I posted was an issue to the group and it was not attended.

The intention is to support only utf8 for one reason, supporting
locales implies dynamic linking or file system access (to load locale
definitions), and one of the targets of the scc libc is bare metal
systems (in fact, an early version is used in Arm Trusted Firmware)
where you don't have them. Once we have only the C locale, which
multibyte encoding should we use? (it is a rethoric question ;) ).

> > I think we can use a typical NELEM macro instead of hardcoding 4 here:
> >
> > #define NELEM(x) (sizeof(x)/sizeof((x)[0]))
> >
> > for (i = 0; i < NELEM(mb); i++)
> >
> > and
> >
> > for (i = 0; i < NELEM(wc); i++)
> >
> > because I expect more cases are going to be added in the future ...
>
> My aim here was that `strlen(mb[n]) == n + 1', so each size of UTF-8 code
> point is tested, so the maximum is 4 and shouldn't change in the foreseeable
> future. More test cases are good but I don't think it would make sense to
> add them to the mb[] and wc[] arrays.

Uhmmm, I didn't notice that. Anyway, the original utf8 specification was 6
bytes (I do know that RFC-3629 limited it to 4 bytes, but I prefer to keep
with the spec in RFC-2279), and there are more test cases, like for example
characters in the invalid ranges. Maybe a more general test definition would
be something like:

        struct mbtest {
                char *s;
                int nbytes;
                wchar_t res;
        };

that would even unify positive and negative tests in one loop as they
only become different entries in a single array (something similar for
the wc tests).

>
> > While I think this works very well to test only mbtowc() and mbtowc(),
> > I think it can be a problem once we begin to add more functions to be
> > tested, so I would split the tests per function for example:
> >
> > test_mbtowc();
> > test_mbtowc();
>
> Do you mean mbtowc and mbrtowc, or mbtowc and wctomb?

Yeah, sorry for the noise ^^!!!!.

Regards,

--
To unsubscribe send a mail to scc-dev+unsubscribe_at_simple-cc.org

Received on Mon 03 Mar 2025 - 20:39:51 CET

This archive was generated by hypermail 2.3.0 : Mon 03 Mar 2025 - 20:40:01 CET