Re: [scc-dev] [PATCH 3/3] libc: fix wchar unicode handling

From: <lhr_at_disroot.org>
Date: Thu, 27 Feb 2025 19:41:13 +0000

> Very good job, you catched so many errors that I am ashamed ^^!!!!!!!

Don't be ashamed, you wrote a whole libc+toolchain!

> Good catch. I wrote these functions but I didn't test them :'( .
> I would use De Morgan rules and avoid the negation:
>
> - for (bp = ranges; bp->begin <= wc && bp->end > wc; ++bp)
> + for (bp = ranges; bp->begin > wc || bp->end <= wc; ++bp)

Many such cases lmao. Yes that looks more readable

> > diff --git a/src/libc/wchar/mbrtowc.c b/src/libc/wchar/mbrtowc.c
> > index 6f825f8b..e1b217f0 100644
> > --- a/src/libc/wchar/mbrtowc.c
> > +++ b/src/libc/wchar/mbrtowc.c
> > _at_@ -5,10 +7,11 @@
> > #undef mbrtowc
> >
> > size_t
> > -mbrtowc(wchar_t *restrict pwc, const char *restrict s, size_t n,
> > +mbrtowc(wchar_t *restrict pwc, const char *restrict s, size_t n_,
> > mbstate_t *restrict ps)
> > {
> > - unsigned char *t = (unsigned char *) s;
> > + const unsigned char *t = (const unsigned char *) s;
> > + ptrdiff_t n = n_;
>
> I think this conversion is just wrong, because the input parameter is
> a size_t for a good reason. If we recive a really big buffer, where
> the size has the msb to 1 then this change will reject it, while it is
> well defined that the function should accept such buffer.

I have wondered about this before. I've heard that the C standard says memory
sizes greater than PTRDIFF_MAX are implementation-defined, even if they are
assigned to size_t, because it's acceptable in the standard to rely on
ptrdiff_t not overflowing. But I can't find a good source for that. Also, on
eg. 32-bit machines, ptrdiff_t can actually be larger than size_t. I think
the reason for size_t is so that in a loop like
        for (size_t i = 0; s[i]; i++) ...
you don't end up with a negative index, which would definitely be undefined
behaviour.

ANYWAY I digress, I think you're right about avoiding ptrdiff_t. I just did
it because signed integer wraparound is caught by `n > 0', but looking again
at the code, I don't think that will be a problem.

> I think having this if here defeats the purpose of the following loop,
> which should catch any error following the utf8 rules. I have
> rewritten the function as:
> [SNIP]

Yes this all looks and tests good to me. Good catch with returning -2

> > --- a/src/libc/wchar/wcrtomb.c
> > +++ b/src/libc/wchar/wcrtomb.c
> > _at_@ -14,13 +14,18 @@ wcrtomb(char *restrict s, wchar_t wc, mbstate_t *restrict ps)
> > if (!s)
> > return 1;
> >
> > + if (wc < 0x80) {
> > + *s = (char)wc;
> > + return 1;
> > + }
> > +
> > if (!_validutf8(wc, &n)) {
> > errno = EILSEQ;
> > return -1;
> > }
> >
> > n--;
> > - *s = 0;
> > + *s = 0x80;
> > for (i = 0; i < n; i++) {
> > *s >>= 1;
> > *s |= 0x80;
>
> I think it is better to transform the special case in a normal case
> with something like:
>
> --- a/src/libc/wchar/wcrtomb.c
> +++ b/src/libc/wchar/wcrtomb.c
> _at_@ -18,9 +18,9 @@ wcrtomb(char *restrict s, wchar_t wc, mbstate_t *restrict ps)
> errno = EILSEQ;
> return -1;
> }
> + n--;
>
> - n--;
> - *s = 0;
> + *s = (wc < 0x80) ? 0 : 0x80;
> for (i = 0; i < n; i++) {
> *s >>= 1;
> *s |= 0x80;
>

The only issue I have here is that for what I presume (with no evidence) to
be the most common case, wc < 0x80, this calls _validutf8(), while the
special-case one didn't. For cases of iterating over long strings of wchars,
this adds an extra function call overhead each time. But I could just be
prematurely optimising here, the CPU may fix this with caching or branch
prediction.


> I don't think it is a good idea to use universal character names in strings
> because it is not well defined in the standard, in fact the standard seems
> even contradictory. Quoting from [1]:
> [SNIP]

I think it's a safe assumption that while '\u0153' is a problem, the same
problem shouldn't apply to "\u0153" as that would make \u escapes in
multibyte encodings effectively useless. More importantly, I'm pretty certain
all existing compilers treat this the same. Using hex escapes assumes the
compiler's charset is unicode (which is mandated by C23) and the execution
locale is UTF-8, but if scc libc only intends to support UTF-8 then its
potato potato I suppose.

> I think we can use a typical NELEM macro instead of hardcoding 4 here:
>
> #define NELEM(x) (sizeof(x)/sizeof((x)[0]))
>
> for (i = 0; i < NELEM(mb); i++)
>
> and
>
> for (i = 0; i < NELEM(wc); i++)
>
> because I expect more cases are going to be added in the future ...

My aim here was that `strlen(mb[n]) == n + 1', so each size of UTF-8 code
point is tested, so the maximum is 4 and shouldn't change in the foreseeable
future. More test cases are good but I don't think it would make sense to
add them to the mb[] and wc[] arrays.

> While I think this works very well to test only mbtowc() and mbtowc(),
> I think it can be a problem once we begin to add more functions to be
> tested, so I would split the tests per function for example:
>
> test_mbtowc();
> test_mbtowc();

Do you mean mbtowc and mbrtowc, or mbtowc and wctomb?


Sorry if there was some irrelevant rambling in this

Thanks,
--
To unsubscribe send a mail to scc-dev+unsubscribe_at_simple-cc.org
Received on Thu 27 Feb 2025 - 20:41:13 CET

This archive was generated by hypermail 2.3.0 : Thu 27 Feb 2025 - 20:50:01 CET