Re: [scc-dev] [PATCH 3/3] libc: fix wchar unicode handling

From: Roberto E. Vargas Caballero <k0ga_at_shike2.net>
Date: Thu, 06 Mar 2025 09:15:24 +0100

Hi,

Quoth lhr_at_disroot.org:
> You could reuse the same array for both tests, the wc tests would use `res'
> as input and compare the result with `s'. I'd rather not keep `nbytes',
> since that's repeating the same information twice -- once in the actual
> length of the string, and once in `nbytes' which would have to be manually
> updated. Better to use strlen()

I am adding more, and more test cases (and detecting problems) and it is more
likely that we will have to use different arrays. Having the length in the
test case allows checks like passing a valid utf-8 mb, but with a shorter
length. Currently, the test_mbrtowc() function is:

        void
        tests_mbrtowc(void)
        {
                static wchar_t wc;
                static struct mbtest {
                        char *s;
                        int l;
                        int r;
                        wchar_t *pwc;
                        wchar_t wc;
                } tests[] = {
{"\0", 2, 0, &wc, 0},
{"\x21", 2, 1, &wc, 0x21},
{"\xc2\xa1", 3, 2, &wc, 0x00A1},
{"\xe2\x80\x94", 4, 3, &wc, 0x2014},
{"\xf0\x9f\x92\xa9", 5, 4, &wc, 0x01F4A9},
{"\xf0\x9f\x92\xa9", 5, 4, NULL, -1},
{"\xf0\x9f\x92\xa9", -1, 4, &wc, 0x01F4A9},
        
                        {NULL, 4, 0, NULL, -1},
{"\xed\xa0\x80", 4, -1, &wc, -1},
{"\xed\xb3\xbf", 4, -1, &wc, -1},
{"\xed\xb4\x80", 4, 3, &wc, 0xdd00},
        
{"\x80", 2, -1, &wc, -1},
{"\xc0\x80", 2, -1, &wc, -1},
{"\xf0\x9f\x92\xa9", 3, -2, &wc, -1},
{"\xf8\x81\x82\x83\x84\x85", -1, -1, &wc, -1},
{"\xfe\x81\x82\x83\x84\x85\x86", 8, -1, &wc, -1},
                };
                struct mbtest *tp;
                int r, i;
                mbstate_t s;
        
                puts("testing mbrtowc1");
                for (tp = tests; tp < &tests[NELEM(tests)]; ++tp) {
                        wc = -1;
                        r = mbrtowc(tp->pwc, tp->s, tp->l, NULL);
                        if (tp->r == -1) {
                                assert(r == -1);
                                assert(errno == EILSEQ);
                        } else {
                                assert(tp->r == r);
                                assert(tp->wc == wc);
                        }
                }
        
                puts("testing mbrtowc2");
                for (tp = tests; tp < &tests[NELEM(tests)]; ++tp) {
                        wc = -1;
                        memset(&s, 0, sizeof(s));
                        r = mbrtowc(tp->pwc, tp->s, tp->l, &s);
                        if (tp->r == -1) {
                                assert(r == -1);
                                assert(errno == EILSEQ);
                        } else {
                                assert(tp->r == r);
                                assert(tp->wc == wc);
                        }
                        assert(mbsinit(&s) != 0);
                }
        }

For example, the test:

                {"\xf0\x9f\x92\xa9", 3, -2, &wc, -1},

is checking a valid sequence but with short lenght. For the -2 case it does
not matter the actual content of the string but the length. Also, the test

                {"\xf0\x9f\x92\xa9", -1, 4, &wc, 0x01F4A9},

is checking that we don't read more than what we should (at the best that
we can do, because we don't have any way to check that we have a buffer
overun there).

What other cases do you think would be worth to test?

>
> Also I just thought, now that mbrtowc() returns (size_t)-2, mbtowc() should
> check for that, since according to the standard it cannot return -2 (it must
> return -1 if not passed a complete, valid multibyte sequence).
>

Uhmmmmm, good point. I just checked musl and it does not use mbrtowc() to
implement mbtowc() and it does not return -2. I would add your modifications
to mbtowc().

Regards,


--
To unsubscribe send a mail to scc-dev+unsubscribe_at_simple-cc.org
Received on Thu 06 Mar 2025 - 09:15:24 CET

This archive was generated by hypermail 2.3.0 : Thu 06 Mar 2025 - 09:20:01 CET