In the 2nd animated GIF showing how bits are placed in the encoded sequence, there is an extra set bit in the last byte. If Im not mistaken, the last byte should be 0x9F (10011111) instead of 0xBF (10111111).
My apologies if I am mistaken.
The article is very clear, and covers a subject which most of us should know. You've got my 'excellent'.
On the other hand, function UTF8Decode2BytesUnicode() has a small bug: since
(MASK2BYTES & MASK3BYTES) == MASK2BYTES) , 3-byte characters will be decoded as if they were 2 bytes. No biggie, though: just changing the order of the last two tests fixes it.
Anyway, than you for taking the time for writing and sharing.
Posted by cilu
on 09/11/2005 04:34am
Yes, you guys are right. The bug was fixed. It will just need a day or so to be updated on the site. Thank you.
The problem is that 3-byte codes won't decode properly since
(? & 0xC0) == 0xC0 is tested before
(? & 0xE0) == 0xE0.
If the UTF-8 sequence is, say, E2 80 93, it will be incorrectly decoded.
Just swap the test sequence. Also, doesn't handle 4-byte UTF-8 which it would be nice to add to make it complete.