Is the UTF-8 string from Google Closure considered valid in this case?

Question

Is the UTF-8 string from Google Closure considered valid in this case?

When running the UTF-8 to byte array tests in the Google Closure library, the string provided is:

\u0000\u007F\u0080\u07FF\u0800\uFFFF

This string is expected to be converted into the following array:

[0x00, 0x7F, 0xC2, 0x80, 0xDF, 0xBF, 0xE0, 0xA0, 0x80, 0xEF, 0xBF, 0xBF]

After testing with other JavaScript and TypeScript implementations for UTF-8 to byte array conversion, some of them have claimed that the given UTF-8 string is invalid.

The string seems to cover the values that transition from 1 byte to 2-byte to 3-byte values.

The question remains: Is Google's implementation correct or are the other libraries right?

javascript typescript utf-8 noncharacter

Answer 1

Answer №1

Google's accuracy stands true.

The sequence

'\u0000\u007F\u0080\u07FF\u0800\uFFFF'

symbolizes Unicode codepoints

U+0000 U+007F U+0080 U+07FF U+0800 U+FFFF

.

The exact conversion of these codepoints to UTF-8 is indeed bytes

00 7F C2 80 DF BF E0 A0 80 EF BF BF

, as confirmed by Google.

It's important to note that U+FFFF is considered a non-character codepoint, according to the Unicode standard:

A "noncharacter" is a code point that is permanently reserved in the Unicode Standard for internal purposes

...

In the initial version of Unicode, the code points U+FFFE and U+FFFF were marked as "Not character codes" and termed "NOT A CHARACTER". The term "noncharacter" emerged from these early designations and labels.

Specifically:

Q: Are noncharacters meant for sharing?

A: No. They are exclusively intended for internal use. For instance, they might serve as placeholders within strings or act as targets for specific weightings in a collation tailoring process to simplify support for "alphabetic index" implementations.

Q: Are noncharacters prohibited from being shared?

A: This matter has sparked controversy because of conflicting interpretations concerning the interchangeability of noncharacters. While the standard initially stated that noncharacters "should never be interchanged", some took this to mean they "shall not be interchanged", implying any string containing a noncharacter would violate the standard. However, the purposeful ambiguity was intended since the interpretation of noncharacters is strictly internal to their implementation context, giving them no publicly exchangeable semantics. Despite varying wording across specifications and interpretations, it was clarified in 2013 with UTC's issuance of Corrigendum #9 which removed the phrase indicating prohibition from interchange, making it clear that noncharacters have no formal restrictions on interchange. This update was included in Unicode 7.0.

Q: Are noncharacters considered invalid in Unicode strings and UTFs?

A: Absolutely not. The presence of noncharacters does not render a Unicode string malformed in any UTF format. This is evident in the presented table where each noncharacter code point has a valid representation in UTF-32, UTF-16, and UTF-8. Any implementation transferring noncharacter code points between different UTF representations must accurately retain these values. Although designated as "noncharacters" and not intended for open sharing, they do not constitute illegitimate or improper code points that invalidate strings containing them.

Answer 2

Google's accuracy stands true.

The sequence

'\u0000\u007F\u0080\u07FF\u0800\uFFFF'

symbolizes Unicode codepoints

U+0000 U+007F U+0080 U+07FF U+0800 U+FFFF

.

The exact conversion of these codepoints to UTF-8 is indeed bytes

00 7F C2 80 DF BF E0 A0 80 EF BF BF

, as confirmed by Google.

It's important to note that U+FFFF is considered a non-character codepoint, according to the Unicode standard:

A "noncharacter" is a code point that is permanently reserved in the Unicode Standard for internal purposes

...

In the initial version of Unicode, the code points U+FFFE and U+FFFF were marked as "Not character codes" and termed "NOT A CHARACTER". The term "noncharacter" emerged from these early designations and labels.

Specifically:

Q: Are noncharacters meant for sharing?

A: No. They are exclusively intended for internal use. For instance, they might serve as placeholders within strings or act as targets for specific weightings in a collation tailoring process to simplify support for "alphabetic index" implementations.

Q: Are noncharacters prohibited from being shared?

A: This matter has sparked controversy because of conflicting interpretations concerning the interchangeability of noncharacters. While the standard initially stated that noncharacters "should never be interchanged", some took this to mean they "shall not be interchanged", implying any string containing a noncharacter would violate the standard. However, the purposeful ambiguity was intended since the interpretation of noncharacters is strictly internal to their implementation context, giving them no publicly exchangeable semantics. Despite varying wording across specifications and interpretations, it was clarified in 2013 with UTC's issuance of Corrigendum #9 which removed the phrase indicating prohibition from interchange, making it clear that noncharacters have no formal restrictions on interchange. This update was included in Unicode 7.0.

Q: Are noncharacters considered invalid in Unicode strings and UTFs?

A: Absolutely not. The presence of noncharacters does not render a Unicode string malformed in any UTF format. This is evident in the presented table where each noncharacter code point has a valid representation in UTF-32, UTF-16, and UTF-8. Any implementation transferring noncharacter code points between different UTF representations must accurately retain these values. Although designated as "noncharacters" and not intended for open sharing, they do not constitute illegitimate or improper code points that invalidate strings containing them.

Is the UTF-8 string from Google Closure considered valid in this case?

Answer №1

Google's accuracy stands true.

Similar questions

Is there a way to retain modal inputs even after the modal has been closed?

Guide on transferring the content of a div to the beginning of a file

Guide to transferring the current date to a text box using Angular JS with Protractor

Discovering the Cookie in Angular 2 after it's Been Created

Tips for resolving aliases in tsconfig.app.json when dealing with multiple source directories in WebStorm

How can I access the DOM element within my render function in React on the same component?

Struggling to get Print.js to work properly for printing, as nothing seems to be happening when I try to print

Exploring Angular 2 testing with TypeScript: multiple occurrences of specifications in Jasmine

Can Jquery be used to swap out specific li content?

Ensure that the key of an object's property is identical to the value of the property

Incorporating z-index into weekly rows within the FullCalendar interface

Compiling this HTML template in dev mode with Vue is agonizingly slow

Enhancing server error troubleshooting with Next.js: improved stack trace visibility?

Ways to efficiently populate HTML elements with JSON data

Is it possible for NodeJS streams to store objects in a queue if there is no downstream pipe attached?

Navigate to a different page and automatically launch a few lightbox pop-ups

Unraveling the Mystery of @Input and @Output Aliases in Angular 2

The curious behavior of JavaScript object fields in Chrome

Error: The array index is outside the permissible range

Browserify is unable to locate the 'jquery' module