Alright, I keep getting questions about what the new 16-bit Char type (and its associated UnicodeString) really mean. Let’s take a couple minutes off from the “Preparing for Delphi 2009″ series and discuss exactly how this works.
The core issue here is that really, a character according to the unicode standard isn’t 16 bits, it’s 32 (ok, actually it’s 21, but that’s not a normal data size, so we use 32). Since you obviously can’t fit 32 bits (or even 21 bits) into a 16-bit data type, what is going on with this 16-bit Char type?
In the unicode world, there are two possibilities here. The first is called UCS-2, which basically means that only a subset of the entire character set is allowed. That is, you can only use the characters that will fit into 16 bits.
The other possibility is UTF-16, which uses a 16-bit data type, and has a mechanism for splitting a larger character into two of these 16-bit chunks. When this happens, each chunk is called a Surrogate, and the two of them together is called a Surrogate Pair.
So, let’s settle this once and for all: Delphi 2009 uses UTF-16. It allows the entire unicode character set, using surrogate pairs for the characters that take more than 16 bits to represent.
Before I get to some specifics, there are a couple of terms we should make sure we have straight:
In unicode, a character is called a Code Point. The letter ‘A’, an exclamation mark, a space, a line feed, and any other “thing” that is represented as part of the unicode “character set” (called the Code Space) is a code point.
On the other hand, that “chunk” of data – 16-bits in UTF-16 – is called a Code Unit. If you have a code point that won’t fit into 16 bits, you’ll need to combine two code units to form a surrogate pair.
In Delphi terms, the new 16-bit Char data type represents a code unit, not a code point. So, when I wrote last week that Length(myString) returns the number of printable characters in myString, that could have been a little misleading. Length(myString) returns the number of code units in the string. If some of those code units are surrogates, the number of code points you see on screen may not be the same as the number of code units in the string.
Now for a couple of frequently (and anticipated) asked questions:
Q: Is it really safe to assume that the size of a string is Length(myString) * 2?
A: If by “size,” you mean “size in bytes,” yes. That will give you the size of the string in bytes, because Length(myString) tells you the number of code units, which are always 2 bytes. I would, however, suggest that you not use the magic number 2, but rather SizeOf(Char), because 1. it’s more readable, and 2. it’s not inconcievable that the size of the Char data type could change again in the future – use SizeOf(Char) and you’re already ready for it.
If by “size” you mean something else, keep reading…
Q: How do I get these characters into my strings?
A: The easiest way is just to include them in your source code. Since the compiler and code editor are fully unicode enabled, you can just put the character into your source code. However, that isn’t always easy if you’re using a keyboard that isn’t really designed for the characters you’re using, and isn’t always the most readable solution either. There is another way:
Delphi 2009 includes a new unit in the RTL called Character.pas. It has a bunch of utility functions (and a utility class) to help with these conversions. Let’s say you want a string that has the codepoint $20086 in it. You could do the math to figure out the surrogate pair and do S := Chr($D840) + Chr($DC86); or you could use the ConvertFromUtf32 function from Character.pas: S := ConvertFromUTF32($20086);
Both will give you the same result, but ConvertFromUtf32 is certainly easier to use.
Note that if you do ShowMessage(S), you’ll see only one character on the screen, but that Length(S) will return 2, since there are two code units used to represent the one character.
Q: How do I determine the number of code points in a string with surrogate pairs, instead of the number of code units?
A: SysUtils.pas has some helper functions for things like this. In this case, we could do I := ElementToCharLen(myString, Length(myString)); and I would contain the number of code points in the string.
Hopefully this will answer some of the questions that have come up. If there are things that still aren’t clear, feel free to comment, and I’ll do my best to clear things up.