Delphi 2009 and Unicode
Posted on September 8th, 2008 in Castalia, CodeGear, Delphi | 8 Comments »
Alright, I keep getting questions about what the new 16-bit Char type (and its associated UnicodeString) really mean. Let’s take a couple minutes off from the “Preparing for Delphi 2009″ series and discuss exactly how this works.
The core issue here is that really, a character according to the unicode standard isn’t 16 bits, it’s 32 (ok, actually it’s 21, but that’s not a normal data size, so we use 32). Since you obviously can’t fit 32 bits (or even 21 bits) into a 16-bit data type, what is going on with this 16-bit Char type?
In the unicode world, there are two possibilities here. The first is called UCS-2, which basically means that only a subset of the entire character set is allowed. That is, you can only use the characters that will fit into 16 bits.
The other possibility is UTF-16, which uses a 16-bit data type, and has a mechanism for splitting a larger character into two of these 16-bit chunks. When this happens, each chunk is called a Surrogate, and the two of them together is called a Surrogate Pair.
So, let’s settle this once and for all: Delphi 2009 uses UTF-16. It allows the entire unicode character set, using surrogate pairs for the characters that take more than 16 bits to represent.
Before I get to some specifics, there are a couple of terms we should make sure we have straight:
In unicode, a character is called a Code Point. The letter ‘A’, an exclamation mark, a space, a line feed, and any other “thing” that is represented as part of the unicode “character set” (called the Code Space) is a code point.
On the other hand, that “chunk” of data – 16-bits in UTF-16 – is called a Code Unit. If you have a code point that won’t fit into 16 bits, you’ll need to combine two code units to form a surrogate pair.
In Delphi terms, the new 16-bit Char data type represents a code unit, not a code point. So, when I wrote last week that Length(myString) returns the number of printable characters in myString, that could have been a little misleading. Length(myString) returns the number of code units in the string. If some of those code units are surrogates, the number of code points you see on screen may not be the same as the number of code units in the string.
Now for a couple of frequently (and anticipated) asked questions:
Q: Is it really safe to assume that the size of a string is Length(myString) * 2?
A: If by “size,” you mean “size in bytes,” yes. That will give you the size of the string in bytes, because Length(myString) tells you the number of code units, which are always 2 bytes. I would, however, suggest that you not use the magic number 2, but rather SizeOf(Char), because 1. it’s more readable, and 2. it’s not inconcievable that the size of the Char data type could change again in the future – use SizeOf(Char) and you’re already ready for it.
If by “size” you mean something else, keep reading…
Q: How do I get these characters into my strings?
A: The easiest way is just to include them in your source code. Since the compiler and code editor are fully unicode enabled, you can just put the character into your source code. However, that isn’t always easy if you’re using a keyboard that isn’t really designed for the characters you’re using, and isn’t always the most readable solution either. There is another way:
Delphi 2009 includes a new unit in the RTL called Character.pas. It has a bunch of utility functions (and a utility class) to help with these conversions. Let’s say you want a string that has the codepoint $20086 in it. You could do the math to figure out the surrogate pair and do S := Chr($D840) + Chr($DC86); or you could use the ConvertFromUtf32 function from Character.pas: S := ConvertFromUTF32($20086);
Both will give you the same result, but ConvertFromUtf32 is certainly easier to use.
Note that if you do ShowMessage(S), you’ll see only one character on the screen, but that Length(S) will return 2, since there are two code units used to represent the one character.
Q: How do I determine the number of code points in a string with surrogate pairs, instead of the number of code units?
A: SysUtils.pas has some helper functions for things like this. In this case, we could do I := ElementToCharLen(myString, Length(myString)); and I would contain the number of code points in the string.
Hopefully this will answer some of the questions that have come up. If there are things that still aren’t clear, feel free to comment, and I’ll do my best to clear things up.
8 Responses
Hello,
This is quite confusing for string processing functions like pos, copy etc. Do the arguments/return values from these functions reflect the code point, or code unit? I would have expected Length(unicodeString), pos, copy etc to refer to the nth code point but it seems not. If pos works the same way then do I have to do
thePos:=pos(‘sometext’,mystring); // returns the code unit
i:=ElementToCharLen(myString,thePos)
myChar:=mystring[i]
Pos(), Copy(), etc… work with the code unit, not the code point. These utility functions simply see the string as (sort of) an array of Char, and don’t make any distinction between surrogates and displayable characters. Your Copy() and Pos() code aren’t going to have to change unless they assume that sizeof(Char) = 1.
Hi Jacob,
Again, I am still confused. If Pos(), Copy() etc work with the code unit, but string[x] returns the xth code point, there is a mismatch. Does the nth code unit as returned by Pos always refer to the nth code point when getting string[n]?
Also, if I want to copy 3 code points starting at codepoint 4, but you are saying Copy() works in code units, then
Copy(mystring,4,3*sizeof(char)) may not work if one of my chars is 3 bytes for whatever reason… Am I missing something here?
David,
string[X] does not return the code point, but the code unit. The Delphi string type is oblivious as to whether a particular Char is a surrogate or not, it’s just a Char – a Word – a 16-bit positive integer.
When indexing a string, or using the utility functions that we’ve used for years, it’s always just working with Chars, not code points. If you want to work with Code Points, there are new utilities to help with that (like the aforementioned ElementToCharLen).
In other words, the string manipulation routines you know and love have always worked with Chars, and they still work with Chars. It’s just that a Char is now 16 bits instead of 8.
Just looking at it quickly, here’s how I would tackle your example (copying 3 code points starting at code point 4):
firstUnit := CharToElementIndex(myString, 4);
tempStr := Copy(myString, firstUnit, Length(myString));
numUnits := CharToElementLen(tempStr, 3);
Result := Copy(tempStr, 1, numUnits);
There may very well be an easier way to do it, but that’s what first came to mind… I fully expect someone to come along and point out some better utility function.
So again, a Char is a code unit, and all the old routines work with Chars. Under most circumstances, this will just work. If you really need to work with code points that will require surrogates, and you have to parse them or otherwise manipulate them, there are new routines that you can use.
Jacob, thanks for the clarification.
As I understand this, we need to learn new techniques for manipulating strings – just a blind splitting, truncating etc. could “invalidate” a string if the cut hits a surrogate pair in the middle. (And “invalid” strings can lead to strange problems – I experienced some when tried to manipulate UTF-8 in a similar way as array of one-byte chars – a cut in a wrong place leaded to completely unreadable display in most browsers when non-ASCII text was displayed).
Or, another approach, we could try to ensure that surrogate pairs do not appear in our strings – by filtering the user input or whatever another means.
If someone want to work with code points (not code units) I think more convenient way is to convert a string to the UCS4String type.
Is there a way to determine whether a UnicodeString contains Unicode, UTF8, or Ansi?
xor in Delphi 2009 is giving different output than Delphi 7.
in Delphi 2009 #$is getting added to the value.
Is there any solution to this?
We are using the following:
s,ss:String
ss:=’“”•–—˜™š›“”•–—˜™š›“”•–—˜™š›“”•–—˜™š›“”•–—˜™š›“”•–—˜™š›“”•–—˜™š›“”•–—˜™š›’;
s:=’|u=develop|e=sw01|r=0|m=0;
for i1:=1 to Length(s) do
s[i1]:=chr(ord(s[i1]) xor ord(ss[i1]));
Delphi 7 is returning: ‘ïá¨òòîüöôãèå«äï©«çá©¥êú¥©’
Delphi 2009 is returning: #$2060#$2068′?7′#$2071′?’#$2147′c’#$2055#$206C#$2061#$2052#$202E#$2067′?LO?’#$206E’†?’#$206F’??L’