Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 293852 - Unicode Supplimental Language Plane -- Java Surrogate Pairs -- Causing Problems in Editors
Summary: Unicode Supplimental Language Plane -- Java Surrogate Pairs -- Causing Proble...
Status: CLOSED DUPLICATE of bug 65899
Alias: None
Product: Platform
Classification: Eclipse Project
Component: SWT (show other bugs)
Version: 3.6   Edit
Hardware: Other Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Platform-SWT-Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-31 12:30 EDT by Phil Stone CLA
Modified: 2009-11-02 09:56 EST (History)
3 users (show)

See Also:


Attachments
Screen shot of Eclipse with Phoenician displayed. (235.24 KB, image/png)
2009-10-31 13:34 EDT, Phil Stone CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Phil Stone CLA 2009-10-31 12:30:50 EDT
User-Agent:       Mozilla/5.0 (compatible; Konqueror/4.3; Linux) KHTML/4.3.2 (like Gecko)
Build Identifier: M20090917-0800

In 2004 the Unicode Consortium added support for Phoenician in the supplimental language plane.  The block of characters is 32 long and begins at U10900.

Java generally converts values like this by breaking the single unicode character into surrogate pairs where the assumption that 1 printed character equals one string character is broken.  This breakage shows up in the Eclipse editor where highlighting and backspacing is not following the true characters, and I suspect is off because Eclipse ignores surrogate pairs as found in Java strings.

Note the interest in the Phoenician code block.  Nearly all phoenetic alphabets known in the world today derive directly from Phoenician.  It is the ancestor alphabet to all known alphabets, including Latin, Greek, Cryllic, Hebrew, Aramaic, Arabic, all Indic languages, and so on.  It is a very important Unicode language block.  Anyone interested in explaining the evolution of any known phoenetic alphabet must start with Phoenician.  Note that it was used in history from about 1500 BC through about 300 AD, and across that time the letter forms changed.  This change in letter form is handled through the use of fonts.  Many languages, notably Hebrew and Aramaic were at times written using Phoenician characters instead of Hebrew or Aramaic characters across this time.  Roughly 10 percent of the dead sea scrolls are written in this alphabet.

It could be argued that this block is so important that it should not have been assigned in the supplimental plane, but what was done was done.  

There are other code blocks in the supplimental plane of Unicode, and fixing support for this block will also fix support for the others.

Reproducible: Always

Steps to Reproduce:
1. Included here is a Phoenician keyboard extension for us keyboards on
linux systems.  Tested on Kubuntu 9.10, though any Linux distro should work.
Take the text given in this file below the ---cut--- line and call it us.txt.  Append it to the US keyboard file on your system with the following command
sudo cat us.txt >> /usr/share/X11/xkb/symbols/us
For this test case you cannot have any other national keyboards configured,
only US English.  (Because we're overloading the RIGHT-CTRL key for switching.)  Then reboot your machine.  The RIGHT_CTRL key will now
switch to/from Phonecian.  Test this at any command line by using RIGHT CTRL as a keyboard shift then try typing any lower case A-Z keys.  (All other keys remain the same.) 

Note Kubuntu 9.10, but not earlier versions, includes support for the Phoenician code block in the system fonts so these characters display essentially everywhere.  At the command line the system will always return command not found, since no commands use Phoenician in their spellings.  So, this is safe to play with even at a shell prompt.

Note the keyboard layout included here is the latinization of the phoenician letters, not later translitterations.  The US English keyboard's latinized key labels will produce the original phoenician character used before these letter forms were reshaped or latinized into forms recognized today.  There are 4 duplications because Phoenician has 22 instead of 26 letters.  These are compensated.  Note also that the unshifted space bar produces the Phoenician "taag" an inter-word dot that was drawn in the days before spaces separated words.

Once this keyboard is installed you can fire up Eclipse.  Open any editor, and start typing, say, a quoted string.  RIGHT-CTL to shift to this phoeician code block.  Note how strings like <p>&#67840;&#67841;&#67842;</p> do not color highlight correctly.

--- CUT ---
// The term "paleo" as used here means "old" and is the informal name
// for the unicode U10900 code block.
partial alphanumeric_keys modifier_keys
xkb_symbols "paleo" {

// This is an extension of the us(basic) keyboard, so start there.

    include "us(basic)"

// Name the 2 groups used here.  This is not seen by people, must be unique.
// A "group" is an entire shifted keyboard.  
// In this case Group1 comes from us(basic)
// and Group 2 is our new paleo set.

    name[Group1]= "USA - Paleo";
    name[Group2]= "Paleo";

// The following lists off each key on the keyboard in <> and then says what
// that key should send off to the rest of the system when it is pressed
// The first set of [] is empty, so use the us(basic) standard whatever that
// currently may be.  The second [...] in each key
// is for Group2, first regular then shifted.  
// Note 10912 is hex unicode for &#67858; the paleo Q

    key <AD01> { [],[ 0x1010912, Q ] };
    key <AD02> { [],[ 0x1010914, W ] };
    key <AD03> { [],[ 0x1010904, E ] };
    key <AD04> { [],[ 0x1010913, R ] };
    key <AD05> { [],[ 0x1010915, T ] };
    key <AD06> { [],[ 0x1010909, Y ] };
    key <AD07> { [],[ 0x1010905, U ] };
    key <AD08> { [],[ 0x1010906, I ] };
    key <AD09> { [],[ 0x101090F, O ] };
    key <AD10> { [],[ 0x1010910, P ] };
//    key <AD11> { [],[ bracketleft,  braceleft       ]       };
//    key <AD12> { [],[ bracketright, braceright      ]       };

    key <AC01> { [],[ 0x1010900, A ] };
    key <AC02> { [],[ 0x101090E, S ] };
    key <AC03> { [],[ 0x1010903, D ] };
    key <AC04> { [],[ 0x1010904, F ] };
    key <AC05> { [],[ 0x1010902, G ] };
    key <AC06> { [],[ 0x1010907, H ] };
    key <AC07> { [],[ 0x1010906, J ] };
    key <AC08> { [],[ 0x101090A, K ] };
    key <AC09> { [],[ 0x101090B, L ] };
//    key <AC10> { [], [ semicolon,    colon           ]       };
//    key <AC11> { [], [ apostrophe,   quotedbl        ]       };

    key <AB01> { [],[ 0x1010911, Z ] };
    key <AB02> { [],[ 0x1010908, X ] };
    key <AB03> { [],[ 0x1010902, C ] };
    key <AB04> { [],[ 0x1010905, V ] };
    key <AB05> { [],[ 0x1010901, B ] };
    key <AB06> { [],[ 0x101090D, N ] };
    key <AB07> { [],[ 0x101090C, M ] };

// Paleo language uses a &#8220;&#67871;&#8221; between all words, so cause the space bar to 
// generate that letter.  If a shift-space is typed, while in paleo, 
// then generate a regular space.
    key <SPCE> { [],[ 0x101091F, space ]};

// The following causes RIGHT CONTROL to toggle this entire group
    include "group(rctrl_toggle)"
    
// The following causes the MENU key to be a COMPOSE key.
    include "compose(menu)"
};
Comment 1 Phil Stone CLA 2009-10-31 13:34:46 EDT
Created attachment 150999 [details]
Screen shot of Eclipse with Phoenician displayed.

I've attached a screen capture of Eclipse with phoenician typed into a list of quoted strings in a Python language file.  In this case the section of the file shows  the names of the Bible's books in Phoenician, with a comment following giving their English language names.

Note the way the comments on the same lines are oddly colored.  The syntax highlighting is lost with these letters.

The editor is similarly lost when trying to edit these lines.

Similar problems exist in html files, and I presume the other language editors.
Comment 2 Dani Megert CLA 2009-11-02 03:22:16 EST
Looks like a problem in StyledText.
Comment 3 Felipe Heidrich CLA 2009-11-02 09:56:33 EST
See Bug 69814 for the same problem on windows.

We don't support surrogate pairs in swt, when you use them some things works, other don't.

*** This bug has been marked as a duplicate of bug 65899 ***