Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 65899

Summary: [StyledText] StyledText Supplementary/Surrogate character navigation
Product: [Eclipse Project] Platform Reporter: Adam Warner <adam>
Component: SWTAssignee: Felipe Heidrich <eclipse.felipe>
Status: RESOLVED FIXED QA Contact: Felipe Heidrich <eclipse.felipe>
Severity: normal    
Priority: P3 CC: daniel_megert, david_williams, eclipse, harendra, heath.borders, kennoji, pwebster, snorthov
Version: 3.0   
Target Milestone: 3.7 M7   
Hardware: PC   
OS: Linux-GTK   
Whiteboard:
Attachments:
Description Flags
Running on GTK-2.4.0
none
Running on GTK 2.2.1
none
Patch for GTK none

Description Adam Warner CLA 2004-06-06 04:49:23 EDT
Eclipse 3.0RC1
Sun JVM 1.4.2_03

Within a StyledText box paste in a Unicode supplementary character, for example
MATHEMATICAL_BOLD_CAPITAL_A with hexadecimal code point 1D400h:
<http://www.unicode.org/charts/PDF/U1D400.pdf>
You have to construct the character. I just used CLISP from the GNOME Terminal: 
(princ #\mathematical_bold_capital_a)

I selected the character and pasted it into a StyledText box. In GTK without the
gylph being available this will appropriately display as a character with the
hexadecimal value in a rectangle:
[01D]
[400]

Now push the left arrow to attempt to move the caret before the character. Type
a character and the text disappears and nothing more typed is visible. This is
probably because the valid UTF-16 character composed of high and low 16 bit
surrogates has been split, creating an illegal UTF-16 sequence.

Inserting text is also affected after a supplementary character. Paste a
supplementary character then type "abc". Press the left arrow and type some
additional letters. The caret doesn't move to before "c" even though letters are
inserted before "c".

D800h to DFFFh are reserved for UTF-16 surrogate pairs and no code points are
assigned to them. This means each surrogate is individually distinguishable and
cannot be mistake for a single 16 bit code point.

"Supplementary Characters in the Java Platform"
<http://java.sun.com/developer/technicalArticles/Intl/Supplementary/>

Character navigation of supplementary Unicode characters within a StyledText box
should be feasible even without Java 1.5. It is going to be just as awkward to
implement in Java 1.5 because the char type is not being upgraded to a size that
will fit all Unicode code points. The number of code points in a Java string or
array of char can only be determined by parsing the entire string or array.
UTF-16 is a variable width encoding without the space advantage of UTF-8 nor the
fixed width advantage of UTF-32.

Regards,
Adam Warner
Comment 1 Felipe Heidrich CLA 2004-06-08 18:20:17 EDT
Is there any other way to obtain this glyph so I can copy it and paste it ?
I don't have clisp.

Something else, did you test this same scenario with Gedit or the Text Widget 
example from gtk-demo app ? Did it work ?
Comment 2 Adam Warner CLA 2004-06-08 22:38:57 EDT
GEdit 2.6.1 has no supplementary character navigation problem, but I wouldn't
expect it to as this problem is an artifact of 16 bit character encoding! I've
just checked GEdit's dependencies and it uses the pango library. This library
uses UTF-8 internally (<http://www.pango.org/design.shtml>): "The character set
is extensible to the full range of ISO10646 without requiring escape mechanisms
such as surrogate pairs."

Because of Java's 16-bit char legacy we have to worry about surrogate pairs. 

You can use GNOME 2.6's "Unicode Character Map" (from Applications/Accessories)
to copy any Unicode character. MATHEMATICAL_BOLD_CAPITAL_A is the first code
point in the "Mathematical Alphanumeric Symbols" category.

However I have written a test case for you to compile. It's appended to this
message. I found the surrogate pairs for MATHEMATICAL_BOLD_CAPITAL_A using this
conversion table: http://www.i18nguy.com/unicode/surrogatetable.html

It should demonstrate all the problems I previously mentioned. I also managed to
induce this error:

** ERROR **: file pango-layout.c: line 2799 (process_item): assertion failed:
(!shape_set)
aborting...
Aborted

And this error:

java.lang.ArrayIndexOutOfBoundsException
        at java.lang.System.arraycopy(Native Method)
        at java.lang.StringBuffer.append(StringBuffer.java:499)
        at
org.eclipse.swt.custom.DefaultContent.getTextRange(DefaultContent.java:720)
        at org.eclipse.swt.custom.WrappedContent.getLine(WrappedContent.java:98)
        at org.eclipse.swt.custom.StyledText.performPaint(StyledText.java:5670)
        at org.eclipse.swt.custom.StyledText.handlePaint(StyledText.java:5074)
        at org.eclipse.swt.custom.StyledText$7.handleEvent(StyledText.java:4754)
        at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:82)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:944)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:968)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:953)
        at org.eclipse.swt.widgets.Control.gtk_expose_event(Control.java:1744)
        at org.eclipse.swt.widgets.Composite.gtk_expose_event(Composite.java:412)
        at org.eclipse.swt.widgets.Canvas.gtk_expose_event(Canvas.java:105)
        at org.eclipse.swt.widgets.Widget.windowProc(Widget.java:1190)
        at org.eclipse.swt.widgets.Display.windowProc(Display.java:3012)
        at org.eclipse.swt.internal.gtk.OS.gtk_main_do_event(Native Method)
        at org.eclipse.swt.widgets.Display.eventProc(Display.java:815)
        at org.eclipse.swt.internal.gtk.OS.gdk_window_process_updates(Native Method)
        at org.eclipse.swt.widgets.Control.update(Control.java:3170)
        at org.eclipse.swt.widgets.Control.update(Control.java:3162)
        at org.eclipse.swt.custom.StyledText.handleTextChanged(StyledText.java:5183)
        at org.eclipse.swt.custom.StyledText$6.textChanged(StyledText.java:4711)
        at
org.eclipse.swt.custom.StyledTextListener.handleEvent(StyledTextListener.java:61)
        at
org.eclipse.swt.custom.DefaultContent.sendTextEvent(DefaultContent.java:799)
        at
org.eclipse.swt.custom.DefaultContent.replaceTextRange(DefaultContent.java:791)
        at
org.eclipse.swt.custom.WrappedContent.replaceTextRange(WrappedContent.java:319)
        at org.eclipse.swt.custom.StyledText.modifyContent(StyledText.java:5583)
        at org.eclipse.swt.custom.StyledText.sendKeyEvent(StyledText.java:6423)
        at org.eclipse.swt.custom.StyledText.doContent(StyledText.java:2537)
        at org.eclipse.swt.custom.StyledText.handleKey(StyledText.java:4981)
        at org.eclipse.swt.custom.StyledText.handleKeyDown(StyledText.java:5004)
        at org.eclipse.swt.custom.StyledText$7.handleEvent(StyledText.java:4749)
        at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:82)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:944)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:968)
        at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:953)
        at org.eclipse.swt.widgets.Control.sendIMKeyEvent(Control.java:2277)
        at org.eclipse.swt.widgets.Control.gtk_commit(Control.java:1694)
        at org.eclipse.swt.widgets.Widget.windowProc(Widget.java:1184)
        at org.eclipse.swt.widgets.Display.windowProc(Display.java:3012)
        at org.eclipse.swt.internal.gtk.OS.gtk_im_context_filter_keypress(Native
Method)
        at org.eclipse.swt.widgets.Control.gtk_key_press_event(Control.java:1790)
        at org.eclipse.swt.widgets.Composite.gtk_key_press_event(Composite.java:439)
        at org.eclipse.swt.widgets.Widget.windowProc(Widget.java:1194)
        at org.eclipse.swt.widgets.Display.windowProc(Display.java:3012)
        at org.eclipse.swt.internal.gtk.OS.gtk_main_do_event(Native Method)
        at org.eclipse.swt.widgets.Display.eventProc(Display.java:815)
        at org.eclipse.swt.internal.gtk.OS.gtk_main_iteration(Native Method)
        at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:2222)
        at org.eclipse.jface.window.Window.runEventLoop(Window.java:668)
        at org.eclipse.jface.window.Window.open(Window.java:648)
        at TestUnicode.main(TestUnicode.java:26)


These occurred when I started the test program, moved right to the end of the
text and typed z four times.

Regards,
Adam

import org.eclipse.jface.window.ApplicationWindow;
import org.eclipse.swt.SWT;
import org.eclipse.swt.custom.StyledText;
import org.eclipse.swt.widgets.Composite;
import org.eclipse.swt.widgets.Control;
import org.eclipse.swt.widgets.Display;

public class TestUnicode extends ApplicationWindow {

    StyledText text;

    public TestUnicode () {
        super(null);
    }

    protected Control createContents(Composite parent) {
        this.getShell().setText("StyledText Unicode Test");
        text = new StyledText(parent, SWT.BORDER | SWT.WRAP);
	text.setText("\uD835\uDC00abc");
        return text;
    }

    static public void main(String[] args) {
        TestUnicode gui = new TestUnicode();
        gui.setBlockOnOpen(true);
        gui.open();
        Display.getCurrent().dispose();
    }
}
Comment 3 Felipe Heidrich CLA 2004-06-09 12:14:36 EDT
I was looking at the feature plan of GTK 2.4 (http://www.gtk.org/plan/2.4/):
-Full Unicode 3.2 (4.0?) support, including non-BMP portions. (#68435, #101081)
Exactly what you need.

As I expected, it works on GTK 2.4 and fails on GTK 2.2.
Here is my snippet, running it on both versions of GTK and compare:
import org.eclipse.swt.SWT;
import org.eclipse.swt.widgets.*;
import org.eclipse.swt.custom.*;
public class PR65899 {
public static void main(String[] args) {	
	Display display = new Display();
	Shell shell = new Shell(display);
	shell.setSize(400, 200);
	StyledText styledText = new StyledText (shell, SWT.BORDER);
	styledText.setBounds(10, 10, 360, 30);
	styledText.setText("StyledText - Surrogate pair: >\uD835\uDC00<");	
	Text text = new Text (shell, SWT.SINGLE | SWT.BORDER);
	text.setBounds(10, 50, 360, 30);
	text.setText("Text - Surrogate pair: >\uD835\uDC00<");	
	shell.open();
	while (!shell.isDisposed())
		if (!display.readAndDispatch())
			display.sleep();
	display.dispose();	
}
}
Comment 4 Felipe Heidrich CLA 2004-06-09 12:16:43 EDT
Created attachment 11803 [details]
Running on GTK-2.4.0
Comment 5 Felipe Heidrich CLA 2004-06-09 12:17:19 EDT
Created attachment 11804 [details]
Running on GTK 2.2.1
Comment 6 Adam Warner CLA 2004-06-09 18:57:16 EDT
Hi Felipe. I see a screenshot similar to your GTK+ 2.4 one since I'm using GNOME
2.6 and GTK+ 2.4.2 (I'm just using a different window manager). I agree this
version of GTK+ is "Exactly what you need."

To reproduce the bug reports from my original mail do this in the _StyledText_ box:

(a) position the cursor right at the beginning of the surrogate pair (just after
the >). Now press the right arrow once. Type a letter and _all the text
disappears_. I suspect this is because the surrogate pair is incorrectly split.

(b) position the cursor right after the surrogate pair and before the <. Type a
letter. Again all the text disappears.

(c) position the cursor right after the <. Type a letter. The letter appears in
the wrong place (before the <). This is further evidence that the surrogate pair
was incorrectly split in the previous two tests.

None of these issues arise in the second Text box, only the StyledText box!

Regards,
Adam
Comment 7 Felipe Heidrich CLA 2004-06-17 14:50:38 EDT
Adam, I will have to defer this problem to after 3.0

Although java 1.4 has problems I didn't expect to have this kind to trouble 
give that all character navigation, hit test, drawing, measuring, etc is done 
using PangoLayout who should handle surrogate pairs properly. For example, 
Pango set the cursor position flag before a low surrogate causing my code to 
set caret in the middle of the surrogate pair, I'm also having trouble with 
measuring. 

Comment 8 Steve Northover CLA 2008-08-15 14:43:32 EDT
Is this bug in an older version of GTK (ie. 2.2.x)and fixed in a newer version (2.4 and greater)?
Comment 9 Felipe Heidrich CLA 2009-05-28 17:31:44 EDT
*** Bug 271587 has been marked as a duplicate of this bug. ***
Comment 10 Felipe Heidrich CLA 2009-08-17 17:05:12 EDT
Your bug has been moved to triage, visit http://www.eclipse.org/swt/triage.php for more info.
Comment 11 Felipe Heidrich CLA 2009-11-02 09:56:33 EST
*** Bug 293852 has been marked as a duplicate of this bug. ***
Comment 12 Harendra CLA 2010-04-02 05:10:47 EDT
This happens for eclipse 3.6 in Redhat Enterprise Linux 5.4 and Ubuntu linux 9.04.
Comment 13 Kentaroh Noji CLA 2010-04-05 21:10:52 EDT
The same symptom happens in Eclipse 3.6 M6 for Solaris 10.4 and Mac OS X 10.6.
Comment 14 Felipe Heidrich CLA 2011-01-28 16:53:35 EST
This bug is a lot bigger than we first assumed.

For example, take the text "a\uD835\uDC00b".
In swt/java (utf-16) the char count is 4.
In gtk (utf-8) the char count is 3 (the byte count is 6).

We assume everywhere that a char offset in java and the char offset in gtk are the same. That is not true. For example: create a text control and set the text above:
text.getCharCount() will return 3.
text.getText().length() will return 4.

We would need visit every API in the entire toolkit and find all places where char offset are being passed and fix them all. Big work and will have impact on performance too.
Comment 15 Phil Stone CLA 2011-01-28 17:43:06 EST
My real problem, was using the Phoenician code block...
I found many, many other problems across the whole computing universe where
Unicode above 16 bits was broken.  Sometimes badly.  All Java apps, or apps 
that use Java appeared to be broken.  (Python is fixing this in an incompatible, V3 upgrade, it is so severe a problem, at least there 
was no brain-dead surrogate pairs in Python.)  Most web browsers where broken above 16 bits too, though getting better.

My only reasonable fix was to rebuild Phoenician in the Unicode private use 
area at 0xEFXX...  Adjusting fonts and so on...  Making my unicode incompatible
with everyone else...  Basically back to code pages, only this time at 16
bits instead of 8.  I might actually release those fonts, and a wide set
of additional tooling later this year.  If/when I do that I'll start 
the battle of Unicode Code Pages...

OK, here's a thought...
It might be better to consider this a bug against Java itself.

"This is an artifact of the 16 bit Java Legacy..."

Nobody using Java can be reasonably expected to properly code support for surrogate pairs for all string handling.  None of the original 16 bit string APIs worked any more once surrogate pairs were introduced.  No introductory Java programming manuals were correct any longer...

Why do we have a 16 bit legacy when we're moving into a 64 bit world?

Surrogate pairs are the bug.

The bug is that all Java characters need to be 20 bits, or more, wide.  

20 bits is the limit of utf-8 unicode itself, so it is a real, hard, limit.

32 bits would be more natural inside Java code, though only 20 are needed.

Push this bug to Oracle...
Comment 16 Felipe Heidrich CLA 2011-02-09 11:03:05 EST
(In reply to comment #15)
> It might be better to consider this a bug against Java itself.
> "This is an artifact of the 16 bit Java Legacy..."
> Nobody using Java can be reasonably expected to properly code support for
> surrogate pairs for all string handling.  None of the original 16 bit string
> APIs worked any more once surrogate pairs were introduced.  No introductory
> Java programming manuals were correct any longer...

I'm interested in your experience, java has added (long ago) lots of new API to work with surrogates. 
http://java.sun.com/developer/technicalArticles/Intl/Supplementary/#Modified_UTF-8
http://www.ibm.com/developerworks/java/library/j-unicode/?ca=drs-

Why didn't that solve the problems you faced ?

> The bug is that all Java characters need to be 20 bits, or more, wide.  
> 20 bits is the limit of utf-8 unicode itself, so it is a real, hard, limit.
> 32 bits would be more natural inside Java code, though only 20 are needed.

Shouldn't that be 21 bits ?
Anyway, Java only has 16bits and 32bits. They chose 32bits (int), in the API a 32bit represent a char is always called a codePoint.
Comment 17 Phil Stone CLA 2011-02-09 13:43:56 EST
I've not been working in Java directly, not recently anyway.

We're building tooling for bible translation using the phoenician alphabet.  
It usually lives at 10900 and following in Unicode.

Eclipse is the IDE, the PyDev plugin for Eclipse to write Python.
The non-programming language folk are still using Eclipse for editing
directly in language files.

Here's a sample of what it looks like using this alphabet.
(Use Firefox or Chrome... IE has not been tested recently.)

http://www.biblelanguages.org/bible#1.1.pal

Internally there are quite a few more software tools to handle the book keeping
for a new translation.  That link is the tip of an iceberg.

My point in an earlier post on this bug was the entire computing universe has
trouble...

First 2 steps for adding support for any alphabet is to...
1) Get a custom keyboard for the alphabet
2) Get a font for the alphabet.

Both of these were relatively easy.  Was typing in no time, and the font
was not hard either.

Then test typing...

Eclipse itself cannot handle, as this bug attests.  OK, typing in sorta
works, but no editing appears to be safe in Eclipse when the unicode is above 10000.

The cursor is out of place, and the R-to-L nature of this code page
causes all lines that have these values to be uneditable.

Open Office was worse, utterly useless.  So, there was no way to write
about this language in a regular office document.

For web display, like the page above, the ttf to eot conversion tools
from Google... could not handle...  or else eot itself cannot, or else Internet Explorer cannot...  Unclear which.

Our web servers were in PHP, which when we started this work did not even
support 16 bit unicode...  The promise of Python 3 being pure 32 bit unicode, and the dramatically less lines of code per unit of function than Java is why
we decided to move everything to python instead of Java. 

Of course Python 3 was such a major update that most libraries have not
been ported, so we wait for a future move to that.  Python 2.6 is still
mostly good enough, though it too has surrogate pair related bugs, with no real intent to fix on the part of the python folks as Python 3.2 now mostly works.

Our work with the language was also evolving, and we realized that
the R-to-L was not strictly correct in a historical sense.  The best
known historical examples of uses of this alphabet (stored in the Louve in Paris) are boustrophedon, ie: written bidirectionally, and there is no available layout tools for that.  (We're working towards a printed, boustrophedon phoenician bible, as well as L-to-R phoenician bible.)

We also needed more characters than were allocated in the Unicode standard, AND
the code point block was not large enough to add all that we needed,
so we had to abandon 10900 anyway.

So, as a test, we moved everything to the private-use area at EF00, added
all the additional characters that we needed, switched to L-to-R, and things cleaned up everywhere.  Eclipse works, Office works, eot fonts work, layout on web pages work, Python 2.6 mostly works, and so on.

It worked so well that is where we remain.

Phil
Comment 18 Felipe Heidrich CLA 2011-02-14 17:08:46 EST
Thank you Phil, your problem is more complex than the one I have.

I only need to get Java and GTK to work together. Most of the problem for me is to convert character offset from utf16 to utf8 and back.
I can convert from java to gtk using:
gtkCharOffset = string.codePointCount(0, javaCharOffset);
and back using:
javaCharOffset = string.offsetByCodePoints(0, gtkCharOffset);

I also need to be very careful when doing a +1 or -1 from a char offset, it should never be done to a java char offset directly.

I have a version of TextLayout/StyledText working here, needs more testing and polish the code a bit more. 
The worse problem, in the code, is that TextLayout adds several characters that should "invisible to the application" (like bidi controls and others stuff). This means that a char offset can be three different "levels":
1) application char offset - the offset that exist to the outside world
2) internal to text layout char offset 
3) internal to text layout char offset (processing surrogates) - compatible with gtk.
Comment 19 Felipe Heidrich CLA 2011-03-23 14:24:17 EDT
Created attachment 191778 [details]
Patch for GTK
Comment 20 Felipe Heidrich CLA 2011-03-23 14:24:44 EDT
Fixed in HEAD
Comment 21 Harendra CLA 2011-04-06 00:50:55 EDT
I verified it was working in Eclipse 3.7 build I20110329-0800 under Red Hat Enterprise Linux 6.0.