| Summary: | [StyledText] StyledText Supplementary/Surrogate character navigation | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | Adam Warner <adam> | ||||||||
| Component: | SWT | Assignee: | Felipe Heidrich <eclipse.felipe> | ||||||||
| Status: | RESOLVED FIXED | QA Contact: | Felipe Heidrich <eclipse.felipe> | ||||||||
| Severity: | normal | ||||||||||
| Priority: | P3 | CC: | daniel_megert, david_williams, eclipse, harendra, heath.borders, kennoji, pwebster, snorthov | ||||||||
| Version: | 3.0 | ||||||||||
| Target Milestone: | 3.7 M7 | ||||||||||
| Hardware: | PC | ||||||||||
| OS: | Linux-GTK | ||||||||||
| Whiteboard: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Adam Warner
Is there any other way to obtain this glyph so I can copy it and paste it ? I don't have clisp. Something else, did you test this same scenario with Gedit or the Text Widget example from gtk-demo app ? Did it work ? GEdit 2.6.1 has no supplementary character navigation problem, but I wouldn't expect it to as this problem is an artifact of 16 bit character encoding! I've just checked GEdit's dependencies and it uses the pango library. This library uses UTF-8 internally (<http://www.pango.org/design.shtml>): "The character set is extensible to the full range of ISO10646 without requiring escape mechanisms such as surrogate pairs." Because of Java's 16-bit char legacy we have to worry about surrogate pairs. You can use GNOME 2.6's "Unicode Character Map" (from Applications/Accessories) to copy any Unicode character. MATHEMATICAL_BOLD_CAPITAL_A is the first code point in the "Mathematical Alphanumeric Symbols" category. However I have written a test case for you to compile. It's appended to this message. I found the surrogate pairs for MATHEMATICAL_BOLD_CAPITAL_A using this conversion table: http://www.i18nguy.com/unicode/surrogatetable.html It should demonstrate all the problems I previously mentioned. I also managed to induce this error: ** ERROR **: file pango-layout.c: line 2799 (process_item): assertion failed: (!shape_set) aborting... Aborted And this error: java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at java.lang.StringBuffer.append(StringBuffer.java:499) at org.eclipse.swt.custom.DefaultContent.getTextRange(DefaultContent.java:720) at org.eclipse.swt.custom.WrappedContent.getLine(WrappedContent.java:98) at org.eclipse.swt.custom.StyledText.performPaint(StyledText.java:5670) at org.eclipse.swt.custom.StyledText.handlePaint(StyledText.java:5074) at org.eclipse.swt.custom.StyledText$7.handleEvent(StyledText.java:4754) at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:82) at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:944) at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:968) at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:953) at org.eclipse.swt.widgets.Control.gtk_expose_event(Control.java:1744) at org.eclipse.swt.widgets.Composite.gtk_expose_event(Composite.java:412) at org.eclipse.swt.widgets.Canvas.gtk_expose_event(Canvas.java:105) at org.eclipse.swt.widgets.Widget.windowProc(Widget.java:1190) at org.eclipse.swt.widgets.Display.windowProc(Display.java:3012) at org.eclipse.swt.internal.gtk.OS.gtk_main_do_event(Native Method) at org.eclipse.swt.widgets.Display.eventProc(Display.java:815) at org.eclipse.swt.internal.gtk.OS.gdk_window_process_updates(Native Method) at org.eclipse.swt.widgets.Control.update(Control.java:3170) at org.eclipse.swt.widgets.Control.update(Control.java:3162) at org.eclipse.swt.custom.StyledText.handleTextChanged(StyledText.java:5183) at org.eclipse.swt.custom.StyledText$6.textChanged(StyledText.java:4711) at org.eclipse.swt.custom.StyledTextListener.handleEvent(StyledTextListener.java:61) at org.eclipse.swt.custom.DefaultContent.sendTextEvent(DefaultContent.java:799) at org.eclipse.swt.custom.DefaultContent.replaceTextRange(DefaultContent.java:791) at org.eclipse.swt.custom.WrappedContent.replaceTextRange(WrappedContent.java:319) at org.eclipse.swt.custom.StyledText.modifyContent(StyledText.java:5583) at org.eclipse.swt.custom.StyledText.sendKeyEvent(StyledText.java:6423) at org.eclipse.swt.custom.StyledText.doContent(StyledText.java:2537) at org.eclipse.swt.custom.StyledText.handleKey(StyledText.java:4981) at org.eclipse.swt.custom.StyledText.handleKeyDown(StyledText.java:5004) at org.eclipse.swt.custom.StyledText$7.handleEvent(StyledText.java:4749) at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:82) at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:944) at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:968) at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:953) at org.eclipse.swt.widgets.Control.sendIMKeyEvent(Control.java:2277) at org.eclipse.swt.widgets.Control.gtk_commit(Control.java:1694) at org.eclipse.swt.widgets.Widget.windowProc(Widget.java:1184) at org.eclipse.swt.widgets.Display.windowProc(Display.java:3012) at org.eclipse.swt.internal.gtk.OS.gtk_im_context_filter_keypress(Native Method) at org.eclipse.swt.widgets.Control.gtk_key_press_event(Control.java:1790) at org.eclipse.swt.widgets.Composite.gtk_key_press_event(Composite.java:439) at org.eclipse.swt.widgets.Widget.windowProc(Widget.java:1194) at org.eclipse.swt.widgets.Display.windowProc(Display.java:3012) at org.eclipse.swt.internal.gtk.OS.gtk_main_do_event(Native Method) at org.eclipse.swt.widgets.Display.eventProc(Display.java:815) at org.eclipse.swt.internal.gtk.OS.gtk_main_iteration(Native Method) at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:2222) at org.eclipse.jface.window.Window.runEventLoop(Window.java:668) at org.eclipse.jface.window.Window.open(Window.java:648) at TestUnicode.main(TestUnicode.java:26) These occurred when I started the test program, moved right to the end of the text and typed z four times. Regards, Adam import org.eclipse.jface.window.ApplicationWindow; import org.eclipse.swt.SWT; import org.eclipse.swt.custom.StyledText; import org.eclipse.swt.widgets.Composite; import org.eclipse.swt.widgets.Control; import org.eclipse.swt.widgets.Display; public class TestUnicode extends ApplicationWindow { StyledText text; public TestUnicode () { super(null); } protected Control createContents(Composite parent) { this.getShell().setText("StyledText Unicode Test"); text = new StyledText(parent, SWT.BORDER | SWT.WRAP); text.setText("\uD835\uDC00abc"); return text; } static public void main(String[] args) { TestUnicode gui = new TestUnicode(); gui.setBlockOnOpen(true); gui.open(); Display.getCurrent().dispose(); } } I was looking at the feature plan of GTK 2.4 (http://www.gtk.org/plan/2.4/): -Full Unicode 3.2 (4.0?) support, including non-BMP portions. (#68435, #101081) Exactly what you need. As I expected, it works on GTK 2.4 and fails on GTK 2.2. Here is my snippet, running it on both versions of GTK and compare: import org.eclipse.swt.SWT; import org.eclipse.swt.widgets.*; import org.eclipse.swt.custom.*; public class PR65899 { public static void main(String[] args) { Display display = new Display(); Shell shell = new Shell(display); shell.setSize(400, 200); StyledText styledText = new StyledText (shell, SWT.BORDER); styledText.setBounds(10, 10, 360, 30); styledText.setText("StyledText - Surrogate pair: >\uD835\uDC00<"); Text text = new Text (shell, SWT.SINGLE | SWT.BORDER); text.setBounds(10, 50, 360, 30); text.setText("Text - Surrogate pair: >\uD835\uDC00<"); shell.open(); while (!shell.isDisposed()) if (!display.readAndDispatch()) display.sleep(); display.dispose(); } } Created attachment 11803 [details]
Running on GTK-2.4.0
Created attachment 11804 [details]
Running on GTK 2.2.1
Hi Felipe. I see a screenshot similar to your GTK+ 2.4 one since I'm using GNOME 2.6 and GTK+ 2.4.2 (I'm just using a different window manager). I agree this version of GTK+ is "Exactly what you need." To reproduce the bug reports from my original mail do this in the _StyledText_ box: (a) position the cursor right at the beginning of the surrogate pair (just after the >). Now press the right arrow once. Type a letter and _all the text disappears_. I suspect this is because the surrogate pair is incorrectly split. (b) position the cursor right after the surrogate pair and before the <. Type a letter. Again all the text disappears. (c) position the cursor right after the <. Type a letter. The letter appears in the wrong place (before the <). This is further evidence that the surrogate pair was incorrectly split in the previous two tests. None of these issues arise in the second Text box, only the StyledText box! Regards, Adam Adam, I will have to defer this problem to after 3.0 Although java 1.4 has problems I didn't expect to have this kind to trouble give that all character navigation, hit test, drawing, measuring, etc is done using PangoLayout who should handle surrogate pairs properly. For example, Pango set the cursor position flag before a low surrogate causing my code to set caret in the middle of the surrogate pair, I'm also having trouble with measuring. Is this bug in an older version of GTK (ie. 2.2.x)and fixed in a newer version (2.4 and greater)? *** Bug 271587 has been marked as a duplicate of this bug. *** Your bug has been moved to triage, visit http://www.eclipse.org/swt/triage.php for more info. *** Bug 293852 has been marked as a duplicate of this bug. *** This happens for eclipse 3.6 in Redhat Enterprise Linux 5.4 and Ubuntu linux 9.04. The same symptom happens in Eclipse 3.6 M6 for Solaris 10.4 and Mac OS X 10.6. This bug is a lot bigger than we first assumed. For example, take the text "a\uD835\uDC00b". In swt/java (utf-16) the char count is 4. In gtk (utf-8) the char count is 3 (the byte count is 6). We assume everywhere that a char offset in java and the char offset in gtk are the same. That is not true. For example: create a text control and set the text above: text.getCharCount() will return 3. text.getText().length() will return 4. We would need visit every API in the entire toolkit and find all places where char offset are being passed and fix them all. Big work and will have impact on performance too. My real problem, was using the Phoenician code block... I found many, many other problems across the whole computing universe where Unicode above 16 bits was broken. Sometimes badly. All Java apps, or apps that use Java appeared to be broken. (Python is fixing this in an incompatible, V3 upgrade, it is so severe a problem, at least there was no brain-dead surrogate pairs in Python.) Most web browsers where broken above 16 bits too, though getting better. My only reasonable fix was to rebuild Phoenician in the Unicode private use area at 0xEFXX... Adjusting fonts and so on... Making my unicode incompatible with everyone else... Basically back to code pages, only this time at 16 bits instead of 8. I might actually release those fonts, and a wide set of additional tooling later this year. If/when I do that I'll start the battle of Unicode Code Pages... OK, here's a thought... It might be better to consider this a bug against Java itself. "This is an artifact of the 16 bit Java Legacy..." Nobody using Java can be reasonably expected to properly code support for surrogate pairs for all string handling. None of the original 16 bit string APIs worked any more once surrogate pairs were introduced. No introductory Java programming manuals were correct any longer... Why do we have a 16 bit legacy when we're moving into a 64 bit world? Surrogate pairs are the bug. The bug is that all Java characters need to be 20 bits, or more, wide. 20 bits is the limit of utf-8 unicode itself, so it is a real, hard, limit. 32 bits would be more natural inside Java code, though only 20 are needed. Push this bug to Oracle... (In reply to comment #15) > It might be better to consider this a bug against Java itself. > "This is an artifact of the 16 bit Java Legacy..." > Nobody using Java can be reasonably expected to properly code support for > surrogate pairs for all string handling. None of the original 16 bit string > APIs worked any more once surrogate pairs were introduced. No introductory > Java programming manuals were correct any longer... I'm interested in your experience, java has added (long ago) lots of new API to work with surrogates. http://java.sun.com/developer/technicalArticles/Intl/Supplementary/#Modified_UTF-8 http://www.ibm.com/developerworks/java/library/j-unicode/?ca=drs- Why didn't that solve the problems you faced ? > The bug is that all Java characters need to be 20 bits, or more, wide. > 20 bits is the limit of utf-8 unicode itself, so it is a real, hard, limit. > 32 bits would be more natural inside Java code, though only 20 are needed. Shouldn't that be 21 bits ? Anyway, Java only has 16bits and 32bits. They chose 32bits (int), in the API a 32bit represent a char is always called a codePoint. I've not been working in Java directly, not recently anyway. We're building tooling for bible translation using the phoenician alphabet. It usually lives at 10900 and following in Unicode. Eclipse is the IDE, the PyDev plugin for Eclipse to write Python. The non-programming language folk are still using Eclipse for editing directly in language files. Here's a sample of what it looks like using this alphabet. (Use Firefox or Chrome... IE has not been tested recently.) http://www.biblelanguages.org/bible#1.1.pal Internally there are quite a few more software tools to handle the book keeping for a new translation. That link is the tip of an iceberg. My point in an earlier post on this bug was the entire computing universe has trouble... First 2 steps for adding support for any alphabet is to... 1) Get a custom keyboard for the alphabet 2) Get a font for the alphabet. Both of these were relatively easy. Was typing in no time, and the font was not hard either. Then test typing... Eclipse itself cannot handle, as this bug attests. OK, typing in sorta works, but no editing appears to be safe in Eclipse when the unicode is above 10000. The cursor is out of place, and the R-to-L nature of this code page causes all lines that have these values to be uneditable. Open Office was worse, utterly useless. So, there was no way to write about this language in a regular office document. For web display, like the page above, the ttf to eot conversion tools from Google... could not handle... or else eot itself cannot, or else Internet Explorer cannot... Unclear which. Our web servers were in PHP, which when we started this work did not even support 16 bit unicode... The promise of Python 3 being pure 32 bit unicode, and the dramatically less lines of code per unit of function than Java is why we decided to move everything to python instead of Java. Of course Python 3 was such a major update that most libraries have not been ported, so we wait for a future move to that. Python 2.6 is still mostly good enough, though it too has surrogate pair related bugs, with no real intent to fix on the part of the python folks as Python 3.2 now mostly works. Our work with the language was also evolving, and we realized that the R-to-L was not strictly correct in a historical sense. The best known historical examples of uses of this alphabet (stored in the Louve in Paris) are boustrophedon, ie: written bidirectionally, and there is no available layout tools for that. (We're working towards a printed, boustrophedon phoenician bible, as well as L-to-R phoenician bible.) We also needed more characters than were allocated in the Unicode standard, AND the code point block was not large enough to add all that we needed, so we had to abandon 10900 anyway. So, as a test, we moved everything to the private-use area at EF00, added all the additional characters that we needed, switched to L-to-R, and things cleaned up everywhere. Eclipse works, Office works, eot fonts work, layout on web pages work, Python 2.6 mostly works, and so on. It worked so well that is where we remain. Phil Thank you Phil, your problem is more complex than the one I have. I only need to get Java and GTK to work together. Most of the problem for me is to convert character offset from utf16 to utf8 and back. I can convert from java to gtk using: gtkCharOffset = string.codePointCount(0, javaCharOffset); and back using: javaCharOffset = string.offsetByCodePoints(0, gtkCharOffset); I also need to be very careful when doing a +1 or -1 from a char offset, it should never be done to a java char offset directly. I have a version of TextLayout/StyledText working here, needs more testing and polish the code a bit more. The worse problem, in the code, is that TextLayout adds several characters that should "invisible to the application" (like bidi controls and others stuff). This means that a char offset can be three different "levels": 1) application char offset - the offset that exist to the outside world 2) internal to text layout char offset 3) internal to text layout char offset (processing surrogates) - compatible with gtk. Created attachment 191778 [details]
Patch for GTK
Fixed in HEAD I verified it was working in Eclipse 3.7 build I20110329-0800 under Red Hat Enterprise Linux 6.0. |