Community
Participate
Working Groups
Created attachment 274281 [details] reproducer I'm looking at eclipse-platform-4.8RC2-linux-gtk-x86_64 installed on Red Hat Enterprise Linux 7.5. I've added several features, including DLTK and SWTBot and run some automatic tests. One of them failed when testing a hover help created from a man page. Although the man page is shown in the hover help correctly, getText() on the corresponding org.eclipse.swt.browser.Browser returns just "<" instead of the actual text. The Browser.getText() method works for me for other hover helps, the only affected are those created from man pages. This is a regression from Eclipse 4.7. It seems to me that: * either DLTK creates man-based hover helps that are partially broken - they render correctly but can't be read by Browser.getText() * or SWT's Browser.getText() itself is broken Filing this against DLTK may be wrong but I need to start somewhere... I'm attaching a simple reproducer. How to reproduce the problem: 1. Install Eclipse 4.8 RC2 2. Install DLTK and SWTBot 3. Create a new project: SWTBot > SWTBot Test Plug-In 4. Create a new class in the project - see the attached class 5. Run As > SWTBot Test Observe the running test: A shell script gets created containing command "find", then a hover help for "find" is shown. So far so good. At that moment follow the console of the parent Eclipse instance: the test lists existing shells and browsers and for each browser, its getUrl() and getText() are called. You should see something like: SHELL: text='PartRenderingEngine's limbo' SHELL: text='junit-workspace - shell_proj1/shell_script1.sh - Eclipse Platform' SHELL: text='Quick Access' SHELL: text='' SHELL: text='' BROWSER: getUrl='about:blank' getText='<' Now you see the problem: getText() is just "<" instead of "a string with HTML that represents the content of the current page", as promised by the API.
Vaclav just tried and confirmed it works for him with SWT_WEBKIT2=0 so it's an swt bug .
I reproduced the problem. I'll look into it.
Created attachment 274287 [details] SWT setText()/getText() to reproduce the issue. I narrowed it down to a snippet. Webkit2 seems to struggle with the particular man-page string as other strings work. Will continue to investigate.
It looks like sometimes webkit returns a utf8 and sometimes a utf16 string. The bug occurs because we always assume utf8. I haven't found a way to tell which type of string is returned yet, but I'll keep investigating. Might ping gtk folks.
I got the following response from webkitgtk folks: (Will research) ---------- Forwarded message ---------- From: Michael Catanzaro <mcatanzaro@igalia.com> Date: Thu, May 31, 2018 at 8:03 PM Subject: Re: [webkit-gtk] guchar * sometimes a utf8, sometimes utf16? To: Leo Ufimtsev <Leonidas@redhat.com> Cc: webkit-gtk@lists.webkit.org On Thu, May 31, 2018 at 5:05 PM, Leo Ufimtsev <Leonidas@redhat.com> wrote: Hello guys, The following function: guchar * webkit_web_resource_get_data_finish(..) Sometimes returns utf8 and sometimes utf16. Is there a way to tell them apart? Thank you. Hm, good question. I don't know the answer, but here are some thoughts anyway: We use guchar instead of gchar to indicate that it's a byte array, not a string, so it's not expected to be UTF-8. In fact, it could be any arbitrary encoding, not just UTF-16. I've seen more esoteric encodings before, particularly for CJKV websites. Of course, it might not be an HTML resource at all, it could be an image or an executable file or anything. Assuming you know it is an HTML doc, then I think you want to parse the charset from the meta tag. Of course, that's a bit difficult because you do not know the encoding you should be using to parse it until after you have somehow successfully parsed it. I don't know how you would do it, but clearly WebKit knows how, somewhere. In Epiphany, our use is limited to saving resources on disk, which then get parsed by other applications when you open them, which is why we've never needed to deal with this problem. For a website loaded via HTTP, the encoding could also have been set by an HTTP header. There's really nothing you can do in that case, as you don't have access to that. I think Firefox uses an encoding detector. WebKit does not, but it's one option. ICU can do this, as can uchardet. Problem is, they are probabilistic and do not work well for some important encodings (e.g. GB18030). But that might work well enough for your needs. Michael
New Gerrit change created: https://git.eclipse.org/r/124502
(In reply to Eclipse Genie from comment #6) > New Gerrit change created: https://git.eclipse.org/r/124502 Patch for review/merge. Note to self, I need to also submit a bug report to webkitgtk and cc Michael Catanzaro <mcatanzaro@igalia.com> (webkitgtk dev that I spoke with who works in that field).
Gerrit change https://git.eclipse.org/r/124502 was merged to [master]. Commit: http://git.eclipse.org/c/platform/eclipse.platform.swt.git/commit/?id=1823ab237d69270276b2e681b68b46bf881f6abf
Note to self: Tomas reported that there are compile issues on windows with the added unicode characters in the javadoc. I'll have to remove those. Hi Leo, When trying to compile SWT on Windows, I'm getting the errors Thomas's email. [javac] e:\swt\eclipse.platform.swt.binaries\bundles\org.eclipse.swt.gtk.linux.x86_64\temp.folder\@dot.src\org\eclipse\swt\internal\Converter.java:277: error: unmappable character for encoding Cp1252 [javac] * Some times it can get confused if it receives two non-null bytes. e.g ╨? = (UTF-16 [01,04]) [javac] ^ [javac] e:\swt\eclipse.platform.swt.binaries\bundles\org.eclipse.swt.gtk.linux.x86_64\temp.folder\@dot.src\org\eclipse\swt\internal\Converter.java:336: error: unmappable character for encoding Cp1252 [javac] // E.g Unicode Hyphen U+2010 'ΓÇ?' ( which btw different from the ascii U+002D '-' Hyphen-Minus) [javac] ^ [javac] 2 errors [javac] Compile failed; see the compiler error output for details. in Converter.java. I reckon they are caused by your commit 1823ab237d69270276b2e681b68b46bf881f6abf. Cheers, Tom
(In reply to Leo Ufimtsev from comment #9) > Note to self: > > Tomas reported that there are compile issues on windows with the added > unicode characters in the javadoc. I'll have to remove those. > Removed via: https://git.eclipse.org/r/#/c/125248/