Bug 535392 - [Webkit2] Browser.getText() returns wrong decoding when setText() contains utf (code point >127) characters
Summary: [Webkit2] Browser.getText() returns wrong decoding when setText() contains ut...
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: SWT (show other bugs)
Version: 4.8   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Leo Ufimtsev CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-31 10:09 EDT by Vaclav Kadlcik CLA
Modified: 2018-06-29 11:54 EDT (History)
3 users (show)

See Also:


Attachments
reproducer (2.35 KB, text/x-java)
2018-05-31 10:09 EDT, Vaclav Kadlcik CLA
no flags Details
SWT setText()/getText() to reproduce the issue. (14.63 KB, text/plain)
2018-05-31 13:37 EDT, Leo Ufimtsev CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vaclav Kadlcik CLA 2018-05-31 10:09:21 EDT
Created attachment 274281 [details]
reproducer

I'm looking at eclipse-platform-4.8RC2-linux-gtk-x86_64 installed
on Red Hat Enterprise Linux 7.5. I've added several features,
including DLTK and SWTBot and run some automatic tests. One of
them failed when testing a hover help created from a man page.
Although the man page is shown in the hover help correctly,
getText() on the corresponding org.eclipse.swt.browser.Browser
returns just "<" instead of the actual text.

The Browser.getText() method works for me for other hover helps,
the only affected are those created from man pages.

This is a regression from Eclipse 4.7. It seems to me that:
 * either DLTK creates man-based hover helps that are partially
   broken - they render correctly but can't be read by
   Browser.getText()
 * or SWT's Browser.getText() itself is broken
Filing this against DLTK may be wrong but I need to start
somewhere...

I'm attaching a simple reproducer.

How to reproduce the problem:

 1. Install Eclipse 4.8 RC2
 2. Install DLTK and SWTBot
 3. Create a new project: SWTBot > SWTBot Test Plug-In
 4. Create a new class in the project - see the attached class
 5. Run As > SWTBot Test

Observe the running test: A shell script gets created containing
command "find", then a hover help for "find" is shown. So far so
good. At that moment follow the console of the parent Eclipse
instance: the test lists existing shells and browsers and for each
browser, its getUrl() and getText() are called. You should see
something like:

SHELL: text='PartRenderingEngine's limbo'
SHELL: text='junit-workspace - shell_proj1/shell_script1.sh - Eclipse Platform'
SHELL: text='Quick Access'
SHELL: text=''
SHELL: text=''
  BROWSER: getUrl='about:blank' getText='<'

Now you see the problem: getText() is just "<" instead of
"a string with HTML that represents the content of the current
page", as promised by the API.
Comment 1 Alexander Kurtakov CLA 2018-05-31 10:35:15 EDT
Vaclav just tried and confirmed it works for him with SWT_WEBKIT2=0 so it's an swt bug .
Comment 2 Leo Ufimtsev CLA 2018-05-31 12:07:18 EDT
I reproduced the problem. I'll look into it.
Comment 3 Leo Ufimtsev CLA 2018-05-31 13:37:18 EDT
Created attachment 274287 [details]
SWT setText()/getText() to reproduce the issue.

I narrowed it down to a snippet. Webkit2 seems to struggle with the particular man-page string as other strings work. 

Will continue to investigate.
Comment 4 Leo Ufimtsev CLA 2018-05-31 18:02:48 EDT
It looks like sometimes webkit returns a utf8 and sometimes a utf16 string. The bug occurs because we always assume utf8.

I haven't found a way to tell which type of string is returned yet, but I'll keep investigating. Might ping gtk folks.
Comment 5 Leo Ufimtsev CLA 2018-06-04 08:43:44 EDT
I got the following response from webkitgtk folks:
(Will research)


---------- Forwarded message ----------
From: Michael Catanzaro <mcatanzaro@igalia.com>
Date: Thu, May 31, 2018 at 8:03 PM
Subject: Re: [webkit-gtk] guchar * sometimes a utf8, sometimes utf16?
To: Leo Ufimtsev <Leonidas@redhat.com>
Cc: webkit-gtk@lists.webkit.org


On Thu, May 31, 2018 at 5:05 PM, Leo Ufimtsev <Leonidas@redhat.com> wrote:

    Hello guys,

    The following function:
    guchar * webkit_web_resource_get_data_finish(..)

    Sometimes returns utf8 and sometimes utf16. Is there a way to tell them apart?

    Thank you. 


Hm, good question. I don't know the answer, but here are some thoughts anyway:

We use guchar instead of gchar to indicate that it's a byte array, not a string, so it's not expected to be UTF-8. In fact, it could be any arbitrary encoding, not just UTF-16. I've seen more esoteric encodings before, particularly for CJKV websites. Of course, it might not be an HTML resource at all, it could be an image or an executable file or anything.

Assuming you know it is an HTML doc, then I think you want to parse the charset from the meta tag. Of course, that's a bit difficult because you do not know the encoding you should be using to parse it until after you have somehow successfully parsed it. I don't know how you would do it, but clearly WebKit knows how, somewhere. In Epiphany, our use is limited to saving resources on disk, which then get parsed by other applications when you open them, which is why we've never needed to deal with this problem.

For a website loaded via HTTP, the encoding could also have been set by an HTTP header. There's really nothing you can do in that case, as you don't have access to that.

I think Firefox uses an encoding detector. WebKit does not, but it's one option. ICU can do this, as can uchardet. Problem is, they are probabilistic and do not work well for some important encodings (e.g. GB18030). But that might work well enough for your needs.

Michael
Comment 6 Eclipse Genie CLA 2018-06-13 17:55:30 EDT
New Gerrit change created: https://git.eclipse.org/r/124502
Comment 7 Leo Ufimtsev CLA 2018-06-13 18:01:47 EDT
(In reply to Eclipse Genie from comment #6)
> New Gerrit change created: https://git.eclipse.org/r/124502

Patch for review/merge. 

Note to self, I need to also submit a bug report to webkitgtk  and cc Michael Catanzaro <mcatanzaro@igalia.com>  (webkitgtk dev that I spoke with who works in that field).
Comment 9 Leo Ufimtsev CLA 2018-06-28 14:54:24 EDT
Note to self:

Tomas reported that there are compile issues on windows with the added unicode characters in the javadoc. I'll have to remove those.

Hi Leo,

When trying to compile SWT on Windows, I'm getting the errors


Thomas's email.

    [javac] e:\swt\eclipse.platform.swt.binaries\bundles\org.eclipse.swt.gtk.linux.x86_64\temp.folder\@dot.src\org\eclipse\swt\internal\Converter.java:277: error: unmappable character for encoding Cp1252
    [javac]      * Some times it can get confused if it receives two non-null bytes. e.g ╨? = (UTF-16  [01,04])
    [javac]                   ^
    [javac] e:\swt\eclipse.platform.swt.binaries\bundles\org.eclipse.swt.gtk.linux.x86_64\temp.folder\@dot.src\org\eclipse\swt\internal\Converter.java:336: error: unmappable character for encoding Cp1252
    [javac]             //            E.g Unicode Hyphen U+2010 'ΓÇ?' ( which btw different from the ascii U+002D  '-' Hyphen-Minus)
    [javac]                                                        ^
    [javac] 2 errors
    [javac] Compile failed; see the compiler error output for details.

in Converter.java. I reckon they are caused by your commit 1823ab237d69270276b2e681b68b46bf881f6abf.

Cheers,
Tom
Comment 10 Leo Ufimtsev CLA 2018-06-29 11:53:57 EDT
(In reply to Leo Ufimtsev from comment #9)
> Note to self:
> 
> Tomas reported that there are compile issues on windows with the added
> unicode characters in the javadoc. I'll have to remove those.
> 

Removed via:
https://git.eclipse.org/r/#/c/125248/