Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 344283

Summary: wrong xhtml encoding determined from content (UTF-8 instead of iso-8859-1)
Product: [WebTools] Java Server Faces Reporter: Will Piasecki <alporygone>
Component: CoreAssignee: Ian Trimble <ian.trimble>
Status: NEW --- QA Contact:
Severity: major    
Priority: P3 CC: gerpres, mauromol, raghunathan.srinivasan, thatnitind, vitor.rm
Version: unspecified   
Target Milestone: Future   
Hardware: PC   
OS: Linux   
Whiteboard:
Bug Depends on: 371430    
Bug Blocks:    
Attachments:
Description Flags
Project demonstrating the problem none

Description Will Piasecki CLA 2011-04-29 10:24:10 EDT
Build Identifier: 20110218-0911

I have a project where all files are on ISO-8859-1 encoding. Every file open parses correctly, except for xhtml (facelets) files.

I'm using Eclipse Helion on Linux Ubuntu 9.10, i also use JBoss Tools, JBoss Seam, GReclipse and Subclipse plugins.

Reproducible: Always

Steps to Reproduce:
1. Opened the workspace which i used in my Eclipse 3.5 (where the error doesn't happen)
 - This workspace pulls the project com a SVN repository
2. Opened any xhtml/xml file
3. Left click on file > properties > Resource > Text File Encoding is set to "Default (determined from content: UTF-8)" even though the xhtml file starts with "<?xml version="1.0" encoding="ISO-8859-1"?>"
Comment 1 Nitin Dahyabhai CLA 2011-04-29 10:39:16 EDT
It sure sounds like bug 318768, but that should have been fixed.  Is it just xhtml files with the facelets namespaces set up that are problematic, or all xhtml files?
Comment 2 Will Piasecki CLA 2011-04-29 10:51:32 EDT
(In reply to comment #1)
> It sure sounds like bug 318768, but that should have been fixed.  Is it just
> xhtml files with the facelets namespaces set up that are problematic, or all
> xhtml files?

Thanks for the quick reply =)

Well, i just created a new file and added our default structure, like this one:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE composition PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<ui:composition xmlns="http://www.w3.org/1999/xhtml"
	xmlns:s="http://jboss.com/products/seam/taglib"
	xmlns:ui="http://java.sun.com/jsf/facelets"
	xmlns:f="http://java.sun.com/jsf/core"
	xmlns:h="http://java.sun.com/jsf/html"
	xmlns:rich="http://richfaces.org/rich"
	xmlns:a="http://richfaces.org/a4j">
</ui:composition>

And it found its encoding as being UTF-8.

Then i tried removing namespaces:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE composition PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<ui:composition xmlns="http://www.w3.org/1999/xhtml">
</ui:composition>

And it parsed correctly, as ISO-8859-1!
Comment 3 Will Piasecki CLA 2011-04-29 10:52:31 EDT
> And it parsed correctly, as ISO-8859-1!

i meant it "identified correctly" the encoding
Comment 4 Nitin Dahyabhai CLA 2011-04-29 11:18:28 EDT
And can you reproduce this without the JBoss Tools installed?
Comment 5 Will Piasecki CLA 2011-04-29 11:28:57 EDT
(In reply to comment #4)
> And can you reproduce this without the JBoss Tools installed?

Tried on an fresh install of eclise 3.6, without any plugins: same behavior, once including namespaces it starts identifying it as UTF-8
Comment 6 Raghunathan Srinivasan CLA 2011-04-29 18:44:21 EDT
I believe this has been resolved in Indigo, see https://bugs.eclipse.org/bugs/show_bug.cgi?id=341973
Comment 7 Will Piasecki CLA 2011-05-02 08:58:07 EDT
(In reply to comment #6)
> I believe this has been resolved in Indigo, see
> https://bugs.eclipse.org/bugs/show_bug.cgi?id=341973

Hello!

Indeed, this seems like the fix... Forgive me, i must redownload eclipse? Check for Updates? 

Tried checking for updates, but no eclipse update was available
Comment 8 Will Piasecki CLA 2011-05-02 09:49:24 EDT
> Indeed, this seems like the fix... Forgive me, i must redownload eclipse? Check
> for Updates? 
> 
> Tried checking for updates, but no eclipse update was available

ah, eclipse indigo is the 3.7... i'll give it a shot!

thanks!
Comment 9 Will Piasecki CLA 2011-05-02 10:43:09 EDT
(In reply to comment #6)
> I believe this has been resolved in Indigo, see
> https://bugs.eclipse.org/bugs/show_bug.cgi?id=341973

Well, i tried downloading Eclipse Indigo M7:
Version: 3.7.0
Build id: I20110428-0848

And i got the same behavior; it still determines UTF-8 from content, even though the encoding is ISO-8859-1 and there are ISO-8859-1 characters in the file.

It also happened before installing the web tools (fresh install) and after i installed JBoss Tools.

The file command in linux gives me the following:

$ file -bi edit.xhtml 
application/xml; charset=iso-8859-1
Comment 10 Ian Trimble CLA 2011-08-30 18:32:04 EDT
Will,

Your DOCTYPE element appears incorrect to me. You are mapping root element "composition" to the XHTML 1.0 Transitional DTD, and not "html" (which is more appropriate for the selected DTD). In fact, what I am reading tells me that an XHTML document MUST have "html" as the root element.

This seems more correct:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:ui="http://java.sun.com/jsf/facelets"
    xmlns:s="http://jboss.com/products/seam/taglib"
    xmlns:f="http://java.sun.com/jsf/core"
    xmlns:h="http://java.sun.com/jsf/html"
    xmlns:rich="http://richfaces.org/rich"
    xmlns:a="http://richfaces.org/a4j">

    <ui:composition>
        ...
    </ui:composition>

</html>

Using this form results in the properties dialog showing the correct encoding ("ISO-8859-1").

Also, note that the JSF facet need not be present on a project to see this behaviour.

 - Ian
Comment 11 Ian Trimble CLA 2011-09-08 17:06:50 EDT
Will,

Without feedback on whether we agree that the documents that are not providing the correct encoding are not valid to begin with, we cannot move forward. The 3.3.1 window is closing. Please advise.

 - Ian
Comment 12 Vitor R Munhoz CLA 2011-09-14 17:19:17 EDT
Hi Ian.

I am having the same problem here. The project runs ok in Eclipse Galileo, but in Eclipse Helios the problem reported by Will happened.

I did a test with the code you sent above and it didn´t work. Eclipse is always using the Window -> Preference -> General -> Content Types -> HTML -> Facelet encoding (UTF-8).

Version: Helios Service Release 2
Build id: 20110218-0911
Comment 13 Vitor R Munhoz CLA 2011-09-15 09:15:45 EDT
I also created a new Web dynamic project with only the code you sent and did´t work too.

I am on Windows 7 (64 bits) and using an Eclipse Helios (32 bits).
Comment 14 Vitor R Munhoz CLA 2011-09-15 10:23:12 EDT
And like Will said in comment #2 when I removed the namespaces the things works. But of course I can´t use ui:composition without namespace.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <ui:composition>
        ...
    </ui:composition>

</html>
Comment 15 Raghunathan Srinivasan CLA 2011-09-15 13:38:49 EDT
Re-targeting to Juno. We can backport to the next service release once we resolve the issue.
Comment 16 Raghunathan Srinivasan CLA 2012-02-01 13:00:21 EST
Investigate
Comment 17 Ian Trimble CLA 2012-02-13 17:16:07 EST
Depends on Bug 371430.
Comment 18 Raghunathan Srinivasan CLA 2012-04-12 14:28:11 EDT
waiting for bug 371430
Comment 19 Will Piasecki CLA 2012-04-20 12:46:44 EDT
Hi. Sorry for the long delay. The problem is not happening anymore and i didn't had time to investigate what happened.

I forced configurations to ISO-8859-1 in some places:
- Project properties > Text file encoding
- Preferences > General > Workspace > Text file encoding
- Preferences > XML > XML Files > Encoding
- Preferences > Web > HTML Files > Encoding

There was also a file in workspace, i can't remember the name, but it is created when you configure a specific file (right click file > Properties > Text-file encoding). I think i deleted those files.
Comment 20 Raghunathan Srinivasan CLA 2012-05-17 14:19:44 EDT
Based on comment 13, deferring from Juno for us to verify.
Comment 21 Mauro Molinari CLA 2014-05-06 10:58:01 EDT
Today I encountered this problem.
On a Windows machine (default os encoding Cp1252) for a facelet with the following header:

<?xml version="1.0" encoding="UTF-8"?> 

the default encoding determined from content is Cp1252. I have to force it to UTF-8 to take the right encoding.

Using Kepler SR-2.
It's strange that on another project on a different machine other XHTML files are recognized correctly as UTF-8 (even without the <?xml> header).
Comment 22 Mauro Molinari CLA 2014-05-22 06:51:36 EDT
Created attachment 243392 [details]
Project demonstrating the problem

Here is a project that demonstrates the problem I'm seeing. If you look at the properties of orderDialog.xhtml you'll see that its encoding is determined from content as Cp1252, although the XML declaration on the first row says differently.

On the other hand, test.xhtml recognizes UTF-8 from content even if the XML declaration is missing.

Of course, if I remove the XML declaration from orderDialog.xhtml the content is still detected as Cp1252.
Comment 23 Mauro Molinari CLA 2017-09-04 10:27:45 EDT
Today I saw this problem again on the machine of a co-worker of mine, XHTML which are UTF-8 encoded (even with proper XML declaration) recognized by Eclipse ("from content") as being Cp1252 (he's working on a Windows system).
Using Neon.3 with WTP 3.8.2.

Can anyone look at this, with the example project I provided in my previous comment?
Comment 24 Mauro Molinari CLA 2017-10-25 12:11:56 EDT
Increasing severity to major, this is a big problem for non-English users dealing with huge projects. In order to share the correct setting, you have to change the resource encoding on a single file basis. Changing the default encoding at workspace level cannot be shared among team members with SCM.

Please correct me if I'm wrong.
Comment 25 Mauro Molinari CLA 2017-10-25 12:16:16 EDT
Even worse, if I change the "Default encoding" for Facelets to UTF-8 in workspace settings (in General | Content Types | Text | HTML | Facelet and Facelet Composite Component), it seems to be ignored, my XHTML files are still detected "from contents" as Cp1252... :-(
Comment 26 Ian Trimble CLA 2017-10-25 12:29:40 EDT
The biggest issue I see from your sample project is you have XML files that don't declare a namespace (no root xmlns attribute). To be correct XHTML files, they need to declare the correct namespace and also the root element should be "html". When I add the correct namespace value to your documents, they report the expected encoding.
Comment 27 Mauro Molinari CLA 2017-10-25 12:47:33 EDT
Probably the structure may be improved, as you pointed out (thank you!!), but please consider that:
- if there's a structure problem, the facelet editor should probably notify it in some way (i.e.: the root xmlns attribute is missing or the <html> root element is missing, etc.)
- in my current project I have a lot of facelets with no html root element and root xmlns declaration, but they work perfectly (using Mojarra under Tomcat), so I think the IDE should in some way cope with them
- I still think the XML declaration should be honoured by Eclipse to determine the file content type, in any case

What do you think?
Comment 28 Mauro Molinari CLA 2017-10-26 02:06:21 EDT
Another thought: even though my XHTML files are missing the html root element and the root xmlns declaration (and I still think this is somewhat valid, since the XML is well formed as all elements are prefixed with the correct namespace alias), and allowing for now that the XML declaration is ignored, why choosing Cp1252 for those files when the default text file encoding at project level is set to UTF-8?
(I don't know whether this is the case for the attached sample project, but it certainly is for my real-world project where I hit this problem)
Comment 29 Nitin Dahyabhai CLA 2017-10-26 05:11:00 EDT
(In reply to Ian Trimble from comment #26)
> The biggest issue I see from your sample project is you have XML files that
> don't declare a namespace (no root xmlns attribute). To be correct XHTML
> files, they need to declare the correct namespace and also the root element
> should be "html". When I add the correct namespace value to your documents,
> they report the expected encoding.

Ian, I'm willing to change the HTMLResourceEncodingDetector code that's being called to make use of the XML declaration's encoding attribute regardless, if that's desirable. Whether the file is actually considered a Facelet seems to be decided, separately, in org.eclipse.jst.jsf.core.internal.contenttype.AbstractContentDescriberForFacelets. Please weigh in, either way.
Comment 30 Ian Trimble CLA 2017-10-26 13:02:42 EDT
(In reply to Nitin Dahyabhai from comment #29)

> Ian, I'm willing to change the HTMLResourceEncodingDetector code that's
> being called to make use of the XML declaration's encoding attribute
> regardless, if that's desirable. Whether the file is actually considered a
> Facelet seems to be decided, separately, in
> org.eclipse.jst.jsf.core.internal.contenttype.
> AbstractContentDescriberForFacelets. Please weigh in, either way.

I'd appreciate that, Nitin. My thinking is that first we should understand that a document is intended to be XHTML (which is, of course, the purpose of a namespace declaration) before we try to understand if it's intended to be a facelet.

Any way we slice this, I don't believe the severity to be major - by making the document declare that it is XHTML (otherwise, it's just another XML file), the issue is resolved. We may be willing and able to loosen this requirement, but wanting the tooling to just make a good guess doesn't increase the severity, IMO.
Comment 31 Mauro Molinari CLA 2017-10-26 13:35:02 EDT
The reasoning behind my decision to increase the severity is that a wrong guess by the IDE in this context may lead to data corruption (messed accented letters in the XML file), which may also not be immediately evident to the user (unless he looks at the whole file contents). Also, strange and apparently unexplainable runtime errors may occur at runtime (Mojarra was complaining about an invalid template path in my last case, as soon as I added an accented letter in the file :-O). Anyway feel free to change if you think differently.
Comment 32 Nitin Dahyabhai CLA 2017-10-26 19:36:52 EDT
Opened bug 526538.
Comment 33 Ian Trimble CLA 2017-10-26 19:59:42 EDT
(In reply to Mauro Molinari from comment #31)
> The reasoning behind my decision to increase the severity is that a wrong
> guess by the IDE in this context may lead to data corruption (messed
> accented letters in the XML file), which may also not be immediately evident
> to the user (unless he looks at the whole file contents). Also, strange and
> apparently unexplainable runtime errors may occur at runtime (Mojarra was
> complaining about an invalid template path in my last case, as soon as I
> added an accented letter in the file :-O). Anyway feel free to change if you
> think differently.

Data corruption is certainly not trivial, so I do understand you raising the severity. My point is that once the document is recognized as an XHTML document (due to the appropriate namespace declaration), functionality is correct, and so data corruption will not then occur.

Let's see where Nitin's fix to the bug he has logged gets us.

Thanks,
 - Ian
Comment 34 Nitin Dahyabhai CLA 2017-10-31 19:24:07 EDT
Resolved bug 526538.