Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 318768

Summary: XHTML file encoding not detected from XML declaration
Product: [WebTools] WTP Source Editing Reporter: Sven Köhler <sven.koehler>
Component: wst.htmlAssignee: Rakesh <rakes123>
Status: RESOLVED FIXED QA Contact: Nitin Dahyabhai <thatnitind>
Severity: normal    
Priority: P3 CC: nsand.dev
Version: unspecifiedFlags: nsand.dev: review+
Target Milestone: 3.2.3   
Hardware: All   
OS: All   
Whiteboard:
Attachments:
Description Flags
patch nsand.dev: iplog+

Description Sven Köhler CLA 2010-07-02 18:32:55 EDT
Build Identifier: I20100608-0911

Below, I describe how to create four files:
abc.xml, abc.htm, abc.html, abc.xhtml

The XML-Editor handles the file correctly - especially the charset is detected properly. But the other three files, are not opened properly by the HTML-Editor. On my machines, the editor choses to use cp1252 on Windows and UTF-8 on Linux instead of the ISO-8859-15 specified by the XML-header.

I believe, that my example is proper XHTML and should be handled correctly by the HTML-Editor.

Reproducible: Always

Steps to Reproduce:
1. Create a file called abc.xml with these contents:
<?xml version="1.0" encoding="ISO-8859-15"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
	"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<title>foobar</title>
	</head>
	<body>
		הצ�€הצ�€
	</body>
</html>


2. copy abc.xml to abc.htm, abc.html, and abc.xhtml
3. open the four abc.* files
4. observe, that the umlauts and/or the euro signs are broken
5. select Edit->Set Encoding in the menu
6. observe, that the charset of all three copies is not ISO-8859-15
Comment 1 Nitin Dahyabhai CLA 2010-07-07 13:58:49 EDT
HTML currently relies on the META tag for specifying encoding when a BOM is not present.
Comment 2 Sven Köhler CLA 2010-07-07 17:08:33 EDT
(In reply to comment #1)
> HTML currently relies on the META tag for specifying encoding when a BOM is not
> present.

You really consider this an enhancement (even though this is a primary cause for the side-effect, that Eclipse destroys valid XHTML documents during application of patches etc.)!?

What if a BOM is present?
Comment 3 Sven Köhler CLA 2010-07-07 17:10:50 EDT
Also note, that XHTML document should be handled as UTF-8, even if there is no BOM and also no <?xml ...?> declaration. AFAIK, you need some logic to detect XHTML even if there is no XML declaration - maybe using the DOCTYPE.
Comment 4 Nitin Dahyabhai CLA 2010-07-08 11:50:15 EDT
(In reply to comment #2)
> What if a BOM is present?

Actually I'm not sure anymore.  The doctype is used to decide whether to treat it as XHTML, but that's unrelated to how the encoding to use is decided.
Comment 5 Rakesh CLA 2010-07-09 10:56:26 EDT
Created attachment 173871 [details]
patch

HTMLResourceEncodingDetector changed to differentiate between xhtml and html.HTMLHeadTokenizer also changed to know if content type is xhtml or not.
Comment 6 Nick Sandonato CLA 2010-11-02 14:47:03 EDT
Looks good, Rakesh. Thanks.