Community
Participate
Working Groups
Build Identifier: I20100608-0911 Below, I describe how to create four files: abc.xml, abc.htm, abc.html, abc.xhtml The XML-Editor handles the file correctly - especially the charset is detected properly. But the other three files, are not opened properly by the HTML-Editor. On my machines, the editor choses to use cp1252 on Windows and UTF-8 on Linux instead of the ISO-8859-15 specified by the XML-header. I believe, that my example is proper XHTML and should be handled correctly by the HTML-Editor. Reproducible: Always Steps to Reproduce: 1. Create a file called abc.xml with these contents: <?xml version="1.0" encoding="ISO-8859-15"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>foobar</title> </head> <body> הצ�€הצ�€ </body> </html> 2. copy abc.xml to abc.htm, abc.html, and abc.xhtml 3. open the four abc.* files 4. observe, that the umlauts and/or the euro signs are broken 5. select Edit->Set Encoding in the menu 6. observe, that the charset of all three copies is not ISO-8859-15
HTML currently relies on the META tag for specifying encoding when a BOM is not present.
(In reply to comment #1) > HTML currently relies on the META tag for specifying encoding when a BOM is not > present. You really consider this an enhancement (even though this is a primary cause for the side-effect, that Eclipse destroys valid XHTML documents during application of patches etc.)!? What if a BOM is present?
Also note, that XHTML document should be handled as UTF-8, even if there is no BOM and also no <?xml ...?> declaration. AFAIK, you need some logic to detect XHTML even if there is no XML declaration - maybe using the DOCTYPE.
(In reply to comment #2) > What if a BOM is present? Actually I'm not sure anymore. The doctype is used to decide whether to treat it as XHTML, but that's unrelated to how the encoding to use is decided.
Created attachment 173871 [details] patch HTMLResourceEncodingDetector changed to differentiate between xhtml and html.HTMLHeadTokenizer also changed to know if content type is xhtml or not.
Looks good, Rakesh. Thanks.