Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 336734

Summary: [content type] Charset of html files should be derived from content
Product: [Eclipse Project] Platform Reporter: Ralf Sternberg <rsternberg>
Component: ResourcesAssignee: Platform-Resources-Inbox <platform-resources-inbox>
Status: CLOSED WONTFIX QA Contact:
Severity: enhancement    
Priority: P3 CC: daniel_megert, markus.kell.r, nsand.dev, prakash, remy.suen, Szymon.Brandys, thatnitind
Version: 3.7   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Whiteboard: stalebug

Description Ralf Sternberg CLA 2011-02-09 11:51:12 EST
Most html files declare their encoding in a meta header like this:
  <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>

This encoding often differs from the source file encoding set on project or workspace level. Typical examples are about.html or package.html files in Eclipse projects with charset=ISO-8859-1", when the Java code in the projects is encoded in UTF-8.

Currently, the encoding of all *.html files seem to be inherited from container. It would be great if it could be derived from the content.
Comment 1 Prakash Rangaraj CLA 2011-02-09 12:36:57 EST
Though we parse the content for content-type, we don't do it to determine the char set. I guess this would be a new extension point and should be provided by the Core Resources, rather than Platform UI
Comment 2 John Arthorne CLA 2011-02-09 13:03:29 EST
This can be handled by our existing "content description" mechanism, which includes both encoding and content type. We currently only provide content describers for XML, rather than HTML. Nitin, does WTP already have an HTML content describer?
Comment 3 Nitin Dahyabhai CLA 2011-02-09 13:14:32 EST
Yes, and it does use the meta tag's information.
Comment 4 John Arthorne CLA 2011-02-09 13:17:16 EST
Thanks Nitin. This functionality appropriately belongs in Web Tools, which you can install if you want this capability. We have no HTML tooling in the platform so providing HTML content analysis in the core platform doesn't make sense.
Comment 5 Ralf Sternberg CLA 2011-02-09 16:01:35 EST
Thanks. Do I have to install Web Tools entirely to get content analysis for html, or can you point me to a single bundle that does it?
Comment 6 Nitin Dahyabhai CLA 2011-03-03 17:52:29 EST
(In reply to comment #5)
> Thanks. Do I have to install Web Tools entirely to get content analysis for
> html, or can you point me to a single bundle that does it?

The functionality lies in the org.eclipse.wst.html.core bundle.
Comment 7 Dani Megert CLA 2011-03-04 03:27:27 EST
*** Bug 338864 has been marked as a duplicate of this bug. ***
Comment 8 Dani Megert CLA 2011-03-04 03:33:14 EST
I think we should reconsider this. XML tooling is also not provided by the Platform but we ship the XML content type and we also ship HTML files. And it's important that the encoding is right even when opened with the Text editor.

Maybe we could ship the content type definition from 'org.eclipse.wst.html.core' and WST would remove its own definition? We might even think of keeping the ID, so that those referring to the WST content type continue to work. We did similar things when moving commands to lower layers.
Comment 9 Nitin Dahyabhai CLA 2011-03-04 11:09:09 EST
(In reply to comment #8)
> Maybe we could ship the content type definition from
> 'org.eclipse.wst.html.core' and WST would remove its own definition? We might
> even think of keeping the ID, so that those referring to the WST content type
> continue to work. We did similar things when moving commands to lower layers.

All of the existing detection support lies in the org.eclipse.wst.html.core.internal.contenttype package in /cvsroot/webtools under sourceediting/plugins/org.eclipse.wst.html.core/src/org/eclipse/wst/html/core/internal/contenttype , except for the contents of org.eclipse.wst.html.core.internal.encoding which build on it for working with IDocuments.
Comment 10 Szymon Brandys CLA 2011-03-07 06:58:10 EST
(In reply to comment #8)
> I think we should reconsider this. XML tooling is also not provided by the
> Platform but we ship the XML content type and we also ship HTML files. And it's
> important that the encoding is right even when opened with the Text editor.

I think that core.runtime and resources should not provide content types. We should keep this set as minimal as possible at least. 

My opinion is that each application should decide what is the set of supported content types. So if we decided to provide the HTML content type, it should be done on the application level i.e. Eclipse IDE.

> Maybe we could ship the content type definition from
> 'org.eclipse.wst.html.core' and WST would remove its own definition? We might
> even think of keeping the ID, so that those referring to the WST content type
> continue to work. We did similar things when moving commands to lower layers.

It makes sense to me.
Comment 11 Paul Webster CLA 2011-03-07 07:31:51 EST
Other similar content types are provided by core.resources or core.contentype itself.

PW
Comment 12 John Arthorne CLA 2011-03-07 09:35:51 EST
(In reply to comment #8)
> I think we should reconsider this. XML tooling is also not provided by the
> Platform but we ship the XML content type and we also ship HTML files. And it's
> important that the encoding is right even when opened with the Text editor.

I don't really buy the argument that since we support XML we should support HTML. Yes it is important that the editor get the correct content type when opening a file, but that is true for any time of file. XML is special because it is such a widely used format for data serialization. The framework itself reads and writes XML files in many places. This is why for example the Java class libraries include XML parsers but not HTML parsers.

> Maybe we could ship the content type definition from
> 'org.eclipse.wst.html.core' and WST would remove its own definition? We might
> even think of keeping the ID, so that those referring to the WST content type
> continue to work. We did similar things when moving commands to lower layers.

It's not just a matter of including a content definition. We would also need to include a minimal HTML parser, and HTML is far less structured and uniform than XML. We would need to support XHTML, HTML 5, etc.  I suspect this is a fairly big chunk of code to include and maintain.
Comment 13 Markus Keller CLA 2011-03-07 10:30:26 EST
The necessary analysis for HTML is a lot more complicated than for XML. For XML, you can use a standard parser and look for the 'encoding' attribute in the prolog. For HTML, the encoding is burried deep in the document and is much harder to get right (and parsers are even advised to "guess" the encoding), see e.g.: http://www.w3.org/TR/html4/charset.html#h-5.2.2

Maybe a low-cost solution could be shipped with the platform with priority="low" and alias-for="<the WST content type>", similar to the basic definition of *.properties in org.eclipse.core.contenttype/plugin.xml?

By low-cost, I mean a simple case-insensitive ASCII parser that looks for
    <html
followed by
    <head
followed by
    <meta\s*http-equiv="Content-Type"\s*content="text/html;\s*charset=([^"]+)"
and gives up at
    </head>
.
Comment 14 Dani Megert CLA 2011-03-17 06:31:33 EDT
> Maybe a low-cost solution could be shipped with the platform with
> priority="low" and alias-for="<the WST content type>", similar to the basic
> definition of *.properties in org.eclipse.core.contenttype/plugin.xml?
> 
> By low-cost, I mean a simple case-insensitive ASCII parser that looks for
>     <html
> followed by
>     <head
> followed by
>     <meta\s*http-equiv="Content-Type"\s*content="text/html;\s*charset=([^"]+)"
> and gives up at
>     </head>
> .

Yes, that would be enough along with a describer that provides the encoding.
Comment 15 Markus Keller CLA 2011-03-17 12:25:12 EDT
(In reply to comment #13)
To avoid scanning too much, also stop on <body, <p, and after about 10KB of text.
Comment 16 Lars Vogel CLA 2019-11-08 04:40:19 EST
This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet.

If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

If the bug is still relevant please remove the stalebug whiteboard tag.