Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 216007

Summary: issue with encoding in news items displayed via Pheonix
Product: Community Reporter: David Williams <david_williams>
Component: WebsiteAssignee: phoenix.ui <phoenix.ui-inbox>
Status: RESOLVED WORKSFORME QA Contact:
Severity: major    
Priority: P3 CC: bob.fraser, david_williams, jesper, thatnitind, webmaster
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Whiteboard:

Description David Williams CLA 2008-01-21 11:37:21 EST
I added a news item today that contained the name
Møller 

That's M-&oslash;-l-l-e-r 
(but, I used just the character, not the character entity. 

This seemed ok for the news.xml file ... it's UTF-8. 
But ... by the time it was translated and displayed on webtools main page, 
it was Møller. (looks like someone is thinking "ISO-8859-1"?). 

Where's the right point to fix this? Can this entry be repaired? 

I thought about using &oslash; but in the xml file, it says that's an error (undeclared entity). 

It looks fine, btw, in my newsreader.
Comment 1 David Williams CLA 2008-01-21 11:39:12 EST
Bob ... do you know how to fix? 

Comment 2 Jesper Moller CLA 2008-01-21 15:13:40 EST
As you say, somebody is interpreting the UTF-8 as ISO-8859-1. There is a mismatch between what the HTTP headers say:

Content-Type: text/html; charset=ISO-8859-1

and what the document itself says:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The document carries no XML declaration

The workaround: If it is XML, you should be able to use the character entity notation of &#248; - it's like the slash - o.

Phoenix is not the only Eclipse system which has i18n problems - see bug 211139: My name is cursed, I guess.

Comment 3 Bob Fraser CLA 2008-01-21 15:52:31 EST
This is a tough one.  The good news is that the RSS feed looks good. The o-slash shows up fine. And using the actual character itself is legal ISO-8859-1.  I believe there is a problem in the php part of the code.

I saw this problem before with non-blanking space.

Furthermore, the problem is with eclipse.org.  The character renders fine on my box running MAMP.  It may be an http server setting or a php setting.

Comment 4 David Williams CLA 2008-01-21 21:16:40 EST
I've tried the &#248; fix, but same problem. 

In fact, changing the encoding in the browser, from ISO-8859-1 to UTF-8 allows the character to be displayed correctly, so ... it does seem a matter that the HTTP header is wrong. 

I suppose another work-around would be to use ISO-8859-1 on our XML file? 

Comment 5 David Williams CLA 2008-01-22 01:15:31 EST
Changing title to emphasis that the news item, as a news item if fine ... it's just the Phoenix version on our webpage, at 
http://www.eclipse.org/webtools 
that is incorrect. (see the "news" section of that page). 

Comment 6 David Williams CLA 2008-01-22 01:41:50 EST
Is Phoenix the right component for this? 

I think it may be a problem with server set up, PHP specifically. 

To clarify Jesper's remarks, if you do a WGET -S, you can see the HTTP header is set to ISO-8859-1, even though the HTML output itself says UTF-8. 

Here's what I see

$ wget -S http://www.eclipse.org/webtools/index.php
--01:23:47--  http://www.eclipse.org/webtools/index.php
           => `index.php.2'
Resolving www.eclipse.org... 206.191.52.50
Connecting to www.eclipse.org|206.191.52.50|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Tue, 22 Jan 2008 06:23:47 GMT
  Server: Apache
  Cache-Control: max-age=86400
  Expires: Wed, 23 Jan 2008 06:23:47 GMT
  Connection: close
  Content-Type: text/html; charset=ISO-8859-1
Length: unspecified [text/html]

Now, my first thought was this was being done in the pheonix scripts (and other themes) but the header.php files look correct, and they look like the output of the UTF-8 charset comes very early, so I don't think it's a matter that the output buffer is filling up and MUST provide some header. 

Googling around in the PHP docs, and looking at my PHP ini file, I see this: 
= = = = = = = =
As of 4.0b4, PHP always outputs a character encoding by default in
; the Content-type: header.  To disable sending of the charset, simply
; set it to be empty.
;
; PHP's built-in default is text/html
default_mimetype = "text/html"
;default_charset = "iso-8859-1"
= = = = = = = = 

On mine, it's commented out, such as this, and indeed, I get no HTTP header. 
Though it sounds like, when they set "set to empty" that perhaps some people have 
to set it such as 
default_charset=
or something. 

So ... any of you server guys care to experiment? :) 

Comment 7 David Williams CLA 2008-01-22 01:48:42 EST
I think this should be in the "Eclispe Foundation" product, in the Server component (or, website?) but ... darn if I can tell how to move it there? 

Comment 8 Denis Roy CLA 2008-01-22 06:50:38 EST
(In reply to comment #7)
> darn if I can tell how to move it there? 

Set product to Community, choose "Reassign bug to default assignee and QA contact, and add Default CC of selected component" and commit.

Comment 9 Bob Fraser CLA 2008-01-22 14:14:55 EST
I can confirm Davids findings as well.  I have had problems with nonblanking space, either with the actual ascii code or with the entity &#160;  Worked on my machine but not on eclipse.org.

My recommendation would be to either remove the offending php config that is sending the character encoding head altogether or set it to utf-8.
Comment 10 David Williams CLA 2008-01-22 22:19:30 EST
I'm changing this to "major" (missing function) partially since I see I've ran into this before, and opened a bug 210887. 

Also, I think main problem is is not PHP. I tried to get a plain HTML file, that says utf-8, and it too has the incorrect problematic http header. 

If interested, try 
http://www.eclipse.org/webtools/wst/components/server/index.html
	
I suspect the problem is that the apache server specifies AddDefaultCharset ISO-8859-1, and it should not set any default (allowing content providers to set their own). 


As an aside, if the php.ini file in /etc/php.ini is the one that's actually used by eclipse servers, I see it had outputbuffering "off". This can sometimes prohibit content providers from setting their own charset (header) in their own php script ... but, one problem at a time, I guess. 

Comment 11 Denis Roy CLA 2008-01-23 14:00:46 EST
Indeed, the default character set configuration comes from Apache via the AddDefaultCharset directive, not PHP.

If you use PHP, you can override the default character set with the header() function before sending content. This is a design "feature" with Phoenix, as the headers are not mangled by our page generation code.

<?php
	header( "Content-type: text/html; charset=utf-8" );
?>

This header() function can be added to your _projectCommon.php file to be applied to all your project's web pages.


For static html files, one could argue that your 'index.html' page isn't a Phoenix page, and needs to be updated. Beyond that, nothing is stopping you from overriding the default Apache setting by creating a .htaccess file in your directory with this line in it:

AddDefaultCharset UTF-8


Try this (using a .htaccess):
wget -S http://www.eclipse.org/webmaster/main.html
HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Date: Wed, 23 Jan 2008 18:42:19 GMT
[snip]
  Content-Type: text/html; charset=UTF-8

I'm closing as WORKSFORME because, although we have specified a default system-wide character set according to recommendations from the Apache docs [1], the system is flexible in that you are free to override it.


> As an aside, if the php.ini file in /etc/php.ini is the one that's actually
> used by eclipse servers, I see it had outputbuffering "off". This can sometimes
> prohibit content providers from setting their own charset (header) in their own
> php script ... but, one problem at a time, I guess. 

Although we're not using /etc/php.ini, we have intentionally set OutputBuffering=Off because, as stated above, the way Phoenix is setup allows you send headers before the content (no loss of functionality for you), and setting the buffer off yields better performance (happy webmaster).



[1] From mod_mime-defaults.conf
#
# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
# (see http://httpd.apache.org/info/css-security/)
#
Comment 12 Bob Fraser CLA 2008-01-23 14:29:09 EST
I will try out the 
header( "Content-type: text/html; charset=utf-8" );
fix on the web tools pages.
Bob
Comment 13 Bob Fraser CLA 2008-01-23 14:50:37 EST
Worked fine.  I will update our site.
Comment 14 David Williams CLA 2008-01-23 15:48:19 EST
Hot dogs ... .htaccess files! :) 

I'm glad the common project approach works. Much thanks Denis. (and Bob, and Mr. Møller. :) 

And, Denis, thanks for fixing bug 210887 too.