Community
Participate
Working Groups
Today I added some URLs to our project plan [0]. The 5th one [1] breaks because it includes bugs which have non-UTF-8 characters in their title. [0]http://www.eclipse.org/projects/project-plan.php?projectid=modeling.emf [1]https://bugs.eclipse.org/bugs/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=&product=EMF&component=Core&component=Doc&component=Edit&component=Mapping&component=Tools&component=XML/XMI&target_milestone=---&long_desc_type=allwordssubstr&long_desc=&bug_file_loc_type=allwordssubstr&bug_file_loc=&keywords_type=allwords&keywords=&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&emailtype1=substring&email1=&emailtype2=substring&email2=&bugidtype=include&bug_id=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&cmdtype=doit&order=Reuse+same+sort+as+last+time&field0-0-0=noop&type0-0-0=noop&value0-0-0= Some bugs which I've seen break this are: bug 73211 - "Unicode – Can’t generate core model using anonatated java source file encoded by UTF 16-BE/LE" (could be the emdash or the non-standard apostrophe character) bug 73212 - "Unicode – Can’t generate core model using anonatated java source file encoded by UTF 16-BE/LE" (duplicate of 73211) bug 29282 - "extends AbstractEnumerator is missing für EMF Demo after Generation" changed to "extends AbstractEnumerator is missing for EMF Demo after Generation", to verify it's the title that's the problem (https://bugs.eclipse.org/bugs/show_activity.cgi?id=29282) --------- Here's the output on the PHP page: Trouble: PHP Warning: XSLTProcessor::transformToXml() [function.transformToXml]: https://bugs.eclipse.org/bugs/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=&product=EMF&component=Core&component=Doc&component=Edit&component=Mapping&component=Tools&component=XML/XMI&target_milestone=---&long_desc_type=allwordssubstr&long_desc=&bug_file_loc_type=allwordssubstr&bug_file_loc=&keywords_type=allwords&keywords=&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&emailtype1=substring&email1=&emailtype2=substring&email2=&bugidtype=include&bug_id=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&cmdtype=doit&order=Reuse+same+sort+as+last+time&field0-0-0=noop&type0-0-0=noop&value0-0-0=&ctype=rdf&columnlist=bug_id,short_desc,target_milestone,bug_status:5478: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x96 0x20 0x43 0x61 /home/local/data/httpd/www.eclipse.org/html/projects/project-plan.php (147) Trouble: PHP Warning: XSLTProcessor::transformToXml() [function.transformToXml]: <bz:short_desc>Unicode – Can’t generate core model using anonatated ja /home/local/data/httpd/www.eclipse.org/html/projects/project-plan.php (147) Trouble: PHP Warning: XSLTProcessor::transformToXml() [function.transformToXml]: ^ /home/local/data/httpd/www.eclipse.org/html/projects/project-plan.php (147) So, either we need a way to tell the page what the encoding is and not die if it's non-UTF-8, or we need to escape these characters so they can be rendered/replaced.
There is one big issue I've seen with bugzilla generated XML content, and it has to do with the xml declaration statem. It always returns as: <?xml version="1.0"?> There is no encoding attribute specified. This and I suspect that the text isn't being encoded as true UTF-8 when it is sent. I'm not sure what version of bugzilla eclipse is using, but version 2.22 turned on UTF-8 by default for new installations. http://www.bugzilla.org/releases/2.22/new-features.html Not sure if this affects the xml or not, but the xml declaration should be: <?xml version="1.0" encoding="UTF-8"?>
(In reply to comment #1) > There is one big issue I've seen with bugzilla generated XML content, and it > has to do with the xml declaration. > <?xml version="1.0"?> > should be: > <?xml version="1.0" encoding="UTF-8"?> Is this something that can be fixed in the bugzilla server script that generates the RDF? Or something that can be done in the php parser?
(In reply to comment #2) > (In reply to comment #1) > > There is one big issue I've seen with bugzilla generated XML content, and it > > has to do with the xml declaration. > > <?xml version="1.0"?> > > should be: > > <?xml version="1.0" encoding="UTF-8"?> > > Is this something that can be fixed in the bugzilla server script that > generates the RDF? Or something that can be done in the php parser? > It has to be done in the bugzilla server script, as the XSLT, just uses the document() function to open the URL provided. The php script just executes the XSL Transformation so it doesn't do anything with generating the RDF format, that's all handled by bugzilla. This is what happens when XML isn't encoded right and is sent to tools that follows the XML specification.
I would suggest opening a bug against Bugzilla if you're certain it's misbehaving. It won't help us here immediately, but if you're having trouble you can't be the only ones. (In reply to comment #3) > It has to be done in the bugzilla server script, as the XSLT, just uses the > document() function to open the URL provided. The php script just executes > the XSL Transformation so it doesn't do anything with generating the RDF > format, that's all handled by bugzilla. If you have the file in the PHP script before executing the transformation you can modify the content before you call the XSLT, no? That's how I'd handle it. Even a str_replace() on the <?xml... header should work. > This is what happens when XML isn't encoded right and is sent to tools that > follows the XML specification. Yes, but that's what scripting is for ;) It would be nice if we could have it fixed immediately but since we can't, and if you're certain it's exhibiting incorrect behavior, then that's the solution that's most practical at the moment.
(In reply to comment #4) > If you have the file in the PHP script before executing the transformation you > can modify the content before you call the XSLT, no? That's how I'd handle it. > Even a str_replace() on the <?xml... header should work. This is PHP is not handling this. Here is the relevant code that is doing the extract of the bugzilla query (in project-plan-render.xsl) in the @bugzilla template: <xsl:choose> <xsl:when test="string-length($bugzillaURL) > 0"> <xsl:apply-templates select="document($bugzillaURL)//bz:bugs"/> </xsl:when> <xsl:otherwise> <html:ul> <html:li> <html:span style="background-color: #FFCCCC; font-weight: bold; font-size: 150%;"> Error: url is not a bugs.eclipse.org url </html:span> </html:li> </html:ul> </xsl:otherwise> </xsl:choose> Notice that the xsl:apply-templates uses the XSLT document function to basically execute the bugzilla query and then work with the information that is returned as XML. This is why the bugzilla script needs to be corrected so that it is handling UTF-8 correctly. This is a common mistake that is made in programs that don't respect the rules of the XML specification. The project-plan-render.xsl is executing the queries through the document function so that it can get the bugzilla XML returned. > > > This is what happens when XML isn't encoded right and is sent to tools that > > follows the XML specification. > > Yes, but that's what scripting is for ;) It would be nice if we could have it > fixed immediately but since we can't, and if you're certain it's exhibiting > incorrect behavior, then that's the solution that's most practical at the > moment. > The only way to do this is to have another URL or script that corrects the issue. Bugzilla has known internationalization issues as has been documented in their releases. There are work arounds (i.e. remove the affending items from the script), but the correct method is to make bugzilla encode items correctly, not to come up with jury rigged patches.
(In reply to comment #5) > The only way to do this is to have another URL or script that corrects the > issue. Bugzilla has known internationalization issues as has been documented > in their releases. There are work arounds (i.e. remove the affending items > from the script), but the correct method is to make bugzilla encode items > correctly, not to come up with jury rigged patches. So I agree, and that's why I said you should open a bug against Bugzilla. Please do. The correct long-term solution is for them to fix it. But if you also want it to work, then you need a script, which was my second point.
I'm moving this one to the bugzilla component as it seems to fit there better.
(In reply to comment #7) > I'm moving this one to the bugzilla component as it seems to fit there better. Is anyone going to open a bug against Bugzilla about it? Because in the Eclipse Bugzilla queue I'll have to close this as NOT_ECLIPSE. If you still want to implement a fix on your side via a PHP script then I suggest moving this bug to where that work will be tracked.
My thought here is that since Eclipse has already enhanced their own copy of bugzilla it could be fixed here. As I said, depending on the release of bugzilla eclipse is using this should already be addressed with Bugzilla by the use of the UTF-8 flag. I'm not sure if eclipse has that turned on or not. http://www.bugzilla.org/releases/2.22/new-features.html In Bugzilla 3.0, they added even more support for UTF-8 and internationalization. http://www.bugzilla.org/releases/3.0/new-features.html It's why I think it belongs in this area at least until it is investigated to see if it's a configuration problem on eclipse's end, an upgrade of bugzilla that is needed, or if it is truely still a bug in bugzilla. No need to open a bug against bugzilla itself until we finish the investigation here.
(In reply to comment #9) > My thought here is that since Eclipse has already enhanced their own copy of > bugzilla it could be fixed here. As I said, depending on the release of > bugzilla eclipse is using this should already be addressed with Bugzilla by the > use of the UTF-8 flag. I'm not sure if eclipse has that turned on or not. I'm sorry David, you're right. There is a parameter in 3.0 to turn on UTF-8 support. However, doing so requires re-encoding the entire database. I also don't know what side effects we can expect from it with other existing scripting that uses the Bugzilla XML or DB interfaces. It appears that it affects a lot of settings throughout the application. I'll have to talk to Denis and Matt about it. I'm vaguely remembering a previous thread about this, but I can't find it in a bug search.
Grrr, that was me, forgot to sign it.
That means this bug is a dupe of bug 220717 right?
(In reply to comment #12) > That means this bug is a dupe of bug 220717 right? > Yes, this looks like the same issue. I'm marking as a duplicate of that bug. This is affect project plan generation in some cases so it probably needs to be taken a look at soon. *** This bug has been marked as a duplicate of bug 220717 ***
hmm, apparently that bug didn't show up on a search of "UTF8". That's the one I was remembering.