Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 248500 - non-UTF-8 characters break Bugzilla scrape
Summary: non-UTF-8 characters break Bugzilla scrape
Status: RESOLVED DUPLICATE of bug 220717
Alias: None
Product: Community
Classification: Eclipse Foundation
Component: Bugzilla (show other bugs)
Version: unspecified   Edit
Hardware: PC Linux
: P3 normal (vote)
Target Milestone: ---   Edit
Assignee: Eclipse Management Organization CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-09-24 16:37 EDT by Nick Boldt CLA
Modified: 2008-09-26 12:01 EDT (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Boldt CLA 2008-09-24 16:37:34 EDT
Today I added some URLs to our project plan [0]. The 5th one [1] breaks because it includes bugs which have non-UTF-8 characters in their title.

[0]http://www.eclipse.org/projects/project-plan.php?projectid=modeling.emf
[1]https://bugs.eclipse.org/bugs/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=&product=EMF&component=Core&component=Doc&component=Edit&component=Mapping&component=Tools&component=XML/XMI&target_milestone=---&long_desc_type=allwordssubstr&long_desc=&bug_file_loc_type=allwordssubstr&bug_file_loc=&keywords_type=allwords&keywords=&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&emailtype1=substring&email1=&emailtype2=substring&email2=&bugidtype=include&bug_id=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&cmdtype=doit&order=Reuse+same+sort+as+last+time&field0-0-0=noop&type0-0-0=noop&value0-0-0=

Some bugs which I've seen break this are:

bug 73211 - "Unicode – Can’t generate core model using anonatated java source file encoded by UTF 16-BE/LE" (could be the emdash or the non-standard apostrophe character)

bug 73212 - "Unicode – Can’t generate core model using anonatated java source file encoded by UTF 16-BE/LE" (duplicate of 73211)

bug 29282 - "extends AbstractEnumerator is missing für EMF Demo after Generation" changed to "extends AbstractEnumerator is missing for EMF Demo after Generation", to verify it's the title that's the problem (https://bugs.eclipse.org/bugs/show_activity.cgi?id=29282)

---------

Here's the output on the PHP page:

Trouble: PHP Warning:
XSLTProcessor::transformToXml() [function.transformToXml]: https://bugs.eclipse.org/bugs/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=&product=EMF&component=Core&component=Doc&component=Edit&component=Mapping&component=Tools&component=XML/XMI&target_milestone=---&long_desc_type=allwordssubstr&long_desc=&bug_file_loc_type=allwordssubstr&bug_file_loc=&keywords_type=allwords&keywords=&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&emailtype1=substring&email1=&emailtype2=substring&email2=&bugidtype=include&bug_id=&votes=&chfieldfrom=&chfieldto=Now&chfieldvalue=&cmdtype=doit&order=Reuse+same+sort+as+last+time&field0-0-0=noop&type0-0-0=noop&value0-0-0=&ctype=rdf&columnlist=bug_id,short_desc,target_milestone,bug_status:5478: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0x96 0x20 0x43 0x61
/home/local/data/httpd/www.eclipse.org/html/projects/project-plan.php (147)

Trouble: PHP Warning:
XSLTProcessor::transformToXml() [function.transformToXml]: <bz:short_desc>Unicode – Can’t generate core model using anonatated ja
/home/local/data/httpd/www.eclipse.org/html/projects/project-plan.php (147)

Trouble: PHP Warning:
XSLTProcessor::transformToXml() [function.transformToXml]: ^
/home/local/data/httpd/www.eclipse.org/html/projects/project-plan.php (147)

So, either we need a way to tell the page what the encoding is and not die if it's non-UTF-8, or we need to escape these characters so they can be rendered/replaced.
Comment 1 David Carver CLA 2008-09-24 16:54:35 EDT
There is one big issue I've seen with bugzilla generated XML content, and it has to do with the xml declaration statem.

It always returns as:

<?xml version="1.0"?>

There is no encoding attribute specified.  This and I suspect that the text isn't being encoded as true UTF-8 when it is sent.   

I'm not sure what version of bugzilla eclipse is using, but version 2.22 turned on UTF-8 by default for new installations.

http://www.bugzilla.org/releases/2.22/new-features.html

Not sure if this affects the xml or not, but the xml declaration should be:

<?xml version="1.0" encoding="UTF-8"?>

Comment 2 Nick Boldt CLA 2008-09-24 17:34:17 EDT
(In reply to comment #1)
> There is one big issue I've seen with bugzilla generated XML content, and it
> has to do with the xml declaration.
> <?xml version="1.0"?>
> should be:
> <?xml version="1.0" encoding="UTF-8"?>

Is this something that can be fixed in the bugzilla server script that generates the RDF? Or something that can be done in the php parser?



Comment 3 David Carver CLA 2008-09-24 19:02:54 EDT
(In reply to comment #2)
> (In reply to comment #1)
> > There is one big issue I've seen with bugzilla generated XML content, and it
> > has to do with the xml declaration.
> > <?xml version="1.0"?>
> > should be:
> > <?xml version="1.0" encoding="UTF-8"?>
> 
> Is this something that can be fixed in the bugzilla server script that
> generates the RDF? Or something that can be done in the php parser?
> 

It has to be done in the bugzilla server script, as the XSLT, just uses the document() function to open the URL provided.   The php script just executes the XSL Transformation so it doesn't do anything with generating the RDF format, that's all handled by bugzilla.

This is what happens when XML isn't encoded right and is sent to tools that follows the XML specification.
Comment 4 Karl Matthias CLA 2008-09-25 11:53:49 EDT
I would suggest opening a bug against Bugzilla if you're certain it's misbehaving.  It won't help us here immediately, but if you're having trouble you can't be the only ones.

(In reply to comment #3)
> It has to be done in the bugzilla server script, as the XSLT, just uses the
> document() function to open the URL provided.   The php script just executes
> the XSL Transformation so it doesn't do anything with generating the RDF
> format, that's all handled by bugzilla.

If you have the file in the PHP script before executing the transformation you can modify the content before you call the XSLT, no?  That's how I'd handle it.  Even a str_replace() on the <?xml... header should work.

> This is what happens when XML isn't encoded right and is sent to tools that
> follows the XML specification.

Yes, but that's what scripting is for ;)  It would be nice if we could have it fixed immediately but since we can't, and if you're certain it's exhibiting incorrect behavior, then that's the solution that's most practical at the moment.
Comment 5 David Carver CLA 2008-09-25 13:03:55 EDT
(In reply to comment #4)
> If you have the file in the PHP script before executing the transformation you
> can modify the content before you call the XSLT, no?  That's how I'd handle it.
>  Even a str_replace() on the <?xml... header should work.

This is PHP is not handling this.  Here is the relevant code that is doing the extract of the bugzilla query (in project-plan-render.xsl) in the @bugzilla template:

        <xsl:choose>
            <xsl:when test="string-length($bugzillaURL) > 0">
                <xsl:apply-templates select="document($bugzillaURL)//bz:bugs"/>
            </xsl:when>
            <xsl:otherwise>
                <html:ul>
                    <html:li>
                        <html:span style="background-color: #FFCCCC; font-weight: bold; font-size: 150%;">
                            Error: url is not a bugs.eclipse.org url
                        </html:span>
                    </html:li>
                </html:ul>                
            </xsl:otherwise>
        </xsl:choose>

Notice that the xsl:apply-templates uses the XSLT document function to basically execute the bugzilla query and then work with the information that is returned as XML.   This is why the bugzilla script needs to be corrected so that it is handling UTF-8 correctly.   This is a common mistake that is made in programs that don't respect the rules of the XML specification.

The project-plan-render.xsl is executing the queries through the document function so that it can get the bugzilla XML returned.


> 
> > This is what happens when XML isn't encoded right and is sent to tools that
> > follows the XML specification.
> 
> Yes, but that's what scripting is for ;)  It would be nice if we could have it
> fixed immediately but since we can't, and if you're certain it's exhibiting
> incorrect behavior, then that's the solution that's most practical at the
> moment.
> 

The only way to do this is to have another URL or script that corrects the issue.  Bugzilla has known internationalization issues as has been documented in their releases.   There are work arounds (i.e. remove the affending items from the script), but the correct method is to make bugzilla encode items correctly, not to come up with jury rigged patches.

Comment 6 Karl Matthias CLA 2008-09-25 14:06:32 EDT
(In reply to comment #5)
> The only way to do this is to have another URL or script that corrects the
> issue.  Bugzilla has known internationalization issues as has been documented
> in their releases.   There are work arounds (i.e. remove the affending items
> from the script), but the correct method is to make bugzilla encode items
> correctly, not to come up with jury rigged patches.

So I agree, and that's why I said you should open a bug against Bugzilla.  Please do.  The correct long-term solution is for them to fix it.  But if you also want it to work, then you need a script, which was my second point.
Comment 7 David Carver CLA 2008-09-25 14:38:50 EDT
I'm moving this one to the bugzilla component as it seems to fit there better.
Comment 8 Karl Matthias CLA 2008-09-25 15:29:05 EDT
(In reply to comment #7)
> I'm moving this one to the bugzilla component as it seems to fit there better.

Is anyone going to open a bug against Bugzilla about it?  Because in the Eclipse Bugzilla queue I'll have to close this as NOT_ECLIPSE.  If you still want to implement a fix on your side via a PHP script then I suggest moving this bug to where that work will be tracked.
Comment 9 David Carver CLA 2008-09-25 15:35:53 EDT
My thought here is that since Eclipse has already enhanced their own copy of bugzilla it could be fixed here.  As I said, depending on the release of bugzilla eclipse is using this should already be addressed with Bugzilla by the use of the UTF-8 flag.  I'm not sure if eclipse has that turned on or not.

http://www.bugzilla.org/releases/2.22/new-features.html

In Bugzilla 3.0, they added even more support for UTF-8 and internationalization.

http://www.bugzilla.org/releases/3.0/new-features.html

It's why I think it belongs in this area at least until it is investigated to see if it's a configuration problem on eclipse's end, an upgrade of bugzilla that is needed, or if it is truely still a bug in bugzilla.   No need to open a bug against bugzilla itself until we finish the investigation here.

Comment 10 Eclipse Webmaster CLA 2008-09-25 15:52:11 EDT
(In reply to comment #9)
> My thought here is that since Eclipse has already enhanced their own copy of
> bugzilla it could be fixed here.  As I said, depending on the release of
> bugzilla eclipse is using this should already be addressed with Bugzilla by the
> use of the UTF-8 flag.  I'm not sure if eclipse has that turned on or not.

I'm sorry David, you're right.  There is a parameter in 3.0 to turn on UTF-8 support.  However, doing so requires re-encoding the entire database.  I also don't know what side effects we can expect from it with other existing scripting that uses the Bugzilla XML or DB interfaces.  It appears that it affects a lot of settings throughout the application.  I'll have to talk to Denis and Matt about it.  I'm vaguely remembering a previous thread about this, but I can't find it in a bug search.
Comment 11 Karl Matthias CLA 2008-09-25 15:53:46 EDT
Grrr, that was me, forgot to sign it.
Comment 12 Denis Roy CLA 2008-09-26 09:59:02 EDT
That means this bug is a dupe of bug 220717 right?
Comment 13 David Carver CLA 2008-09-26 10:24:12 EDT
(In reply to comment #12)
> That means this bug is a dupe of bug 220717 right?
> 

Yes, this looks like the same issue.  I'm marking as a duplicate of that bug.  This is affect project plan generation in some cases so it probably needs to be taken a look at soon.




*** This bug has been marked as a duplicate of bug 220717 ***
Comment 14 Karl Matthias CLA 2008-09-26 12:01:19 EDT
hmm, apparently that bug didn't show up on a search of "UTF8".  That's the one I was remembering.