Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 344499 - [Webapp] Print Selected Topic and All Subtopics Incorrectly Injects Section Numbers
Summary: [Webapp] Print Selected Topic and All Subtopics Incorrectly Injects Section N...
Status: RESOLVED FIXED
Alias: None
Product: Platform
Classification: Eclipse Project
Component: User Assistance (show other bugs)
Version: 3.6.2   Edit
Hardware: PC Windows XP
: P3 normal (vote)
Target Milestone: 3.8 M1   Edit
Assignee: Chris Goldthorpe CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-05-02 13:59 EDT by Brian Lillie CLA
Modified: 2011-10-04 14:09 EDT (History)
2 users (show)

See Also:


Attachments
Example of help file with Unicode title (6.20 KB, text/html)
2011-05-02 14:07 EDT, Brian Lillie CLA
no flags Details
English version of generated content (36.04 KB, text/html)
2011-05-02 14:52 EDT, Brian Lillie CLA
no flags Details
Japanese version of generated content (44.41 KB, text/html)
2011-05-02 14:52 EDT, Brian Lillie CLA
no flags Details
Case 1 of mismarked section (498.13 KB, image/bmp)
2011-05-02 15:00 EDT, Brian Lillie CLA
no flags Details
Case 2 of mismarked section (960.52 KB, image/bmp)
2011-05-02 15:01 EDT, Brian Lillie CLA
no flags Details
Patch (1.20 KB, patch)
2011-05-13 14:44 EDT, Chris Goldthorpe CLA
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Brian Lillie CLA 2011-05-02 13:59:10 EDT
Build Identifier: Helios SR2 20110218-0911 

From a Help Document, select Print > Print Selected Topic and All Subtopics.   The resultant document is a combination of the sections with injected section numbers.   If the titles are all Latin-1 character based, this works correctly.   Documents with Unicode chars in titles results in incorrectly placed section numbers.

I believe the problem is the PATTERN_HEADING in the org.eclipse.help.webapp/org.eclipse.help.internal.webapp.data.PrintData uses a \w in the pattern to try to find the first text outside of the html tags.   According to the documentation in java.util.regex.Pattern, this matches A-Za-z_0-9, which does not match Unicode character content.

From limited research, it appears that a pattern containing \\p{Ll}|\\p{Lu}|\\p{Lt}|\\p{Lo}|\\p{Nd}]|_ might be a reasonable approximation for the \w that would include Unicode content.



Reproducible: Always
Comment 1 Brian Lillie CLA 2011-05-02 14:07:43 EDT
Created attachment 194521 [details]
Example of help file with Unicode title
Comment 2 Chris Goldthorpe CLA 2011-05-02 14:45:27 EDT
I will look into this. Can you attach a screen shot showing the section numbers in the wrong place?
Comment 3 Brian Lillie CLA 2011-05-02 14:52:16 EDT
Created attachment 194526 [details]
English version of generated content
Comment 4 Brian Lillie CLA 2011-05-02 14:52:40 EDT
Created attachment 194527 [details]
Japanese version of generated content
Comment 5 Brian Lillie CLA 2011-05-02 15:00:23 EDT
If you look at the attached en/ja files for comparison and search on topic title, you will be able to see where they exist in the English output, and that they don't exist in the Japanese output.

Attached images from Japanese output show where section is marked (underline), and where expected (arrow)
Comment 6 Brian Lillie CLA 2011-05-02 15:00:49 EDT
Created attachment 194528 [details]
Case 1 of mismarked section
Comment 7 Brian Lillie CLA 2011-05-02 15:01:20 EDT
Created attachment 194529 [details]
Case 2 of mismarked section
Comment 8 Chris Goldthorpe CLA 2011-05-12 19:56:18 EDT
I agree with your analysis of why this is failing. Characters represented by numeric entities will also not get matched.

Other titles which would fail to get matched include

¿Qué es Eclipse?
águila
/usr/bin

I'm thinking of replacing the regular expression for PATTERN_HEADING with an expression like the one below which would match the first non whitespace character in a text element which would handle all of the above cases. I need to test this to see if there are any more characters which need to be added to the [^</s] subexpression.

<body.*?>[\s]*?([^<\s])

This is too late to get into Eclipse 3.7, targeting Eclipse 3.8.
Comment 9 Chris Goldthorpe CLA 2011-05-13 14:44:49 EDT
Created attachment 195625 [details]
Patch
Comment 10 Chris Goldthorpe CLA 2011-06-20 12:54:14 EDT
Patch committed to HEAD, Fixed.