Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 344499

Summary: [Webapp] Print Selected Topic and All Subtopics Incorrectly Injects Section Numbers
Product: [Eclipse Project] Platform Reporter: Brian Lillie <brianlil>
Component: User AssistanceAssignee: Chris Goldthorpe <cgold>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: P3 CC: cgold, mukund
Version: 3.6.2   
Target Milestone: 3.8 M1   
Hardware: PC   
OS: Windows XP   
Whiteboard:
Attachments:
Description Flags
Example of help file with Unicode title
none
English version of generated content
none
Japanese version of generated content
none
Case 1 of mismarked section
none
Case 2 of mismarked section
none
Patch none

Description Brian Lillie CLA 2011-05-02 13:59:10 EDT
Build Identifier: Helios SR2 20110218-0911 

From a Help Document, select Print > Print Selected Topic and All Subtopics.   The resultant document is a combination of the sections with injected section numbers.   If the titles are all Latin-1 character based, this works correctly.   Documents with Unicode chars in titles results in incorrectly placed section numbers.

I believe the problem is the PATTERN_HEADING in the org.eclipse.help.webapp/org.eclipse.help.internal.webapp.data.PrintData uses a \w in the pattern to try to find the first text outside of the html tags.   According to the documentation in java.util.regex.Pattern, this matches A-Za-z_0-9, which does not match Unicode character content.

From limited research, it appears that a pattern containing \\p{Ll}|\\p{Lu}|\\p{Lt}|\\p{Lo}|\\p{Nd}]|_ might be a reasonable approximation for the \w that would include Unicode content.



Reproducible: Always
Comment 1 Brian Lillie CLA 2011-05-02 14:07:43 EDT
Created attachment 194521 [details]
Example of help file with Unicode title
Comment 2 Chris Goldthorpe CLA 2011-05-02 14:45:27 EDT
I will look into this. Can you attach a screen shot showing the section numbers in the wrong place?
Comment 3 Brian Lillie CLA 2011-05-02 14:52:16 EDT
Created attachment 194526 [details]
English version of generated content
Comment 4 Brian Lillie CLA 2011-05-02 14:52:40 EDT
Created attachment 194527 [details]
Japanese version of generated content
Comment 5 Brian Lillie CLA 2011-05-02 15:00:23 EDT
If you look at the attached en/ja files for comparison and search on topic title, you will be able to see where they exist in the English output, and that they don't exist in the Japanese output.

Attached images from Japanese output show where section is marked (underline), and where expected (arrow)
Comment 6 Brian Lillie CLA 2011-05-02 15:00:49 EDT
Created attachment 194528 [details]
Case 1 of mismarked section
Comment 7 Brian Lillie CLA 2011-05-02 15:01:20 EDT
Created attachment 194529 [details]
Case 2 of mismarked section
Comment 8 Chris Goldthorpe CLA 2011-05-12 19:56:18 EDT
I agree with your analysis of why this is failing. Characters represented by numeric entities will also not get matched.

Other titles which would fail to get matched include

¿Qué es Eclipse?
águila
/usr/bin

I'm thinking of replacing the regular expression for PATTERN_HEADING with an expression like the one below which would match the first non whitespace character in a text element which would handle all of the above cases. I need to test this to see if there are any more characters which need to be added to the [^</s] subexpression.

<body.*?>[\s]*?([^<\s])

This is too late to get into Eclipse 3.7, targeting Eclipse 3.8.
Comment 9 Chris Goldthorpe CLA 2011-05-13 14:44:49 EDT
Created attachment 195625 [details]
Patch
Comment 10 Chris Goldthorpe CLA 2011-06-20 12:54:14 EDT
Patch committed to HEAD, Fixed.