Community
Participate
Working Groups
Build Identifier: Helios SR2 20110218-0911 From a Help Document, select Print > Print Selected Topic and All Subtopics. The resultant document is a combination of the sections with injected section numbers. If the titles are all Latin-1 character based, this works correctly. Documents with Unicode chars in titles results in incorrectly placed section numbers. I believe the problem is the PATTERN_HEADING in the org.eclipse.help.webapp/org.eclipse.help.internal.webapp.data.PrintData uses a \w in the pattern to try to find the first text outside of the html tags. According to the documentation in java.util.regex.Pattern, this matches A-Za-z_0-9, which does not match Unicode character content. From limited research, it appears that a pattern containing \\p{Ll}|\\p{Lu}|\\p{Lt}|\\p{Lo}|\\p{Nd}]|_ might be a reasonable approximation for the \w that would include Unicode content. Reproducible: Always
Created attachment 194521 [details] Example of help file with Unicode title
I will look into this. Can you attach a screen shot showing the section numbers in the wrong place?
Created attachment 194526 [details] English version of generated content
Created attachment 194527 [details] Japanese version of generated content
If you look at the attached en/ja files for comparison and search on topic title, you will be able to see where they exist in the English output, and that they don't exist in the Japanese output. Attached images from Japanese output show where section is marked (underline), and where expected (arrow)
Created attachment 194528 [details] Case 1 of mismarked section
Created attachment 194529 [details] Case 2 of mismarked section
I agree with your analysis of why this is failing. Characters represented by numeric entities will also not get matched. Other titles which would fail to get matched include ¿Qué es Eclipse? águila /usr/bin I'm thinking of replacing the regular expression for PATTERN_HEADING with an expression like the one below which would match the first non whitespace character in a text element which would handle all of the above cases. I need to test this to see if there are any more characters which need to be added to the [^</s] subexpression. <body.*?>[\s]*?([^<\s]) This is too late to get into Eclipse 3.7, targeting Eclipse 3.8.
Created attachment 195625 [details] Patch
Patch committed to HEAD, Fixed.