| Summary: | [xpath2] fn:string-length, fn:substring and fn:translate need to handl Surrogate pairs | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [WebTools] WTP Source Editing | Reporter: | David Carver <d_a_carver> | ||||
| Component: | wst.xpath | Assignee: | David Carver <d_a_carver> | ||||
| Status: | RESOLVED FIXED | QA Contact: | David Carver <d_a_carver> | ||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | jesper | ||||
| Version: | 3.1 | Keywords: | helpwanted | ||||
| Target Milestone: | 3.2 M2 | ||||||
| Hardware: | PC | ||||||
| OS: | All | ||||||
| Whiteboard: | |||||||
| Bug Depends on: | |||||||
| Bug Blocks: | 280554, 286062 | ||||||
| Attachments: |
|
||||||
|
Description
David Carver
Added some enhancements to the code that was failing the tests to decode the String for surrogate values and re-encode it when outputting it. Re-encoding isn't working exactly correctly. StringEscapeUtils.escapeXML() doesn't seem to be handling high order surrogates correctly. My foo in this area is weak, so marking as helpwanted. You know very well that I just can't resist, don't you... (In reply to comment #2) > You know very well that I just can't resist, don't you... > I know you like a challenge. :) Any help would be appreciated. fn:match needs to have it's decoding and encoding corrected as well. I now understand the misunderstanding, so to speak. Some routines dealing with strings are actually encoding and decoding entity references, which is wrong at best, and should not be done. There's is no need outside perhaps test code to code and decode entity references, that's a parser/serializer job dealing with what XML looks like expressed in "narrow" encodings, like ASCII. Internally: In Java, a java.lang.String is always (since 1.5 anyway) a UTF-16 sequence, which means it uses surrogate pairs (high+low) to express Unicode codepoints above 0xFFFF. This is also how we get them from e.g. Text nodes and whatnot. We just need to handle this, not code and decode entity references. See my patch to FnCodepointsToString for an example of how to handle this corectly. I am already working on the patch for this bug, but if you prefer to hold off for M2, I'm fine with that. Actually, I think we need to rework on how XSString works. When a XSString is created through the constructor, the decoding should happen there, and the value is stored internally in the decoded state. When a string_value() is called, the recoding should happen there. There are several XPath 2.0 tests that are expecting encoded values to be returned. Reworking how XSString stores the values internally should help eliminate some of the "hacks" currently implemented to make these test cases pass. We are running out of time for 3.2M1 milestone, so this needs to be put off to 3.2M2. Created attachment 143859 [details] Patch for this bug This patch has three major contributions: 1) It includes a robust codepoint handling iterator, which the JDK seems to lack 2) It reimplements substring, string-length and translate functionality to handle surrogates 3) It removes escape encoding/decoding entirely from the XPath classes and puts it into the test framework entirely All Java String values in the XPath2 world are now always UTF-16 code units, no escapes, just surrogates. Fixes the following tests: test_fn_codepoints_to_string1args_1(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.CodepointToStringFuncTest) test_surrogates03(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.SurrogatesTest) test_surrogates06(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.SurrogatesTest) test_surrogates07(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.SurrogatesTest) test_surrogates08(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.SurrogatesTest) test_surrogates10(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.SurrogatesTest) test_surrogates14(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.SurrogatesTest) test_surrogates15(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.SurrogatesTest) test_fn_translate3args_1(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.TranslateFuncTest) test_fn_translate3args_2(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.TranslateFuncTest) I think both translate and substring got a lot closer to the spec, at least once you undestand the codepoint iteration class. There's a possible regression on : - test_fn_current_date_4(org.eclipse.wst.xml.xpath2.processor.testsuite.functions.ContextCurrentDateFuncTest) but I think it's a time-of-day issue exposed by timezones, since the logic is not affected by this patch at all. I'll try again tomorrow. Oh, and it also fixes string-to-codepoints (bug 280554). Thanks...I checked the patch and you got lucky. 414 lines of new code, but deleted 183 lines of code, leaving 231 lines of code to actually be committed. Just under the 250 line limit. :) Patch applied to head. |