| Summary: | [content type] UTF-16 causes exception in XMLRootHandler with IBM's JRE | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | David Williams <david_williams> | ||||||
| Component: | Resources | Assignee: | Rafael Chaves <eclipse> | ||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||
| Severity: | normal | ||||||||
| Priority: | P3 | ||||||||
| Version: | 3.0 | ||||||||
| Target Milestone: | 3.0 RC1 | ||||||||
| Hardware: | PC | ||||||||
| OS: | Windows XP | ||||||||
| Whiteboard: | |||||||||
| Attachments: |
|
||||||||
|
Description
David Williams
Created attachment 10718 [details]
junit test to demonstrate above error
David, it seems the mentioned file does not have a UTF-16 BOM. According to: http://www.w3.org/TR/2004/REC-xml-20040204/#charencoding it seems it should have. Of course, this does not invalidate the PR. I just thought I should mention. Created attachment 10738 [details]
hex values of test file
I think it does -- testUTF16.xml in zip file? -- unless somethings getting
"lost in translation". Of course, I think some one's code some where may leave
the input stream positioned after the BOM :) [that works pretty well, and is
needed, for UTF-8 BOMs, I've always found the UTF-16 boms a little more
problematic, sometimes expected, sometimes not, as the different results for
two VM's would seem to indicate]
The attached image show's what the hex values look like for the file I'm
looking at ... FFFE, right?
My fault... yes, the right one has the right BOM... I was trying with "test-UTF-16.xml" in testfiles\genedFiles...\xml. The reason we were failing is that we want to let IOExceptions flow to the caller, but sun.io.MalformedInputException is an I/O exception (CharConversionException). We will have to handle those (and let non-encoding-related ones flow). Since we are reading the contents right in the beginning in the handling it to describers, any "real" I/O exceptions will happen right way. When calling describers, I/O exceptions will not be severe, so they are just logged (not thrown). Fixed and released to HEAD as described above. Actually the problem itself still occurs... David, that file has an odd number of bytes. The IOException happens when trying
to decode the last char. Is this intentional? The following example would cause
a CharConversionException to occur with IBM's JRE:
import java.io.*;
public class Simple {
public static void main(String[] args) throws IOException {
Reader reader = new InputStreamReader(new FileInputStream(args[0]), args[1]);
int c;
while ((c = reader.read()) != -1)
System.out.println((char) c);
}
}
No, it wasn't intentional. Well, at least I don't think so. I'll try and
recover its "history", but its just part of the whole set of files I've
routinely tested for the past few years! (I should document my unit tests
better). I assume some previous version of Java wrote it that way. It does
make obvious, though, that the CharsetDecoder error defaults are different
between the IBM and Sun VMs (and, we've had trouble in the past where the
defaults change from one version to another).
If you change your example to use "Replace" on error then you get the same
behavior on both VMs.
Charset charset = Charset.forName("UTF-16");
CharsetDecoder charsetDecoder = charset.newDecoder();
charsetDecoder.onMalformedInput(CodingErrorAction.REPLACE);
Reader reader = new InputStreamReader(new FileInputStream(args
[0]), charsetDecoder);
int c;
while ((c = reader.read()) != -1)
System.out.println((char) c);
}
Can things be arranged so each "content type handler" set its own values for
this type of error handling? It seems the XMLRootHander would be best off
ignoring (replacing) them (since its looking for a "positive match"). But, in
the past, we've enjoyed giving are editor users a choice when an error
occurs ... e.g. "malformed input detected do you want to continue or cancel?".
I'm not sure how to do that with this new system.
Partial fix was to handle faulty describers so other describers still have a chance. As the change you suggested, David, in core.runtime we have a requirement of running on J2SE subsets that do not include java.nio. To do what you suggested would require doing some exercise with reflection. No java.nio!? How do those systems handle encoding/decoding? I thought java.nio was a "standard" part of Java 1.4. So, if core.runtime has to run on a subset of standard Java, then some of this encoding/decoding function doesn't belong at that level, that'd be my opinion, I mean. Even more concretely, in this case, if your "fix" is just to disable that provider as 'faulty', then there will be a bug open that object contributions depending on XMLRootElementContentDescriber would not work. (Or, do you mean it was just be disabled for that one pass, for that one file, in which case, you'd always need that sort of fall back behaviour for that one time.). BTW, I think this "invalid file" was formed by checking a UTF-16 files with single EOL into CVS, and then when checked back out, an 2 coded EOL was added. Or some similar "play" with end-of-lines. I suspect this will be moderately common. More severely, for me to maintain our products current level of encoding support/behavior, I will have to use java.nio (e.g. to check differenct of "detected encoding" and "used encoding" (to know when an alias is being used) and have control over how its set/initialized. I as going to propose some of these as fixes for core.runtime, but sounds like that would be a hard case to sell. I don't mind leaving them in my own XML version of ContentDescriber, as long as I can depend on it always being called. I assume I would put its priority as "high" and a child of runtime.xml. Do you forsee any problems with this approach? Thanks in advance for any help or advice. Agreed that the content type support is being more restricted than it should, but right now we don't have many choices. Regardless the cirscumstances your file got into that state, you agree it is invalid, right? Re: providing a personalized version of the XML content describer: you cannot replace the default XML content provider. But the XML describer will hardly classify any contents as invalid (currently it never does that), so you don't need a new content type for XML. You need a more appropriate XML content describer to be used by your XML-based content types. No further action planned. We will log such exceptions only if in debug mode (added a debug option for content type), and faulty describers will just be skipped during that lookup. |