Community
Participate
Working Groups
If you create projects with double-byte characters (such as Chinese or Japanese) and then use J2EE Module Dependencies UI (or the underlying operations) to setup module dependencies, the MANIFEST.MF file will get corrupted by the UTF8 characters. The proper way to place double-byte characters into MANIFEST.MF file is to use an ASCII escape sequences. I believe the problem is in ArchiveManifestImpl class. I took a brief look and it doesn't look like it does anything to handle double-byte characters.
Is this then a JRE bug?
Nope. The problem is in WTP code. ArchiveManifestImpl is in org.eclipse.jst.j2ee plugin.
Created attachment 41969 [details] Straightforward fix for the problem This fixes it, but still leaves a room for improvement.
The fix is against the current HEAD, by the way. The possible improvement is for long lines. Now it checks against 72 char lines, but the spec says to check for 72 BYTE lines. This involves some silly backtracking since a UTF-8 character can be from 1 to 5 bytes in length and must NOT be split in the middle. I'm not sure who enforces the 72 byte limit in practice.
I looked at Sun's impl of manifest serialization to see how they are handling UTF-8 and 72 byte line limit. Here is what they do: byte[] vb = value.getBytes("UTF8"); value = new String(vb, 0, 0, vb.length); buffer.append(value); .... make72Safe(buffer); Basically, they encode the value (the only part that's allowed to be UTF-8) into UTF-8 byte array first. Then they convert the resulting byte array into a string using one byte per character. Then they do the whole bit of inserting new lines and writing it out. The read reverses the process, stripping out the new lines first and then decoding from UTF-8. I will put together a patch that mimicks Sun's behavior.
The algorith that you describe is incorrect since it may split a line right in the middle of a multi-byte UTF-8 sequence (and insert the line break and space). That way, the file cannot be read back in using a character-oriented stream/reader mechanism and it will certainly not work in Eclipse's editor. The correct algorithm is to split the string at a *character* boundary somewhere before the 72th position. I can outline the algorithm if required.
The depends on how you define correct. ;) It will produce manifest files that are compatible with Sun's manifest parser, which is the important part. Technically, if you read the spec, the value part of the name/value pairs in the manifest is binary encoded as text, so you are not guaranteed to be able to read manifest files using an editor.
Created attachment 42188 [details] Patch
I have some questions about this problem and proposed fix. Are we confident that everyone (all VM's) that will be used on servers that the jars are deployed to, will read the manifest.mf in the same way as we are writing them? Wouldn't we be better off using a "standard" interface/class, like java.util.jar.Manifest to read and write the manifest? I'm assuming this is an "API" in all VM's? I believe that's what the base does, and the source of them trashing manifest.mf files recently. (Trashing according to spec, that is :) Are these manifest ever edited by a person? I mean in general practice in the field? If so, the importance of a "manifest formatter" becomes more important to this project. Interested parties might read/follow https://bugs.eclipse.org/bugs/show_bug.cgi?id=130064 https://bugs.eclipse.org/bugs/show_bug.cgi?id=138520 As I read the spec, such as at http://java.sun.com/j2se/1.4.2/docs/guide/jar/jar.html long lines do not have to be exactly 72 characters long. So, it would be best to back up on break on character boundries, if that would not interfere with how VM's (the Manifest class) read these. (Or, maybe, the spec is poorly written?).
1. In my opinion, if we could switch back to using java.util.jar.Manifest to write manifest, we would have a lot fewer problems to deal with. The downside is that we would have no way to control any aspect of the formatting. 2. In my opinion, breaking strictly on character boundaries would be a nice enhancement for the future, but it's not something that we have to do for 1.0.3. The algorithm is a lot more complicated and the risk of introducing new bugs is far greater. The manifests produced by my patch should be readable by the java.util.jar.Manifest class as that's how that class writes them.
+1 Thanks, Kosta. I agree with your assesment on risks and magnitude of change. We can deal with these other issues in future, I hope, but your fix would fix the immediate problem of non-ascii characters not being encoded per spec. Thanks.
+1 for WTP 1.0.3 - the proposed fix seems to minimize change while making some improvement. Let's to the "right" thing in WTP 2.0 at least.
+1 for WTP 1.0.3
Released the fix to both 1.0.3 and 1.5 RC5. I am assuming that a fix that's authorized for 1.0.3 is automatically authorized for 1.5. Opened a new enhancement request(https://bugs.eclipse.org/bugs/show_bug.cgi?id=145576) to track improving the algorithm to break only on character boundaries.
closing