Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 225753

Summary: [implementation] save changes BOM of UTF-16 with LE BOM
Product: [Eclipse Project] Platform Reporter: Warren Paul <warren.paul>
Component: TextAssignee: Platform-Text-Inbox <platform-text-inbox>
Status: RESOLVED FIXED QA Contact:
Severity: major    
Priority: P3 CC: daniel_megert, Szymon.Brandys
Version: 3.3.1   
Target Milestone: 3.4 M7   
Hardware: PC   
OS: Windows XP   
Whiteboard:
Attachments:
Description Flags
test case none

Description Warren Paul CLA 2008-04-04 10:55:07 EDT
I wasn't sure what component to put for this.

If you open a UTF-16LE file in the editor and then save it, the encoding gets changed to UTF-16BE.  Obviously this is not the desired behavior.  The only thing I could find in Bugzilla that was somewhat related to this was https://bugs.eclipse.org/bugs/show_bug.cgi?id=136854.  Comment #6 has some interesting info on the background of this with Java Charset's.

Reading the doc on java.nio.charset.Charset, you're led to believe that there would be no encoding for UTF-16LE and UTF-16BE.  Here's a snippet:

    *  When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.
    *  When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.


So the current logic is to treat both UTF-16BE and UTF-16LE as UTF-16.  This is the problem though.  This logic is in org.eclipse.core.internal.content.ContentDescription.

	public String getCharset() {
		byte[] bom = (byte[]) getProperty(BYTE_ORDER_MARK);
		if (bom == BOM_UTF_8)
			return CHARSET_UTF_8;
		else if (bom == BOM_UTF_16BE || bom == BOM_UTF_16LE)
			// UTF-16 will properly recognize the BOM
			return CHARSET_UTF_16;
		return (String) getProperty(CHARSET);
	}


In practice though, if UTF-16BE and UTF-16LE charset's won't strip the existing BOM if it exists.  It just won't add a BOM.  So the file stream already has the correct BOM, so if you use the correct encoder, everything works as expected.  So the fix is to change getCharset() like so:

	public String getCharset() {
		byte[] bom = (byte[]) getProperty(BYTE_ORDER_MARK);
		if (bom == BOM_UTF_8)
			return CHARSET_UTF_8;
		else if (bom == BOM_UTF_16BE)
			return CHARSET_UTF_16BE;
		else if (bom == BOM_UTF_16LE)
			return CHARSET_UTF_16LE;
		return (String) getProperty(CHARSET);
	}

with

	private static final String CHARSET_UTF_16BE = "UTF-16BE"; //$NON-NLS-1$
	private static final String CHARSET_UTF_16LE = "UTF-16LE"; //$NON-NLS-1$

Sorry, I don't have this checked out so I can create a proper patch.  The same problem does exist with 3.4M5.
Comment 1 Warren Paul CLA 2008-04-04 11:00:05 EDT
Created attachment 94857 [details]
test case

Here's a simple UTF-16LE file.  Before the change, if you opened this file in Eclipse, changed it and saved, the encoding would be changed to UTF-16BE.  An easy way to tell is opening the file in Notepad on Win32 and doing Save As.  That shows you the type, either Unicode, or Unicode big endian.

After the change, the proper encoding is preserved.  I tested opening/changing both UTF-16BE and UTF-16LE files to make sure the correct BOM is preserved.
Comment 2 Dani Megert CLA 2008-04-09 06:55:57 EDT
I've added a workaround to file buffers which will fix it for most textual editors until this bug gets fixed.
Comment 3 Dani Megert CLA 2008-04-09 07:14:54 EDT
Szymon, the proposed fix in comment 0 looks reasonable to me.
Comment 4 Dani Megert CLA 2008-04-11 06:52:37 EDT
>Szymon, the proposed fix in comment 0 looks reasonable to me.
Actually, after discussing with Szymon and reading http://en.wikipedia.org/wiki/Utf-16 we conclude that the current behavior of ContentDescription.getCharset() is correct.

The writer needs to take care of preserving the BOM.
Comment 5 Dani Megert CLA 2008-04-11 10:45:01 EDT
Fixed in HEAD for file buffers and FileDocumentProvider.
Available in builds > N20080410-2000.
Comment 6 Dani Megert CLA 2008-04-28 07:54:46 EDT
Starting verification...
Comment 7 Dani Megert CLA 2008-04-28 07:56:31 EDT
Verified in I20080427-2000.
Comment 8 Warren Paul CLA 2008-06-02 14:40:03 EDT
The test case still fails for me.  I open a Unicode file in RC2, change something and save, then open in Notepad, Save As.  It's Unicode big endian.
Comment 9 Dani Megert CLA 2008-06-08 15:35:03 EDT
Save As is a known issue, please see bug 64567.
Comment 10 Warren Paul CLA 2008-06-09 09:28:46 EDT
The original bug report is still an issue.  The save as was done in notepad as an easy way to see the encoding after saving the file in Eclipse.
Comment 11 Dani Megert CLA 2008-06-09 09:32:06 EDT
Sorry, I cannot reproduce. Please provide exact/detailed steps.
Comment 12 Warren Paul CLA 2008-06-09 09:41:26 EDT
(In reply to comment #11)
> Sorry, I cannot reproduce. Please provide exact/detailed steps.
> 

I take a UTF16-LE file and open it in Eclipse RC3 using File->Open File.  I make a minor change and save the file.  The encoding is changed to UTF16-BE.  I'm using Notepad to verify the encoding of the file before and after changing it in Eclipse.
Comment 13 Dani Megert CLA 2008-06-09 09:44:53 EDT
Please file a separate bug. I fixed the scenario where files within a workspace aren't handled correctly. If you use File > Open File it's a different scenario, especially since we have no clue what encoding that file has.
Comment 14 Warren Paul CLA 2008-06-09 12:00:25 EDT
(In reply to comment #13)
> Please file a separate bug. I fixed the scenario where files within a workspace
> aren't handled correctly. If you use File > Open File it's a different
> scenario, especially since we have no clue what encoding that file has.
> 

OK, I cloned this to 236266.  But we do know the encoding as it's in the BOM.  The file may not have a BOM in which case we won't know, but when it does, it should be honored.