Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 333481 - logic of encode to utf-8 from surrogate pair is wrong ?
Summary: logic of encode to utf-8 from surrogate pair is wrong ?
Status: RESOLVED FIXED
Alias: None
Product: Jetty
Classification: RT
Component: other (show other bugs)
Version: 8.0.0   Edit
Hardware: PC Windows 7
: P3 normal (vote)
Target Milestone: 7.1.x   Edit
Assignee: Greg Wilkins CLA
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-01-04 10:23 EST by kenya CLA
Modified: 2011-02-21 22:42 EST (History)
3 users (show)

See Also:
gregw: iplog+


Attachments
test case that demonstrates the problem (2.38 KB, patch)
2011-01-07 06:55 EST, Greg Wilkins CLA
no flags Details | Diff
my proposed fix (10.92 KB, text/plain)
2011-01-08 05:49 EST, kenya CLA
no flags Details
my failing test case (However, this is corrected by proposed fix) (25.20 KB, application/octet-stream)
2011-01-11 10:25 EST, kenya CLA
no flags Details
my failing test case (another solution) (30.43 KB, application/octet-stream)
2011-01-14 14:08 EST, kenya CLA
no flags Details
my failing test code (HttpWriterTest.java) (9.53 KB, application/octet-stream)
2011-01-30 04:42 EST, kenya CLA
no flags Details
failing test code (HttpWriterTest.java) and fix code(HttpWriter.java) (4.09 KB, application/x-zip-compressed)
2011-02-02 13:18 EST, kenya CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description kenya CLA 2011-01-04 10:23:59 EST
I hope surrogate pair encode to utf8 and utf8 decode to surrogate pair.



The following workarounds are taken until corresponding. (JDK5)


======================
at [org.eclipse.jetty.util.Utf8StringBuilder]

126:                    _buffer.append((char)_bits);
126:                    _buffer.append(Character.toChars(_bits)); // change
======================
======================
at [org.eclipse.jetty.util.Utf8StringBuffer]

129:                    _buffer.append((char)_bits);
129:                    _buffer.append(Character.toChars(_bits)); // change
======================
======================
at [org.eclipse.jetty.server.HttpWriter]

62:            _writeMode = WRITE_UTF8;
62://            _writeMode = WRITE_UTF8; // change
======================


This concludes my report.
Comment 1 Jesse McConnell CLA 2011-01-05 08:46:20 EST
I think you are asking about UTF-16 -> UTF-8 conversion issues

if that isn't what your looking at here, please explain the issue more
Comment 2 kenya CLA 2011-01-06 13:19:54 EST
Dear Jesse McConnell 

Thank you for answers.

> I think you are asking about UTF-16 -> UTF-8 conversion issues
Strictly UTF-16*2 -> UTF-8 conversion issues.


The following is the problem example.

[decode problem]
======================
UTF-8:F0,A0,AE,9F
->UTF-32(codePoint):U+20B9F(134047)
-->//_buffer.append((char)134047); // 134047 needs two java chars.
-->_buffer.append(Character.toChars(134047)); // change
   _buffer.append([D842,DF9F]              ); // change(UTF-16*2) 
======================

[encode problem]
======================
UTF-16*2:[D842,DF9F]
->org.eclipse.jetty.server.HttpWriter.write([D842,DF9F], 0, 2);
-->UTF-8:ED,A1,82,ED,BE,9F // Irreversible(I do not want)

UTF-16*2:[D842,DF9F]
->java.io.OutputStreamWrite.write([D842,DF9F], 0, 2);
-->UTF-8:F0,A0,AE,9F // Reversible(I hope)
======================



Because I am referring to the following sites:

at [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/index.html]
======================
> This article describes how supplementary characters are supported 
> in the Java platform. Supplementary characters are characters 
> in the Unicode standard whose code points are above U+FFFF, 
> and which therefore cannot be described as single 16-bit entities 
> such as the char data type in the Java programming language. 
> Such characters are generally rare, but some are used, for example, 
> as part of Chinese and Japanese personal names, and so support 
> for them is commonly required for government applications 
> in East Asian countries.
======================

at [http://www.mail-archive.com/solr-dev@lucene.apache.org/msg19487.html]
======================
> A code point (unicode character) outside of the BMP (basic
> multilingual plane, fits in 16 bits) is represented as two java chars
> - a surrogate pair.  It's a single logical character - see
> String.codePointAt().  In correct UTF-8 it should be encoded as a
> single code point... but Jetty is ignoring the fact that it's a
> surrogate pair and encoding each Java char as it's own code point...
> this is often called modified-UTF8 or CESU-8.
======================

thanks
Comment 3 Jesse McConnell CLA 2011-01-06 16:44:03 EST
ok, taking a look
Comment 4 Jesse McConnell CLA 2011-01-06 17:19:44 EST
Is there a specific writer case you are having issues with?

I am tinkering with this as a test case for the writer and it works correctly.

    @Test
    public void testUTF16x2() throws Exception
    {
        _writer.setCharacterEncoding(StringUtil.__UTF8);
        _writer.write((char)134047);     
        assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8),_bytes.asArray());
        
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        OutputStreamWriter osw = new OutputStreamWriter(baos);
        
        osw.write((char)134047);
        osw.flush();
        assertArrayEquals(baos.toByteArray(), _bytes.asArray());
        assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8), _bytes.asArray());

    }
Comment 5 Greg Wilkins CLA 2011-01-07 06:41:02 EST
Jesse,

I think the issue is with characters that can only be represented in multiple UTF-16 characters.

Your example is writing only a single UTF-16 character, so will not trigger this issue.  You write 134047,  but I think that is a UTF-32 code point.

I think kenya is saying that 134047 in UTF-16 is represented by the two UTF-16 characters D842,DF9F.
Comment 6 Greg Wilkins CLA 2011-01-07 06:49:42 EST
Jesse, I added the following tests to Utf8StringBufferTest and Utf8StringBuilderTest:


    @Test 
    public void testUTF32codes()
    throws Exception
    {
        String source="\uD842\uDF9F";
        System.err.println("source='"+source+"'");
        byte[] bytes=source.getBytes("UTF-8");
        System.err.println("bytes='"+TypeUtil.toHexString(bytes));
        
        Utf8StringBuilder buffer = new Utf8StringBuilder();
        buffer.append(bytes,0,bytes.length);
        
        String result=buffer.toString();
        System.err.println("result='"+result+"'");
        
        assertEquals(source,result);
        
    }


It fails and produces the result output (which I don't know if bugzilla will handle correctly):

source='
Comment 7 Greg Wilkins CLA 2011-01-07 06:55:01 EST
Created attachment 186262 [details]
test case that demonstrates the problem
Comment 8 Greg Wilkins CLA 2011-01-07 07:03:50 EST
Ah - there is also a proposed fix.... didn't notice first time.
It looks like it works! thanks!

tests and fix committed r2632
Comment 9 kenya CLA 2011-01-07 19:18:00 EST
(In reply to comment #4)
> Is there a specific writer case you are having issues with?
> I am tinkering with this as a test case for the writer and it works correctly.
>     @Test
>     public void testUTF16x2() throws Exception
>     {
>         _writer.setCharacterEncoding(StringUtil.__UTF8);
>         _writer.write((char)134047);     
>        
> assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8),_bytes.asArray());
>         ByteArrayOutputStream baos = new ByteArrayOutputStream();
>         OutputStreamWriter osw = new OutputStreamWriter(baos);
>         osw.write((char)134047);
>         osw.flush();
>         assertArrayEquals(baos.toByteArray(), _bytes.asArray());
>        
> assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8),
> _bytes.asArray());
>     }


Dear Jesse McConnell 

Thank you for your help.

My idea is the same as Greg.
The following is the test code.

[test code]
======================
	@Test
	public void testUTF16x2() throws Exception {
		_writer.setCharacterEncoding(StringUtil.__UTF8);

		// String source = "\uD842\uDF9F";
		int codePoint = 134047;
		char[] codeUnits = new char[Character.charCount(codePoint)];
		int count = Character.toChars(codePoint, codeUnits, 0);
		String source = new String(codeUnits, 0, count);

		byte[] bytes = source.getBytes("UTF-8");
		_writer.write(source.toCharArray(), 0, source.toCharArray().length);
assertArrayEquals(bytes, _bytes.asArray());
		
		ByteArrayOutputStream baos = new ByteArrayOutputStream();
		OutputStreamWriter osw = new OutputStreamWriter(baos);
		osw.write(source.toCharArray(), 0, source.toCharArray().length);
		osw.flush();
assertArrayEquals(baos.toByteArray(), _bytes.asArray());
	}
======================
Comment 10 kenya CLA 2011-01-08 05:49:44 EST
Created attachment 186326 [details]
my proposed fix

The following is the proposed fix and test code2.
Please have a look.


[proposed fix]
======================
at [org.eclipse.jetty.server.HttpWriter]

155:                    if (bytes+chars>buffer.length)
156:                        chars=buffer.length-bytes;
157:
 + :                    if(Character.isLowSurrogate(s[offset]))
 + :                    {
 + :                    	chars++;
 + :                    	length++;
 + :                    	offset--; // process two input characters (pending from the previous)
 + :                    }
 + :                    if(Character.isHighSurrogate(s[offset+chars-1]))
 + :                    {
 + :                    	chars--; // process two input characters (pending the next)
 + :	                    if(chars==0)
 + :	                    {
 + :	                        length--;
 + :	                        offset++;
 + :	                    }
 + :                    }
 + :
158:                    for (int i = 0; i < chars; i++)
159:                    {
160://                        int code = s[offset+i];
 + :                        int code = Character.codePointAt(s, offset+i);
 + :                        if(Character.charCount(code)>1)
 + :                        	i++; // process two input characters
161:                        
162:                        if ((code & 0xffffff80) == 0) 
======================


[test code2]
======================
    @Test
    public void testMultiByteOverflowUTF16x2() throws Exception
    {
        _writer.setCharacterEncoding(StringUtil.__UTF8);

        final String singleByteStr = "a";
        // final String multiByteDuplicateStr = "\uD842\uDF9F";
        int codePoint = 134047;
		char[] codeUnits = new char[Character.charCount(codePoint)];
		int count = Character.toChars(codePoint, codeUnits, 0);
		final String multiByteDuplicateStr = new String(codeUnits, 0, count);
        int multiByteStrByteLength = multiByteDuplicateStr.getBytes("UTF-8").length;
        // int remainSize = 1;
        int remainSize = HttpWriter.MAX_OUTPUT_CHARS - multiByteStrByteLength + 1;
        // int remainSize = HttpWriter.MAX_OUTPUT_CHARS;

        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS - 1; i++) {
          sb.append(singleByteStr);
        }
        sb.append(multiByteDuplicateStr);
        for (int i = 0; i < remainSize; i++) {
          sb.append(singleByteStr);
        }
        char[] buf = new char[HttpWriter.MAX_OUTPUT_CHARS * 3];

        int length = HttpWriter.MAX_OUTPUT_CHARS - 1 + remainSize + Character.charCount(codePoint);
        sb.toString().getChars(0, length, buf, 0);

        _writer.write(buf, 0, HttpWriter.MAX_OUTPUT_CHARS);        
        _writer.write(buf, HttpWriter.MAX_OUTPUT_CHARS, length-HttpWriter.MAX_OUTPUT_CHARS);

        assertEquals(sb.toString(),new String(_bytes.asArray(),StringUtil.__UTF8));
    }
======================

thanks
Comment 11 kenya CLA 2011-01-08 20:10:13 EST
(In reply to comment #8)
> Ah - there is also a proposed fix.... didn't notice first time.
> It looks like it works! thanks!
> tests and fix committed r2632

Dear Greg Wilkins.

Thanks for quick reaction.
Comment 12 kenya CLA 2011-01-08 20:44:39 EST
Dear Sir.

Thank you for your help.

Encoding problem has not been resolved. (a delicate issue)

so, I would change the status. (reopend)
Comment 13 Jesse McConnell CLA 2011-01-10 12:09:05 EST
How are you testing the fix for this?

I tried your proposed test case and it passes fine...and there are now test cases for the utf32 codes in the reader string buffer and builder tests

Please provide a test case that demonstrates how things are failing for you

note, these changes are provided in 7.3.0-SNAPSHOT at the moment, build from trunk or let me know and I'll deploy an updated snapshot to oss

cheers,
jesse
Comment 14 kenya CLA 2011-01-11 10:25:36 EST
Created attachment 186510 [details]
my failing test case (However, this is corrected by proposed fix)

Dear Jesse McConnell.


The following is Steps to Reproduce:

[Reproduce]
=================================
1) Unzip the archive.(html5.zip)
2) Import as a project in Eclipse.(html5\**)
3) Delete proposed sources.(src\test\java\org\*.*,src\test\java\html5\MyTest.java)
4) Run ant.(maven-build.xml)
5) Run class.(src\test\java\html5\Start.java)
6) Browse the local site in IE8 in windows 7.(http://localhost:8080/html5/)

-> A code point outside of the BMP are garbled.(in request and response)
=================================


> note, these changes are provided in 7.3.0-SNAPSHOT at the moment, build from
> trunk or let me know and I'll deploy an updated snapshot to oss

I build from trunk.
Because we might find another good solution.


Thank you very much.
Comment 15 Jesse McConnell CLA 2011-01-11 12:37:05 EST
Your sample project appears to be using jetty-8.0.0.M2 which is not where the fix was implemented.

The fix was updated to jetty-8.0.0.M3-SNAPSHOT but we don't have a recent snapshot of that deployed at the moment.  I'll get one of those pushed up but I am thinking this might be the issue.

cheers
Comment 16 kenya CLA 2011-01-14 14:08:20 EST
Created attachment 186842 [details]
my failing test case (another solution)

Dear Jesse McConnell.

Thank you for your advice.

My proposed fix was wrong.
Because I had the wrong idea of buffering.
Moreover The test was not enough .


The following is the another solution.
Is this looks good?


[another solution]
======================
at [org.eclipse.jetty.server.HttpWriter]

50 :        _generator=_out._generator;
51+:         _surrogate=0; // AS lastUTF16CodePoint
52 :    }

160:                        int code = s[offset+i];
 + :                         if(_surrogate==0)
 + :                         {
 + :                             if(Character.isHighSurrogate((char)code))
 + :                             {
 + :                             	_surrogate=code; // UCS-?
 + :                             	continue;
 + :                             }                        	
 + :                         }
 + :                         else
 + :                         {
 + :                         	if(Character.isLowSurrogate((char)code))
 + :                         	{
 + :                             	code = Character.toCodePoint((char)_surrogate, (char)code); // UCS-4
 + :                         	}
 + :                         	else
 + :                         	{
 + :                            		code=_surrogate; // UCS-2
 + :                            		i--;
 + :                         	}
 + :                         }
161:                     	
162:                        if ((code & 0xffffff80) == 0) 

262:                            }
 + :                             _surrogate=0; // USED
243:
244:                            if (bytes==buffer.length)
======================


thanks
Comment 17 Jesse McConnell CLA 2011-01-14 14:45:04 EST
could you provide a working test case (and by working I mean one that fails) for the current HttpWriter that I can validate this fix works with?

I am loathe to make changes like this without being able to demonstrate why the current solution is wrong.

if I get a change later I'll try and cobble together something that fails but if you have something already that would speed it up

cheers
Comment 18 kenya CLA 2011-01-14 20:17:11 EST
(In reply to comment #17)
> could you provide a working test case (and by working I mean one that fails)
> for the current HttpWriter that I can validate this fix works with?
> I am loathe to make changes like this without being able to demonstrate why the
> current solution is wrong.
> if I get a change later I'll try and cobble together something that fails but
> if you have something already that would speed it up
> cheers

Thank you for your help.

I'm very sorry.

The following is the test code which current fix was failed.
If first position of buffer is a low surrogate,occurs ArrayIndexOutOfBoundsException.

[test code]
======================
    @Test
    public void testMultiByteOverflowUTF16x2() throws Exception
    {
        _writer.setCharacterEncoding(StringUtil.__UTF8);
        // _writer.setCharacterEncoding("UTF8");
        
        final String singleByteStr = "a";
        int remainSize = 1;
        final String multiByteDuplicateStr = "\uD842\uDF9F"; // valid(High + Low)
        // final String multiByteDuplicateStr = "\uD842\uD842"; // invalid(High + High)
        // final String multiByteDuplicateStr = "\uDF9F\uDF9F"; // invalid(Low + Low)
        // final String multiByteDuplicateStr = "\uDF9F\uD842"; // invalid(Low + High)
        // int adjustSize = 0;
        int adjustSize = -1;

        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS + adjustSize; i++) {
          sb.append(singleByteStr);
        }
        sb.append(multiByteDuplicateStr);
        for (int i = 0; i < remainSize; i++) {
          sb.append(singleByteStr);
        }
        char[] buf = new char[HttpWriter.MAX_OUTPUT_CHARS * 3];

        // int length = HttpWriter.MAX_OUTPUT_CHARS + adjustSize + remainSize + multiByteDuplicateStr.length();
        // sb.toString().getChars(0, length, buf, 0);
       	// _writer.write(buf, 0, length);
        _writer.write(sb.toString());
       	
		// System.err.println(sb.toString());
		// for (int i = HttpWriter.MAX_OUTPUT_CHARS; i > 0; i--)
		// 	System.err.print("-");
		// System.err.println();
		// System.err.println(new String(_bytes.asArray(), StringUtil.__UTF8));

       assertEquals(sb.toString(),new String(_bytes.asArray(),StringUtil.__UTF8));
    }
======================

thanks
Comment 19 Michael Gorovoy CLA 2011-01-24 20:43:21 EST
Greetings,

I just ran your proposed test case using the current head revision of Jetty 7 trunk as well as Jetty 8 branch, and it passed in both cases. I'm currently testing on Ubuntu 10.10 Maverick Meerkat, using Sun JDK, build 1.6.0_22-b04.

Could you please let us know what exact version of Java Development Kit are you using, and what platform you are running it on?

Thanks,
Michael
Comment 20 kenya CLA 2011-01-30 04:42:22 EST
Created attachment 187917 [details]
my failing test code (HttpWriterTest.java)

Dear Sir.

thanks for your work.

I just ran test case using the current head revision of Jetty 7
trunk on Windows 7 Home Premium, using Sun JDK, build 1.6.0_12-b04.

testUTF16x2                  -> failure
testMultiByteOverflowUTF16x2 -> success(no longer fails)

thanks.
Comment 21 Greg Wilkins CLA 2011-01-30 23:53:26 EST
hmmm - both tests are failing for me on 

[java.version]	1.6.0_22
[java.vender]	null
[java.vm.specification.version]	1.0
[java.vm.specification.vender]	null
[java.vm.specification.name]	Java Virtual Machine Specification
[java.vm.version]	17.1-b03
[java.vm.vender]	null
[java.vm.name]	Java HotSpot(TM) 64-Bit Server VM
[java.specification.version]	1.6
[java.specification.vender]	null
[java.specification.name]	Java Platform API Specification
[os.name]	Linux
[os.arch]	amd64
[os.version]	2.6.35-25-generic

investigating....
Comment 22 Greg Wilkins CLA 2011-01-31 00:15:26 EST
I applied your patch from comment 16 and both tests are now passing.

thanks for the contribution. 

Committed in r2713
Comment 23 kenya CLA 2011-02-02 13:18:02 EST
Created attachment 188176 [details]
failing test code (HttpWriterTest.java) and fix code(HttpWriter.java)

Dear Sir.


I just ran following test case using the current head revision of Jetty 7
trunk, and it failed.

[test case]
======================
    @Test
    public void testMultiByteOverflowUTF16x2() throws Exception
    {
        _writer.setCharacterEncoding(StringUtil.__UTF8);

        final String singleByteStr = "a";
        int remainSize = 1;
        final String multiByteDuplicateStr = "\uD842\uDF9F"; // valid(High + Low)
        // final String multiByteDuplicateStr = "\uD842\uD842"; // invalid(High + High)
        // final String multiByteDuplicateStr = "\uDF9F\uDF9F"; // invalid(Low + Low)
        // final String multiByteDuplicateStr = "\uDF9F\uD842"; // invalid(Low + High)
        // int adjustSize = 0;
        // int adjustSize = -1;
        int adjustSize = -2;        
        // int adjustSize = -3;
        // int adjustSize = -4;
        // int adjustSize = -5;
        
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS + adjustSize; i++)
        {
            sb.append(singleByteStr);
        }
        sb.append(multiByteDuplicateStr);
        for (int i = 0; i < remainSize; i++)
        {
            sb.append(singleByteStr);
        }
        String source = sb.toString();

        byte[] bytes = source.getBytes("UTF-8"/* StringUtil.__UTF81 */);
        _writer.write(source.toCharArray(),0,source.toCharArray().length);

        java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
        java.io.OutputStreamWriter osw = new java.io.OutputStreamWriter(baos/* ,StringUtil.__UTF8 */);
        osw.write(source.toCharArray(),0,source.toCharArray().length);
        osw.flush();

        myReportBytes(bytes);
        myReportBytes(baos.toByteArray());
        myReportBytes(_bytes.asArray());

        assertArrayEquals(bytes,_bytes.asArray());
        assertArrayEquals(baos.toByteArray(),_bytes.asArray());
    }
======================


I will thought that the following fix will be the solution.

[fix of fix]
======================
175 :                            code = Character.toCodePoint((char)_surrogate, (char)code); // UCS-4
176+://                            _surrogate=0; // USED
177 :                        }

266 :                            } 
267 :                            
 +  :                            _surrogate=0; // USED
268 :
269 :                            if (bytes==buffer.length)
======================


I'm so sorry. My test was not enough.
so, I would change the status. (reopened)
Comment 24 kenya CLA 2011-02-04 22:37:28 EST
> so, I would change the status. (reopened)

I changed the status. (RESOLVED FIXED  -> REOPEND)
Comment 25 Greg Wilkins CLA 2011-02-21 22:42:30 EST
Thanks yet again.  Fixed in r2811 and a 7.3.1-snapshot has been pushed to http://oss.sonatype.org