Community
Participate
Working Groups
I hope surrogate pair encode to utf8 and utf8 decode to surrogate pair. The following workarounds are taken until corresponding. (JDK5) ====================== at [org.eclipse.jetty.util.Utf8StringBuilder] 126: _buffer.append((char)_bits); 126: _buffer.append(Character.toChars(_bits)); // change ====================== ====================== at [org.eclipse.jetty.util.Utf8StringBuffer] 129: _buffer.append((char)_bits); 129: _buffer.append(Character.toChars(_bits)); // change ====================== ====================== at [org.eclipse.jetty.server.HttpWriter] 62: _writeMode = WRITE_UTF8; 62:// _writeMode = WRITE_UTF8; // change ====================== This concludes my report.
I think you are asking about UTF-16 -> UTF-8 conversion issues if that isn't what your looking at here, please explain the issue more
Dear Jesse McConnell Thank you for answers. > I think you are asking about UTF-16 -> UTF-8 conversion issues Strictly UTF-16*2 -> UTF-8 conversion issues. The following is the problem example. [decode problem] ====================== UTF-8:F0,A0,AE,9F ->UTF-32(codePoint):U+20B9F(134047) -->//_buffer.append((char)134047); // 134047 needs two java chars. -->_buffer.append(Character.toChars(134047)); // change _buffer.append([D842,DF9F] ); // change(UTF-16*2) ====================== [encode problem] ====================== UTF-16*2:[D842,DF9F] ->org.eclipse.jetty.server.HttpWriter.write([D842,DF9F], 0, 2); -->UTF-8:ED,A1,82,ED,BE,9F // Irreversible(I do not want) UTF-16*2:[D842,DF9F] ->java.io.OutputStreamWrite.write([D842,DF9F], 0, 2); -->UTF-8:F0,A0,AE,9F // Reversible(I hope) ====================== Because I am referring to the following sites: at [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/index.html] ====================== > This article describes how supplementary characters are supported > in the Java platform. Supplementary characters are characters > in the Unicode standard whose code points are above U+FFFF, > and which therefore cannot be described as single 16-bit entities > such as the char data type in the Java programming language. > Such characters are generally rare, but some are used, for example, > as part of Chinese and Japanese personal names, and so support > for them is commonly required for government applications > in East Asian countries. ====================== at [http://www.mail-archive.com/solr-dev@lucene.apache.org/msg19487.html] ====================== > A code point (unicode character) outside of the BMP (basic > multilingual plane, fits in 16 bits) is represented as two java chars > - a surrogate pair. It's a single logical character - see > String.codePointAt(). In correct UTF-8 it should be encoded as a > single code point... but Jetty is ignoring the fact that it's a > surrogate pair and encoding each Java char as it's own code point... > this is often called modified-UTF8 or CESU-8. ====================== thanks
ok, taking a look
Is there a specific writer case you are having issues with? I am tinkering with this as a test case for the writer and it works correctly. @Test public void testUTF16x2() throws Exception { _writer.setCharacterEncoding(StringUtil.__UTF8); _writer.write((char)134047); assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8),_bytes.asArray()); ByteArrayOutputStream baos = new ByteArrayOutputStream(); OutputStreamWriter osw = new OutputStreamWriter(baos); osw.write((char)134047); osw.flush(); assertArrayEquals(baos.toByteArray(), _bytes.asArray()); assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8), _bytes.asArray()); }
Jesse, I think the issue is with characters that can only be represented in multiple UTF-16 characters. Your example is writing only a single UTF-16 character, so will not trigger this issue. You write 134047, but I think that is a UTF-32 code point. I think kenya is saying that 134047 in UTF-16 is represented by the two UTF-16 characters D842,DF9F.
Jesse, I added the following tests to Utf8StringBufferTest and Utf8StringBuilderTest: @Test public void testUTF32codes() throws Exception { String source="\uD842\uDF9F"; System.err.println("source='"+source+"'"); byte[] bytes=source.getBytes("UTF-8"); System.err.println("bytes='"+TypeUtil.toHexString(bytes)); Utf8StringBuilder buffer = new Utf8StringBuilder(); buffer.append(bytes,0,bytes.length); String result=buffer.toString(); System.err.println("result='"+result+"'"); assertEquals(source,result); } It fails and produces the result output (which I don't know if bugzilla will handle correctly): source='
Created attachment 186262 [details] test case that demonstrates the problem
Ah - there is also a proposed fix.... didn't notice first time. It looks like it works! thanks! tests and fix committed r2632
(In reply to comment #4) > Is there a specific writer case you are having issues with? > I am tinkering with this as a test case for the writer and it works correctly. > @Test > public void testUTF16x2() throws Exception > { > _writer.setCharacterEncoding(StringUtil.__UTF8); > _writer.write((char)134047); > > assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8),_bytes.asArray()); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > OutputStreamWriter osw = new OutputStreamWriter(baos); > osw.write((char)134047); > osw.flush(); > assertArrayEquals(baos.toByteArray(), _bytes.asArray()); > > assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8), > _bytes.asArray()); > } Dear Jesse McConnell Thank you for your help. My idea is the same as Greg. The following is the test code. [test code] ====================== @Test public void testUTF16x2() throws Exception { _writer.setCharacterEncoding(StringUtil.__UTF8); // String source = "\uD842\uDF9F"; int codePoint = 134047; char[] codeUnits = new char[Character.charCount(codePoint)]; int count = Character.toChars(codePoint, codeUnits, 0); String source = new String(codeUnits, 0, count); byte[] bytes = source.getBytes("UTF-8"); _writer.write(source.toCharArray(), 0, source.toCharArray().length); assertArrayEquals(bytes, _bytes.asArray()); ByteArrayOutputStream baos = new ByteArrayOutputStream(); OutputStreamWriter osw = new OutputStreamWriter(baos); osw.write(source.toCharArray(), 0, source.toCharArray().length); osw.flush(); assertArrayEquals(baos.toByteArray(), _bytes.asArray()); } ======================
Created attachment 186326 [details] my proposed fix The following is the proposed fix and test code2. Please have a look. [proposed fix] ====================== at [org.eclipse.jetty.server.HttpWriter] 155: if (bytes+chars>buffer.length) 156: chars=buffer.length-bytes; 157: + : if(Character.isLowSurrogate(s[offset])) + : { + : chars++; + : length++; + : offset--; // process two input characters (pending from the previous) + : } + : if(Character.isHighSurrogate(s[offset+chars-1])) + : { + : chars--; // process two input characters (pending the next) + : if(chars==0) + : { + : length--; + : offset++; + : } + : } + : 158: for (int i = 0; i < chars; i++) 159: { 160:// int code = s[offset+i]; + : int code = Character.codePointAt(s, offset+i); + : if(Character.charCount(code)>1) + : i++; // process two input characters 161: 162: if ((code & 0xffffff80) == 0) ====================== [test code2] ====================== @Test public void testMultiByteOverflowUTF16x2() throws Exception { _writer.setCharacterEncoding(StringUtil.__UTF8); final String singleByteStr = "a"; // final String multiByteDuplicateStr = "\uD842\uDF9F"; int codePoint = 134047; char[] codeUnits = new char[Character.charCount(codePoint)]; int count = Character.toChars(codePoint, codeUnits, 0); final String multiByteDuplicateStr = new String(codeUnits, 0, count); int multiByteStrByteLength = multiByteDuplicateStr.getBytes("UTF-8").length; // int remainSize = 1; int remainSize = HttpWriter.MAX_OUTPUT_CHARS - multiByteStrByteLength + 1; // int remainSize = HttpWriter.MAX_OUTPUT_CHARS; StringBuilder sb = new StringBuilder(); for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS - 1; i++) { sb.append(singleByteStr); } sb.append(multiByteDuplicateStr); for (int i = 0; i < remainSize; i++) { sb.append(singleByteStr); } char[] buf = new char[HttpWriter.MAX_OUTPUT_CHARS * 3]; int length = HttpWriter.MAX_OUTPUT_CHARS - 1 + remainSize + Character.charCount(codePoint); sb.toString().getChars(0, length, buf, 0); _writer.write(buf, 0, HttpWriter.MAX_OUTPUT_CHARS); _writer.write(buf, HttpWriter.MAX_OUTPUT_CHARS, length-HttpWriter.MAX_OUTPUT_CHARS); assertEquals(sb.toString(),new String(_bytes.asArray(),StringUtil.__UTF8)); } ====================== thanks
(In reply to comment #8) > Ah - there is also a proposed fix.... didn't notice first time. > It looks like it works! thanks! > tests and fix committed r2632 Dear Greg Wilkins. Thanks for quick reaction.
Dear Sir. Thank you for your help. Encoding problem has not been resolved. (a delicate issue) so, I would change the status. (reopend)
How are you testing the fix for this? I tried your proposed test case and it passes fine...and there are now test cases for the utf32 codes in the reader string buffer and builder tests Please provide a test case that demonstrates how things are failing for you note, these changes are provided in 7.3.0-SNAPSHOT at the moment, build from trunk or let me know and I'll deploy an updated snapshot to oss cheers, jesse
Created attachment 186510 [details] my failing test case (However, this is corrected by proposed fix) Dear Jesse McConnell. The following is Steps to Reproduce: [Reproduce] ================================= 1) Unzip the archive.(html5.zip) 2) Import as a project in Eclipse.(html5\**) 3) Delete proposed sources.(src\test\java\org\*.*,src\test\java\html5\MyTest.java) 4) Run ant.(maven-build.xml) 5) Run class.(src\test\java\html5\Start.java) 6) Browse the local site in IE8 in windows 7.(http://localhost:8080/html5/) -> A code point outside of the BMP are garbled.(in request and response) ================================= > note, these changes are provided in 7.3.0-SNAPSHOT at the moment, build from > trunk or let me know and I'll deploy an updated snapshot to oss I build from trunk. Because we might find another good solution. Thank you very much.
Your sample project appears to be using jetty-8.0.0.M2 which is not where the fix was implemented. The fix was updated to jetty-8.0.0.M3-SNAPSHOT but we don't have a recent snapshot of that deployed at the moment. I'll get one of those pushed up but I am thinking this might be the issue. cheers
Created attachment 186842 [details] my failing test case (another solution) Dear Jesse McConnell. Thank you for your advice. My proposed fix was wrong. Because I had the wrong idea of buffering. Moreover The test was not enough . The following is the another solution. Is this looks good? [another solution] ====================== at [org.eclipse.jetty.server.HttpWriter] 50 : _generator=_out._generator; 51+: _surrogate=0; // AS lastUTF16CodePoint 52 : } 160: int code = s[offset+i]; + : if(_surrogate==0) + : { + : if(Character.isHighSurrogate((char)code)) + : { + : _surrogate=code; // UCS-? + : continue; + : } + : } + : else + : { + : if(Character.isLowSurrogate((char)code)) + : { + : code = Character.toCodePoint((char)_surrogate, (char)code); // UCS-4 + : } + : else + : { + : code=_surrogate; // UCS-2 + : i--; + : } + : } 161: 162: if ((code & 0xffffff80) == 0) 262: } + : _surrogate=0; // USED 243: 244: if (bytes==buffer.length) ====================== thanks
could you provide a working test case (and by working I mean one that fails) for the current HttpWriter that I can validate this fix works with? I am loathe to make changes like this without being able to demonstrate why the current solution is wrong. if I get a change later I'll try and cobble together something that fails but if you have something already that would speed it up cheers
(In reply to comment #17) > could you provide a working test case (and by working I mean one that fails) > for the current HttpWriter that I can validate this fix works with? > I am loathe to make changes like this without being able to demonstrate why the > current solution is wrong. > if I get a change later I'll try and cobble together something that fails but > if you have something already that would speed it up > cheers Thank you for your help. I'm very sorry. The following is the test code which current fix was failed. If first position of buffer is a low surrogate,occurs ArrayIndexOutOfBoundsException. [test code] ====================== @Test public void testMultiByteOverflowUTF16x2() throws Exception { _writer.setCharacterEncoding(StringUtil.__UTF8); // _writer.setCharacterEncoding("UTF8"); final String singleByteStr = "a"; int remainSize = 1; final String multiByteDuplicateStr = "\uD842\uDF9F"; // valid(High + Low) // final String multiByteDuplicateStr = "\uD842\uD842"; // invalid(High + High) // final String multiByteDuplicateStr = "\uDF9F\uDF9F"; // invalid(Low + Low) // final String multiByteDuplicateStr = "\uDF9F\uD842"; // invalid(Low + High) // int adjustSize = 0; int adjustSize = -1; StringBuilder sb = new StringBuilder(); for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS + adjustSize; i++) { sb.append(singleByteStr); } sb.append(multiByteDuplicateStr); for (int i = 0; i < remainSize; i++) { sb.append(singleByteStr); } char[] buf = new char[HttpWriter.MAX_OUTPUT_CHARS * 3]; // int length = HttpWriter.MAX_OUTPUT_CHARS + adjustSize + remainSize + multiByteDuplicateStr.length(); // sb.toString().getChars(0, length, buf, 0); // _writer.write(buf, 0, length); _writer.write(sb.toString()); // System.err.println(sb.toString()); // for (int i = HttpWriter.MAX_OUTPUT_CHARS; i > 0; i--) // System.err.print("-"); // System.err.println(); // System.err.println(new String(_bytes.asArray(), StringUtil.__UTF8)); assertEquals(sb.toString(),new String(_bytes.asArray(),StringUtil.__UTF8)); } ====================== thanks
Greetings, I just ran your proposed test case using the current head revision of Jetty 7 trunk as well as Jetty 8 branch, and it passed in both cases. I'm currently testing on Ubuntu 10.10 Maverick Meerkat, using Sun JDK, build 1.6.0_22-b04. Could you please let us know what exact version of Java Development Kit are you using, and what platform you are running it on? Thanks, Michael
Created attachment 187917 [details] my failing test code (HttpWriterTest.java) Dear Sir. thanks for your work. I just ran test case using the current head revision of Jetty 7 trunk on Windows 7 Home Premium, using Sun JDK, build 1.6.0_12-b04. testUTF16x2 -> failure testMultiByteOverflowUTF16x2 -> success(no longer fails) thanks.
hmmm - both tests are failing for me on [java.version] 1.6.0_22 [java.vender] null [java.vm.specification.version] 1.0 [java.vm.specification.vender] null [java.vm.specification.name] Java Virtual Machine Specification [java.vm.version] 17.1-b03 [java.vm.vender] null [java.vm.name] Java HotSpot(TM) 64-Bit Server VM [java.specification.version] 1.6 [java.specification.vender] null [java.specification.name] Java Platform API Specification [os.name] Linux [os.arch] amd64 [os.version] 2.6.35-25-generic investigating....
I applied your patch from comment 16 and both tests are now passing. thanks for the contribution. Committed in r2713
Created attachment 188176 [details] failing test code (HttpWriterTest.java) and fix code(HttpWriter.java) Dear Sir. I just ran following test case using the current head revision of Jetty 7 trunk, and it failed. [test case] ====================== @Test public void testMultiByteOverflowUTF16x2() throws Exception { _writer.setCharacterEncoding(StringUtil.__UTF8); final String singleByteStr = "a"; int remainSize = 1; final String multiByteDuplicateStr = "\uD842\uDF9F"; // valid(High + Low) // final String multiByteDuplicateStr = "\uD842\uD842"; // invalid(High + High) // final String multiByteDuplicateStr = "\uDF9F\uDF9F"; // invalid(Low + Low) // final String multiByteDuplicateStr = "\uDF9F\uD842"; // invalid(Low + High) // int adjustSize = 0; // int adjustSize = -1; int adjustSize = -2; // int adjustSize = -3; // int adjustSize = -4; // int adjustSize = -5; StringBuilder sb = new StringBuilder(); for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS + adjustSize; i++) { sb.append(singleByteStr); } sb.append(multiByteDuplicateStr); for (int i = 0; i < remainSize; i++) { sb.append(singleByteStr); } String source = sb.toString(); byte[] bytes = source.getBytes("UTF-8"/* StringUtil.__UTF81 */); _writer.write(source.toCharArray(),0,source.toCharArray().length); java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream(); java.io.OutputStreamWriter osw = new java.io.OutputStreamWriter(baos/* ,StringUtil.__UTF8 */); osw.write(source.toCharArray(),0,source.toCharArray().length); osw.flush(); myReportBytes(bytes); myReportBytes(baos.toByteArray()); myReportBytes(_bytes.asArray()); assertArrayEquals(bytes,_bytes.asArray()); assertArrayEquals(baos.toByteArray(),_bytes.asArray()); } ====================== I will thought that the following fix will be the solution. [fix of fix] ====================== 175 : code = Character.toCodePoint((char)_surrogate, (char)code); // UCS-4 176+:// _surrogate=0; // USED 177 : } 266 : } 267 : + : _surrogate=0; // USED 268 : 269 : if (bytes==buffer.length) ====================== I'm so sorry. My test was not enough. so, I would change the status. (reopened)
> so, I would change the status. (reopened) I changed the status. (RESOLVED FIXED -> REOPEND)
Thanks yet again. Fixed in r2811 and a 7.3.1-snapshot has been pushed to http://oss.sonatype.org