| Summary: | logic of encode to utf-8 from surrogate pair is wrong ? | ||
|---|---|---|---|
| Product: | [RT] Jetty | Reporter: | kenya <fwnp1099> |
| Component: | other | Assignee: | Greg Wilkins <gregw> |
| Status: | RESOLVED FIXED | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | gregw, jesse.mcconnell, mgorovoy |
| Version: | 8.0.0 | Flags: | gregw:
iplog+
|
| Target Milestone: | 7.1.x | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
| Attachments: | |||
|
Description
kenya
I think you are asking about UTF-16 -> UTF-8 conversion issues if that isn't what your looking at here, please explain the issue more Dear Jesse McConnell Thank you for answers. > I think you are asking about UTF-16 -> UTF-8 conversion issues Strictly UTF-16*2 -> UTF-8 conversion issues. The following is the problem example. [decode problem] ====================== UTF-8:F0,A0,AE,9F ->UTF-32(codePoint):U+20B9F(134047) -->//_buffer.append((char)134047); // 134047 needs two java chars. -->_buffer.append(Character.toChars(134047)); // change _buffer.append([D842,DF9F] ); // change(UTF-16*2) ====================== [encode problem] ====================== UTF-16*2:[D842,DF9F] ->org.eclipse.jetty.server.HttpWriter.write([D842,DF9F], 0, 2); -->UTF-8:ED,A1,82,ED,BE,9F // Irreversible(I do not want) UTF-16*2:[D842,DF9F] ->java.io.OutputStreamWrite.write([D842,DF9F], 0, 2); -->UTF-8:F0,A0,AE,9F // Reversible(I hope) ====================== Because I am referring to the following sites: at [http://java.sun.com/developer/technicalArticles/Intl/Supplementary/index.html] ====================== > This article describes how supplementary characters are supported > in the Java platform. Supplementary characters are characters > in the Unicode standard whose code points are above U+FFFF, > and which therefore cannot be described as single 16-bit entities > such as the char data type in the Java programming language. > Such characters are generally rare, but some are used, for example, > as part of Chinese and Japanese personal names, and so support > for them is commonly required for government applications > in East Asian countries. ====================== at [http://www.mail-archive.com/solr-dev@lucene.apache.org/msg19487.html] ====================== > A code point (unicode character) outside of the BMP (basic > multilingual plane, fits in 16 bits) is represented as two java chars > - a surrogate pair. It's a single logical character - see > String.codePointAt(). In correct UTF-8 it should be encoded as a > single code point... but Jetty is ignoring the fact that it's a > surrogate pair and encoding each Java char as it's own code point... > this is often called modified-UTF8 or CESU-8. ====================== thanks ok, taking a look Is there a specific writer case you are having issues with?
I am tinkering with this as a test case for the writer and it works correctly.
@Test
public void testUTF16x2() throws Exception
{
_writer.setCharacterEncoding(StringUtil.__UTF8);
_writer.write((char)134047);
assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8),_bytes.asArray());
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter osw = new OutputStreamWriter(baos);
osw.write((char)134047);
osw.flush();
assertArrayEquals(baos.toByteArray(), _bytes.asArray());
assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8), _bytes.asArray());
}
Jesse, I think the issue is with characters that can only be represented in multiple UTF-16 characters. Your example is writing only a single UTF-16 character, so will not trigger this issue. You write 134047, but I think that is a UTF-32 code point. I think kenya is saying that 134047 in UTF-16 is represented by the two UTF-16 characters D842,DF9F.
Jesse, I added the following tests to Utf8StringBufferTest and Utf8StringBuilderTest:
@Test
public void testUTF32codes()
throws Exception
{
String source="\uD842\uDF9F";
System.err.println("source='"+source+"'");
byte[] bytes=source.getBytes("UTF-8");
System.err.println("bytes='"+TypeUtil.toHexString(bytes));
Utf8StringBuilder buffer = new Utf8StringBuilder();
buffer.append(bytes,0,bytes.length);
String result=buffer.toString();
System.err.println("result='"+result+"'");
assertEquals(source,result);
}
It fails and produces the result output (which I don't know if bugzilla will handle correctly):
source='
Created attachment 186262 [details]
test case that demonstrates the problem
Ah - there is also a proposed fix.... didn't notice first time. It looks like it works! thanks! tests and fix committed r2632 (In reply to comment #4) > Is there a specific writer case you are having issues with? > I am tinkering with this as a test case for the writer and it works correctly. > @Test > public void testUTF16x2() throws Exception > { > _writer.setCharacterEncoding(StringUtil.__UTF8); > _writer.write((char)134047); > > assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8),_bytes.asArray()); > ByteArrayOutputStream baos = new ByteArrayOutputStream(); > OutputStreamWriter osw = new OutputStreamWriter(baos); > osw.write((char)134047); > osw.flush(); > assertArrayEquals(baos.toByteArray(), _bytes.asArray()); > > assertArrayEquals(Character.valueOf((char)134047).toString().getBytes(StringUtil.__UTF8), > _bytes.asArray()); > } Dear Jesse McConnell Thank you for your help. My idea is the same as Greg. The following is the test code. [test code] ====================== @Test public void testUTF16x2() throws Exception { _writer.setCharacterEncoding(StringUtil.__UTF8); // String source = "\uD842\uDF9F"; int codePoint = 134047; char[] codeUnits = new char[Character.charCount(codePoint)]; int count = Character.toChars(codePoint, codeUnits, 0); String source = new String(codeUnits, 0, count); byte[] bytes = source.getBytes("UTF-8"); _writer.write(source.toCharArray(), 0, source.toCharArray().length); assertArrayEquals(bytes, _bytes.asArray()); ByteArrayOutputStream baos = new ByteArrayOutputStream(); OutputStreamWriter osw = new OutputStreamWriter(baos); osw.write(source.toCharArray(), 0, source.toCharArray().length); osw.flush(); assertArrayEquals(baos.toByteArray(), _bytes.asArray()); } ====================== Created attachment 186326 [details]
my proposed fix
The following is the proposed fix and test code2.
Please have a look.
[proposed fix]
======================
at [org.eclipse.jetty.server.HttpWriter]
155: if (bytes+chars>buffer.length)
156: chars=buffer.length-bytes;
157:
+ : if(Character.isLowSurrogate(s[offset]))
+ : {
+ : chars++;
+ : length++;
+ : offset--; // process two input characters (pending from the previous)
+ : }
+ : if(Character.isHighSurrogate(s[offset+chars-1]))
+ : {
+ : chars--; // process two input characters (pending the next)
+ : if(chars==0)
+ : {
+ : length--;
+ : offset++;
+ : }
+ : }
+ :
158: for (int i = 0; i < chars; i++)
159: {
160:// int code = s[offset+i];
+ : int code = Character.codePointAt(s, offset+i);
+ : if(Character.charCount(code)>1)
+ : i++; // process two input characters
161:
162: if ((code & 0xffffff80) == 0)
======================
[test code2]
======================
@Test
public void testMultiByteOverflowUTF16x2() throws Exception
{
_writer.setCharacterEncoding(StringUtil.__UTF8);
final String singleByteStr = "a";
// final String multiByteDuplicateStr = "\uD842\uDF9F";
int codePoint = 134047;
char[] codeUnits = new char[Character.charCount(codePoint)];
int count = Character.toChars(codePoint, codeUnits, 0);
final String multiByteDuplicateStr = new String(codeUnits, 0, count);
int multiByteStrByteLength = multiByteDuplicateStr.getBytes("UTF-8").length;
// int remainSize = 1;
int remainSize = HttpWriter.MAX_OUTPUT_CHARS - multiByteStrByteLength + 1;
// int remainSize = HttpWriter.MAX_OUTPUT_CHARS;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS - 1; i++) {
sb.append(singleByteStr);
}
sb.append(multiByteDuplicateStr);
for (int i = 0; i < remainSize; i++) {
sb.append(singleByteStr);
}
char[] buf = new char[HttpWriter.MAX_OUTPUT_CHARS * 3];
int length = HttpWriter.MAX_OUTPUT_CHARS - 1 + remainSize + Character.charCount(codePoint);
sb.toString().getChars(0, length, buf, 0);
_writer.write(buf, 0, HttpWriter.MAX_OUTPUT_CHARS);
_writer.write(buf, HttpWriter.MAX_OUTPUT_CHARS, length-HttpWriter.MAX_OUTPUT_CHARS);
assertEquals(sb.toString(),new String(_bytes.asArray(),StringUtil.__UTF8));
}
======================
thanks
(In reply to comment #8) > Ah - there is also a proposed fix.... didn't notice first time. > It looks like it works! thanks! > tests and fix committed r2632 Dear Greg Wilkins. Thanks for quick reaction. Dear Sir. Thank you for your help. Encoding problem has not been resolved. (a delicate issue) so, I would change the status. (reopend) How are you testing the fix for this? I tried your proposed test case and it passes fine...and there are now test cases for the utf32 codes in the reader string buffer and builder tests Please provide a test case that demonstrates how things are failing for you note, these changes are provided in 7.3.0-SNAPSHOT at the moment, build from trunk or let me know and I'll deploy an updated snapshot to oss cheers, jesse Created attachment 186510 [details] my failing test case (However, this is corrected by proposed fix) Dear Jesse McConnell. The following is Steps to Reproduce: [Reproduce] ================================= 1) Unzip the archive.(html5.zip) 2) Import as a project in Eclipse.(html5\**) 3) Delete proposed sources.(src\test\java\org\*.*,src\test\java\html5\MyTest.java) 4) Run ant.(maven-build.xml) 5) Run class.(src\test\java\html5\Start.java) 6) Browse the local site in IE8 in windows 7.(http://localhost:8080/html5/) -> A code point outside of the BMP are garbled.(in request and response) ================================= > note, these changes are provided in 7.3.0-SNAPSHOT at the moment, build from > trunk or let me know and I'll deploy an updated snapshot to oss I build from trunk. Because we might find another good solution. Thank you very much. Your sample project appears to be using jetty-8.0.0.M2 which is not where the fix was implemented. The fix was updated to jetty-8.0.0.M3-SNAPSHOT but we don't have a recent snapshot of that deployed at the moment. I'll get one of those pushed up but I am thinking this might be the issue. cheers Created attachment 186842 [details]
my failing test case (another solution)
Dear Jesse McConnell.
Thank you for your advice.
My proposed fix was wrong.
Because I had the wrong idea of buffering.
Moreover The test was not enough .
The following is the another solution.
Is this looks good?
[another solution]
======================
at [org.eclipse.jetty.server.HttpWriter]
50 : _generator=_out._generator;
51+: _surrogate=0; // AS lastUTF16CodePoint
52 : }
160: int code = s[offset+i];
+ : if(_surrogate==0)
+ : {
+ : if(Character.isHighSurrogate((char)code))
+ : {
+ : _surrogate=code; // UCS-?
+ : continue;
+ : }
+ : }
+ : else
+ : {
+ : if(Character.isLowSurrogate((char)code))
+ : {
+ : code = Character.toCodePoint((char)_surrogate, (char)code); // UCS-4
+ : }
+ : else
+ : {
+ : code=_surrogate; // UCS-2
+ : i--;
+ : }
+ : }
161:
162: if ((code & 0xffffff80) == 0)
262: }
+ : _surrogate=0; // USED
243:
244: if (bytes==buffer.length)
======================
thanks
could you provide a working test case (and by working I mean one that fails) for the current HttpWriter that I can validate this fix works with? I am loathe to make changes like this without being able to demonstrate why the current solution is wrong. if I get a change later I'll try and cobble together something that fails but if you have something already that would speed it up cheers (In reply to comment #17) > could you provide a working test case (and by working I mean one that fails) > for the current HttpWriter that I can validate this fix works with? > I am loathe to make changes like this without being able to demonstrate why the > current solution is wrong. > if I get a change later I'll try and cobble together something that fails but > if you have something already that would speed it up > cheers Thank you for your help. I'm very sorry. The following is the test code which current fix was failed. If first position of buffer is a low surrogate,occurs ArrayIndexOutOfBoundsException. [test code] ====================== @Test public void testMultiByteOverflowUTF16x2() throws Exception { _writer.setCharacterEncoding(StringUtil.__UTF8); // _writer.setCharacterEncoding("UTF8"); final String singleByteStr = "a"; int remainSize = 1; final String multiByteDuplicateStr = "\uD842\uDF9F"; // valid(High + Low) // final String multiByteDuplicateStr = "\uD842\uD842"; // invalid(High + High) // final String multiByteDuplicateStr = "\uDF9F\uDF9F"; // invalid(Low + Low) // final String multiByteDuplicateStr = "\uDF9F\uD842"; // invalid(Low + High) // int adjustSize = 0; int adjustSize = -1; StringBuilder sb = new StringBuilder(); for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS + adjustSize; i++) { sb.append(singleByteStr); } sb.append(multiByteDuplicateStr); for (int i = 0; i < remainSize; i++) { sb.append(singleByteStr); } char[] buf = new char[HttpWriter.MAX_OUTPUT_CHARS * 3]; // int length = HttpWriter.MAX_OUTPUT_CHARS + adjustSize + remainSize + multiByteDuplicateStr.length(); // sb.toString().getChars(0, length, buf, 0); // _writer.write(buf, 0, length); _writer.write(sb.toString()); // System.err.println(sb.toString()); // for (int i = HttpWriter.MAX_OUTPUT_CHARS; i > 0; i--) // System.err.print("-"); // System.err.println(); // System.err.println(new String(_bytes.asArray(), StringUtil.__UTF8)); assertEquals(sb.toString(),new String(_bytes.asArray(),StringUtil.__UTF8)); } ====================== thanks Greetings, I just ran your proposed test case using the current head revision of Jetty 7 trunk as well as Jetty 8 branch, and it passed in both cases. I'm currently testing on Ubuntu 10.10 Maverick Meerkat, using Sun JDK, build 1.6.0_22-b04. Could you please let us know what exact version of Java Development Kit are you using, and what platform you are running it on? Thanks, Michael Created attachment 187917 [details]
my failing test code (HttpWriterTest.java)
Dear Sir.
thanks for your work.
I just ran test case using the current head revision of Jetty 7
trunk on Windows 7 Home Premium, using Sun JDK, build 1.6.0_12-b04.
testUTF16x2 -> failure
testMultiByteOverflowUTF16x2 -> success(no longer fails)
thanks.
hmmm - both tests are failing for me on [java.version] 1.6.0_22 [java.vender] null [java.vm.specification.version] 1.0 [java.vm.specification.vender] null [java.vm.specification.name] Java Virtual Machine Specification [java.vm.version] 17.1-b03 [java.vm.vender] null [java.vm.name] Java HotSpot(TM) 64-Bit Server VM [java.specification.version] 1.6 [java.specification.vender] null [java.specification.name] Java Platform API Specification [os.name] Linux [os.arch] amd64 [os.version] 2.6.35-25-generic investigating.... I applied your patch from comment 16 and both tests are now passing. thanks for the contribution. Committed in r2713 Created attachment 188176 [details]
failing test code (HttpWriterTest.java) and fix code(HttpWriter.java)
Dear Sir.
I just ran following test case using the current head revision of Jetty 7
trunk, and it failed.
[test case]
======================
@Test
public void testMultiByteOverflowUTF16x2() throws Exception
{
_writer.setCharacterEncoding(StringUtil.__UTF8);
final String singleByteStr = "a";
int remainSize = 1;
final String multiByteDuplicateStr = "\uD842\uDF9F"; // valid(High + Low)
// final String multiByteDuplicateStr = "\uD842\uD842"; // invalid(High + High)
// final String multiByteDuplicateStr = "\uDF9F\uDF9F"; // invalid(Low + Low)
// final String multiByteDuplicateStr = "\uDF9F\uD842"; // invalid(Low + High)
// int adjustSize = 0;
// int adjustSize = -1;
int adjustSize = -2;
// int adjustSize = -3;
// int adjustSize = -4;
// int adjustSize = -5;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < HttpWriter.MAX_OUTPUT_CHARS + adjustSize; i++)
{
sb.append(singleByteStr);
}
sb.append(multiByteDuplicateStr);
for (int i = 0; i < remainSize; i++)
{
sb.append(singleByteStr);
}
String source = sb.toString();
byte[] bytes = source.getBytes("UTF-8"/* StringUtil.__UTF81 */);
_writer.write(source.toCharArray(),0,source.toCharArray().length);
java.io.ByteArrayOutputStream baos = new java.io.ByteArrayOutputStream();
java.io.OutputStreamWriter osw = new java.io.OutputStreamWriter(baos/* ,StringUtil.__UTF8 */);
osw.write(source.toCharArray(),0,source.toCharArray().length);
osw.flush();
myReportBytes(bytes);
myReportBytes(baos.toByteArray());
myReportBytes(_bytes.asArray());
assertArrayEquals(bytes,_bytes.asArray());
assertArrayEquals(baos.toByteArray(),_bytes.asArray());
}
======================
I will thought that the following fix will be the solution.
[fix of fix]
======================
175 : code = Character.toCodePoint((char)_surrogate, (char)code); // UCS-4
176+:// _surrogate=0; // USED
177 : }
266 : }
267 :
+ : _surrogate=0; // USED
268 :
269 : if (bytes==buffer.length)
======================
I'm so sorry. My test was not enough.
so, I would change the status. (reopened)
> so, I would change the status. (reopened)
I changed the status. (RESOLVED FIXED -> REOPEND)
Thanks yet again. Fixed in r2811 and a 7.3.1-snapshot has been pushed to http://oss.sonatype.org |