Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 377604

Summary: [nls tooling] TUR4.2: Unicode escape of Latin 1 specific characters is incorrect by Externalize String
Product: [Eclipse Project] JDT Reporter: Kentaroh Noji <kennoji>
Component: TextAssignee: JDT-Text-Inbox <jdt-text-inbox>
Status: CLOSED INVALID QA Contact:
Severity: normal    
Priority: P3 CC: camle, daniel_megert, harendra, kennoji, maedera
Version: 3.8   
Target Milestone: ---   
Hardware: PC   
OS: Windows 7   
Whiteboard:
Attachments:
Description Flags
Sample test case
none
a message.properties file generated by Externalize String
none
Screen capture of result none

Description Kentaroh Noji CLA 2012-04-25 04:44:55 EDT
Build Identifier: I20120315-1300

Latin 1 specific characters such as "¡¢£¤¥¦§¨©ª«¬SHY®¯Bx°±²³´µ¶·¸¹º»¼½¾¿CxÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏDxÐÑÒÓÔÕÖרÙÚÛÜÝÞßExàáâãäåæçèéêëìíîïFxðñòóôõö÷øùúûüýþÿ!" should be encoded in Unicode escape \uxxxx in properties file. However, Externalize String function generates these Latin 1 characters as it is. JDK's native2ascii translates these Latin 1 characters into Unicode escape such as \u00a1\u00a2\u00a3\u00a4\u00a5\u00a6\u00a7\u00a8\u00a9\u00aa\u00ab\u00acSHY\u00ae\u00afBx\u00b0\u00b1\u00b2\u00b3\u00b4\u00b5\u00b6\u00b7\u00b8\u00b9\u00ba\u00bb\u00bc\u00bd\u00be\u00bfCx\u00c0\u00c1\u00c2\u00c3\u00c4\u00c5\u00c6\u00c7\u00c8\u00c9\u00ca\u00cb\u00cc\u00cd\u00ce\u00cfDx\u00d0\u00d1\u00d2\u00d3\u00d4\u00d5\u00d6\u00d7\u00d8\u00d9\u00da\u00db\u00dc\u00dd\u00de\u00dfEx\u00e0\u00e1\u00e2\u00e3\u00e4\u00e5\u00e6\u00e7\u00e8\u00e9\u00ea\u00eb\u00ec\u00ed\u00ee\u00efFx\u00f0\u00f1\u00f2\u00f3\u00f4\u00f5\u00f6\u00f7\u00f8\u00f9\u00fa\u00fb\u00fc\u00fd\u00fe\u00ff

Reproducible: Always

Steps to Reproduce:
OS: e.g. Windows 7 SP1 Professional Turkish Edition
JDK: java full version JRE 1.7.0 IBM Windows AMD 64 build pwa6470-20110906_01
Locale:Turkish

I found this symptom in Turkish Environment with some Turkish characters. After some investigation, I found that this symptom happens with Latin 1 specific characters.  

1. Create a Java project, and create a java class. 
2. Add "System.out.println("¡¢£¤¥¦§¨©ª«¬SHY®¯Bx°±²³´µ¶·¸¹º»¼½¾¿CxÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏDxÐÑÒÓÔÕÖרÙÚÛÜÝÞßExàáâãäåæçèéêëìíîïFxðñòóôõö÷øùúûüýþÿ");" in the java class created.   

3. Source > Externalize String. Click Next button, and finish. 
4. Browse the messages.properties file.
Comment 1 Kentaroh Noji CLA 2012-04-25 04:56:08 EDT
Created attachment 214510 [details]
Sample test case

This sample contains Latin 1 specific characters.
Comment 2 Kentaroh Noji CLA 2012-04-25 04:56:53 EDT
Created attachment 214511 [details]
a message.properties file generated by Externalize String
Comment 3 Kentaroh Noji CLA 2012-04-25 04:58:33 EDT
Created attachment 214512 [details]
Screen capture of result
Comment 4 Dani Megert CLA 2012-04-25 08:16:08 EDT
(In reply to comment #1)
> Created attachment 214510 [details]
> Sample test case
> 
> This sample contains Latin 1 specific characters.

This file does not compile and hence nothing can be externalized.

When I fix the error and then externalize the string I get this entry:

Uni.0=¡¢£¤¥¦§¨©ª«¬SHY®¯Bx°±²³´µ¶·¸¹º»¼½¾¿CxÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏDxÐÑÒÓÔÕÖרÙÚÛÜÝÞßExàáâãäåæçèéêëìíîïFxðñòóôõö÷øùúûüýþÿ


And all works fine.
Comment 5 Kentaroh Noji CLA 2012-04-25 08:45:24 EDT
(In reply to comment #4)
 
> And all works fine.

Let me explain the problem again. Here are problem recreation steps: 

1. Create a java class file which contains literals with Latin 1 characters:

public class Uniescape {

	/**
	 * @param args
	 */
		public static void main(String[] args) {
			System.out.println("AÀÁÂÃÄÅ"); 
		}

	}

2. Externalize String in Eclipse. Source > Externalize String.

3. Then, messages.properties file is created and it contains key=value like:

Uniescape.0=AÀÁÂÃÄÅ

4. When I run JDK's native2ascii command for the message.properties, I get the following result: 

Uniescape.0=A\u00c0\u00c1\u00c2\u00c3\u00c4\u00c5

5. Why does not the Eclipse's externalizing string function transform these Latin 1 characters into Unicode escape defined by the Java sepc.? Note that Eclipse's externalizing string function transform non-ASCII character other than Latin 1 into Unicode escape.
Comment 6 Dani Megert CLA 2012-04-25 09:01:03 EDT
(In reply to comment #5)
> (In reply to comment #4)
> 
> > And all works fine.
> 
> Let me explain the problem again. Here are problem recreation steps: 
> 
> 1. Create a java class file which contains literals with Latin 1 characters:
> 
> public class Uniescape {
> 
>     /**
>      * @param args
>      */
>         public static void main(String[] args) {
>             System.out.println("AÀÁÂÃÄÅ"); 
>         }
> 
>     }
> 
> 2. Externalize String in Eclipse. Source > Externalize String.
> 
> 3. Then, messages.properties file is created and it contains key=value like:
> 
> Uniescape.0=AÀÁÂÃÄÅ

And this is correct.


 

> 5. Why does not the Eclipse's externalizing string function transform these
> Latin 1 characters into Unicode escape defined by the Java sepc.? 

That's not in spec. Please point me to which part in the JLS7 you refer to if you disagree. The Javadoc says that only non-Latin1 characters need to be escaped.
Comment 7 Kentaroh Noji CLA 2012-04-25 23:07:23 EDT
The Java™ Language Specification Java SE 7 Edition describes in section 3.3 Unicode escape:

The Java programming language specifies a standard way of transforming a
program written in Unicode into ASCII that changes a program into a form that
can be processed by ASCII-based tools. The transformation involves converting
any Unicode escapes in the source text of the program to ASCII by adding an extra
u - for example, \uxxxx becomes \uuxxxx - while simultaneously converting non-
ASCII characters in the source text to Unicode escapes containing a single u each.


So, it looks that Unicode escape is for transforming from Unicode chars into ASCII chars only. It looks ASCII does not include U+00A0 - U+00FF.
Comment 8 Dani Megert CLA 2012-04-26 02:42:26 EDT
Yes, there *is a way* to transform to ASCII. It does not say anything that one *must* transform it. Only when a properties file entry is non-Latin1 one has to do this.
Comment 9 Kentaroh Noji CLA 2012-04-26 23:47:06 EDT
Thank you. I found the following statement in javadoc Properties:

When saving properties to a stream or loading them from a stream, the ISO 8859-1 character encoding is used. For characters that cannot be directly represented in this encoding, Unicode escapes are used; however, only a single 'u' character is allowed in an escape sequence. The native2ascii tool can be used to convert property files to and from other character encodings. 

I understand this is not a bug, I am closing this report.