Community
Participate
Working Groups
I was reading http://www-128.ibm.com/developerworks/xml/library/x-utf8/?ca=dnt-635 today, and it got me thinking; why does Eclipse default to a platform-specific encoding for its file types when creating new files? In the majority of cases, source files will already be in US-ASCII (which is upwards compatible with UTF-8) and these days, operating systems are already capable of dealing with UTF-8 documents in editors (Windows has notepad and wordpad, Mac OS X has TextEdit, and Linux has a variety of tools (vim, emacs) that also support UTF-8. This would also avoid potential problems with creating HTML pages, by disallowing certain windows-codepage characters (those tending to be ` and ' quotes) that don't show up on other platforms. Obviously Eclipse can be configured to define the default encoding type, but Eclipse is such a leading player in the IDE market that it makes sense for Eclipse to take the lead in making UTF-8 the default encoding for text files.
-1 to do this for all kind of text files. Note we already define UTF-8 as default for files with content-type XML. Moving to JDT Core to decide whether they want to define a default encoding for Java source files.
Why -1 for all types of files? It will only affect newly created files within Eclipse, and the default can be changed by users afterwards. Given that Eclipse is well set up for distributed environments, and indeed, works on many platforms, then UTF-8 is the only sensible default encoding type. Granted, there may be good reasons; but can you explain them here for others interested in the reasoning behind the decision?
Because Eclipse is not just an IDE and more important we should not override the platform (os) encoding that user has chosen.
The user does not normally choose the platform OS encoding. They install an operating system and say they're in the U.S./Japan/Quebec/Denmark/ wherever and a default character set is chosen for them based on that information. On Windows and the Mac, this encoding is likely to be a local, platform dependent, non-standard character set. Linux is a little better. You're at least likely to get a genuine standard character set. However, it still may not be Unicode. The user has not made an explicit choice of the default encoding at the operating system level, and generaly cannot make that choice. I think the vendors should also change their defaults to UTF-8 and Unicode, but until they do, there's no reaosn for Eclipse to repsect their defaults.
For that matter, Windows users don't get to choose what their encoding is either. The regional options specify a 'Language for non-Unicode platforms' that should be used as a fall-back when you have a program that doesn't know what Unicode is. But Eclipse knows what Unicode is, and can deal with it nicely; even Windows 2000 supported UTF-8. Given that all OS vendors are moving towards supporting UTF-8 as a default option, I think it's time to give the shackles of codepages a rest and move forwards rather than looking backwards. It doesn't really matter whether you're looking at Eclipse as an IDE or the Eclipse platform; I'm writing a Rich Client Application and it's just as important for that that the default text format is a cross-platform rather than platform-specific format. After all, I'm developing it as a Rich Client app because of the cross-platform support.
I'll voice my 2 cents that I do not possibly see how UTF-8 as the default for all files could possibly work. Seems this would mean files created with Eclipse could not be interoperable with other applications not making that assumption. Perhaps the originator is assuming that all UTF-8 is identified with a 3 byte BOM, which I do not think is true. Even if so, Java, by itself, does not even handle that 3 byte BOM well (does not handle well on 'read', does not produce during 'write'). Of course, it makes sense for XML, etc. HTML and JSP's all have their own spec'd encoding rules (well, HTML doesn't, that I know of). But as a general rule, if the encoding is not identified in the content (or spec'd rules for the content), you pretty much have to assume platform default.
Much as I would like to see some sanity in this area, I agree with David. The description is correct - an increasing number of applications can deal with UTF-8. The fact that Windows adds the UTF-8 BOM helps a lot. But other platforms still don't write a UTF-8 BOM, and until there is a reliable, platform-independent, content-independent way to detect UTF-8 encoding, it doesn't make sense as the Eclipse default. Too bad, really, but easy workaround. The user can set UTF-8 as the default encoding.
Autodetection of encoding would be nice. However, without a lot of effort it can't be done for all types of files. However this does not mean we shoudl accept the platform default. The platform default is just one other encoding that cannot be autodetected. There is no reason that encoding is more likely to be correct than UTF-8. In 2005 files are routinely moved between platforms and locales. I often start a project by checking existing code out of a source repository. What encoding the files are in, depends only on what encoding they were checked in as. It has nothing to do with the platform default. One option that UTF-8 offers (and single byte platform defaults do not) is to attempt to read a file as UTF-8 and, if it fails, to try again with the platform default. A file that is not UTF-8 is unlikely to be be read as UTF-8 without detectable error. The reverse is not true. If, for instance, you attempt to read a file as Latin-1, then all files will seem to be legal Latin-1 without exception, even if that's wrong. Non-UTF-8 can normally be detected through invalid byte sequences. However all byte sequences are legal in Latin-1 and most othe rsingle-byte character sets.
The fact is no single encoding will work as the default for all files. This includes the platform default. The current system does not work. The question is not whether UTF-8 will work for all files. It won't. The question is whether assuming UTF-8 as the default will work better than the current, failing system. It will. Java is a cross-platform language. Teams routinely use different platforms and increasingly the same platform but set to different locales. Even if everyone on a team is using Windows, the developers in Japan, Israel, the U.S. India, and China are all likely to have different default character sets. Unicode is the only character set that has any hope of working for them all, and UTF-8 is the right encoding for Unicode.
Just to be clear, I didn't raise this with the expectation that all UTF-8 files are marked with the BOM, or assume that such encodings can be automatically detected. However, just because one encoding cannot automatically be detected does not mean that another choice is therefore the correct answer. Consider the possibilities that Eclipse (including RCP) are possibly going to be used for: 1) Editing files that other Eclipse installs will read (e.g. private data to an RCP application, or others specific to a feature e.g. Java source files) 2) Editing files that will be stored in some kind of shared repository, potentially globally 3) Editing files as a souped-up editor for the filesystem Of these three possibilities, it's way more likely that Eclipse will be used as one of the first two options. Even Eclipse's assumptions about all files being stored under some particular workspace/project combination (for the IDE, at least) is likely to rule out Eclipse as a general purpose editor, unlike Emacs which happily can edit files in any location. For example, I wouldn't use Eclipse to edit /etc/hosts because (a) I don't want to have to set up a .project in /etc just to look at the config files, and (b) I don't want to create linked resources for every file I want to edit in Eclipse -- I'll just use Emacs or Vi (both of which support UTF-8, by the way). The point is that with any choice, there are pros and cons. In this case, if files are created/assumed to be UTF-8, then you'll end up with a file that is editable on any Unicode-savvy operating system. This includes Windows, where UTF-8 files are supported by the OS (and the encoding reported by Java is the 'fallback encoding' for non-Unicode aware systems). On the other hand, if you use RandomOS' choice of character encoding for files, then it's only RandomOS that will be able to read that file correctly. All other non-RandomOS systems will load the file transparently with errors, possibly mangling the data in the process. Eclipse is supposed to be about platform-neutral development, so that development is independent of the OS that is being used to create the content. This simply isn't true when using RandomOS' character set encoding. In fact, by using RandomOS' encoding, you are explicitly limiting those file(s) to only be usable on RandomOS. Yes, it may break obscure cases where Eclipse is being used as an editor for platform-specific files, like /etc/hosts. But it will fix a lot more cases where files are developed by distributed team members around the globe on a variety of different operating systems. So there is no one this-absolutely-works-for-all-cases. But UTF-8 is a much, much better choice as a default than RandomOS' encoding, especially when compared with the target uses of Eclipse outlined above. This bug as also raised against the Core Text component, rather than JDT itself. To re-iterate, this is a bug on the text handling of *all* text files, not just .java files. This may be currently assigned to jdt- inbox for their comments, but the bug should still remain a Core Text bug.
>This may be currently assigned to jdt- >inbox for their comments, but the bug should still remain a Core Text bug. It's not Platform Text: the default encoding is provided by Platform Resources.
The encoding for .java files is not spec'ed by the JLS. Moving to Platform Resources for comment on the general resolution of this request.
I beg to differ re: Java files: http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#95413 "Programs are written in Unicode (section 3.1), but lexical translations are provided (section 3.2) so that Unicode escapes (section 3.3) can be used to include any Unicode character using only ASCII characters." "3.1 Unicode Programs are written using the Unicode character set. ... " It doesn't explicitly say which encoding of Unicode should be used (UTF-8, UTF-16 etc.) but it *does* say that it is Unicode. Furthermore, it says that programs may also be written in ASCII with Unicode escape sequences, and UTF-8 is the only encoding that also has the property that the first 128 characters are ASCII, so the implicit conclusion is that the only UTF encoding that can be used is UTF-8. Note that the statement further on: "Except for comments (section 3.7), identifiers, and the contents of character and string literals (section 3.10.4, section 3.10.5), all input elements (section 3.5) in a program are formed only from ASCII characters (or Unicode escapes (section 3.3) which result in ASCII characters). ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode character encoding are the ASCII characters." may be misleading due to the English, but it is saying that all of the punctuation, white-space and other characters in a file are ASCII (also the same character in UTF-8) -- but (importantly) the comments, identifiers, and string literals (i.e. everything except keywords and punctuation) *is* Unicode. It's just that as well as Unicode, it can also be represented using \u notation, but does not have to be.
The only reasonable default encoding is the one supplied by the operating system. If the user is running in a different locale and has an encoding to match that locale, it needs to be honoured by Eclipse. Interoperability with the local operating system and other local programs is more important than cross-platform interoperability. If you want to set the encoding used by Eclipse to UTF-8, you can do so.
This isn't just a cross-platform issue. It's a cross-locale issue. Developers writing code/documentation/files in a Locale on one side of the globe should be able to have files shared with those on the other side of the globe, even on the same platform. Further, there's no way of setting the default locale as picked up by Java on Windows systems. The Cp1252 reported on windows (when running in England) is the fallback encoding for when UTF-8 isn't supported. I also feel this bug needs a wider audience (and reasoned discussion) than an assertion that 'the only sensible default is the OS locale'. As is noted in comment #4, the user often doesn't have this choice of encoding; they just select from a generic regional location and a locale-specific non-global one is picked randomly without any user intervention. I also strongly disagree with the statment that 'Interoperability with the local operating system and other local programs is more important than cross-[locale] interoperability.'. I invite you to submit an example of any Eclipse application -- JDT or otherwise -- that edits operating system files instead of ones that are destined for UTF-8 capable systems (web browsers, version control systems etc.) And please note, this is about cross-locale interoperability, not just cross-platform interoperability.
Note that for cross-locale interoperability users are expected to set the default encoding (whatever it is) at the project level (instead of at the workspace level). This setting is stored in the project content area, thus being shared through the team repository (all users will end up with the same setting).
I'd also like to point out that whilst it's possible to override the default Java setting (using -Dfile.encoding=UTF-8), this hides what any original platform setting may be at any level. Having Eclipse default to UTF-8 by default, whilst still allowing it to be changed back to any locale-specific encoding, is a way of having a locale- and platform- portable default that is overridable by the user to be locale-specific. I don't necessarily believe that a per-project setting is the best workaround, as there are RCP apps that don't necessarily use .projects for data interchange (they may choose to work with WebDAV or similar). Having a default accessible may make sense for these kinds of applications as well.
I completely agree with John. UTF-8 is still a minority encoding; most files are in national character sets. The setting most likely to correspond to the user's national character set is the operating system default. Given that the user can change the default encoding with one preference setting, I'm surprised this discussion (reasoned or not) has dragged on this long.
Because it's about changing the *default*. You know, what Eclipse comes with. Yes, it's trivial for me to change my preference setting, but I'm building RCP applications and I don't want users across Europe (who use a variety of slightly different locales) wondering why they can't exchange RCP documents. o Eclipse uses GIFs instead of BMPs, because they're more portable o Eclipse uses HTML instead of Word or TROFF, because they're more portable And yet you're arguing that using a less-portable character set encoding is the right thing to do? Eclipse isn't used as a general-purpose text editor to edit operating system files. Even if it was, current operating systems can deal with UTF-8 character set encodings natively and this 'codepage' thing is a fallback for applications that can't, or in this case, won't deal with UTF-8 encodings. As has already been pointed out, Java files are already UTF-8, and it's also currently the default for XML documents. It should also be the default for any HTML or JSP document to avoid non-printing characters showing up when the page is viewed on a platform where the encoding is different. Eclipse is a very good cross-platform product. It's already used in global development (the Eclipse committers do a great job of making that happen). However, you have situations where developers in one locale will be creating files with one encoding, and developers the other side of the world using another encoding. Tell me why it's not sensible that we should all be using one encoding?
This discussion is closed as far as I'm concerned. I think the discussion has had adequate exposure in the various newsgroup postings Alex has made, and there clearly isn't community consensus. Changing the default encoding is a drastic enough change that we would need broad support from both the community and the commiters on affected projects, and the -1's above from the platform and WTP text leads alone are enough for me to consider this closed.
My dev team ran into this unexpected issue. A cut/paste from a Windows doc into an Eclipse Java file editor was then checked into the source control system. A linux user then was unable to compile or open the file because it contained cp1252 characters, which are illegal under the linux default of utf8. Developers expectations are that the tools are going to protect them from such situations. We have people developing product on three platforms: Windows, Linux, and Solaris. Cross platform is a big issue for us. I assume the best suggestion is that we manually configure character encoding to be UTF-8 across all platforms? Cheers,
Yes, you need to use an encoding that is shared across all your development platforms, or restrict yourself to the range that those encodings have in common. cp1252 and UTF-8 share a signficant subset (128 bits of ASCII, and more). The development of Eclipse itself is done across many platforms within that shared subset of encodings.
That comment is incorrect. The only characters Cp1252 and UTF-8 share in the context of Eclipse are the 128 ASCII characters. While all 256 Cp1252 characters are available in UTF-8, 128 of them do not share the same code points. Since, unlike XML, Java files do not carry any information about their own encoding, this needs to be externally speciifed by the IDE. Thus a Cp1252 file loaded into a UTF-8 environment will be reported a smalformed. The Java editor does not autodetect and account for the different mappings from code points to characters, as an XML editor might be able to do. This is a flaw in the design of Java. We can't fix that. Currently the best solution is indeed to manually configure for UTF-8 across all platforms. However, since that is the best solution it should be the default as well.
cp1252 and UTF-8 only differ in the range 0x80-0x9F, the remaining 224 characters are the same. Here is a mapping table from unicode.org: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
Please confirm or correct the following statement. If a dev team is developing on Linux, Solaris, and Windows, then it is recommended that Eclipse file encoding be set to UTF-8 for all platforms.
The Cp1252 character set is a proper subset of the Unicode character set. Every character in Cp1252 has a corresponding Unicode code point. The problem is that Eclipse doesn't work at the level of characters. It does not know that Cp1252 é (the byte 0xE9) is equivalent to the two UTF-8 bytes 0xC3 0x A9. Thus at the level Eclipse works, the different character sets are not compatible. There is a missing layer of indirection in Java. XML has this additional layer of indirection between bytes and characters. Java doesn't. If Java had it, we wouldn't be having this discussion. Given that Java does not include in file metadata about the character encoding, the question becomes what Eclipse should do to handle character set identification. No solution will be perfect. However in the long term I think the current platform specific approach is clearly inferior to a platform-independent UTF-8 default.
My correction: "It is recommended that the Eclipse file encoding be set to UTF-8 for all platforms." No "if" is necessary. :-) Even a mono-platform environment will not be harmed by using UTF-8, and may well be improved by it if characters from outside the current locale are needed. In today's international world, we cannot assume that just because I am typing this message in the U.S. that I only require characters from the Roman alphabet. I may well need Cyrillic or Japanese or other character sets. At worst UTF-8 does no harm. At best it avoids numerous problems of characters set interoperability between programmers on a team.
It would have saved my team some grief had all the the platform defaults been set to UTF8. This character encoding issues is not something most developers want or need to be bothered with. P.S. Why did not the Windows cp1252 editor complain when illegal characters were cut and pasted from a Word doc? Is this a bug? Had the editor detected/prevented the illegal chars, we would not have had a problem.
Good point about catching this on paste - I suggest entering a separate bug report against the Platform Text component.
Where exactly does the file encoding setting get persisted and to which property? I'm playing with the file encoding setting now and exporting a new preference file, but can't locate the encoding change in the preferences file. Further, when I import our old preferences file, the encoding remains set to my change to UTF8, not back to the original default of cp1252. ?
And here is the bug report response... daniel.megert@eclipse.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |INVALID ------- Additional Comments From daniel.megert@eclipse.org 2005-11-09 12:12 ------- We have no idea under which encoding someone will checkout that file in the future. The one who checks out the file could have his workspace encoding set to Chinese or whatever. If you share files across platforms you have to choices: 1. have a policy that users watch for such problems and don't release such files 2. have a policy that users must set their workspace encoding to UTF-8 3. set the project encoding to UTF-8. This might be the best solution because any one who checks out the project will get the correct encoding.
re comment 30: The best place to set the encoding is by right clicking on any resource (project, folder, file), and selecting Properties > Info. When the encoding is set on a project or folder, it sets the *default* encoding for all files in that container. I.e., if the file does not have an explicit encoding stored, Eclipse looks for the encoding on the containing folder recursively until an encoding setting is found. When set this way, the encoding information will be persisted in the project content area in the .settings directory along with the project contents.
I can not locate a .settings folder. Where exactly is it located?
The .settings folder isonly created as needed... it is a sibling of the .project file in your project's top level directory. If you set the encoding of a resource in that project (or for the entire project), it will be stored in that directory.
I just changed the encoding on the project to UTF-8, then closed the project. No sign of a .settings folder. ?
Created attachment 29639 [details] Screen shot For illustration, here is a screen shot of a simple project that has its encoding set to UTF8. You can see the .settings folder in the Navigator, and it contains a file called "org.eclipse.core.resources.prefs" that stores the encoding details for the project. I assume you are using Eclipse 3.0 or greater, and that you don't have filter on your view that hides .* files?
If I do this exactly as you describe on a new sample test project, I get exactly the result you describe. If I try this on my existing project, no .settings folder is created. ?
Actually, I get "internal error setting encoding" dialogue. Don't know if this is relevant, but we are using CCRC pluggin for Eclipse. The project is under ClearCase source control.
Are there more error details in the log file? (workspace/.metadata/.log)?
!ENTRY org.eclipse.core.runtime 4 2 2005-11-09 17:47:00.367 !MESSAGE An internal error occurred during: "Setting encoding". !STACK 0 java.lang.IllegalArgumentException: Attempted to beginRule: R/, does not match outer scope rule: P/bac-nova at org.eclipse.core.internal.runtime.Assert.isLegal(Assert.java:58) at org.eclipse.core.internal.jobs.ThreadJob.illegalPush(ThreadJob.java:117) at org.eclipse.core.internal.jobs.ThreadJob.push(ThreadJob.java:211) at org.eclipse.core.internal.jobs.ImplicitJobs.begin(ImplicitJobs.java:59) at org.eclipse.core.internal.jobs.JobManager.beginRule(JobManager.java:190) at org.eclipse.core.internal.resources.WorkManager.checkIn(WorkManager.java:96) at org.eclipse.core.internal.resources.Workspace.prepareOperation(Workspace.java:1674) at org.eclipse.core.internal.resources.Folder.create(Folder.java:88) at org.eclipse.core.internal.resources.ProjectPreferences$2.run(ProjectPreferences.java:304) at org.eclipse.core.internal.resources.ProjectPreferences.save(ProjectPreferences.java:315) at org.eclipse.core.internal.preferences.EclipsePreferences.flush(EclipsePreferences.java:351) at org.eclipse.core.internal.resources.ProjectPreferences.flush(ProjectPreferences.java:585) at org.eclipse.core.internal.preferences.EclipsePreferences.flush(EclipsePreferences.java:339) at org.eclipse.core.internal.resources.ProjectPreferences.flush(ProjectPreferences.java:585) at org.eclipse.core.internal.resources.CharsetManager.setCharsetFor(CharsetManager.java:280) at org.eclipse.core.internal.resources.Container.setDefaultCharset(Container.java:255) at org.eclipse.ui.ide.dialogs.ResourceEncodingFieldEditor$1.run(ResourceEncodingFieldEditor.java:134) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:76)
Re comment #40 - can you enter a new bug report for that error? In the report, include what version/build of Eclipse you are using (build id is in Help > About...).
Re: 41, under which category should this bug be filled?
You can log it under Platform Resources.
*** Bug 171087 has been marked as a duplicate of this bug. ***
*** Bug 213251 has been marked as a duplicate of this bug. ***
+1 UTF-8 should be the obvious default character set for all text files, I had a problem with eclipse when I was using an English Windows XP system and trying to open a file in eclipse with Chinese characters, as you can imagine the display is completely messed up and eclipse doesn't tell me what I need to do. I had to spend time google for answers. I had to put -Dfile.encoding=UTF-8 in eclipse.ini so that it behaves correctly. If eclipse had this on as default or at least detect file correctly, a user like me wouldn't have to go through all this trouble to get a file to display as it should have been displayed. Every other text editor I use such as PsPad, ultraedit, notepad++, display the file properly.
(In reply to comment #46) > > If eclipse had this on as default or at least detect file correctly, a user > like me wouldn't have to go through all this trouble to get a file to display > as it should have been displayed. Every other text editor I use such as PsPad, > ultraedit, notepad++, display the file properly. > Making UTF-8 the default is not the right solution for the problem you were having. I'd suggest you open a separate bug describing the details and how Eclipse didn't meet your expectations compared to the other editors. You'd have to attach a sample file (I recommend zipping it up, so it doesn't get "changed" by attachments and browsers). You should also attach your "configuration" (obtained from about box) so someone could maybe see what the problem is. Also, the .settings folder from the project might also be important. There may be "nothing we can do" for an automatic fix, but ... that'd be the right approach, not making UTF-8 the default. Thanks,
I'm going to bump this up to 4.0 and re-open. Considerations about distributed resources (where the client OS and server OS may be in different locales/character sets) re-emphasise the need for a universal text encoding.
(In reply to comment #48) > I'm going to bump this up to 4.0 and re-open. Considerations about distributed > resources (where the client OS and server OS may be in different > locales/character sets) re-emphasise the need for a universal text encoding. > I'd suggest waiting to re-open until all the client OS's and server OS's agree to a universal encoding. :) Until then, anything else is going to break someone.
If the distributed EFS representation can't come up with some kind of encoding (or at least, demarkating what encoding the resources should be in) then E4 isn't really going to stand much of a chance. Even HTTP resources announce a Content-Type with a charset encoding; this could easily be used to work with a UTF-8 representation. It may not be universal, but it's a darn sight more universal than any other encoding you can name (unless you go with other UTF-* encodings, or one of its subsets like ASCII)
(In reply to comment #50) > If the distributed EFS representation can't come up with some kind of encoding > (or at least, demarkating what encoding the resources should be in) then E4 > isn't really going to stand much of a chance. Even HTTP resources announce a > Content-Type with a charset encoding; this could easily be used to work with a > UTF-8 representation. > > It may not be universal, but it's a darn sight more universal than any other > encoding you can name (unless you go with other UTF-* encodings, or one of its > subsets like ASCII) > I don't know much (nothing really) about "distributed EFS" but if you are saying that e4 should use a protocol that contains the encoding in the data stream itself (like HTTP) then I wholeheartedly agree with that. See also bug 210704.
+1 for embedding encoding in the character stream wherever we can (like XML, HTTP, some kinds of file systems). Encoding is meta-info for the data and belongs to the data, not to a separate user-changeable setup. For the actual data in the workspace, the main problem is that this often needs to be interoperable with legacy tools. People want to use their Eclipse editor and legacy editor interchangeably. Encoding is really owned by the data (which may be legacy) and not by Eclipse. There may be some projects (like Java) where UTF-8 is the obvious default choice. In other cases (old C, Makefiles, some Webpages) a very conservative ISO-8859-1 may be the best choice which inhibits accidentally entering "odd" characters from the start. To add to complexity, default encoding may also be specified by the underlying OS/Platform or Country -- although that's often not really desired, especially when data is meant to be shared across geo boundaries like we see more and more. I'm in favor of having Eclipse auto-detect the proper encoding in more places than it does to day, but too much magic in some toolset is always a slippery road and the problem is a tough one to solve. Perhaps the simplest thing that could possibly work is this: (1) Always accept encoding as specified inside data stream. (2) Use UTF-8 default encoding unless otherwise specified. (3) Have project creation wizards / natures override that default as appropriate.
*** Bug 284637 has been marked as a duplicate of this bug. ***
It is really sucking bug. Try to generate javadoc with macRoman ? And for french people ... Please remove MacRoman. It will be good for newbies ...
This bug is so annoying and kills productivity. It happened a hundred times to me, the working with other people is distracted by that silly behavior. Please just do UTF-8 as default.
(In reply to comment #55) > This bug is so annoying and kills productivity. It happened a hundred times to > me, the working with other people is distracted by that silly behavior. > > Please just do UTF-8 as default. I agree with comment 18, made in 2005, "I'm surprised this discussion (reasoned or not) has dragged on this long." A "fix" will have to be something other than changing the default, for the many reasons mentioned in the many comments over the past 6 years, so its not constructive just to keep suggesting that. Perhaps someone would want to work on a new feature, say, to "provide a better warning UTF-8 is not being used" or something ... but seriously ... 2005 ... changing the default discussed for 6 years?! Perhaps leaving this bug as opened and "new" gives the wrong impression ... perhaps it should be closed as "won't fix"? And interested parties could open more specific feature requests for new behavior or features that wouldn't break existing users and data?
> Perhaps leaving this bug as opened and "new" gives the wrong impression ... > perhaps it should be closed as "won't fix"? +1.
Or maybe now that it's 2012 (!!) we can suppose that Unicode support has now progressed far enough for Eclipse to make UTF-8 the default for new projects? Funny thing is that because of all these tools using these very conservative defaults, every day new files get created in legacy encodings, reenforcing the need for being conservative... in a vicious circle, perpetuating itself. If just these tools would embrace Unicode within a year or two we could forget about those legacy encodings. The default is important because the majority of programmers doesn't care/know about encodings and will use the default no matter how bad it may be.
(In reply to comment #58) > Or maybe now that it's 2012 (!!) we can suppose that Unicode support has now > progressed far enough for Eclipse to make UTF-8 the default for new projects? > > reenforcing the > need for being conservative... in a vicious circle, perpetuating itself. +1
Probably, Alex Blewitt, you were too avant-gardist for that time... Other ones were certainly afraid of the potential bugs and complaints that would have caused. Ok guys ! Go on, here we go now ! Ready for next release ? Just a property to switch I beg, that's all isn't it ? And what about a display of the current encoding used in the active Editor, into the status bar for instance ? Thank you.
Maybe I'm getting this all wrong, but this discussion is about the creation of *new* files right? The original problem statement was this: "Why does Eclipse default to a platform-specific encoding for its file types when creating new files?" So maybe I'm dumb, but can someone explain why creating a new file as UTF-8, from within Eclipse, would be a problem for anyone? Now I can give many examples of what I consider to be a very plausible use case for the current setting of platform default to give problems, but I'd be repeating was has been said before. Somehow it looks to me like the majority of the people voting against this are mostly thinking of scenarios involving the opening of *existing* files.
Another user having a problem with the current situation has led me here: http://stackoverflow.com/questions/19251180/encoding-issues-in-eclipse-for-mac-and-for-windows As this is still open, please note that NetBeans uses UTF-8 as default: http://wiki.netbeans.org/FaqI18nProjectEncoding UTF-8 being a minority encoding: This is no longer true, at least for the web: http://w3techs.com/technologies/overview/character_encoding/all (In reply to Dani Megert from comment #3) > Because Eclipse is not just an IDE Maybe this should be another one of the defaults which should be overriden for the IDE packages then.
(In reply to Stijn de Witt from comment #61) > Maybe I'm getting this all wrong, but this discussion is about the creation > of *new* files right? > > The original problem statement was this: > > "Why does Eclipse default to a platform-specific encoding for > its file types when creating new files?" > > So maybe I'm dumb, but can someone explain why creating a new file as UTF-8, > from within Eclipse, would be a problem for anyone? > > Now I can give many examples of what I consider to be a very plausible use > case for the current setting of platform default to give problems, but I'd > be repeating was has been said before. > > Somehow it looks to me like the majority of the people voting against this > are mostly thinking of scenarios involving the opening of *existing* files. Yes, but ... imagine a user has a thousands of files created under old default assumption ... and then they work for a while and create a couple of hundred new files under new assumed default .... and then someone else on the team checks out that project (let's say, for the first time) ... how is known at that point which were the "old, existing" files ... and which were the newly created files? (I actually think I know of one answer to this, but just wondering if you do ... or, what you had in mind). Similarly, what if a user had a few thousand files that already existed, lets say create with plain 'ol text editors or some old tools that used the platform encoding ... and a user wants to "import" those into Eclipse. I think from Eclipse's point of view, those are (sort of) "new files" ... but still, either way, data could be lost if Eclipse made some other assumption or tried to do some "automatic conversion". Keep in mind, there are some file encodings, such as for various Japanese, Chinese, or Arabic languages that can not be properly encoded using UTF-8, so a simple "automatic conversion" would not (always) work. Well, I'm 99% sure of that :) ... native language developers are free to correct me -- I'm certainly not knowledgeable to know of the exact list. I'm sure all these problems are "solvable" (to some extent) ... but, it would take more work than simply "changing the default workspace preference".
May by for the same reasons we need to back to some 7-bit encoding? Imagine I have thousand files in that encoding. And of course I use pretty old school 7-bit editor (why not?). And now I try to open my files in Eclipse. Wow! It doesn't work! Why?!
(In reply to Stijn de Witt from comment #61) > Maybe I'm getting this all wrong, but this discussion is about the creation > of *new* files right? It said so in the initial comment, but that's not the whole story about how it currently works in Eclipse. Most files don't tell you what their encoding is and hence that preference is also important/used when reading files. The encoding is detected 1. from the file contents, if possible 2. from the file's encoding setting, if available 3. from the file's (parent) folder's encoding setting, if available 4 from the file's project encoding setting, if available 5. from the workspace encoding preference To be independent of workspaces, it is therefore recommended to set the project specific encoding, so that it can be shared via repository.
Why the only "reason" to keep it as is was "the only sensible choice" or something like that? Just another issue. I have a java app on witch I have a regex hardcoded with an accentuated character (á). Eclipse warned me the file should be saved as utf8 and so I did. After some months, I had to re import the project on another eclipse instance on the same computer. The file was imported as windows locale encoding or something and thus my "á" got corrupted. Since it was a regex, I could not detect the issue until it was too late... Even changing the file to utf8 again wouldn't fix the corrupted character. I had to edit the source. So... a project breaks just by reimporting it if you save a file as utf8 without changing the property for the entire project/environment??????? Please enum real reasons for not changing the default to utf8 and please, stop assuming that because there is an option somewhere to change it, everyone will find it right away without losing hours of valuable time.
While the Eclipse team is trying to please everyone (impossible IMHO), you can set the UTF8 as default for whole JDK by setting system (or user) environment variable JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8. After that the Java editor will use the UTF8 by default. But even if you will set that property it you will need to update default encoding for jsp and .property files (#@!$#@!$) in Eclipse...
(In reply to Tiger Shark from comment #66) > > Please enum real reasons for not changing the default to utf8 and please, > stop assuming that because there is an option somewhere to change it, > everyone will find it right away without losing hours of valuable time. I think the real reasons have been stated in this bug (though, admit, it is a lot to read) but think its unfair to say they have not been. It basically comes down to many people have different requirements, many people have decades of existing data that was created and continues to operate under one set of assumptions and we don't want to break them. Plus, I have to emphasize, there are many places where "encoding can go wrong" and making one of them the default all the time, will not magically solve all those problems. Those that use encodings must understand them to some degree. I still think the path to improvement is to make the "user education" better. Such as instead of the editor warning you "the file should be saved as UTF-8", perhaps it should have also given you the choice to change the project settings to UTF-8? Further, from your description, I have a feeling your character would have gotten corrupted anyway ... going to a windows machine, that was not using UTF-8, ... works in cases if you export/import projects from/to an SCM system, but it almost sounds like you copied the single file to the file system? (Otherwise, the file would have still been "ok", going from Eclipse to Eclipse via SCM projects.) If it was XML, it would have been handled fine if you had a correct DECL, but, perhaps it was a bash script, which does not have any means to "self document" its encoding? I only mention all these to try an convey the fact that "its complicated".
Hmm... "Windows machine" at least several years use UTF8 as default :) For some reason Oracle JDK on Windows threat the "Language for non-Unicode programs" as a system locale. And so each Java application on Windows works in non-Unicode mode. Fortunately you can force the Java to use UTF8, but Eclipce still use local encoding even in that case.
(In reply to David Williams from comment #68) > (In reply to Tiger Shark from comment #66) > > > > > Please enum real reasons for not changing the default to utf8 and please, > > stop assuming that because there is an option somewhere to change it, > > everyone will find it right away without losing hours of valuable time. > > I think the real reasons have been stated in this bug (though, admit, it is > a lot to read) but think its unfair to say they have not been. It basically > comes down to many people have different requirements, many people have > decades of existing data that was created and continues to operate under one > set of assumptions and we don't want to break them. Plus, I have to > emphasize, there are many places where "encoding can go wrong" and making > one of them the default all the time, will not magically solve all those > problems. Those that use encodings must understand them to some degree. > > I still think the path to improvement is to make the "user education" > better. Such as instead of the editor warning you "the file should be saved > as UTF-8", perhaps it should have also given you the choice to change the > project settings to UTF-8? > > Further, from your description, I have a feeling your character would have > gotten corrupted anyway ... going to a windows machine, that was not using > UTF-8, ... works in cases if you export/import projects from/to an SCM > system, but it almost sounds like you copied the single file to the file > system? (Otherwise, the file would have still been "ok", going from Eclipse > to Eclipse via SCM projects.) If it was XML, it would have been handled fine > if you had a correct DECL, but, perhaps it was a bash script, which does not > have any means to "self document" its encoding? I only mention all these to > try an convey the fact that "its complicated". Not really. I did not move the files. I'm still using the very same machine and OS from where the file was originally created. Its a java source, nothing else. The problem is, the file was in utf8 but after reimporting the project, eclipse assumed the system locale. So when I opened the file after the reimport, the character became corrupted and since it was not a unicode character anymore, the following times the file was saved, the locale encoding was used. I agree the option to turn the entire project default to utf8 should be displayed but that alone is not enough if eclipse wont detect it on a reimport of an existing project. Legacy code... ok, so this change will never come as there will always be locale encoded files since its the default XD
"Legacy code... ok, so this change will never come as there will always be locale encoded files since its the default XD" This kinda sums it up. Eclipse and dozens of other programs insist on keeping creating new files in the, clearly inferior, legacy encodings of old. The reason is that there are so many files out there in those encodings... It's a self-fulfilling prophecy. Or a vicious cycle or whatever. The point is that those legacy encoded files will never go away until we start changing the defaults. This is also the reason why myself (and judging from the comments, many others) are not happy with the ability to change the setting on our local machine. It does not stop the mass creation of new files in legacy encodings that keeps us all locked in this never ending story. Since this bug has been open so long and there are so many comments, I think the people at Eclipse should think of ways to give the community at least something... Maybe for NEW projects, set the encoding setting to Unicode/UTF-8? Or make it an important setting in New Project wizards? So people explicitly have the option to set the encoding, possibly with some explanation?
(In reply to Stijn de Witt from comment #71) > It's a self-fulfilling prophecy. Nicely put...The Eclipse 4.x would have been (another) perfect opportunity to "modernize" the default encoding. We should accept that at one point the past is just that - the past.
*** Bug 428892 has been marked as a duplicate of this bug. ***
There is really nothing we can do at the Platform/Resources level.
Ten years to find out that issue has wrong parameters?! And just close it?! Why do not correct the Product/Component to the right ones?
WONTFIX is not acceptable for this bug. There's been a pretty intense outcry from leaders in the community on this. Let's keep it open and work out a proper solution.
(In reply to Doug Schaefer from comment #76) > WONTFIX is not acceptable for this bug. There's been a pretty intense outcry > from leaders in the community on this. Let's keep it open and work out a > proper solution. I don't know why everyone thinks that "proper solution" is to change component default values. As stated numerous times in this bug, the default value cannot change at the Platform/Resources level (component level). However, it is always possible to change the default per product, so for example one ask to change it for certain EPP package via pluginCustomization file. The same thing was done in many other cases, e.g. lightweight refresh - it is still disabled by default at the component level, but enabled via pluginCustomization file for EPP packages (see bug 384104). If you want to have different defaults than component defaults, that's the way forward. Marking WONTFIX, because there is really nothing we can do at the Platform/Resources level. Feel free to move this bug to EPP.
@all you opened the bug in the wrong component, nobody tells you during ten years. As we don't care your problem here and we are already happy to build the best IDE in the world, we're closing it. Love you <3
(In reply to Matthieu Paret from comment #78) > @all you opened the bug in the wrong component, nobody tells you during ten > years. As we don't care your problem here and we are already happy to build > the best IDE in the world, we're closing it. Love you <3 In defense of the Platform team, the EPP project did not exist 10 years ago. (In reply to Doug Schaefer from comment #76) > WONTFIX is not acceptable for this bug. There's been a pretty intense outcry > from leaders in the community on this. Let's keep it open and work out a > proper solution. +1 But let's move it to EPP and see if we can convince a package maintainer to implement this.
Who's going to come over here and fix my plug-in customization file? If fixing everything that's wrong with the Platform in places not in the Platform is the direction we want to take, then let's plan that carefully and do it right.
(In reply to Doug Schaefer from comment #80) > If > fixing everything that's wrong with the Platform You can claim it was wrong not to use UTF-8 when we started. Fair enough. But changing it now would definitely be wrong and break clients. Cp1252 is *not* a subset of UTF-8. This means if a user wrote a file in a current Eclipse Windows workspace with the following content: Diese Nüsse sind geröstet. and then opens, edits and saves it with a workspace where the default is now UTF-8, he will end up with a corrupted file. At least for the Platform this is not an option. If a certain EPP, RCP or product does not see this as a problem for their clients, then they are free to change the default.
We do know that this is a difficult change and that could cause problems to many people if it is not well managed. But do you really think that we can continue like that, seriously? Former character encodings are a recurring problem that we must eradicate. So let's go now ! Do expose issues to resolve so that a patch can be produced which side effects will be as low as possible on existing installations. Some strategies : 1 - Make the modification, and while deploying it, a kind of popup window warn users of the difficulties that they might encounter and what to do with them. That solution might be more or less clever in its way of detecting current context and what to warn about... 2 - Announce long time before the target release that it will embed a core modification that could affect existing installations. Elaborate different procedure to fix side effects. 3 - Make a sophisticate patch that will try to fix most of the common cases, avoiding at the maximum side effects. 4 ... These are only bootstraps. Your turn and be positive and constructive ! Thank you.
Would it be an option to set UTF-8 by default for *new* workspaces only, but keep the current behavior for existing workspaces ?
(In reply to Martin Oberhuber from comment #83) > Would it be an option to set UTF-8 by default for *new* workspaces only, > but keep the current behavior for existing workspaces ? I like this option as well, although there's still a risk: whenever I create a new workspace it's to check out existing projects :-) But that could be part of the migration guide "if you have projects that aren't UTF-8 and you create a new workspace you have to either specify the encoding in the project settings or flip the setting back to default in your new workspace". PW
(In reply to Martin Oberhuber from comment #83) > Would it be an option to set UTF-8 by default for *new* workspaces only, > but keep the current behavior for existing workspaces ? (In reply to Martin Oberhuber from comment #83) > Would it be an option to set UTF-8 by default for *new* workspaces only, > but keep the current behavior for existing workspaces ? That would only just protect the existing workspace but not the case when you check out code from a repository or import an existing project. A more durable solution could be to add a new option that allows to specify the encoding to use *and set* when creating a new project. That would also make sure that the project can be opened in all workspaces, since the encoding would be set on the project.
(In reply to Dani Megert from comment #85) > durable solution could be to add a new option that allows to specify the > encoding to use *and set* when creating a new project. Great suggestion. Encoding should be associated with the project anyways, and not with the workspace. That way, (new projects created with UTF-8 by default), more and more projects would convert to UTF-8 over time. At one point, a warning dialog could come up when importing a project that doesn't have the encoding specified.
Elaborating on the suggestions already made. Eclipse already has a default welcome screen for new workspaces, what about adding information on that very screen about the currently used character encoding and an option to change it then and there? Alternatively a small (~5ish steps) setup at the very first startup of Eclipse itself to set a small number of default settings, say encoding, line numbers, etc. This would serve to educate the users about existing problems and capabilities as well as offer an easy and mostly painless way to address the issue at hand. Such a welcome screen / setup tool could of course be used for projects instead/as well, too.
+1 for changing to UTF-8 as the default text file encoding on all platforms. Currently, the default for Swedish Eclipse users on Windows is Cp1252 (Windows code page 1252). I recently began work in a 5 year old medium sized project (~100 ppl), where they did not know or cared about character encoding in the IDE at the time of project start up. So now everyone in this project still have to use Cp1252 in the IDE. And yes, since this is a governmental system, all code comments and logging has to be in Swedish. One "funny" thing is that the system is run on UNIX, where logging using Cp1252 is not optimal...
@a e do vote for that issue ;) One might wonder how many votes (according to the oldness of that issue) are necessary to trigger any fix study but how to do otherwise ?
"You can claim it was wrong not to use UTF-8 when we started. Fair enough. But changing it now would definitely be wrong and break clients. Cp1252 is *not* a subset of UTF-8. This means if a user wrote a file in a current Eclipse Windows workspace with the following content: Diese Nüsse sind geröstet. and then opens, edits and saves it with a workspace where the default is now UTF-8, he will end up with a corrupted file." And how is this different from a user that already has set the default encoding to UTF-8 in his workspace? Or even worse, a user that did not make any changes, but is running on an OS that has a different encoding set as the default? Your use case is based on the assumption that people work alone, on the same machine. Only in these cases would something break which does not break in the current situation. But look at it from this perspective: In the current situation peoples projects will *allways* break when they are being shared across different machines with different default encoding. Stuff is *already* broken. That is why this issue exists for ten years and people are still taking the trouble to add comments to it. How about this: New files: UTF-8 with BOM Existing files: Auto-detect Fallback: Platform encoding. Basically rename the existing option to 'Fallback' and add a new option for New files. If a file being opened does not contain a BOM (and so encoding can not be determined reliably), use the fallback encoding when opening. Otherwise use the encoding detected from the BOM. When creating new files, create them as UTF-8 with BOM so that they will be opened correctly even on machines that have different defaults set. If you look at all the ideas suggested for this issue that would at least improve on the situation you can't keep saying that "As stated numerous times in this bug, the default value cannot change at the Platform/Resources level (component level)." If you guys really wanted this you *could* and you would change it.
Stijn, thank you for those reminders and explanations. Of course, this remains a really tricky migration but Eclipse have to switch to UTF-8 as default encoding. There is no other option, alternate encodings are heavy hindering and eventually disappear. Every day I'm personally facing encoding problems, because of Eclipse, because of my development environment that I can't change, because of plugins gaps with encodings and because in my language we can't satisfy with ASCII characters. Then now, let's establish steps to make that migration as softly as possible ! 1st of all : Users have to be prepared and warned each time that UTF-8 could/should be considered, instead of the OS encoding or any other one. This should occurs when installing Eclipse, creating a new workspace, creating each new project, RCP, plug-ins... The encoding of the file that is opened into the editor that has the focus, could/should be displayed somewhere (in the status bar for instance). ...
Created attachment 249872 [details] Default Encoding in IntelliJ Community Edition 14 Attached a screenshot from IntelliJ Community Edition 14, in which it shows what the IDE's encoding is set to, and that projects use system default. So IntelliJ also defaults to whatever the system is as well. Certain files like IDE files and XML files it defaults to UTF-8, but it is still the projects responsibility to set a default encoding if it needs too.
Created attachment 249873 [details] UTF-8 from a system in Eclipse I have the system default UTF-8, but anywhere Eclipse use some legacy encoding for certain file types, e.g. for .properties it is ISO-8859-1. So I need to check these types each time for a new workspace.
(In reply to Stanislav Spiridonov from comment #93) > e.g. for .properties it is ISO-8859-1. This is one of the exceptions... the default encoding of a Java .properties file is in fact ISO 8859-1, not UTF-8, see the following JavaDoc: http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.InputStream%29
Thank you, I see the reason. I am working with GWT and it recognizes the UTF-8 in properties files. But anyway there are still JSP content types which is also have the legacy encoding by default.
*** Bug 458618 has been marked as a duplicate of this bug. ***
I've read someone complaining about it, again. I have a dummy question on this topic: is the encoding inherited between content-types? In preferences, the view is a tree, and the "root" text content-type doesn't specify an encoding. If we only set this one to UTF-8, does that mean that all children content-types that don't override the setting will be UTF-8 ? If yes, it seems to be a minor change for reasonable users satisfaction.
Hi, I am the complaining guy Mickael Istria refers to in comment #97. Every time this happens : the developers team work on Windows, unzip Eclipse, start to work a few weeks, then the product is deployed on a production server that runs on anyrything but Windows (Solaris, RedHat...) and thus doesn't use a MS proprietary charset, but rather UTF-8. And then all web pages display those lovely diamonds with question marks, stating that we yet again have an encoding issue. So now everyone has to take time to reconfigure Eclipse in all the places it manages file encoding, verify hundreds of files for re-encoding issues etc. I agree this is the developer duty to know his tools and configure them properly, but not everyone is a senior, encoding-problems-aware developer (at least not in any team I worked with so far), so a sane default would prevent sooo many problems. Add to that, that nowadays most Java applications are web applications, thus very likely to be i18n'd in "funky" languages like Arabic, Japanese, Korean or Chinese (with glyphs not covered by WIN-CP1252 nor ISO-8859-1)... To conclude, I would be very grateful if the Eclipse team would correct this simple but always painful issue !
(In reply to Olivier Croisier from comment #98) > Add to that, that nowadays most Java applications are web applications, thus > very likely to be i18n'd in "funky" languages like Arabic, Japanese, Korean > or Chinese (with glyphs not covered by WIN-CP1252 nor ISO-8859-1)... What filetype are you complaining about? If you have webtools installed, it should pickup the correct encoding from the file content itself (as well as XML and many other types of files). If you have your code under source code control, you should have set the encoding for your projects as you like ... then all the "new guys" will automatically get your preferred encoding set on those projects. > To conclude, I would be very grateful if the Eclipse team would correct this > simple but always painful issue ! And you don't mind breaking many other people/projects who have different assumptions, eh? :)
I'm sorry. UTF-8 is the industry standard. I can't see how anyone can deny that. I don't mind breaking people. Make the few who this impacts change the preference back. Sometimes you have to take a hit to do the right thing for the vast majority of your users. We need to do the math. I'm tired of sucking just to keep the status quo to satisfy the few. Be brave and take the hit and do the right thing.
The webtools editors use the best strategy to *detect* the encoding when possible. But in case there is not enough to detect the encoding, using UTF-8 as fallback seems to be the best approach from user perspective. I second Doug here. I believe that there will be more people happy by the move to UTF-8 than people unhappy with it, and that those who are using funky alternative conventions and encodings should be the one having to do the extra-step of setting their encoding if it doesn't match. Telling most users to change a preference is not as good user experience as setting this preference as default. Maybe this can be part of a future poll, such as the one that happened about line numbers some time ago?
A future implementation note: if you set -Dfile.encoding=UTF-8 then you lose the ability to switch back to the native filesysyem encoding (because you are effectively saying that the native file encoding is UTF-8). If you change the Eclipse preference from OS native to UTF-8 then it will at least permit those who want to switch back to do so.
PS happy ten year bugaversary for this bug earlier this month :-)
(In reply to Alex Blewitt from comment #102) > A future implementation note: if you set -Dfile.encoding=UTF-8 then you lose > the ability to switch back to the native filesysyem encoding Using the JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8 is correct for Windows in 99% because Java "incorrectly" define the native filesysyem encoding. It takes the windows fallback settings ("Language for non-Unicode programs") as default. So with JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8 you just set the CORRECT native filesysyem encoding for Windows.
Note that the default workspace encoding is not a preference, it's computed in Platform and stored as a metadata on the workspace (so the issue seems to be in Platform, not in EPP). See org.eclipse.ui.WorkbenchEncoding#getWorkbenchDefaultEncoding. It relies on the file.encoding of the JVM and fails back to UTF-8. As Stanislas mentioned just above, advising users to revise their JVM settings on Windows rather than telling them to tweak the workspace is a good idea. However, it would be nice if users could figure this out by themselves in the IDE. I suggest we replace the "Default (...)" label for encoding in Preferences > General > Workbench by "JVM Default (...)" and add a tooltip such as "Relies on the 'file.encoding' JVM property. For development and execution consistency, it's recommended that you configure your JVM property rather than overriding the workspace configuration." WDYT?
Sorry, but this is so ridiculous. The platform may have a shitty default, so we have to stick to that shitty default. Of course, in an ideal world every dev should know that problem and configure its environment correctly, BUT you always have new developers in a Team or you have a new computer and forget that little tiny thing and you encodings get messed up. What was the benefit taking the platform default? I probably missed that point.
(In reply to Mickael Istria from comment #105) > See org.eclipse.ui.WorkbenchEncoding#getWorkbenchDefaultEncoding. It relies > on the file.encoding of the JVM and fails back to UTF-8. Or maybe simply override the ResourcePlugin.getEncoding method to return UTF-8 instead of checking property. Indeed, there is no strong relationshop between the JVM that is used to run Eclipse IDE on the workstation, and the target environment (it may not even be Java), so inferring Resource encoding from underlying JVM settings seems irrelevant.
New Gerrit change created: https://git.eclipse.org/r/57036
New Gerrit change created: https://git.eclipse.org/r/57040
(In reply to Olivier Croisier from comment #98) > Hi, I am the complaining guy Mickael Istria refers to in comment #97. Thank you for taking the time to register for an Eclipse account and for posting your comments. The entire community benefits from more input.
"Note that the default workspace encoding is not a preference, it's computed in Platform and stored as a metadata on the workspace (so the issue seems to be in Platform, not in EPP)." Ha ha yes this is what everyone has been saying for ten years now. "And you don't mind breaking many other people/projects who have different assumptions, eh? :)" David, could you please describe such a project? This is a sincere question, because I believe it actually to be *impossible* to have a non-breaking setup without setting an encoding on the project level, because of the current default. Let me explain. 1. If you do NOT set encoding at the project level (or file level), Eclipse uses the platform default. 2. Because of how Java works, the platform default is *never* compatible across different machines and operating systems. On Windows machines it will assume CP-1252, whereas on Macs and Linux boxes it will use different (incompatible) encodings. 3. Even on Windows, the platform default will actually vary across locales. There are dozens of different encodings for e.g. Germany, Poland, Japan, France, Spain etc. So the 'many other people/projects' that would break would have to be groups of people that: * Are all on the same OS * Are all within the same locale (or compatible at least) I'm not sure where these people are that are never working with people from different countries, or that are using different OS etc but are they really the people that should be protected? Every day developers are losing time because of this defaults. Developers that want to create *interoperable* software that works in *every* country and on *every* OS. Is their life really being made more difficult for the sake of these legacy-encoding-dependant people that are creating software that is *per definition* NOT interoperable? I know I come on strong with my arguments... But imho, in 2005 when this bug was created it made *some* sense to argue against changing the defaults. Unicode was still pretty new then. But today, in 2015 it's becoming totally ridiculous if you ask me. UTF-8 is *the* de-facto standard encoding and has been for years now.
(In reply to Stijn de Witt from comment #112) > David, could you please describe such a project? > So the 'many other people/projects' that would break would have to be groups > of people that: > > * Are all on the same OS > * Are all within the same locale (or compatible at least) This is the use-case I'm aware of. (Japanese developers, developing Japanese web applications, specifically). Admittedly, I've not worked with those development groups for a long time, but I'd think if nothing else, they could have assets still in use. I know, for them at least, possibly others, it's even more complicated that the complications you mentioned, since there's often special hardware, and special versions of Java made for such cases. (That I do not really keep track of.) I don't mind you, and others, repeatedly asking for this ... but, many alternatives have been suggested, over the years, and I have yet to hear why none of those alternatives would be feasible. So, it does get tiresome. The easy alternatives: make sure your files that allow self documenting encoding are properly self documented, and make sure your project encoding are set properly. Beyond that, there were suggestions for someone with a vested interest to contribute "user aides" that would remind users, say during "New Project ...", to specify a better encoding than "workspace default". Those things seem better to me than risk messing up someone's existing data. Which, reminds me, that's how this case is different than, say "voting on line number preferences". Here we are talking about the possibility of damaging someone's existing data, or, I think it was suggested "make them invest in converting all their existing data". These type of things (damage, and "forced investments") are not open to "majority rule", IMHO. I feel an obligation to protect the minority, in such cases.
(In reply to David Williams from comment #113) > Those things seem better to me than risk messing up someone's existing data. > > Which, reminds me, that's how this case is different than, say "voting on > line number preferences". Here we are talking about the possibility of > damaging someone's existing data, or, I think it was suggested "make them > invest in converting all their existing data". > > These type of things (damage, and "forced investments") are not open to > "majority rule", IMHO. I feel an obligation to protect the minority, in such > cases. +1. I'm definitely also against such a change. I will take this into our next PMC call to see whether other PMC members have a different opinion on this.
(In reply to comment #114) > +1. I'm definitely also against such a change. I will take this into our next > PMC call to see whether other PMC members have a different opinion on this. Good. Nobody wants to rot the environment of others. There are certainly different strategies to achieve this without causing anger and disappointment. If you decide to do nothing, do you realize that you'll have more and more complaints about this issue ?
(In reply to Laurent Barbareau from comment #115) > If you decide to do nothing, do you realize that you'll have more and more > complaints about this issue ? Or less and less, since users may prefer other IDEs ;)
(In reply to Laurent Barbareau from comment #115) > Nobody wants to rot the environment of others. Right! Well, some seem to. > There are certainly different > strategies to achieve this without causing anger and disappointment. I think a partial solution would be to set the encoding to UTF-8 for empty workspaces. It won't solve all issues (see my comment 85) but solve the 80% problem.
(In reply to Dani Megert from comment #117) > I think a partial solution would be to set the encoding to UTF-8 for empty > workspaces. It won't solve all issues (see my comment 85) but solve the 80% > problem. I believe that's a (the?) good solution. It's more or less what I was willing to do with the suggested patches, but I didn't manage to do that. Is the alternative of just setting default value for PREF_ENCODING to UTF-8 a good way to implement that behaviour.
(In reply to Mickael Istria from comment #118) > (In reply to Dani Megert from comment #117) > > I think a partial solution would be to set the encoding to UTF-8 for empty > > workspaces. It won't solve all issues (see my comment 85) but solve the 80% > > problem. > > I believe that's a (the?) good solution. Good path forward. I think we only need to do two things: 1. Add org.eclipse.core.resources.ResourcesPlugin.getDefaultEncoding() that returns UTF-8 if the workspace is empty (detecting whether it's a completely new workspace is hard) and return current default otherwise. 2. Call that method in ResourcesPlugin.getEncoding() and WorkbenchEncoding.getWorkbenchDefaultEncoding() and other places where appropriate. BTW: Didn't like your threat in your previous comment ;-).
(In reply to comment #117) > (In reply to Laurent Barbareau from comment #115) > > Nobody wants to rot the environment of others. > > Right! Well, some seem to. No... what we/people want is to get rid as soon as possible of those encoding issues. UTF-8 as default encoding in Eclipse is just a little step. Eventually, everything in the Eclipse ecosystem has to converge towards UTF-8 but today neither Eclipse nor its plugins know how to deal properly with encoding. There is always a component that doesn't reach to determine the good (or most appropriate) encoding according to a specific situation even if you have an identical one at the different levels (workspace, projects properties, files....) But, before initiating any action you should take the time to prepare people to that transition. You must not just decide to change something without warning users (as it is too often the case in my opinion, in Eclipse or elsewhere too). And when I'm talking about warning people, I think to explicitly describe what's going to change for users, each time it may be useful or necessary. Not everybody knows which encoding is best for them according to their situation. Globally, I suggest to ensure that the encoding information is displayed or asked everywhere it may be useful. For instance : - by adding that information into the status bar (as we can see in some other softwares) and each time you click on an element that can be concerned, the encoding is displayed/updated. - when you create or copy a workspace, a project, a file... you're asked to choose the encoding (accompanied with a guide (in a popup) to choose the most appropriate one for you). - when you start first time Eclipse, you're asked to choose the encoding (accompanied with a guide (in a popup) to choose the most appropriate one for you). - ... But always pushing UTF-8 as best choice it you start from scratch. Simultaneously, regarding what I was saying about the encoding issues encountered with Eclipse or its plug-ins, it hope it would be possible to guide developers to take more care about the encoding determination...
(In reply to Dani Megert from comment #114) > I will take this into our > next PMC call to see whether other PMC members have a different opinion on > this. We have discussed this on Wednesday in our Eclipse top-level PMC call and the PMC unanimous agreed that we will not change the default. You can find more details in the PMC Meeting Minutes from October 7: https://wiki.eclipse.org/Eclipse/PMC#Meeting_Minutes To repeat the reasons for the decision: - changing the encoding to 'UTF-8' on Windows causes lots of troubles: - encoding on Windows (including Windows 10) is 'Cp1252' in most countries around the globe - all Windows tools (including compilers) read and write files with that encoding - characters will no longer be readable when copying or importing files from disk - characters will be destroyed without warning when saving the file To go forward the following things can improve the workflow: - make sure that the encoding is set on the project when creating it (bug 479450) - add a warning when a project does not have the encoding set and provide a quick fix to set it (bug 479451) - revisit the decision regarding the Welcome Questionnaire - explicitly set the encoding when creating a resource where the encoding differs from the parent (e.g. when using drag & drop or import) Please use the mentioned bugs to comment on the individual ideas, rather than comment here.
Thanks for the input Dani. What you suggest is quite good from user POV, despite it's not chaning default, it will give the same user experience in most cases.
(In reply to Dani Megert from comment #121) > - encoding on Windows (including Windows 10) is 'Cp1252' in most countries > around the globe Yeah, so? That looks like a pro-argument, not a con-argument to me. > - all Windows tools (including compilers) read and write files with that > encoding Again, so?? > - characters will no longer be readable when copying or importing files > from > disk What? How ...? > - characters will be destroyed without warning when saving the file LOL, seriously? There must be more reasons than this, right?
(In reply to Mickael Istria from comment #122) > Thanks for the input Dani. What you suggest is quite good from user POV, > despite it's not chaning default, it will give the same user experience in > most cases. Thanks Mickael and Dani. If you can make it so that all new/empty workspaces are UTF-8, that would be a reasonable compromise. Remember when you make decisions like this, you are making them on behalf of the community, the entire community, not just your employer and certainly not only yourself, and that you are doing it with respect for the opinion of those who've commented on this bug and elsewhere. Then hopefully the respect would be mutual. There are millions of users out there. Our products are successful because Eclipse in the large is successful. We need to make sure we protect that.
FYI, there is precedent for using UTF-8 by default on Windows. Visual SourceSafe (a Microsoft tool) uses UTF-8 by default: https://msdn.microsoft.com/en-US/library/5fdkw2w1(v=vs.80).aspx
Re: comment 123 While I strongly agree that using UTF-8 by default is the right thing to do, insults and sarcasm won't help convince anyone of this.