Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 325828 - Egit 0.9.3 has troubles with Non-ASCII/Unicode filenames
Summary: Egit 0.9.3 has troubles with Non-ASCII/Unicode filenames
Status: CLOSED WONTFIX
Alias: None
Product: EGit
Classification: Technology
Component: Core (show other bugs)
Version: 0.9.0   Edit
Hardware: PC Windows XP
: P3 major with 1 vote (vote)
Target Milestone: ---   Edit
Assignee: Project Inbox CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 332613 352522 (view as bug list)
Depends on:
Blocks:
 
Reported: 2010-09-21 06:59 EDT by Daniel Stein CLA
Modified: 2012-08-20 18:15 EDT (History)
8 users (show)

See Also:


Attachments
Screenshot witch shows the switch problem (5.63 KB, image/png)
2010-09-22 02:04 EDT, Daniel Stein CLA
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Stein CLA 2010-09-21 06:59:06 EDT
Build Identifier:  M20090917-0800

Hi everyone. 

I'm new to egit. I use git and git cli quite a time now and never had any troubles. Today I've tried egit. In my git repro there a file-names that contains Umlauts (vowel mutation like ä or ö). If I checkout a branch of my repo with egit all files whose names contains Umlauts are marked as "untracked files". The cli don't behave like this.

I'think it is very important for the acceptances of egit in Germany, that egit suports file-names with Umlauts.

Reproducible: Always

Steps to Reproduce:
1. git cli: Clone a repo witch contains files whose names has Umlauts.
2. egit:  Import projects from this repo.
3. git cli: git status says nothing to commit and no untracked files.
4. egit: Checkout a branch.
5. git cli: git status says files whose file-names contains Umlauts are untracked files...
Comment 1 Chris Aniszczyk CLA 2010-09-21 09:59:54 EDT
I agree, this should be investigated for 0.10
Comment 2 Daniel Stein CLA 2010-09-22 02:04:55 EDT
Created attachment 179353 [details]
Screenshot witch shows the switch problem

Hi, 
thanks for your interest. I further invastegated the problem.  I'm using msysgit 1.7.1.0 and want to switch to egit. In msysgit I have set the variable core.quotepath to deal with file-names with Umlauts. If I import the project shown in the screenshot from a repo created by msysgit at first egit only shows the folder marked with (1.), and says files are untracked. If i then switch to another branch egit adds the folder marked with (2.). Funnily the second folder is named correctly. So it seams that msysgit has also troubles with Umlauts. But I think egit should not add folders by itself... 

Another problem is that i must use msysgit because of git-svn...
Comment 3 Matthias Sohn CLA 2010-10-21 09:26:54 EDT
Could you try with latest egit nightly (since nightly update site currently is broken you may
 install it from hudson by pointing p2 at https://hudson.eclipse.org/hudson/job/egit/lastSuccessfulBuild/artifact/org.eclipse.egit-updatesite/target/site/)
and try to reproduce the problem using egit only (without using msysgit). 

I tried and everything worked just fine (will attach a screenshot).
I am on Mac hence can't try with msysgit. 

Note that msysgit has an open issue in that area
http://code.google.com/p/msysgit/issues/detail?id=80
Comment 4 Manuel Doninger CLA 2010-10-21 10:04:29 EDT
Hi Matthias,
i had the same problem and i tried the snapshot. Now files with umlauts are marked as "added" (with 0.9.3 they were marked as "untracked").
My Git Remote Repo is saved on a Red Hat Server, and i clone this repository on a windows machine. I don't have the time today to do more testing, eventually i find the time tomorrow or on saturday.

Regards,
Manuel
Comment 5 Manuel Doninger CLA 2010-10-22 05:05:48 EDT
I have to correct my last comment: With EGit 0.9.3 files with umlauts are marked as "added" after cloning the repository, same with the nightly build.
This problem occurs also if i clone a repository, which was created on a computer with windows. 
But if i add and commit the "added"-marked files and make a new clone, the files are handled correctly by EGit.
Comment 6 Manuel Doninger CLA 2010-10-22 05:45:22 EDT
Another amendment: after commiting a file with an umlaut, EGit shows the icon overlay for "tracked", but if i open the commit window, the file status there says "unknown". 
So my last comment about "handling correctly" was wrong.
Comment 7 Chris Aniszczyk CLA 2010-12-28 10:30:41 EST
Should look again at 0.11 timeframe
Comment 8 Robin Rosenberg CLA 2010-12-28 18:01:35 EST
(In reply to comment #2)
> In msysgit I have set the variable
> core.quotepath to deal with file-names with Umlauts. If I import the project

Minor note: core.quotepath only affect display and does not change how things work
Comment 9 Robin Rosenberg CLA 2010-12-28 18:04:08 EST
*** Bug 332613 has been marked as a duplicate of this bug. ***
Comment 10 Robin Rosenberg CLA 2010-12-28 18:27:15 EST
See http://egit.eclipse.org/r/#change,335 for a suggestion on how to work around this problem.

The solution may alleviate the problem for some, but does not represent a solution. C Git and
JGit is incompatble

C Git uses the philosophy that it store the filenames "as-is", ie. the raw encoding. This is utter non-sense since on Windows the encoding is UTF-16, which obviously is not the encoding used. We can accept the explanation that is it user or machine specific. C Git commits the file names in the eight-bit locale of the user, which may be Latin-1, Latin-2 etc, cyrylic, various legacy chineese encodings, UTF-8 composed (modern Linux) or decomposed (OS X).

If you create a repo using C Git in one encoding and use it (e.g. a clone) on another machine, you will have some kind of problem due to this.

Operations on the index is one thing and on the file system is another. I'm quite convinced that the ONLY long-term solution is to go for one of the UTF-8 forms in tree objects, ref names etc.  Hence, I made a decision a long time ago to always commit as-if JGit's encoding was UTF-8. As it turns out this is a composed form on Linux and Windows. On OS X this will be the decomposed form.

The solution to this need to come from some kind of agreement to get enough mass around what is right. JGit can only get a cross-implementation halfway. C Git needs adaption too.
Comment 11 Robin Rosenberg CLA 2011-01-12 18:00:49 EST
Discussions around patches for "msys" git is going on, whereby C Git would be able to read our UTF-8 based file names. See http://groups.google.com/group/msysgit/browse_thread/thread/d4414235850ce181/95bfcc1718fd3f1e?lnk=gst&q=blees#95bfcc1718fd3f1e

Maybe we can just ignore the problem and it goes away :)

Fixing C Git is the proper way of dealing with this problem. Fixing msys git will not fix C Git on unix systems with legacy encodings, but I guess they are not that many instances of these anymore.
Comment 12 Mikhail Sviridov CLA 2011-04-14 02:20:10 EDT
Now it's a big problem to use egit because it commits all files with cyrillic names with UTF-8 charset and if I revert to some previous commit and look at the folder with these files then I will see 2 sets of files: 1 set - files with cp1251 filenames, 2 set - the same files but with corrupted filenames. I think that files under Windows must have native charset of filenames - cp1251 for cyrillic or another. For example, SmartGIT works good with cp1251 filenames!!! And it uses java runtime as egit and eclipse. So why egit can't do that.
Eclipse platform works with these files correctly!!! Why egit corrupts filenames?
Eclipse is cross platform IDE, but somehow it works correctly with native filenames on various operating systems. May be eclipse platform can help egit to select correct charset for working with filesystem. Now only egit has this problem - other git clients work well. Note this!
Comment 13 Robin Rosenberg CLA 2011-04-15 12:34:00 EDT
(In reply to comment #12)
> Now it's a big problem to use egit because it commits all files with cyrillic
> names with UTF-8 charset and if I revert to some previous commit and look at
> the folder with these files then I will see 2 sets of files: 1 set - files with
> cp1251 filenames, 2 set - the same files but with corrupted filenames. I think
> that files under Windows must have native charset of filenames - cp1251 for
> cyrillic or another. For example, SmartGIT works good with cp1251 filenames!!!

Windows uses UTF-16 for filenames. Application may chose to work with other encodings representing a subset of UTF-16. Msys and C Git are examples of such applications. EGit can work with any Windows file name.

> And it uses java runtime as egit and eclipse. So why egit can't do that.
> Eclipse platform works with these files correctly!!! Why egit corrupts
> filenames?

Please read the other comments and follow the links. There is even set of
patches for Msys Git that you may want to try.

> Eclipse is cross platform IDE, but somehow it works correctly with native
> filenames on various operating systems. May be eclipse platform can help egit
> to select correct charset for working with filesystem. Now only egit has this
> problem - other git clients work well. Note this!

That actually depends on your use case. Use C Git to create a repo on Windows and try to work with it on a Mac or Linux (using C Git). This will work to some
extent, but look ugly. Working with Windows/Linux works ok with only EGit. Sharing repos between Russian/Swedish/etc windows machines will (I assume) work fine with only EGit, but not with C Git.

I haven't tried, but cygwin 1.7 could possibly be compatible with egit. This means cygwin git will be incompatible with msys git.

C Git/Egit on OS X is incompatible with them all. It's not even fully comparible with itself due to the decomposing (...) nature of the OS X file
system.

Pick your poison.
Comment 14 Robin Rosenberg CLA 2011-05-29 17:45:50 EDT
Posted patches http://egit.eclipse.org/r/3573 and http://egit.eclipse.org/r/3571, for people with insights into coding.
Comment 15 Robin Rosenberg CLA 2011-08-08 17:58:54 EDT
Status update: See http://markmail.org/message/jux476xzhaz6muoi for a Unicode enabled version of Git for Windows
Comment 16 Robin Rosenberg CLA 2012-08-19 16:34:36 EDT
Now that Git has fixed this for Windows and the rest of the world
is mostly UTF-8, I think we can declare this a WON'T FIX. I abandoned
the patches I had in Gerrit.
Comment 17 Matthias Sohn CLA 2012-08-20 03:48:55 EDT
(In reply to comment #16)
> Now that Git has fixed this for Windows and the rest of the world
> is mostly UTF-8, I think we can declare this a WON'T FIX. I abandoned
> the patches I had in Gerrit.

+1
Comment 18 Robin Rosenberg CLA 2012-08-20 18:15:30 EDT
*** Bug 352522 has been marked as a duplicate of this bug. ***