Community
Participate
Working Groups
The current SFTP based remotetools has several disadvantages. The main problem is that the Eclipse core and CDT do many of the file operations in the main thread. This seems to be caused by developers (of Core and CDT) assuming that all file operations are low latency. But having the file operations in the main thread has several speed and reliability problems. Speed: - Because the file operations are in the main thread they block the GUI until the IO operation finishes and thus preventing the user to continue the work while the IO operation is running. It also often prevents IO operations which could run in parallel to do so. See Bugs 160353, 177994, 195997, 218387, 219169 and wiki.eclipse.org/TM_and_RSE_FAQ from the RSE team regarding the same problem for RSE. Their seems to be no work-around for this problem. While it seams in theory to be possible to improve it somewhat by using Display.readAndDispatch, it is not advised and has been removed from RSE (160353). Having a responsive UI is considered by many extremely important for the user (see e.g. Google) thus this is a major point. Reliability: - Because it is in the main thread it gives some troubles with threading. This is not something which can't be overcome but it makes the code more complicated. For examples see the workaround(s) required for 314771. It is very unlikely, at least for the medium-term (meaning the next Eclipse release in 2011), that both Eclipse Core and CDT move all file operations into threads and hide latency by doing IO operations in parallel. Therefore a different approach is needed to have s performant remote IO method. For RSE an interesting suggestion is to support rsync as an alternative subsystem (195997). The idea is that the IO operations don't directly operate on the remote files but instead on a local replica. This replica would be kept up to date using rsync. The advantage is that (after an initial synchronization) all read operations would be local and even write operations would not block the UI as long as synchronization is handled correctly. (For disadvantages and advantages see also RSE FAQ). In the case of remote development, it might be sufficient to automatically synchronize before any remote build and additional allow the user to synchronize on demand. This would probably capture the usual use cases and would make it very convenient to use. Implementing this with rsync would give a large performance improvement (especially over connections like cable modem / DSL) and would in my opinion be an alternative which would work significantly better than the current SFTP based approach. Of course it should not replace the current approach but only be an alternative so that users preferring the SFTP approach would keep this option. As I write in a comment to 195997, while rsync would be a good option, it might not be the best approach for implementing such a replication file system. It has two disadvantages: 1) no JAVA implementation is available and 2) the synchronization is only one-way. The later is important if, either automatically or by the user, remote files get changed. The one-way synchronization of rsync would usually not synchronize changes to the client and would not detect conflicts caused by changes on both sides very well. Thus a better option then rsync might be to use git for the job. It has a java implementation shipping with Helios, is known to be extremely fast (1) (including the java implementation), and supports two way synchronization. Of course GIT is not meant as a synchronization tool (but a DCVS) but it works as a synchronization tool extremely well. Using git for synchronization would work both for those users using it also for version control and for those users using some other tool for version control. The details would need to be discussed. E.g. whether the indexing is done remote or local and whether to atomically sync more often then when building. But even with auto-build activated, I would assume the overall (perceived) performance would be better because the time for the synchronization is small compared to the build time and the blocked UI gives a perception of very slow performance. Deciding to add a replicating file system and deciding to implement this to do this with GIT are of course to different question. We should first discuss the former. I only included the suggest of implementing it with git to make it clearer how such a replicating file system could look like. 1) As an example a remote synchronization of a folder containing ~4000files (1 changed - which unknown to GIT), ~100MB, where GIT detects file changes on both sides, over a remote connection (cable), takes less than one second. The performance is mainly limited by the file system for the tree traversal.
I agree that the current situation is not ideal. Jeff Overbey has been experimenting with using rsync to enable remote support for Photran. The main problems with the rsync approach are: 1. If indexing is done locally, the entire project must be copied to the local machine. This only happens once, but could take a very long time for large projects/slow connections. 2. Local indexing is problematic as the local environment will be different from the remote environment, so macros and includes will be incorrect. Running scanner discovery remotely seems to be the obvious way to solve the macro problem, but scanner discovery is hopelessly broken and not even the CDT people seem to know how it works. In addition, the indexer would need to be modified to copy system and library includes from the remote machine as part of the indexing. 3. Remote indexing is problematic as each language requires a separate remote indexer. Currently only C and C++ are supported. 4. Some activities, such as building, will always need to be done remotely, so the performance problems will always be evident to some degree. The GIT approach sounds interesting and worth exploring more. A combination of GIT synchronization and remote indexing might solve some issues in the short term.
(In reply to comment #1) > I agree that the current situation is not ideal. Jeff Overbey has been > experimenting with using rsync to enable remote support for Photran. > > The main problems with the rsync approach are: I agree that a replicating approach (both rsync, git) has also disadvantages and thus the current approach is certainly better for some cases. And thus an additional disadvantage would be, that we would need to support both the current approach and additional the replicating one. > 1. If indexing is done locally, the entire project must be copied to the local > machine. This only happens once, but could take a very long time for large > projects/slow connections. Yes if e.g. the user wants to change a few files once or seldom the current approach would be better. In other cases the one-time wait shouldn't matter. > 2. Local indexing is problematic as the local environment will be different > from the remote environment, so macros and includes will be incorrect. Running > scanner discovery remotely seems to be the obvious way to solve the macro > problem, but scanner discovery is hopelessly broken and not even the CDT people > seem to know how it works. In addition, the indexer would need to be modified > to copy system and library includes from the remote machine as part of the > indexing. My experience is that the include files of standard libraries change so little (and are also installed on the client) that I would prefer any small speed advantage over having the remote include files indexed. But again it would be of course best if the user has a choice. Having the correct environment (environment variables, Make variables, ...) would indeed be nice. The remote discovery is currently broken for GNU. So it wouldn't be worse. And hopefully the remote discover could be fixed for both remote and local indexing. > 3. Remote indexing is problematic as each language requires a separate remote > indexer. Currently only C and C++ are supported. Wouldn't this be an advantage of a replicating (including rsync) appraoch? Because one could do (optional) local indexing? Especially for other languages (e.g. Python) local indexing would work very well because the environment and include files wouldn't be an issue. > 4. Some activities, such as building, will always need to be done remotely, so > the performance problems will always be evident to some degree. Sure. But at least with C/C++ and larger projects, the building step is anyhow longer (even changing only one file, make has to check the modification time of all files and has to run the linking step) and thus would somewhat hide the time required for the remote IO. But for other languages which have fast build or don't require build this might be annoying. For those languages the best option (if possible) might be (as recommended by RSE) to run eclipse remote using VNC or NX. > The GIT approach sounds interesting and worth exploring more. A combination of > GIT synchronization and remote indexing might solve some issues in the short > term. I don't really think regarding the 4 points above that GIT would have advantages over rsync. I think the advantages of GIT would only be 1) Java Implementation 2) If files get modified on the remote side (e.g. by user or by build)
(In reply to comment #2) > > 2. Local indexing is problematic as the local environment will be different ... > My experience is that the include files of standard libraries change so little > (and are also installed on the client) that I would prefer any small speed > advantage over having the remote include files indexed. But again it would be of > course best if the user has a choice. Having the correct environment > (environment variables, Make variables, ...) would indeed be nice. The remote > discovery is currently broken for GNU. So it wouldn't be worse. And hopefully > the remote discover could be fixed for both remote and local indexing. That might be the case if the client and remote systems are the same, but is definitely not the case when they are different, which is more likely to be the situation. > > 3. Remote indexing is problematic as each language requires a separate remote > > indexer. Currently only C and C++ are supported. > Wouldn't this be an advantage of a replicating (including rsync) appraoch? > Because one could do (optional) local indexing? Especially for other languages > (e.g. Python) local indexing would work very well because the environment and > include files wouldn't be an issue. Yes, provided you can solve the scanner info and include problems. > I don't really think regarding the 4 points above that GIT would have advantages > over rsync. I think the advantages of GIT would only be > 1) Java Implementation > 2) If files get modified on the remote side (e.g. by user or by build) Right. If you're just running a command to do the sync (e.g. prior to building), then rsync is probably fine. If you wanted to create an EFS provider based on GIT this would be another matter.
(In reply to comment #3) > (In reply to comment #2) > > > 2. Local indexing is problematic as the local environment will be different > ... > > My experience is that the include files of standard libraries change so little > > (and are also installed on the client) that I would prefer any small speed > > advantage over having the remote include files indexed. But again it would be of > > course best if the user has a choice. Having the correct environment > > (environment variables, Make variables, ...) would indeed be nice. The remote > > discovery is currently broken for GNU. So it wouldn't be worse. And hopefully > > the remote discover could be fixed for both remote and local indexing. > > That might be the case if the client and remote systems are the same, but is > definitely not the case when they are different, which is more likely to be the > situation. Yes you are right. I was thinking at those libraries I most use (Glibc and FFTW) which I have from cygwin, to Linux notebook, cluster and Cray. But of course you are right their also system which which don't have Glibc. But I think it is not that uncommon to have it on both sides. > > I don't really think regarding the 4 points above that GIT would have advantages > > over rsync. I think the advantages of GIT would only be > > 1) Java Implementation > > 2) If files get modified on the remote side (e.g. by user or by build) > > Right. If you're just running a command to do the sync (e.g. prior to > building), then rsync is probably fine. If you wanted to create an EFS provider > based on GIT this would be another matter. I never meant to suggest to implement a full EFS provider, but this is an interesting idea. What would be the advantage of implementing an EFS provider? I think it is possible to do remote indexing with a sync approach. It would be OK to do the sync automatically after each save. The remote indexing would only need to be notified that the sync is finished (to make it performant). For the sync approach, both build and remote index would need to explicit initiate the sync or wait on the sync if it is initiated by the file save. For the EFS approach there would be two options: 1) wait on GIT to finish when saving 2) not wait on GIT and have an API to check whether the latest GIT synchronization is finished I think approach 1 would defeat the purpose. If every save operation has to guarantee that the file has changed remotely before it can finish, than it would have the same latency problem as the current approach. Approach 2 should work but requires each remote operation to call a function to guarantee that the remote files are up-to-date. Any file save operation would initiates a sync (but not wait) and any remote operation (e.g. remote build or index) would wait automatically on any outstanding sync. Approach 2 would also work without EFS. So I think the difference is small between 2 implemented with/without providing EFS. The only thing I see EFS could add, would be to somehow combine the sync with the existing sftp based remote-tools. E.g. it could be possible to configure it to only synchronize source files (as is typical for a version control) and the EFS provider could use that information to automatically use sftp to read a file which is not part of the synchronization. But while this sounds like an interesting approach, I'm skeptical that it would work very well in practice (because of complexity - but I might very well be wrong). Thus I would suggest the following steps; 1) Have a API which is required to be called by any remote operation to guarantee that the remote files are in sync and add it to all remote operaions (build, index, further?). The implementation would of the call would depend on the used file system. It would do nothing for the existing remote-tools, and could initiate a sync or wait on running syncs based on the implementation of a replicating file system - either with or without EFS). 2) This would allow to test some simple replicating based implementation and would allow to compare rsync and GIT. 3) Optional: Add a full EFS implementation and test what additional advantages this could provide over an implementation without EFS.
Adding Jeff to CC as he might have some comments/suggestions. He's already been working on this.
I was thinking it might be possible to use GIT to create a mirror filesystem, so that a local and remote copy would be kept in sync. Merges could happen on saving, or prior to building, or something. It would need to be worked out, but it seems like this could be an interesting option for solving the issue and indexing issues, plus provide a nice off-line working environment.
Just searched the internet for some ideas and found the following: - GIT is in most benchmarks the fastest DVCS. Also it is supported by Eclipse (EGit). Thus it is probably the best option. - Similar to our requirements is the area of Web Deployment. And people use GIT for that succesful (1-5). A generic file system is less comparable because it wouldn't be a good idea to sync generated files (e.g. binaries, object files). - To push to a non-bare repository is discouraged. Their a different options: + Fetch (not good option because it requires SSHD on the client) + Push into working branch with post-update hook (6). Disadvantages: Requires stat of each file on server (slow over NFS) and doesn't allow merge on client side + Push to separate bare repository. Disadvantage: Requires 2 repositories + Push to remote branch. See (7). Seems best option It is probably good to make it configurable whether we sync after each save or only before each compile. Both have advantages/disadvantages: Pro after each save: - Required for Auto-Build and Indexing on Server - Reduces time to build (because is already synced) Contra - Causes larger repository (~2k per commit) and more traffic For the UI to be responsive it is important that the sync is in any case asynchronous (thus a file save operation would only intiate a sync but wouldn't wait on it). If it is possible to use a remote scanner but a local indexer it is probably best to do the indexing local. It would be a nice feature if one could compile+run both local and remote from the same project. Links: 1) http://www.turnkeylinux.org/blog/website-synchronization 2) http://www.codebork.com/coding/2010/06/03/php-web-deployment-using-git.html 3) http://stackoverflow.com/questions/279169/deploy-php-using-git 4) http://insideria.com/2009/12/5-tips-for-deploying-sites.html 5) http://stackoverflow.com/questions/883878/update-website-with-a-single-command-git-push-instead-of-ftp-drag-and-dropping 6) http://utsl.gen.nz/git/post-update 7) http://thread.gmane.org/gmane.comp.version-control.git/42506/focus=42685
some real file systems which work over WAN (or articles about it): http://www.xtreemfs.org/ http://offlinefs.sourceforge.net/wiki/ http://www.microsoft.com/windowsserversystem/dfs/default.mspx http://www.eetimes.com/electronics-news/4144653/Enabling-File-Sharing-over-the-WAN http://sector.sourceforge.net/ http://www.coda.cs.cmu.edu/ http://www.cuteftp.com/wafs/ http://www.riverbed.com/products/compare/wafs.php http://portal.acm.org/citation.cfm?id=844128.844131 http://userweb.cs.utexas.edu/users/dahlin/papers/FINAL-PRACTI-NSDI.pdf but none has all required features: read-write support, production level, free, platform independent And even if one would be available it would probably be such a good idea to use a real file system. Even with FUSE it would be difficult to install and would often require admin permissions (at least to activate FUSE). NetBeans uses for its Remote Development Replication over SSH and they call it "Secure Secure Copy". I wasn't able to find any design document about it.
By implementing an EFS additional features are possible. And more than I realize when I wrote Comment 4. The advantages of asynchronous GIT synchronization after each save (described in Comment 7) are only possible with an EFS. And implementing such an EFS should be rather straight-forward. It would only have to: - initiate a sync (bundling file modification if many almost at one time as e.g. when moving directory), and - keep record of which files have been changed at what time to allow the build to know on what synchronizations it has to wait. I still think the most important part can be tested out without implementing an EFS and only linking it into remote operations (build, ...) as described in Comment 4. Thus by having an additional synchronization service additional to the (local) file system. I propose the following steps: - Define a new Synchronization service type (which add synchronization/replication to the running EFS). It would have as public method guranteeSynchronized. The default server (for a purely local project or for remotetools/RSE) would do nothing. - Add to all remote operations (compile, remote index, ..) a call to gureanteeSyncronized - Implement a GIT based synchronization service (doing the GIT push in the gureanteeSyncronized call) - Add the GUI to configure the synronization service Later an EFS which would do the asynchronous GIT push after a file modifcation(e.g. save) and the gureanteeSyncronized would just wait for the push to finish.
I think a commit on every save would be a really nice feature as it would give the ability to roll back to any past buffer save. Even after 1000 commits at 2K per commit, that's only 2MB which is not very big. How will the remote side work? In order to build on the remote machine, the files will need to have been checked out, then updated prior to the build launching. Does git store the master repository in a form that could be accessed by filesystem commands like make?
(In reply to comment #10) > I think a commit on every save would be a really nice feature as it would give > the ability to roll back to any past buffer save. Even after 1000 commits at 2K > per commit, that's only 2MB which is not very big. True. > How will the remote side work? In order to build on the remote machine, the > files will need to have been checked out, then updated prior to the build > launching. Does git store the master repository in a form that could be > accessed by filesystem commands like make? Because Git is distributed it does not have a master repository. The default type of a repository has the object storage (containing the file revisions) and the working directory (the checkout out files). A bare repository has only the object storage. As discussed in Comment 7 I would suggest to use the standard (non-bare) repository on both the client and the server side. Thus a working directory with all files in their normal folder structure would be available on both dies. As you point out, the working directory has to be updated prior to build. Commit a file on the local side and updating the remote side would involve the following steps: - Mark modified file for commit (git add, a local operation, not strictly necessary improves performance because git does not have to check all files for modifications, only possible after implementing full EFS) - Commit change (git commit, local operation, thus provides the roll-back functionality without internet connection) - Upload commit to remote side without changing working directory (git push, details link in comment 7) - Update working directory (git merge executed remotely through ssh connection, auto-merges if possible, fails if merge conflict allowing merge on local side) To check for updates from the remote side (and merge in if any): - Commit any possible changes (remote: git commit -a, might be relatively slow because requires stat of each file, thus we should probably not do that more than every 30s-1min to not put too much strain on the remote file system) - Fetch+Merge (git pull, auto-merges if possible open EGit merge window if not)
Just for reference, here is the work that we did a few months ago on rsync-based remote projects... Here is a relatively complete description of the project and many of the issues we thought of: http://wiki.eclipse.org/PTP/photran/rsync_remote_projects Bug 313194 is a strawman prototype of a rsync-based remotely-synchronized project. It adds a new project wizard which creates a C/Fortran project but replaces the standard CDT build command (make) with a call to a custom shell script which uses rsync to copy the project to a remote server and run make remotely. This was definitely a prototype -- I'm sure the final version won't look anything like it (e.g., our build script makes two or three separate connections to the remote machine) -- but this is *simple* and it works, more or less, which gave us something real to try out. Bug 305525 adds remote (Fortran) INCLUDE paths to Photran. Photran's include paths are configured in the project properties. Traditionally, they'd be paths on the local maching (e.g., /usr/include:/usr/local/include). This replaces them with URIs, so they can be on either the local machine or a remote one (e.g., rse://remotehost/usr/include:file:///usr/include). It also changes the properties page to use a remote file selection dialog box.
(In reply to comment #12) > Bug 313194 is a strawman prototype of a rsync-based remotely-synchronized > project. It adds a new project wizard which creates a C/Fortran project but > replaces the standard CDT build command (make) with a call to a custom shell > script which uses rsync to copy the project to a remote server and run make > remotely. This was definitely a prototype -- I'm sure the final version won't > look anything like it (e.g., our build script makes two or three separate > connections to the remote machine) -- but this is *simple* and it works, more > or less, which gave us something real to try out. Will the code be attached to bug 313194? I'd like to take a look.
In CVS ships with Indigo