| Summary: | replicating file system | ||
|---|---|---|---|
| Product: | [Tools] PTP | Reporter: | Roland Schulz <roland> |
| Component: | Remote Tools | Assignee: | Roland Schulz <roland> |
| Status: | CLOSED FIXED | QA Contact: | |
| Severity: | enhancement | ||
| Priority: | P3 | CC: | ben, com-eclipse-dot-org, dieter.krachtus, g.watson, incongruous, yevshif |
| Version: | 4.0 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
|
Description
Roland Schulz
I agree that the current situation is not ideal. Jeff Overbey has been experimenting with using rsync to enable remote support for Photran. The main problems with the rsync approach are: 1. If indexing is done locally, the entire project must be copied to the local machine. This only happens once, but could take a very long time for large projects/slow connections. 2. Local indexing is problematic as the local environment will be different from the remote environment, so macros and includes will be incorrect. Running scanner discovery remotely seems to be the obvious way to solve the macro problem, but scanner discovery is hopelessly broken and not even the CDT people seem to know how it works. In addition, the indexer would need to be modified to copy system and library includes from the remote machine as part of the indexing. 3. Remote indexing is problematic as each language requires a separate remote indexer. Currently only C and C++ are supported. 4. Some activities, such as building, will always need to be done remotely, so the performance problems will always be evident to some degree. The GIT approach sounds interesting and worth exploring more. A combination of GIT synchronization and remote indexing might solve some issues in the short term. (In reply to comment #1) > I agree that the current situation is not ideal. Jeff Overbey has been > experimenting with using rsync to enable remote support for Photran. > > The main problems with the rsync approach are: I agree that a replicating approach (both rsync, git) has also disadvantages and thus the current approach is certainly better for some cases. And thus an additional disadvantage would be, that we would need to support both the current approach and additional the replicating one. > 1. If indexing is done locally, the entire project must be copied to the local > machine. This only happens once, but could take a very long time for large > projects/slow connections. Yes if e.g. the user wants to change a few files once or seldom the current approach would be better. In other cases the one-time wait shouldn't matter. > 2. Local indexing is problematic as the local environment will be different > from the remote environment, so macros and includes will be incorrect. Running > scanner discovery remotely seems to be the obvious way to solve the macro > problem, but scanner discovery is hopelessly broken and not even the CDT people > seem to know how it works. In addition, the indexer would need to be modified > to copy system and library includes from the remote machine as part of the > indexing. My experience is that the include files of standard libraries change so little (and are also installed on the client) that I would prefer any small speed advantage over having the remote include files indexed. But again it would be of course best if the user has a choice. Having the correct environment (environment variables, Make variables, ...) would indeed be nice. The remote discovery is currently broken for GNU. So it wouldn't be worse. And hopefully the remote discover could be fixed for both remote and local indexing. > 3. Remote indexing is problematic as each language requires a separate remote > indexer. Currently only C and C++ are supported. Wouldn't this be an advantage of a replicating (including rsync) appraoch? Because one could do (optional) local indexing? Especially for other languages (e.g. Python) local indexing would work very well because the environment and include files wouldn't be an issue. > 4. Some activities, such as building, will always need to be done remotely, so > the performance problems will always be evident to some degree. Sure. But at least with C/C++ and larger projects, the building step is anyhow longer (even changing only one file, make has to check the modification time of all files and has to run the linking step) and thus would somewhat hide the time required for the remote IO. But for other languages which have fast build or don't require build this might be annoying. For those languages the best option (if possible) might be (as recommended by RSE) to run eclipse remote using VNC or NX. > The GIT approach sounds interesting and worth exploring more. A combination of > GIT synchronization and remote indexing might solve some issues in the short > term. I don't really think regarding the 4 points above that GIT would have advantages over rsync. I think the advantages of GIT would only be 1) Java Implementation 2) If files get modified on the remote side (e.g. by user or by build) (In reply to comment #2) > > 2. Local indexing is problematic as the local environment will be different ... > My experience is that the include files of standard libraries change so little > (and are also installed on the client) that I would prefer any small speed > advantage over having the remote include files indexed. But again it would be of > course best if the user has a choice. Having the correct environment > (environment variables, Make variables, ...) would indeed be nice. The remote > discovery is currently broken for GNU. So it wouldn't be worse. And hopefully > the remote discover could be fixed for both remote and local indexing. That might be the case if the client and remote systems are the same, but is definitely not the case when they are different, which is more likely to be the situation. > > 3. Remote indexing is problematic as each language requires a separate remote > > indexer. Currently only C and C++ are supported. > Wouldn't this be an advantage of a replicating (including rsync) appraoch? > Because one could do (optional) local indexing? Especially for other languages > (e.g. Python) local indexing would work very well because the environment and > include files wouldn't be an issue. Yes, provided you can solve the scanner info and include problems. > I don't really think regarding the 4 points above that GIT would have advantages > over rsync. I think the advantages of GIT would only be > 1) Java Implementation > 2) If files get modified on the remote side (e.g. by user or by build) Right. If you're just running a command to do the sync (e.g. prior to building), then rsync is probably fine. If you wanted to create an EFS provider based on GIT this would be another matter. (In reply to comment #3) > (In reply to comment #2) > > > 2. Local indexing is problematic as the local environment will be different > ... > > My experience is that the include files of standard libraries change so little > > (and are also installed on the client) that I would prefer any small speed > > advantage over having the remote include files indexed. But again it would be of > > course best if the user has a choice. Having the correct environment > > (environment variables, Make variables, ...) would indeed be nice. The remote > > discovery is currently broken for GNU. So it wouldn't be worse. And hopefully > > the remote discover could be fixed for both remote and local indexing. > > That might be the case if the client and remote systems are the same, but is > definitely not the case when they are different, which is more likely to be the > situation. Yes you are right. I was thinking at those libraries I most use (Glibc and FFTW) which I have from cygwin, to Linux notebook, cluster and Cray. But of course you are right their also system which which don't have Glibc. But I think it is not that uncommon to have it on both sides. > > I don't really think regarding the 4 points above that GIT would have advantages > > over rsync. I think the advantages of GIT would only be > > 1) Java Implementation > > 2) If files get modified on the remote side (e.g. by user or by build) > > Right. If you're just running a command to do the sync (e.g. prior to > building), then rsync is probably fine. If you wanted to create an EFS provider > based on GIT this would be another matter. I never meant to suggest to implement a full EFS provider, but this is an interesting idea. What would be the advantage of implementing an EFS provider? I think it is possible to do remote indexing with a sync approach. It would be OK to do the sync automatically after each save. The remote indexing would only need to be notified that the sync is finished (to make it performant). For the sync approach, both build and remote index would need to explicit initiate the sync or wait on the sync if it is initiated by the file save. For the EFS approach there would be two options: 1) wait on GIT to finish when saving 2) not wait on GIT and have an API to check whether the latest GIT synchronization is finished I think approach 1 would defeat the purpose. If every save operation has to guarantee that the file has changed remotely before it can finish, than it would have the same latency problem as the current approach. Approach 2 should work but requires each remote operation to call a function to guarantee that the remote files are up-to-date. Any file save operation would initiates a sync (but not wait) and any remote operation (e.g. remote build or index) would wait automatically on any outstanding sync. Approach 2 would also work without EFS. So I think the difference is small between 2 implemented with/without providing EFS. The only thing I see EFS could add, would be to somehow combine the sync with the existing sftp based remote-tools. E.g. it could be possible to configure it to only synchronize source files (as is typical for a version control) and the EFS provider could use that information to automatically use sftp to read a file which is not part of the synchronization. But while this sounds like an interesting approach, I'm skeptical that it would work very well in practice (because of complexity - but I might very well be wrong). Thus I would suggest the following steps; 1) Have a API which is required to be called by any remote operation to guarantee that the remote files are in sync and add it to all remote operaions (build, index, further?). The implementation would of the call would depend on the used file system. It would do nothing for the existing remote-tools, and could initiate a sync or wait on running syncs based on the implementation of a replicating file system - either with or without EFS). 2) This would allow to test some simple replicating based implementation and would allow to compare rsync and GIT. 3) Optional: Add a full EFS implementation and test what additional advantages this could provide over an implementation without EFS. Adding Jeff to CC as he might have some comments/suggestions. He's already been working on this. I was thinking it might be possible to use GIT to create a mirror filesystem, so that a local and remote copy would be kept in sync. Merges could happen on saving, or prior to building, or something. It would need to be worked out, but it seems like this could be an interesting option for solving the issue and indexing issues, plus provide a nice off-line working environment. Just searched the internet for some ideas and found the following: - GIT is in most benchmarks the fastest DVCS. Also it is supported by Eclipse (EGit). Thus it is probably the best option. - Similar to our requirements is the area of Web Deployment. And people use GIT for that succesful (1-5). A generic file system is less comparable because it wouldn't be a good idea to sync generated files (e.g. binaries, object files). - To push to a non-bare repository is discouraged. Their a different options: + Fetch (not good option because it requires SSHD on the client) + Push into working branch with post-update hook (6). Disadvantages: Requires stat of each file on server (slow over NFS) and doesn't allow merge on client side + Push to separate bare repository. Disadvantage: Requires 2 repositories + Push to remote branch. See (7). Seems best option It is probably good to make it configurable whether we sync after each save or only before each compile. Both have advantages/disadvantages: Pro after each save: - Required for Auto-Build and Indexing on Server - Reduces time to build (because is already synced) Contra - Causes larger repository (~2k per commit) and more traffic For the UI to be responsive it is important that the sync is in any case asynchronous (thus a file save operation would only intiate a sync but wouldn't wait on it). If it is possible to use a remote scanner but a local indexer it is probably best to do the indexing local. It would be a nice feature if one could compile+run both local and remote from the same project. Links: 1) http://www.turnkeylinux.org/blog/website-synchronization 2) http://www.codebork.com/coding/2010/06/03/php-web-deployment-using-git.html 3) http://stackoverflow.com/questions/279169/deploy-php-using-git 4) http://insideria.com/2009/12/5-tips-for-deploying-sites.html 5) http://stackoverflow.com/questions/883878/update-website-with-a-single-command-git-push-instead-of-ftp-drag-and-dropping 6) http://utsl.gen.nz/git/post-update 7) http://thread.gmane.org/gmane.comp.version-control.git/42506/focus=42685 some real file systems which work over WAN (or articles about it): http://www.xtreemfs.org/ http://offlinefs.sourceforge.net/wiki/ http://www.microsoft.com/windowsserversystem/dfs/default.mspx http://www.eetimes.com/electronics-news/4144653/Enabling-File-Sharing-over-the-WAN http://sector.sourceforge.net/ http://www.coda.cs.cmu.edu/ http://www.cuteftp.com/wafs/ http://www.riverbed.com/products/compare/wafs.php http://portal.acm.org/citation.cfm?id=844128.844131 http://userweb.cs.utexas.edu/users/dahlin/papers/FINAL-PRACTI-NSDI.pdf but none has all required features: read-write support, production level, free, platform independent And even if one would be available it would probably be such a good idea to use a real file system. Even with FUSE it would be difficult to install and would often require admin permissions (at least to activate FUSE). NetBeans uses for its Remote Development Replication over SSH and they call it "Secure Secure Copy". I wasn't able to find any design document about it. By implementing an EFS additional features are possible. And more than I realize when I wrote Comment 4. The advantages of asynchronous GIT synchronization after each save (described in Comment 7) are only possible with an EFS. And implementing such an EFS should be rather straight-forward. It would only have to: - initiate a sync (bundling file modification if many almost at one time as e.g. when moving directory), and - keep record of which files have been changed at what time to allow the build to know on what synchronizations it has to wait. I still think the most important part can be tested out without implementing an EFS and only linking it into remote operations (build, ...) as described in Comment 4. Thus by having an additional synchronization service additional to the (local) file system. I propose the following steps: - Define a new Synchronization service type (which add synchronization/replication to the running EFS). It would have as public method guranteeSynchronized. The default server (for a purely local project or for remotetools/RSE) would do nothing. - Add to all remote operations (compile, remote index, ..) a call to gureanteeSyncronized - Implement a GIT based synchronization service (doing the GIT push in the gureanteeSyncronized call) - Add the GUI to configure the synronization service Later an EFS which would do the asynchronous GIT push after a file modifcation(e.g. save) and the gureanteeSyncronized would just wait for the push to finish. I think a commit on every save would be a really nice feature as it would give the ability to roll back to any past buffer save. Even after 1000 commits at 2K per commit, that's only 2MB which is not very big. How will the remote side work? In order to build on the remote machine, the files will need to have been checked out, then updated prior to the build launching. Does git store the master repository in a form that could be accessed by filesystem commands like make? (In reply to comment #10) > I think a commit on every save would be a really nice feature as it would give > the ability to roll back to any past buffer save. Even after 1000 commits at 2K > per commit, that's only 2MB which is not very big. True. > How will the remote side work? In order to build on the remote machine, the > files will need to have been checked out, then updated prior to the build > launching. Does git store the master repository in a form that could be > accessed by filesystem commands like make? Because Git is distributed it does not have a master repository. The default type of a repository has the object storage (containing the file revisions) and the working directory (the checkout out files). A bare repository has only the object storage. As discussed in Comment 7 I would suggest to use the standard (non-bare) repository on both the client and the server side. Thus a working directory with all files in their normal folder structure would be available on both dies. As you point out, the working directory has to be updated prior to build. Commit a file on the local side and updating the remote side would involve the following steps: - Mark modified file for commit (git add, a local operation, not strictly necessary improves performance because git does not have to check all files for modifications, only possible after implementing full EFS) - Commit change (git commit, local operation, thus provides the roll-back functionality without internet connection) - Upload commit to remote side without changing working directory (git push, details link in comment 7) - Update working directory (git merge executed remotely through ssh connection, auto-merges if possible, fails if merge conflict allowing merge on local side) To check for updates from the remote side (and merge in if any): - Commit any possible changes (remote: git commit -a, might be relatively slow because requires stat of each file, thus we should probably not do that more than every 30s-1min to not put too much strain on the remote file system) - Fetch+Merge (git pull, auto-merges if possible open EGit merge window if not) Just for reference, here is the work that we did a few months ago on rsync-based remote projects... Here is a relatively complete description of the project and many of the issues we thought of: http://wiki.eclipse.org/PTP/photran/rsync_remote_projects Bug 313194 is a strawman prototype of a rsync-based remotely-synchronized project. It adds a new project wizard which creates a C/Fortran project but replaces the standard CDT build command (make) with a call to a custom shell script which uses rsync to copy the project to a remote server and run make remotely. This was definitely a prototype -- I'm sure the final version won't look anything like it (e.g., our build script makes two or three separate connections to the remote machine) -- but this is *simple* and it works, more or less, which gave us something real to try out. Bug 305525 adds remote (Fortran) INCLUDE paths to Photran. Photran's include paths are configured in the project properties. Traditionally, they'd be paths on the local maching (e.g., /usr/include:/usr/local/include). This replaces them with URIs, so they can be on either the local machine or a remote one (e.g., rse://remotehost/usr/include:file:///usr/include). It also changes the properties page to use a remote file selection dialog box. (In reply to comment #12) > Bug 313194 is a strawman prototype of a rsync-based remotely-synchronized > project. It adds a new project wizard which creates a C/Fortran project but > replaces the standard CDT build command (make) with a call to a custom shell > script which uses rsync to copy the project to a remote server and run make > remotely. This was definitely a prototype -- I'm sure the final version won't > look anything like it (e.g., our build script makes two or three separate > connections to the remote machine) -- but this is *simple* and it works, more > or less, which gave us something real to try out. Will the code be attached to bug 313194? I'd like to take a look. In CVS ships with Indigo |