Community
Participate
Working Groups
I initiated this discussion in the mailing list [1]. I'd like to continue it here. We've been running the UDC for a number of years now. Despite my own efforts and the efforts of several individuals, companies, and university researchers, we have not yet been able to make any really valuable use out of the data. My sense is that the amount of effort that we've been spending far outstrips the value of the data collected. And, as the bug record indicates, we really need to spend more effort to just maintain the status quo. Unfortunately, attracting additional committers to the project won't solve the problem. Ultimately, it's the huge volume of data that the main problem. Privacy concerns prevent us from making this data widely available, and careful dissemination to individuals has proven fruitless. Ian and I have discussed this and we feel that it is time to retire the UDC. I am concerned about making changes to the composition of the packages at this late date in our development cycle, but would like to recommend that we do so immediately to ensure adequate time for testing. I will plan to archive UDC following the Indigo release. [1] http://dev.eclipse.org/mhonarc/lists/epp-dev/msg01458.html
I'm not very comfortable with a removal of the UDC from the Simulaneous Release that late in the release. Okay, there is no added functionality since Helios, but on the other hand I don't see any show-stopper bugs for the UDC at the moment, something that would prevent us from releasing it as it is with Indigo. Maybe we should think about removing (or improving?) it with next year's Juno release, but I don't see that we save any efforts on the client side if we remove it from the packages and from the Simultaneous Release. On the server side? There are many millions of UDC clients out there and there is no way of stopping them to send data to the Foundation servers. Unless we implement a >/dev/null nothing will change in the near future on the server side. (Apart from that: Removing it from the packages is a trivial task. It requires the removal of a single line in one feature.xml. But again, I don't think we should change it only 4 weeks before the release.)
(In reply to comment #1) > I'm not very comfortable with a removal of the UDC from the Simulaneous Release > that late in the release. Okay, there is no added functionality since Helios, > but on the other hand I don't see any show-stopper bugs for the UDC at the > moment, something that would prevent us from releasing it as it is with Indigo. > Maybe we should think about removing (or improving?) it with next year's Juno > release, but I don't see that we save any efforts on the client side if we > remove it from the packages and from the Simultaneous Release. I am nervous about any late change. As I note in the discussion on epp-dev, it is dealing with the data on the server that's the challenge not maintaining the client. The UDC incurs an ongoing cost to the foundation in terms of Webmaster and my time, in addition to the bandwidth and load issues. Thus far, we have not been able to generate appreciable value for that cost. > On the server side? There are many millions of UDC clients out there and there > is no way of stopping them to send data to the Foundation servers. Unless we > implement a >/dev/null nothing will change in the near future on the server > side. I can flip a switch in the server code to shut it off gracefully. > (Apart from that: Removing it from the packages is a trivial task. It requires > the removal of a single line in one feature.xml. But again, I don't think we > should change it only 4 weeks before the release.) I am equally concerned. AFAIK, we have only one downstream plug-in that adds monitors to the UDC from the m2e project; they have been monitoring this discussion. Still, any major change is worthy of concern.
Is there a way of putting in a property which stops collection instead of removing code? I believe there's a URL that you can change the UDC location - if it was set to a null value would that prevent the transmission of data? Separately there are use cases of using this in internal organisations. I'd be sad to see it go, especially at late notice and if it works at the moment. Removing it for future releases gives time for those that want to collect the data to come up with an alternative. Alex
Code Recommenders Project is interested in using and extending UDC. Two examples: 1. Stacktrace Search Engine: http://code-recommenders.blogspot.com/2011/05/oh-stacktrace-my-stacktrace.html 2. "Mylyn Inverse" & Extended JavaDoc Platform: The information how people interact with APIs, i.e., how they navigate through the documentation carries some interesting information like "programmers that looked at IWizard.addPage also looked at WizardPage" We are currently evaluating how we could leverage such data to merge this with Mylyn. Code Recommenders does not rely on UDC yet. However, it might be of great use.
imo, if anyone can come up with a valid use case of this working _now_ then it ought not be removed at this point in the process for indigo....and reading below I see someone said it would be a pity because of some internal usage...expand on that a bit and I vote we keep it in. moving forward though I would like to see it turned off by default with a simple way for an organization that might have that internal use case be able to have their folks run with it. or perhaps have it run and prompt in special cases.. 'We have detected that eclipse did not shutdown normally the last time you used it, would you like to turn on UDC so it if happens again we might get a record of what went wrong to help improve Eclipse.' something like that maybe
I just wanted to point you to another use of the UDC data. The following research paper have analyzed UDC data to understand the usage patterns of refactoring tools in Eclipse. Murphy-Hill, Emerson, Chris Parnin, and Andrew P. Black. 2009. How we refactor, and how we know it. In International Conference on Software Engineering, 287-297. <http://people.engr.ncsu.edu/ermurph3/papers/icse09.pdf>.
(In reply to comment #6) That's a very interesting pointer. Many thanks!
To close this loop... As I indicated on the mailing list, I have decided to stall UDC retirement until post Indigo. I'll leave this bug open as a place to hold the ongoing discussion. (In reply to comment #3) > Is there a way of putting in a property which stops collection instead of > removing code? I believe there's a URL that you can change the UDC location - > if it was set to a null value would that prevent the transmission of data? There's a flag in the server code to make it just ignore upload requests (and pretend that everything is fine so that the cached values on the workstation don't accumulate).
So it's now post-Indigo. I have to admit that I am very concerned with the apparent growth in usage data collectors (The Recommenders project has a contribution working through the CQ process, and there are several third-party plug-ins that have usage data collection functionality). The fact remains that we just don't have the necessary resources to manage the massive amounts of data that we're collecting. And--Emerson's research notwithstanding--I haven't seen any results of particular value to Eclipse committers or the broader community. I'd love to see a community discussion about how we might go about making usage data collection more of a community effort. It'd be nice to see some of these third-party usage data collectors disappear in favour of a single community-oriented solution. I'm happy to participate in a discussion of this nature if there is enough interest. Perhaps this is a topic for a BoF at EclipseCon Europe? In the meantime, I'd like to start the process of removing the UDC from the Juno simultaneous release. From a technical POV, I believe that this is a simple matter of deleting the corresponding b3aggrcon from the Juno build directory in CVS. Correct? Before I do that, I will announce the step on the cross-project mailing list.
> > In the meantime, I'd like to start the process of removing the UDC from the > Juno simultaneous release. From a technical POV, I believe that this is a > simple matter of deleting the corresponding b3aggrcon from the Juno build > directory in CVS. Correct? > Well, best to use the aggregator editor to remove the "contribution" first. From the simrel.b3aggr file. Ultimately, the simrel.b3aggr file needs to change at the same time the file is removed. Let me know if you want help (or, for me to just do it). Also, don't forget, there are some "features" listed somewhere (epp common?) that have to be updated (removed).
(In reply to comment #9) > I have to admit that I am very concerned with the apparent growth in usage data > collectors (The Recommenders project has a contribution working through the CQ > process, and there are several third-party plug-ins that have usage data > collection functionality). Are you concerned about the fact that (i) more and more usage data is collected or (ii) that there are so many individual collectors? If (i): Are you concerned about privacy or the amount of data? if (ii): Are you looking for a "common collector framework" that serves all collector needs?
(In reply to comment #11) > Are you concerned about the fact that (i) more and more usage data is collected > or (ii) that there are so many individual collectors? > > If (i): Are you concerned about privacy or the amount of data? > if (ii): Are you looking for a "common collector framework" that serves all > collector needs? Privacy is most definitely a concern for any data collected by an Eclipse project. Primarily, though, I am concerned about the potential user-experience impact that multiple disconnected usage data collectors might have.
(In reply to comment #12) > Primarily, though, I am concerned about the potential user-experience impact > that multiple disconnected usage data collectors might have. Basically, you are looking for (i) a simple framework that offers a common user interface and configurations - and (ii) want to get rid of the insane amount of data that get's submitted to eclipse.org? To (i): I think, there is a waste of collectors we (plug-in providers in general) would like to have, and there might be families of problems interesting to many. For instance, one of my problems is the lack of feedback in the case of errors. If a user experiences an exception I would like to know that there was an one, and where. I'm thinking of an "error collector" similar to the style how WIndows, Mac, Firefox etc. are dealing with exceptions. I guess, this would be interesting for other plug-in providers too. And BTW: There is quite a lot more you can do with such a client. See http://code-recommenders.blogspot.com/2011/05/oh-stacktrace-my-stacktrace.html for post on mining stacktraces. There are probably many other areas of interest. I can imagine to extend UDC to a "more pluggable UDC" that solves common usage data collection problems and to provide a simple infrastructure to store and transmit usage data for common problems. That's what we (code recommenders) have to do anyways. To (ii): It would be quite simple to let the remote server inform the client whether it is interested in collecting data. If the server don't want to get this kind of data, no data is submitted or the local data collection may be disabled. So, the overall question to me is: How many plug-in providers actually use the Eclipse UDC or have their own version running? Depending on this, IMHO, a BoF makes sense - or not. We would be interested to continue the work on the UDC but to relax the data collection routines so that the collected data can flow to different resources. UDC is know, wide-spread, and accepted by Eclipse Users. It would be a pity to loose such a huge community willing to give back to Eclipse and their plug-in providers.
(In reply to comment #13) > http://code-recommenders.blogspot.com/2011/05/oh-stacktrace-my-stacktrace.html BTW: We would be ready to setup such a system for demo purpose for 3.8 and 4.2 Milestones in order to improve the feedback cycle in just a few weeks.
I have removed the epp-udc files and feature from the "juno.build" aggregation files for common repository, in preparation for Juno M2. Let me know if you've had a change of heart, and it should go back in.
(In reply to comment #15) > I have removed the epp-udc files and feature from the "juno.build" aggregation > files for common repository, in preparation for Juno M2. Let me know if you've > had a change of heart, and it should go back in. Can we still expect official announcement of UDC removal on cross-platform mailing list?
(In reply to comment #15) > I have removed the epp-udc files and feature from the "juno.build" aggregation > files for common repository, in preparation for Juno M2. Let me know if you've > had a change of heart, and it should go back in. Thanks David.(In reply to comment #16) > Can we still expect official announcement of UDC removal on cross-platform > mailing list? Yes.
(In reply to comment #15) > I have removed the epp-udc files and feature from the "juno.build" aggregation > files for common repository, in preparation for Juno M2. Let me know if you've > had a change of heart, and it should go back in. It just occurred to me that this impacts all (most) of the packages. If you've removed the feature from the p2 repository, the package scripts that depend on the feature will break. Markus, is this something that I can help you sort out?
Created attachment 203716 [details] patch to remove udc feature from all epp packages In response to comment #18, Its my understanding the udc feature appears explicitly only in the "common feature" used in all epp packages, so _should_ be easy to remove from packages.
(In reply to comment #19) > Its my understanding the udc feature appears explicitly only in the "common > feature" used in all epp packages, so _should_ be easy to remove from packages. I just saw your patch... it contained the very same changes that I did this week in order to remove the UDC from *all* packages. The UDC is not available in any of the EPP packages beginning with Juno M2.
The UDC server is no longer retaining usage data. Clients are still being told to clean up their stored data. Unless somebody steps forward to take over development of the UDC component in EPP, I am going to plan on archiving the component following the Indigo SR-2 release.
Over the past few weeks, I've received a small number of emails asking about UDC data and use of the code in other contexts. I have pointed a few folks at this bug. If there is interest in continuing work on the UDC, we'll need some people to step up and take over the work. Since we've removed it from the Juno simultaneous release and have decided not to include the UDC in the standard packages, I feel that EPP is not the correct home. If there is interest in continuing this work (i.e. if two or more committers and a minimum of one Architecture Council mentors--I can be the second--step up), I'll make it my last act as developer of the UDC to promote the component into a proper project directly under Technology. There is a big outstanding issue of where and how to collect the actual data and how we manage privacy concerns. We'll need to address this as part of the project creation process. Any takers?
We, Code Recommenders, would continue/enhance the collector if there is more than a single interested party. Otherwise we'll go with a tailored collector that suites our needs.
To provide a little more details what we would like to do. There will be an Eclipse-wide Eventbus that dispatches events to all potentially interested parties (e.g., based on http://goo.gl/7rxf8). Any "monitor" can use this bus to send events to all interested parties. All "serializers" subscribed to the bus and these particular events, can serialize the event and send it to a database or webservice or local filesystem. The user can enable and disable monitors and serializers via a preference page. Plug-in providers can register new monitors and serializers using extension points. The foundation would then only provide the framework but no active monitors. No data is collected and shared by default. The users are in charge to enable/disable the monitors/projects they'd like to support with their usage data. Regarding privacy. Which data is send is determined by the "serializer" and its anonymization policy. The foundation is not involved in this. Monitors we have in mind are: completion proposal monitors, stacktrace monitors, compilation unit shares, certain click feedbacks, and the like. The benefits of such a system, in my opinion, is that we can easily register monitors to gather usage data required for a specific need and companies could easily create usage statistics or automated error reports by almost no efforts. That's a rough sketch of what we would like to do. No clue if this is in accordance to the Foundation's policies - that's why I'm posting it here. If these ideas are OK for the foundation and if an mentor raises the hand to support us, Johannes, Sebastian, and me would continue the project. We strongly believe that dropping the usage data collector would be a bad decisions - for us and Eclipse.
It might make sense to use OSGi's EventAdmin for publishing events over the bus, especially as that already handles a lot of these kind of things. It also opens the scope of using it in other osgi platforms like Felix
(In reply to comment #24) > Regarding privacy. > Which data is send is determined by the "serializer" and its anonymization > policy. The foundation is not involved in this. Do you envision that the new and improved UDC will be part of packages shipped from eclipse.org? Where does the data actually go? i.e. where does the server live? Who maintains it? Who is responsible to ensure that private information is not leaked while still making the data available to the community in some form (e.g. the filtered form that we currently make available). We have a privacy policy that must be followed. (In reply to comment #24) > We strongly > believe that dropping the usage data collector would be a bad decisions - for > us and Eclipse. Belief is one thing; reality is something altogether different. As I've stated, the simple fact of the matter is that nobody has actually been able to glean anything more useful than "that's interesting" from the data.
(In reply to comment #26) > (In reply to comment #24) > > Regarding privacy. > > Which data is send is determined by the "serializer" and its anonymization > > policy. The foundation is not involved in this. > > Do you envision that the new and improved UDC will be part of packages shipped > from eclipse.org? Foundation just removed it from all packages. From my POV, it would make sense only if there is a monitor that is actually useful for the Foundation or for Eclipse projects like Platform or JDT. The collector could be used stacktrace/error reporting agent for instance. If not, project that have a need may deliver this module as part of their features. > Where does the data actually go? > i.e. where does the server live? > Who maintains it? > Who is responsible to ensure that private information is not leaked while > still making the data available to the community in some form (e.g. the > filtered form that we currently make available). Is the Foundation willing to continue the project? Would the Foundation allow data to be stored on servers outside the Foundation? My understanding of the usage data collector would be that anyone can provide a collector and people *can* use it. You may have see the request for supporting CodingSpectator (http://blog.deepakazad.com/2011/12/codingspectator-research-study-on.html). With such a framework it would also be easy to enable researchers to conduct studies and to collect the data they need for their studies. And finally, it's up to the users to decide whether they want to support this project. Same should be true for UDC. BTW: There is no need to store data til the end of days. Detailed logs can be erased after 3 months as my mobile provider does. > We have a privacy policy that must be followed. Sure. > Belief is one thing; reality is something altogether different. As I've stated, > the simple fact of the matter is that nobody has actually been able to glean > anything more useful than "that's interesting" from the data. To be entirely honest (but I don't want to sound offensive): The data that has been collected is actually not interesting for researchers that want to improve IDEs or software development. I've no idea based on which inputs the Foundation decided to collect this data - for years. What kind of benefits did you expect from that data? Coding Spectator, for instance, is interested in real coding activities and refactorings which requires completely different information. In addition, needs will change fast and depend on the project/research question to answer. There is not one dataset that provides data for research topics over years. there are many for different short-lived questions. In 2006 IBM invested quite some money to bring Eclipse to the research community. This worked out well in these days. In the meanwhile we start looking for other platforms where we get data for. I know from my own research work that it is pretty hard to get reasonable amounts of usage data to support your research claims. And we need more of it to make tools we work on at Code Recommenders better and better. FWIW, we are growing and getting more and more universities involved (two at the moment and two more universities are in the move). Real usage data and community are the biggest motivations for this. No research without data. Also, one of the reasons why Code Recommenders is here, because Eclipse offeres a wealth of information, feedback - and usage data. A highly customizable usage data collector is a *very* attractive tool in Eclipse's belt. Having said this, I've the feeling that you already decided to shutdown the project. If so, I'm fine. Then I don't need to find arguments why to keep it alive. If you are looking for a team that continues the work and leverages the name and its distribution to do something useful for Eclipse, its projects, and research: here is an offer. We have plans; And I think we have proven over the last year that we are committed to deliver valuable things to Eclipse. We are growing in the Eclipse community and we start to contributing back to JDT. And finally, how much can go wrong here (given that we take care on privacy)? :) Anyway, if the directors see no advantages for the Foundation: close it. If you see the chance that it may pay out in a year or two: give it a try. A virtual server with a CouchDB on it would be enough to start our visions. If you allow, we could also host this on our university servers (at least for the data we are interested in). Then this would be at no cost for the Foundation. Thanks for listening, sorry for the long reply, and my best wishes, Marcel
The Eclipse Foundation has switched off the Usage Data Collector server side. The client part of the UDC won't be included in any of the Juno EPP packages and is not included in the Juno repository. That's why I think it is time to close this bug. The client side part of the UDC is still available from a separate Git repository - see http://git.eclipse.org/c/epp/org.eclipse.epp.usagedata.git/ for web access. Thanks!