| Summary: | Create a "Calling home" policy | ||
|---|---|---|---|
| Product: | Community | Reporter: | Wayne Beaton <wayne.beaton> |
| Component: | Architecture Council | Assignee: | eclipse.org-architecture-council |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | achmetow84, bugs.eclipse.org, contact, david_williams, denis.roy, gunnar, hendrik, ian.skerrett, igor, irbull, janet.campbell, john.arthorne, ken_walker, manderse, marc.khouzam, marcel.bruch, mik.kersten, mike.milinkovich, mjmeijer, nepomuk.seiler, robert.munteanu, sbouchet, sewe, steffen.pingel, stepper, wim.jongman, yasser.aziza |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Whiteboard: | stalebug | ||
| Bug Depends on: | |||
| Bug Blocks: | 416371 | ||
|
Description
Wayne Beaton
(In reply to comment #0) > Any call home service would have to be opt-in. +1 It should be easy to enable/disable which information is shared with the community. > The user needs to be able to review the data before it is sent. Some kind of review capability in the UI would be nice. However, I tend to believe that not all data can be reviewd in that way. I think there should be some review UI and a dedicated folder for log files where people can browse the data that was collected before submitting. > Raw data, which may include non-obvious potentially private information, > needs to be transferred securely. What does that mean? Communicating via https with some server? > The target for data collected by content distributed from eclipse.org must > be an eclipse.org server (e.g. the Eclipse packages must be configured to > send data to an eclipse.org server). This can be configurable by adopters to > send to an alternate server. +1 for enabling others to provide extensions. This could be very interesting for research and tool providers. > Raw data needs to be stored securely and access to the data needs to be > strictly controlled. Who will be able to access the data? Given the AOL example (where even carefully anonymized data may contain enough information to map this back to users), making the raw data public available would be a potential risk - at least some people may argue so. Who decides who gets access to the raw data? Some PMC? > Cleaned and processed data needs to be publicly accessible. Given the experience from the last UDC, I don't think a predefined data set this will be very useful. Instead every collector should clearly specify which data it collects and how much data is currently available. Researchers and interested parties can than come up with their requests. Of course, there may be some general statistics that everyone may be interested in and thus could be published (e.g., number of Eclipse installs, which plug-ins are installed, popular preferences...) > Any project implementing a call-home server must get approval from EMO(ED). I wonder whether a project may use data that may contain privacy relevant information for its internal use. Can Eclipse projects us such a data - or will they also only get a public view on the data? > I'm also thinking that having multiple separate projects each with their own > call-home mechanism that prompts the user for opt-in will make us look > goofy. Some coordination of call-home services should be considered. +1. Collectors may be enabled and disabled on a single preference page. > > Thoughts? > > [1] http://dev.eclipse.org/mhonarc/lists/recommenders-dev/msg01157.html (In reply to comment #0) > Any project implementing a call-home server must get approval from EMO(ED). How many interested projects are we talking about here anyway? To me it seems code recommenders is currently the only one. (In reply to comment #2) > (In reply to comment #0) > > Any project implementing a call-home server must get approval from EMO(ED). > > How many interested projects are we talking about here anyway? To me it > seems code recommenders is currently the only one. Several non-eclipse.org projects want to call home as well: subclipse for example and Collabnet. Providing a single framework would be nice with extension points to define what goes where. For the user there needs to be a single settings page for this. Think iphone access to Location or Contacts type. (In reply to comment #3) > For the user there needs to be a single settings page for this. Think iphone > access to Location or Contacts type. Yes, that would be great. Nice idea! (In reply to comment #3) > > How many interested projects are we talking about here anyway? To me it > > seems code recommenders is currently the only one. + JBoss Tools They do report usage statistics like screen resolution, version of java, plug-in versions used and more. I guess you know about the ongoing discussion at cross-project-issues-dev and people there requesting some kind of survey or usage data. (In reply to comment #5) > I guess you know about the ongoing discussion at cross-project-issues-dev > and people there requesting some kind of survey or usage data. Yes, to me it seems though, as if the discussion is going into a different direction. Anyway, how is JBoss Tools or any other non EF project going to be interested in an EF specific calling home policy? (In reply to comment #6) > Anyway, how is JBoss Tools or any other non EF project going to > be interested in an EF specific calling home policy? I see. Non-EF projects likely will be interested in: 1. which data gets collected by default. 2. how to extend and reuse the collector for their own purpose and collecting of data on own servers. Which other EF projects are interested: I can't say. However, which difference does it make how many projects there are? (In reply to comment #7) > However, which difference does it make how many projects there are? I would assume it gets higher priority at the Architecture Council if sufficient projects are asking for it. +1 Sigasi is also interested. Hendrik. (In reply to comment #3) > Several non-eclipse.org projects want to call home as well: subclipse for > example and Collabnet. Providing a single framework would be nice with > extension points to define what goes where. For the user there needs to be a > single settings page for this. Think iphone access to Location or Contacts > type. The policy would not apply to non-eclipse.org projects/plug-ins/products. I understand, however, that non-eclipse.org projects would want to leverage corresponding shared "call home" infrastructure, but that is not within the intended scope of this discussion. Wayne, any answers to my questions in comment 1? (In reply to comment #1) > > Raw data, which may include non-obvious potentially private information, > > needs to be transferred securely. > > What does that mean? Communicating via https with some server? The transfer protocol is certainly part of it. Intuitively, I think that HTTPS communication is enough. > > Raw data needs to be stored securely and access to the data needs to be > > strictly controlled. > > Who will be able to access the data? This needs to be answered (I don't know what the answer is). Perhaps we might require that those with access to the raw data sign some kind of NDA with the Foundation. > > Who decides who gets access to the raw data? Some PMC? Good question. I welcome your input. > > Any project implementing a call-home server must get approval from EMO(ED). > > I wonder whether a project may use data that may contain privacy relevant > information for its internal use. Can Eclipse projects us such a data - or > will they also only get a public view on the data? Intuitively, I believe that the general case is the only thing we care about. Anybody who has access to any of the data has access to all of the data. (In reply to comment #12) > > > Raw data needs to be stored securely and access to the data needs to be > > > strictly controlled. > > > > Who will be able to access the data? > > This needs to be answered (I don't know what the answer is). Perhaps we > might require that those with access to the raw data sign some kind of NDA > with the Foundation. An NDA sounds reasonable to me when raw-data access is needed. We may consider to generate public views (on demand) on the raw data which can be made publicly available w/o signing an NDA. > > Who decides who gets access to the raw data? Some PMC? > > Good question. I welcome your input. I've no real suggestion. EMO(ED) approves which data is collected. So it should also decided who get's access? How many requests did you receive for the old UDC data? I guess not that many from external. How do we continue? Who has to give permission whether a or not the development of a second UDC can be started? (In reply to comment #13) > I've no real suggestion. EMO(ED) approves which data is collected. So it > should also decided who get's access? How many requests did you receive for > the old UDC data? I guess not that many from external. Not too many. Maybe a dozen or so. > How do we continue? Who has to give permission whether a or not the > development of a second UDC can be started? The next step is to collect our discussion into a policy and get approval from EMO(ED). The policy itself will dictate the steps that follow. I've started a wiki document here: http://wiki.eclipse.org/Development_Resources/Call_Home_Policy It's currently just a copy and paste of comment #0 with some minor edits. I'd love to make this happen, so, let's make a test run on the current calling home policy status. To assess the quality of Code Recommenders' code completion proposals, I'd like to ask users to anonymously share their code completion statistics for org.eclipse.*, org.apache.*, and java[fx|x].* completions. In particular, we are interested in whether the user picked proposal from our engine and at which position the selected proposal was in the list. So we would collect a minimum information about the environment in which code completion was triggered: 1. the receiver type (like "String" or "Object") completion was triggered on, and 2. the completion proposal that was selected (e.g., "toString()") 3. the index where this proposal showed up in the list of all proposals 4. the completion prefix the user entered (e.g "to" to filter for "toString"). What would be the next steps to get permission to collect this information? ping? (In reply to Marcel Bruch from comment #15) > I'd love to make this happen, so, let's make a test run on the current > calling home policy status. What is--in your opinion--the status of the policy? > What would be the next steps to get permission to collect this information? Approval from the EMO(ED). We need to bring him more than a random collection of my thoughts on a wiki page. We'll also need approval from the Webmaster, assuming that he'll be responsible for keeping any machine that we use to collect data running. (In reply to Wayne Beaton from comment #17) > (In reply to Marcel Bruch from comment #15) > > I'd love to make this happen, so, let's make a test run on the current > > calling home policy status. > > What is--in your opinion--the status of the policy? The implementation details are missing but the general guidelines are clear to me (at least I think so). I outlined below how I'd continue - if EMO(ED) is in principal not against collecting this kind of usage statistics > > What would be the next steps to get permission to collect this information? > > Approval from the EMO(ED). We need to bring him more than a random > collection of my thoughts on a wiki page. I've to admit that, in all modesty, I've no clue what EMO(ED) needs. Some wiki page that states what get's collected, when, when uploaded? An general architecture/concept of the statistics framework design upfront? If possible, I'd like to start with something simple (as incubator) first to have a basis on which we can discuss what else is needed. But maybe I'm just running into the wrong direction? > We'll also need approval from the Webmaster, assuming that he'll be > responsible for keeping any machine that we use to collect data running. I'd be happy if we could run a simple OSGI-based server on somewhere.eclipse.org that dumps data to daily rotating log files. I'd also be fine with removing logs older than 30 days to keep the number of data used on hard drive low. Regarding access to the data, I'd propose to create "orbit-style" group of committers that get access to the data (one per project). All members of this list need to sign some agreement with the EF to use the data according to the terms of use as determined by the EF. Regarding cleaning up the data to make it public available: I'd not push forward on this at the moment. Next steps to me would be: (i) a brief statement of EMO(ED) at this point, saying "Go create a first prototype to evaluate this in an incubator" w/o any legal implications (ii) webmaster allows us to use recommenders.eclipse.org to setup the service there (iii) recommenders will create the first UI and sharing service for the IDE (iv) followed by a technical review by you (Wayne) / EF (v) another statement by EMO(ED) whether this sharing code is in accordance to any the Eclipse bylaws (++) (vi) a rollout of the stats module by every interested party at eclipse.org that want't to gather some usage statistics to gather community feedback (vii) continuously evolving the set of data collected based on feedback gathered by the community and other projects. It may be that I completely miss the points important to you (I feel so somehow :). Let me know when you need something else. @Wayne, since you marked m2e bug 416371 as depending on this one, can you please explain how current p2 download stats is different from "call home" functionality discussed here and what part of call home behaviour trigger requirement to comply with the proposed new policy? My main concerns are that: * We don't violate the Bylaws; * We honour the privacy policy; and * We don't look goofy. The first two need to be addressed before any PoC can be implemented (implementation is not generally something addressed in a policy). I'll take a pass over the draft policy to see if I can get it into better shape (and ensure that it addresses my concerns). Frankly, the "goofy" part worries me. I know that m2e has designs on collecting usage data. The horror scenario is that we have a dozen Eclipse projects all asking the user "hey, do you mind if I upload some data to an Eclipse Foundation server?" in different ways at different times. (In reply to Igor Fedorenko from comment #19) > @Wayne, since you marked m2e bug 416371 as depending on this one, can you > please explain how current p2 download stats is different from "call home" > functionality discussed here and what part of call home behaviour trigger > requirement to comply with the proposed new policy? For starters, we carefully control who has access to things like IP addresses. In my mind, the policy already exists in an informal sense and I'm quite sure that the manner in which we gather download statistics conforms. When we gather usage data, there is a real possibility that we may inadvertently collect enough information to make reasonable guesses at the identity of some users and the sorts of activities that they are undertaking. Leaking that sort of information would be very bad, IMHO. AFAIK, there is some real risk that folks can be arrested if our privacy policy is violated. With the usage data collector, for example, I filtered out things like views and editors that had the company name in the bundle id. We don't have to worry about this sort of thing when we're gathering download stats. Frankly, I think we're close to having an actual policy here that can work. Unfortunately, I'm not getting the help that I asked for in drafting it, so it's had to wait until I've cleared other higher-priority things from my list. (In reply to Wayne Beaton from comment #20) > My main concerns are that: > > * We don't violate the Bylaws; > * We honour the privacy policy; and > * We don't look goofy. > > The first two need to be addressed before any PoC can be implemented > (implementation is not generally something addressed in a policy). > > I'll take a pass over the draft policy to see if I can get it into better > shape (and ensure that it addresses my concerns). > > Frankly, the "goofy" part worries me. I know that m2e has designs on > collecting usage data. The horror scenario is that we have a dozen Eclipse > projects all asking the user "hey, do you mind if I upload some data to an > Eclipse Foundation server?" in different ways at different times. Just to clarify, m2e *used* to have usage collection code, but that code was removed when Foundation retired UDC (or whatever that thing was called). I am *NOT* suggesting to reintroduce that functionality. Right now I just need to count number of active m2e installation, including versions of eclipse and java they run on. I believe this is the same underlying usecase as p2 download stats and I don't see why it should be treated any differently. (In reply to Igor Fedorenko from comment #22) > Right now I just need to count number of active m2e installation, including > versions of eclipse and java they run on. I believe this is the same > underlying usecase as p2 download stats and I don't see why it should be > treated any differently. Why not propose that the current p2 download stats be extended to support this? That way it benefits all the projects, not just yours. (In reply to Denis Roy from comment #23) > (In reply to Igor Fedorenko from comment #22) > > Right now I just need to count number of active m2e installation, including > > versions of eclipse and java they run on. I believe this is the same > > underlying usecase as p2 download stats and I don't see why it should be > > treated any differently. > > Why not propose that the current p2 download stats be extended to support > this? That way it benefits all the projects, not just yours. Propose to who and where? This is an honest question. I sent emails to m2e mailing list and to Wayne and webmaster explaining what I am doing and asking for feedback and advice. The answer I got from Wayne was "no, you can't do it until this bug is resolved". So I am trying to understand why p2 stats didn't need a policy and what are my next steps should be. And to reiterate this once again, I don't want to run m2e-specific active installation count service. I would very much prefer to use a service provided and maintained by Eclipse Foundation. I'd be more than happy to use p2 download stats, if that's the recommended way of doing this sort of things. > So I am trying to understand why p2 > stats didn't need a policy and what are my next steps should be. The p2 download stats mechanism was p2 functionality that was implemented by both the p2 team and the webmasters as a result of wanting better stats for all projects. See bug 302160 The server that receives those stats calls is not accessible by anyone but webmaster. We provide aggregate information to our committers via the download stats page. IP addresses are not part of this aggregation, but country information is. https://dev.eclipse.org/committers/committertools/stats.php > And to reiterate this once again, I don't want to run m2e-specific active > installation count service. > I would very much prefer to use a service > provided and maintained by Eclipse Foundation. I'd be more than happy to use > p2 download stats, if that's the recommended way of doing this sort of > things. You can start here, today: http://wiki.eclipse.org/Equinox_p2_download_stats You won't capture Java and OSGi versions. If this information is desirable, I'd file an enhancement request. > Propose to who and where? The p2 project would be the recipient of such a feature request, via this Bugzilla application. https://bugs.eclipse.org/bugs/enter_bug.cgi?product=Equinox&component=p2&bug_severity=enhancement Opened bug 416456 for p2 enhancements. (In reply to Wayne Beaton from comment #21) > For starters, we carefully control who has access to things like IP > addresses. What use case do we need IP addresses for? If there is no important use case, I'd say to generally not collect that information. Btw. what about encryption? Are we going to transfer the information over (end-to-end) encrypted channels? Is the data going to be stored in plain text at the EF?... (In reply to Markus Kuppe from comment #27) > (In reply to Wayne Beaton from comment #21) > > > For starters, we carefully control who has access to things like IP > > addresses. > > What use case do we need IP addresses for? If there is no important use > case, I'd say to generally not collect that information. I'm sure that somebody can come up with some use--or potential "future consideration"--for collecting IP Addresses. This is why I want it baked into the policy. > Btw. what about encryption? Are we going to transfer the information over > (end-to-end) encrypted channels? Is the data going to be stored in plain > text at the EF?... I've captured this in the evolving policy document: http://wiki.eclipse.org/Development_Resources/Call_Home_Policy#Private_Information (In reply to Wayne Beaton from comment #28) > > What use case do we need IP addresses for? If there is no important use > > case, I'd say to generally not collect that information. > > I'm sure that somebody can come up with some use--or potential "future > consideration"--for collecting IP Addresses. This is why I want it baked > into the policy. I guess this is the opposite approach to privacy by design. Anyway, even if IPs should be collected, they should be properly anonymized by removing the last two octets. I really question there is a legit use case that requires to identify a host. (In reply to Markus Kuppe from comment #29) > Anyway, even if IPs should be collected, they should be properly anonymized The issue may not be limited to "collecting" (ie, storing) private information, but that a committer may access it in transit if they are part of the team managing the collection resource (script and/or server). This is the same reason we don't allow projects to host services that would require users to supply eclipse.org auth credentials. (In reply to Denis Roy from comment #30) > (In reply to Markus Kuppe from comment #29) > > Anyway, even if IPs should be collected, they should be properly anonymized > > The issue may not be limited to "collecting" (ie, storing) private > information, but that a committer may access it in transit if they are part > of the team managing the collection resource (script and/or server). > > This is the same reason we don't allow projects to host services that would > require users to supply eclipse.org auth credentials. True, so the data collector running at the user end should only ever send anonymized IPs. (In reply to Markus Kuppe from comment #29) > Anyway, even if IPs should be collected, they should be properly anonymized > by removing the last two octets. I really question there is a legit use case > that requires to identify a host. I agree. I think that my responses on this topic have been a little softer than I intended. I had meant to suggest that while this information is available to us via the call itself, we must never use this information to identify individuals or organizations. The only thing that an IP address can be used for is to determine the country of origin. No call-home facility should ever persist IP addresses or attempt to identify a host. (In reply to Wayne Beaton from comment #32) > No call-home facility should ever persist IP addresses or attempt to > identify a host. Is having an UUID (completely randomly generated but that does not change over several submissions) acceptable for EF? Of course a user should know if UUIDs are used and can be able to switch to a "null" UUID instead. (In reply to Marcel Bruch from comment #33) > Is having an UUID (completely randomly generated but that does not change > over several submissions) acceptable for EF? Of course a user should know if > UUIDs are used and can be able to switch to a "null" UUID instead. Yes. This is what we ended up doing with the UDC. A UUID cannot be used by itself to identify a particular individual or organization. It does make it possible for an individual to identify their own data, which I think has some interesting potential. (In reply to Marcel Bruch from comment #33) > Is having an UUID (completely randomly generated but that does not change > over several submissions) acceptable for EF? Of course a user should know if > UUIDs are used and can be able to switch to a "null" UUID instead. Marcel, there was a huge discussion going on when people discovered that Chrome had such UUIDs (fingerprints) by default and used it for "aligning" auto suggestions to browser instances for tracking. I think you should plan with an "opt-in" model, i.e. start anonymous and allow a user to opt-in to additional functionality. For example, a user could opt-in to register and associate his submitted data with his user account. That's totally fine because it's "opt-in". You could then use that for competitions with prizes, etc. However, the default for any Eclipse project should be covered by this policy which should be (IMHO) aiming for anonymization of all data transferred. Some data, which allows identification of people, is delivered by the protocol automatically. It will be the responsibility of the receivers (server owners) to ensure that such data is not collected in a way which allows any identification. Frankly, I think it should be the responsibility of the Eclipse project to verify that the receiver complies to this or deny transmission of data to the receiver. (In reply to Gunnar Wagenknecht from comment #35) > I think you should plan with an "opt-in" model, i.e. start anonymous and > allow a user to opt-in to additional functionality. For example, a user > could opt-in to register and associate his submitted data with his user > account. That's totally fine because it's "opt-in". You could then use that > for competitions with prizes, etc. That sounds as if users opt-in to de-anonymize data (fine with me), but just to re-iterate: A user first have to opt-in to send any (anonymized) data at all?! > Some data, which allows identification of people, is delivered by the > protocol automatically. Which is? IMO if the protocol leaks bits of data that allow us to identify the user, we should improve or switch the protocol. (In reply to Markus Kuppe from comment #36) > (In reply to Gunnar Wagenknecht from comment #35) > > [...] > That sounds as if users opt-in to de-anonymize data (fine with me), but just > to re-iterate: A user first have to opt-in to send any (anonymized) data at > all?! To sum up (what I think is) general consent: All data require an initial opt-in to submit data at all: 1. send data at all? (yes/no) 2. send data with UUID? (yes/no) 3. send data with registered open-id/email? (specify) I don't want to get too much side tracked but: I added #3 because it would be required for one of our incubator projects. As Wayne implied: people might be interested to see which data they shared. With an open-id approach people could get access to their data even if they change their machines. IPs will technically always be present in the webserver logs but usually don't have to be stored along with the data. If the IP will be collected and stored along with the shared data, it's in the responsibility of the collecting service to prune/anonymize it and/or to make sure it does not leak in any way to outside world. (In reply to Marcel Bruch from comment #37) > IPs will technically always be present in the webserver logs but usually > don't have to be stored along with the data. This statement is incorrect. A web server does not necessarily have to log IPs (for Apache there exists mod_removeip which overrides REMOTE_ADDR var [1]). If that does not provide sufficient privacy, the (TCP) connection can be tunneled through an anonymizer like Tor [2]. [1] https://labs.riseup.net/code/projects/privacy [2] https://www.torproject.org/ What will be the default for the opt-in checkbox/button? I would assume it is off (do not send data) by default and that a user only ever gets asked once. The "opt-in" should apply to everything that reduces the anonymity. I'm totally fine with sending anonymous usage statistics by default. Every web site does it. As far as the law in Germany is concerned, this is fine as long as no user related information is collected (such as the IP address, the UUID, etc.) and the user is informed about this together with information on what is collected (anonymously) and why. The "About" dialog seems like the right place for this but I'm sure there are other options. Technically, I suggest having a preference (check-box) to enable sending of anonymous data. This should be off/disabled by default in code. The Eclipse packages produced by EPP can deliver a customization that sets it to on by default. The user can then turn it off later. But any corporate package just including the infrastructure won't send data by default. (In reply to Gunnar Wagenknecht from comment #40) > Technically, I suggest having a preference (check-box) to enable sending of > anonymous data. This should be off/disabled by default in code. The Eclipse > packages produced by EPP can deliver a customization that sets it to on by > default. The user can then turn it off later. But any corporate package just > including the infrastructure won't send data by default. -1 for enabling it in EPP (even for anonymous data). The consumers of the user data is just to small compared to the size of projects being part of EPP. Every now and then someone raises the issue that 'Check for update' should happen automatically (once a month or something). We don't have this set, although it's pretty easy to do. If the call-home policy eventually states that all communication must be opt-in, then we won't be able to enable automatic check for updates. I'm not saying that's a bad thing, I just want to make everyone aware of that. (In reply to Ian Bull from comment #42) > Every now and then someone raises the issue that 'Check for update' should > happen automatically (once a month or something). We don't have this set, > although it's pretty easy to do. > > If the call-home policy eventually states that all communication must be > opt-in, then we won't be able to enable automatic check for updates. I'm not > saying that's a bad thing, I just want to make everyone aware of that. The intent of the policy is to make sure that we're all on the same page. I have been working under an assumption that our user community expects/demands opt-in for any kind of call home service (even a simple check for updates). I'm quite willing to accept that this assumption is incorrect. We can also work an exception process into the policy. I do tend to prefer a policy of "do the right thing". I just wanted to mention standard analytics on web pages. No user information captured, just what kind of browser, etc. This could be a particular vendor which might not be an open source based solution. Also, Orion being a server based web application where your content is stored remotely we have to call home. I'm assuming this is outside the scope of this discussion? The "What's New" page on the Welcome view connects to eclipse.org to get news. I believe that this violates the "opt-in" requirement being discussed. Is this an exception, or a precedent? (In reply to Wayne Beaton from comment #46) > The "What's New" page on the Welcome view connects to eclipse.org to get > news. I believe that this violates the "opt-in" requirement being discussed. > Is this an exception, or a precedent? I think we just need to define 'call home'. Obviously reading an RSS feed is not calling home, but if we word the policy as 'Eclipse won't make any network requests without an opt-in', then it violates the policy. However, if we say that 'Eclipse won't track any user information' without an opt-in, then this is just a matter of ensuring that our web-servers are not logging things like IP addresses. I believe we should have a general exemption or allowance for *fetching* data from eclipse.org or other servers for the sole purpose of providing the user with tools or content, and where there is no data gathered by the server other than the normal rotating logs of a typical HTTP server, and where nobody is given access to those logs. I think if we required an explicit PMC approval and user opt-in for these cases it would become too much burden. I can't think of any software that has ever asked for my permission before check for updates for example. Here are some other examples of network communication and I'm sure others can add more: - Welcome page fetching news feeds - External links in help content navigating to external serveres - Java editor hovers fetching javadoc from remote servers I think the policy should cover all cases where data is *sent*, and cases where data is fetched and the data is retained for other purposes (e.g., download stats). (In reply to Ken Walker from comment #45) > Also, Orion being a server based web application where your content is > stored remotely we have to call home. I'm assuming this is outside the > scope of this discussion? I think it is in scope, but in this case there is clear opt-in by creating an orionhub account, and the privacy policy and terms of use are clearly laid out. So overall I think it will fit the policy (maybe the terms need to be a bit more in your face at account creation time). Orion does retain a lot of user data by its very nature, but I think we still need to be up front about what is retained, who has access to it, etc, as with any other tool. We're going to need Board approval for this policy (there are legal and privacy implications). I'd like to get this on the agenda for their October meeting. To make this happen, I'll need real assistance. Citing examples is very useful, but I need to have actual words in the policy document itself. (In reply to Wayne Beaton from comment #50) > We're going to need Board approval for this policy (there are legal and > privacy implications). I'd like to get this on the agenda for their October > meeting. To make this happen, I'll need real assistance. Citing examples is > very useful, but I need to have actual words in the policy document itself. Apparently, we're on the agenda for the September meeting. This first step is a presentation to the board. The next step is a vote during the October meeting. Once we have approval from the board, we're good to go. What is needed for the presentation to the board? Slideware with examples, classification of different kinds of usage data and services? A written policy? A sketched implementation/ui? (In reply to Marcel Bruch from comment #52) > What is needed for the presentation to the board? Slideware with examples, > classification of different kinds of usage data and services? A written > policy? A sketched implementation/ui? I can build slideware from the policy document itself and I think that we have enough examples on this thread with Recommenders and m2e (though it might be helpful to have a few short bullets specifically describing what Recommenders needs to do). Minimally, we need to have the policy document in a form that we are all (at least mostly) happy with. I figure that we have three different types of "call home" and may be able to treat each of them differently. 1) Fetch data only. No data is sent from the user's workstation (IP address notwithstanding) Examples: News on the welcome screen, HTTP pointers in help. This just works, no user configuration provided or required. 2) Simple/no personal information sent Examples: heartbeat from an Eclipse plug-in. May provide very minimal configuration information, e.g. Bundle Id and version. This is just turned on "out of the box"; must be thoroughly documented and user must have a means of turning it off. Attempting to collect environmental information (e.g. JVM) may push us into the third category. 3) Complex/potential that personal information may be sent. Examples: code usage patterns for Code Recommenders. The names of packages and classes might, for example, include company names or otherwise identifiable information. Turned off by default. User has ability to "opt-in". Must be thoroughly documented. Thoughts? (In reply to Wayne Beaton from comment #54) > I figure that we have three different types of "call home" and may be able > to treat each of them differently. > > 1) Fetch data only. No data is sent from the user's workstation (IP address > notwithstanding) > > Examples: News on the welcome screen, HTTP pointers in help. > > This just works, no user configuration provided or required. > What about user-agent string? For example m2e is encoding version info as part of user-agent string when downloading artifacts from remote maven repositories (so does maven). Is this still considered "fetch data only"? (In reply to Igor Fedorenko from comment #55) > What about user-agent string? For example m2e is encoding version info as > part of user-agent string when downloading artifacts from remote maven > repositories (so does maven). Is this still considered "fetch data only"? I think that this lands pretty firmly in the second category. Can we reasonably assert that the user implicitly agrees to this sort of communication as a natural part of using the software? i.e. is there any chance that the user might be surprised that this communication is happening? Is it a requirement or convention that the user-agent be set this way? I think your current draft captures the main points. Some minor points: - I find the term "call home" unclear. I wonder if it should instead be called the "Data Collection Policy". - It says "Aggregate data needs to be publicly accessible". It is unclear to me if this mean that there must be aggregate data, or simply that if there is aggregate data, then it be publicly accessible. - Several references to "needs to" should be changed to "must" for clarity. - You are again using the term "EMO(ED)" that nobody understands. - In the "Storage" section it says the target must be an EF server. I can imagine a data collection tool that actually has no default "home server", but can be configured to provide data to some particular server by the end user. Maybe this should instead say, "If a default target for data collection is defined, that default must be an Eclipse Foundation server". (In reply to Wayne Beaton from comment #56) > (In reply to Igor Fedorenko from comment #55) > > What about user-agent string? For example m2e is encoding version info as > > part of user-agent string when downloading artifacts from remote maven > > repositories (so does maven). Is this still considered "fetch data only"? > > I think that this lands pretty firmly in the second category. > > Can we reasonably assert that the user implicitly agrees to this sort of > communication as a natural part of using the software? i.e. is there any > chance that the user might be surprised that this communication is happening? > In case of m2e, I am pretty sure users expect m2e to contact remote repositories and download artifacts there. I don't know if user-agent string will surprise anyone, but it is used by all http clients as far as I know. > Is it a requirement or convention that the user-agent be set this way? Not a requirement, at least not yet, but all tools that talk to maven repositories identify themselves. This proves to be vital when troubleshooting traffic irregularities from server-side and I wouldn't be surprised if larger repositories started requiring it at some point. Hey, this awesome discussion somehow went under my radar during my PTO - thanks to Wayne for blogging/tweeting about it :) I think this discussion is great and some good points are made - especially in the area that there are very different levels of "Calling home". I like the three separations defined in https://bugs.eclipse.org/bugs/show_bug.cgi?id=413169#c54 Who are interested ? ==================== In this thread was questions about who would be interested in this data and it was made to sound like it was "just a few" which I think is a big underestimation when looking at how many plugins does it today (all outside of eclipse since eclipse doesn't allow it). Here is the list I know about: * Atlassian Jira/Bamboo connectors * Tigris Subversion * Springsource Tool suite + most of their separate installable plugins from marketplace * JBoss Developer Studio + JBoss Tools ..and I'm sure there is more. The list above are those that actually has the decency to ask the user if they want to send data. I also think in general many would be happy to see how their plugins is being used or *not* used. Need for unified ui =================== As a user if you install the three above (and many do considering all 4 of these are in the Top on marketplace) you will end up with no less than 4 "can we get your data?" dialogs. Making eclipse look bad - hence why I'm all for that a policy is made *AND* an extendable mechanism becomes available to opt-in/opt-out of this that does not require multiple complex and noisy dialogs to be shown. Much like p2 unified the license dialogs into one vs many in past. What can the data be used for ? =============================== Beyond that I'm sure many within eclipse could make good use of the data - i.e. we (JBoss Tools) "only" been gathering data for 2+ years but we've used the data to make or at least put data behind decisions. Examples: * Our data can debunk the myth that OSX and Linux are huge and growing wild compared to Windows. This is false - at least for the areas of Eclipse installations. Thus we maintained and increased Windows testing. * Windows XP is still a widely used OS - it is used more than OSX and Linux combined. Thus we continued to do basic testing our IDE and runtimes on XP. * Windows 8 has as big a user base as all Mac OSX versions. Even though media says Windows 8 is a flop, its still bigger than most other OS's. * Java 7 is picking up adoption speed - Java 6 used a lot by Juno users, much less by Kepler. We insist on not using Java 7 features where not needed up past Kepler GA; Kepler usage showing we can start relaxing on that. * Kepler is not being picked up as fast as previous 2 eclipse releases. Makes one think. * Java 7 autoupdate mechanism is visible in our stats. Shows when there have been security exploits fixed (mainly for 'fun') * We all have big displays, except 70% of users who still runs with 1280x or lower resolution. Fixes and open bugs in plugins that add UI elements which pollutes screen/real estate. * etc. The data above is purely derived from a single adopted http request using Google Analytics. How and what we collect you can see at http://jboss.org/tools/usage. This is *not* as complex as what CodeRecommenders want to do. What we would like to add to the list of things to derive (and this is similar to what m2e wants) is to know not only what have been installed but what is actively being used. This is data that can't just be collected on a single ping from eclipse startup. It is something that would need being sent while eclipse is doing some action. ie. "When a WTP server is created, send an event that includes the WTP server type" or "When a maven project is imported, send an event with module count". Need for IP collection ? ======================== When we setup our usage tracking our legal explicitly called out Germany laws and required us to *not* link IP numbers to this and that we must make it an opt-in. we did this by showing a *very* small dialog (at least compared to Eclipse UDC and STS dialogs which are scary legal documents or have complex choices ;). This dialog simply asks if you are willing to send anonymously data to a named entity (JBoss Tools team and/or Red Hat when it is inside our products). We unfortunately cannot (legal requirement) tell how many says No to this, but I can say a lot is saying yes. We currently get at least over 30.000+ ping backs per day and it is spread across the planet, across OS's etc. I know that subversion which is installed in even more installations has many many more. Thus many users seem to at least be ok with this. Suggestions =========== A) define a policy for eclipse.org projects B) Make room in this policy for external users to get access to the data (at least in some public derived form) C) Add infrastructure into eclipse core/EPP that sends back Category #2 data (basic info, that can be used to figure out eclipse versions pickup, java versions used, screen resolutions etc.) D) Add infrastructure that allows plugins (whether internal or external) to collect #2 and #3 data with an ui that does not scare users and allow for easy opt-in/opt-out. E) Publish peer-reviewed results of this data from time to time (statistics can be so devious if not done right :) (In reply to Wayne Beaton from comment #54) > 2) Simple/no personal information sent > > Examples: heartbeat from an Eclipse plug-in. May provide very minimal > configuration information, e.g. Bundle Id and version. > > This is just turned on "out of the box"; must be thoroughly documented and > user must have a means of turning it off. > > Attempting to collect environmental information (e.g. JVM) may push us into > the third category. > > 3) Complex/potential that personal information may be sent. > > Examples: code usage patterns for Code Recommenders. The names of packages > and classes might, for example, include company names or otherwise > identifiable information. > > Turned off by default. User has ability to "opt-in". Must be thoroughly > documented. The distinction between 2) and 3) looks artificial and blurry. What exactly is "very minimal configuration info" in general? Why and when do env infos turn it into personal information? Think of the set of bundles (id/version & activation state) installed. This set is specific to an installation and thus can potentially be used to fingerprint installations (see webtrackers that track browser instances based on installed addons/plugins [1]). Also 2) conflicts with your comment #1 where you write that "Any call home service would have to be opt-in.", if data is collected by default. [1] https://panopticlick.eff.org/browser-uniqueness.pdf (In reply to Markus Kuppe from comment #60) > The distinction between 2) and 3) looks artificial and blurry. What exactly > is "very minimal configuration info" in general? I'm thinking specifically of the requirements of m2e. They want to get a regular heartbeat of the version of their single bundle/feature. AFAIK, Igor's not looking for any means to map this to a specific user. > Why and when do env infos > turn it into personal information? A bundle reporting information about itself seems pretty straightforward in my mind. Reporting any kind of environment information is more dynamic. It might be running, for example, on a company-specific fork of OpenJDK. I understand that this may just be plain silly to be worrying about. > Think of the set of bundles (id/version & activation state) installed. This > set is specific to an installation and thus can potentially be used to > fingerprint installations (see webtrackers that track browser instances > based on installed addons/plugins [1]). So I need to tighten up my wording on #2. This example feels like a #3 to me. A full configuration may include company-specific bundle ids and such. > Also 2) conflicts with your comment #1 where you write that "Any call home > service would have to be opt-in.", if data is collected by default. That was my starting position (labeled as "brainstorming"). I'm not a US politician, I'm allowed to change my mind when presented with new information. (In reply to Wayne Beaton from comment #61) > (In reply to Markus Kuppe from comment #60) > > The distinction between 2) and 3) looks artificial and blurry. What exactly > > is "very minimal configuration info" in general? > > I'm thinking specifically of the requirements of m2e. They want to get a > regular heartbeat of the version of their single bundle/feature. AFAIK, > Igor's not looking for any means to map this to a specific user. Correct. I do not need to identify individual users. My immediate requirement is to know exact m2e version, eclipse version (at least major/minor part, i.e., 4.3, 4.4, etc) and major java version, i.e. 6, 7, 8, etc. I basically want to know when to drop support for older versions. I do agree with comment 59. There is a lot we, as a development community, can learn by better understanding our userbase and we need to find a way to collect these metrics in less obtrusive way and obviously without invading users privacy. (In reply to Wayne Beaton from comment #53) > (though it > might be helpful to have a few short bullets specifically describing what > Recommenders needs to do). It's more a "can do" than a "needs to do". Due to the current data collection policy we host such services outside Eclipse, i.e., the core frameworks are hosted at eclipse.org but anything that collects and leverages the data is hosted elsewhere. Generally, we want to start looking into how to leverage implicit feedback like api navigation patterns, how much time was spent on certain api docs, which exceptions occurred where in the code, which commands does a user know and use (and not know about), which code completion proposals are selected in certain situations, gather implicit quality feedback about code snippets shown in documentation by tracking copy and paste commands and the like. Recommenders certainly needs quite different data than most other projects, and I'm not sure if it makes sense to discuss the details of this in this bug report (anymore) as this bug report turned into something much more general (which is good and somewhat intended). So I wouldn't focus too much on our needs here. (In reply to Wayne Beaton from comment #61) > > Also 2) conflicts with your comment #1 where you write that "Any call home > > service would have to be opt-in.", if data is collected by default. > > That was my starting position (labeled as "brainstorming"). I'm not a US > politician, I'm allowed to change my mind when presented with new > information. What argument has made you change your mind from a privacy perspective? I only see extenuations why having access to data without opt-in is preferred by those who want that data. The beauty of your original comment #1 is that it is simple and does not lead to (endless) discussion into what category (2 or 3) certain data falls. (In reply to Markus Kuppe from comment #64) > What argument has made you change your mind from a privacy perspective? I > only see extenuations why having access to data without opt-in is preferred > by those who want that data. A simple heartbeat doesn't have any potential to pass data that could potentially be used to determine the identity of the individual or organization. Heartbeat-type data is only really useful if it's collected from a relatively large number of users. Providing a user interface to encourage opt-in is almost certainly going to result in a miserable user experience. > The beauty of your original comment #1 is that it is simple and does not > lead to (endless) discussion into what category (2 or 3) certain data falls. I also prefer simplicity. Nothing is set in stone yet. I'm still at the point of trying to make as many people happy as possible. I think that this special case is really a workaround the fact that we have a terrible installation story. Would we need this case if we could ask for opt-in as part of the install experience (which is a pretty common thing these days). (In reply to Wayne Beaton from comment #65) > A simple heartbeat doesn't have any potential to pass data [...] to determine > the identity of the individual or organization. > > Heartbeat-type data is only really useful if it's collected from a > relatively large number of users. I agree with Markus that above criteria *may* lead to long discussions. An installation heartbeat does not carry sensitive data (assuming it only collects the info which org.eclipse.* plugins are installed and used). You may take a different position for this use case: For every code completion event, we collect at which position in the completion list Code Recommenders ranked the actually selected proposal. We'd send triples <Java Type, applied Proposal, position in ranking> - of course only for org.eclipse.* APIs and, say, once a week. Is this the same as a heartbeat? The data only makes sense for relatively large user base and it does not carry information about the company or individual. Can it be enabled by default? If not, we may be better off with defining categories like "installation heartbeat" which describe what *could* be collected and classifying them as "type 1,2,3" usage data type. The EF/PMC/AC can than decide if a new request falls into an existing category and which type it has. To be honest, after an initial "run" of 5-10 requests I think it will quickly go down to 1-2 requests/discussions per year. I'd be fine with defining a couple of initial categories like "heartbeat", describing it, getting ACK by PMC/AC/EF and then see where we get after the first 3 requests. (In reply to Marcel Bruch from comment #66) > [...] getting ACK by PMC/AC/EF and then see where we get after the first 3 > requests. Put differently: start with m2e heartbeat & jboss tools installation stats as type 0. continue with code recommenders stats as type 2,3 with an opt-in and previously review by PMC/AC/EF. See and review where we get after this is in place. Just for transparency in the heartbeat jboss tools sends it includes the product bundle id - use to know what kind of eclipse install users are running. (In reply to Max Rydahl Andersen from comment #68) > Just for transparency in the heartbeat jboss tools sends it includes the > product bundle id - use to know what kind of eclipse install users are > running. Let me add MPC to this list that sends data, including product bundle id, back to the eclipse.org server. When I presented our initial thoughts to the board of directors, there was considerable push back regarding any notion of opt-out. There is, IMHO, no chance of us including any notion of opt-out in the policy. I've updated the policy document in the wiki. Your input will be appreciated. (In reply to Wayne Beaton from comment #70) > When I presented our initial thoughts to the board of directors, there was > considerable push back regarding any notion of opt-out. There is, IMHO, no > chance of us including any notion of opt-out in the policy. > > I've updated the policy document in the wiki. Your input will be appreciated. To clarify... the boards requires opt-in for all data? (In reply to Wayne Beaton from comment #70) > When I presented our initial thoughts to the board of directors, there was > considerable push back regarding any notion of opt-out. There is, IMHO, no > chance of us including any notion of opt-out in the policy. > > I've updated the policy document in the wiki. Your input will be appreciated. Opt-in means popup dialogs asking user's permission to call home. It also likely means multiple popups, one for each project that needs to call home, unless Platform provides common popup. (In reply to Igor Fedorenko from comment #72) > Opt-in means popup dialogs asking user's permission to call home. It also > likely means multiple popups, one for each project that needs to call home, > unless Platform provides common popup. This sounds overly pessimistic. Do you think platform would turn down a contribution that implements a common popup? (In reply to Markus Kuppe from comment #73) > (In reply to Igor Fedorenko from comment #72) > > Opt-in means popup dialogs asking user's permission to call home. It also > > likely means multiple popups, one for each project that needs to call home, > > unless Platform provides common popup. > > This sounds overly pessimistic. Do you think platform would turn down a > contribution that implements a common popup? I assume the final approved policy will not be available until end of 2013 or beginning of 2014. This hardly leaves enough time for individual projects to implement compliant call-home functionality for Luna. Maybe for the next release we'll have enough time to agree how this common popup gets implemented at Platform level, but for Luna it is unlikely. I'd be happy to be proved wrong, of course. Igor, Wayne, why not let m2e be the first who get's a permission from EMO to create such a popup for their needs - all assuming that there is clear statement from EMO(ED) that when the second project joins, a more general solution has to be developed? (In reply to Marcel Bruch from comment #75) > Igor, Wayne, > > why not let m2e be the first who get's a permission from EMO to create such > a popup for their needs - all assuming that there is clear statement from > EMO(ED) that when the second project joins, a more general solution has to > be developed? +1 Any update? We are well into 2014 already and I really don't want to lose another year. Hi all, I am a computer science student and am planning to take part in GSoC 2014. I have submitted a proposal about approved data collection: http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/y_aziza/5629499534213120 Any suggestions/feedback would be appreciated. Thanks, Yasser. This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. -- The automated Eclipse Genie. This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. -- The automated Eclipse Genie. This bug hasn't had any activity in quite some time. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. If you have further information on the current state of the bug, please add it. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant. -- The automated Eclipse Genie. |