| Summary: | [Help][Search] Update Lucene 2.9.1 to the latest version | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | Holger Voormann <eclipse> | ||||||||
| Component: | User Assistance | Assignee: | John Arthorne <john.arthorne> | ||||||||
| Status: | VERIFIED FIXED | QA Contact: | |||||||||
| Severity: | enhancement | ||||||||||
| Priority: | P3 | CC: | cgold, cvgaviao, daniel_megert, david_williams, jarthana, john.arthorne, krzysztof.daniel, Mike_Wilson, mknauer, pwebster | ||||||||
| Version: | 3.7 | ||||||||||
| Target Milestone: | 4.3 M1 | ||||||||||
| Hardware: | All | ||||||||||
| OS: | All | ||||||||||
| Whiteboard: | |||||||||||
| Bug Depends on: | 350103, 383582, 383586, 384216, 385962 | ||||||||||
| Bug Blocks: | 355562, 377861 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Holger Voormann
I'm looking into this right now. I don't know if there is going to be time to get this into Eclipse 3.7 since the way the 2.9.1 Lucene bundles are organized in Eclipse there are 9 bundles which would need updating. org.apache.lucene org.apache.lucene.core org.apache.lucene.analysis org.apache.lucene.highlighterresolution org.apache.lucene.memory org.apache.lucene.queries org.apache.lucene.snowball org.apache.lucene.spellchecker org.apache.lucene.misc Eight of these correspond to a jar file from apache, org.apache.lucene does not contain any classes of its own but does export a number of packages declared in the other bundles. The Eclipse help system has dependencies to org.apache.lucene and org.apache.lucene.analysis and indirectly depends on org.apache.lucene.core. As a historical note the following versions of Lucene are in orbit: 1.4.3, 1.9.1, 2.3.2, 2.4.0, 2.9.1. Between 2.3.2 and 2.4.0 the Lucene bundles in Orbit were restructured. In Lucene 1.4.3, 1.9.1 and 2.3.2 org.apache.lucene has no dependencies and contains the classes from lucene core. In Lucene 2.4.0 and 2.9.1 org.apache.lucene depends on all of the other Lucene bundles and exports packages from those dependencies. An upgrade to 2.9.4 would preserve the structure we used for 2.9.1. One reason why I am thinking that we are getting too late in the release cycle to put 2.9.4 into orbit is that it requires that eight jar files be converted to bundles, only two of these are used by the help system and I don't know exactly how to test the other six bundles or which components use them. Bug 260034 describes the way the Lucene jars were split across bundles. Deferred to Eclipse 3.8, see Comment 1 for the reasoning. Currently no-one is assigned to do this so I am removing the target milestone. I started to look at what work would be required to port to Apache Lucene 3.4 and discovered that one class, org.eclipse.help.internal.search.WordTokenStream will need to be recoded because the superclass org.apache.lucene.analysis.TokenStream has changed significantly. Apart from that the port appeared to be straightforward but until WordTokenStream was recoded there is no way to test the code. (In reply to comment #4) > Currently no-one is assigned to do this so I am removing the target milestone. > > I started to look at what work would be required to port to Apache Lucene 3.4 > and discovered that one class, org.eclipse.help.internal.search.WordTokenStream > will need to be recoded because the superclass > org.apache.lucene.analysis.TokenStream has changed significantly. Apart from > that the port appeared to be straightforward but until WordTokenStream was > recoded there is no way to test the code. I spent some time on this today. But I was hit by more dependency problems and not just the compilation error. Could this be because of some set-up issue at my end? I took the help projects from the CVS HEAD. I already had the lucene 3.5 on the workspace but I had to upgrade jasper, servlet, servlet.jsp too. But still had more errors. If you are sure that it's only the compilation error that needs to be looked at, I can try coming up with a patch for that. (In reply to comment #5) > > I spent some time on this today. But I was hit by more dependency problems and > not just the compilation error. Could this be because of some set-up issue at > my end? I took the help projects from the CVS HEAD. I already had the lucene > 3.5 on the workspace but I had to upgrade jasper, servlet, servlet.jsp too. But > still had more errors. If you are sure that it's only the compilation error > that needs to be looked at, I can try coming up with a patch for that. The CVS projects are no long valid. The current help projects are available via git from http://git.eclipse.org/c/platform/eclipse.platform.ua.git/ master is used for the Juno release. PW I've been investigating this a bit. WordTokenStream is a tokenizer that finds word boundaries using ICU4J's BreakIterator. This was done because ten years ago Lucene's tokenizer couldn't handle DBCS languages (bug 12656). Fast forward ten years, and Lucene's standard tokenizer as of Lucene 3.1 uses the Unicode standard text segmentation algorithm: http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/standard/StandardTokenizer.html From reading ICU4J documentation, ICU4J is using the same algorithm: http://userguide.icu-project.org/boundaryanalysis So I suspect the easiest option is to just switch to Lucene's standard tokenizer and remove our hand-crafted one. It seems better in the long run to take advantage of steady improvements in Lucene's tokenizer rather than maintaining our own. Either way this is feeling like a risky change in the platform for Juno at this point. One option I'm liking is to use a system property to enable someone to make this switch if they want to use the new Lucene with help (we want to do this in Orion). This way Orion and others can try using the new Lucene but not risk putting that change on all consumers of the platform. After Juno we could look into switching the platform over too. Created attachment 213534 [details]
Proposed fix
The fix should look something like this. I tested some scenarios and seem to be working. But I still see some error, which may be due to some language support missing or the fix might still be incomplete. I will continue to work on this tomorrow. Meanwhile, it would be great if someone from the UA team can take a look at the patch.
(In reply to comment #7) > So I suspect the easiest option is to just switch to Lucene's standard > tokenizer and remove our hand-crafted one. It seems better in the long run to > take advantage of steady improvements in Lucene's tokenizer rather than > maintaining our own. I was very tempted to do this last night when I was working on the patch. But didn't really know whey we had all that code computing the Locale and using it with BreakIterator. (In reply to comment #7) > point. One option I'm liking is to use a system property to enable someone to > make this switch if they want to use the new Lucene with help (we want to do > this in Orion). This way Orion and others can try using the new Lucene but not > risk putting that change on all consumers of the platform. After Juno we could > look into switching the platform over too. I am not sure I understand this. If I got it right, if we put in any lucene versions => 2.9.2, we won't be able to use the WordTokenStream as is, owing to the compilation error mentioned in comment #4. But we still want to use the existing code in WordTokenStream unless one specifically wants to be run with latest Lucene. Did I overlook something? The main issue, which occurred to me yesterday when looking at this in detail, is that I think it is too late to switch the Eclipse Platform to use new Lucene for the Juno release. We are in the final development milestone for Juno, and there isn't enough time to test it properly to ensure the major new Lucene version isn't breaking anything. It would also be a last minute impact for people downstream from the platform to get a major new Lucene version at the last moment (from 2.9 to 3.5). I think we can move the Platform to Lucene 3.5 immediately after Juno, and your patch looks like a good fix for that. We can leave this bug open for post-Juno to track that. For Orion we have a separate problem that we want to consume platform help and run with Lucene 3.5. For Juno I think we should explore using a system property to enable help to run on Lucene 3.5 for those consumers who want it. I have opened bug 376069 for that. To cross-reference, see bug 382574 which demonstrates that without an upper bound, the help system gets "forced" to use 3.5. Good think there's a work around in place? :) Entered CQ's for Lucene 3.5: https://dev.eclipse.org/ipzilla/show_bug.cgi?id=6615 https://dev.eclipse.org/ipzilla/show_bug.cgi?id=6616 Created attachment 217831 [details]
Work in progress
I now have all Platform UA tests passing on Lucene 3.5 in this branch: http://git.eclipse.org/c/platform/eclipse.platform.ua.git/log/?h=johna/lucene35 There were several fixes, and there are bugs listed in the "Depends on" list here to track specific issues. The next step will be getting the updated lucene bundles into the build. Created attachment 217970 [details] patch for orbit map > .. The next step will be getting the updated > lucene bundles into the build. Want some help? :) This patch updates Orbit map for 3.5.0 (I'm assuming the bundles required are exactly the same, just change of version). And, I'm sure you know better than me, but then also update eclipse.platform.common/bundles/org.eclipse.platform.doc.isv/platformOptions.txt and eclipse.platform.releng/features/org.eclipse.sdk/build.properties (In reply to comment #16) > > Want some help? :) > And, meant to say, I'm assuming all this needs to be "coordinated", so will let you commit/push as you'd like, unless I hear differently and you'd just prefer I do it. I was going to wait until we actually had builds going before trying this. I might try next week if we have builds, but otherwise will wait for week of July 23rd because I will be away July 9-20. (In reply to comment #16) > This patch updates Orbit map for 3.5.0 (I'm assuming the bundles required are > exactly the same, just change of version). Just to record here, I wasn't going to bother including the org.apache.lucene bundle anymore. This is currently just an empty shell that was left in place for compatibility. Since moving to Lucene 3.x is a major breaking change already, it seems like the best time to remove this. This is also discussed in bug 355562. So in the end I believe we just need org.apache.lucene.core and org.apache.lucene.analysis. I plan to release this tomorrow after the 4.3 integration build. Most of the changes are in a dozen commits in this branch: http://git.eclipse.org/c/platform/eclipse.platform.ua.git/log/?h=johna/lucene35 There are also very small changes in these repositories which I will post a link to once committed (mostly version numbers): eclipse.platform eclipse.platform.releng.maps eclipse.platform.releng Just for the records: John announced this change on cross-platform: http://dev.eclipse.org/mhonarc/lists/cross-project-issues-dev/msg07931.html Map files: http://git.eclipse.org/c/platform/eclipse.platform.releng.maps.git/commit/?id=2d9c3e7b21df1256a84f1eaedb07b643845b66b7 org.eclipse.sdk/build.properties: http://git.eclipse.org/c/platform/eclipse.platform.releng.git/commit/?id=8e354b3ba95006bfb26512dfd5268b2738b6a5aa Version range updates in org.eclipse.sdk branding plugin: http://git.eclipse.org/c/platform/eclipse.platform.git/commit/?id=4f620f326d9ccceaf8ba69981b678ad88121060e There are 12 different commits in eclipse.platform.ua for various parts of this, with the following commit at the tip: http://git.eclipse.org/c/platform/eclipse.platform.ua.git/commit/?id=e6b2a587a9fe9d9aa617553290d212c8bda1f9ef Change to platformOptions.txt: http://git.eclipse.org/c/platform/eclipse.platform.common.git/commit/?id=f480d1507e14e14099ade2a596c48ffa3699e892 John, all your changes look correct. But, I've found some omissions I'll commit tonight, and we'll do another N build on Wednesday, or if your feeling brave :) ask for a I build respin. = = = maps project (some trivial change/file I do not know what its for, but just as well clean up). .../apiexclude/exclude_list_external.txt -org.apache.lucene +org.apache.lucene.core org.apache.lucene.analysis http://git.eclipse.org/c/platform/eclipse.platform.releng.maps.git/commit/?id=7705f93992f93a508540587d7772dcb479384f9b = = = eclipse.platform.releng compareoptions.properties mentioned all three lucene bundles, so I removed the unbrella one. (Not sure what the file is used for, probably wouldn't hurt anything being there ... but ... while we are at it ... I'll clean that up. But, here's a build breaker. I'll bet you have this changed in your workspace, and forgot to commit/release. The help feature ("eclipse.platform.releng/features/org.eclipse.help-feature") still has <plugin id="org.apache.lucene" download-size="0" install-size="0" version="0.0.0" unpack="false"/> So I'll move that included plugin from that feature, in master = = = = = In eclipsebuilder, there is a place that refers to "org.apache.lucene". And, remember, this is in the non-split stream parts of eclipsebuilder. But, I think its use is wrong and we can fix without stream specific configuration. We include it in the "exclude" list for p2 comparator task during one of our final mirroring. <comparator comparator="org.eclipse.equinox.p2.repository.tools.jar.comparator" comparatorLog="${buildlogs}/comparatorlog.txt"> <repository location="${repoBaseline}" /> <exclude> <artifact id="org.eclipse.jdt.doc.isv" /> <artifact id="org.eclipse.jdt.doc.user" /> <artifact id="org.eclipse.pde.doc.user" /> <artifact id="org.eclipse.platform.doc.isv" /> <artifact id="org.eclipse.platform.doc.user" /> <artifact id="org.eclipse.equinox.executable" /> <artifact id="org.eclipse.sdk.examples" /> <artifact id="org.eclipse.sdk.examples.source" /> <artifact id="master-equinox" /> <artifact id="org.apache.lucene" /> <artifact id="org.apache.lucene.source" /> <artifact id="org.apache.lucene.core" /> <artifact id="org.apache.lucene.core.source" /> <artifact id="org.apache.lucene.analysis" /> <artifact id="org.apache.lucene.analysis.source" /> </exclude> </comparator> I suspect that is wrong, something left over from long ago? The reason I think that is that those lucene bundles should not change contents while their qualifer does not (which is the usual reason to exclude something ... like the doc bundles). Perhaps there was a reason once, but think the right action is to remove all 6 lucene items and see what happens. We can "split" that part of the configuration if we find there is a reason. I'll commit that change too. A test I build, on build machine only, did not go well. I may be misunderstanding or forgotten something has to be "released" to integration stream ... but, not from what I could see. Full log at http://build.eclipse.org/eclipse/eclipse4I/siteDirTESTONLY/eclipse/downloads/drops4/I20120724-2339/buildlogs/fullmasterBuildOutput.txt You'll notice at beginning it still refers to "old version" in places, such as GitCheckoutTagInLocalRepo: [echo] [GIT] /shared/eclipse/eclipse4I/build/supportDir/scmCache/git___git_eclipse_org_gitroot_platform_eclipse_platform_releng_git >> git checkout --force v20120528-1648 [exec] HEAD is now at 0ccd4b4... Bug 379747 - Pull request for Platform from CBI Add poms for Tycho build GitFetchFileFromLocalRepo: [copy] Copying 1 file to /shared/eclipse/eclipse4I/build/supportDir/src/tempFeature GitFetchFileFromLocalRepo: [copy] Copying 1 file to /shared/eclipse/eclipse4I/build/supportDir/src/tempFeature [eclipse.fetch] The entry feature@org.eclipse.license,1.0.0.qualifier has not been found. The entry feature@org.eclipse.license has been used instead. [eclipse.fetch] Missing directory entry: feature@org.eclipse.rcp.source. [eclipse.fetch] Missing directory entry: feature@org.eclipse.equinox.p2.user.ui.source. [eclipse.fetch] Missing directory entry: exclude@org.eclipse.platform.doc.user. [eclipse.fetch] Missing directory entry: exclude@org.eclipse.jdt.doc.user. [eclipse.fetch] Missing directory entry: exclude@org.eclipse.pde.doc.user. [eclipse.fetch] Missing directory entry: plugin@org.apache.lucene.source,2.9.1.qualifier. [eclipse.fetch] Missing directory entry: plugin@org.apache.lucene.analysis.source,2.9.1.qualifier. [eclipse.fetch] Missing directory entry: plugin@org.apache.lucene.core.source,2.9.1.qualifier. and at end gives the typical hint that the FEATURE still requires it, but not found in maps "Processing inclusion from feature org.eclipse.help: Unable to find plug-in: org.apache.lucene_0.0.0. Please check the error log for more details." But ... pretty sure I removed it from feature (and committed and pushed), and pretty sure there's no "integration" merging needed for that project ... works off master. Its almost like something is wrong with git? :\ But, I'll try an N build and am sure others can explain the oddities. In N20120725-0200 the bundles are in, but the source bundles are empty (except for the legal notice and manifest files). (In reply to comment #27) > In N20120725-0200 the bundles are in, but the source bundles are empty (except > for the legal notice and manifest files). They appear that way in Orbit too! I've open orbit bug 385962. (In reply to comment #26) > A test I build, on build machine only, did not go well. I may be > misunderstanding or forgotten something has to be "released" to integration > stream ... but, not from what I could see. This test I-build seems to be using old content from eclipse.platform.ua. For example I see: Bundle org.eclipse.help.ui: [eclipse.buildScript] Missing required plug-in org.eclipse.help.base_[3.5.0,4.0.0). But in master the version range of this dependency is [4.0.0,5.0.0). I noticed there are no new tags in eclipse.platform.ua from your test build.. Maybe the tagging was turned off and we're not getting the latest content? (In reply to comment #29) > (In reply to comment #26) > I noticed there are no new tags in eclipse.platform.ua from your test build.. > Maybe the tagging was turned off and we're not getting the latest content? Ah, of course. I forget ... A test build on build machine only, even an I build, does not tag, just uses existing maps (by default). So, I think we are good to go here. Which is good, since I'm going to propose another I build this evening :) (for another bug) Plus, I'll promote the fixed Orbit build and we can use that Orbit I build to make sure we get right source. Unless there's concerns or counter suggestions. Thanks, I've updated the Orbit maps to point to p2 repo at http://download.eclipse.org/tools/orbit/downloads/drops/I20120725183811/ which includes the new Lucene versions with fixed source. And, thanks to new handy .* pattern in properties files, the maps only thing to change for new qualifier, as far as know. The only other version (qualifier) to change was Ant 1.8.3. Changed from 1.8.3.v20120321-1730 to 1.8.3.v20120530-0730 That was for Bug 380984 - ant 1.8.3 missing bundle name, provider name (localization) Documented our change in bug 385990. (That might be a dup of one already?) I'm confident enough in the Lucene changes to mark this as fixed ... verification is needed, but all known work is done as far as I know. Thanks all I meant to mark as fixed :/ And basically looks right. After doing update from Juno, with latest I build, I see only the two 3.5 lucene bundles installed, can import them with source, and sure seems to me that help really is indexed lots faster! (but, I did not actually measure). The only odd thing ... in console, I saw this message as it started up: !ENTRY org.eclipse.help.base 4 4 2012-07-25 22:33:26.123 !MESSAGE Help documentation could not be indexed completely. !SUBENTRY 1 org.eclipse.help.base 4 4 2012-07-25 22:33:26.123 !MESSAGE Help document /org.eclipse.platform.doc.isv/reference/extension-points/org_eclipse_help_base_luceneSearchParticipants.html cannot be opened. Is that a separate bug, or did one of us change one too many files? (In reply to comment #32) > > The only odd thing ... in console, I saw this message as it started up: > > !ENTRY org.eclipse.help.base 4 4 2012-07-25 22:33:26.123 > !MESSAGE Help documentation could not be indexed completely. > !SUBENTRY 1 org.eclipse.help.base 4 4 2012-07-25 22:33:26.123 > !MESSAGE Help document > /org.eclipse.platform.doc.isv/reference/extension-points/org_eclipse_help_base_luceneSearchParticipants.html > cannot be opened. > > Is that a separate bug, or did one of us change one too many files? Wow, there's even a unit test for that! :) testPlatformIsvGenerated Failure Invalid link in "/org.eclipse.platform.doc.isv/topics_Reference.xml": reference/extension-points/org_eclipse_help_base_luceneSearchParticipants.html But, I'm still hoping someone else knows what it meaans. :) Too much removed? Not enough? (In reply to comment #33) > testPlatformIsvGenerated Failure Invalid link in > "/org.eclipse.platform.doc.isv/topics_Reference.xml": > reference/extension-points/org_eclipse_help_base_luceneSearchParticipants.html I'll fix it. Opened bug 386044. Verified in I20120808-2000. *** Bug 388162 has been marked as a duplicate of this bug. *** |