| Summary: | Search: the filesearch service is not reliable | ||
|---|---|---|---|
| Product: | [ECD] Orion | Reporter: | sheng yao <yaosheng79> |
| Component: | Server | Assignee: | Project Inbox <orion.server-inbox> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | ahunter.eclipse, john.arthorne, mamacdon, michael.ochmann, yaosheng79 |
| Version: | 5.0 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows 7 | ||
| Whiteboard: | |||
|
Description
sheng yao
Issue 1: opened bug 440914 to get .json files indexed. Issues 2 & 3 stem from weaknesses of the index-based search implementation that we have recognized for some time. It may be possible to configure the indexer's syntax rules to fix these... but the configuration is complex and must be performed by the server administrator, not end users. Folks on the Orion team have been discussing alternatives. We would like to extend the filesystem API to allow filesystems to provide reliable "grep" and "find" style searches. These could be executed on the server, and hence offer much better performance than Orion's existing client-side crawl. We don't have a time frame for this work yet, but we all struggle with search on a daily basis so we understand the importance. Issue #4: if you're hosting your own Orion server built from source, you can tune the indexing interval by changing the IDLE_DELAY field in /bundles/org.eclipse.orion.server.search/src/org/eclipse/orion/internal/server/search/Indexer.java. The default is 5 minutes. There is a simple example that illustrates issue 3 quite nicely:
1. create a test.js file with the content:
function func(foo_bar){};
2. open global or quick search and enter
func => match
foo => match
bar => match
foo_ => no match
foo_bar => no match
3. insert blanks before and after foo_bar:
function func( foo_bar ){};
4. search again:
func => match
foo => match
bar => match
foo_ => match
foo_bar => match
It's easy to find similiar patterns for other common file types:
json: {"key":"foo_bar"} => no match for foo_ and foo_bar
{"key": "foo_bar" } => match for foo_ and foo_bar
xml: <tag>foo_bar</tag> => no match for foo_ and foo_bar
<tag> foo_bar </tag> => match for foo_ and foo_bar
If I understand the Solr documentation right, the input
function func(foo_bar)
is split into tokens (according to the currently used Solr schema.xml) as follows:
WhitespaceTokenizer =>
function
func(foo_bar)
WordDelimiterFilter =>
function
func <- generateWordParts
foo <- generateWordParts
bar <- generateWordParts
funcfoobar <- catenateWords
func(foo_bar) <- preserveOriginal
I left out the other filters that are not relevant in this case. The result explains why searching for "foo" and "bar" produces matches, but "foo_" and "foo_bar" do not.
Introducing blanks
function func( foo_bar )
gives instead
WhitespaceTokenizer =>
function
func(
foo_bar
)
WordDelimiterFilter =>
function
func <- generateWordParts
foo <- generateWordParts
bar <- generateWordParts
foobar <- catenateWords
foo_bar <- preserveOriginal
This looks more promising. Since the search always appends a * wildcard, "foo_", "foo_b", "foo_ba" and "foo_bar" produce matches now.
I experimented with some alternative filter/tokenizer combination and eventually introduced an additional MappingCharFilter before the whitespace tokenizer:
https://git.eclipse.org/r/#/c/33506/
The idea basically is to replace tyical delimiters like brackets, semicolons etc. with blanks **before** splitting the input into tokens, not afterwards like in the current implementation.
The data flow with the new filter looks like the following:
MappingCharFilter =>
function func foo_bar
WhitespaceTokenizer =>
function
func
foo_bar
WordDelimiterFilter =>
function
func
foo <- generateWordParts
bar <- generateWordParts
foobar <- catenateWords
foo_bar <- preserveOriginal
I played some time now with the new filter rules and I think it yields better results for the most common file types, although it might have some other weaknesses.
What do you think, Mark?
(In reply to Michael Ochmann from comment #2) > I played some time now with the new filter rules and I think it yields > better results for the most common file types, although it might have some > other weaknesses. > > What do you think, Mark? This sounds like a great improvement, but I'm not the right person to ask about Solr tokenizing. Cc'ing John... (In reply to Mark Macdonald from comment #1) > Folks on the Orion team have been discussing alternatives. We would like to > extend the filesystem API to allow filesystems to provide reliable "grep" > and "find" style searches. These could be executed on the server, and hence > offer much better performance than Orion's existing client-side crawl. We > don't have a time frame for this work yet, but we all struggle with search > on a daily basis so we understand the importance. See Bug 450017 Michael, thanks for the very detailed explanation, and sorry it took so long to get back to you. It looks like your change will be an improvement, but after a few years of trying to tweak Solr/Lucene we have never been able to get reliable searches over arbitrary source code. We are working on a replacement search implementation that is not index-based. See bug 450017 for details. I think index-based search could still play an interesting role in language-specific searching like search for tokens, function names, etc, but for arbitrary regular expression search over arbitrary source code we have not been able to get 100% reliable results. I'm going to close this one as WONTFIX because we are not continuing to work on reliability of the indexed search. The root problem of unreliable search is being explored in bug 450017. |