Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 440699

Summary: Search: the filesearch service is not reliable
Product: [ECD] Orion Reporter: sheng yao <yaosheng79>
Component: ServerAssignee: Project Inbox <orion.server-inbox>
Status: RESOLVED WONTFIX QA Contact:
Severity: normal    
Priority: P3 CC: ahunter.eclipse, john.arthorne, mamacdon, michael.ochmann, yaosheng79
Version: 5.0   
Target Milestone: ---   
Hardware: PC   
OS: Windows 7   
Whiteboard:

Description sheng yao CLA 2014-07-29 22:28:30 EDT
Hi
I have encountered some reliability issues during using of the fileSearch service. below is a summary:

1. all the .json files are ignored, even a json file contains the searching term  the filesearch service will never return it. I think json files should not be skipped since they are not binary format
2. Orion's fileSearch service return less result than using Eclipse. In my case, if I search "sap.context" using fileSearching it returns only 21 files, whereas Eclipse returns 100+ files. Seems like some files in the deep folder are skipped
3. If the term contains special charactors such as "/" "_"... the filesearch service won't return any result
4. the result contains newly deleted file but does not contain newly added files. I know it because of the indexer, may I know how can I change the refresh interval of the indexer?

I know the client side searching can return more accurate result, but the performance is not good. It may send thousands of requests if the work space contains lots of files.
Is it possible to fix those issues on the backend filesearch service as well? Especially the first 3 issues.

Thanks & Regards,
YaoSheng
Comment 1 Mark Macdonald CLA 2014-07-31 12:35:49 EDT
Issue 1: opened bug 440914 to get .json files indexed.

Issues 2 & 3 stem from weaknesses of the index-based search implementation that we have recognized for some time. It may be possible to configure the indexer's syntax rules to fix these... but the configuration is complex and must be performed by the server administrator, not end users.

Folks on the Orion team have been discussing alternatives. We would like to extend the filesystem API to allow filesystems to provide reliable "grep" and "find" style searches. These could be executed on the server, and hence offer much better performance than Orion's existing client-side crawl. We don't have a time frame for this work yet, but we all struggle with search on a daily basis so we understand the importance.

Issue #4: if you're hosting your own Orion server built from source, you can tune the indexing interval by changing the IDLE_DELAY field in /bundles/org.eclipse.orion.server.search/src/org/eclipse/orion/internal/server/search/Indexer.java. The default is 5 minutes.
Comment 2 Michael Ochmann CLA 2014-09-17 11:57:01 EDT
There is a simple example that illustrates issue 3 quite nicely:

1. create a test.js file with the content:

     function func(foo_bar){};

2. open global or quick search and enter
   func    => match
   foo     => match
   bar     => match
   foo_    => no match
   foo_bar => no match

3. insert blanks before and after foo_bar:

     function func( foo_bar ){};

4. search again:
   func    => match
   foo     => match
   bar     => match
   foo_    => match
   foo_bar => match

It's easy to find similiar patterns for other common file types:

json: {"key":"foo_bar"}    => no match for foo_ and foo_bar
      {"key": "foo_bar" }  => match for foo_ and foo_bar

xml:  <tag>foo_bar</tag>   => no match for foo_ and foo_bar
      <tag> foo_bar </tag> => match for foo_ and foo_bar

If I understand the Solr documentation right, the input

     function func(foo_bar)

is split into tokens (according to the currently used Solr schema.xml) as follows:

  WhitespaceTokenizer => 
    function 
    func(foo_bar) 
  WordDelimiterFilter =>
    function
    func            <- generateWordParts
    foo             <- generateWordParts
    bar             <- generateWordParts
    funcfoobar      <- catenateWords
    func(foo_bar)   <- preserveOriginal

I left out the other filters that are not relevant in this case. The result explains why searching for "foo" and "bar" produces matches, but "foo_" and "foo_bar" do not.

Introducing blanks

     function func( foo_bar )

gives instead

  WhitespaceTokenizer => 
    function 
    func(
    foo_bar
    ) 
  WordDelimiterFilter =>
    function
    func            <- generateWordParts
    foo             <- generateWordParts
    bar             <- generateWordParts
    foobar          <- catenateWords
    foo_bar         <- preserveOriginal

This looks more promising. Since the search always appends a * wildcard, "foo_", "foo_b", "foo_ba" and "foo_bar" produce matches now.

I experimented with some alternative filter/tokenizer combination and eventually introduced an additional MappingCharFilter before the whitespace tokenizer:

  https://git.eclipse.org/r/#/c/33506/

The idea basically is to replace tyical delimiters like brackets, semicolons etc. with blanks **before** splitting the input into tokens, not afterwards like in the current implementation.

The data flow with the new filter looks like the following:

MappingCharFilter =>
    function func foo_bar

WhitespaceTokenizer => 
    function
    func
    foo_bar

WordDelimiterFilter =>
    function
    func
    foo             <- generateWordParts
    bar             <- generateWordParts
    foobar          <- catenateWords
    foo_bar         <- preserveOriginal

I played some time now with the new filter rules and I think it yields better results for the most common file types, although it might have some other weaknesses.

What do you think, Mark?
Comment 3 Mark Macdonald CLA 2014-09-17 19:53:00 EDT
(In reply to Michael Ochmann from comment #2)
> I played some time now with the new filter rules and I think it yields
> better results for the most common file types, although it might have some
> other weaknesses.
> 
> What do you think, Mark?

This sounds like a great improvement, but I'm not the right person to ask about Solr tokenizing. Cc'ing John...
Comment 4 Anthony Hunter CLA 2014-11-04 16:52:50 EST
(In reply to Mark Macdonald from comment #1)
> Folks on the Orion team have been discussing alternatives. We would like to
> extend the filesystem API to allow filesystems to provide reliable "grep"
> and "find" style searches. These could be executed on the server, and hence
> offer much better performance than Orion's existing client-side crawl. We
> don't have a time frame for this work yet, but we all struggle with search
> on a daily basis so we understand the importance.

See Bug 450017
Comment 5 John Arthorne CLA 2015-01-14 16:24:43 EST
Michael, thanks for the very detailed explanation, and sorry it took so long to get back to you. It looks like your change will be an improvement, but after a few years of trying to tweak Solr/Lucene we have never been able to get reliable searches over arbitrary source code. We are working on a replacement search implementation that is not index-based. See bug 450017 for details. I think index-based search could still play an interesting role in language-specific searching like search for tokens, function names, etc, but for arbitrary regular expression search over arbitrary source code we have not been able to get 100% reliable results.

I'm going to close this one as WONTFIX because we are not continuing to work on reliability of the indexed search. The root problem of unreliable search is being explored in bug 450017.