Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.
Bug 349041 - [search] Ensure server excludes all binary content
Summary: [search] Ensure server excludes all binary content
Status: RESOLVED FIXED
Alias: None
Product: Orion
Classification: ECD
Component: Server (show other bugs)
Version: unspecified   Edit
Hardware: PC Windows 7
: P3 major (vote)
Target Milestone: 8.0   Edit
Assignee: Anthony Hunter CLA
QA Contact:
URL:
Whiteboard:
Keywords:
: 421692 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-06-10 10:43 EDT by John Arthorne CLA
Modified: 2015-01-19 16:40 EST (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description John Arthorne CLA 2011-06-10 10:43:37 EDT
The search indexer currently has a hard-coded list of binary file types to avoid. It will eventually need code to detect binary vs. text content so it can index all text content regardless of  file extension.
Comment 1 John Arthorne CLA 2011-06-10 10:44:54 EDT
Note to myself. See:

org.eclipse.search.internal.core.text.TextSearchVisitor#processFile for the
platform code that determines if a file is binary for the purpose of searching.
Comment 2 John Arthorne CLA 2013-06-19 15:59:04 EDT
Note in bug 348040 we switched from a blacklist of types to ignore to a whitelist of types to search. This was to address a problem where the search indexer was consuming unreasonable amount of CPU. Either way the problem remains of adding the proper binary vs text detection.
Comment 3 John Arthorne CLA 2013-11-14 13:26:05 EST
*** Bug 421692 has been marked as a duplicate of this bug. ***
Comment 4 Rafael Chaves CLA 2013-11-14 13:43:24 EST
Folks, this is a biggie. Pardon if I tweak the severity. The only workaround requires forking the Orion code base. 

I wouldn't expect anything fancy, but this list of searchable extensions needs to be externalized somehow so language tools built on top of Orion can provide search support.

Making this configurable in orion.conf would be sufficient.
Comment 5 John Arthorne CLA 2013-12-05 13:17:18 EST
There are a few options here:

1) Invert the default assumption so that we index all files that are not known as binary, rather than only indexing files that are known as text. This will mean we waste a bit of resources attempting to index some files that are binary but we failed to detect it.
2) A server side extension like Rafael mentions. It is a quick solution but it does not help with client-side extensibility - say someone plugs language tooling into orionhub but doesn't have control over the server.
3) Consume Eclipse content type infrastructure and do intelligent analysis on whether file is binary or text.  Potentially more expensive, but more likely to be right and is also pluggable on server side at least.

I am currently thinking I will do 1) for 5.0 M2, mainly because that's all I have time to do. Rafael if you have interest in contributing something more sophisticated it would be welcome.
Comment 6 Mike Wilson CLA 2014-06-30 18:29:00 EDT
I would really like us to spend whatever time it takes to fix this. The current situation burns me basically *every* time I start looking at a new file type.
Comment 7 libing wang CLA 2014-07-08 16:21:57 EDT
(In reply to Mike Wilson from comment #6)
> I would really like us to spend whatever time it takes to fix this. The
> current situation burns me basically *every* time I start looking at a new
> file type.

Fixed Bug 438727. Crawler now searched all the file types except the known binary  and image files.
Comment 8 Anthony Hunter CLA 2015-01-19 15:44:46 EST
(In reply to John Arthorne from comment #1)
> Note to myself. See:
> 
> org.eclipse.search.internal.core.text.TextSearchVisitor#processFile for the
> platform code that determines if a file is binary for the purpose of
> searching.

We need update the new search to skip binary files.
Comment 9 Anthony Hunter CLA 2015-01-19 16:40:36 EST
Added a very small update with the following commit that looks to do the trick:
http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=d7ed9685582288db9ec888d7652deec09712d34e