| Summary: | Indexing failures on orionhub.org | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [ECD] Orion | Reporter: | John Arthorne <john.arthorne> | ||||
| Component: | Server | Assignee: | Anthony Hunter <ahunter.eclipse> | ||||
| Status: | RESOLVED FIXED | QA Contact: | |||||
| Severity: | normal | ||||||
| Priority: | P3 | CC: | ahunter.eclipse, john.arthorne, ken_walker | ||||
| Version: | 1.0 | ||||||
| Target Milestone: | 6.0 M1 | ||||||
| Hardware: | PC | ||||||
| OS: | Windows 7 | ||||||
| Whiteboard: | |||||||
| Attachments: |
|
||||||
Created attachment 218290 [details]
File where failure occurs
Ken and I saw these again while recovering from our failures yesterday. We should try to fix it because it clutters the log, and also indicates these failing files are not being indexed at all so will be missing from search results. Is it possible to copy that worker-javascript.js file from OrionHub? I can try to reproduce it in my localhost first. By the way, here is a thread talking about invalid characters(e.g. 0xffff or above) http://lucene.472066.n3.nabble.com/Solr-3-1-indexing-error-Invalid-UTF-8-character-0xffff-td3113191.html The file is already attached to this bug. (In reply to John Arthorne from comment #4) > The file is already attached to this bug. Ahh, got it.I thought it is the log file. I tried both copy the file content and save the raw file from the file link in the bug. Then I: 1.In my local server use a new empty workspace. 2.In a brand new user, create a new js file and paste the file contents and save. 3.Try both find file name and search file contents. 4.Both of the find/search found something. 5. Did not see any error form the serve rlog. 6.Then I import the raw file that I saved. 7.I got both files hit and there's no error in the server console. I think I have to read more in the thread in comment 3 or maybe try to create a file with such special characters. The link of the embeddedSolrServer source code. http://www.docjar.com/html/api/org/apache/solr/client/solrj/embedded/EmbeddedSolrServer.java.html After the migration of orionhub.org to the simple metadata storage, the server was completely reindexed since the project locations moved. There were 16698 of these errors in the log. An invalid XML character (Unicode: 0xffff) Most of the occurrences of the error are in indexing minified JavaScript files. Therefore the XML error is nothing to do with parsing the content of the file being indexed. Instead it looks like Solr has some internal XML representation of requests that is bombing out when a request field contains certain contents. I have been trying to reproduce on my local Windows server without luck. Many of the instances of the error are while processing repositories containing minified Ace editor. Here is one example of a repo that we are failing to index on orionhub.org: https://github.com/Bluefinch/microglark Anthony, could you try cloning this on a locally running Linux Orion server and see if you get the error? I have doubts about us resolving the core problem as it seems to be in the Solr code. As a mitigation we can dial this particular failure down to DEBUG log level so it is not seen by default in production server. It is currently swamping the logs on our production servers. I can duplicate this error on my Linux Orion server. There are 26 errors created by cloning this repository. I will look into this issue. There is a very simple fix for this problem.
The offending code is the line:
doc.addField("Text", getContentsAsString(file));
We get the contents of the file and pass it to the indexer. The indexer blows up because of the invalid XML characters.
Since we already have the file contents as a String, we can see if it contains the offending character and if it does, do not index this file's contents.
I am only going to check for the problem character that we know about (Unicode FFFF). There are other invalid Unicode characters in XML files, but no sense adding them all unless we know they might occur in a file.
I have delivered this fix with commit:
http://git.eclipse.org/c/orion/org.eclipse.orion.server.git/commit/?id=738ca3197d866245a7946a971570f2ca66b75661
|
There are some files we are failing to index on orionhub.org. It looks like we might not be properly encoding content for our solr requests when performing indexing. Here is an example stack: !ENTRY org.eclipse.orion.server.core.search 4 0 2012-06-28 04:53:19.074 !MESSAGE Error during searching indexing on file: /home/data/workspace/serverworkspace/LZ/LZ/T-/js/ace-0.2.0/src/worker-javascript.js !STACK 0 org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: org.apache.solr.common.SolrException: An invalid XML character (Unicode: 0xffff) was found in the element content of the document. at org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.request(EmbeddedSolrServer.java:153) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:120) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:106) at org.eclipse.orion.internal.server.search.Indexer.indexProject(Indexer.java:167) at org.eclipse.orion.internal.server.search.Indexer.run(Indexer.java:230) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:54)