| Summary: | [content type] allow clients to limit how much can be read during content description | ||
|---|---|---|---|
| Product: | [Eclipse Project] Platform | Reporter: | Rafael Chaves <eclipse> |
| Component: | Runtime | Assignee: | Rafael Chaves <eclipse> |
| Status: | RESOLVED WONTFIX | QA Contact: | |
| Severity: | enhancement | ||
| Priority: | P3 | CC: | david_williams, for.work.things, thatnitind |
| Version: | 3.0 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows XP | ||
| Whiteboard: | |||
|
Description
Rafael Chaves
I think now I understand your problem. You made an assumption that never more than MARK_LIMIT (1k) would be read from a stream submitted to content description/detection, and now this is not true anymore. As you are aware now, that may cause problems (see bug 67211) - we cannot refuse to work properly because we don't read more than <your-reasonable-limit-here> bytes. I agree it may get pretty slow (I tried with a Ant build script with several Mb of comments before the actual contents, and opening the context menu took a lot of time), but at least we still work correctly. One workaround I would suggest is to provide a simple input stream wrapping the real input stream (or extending the one you are using) that would limit the maximum number of bytes that can be read, enforcing the behavior you require for your use case. Reducing severity since the current behavior is correct (from an API standpoint) and a workaround has been provided. Thanks Rafael. I still think might be blocker, but will investigate your suggestions more. Yes, I've always assumed "contentType" would be based on peeking in file (as it was, up to a week or so ago). Not reading whole thing. It seems if someone really needs to read that much, its not really a "contentType", to me. I'm sure lot's of people will find it handy, for sure. A nice easy way to cache information based on reading the file. But, if I understand your comment in #2, I don't think us limiting input really prevents others from reading the whole thing, past our ability to re-read the stream. I guess I would have expected some "protection" from that from platform code. Plus, I don't think current solution is scalable. You said you tried a couple megabytes. Have you tried a 400 megabyte xml file in your workspace? Even if its not an ant file, ant describer will spend a great deal of time and memory deciding that its not an ant file. And, yes, 400 Megabyte xml files are real, usually the result of some logging/database functions. There may not be many of them, but it only takes one. You might say that's an ant describer bug, but I'd assumed a platform function would protect against run-away describers -- after all, if all describers need to limit themselves, then why wouldn't the platform provide a way of limiting it? Would a set-once variable "max blocks allowed" allow flexibility to learn from experience for various products/needs? I do also think there may be problems with reset in LazyInputStream. Its mark and reset methods seem pretty non-standard, and I'm not sure what they are for (mark for exmaple, ignores the readlimit, set's mark to current offset, and reset set's mark back to zero? ... that is, undoes previous call to mark?). I'll look more into converting all our code (reluctantly) and coming up with test cases to better test current scheme to convince my self its workable with more than one (or two) describers and various input steams. I'll conclude this diatribe by stating the obvious ... there's several ways to define "still work correctly". Always detecting correct type is one, but doing that at risk of allowing "out of memory" error, or poor performance, is not really an overall solution. Actually, I don't expect describers to have to read the whole file. Ideally, describers should read just as much as necessary for doing their job. I agree the XMLRootElementDescriber (currently used by Ant and PDE, but provided by Core) reads more than it needs (16kb using Crimson), but that is it, no matter how much bigger the file is than that. More will be read only if no actual contents can be found in the first 16Kb (e.g. an XML with pages of comments preceding the actual contents). I hope that diminishes your concerns w.r.t. memory consumption. The workaround I suggested (if you still need one) was for the case you are submitting an input stream you control for content description/detection (as it seemed you were doing). In this case, your custom input stream could fake an early EOF if more than appropriate is read. This does not impact how things like IFile#getContentDescription (and so #getCharset) work, but would help in your scenario, where you submit your own stream and then reuse the stream for other needs. Re: whether LazyInputStream honors the contract for input streams, I would suggest you opened another PR if you have any evidences that show the contrary (I suggest you checking ByteArrayInputStream's contract). Rafael, your suggestion seems to work fine. Where we need it, I create a "LimitedBufferedInput" stream that return's -1 if attempt to read past buffer, so then original stream is still resettable to beginning when we (re) use it to (re) read contents. Thanks very much for your help (and please just ignore my general grumpiness :) It might be nice to have an option/parameter on API so clients could specify limit to "peeking", but that would be such a distant enhancement, you're welcome to dispose of this defect how you best see fit. And, BTW, I know of people on our WSAD team that will love to read whole file to look for certain characteristics, so you've made that possible for them. (And, I say whole file, instead of "as much as they need to" since what they are looking for may or may not be there, so they'd have to do exhaustive search). Thanks again. Original summary: "content description does not reset input stream" Agreed we could have API that allows clients to limit the number of bytes/chars that can be read. Unfortunately, AFAIK, such a thing will be hard to get even for 3.0.1, since we are not supposed to introduce API changes in service releases. I am changing the subject of this PR so we handle this issue here. Regarding clients taking advantage of the lack of limit to read the whole file... that is not a legitimate use of that API and it is not a recommended practice - describers are supposed to be as quick as possible in their work, although that is not enforced by the platform. There is a good chance the resulting product will present poor performance or have high memory consumption if describers take too long to run or fill content descriptions with too much data. The API is already too big for what it does, and adding a second version for every method that takes an input stream would make it even more complicated. It seems the more flexible way of tweaking how the content type framework reads the input streams handed to it is exactly by providing customized input streams. Closing as wontfix. |