Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 68565

Summary: [content type] allow clients to limit how much can be read during content description
Product: [Eclipse Project] Platform Reporter: Rafael Chaves <eclipse>
Component: RuntimeAssignee: Rafael Chaves <eclipse>
Status: RESOLVED WONTFIX QA Contact:
Severity: enhancement    
Priority: P3 CC: david_williams, for.work.things, thatnitind
Version: 3.0   
Target Milestone: ---   
Hardware: PC   
OS: Windows XP   
Whiteboard:

Description Rafael Chaves CLA 2004-06-24 21:45:27 EDT
Description copied from bug 67211 comment 20:

"The fix for (bug 67211) is really causing us grief. We basically have a lot of
code 
that assumed what ever input stream we passed into describer sort of functions 
would be resettable to beginning, so we could re-read it. This is no longer 
happening. We've tried some work arounds, by always resetting stream ourselves 
(which we used to get for free) but that didn't fix all the problems. I think, 
basically, we have FileInputStreams, wrapped with BufferedStreams, which are now 
wrapped with LazyInputStreams. I believe we'll have to change all our code so 
can always get "fresh" FileInputStreams after they've been used by describer 
framework. Was this really intended design? Also, is it really that wise to let 
describers go as deep into a file as they'd like ... what about a 400 Megabyte 
XML file? Anyway, I'll try to look at deeper and see if I have more constructive 
suggestions, but figured I'd better re-open so you'd know it broke us. There may 
well be work arounds and things in our code we can change, but our initial 
attempts have not solved all problems. I think if there was some "compromise" so 
there was a published maximimum, 2K? 3K? 10K? We could adopt by easily adapt by 
chaning size of our BufferedInputStreams. And, I suspect, this would solve 
original problem for 99.9 percent of cases."
Comment 1 Rafael Chaves CLA 2004-06-24 22:03:42 EDT
I think now I understand your problem. You made an assumption that never more
than MARK_LIMIT (1k) would be read from a stream submitted to content
description/detection, and now this is not true anymore.

As you are aware now, that may cause problems (see bug 67211) - we cannot refuse
to work properly because we don't read more than <your-reasonable-limit-here>
bytes. I agree it may get pretty slow (I tried with a Ant build script with
several Mb of comments before the actual contents, and opening the context menu
took a lot of time), but at least we still work correctly.

One workaround I would suggest is to provide a simple input stream wrapping the
real input stream (or extending the one you are using) that would limit the
maximum number of bytes that can be read, enforcing the behavior you require for
your use case.
Comment 2 Rafael Chaves CLA 2004-06-24 23:22:10 EDT
Reducing severity since the current behavior is correct (from an API standpoint)
and a workaround has been provided.
Comment 3 David Williams CLA 2004-06-24 23:59:45 EDT
Thanks Rafael. I still think might be blocker, but will investigate your 
suggestions more. 

Yes, I've always assumed "contentType" would be based on peeking in file (as it 
was, up to a week or so ago). Not reading whole thing. It seems if someone 
really needs to read that much, its not really a "contentType", to me. I'm sure 
lot's of people will find it handy, for sure. A nice easy way to cache 
information based on reading the file. But, if I understand your comment in #2, 
I don't think us limiting input really prevents others from reading the whole 
thing, past our ability to re-read the stream. I guess I would have expected 
some "protection" from that from platform code. 

Plus, I don't think current solution is scalable. You said you tried a couple 
megabytes. Have you tried a 400 megabyte xml file in your workspace? Even if its 
not an ant file, ant describer will spend a great deal of time and memory 
deciding that its not an ant file. And, yes, 400 Megabyte xml files are real, 
usually the result of some logging/database functions. There may not be many of 
them, but it only takes one. You might say that's an ant describer bug, but I'd 
assumed a platform function would protect against run-away describers -- after 
all, if all describers need to limit themselves, then why wouldn't the platform 
provide a way of limiting it? Would a set-once variable "max blocks allowed" 
allow flexibility to learn from experience for various products/needs?

I do also think there may be problems with reset in LazyInputStream. Its mark 
and reset methods seem pretty non-standard, and I'm not sure what they are for 
(mark for exmaple, ignores the readlimit, set's mark to current offset, and 
reset set's mark back to zero? ... that is, undoes previous call to mark?). 

I'll look more into converting all our code (reluctantly) and coming up with 
test cases to better test current scheme to convince my self its workable with 
more than one (or two) describers and various input steams. 

I'll conclude this diatribe by stating the obvious ... there's several ways to 
define "still work correctly". Always detecting correct type is one, but doing 
that at risk of allowing "out of memory" error, or poor performance, is not 
really an overall solution. 
Comment 4 Rafael Chaves CLA 2004-06-25 01:03:07 EDT
Actually, I don't expect describers to have to read the whole file. Ideally, 
describers should read just as much as necessary for doing their job. I agree
the XMLRootElementDescriber (currently used by Ant and PDE, but provided by
Core) reads more than it needs (16kb using Crimson), but that is it, no matter
how much bigger the file is than that. More will be read only if no actual
contents can be found in the first 16Kb (e.g. an XML with pages of comments
preceding the actual contents). I hope that diminishes your concerns w.r.t.
memory consumption.

The workaround I suggested (if you still need one) was for the case you are
submitting an input stream you control for content description/detection (as it
seemed you were doing). In this case, your custom input stream could fake an
early EOF if more than appropriate is read. This does not impact how things like
IFile#getContentDescription (and so #getCharset) work, but would help in your
scenario, where you submit your own stream and then reuse the stream for other
needs.

Re: whether LazyInputStream honors the contract for input streams, I would
suggest you opened another PR if you have any evidences that show the contrary
(I suggest you checking ByteArrayInputStream's contract).
Comment 5 David Williams CLA 2004-06-25 15:16:07 EDT
Rafael, your suggestion seems to work fine. Where we need it, I create a 
"LimitedBufferedInput" stream that return's -1 if attempt to read past buffer, 
so then original stream is still resettable to beginning when we (re) use it to 
(re) read contents. Thanks very much for your help (and please just ignore my 
general grumpiness :)

It might be nice to have an option/parameter on API so clients could specify 
limit to "peeking", but that would be such a distant enhancement, you're welcome 
to dispose of this defect how you best see fit. 

And, BTW, I know of people on our WSAD team that will love to read whole file to 
look for certain characteristics, so you've made that possible for them. (And, I 
say whole file, instead of "as much as they need to" since what they are looking 
for may or may not be there, so they'd have to do exhaustive search). 

Thanks again.
Comment 6 Rafael Chaves CLA 2004-06-25 15:59:18 EDT
Original summary: "content description does not reset input stream"

Agreed we could have API that allows clients to limit the number of bytes/chars
that can be read. Unfortunately, AFAIK, such a thing will be hard to get even
for 3.0.1, since we are not supposed to introduce API changes in service
releases. I am changing the subject of this PR so we handle this issue here.

Regarding clients taking advantage of the lack of limit to read the whole
file... that is not a legitimate use of that API and it is not a recommended
practice - describers are supposed to be as quick as possible in their work,
although that is not enforced by the platform. There is a good chance the
resulting product will present poor performance or have high memory consumption
if describers take too long to run or fill content descriptions with too much data.
Comment 7 Rafael Chaves CLA 2004-10-07 12:08:38 EDT
The API is already too big for what it does, and adding a second version for
every method that takes an input stream would make it even more complicated. It
seems the more flexible way of tweaking how the content type framework reads the
input streams handed to it is exactly by providing customized input streams.

Closing as wontfix.