Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 336818

Summary: [BlackBoard] revise API/impl to behave the same independent of the impl/config
Product: z_Archived Reporter: thomas menzel <tmenzel>
Component: SmilaAssignee: Juergen Schumacher <juergen.schumacher>
Status: CLOSED FIXED QA Contact:
Severity: enhancement    
Priority: P3 CC: Andreas.Weber
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Windows Vista   
Whiteboard:
Attachments:
Description Flags
Patch for revised Blackboard API. First try.
none
Revised Blackboard API, second try
none
comments/notes on the interface methods
none
Blackboard Refactoring, Attempt no. 3 none

Description thomas menzel CLA 2011-02-10 08:28:32 EST
the current API and implementations of the Blackboard have some code smell IMO and i think we shoould revise it. 

this issue servers to track the discussion and points regarding this.


the behavior of getRecord() depends not only on the impl but also the config of the PersitingBlackboard. i have updated the Javadocs regarding this and will check this in at some point in time but really think we should really see that the general behavior is the same.

there is also some naming (e.g. create -> createRecord) that i would change and also create and load should return the record.
Comment 1 Andreas Weber CLA 2011-03-10 06:24:09 EST
just stumbled across the Blackboard API, too.

IMHO the interaction/implementation of the Blackboard methods getRecord(), load() and create() is not really understandable. 

I would suggest the following changes: 

- Remove load() method, resp. replace it by getRecord(). 

- In getRecord() don't create records automatically when not finding them. It should be decided by the client, if a new record should be created afterwards. (Yes we have some more work here for the clients, but I prefer a clearer API over reduction of client API calls.). Or implement a new method if that behaviour is badly needed... 

TransientBlackboardImpl: load() is removed, getRecord() no longer calls load() and so no longer creates a new record if it isn't found. Instead it just returns <null> in that case.

PersistentBlackboardImpl: load() is renamed to getRecord(). The method no longer creates a new record if no record storage is available and the record isn't found. This should just returns <null> which is the general behaviour if a record isn't found.

And yes, +1 for a more general naming. I also don't like the name "create" for a method that in fact creates a new record only if it doesn't find an existing one. Maybe this should be changed to "createRecordIfNotExists" or sth. like that.
Comment 2 thomas menzel CLA 2011-03-10 06:36:53 EST
a +1 for all ur ideas on this.

what really lies at my hart here is that the external seen behavior of each method should be transparent to the user independent of the config or impl. which means to me that the usage of the store is to be transparent and just a matter of the config.

extending that idea i think we should have just one BB impl that plugs in the store, where store could be a JPA impl or just a simple Map held in memory.
Comment 3 Andreas Weber CLA 2011-03-10 07:32:13 EST
I totally agree that we should have the same behaviour for each implementation. 

If this means, that we should only have _one_ BB implementation which is configured dynamically by record/binary storage service availability, well, I don't know. Sounds like a good idea at first sight, maybe someone else can give his opinion...
Comment 4 Juergen Schumacher CLA 2011-08-08 10:19:46 EDT
I'll have a look into this now.
My current plan is to have only one Blackboard implementation and one factory method "createBlackboard(boolean useRecordStorage, boolean useBinaryStorage)" so that the client itself can decide which level of persistence it needs. The implementation would of course behave the same regardless of which storages are attached. I'll check how far we get with this and report here before I'll commit this to svn ...
Comment 5 thomas menzel CLA 2011-08-09 02:42:04 EDT
ah, great. dont be shy to open a branch for this ;)
Comment 6 Juergen Schumacher CLA 2011-08-09 04:10:01 EDT
Created attachment 201122 [details]
Patch for revised Blackboard API. First try.
Comment 7 Juergen Schumacher CLA 2011-08-09 04:10:10 EDT
I'm not too shy, i'm just too lazy for a branch (-:
Anyway, I don't want to take this longer than necessary, so posting patches against current trunk should suffice for now. After short discussion we should be able to settle this in trunk ... I hope (-;
So here is a first shot (blackboard-r1436.patch)
- Merge the blackboard implementations into one with optional record/binary storages.
- Add new factory method createBlackboard(boolean, boolean) to select which storages are attached. createTransient...() is now the same as createBlackboard(false, false), while createPersistent...() is the same as create...(true, true).
- getRecord(id) returns null, if the record does not exists on blackboard or in storage
- createRecord(id) renamed to ensureRecord(id): create record on blackboard if not yet loaded. Doesn't check record storage, if attached. I'm not sure if this make sense.
- Changed some methods to return null, if the record/attachment does not exists instead of throwing exceptions. I'm not sure if it's completely consistent yet, so there's probably more to change.
- Some more renaming like commit(id) -> commitRecord(id) etc.
With this patch all tests run successful again. Most adaptations were done automatically, I had to fix just a few places where exceptions were expected instead of null return values.
So ... opinions?
Comment 8 thomas menzel CLA 2011-08-09 05:55:49 EDT
i cant use the patch/it is not complete. it thinks there is a BlackboardImpl to be patched which i dont have...
Comment 9 Juergen Schumacher CLA 2011-08-09 06:07:53 EDT
Created attachment 201126 [details]
Revised Blackboard API, second try

Ok, here's an update. I deleted the old BlackboardImpls and added the new one from scratch, now svn diff seems to get it right.
I added some null checks for record and id arguments to make behaviour more consistent.
Comment 10 thomas menzel CLA 2011-08-09 09:10:04 EDT
Created attachment 201138 [details]
comments/notes on the interface methods

i have commented on the interface methods found in the last patch. however, i think that we need to actually think deeper if the current overall design of the BB is OK, which is why I didnt continue to make annotations in the file but continue the discussion here:

issue 1: 
why do we need to expose methods at all that have smth, todo with the sync'ing of the in-mem copy to the underlying DB.

i would have thought that the BB is a wrapper to the more specific storages and that as a user (and i also mean one of us) i dont need to worry about sync'ing the record, the BB should be able to do this by itself (after i committed my change of course). in that sense it also acts as a cache and we should have means to limit the amount of objects held in memory and do auto-flushing and all the rest of that fun stuff. i know that this is lot of more work, but in the end i dont think we end up with a good design w/o it.

issue 2:
this is more broader but impacts the BB. i can config the pipelining of the ETL stages to use a transient blackboard with sync=false. however, when i do this i cant use attachments, as these dont ever get serialized and hence get lost on destruction of the BB. as a user of the BB this is not transparent and i need to know about this when desinging the config/ETL process.

well, so far. maybe that kicks off some thoughts with others as well and they pitch in. my problem is that i have too little knowledge about all the interal stuff of the BB itself and how it is used by the routers/workes and other payers in smila. hence, the things i have written can be taken as a view from an outsider.
Comment 11 Juergen Schumacher CLA 2011-08-09 09:37:27 EDT
I did not yet read annotations in the file, but only the comment, so:
on issue 1: Yes, we probably don't need to expose the commitRecord(id) and invalidateRecord(id) methods, usually it's sufficient to have commit() and invalidate() and that they are called by a "framework component", e.g. a queue listener or a worker in the new job concept, or the connectivity/crawler/agent managers. It's not necessary for a pipelet/crawler/agent/worker implementor to call any of these methods. But the framework component needs to be able to call them to "commit" or "rollback" the "transaction" (yes, I know, the blackboard is far from being transactional). Maybe we could separate these interfaces, but for the moment I would try to "keep it simple" and solve this by documentation.
on issue 2: Don't see the problem with this one. ETL frames will probably use a BB with at least binstorage attached. On the other hand, in search processing we usually have only small attachments (if any) and no need for record persistence, so we use a pure "in-memory" BB. Yes, BB usage should be transparent for the "user" of the BB (think: crawlers/agents/pipelets/workers) in the sense that the user reads and writes the blackboard without having to care about persistence. I thought that was the target of the whole issue. The "frame" then needs to decide what happens with the result records: If they are to be persisted, or to be taken from the BB completely and sent to somewhere else, or just thrown away. This "dataflow" has to be configured, of course, and one part of this configuration is when what should be persisted and where.
Comment 12 Juergen Schumacher CLA 2011-08-09 09:55:51 EDT
On your annotations in Blackboard.java: 
- ensureRecord(): I'm not too happy with that method either. Maybe we should have "createRecord()" which creates a new empty record regardless of versions in storage, and "ensureRecord()" that first tries to load the record from storage and only creates a new one if it does not exist?
- copyRecord(): currently an existing version of the record would be overwritten.
- syncRecord(): the in-mem version of the record on the record is currently never a filtered version, but complete. Blackboard implementations that support partial record loading would have to care about that this doesn't delete attributes, of course.
- the commit methods can probably leave the record on the blackboard, as the blackboard instance is removed after the processing anyway
- the parts about "records locked in database": bad javadoc (I did not yet review that completely), there is no locking currently 
- getAttachmentAsFile(): the file should be cleaned up by the blackboard on commit/invalidate.
Hope this helps.
Comment 13 Andreas Weber CLA 2011-08-09 11:22:35 EDT
My two cents: First, it's already much better than before ;) From a general perspective I would like to see all blackboard (record) update methods working in memory only. The synching with an underlying persistence storage should be down by explicit calls of the commit method(s).

For the methods mentioned above this would mean:
- commit()/commitRecord(): as agreed, records shouldn't be removed here. 
- copyRecord(): do we need this at all? Or would it be enough to let the user copy records with a helper class and then let the user set(overwrite) it on the blackboard if needed.
- ensureRecord(): from my understanding the name would imply that this also checks storage for existence. That's ok because it doesn't change the storage. It's just a shortcut for getRecord() + createRecord(). 
- removeRecord(): should not delete record in storage. Record in storage should be removed when commit() is called. 

BTW, do we really use/need those Notes (setRecordNote() etc.)?
Comment 14 thomas menzel CLA 2011-08-09 12:33:13 EDT
ok, i understand that a complete overhaul is not possible/wanted time-wise and i'm fine with this. i just wanted to know if it makes sense to you as well, or not . If it does: we should keep this is mind - also the lack of ACID-ness of the BB.

- ensureRecord() this sounds like it would return a boolean. Hence, if we need/want it,  i would do this as a an overload to getRecord(id, boolean create). 
this would in all cases also check the underlying store if any and when create ==false return null on a miss; if true create a new empty record for the id. the new record is only created in memory until comitted.

- removeRecord(): hm, still not obvios from the name that this just removes it from memory but not from store. how about unload() here? that is quite telling but we get the load term in again. "evict" is another term that i know from caching libs and would somewhat fit here too.

- getAttachmentAsFile(): i dont know if this will in cases get rid of the file. to me it would also be fine if the caller is repsonsible to remove it. also the jdoc should tell where the file is created (if not descretionary to the impl) or the method could evene provide means to control this.
Comment 15 Juergen Schumacher CLA 2011-08-10 05:20:15 EDT
Some comments on Thomas'  comments:
- ensureRecord(): I'm not sure if it would be nicer to have two methods (with good names :-) or one method with a parameter. In general I do not like boolean parameters too much: You don't recognize immediately in the caller code what is done, you always have to check "what means true and what means false". I could do it with an enum parameter, then we would have even more possibilities: getRecord(id, {LOAD_ONLY, CREATE_IF_NOT_EXISTS, CREATE_NEW_ALWAYS, ... ?}), but I would add a getRecord(id) that use one of these (NO_CREATE?) as default.
- removeRecord() should actually remove the record from storage, not just unload it. Maybe I should rename the invalidate() methods to unload() - invalidate sounds like it does more than just remove it from memory. 
- getAttachmentAsFile(): letting the caller care about deletion would be easier for the blackboard - I think, the current blackboard code caring about this is not very nice, and if the blackboard is just finalized without being explicitly close, the files remain. So it would be fine with me to move this responsibility to the caller. About the location: It should be on a local hard disk. Everything else should be left to be decided by the blackboard implementation. If the caller wants to have more control it can get the attachment as a stream and write it to it's own file.

And @Andreas:
- commit(): OK, they will not unload the records.
- copyRecord(): OK, I'll remove it.
- removeRecord(): OK, I'll change it to mark the record as deleted only and to the removal from storage in commit.
Comment 16 thomas menzel CLA 2011-08-10 05:38:37 EDT
ensureRecord() -> getRecord(...) 
+1, like ur suggestion with enum

invalidate() rename to unload(): 
+1

getAttachmentAsFile() 
+1
if we move to client having to delete this, then we *must* make this very apparent in a migration guide!

location: ur right here, so scrap my idea.

come to think: what is the use/contract of this method anyhow WRT to the returned file? 
is this just a time saver and each time a new temp file is created? OR would this at least recognize that 2 diff. clients request the same attachment and then return the same file - maybe this just limited to the same thread? 
what are the allowed usages of the file? may i write to it? or could it be that an impl. decides to return the live/stored file of the bin store, e.g. in case of using the FS bin store? then writing to this file should be prevented.!
OR shall we leave all those aspects undefined and refer in the javadoc that the behavior depends on the impl. of the BB/store in use?
Comment 17 Andreas Weber CLA 2011-08-10 06:07:41 EDT
getAttachmentAsFile() - an addition to Thomas' thoughts: If this is just a shortcut/time saver for the user then we should think if this is really something the Blackboard itself should deal with. Maybe it's better to have a smaller but clearer API here. So, do we really need the method?
Comment 18 Juergen Schumacher CLA 2011-08-10 07:00:08 EDT
getAttachmentAsFile() - Originally the idea of this method was to support workers that need to access an attachment as a file (think of some external executable that transforms binary documents to text) and to give the blackboard/binarystorage the possibility to do this without having to copy the attachment from storage to a temp file, if the storage stores attachment as separate files anyway and could provide a File instance to it. And in such a case the client of course *must not* delete the file on its own. But such an optimization was never implemented. So indeed, it seems to make sense to remove this method from the blackboard completely, and leave it to the caller to create a local temp file using the attachment stream and care about cleanup itself. It sounds like a good idea to implement this in a helper class (: Thus, I can remove this method for now? In SMILA itself it's only used in tests anyway.
Comment 19 thomas menzel CLA 2011-08-10 07:03:14 EDT
+1 for removing
Comment 20 Juergen Schumacher CLA 2011-08-10 10:48:04 EDT
Created attachment 201246 [details]
Blackboard Refactoring, Attempt no. 3

Here we go again ...
I hope this reflects the agreements we reached in the comments (:
Tests were green with this patch applied to rev. 1471.
If nobody has objections for now I would like to commit this to SVN tomorrow. That does not mean that it's carved in stone then, of course. 

An attachment-to-tempfile helper is not yet included.
Comment 21 Juergen Schumacher CLA 2011-08-11 06:14:02 EDT
Patch applied: rev 1473. However, i'll keep this issue open for a few days in case something needs to be discussed (;
Comment 22 Andreas Weber CLA 2012-12-19 10:14:08 EST
"a few days" are over :)  so I close that issue.