Some Eclipse Foundation services are deprecated, or will be soon. Please ensure you've read this important communication.

Bug 189787

Summary: [web connector] Make regexp parsing of web pages more flexible
Product: z_Archived Reporter: Jörg Thönnes <jtk499>
Component: MylynAssignee: Eugene Kuleshov <ekuleshov>
Status: RESOLVED FIXED QA Contact:
Severity: enhancement    
Priority: P3 CC: mik.kersten
Version: 2.0 M2   
Target Milestone: ---   
Hardware: All   
OS: All   
Whiteboard:
Bug Depends on:    
Bug Blocks: 189788    
Attachments:
Description Flags
mylar/context/zip
none
mylyn/context/zip none

Description Jörg Thönnes CLA 2007-05-29 17:00:37 EDT
Currently, there is one regexp which has to return ${task.id} in the first match and ${task.description}
in the second one. This could be made more flexible in this way:

1. Any regexp is allowed with any number of capturing groups (...)
2. Any capturing group can be referenced by e.g. ${query.group.1}...
3. For both the task id and the task description a string composed of any characters and references to variables
     including the capturing groups ${query.group.1}... can be build (form input field)

E.g.

	Task ID: "ID ${query.group.2}"
	Task Description: "${query.group.3}: ${query.group.1}"

In this way, access to other task properties (e.g. status, priority etc.) could be defined.
Comment 1 Eugene Kuleshov CLA 2007-05-29 18:51:22 EDT
Finally someone asked for that! :-)
Comment 2 Jörg Thönnes CLA 2007-05-30 02:56:02 EDT
 (In reply to comment #1)
> Finally someone asked for that! :-)

Eugene, actually I have this idea for months now, but it is the first time I posted it.

In the beginning, I was thinking about some sort of JDBC connector, because our
self-made issue database is based on PostgreSQL and has a very simple web frontend.

IMHO, this would also make the implementation of bug 189788: [web connector] Allow more task properties to be set using fixed mappings
much easier.
Comment 3 Jörg Thönnes CLA 2007-05-31 16:49:33 EDT
Eugene, please provide some Mylar context as a starting point. Thanks.
Comment 4 Eugene Kuleshov CLA 2007-05-31 17:13:48 EDT
Joerg, as a first step I'd like to make a suggestion about regular expression syntax.

We already using ${varName} to substitute variables. So, it may make sense to use something like (${varName}) to declare parsing group.
Then you can replace all (${varName}) into (.+?) before compiling regexp and also keep map of those variables to group numbers. Then you can use that map after parsing to map groups back to names. This way we won't need any special UI and can also easily specify what variables are allowed there. So, potential field candidates are (based on the present Mylar's object model - AbstractRepositoryTask): 

task.id
task.summary
task.isCompleted
task.status
task.priority
task.kind
task.owner
task.reporter
task.creationDate
task.completionDate
task.dueDate
Comment 5 Eugene Kuleshov CLA 2007-05-31 17:13:52 EDT
Created attachment 69618 [details]
mylar/context/zip
Comment 6 Jörg Thönnes CLA 2007-05-31 18:29:05 EDT
 (In reply to comment #5)

Thanks for the context. This was a great helped. I was looking for the connector inside org.eclipse.mylar.web.*...

 (In reply to comment #4)
> Joerg, as a first step I'd like to make a suggestion about regular expression
> syntax.
> We already using ${varName} to substitute variables. So, it may make sense to
> use something like (${varName}) to declare parsing group. [...]

There must be some misunderstanding. Looking at WebRepositoryConnector.performQuery
I would like to change the following:

- Allow any number of matching groups (matcher.groupCount()) -- even nothing found???
- Assign every matcher,group(n) to ${query.group.n} variable

How can I assign a new ${...} variable?

Comment 7 Jörg Thönnes CLA 2007-05-31 18:52:26 EDT
 (In reply to comment #4)
> We already using ${varName} to substitute variables. So, it may make sense to
> use something like (${varName}) to declare parsing group.
> Then you can replace all (${varName}) into (.+?) before compiling regexp and
> also keep map of those variables to group numbers. Then you can use that map
> after parsing to map groups back to names. This way we won't need any special UI
> and can also easily specify what variables are allowed there.

OK, now I got it. But this is less flexible: I cannot e.g. assign

  task.summary := FIXED TEXT ${query.group.2} FIXED_TEXT ${query.group.1} ...

I admit that you have to provide a UI field for every task.* attribute with predefined default value.
But your suggestion sounds a bit unhandy and introduces a complex syntax.
Comment 8 Eugene Kuleshov CLA 2007-05-31 21:52:20 EDT
I am not convinced that it is a good idea to introduce such scripting/templating/remapping for individual fields. You should be able to extract id, summary, etc as is and fixed text can only came from the web page. Syntax that I am suggesting look better to me, because you can see which field it will be mapped to right away, and that is an improvement to the current regexps. For comparison, query.group.x won't mean anything inside query and like you noticed, it will require additional UI to deal with that.

Also note that we should not allow configurations that can't return task.id and task.summary. So, those two fields are mandatory.
Comment 9 Jörg Thönnes CLA 2007-06-01 00:35:44 EDT
OK, so please describe the syntax in more details. If I have something like

    ...(issue[A-Z]\d+)...
    
how would this be assigned to task.id?
Comment 10 Eugene Kuleshov CLA 2007-06-01 01:55:43 EDT
You need to use anchors/markers outside of issue, i.e. <td style="issueId">(${task.id})</td>
Comment 11 Jörg Thönnes CLA 2007-06-01 02:00:24 EDT
 (In reply to comment #10)
> You need to use anchors/markers outside of issue, i.e. <td
> style="issueId">(${task.id})</td>

That means you are making assumptions about the format of the web interface.
IMHO, this is far too restrictive for a generic web interface. In our case, we are in control
of our web interface so we could make the web interface fit the web connectors assumptions.
But is this the right way to go?
Comment 12 Eugene Kuleshov CLA 2007-06-01 03:28:32 EDT
Actually, on a second thought I take it back. We need phyton-like syntax [1]. Expression would look like <td .+?>({task.id}issue[A-Z]\d+)</td>  Then you can just cut variable in {} off the (..) declaration and like before run regular regexp and then match mapping back. 
Note that it should handle case like \({aaa}, when {aaa} is not a variable since one "(" is masked with "\".

[1] http://jregex.sourceforge.net/gstarted-advanced.html#ngroups
Comment 13 Jörg Thönnes CLA 2007-06-01 11:47:41 EDT
 (In reply to comment #12)
> [1] http://jregex.sourceforge.net/gstarted-advanced.html#ngroups

Do you intend to add this library to the web connector plugin? This would be the simplest way.

Or is this too difficult and creates too many dependencies?

Cheers, Jörg
Comment 14 Mik Kersten CLA 2007-06-01 12:30:23 EDT
The Eclipse IP review process means that we have to carefully select dependencies that we want to request adding, and the dependencies that we do request need to be under a compatible license (e.g. EPL or APL).  So it is probably worth exploring an approach other than adding [1], since that would take 6 weeks after the Europa release minimum.
Comment 15 Eugene Kuleshov CLA 2007-06-01 12:48:39 EDT
(In reply to comment #13)
> Do you intend to add this library to the web connector plugin? This would be the
> simplest way. Or is this too difficult and creates too many dependencies?

We could in theory, but it is not feasible from the practical point of view for reasons Mik explained. So, I think it is the best to build it around standard Java regexps. Like I described before, remove vars from regexp string, but keep map of wars to regexp groups (that part is probably is most complicated and it can be done outside of Eclipse and should have some unit tests), then you can use that map to match parsed groups with variables.
Comment 16 Jörg Thönnes CLA 2007-06-01 16:39:19 EDT
OK, I did not think about the IP process. Actually, it is not too difficult to implement.
The matching groups are counted by the opening parenthesis, I have to avoid "\(" as Eugene said
and map the group number of every "({id}" to the id.

But at the other side I am still worried about the increased complexity of the regexp.
Splitting the task attribute generaton into a matching step and a assembling step would support
more scenarios.

E.g. we have the fields category and title in our issue database. I would like to join the in the summary:

    category: title
    
I cannot do that with the current regexps.

In summary, the _generic_ web connector should be as generic as possible.
Comment 17 Eugene Kuleshov CLA 2007-06-01 17:10:54 EDT
(In reply to comment #16)
> In summary, the _generic_ web connector should be as generic as possible.

I disagree on that. It should be simple enough to use and too much flexibility and hard to use UI is not necessarily a good thing.
However you should be able to build your own extensions on top of generic web connector.
Comment 18 Eugene Kuleshov CLA 2007-06-01 17:17:42 EDT
(In reply to comment #16)
> The matching groups are counted by the opening parenthesis, I have to avoid "\("
> as Eugene said and map the group number of every "({id}" to the id.

You also have to watch for the nested groups and make sure that they are accounted in the same way as Java regexp does, also (?...) groups aren't captured. Still not relatively straightforward. See http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html#cg
Comment 19 Jörg Thönnes CLA 2007-06-01 17:20:57 EDT
 (In reply to comment #17)
> I disagree on that. It should be simple enough to use and too much flexibility
> and hard to use UI is not necessarily a good thing.

In this specific case, just another GUI element like the ParameterEditor would be sufficient.

> However you should be able to build your own extensions on top of generic web
> connector.

Sure. But then I would not like to call it "generic", rather "example."

Anyway, I will try to implement some regexp syntax as you suggested.
Comment 20 Jörg Thönnes CLA 2007-06-01 17:24:06 EDT
 (In reply to comment #18)
> (In reply to comment #16)
> > The matching groups are counted by the opening parenthesis, I have to avoid
> "\("
> > as Eugene said and map the group number of every "({id}" to the id.
> You also have to watch for the nested groups and make sure that they are
> accounted in the same way as Java regexp does,

They are counted by opening parentheses, e.g.

   ...({a}({b}({c}...)...)...)
   
is assigned as

   a:= $1; b:= $2; c:= $3

> also (?...) groups aren't
> captured.

Good point.

> Still not relatively straightforward. See
> http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html#cg

I may also have a look at the JRegexp implementation to get some hints.
Comment 21 Jörg Thönnes CLA 2007-06-05 17:28:44 EDT
OK, will implement some utility class including unit test first.

Please advise where the utility class and the associated unit test should live.
I have no idea of your package conventions.

Thanks, Jörg
Comment 22 Eugene Kuleshov CLA 2007-07-14 01:33:31 EDT
I am taking it in
Comment 23 Eugene Kuleshov CLA 2007-07-14 02:07:11 EDT
I've implemented NamedPattern wrapper, hooked it into query parser and updated few packaged templates to fetch assignee and status. Current implementation is hardcoded to guess status by one of the common words: completed fixed resolved invalid verified deleted closed. More advanced way to detect completed task is discussed on bug 194496.

This fix adds support for named regexps using syntax like ({name}.+?). Current implementation recognize group names: Id, Description, Owner, Type and Status. Other fields of the AbstractTask need either value substitution or customizable date formats and will have to be handled on bug 189788

Anyways, I think it is a big improvement. Now it can retrieve assignee and status (so no need to manually manage completed status).
Comment 24 Eugene Kuleshov CLA 2007-07-14 02:07:15 EDT
Created attachment 73789 [details]
mylyn/context/zip
Comment 25 Jörg Thönnes CLA 2007-07-15 09:24:13 EDT
(In reply to comment #22)
> I am taking it in

Thanks, Eugene. Actually, I just returned from holiday and was lacking the time before.

Cheers, Jörg
Comment 26 Jörg Thönnes CLA 2007-07-16 10:10:59 EDT
(In reply to comment #23)
> I've implemented NamedPattern wrapper, hooked it into query parser and updated
> few packaged templates to fetch assignee and status. Current implementation is
> hardcoded to guess status by one of the common words: completed fixed resolved
> invalid verified deleted closed.

Actually we use "done" here, could you add this to the hardcoded status guess?

Thanks, Jörg
Comment 27 Eugene Kuleshov CLA 2007-07-16 10:23:38 EDT
(In reply to comment #26)
> Actually we use "done" here, could you add this to the hardcoded status guess?

Done
Comment 28 Mike Snare CLA 2007-09-19 11:32:13 EDT
(In reply to comment #27)
> (In reply to comment #26)
> > Actually we use "done" here, could you add this to the hardcoded status guess?
> 
> Done
> 

It seems to me that, at a minimum, these strings should be moved to a resource file that can be edited by the end user so that he/she can add any string needed.
Comment 29 Eugene Kuleshov CLA 2007-09-19 13:10:49 EDT
(In reply to comment #28)
> It seems to me that, at a minimum, these strings should be moved to a resource
> file that can be edited by the end user so that he/she can add any string
> needed.

No plans to do that. However there is a bug 189788 that suggesting a mapping facility