| Summary: | [web connector] Make regexp parsing of web pages more flexible | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | z_Archived | Reporter: | Jörg Thönnes <jtk499> | ||||||
| Component: | Mylyn | Assignee: | Eugene Kuleshov <ekuleshov> | ||||||
| Status: | RESOLVED FIXED | QA Contact: | |||||||
| Severity: | enhancement | ||||||||
| Priority: | P3 | CC: | mik.kersten | ||||||
| Version: | 2.0 M2 | ||||||||
| Target Milestone: | --- | ||||||||
| Hardware: | All | ||||||||
| OS: | All | ||||||||
| Whiteboard: | |||||||||
| Bug Depends on: | |||||||||
| Bug Blocks: | 189788 | ||||||||
| Attachments: |
|
||||||||
Finally someone asked for that! :-) (In reply to comment #1) > Finally someone asked for that! :-) Eugene, actually I have this idea for months now, but it is the first time I posted it. In the beginning, I was thinking about some sort of JDBC connector, because our self-made issue database is based on PostgreSQL and has a very simple web frontend. IMHO, this would also make the implementation of bug 189788: [web connector] Allow more task properties to be set using fixed mappings much easier. Eugene, please provide some Mylar context as a starting point. Thanks. Joerg, as a first step I'd like to make a suggestion about regular expression syntax.
We already using ${varName} to substitute variables. So, it may make sense to use something like (${varName}) to declare parsing group.
Then you can replace all (${varName}) into (.+?) before compiling regexp and also keep map of those variables to group numbers. Then you can use that map after parsing to map groups back to names. This way we won't need any special UI and can also easily specify what variables are allowed there. So, potential field candidates are (based on the present Mylar's object model - AbstractRepositoryTask):
task.id
task.summary
task.isCompleted
task.status
task.priority
task.kind
task.owner
task.reporter
task.creationDate
task.completionDate
task.dueDate
Created attachment 69618 [details]
mylar/context/zip
(In reply to comment #5) Thanks for the context. This was a great helped. I was looking for the connector inside org.eclipse.mylar.web.*... (In reply to comment #4) > Joerg, as a first step I'd like to make a suggestion about regular expression > syntax. > We already using ${varName} to substitute variables. So, it may make sense to > use something like (${varName}) to declare parsing group. [...] There must be some misunderstanding. Looking at WebRepositoryConnector.performQuery I would like to change the following: - Allow any number of matching groups (matcher.groupCount()) -- even nothing found??? - Assign every matcher,group(n) to ${query.group.n} variable How can I assign a new ${...} variable? (In reply to comment #4) > We already using ${varName} to substitute variables. So, it may make sense to > use something like (${varName}) to declare parsing group. > Then you can replace all (${varName}) into (.+?) before compiling regexp and > also keep map of those variables to group numbers. Then you can use that map > after parsing to map groups back to names. This way we won't need any special UI > and can also easily specify what variables are allowed there. OK, now I got it. But this is less flexible: I cannot e.g. assign task.summary := FIXED TEXT ${query.group.2} FIXED_TEXT ${query.group.1} ... I admit that you have to provide a UI field for every task.* attribute with predefined default value. But your suggestion sounds a bit unhandy and introduces a complex syntax. I am not convinced that it is a good idea to introduce such scripting/templating/remapping for individual fields. You should be able to extract id, summary, etc as is and fixed text can only came from the web page. Syntax that I am suggesting look better to me, because you can see which field it will be mapped to right away, and that is an improvement to the current regexps. For comparison, query.group.x won't mean anything inside query and like you noticed, it will require additional UI to deal with that. Also note that we should not allow configurations that can't return task.id and task.summary. So, those two fields are mandatory. OK, so please describe the syntax in more details. If I have something like
...(issue[A-Z]\d+)...
how would this be assigned to task.id?
You need to use anchors/markers outside of issue, i.e. <td style="issueId">(${task.id})</td>
(In reply to comment #10) > You need to use anchors/markers outside of issue, i.e. <td > style="issueId">(${task.id})</td> That means you are making assumptions about the format of the web interface. IMHO, this is far too restrictive for a generic web interface. In our case, we are in control of our web interface so we could make the web interface fit the web connectors assumptions. But is this the right way to go? Actually, on a second thought I take it back. We need phyton-like syntax [1]. Expression would look like <td .+?>({task.id}issue[A-Z]\d+)</td> Then you can just cut variable in {} off the (..) declaration and like before run regular regexp and then match mapping back.
Note that it should handle case like \({aaa}, when {aaa} is not a variable since one "(" is masked with "\".
[1] http://jregex.sourceforge.net/gstarted-advanced.html#ngroups
(In reply to comment #12) > [1] http://jregex.sourceforge.net/gstarted-advanced.html#ngroups Do you intend to add this library to the web connector plugin? This would be the simplest way. Or is this too difficult and creates too many dependencies? Cheers, Jörg The Eclipse IP review process means that we have to carefully select dependencies that we want to request adding, and the dependencies that we do request need to be under a compatible license (e.g. EPL or APL). So it is probably worth exploring an approach other than adding [1], since that would take 6 weeks after the Europa release minimum. (In reply to comment #13) > Do you intend to add this library to the web connector plugin? This would be the > simplest way. Or is this too difficult and creates too many dependencies? We could in theory, but it is not feasible from the practical point of view for reasons Mik explained. So, I think it is the best to build it around standard Java regexps. Like I described before, remove vars from regexp string, but keep map of wars to regexp groups (that part is probably is most complicated and it can be done outside of Eclipse and should have some unit tests), then you can use that map to match parsed groups with variables. OK, I did not think about the IP process. Actually, it is not too difficult to implement.
The matching groups are counted by the opening parenthesis, I have to avoid "\(" as Eugene said
and map the group number of every "({id}" to the id.
But at the other side I am still worried about the increased complexity of the regexp.
Splitting the task attribute generaton into a matching step and a assembling step would support
more scenarios.
E.g. we have the fields category and title in our issue database. I would like to join the in the summary:
category: title
I cannot do that with the current regexps.
In summary, the _generic_ web connector should be as generic as possible.
(In reply to comment #16) > In summary, the _generic_ web connector should be as generic as possible. I disagree on that. It should be simple enough to use and too much flexibility and hard to use UI is not necessarily a good thing. However you should be able to build your own extensions on top of generic web connector. (In reply to comment #16) > The matching groups are counted by the opening parenthesis, I have to avoid "\(" > as Eugene said and map the group number of every "({id}" to the id. You also have to watch for the nested groups and make sure that they are accounted in the same way as Java regexp does, also (?...) groups aren't captured. Still not relatively straightforward. See http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html#cg (In reply to comment #17) > I disagree on that. It should be simple enough to use and too much flexibility > and hard to use UI is not necessarily a good thing. In this specific case, just another GUI element like the ParameterEditor would be sufficient. > However you should be able to build your own extensions on top of generic web > connector. Sure. But then I would not like to call it "generic", rather "example." Anyway, I will try to implement some regexp syntax as you suggested. (In reply to comment #18) > (In reply to comment #16) > > The matching groups are counted by the opening parenthesis, I have to avoid > "\(" > > as Eugene said and map the group number of every "({id}" to the id. > You also have to watch for the nested groups and make sure that they are > accounted in the same way as Java regexp does, They are counted by opening parentheses, e.g. ...({a}({b}({c}...)...)...) is assigned as a:= $1; b:= $2; c:= $3 > also (?...) groups aren't > captured. Good point. > Still not relatively straightforward. See > http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html#cg I may also have a look at the JRegexp implementation to get some hints. OK, will implement some utility class including unit test first. Please advise where the utility class and the associated unit test should live. I have no idea of your package conventions. Thanks, Jörg I am taking it in I've implemented NamedPattern wrapper, hooked it into query parser and updated few packaged templates to fetch assignee and status. Current implementation is hardcoded to guess status by one of the common words: completed fixed resolved invalid verified deleted closed. More advanced way to detect completed task is discussed on bug 194496. This fix adds support for named regexps using syntax like ({name}.+?). Current implementation recognize group names: Id, Description, Owner, Type and Status. Other fields of the AbstractTask need either value substitution or customizable date formats and will have to be handled on bug 189788 Anyways, I think it is a big improvement. Now it can retrieve assignee and status (so no need to manually manage completed status). Created attachment 73789 [details]
mylyn/context/zip
(In reply to comment #22) > I am taking it in Thanks, Eugene. Actually, I just returned from holiday and was lacking the time before. Cheers, Jörg (In reply to comment #23) > I've implemented NamedPattern wrapper, hooked it into query parser and updated > few packaged templates to fetch assignee and status. Current implementation is > hardcoded to guess status by one of the common words: completed fixed resolved > invalid verified deleted closed. Actually we use "done" here, could you add this to the hardcoded status guess? Thanks, Jörg (In reply to comment #26) > Actually we use "done" here, could you add this to the hardcoded status guess? Done (In reply to comment #27) > (In reply to comment #26) > > Actually we use "done" here, could you add this to the hardcoded status guess? > > Done > It seems to me that, at a minimum, these strings should be moved to a resource file that can be edited by the end user so that he/she can add any string needed. (In reply to comment #28) > It seems to me that, at a minimum, these strings should be moved to a resource > file that can be edited by the end user so that he/she can add any string > needed. No plans to do that. However there is a bug 189788 that suggesting a mapping facility |
Currently, there is one regexp which has to return ${task.id} in the first match and ${task.description} in the second one. This could be made more flexible in this way: 1. Any regexp is allowed with any number of capturing groups (...) 2. Any capturing group can be referenced by e.g. ${query.group.1}... 3. For both the task id and the task description a string composed of any characters and references to variables including the capturing groups ${query.group.1}... can be build (form input field) E.g. Task ID: "ID ${query.group.2}" Task Description: "${query.group.3}: ${query.group.1}" In this way, access to other task properties (e.g. status, priority etc.) could be defined.