Community
Participate
Working Groups
The OmniVersion format can not handle the standard PHP version format which has special meaning associated with some strings. The rule used by PHP version_compare is that: any-string < "dev" < "a" == "alpha" < "b" == "beta" < "rc" == "RC" < "#" < "pl" == "p" This could be solved with OmniVersion by supporting enumerators. Suggest that enum supports optional fallback to a second format typically s, or a. There should be a way to specify if enum strings are case sensitive. Here is an example: 1.2.3foo1 < 1.2.3a1 == 1.2.3alpha1 For reference see http://php.net/manual/en/function.version-compare.php
Additional information: "pl" stands for "patch level", and I believe that "#" is used for "buildnumber"
The raw format needs to support enum since: Integer > Enum > String Propose that this is encoded in raw format as enum : 'e' numeric ; or, if we want to be able to reduce potential clashes of two different types of enums being equal we need to involve a classifier. enum : 'e' classifier numeric ; classifier : [a-zA-Z_]+ ; As an example, if the enums for PHP are given the identifier 'php' the encoding of "beta" in raw would be: raw: ephp3
The format specification is probably best done as a processing instruction available for string, auto, and numeric formats as well as for a new enum (only) format. Assume =E; is an enum processing rule (to be expanded on below) then: - 'a=E;' means auto with enum replacement, numeric input produces numeric segment, string input will be translated to an enum segment if it is a valid enumerator string, else a string segment. - 'n=E;' means numeric with enum replacement, numeric input will produce numeric segment, and string input will be translated to an enum segment if it is a valid enumerator, else the input is in error. - 's=E;' means string with enum replacement, string input will be translated to an enum segment if it is a valid enumerator, else a string segment. The =E; processing rule could be expressed as: enum-processing : '=' 'E' enum-class? enumerator-list options ';' ; enumerator-list : '[' enumerator (',' enumerator)* ']' ; enumerator : '[' ENUMCHARSEQ (',' ENUMCHARSEQ)* ']' options | ENUMCHARSEQ ; options : 'i' 'b' ; =Ephp[alpha,beta]; Enumerates 'alpha' as raw: ephp1, and 'beta' as ephp2. =Ephp[[a,alpha],[b,beta]]; Enumerates 'a' and 'alpha' as raw: ephp2 and 'b' or 'beta' as ephp2. OPTIONS The 'i' option makes the comparison case insensitive. =Ephp[alhpa, beta]i; Enumerates 'alpha' or 'ALPHA' or 'AlPhA' as raw: ephp1, and 'beta' or 'BETA' etc. as raw: ephp2. =Ephp[beta,[rc]i]; Enumerates 'rc' or 'RC' etc. as raw: ephp2. The 'b' makes the comparison using 'begins with', with a or s formats. If a string sequence is detected, this sequence is by default matched against the enumerators, if no match is found the input is not translated to an enumerator. When the comparison is made 'non-greedy' the input matches an enumerator if the sequence begins with the enumerator, and the remainder is left to be consumed by the next rule. a=Ephp[rc];a? If input is 'RCa' the result would be the string segment 'RCa'. a=Ephp[rc]b;a? If input is 'RCa' the result would be the enum segment ephp1 followed by the string segment 'a'. Maybe the 'b' option is not really needed, as the 'auto' format should probably always use this. One problem with the proposed format is that it becomes quite long if the same enumerators are needed several times in the format. We should perhaps allow declaration and reuse. This could be done by specifying it once with identifier - e.g. =Ephp[alpha, beta]; and for subsequent use of the same enumerator identifier use the short form =Ephp; to mean the same set of enumerators. (Redefinition would not be allowed). Another possible useful options could be 'u' for 'incrementally unique' - i.e. when enough input has been seen to be able to select an enumerator it is matched, input 'r', 'rel', 'relea' would all match the enumerator 'release'. I doubt that explicit setting of enumerator (raw) values is meaningful - the values would simply be assigned incrementally starting with 0 or 1 for the first enumerator. Comments?
(In reply to comment #3) > The format specification is probably best done as a processing instruction > available for string, auto, and numeric formats as well as for a new enum > (only) format. I forgot to include the enum only format. It is needed if the input is supposed to contain an enumerator but not something else. For this we could use the format 'e' which would be used as follows: e=E[dev,alpha,beta]; Two additional useful options could be: 'n' to make the enum produce a raw numeric segment, or 's' to make the enum produce a raw string segment.
I'll look into how this could be implemented.
I think we need to include classifier, ordinal, and identifier in the raw format. How else can we do canonical comparisons? I.e. what rules apply when we compare: ephp0 with esomethingelse0 ? Wouldn't it be good if we could compare 'alpha' == 'alpha' regardless of format? This is possible if we include both the magnitude and the label so for =Exx[alpha,beta] we would see raw values like: exx0alpha exx1beta whereas for =php[dev,alpha,beta,...] we would see: ephp0dev ephp1alpha ephp2beta ... When comparing, the rule would be: 1. If the class is the same, use the ordinal 2. If the class differ, use the identifier There is one drawback in this. It is not 100% coherent. Consider =Eaa[a,b] and =Ebb[b,a] eaa0a == ebb1a (rule 2) ebb1a > ebb0b (rule 1) ebb0b == eaa1b (rule 2) From this we can deduct that eaa0a is lexically greater then eaa1b which is false since: eaa0a < eaa1b (rule 1) Not sure if that matters since the likelihood of this ever occuring should be relatively small. It can only happen when two enum classes use the same identifiers but in different order.
Thinking a bit more. What if the raw version looked like this: [dev,*alpha,beta] instead of: ephp0alpha Now, the raw representation contains a canonical representation of the full enum and an indication for the current value and the classifier is thus redundant. With this information, it is possible to create a canonical comparison algorithm as: Rule 1. if the selected identifier in e1 exists in e2 and the selected identifier in e2 exists in e1 and the order in which both identifiers are listed in e1 is not in conflict with how they are listed in e2, then yield comparison of the two ordinals as they appear in e1. Example: [dev,*alpha,beta] < [alpha,*beta] Rule 2. The selected identifier in e1 can be found in e2 but the selected identifier in e2 cannot be found in e1. This means that e2 has both identifiers so we use the identifier ordinals from e2. If the opposite is true, then we return the negation of a reversed comparison. Example: [dev,*beta] > [dev,*alpha,beta] For all other cases, we need to define that one of the compared enums is "primary". The one with the larger number of elements is always primary. For enums with equal number of elements, we compare each element until a lexical difference is found and let the one with the first lexically greater element be considered primary. After some consideration and experiments, I came up with this: Rule 3. Primary wins Comments?
Regarding comparison of enum types - I think a simple ordering based on the enum classifier - i.e eaa1 < eaa99 < eab1 is what we want. 1.0.0.alpha in a scheme where alpha is '0' is not the same as 1.0.0.alpha in a scheme where alpha is '1'. i.e. I am not sure that the case that enums happens to define the strings in the same order means that intermixing them is meaningful. The reason the enum's exist are because there is semantic attached to the symbols. Elephant : [S, M, L ] Mouse : [S, M, L ] Just because they share the same symbols does not mean that they share the same scale.
(In reply to comment #8) > Regarding comparison of enum types - I think a simple ordering based on the > enum classifier - i.e > eaa1 < eaa99 < eab1 is what we want. 1.0.0.alpha in a scheme where alpha is '0' > is not the same as 1.0.0.alpha in a scheme where alpha is '1'. > An important thing to consider is when a format, say foo[alpha,beta], has been around for some time and the maintainers of foo now wants to add in 'dev' before 'alpha'. I think that's a valid use-case that we should support but it falls short if we just use classifiers and ordinals. A similar situation will occur if we for some other reason encounter different formats using the same classifier but with different definitions (i.e. php[alpha,beta] and php[dev,alpha,beta]). My proposal cover both those cases. > i.e. I am not sure that the case that enums happens to define the strings in > the same order means that intermixing them is meaningful. > I on the other hand, can't think of one situation (in the limited scope of software versions) where that doesn't make perfect sense. > The reason the enum's exist are because there is semantic attached to the > symbols. > > Elephant : [S, M, L ] > Mouse : [S, M, L ] > We intend to use enums to compare software versions. It's a limited domain. How many enum schemes have you seen and what identifiers have been used? > Just because they share the same symbols does not mean that they share the same > scale. True, but you have to admit that it is fairly common. I've seen alpha, beta, rc (or cr) in more places then one and they always come in the same order. I have problems with classifiers and ordinals because: 1. There is no maintained map of classifier <-> definition which means that we will see different definitions of one and the same classifier. 2. The lexical magnitude of the classifier has greater significance then the ordinal which means that php:beta < ruby:alpha. 3. It's impossible to change the meaning of the classifier over time without decoupling all components already in existence.
(In reply to comment #9) > 3. It's impossible to change the meaning of the classifier over time without > decoupling all components already in existence. That is really bad as it is common to start using an under-specified version format. Although it is a limited domain, the mechanism may be useful for ordering "release names" as well as the common alpha, beta, etc. series. But that requirement is not as important. I wonder if we instead should just have a semantically defined series and that different formats map strings to it. That would be enough for the PHP scheme. If we leave some room between defined values and before the first it should be reasonably future proof. The rest is more about translation of original to raw, if upper case is allowed in a particular format etc.
After some discussions and working through a working implementation (patch coming as soon as we agree on format) I have come up with a slightly simplified format. In short, an enum is not really an enum. It's more an instruction saying that specific symbols are recognized and ordered a specific way. While that sounds very similar to an enum, there's one big difference. Just like all other format specifications, this sub-specification is also completely anonymous. There is no 'classifier'. I also removed the "enum only format" since I consider it redundant and a bit strange. The enum is an instruction, not a format. Instead, I added an "optional" flag in the form of a question mark. It's presence means, if the enum is not matched, fall back to string parsing. Since I wanted to avoid '[' and ']' in the raw format (it generates a lot of escapes in ranges), I used '{' and '}' for the raw enum. Felt natural (and saved some lines of code) to do that for the enum format as well. It also looks less confusing when mixing with the optional format specifiers. Heres' the BNF. enum-processing : '=' enumerator-list options ';' ; enumerator-list : '{' enumerator (',' enumerator)* '}' ; enumerator : symbol ('=' symbol)* options : 'i' 'b','?' ; Example: a={alpha=a,beta=b,gamma=g}i? The first symbol of each enum is considered the canonical one, .i.e. for alpha=a, alpha would be the canonical identifier. A PHP version format could look something like this: n[.n=0;[.n=0;[.a={dev,alpha=a,beta=b,RC=rc,#,pl=p}?;]]]
I am fine with everything, but concerned that format gets long as there is the need to use the exact same specification in multiple places. The php format is really any sequence of number/string/enums so may be possible to avoid it in this case, but in general it seems like a problem. I am currently using: format(nda(d?a)*) but it should really not require an initial number if I understand the php versionCompare algorithm correctly. But, imagine a version spec like OSGi where each specified slot had the same format: a={dev,alpha=a,beta=b,RC=rc,#,pl=p}?;[.a={dev,alpha=a,beta=b,RC=rc,#,pl=p}?;=0;[.a={dev,alpha=a,beta=b,RC=rc,#,pl=p}?;=0;[.a={dev,alpha=a,beta=b,RC=rc,#,pl=p}?;]]] But, we could defer that to a time when a format like that is actually encountered. A simple solution (and what is probably the most common case is "same format as already used"), so maybe just have notation for that?
Created attachment 180540 [details] Patch that adds the enum format This patch is for the HEAD. It adds the enum format as specified in comment #11 together with a bunch of tests (can be found in org.eclipse.equinox.p2.tests.omniVersion.FormatATest). It's design with performance and memory preservation in mind and should have little or no effect on existing code.
Patch committed to HEAD.