| Summary: | Regex POSIX \p{IsPunct} and \p{Punct} behave differently with OpenJDK | ||
|---|---|---|---|
| Product: | [Eclipse Project] JDT | Reporter: | Von Landon <landonvg> |
| Component: | Core | Assignee: | JDT-Core-Inbox <jdt-core-inbox> |
| Status: | CLOSED NOT_ECLIPSE | QA Contact: | |
| Severity: | normal | ||
| Priority: | P3 | CC: | twolf |
| Version: | 4.18 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Windows 10 | ||
| Whiteboard: | |||
TL;DR: not a JDT problem; it's a difference in Java libraries and/or Unicode versions used by certain Java versions.
I see differences between the Java 1.8 libraries and Java >= 11 libraries.
Per the JLS:[1]: "Versions of the Java programming language prior to JDK 1.1 used Unicode 1.1.5. Upgrades to newer versions of the Unicode Standard occurred in JDK 1.1 (to Unicode 2.0), JDK 1.1.7 (to Unicode 2.1), Java SE 1.4 (to Unicode 3.0), Java SE 5.0 (to Unicode 4.0), Java SE 7 (to Unicode 6.0), Java SE 8 (to Unicode 6.2), Java SE 9 (to Unicode 8.0), Java SE 11 (to Unicode 10.0), Java SE 12 (to Unicode 11.0), Java SE 13 (to Unicode 12.1), and Java SE 15 (to Unicode 13.0)."
In Unicode, ">" is not in a punctuation category but in the "mathematical symbols" category. Don't know if that ever changed between Unicode versions.
$ cat RegexTest.java
public class RegexTest {
public static void main(String[] args) {
boolean a = ">".matches("\\p{IsPunct}");
boolean b = ">".matches("\\p{Punct}");
boolean c = ">".matches("\\p{IsPunctuation}");
System.out.println("POSIX Result: IsPunct=" + a + ", Punct=" + b + ", IsPunctuation=" + c);
a = ">".matches("(?U)\\p{IsPunct}");
b = ">".matches("(?U)\\p{Punct}");
c = ">".matches("(?U)\\p{IsPunctuation}");
System.out.println("Unicode Result: IsPunct=" + a + ", Punct=" + b + ", IsPunctuation=" + c);
}
}
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home javac RegexTest.java
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-16.jdk/Contents/Home java RegexTest
POSIX Result: IsPunct=false, Punct=true, IsPunctuation=false
Unicode Result: IsPunct=false, Punct=false, IsPunctuation=false
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-11.jdk/Contents/Home javac RegexTest.java
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/adoptopenjdk-11.jdk/Contents/Home java RegexTest
POSIX Result: IsPunct=false, Punct=true, IsPunctuation=false
Unicode Result: IsPunct=false, Punct=false, IsPunctuation=false
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home javac RegexTest.java
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/amazon-corretto-11.jdk/Contents/Home java RegexTest
POSIX Result: IsPunct=false, Punct=true, IsPunctuation=false
Unicode Result: IsPunct=false, Punct=false, IsPunctuation=false
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_191.jdk/Contents/Home javac RegexTest.java
$ JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_191.jdk/Contents/Home java RegexTest
POSIX Result: IsPunct=true, Punct=true, IsPunctuation=false
Unicode Result: IsPunct=true, Punct=false, IsPunctuation=false
[1] https://docs.oracle.com/javase/specs/jls/se16/html/jls-3.html#jls-3.1
|
Using the default JRE installed with Eclipse, evaluating ">".matches("(\\p{IsPunct})") returns false, but it should be true. Evaluating ">".matches("(\\p{Punct})") returns true as expected. Switching to Amazon Corretto results in both evaluating to true, as expected.