Java Regular Expressions (Theory, Classes and Syntax)

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. A regular expression (often shortened to regex or regexp) is written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For example, Perl, Ruby and Tcl have a powerful regular expression engine built directly into their syntax. Several utilities provided by Unix distributions – including the editor ed and the filter grep – were the first to popularize the concept of regular expressions.

As an example of the syntax, the regular expression \bex can be used to search for all instances of the string “ex” that occur after “word boundaries” (signified by the \b). In laymen’s terms, \bex will find the matching string “ex” in two possible locations,

At the beginning of words, and
Between two characters in a string, where one is a word character and the other is not a word character.

Thus, in the string “Texts for experts,” \bex matches the “ex” in “experts” but not in “Texts” (because the “ex” occurs inside a word and not immediately after a word boundary).

Many modern computing systems provide wildcard characters in matching filenames from a file system. This is a core capability of many command-line shells and is also known as globbing. Wildcards differ from regular expressions in generally only expressing very limited forms of alternatives.

The java.util.regex package primarily consists of three classes: Pattern, Matcher, and PatternSyntaxException.

A Pattern object is a compiled representation of a regular expression. The Pattern class provides no public constructors. To create a pattern, you must first invoke one of its public static compile methods, which will then return a Pattern object. These methods accept a regular expression as the first argument; the first few lessons of this trail will teach you the required syntax.
A Matcher object is the engine that interprets the pattern and performs match operations against an input string. Like the Pattern class, Matcher defines no public constructors. You obtain a Matcher object by invoking the matcher method on a Pattern object.
A PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.

Characters
x    The character x
\\    The backslash character
\0n    The character with octal value 0n (0 <= n <= 7)
\0nn    The character with octal value 0nn (0 <= n <= 7)
\0mnn    The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\xhh    The character with hexadecimal value 0xhh
\uhhhh    The character with hexadecimal value 0xhhhh
\t    The tab character (‘\u0009′)
\n    The newline (line feed) character (‘\u000A’)
\r    The carriage-return character (‘\u000D’)
\f    The form-feed character (‘\u000C’)
\a    The alert (bell) character (‘\u0007′)
\e    The escape character (‘\u001B’)
\cx    The control character corresponding to x

Character classes
[abc]    a, b, or c (simple class)
[^abc]    Any character except a, b, or c (negation)
[a-zA-Z]    a through z or A through Z, inclusive (range)
[a-d[m-p]]    a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]    d, e, or f (intersection)
[a-z&&[^bc]]    a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]    a through z, and not m through p: [a-lq-z](subtraction)

Predefined character classes
.       Any character (may or may not match line terminators)
\d     A digit: [0-9]
\D    A non-digit: [^0-9]
\s    A whitespace character: [ \t\n\x0B\f\r]
\S    A non-whitespace character: [^\s]
\w    A word character: [a-zA-Z_0-9]
\W    A non-word character: [^\w]

POSIX character classes (US-ASCII only)
\p{Lower}    A lower-case alphabetic character: [a-z]
\p{Upper}    An upper-case alphabetic character:[A-Z]
\p{ASCII}    All ASCII:[\x00-\x7F]
\p{Alpha}    An alphabetic character:[\p{Lower}\p{Upper}]
\p{Digit}    A decimal digit: [0-9]
\p{Alnum}    An alphanumeric character:[\p{Alpha}\p{Digit}]
\p{Punct}    Punctuation: One of !”#$%&’()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}    A visible character: [\p{Alnum}\p{Punct}]
\p{Print}    A printable character: [\p{Graph}]
\p{Blank}    A space or a tab: [ \t]
\p{Cntrl}    A control character: [\x00-\x1F\x7F]
\p{XDigit}    A hexadecimal digit: [0-9a-fA-F]
\p{Space}    A whitespace character: [ \t\n\x0B\f\r]

Classes for Unicode blocks and categories
\p{InGreek}    A character in the Greek block (simple block)
\p{Lu}    An uppercase letter (simple category)
\p{Sc}    A currency symbol
\P{InGreek}    Any character except one in the Greek block (negation)
[\p{L}&&[^\p{Lu}]]     Any letter except an uppercase letter (subtraction)

Boundary matchers
^    The beginning of a line
$    The end of a line
\b    A word boundary
\B    A non-word boundary
\A    The beginning of the input
\G    The end of the previous match
\Z    The end of the input but for the final terminator, if any
\z    The end of the input

Greedy quantifiers
X?    X, once or not at all
X*    X, zero or more times
X+    X, one or more times
X{n}    X, exactly n times
X{n,}    X, at least n times
X{n,m}    X, at least n but not more than m times

Reluctant quantifiers
X??    X, once or not at all
X*?    X, zero or more times
X+?    X, one or more times
X{n}?    X, exactly n times
X{n,}?    X, at least n times
X{n,m}?    X, at least n but not more than m times

Possessive quantifiers
X?+    X, once or not at all
X*+    X, zero or more times
X++    X, one or more times
X{n}+    X, exactly n times
X{n,}+    X, at least n times
X{n,m}+    X, at least n but not more than m times

Logical operators
XY    X followed by Y
X|Y    Either X or Y
(X)    X, as a capturing group

Back references
\n Whatever the nth capturing group matched

Quotation
\    Nothing, but quotes the following character
\Q    Nothing, but quotes all characters until \E
\E    Nothing, but ends quoting started by \Q

Special constructs (non-capturing)
(?:X)    X, as a non-capturing group
(?idmsux-idmsux)     Nothing, but turns match flags on – off
(?idmsux-idmsux:X)     X, as a non-capturing group with the given flags on – off
(?=X)    X, via zero-width positive lookahead
(?!X)    X, via zero-width negative lookahead
(?<=X)    X, via zero-width positive lookbehind
(?<!X)    X, via zero-width negative lookbehind
(?>X)    X, as an independent, non-capturing group

This article is based on Wikipedia and Java Patterns page

Originally posted 2009-09-16 09:58:23.

Java Regular Expressions (Theory, Classes and Syntax)

Trending Articles

Black Angus Grilled Artichokes

SANIDAPA LIVE IN HALDADUWANA 2005-06-26

Police confirm man stabbed to death in Selsdon was Andrew David Else of Croydon

Police charge man, 23, with assault and criminal damage following incident in...

BO RUSSELL BENDER Arrested by Clackamas County Sheriff's Office on Mar 11, 2020

Blackstone — Befi Mano (Throw Back Thursday)

Ko Droka na Bogi

Charlotte de Witte – One Mind – EP [iTunes Plus M4A]

Azura Botanify v1.0 (For FL Studio)-FANTASTiC

A Bottle of Dew Class 6 Worksheet English Poorvi Chapter 1

Stalker hid in bushes leaving his ex 'terrified'

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Teen Shot In Miami Drive-By Dies From Injuries

Man arrested for threatening to shoot up police station

Hizia picha za utupu za meneja wa benki imekaaje?

Notts men wanted over alleged cocaine smuggling plot

D16 Group Phoscyon v1.9.5 Incl.Keygen WiN/MAC-R2R

Raj Panchayat 3rd / Third Grade Teacher Revised Result 2012 Level 1-2...

MCQ Questions for Class 12 History: Ch 10 Colonialism and the countryside

'Exceptionally dangerous' rapist Bradley Trengove from Camborne...