regex.Rd 16 KB
Newer Older
Radford Neal's avatar
Radford Neal committed
1 2
% File src/library/base/man/regex.Rd
% Part of the R package, http://www.R-project.org
3
% Copyright 1995-2011 R Core Team
Radford Neal's avatar
Radford Neal committed
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
% Distributed under GPL 2 or later

\name{regex}
\alias{regex}
\alias{regexp}
\alias{regular expression}
\concept{regular expression}
\title{Regular Expressions as used in R}
\description{
  This help page documents the regular expression patterns supported by
  \code{\link{grep}} and related functions \code{grepl}, \code{regexpr},
  \code{gregexpr}, \code{sub} and \code{gsub}, as well as by
  \code{\link{strsplit}}.
}
\details{
  A \sQuote{regular expression} is a pattern that describes a set of
  strings.  Two types of regular expressions are used in \R,
  \emph{extended} regular expressions (the default) and
  \emph{Perl-like} regular expressions used by \code{perl = TRUE}.
  There is a also \code{fixed = TRUE} which can be considered to use a
  \emph{literal} regular expression.

  Other functions which use regular expressions (often via the use of
  \code{grep}) include \code{apropos}, \code{browseEnv},
  \code{help.search}, \code{list.files} and \code{ls}.
  These will all use \emph{extended} regular expressions.

  Patterns are described here as they would be printed by \code{cat}:
  (\emph{do remember that backslashes need to be doubled when entering \R
    character strings}, e.g. from the keyboard).
Radford Neal's avatar
Radford Neal committed
34 35 36

  Do not assume that long regular expressions will be accepted: the
  POSIX standard only requires up to 256 \emph{bytes}.
Radford Neal's avatar
Radford Neal committed
37 38 39 40 41 42 43 44 45 46
}
\section{Extended Regular Expressions}{
  This section covers the regular expressions allowed in the default
  mode of \code{grep}, \code{regexpr}, \code{gregexpr}, \code{sub},
  \code{gsub} and \code{strsplit}.  They use an implementation of the
  POSIX 1003.2 standard: that allows some scope for interpretation and
  the interpretations here are those used as from \R 2.10.0.

  Regular expressions are constructed analogously to arithmetic
  expressions, by using various operators to combine smaller
Radford Neal's avatar
Radford Neal committed
47
  expressions.  The whole expression matches zero or more characters
Radford Neal's avatar
Radford Neal committed
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
  (read \sQuote{character} as \sQuote{byte} if \code{useBytes = TRUE}).

  The fundamental building blocks are the regular expressions that match
  a single character.  Most characters, including all letters and
  digits, are regular expressions that match themselves.  Any
  metacharacter with special meaning may be quoted by preceding it with
  a backslash.  The metacharacters in EREs are \samp{. \\
  | ( ) [ \{ ^ $ * + ?}, but note that whether these have a special
  meaning depends on the context.

  Escaping non-metacharacters with a backslash is
  implementation-dependent.  The current implementation interprets
  \samp{\\a} as \samp{BEL}, \samp{\\e} as \samp{ESC}, \samp{\\f} as
  \samp{FF}, \samp{\\n} as \samp{LF}, \samp{\\r} as \samp{CR} and
  \samp{\\t} as \samp{TAB}.  (Note that these will be interpreted by
  \R's parser in literal character strings.)

  A \emph{character class} is a list of characters enclosed between
  \samp{[} and \samp{]} which matches any single character in that list;
  unless the first character of the list is the caret \samp{^}, when it
  matches any character \emph{not} in the list.  For example, the
  regular expression \samp{[0123456789]} matches any single digit, and
  \samp{[^abc]} matches anything except the characters \samp{a},
  \samp{b} or \samp{c}.  A range of characters may be specified by
  giving the first and last characters, separated by a hyphen.  (Because
  their interpretation is locale- and implementation-dependent, they are
  best avoided.)  The only portable way to specify all ASCII letters is
Radford Neal's avatar
Radford Neal committed
75 76 77
  to list them all as the character class\cr
  \samp{[ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]}.\cr
  (The current implementation uses numerical order of the encoding: prior to
Radford Neal's avatar
Radford Neal committed
78 79 80 81 82 83 84 85 86 87 88 89 90
  \R 2.10.0 locale-specific collation was used, and might be again.)

  Certain named classes of characters are predefined.  Their
  interpretation depends on the \emph{locale} (see \link{locales}); the
  interpretation below is that of the POSIX locale.

  \describe{
    \item{\samp{[:alnum:]}}{Alphanumeric characters: \samp{[:alpha:]}
      and \samp{[:digit:]}.}

    \item{\samp{[:alpha:]}}{Alphabetic characters: \samp{[:lower:]} and
      \samp{[:upper:]}.}

Radford Neal's avatar
Radford Neal committed
91 92 93
    \item{\samp{[:blank:]}}{Blank characters: space and tab, and
      possibly other locale-dependent characters such as non-breaking
      space.}
Radford Neal's avatar
Radford Neal committed
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109

    \item{\samp{[:cntrl:]}}{
      Control characters.  In ASCII, these characters have octal codes
      000 through 037, and 177 (\code{DEL}).  In another character set,
      these are the equivalent characters, if any.}

    \item{\samp{[:digit:]}}{Digits: \samp{0 1 2 3 4 5 6 7 8 9}.}

    \item{\samp{[:graph:]}}{Graphical characters: \samp{[:alnum:]} and
      \samp{[:punct:]}.}

    \item{\samp{[:lower:]}}{Lower-case letters in the current locale.}

    \item{\samp{[:print:]}}{
      Printable characters: \samp{[:alnum:]}, \samp{[:punct:]} and space.}

Radford Neal's avatar
Radford Neal committed
110
    \item{\samp{[:punct:]}}{Punctuation characters:\cr
Radford Neal's avatar
Radford Neal committed
111 112 113 114 115
      \samp{! " # $ \% & ' ( ) * + , - . / : ; < = > ? @ [ \\ ] ^ _ ` \{ | \} ~}.}
%'"`  keep Emacs Rd mode happy

    \item{\samp{[:space:]}}{
      Space characters: tab, newline, vertical tab, form feed, carriage
Radford Neal's avatar
Radford Neal committed
116
      return, space and possibly other locale-dependent characters.}
Radford Neal's avatar
Radford Neal committed
117 118 119

    \item{\samp{[:upper:]}}{Upper-case letters in the current locale.}

Radford Neal's avatar
Radford Neal committed
120
    \item{\samp{[:xdigit:]}}{Hexadecimal digits:\cr
Radford Neal's avatar
Radford Neal committed
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
      \samp{0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f}.}
  }

  For example, \samp{[[:alnum:]]} means \samp{[0-9A-Za-z]}, except the
  latter depends upon the locale and the character encoding, whereas the
  former is independent of locale and character set.  (Note that the
  brackets in these class names are part of the symbolic names, and must
  be included in addition to the brackets delimiting the bracket list.)
  Most metacharacters lose their special meaning inside a character
  class.  To include a literal \samp{]}, place it first in the list.
  Similarly, to include a literal \samp{^}, place it anywhere but first.
  Finally, to include a literal \samp{-}, place it first or last (or,
  for \code{perl = TRUE} only, precede it by a backslash.).  (Only
  \samp{^ - \\ ]} are special inside character classes.)

  The period \samp{.} matches any single character.  The symbol
  \samp{\\w} matches a \sQuote{word} character (a synonym for
  \samp{[[:alnum:]_]}) and \samp{\\W} is its negation.  Symbols
  \samp{\\d}, \samp{\\s}, \samp{\\D} and \samp{\\S} denote the digit and
  space classes and their negations.

  The caret \samp{^} and the dollar sign \samp{$} are metacharacters
  that respectively match the empty string at the beginning and end of a
  line.  The symbols \samp{\\<} and \samp{\\>} match the empty string at
  the beginning and end of a word.  The symbol \samp{\\b} matches the
  empty string at either edge of a word, and \samp{\\B} matches the
  empty string provided it is not at an edge of a word.  (The
  interpretation of \sQuote{word} depends on the locale and
  implementation.)

  A regular expression may be followed by one of several repetition
  quantifiers:
  \describe{
    \item{\samp{?}}{The preceding item is optional and will be matched
      at most once.}

    \item{\samp{*}}{The preceding item will be matched zero or more
      times.}

    \item{\samp{+}}{The preceding item will be matched one or more
      times.}

    \item{\samp{{n}}}{The preceding item is matched exactly \code{n}
      times.}

    \item{\samp{{n,}}}{The preceding item is matched \code{n} or more
      times.}

    \item{\samp{{n,m}}}{The preceding item is matched at least \code{n}
      times, but not more than \code{m} times.}
  }
  By default repetition is greedy, so the maximal possible number of
  repeats is used.  This can be changed to \sQuote{minimal} by appending
  \code{?} to the quantifier.  (There are further quantifiers that allow
  approximate matching: see the TRE documentation.)

  Regular expressions may be concatenated; the resulting regular
  expression matches any string formed by concatenating the substrings
  that match the concatenated subexpressions.

  Two regular expressions may be joined by the infix operator \samp{|};
  the resulting regular expression matches any string matching either
  subexpression.   For example, \samp{abba|cde} matches either the
  string \code{abba} or the string \code{cde}.  Note that alternation
  does not work inside character classes, where \samp{|} has its literal
  meaning.

  Repetition takes precedence over concatenation, which in turn takes
  precedence over alternation.  A whole subexpression may be enclosed in
  parentheses to override these precedence rules.

  The backreference \samp{\\N}, where \samp{N = 1 ... 9}, matches
  the substring previously matched by the Nth parenthesized
  subexpression of the regular expression.  (This is an
  extension for extended regular expressions: POSIX defines them only
  for basic ones.)
}
\section{Perl-like Regular Expressions}{
  The \code{perl = TRUE} argument to \code{grep}, \code{regexpr},
  \code{gregexpr}, \code{sub}, \code{gsub} and \code{strsplit} switches
  to the PCRE library that implements regular expression pattern
  matching using the same syntax and semantics as Perl 5.10,
  with just a few differences.

  For complete details please consult the man pages for PCRE, especially
  \command{man pcrepattern} and \command{man pcreapi}), on your system or from
  the sources at \url{http://www.pcre.org}. If PCRE support was compiled
Radford Neal's avatar
Radford Neal committed
208
  from the sources within \R, the PCRE version is 8.12 as described here.
Radford Neal's avatar
Radford Neal committed
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280

  Perl regular expressions can be computed byte-by-byte or
  (UTF-8) character-by-character: the latter is used in all multibyte
  locales and if any of the inputs are marked as UTF-8 (see
  \code{\link{Encoding}}).

  All the regular expressions described for extended regular expressions
  are accepted except \samp{\\<} and \samp{\\>}: in Perl all backslashed
  metacharacters are alphanumeric and backslashed symbols always are
  interpreted as a literal character. \samp{\{} is not special if it
  would be the start of an invalid interval specification.  There can be
  more than 9 backreferences (but the replacement in \code{\link{sub}}
  can only refer to the first 9).

  Character ranges are interpreted in the numerical order of
  the characters, either as bytes in a single-byte locale or as Unicode
  points in UTF-8 mode.  So in either case \samp{[A-Za-z]} specifies
  the set of ASCII letters.

  In UTF-8 mode the named character classes only match ASCII characters:
  see \samp{\\p} below for an alternative.

  The construct \samp{(?...)} is used for Perl extensions in a variety
  of ways depending on what immediately follows the \samp{?}.

  Perl-like matching can work in several modes, set by the options
  \samp{(?i)} (caseless, equivalent to Perl's \samp{/i}), \samp{(?m)}
  (multiline, equivalent to Perl's \samp{/m}), \samp{(?s)} (single line,
  so a dot matches all characters, even new lines: equivalent to Perl's
  \samp{/s}) and \samp{(?x)} (extended, whitespace data characters are
  ignored unless escaped and comments are allowed: equivalent to Perl's
  \samp{/x}).  These can be concatenated, so for example, \samp{(?im)}
  sets caseless multiline matching.  It is also possible to unset these
  options by preceding the letter with a hyphen, and to combine setting
  and unsetting such as \samp{(?im-sx)}.  These settings can be applied
  within patterns, and then apply to the remainder of the pattern.
  Additional options not in Perl include \samp{(?U)} to set
  \sQuote{ungreedy} mode (so matching is minimal unless \samp{?} is used
  as part of the repetition quantifier, when it is greedy).  Initially
  none of these options are set.

  If you want to remove the special meaning from a sequence of
  characters, you can do so by putting them between \samp{\\Q} and
  \samp{\\E}. This is different from Perl in that \samp{$} and \samp{@} are
  handled as literals in \samp{\\Q...\\E} sequences in PCRE, whereas in
  Perl, \samp{$} and \samp{@} cause variable interpolation.

  The escape sequences \samp{\\d}, \samp{\\s} and \samp{\\w} represent
  any decimal digit, space character and \sQuote{word} character
  (letter, digit or underscore in the current locale: in UTF-8 mode only
  ASCII letters and digits are considered) respectively, and their
  upper-case versions represent their negation.  Unlike POSIX, vertical
  tab is not regarded as a space character.  Sequences \samp{\\h},
  \samp{\\v}, \samp{\\H} and \samp{\\V} match horizontal and vertical
  space or the negation.  (In UTF-8 mode, these do match non-ASCII
  Unicode points.)

  There are additional escape sequences: \samp{\\cx} is
  \samp{cntrl-x} for any \samp{x}, \samp{\\ddd} is the
  octal character (for up to three digits unless
  interpretable as a backreference, as \samp{\\1} to \samp{\\7} always
  are), and \samp{\\xhh} specifies a character by two hex digits.
  In a UTF-8 locale, \samp{\\x\{h...\}} specifies a Unicode point
  by one or more hex digits.  (Note that some of these will be
  interpreted by \R's parser in literal character strings.)

  Outside a character class, \samp{\\A} matches at the start of a
  subject (even in multiline mode, unlike \samp{^}), \samp{\\Z} matches
  at the end of a subject or before a newline at the end, \samp{\\z}
  matches only at end of a subject. and \samp{\\G} matches at first
  matching position in a subject (which is subtly different from Perl's
  end of the previous match).  \samp{\\C} matches a single
Radford Neal's avatar
Radford Neal committed
281
  byte, including a newline, but its use is warned against.  In UTF-8
Radford Neal's avatar
Radford Neal committed
282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
  mode, \samp{\\R} matches any Unicode newline character (not just CR),
  and \samp{\\X} matches any number of Unicode characters that form an
  extended Unicode sequence.

  In UTF-8 mode, some Unicode properties are supported via
  \samp{\\p\{xx\}} and \samp{\\P\{xx\}} which match
  characters with and without property \samp{xx} respectively.
  For a list of supported properties see the PCRE documentation, but for
  example \samp{Lu} is \sQuote{upper case letter} and \samp{Sc} is
  \sQuote{currency symbol}.

  The sequence \samp{(?#} marks the start of a comment which continues
  up to the next closing parenthesis.  Nested parentheses are not
  permitted.  The characters that make up a comment play no part at all in
  the pattern matching.

  If the extended option is set, an unescaped \samp{#} character outside
  a character class introduces a comment that continues up to the next
  newline character in the pattern.

  The pattern \samp{(?:...)} groups characters just as parentheses do
  but does not make a backreference.

  Patterns \samp{(?=...)} and \samp{(?!...)} are zero-width positive and
  negative lookahead \emph{assertions}: they match if an attempt to
  match the \code{\dots} forward from the current position would succeed
  (or not), but use up no characters in the string being processed.
  Patterns \samp{(?<=...)} and \samp{(?<!...)} are the lookbehind
  equivalents: they do not allow repetition quantifiers nor \samp{\\C}
  in \code{\dots}.
Radford Neal's avatar
Radford Neal committed
312 313 314 315 316 317 318 319
  
  As from \R 2.14.0 \code{regexpr} and \code{gregexpr} support
  \sQuote{named capture}.  If groups are named, e.g.,
  \code{"(?<first>[A-Z][a-z]+)"} then the positions of the matches are
  also returned by name.  (Named backreferences are not supported by
  \code{sub}.)

  Atomic grouping, possessive qualifiers and conditional
Radford Neal's avatar
Radford Neal committed
320 321 322 323
  and recursive patterns are not covered here.
}
\author{
  This help page is based on the documentation of GNU grep 2.4.2, the
Radford Neal's avatar
Radford Neal committed
324
  TRE documentation and the POSIX standard, and the \code{pcrepattern}
Radford Neal's avatar
Radford Neal committed
325 326 327 328 329 330 331 332 333 334 335
  man page from PCRE 8.0.
}
\seealso{
  \code{\link{grep}}, \code{\link{apropos}}, \code{\link{browseEnv}},
  \code{\link{glob2rx}}, \code{\link{help.search}}, \code{\link{list.files}},
  \code{\link{ls}} and \code{\link{strsplit}}.

  The TRE documentation at
  \url{http://laurikari.net/tre/documentation/regex-syntax/}).

  The POSIX 1003.2 standard at
Radford Neal's avatar
Radford Neal committed
336
  \url{http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09}
Radford Neal's avatar
Radford Neal committed
337 338 339 340 341 342

  The \code{pcrepattern} can be found as part of
  \url{http://www.pcre.org/pcre.txt}, and details of Perl's own
  implementation at \url{http://perldoc.perl.org/perlre.html}.
}
\keyword{character}