原文 2014-02-09 19:32:01 发表于 CSDN,这里对以前写的文章做下收录。
在c++中,有三种正则可以选择使用,C ++regex,C regex,boost regex ,如果在windows下开发c++,默认不支持后面两种正则,如果想快速应用,显然C++ regex 比较方便使用。文章将讨论 C++ regex 正则表达式的使用。
C++ regex函数有3个:regex_match、 regex_search 、regex_replace
regex_match
regex_match是正则表达式匹配的函数,下面以例子说明。如果想系统的了解,参考regex_match
// regex_match example
#include <iostream>
#include <string>
#include <regex>
int main (){
  if (std::regex_match ("subject", std::regex("(sub)(.*)") ))
    std::cout << "string literal matched\n";
  std::string s ("subject");
  std::regex e ("(sub)(.*)");
  if (std::regex_match (s,e))
    std::cout << "string object matched\n";
  if ( std::regex_match ( s.begin(), s.end(), e ) )
    std::cout << "range matched\n";
  std::cmatch cm;    // same as std::match_results<const char*> cm;
  std::regex_match ("subject",cm,e);
  std::cout << "string literal with " << cm.size() << " matches\n";
  std::smatch sm;    // same as std::match_results<string::const_iterator> sm;
  std::regex_match (s,sm,e);
  std::cout << "string object with " << sm.size() << " matches\n";
  std::regex_match ( s.cbegin(), s.cend(), sm, e);
  std::cout << "range with " << sm.size() << " matches\n";
  // using explicit flags:
  std::regex_match ( "subject", cm, e, std::regex_constants::match_default );
  std::cout << "the matches were: ";
  for (unsigned i=0; i<sm.size(); ++i) {
    std::cout << "[" << sm[i] << "] ";
  }
  std::cout << std::endl;
  return 0;
}
输出如下:
string literal matched
string object matched
range matched
string literal with 3 matches
string object with 3 matches
range with 3 matches
the matches were: [subject] [sub] [ject]
regex_search
regex_match是另外一个正则表达式匹配的函数,下面是regex_search的例子。regex_search和regex_match的主要区别是:regex_match是全词匹配,而regex_search是搜索其中匹配的字符串。如果想系统了解,请参考regex_search
// regex_search example
#include <iostream>
#include <regex>
#include <string>
int main(){
  std::string s ("this subject has a submarine as a subsequence");
  std::smatch m;
  std::regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"
  std::cout << "Target sequence: " << s << std::endl;
  std::cout << "Regular expression: /\\b(sub)([^ ]*)/" << std::endl;
  std::cout << "The following matches and submatches were found:" << std::endl;
  while (std::regex_search (s,m,e)) {
    for (auto x=m.begin();x!=m.end();x++) 
      std::cout << x->str() << " ";
    std::cout << "--> ([^ ]*) match " << m.format("$2") <<std::endl;
    s = m.suffix().str();
  }
}
输出如下:
Target sequence: this subject has a submarine as a subsequence
Regular expression: /\b(sub)([^ ]*)/
The following matches and submatches were found:
subject sub ject --> ([^ ]*) match ject
submarine sub marine --> ([^ ]*) match marine
subsequence sub sequence --> ([^ ]*) match sequence
regex_replace
regex_replace是替换正则表达式匹配内容的函数,下面是regex_replace的例子。如果想系统了解,请参考regex_replace
// regex_replace example
#include <regex> 
#include <iostream> 
int main() { 
    char buf[20]; 
    const char *first = "axayaz"; 
    const char *last = first + strlen(first); 
    std::regex rx("a"); 
    std::string fmt("A"); 
    std::regex_constants::match_flag_type fonly = 
        std::regex_constants::format_first_only; 
 
    *std::regex_replace(&buf[0], first, last, rx, fmt) = '\0'; 
    std::cout << &buf[0] << std::endl; 
 
    *std::regex_replace(&buf[0], first, last, rx, fmt, fonly) = '\0'; 
    std::cout << &buf[0] << std::endl; 
 
    std::string str("adaeaf"); 
    std::cout << std::regex_replace(str, rx, fmt) << std::endl; 
 
    std::cout << std::regex_replace(str, rx, fmt, fonly) << std::endl; 
 
    return 0; 
} 
输出如下:
AxAyAz
Axayaz
AdAeAf
Adaeaf
C++ regex正则表达式的规则和其他编程语言差不多,如下:
特殊字符(用于匹配很难形容的字符):
| characters | description | matches | 
|---|---|---|
| . | not newline | any character except line terminators (LF, CR, LS, PS). | 
| \t | tab (HT) | a horizontal tab character (same as \u0009). | 
| \n | newline (LF) | a newline (line feed) character (same as \u000A). | 
| \v | vertical tab (VT) | a vertical tab character (same as \u000B). | 
| \f | form feed (FF) | a form feed character (same as \u000C). | 
| \r | carriage return (CR) | a carriage return character (same as \u000D). | 
| \cletter | control code | a control code character whose code unit value is the same as the remainder of dividing the code unit value of letter by 32. For example: \ca is the same as \u0001, \cb the same as \u0002, and so on... | 
| \xhh | ASCII character | a character whose code unit value has an hex value equivalent to the two hex digits hh. For example: \x4c is the same as L, or \x23 the same as #. | 
| \uhhhh | unicode character | a character whose code unit value has an hex value equivalent to the four hex digitshhhh. | 
| \0 | null | a null character (same as \u0000). | 
| \int | backreference | the result of the submatch whose opening parenthesis is the int-th (int shall begin by a digit other than 0). See groups below for more info. | 
| \d | digit | a decimal digit character | 
| \D | not digit | any character that is not a decimal digit character | 
| \s | whitespace | a whitespace character | 
| \S | not whitespace | any character that is not a whitespace character | 
| \w | word | an alphanumeric or underscore character | 
| \W | not word | any character that is not an alphanumeric or underscore character | 
| \character | character | the character character as it is, without interpreting its special meaning within a regex expression. Any character can be escaped except those which form any of the special character sequences above. Needed for: ^ $ \ . * + ? ( ) [ ] { } | | 
| [class] | character class | the target character is part of the class | 
| [^class] | negated character class | the target character is not part of the class | 
注意了,在C++反斜杠字符(\)会被转义
std::regex e1 ("\\d");  //  \d -> 匹配数字字符std::regex e2 ("\\\\"); //  \\ -> 匹配反斜杠字符
数量:
| characters | times | effects | 
|---|---|---|
| * | 0 or more | The preceding atom is matched 0 or more times. | 
| + | 1 or more | The preceding atom is matched 1 or more times. | 
| ? | 0 or 1 | The preceding atom is optional (matched either 0 times or once). | 
| {int} | int | The preceding atom is matched exactly int times. | 
| {int,} | int or more | The preceding atom is matched int or more times. | 
| {min,max} | between min and max | The preceding atom is matched at least min times, but not more than max. | 
注意了,模式 "(a+).*" 匹配 "aardvark" 将匹配到 aa,模式 "(a+?).*" 匹配 "aardvark" 将匹配到 a
组(用以匹配连续的多个字符):
| characters | description | effects | 
|---|---|---|
| (subpattern) | Group | Creates a backreference. | 
| (?:subpattern) | Passive group | Does not create a backreference. | 
注意了,第一种将创建一个反向引用,用于提取匹配到的内容,第二种则没有,相对来说性能方面也没这部分的开销
| characters | description | condition for match | 
|---|---|---|
| ^ | Beginning of line | Either it is the beginning of the target sequence, or follows a line terminator. | 
| $ | End of line | Either it is the end of the target sequence, or precedes a line terminator. | 
| | | Separator | Separates two alternative patterns or subpatterns.. | 
单个字符
[abc] 匹配 a, b 或 c.
[^xyz] 匹配任何非 x, y, z的字符
范围
[a-z] 匹配任何小写字母 (a, b, c, ..., z).
[abc1-5] 匹配 a, b , c, 或 1 到 5 的数字.
c++ regex还有一种类POSIX的写法
| class | description | equivalent (with regex_traits, default locale) | 
|---|---|---|
| [:alnum:] | alpha-numerical character | isalnum | 
| [:alpha:] | alphabetic character | isalpha | 
| [:blank:] | blank character | isblank | 
| [:cntrl:] | control character | iscntrl | 
| [:digit:] | decimal digit character | isdigit | 
| [:graph:] | character with graphical representation | isgraph | 
| [:lower:] | lowercase letter | islower | 
| [:print:] | printable character | isprint | 
| [:punct:] | punctuation mark character | ispunct | 
| [:space:] | whitespace character | isspace | 
| [:upper:] | uppercase letter | isupper | 
| [:xdigit:] | hexadecimal digit character | isxdigit | 
| [:d:] | decimal digit character | isdigit | 
| [:w:] | word character | isalnum | 
| [:s:] | whitespace character | isspace | 
参考:
http://www.cplusplus.com/reference/regex/
