C++ regex 正则表达式

原文 2014-02-09 19:32:01 发表于 CSDN,这里对以前写的文章做下收录。

在c++中,有三种正则可以选择使用,C ++regex,C regex,boost regex ,如果在windows下开发c++,默认不支持后面两种正则,如果想快速应用,显然C++ regex 比较方便使用。文章将讨论 C++ regex 正则表达式的使用。

C++ regex函数有3个:regex_match、 regex_search 、regex_replace



// regex_match example
#include <iostream>
#include <string>
#include <regex>
int main (){

  if (std::regex_match ("subject", std::regex("(sub)(.*)") ))
    std::cout << "string literal matched\n";

  std::string s ("subject");
  std::regex e ("(sub)(.*)");
  if (std::regex_match (s,e))
    std::cout << "string object matched\n";

  if ( std::regex_match ( s.begin(), s.end(), e ) )
    std::cout << "range matched\n";

  std::cmatch cm;    // same as std::match_results<const char*> cm;
  std::regex_match ("subject",cm,e);
  std::cout << "string literal with " << cm.size() << " matches\n";

  std::smatch sm;    // same as std::match_results<string::const_iterator> sm;
  std::regex_match (s,sm,e);
  std::cout << "string object with " << sm.size() << " matches\n";

  std::regex_match ( s.cbegin(), s.cend(), sm, e);
  std::cout << "range with " << sm.size() << " matches\n";

  // using explicit flags:
  std::regex_match ( "subject", cm, e, std::regex_constants::match_default );

  std::cout << "the matches were: ";
  for (unsigned i=0; i<sm.size(); ++i) {
    std::cout << "[" << sm[i] << "] ";

  std::cout << std::endl;

  return 0;

string literal matched
string object matched
range matched
string literal with 3 matches
string object with 3 matches
range with 3 matches
the matches were: [subject] [sub] [ject]



// regex_search example
#include <iostream>
#include <regex>
#include <string>
int main(){
  std::string s ("this subject has a submarine as a subsequence");
  std::smatch m;
  std::regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"

  std::cout << "Target sequence: " << s << std::endl;
  std::cout << "Regular expression: /\\b(sub)([^ ]*)/" << std::endl;
  std::cout << "The following matches and submatches were found:" << std::endl;

  while (std::regex_search (s,m,e)) {
    for (auto x=m.begin();x!=m.end();x++) 
      std::cout << x->str() << " ";
    std::cout << "--> ([^ ]*) match " << m.format("$2") <<std::endl;
    s = m.suffix().str();

Target sequence: this subject has a submarine as a subsequence
Regular expression: /\b(sub)([^ ]*)/
The following matches and submatches were found:
subject sub ject --> ([^ ]*) match ject
submarine sub marine --> ([^ ]*) match marine
subsequence sub sequence --> ([^ ]*) match sequence



// regex_replace example
#include <regex> 
#include <iostream> 
int main() { 
    char buf[20]; 
    const char *first = "axayaz"; 
    const char *last = first + strlen(first); 
    std::regex rx("a"); 
    std::string fmt("A"); 
    std::regex_constants::match_flag_type fonly = 
    *std::regex_replace(&buf[0], first, last, rx, fmt) = '\0'; 
    std::cout << &buf[0] << std::endl; 
    *std::regex_replace(&buf[0], first, last, rx, fmt, fonly) = '\0'; 
    std::cout << &buf[0] << std::endl; 
    std::string str("adaeaf"); 
    std::cout << std::regex_replace(str, rx, fmt) << std::endl; 
    std::cout << std::regex_replace(str, rx, fmt, fonly) << std::endl; 
    return 0; 


C++ regex正则表达式的规则和其他编程语言差不多,如下:

characters description matches
. not newline any character except line terminators (LF, CR, LS, PS).
\t tab (HT) a horizontal tab character (same as \u0009).
\n newline (LF) a newline (line feed) character (same as \u000A).
\v vertical tab (VT) a vertical tab character (same as \u000B).
\f form feed (FF) a form feed character (same as \u000C).
\r carriage return (CR) a carriage return character (same as \u000D).
\cletter control code a control code character whose code unit value is the same as the remainder of dividing the code unit value of letter by 32.
For example: \ca is the same as \u0001, \cb the same as \u0002, and so on...
\xhh ASCII character a character whose code unit value has an hex value equivalent to the two hex digits hh.
For example: \x4c is the same as L, or \x23 the same as #.
\uhhhh unicode character a character whose code unit value has an hex value equivalent to the four hex digitshhhh.
\0 null a null character (same as \u0000).
\int backreference the result of the submatch whose opening parenthesis is the int-th (int shall begin by a digit other than 0). See groups below for more info.
\d digit a decimal digit character 
\D not digit any character that is not a decimal digit character
\s whitespace a whitespace character 
\S not whitespace any character that is not a whitespace character
\w word an alphanumeric or underscore character 
\W not word any character that is not an alphanumeric or underscore character
\character character the character character as it is, without interpreting its special meaning within a regex expression.
Any character can be escaped except those which form any of the special character sequences above.
Needed for: ^ $ \ . * + ? ( ) [ ] { } |
[class] character class the target character is part of the class 
[^class] negated character class the target character is not part of the class 

std::regex e1 ("\\d"); // \d -> 匹配数字字符std::regex e2 ("\\\\"); // \\ -> 匹配反斜杠字符


characters times effects
* 0 or more The preceding atom is matched 0 or more times.
+ 1 or more The preceding atom is matched 1 or more times.
? 0 or 1 The preceding atom is optional (matched either 0 times or once).
{int} int The preceding atom is matched exactly int times.
{int,} int or more The preceding atom is matched int or more times.
{min,max} between min and max The preceding atom is matched at least min times, but not more than max.

注意了,模式 "(a+).*" 匹配 "aardvark" 将匹配到 aa,模式 "(a+?).*" 匹配 "aardvark" 将匹配到 a

characters description effects
(subpattern) Group Creates a backreference.
(?:subpattern) Passive group Does not create a backreference.


characters description condition for match
^ Beginning of line Either it is the beginning of the target sequence, or follows a line terminator.
$ End of line Either it is the end of the target sequence, or precedes a line terminator.
| Separator Separates two alternative patterns or subpatterns..

[abc] 匹配 a, b 或 c.
[^xyz] 匹配任何非 x, y, z的字符
[a-z] 匹配任何小写字母 (a, b, c, ..., z).
[abc1-5] 匹配 a, b , c, 或 1 到 5 的数字.

c++ regex还有一种类POSIX的写法

class description equivalent (with regex_traits, default locale)
[:alnum:] alpha-numerical character isalnum
[:alpha:] alphabetic character isalpha
[:blank:] blank character isblank
[:cntrl:] control character iscntrl
[:digit:] decimal digit character isdigit
[:graph:] character with graphical representation isgraph
[:lower:] lowercase letter islower
[:print:] printable character isprint
[:punct:] punctuation mark character ispunct
[:space:] whitespace character isspace
[:upper:] uppercase letter isupper
[:xdigit:] hexadecimal digit character isxdigit
[:d:] decimal digit character isdigit
[:w:] word character isalnum
[:s:] whitespace character isspace



邮箱地址不会被公开。 必填项已用*标注