Notice
Recent Posts
Recent Comments
«   2024/12   »
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
Archives
Today
In Total
관리 메뉴

A Joyful AI Research Journey🌳😊

The regex pattern \b\w+\b with examples 본문

🌳AI Projects: NLP🍀✨/NLP Deep Dive

The regex pattern \b\w+\b with examples

yjyuwisely 2023. 9. 9. 21:57

Let's break down the regex pattern \b\w+\b and explain it with examples.

1. \w

The \w metacharacter matches any word character, which is equivalent to the character set [a-zA-Z0-9_]. This includes:

  • Uppercase letters: A to Z
  • Lowercase letters: a to z
  • Digits: 0 to 9
  • Underscore: _

2. \w+

The + is a quantifier that means "one or more" of the preceding character or group. So, \w+ matches one or more word characters. Here are some examples:

  • apple: This is matched by \w+ because it contains one or more word characters.
  • a: This is also matched by \w+ because it's a single word character.
  • a_b: This is matched as well because underscores are part of the word character class.

3. \b

The \b is a word boundary assertion. It does not match any characters but asserts a position where a word character is not followed or preceded by another word character. It's useful to differentiate words from substrings within words.

For instance:

  • In the string apple pie, there's a word boundary between e and the space, and another word boundary between the space and p.
  • In the string applepie, there's no word boundary between e and p because both are word characters.

Putting it together: \b\w+\b

This pattern ensures that we're matching whole words, and not just substrings inside of words.

Examples:

  • In the string apple pie, \b\w+\b matches apple and pie separately.
  • In the string applepie, \b\w+\b matches the entire string as a single word.
  • In the string an apple, \b\w+\b matches an and apple separately.
  • In the string apple-pie, \b\w+\b matches apple and pie separately because the hyphen is not a word character, so there's a word boundary on both sides of it.

Using \b\w+\b ensures that words are matched as distinct tokens, even if they are adjacent to non-word characters like punctuation or spaces.

728x90
반응형
Comments