Professional Documents
Culture Documents
Keep - Calm .And .Regex .-.Angela - Madrid
Keep - Calm .And .Regex .-.Angela - Madrid
Server Projects
A Quick Overview
The Scary Bit
Regular Expression
[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}
Matching Text
202ca4c2-749d-4f54-ae02-fdf19939ef10
What Are Regular Expressions?
• They are not a programming language
• Symbols that describe a text pattern
• Used to match, search and manipulate text
• A more powerful “Search and replace”
• Called “regex” for short
• There are several regex engines or “flavours”
• memoQ uses Microsoft .NET
How Long Does It Take
to Learn a New Language?
*http://www.effectivelanguagelearning.com/language-guide/language-difficulty
How Long Does It Take
to Learn Regex?
You can start creating your own basic expressions within a few minutes.
SIGH OF RELIEF
What Are They Used For?
• Search and match:
– Email addresses
– Urls
– Tags and placeholders
– Phone number formats
– Alternate spellings
– Consistency checks (e.g. lower case v. upper case)
– Trailing spaces
– Other repetitive text
Where in memoQ?
• Source and target filtering
• Find and replace
• Auto-translation rules
• Segmentation
• Filters:
– Regex Tagger
– Regex Text Filter
Search
Two Types of Regex Text
Literal characters Metacharacters
bomb
\ -
bomb . |
bomber * ()
A-bomb ? {}
The bomb went off. + $
Bombs off. [] ^
bomb
Metacharacters
. Any character - Separator in ranges
* Preceding item zero or | Either or
more times {} Bean counting
? Preceding item zero or ^ Start of segment //
one time Negate a character set
+ Preceding item one or $ End of segment
more times ( Begin group
[ Begin character set ) End group
] End character set
Character Sets
Will match any one of the characters in the set
but only once, unless otherwise specified by
bean counting {}
[a-z] Lower case Can be negated using ^
[A-Z] Upper case
[A-z] Any case [^0-9] Any character
[0-9] Digits except a digit
[0-9A-z] Digits + letters
\p{Ll} Lower + special letters Can be combined
\p{Lu} Upper + special letters
\p{L} Any case + special letters [0-9a-e ,]
Shorthand Character Sets
\d Digit
\w Digit OR letter
\s Whitespace
\b Boundary (Beginning OR end of word)
\t Tab
\r Line return
\n New line
\D Not a digit
\W Not a digit OR a letter
\S Not a whitespace
\tag memoQ tag
“Escaping” Metacharacters
If you need to match a \. \(
special character in the \? \)
text, you will have to \* \{
“escape” it, or mark it \+ \}
for its literal meaning. \[ \$
This is achieved by \] \^
putting a backslash in
front of it. \- \!
\| \\
Find and Replace
Replace expressions allow you choose which
parts of the text to replace and which parts to
keep as they are. This is achieved via groups ()
Search: (\d{1,3})\s{1,}[mM][gG]
Replace: $1 mg
Finds: 225 mG
Replaces with: 225 mg
Greedy v. Lazy
Dangers of Greediness
By default, regex expressions are greedy, so it is a good habit
to limit your expressions as much as possible to avoid
matching more text than you intend to.
Example:
pur.*\b will match
“All purées contains at least 10% of the main ingredient,
unless otherwise specified in the purée description.”
pur.*?\b will match
“All purées contains at least 10% of the main ingredient,
unless otherwise specified in the purée description.”
Auto-Translation: Practical Cases
To have memoQ display certain patterns of text as auto-translation
results, you can use expressions as the ones below. Insert them in the
rules section and use a replacement rule in the replace order section.
If you enclose the full match expression between brackets and use $1
for replacement, you will achieve an identical match in auto-
translation, but other types of manipulation are possible.
• Email addresses
(\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)
• URLS
((https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?)
• Phone numbers
(\d{5}\s\d{6}) 01908 443300
(\d{5}-\d{6}) 01908-443300
(\+\d{2}\s\(0\)\s\d{4}\s\d{6} ) +44 (0) 1908 443300
Auto-Translation: Where in mQ?
Segmentation: Practical Case
SOURCE: “Manufactured in China (PRC) for the UK market.
Ingredients: Lemon Grass Purée (15%), Red Chilli Purée
(11%), Onion, Water, Coconut Milk, Red Pepper, Galangal
(5%), Sugar (Sulphites), Lime Juice From Concentrate
(Sulphites), Salt, Rapeseed Oil, Garlic Purée, Rice Wine
Vinegar (Sulphites), Lime Leaves (2.5%), Yeast Extract,
Chilli Flakes, Cornflour, Tamarind Paste, Coriander,
Cayenne Pepper, Paprika Extract.”
[\s]+#!#\([\s]*[\p{L}0-9]*\.?\d*\s*%?\),\s+\p{Lu}
Segmentation: Practical Case
Regex Tagger: Practical Case
SOURCE: “Dear [%$FIRSTNAME%] [%$LASTNAME%], Your
online order placed on [%$WEBSITE%] on [%$DATE%] and
processed as the authorized vendor of [%$RANGE%]
products, has been successfully completed (order
number: [%$REFNO%]). Please note that [%if $ORDER !=
""%][%$ORDER%][%else%] [%$COMPANY%] will appear
on your bank statement, instead of [%$RANGE%].”
angela.madrid@k-international.com