Pybot Wiki
This page will INTRODUCE you to a basic topic.
This page describes a bit of REGEX.

A regular expression, or regex, is a series of semantically meaningful characters which is used by pywikipedia to find and match patterns within your the whole body of your wiki's text.

Most pywikipedia scripts allow the option to include regex statements via the nomenclature, -regex:"regex expression".

Case study[]

It's perhaps better to leave it to Wikipedia and other sources to define regex in detail. Here at Pybot, we'll just "define by doing", and present this practical example that most Wikians can recognise.

Imagine that you've discovered that your editors have repeated a mistake many times on your site. For whatever reason, in every infobox about a book, they've decided to put brackets around the pagecount variable, and so now you've got a lot of redlinks to simple, three-digit numbers. You'll want to get rid of those links, but preserve the numbers themselves. So how do you do this, given that:

  • the numbers aren't the same
  • people haven't always used the same number of spaces between the variable name and the equals sign (=)

Your only option in such a case is regex.

Here's what you'd do:

python -regex -summary:"de-linking page numbers in infobox" 
"pagecount( *?)=( *?)\[\[(.*?)\]\]" "pagecount\1=\2\3" -catr:"Books"

As you can see, the regex is a whole bunch of symbols that appear meaningless, but which actually have quite a bit of kick. Going character-by-character from left to right, here's what this expression means:

  • "start this regular expression
  • pagecountevery time you see the word "pagecount"
  • (begin group 1
  • <space>* if you encounter 0 or more spaces'
  • ?don't be greedy about it; select the number of spaces only up to the "next event", which in this case is the equals sign that's coming up
  • )close group 1
  • {{{1}}}the literal equals sign
When you put all the characters so far together, you get, "Look for every instance of "pagecount" followed by all the spaces between it and the equals sign, and include the equals sign, too."
  • ( *?) — group 2, because parentheses always define groups, and this is the second use of them in this regex, is all the spaces after the equals sign until the "next event"
  • \"escape" the next character, because it has a semantic meaning in regular expressions
  • [the first of two opening square brackets
  • \[escape the other opening square bracket, too
  • (.)open the definition of group 3, then look for all text
  • * —  match everything from this point until I tell you to stop
  • ?don't be greedy about it, stop at the "next event"
  • \]\]escape the closing brackets (and, oh, by the way, this is the 'next event')
  • " closes the regex
  • {{{1}}} is the replacement text. It means, "On this page, where you find "pagecount= [[something]]", replace it with {{tt|pagecount(however many spaces are currently before the = sign)=(however many spaces are currently after the = sign)(whatever currently exists between the double brackets)." So, if you started with pagecount = [[540]], you'd end up with {{{1}}}

Regex on this wiki[]

What we aim to do on this wiki is to give you little snippets of regex to help you solve all manner of common — and sometimes obscure — problems of wiki maintenance. Our regex library is in the Regex namespace. If you want to see some of these ready-made solutions, please click through to the Regex Repository.