Professional Documents
Culture Documents
It is
the first thing the current parser does after preprocessing, in the "strip" method. The only
thing that ends one of these blocks is the matching close tag.
Notes:
Nowiki, pre and html-comment are always available.
Html is available if $wgRawHtml is true in localsettings.php
Math is available if the math extension is installed
Other tags may be available if installed and present in parser->mTagHooks.
Magic links are words that may appear within <wiki-text> that are automatically converted
to external links without any special markup being required by the person writing the page.
Note:
that all character-literals on this page are case sensitive (i.e. upper-case characters in the definitions
on this page MUST be written in upper case in the markup).
HTML entity
The parser recognises validly constructed HTML entities and leaves them alone.
<html-entity> ::= "&" <html-entity-name> ";"
| "&#" <decimal-number> ";"
| "&#x" <hex-number> ";"
<html-entity-name> ::= Sanitizer::$wgHtmlEntities (case sensitive)
(* "Aacute" | "aacute" | ... *)
Rendering
Rendering
<unescaped-ampersand> → &
<unescaped-less-than> → <
<unescaped-greater-than> → >
Text
Harmless-characters mean characters that couldn't be anything else. I'm not sure how useful
this is as a distinction, but perhaps it will help speed things up?
A "random character" is any character which hasn't matched anything else.
Rendering
Both types are written literally.
<newline> ::= CR LF | LF CR | CR | LF
<BOL> ::= <newline> | BOF
<EOL> ::= <newline> | EOF
Reality:
Rendering
Once the parser has decided which way the toggles go:
bold-toggle-on -> <b>
bold-toggle-off -> </b>
italic-toggle-on -> <i>
italic-toggle-off -> </i>
bold-italic-toggle-on -> <b> <i>
bold-italic-toggle-off-> </i> </b>
Three ( ''' ):
1. Bold (default)
e.g. ( hello ''' blah ) → hello blah
2. Apostrophe, italics
If there is otherwise an odd number of both bold and italics
1. If the preceding characters are <space><non-space> (and
there are no earlier such sequences)
e.g. ( hello l'''amour'' l'''ouest''' blah ) → hello l'amour louest blah
2. Else if the preceding characters are <non-space><non-space>
(and there are no earlier such sequences)
e.g. ( hello mon'''amour'' blah ) → hello mon'amour blah
3. Else (the preceding character is <space>) (and there are no
earlier such sequences)
e.g. ( hello '''amour'' '''blah '''blah ) → hello 'amour blah blah
Five ( ''''' ):
1. Bold, italics; or italics, bold (default, the two cases are equivalent)
e.g. ( hello ''''' blah ) → hello blah
Inline HTML
The parser recognises and cleans a large number of HTML tags, as defined in Sanitizer.php.
A decision has to be made here on whether to attempt to parse these things as a matched
set, or whether to leave that to a later pass.
A loose definition assuming they are treated individually:
Remarks
The range of "word-boundary-char" seems to be an artefact of the regular
expression: if( preg_match( '!^(/?)(\\w+)([^>]*?)(/{0,1}>)([^<]*)$!',
$x, $regs ) ) {
block elements
p, span, table, div,
lists
ol, ul, dl,
paragraph formatting
h1, h2, h3, h4, h5, h6, cite, center, blockquote, caption, pre,
character formatting
b, del, i, ins, u, font, big, small, sub, sup, code, em, s,
strike, strong, tt, var, u
Ruby
rt, rb , rp, ruby,
Here, blockquote and span are both "nesting" tags. When the close-blockquote tag is found inside the
span block, it is escaped.
Rendering
Tags that have to be paired are forced closed according to some sort of logic.
<extra-characters> are "sanitized", strip all but pre-approved attributes and styles on a
whitelist.
Tags are then written out literally: <InlineHTMLTagname> " " <sanitized-
attributes> > etc.
HTML comments are completely discarded, with some whitespace massaging: (sanitizer.php)
To avoid leaving blank lines, when a comment is both preceded and followed by a newline
(ignoring spaces), trim leading and trailing spaces and one of the newlines.
Non-breaking spaces
This is pretty trivial and used basically to improve the appearance of punctuation in French, which
always places a space before certain punctuation, and places spaces inside guillemets. Other
languages use these characters, but without the spaces. Currently performed directly in the parse()
method.
Rendering
In both cases, the space is converted to a   string.
Behaviour switches
Not to be confused with magic links. These seem to be able to be used virtually anywhere: a table of
contents in an image caption even works. See Help:Magic words#Behaviour switches.
Notes:
Semantics
behaviourswitch-toc: a miniature contents page will be rendered and inserted at the first
instance of this token.
behaviourswitch-forcetoc: a contents box will be rendered even if the normal criteria
(typically, 4 sections) have not been met. Irrelevant if magicword-toc is present.
behaviourswitch-notoc: no miniature contents pages will be rendered. Only takes effect if
neither magicword-toc nor magicword-forcetoc are present.
behaviourswitch-noeditsection: no edit links are to be displayed for any sections.
behaviourswitch-nogallery: unclear. According to the code (parser::stripNoGallery): if the
string (not case-sensitive) occurs in the HTML, do not add TOC. Perhaps it only has an effect
in certain namespaces.
Links to images and media should be handled as normal links. It's inline images and media that are
being dealt with here.
Images
ImageInline ::= "[[" , "Image:" , PageName, ".",
ImageExtension, ( { <Pipe>, ImageOption, } ) "]]" ;
ImageName ::= PageName, ".", ImageExtension
ImageExtension ::= "jpg" | "jpeg" | "png" | "svg" | "gif" |
"bmp" ;
ImageOption ::= ImageModeParameter | ImageSizeParameter |
ImageAlignParameter
| ImageVAlignParameter | Caption
/* Default settings: */
mw("img_manualthumb") ::= "thumbnail=", ImageName | "thumb=",
ImageName
mw("img_thumbnail") ::= "thumbnail" | "thumb";
mw("img_frame") ::= "framed" | "enframed" | "frame";
mw("img_frameless") ::= "frameless";
/* Default settings: */
mw("img_page") ::= "page=$1" | "page $1" ??? (where is this
used?)
mw("img_upright") ::= "upright" [, ["=",] PositiveInteger]
mw("img_border") ::= "border"
/* Default settings: */
mw("img_left") ::= "left"
mw("img_center") ::= "center" | "centre"
mw("img_right") ::= "right"
mw("img_none") ::= "none"
/* By default: */
mw("img_baseline") ::= "baseline"
mw("img_sub") ::= "sub"
mw("img_super") ::= "super" | "sup"
mw("img_top") ::= "top"
mw("img_text_top") ::= "text-top"
mw("img_middle") ::= "middle"
mw("img_bottom") ::= "bottom"
mw("img_text_bottom") ::= "text-bottom"
Semantics
Media
Gallery
Remarks:
The gallery block can technically be used in the middle of a sentence so is not a "special
block". It doesn't render particularly nicely when you do that though.
Here, blockquote and span are both "nesting" tags. When the close-blockquote tag is found
inside the span block, it is escaped.
This doesn't work:
Rendering
Tags that have to be paired are forced closed according to some sort of logic.
<extra-characters> are "sanitized", strip all but pre-approved attributes and styles on
a whitelist.
Tags are then written out literally: <InlineHTMLTagname> " " <sanitized-
attributes> > etc.
HTML comments are completely discarded, with some whitespace massaging:
(sanitizer.php)
To avoid leaving blank lines, when a comment is both preceded and followed by a newline
(ignoring spaces), trim leading and trailing spaces and one of the newlines.
Non-breaking spaces
This is pretty trivial and used basically to improve the appearance of punctuation in French,
which always places a space before certain punctuation, and places spaces inside
guillemets. Other languages use these characters, but without the spaces. Currently
performed directly in the parse() method.
<nbsp-before> ::= [any character] <space> ("»" | "?" |
":" | ";" | "!" | "%")
<nbsp-after> ::= "«" <space>
Rendering
Behaviour switches
Not to be confused with magic links. These seem to be able to be used virtually anywhere: a
table of contents in an image caption even works. See Help:Magic words#Behaviour
switches.
<behaviour-switch> ::= <behaviourswitch-toc> |
<behaviourswitch-forcetoc> | <behaviourswitch-notoc> | <behaviourswitch-
noeditsection> | <behaviourswitch-nogallery>
Notes:
/* Default settings: */
mw("img_manualthumb") ::= "thumbnail=", ImageName | "thumb=",
ImageName
mw("img_thumbnail") ::= "thumbnail" | "thumb";
mw("img_frame") ::= "framed" | "enframed" | "frame";
mw("img_frameless") ::= "frameless";
ImageOtherParameter ::= ImageParamPage | ImageParamUpright |
ImageParamBorder
ImageParamPage ::= mw("img_page")
ImageParamUpgright ::= mw("img_upright")
ImageParamBorder ::= mw("img_border")
/* Default settings: */
mw("img_page") ::= "page=$1" | "page $1" ??? (where is this
used?)
mw("img_upright") ::= "upright" [, ["=",] PositiveInteger]
mw("img_border") ::= "border"
/* Default settings: */
mw("img_left") ::= "left"
mw("img_center") ::= "center" | "centre"
mw("img_right") ::= "right"
mw("img_none") ::= "none"
/* By default: */
mw("img_baseline") ::= "baseline"
mw("img_sub") ::= "sub"
mw("img_super") ::= "super" | "sup"
mw("img_top") ::= "top"
mw("img_text_top") ::= "text-top"
mw("img_middle") ::= "middle"
mw("img_bottom") ::= "bottom"
mw("img_text_bottom") ::= "text-bottom"
Semantics
Media
Gallery
GalleryBlock ::= "<gallery>" [ NewLine ] GalleryImage
{ [ NewLine ] GalleryImage } [ NewLine ] "</gallery>" ;
GalleryImage ::= (to be defined: essentially foo.jpg[|
caption] )
Remarks:
The gallery block can technically be used in the middle of a sentence so is not a
"special block". It doesn't render particularly nicely when you do that though.