Markup Languages: A History and Guide

In computer text processing, a markup language is a system for annotating a document in a
way that is syntactically distinguishable from the text,[1] meaning when the document is processed
for display, the markup language is not shown, and is only used to format the text. [2] The idea and
terminology evolved from the "marking up" of paper manuscripts (i.e., the revision instructions by
editors), which is traditionally written with a red or blue pencil on authors' manuscripts.[3] Such
"markup" typically includes both content corrections (such as spelling, punctuation, or movement
of content), and also typographic instructions, such as to make a heading larger or boldface.
In digital media, this "blue pencil instruction text" was replaced by tags which ideally indicate
what the parts of the document are, rather than details of how they might be shown on some
display. This lets authors avoid formatting every instance of the same kind of thing redundantly
(and possibly inconsistently). It also avoids the specification of fonts and dimensions which may
not apply to many users (such as those with different-size displays, impaired vision and screen-
reading software).
Early markup systems typically included typesetting instructions, as troff, TeX and LaTeX do,
while Scribe and most modern markup systems name components, and later process those
names to apply formatting or other processing, as in the case of XML.
Some markup languages, such as the widely used HTML, have pre-defined presentation
semantics—meaning that their specification prescribes some aspects of how to present
the structured data on particular media. HTML, like DocBook, Open eBook, JATS and countless
others, is a specific application of the markup meta-languages SGML and XML. That is, SGML
and XML enable users to specify particular schemas, which determine just what elements,
attributes, and other features are permitted, and where.
One extremely important characteristic of most markup languages is that they allow mixing
markup directly into text streams. This happens all the time in documents: A few words in a
sentence must be emphasized, or identified as a proper name, defined term, or other special
item. This is quite different structurally from traditional databases, where it is by definition
impossible to have data that is (for example) within a record, but not within any field. Likewise,
markup for natural language texts must maintain ordering: it would not suffice to make each
paragraph of a book into a "paragraph" record, where those records do not maintain order.
Contents
 1Etymology
 2Types of markup language
 3History of markup languages
o 3.1GenCode
o 3.2troff and nroff
o 3.3TeX for formulas
o 3.4Scribe, GML and SGML
 3.4.1HTML
o 3.5XML
 3.5.1XHTML
 3.5.2Other XML-based applications
 4Features of markup languages
 5Alternative usages
 6See also
 7References
 8External links
Etymology[edit]
The noun markup is derived from the traditional publishing practice called "marking
up" a manuscript,[4] which involves adding handwritten annotations in the form of conventional
symbolic printer's instructions — in the margins and the text of a paper or a printed manuscript. It
is a jargon used in coding proof. For centuries, this task was done primarily by skilled
typographers known as "markup men" [5] or "d markers"[6] who marked up text to indicate
what typeface, style, and size should be applied to each part, and then passed the manuscript to
others for typesetting by hand or machine. Markup was also commonly applied by editors,
proofreaders, publishers, and graphic designers, and indeed by document authors, all of whom
might also mark other things, such as corrections, changes, etc.
Types of markup language[edit]

There are three main general categories of electronic markup, articulated in Coombs, et al.
(1987),[7] and Bray (2003).[8]
Presentational markup
The kind of markup used by traditional word-processing systems: binary codes
embedded within document text that produce the WYSIWYG ("what you see is what
you get") effect. Such markup is usually hidden from the human users, even authors and
editors. Properly speaking, such systems use procedural and/or descriptive markup
underneath, but convert it to "present" to the user as geometric arrangements of type.
Procedural markup
Markup is embedded in text which provides instructions for programs to process the text.
Well-known examples include troff, TeX, and PostScript. It is expected that the processor
will run through the text from beginning to end, following the instructions as encountered.
Text with such markup is often edited with the markup visible and directly manipulated by
the author. Popular procedural markup systems usually include programming constructs,
and macros or subroutines are commonly defined so that complex sets of instructions
can be invoked by a simple name (and perhaps a few parameters). This is much faster,
less error-prone, and maintenance-friendly than re-stating the same or similar
instructions in many places.
Descriptive markup
Markup is specifically used to label parts of the document for what they are, rather than
how they should be processed. Well-known systems that provide many such labels
include LaTeX, HTML, and XML. The objective is to decouple the structure of the
document from any particular treatment or rendition of it. Such markup is often described
as "semantic". An example of a descriptive markup would be HTML's <cite> tag, which
is used to label a citation. Descriptive markup — sometimes called logical
markup or conceptual markup — encourages authors to write in a way that describes the
material conceptually, rather than visually.[9]
There is considerable blurring of the lines between the types of markup. In modern
word-processing systems, presentational markup is often saved in descriptive-
markup-oriented systems such as XML, and then processed procedurally by
implementations. The programming in procedural-markup systems, such as TeX,
may be used to create higher-level markup systems that are more descriptive in
nature, such as LaTeX.
In the recent years, a number of small and largely unstandardized markup
languages have been developed to allow authors to create formatted text via web
browsers, such as the ones used in wikis and in web forums. These are sometimes
called lightweight markup languages. Markdown and the markup language used
by Wikipedia are examples of such wiki markup.
History of markup languages[edit]

GenCode[edit]
The first well-known public presentation of markup languages in computer text
processing was made by William W. Tunnicliffe at a conference in 1967, although he
preferred to call it generic coding. It can be seen as a response to the emergence of
programs such as RUNOFF that each used their own control notations, often
specific to the target typesetting device. In the 1970s, Tunnicliffe led the
development of a standard called GenCode for the publishing industry and later was
the first chairman of the International Organization for Standardization committee
that created SGML, the first standard descriptive markup language. Book designer
Stanley Rice published speculation along similar lines in 1970. [10]
Brian Reid, in his 1980 dissertation at Carnegie Mellon University, developed the
theory and a working implementation of descriptive markup in actual use.
However, IBM researcher Charles Goldfarb is more commonly seen today as the
"father" of markup languages. Goldfarb hit upon the basic idea while working on a
primitive document management system intended for law firms in 1969, and helped
invent IBM GML later that same year. GML was first publicly disclosed in 1973.
In 1975, Goldfarb moved from Cambridge, Massachusetts to Silicon Valley and
became a product planner at the IBM Almaden Research Center. There, he
convinced IBM's executives to deploy GML commercially in 1978 as part of IBM's
Document Composition Facility product, and it was widely used in business within a
few years.
SGML, which was based on both GML and GenCode, was an ISO project worked on
by Goldfarb beginning in 1974. [11] Goldfarb eventually became chair of the SGML
committee. SGML was first released by ISO as the ISO 8879 standard in October
1986.
troff and nroff[edit]

Some early examples of computer markup languages available outside the
publishing industry can be found in typesetting tools on Unix systems such
as troff and nroff. In these systems, formatting commands were inserted into the
document text so that typesetting software could format the text according to the
editor's specifications. It was a trial and error iterative process to get a document
printed correctly.[12] Availability of WYSIWYG ("what you see is what you get")
publishing software supplanted much use of these languages among casual users,
though serious publishing work still uses markup to specify the non-visual structure
of texts, and WYSIWYG editors now usually save documents in a markup-language-
based format.
TeX for formulas[edit]

Main page: Help:Displaying a formula
Another major publishing standard is TeX, created and refined by Donald Knuth in
the 1970s and '80s. TeX concentrated on detailed layout of text and font
descriptions to typeset mathematical books. This required Knuth to spend
considerable time investigating the art of typesetting. TeX is mainly used
in academia, where it is a de facto standard in many scientific disciplines. A TeX
macro package known as LaTeX provides a descriptive markup system on top of
TeX, and is widely used both among the scientific community and the publishing
industry.[13]
Scribe, GML and SGML[edit]

Main articles: Scribe (markup language), IBM Generalized Markup Language,
and Standard Generalized Markup Language
The first language to make a clean distinction between structure and presentation
was Scribe, developed by Brian Reid and described in his doctoral thesis in 1980.
[14]
Scribe was revolutionary in a number of ways, not least that it introduced the idea
of styles separated from the marked up document, and of a grammar controlling the
usage of descriptive elements. Scribe influenced the development of Generalized
Markup Language (later SGML),[15] and is a direct ancestor to HTML and LaTeX.[16]
In the early 1980s, the idea that markup should focus on the structural aspects of a
document and leave the visual presentation of that structure to the interpreter led to
the creation of SGML. The language was developed by a committee chaired by
Goldfarb. It incorporated ideas from many different sources, including Tunnicliffe's
project, GenCode. Sharon Adler, Anders Berglund, and James A. Marke were also
key members of the SGML committee.
SGML specified a syntax for including the markup in documents, as well as one for
separately describing what tags were allowed, and where (the Document Type
Definition (DTD), later known as a schema). This allowed authors to create and use
any markup they wished, selecting tags that made the most sense to them and were
named in their own natural languages, while also allowing automated verification.
Thus, SGML is properly a meta-language, and many particular markup languages
are derived from it. From the late '80s onward, most substantial new markup
languages have been based on the SGML system, including for
example TEI and DocBook. SGML was promulgated as an International Standard
by International Organization for Standardization, ISO 8879, in 1986.[17]
SGML found wide acceptance and use in fields with very large-scale documentation
requirements. However, many found it cumbersome and difficult to learn — a side
effect of its design attempting to do too much and to be too flexible. For example,
SGML made end tags (or start-tags, or even both) optional in certain contexts,
because its developers thought markup would be done manually by overworked
support staff who would appreciate saving keystrokes[citation needed].
HTML[edit]
Main article: HTML
In 1989, computer scientist Sir Tim Berners-Lee wrote a memo proposing
an Internet-based hypertext system,[18] then specified HTML and wrote the browser
and server software in the last part of 1990. The first publicly available description of
HTML was a document called "HTML Tags", first mentioned on the Internet by
Berners-Lee in late 1991.[19][20] It describes 18 elements comprising the initial,
relatively simple design of HTML. Except for the hyperlink tag, these were strongly
influenced by SGMLguid, an in-house SGML-based documentation format at CERN,
and very similar to the sample schema in the SGML standard. Eleven of these
elements still exist in HTML 4.[21]
Berners-Lee considered HTML an SGML application. The Internet Engineering Task
Force (IETF) formally defined it as such with the mid-1993 publication of the first
proposal for an HTML specification: "Hypertext Markup Language (HTML)" Internet-
Draft by Berners-Lee and Dan Connolly, which included an SGML Document Type
Definition to define the grammar.[22] Many of the HTML text elements are found in the
1988 ISO technical report TR 9537 Techniques for using SGML, which in turn
covers the features of early text formatting languages such as that used by
the RUNOFF command developed in the early 1960s for the CTSS (Compatible
Time-Sharing System) operating system. These formatting commands were derived
from those used by typesetters to manually format documents. Steven
DeRose[23] argues that HTML's use of descriptive markup (and influence of SGML in
particular) was a major factor in the success of the Web, because of the flexibility
and extensibility that it enabled. HTML became the main markup language for
creating web pages and other information that can be displayed in a web browser,
and is quite likely the most used markup language in the world today.
XML[edit]
Main article: XML
XML (Extensible Markup Language) is a meta markup language that is very widely
used. XML was developed by the World Wide Web Consortium, in a committee
created and chaired by Jon Bosak. The main purpose of XML was to simplify SGML
by focusing on a particular problem — documents on the Internet. [24] XML remains a
meta-language like SGML, allowing users to create any tags needed (hence
"extensible") and then describing those tags and their permitted uses.
XML adoption was helped because every XML document can be written in such a
way that it is also an SGML document, and existing SGML users and software could
switch to XML fairly easily. However, XML eliminated many of the more complex
features of SGML to simplify implementation environments such as documents and
publications. It appeared to strike a happy medium between simplicity and flexibility,
as well as supporting very robust schema definition and validation tools, and was
rapidly adopted for many other uses. XML is now widely used for
communicating data between applications, for serializing program data, for hardware
communications protocols, vector graphics, and many other uses as well as
documents.
XHTML[edit]
Main article: XHTML
This article's factual accuracy may be compromised

due to out-of-date information. Please update this
article to reflect recent events or newly available
information. (February 2017)
Since January 2000, all W3C Recommendations for HTML have been based on

XML rather than SGML, using the
abbreviation XHTML (Extensible HyperText Markup Language). The language
specification requires that XHTML Web documents must be well-formed XML
documents. This allows for more rigorous and robust documents while using tags
familiar from HTML.
One of the most noticeable differences between HTML and XHTML is the rule
that all tags must be closed: empty HTML tags such as <br> must either
be closed with a regular end-tag, or replaced by a special form: <br /> (the space
before the ' / ' on the end tag is optional, but frequently used because it enables
some pre-XML Web browsers, and SGML parsers, to accept the tag). Another is that
all attribute values in tags must be quoted. Finally, all tag and attribute names within
the XHTML namespace must be lowercase to be valid. HTML, on the other hand,
was case-insensitive.
Other XML-based applications[edit]
Many XML-based applications now exist, including the Resource Description
Framework as RDF/XML, XForms, DocBook, SOAP, and the Web Ontology
Language (OWL). For a partial list of these, see List of XML markup languages.
Features of markup languages[edit]

A common feature of many markup languages is that they intermix the text of a
document with markup instructions in the same data stream or file. This is not
necessary; it is possible to isolate markup from text content, using pointers, offsets,
IDs, or other methods to co-ordinate the two. Such "standoff markup" is typical for
the internal representations that programs use to work with marked-up documents.
However, embedded or "inline" markup is much more common elsewhere. Here, for
example, is a small section of text marked up in HTML:
<h1>Anatidae</h1>
<p>
The family <i>Anatidae</i> includes ducks, geese, and swans,
but <em>not</em> the closely related screamers.
</p>
The codes enclosed in angle-brackets <like this> are markup instructions

(known as tags), while the text between these instructions is the actual text of the
document. The codes h1 , p , and em are examples of semantic markup, in that
they describe the intended purpose or the meaning of the text they include.
Specifically, h1 means "this is a first-level heading", p means "this is a paragraph",
and em means "this is an emphasized word or phrase". A program interpreting such
structural markup may apply its own rules or styles for presenting the various pieces
of text, using different typefaces, boldness, font size, indentation, colour, or other
styles, as desired. For example, a tag such as "h1" (header level 1) might be
presented in a large bold sans-serif typeface in an article, or it might be underscored
in a monospaced (typewriter-style) document – or it might simply not change the
presentation at all.
In contrast, the i tag in HTML is an example of presentational markup; it is
generally used to specify a particular characteristic of the text (in this case, the use
of an italic typeface) — without specifying the reason for that appearance.
The Text Encoding Initiative (TEI) has published extensive guidelines [25] for how to
encode texts of interest in the humanities and social sciences, developed through
years of international cooperative work. These guidelines are used by projects
encoding historical documents, the works of particular scholars, periods, or genres,
and so on.
Alternative usages[edit]
While the idea of markup language originated with text documents, there is
increasing use of markup languages in the presentation of other types of
information, including playlists, vector graphics, web services, content syndication,
and user interfaces. Most of these are XML applications, because XML is a well-
defined and extensible language.
The use of XML has also led to the possibility of combining multiple markup
languages into a single profile, like XHTML+SMIL and XHTML+MathML+SVG.[26]

Markup Languages: A History and Guide

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Markup Languages: A History and Guide

Uploaded by

Copyright:

Available Formats

In computer text processing, a markup language is a system for annotating a document in a

Types of markup language[edit]

History of markup languages[edit]

troff and nroff[edit]

TeX for formulas[edit]

Scribe, GML and SGML[edit]

This article's factual accuracy may be compromised

Since January 2000, all W3C Recommendations for HTML have been based on

Features of markup languages[edit]

The codes enclosed in angle-brackets <like this> are markup instructions

You might also like