Joey Takeda
Digital Humanities Innovation Lab, Simon Fraser University
ENGL502A | March 19, 2025
A set of shared, community-developed guidelines for encoding text
Started in the 1990s (preceding HTML)
Used by many projects across the world in many different languages and for many different reasons
Website: https://tei-c.org/
Literary recovery
Critical editing
Analysis (both distant and close)
Publication (i.e. from one format to another)
Answer existing (and asking new) research questions
At its core, marking up text (aka encoding) is a way of identifying and differentiating bits of text from other bits of texts.
Italics for emphasis
Underlining for titles
Bold for extra-emphasis
Quotation marks for outside attribution
or skepticism
All capitals to YELL
+++
But these are contextual and local
E.g. different types of punctuation for levels of quotation
And they are subject to varying interpretations
Markup refers to a structured way to identify and separate textual information
The most common form of markup is a structure called XML (aka "pointy brackets")
Semantic or Descriptive markup = encoding what the thing is
Display or Presentational markup = encoding how you want that thing to look
Marking up text is an assertion of your knowledge and your interpretation of the text
What does the text (form and content) express?
How does your model conform/resist?
The process is analytical, strategic, and interpretive.
It is analytical, in identifying a set of components into which the text can meaningfully be broken and whose relationship can be represented
Markup is strategic, in that text encoding is always aimed (deliberately or by default) at some intellectual or practical goal
And markup is interpretive, in that the act of encoding will always take place through a connection between an observing individual and a source object.
XML = eXtensible Markup Language
XML is not a set language unto itself, but a grammar
There is nothing inherent about the function of XML
It is purely a structure--a way of organizing
Anyone can conceive of an XML dialect (e.g. it is extensible)
Markup codifies intentions
"Sure"
<quotation>Sure</quotation>
<sarcasm>Sure</sarcasm>
<skepticism>Sure</skepticism>
<title>Sure</title>
HTML (HyperText Markup Language: Every website)
KML (Keyhole Markup Language: Google Maps)
RDF (Resource Description Framework: Library catalogues)
SVG (Scalable Vector Graphics: Digital Images)
OOXML (Open Office XML: This presentation, word documents, et cetera)
XML is hierarchical
XML is a tree-like structure
And is often described in genealogical terms
- chocolate
1 tbsp butter
-
2 tbsp
- cherry brandy
- kirsch
- coffee
2-3 tbsp sugar
2 large eggs
The two pointy brackets is called an element
E.g. <item> = the item element
All elements have start and end tags
<measure> is the start tag and </measure> is the end tag
Elements can also have attributes (@quantity)
Attributes must have a value: <measure quantity="2">.
All XML structures have a "root" (or container) element
Elements nest and use genealogical terms
The list element is a parent of item
<measure> is a child of <item>
Elements cannot overlap
✅ <shelf><book>Anna Karenina</book></shelf>
❌ <shelf><book>Anna Karenina</shelf></book>
Whitespace doesn't matter
My paragraph
has a lot of space
My paragraph has a lot of space
The TEI defines elements and attributes to create a standard for encoding texts
All texts must be called <text>
All divisions (whether they be chapters, sections, et cetera) must be called <div>
All paragraphs must be called <p>
All words must be called <w>
+++
Offers a rich vocabulary and method to encode:
Bibliographic and structural features: page breaks, headers, footers, page numbers, line breaks, divisions, paragraphs, line groups, etc
Interpretative features: stage movement, emphasis, place names, proper names, dialogue direction, etc
Editorial apparatus: hands, witnesses, collation, gaps, additions, deletions, etc
Linguistic features: morphemes, feature structures, orthographic form, etc
Spoken features: incidents, pauses, shifts, "communicative phenomenon", etc
Metadata: various classification schemes, provenance, manuscript description, etc
+++++
Root <TEI> element
A <teiHeader> that describes both the file and the primary source that you are transcribing (if applicable)
A <text> that contains the text of the document
Within text, you can have a <front>, <body>, or <back>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<!--...-->
</TEI>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Title</title>
</titleStmt>
<publicationStmt>
<p>Publication Information</p>
</publicationStmt>
<sourceDesc>
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<!--...-->
</TEI>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Title</title>
</titleStmt>
<publicationStmt>
<p>Publication Information</p>
</publicationStmt>
<sourceDesc>
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<body>
<p>Some text here.</p>
</body>
</text>
</TEI>
Having a Coke With You
is even more fun than going to San Sebastian, Irún, Hendaye, Biarritz,
or being sick to my stomach on the Travesera de Gracia in Barcelona
partly because in your orange shirt you look like a better happier St. Sebastian
partly because of my love for you, partly because of your love for yoghurt
<div>
<head>Having a Coke With You</head>
is even more fun than going to San Sebastian, Irún, Hendaye, Biarritz,
or being sick to my stomach on the Travesera de Gracia in Barcelona
partly because in your orange shirt you look like a better happier St. Sebastian
partly because of my love for you, partly because of your love for yoghurt
</div>
<div>
<head>Having a Coke With You</head>
<lg>
<l>is even more fun than going to San Sebastian, Irún, Hendaye, Biarritz,</l>
<l>or being sick to my stomach on the Travesera de Gracia in Barcelona</l>
<l>partly because in your orange shirt you look like a better happier St. Sebastian</l>
<l>partly because of my love for you, partly because of your love for yoghurt</l>
</lg>
</div>
<div>
<head>Having a Coke With You</head>
<lg>
<l>is even more fun than going to <placeName>San Sebastian</placeName>, <placeName>Irún</placeName>, Hendaye, Biarritz,</l>
<l>or being sick to my stomach on the <placeName>Travesera de Gracia</placeName> in Barcelona</l>
<l>partly because in your orange shirt you look like a better happier <persName>St. Sebastian</persName></l>
<l>partly because of my love for you, partly because of your love for yoghurt</l>
</lg>
</div>
Input =/= Output
Decisions about encoding = editorial practice
Encode what you care about and what you have time to encode
If you don't encode it, you can't do much with it
But: you don't need to encode or retain everything