Encoding in TEI


Joey Takeda

Digital Humanities Innovation Lab, Simon Fraser University

ENGL502A | March 19, 2025

The TEI

The TEI

A set of shared, community-developed guidelines for encoding text

Started in the 1990s (preceding HTML)

Used by many projects across the world in many different languages and for many different reasons

Website: https://tei-c.org/

Why should we encode texts?

Literary recovery

Critical editing

Analysis (both distant and close)

Publication (i.e. from one format to another)

Answer existing (and asking new) research questions

Example Projects

The Yellow Nineties

https://1890s.ca/

Scholarly Editing

https://scholarlyediting.org

Encoding, markup, et cetera...

At its core, marking up text (aka encoding) is a way of identifying and differentiating bits of text from other bits of texts.

Excerpt from O'Hara, Frank. "Having a Coke With You." In The Collection Poems of Frank O'Hara, edited by Donald Allen. University of California Press, 1991.

We do this all the time!

Italics for emphasis

Underlining for titles

Bold for extra-emphasis

Quotation marks for outside attribution or skepticism

All capitals to YELL

+++

Encoding, markup, et cetera

But these are contextual and local

E.g. different types of punctuation for levels of quotation

And they are subject to varying interpretations

What is markup?

Markup refers to a structured way to identify and separate textual information

The most common form of markup is a structure called XML (aka "pointy brackets")

Semantics v. Display

Semantic or Descriptive markup = encoding what the thing is

Display or Presentational markup = encoding how you want that thing to look

Encoding Texts as Literary Criticism

Marking up text is an assertion of your knowledge and your interpretation of the text

What does the text (form and content) express?

How does your model conform/resist?

The Process of Marking Up Texts

The process is analytical, strategic, and interpretive.
It is analytical, in identifying a set of components into which the text can meaningfully be broken and whose relationship can be represented
Markup is strategic, in that text encoding is always aimed (deliberately or by default) at some intellectual or practical goal
And markup is interpretive, in that the act of encoding will always take place through a connection between an observing individual and a source object.
Julia Flanders, Syd Bauman, and Sarah Connell. "Text Encoding." Doing Digital Humanities, edited by Constance Crompton, Richard Lane, and Ray Siemens. Routledge, 2016.

XML

XML = eXtensible Markup Language

XML is not a set language unto itself, but a grammar

There is nothing inherent about the function of XML

It is purely a structure--a way of organizing

Anyone can conceive of an XML dialect (e.g. it is extensible)

XML

Markup codifies intentions


                    "Sure"
                

                <quotation>Sure</quotation>
            

                <sarcasm>Sure</sarcasm>
            

                <skepticism>Sure</skepticism>
            


                <title>Sure</title>
            

XML is Everywhere

HTML (HyperText Markup Language: Every website)

KML (Keyhole Markup Language: Google Maps)

RDF (Resource Description Framework: Library catalogues)

SVG (Scalable Vector Graphics: Digital Images)

OOXML (Open Office XML: This presentation, word documents, et cetera)

XML

XML is hierarchical

XML is a tree-like structure

And is often described in genealogical terms

XML


                
                    chocolate
                    1 tbsp butter
                    
                        2 tbsp
                        
                            cherry brandy
                            kirsch
                            coffee
                        
                    
                    2-3 tbsp sugar
                    2 large eggs
                
            

The two pointy brackets is called an element

E.g. <item> = the item element

All elements have start and end tags
<measure> is the start tag and </measure> is the end tag

Elements can also have attributes (@quantity)
Attributes must have a value: <measure quantity="2">.

All XML structures have a "root" (or container) element

Elements nest and use genealogical terms

The list element is a parent of item

<measure> is a child of <item>

Adapted from Nigella Lawson's "Chocolate Cherry Mousse" from the New York Times

XML Explained

Elements cannot overlap

<shelf><book>Anna Karenina</book></shelf>

<shelf><book>Anna Karenina</shelf></book>

XML Explained

Whitespace doesn't matter

                
                    

My paragraph has a lot of space

                
                    

My paragraph has a lot of space

The TEI = XML Vocabulary

The TEI defines elements and attributes to create a standard for encoding texts

All texts must be called <text>

All divisions (whether they be chapters, sections, et cetera) must be called <div>

All paragraphs must be called <p>

All words must be called <w>

+++

The TEI

Offers a rich vocabulary and method to encode:

Bibliographic and structural features: page breaks, headers, footers, page numbers, line breaks, divisions, paragraphs, line groups, etc

Interpretative features: stage movement, emphasis, place names, proper names, dialogue direction, etc

Editorial apparatus: hands, witnesses, collation, gaps, additions, deletions, etc

Linguistic features: morphemes, feature structures, orthographic form, etc

Spoken features: incidents, pauses, shifts, "communicative phenomenon", etc

Metadata: various classification schemes, provenance, manuscript description, etc

+++++

Components of a (basic) TEI file

Root <TEI> element

A <teiHeader> that describes both the file and the primary source that you are transcribing (if applicable)

A <text> that contains the text of the document

Within text, you can have a <front>, <body>, or <back>

                
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<!--...-->
</TEI>
                
            
                
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Publication Information</p>
         </publicationStmt>
         <sourceDesc>
            <p>Information about the source</p>
         </sourceDesc>
      </fileDesc>
  </teiHeader>
  <!--...-->
</TEI>
                
            
                
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader>
      <fileDesc>
         <titleStmt>
            <title>Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Publication Information</p>
         </publicationStmt>
         <sourceDesc>
            <p>Information about the source</p>
         </sourceDesc>
      </fileDesc>
  </teiHeader>
  <text>
      <body>
         <p>Some text here.</p>
      </body>
  </text>
</TEI>
                
            
                
  Having a Coke With You
  
is even more fun than going to San Sebastian, Irún, Hendaye, Biarritz,
or being sick to my stomach on the Travesera de Gracia in Barcelona
partly because in your orange shirt you look like a better happier St. Sebastian
partly because of my love for you, partly because of your love for yoghurt
                
            
                
<div>
  <head>Having a Coke With You</head>
  
    is even more fun than going to San Sebastian, Irún, Hendaye, Biarritz,
    or being sick to my stomach on the Travesera de Gracia in Barcelona
    partly because in your orange shirt you look like a better happier St. Sebastian
    partly because of my love for you, partly because of your love for yoghurt
  
</div>
                
            
                    
<div>
  <head>Having a Coke With You</head>
  <lg>
    <l>is even more fun than going to San Sebastian, Irún, Hendaye, Biarritz,</l>
    <l>or being sick to my stomach on the Travesera de Gracia in Barcelona</l>
    <l>partly because in your orange shirt you look like a better happier St. Sebastian</l>
    <l>partly because of my love for you, partly because of your love for yoghurt</l>
  </lg>
</div>

            
                
 <div>
  <head>Having a Coke With You</head>
  <lg>
    <l>is even more fun than going to <placeName>San Sebastian</placeName>, <placeName>Irún</placeName>, Hendaye, Biarritz,</l>
    <l>or being sick to my stomach on the <placeName>Travesera de Gracia</placeName> in Barcelona</l>
    <l>partly because in your orange shirt you look like a better happier <persName>St. Sebastian</persName></l>
    <l>partly because of my love for you, partly because of your love for yoghurt</l>
  </lg>
</div>
                
            

What to encode?

Input =/= Output

Decisions about encoding = editorial practice

Encode what you care about and what you have time to encode

If you don't encode it, you can't do much with it

But: you don't need to encode or retain everything

Next Steps

  • Starting oXygen
  • Downloading XML file