[bnc] BNC User manual - Written texts

Written texts

Divisions of written texts

Written texts exhibit a bewildering variety and richness of different structural forms. Some have very little organization at levels higher than the paragraphs; others may have a complex hierarchy of parts, sections, chapters etc. Novels are divided into chapters, newspapers into sections, reference works into articles and so forth. The following elements are used to represent all such textual divisions:

<div1>: major subdivision of a written text, e.g. chapter.
<div2>: further subdivision of a written text, entirely contained within a <div1>, e.g. section.
<div3>: further subdivision of a written text, entirely contained within a <div2>, e.g. subsection.
<div4>: smallest possible subdivision of a written text, entirely contained within a <div3>, e.g. sub-subsection.

Most written texts, of whatever kind, are hierarchically subdivided using these elements. Structural subdivisions smaller than level 4 (but above paragraph level) are all tagged <div4>. In all texts, structural subdivisions at the highest level (<div1>) are always identified; lower levels of subdivision (i.e. <div2>, <div3> or <div4>) may also be supplied where appropriate, but are not required.

These elements have the following attributes in common, in addition to the global attributes id, n, and r:

type

categorizes the division in some respect, e.g. as a chapter, section etc.

org

specifies how the content of the division is organized. Legal values are:

compo: composite content: i.e. no claim is made about the sequence in which elements inferior to this one are to be processed, or their interrelationships
seq: sequential content: i.e. elements inferior to this are regarded as forming a logical unit, to be processed in the sequence given

complete

specifies whether or not this division is complete or a sample. Legal values are:

Y: the full text of the original has been used
N: a sample of the original text has been used

The n attribute is sometimes used to supply an identifying name or number used within the text for a given division, for example, a chapter number, as in the following example:

More often, however, chapter names or numbers will appear within the text, tagged using the <head> element discussed in section Headings and captions below.

The value of the attribute type is used to characterise the function of the textual division, according to an informal taxonomy. The values used are listed in ??. If a value is supplied for one division at a given level, it may be assumed to apply to all subsequent divisions at the same level until the end of the enclosing element.

A sequence of paragraph-level elements of arbitrary length may precede the first structural subdivision at any level. A text may have no structural divisions within it at all. Note that any prefatory or appended matter not forming part of a text will not generally be captured: the tei elements <front> and <back> elements are not used.

Paragraph-level elements and chunks

Written texts may be organized into structural units containing more than one <s> element and smaller than any of the divisions discussed in section Divisions of written texts above. The most commonly found such element is the (paragraph), but there are several others. Their common identifying feature is that they may appear directly within divisions (that is, directly within <div1>, <div2> etc., or within <text> elements, not nested within some other element such as a paragraph).

An alphabetically ordered list of these elements follows:

<bibl>

a loosely structured bibliographic citation appearing within a corpus text (see Notes and citations).

<caption>

(1) a heading, title etc. attached to a picture or diagram, usually with deictic content (2) a `pull quote' or other text about or extracted from a text and superimposed upon it to draw attention to it (see Headings and captions). Attributes include:

type

categorizes the caption. Legal values are:

byline: caption containing authorship or provenance of an article in a newspaper or periodical
display: extra-textual caption such as a pull quote or displayed box
attached: caption describing a non-transcribed item such as a figure or photograph
unspec: not specified or unknown

<head>

a title or heading prefixed to some division of a written text or to a poem (see Headings and captions). Attributes include:

type

characterises the heading in some respect. Legal values are:

byline: heading containing authorship or provenance of an article in a periodical
main: a main heading (only one allowed per div)
sub: a secondary heading (may be zero or more per div)
unspec: not specified or unknown

<list>

a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit (see Lists).

<note>

any form of note, additional comment or gloss within a written or spoken text (see Notes and citations).

a paragraph in a written text.

<poem>

a poem, or an extract from one, embedded or quoted within a spoken or written text (see Poems).

<quote>

a quotation from some author other than that of the surrounding text, usually either embedded or displayed (see Quotations).

<sp>

a spoken paragraph, i.e. material marked as ‘written to be spoken’, usually by the presence of a speaker prefix, for example in a play script or printed interview (see Spoken paragraphs).

Examples for each of these (except ) are discussed in more detail in the following subsections.

Headings and captions

Headings and captions serve a variety of functions in written texts. The BNC scheme currently distinguishes between <head> elements, which can appear only at the start of a text division and are logically associated with it (for example, chapter titles, newspaper headlines etc.) and <caption> elements which are logically independent of the position they may have within a textual division (for example, captions attached to pictures or figures, ‘pull-quotes’ embedded within the text, ‘by-lines’ identifying authorship and provenance of a newspaper or periodical article).

One or more <head> elements may appear in sequence at the start of any <div1>, <div2>, <div3> or <div4> element, or at the start of a <list> or <poem>, as in the following example:.

<div1 type="u" n=1> <head type=MAIN> <s n="1"> <w NN1>AGEISM </head> <head type=SUB> <s n="2"> <w AT0>THE <w NN1>FOUNDATION <w PRF>OF <w NN1>AGE <w NN1>DISCRIMINATION </head> <head type=BYLINE> <s n="3"> <w NP0>STEVE <w NP0-NN1>SCRUTTON </head>

In the following example, the <head> element is followed by a number of <caption> elements introducing particular parts of a magazine story:

<div1 complete=y org=seq> <head> <s n=00040> <w NN2>TROUSERS <w VVB>SUIT </head> <caption> <s n=00041> <w EX0>There <w VBZ>is <w PNI>nothing <w AJ0>masculine <w PRP>about <w DT0>these <w AJ0>new <w NN1>trouser <w NN2-VVZ>suits <w PRP>in <w NN1>summer<w POS>'s <w AJ0>soft <w NN2>pastels<c PUN>. <s n=00042> <w NP0>Smart <w CJC>and <w AJ0>acceptable <w PRP>for <w NN1>city <w NN1-VVB>wear <w CJC>but <w AJ0>soft <w AV0>enough <w PRP>for <w AJ0>relaxed <w NN2>days </caption>

The type attribute may be used to distinguish more exactly the function of the caption or heading, as indicated below.

<div1 complete=y org=seq> <head type=main> <s n=0223> <w PNP>They<w VBB>'re <w VDG>doing <w AJ0>fine </head> <head type=sub> <s n=0224> <w NP0>Dominic <w VVZ>sees <w AJ0-NN1>double </head>

Where captions would interrupt the normal flow, pointers are used as discussed in section ??.

Quotations

A quotation is an extract from some other work than the text itself which is embedded within it, for example as an epigraph or illustration. It is marked up using the <quote> element. This may contain any combination of other chunks (for example paragraphs, poems, lists) but may not directly contain phrase-level elements. Any reference for the citation should also be contained within it.

For example:

<quote> <s n=2080> <w DT0>This <w NN1>way <w PRP>for <w AT0>the <w AJ0>sorrowful <w NN1>city<c PUN>. <s n=2081> <w DT0>This <w NN1>way <w PRP>for <w AJ0>eternal <w NN1>suffering<c PUN>. <s n=2082> <w DT0>This <w NN1>way <w TO0>to <w VVI>join <w AT0>the <w AJ0>lost <w NN0>people<c PUN>&hellip <s n=2083> <w VVB>Abandon <w DT0>all <w NN1>hope<c PUN>, <w PNP>you <w PNQ>who <w VVB>enter<c PUN>&hellip <bibl><s n=2084> <w NP0>Dante </bibl> </quote>

Spoken paragraphs

As noted above, the <sp> element is used to mark parts of a written text which were or are intended to be spoken, for example the speeches in a dramatic text or a published interview. Such parts are generally readily identifiable by the use of such conventions as speaker prefixes (the label supplying the name of the speaker) and stage directions, for which the following specific tags are defined:

<spkr>: contains the speech prefix used in the original source to identify the speaker of a passage written to be spoken.
<stage>: contains any kind of stage direction within a dramatic text.

The <sp> element is used only for speaker turns identified as such in a written text, by contrast with the element discussed in section Utterances, which is used only for speaker turns identified in a spoken text, i.e. one which has been transcribed from audio tape.

If present, a <spkr> element should appear at the start of the <sp> element, followed by one or more elements containing the actual speech. Any <stage> element present will usually be relocated to the end of the paragraph in which it occurs and replaced by a <ptr> element, as discussed in section ??.

For example:

<sp> <spkr> <s n=00156> <w CRD>M </spkr> <s n=00157> <ptr target=HHWST01C><w VVB>Give <w DPS>her <w NN1>medicine<c PUN>. <s n=00158> <w PNP>I<w VM0>'ll <w VVI>kill <w PNX>myself <w CJS>if <w PNP>she <w VVZ>dies<c PUN>. <stage id=HHWST01C type=u> <s n=00159> <c PUL>(<w NP0>Sinking <w PRP>to <w NN2>knees <w CJC>and <w AJ0-VVG>banging <w NN1>head <w PRP>on <w NN1>floor<c PUR>) </stage> </sp>

Poems

Poems or fragments of verse or song may appear both within and between paragraphs. The <l> (line) element is used to mark each metrical line, and any titles or headings present are marked with <head> elements. Each such group of lines is marked as a <poem> element, with no indication of its completeness.

No provision is made for marking units of verse such as stanzas, verse paragraphs etc. A part attribute is defined for the <l> which allows incomplete lines to be indicated, but in the current version of the corpus this always takes the value ‘u’ (for unknown).

For example:

<poem> <l part=u> <s n=0900> <w PNP>I <w VVB>send <w DPS>my <w NN1>soul <w PRP>through <w NN1>time <w CJC>and <w NN1>space <w TO0>to <w VVI>greet <w PNP>you<c PUN>. </l> <l part=u> <s n=0901> <w PNP>You <w VBD>were <w AT0>a <w NN1>poet<c PUN>. <s n=0902> <w PNP>You <w VM0>will <w VVI>understand<c PUN>. </l> </poem>

Note that the <l> element is not used to mark typographic lineation; on the few occasions where this has been recorded, it is marked with the <lb> tag discussed in section Miscellaneous phrase-level elements below.

Lists

A list is a collection of distinct items flagged as such by special layout in written texts, often functioning as a single syntactic unit. Lists may appear within or between paragraphs. Where marked, lists are tagged with the <list> element.

A <list> element consists of an optional <head> element, followed by one or more <item> elements, each of which may optionally be preceded by a <label> element, used to hold the identifier or tag sometimes attached to a list item, for example ‘(a)’. It may also contain a word or phrase used for a similar purpose.

The <item> element may appear only inside lists. It contains the same mixture of elements as a paragraph, and may thus contain one or more nested lists. It may also contains a series of paragraphs, each marked with a element.

Here is an example of a simple list:

<list> <item> <s n=0087> <w VBZ>Is <w DPS>your <w NN1>nylon <hi r=it> <w NN1>nightie </hi> <w AJ0>fireproof<c PUN>? </item> <item> <s n=0088> <w AT0>The <w NN1>hurricane <w VBD>was <hi r=it> <w AJ0-AV0>mighty </hi> <w AJ0>fierce<c PUN>. <pb n=78> </item> <item> <s n=0089> <w VM0>Will <w PNP>you <hi r=it> <w VVI>mow </hi> <w AT0>the <w NN1>lawn<c PUN>? </item> <item> <s n=0090> <w VDD>Did <w PNP>you <hi r=it> <w VVI>know </hi> <w AT0>the <w NN1>time<c PUN>? </item> </list>

Here is an example of a labelled list:

<list> <label> <s n=0423> <w CRD>1<c PUN>. </label> <item> <s n=0424> <w NN1-NP0>Surya <c PUN>&mdash <w NN1>Sun <c PUN>&mdash <w AJ0>Creative <w NN1>agent </item> <label> <s n=0425> <w CRD>2<c PUN>. </label> <item> <s n=0426> <w NN1-NP0>Vayu <c PUN>&mdash <w NN1>Air <c PUN>&mdash <w NP0>Preserving <w NN1>agent <pb n=43> </item> <label> <s n=0427> <w CRD>3<c PUN>. </label> <item> <s n=0428> <w NN2>Agni <c PUN>&mdash <w NN1>Fire <c PUN>&mdash <w AJ0>Destructive <w NN1>agent </item> </list>

Notes and citations

Annotations occurring in written texts, and bibliographic citations or references, have been marked up in some texts, using the <note> element. This element has the following additional attributes:

type

identifies the provenance of the note, i.e. editorial or authorial. Legal values are:

ed: note supplied by BNC transcriber or encoder
orig: note present in the original source text

ed

code for the person or organization responsible for BNC-supplied note. Legal values are:

lancs: Note supplied by UCREL grammarians
longm: Note supplied by Longman transcribers
oucs: Note supplied by OUCS staff
oup: Note supplied by OUP transcribers
undef: Provenance of note unknown or unspecified

place

specifies the location of an original note in the source text. Legal values are:

foot: foot of page
end: end of current division or text
side: left or right margin
unspec: unknown or unspecified.

Notes within headers are tagged using a distinct <bibNote> element, which is a departure from TEI-recommended practice, as is the use of the <note> element for both original and supplied annotation. The two usages are distinguished by the type attribute.

Here for example is a typical transcriber's note:

<note type=ed> <s n=0001> <w NN1-NP0>Page <w NN2>numbers <w XX0>not <w AJ0>available </note>

Original notes may contain any mixture of other chunks, and may also contain paragraphs: they may appear in written texts only. They will normally be relocated to the end of the section in which they appear, and their original position marked by a <ptr> element, as discussed in section ??.

For example:

<s n=053> <w CJS-PRP>As <w AT0>the <w NP0>UK<w POS>'s <w AJ0>main <w AJ0>independent <w NN1>AIDS <w AV0>home <w NN1-VVB>care <w NN1>provider<c PUN>, <w PNP>we <w VVD>cared <w AVP-PRP>for <w PRP>around <w NN0>25% <w PRF>of <w DT0>all <w DT0>those <w PNQ>who <w VVD>died <w PRF>of <w NN1>AIDS <w ORD>last <w NN1>year <ptr target=A02NT001><c PUN>. <s n=054> <w PRP>In <w NP0>London<c PUN>, <w NN1-VVB>demand <w PRP>for <w DPS>our <w NN1>Home <w NN1-VVB>Care <w NN2>services <w VVD-VVN>doubled <w AVP-PRP>over <w AT0>the <w ORD>last <w CRD>twelve <w NN2>months<c PUN>.  <s n=056> <w PNP>I <w VVB>expect <w NN1-VVB>demand <w PRP>for <w DT0>this <w NN1>service <w TO0>to <w VVI>continue <w TO0>to <w VVI>grow <w AVP-PRP>over <w AT0>the <w AJ0>coming <w NN1>year<c PUN>. <note id=a02nt001 n=2 type=orig> <s n=057> <w NN1>AIDS <w NN2>deaths<c PUN>: <w NP0>April <w CRD>1990 <c PUN>&mdash <w NP0>March <w CRD>1991<c PUN>, <w NP0>UK <w NN1>total <c PUL>(<w NN1-NP0>CDSC <w NN2>figures <c PUN>&mdash <w CRD>584 <w NP0>April <w CRD>1991<c PUN>.<c PUR>) <s n=058> <w DPS>Our <w NN1>Home <w NN1-VVB>Care <w NN2>teams <w VVD>saw <w CRD>141 <w NN0>AIDS <w AJ0-VVD>related <w NN2>deaths <w ORD>last <w NN1>year </note>

Note the use of the n attribute to carry the original footnote number in the above example.

Bibliographic citations or references within running texts may also be marked, using the <bibl> element; this is done in some texts only in the present version of the corpus.

For example:

<bibl> <s n=1379> <w NP0>Mordechai <w NP0>Chaim <w NP0>Rumkowski<c PUN>, <w AJS>Eldest <w PRF>of <w AT0>the <w NN2>Jews <w PRP>in <w AT0>the <w NN1-NP0>Lodz <w NN1>ghetto<c PUN>, <w VVG>speaking <w PRP>in <w CRD>1942 </bibl>

Phrase-level elements

Phrase-level elements are elements which cannot appear directly within a textual division, but must be contained by some other element. In practice, this means they will be contained within an <s> element.

Highlighted phrases

Typographic highlighting in the original may not be marked in the transcript at all. Alternatively, highlighted phrases, and the kind of highlighting used, may be recorded in one of two ways:

using the global rend (rendition) attribute
using the <hi> (highlighted) element

The former is used where the function of the highlighting is clear, for example to mark a heading, and where the boundaries of the highlighted phrase therefore coincide with the boundaries of some other cdif element. The latter is used where the function is not clear, where the DTD does not provide a tag to identify the feature concerned or where the highlighted phrase is not coterminous with some other element.

When the <hi> element is used, its rend attribute must be supplied. On all other cdif elements, the rend attribute is optional. Its value indicates the nature of the highlighting used, e.g. italic font, quoted, small caps etc. A list of the values used for this attribute is given in section ?? below.

It should be noted that the purpose of the rend attribute is not to provide information adequate to the needs of a typesetter, but simply to record some qualitative information about the original. In particular, the present version of the corpus includes no indication of size of type or style of writing.

Like all other phrase-level elements, each <hi> element must be entirely contained by an <s> element. This implies that where, for example, a bolded passage contains more than one sentence, or an italicised phrase begins in one verse line and ends in another, the <hi> element must be closed at the end of the enclosing element, and then re-opened within the next.

For example, in the following four lines of verse, the first three are rendered in italics, and the rend attribute is therefore specified for each <l> element. In the fourth line, only the first few words are in italics: a <hi> element must be used within the <l> to carry this information.

<l part=u rend=it> <s n=394><w PNP>It <w VBD>was <w CRD>one <w PRF>of <w AT0>a <w NN0>pair<c PUN>. <s n=395><w DPS>Its <w AJ0>precious <w NN1>twin </l> <l part=u rend=it> <s n=396><w VBD>was <w VVN>stolen <w PRP>by <w AT0>the <w NN2>soldiers<c PUN>. <s n=397><w DT0>All <w AT0>the <w NN1>time </l> <l part=u r=it> <s n=398><w DPS>her <w NN1>uncle <w VVD>stood <w AV0>there <w VVG>clutching <w DT0>this <w CRD-PNI>one <w PRP>in </l> <l part=u> <s n=399><hi rend=it> <w DPS>his <w AJ0>big <w NN1>fist </hi> <c PUN>&mdash <w AV0>so<c PUN>! <s n=400><w PNP>She <w VDZ>does <w DT0>a little <w NN1>mime<c PUN>. </l>

Miscellaneous phrase-level elements

The following miscellaneous phrase-level elements also appear within <s> elements in written texts:

<pb>: marks the start of a new page in the original source; used to indicate where e.g. articles in periodicals are split across several pages.
<lb>: marks the start of a new (printed) line in the original source.
<name>: proper name of a person, place or institution.
<salute>: a formulaic greeting or form of address appearing at the start or the end of a spoken or a written text.

In this example, the presence of a page break between two verse lines is indicated by the <pb> element:

<l part=u> <s n=1403> <c PUN>&mdash <w CJC>and <w NN2>creditors <w VVB>grow <w AJ0>cruel<c PUN>, </l> <l part=u> <pb n=75> </l> <l part=u> <s n=1404> <w AV0>so <w PNP>he <w VVZ>bows <w CJC>and <w NN2-VVZ>scrapes<c PUN>, </l>

In the following example, the <lb> element has been used to mark the position of line breaks in the source text, since they seem to be taking the place of conventional punctuation:

<caption> <s n=1503> <w NN1>Man <w PRF>of <w AT0>the <w NN1>Year <lb> <w NN1>Design <lb> <w NN1>Design <w NN1>Concept <lb> <w AJ0>Technical <w NN1>Innovation <w NN1>Safety <w NN1-NP0>Achievement <lb> <w NP0>Environmental <w NN1-NP0>Contribution <lb> <w NN1-VVG>Marketing <w NN1>Initiative <lb> <w NN1>Manufacturer <w PRP>in <w NN1-NP0>Motorsport <lb> <w AJ0-NN1>Specialist <w NN1>Manufacturer <w NN1>Mid-size <w NN1-NP0>Manufacturer <lb> <w AJ0>Large <w NN1>Manufacturer </caption>

In the following example, the <salute> element has been used to separate the addressee of a letter from the rest of the text:

<s n=0343> <w VVB>Ask <w TO0>to <w VVI>see <w NN2>examples <w PRF>of <w DPS>their <w NN1>work <w CJC>and <w NN1-VVB>contact <w NN2>references <w PRP>from <w DPS>their <w AJ0-NN1>past <w NN2>projects<c PUN>. <s n=0344> <salute> <w NP0>JOHN <w NN0>DIBBLE <w NN1>Partner<c PUN>, <w NP0>Atlam <w NN1>Design <w NN1-NP0>Partnership<c PUN>, <w NP0>Portland <w NP0>House<c PUN>, <w NP0>Portland <w NP0>Street<c PUN>, <w NP0>Leamington <w NP0>Spa<c PUN>, <w NP0>Warwickshire<c PUN>. </salute>

Up: Contents Next: Spoken texts