Encoding of Texts

To be usable by computer, an electronic text must include some kind of mark-up. The mark-up introduced into the BNC texts indicates explicitly a wide range of important information, including:
  • the boundary and part of speech of each word
  • the sentence structure identified by CLAWS
  • paragraphs, sections, headings and similar features in written texts
  • speech turns, pausing, and para-linguistic features such as laughter in spoken texts
  • meta-textual information about the source or encoding of individual texts
These textual features, and others, are all encoded in a standardized way, to help ensure that the corpus will be usable no matter what the local computational set-up may be.

The format used by the BNC is called the Corpus Document Interchange Format (CDIF for short) and is fully documented in the BNC Users Reference Guide. An article by Gavin Burnage and Dominic Dunlop titled Encoding the British National Corpus, written while the BNC was being developed, describes the scheme and its use within the project in some detail.

CDIF was originally designed to use an ISO standard called SGML (ISO 8879: Standard Generalized Markup Language) The BNC XML edition uses the more recent W3C standard called XML (Extensible Markup Language).

The design of CDIF was strongly influenced by the development of the Guidelines for Encoding of Electronic Text of the international Text Encoding Initiative (TEI). The use of these Guidelines helps to ensure the corpus can be used by many different researchers using different types of machines and software.

