British National Corpus
Introduction to SARA98
This worksheet introduces you to some of the key features of the SARA software. It does not cover everything that the software can do but it gives a good indication of the kinds of facilities available. Please use it as a basis for your exploration of the system.
If you are using SARA to connect to some other server or corpus, the name which appears may be different. SARA can be configured to work with other corpora, with other servers providing access to the BNC, or with a copy of the BNC installed locally on your hard disk. You can click on the menu button to see what other servers are available, or to add a new one. Configuring your client to operate with other corpora is not covered in this tutorial.
The message ‘Initialising please wait’ appears at the bottom of the screen, and there is a short delay during which details about the corpus being searched are loaded into the program. When this process is complete, a minimized window titled BNC-2 appears in the bottom left of the screen.
This is the corpus Browse window. If you open it, you will see a list of the texts making up the corpus. You can select texts from the list for various purposes, e.g. to make a subcorpus or simply to browse them; however, this use of the window is not covered in this tutorial.
At the top of the screen you see the usual Windows menu items (File, Edit, Texts, View, Window, and, on the far right, Help). Below that you can see a number of buttons which we call collectively the Toolbar.
After a second or so, the name of the button will appear in a small popup and a brief description of its purpose will appear at the bottom of the screen. Each button provides rapid access to a specific function also available from a menu. In this tutorial we will be using the following buttons:
In this tutorial we will discuss just a few of the functions available; for an overview of all them, you may like to explore the built in Help system, by selecting Contents on the Help menu, or simply pressing the F1 key.
The cursor will turn briefly into an hourglass while SARA searches, and then the ‘Too Many Solutions’ alert will appear, telling you the result of the search: there are 1651 occurrences of this phrase in 927 different texts.
By default SARA will not download and display more than 100 solutions to any query; you must therefore specify how many solutions you want to see and how they should be selected from those available. You can change this behaviour using the Preferences command on the View menu, which we discuss later on in this tutorial.
If you look at the status bar, you will see that it now contains additional information. Reading from left to right, you should see something like the following: BNC2 bnc 2:100(100) A0F 142 . This indicates that the name of the corpus being searched is BNC2, that the lemmatization scheme in effect is called bnc, that the currently highlighted solution is number 2 of 100 chosen from 100 different texts, and that it appears in text A0F at sentence number 142 (your numbers will probably differ, since you are doing a random sample).
In Colour format, different parts of speech are displayed in different colours, and the POS code itself appears when the mouse hovers over a word. In SGML format, you can see all the underlying markup in the file. Custom format displays the text in a user-defined way, using the markup according to specific requirements. In Plain format, you see only the words and punctuation of the text, with the search term highlighted.
This button switches between displaying solutions one at a time and displaying solutions in the traditional one-per-line KWIC format. In either mode, you can scroll through the solutions using the PgDn and PgUp keys; in line mode you can also use the arrow keys.
A menu appears from which you can choose to copy the current solution to the clipboard, to expand the amount of context visible for that solution (Max Scope), to select the current solution or a series of solutions (only in Line mode), and to view information about the Source of the current solution. Experiment with these options to see their effect.
You can also change the font in which solutions are displayed, set the colour scheme used for display of POS information, and set default preferences for display mode etc. in subsequent queries using commands on the View menu.
A dialogue box appears in which you can specify how the concordance lines should be sorted. You can use different sorting orders and other options for each of two sort keys, and also specify a particular collating method, using radio buttons in this dialogue box. You can also indicate how many words are to be considered when sorting using the Span window.
When solutions are displayed in colour format, a radio button labelled POS code is available as a Collating option, making it possible for you to sort the lines according to the POS codes of words they contain.
View Query will show you the text of any query at the head of the solutions display; Concordance will show the solutions in a KWIC (one-per-line) format. Custom format will display certain elements, such as new paragraphs and utterances, in particular ways on the screen, while Automatic scope will display roughly one sentence of context for each citation.
The preferences just specified will be active for the remainder of this SARA session: if, by any chance, you are disconnected from the server or accidentally close the programme, you should reset them before continuing.
SARA maintains a list of all the distinct word forms in the corpus, together with their frequencies and part of speech codes. We refer to this list as the lexicon. You use the Word Query command to search the lexicon in a number of different ways, and to find places in the corpus where these word forms occur. The Collocation facilities allow you to detect statistically significant patterns of co-occurrence for word forms in the corpus, while the Lemmatization facilities allow you to group words together under linguistically significant headings despite their orthographic form.
A list of all the word-forms in the lexicon which begin with the letters wine is displayed in alphabetical order. The other columns show the frequency and the number of different forms grouped under that entry using the current lemmatization scheme. In the default scheme (called bnc) words which differ only in their part-of-speech code are grouped under a single entry.
The different word-forms grouped under this entry are displayed in the lower window. You will see that wine, while generally classified as a singular common noun (NN1), also appears in the lexicon as a proper noun (NP0).
Once your curiosity is satisfied, you may wish to investigate uses of wine as a common noun. To save time re-defining a query from scratch, we will use the Edit button on the toolbar: this looks like a tiny pencil writing on a blue-edged screen.
The ‘Too Many Solutions’ dialogue appears, reminding us that there are 6050 solutions to this query. In the next section we will use the collocation options in SARA as a means of investigating this mass of data.
The dialog box expands to offer you several additional tabs which can be used to control the calculation. You can set the window, i.e. the span of words around the hit within which collocations are to be sought; you can use the download tab to control which of the possible collocates are to be displayed; you can apply a lemmatization scheme, and you can also modify the way the collocation scores are calculated. In this exercise we will use only the first two of these.
It may take several minutes for SARA to calculate all the collocates and their scores: wait for the red light on the status bar to go out before you try to do anything else. This may be a good time for a cup of coffee.
The collocate display is designed to show you words which cluster together: this is expressed by means of a frequency-based statistic known as the z-score. The higher the z-score, the more significant the clustering.
You can re-sort the list by clicking on the relevant column heading. You can save a copy of the list by clicking on the Save button and choosing an appropriate format. You can also choose to calculate significance using the Mutual Information statistic rather than Z-scores.
In an inflected language like English, it is often convenient to group under the same heading words which have different forms, as well as to distinguish words which have the same form but different part of speech codes. For example, consider the word rise. This may be a noun or a verb. In the verbal sense, it may be regarded as consisting of a number of inflected forms rose, risen, rises, rising, etc. The lemmatization feature of SARA allows us to perform such groupings.
In the lower box, you will see a list of the different forms grouped under this heading by the current lemmatization scheme. In the default (BNC) lemmatization scheme, this word has six different POS codes, the frequency for each of which is given.
In the Lancaster lemmatization scheme, nominal and verbal forms of rise are treated as different head words, so there are now two entries for the word in the upper list, one tagged SUBST and the other VERB. The frequency count given for each of these includes all of its inflected forms together.
In the last section we saw that the verbal lemma rise was approximately twice as frequent as the nominal one. As 90% of the BNC is composed of written texts, this difference probably reflects the relative frequencies with which they are employed in writing. Is there also a difference in speech? You can investigate this question by submitting the same query you posed using the full corpus to the subcorpus of spoken texts.
Since the spoken component is approximately 10% of the whole corpus, we would expect the frequencies of the two lemmas to be approximately one-tenth what they were in the whole corpus. However, the new figures are very much smaller than this, particularly as far as the verb forms are concerned.
If you have time, you may also like to compare the collocates of the nominal lemma in the full corpus and in the spoken subcorpus: you will find that combinations such as sharp rise are much less common in speech.
The collocate frequencies provided will be for Lancaster lemmata, which you activated in your last Word Query. Their significance level will in each case be calculated with respect to the corpus you are using.
You can define a subcorpus of your own in three ways. The first uses information in the text headers to identify all the texts in a particular category — information which is provided in the <catRef> element. We shall use this method to define a subcorpus of imaginative written texts — novels, stories, plays and poems.
There are 477 imaginative texts in the corpus, so there are 477 hits for your search. To see what these really look like, you should display them in SGML format: each concordance line will contain a string beginning catRef target="alltim3 allava2...." towards the end of which you will find the value wridom1. These are the BNC text categorization codes. You have thus found all the text headers (and consequently all the texts) with the categorisation wridom1, i.e. all the imaginative texts.
This is the only active button to the right of the box showing the current corpus. A window will appear in which you can type the name of your new subcorpus, which will consist of all the texts for which you have downloaded solutions.
You should now see all the fictional occurrences of the word fictional. Try looking up other words or phrases which seem typical of imaginative writing, such as frightfully, throb, or lips. Compare their frequencies in this subcorpus with their frequencies in the full corpus.
David Lee, of Lancaster University, has provided his own hierarchical classification of all the BNC texts in the World Edition. This makes it possible to define subcorpora using classes such as w (written), w ac (written academic), w ac medicine (written academic medicine).
To make the subcorpus, you need to identify all the texts which contain this classification within the <classCode> element in the header. This requires a complex query, which specifies (for example) that you are only interested in the word interview or the phrase non ac where it occurs within this element.
The QueryBuilder screen appears. You use this screen to define complex queries, each component of the query being represented as a node on this screen. The lefthand node defines the scope of the query — that is, where the search is to be carried out. As you see, this starts off with the assumption that you will search within a single BNC document (i.e. <bncDoc> element). To the right of this node you define what it is you want to look for, as one or more linked content nodes. The box is red because you must supply something.
Now you can activate your subcorpus from the subcorpora box on the toolbar, and design Quick queries or Word queries concerning it. For instance, is there much bad language in the BNC spoken interviews? What about in non-academic writing? You can compare figures with those for the subcorpus of all the spoken texts, or for the corpus as a whole, bearing in mind the different numbers of texts in each.
Nearly all the 13 solutions appear to have something to do with wine-tasting, bar one exception where it is a proper name. Double-click on this solution to select it, then click on the arrow at the right of the Thin button on the toolbar, and choose Reverse selection. The solution you selected will be deleted from the display.
As you've seen, you can type either a word or a phrase into the Quick query box. But suppose you want to search for a phrase in which some of the words, or the word order, can vary? In this section we'll explore some of the facilities for defining more complex, or less exact, queries.
This is slightly less common in terms of absolute frequency, but not in terms of number of texts. In the general case, we would like to be able to find the two words cheese and wine in the same phrase, in either order. The QueryBuilder is the right tool for this job.
You will see that the content node has now become black, since you have provided it with valid content. You will also see that the node has small branches growing from its sides: these allow you to add other nodes with other contents, and to link the nodes them in different ways.
Nodes which are presented top to bottom down the screen are interpreted as additive. Your current query will find occurrences of wine or cheese, but only if they are followed by another word. The Next link means that the two nodes concerned must directly follow each other, with no other words intervening.
The Query Builder can be used to search for combinations of words within particular contexts by additionally changing the scope node. For instance, you might want to find co-occurrences of oaky and fruity within the same sentence, or within an overall span of five words.
As a final exercise, see if you can work out how to compare the number of occurrences of evaluative terms such as wrong or correct in all spoken utterances, in utterances spoken by men, and in utterances spoken by women. To choose utterances by a particular type of speaker you should first specify <u> as the scope node, and then choose from the attributes listed below the list of elements. For their sex choose who_sex. If you wish, you can additionally restrict the search by the age of the respondent (who_age) or other criteria.
Advanced students may like to compare sentence-initial occurrences of the word right within speech and within writing. Hint: content nodes in a QueryBuilder query can contain any kind of query, including an SGML query.