add this bookmarking tool

3 The File menu

Most of the commands on this menu manipulate queries, as opposed to the results which they return from the corpus: the exceptions are Print and Print preview, both of which relate to the solutions returned by the current query.

The following commands are provided on the File menu:

  • New query Open a submenu, from which you can select the kind of query you wish to define. A new query window is then opened for you to define that kind of query. See 3.1 for information about the types of query that may be defined;
  • Open Open a previously defined query;
  • Close Close the current query and its associated window;
  • Save Save the current query as a file, using the name specified in the title bar of the query window;
  • Save As Save the current query as a file, giving an option to change its name from that specified in the title bar of the query window;
  • Print Print the results of the current query;
  • Print preview Display on the screen the format in which the current result set will be printed;
  • Recent File Open a recently accessed query (a list of filenames is displayed in the menu at this point);
  • Exit Exit from the Client program.

By default, the first query defined during your SARA session is named Query1 , the second Query2 , and so on. You can give any query a more meaningful name, if you wish, before saving it in an SQY file. The name of a query appears in the title bar of the window containing its results. It can contain only characters which are legal in filenames under MS-DOS, and may not exceed eight characters in length.

Queries are opened or saved using the normal Windows dialogue boxes for file manipulation, which allow you to change drives, specify file names etc. If you do not know how to use these, consult any introductory text on using Microsoft Windows.

3.1 Defining a query

The New query option on the File menu opens a submenu from which you can select which type of query you want to perform. SARA allows you to define five different kinds of query:

  • word query searches the SARA word index, and then optionally also searches the BNC for a word or words selected from those found (see section 3.2 );
  • phrase query searches the BNC for a phrase (see section3.3 );
  • pattern query searches the BNC for words matching a pattern (or regular expression) (see section 3.4 );
  • POS query searches the BNC for a word with a specific part of speech (POS) code (or codes) (see section 3.5 );
  • SGML query searches the BNC for SGML tags (see section3.6 );
  • Query Builder combines queries of different or the same kinds into a single complex query using a visual interface (see section 3.7 );
  • CQL defines a query using SARA's own internal command language, the Corpus Query Language (CQL) (see section 3.8 ).

More detail about each kind of query is given in the appropriate section below. There is a button on the tool bar for each kind of query: it is generally quicker to press the button than to select it from the menu.

3.2 Defining a Word Query

A word query may be defined in any of the following ways:

  • select Word Query from the submenu of the New query option on the File menu;
  • press the Word Query button on the tool bar;
  • within Query Builder, select Word from the Edit submenu..

Any of the above will cause the Word Query dialogue box to be displayed, containing a window into which you can type a word, or part of a word, to be searched for in the SARA index. If thePattern checkbox to the right of the window is checked, whatever you type will be interpreted as a pattern. If it is not checked, whatever you type will be interpreted as a word stem. (Strictly speaking, a word stem is also a kind of pattern: the word stem XXX is exactly equivalent to the patternXXX.*)

The Lookup button carries out a search of the SARA index. Every form found in the index which starts with the same letters as the word or part of a word you typed in will be displayed in the lower window. If the Pattern checkbox was checked, every word matching the pattern you typed in will be displayed.

For example, typing in colour with the Pattern checkbox unchecked will produce a list of words beginning with the letters `colour', (`colour', `coloured', `colouring', etc.) If the box is checked, only the word `colour' will be produced, since this is the only word which matches that pattern. (Patterns are described below in section 3.4 .) Typing in colou?r.* with the pattern box checked will produce a list of all words beginning with the letters `color' and `colour'. Note that, in this case, if the box is not checked, no words will be returned, since there is no word beginning `colou?r.*' in the BNC.

A pattern expression which begins with anything other than a literal will usually involve a search through the whole BNC index, which will take a very long time indeed, and should be avoided. This implies that searches for word-endings are not easily done.

Note that the items treated by SARA as single words may not correspond with orthographic words. In particular, hyphenated words and words followed by some punctuation characters may not always be indexed in the way you expect.

The lower window will not display more than 200 items: a warning message will appear if the word or word part you typed was not specific enough, perhaps because it was too short. If the word you wish to look up is also a very common prefix, check thePattern box to select only the word, rather than all words beginning with that string of characters.

You can click on one or more of the word forms displayed in the lower window to select them. As is usual with Windows application, clicking on one or more items with the CTRL key depressed will select each of them; clicking one and then another with the SHIFT key depressed will select both those two and all the other items between them in the list.

When an item is selected in this way it is highlighted on the screen, and a count is displayed below the box indicating the frequency and z-score for the selected word forms within the texts making up the BNC. (Note that words occurring within the text headers are excluded.)

When items are selected, the Query button can be pressed to carry out a search for these word forms within the BNC. Section 3.9 gives further details of the process of downloading the results of a search.

The other buttons on the Word Query dialogue box have the following effects:

  • Copy copies the input string to the Windows clipboard;
  • Clear deletes any previous input and selections from the dialogue box;
  • Cancel leaves the dialogue box without starting a query.

When the Word Query is part of a Query Builder query, theQuery button is labelled OK and clicking it simply adds the word query into the query being constructed.

3.3 Defining a Phrase Query

A phrase query may be defined in any of the following ways:

  • select Phrase Query from the submenu of the New query option on the File menu;
  • press the Phrase Query button on the tool bar;
  • within Query Builder, select Phrase from the Edit submenu.

Any of the above will cause the Phrase Query dialogue box to be displayed. This dialogue box contains a window into which you can type a word or phrase, a checkbox labelled Ignore Case, and a checkbox labelled Search Headers.

You can type any sequence of words, or a single word, into the window. Press the OK button (or the Return key) and a search is carried out for the specified phrase within the BNC.

If the Search Headers checkbox is checked, then the search is carried out within the TEI headers as well as the text. Otherwise, only the texts are searched.

If the Ignore Case checkbox is unchecked, the search is case-sensitive. If the box is checked, a search for Sara will recover occurrences of `Sara', `SARA', or `sara'; if it is not, only the first of these will be found.

These two check boxes are the only ways SARA provides for searching in a case sensitive way, or for searching within the headers, other than by using a CQL query .

A phrase query can contain punctuation characters as well as words. For example, the phrase query , whereas will find occurrences of `whereas' only where they are preceded by a comma. When searching for a match, newlines between components of a phrase query are not significant: for example, it makes no difference whether the comma is at the end of one line and the `whereas' at the start of the next.

A special punctuation character known as the Anyword character _ can be used within a phrase query (but not at the start or end of one). It will match any single item in the index. For example, the phrase query home _ centre will recover phrases such as `home loan centre', `home improvement centre', `home planning centre' etc.

Note that not every item in the index is a conventional orthographic word. As further discussed in section , the index uses L-words which may be parts of conventional orthographic words such as `n't' or orthographic phrases such as `in spite of'.

Each part of a phrase query is searched for separately, and the results are then combined. Consequently, if a phrase query contains any very common words (for example, `to', `the' etc.) it may take a very long time to execute: in such cases it is usually better to replace the very high frequency word with an AnyWord character. For example, to find the phrase `die the death', type die _ death and discard the (fairly small) number of false positives such as `die a death', using the Thin command described in section6.3 .

There is no limit on the number of words a phrase query may contain, but the total length of the string may not exceed 200 characters.

Click on the OK button to send the query to the server, or click on the Cancel button to cancel it; see further3.9 .

3.4 Defining a Pattern Query

A pattern query may be defined in any of the following ways:

  • select Pattern from the submenu of the New query option on the File menu;
  • press the Pattern Query button on the tool bar;
  • within Query Builder, select Pattern from the Edit submenu.

All of the above will cause the Pattern Query dialogue box to be displayed. This dialogue box contains a window into which you can type a pattern query. The pattern is validated, and a search is carried out for all the words which match it. See further 3.9 .

As noted above in section 3.2 , a pattern can also be typed as part of a Word Query in order to produce a list of matching words. This is a very useful way of checking the results of a pattern query without actually carrying it out by searching the BNC.

A pattern is a string of characters which is used as a template to match words in the SARA index. The characters making up a pattern can be:

  • literal characters, such as A, B or C, which simply match occurrences of the same character; pattern-matching is never case-sensitive, so a and A are equivalent;
  • special characters are characters which behave in a special way within patterns: these are the dot and the hyphen, the square brackets, [ and ], the parentheses ( and ), the caret ^, the repetition operators ?, * and +, and the disjunction symbol |. If any of these characters is to be used within a pattern but interpreted as if it were a literal, it must be escaped using the backslash character.

The dot . is a special character which matches any single character. For example the pattern f... matches any four letter word beginning with F.

A sequence of characters within square brackets matches any one of them. For example the pattern [aeiou] matches any vowel;

A sequence can contain a hyphen to express a range. For example, the patterns [0-9] and [0123456789] are equivalent: either one will match any digit.

The caret ^ is a special character which can appear at the start of a sequence of characters within square brackets, to indicate that any character not in the sequence should be matched. For example, the pattern [^aeiou] will match any consonant; the pattern [^0-9] will match anything which is not a digit.

Single characters or bracketed sequences can be repeated as often as necessary to make up a complete pattern. For example, the pattern [0-9][0-9][0-9] will match all three-digit numbers; the pattern m[0-9][0-9] will match an M followed by two digits.

The question mark ? is a special character which can follow either a single character or a bracketed sequence of characters, to indicate that the character is optional. For example, the patterncolou?r will match either `colour' or `color'; the pattern[0-9][0-9][0-9]? will match all two- or three- digit numbers, e.g. 99 or 42 or 123 or 912.

The star * is a special character which can follow either a single character or a bracketed sequence of characters, to indicate that the character is optionally repeatable. For example, the pattern hm[hm]* will match words begining with HM and containing only those two letters, no matter how long they are, for example `hm' or `hmmmm' or `hmmhmhmmmm'; the patternsorrow.* will match any word beginning with the letters `sorrow', including `sorrow' itself.

The plus + is a special character which can follow either a single character or a bracketed sequence of characters, to indicate that the character is repeated at least once. For example, the pattern sorrow.+ will match any word beginning with the letters `sorrow', except for `sorrow' itself; the patternm[0-9]+ will match all words composed of the letters M followed by at least one digit, and nothing but digits, e.g. M1, M2345; similarly, the pattern e+k will find `ek' `eek' `eeeeek' etc.

The plus or star character can be used to indicate repetition at any point in a pattern. However, matching of patternsbeginning with such sequences (for example .*ing, to recover all words ending with `ing') is likely to be unacceptably slow, since it requires a scan through the entire word index. In general, it is best to make the first component of any pattern a literal. Repetition can however be effectively used in the middle of a pattern: for example effec.*ly will match `effectively' or `effectually'.

Two or more patterns can be combined as alternatives using the disjunction meta-character (a vertical bar). For example, the pattern seek|sought will match either the word `seek' or the word `sought'. Parentheses () can be used to group parts of a pattern together: for example, the same effect could be obtained by the pattern s(eek|ought).

Any character preceded by the backslash (\) will be treated as a literal even if it is a meta-character. For example, the patternMr?s?\. will match any of `M.', `Mr.', `Mrs.' or `Ms.'. Without the backslash, the final dot would be interpreted as a meta-character, matching any character at all. A backslash is unnecessary within square brackets: the pattern M[rs.]* would have a similar effect to the above, except that it would also match forms lacking a final dot (plus a number of probably unintended matches, such as `mss.').

3.5 Defining a POS Query

A POS, or part of speech query behaves in the same way as a word query, except that it searches for only a single word, which can be further restricted according to its part of speech (POS) code. It may be defined in any of the following ways:

  • select POS from the submenu of the New query option on the File menu;
  • press the POS Query button on the tool bar;
  • within Query Builder, select POS from the Edit submenu.

All of the above will cause the POS Query dialogue box to be displayed, containing two display windows. When the word to be searched for is typed into the upper window, and the mouse is clicked in the lower window, the lower window is filled with a list of the different parts of speech that the word in question has been assigned within the corpus. The same effect can be obtained by typing in a word and pressing the Tab key.

For example, the word `snore' appears in the corpus as a verb (VV1), as a noun (NN1), and as a portmanteau (NN1-VV1). All three possibilities appear in the lower box.

To search for the nominal senses only, highlight theNN1 in the lower window, and press OK. To search for both nominal and portmanteau cases, hold down the control key while highlighting the NN1 and NN1-VV1 entries, and then press OK.

Note that it is not possible to search for a particular part of speech without specifying the word to which it is attached. This implies that you cannot use SARA to search for such things as sequences of three or more adjectives, nor for occurrences of a specific word preceded by any word with a particular part of speech.

The Help system contains a list of POS codes used in the current version of the corpus: this list also appears in appendix below. A brief explanation of each POS code is also displayed when you select it from the upper box in the POS query dialogue box.

Click on the OK button to send the query to the server, or click on the Cancel button to cancel it; see further3.9 .

3.6 Defining an SGML Query

An SGML query may be defined in any of the following ways:

  • select SGML from the submenu of the New query option on the File menu;
  • press the SGML button on the tool bar;
  • within Query Builder, select SGML from the Edit submenu.

All of the above will cause the SGML dialogue box to be displayed.

As well as information about words and their parts of speech, the BNC index searched by SARA contains details of where theSGML elements of which the corpus is composed begin and end. (SGML [mdash ] the ISO Standard Generalised Markup Language [mdash ] is briefly described above at section ; see also chapter 5 of the BNC Users' Reference Guide ).

The start of an SGML element is indicated by a start-tag; its end is indicated by an end-tag. Start-tags may additionally carry named attributes, with particular values, to convey additional information about the element occurrences they delimit.

You can use this information to restrict searches to particular types of text (the categorisation of a text is indicated by attributes of a <catRef> element within its header), or to find particular types of text component [mdash ] for example newspaper headlines, which are mostly tagged <head type=main> in the BNC, or pauses (<pause>) in spoken texts.

The SGML dialogue box contains a scrollable list of the element names or tags defined for the corpus. For an explanation of the way these elements are used in the corpus, refer to the BNC Users Reference Guide . If the Show Header Tags checkbox is checked, all tags used in the corpus will appear; if it is not, then tags which are used only in the headers will be excluded. To search the corpus for an SGML start- or end-tag, you select the name of the element concerned from this list by clicking on it. A brief description of the way this element is used is then displayed.

Provided that the Start radio button is selected, a list of any attributes defined for this element will then be displayed in the lower left hand window. You can restrict the search to occurrences of this element having particular values for some combination of these attributes by selecting attribute names from the list, one at a time, and adding them into the query. Alternatively, if you do not select any attribute name fom the list, the query will select occurrences of this SGML element whatever attribute values it may have.

When you select an attribute name from the list, clicking on theAdd button will open a further dialogue, indicating the range of values possible for that attribute. Click on the desired value (or values) and then press OK to close this dialogue box. Several attribute value constraints may be added in this way. You can also remove a particular constraint by selecting it from the right hand window in the SGML dialogue box, and then clicking on the Remove button, or remove all of them by clicking on the Remove All button.

Click on the OK button to send the query to the server, or click on the Cancel button to cancel it; see further3.9 .

3.7 Defining a query with Query Builder

Query Builder is a special purpose tool which allows you to create complex queries using a visual interface. The Query Builder command can be used in either of the following ways:

  • select Query Builder from the submenu of the New query option on the File menu;
  • press the Query Builder button on the tool bar.

Either of these will cause the Query Builder dialogue box to be displayed. This dialogue box is used to define a Query Builder query as further described in this section.

Parts of a complex query are represented in the Query Builder dialogue box by nodes of various types. A Query Builder query always has at least two nodes: one, the scope node, defines the the context within which a complex query is to be evaluated. The other nodes, which may be linked in various ways, are known as content nodes. These define the various things which are to be found within this scope. Any form of query can be used in a content node (except for a CQL or Query Builder query).

For example, you might use the Query Builder to search for the word `fork' followed or preceded by the word `knife' within the scope of a single <s> (sentence) element. In this case, the scope node would indicate a single SGML element occurrence, and there would be two content nodes, one for `knife' and the other for `fork'. Alternatively, you might specify the same search but define its scope as a number of words. The default scope for all Query Builder queries is a <bncDoc> element, i.e. any one of the 4124 distinct text samples making up the BNC.

The scope of a query is represented in Query Builder by the scope node which appears on the left of the dialogue box. To the right of this is a single empty content node. Clicking with the mouse inside a content node opens a submenu, from which you can select either Edit, Clear, or (for nodes other than the first one) Delete. Selecting Edit opens a further submenu, from which you select the type of query you wish to define for that node, or, if you have already defined a query for the node, to edit it. Selecting Clear cancels any previous choice, allowing you to select a new query type for the node. Delete removes the content node, but leaves the rest of the query unchanged.

When a single content node has been filled, further nodes can be added to its right, above it, or below it, simply by clicking the mouse on the branch in that direction. Nodes added to the right of a query node represent alternatives. For example, the Query Builder representation of a query to find either the word `fork' or the word `knife' within the scope of a single <bncDoc> element is shown in figure
Either FORK or KNIFE Alternation can also be contained within a single content node by using a pattern query, or a word query with alternatives. Figure 2 shows another way of achieving the same effect as the preceding query, using a pattern query.
Either FORK or KNIFE (another way)
Nodes added above or below a content node represent additional constraints. The query represented in figure 3, for example, searches for both the word `fork' and the word `knife' within the scope of a single <bncDoc> element.
FORK preceding KNIFE The vertical line linking the two content nodes indicates the order and proximity required. Clicking on the line opens a submenu from which you can select one of the following possibilities:
  • next (represented by a thick line): no words or punctuation can appear between the query term indicated above the line and the term below the line;
  • one-way (represented by a downwards pointing arrow): the query term indicated above the line must precede the term below the line within the scope indicated by the scope node;
  • two-way (represented by a double-headed arrow): the query terms above and below the line may appear in any order within the scope indicated by the scope node.

In the current version of the SARA client, you should use the same kind of link (next, one-way, or two-way) between all the content nodes of a single query. Results where the link-types differ are not defined.

To change the scope of a complex query, click on the scope node. A submenu opens, from which you can choose either SGML or Span. Choosing SGML opens the SGML dialogue box, from which you can select an SGML element, possibly modified by attribute values, as in an SGML query (see further section 3.6 ). Choosing Span opens a dialogue box in which you can enter the number of words within which the rest of the query must be satisfied.

The example shown in figure 4 will find the words `fork' and `knife' in either order, provided they appear within five words of each other.
FORK followed or preced by KNIFE within 5 words
When nodes are added both to the right of and above or below a content node, they must all be satisfied. For example, the query shown in figure 5
FORK or SPOON followed or preceded by KNIFE within 5 words will find occurrences of `fork' or `spoon', but only only where they are followed by `knife' within a span of five words.

A content node can contain any kind of query (other than a CQL or a Query Builder query) [mdash ] one or more alternatives chosen from the word query dialogue box; a phrase query; a pattern query; a POS query; or an SGML query. The Anyword character can also be entered as a content node in its own right.

Once you have completed defining the query, press theOK button to carry out a search, or press Cancel to cancel it. See further 3.9 .

3.8 Defining a CQL Query

CQL (pronounced ``sequel'') is short for the corpus query language. It is the command language which a SARA client program uses to communicate with the SARA server. Usually expressions in CQL are generated for you by the client program, but there is no reason why you should not type them in directly as well. There are also a few features of the command language which cannot be easily (or at all) expressed by the current client except in this way.

A CQL query may be defined in either of the following ways:

  • select CQL from the submenu of the New query option on the File menu;
  • press the CQL button on the tool bar.

Either of the above causes the CQL query dialogue box to be displayed. This dialogue box contains a window into which you can type a CQL query. The query is then validated, and a search is carried out (see further 3.9 ).

The syntax of CQL is defined briefly here. The CQL form of any query can always be viewed by switching on the Query Text option on the Query menu (see 6.4 ).

A CQL query is made up of one or more atomic queries. An atomic query may be one of the following:

  • a word, punctuation mark, or delimited string e.g.jam, ?, "Mrs.";
  • a word-and-POS pair, e.g. "CAN"=NN1;
  • a phrase, e.g. "not on your life";
  • a pattern;
  • an SGML query, that is, a search for a start- or end-tag. Attribute values may also be searched for;
  • the wildcard character _, which will match any single word.

Four unary operators are allowed in CQL:

  • case The $ operator makes the query which is its operand case-sensitive.
  • header The @ operator makes the query which is its operand search within headers as well as in the bodies of texts.
  • not The ! operator matches anything which is not a solution to the query which is its operand; it makes no sense unless the query is combined with another.

A CQL expression containing more than one atomic query may use the following binary operators:

  • sequence one or more blanks between two queries matches cases where solutions to the first immediately precede solutions to the second.
  • disjunction The | operator between two queries matches cases where either query is satisfied.
  • join The * operator between two queries matches cases where both queries are satisfied in the order specified; the # operator between two queries matches cases where both queries are satisfied in either order.

When queries are joined, the scope of the expression may be defined in one of the following ways:

  • SGML element A join query followed by a / operator and an SGML query matches cases where the joined query is satisfied within the scope of the SGML element.
  • number A join query followed by a / operator and a number matches cases where the joined query is satisfied within the number of L-words specified.

Some simple examples follow:

  • cat _ dog finds three word phrases of which the first word is `cat' and the last is `dog'
  • !cat dog finds occurrences of `dog' not preceded by `cat' within the same document
  • cat*dog finds occurrences of `cat' followed anywhere within the same document by `dog'
  • cat#dog finds occurrences of `cat' followed or preceded by `dog' anywhere within the same document
  • cat*dog/10 finds occurrences of `cat' followed by `dog' within ten words
  • cat*dog/ finds occurrences of `cat' followed by `dog' within a single <head> element

3.9 Execution of SARA Queries

Whichever type of SARA query you define, the process of executing it is the same, and proceeds as follows:

  • press the OK button to send the query to the server;
  • the red Busy light on the status bar at the bottom of the main window will be lit, indicating that the server is processing your query;
  • the server returns a count of the number of hits found to the client;
  • if this number is less than the Maximum Downloads set in the User Preferences dialogue box (see 7.5 ), results will start to appear in a new window, named Queryn (where n is the number of this query in the session);
  • if the number of hits is greater than the Maximum Downloads figure, the Too Many Solutions dialogue box will appear.

The Too Many Solutions dialogue box allows you to reset the download limit temporarily, and also to specify which of the available solutions should be displayed. The number of solutions to be downloaded can be re-set manually, by typing a new number into the box at the bottom, or automatically, by clicking on either or both of the Download all and One per text buttons.

In either case, when solutions are downloaded, they appear in order, starting from the beginning of the corpus. If theRandom checkbox is selected, solutions are chosen at random until the specified number has been reached; if it is not, then either all solutions are chosen, or (if One per text is chosen) the first in each text, until the limit has been reached.

When downloading is complete, the red Busy light will go out.

You can scroll, sort, thin, save, see the sources of, or otherwise manipulate the solution set using options from the Query menu, as described in section 6 .

You can interrupt execution of a query at any time before downloading of solutions begins by pressing the Esc key. This will abort processing of the query as soon as possible.

3.10 Printing results of a query

You can print the results of a query in three different ways:

  • using the Print command on the File menu (or the Print button on the toolbar);
  • using the Copy command on the Edit menu (or the Copy button on the toolbar), you can save a single result on the Windows clipboard, and then import it to a word processor for later printing;
  • using the Listing command on the Query menu, you can save the whole of a set of results to a file in SGML format, and then import it to a word processor for later printing.

Only the first of these is discussed in this section; for the other two, refer to sections 4 and 6.6 respectively.

Choosing the Print command will open a standard Windows Print dialogue box. You can select whether printing should be done in landscape or portrait mode by clicking on the appropriate button. You can also choose the printer to be used, and configure the printer in the normal Windows manner.

The current version of SARA does not allow you to change the page layout of the report printed: it contains a running title, derived from the query, and page numbering. References for each hit are printed down the left margin, indicating the text and the sentence number from which it comes. As much of each hit as will fit on a single line is included.

You can use the Print Preview command to see a rough indication on the screen of how the results will look when printed.

For more flexible formatting of the results of a search, you should use the Listing command on the Query menu to save the results in SGML format, as described in section 6.6 below. This file can then be formatted in any way appropriate using the word processor of your choice.

Up: Contents Previous: 2 The Main SARA window Next: 4 The Edit menu