1

corp1

equal equal high

high

high high low corp2

equal equal low

low

high high low dis

equal much higher high

higher

low a bit higher a bit higher Comment

same language variety/ies. different language varieties. corp2 is homogeneous and falls within the range of "general" corp1. corp2 is homogeneous and falls outside the range of "general" corp1. impossible overlapping; share some varieties similar varieties Table1: Interactions between homogeneity and similarity.

Totals the of and a Corpus 1

1234567 80123 36356 25143 19976 Corpus 2

1876543 121045 56101 37731 29164 Table 2: Word frequencies in two corpora.

the of and a Remainders o1

81023 36356 25143 19976 1072969 o2

121045 56101 37731 29164 1632502 e1

79828.5 36689.3 24850.0 19500.0 1073599.2 e2

121339.5 55767.7 37924.0 29640.0 1631871.8 (o1-e1)2 e1

1.09 3.03 1.49 11.62 0.37 (o2-e2)2 e2

0.71 1.99 0.98 7.64 0.24 Table 3: The c2 statistic for two corpora.

Class (Words in freq. order)

First 10 items Next 10 items Next 20 items Next 40 items Next 80 items Next 160 items Next 320 items Next 640 items Next 1280 items Next 2560 items Next 5120 items Next 10240 items Next 20480 items Next 40960 items *First item Word

the for not have also know six finally plants pocket represent peking fondly chandelier in class* POS

DET PREP NOT V-BASE ADV V-INF CARD ADV N-PL N-SING V-BASE PROPER ADV N-SING Mean error term for items in class

18.76 17.45 14.39 10.71 7.03 6.40 5.30 6.71 6.05 5.82 4.53 3.07 1.87 1.15 Table 4: Variation of (O-E)2/E term with word frequency for same-variety corpora, for high-frequency and low-frequency word-POS pairs. Part-of- speech codes are from the CLAWS tagset as used in the BNC (modified/lengthened for easier reading).

Lo

Mid

Hi

Im

Sp Lo 5.1 (.3) 5.4 (.3; 6085) 9.3 (.5; 5450) 26.3 (1.7; 4460) 35.4 (2.0; 4126) Mid

4.5 (0.2) 6.7 (0.4; 5729) 28.7 (2.2; 4407) 36.3 (0.8; 4144) Hi

6.0 (.2) 42.0 (1.6; 3290) 47.6 (1.7; 3820) Im

4.6 (.2) 35.4 (1.2; 3456) Sp

4.8 (0.3) Table 5: Corpus homogeneity and corpus similarity results. Unbracketed: Mean homogeneity/similarity figures. Bracketed: standard deviation, and number of degrees of freedom. Number of degrees of freedom for all homogeneity figures (on leading diagonal) is 5,000.