1
corp1
equal equal high
high
high high low corp2
equal equal low
low
high high low dis
equal much higher high
higher
low a bit higher a bit higher Comment
same language variety/ies. different language varieties. corp2 is homogeneous and falls within the range of "general" corp1. corp2 is homogeneous and falls outside the range of "general" corp1. impossible overlapping; share some varieties similar varieties Table1: Interactions between homogeneity and similarity.
Totals the of and a Corpus 1
1234567 80123 36356 25143 19976 Corpus 2
1876543 121045 56101 37731 29164 Table 2: Word frequencies in two corpora.
the of and a Remainders o1
81023 36356 25143 19976 1072969 o2
121045 56101 37731 29164 1632502 e1
79828.5 36689.3 24850.0 19500.0 1073599.2 e2
121339.5 55767.7 37924.0 29640.0 1631871.8 (o1-e1)2 e1
1.09 3.03 1.49 11.62 0.37 (o2-e2)2 e2
0.71 1.99 0.98 7.64 0.24 Table 3: The c2 statistic for two corpora.
Class (Words in freq. order)
First 10 items Next 10 items Next 20 items Next 40 items Next 80 items Next 160 items Next 320 items Next 640 items Next 1280 items Next 2560 items Next 5120 items Next 10240 items Next 20480 items Next 40960 items *First item Word
the for not have also know six finally plants pocket represent peking fondly chandelier in class* POS
DET PREP NOT V-BASE ADV V-INF CARD ADV N-PL N-SING V-BASE PROPER ADV N-SING Mean error term for items in class
18.76 17.45 14.39 10.71 7.03 6.40 5.30 6.71 6.05 5.82 4.53 3.07 1.87 1.15 Table 4: Variation of (O-E)2/E term with word frequency for same-variety corpora, for high-frequency and low-frequency word-POS pairs. Part-of- speech codes are from the CLAWS tagset as used in the BNC (modified/lengthened for easier reading).
Lo
Mid
Hi
Im
Sp Lo 5.1 (.3) 5.4 (.3; 6085) 9.3 (.5; 5450) 26.3 (1.7; 4460) 35.4 (2.0; 4126) Mid
4.5 (0.2) 6.7 (0.4; 5729) 28.7 (2.2; 4407) 36.3 (0.8; 4144) Hi
6.0 (.2) 42.0 (1.6; 3290) 47.6 (1.7; 3820) Im
4.6 (.2) 35.4 (1.2; 3456) Sp
4.8 (0.3) Table 5: Corpus homogeneity and corpus similarity results. Unbracketed: Mean homogeneity/similarity figures. Bracketed: standard deviation, and number of degrees of freedom. Number of degrees of freedom for all homogeneity figures (on leading diagonal) is 5,000.