In this tutorial we will be looking at basic data analysis and data evaluation methods:
We will use LibreOffice Calc to exemplify the methods. You may also use Microsoft Excel, however, there might be slightly differences.
In later tutorials we will use the statistical programming language R for these matters as it allows for a more efficient processing especially with multivariate data sets.
Frequency distribution is used to see occurrence differences for a particular linguistic phenomena in different language varieties, registers, time periods, etc. It is the basic statistical analysis in corpus linguistics and still by far the most popular one. A frequency distribution gives you a first insight in the distribution of a particular phenomena. You can display frequency distributions in a matrix or as a diagram (bar chart, line chart, …).
[pos="VV.*"]
Download query ad plain text tabulation
pos, text_reg
sort output, display as matrix
data/distr_vfull_pos-reg_brown_matrix.txt
Einfügen -> Diagramm
or click on the diagram symbol in the tool bar to open the diagram assistanttotal
, once with pos
on the x-axis, once and with register
on the x-axis (4 charts)total
In order to be able to compare frequency distributions across different corpora/subcorpora you usually need to normalize the frequency counts. This is due to the fact that the corpora you compare are usually of different sizes. It is, e.g., not surprizing if you find more verbs in a 20 million word corpus than in a 1 million word corpus. In order to really compare the numbers, you have to bring them on the same basis, i.e., you calculate a per basis
frequency, e.g. the frequency per million tokens (or per 100, 1000).
The mathematical equation for normalization is as follows:
\(\frac{raw~frequency}{N} \times 1,000,000 = freqeuncy~per ~million\)
where N is the number of tokens in the corpus.
Lets have a look at a real example, the matrix we extracted for the distribution of verbal pos across registers in the BROWN corpus and the corresponding subcorpus sizes.
We can observe that the size of the subcorpora differs, the raw frequencies are not comparable, we have to normalize our data. In order to do this, we have to devide each figure in the matrix by the corresponding subcorpus size and multiply it with a normalization basis (e.g. one million). In the case of VVD
in register A
(press reportage) the calculation would be as follows:
\(\frac{2631}{102512}\times 1,000,000~= 25665.29\)
data/distr_vfull_pos-reg_brown_matrix.txt
in LibreOffice/Excelrawfreq
fpm
(frequency per million) and copy paste the rawfreq
table into this sheetfpm
tabledata
directory of the course directory: data/brown_csizes_full.txt
Tabelle -> Tabelle aus Datei einfügen
to open the BROWN corpus size file in a separate sheetcorpsize
fpm
table (VVD - A
)=
to indicate a formularawfreq
and corpsize
table (click and ENTER)$$
(e.g. corpsize.$C$2
)The corpus size is not the only size that can matter. The ratio of the different pos of our verbs, e.g., can also depend on the total number of verb forms occuring in the different registers. Thus, it can make sense to use the number of verb forms as N
in our formula.
fpm_VN
and copy paste the data table into this sheetfpm
based on the sucorpus sizeDiagrams visualize frequency distributions and can point to differences between corpora. However, we can not be sure, whether these differences are really significant or simply a matter of chance.
For significance testing we have to use statistical test such as the \(\chi^2\)-test (chi-square). The formula for the test is as follows:
\(\chi^2 = \sum_{ij} \frac{(O_{ij} - E_{ij})^{2}}{E_{ij}}\)
Where O
is the observed frequency, i.e. the raw freqency that we observe in our corpus, and E
refers to the expected frequency, i.e., the frequency we would expect in the corpus assuming an equal distribution (no significant difference).
To calculate the chi-square test, we use a contingency table. The basis of a contingency table is a two dimentional table, with one feature represented in rows, the other in columns, with the figures for each feature-pair in the corresponding cells. Additionally we have to calculate the so-called marginal frequencies (row totals and column totals, as well as the total sum of instances N
.
The next step is to estimate the expected frequencies for each cell. Unlike the observed frequencies, we cannot extract them from our corpus, we have to calculte them based on the observed frequencies.
The formula for the expected frequency is:
\(E = f_1 x \frac{f_2}{N} = \frac{f_1f_2}{N}\)
f1
and f2
refer to the so-called marginal frequencies, f1
being the corresponding row total, f2
the corresponding column total.
Let us illustrate this with an example. Suppose we have extracted the following data for omission of that with think or say vs. other verbs.
data
directory of the course folderN
in the first table (observed frequencies)RUNDEN()
to round to integersIf you paid attention, you might wonder about something:
Significance is expressed in the so-called p-value or probability value. The p-value is the probability of the NULL-Hypothesis. I measures how likely it is that the Null-Hypothesis is rejected, i.e., that the difference is significant. The difference is significant if the p-value is smaller than 0.05.
For calculating the p-value in our example, we have to do the following:
This will give us the chi-square value for the comparison.
The corresponding p-value can be looked up in a table. See here for an example.
You might have noticed in the table the df
(degree of freedom) it is a parameter which determines the distribution of the chi-square values, and thus has influence on the p-value. See e.g here for more details. The degree of freedom is calculated by the number of rows minus 1 multiplied with the number of columns minus 1.
df = ( #Rows - 1 ) * ( #Columns -1)
In our case:
df = (2-1) * (2-1) = 1
Now we can look up our chi-square value in the table to determine the p-value.
We can also calculate the exact p-value in LibreOffice/Excel:
CHISQUARE(Observed;Expected)
in English or CHIQU.TEST(Observed;Expected)
in GermanWhat about our data. Is the difference in the usage of, e.g. VVD
significant between the registers C
(press reviews) and D
(general prose religion), and between the registers D
(general prose religion) and M
(fiction science fiction).
TIP3: think about what the four observed frequencies might be!
Does the result conform to your hypothesis?