In this tutorial we will be looking at basic data analysis and data evaluation methods:

frequency distribution
normalization
statistical significance test using the $\chi^2$-test (chi-square)

We will use LibreOffice Calc to exemplify the methods. You may also use Microsoft Excel, however, there might be slightly differences.
In later tutorials we will use the statistical programming language R for these matters as it allows for a more efficient processing especially with multivariate data sets.

Frequency distribution

Frequency distribution is used to see occurrence differences for a particular linguistic phenomena in different language varieties, registers, time periods, etc. It is the basic statistical analysis in corpus linguistics and still by far the most popular one. A frequency distribution gives you a first insight in the distribution of a particular phenomena. You can display frequency distributions in a matrix or as a diagram (bar chart, line chart, …).

Exercise 1: Plotting frequency distributions

Extract a matrix for the following query from the BROWN Corpus: [pos="VV.*"]
Use Download query ad plain text tabulation
features: pos, text_reg
output options: sort output, display as matrix
save the results as: data/distr_vfull_pos-reg_brown_matrix.txt
open the file in LibreOffice Calc
choose Einfügen -> Diagramm or click on the diagram symbol in the tool bar to open the diagram assistant
create the following diagrams:
- bar charts (Säulendiagramm) with and without total, once with pos on the x-axis, once and with register on the x-axis (4 charts)
- a pie chart (Kreisdiagramm) without total
what do the different diagrams tell you?
which of the diagrams can you use as is and why?

Normalization

In order to be able to compare frequency distributions across different corpora/subcorpora you usually need to normalize the frequency counts. This is due to the fact that the corpora you compare are usually of different sizes. It is, e.g., not surprizing if you find more verbs in a 20 million word corpus than in a 1 million word corpus. In order to really compare the numbers, you have to bring them on the same basis, i.e., you calculate a per basis frequency, e.g. the frequency per million tokens (or per 100, 1000).

The mathematical equation for normalization is as follows:

$\frac{raw~frequency}{N} \times 1,000,000 = freqeuncy~per ~million$

where N is the number of tokens in the corpus.

Lets have a look at a real example, the matrix we extracted for the distribution of verbal pos across registers in the BROWN corpus and the corresponding subcorpus sizes.

We can observe that the size of the subcorpora differs, the raw frequencies are not comparable, we have to normalize our data. In order to do this, we have to devide each figure in the matrix by the corresponding subcorpus size and multiply it with a normalization basis (e.g. one million). In the case of VVD in register A (press reportage) the calculation would be as follows:

$\frac{2631}{102512}\times 1,000,000~= 25665.29$

Exercise 2: Calculating normalized figures

Open the data file data/distr_vfull_pos-reg_brown_matrix.txt in LibreOffice/Excel
Rename the sheet rawfreq
Create a new sheet, rename it fpm (frequency per million) and copy paste the rawfreq table into this sheet
We will calculate the normalized frequencies in this new table.
Delete all figures from the fpm table
Download the file with the BROWN corpus sizes and save it in the data directory of the course directory: data/brown_csizes_full.txt
Choose Tabelle -> Tabelle aus Datei einfügen to open the BROWN corpus size file in a separate sheet
Rename this sheet corpsize
Now we can add our formula to the first cell in our table.
Choose the first data cell in the fpm table (VVD - A)
Write a = to indicate a formula
Add the formula for normalization you may choose (select) the respective cells from the rawfreq and corpsize table (click and ENTER)
The results of the formula will be displayed in the respective cell
Add the formula to all cells
TIP
- You can copy-and-paste formulas
- You can paste in more than one cell at a time by selecting several cells before pasting
- Excel adjusts the formula according to its position. If you want to avoid this enclose the column specification in $$ (e.g. corpsize.$C$2)

Exercise 3: Plotting the normalized figures

Create the same diagrams as in Exercise 1 for the normalized figures
Compare them to the diagrams with the raw frequencies

Exercise 4: Yet another normalization

The corpus size is not the only size that can matter. The ratio of the different pos of our verbs, e.g., can also depend on the total number of verb forms occuring in the different registers. Thus, it can make sense to use the number of verb forms as N in our formula.

Create a new sheet, rename it fpm_VN and copy paste the data table into this sheet
Calculate the fpm values based on the total number of verb forms in the registers
Plot the resulting figures and compare them to the fpm based on the sucorpus size

Significance testing: Chi-squared test

Diagrams visualize frequency distributions and can point to differences between corpora. However, we can not be sure, whether these differences are really significant or simply a matter of chance.

For significance testing we have to use statistical test such as the $\chi^2$-test (chi-square). The formula for the test is as follows:

$\chi^2 = \sum_{ij} \frac{(O_{ij} - E_{ij})^{2}}{E_{ij}}$

Where O is the observed frequency, i.e. the raw freqency that we observe in our corpus, and E refers to the expected frequency, i.e., the frequency we would expect in the corpus assuming an equal distribution (no significant difference).

To calculate the chi-square test, we use a contingency table. The basis of a contingency table is a two dimentional table, with one feature represented in rows, the other in columns, with the figures for each feature-pair in the corresponding cells. Additionally we have to calculate the so-called marginal frequencies (row totals and column totals, as well as the total sum of instances N.

The next step is to estimate the expected frequencies for each cell. Unlike the observed frequencies, we cannot extract them from our corpus, we have to calculte them based on the observed frequencies.

The formula for the expected frequency is:

$E = f_1 x \frac{f_2}{N} = \frac{f_1f_2}{N}$

f1 and f2 refer to the so-called marginal frequencies, f1 being the corresponding row total, f2 the corresponding column total.

Exercise 5: Chi-square test first example

Let us illustrate this with an example. Suppose we have extracted the following data for omission of that with think or say vs. other verbs.

Download the data-file: that-omission_sample.txt and save it in the data directory of the course folder
Open the file in LibreOffice Calc
Dublicate the contigency table using copy-and-paste - we will use this table to calculate the expected frequencies
Calculate the marginal frequencies (row total and column totals) and N in the first table (observed frequencies)
Delete all figures from the expected frequency table
Add formulas in each cell calculating the expected frequency based on the figures of the first table
Use the function RUNDEN() to round to integers
Duplicate the contigency table again - we will use it to calculate the chi-square values
Add the formula for chi-square values in this table - everthing behind the Summation sign $\frac{(O-E)^2}{E}$.
The calculated number is larger, the larger the difference between the observe and expected frequency

If you paid attention, you might wonder about something:

What about the summation sign?
Where is the p-value?

Significance is expressed in the so-called p-value or probability value. The p-value is the probability of the NULL-Hypothesis. I measures how likely it is that the Null-Hypothesis is rejected, i.e., that the difference is significant. The difference is significant if the p-value is smaller than 0.05.

For calculating the p-value in our example, we have to do the following:

calculate row totals
sum-up the row totals

This will give us the chi-square value for the comparison.

The corresponding p-value can be looked up in a table. See here for an example.

You might have noticed in the table the df (degree of freedom) it is a parameter which determines the distribution of the chi-square values, and thus has influence on the p-value. See e.g here for more details. The degree of freedom is calculated by the number of rows minus 1 multiplied with the number of columns minus 1.

df = ( #Rows - 1 ) * ( #Columns -1)

In our case:

df = (2-1) * (2-1) = 1

Now we can look up our chi-square value in the table to determine the p-value.

We can also calculate the exact p-value in LibreOffice/Excel:

use the function CHISQUARE(Observed;Expected) in English or CHIQU.TEST(Observed;Expected) in German

Exercise 6: A “real” example

What about our data. Is the difference in the usage of, e.g. VVD significant between the registers C (press reviews) and D (general prose religion), and between the registers D (general prose religion) and M (fiction science fiction).

TIP: you can reuse the table with the formulas! You simply have to change the values for the observed frequency
TIP2: if you open the extracted data in the same LibreOffice document you can let LibreOffice fill in the observed frequencies
TIP3: think about what the four observed frequencies might be!
Which comparison would you assume to be different and why - based on the observed frequencies?
Does the result conform to your hypothesis?

Back

Frequency distribution, normalization, chi-square test