In this tutorial we will be looking at basic data analysis and data evaluation methods:

We will use LibreOffice Calc to exemplify the methods. You may also use Microsoft Excel, however, there might be slightly differences.
In later tutorials we will use the statistical programming language R for these matters as it allows for a more efficient processing especially with multivariate data sets.

Frequency distribution

Frequency distribution is used to see occurrence differences for a particular linguistic phenomena in different language varieties, registers, time periods, etc. It is the basic statistical analysis in corpus linguistics and still by far the most popular one. A frequency distribution gives you a first insight in the distribution of a particular phenomena. You can display frequency distributions in a matrix or as a diagram (bar chart, line chart, …).

Exercise 1: Plotting frequency distributions

  • Extract a matrix for the following query from the BROWN Corpus: [pos="VV.*"]
  • Use Download query ad plain text tabulation
  • features: pos, text_reg
  • output options: sort output, display as matrix
  • save the results as: data/distr_vfull_pos-reg_brown_matrix.txt
  • open the file in LibreOffice Calc
  • choose Einfügen -> Diagramm or click on the diagram symbol in the tool bar to open the diagram assistant
  • create the following diagrams:
    • bar charts (Säulendiagramm) with and without total, once with pos on the x-axis, once and with register on the x-axis (4 charts)
    • a pie chart (Kreisdiagramm) without total
  • what do the different diagrams tell you?
  • which of the diagrams can you use as is and why?

Normalization

In order to be able to compare frequency distributions across different corpora/subcorpora you usually need to normalize the frequency counts. This is due to the fact that the corpora you compare are usually of different sizes. It is, e.g., not surprizing if you find more verbs in a 20 million word corpus than in a 1 million word corpus. In order to really compare the numbers, you have to bring them on the same basis, i.e., you calculate a per basis frequency, e.g. the frequency per million tokens (or per 100, 1000).

The mathematical equation for normalization is as follows:

\(\frac{raw~frequency}{N} \times 1,000,000 = freqeuncy~per ~million\)

where N is the number of tokens in the corpus.

Lets have a look at a real example, the matrix we extracted for the distribution of verbal pos across registers in the BROWN corpus and the corresponding subcorpus sizes.

We can observe that the size of the subcorpora differs, the raw frequencies are not comparable, we have to normalize our data. In order to do this, we have to devide each figure in the matrix by the corresponding subcorpus size and multiply it with a normalization basis (e.g. one million). In the case of VVD in register A (press reportage) the calculation would be as follows:

\(\frac{2631}{102512}\times 1,000,000~= 25665.29\)

Exercise 2: Calculating normalized figures

  • Open the data file data/distr_vfull_pos-reg_brown_matrix.txt in LibreOffice/Excel
  • Rename the sheet rawfreq
  • Create a new sheet, rename it fpm (frequency per million) and copy paste the rawfreq table into this sheet
  • We will calculate the normalized frequencies in this new table.
  • Delete all figures from the fpm table
  • Download the file with the BROWN corpus sizes and save it in the data directory of the course directory: data/brown_csizes_full.txt
  • Choose Tabelle -> Tabelle aus Datei einfügen to open the BROWN corpus size file in a separate sheet
  • Rename this sheet corpsize
  • Now we can add our formula to the first cell in our table.
  • Choose the first data cell in the fpm table (VVD - A)
  • Write a = to indicate a formula
  • Add the formula for normalization you may choose (select) the respective cells from the rawfreq and corpsize table (click and ENTER)
  • The results of the formula will be displayed in the respective cell
  • Add the formula to all cells
  • TIP
    • You can copy-and-paste formulas
    • You can paste in more than one cell at a time by selecting several cells before pasting
    • Excel adjusts the formula according to its position. If you want to avoid this enclose the column specification in $$ (e.g. corpsize.$C$2)

Exercise 3: Plotting the normalized figures

  • Create the same diagrams as in Exercise 1 for the normalized figures
  • Compare them to the diagrams with the raw frequencies

Exercise 4: Yet another normalization

The corpus size is not the only size that can matter. The ratio of the different pos of our verbs, e.g., can also depend on the total number of verb forms occuring in the different registers. Thus, it can make sense to use the number of verb forms as N in our formula.

  • Create a new sheet, rename it fpm_VN and copy paste the data table into this sheet
  • Calculate the fpm values based on the total number of verb forms in the registers
  • Plot the resulting figures and compare them to the fpm based on the sucorpus size

Significance testing: Chi-squared test

Diagrams visualize frequency distributions and can point to differences between corpora. However, we can not be sure, whether these differences are really significant or simply a matter of chance.

For significance testing we have to use statistical test such as the \(\chi^2\)-test (chi-square). The formula for the test is as follows:

\(\chi^2 = \sum_{ij} \frac{(O_{ij} - E_{ij})^{2}}{E_{ij}}\)

Where O is the observed frequency, i.e. the raw freqency that we observe in our corpus, and E refers to the expected frequency, i.e., the frequency we would expect in the corpus assuming an equal distribution (no significant difference).

To calculate the chi-square test, we use a contingency table. The basis of a contingency table is a two dimentional table, with one feature represented in rows, the other in columns, with the figures for each feature-pair in the corresponding cells. Additionally we have to calculate the so-called marginal frequencies (row totals and column totals, as well as the total sum of instances N.

The next step is to estimate the expected frequencies for each cell. Unlike the observed frequencies, we cannot extract them from our corpus, we have to calculte them based on the observed frequencies.

The formula for the expected frequency is:

\(E = f_1 x \frac{f_2}{N} = \frac{f_1f_2}{N}\)

f1 and f2 refer to the so-called marginal frequencies, f1 being the corresponding row total, f2 the corresponding column total.

Exercise 5: Chi-square test first example

Let us illustrate this with an example. Suppose we have extracted the following data for omission of that with think or say vs. other verbs.

  • Download the data-file: that-omission_sample.txt and save it in the data directory of the course folder
  • Open the file in LibreOffice Calc
  • Dublicate the contigency table using copy-and-paste - we will use this table to calculate the expected frequencies
  • Calculate the marginal frequencies (row total and column totals) and N in the first table (observed frequencies)
  • Delete all figures from the expected frequency table
  • Add formulas in each cell calculating the expected frequency based on the figures of the first table
  • Use the function RUNDEN() to round to integers
  • Duplicate the contigency table again - we will use it to calculate the chi-square values
  • Add the formula for chi-square values in this table - everthing behind the Summation sign \(\frac{(O-E)^2}{E}\).
  • The calculated number is larger, the larger the difference between the observe and expected frequency

If you paid attention, you might wonder about something:

  • What about the summation sign?
  • Where is the p-value?

Significance is expressed in the so-called p-value or probability value. The p-value is the probability of the NULL-Hypothesis. I measures how likely it is that the Null-Hypothesis is rejected, i.e., that the difference is significant. The difference is significant if the p-value is smaller than 0.05.

For calculating the p-value in our example, we have to do the following:

  • calculate row totals
  • sum-up the row totals

This will give us the chi-square value for the comparison.

The corresponding p-value can be looked up in a table. See here for an example.

You might have noticed in the table the df (degree of freedom) it is a parameter which determines the distribution of the chi-square values, and thus has influence on the p-value. See e.g here for more details. The degree of freedom is calculated by the number of rows minus 1 multiplied with the number of columns minus 1.

df = ( #Rows - 1 ) * ( #Columns -1)

In our case:

df = (2-1) * (2-1) = 1

Now we can look up our chi-square value in the table to determine the p-value.

We can also calculate the exact p-value in LibreOffice/Excel:

  • use the function CHISQUARE(Observed;Expected) in English or CHIQU.TEST(Observed;Expected) in German

Exercise 6: A “real” example

What about our data. Is the difference in the usage of, e.g. VVD significant between the registers C (press reviews) and D (general prose religion), and between the registers D (general prose religion) and M (fiction science fiction).

  • TIP: you can reuse the table with the formulas! You simply have to change the values for the observed frequency
  • TIP2: if you open the extracted data in the same LibreOffice document you can let LibreOffice fill in the observed frequencies
  • TIP3: think about what the four observed frequencies might be!

  • Which comparison would you assume to be different and why - based on the observed frequencies?
  • Does the result conform to your hypothesis?

Back