The document uses two R-functions from the R-script func_dataana_normfreq.R
- norm.data - plot.bar
The functions facilitate normalization and plotting making the R-Markdown more readable and easier to modify.
Data_analysis_norm-and-freq.Rmd
) in RStudioKnit
it will generate a HTML document including both text content and the output of the embedded R code chunksThe document requires two different data sets.
Initially the file runs with the sample data provided:
See Creating the sample data for information on how to create the sample data provided.
You may also use the document with your own data. In this case you simply have to adjust the parameters below.
Parameter settings: you need to adopt the variables in this chunk to process your own data or to change the features for the analysis.
# Parameter settings: ADOPT this variables in this chunk to process your own data
# data set file with column names
datafile <- "data/distr_vfull_lemma-pos-reg-sc_brownfam-meta.txt"
# data set with token sizes with column names
csizefile <- "data/brown_family_csizes-meta.txt"
The first thing we have to do is load the data sets.
As you can see, the r-chunk uses the parameters we set above to load the data sets dat
and d.csize
.
# load the data set file
dat <- read.table(datafile, header=T, fill=T, sep="\t", row.names=NULL, quote="")
# load the data set with token sizes
d.csize <- read.table(csizefile, header=T, fill=T, sep="\t", row.names=NULL, quote="")
The data set dat
is a multivariate data set including a lot of different features, e.g. the verb lemma and the part-of-speach as well as register, language variety, and year.
We can choose which features we want to investigate more closely using the paramter variables feat1
and feat2
.
In this tutorial feat1
refers to the linguistic phenomenon, while feat2
refers the subcorpus.
We can change the features any time to plot new things.
feat1
and feat2
)Our first plot show the distribution of parts-of-speech across registers.
We can also easily turn the diagram:
You can also make the plot interactive
Now we want to have a look at the distribution of parts-of-speech across time (year
)
Now we want to have a look at the distribution of verb lemmas. However, we want to plot the 10 most frequent verb lemmas only.
What if we want to see the distribution of the parts-of-speech across registers in different subcorpora, e.g. the BROWN corpus.
Then we have to
The following plots the 10 most frequent verb lemmas in the BROWN corpus.
Plot the same for the BLOB corpus and compare. Make notes of your observations.
Plot the following