In this tutorial we will learn how to manipulate data sets in R:
This tutorial is a R Mardown document that you can use as laboratory notebook and as base script for data manipulation.
Knit
button, which will execute the r-code chunks and render the text in a given output format (here: HTML)r
up to the first comma), but names must be uniqueSee Introduction to R Markdown and Getting Started for more details on R Markdown documents.
Let’s get started
Data-Set_Manipulation.Rmd
As examples we will use the data sets that we created in the Tutorial on Data extraction and Data Formats on the distribution of content verbs and their parts-of-speech across registers in different corpora of the Brown corpus family (the BROWN, FROWN, FLOB and LOB corpus).
We will first load the data set from the BROWN corpus and store it in the data frame d.brown
and add column names representing the features (verb, pos, register
).
# load the data and store it in d.brown
d.brown <- read.table("data/distr_vfull_lemma-pos-reg_brown.txt", header=F, fill=T, sep="\t", row.names=NULL, quote="")
# add column names
colnames(d.brown) <- c("verb","pos","register")
We can now have a first look at the data frame.
head(d.brown)
## verb pos register
## 1 say VVD A
## 2 produce VVN A
## 3 take VVD A
## 4 say VVD A
## 5 deserve VVZ A
## 6 conduct VVN A
A data frame is an object class in R to store multivariate data sets. Each column in a data frame can have a different class (e.g. numeric, factor, character).
numeric
for number valuesfactor
for categorical valuescharacter
for string valuesIf not specified explicitlty, R assignes the class of a column automatically based on the values fo the column. We will see later that this does not always work as desired.
With summary
we can get a summarization of the values in the data frame. Depending on the class each column may be treated differently.
Our example contains only columns of the class factor
. Thus, the summary for each column lists the most frequent values with their frequency.
summary(d.brown)
## verb pos register
## do : 3384 VV :26291 G :16329
## say : 2780 VVD:27855 J :16096
## make : 2360 VVG:17299 F :10685
## go : 1808 VVN:27674 A : 9364
## take : 1588 VVP: 8723 N : 8825
## come : 1564 VVZ: 8077 P : 8408
## (Other):102435 (Other):46212
We can access single rows and columns, or groups by the rownames or columnnames (if they exists) or by their number.
# row number 3
d.brown[3,]
# rows 3-10
d.brown[3:10,]
# column number 2
d.brown[,2]
# column number 2:3
d.brown[,2:3]
# column numbers 1 and 3
d.brown[,c(1,3)]
# column by columnname
d.brown$verb
d.brown[,"verb"]
# several columns by name
d.brown[,c("verb","register")]
The data set does not yet contain meta information about the corpus it was extracted from. Thus, we add the columns
corpus
with the value brown
year
with the value 1961
lgvar
(for language variety) with the value AE
If we would simply assign the values to R as follows, the variables would automatically be assinged the class character
.
d.brown$corpus <- "brown"
d.brown$year <- "1961"
d.brown$lgvar <- "AE"
With a summary we will then simply get the information about the total number of rows in the column the class
and mode
summary(d.brown)
## verb pos register corpus
## do : 3384 VV :26291 G :16329 Length:115919
## say : 2780 VVD:27855 J :16096 Class :character
## make : 2360 VVG:17299 F :10685 Mode :character
## go : 1808 VVN:27674 A : 9364
## take : 1588 VVP: 8723 N : 8825
## come : 1564 VVZ: 8077 P : 8408
## (Other):102435 (Other):46212
## year lgvar
## Length:115919 Length:115919
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
However, the meta data of the corpus are categorical features. We can tell R this by specifying the class explicitly when assigning the values.
d.brown$corpus <- factor("brown")
d.brown$year <- factor("1961")
d.brown$lgvar <- factor("AE")
A summary shows the difference.
summary(d.brown)
## verb pos register corpus year
## do : 3384 VV :26291 G :16329 brown:115919 1961:115919
## say : 2780 VVD:27855 J :16096
## make : 2360 VVG:17299 F :10685
## go : 1808 VVN:27674 A : 9364
## take : 1588 VVP: 8723 N : 8825
## come : 1564 VVZ: 8077 P : 8408
## (Other):102435 (Other):46212
## lgvar
## AE:115919
##
##
##
##
##
##
What about the register
references. The single characters are not very telling. Let’s add a column with more speaking names. The data set brown_registers.txt contains a mapping of the character codes and the more telling names of the registers.
# load data set with register information
d.reg <- read.table("data/brown_registers.txt",header=T,sep="\t",row.names=NULL,quote="")
d.reg
## register reg.long
## 1 A press.reportage
## 2 B press.editorial
## 3 C press.reviews
## 4 D general.prose.religion
## 5 E general.prose.skills.trades.hobbies
## 6 F general.prose.popular.lore
## 7 G general.prose.essay
## 8 H general.prose.misc
## 9 J academic.learned
## 10 K fiction.general
## 11 L fiction.adventure
## 12 M fiction.science.fiction
## 13 N fiction.mistery
## 14 P fiction.romance
## 15 R fiction.humor
We can see that the registers fall into different types
. Thus, we will add a variable register type (reg.type
) to the data set.
# categorize registers using character codes
d.reg$reg.type[d.reg$register %in% c("A","B","C")] <- "press"
d.reg$reg.type[d.reg$register %in% c("D","E","F","G","H")] <- "prose"
d.reg$reg.type[d.reg$register %in% "J"] <- "academic"
# categorize registers using regular expression
d.reg$reg.type[grep("fiction",d.reg$reg.long)] <- "fiction"
d.reg$reg.type <- as.factor(d.reg$reg.type)
d.reg
## register reg.long reg.type
## 1 A press.reportage press
## 2 B press.editorial press
## 3 C press.reviews press
## 4 D general.prose.religion prose
## 5 E general.prose.skills.trades.hobbies prose
## 6 F general.prose.popular.lore prose
## 7 G general.prose.essay prose
## 8 H general.prose.misc prose
## 9 J academic.learned academic
## 10 K fiction.general fiction
## 11 L fiction.adventure fiction
## 12 M fiction.science.fiction fiction
## 13 N fiction.mistery fiction
## 14 P fiction.romance fiction
## 15 R fiction.humor fiction
Now we would like to add this information to our data set d.brown
. As you can see the data frames d.reg
and d.brown
both contain a column with the name register
, which includes the same set of values. We will use this column to merge
the two data sets based on this column and overwrite d.brown
# merge information of two data sets
d.brown <- merge(d.brown,d.reg)
As a result d.brown
now additionally contains the columns reg.long
and reg.type
from the register
data set.
head(d.brown)
## register verb pos corpus year lgvar reg.long reg.type
## 1 A say VVD brown 1961 AE press.reportage press
## 2 A produce VVN brown 1961 AE press.reportage press
## 3 A take VVD brown 1961 AE press.reportage press
## 4 A say VVD brown 1961 AE press.reportage press
## 5 A deserve VVZ brown 1961 AE press.reportage press
## 6 A conduct VVN brown 1961 AE press.reportage press
summary(d.brown)
## register verb pos corpus year
## G :16329 do : 3384 VV :26291 brown:115919 1961:115919
## J :16096 say : 2780 VVD:27855
## F :10685 make : 2360 VVG:17299
## A : 9364 go : 1808 VVN:27674
## N : 8825 take : 1588 VVP: 8723
## P : 8408 come : 1564 VVZ: 8077
## (Other):46212 (Other):102435
## lgvar reg.long reg.type
## AE:115919 general.prose.essay :16329 academic:16096
## academic.learned :16096 fiction :36613
## general.prose.popular.lore:10685 press :18745
## press.reportage : 9364 prose :44465
## fiction.mistery : 8825
## fiction.romance : 8408
## (Other) :46212
Now we have to load and modify the data set from the FROWN corpus in the same way. This time we combine all commands in one chunk and save the new data set as d.frown
.
# load the data set
d.frown <- read.table("data/distr_vfull_lemma-pos-reg_frown.txt",header=F,sep="\t",row.names = NULL,quote="")
# add column names
colnames(d.frown) <- c("verb","pos","register")
# add corpus information
d.frown$corpus <- factor("frown")
d.frown$year <- factor("1991")
d.frown$lgvar <- factor("AE")
# add register information
d.frown <- merge(d.frown,d.reg)
We can now combine the two data sets (d.brown
and d.frown
) by concatenating them.
d.all <- rbind(d.brown,d.frown)
summary(d.all)
## register verb pos corpus year
## G :33220 do : 6780 VV :53012 brown:115919 1961:115919
## J :32378 say : 6704 VVD:55961 frown:120475 1991:120475
## F :21486 make : 4413 VVG:35920
## A :20288 go : 3645 VVN:52804
## N :17536 get : 3134 VVP:19777
## P :17305 know : 3092 VVZ:18920
## (Other):94181 (Other):208626
## lgvar reg.long reg.type
## AE:236394 general.prose.essay :33220 academic:32378
## academic.learned :32378 fiction :74220
## general.prose.popular.lore:21486 press :39588
## press.reportage :20288 prose :90208
## fiction.mistery :17536
## fiction.romance :17305
## (Other) :94181
For further use we will save our newly created data set in the data directory as distr_vfull_lemma-pos-reg_brown-all.txt
write.table(d.all,file = "data/distr_vfull_lemma-pos-reg_brown-all.txt",quote=FALSE,sep="\t",row.names = FALSE)
d.brown
and d.frown
d.all
using rbind
distr_vfull_lemma-pos-reg_brown-all.txt
<< Back: Data Extraction and Data Formats << | >> Next: Data Analysis I (Frequency distribution, normalization)>> |