Manipulating Data Sets

In this tutorial we will learn how to manipulate data sets in R:

adding column names
adding additional variables (columns)
summarizing the data
merging and combining two or more data sets

Getting started

This tutorial is a R Mardown document that you can use as laboratory notebook and as base script for data manipulation.

the document contains text/notes (white background) as well as r-code chunks (grey background)
you can execute/render the file by clicking on the Knit button, which will execute the r-code chunks and render the text in a given output format (here: HTML)
You can also execute the r-chunks one-by-one in the console by clicking on the small green arrow at the upper right hand side of the chunk
r-chunks may be named (first string after r up to the first comma), but names must be unique

See Introduction to R Markdown and Getting Started for more details on R Markdown documents.

Let’s get started

download Tutorial_Data-Set_Manipulation.Rmd
save the file in your course directory
copy the file and rename the copy in Data-Set_Manipulation.Rmd
open the new file in R-Studio by double clicking on the file

Data frames

As examples we will use the data sets that we created in the Tutorial on Data extraction and Data Formats on the distribution of content verbs and their parts-of-speech across registers in different corpora of the Brown corpus family (the BROWN, FROWN, FLOB and LOB corpus).

We will first load the data set from the BROWN corpus and store it in the data frame d.brown and add column names representing the features (verb, pos, register).

# load the data and store it in d.brown
d.brown <- read.table("data/distr_vfull_lemma-pos-reg_brown.txt", header=F, fill=T, sep="\t", row.names=NULL, quote="")
# add column names
colnames(d.brown) <- c("verb","pos","register")

We can now have a first look at the data frame.

head(d.brown)

##      verb pos register
## 1     say VVD        A
## 2 produce VVN        A
## 3    take VVD        A
## 4     say VVD        A
## 5 deserve VVZ        A
## 6 conduct VVN        A

A data frame is an object class in R to store multivariate data sets. Each column in a data frame can have a different class (e.g. numeric, factor, character).

numeric for number values
factor for categorical values
character for string values

If not specified explicitlty, R assignes the class of a column automatically based on the values fo the column. We will see later that this does not always work as desired.

With summary we can get a summarization of the values in the data frame. Depending on the class each column may be treated differently.
Our example contains only columns of the class factor. Thus, the summary for each column lists the most frequent values with their frequency.

summary(d.brown)

##       verb         pos           register    
##  do     :  3384   VV :26291   G      :16329  
##  say    :  2780   VVD:27855   J      :16096  
##  make   :  2360   VVG:17299   F      :10685  
##  go     :  1808   VVN:27674   A      : 9364  
##  take   :  1588   VVP: 8723   N      : 8825  
##  come   :  1564   VVZ: 8077   P      : 8408  
##  (Other):102435               (Other):46212

We can access single rows and columns, or groups by the rownames or columnnames (if they exists) or by their number.

# row number 3
d.brown[3,]
# rows 3-10
d.brown[3:10,]
# column number 2
d.brown[,2]
# column number 2:3
d.brown[,2:3]
# column numbers 1 and 3
d.brown[,c(1,3)]
# column by columnname
d.brown$verb
d.brown[,"verb"]
# several columns by name
d.brown[,c("verb","register")]

Adding columns to a data frame

The data set does not yet contain meta information about the corpus it was extracted from. Thus, we add the columns

corpus with the value brown
year with the value 1961
lgvar (for language variety) with the value AE

If we would simply assign the values to R as follows, the variables would automatically be assinged the class character.

d.brown$corpus <- "brown"
d.brown$year <- "1961"
d.brown$lgvar <- "AE"

With a summary we will then simply get the information about the total number of rows in the column the class and mode

summary(d.brown)

##       verb         pos           register        corpus         
##  do     :  3384   VV :26291   G      :16329   Length:115919     
##  say    :  2780   VVD:27855   J      :16096   Class :character  
##  make   :  2360   VVG:17299   F      :10685   Mode  :character  
##  go     :  1808   VVN:27674   A      : 9364                     
##  take   :  1588   VVP: 8723   N      : 8825                     
##  come   :  1564   VVZ: 8077   P      : 8408                     
##  (Other):102435               (Other):46212                     
##      year              lgvar          
##  Length:115919      Length:115919     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

However, the meta data of the corpus are categorical features. We can tell R this by specifying the class explicitly when assigning the values.

d.brown$corpus <- factor("brown")
d.brown$year <- factor("1961")
d.brown$lgvar <- factor("AE")

A summary shows the difference.

summary(d.brown)

##       verb         pos           register       corpus         year       
##  do     :  3384   VV :26291   G      :16329   brown:115919   1961:115919  
##  say    :  2780   VVD:27855   J      :16096                               
##  make   :  2360   VVG:17299   F      :10685                               
##  go     :  1808   VVN:27674   A      : 9364                               
##  take   :  1588   VVP: 8723   N      : 8825                               
##  come   :  1564   VVZ: 8077   P      : 8408                               
##  (Other):102435               (Other):46212                               
##  lgvar      
##  AE:115919  
##             
##             
##             
##             
##             
##

Merging data frame to add additional features

What about the register references. The single characters are not very telling. Let’s add a column with more speaking names. The data set brown_registers.txt contains a mapping of the character codes and the more telling names of the registers.

# load data set with register information
d.reg <- read.table("data/brown_registers.txt",header=T,sep="\t",row.names=NULL,quote="")
d.reg

##    register                            reg.long
## 1         A                     press.reportage
## 2         B                     press.editorial
## 3         C                       press.reviews
## 4         D              general.prose.religion
## 5         E general.prose.skills.trades.hobbies
## 6         F          general.prose.popular.lore
## 7         G                 general.prose.essay
## 8         H                  general.prose.misc
## 9         J                    academic.learned
## 10        K                     fiction.general
## 11        L                   fiction.adventure
## 12        M             fiction.science.fiction
## 13        N                     fiction.mistery
## 14        P                     fiction.romance
## 15        R                       fiction.humor

We can see that the registers fall into different types. Thus, we will add a variable register type (reg.type) to the data set.

# categorize registers using character codes
d.reg$reg.type[d.reg$register %in% c("A","B","C")] <- "press"
d.reg$reg.type[d.reg$register %in% c("D","E","F","G","H")] <- "prose"
d.reg$reg.type[d.reg$register %in% "J"] <- "academic"
# categorize registers using regular expression
d.reg$reg.type[grep("fiction",d.reg$reg.long)] <- "fiction"
d.reg$reg.type <- as.factor(d.reg$reg.type)

d.reg

##    register                            reg.long reg.type
## 1         A                     press.reportage    press
## 2         B                     press.editorial    press
## 3         C                       press.reviews    press
## 4         D              general.prose.religion    prose
## 5         E general.prose.skills.trades.hobbies    prose
## 6         F          general.prose.popular.lore    prose
## 7         G                 general.prose.essay    prose
## 8         H                  general.prose.misc    prose
## 9         J                    academic.learned academic
## 10        K                     fiction.general  fiction
## 11        L                   fiction.adventure  fiction
## 12        M             fiction.science.fiction  fiction
## 13        N                     fiction.mistery  fiction
## 14        P                     fiction.romance  fiction
## 15        R                       fiction.humor  fiction

Merging data frames

Now we would like to add this information to our data set d.brown. As you can see the data frames d.reg and d.brown both contain a column with the name register, which includes the same set of values. We will use this column to merge the two data sets based on this column and overwrite d.brown

# merge information of two data sets
d.brown <- merge(d.brown,d.reg)

As a result d.brown now additionally contains the columns reg.long and reg.type from the register data set.

head(d.brown)

##   register    verb pos corpus year lgvar        reg.long reg.type
## 1        A     say VVD  brown 1961    AE press.reportage    press
## 2        A produce VVN  brown 1961    AE press.reportage    press
## 3        A    take VVD  brown 1961    AE press.reportage    press
## 4        A     say VVD  brown 1961    AE press.reportage    press
## 5        A deserve VVZ  brown 1961    AE press.reportage    press
## 6        A conduct VVN  brown 1961    AE press.reportage    press

summary(d.brown)

##     register          verb         pos          corpus         year       
##  G      :16329   do     :  3384   VV :26291   brown:115919   1961:115919  
##  J      :16096   say    :  2780   VVD:27855                               
##  F      :10685   make   :  2360   VVG:17299                               
##  A      : 9364   go     :  1808   VVN:27674                               
##  N      : 8825   take   :  1588   VVP: 8723                               
##  P      : 8408   come   :  1564   VVZ: 8077                               
##  (Other):46212   (Other):102435                                           
##  lgvar                             reg.long         reg.type    
##  AE:115919   general.prose.essay       :16329   academic:16096  
##              academic.learned          :16096   fiction :36613  
##              general.prose.popular.lore:10685   press   :18745  
##              press.reportage           : 9364   prose   :44465  
##              fiction.mistery           : 8825                   
##              fiction.romance           : 8408                   
##              (Other)                   :46212

Another data set

Now we have to load and modify the data set from the FROWN corpus in the same way. This time we combine all commands in one chunk and save the new data set as d.frown.

# load the data set
d.frown <- read.table("data/distr_vfull_lemma-pos-reg_frown.txt",header=F,sep="\t",row.names = NULL,quote="")
# add column names
colnames(d.frown) <- c("verb","pos","register")
# add corpus information
d.frown$corpus <- factor("frown")
d.frown$year <- factor("1991")
d.frown$lgvar <- factor("AE")
# add register information
d.frown <- merge(d.frown,d.reg)

Combining two data frames

We can now combine the two data sets (d.brown and d.frown) by concatenating them.

d.all <- rbind(d.brown,d.frown)

summary(d.all)

##     register          verb         pos          corpus         year       
##  G      :33220   do     :  6780   VV :53012   brown:115919   1961:115919  
##  J      :32378   say    :  6704   VVD:55961   frown:120475   1991:120475  
##  F      :21486   make   :  4413   VVG:35920                               
##  A      :20288   go     :  3645   VVN:52804                               
##  N      :17536   get    :  3134   VVP:19777                               
##  P      :17305   know   :  3092   VVZ:18920                               
##  (Other):94181   (Other):208626                                           
##  lgvar                             reg.long         reg.type    
##  AE:236394   general.prose.essay       :33220   academic:32378  
##              academic.learned          :32378   fiction :74220  
##              general.prose.popular.lore:21486   press   :39588  
##              press.reportage           :20288   prose   :90208  
##              fiction.mistery           :17536                   
##              fiction.romance           :17305                   
##              (Other)                   :94181

Saving data sets in a file

For further use we will save our newly created data set in the data directory as distr_vfull_lemma-pos-reg_brown-all.txt

write.table(d.all,file = "data/distr_vfull_lemma-pos-reg_brown-all.txt",quote=FALSE,sep="\t",row.names = FALSE)

Exercise

Extract corresponding data sets from the FLOB and the LOB corpus.
Manipulate the data sets in the same way as d.brown and d.frown
Add the data sets to d.all using rbind
don’t forget to save the newly created data sei in the data directory overwriting distr_vfull_lemma-pos-reg_brown-all.txt

<< Back: Data Extraction and Data Formats <<

>> Next: Data Analysis I (Frequency distribution, normalization)>>