5.5 Managing Data Frames

A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Under the hood, a data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor). In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows. In this section you will learn how to perform basic operations with a data frame to include:

  • Creating data frames

  • Adding on to data frames

  • Adding attributes to data frames

  • Subsetting data frames

5.5.1 Creating Data Frames

Data frames are usually created by reading in a dataset using the read.table() or read.csv(); this will be covered in the importing and scraping data tutorials. However, data frames can also be created explicitly with the data.frame() function or they can be coerced from other types of objects like lists. In this case I’ll create a simple data frame df and assess its basic structure:

Note how col2 in df was converted to a column of factors. This is because there is a default setting in data.frame() that converts character columns to factors. We can turn this off by setting the stringsAsFactors = FALSE argument:

We can also convert pre-existing structures to a data frame. The following illustrates how we can turn multiple vectors, a list, or a matrix into a data frame:

col1 col2 col3
1 this TRUE
2 is FALSE
3 text TRUE
item1 item2 item3
1 this 2.5
2 is 4.2
3 text 5.1
1 5 9
2 6 10
3 7 11
4 8 12
V1 V2 V3
1 5 9
2 6 10
3 7 11
4 8 12

5.5.2 Adding on to Data Frames

We can leverage the cbind() function for adding columns to a data frame. Note that one of the objects being combined must already be a data frame otherwise cbind() could produce a matrix.

col1 col2 col3 col4
1 this TRUE 2.500000
2 is FALSE 4.200000
3 text TRUE 3.141593
col1 col2 col3 col4 v4
1 this TRUE 2.500000 A
2 is FALSE 4.200000 B
3 text TRUE 3.141593 C

We can also use the rbind() function to add data frame rows together. However, severe caution should be taken because this can cause changes in the classes of the columns. For instance, our data frame df currently consists of an integer, character, logical, and numeric variables.

col1 col2 col3 col4
1 this TRUE 2.500000
2 is FALSE 4.200000
3 text TRUE 3.141593

If we attempt to add a row using rbind() and c() it converts all columns to a character class. This is because all elements in the vector created by c() must be of the same class so they are all coerced to the character class which coerces all the variables in the data frame to the character class.

col1 col2 col3 col4
1 this TRUE 2.5
2 is FALSE 4.2
3 text TRUE 3.14159265358979
4 R FALSE 1.1

To add rows appropriately, we need to convert the items being added to a data frame and make sure the columns are the same class as the original data frame.

col1 col2 col3 col4
1 this TRUE 2.500000
2 is FALSE 4.200000
3 text TRUE 3.141593
4 R FALSE 1.100000

There are better ways to join data frames together than to use cbind() and rbind(). These are covered later on in the transforming your data with dplyr tutorial.

5.5.3 Adding Attributes to Data Frames

Similar to matrices, data frames will have a dimension attribute. In addition, data frames can also have additional attributes such as row names, column names, and comments. We can illustrate with data frame df.

col1 col2 col3 col4
1 this TRUE 2.500000
2 is FALSE 4.200000
3 text TRUE 3.141593

Currently df does not have row names but we can add them with rownames():

col1 col2 col3 col4
row1 1 this TRUE 2.500000
row2 2 is FALSE 4.200000
row3 3 text TRUE 3.141593

We can also also change the existing column names by using colnames() or names():

col_1 col_2 col_3 col_4
row1 1 this TRUE 2.500000
row2 2 is FALSE 4.200000
row3 3 text TRUE 3.141593
col.1 col.2 col.3 col.4
row1 1 this TRUE 2.500000
row2 2 is FALSE 4.200000
row3 3 text TRUE 3.141593

Lastly, just like vectors, lists, and matrices, we can add a comment to a data frame without affecting how it operates.

5.5.4 Subsetting Data Frames

Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists and will return the selected columns with all rows; if you subset with two vectors, they behave like matrices and can be subset by row and column:

col.1 col.2 col.3 col.4
row1 1 this TRUE 2.500000
row2 2 is FALSE 4.200000
row3 3 text TRUE 3.141593
col.1 col.2 col.3 col.4
row2 2 is FALSE 4.200000
row3 3 text TRUE 3.141593
col.1 col.2 col.3 col.4
row2 2 is FALSE 4.200000
row3 3 text TRUE 3.141593
col.2 col.4
row1 this 2.500000
row2 is 4.200000
row3 text 3.141593
col.2 col.4
row1 this 2.500000
row2 is 4.200000
row3 text 3.141593
col.1 col.3
row1 1 TRUE
row2 2 FALSE
col.1 col.2 col.4
row1 1 this 2.500000
row2 2 is 4.200000
row3 3 text 3.141593

Note that subsetting data frames with the [ operator will simplify14 the results to the lowest possible dimension. To avoid this you can introduce the drop = FALSE argument:

col.2
row1 this
row2 is
row3 text

You can also subset data frames based on conditional statements. To illustrate we’ll use the built in mtcars data frame:

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

If we want to subset mtcars for all rows where mpg is greater than 20 we can perform this in two ways:

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

We can add on to the conditional statement if we want to filter for multiple conditions. You can see how the subset() function helps to simplify the process by only requiring you to state the data frame once and then directly call the variables to perform the condition on.

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

And if we want to perform this filtering along with return only specified columns we simply state the columns we want to return.

mpg cyl wt
Mazda RX4 21.0 6 2.620
Mazda RX4 Wag 21.0 6 2.875
Hornet 4 Drive 21.4 6 3.215
mpg cyl wt
Mazda RX4 21.0 6 2.620
Mazda RX4 Wag 21.0 6 2.875
Hornet 4 Drive 21.4 6 3.215

5.5.5 Applying fucntions to data frames

5.5.5.1 The apply() function

The apply() function is most often used to apply a function to the rows or columns (margins) of matrices or data frames. However, it can be used with general arrays, for example, to take the average of an array of matrices. Using apply() is not faster than using a loop function, but it is highly compact and can be written in one line.

The syntax for apply() is as follows where

  • x is the matrix, dataframe or array
  • MARGIN is a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.
  • FUN is the function to be applied
  • ... is for any other arguments to be passed to the function

To provide examples let’s use the mtcars data set provided in R:

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mpg cyl disp hp drat wt qsec vs am gear carb
10% 14.340 4 80.610 66.0 3.007 1.95550 15.5340 0 0 3 1
25% 15.425 4 120.825 96.5 3.080 2.58125 16.8925 0 0 3 2
50% 19.200 6 196.300 123.0 3.695 3.32500 17.7100 0 0 4 2
75% 22.800 8 326.000 180.0 3.920 3.61000 18.9000 1 1 4 4
90% 30.090 8 396.000 243.5 4.209 4.04750 19.9900 1 1 5 4

5.5.5.2 Other Useful “apply-like” Functions

In addition to the apply family which provide vectorized functions that minimize your need to explicitly create loops, there are also a few commonly applied apply functions that have been further simplified. These include the calculation of column and row sums, means, medians, standard deviations, variances, and summary quantiles across the entire data set.

The most common apply functions that have been include calculating the sums and means of columns and rows. For instance, to calculate the sum of columns across a data frame or matrix you could do the following:

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

However, you can perform the same function with the shorter colSums() function, plus it performs faster:

To illustrate the speed difference we can compare the performance of using the apply() function versus the colSums() function on a matrix with 100 million values (10K x 10K). You can see that the speed of colSums() is significantly faster.

Base R provides the following simplified apply functions:

  • colSums (x, na.rm = FALSE)
  • rowSums (x, na.rm = FALSE)
  • colMeans(x, na.rm = FALSE)
  • rowMeans(x, na.rm = FALSE)

In addition, the following functions are provided through the specified packages:

  • miscTools package (note that these functions will work on data frames)
    • colMedians()
    • rowMedians()
  • matrixStats package (note that these functions only operate on matrices)
    • colMedians() & rowMedians()
    • colSds() & rowSds()
    • colVar() & rowVar()
    • colRanges() & rowRanges()
    • colQuantiles() & rowQuantiles()
    • along with several additional summary statistic functions

In addition, the summary() function will provide relevant summary statistics over each column of data frames and matrices. Note in the the example that follows that for the first four columns of the iris data set the summary statistics include min, med, mean, max, and 1st & 3rd quantiles. Whereas the last column (Species) only provides the total count since this is a factor variable.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 NA
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 NA
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 NA


  1. Its important to understand the difference between simplifying and preserving subsetting. Simplifying subsets returns the simplest possible data structure that can represent the output. Preserving subsets keeps the structure of the output the same as the input. See Hadley Wickham’s section on Simplifying vs. Preserving Subsetting to learn more.