5.5 Managing Data Frames
A data frame is the most common way of storing data in R and, generally, is the data structure most often used for data analyses. Under the hood, a data frame is a list of equal-length vectors. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows. As a result, data frames can store different classes of objects in each column (i.e. numeric, character, factor). In essence, the easiest way to think of a data frame is as an Excel worksheet that contains columns of different types of data but are all of equal length rows. In this section you will learn how to perform basic operations with a data frame to include:
Creating data frames
Adding on to data frames
Adding attributes to data frames
Subsetting data frames
5.5.1 Creating Data Frames
Data frames are usually created by reading in a dataset using the read.table()
or read.csv()
; this will be covered in the importing and scraping data tutorials. However, data frames can also be created explicitly with the data.frame()
function or they can be coerced from other types of objects like lists. In this case I’ll create a simple data frame df
and assess its basic structure:
df <- data.frame(col1 = 1:3,
col2 = c("this", "is", "text"),
col3 = c(TRUE, FALSE, TRUE),
col4 = c(2.5, 4.2, pi))
# assess the structure of a data frame
str(df)
'data.frame': 3 obs. of 4 variables:
$ col1: int 1 2 3
$ col2: Factor w/ 3 levels "is","text","this": 3 1 2
$ col3: logi TRUE FALSE TRUE
$ col4: num 2.5 4.2 3.14
# number of rows
nrow(df)
[1] 3
# number of columns
ncol(df)
[1] 4
Note how col2
in df
was converted to a column of factors. This is because there is a default setting in data.frame()
that converts character columns to factors. We can turn this off by setting the stringsAsFactors = FALSE
argument:
df <- data.frame(col1 = 1:3,
col2 = c("this", "is", "text"),
col3 = c(TRUE, FALSE, TRUE),
col4 = c(2.5, 4.2, pi),
stringsAsFactors = FALSE)
# note how col2 now is of a character class
str(df)
'data.frame': 3 obs. of 4 variables:
$ col1: int 1 2 3
$ col2: chr "this" "is" "text"
$ col3: logi TRUE FALSE TRUE
$ col4: num 2.5 4.2 3.14
We can also convert pre-existing structures to a data frame. The following illustrates how we can turn multiple vectors, a list, or a matrix into a data frame:
v1 <- 1:3
v2 <-c("this", "is", "text")
v3 <- c(TRUE, FALSE, TRUE)
# convert same length vectors to a data frame using data.frame()
data.frame(col1 = v1, col2 = v2, col3 = v3)
col1 | col2 | col3 |
---|---|---|
1 | this | TRUE |
2 | is | FALSE |
3 | text | TRUE |
# convert a list to a data frame using as.data.frame()
l <- list(item1 = 1:3, item2 = c("this", "is", "text"), item3 = c(2.5, 4.2, 5.1))
l
$item1
[1] 1 2 3
$item2
[1] "this" "is" "text"
$item3
[1] 2.5 4.2 5.1
as.data.frame(l)
item1 | item2 | item3 |
---|---|---|
1 | this | 2.5 |
2 | is | 4.2 |
3 | text | 5.1 |
1 | 5 | 9 |
2 | 6 | 10 |
3 | 7 | 11 |
4 | 8 | 12 |
V1 | V2 | V3 |
---|---|---|
1 | 5 | 9 |
2 | 6 | 10 |
3 | 7 | 11 |
4 | 8 | 12 |
5.5.2 Adding on to Data Frames
We can leverage the cbind()
function for adding columns to a data frame. Note that one of the objects being combined must already be a data frame otherwise cbind()
could produce a matrix.
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | this | TRUE | 2.500000 |
2 | is | FALSE | 4.200000 |
3 | text | TRUE | 3.141593 |
col1 | col2 | col3 | col4 | v4 |
---|---|---|---|---|
1 | this | TRUE | 2.500000 | A |
2 | is | FALSE | 4.200000 | B |
3 | text | TRUE | 3.141593 | C |
We can also use the rbind()
function to add data frame rows together. However, severe caution should be taken because this can cause changes in the classes of the columns. For instance, our data frame df
currently consists of an integer, character, logical, and numeric variables.
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | this | TRUE | 2.500000 |
2 | is | FALSE | 4.200000 |
3 | text | TRUE | 3.141593 |
str(df)
'data.frame': 3 obs. of 4 variables:
$ col1: int 1 2 3
$ col2: chr "this" "is" "text"
$ col3: logi TRUE FALSE TRUE
$ col4: num 2.5 4.2 3.14
If we attempt to add a row using rbind()
and c()
it converts all columns to a character class. This is because all elements in the vector created by c()
must be of the same class so they are all coerced to the character class which coerces all the variables in the data frame to the character class.
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | this | TRUE | 2.5 |
2 | is | FALSE | 4.2 |
3 | text | TRUE | 3.14159265358979 |
4 | R | FALSE | 1.1 |
str(df2)
'data.frame': 4 obs. of 4 variables:
$ col1: chr "1" "2" "3" "4"
$ col2: chr "this" "is" "text" "R"
$ col3: chr "TRUE" "FALSE" "TRUE" "FALSE"
$ col4: chr "2.5" "4.2" "3.14159265358979" "1.1"
To add rows appropriately, we need to convert the items being added to a data frame and make sure the columns are the same class as the original data frame.
adding_df <- data.frame(col1 = 4,
col2 = "R",
col3 = FALSE,
col4 = 1.1,
stringsAsFactors = FALSE)
df3 <- rbind(df, adding_df)
df3
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | this | TRUE | 2.500000 |
2 | is | FALSE | 4.200000 |
3 | text | TRUE | 3.141593 |
4 | R | FALSE | 1.100000 |
str(df3)
'data.frame': 4 obs. of 4 variables:
$ col1: num 1 2 3 4
$ col2: chr "this" "is" "text" "R"
$ col3: logi TRUE FALSE TRUE FALSE
$ col4: num 2.5 4.2 3.14 1.1
There are better ways to join data frames together than to use cbind()
and rbind()
. These are covered later on in the transforming your data with dplyr
tutorial.
5.5.3 Adding Attributes to Data Frames
Similar to matrices, data frames will have a dimension attribute. In addition, data frames can also have additional attributes such as row names, column names, and comments. We can illustrate with data frame df
.
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | this | TRUE | 2.500000 |
2 | is | FALSE | 4.200000 |
3 | text | TRUE | 3.141593 |
dim(df)
[1] 3 4
## [1] 3 4
attributes(df)
$names
[1] "col1" "col2" "col3" "col4"
$class
[1] "data.frame"
$row.names
[1] 1 2 3
Currently df
does not have row names but we can add them with rownames()
:
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
row1 | 1 | this | TRUE | 2.500000 |
row2 | 2 | is | FALSE | 4.200000 |
row3 | 3 | text | TRUE | 3.141593 |
attributes(df)
$names
[1] "col1" "col2" "col3" "col4"
$class
[1] "data.frame"
$row.names
[1] "row1" "row2" "row3"
We can also also change the existing column names by using colnames()
or names()
:
col_1 | col_2 | col_3 | col_4 | |
---|---|---|---|---|
row1 | 1 | this | TRUE | 2.500000 |
row2 | 2 | is | FALSE | 4.200000 |
row3 | 3 | text | TRUE | 3.141593 |
attributes(df)
$names
[1] "col_1" "col_2" "col_3" "col_4"
$class
[1] "data.frame"
$row.names
[1] "row1" "row2" "row3"
# add/change column names with names()
names(df) <- c("col.1", "col.2", "col.3", "col.4")
df
col.1 | col.2 | col.3 | col.4 | |
---|---|---|---|---|
row1 | 1 | this | TRUE | 2.500000 |
row2 | 2 | is | FALSE | 4.200000 |
row3 | 3 | text | TRUE | 3.141593 |
attributes(df)
$names
[1] "col.1" "col.2" "col.3" "col.4"
$class
[1] "data.frame"
$row.names
[1] "row1" "row2" "row3"
Lastly, just like vectors, lists, and matrices, we can add a comment to a data frame without affecting how it operates.
5.5.4 Subsetting Data Frames
Data frames possess the characteristics of both lists and matrices: if you subset with a single vector, they behave like lists and will return the selected columns with all rows; if you subset with two vectors, they behave like matrices and can be subset by row and column:
col.1 | col.2 | col.3 | col.4 | |
---|---|---|---|---|
row1 | 1 | this | TRUE | 2.500000 |
row2 | 2 | is | FALSE | 4.200000 |
row3 | 3 | text | TRUE | 3.141593 |
col.1 | col.2 | col.3 | col.4 | |
---|---|---|---|---|
row2 | 2 | is | FALSE | 4.200000 |
row3 | 3 | text | TRUE | 3.141593 |
col.1 | col.2 | col.3 | col.4 | |
---|---|---|---|---|
row2 | 2 | is | FALSE | 4.200000 |
row3 | 3 | text | TRUE | 3.141593 |
col.2 | col.4 | |
---|---|---|
row1 | this | 2.500000 |
row2 | is | 4.200000 |
row3 | text | 3.141593 |
col.2 | col.4 | |
---|---|---|
row1 | this | 2.500000 |
row2 | is | 4.200000 |
row3 | text | 3.141593 |
col.1 | col.3 | |
---|---|---|
row1 | 1 | TRUE |
row2 | 2 | FALSE |
col.1 | col.2 | col.4 | |
---|---|---|---|
row1 | 1 | this | 2.500000 |
row2 | 2 | is | 4.200000 |
row3 | 3 | text | 3.141593 |
Note that subsetting data frames with the [
operator will simplify14 the results to the lowest possible dimension. To avoid this you can introduce the drop = FALSE
argument:
# simplifying results in a named vector
df[, 2]
[1] "this" "is" "text"
# preserving results in a 3x1 data frame
df[, 2, drop = FALSE]
col.2 | |
---|---|
row1 | this |
row2 | is |
row3 | text |
You can also subset data frames based on conditional statements. To illustrate we’ll use the built in mtcars
data frame:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
If we want to subset mtcars
for all rows where mpg
is greater than 20 we can perform this in two ways:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
We can add on to the conditional statement if we want to filter for multiple conditions. You can see how the subset()
function helps to simplify the process by only requiring you to state the data frame once and then directly call the variables to perform the condition on.
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
And if we want to perform this filtering along with return only specified columns we simply state the columns we want to return.
mpg | cyl | wt | |
---|---|---|---|
Mazda RX4 | 21.0 | 6 | 2.620 |
Mazda RX4 Wag | 21.0 | 6 | 2.875 |
Hornet 4 Drive | 21.4 | 6 | 3.215 |
mpg | cyl | wt | |
---|---|---|---|
Mazda RX4 | 21.0 | 6 | 2.620 |
Mazda RX4 Wag | 21.0 | 6 | 2.875 |
Hornet 4 Drive | 21.4 | 6 | 3.215 |
5.5.5 Applying fucntions to data frames
5.5.5.1 The apply()
function
The apply()
function is most often used to apply a function to the rows or columns (margins) of matrices or data frames. However, it can be used with general arrays, for example, to take the average of an array of matrices. Using apply()
is not faster than using a loop function, but it is highly compact and can be written in one line.
The syntax for apply()
is as follows where
x
is the matrix, dataframe or arrayMARGIN
is a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns.FUN
is the function to be applied...
is for any other arguments to be passed to the function
To provide examples let’s use the mtcars
data set provided in R:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
# get the mean of each column
apply(mtcars, 2, mean)
mpg cyl disp hp drat wt
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
qsec vs am gear carb
17.848750 0.437500 0.406250 3.687500 2.812500
# get the sum of each row (not really relevant for this data
# but it illustrates the capability)
apply(mtcars, 1, sum)
Mazda RX4 Mazda RX4 Wag Datsun 710
328.980 329.795 259.580
Hornet 4 Drive Hornet Sportabout Valiant
426.135 590.310 385.540
Duster 360 Merc 240D Merc 230
656.920 270.980 299.570
Merc 280 Merc 280C Merc 450SE
350.460 349.660 510.740
Merc 450SL Merc 450SLC Cadillac Fleetwood
511.500 509.850 728.560
Lincoln Continental Chrysler Imperial Fiat 128
726.644 725.695 213.850
Honda Civic Toyota Corolla Toyota Corona
195.165 206.955 273.775
Dodge Challenger AMC Javelin Camaro Z28
519.650 506.085 646.280
Pontiac Firebird Fiat X1-9 Porsche 914-2
631.175 208.215 272.570
Lotus Europa Ford Pantera L Ferrari Dino
273.683 670.690 379.590
Maserati Bora Volvo 142E
694.710 288.890
# get column quantiles (notice the quantile percents as row names)
apply(mtcars, 2, quantile, probs = c(0.10, 0.25, 0.50, 0.75, 0.90))
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
10% | 14.340 | 4 | 80.610 | 66.0 | 3.007 | 1.95550 | 15.5340 | 0 | 0 | 3 | 1 |
25% | 15.425 | 4 | 120.825 | 96.5 | 3.080 | 2.58125 | 16.8925 | 0 | 0 | 3 | 2 |
50% | 19.200 | 6 | 196.300 | 123.0 | 3.695 | 3.32500 | 17.7100 | 0 | 0 | 4 | 2 |
75% | 22.800 | 8 | 326.000 | 180.0 | 3.920 | 3.61000 | 18.9000 | 1 | 1 | 4 | 4 |
90% | 30.090 | 8 | 396.000 | 243.5 | 4.209 | 4.04750 | 19.9900 | 1 | 1 | 5 | 4 |
5.5.5.2 Other Useful “apply-like” Functions
In addition to the apply
family which provide vectorized functions that minimize your need to explicitly create loops, there are also a few commonly applied apply
functions that have been further simplified. These include the calculation of column and row sums, means, medians, standard deviations, variances, and summary quantiles across the entire data set.
The most common apply functions that have been include calculating the sums and means of columns and rows. For instance, to calculate the sum of columns across a data frame or matrix you could do the following:
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
apply(mtcars, 2, sum)
mpg cyl disp hp drat wt qsec vs
642.900 198.000 7383.100 4694.000 115.090 102.952 571.160 14.000
am gear carb
13.000 118.000 90.000
However, you can perform the same function with the shorter colSums()
function, plus it performs faster:
colSums(mtcars)
mpg cyl disp hp drat wt qsec vs
642.900 198.000 7383.100 4694.000 115.090 102.952 571.160 14.000
am gear carb
13.000 118.000 90.000
To illustrate the speed difference we can compare the performance of using the apply()
function versus the colSums()
function on a matrix with 100 million values (10K x 10K). You can see that the speed of colSums()
is significantly faster.
# develop a 10,000 x 10,000 matrix
mat = matrix(sample(1:10, size=100000000, replace=TRUE), nrow=10000)
system.time(apply(mat, 2, sum))
## user system elapsed
## 1.544 0.329 1.879
system.time(colSums(mat))
## user system elapsed
## 0.126 0.000 0.127
Base R provides the following simplified apply
functions:
colSums (x, na.rm = FALSE)
rowSums (x, na.rm = FALSE)
colMeans(x, na.rm = FALSE)
rowMeans(x, na.rm = FALSE)
In addition, the following functions are provided through the specified packages:
miscTools
package (note that these functions will work on data frames)colMedians()
rowMedians()
matrixStats
package (note that these functions only operate on matrices)colMedians()
&rowMedians()
colSds()
&rowSds()
colVar()
&rowVar()
colRanges()
&rowRanges()
colQuantiles()
&rowQuantiles()
- along with several additional summary statistic functions
In addition, the summary()
function will provide relevant summary statistics over each column of data frames and matrices. Note in the the example that follows that for the first four columns of the iris
data set the summary statistics include min, med, mean, max, and 1st & 3rd quantiles. Whereas the last column (Species
) only provides the total count since this is a factor variable.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
Min. :4.300 | Min. :2.000 | Min. :1.000 | Min. :0.100 | setosa :50 | |
1st Qu.:5.100 | 1st Qu.:2.800 | 1st Qu.:1.600 | 1st Qu.:0.300 | versicolor:50 | |
Median :5.800 | Median :3.000 | Median :4.350 | Median :1.300 | virginica :50 | |
Mean :5.843 | Mean :3.057 | Mean :3.758 | Mean :1.199 | NA | |
3rd Qu.:6.400 | 3rd Qu.:3.300 | 3rd Qu.:5.100 | 3rd Qu.:1.800 | NA | |
Max. :7.900 | Max. :4.400 | Max. :6.900 | Max. :2.500 | NA |
Its important to understand the difference between simplifying and preserving subsetting. Simplifying subsets returns the simplest possible data structure that can represent the output. Preserving subsets keeps the structure of the output the same as the input. See Hadley Wickham’s section on Simplifying vs. Preserving Subsetting to learn more.↩