4.3 Dealing with Factors
Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label. In fact, factors are built on top of integer vectors using two attributes: the class()
, “factor”, which makes them behave differently from regular integer vectors, and the levels()
, which defines the set of allowed values. Factors are important in statistical modeling and are treated specially by modelling functions like lm()
and glm()
. This section will provide you the basics of managing categorical data as factors.
4.3.1 Creating, Converting & Inspecting Factors
Factor objects can be created with the factor()
function:
# create a factor string
gender <- factor(c("male", "female", "female", "male", "female"))
gender
[1] male female female male female
Levels: female male
# inspect to see if it is a factor class
class(gender)
[1] "factor"
# show that factors are just built on top of integers
typeof(gender)
[1] "integer"
# See the underlying representation of factor
unclass(gender)
[1] 2 1 1 2 1
attr(,"levels")
[1] "female" "male"
# what are the factor levels?
levels(gender)
[1] "female" "male"
# show summary of counts
summary(gender)
female male
3 2
If we have a vector of character strings or integers we can easily convert to factors:
4.3.2 Ordering, Revaluing, & Dropping Factor Levels
We can easily order, revalue, and drop factor levels as the following illustrates.
4.3.2.1 Ordering Levels
When creating a factor we can control the ordering of the levels by using the levels
argument:
# when not specified the default puts order as alphabetical
gender <- factor(c("male", "female", "female", "male", "female"))
gender
[1] male female female male female
Levels: female male
# specifying order
gender <- factor(c("male", "female", "female", "male", "female"),
levels = c("male", "female"))
gender
[1] male female female male female
Levels: male female
We can also create ordinal factors in which a specific order is desired by using the ordered = TRUE
argument. This will be reflected in the output of the levels as shown below in which low < middle < high
:
ses <- c("low", "middle", "low", "low", "low", "low", "middle", "low", "middle",
"middle", "middle", "middle", "middle", "high", "high", "low", "middle",
"middle", "low", "high")
# create ordinal levels
ses <- factor(ses, levels = c("low", "middle", "high"), ordered = TRUE)
ses
[1] low middle low low low low middle low middle middle
[11] middle middle middle high high low middle middle low high
Levels: low < middle < high
# you can also reverse the order of levels if desired
factor(ses, levels=rev(levels(ses)))
[1] low middle low low low low middle low middle middle
[11] middle middle middle high high low middle middle low high
Levels: high < middle < low
4.3.2.2 Revalue Levels
To recode factor levels I usually use the revalue()
function from the plyr
package.
plyr::revalue(ses, c("low" = "small", "middle" = "medium", "high" = "large"))
[1] small medium small small small small medium small medium medium
[11] medium medium medium large large small medium medium small large
Levels: small < medium < large
☛ Using the ::
notation allows you to access the revalue()
function without having to fully load the plyr
package.
4.3.2.3 Dropping Levels
When you want to drop unused factor levels, use droplevels()
: