4.6 Dealing with Missing Values

A common task in data analysis is dealing with missing values. In R, missing values are often represented by NA or some other value that represents missing values (i.e. 99). We can easily work with missing values and in this section you will learn how to:

  • Test for missing values
  • Recode missing values
  • Exclude missing values

4.6.1 Test for missing values

To identify missing values use is.na() which returns a logical vector with TRUE in the element locations that contain missing values represented by NA. is.na() will work on vectors, lists, matrices, and data frames.

col1 col2 col3 col4
FALSE FALSE FALSE FALSE
FALSE TRUE FALSE FALSE
FALSE FALSE FALSE FALSE
TRUE FALSE FALSE TRUE

To identify the location or the number of NAs we can leverage the which() and sum() functions:

4.6.2 Recode missing values

To recode missing values; or recode specific indicators that represent missing values, we can use normal subsetting and assignment operations. For example, we can recode missing values in vector x with the mean values in x by first subsetting the vector to identify NAs and then assign these elements a value. Similarly, if missing values are represented by another value (i.e. 99) we can simply subset the data for the elements that contain that value and then assign a desired value to those elements.

col1 col2
1 2.5
2 4.2
3 NA
NA 3.2

4.6.3 Exclude missing values

We can exclude missing values in a couple different ways. First, if we want to exclude missing values from mathematical operations use the na.rm = TRUE argument. If you do not exclude these values most functions will return an NA.

We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data. We can do this a few different ways.

col1 col2 col3 col4
1 this TRUE 2.5
2 NA FALSE 4.2
3 is TRUE 3.2
NA text TRUE NA

First, to find complete cases we can leverage the complete.cases() function which returns a logical vector identifying rows which are complete cases. So in the following case rows 1 and 3 are complete cases. We can use this information to subset our data frame which will return the rows which complete.cases() found to be TRUE.

col1 col2 col3 col4
1 1 this TRUE 2.5
3 3 is TRUE 3.2
col1 col2 col3 col4
2 2 NA FALSE 4.2
4 NA text TRUE NA

An shorthand alternative is to simply use na.omit() to omit all rows containing missing values.

col1 col2 col3 col4
1 1 this TRUE 2.5
3 3 is TRUE 3.2