Problems and solutions

Premise

The idea of this document is to present common (and often confusing) grievances I encounter with R and how I go about resolving them.

Contents

Extracting a vector from a data frame of characters

Problem

I often want to loop through a data frame and extract a vector from it. This is fine if that data frame contains numbers. For example, the following code generates a data frame of numbers, extracting the first row creates a vector equal to a vector of 1 to 5.

numbers.df <- data.frame (c1=rep (1, 5),
                          c2=rep (2, 5),
                          c3=rep (3, 5),
                          c4=rep (4, 5),
                          c5=rep (5, 5))
row.element <- 1:5
all (row.element %in% numbers.df[1,])  # TRUE

But if I switch to characters, it doesn't work.

letters.df <- data.frame (c1=rep ('a', 5),
                          c2=rep ('b', 5),
                          c3=rep ('c', 5),
                          c4=rep ('d', 5),
                          c5=rep ('e', 5))
row.element <- c ('a', 'b', 'c', 'd', 'e')
all (row.element %in% letters.df[1,])  # FALSE!

How should I resolve this? Printing letters.df[1,] returns this:

> letters.df[1,]
  c1 c2 c3 c4 c5
1  a  b  c  d  e

So perhaps if I convert the vector into characters it will work.

all (row.element %in% as.character (as.vector (letters.df[1, ])))  # FALSE

No, because this has simply converted the vector into a vector of 1s.

> as.character (as.vector (letters.df[1, ]))
[1] "1" "1" "1" "1" "1"

Solution 1

The problem is to do with levels in the data frame. When we extract our row we also need to drop the data frame class. (By default, [] preserve the data.frame class even though its a single row). This returns a list so we need to then convert this into a vector using unlist(). Now there are still levels but these refer to levels within the vector rather than the original data.frame. To drop these we can use as.character().

res <- as.character (unlist (letters.df[1, ,drop=TRUE]))
all (row.element %in% res)  # TRUE

Solution 2

The above solution is complicated. The problem is to do with levels. Therefore, another solution is to force all strings not to be made into factors when we create our data frame.

letters.df <- data.frame (c1=rep ('a', 5),
                          c2=rep ('b', 5),
                          c3=rep ('c', 5),
                          c4=rep ('d', 5),
                          c5=rep ('e', 5),
                          stringsAsFactors=FALSE)
all (row.element %in% letters.df[1,])  # TRUE!

Now the data.frame acts in the same way that numbers.df did. This is the easiest and most straightforward solution, however, it does come at the cost of losing factor functionality (e.g. tapply()) on the data frame.