Day 4

Slightly Advanced Aggregations: calculating ratios with ddply

Remember to read in and subset our sample_data dataframe to create anambra, katsina, and isample_data dataframes:

library(plyr)
sample_data <- read.csv("sample_health_facilities.csv", stringsAsFactors = F)
anambra <- subset(sample_data, state == "Anambra")
katsina <- subset(sample_data, state == "Katsina")
isample_data <- idata.frame(sample_data)

Lets look at something similar to what we looked at in Day 3: ratio of nurses to doctors per LGA. This is not an important indicator, but ratio of students to teacher is (pupil to teacher ratio). We are looking at nurses to doctors to avoid having to load another dataset. The concept is the same.

First, lets do this at a single LGA. What is the ratio for Katsina?

katsina[, c("num_doctors_fulltime", "num_nurses_fulltime")]
##    num_doctors_fulltime num_nurses_fulltime
## 3                     0                   1
## 18                    0                   0
## 24                    0                   0
## 38                    0                   0
## 39                    0                   0
## 42                   NA                  NA

Simple, right? 1/0 or Inf.

Okay, what is the nurses to doctor ratio for Anambra?

anambra[, c("num_doctors_fulltime", "num_nurses_fulltime")]
##    num_doctors_fulltime num_nurses_fulltime
## 2                    NA                   2
## 6                     1                   0
## 17                    1                   2
## 27                    1                   0

Remember that in survey data, we treat NA as “missing value”. The value could be 0, it could be 1, it could be 100. The ratio of nurses to doctors in anambra, for this small sample, is actually 2 / 3 == 0.6667. Because the number of doctors in the first facility is unknown, we have to drop that entire row from our sample before calculating the ratio.

What do you think of the following calculation?

x <- ddply(sample_data, "state", summarize, nurse_per_doctor_ratio = sum(num_nurses_fulltime, 
    na.rm = T)/sum(num_doctors_fulltime, na.rm = T))

Okay, now we know that the ratio calculation is incorrect because we are not eliminating the entire row of any NA value. How should we implement it correctly?

The ratio() function

This kind of calculation has to be done a few times in NMIS. For special cases like this, we have written convenience functions. In fact, it even handles the cases of needing to filter out just a subset of the facilities:

ratio <- function(numerator_col, denominator_col, filter) {
    df <- data.frame(cbind(num = numerator_col, den = denominator_col))
    df <- na.omit(df[filter, ])
    if (nrow(df) == 0 | sum(df$den) == 0) {
        return(NA)
    }
    return(sum(df$num)/sum(df$den))
}

my_summary <- ddply(isample_data, .(state), function(df) {
    data.frame(nurse_per_doctor_ratio = ratio(df$num_nurses_fulltime, df$num_doctors_fulltime), 
        nurse_per_doctor_ratio_public = ratio(df$num_nurses_fulltime, df$num_doctors_fulltime, 
            df$management == "public"))
})

head(my_summary)
##         state nurse_per_doctor_ratio nurse_per_doctor_ratio_public
## 1        Abia                     NA                            NA
## 2     Adamawa                 2.0000                             2
## 3     Anambra                 0.6667                             0
## 4      Bauchi                     NA                            NA
## 5       Benue                     NA                            NA
## 6 Cross River                     NA                            NA

The value for nurse_per_doctor_ratio is now the same as we calculated before: 2 / 3 == 0.6667. Let's take a look at the actual data to make sure that nurse_per_doctor_ratio_public is also correct:

anambra[c("num_doctors_fulltime", "num_nurses_fulltime", "management")]
##    num_doctors_fulltime num_nurses_fulltime management
## 2                    NA                   2     public
## 6                     1                   0     public
## 17                    1                   2       <NA>
## 27                    1                   0       <NA>

The bool_proportion() function

Great. As you may have guessed, we have to get around the same NA problem with logical indicators. Let's look at the proportion of facilities that have c-sections as an example i.e. c_section_yn

# necessary for bool_proportion function
icount <- function(predicate) {
    counts <- table(predicate)
    if ("TRUE" %in% names(counts)) {
        counts["TRUE"]
    } else {
        0
    }
}

bool_proportion <- function(numerator_TF, denominator_TF) {
    if (is.null(numerator_TF) | is.null(denominator_TF)) {
        print("bool_proportion called on empty column")
        NA
    } else {
        if (class(numerator_TF) == "character") {
            if (length(c(which(str_detect(numerator_TF, ignore.case("yes|no|true|false"))), 
                which(is.na(numerator_TF))))/length(numerator_TF) > 0.4) {
                numerator_TF <- as.logical(recodeVar(tolower(numerator_TF), 
                  src = list(c("yes", "true"), c("no", "false")), tgt = list(TRUE, 
                    FALSE), default = NA, keep.na = T))
            } else {
                warning("Cannot recode Boolean value, check the data first!")
            }
        } else if (class(denominator_TF) == "character") {
            if (length(c(which(str_detect(denominator_TF, ignore.case("yes|no|true|false"))), 
                which(is.na(denominator_TF))))/length(denominator_TF) > 0.4) {
                denominator_TF <- as.logical(recodeVar(tolower(denominator_TF), 
                  src = list(c("yes", "true"), c("no", "false")), tgt = list(TRUE, 
                    FALSE), default = NA, keep.na = T))
            } else {
                warning("Cannot recode Boolean value, check the data first!")
            }
        }
        df <- data.frame(cbind(num = numerator_TF, den = denominator_TF))
        df <- na.omit(df)
        icount(df$num & df$den)/icount(df$den)
    }
}

my_summary2 <- ddply(isample_data, .(state), function(df) {
    data.frame(facilities_with_csection = bool_proportion(df$c_section_yn, TRUE))
})

head(my_summary2)
##         state facilities_with_csection
## 1        Abia                      1.0
## 2     Adamawa                      0.0
## 3     Anambra                      0.5
## 4      Bauchi                      0.0
## 5       Benue                      0.0
## 6 Cross River                      0.0

The proportion of facilities with c-sections for Anambra was calculated as .50. This is confirmed when we look at the actual data:

anambra[c("c_section_yn")]
##    c_section_yn
## 2         FALSE
## 6         FALSE
## 17         TRUE
## 27         TRUE

Data Cleaning:

Type Conversion

Type conversion can be forced by as.* functions. Common * types you'd encounter are:

  1. numeric
  2. integer
  3. character
  4. logical
    Sometimes you'll encounter factor variables, we recommend using as.character() function to convert it into character type before proceeding.
my_numbers <- c("1", "2", "3", "4", "TRUE")
my_numbers
## [1] "1"    "2"    "3"    "4"    "TRUE"
as.numeric(my_numbers)
## Warning: NAs introduced by coercion
## [1]  1  2  3  4 NA
as.logical(my_numbers)
## [1]   NA   NA   NA   NA TRUE

String Searching

grep() is a useful fucntion used to efficiently browse data. This can be done by the index, or actual strings of the pattern you are searching for:

my_strings = c("Hello", "World", "Foo")
grep(pattern = "l", x = my_strings, ignore.case = FALSE)
## [1] 1 2

# when value argument is set to true, grep() returns the actual strings
# matchs the patterns
grep(pattern = "l", x = my_strings, ignore.case = FALSE, value = TRUE)
## [1] "Hello" "World"

# once comfortable with the syntax of grep(), you may write in the arguments
# directly:
grep("l", my_strings, value = T)
## [1] "Hello" "World"

Quite often, grep() is used on the column/variable names of a dataset. Be sure to include the names() function in your grep() search if you wish to do so:

grep("num", names(sample_data), value = T)
## [1] "num_nurses_fulltime"    "num_lab_techs_fulltime"
## [3] "num_doctors_fulltime"

Similar to grep(), the function str_detect() is useful for browsing data. The main difference, apart from the argument syntax, is that str_detect() returns logical values for all elements of the string:

# str_detect() is part of the stringr library
library(stringr)

my_strings
## [1] "Hello" "World" "Foo"
str_detect(my_strings, "l")
## [1]  TRUE  TRUE FALSE

String Manipulation

To find and replace a pattern in a list, we use two functions. The first is gsub(), short for global sub, which replaces all the occurance of a matching pattern. The other is sub(), which only replaces the first appearance of the pattern.


my_strings
## [1] "Hello" "World" "Foo"
sub(pattern = "o", replacement = "X", x = my_strings)
## [1] "HellX" "WXrld" "FXo"
gsub(pattern = "o", replacement = "X", x = my_strings)
## [1] "HellX" "WXrld" "FXX"

There are times when you do not wish to find and replace; only replace. A useful function for this is revalue():

revalue(my_strings, c(Hello = "How Now?", Foo = "No Wahala"))
## [1] "How Now?"  "World"     "No Wahala"

Altering the case of strings can be done by using either toupper() or tolower():

my_strings
## [1] "Hello" "World" "Foo"
toupper(my_strings)
## [1] "HELLO" "WORLD" "FOO"
tolower(my_strings)
## [1] "hello" "world" "foo"

Creating strings that require concatination of several parts can be done using paste():

paste("hello", "world", "foo", sep = ",")
## [1] "hello,world,foo"

Writing out data

On Day 3, we learned that there are multiple ways to read data into R; depending on format. One quick follow up to that is the nrows argument with read.csv() that can be used to specify the number of rows from the file you are reading in:

read.csv("Health_661_Merged.csv", stringsAsFactors = F, nrows = 200)

Similarly for writing out data, there are different functions depending on the format.

If the desired output is csv format, use the write.csv() format. It may be helpful to know a common argument that is used to avoid writing out the extra rownames column: row.names=F.

# write.csv(your data, 'the location you wish to write your data')
write.csv(sample_data, "./my_output.csv", row.names = F)

If you are using R exclusively, we recommend using the RDS format for speed purposes (since it is stored as a binary file). Write RDS files using the saveRDS() function. Because it is a binary file, there is no need to use the row.names=F argument.

saveRDS(sample_data, "./my_output.RDS")