Day 3

Review

Assignment review:

Aggregations in R:

There are many functions that can do aggregations for you in R; we will cover ddply() from the plyr package in this tutorial. This is also the function that we have found most useful when writing aggregate indicators for NMIS.

library(plyr)

Basic ddply: summarise

Lets say that we want to calculate the total number of doctors in the different states for which we have data. The aggregations provided by ddply make this very easy. The syntax for ddply is:

ddply(input_data_frame, group-by_variables, function, parameters)

The example below is similar to one you will find yourself using quite frequently:

my_summary <- ddply(sample_data, # input_data_frame
                    "state", # group-by variable
                    summarise, # function. in this case, we use summarise
                        counts = length(lga_id),
                        total_num_nurses = sum(num_nurses_fulltime, na.rm=T),
                        avg_num_doctors = mean(num_doctors_fulltime, na.rm=T),
                        avg_c_section_rate = mean(c_section_yn, na.rm=T))
my_summary
##          state counts total_num_nurses avg_num_doctors avg_c_section_rate
## 1         Abia      1                0        308.0000             1.0000
## 2      Adamawa      1                2          1.0000             0.0000
## 3      Anambra      4                4          1.0000             0.5000
## 4       Bauchi      2                0          0.0000             0.0000
## 5        Benue      1                0          0.0000             0.0000
## 6  Cross River      3                3          0.0000             0.0000
## 7        Delta      4               10          0.5000             0.2500
## 8          Edo      1                0          0.0000             0.0000
## 9        Ekiti      1                2          1.0000             0.0000
## 10         Imo      3                8          0.0000             0.3333
## 11      Jigawa      1                0          0.0000             0.0000
## 12      Kaduna      1                5          0.0000             0.0000
## 13        Kano      1                0          0.0000             0.0000
## 14     Katsina      6                1          0.0000             0.0000
## 15       Kebbi      2                0          0.0000             0.0000
## 16        Kogi      1                1          0.0000             0.0000
## 17       Lagos      3                4          1.3333             0.6667
## 18       Niger      3                0          0.0000             0.0000
## 19        Ogun      3                0          0.6667             0.3333
## 20        Osun      2                6          0.5000             0.5000
## 21     Plateau      2                0          0.0000             0.0000
## 22      Rivers      1                2          2.0000             0.0000
## 23      Taraba      2                0          0.5000             0.5000
## 24     Zamfara      1                0          0.0000             0.0000

Understanding ddply

What is ddply doing here? It does something quite complicated but useful. Lets try to repeat what ddply did above for two states in Nigeria.

Exercise 1: Before you knew ddply, and I said that I wanted the total number of doctors in Anambra and Katsina state, how would you do it? Hint: make a dataframe named anambra and a dataframe called katsina.

Exercise 2: Now, lets say I want a two-row and two-column data.frame. The column names are state and total_num_doctors, and each row is the total number of doctors per state. Please do this without using ddply. If you have extra time, you can also do this with ddply. Check your work by manually counting the number of doctors in sample_data.

Exercise 3: On the flipboard, make a block diagram of the process that you went through to make this happen. We will do this as a group.

Explanation of how ddply works

If you did exercise 2 correctly, here are the steps you probably took:

This pattern of data manipulation is called “split-apply-combine” by the author of plyr, Hadley Wickam. ddply and associated functions basically make it easier to perform this pattern of data manipulation, which turns out to be extremely useful. The equivalent method of doing what we did above in ddply is the following:

my_summary <- ddply(sample_data, # input_data_frame
                    "state", # group-by variable
                    summarise, # function. in this case, we use summarise
                        total_num_doctors = sum(num_doctors_fulltime, na.rm=T))
head(my_summary)
##         state total_num_doctors
## 1        Abia               308
## 2     Adamawa                 1
## 3     Anambra                 3
## 4      Bauchi                 0
## 5       Benue                 0
## 6 Cross River                 0

With the difference being that we have now done our calculation for ALL states. As we walk through the ddply, it is useful to see what summarize does for us; can you describe what summarize does based on the output below?

anambra <- subset(sample_data, state == "Anambra")
example <- summarize(anambra, total_num_doctors = sum(num_doctors_fulltime, 
    na.rm = T))
str(example)
## 'data.frame':    1 obs. of  1 variable:
##  $ total_num_doctors: int 3
example
##   total_num_doctors
## 1                 3

Okay, back to our ddply. The group-by variable state tells ddply that it should take our data.frame, and make many many small-small dataframes, one data frame for each state. So there is one dataframe for anambra, one for abia, one for katsina, and so on. Then, ddply “applies” the function at the end of the ddply command. This function should return a dataframe, like summarize does. Finally, it takes all of those small-small NEW dataframes, and rbinds them all together to a new large dataframe. Plus, it adds a column called state, whose value it knew from the subsetting process (anambra's state is Anambra, duh!), and adds it on.

Awesome, or what?

Notes on ddply usage:

  1. The group-by variable must have at least one input.
  2. You must specify what type of aggregation you want to perform, choose one from: summarize, or your own function (we'll see this below).
  3. To understand more, in addition to this tutorial, you should consider looking at the package document, the plyr website, or this tutorial on plyr.
Exercise
  1. Calculate the total number of public facilities per zone in our sample dataset. Your output should look like the following:
##            zone num_public_facilities
## 1 North-Central                     6
## 2     Northeast                     3
## 3     Northwest                    10
## 4   South-South                     6
## 5     Southeast                     4
## 6     Southwest                     4

Hint: what does sum(c(TRUE, FALSE, TRUE, TRUE, FALSE)) return?

  1. Calculate the mean and standard deviation of number of doctors in each zone in Nigeria. Your output should look like the following:
##            zone avg_num_doctors standard_deviation_doctors
## 1 North-Central          0.0000                     0.0000
## 2     Northeast          0.4000                     0.5477
## 3     Northwest          0.0000                     0.0000
## 4   South-South          0.4444                     0.8819
## 5     Southeast         44.4286                   116.2252
## 6     Southwest          0.8889                     1.0541

Remember that facility with the 308 doctors?

More ddply

You can use multiple by variables to perform an aggregation. For example, we can use both “state” and “lga” below:

my_summary <- ddply(sample_data, c("state", "lga"), summarise, counts = length(lga_id), 
    total_num_nurses = sum(num_nurses_fulltime, na.rm = T), avg_num_doctors = mean(num_doctors_fulltime, 
        na.rm = T), avg_c_section_rate = mean(c_section_yn, na.rm = T))
head(my_summary)
##     state           lga counts total_num_nurses avg_num_doctors
## 1    Abia Umuahia North      1                0             308
## 2 Adamawa      Shelleng      1                2               1
## 3 Anambra       Anaocha      1                2             NaN
## 4 Anambra      Ayamelum      1                0               1
## 5 Anambra      Ekwusigo      1                0               1
## 6 Anambra        Ihiala      1                2               1
##   avg_c_section_rate
## 1                  1
## 2                  0
## 3                  0
## 4                  0
## 5                  1
## 6                  1

ddply also allows you to use a special . syntax, where you don't have to put your column names in string variables.

my_summary <- ddply(sample_data, .(state, lga), summarise, counts = length(lga_id), 
    total_num_nurses = sum(num_nurses_fulltime, na.rm = T), avg_num_doctors = mean(num_doctors_fulltime, 
        na.rm = T), avg_c_section_rate = mean(c_section_yn, na.rm = T))
head(my_summary)
##     state           lga counts total_num_nurses avg_num_doctors
## 1    Abia Umuahia North      1                0             308
## 2 Adamawa      Shelleng      1                2               1
## 3 Anambra       Anaocha      1                2             NaN
## 4 Anambra      Ayamelum      1                0               1
## 5 Anambra      Ekwusigo      1                0               1
## 6 Anambra        Ihiala      1                2               1
##   avg_c_section_rate
## 1                  1
## 2                  0
## 3                  0
## 4                  0
## 5                  1
## 6                  1

Question: What is this summary of? Could you use a single group-by variable to get the same result? When might you want to use two variables instead of one?

User defined functions in ddply

You are allowed, and in fact, will need to, use your own function (instead of summarize) in ddply. The syntax is like any function definition in R; we will get to functions later on. For now, please pay attention to the syntax, and note that there must always be the data.frame included inside the function when using it with ddply. This is because, as we said above, ddply, at the end, rbinds a whole lot of small-small data.frames. Using the syntax below, each function returns a new data.frame.

my_summary <- ddply(sample_data, "state", function(df) {
    data.frame(counts = length(df$lga_id), total_num_nurses = sum(df$num_nurses_fulltime, 
        na.rm = T), avg_num_doctors = mean(df$num_doctors_fulltime, na.rm = T), 
        avg_c_section_rate = mean(df$c_section_yn, na.rm = T))
})
head(my_summary)
##         state counts total_num_nurses avg_num_doctors avg_c_section_rate
## 1        Abia      1                0             308                1.0
## 2     Adamawa      1                2               1                0.0
## 3     Anambra      4                4               1                0.5
## 4      Bauchi      2                0               0                0.0
## 5       Benue      1                0               0                0.0
## 6 Cross River      3                3               0                0.0

How is this diferent from the above, which used summarize? Why do you think this is?

Using idata.frame

When you are working with bigger datasets, there is a type of data.frame called idata.frame that is more efficient to use than data.frames for aggregation. By default, R makes copies of the dataframes during a ddply operation (the same way we made copies of our data for anambra and katsina). In order to make R not make these copies, and work more efficiently, we can use something called an idata.frame. The way to do it is to create a new object from a data frame that is an idata.frame, and then perform the exact same operations as we did before.

If ddply calls are starting to take a long time, you should think about using idata.frame. Note that the cost of idata.frames is slighly more complex code; certain functions don't work with idata.frames. One such function is summarize; WE ALWAYS NEED THE FUNCTION FORMAT when using ddply with idata.frames. Please read the idata.frame documentation for further instructions. An example:

isample_data <- idata.frame(sample_data)
my_summary <- ddply(isample_data, "state", function(df) {
    data.frame(counts = length(df$lga_id), total_num_nurses = sum(df$num_nurses_fulltime, 
        na.rm = T), avg_num_doctors = mean(df$num_doctors_fulltime, na.rm = T), 
        avg_c_section_rate = mean(df$c_section_yn, na.rm = T))
})
head(my_summary)
##         state counts total_num_nurses avg_num_doctors avg_c_section_rate
## 1        Abia      1                0             308                1.0
## 2     Adamawa      1                2               1                0.0
## 3     Anambra      4                4               1                0.5
## 4      Bauchi      2                0               0                0.0
## 5       Benue      1                0               0                0.0
## 6 Cross River      3                3               0                0.0

In the end, however, we get a data.frame back. Can someone remind me how to check if this is correct?

Advanced R: timing your functions.

To demonstrate the speedup, lets time the output of running ddply on an idata.frame versus a data.frame. We'll replicate a simple ddply call a thousand times so we can notice a difference. Note that the difference will grow larger as the datasets grow larger.

system.time(replicate(1000, ddply(isample_data, "state", nrow)))
## Error: could not find function "ddply"
## Timing stopped at: 0 0 0.001

system.time(replicate(1000, ddply(sample_data, "state", nrow)))
## Error: could not find function "ddply"
## Timing stopped at: 0 0 0

Question: what is the result that we calculate, before replicate it many times and timing it?

Exercise: How would you calculate the proportion of of c_section_yn==TRUE versus total non-NA records in each state?

##         state num_c_sections total_non_na
## 1        Abia              1            1
## 2     Adamawa              0            1
## 3     Anambra              2            4
## 4      Bauchi              0            2
## 5       Benue              0            1
## 6 Cross River              0            3

Slightly advanced: calculating ratios with ddply

Lets look at something similar to what we looked at in the beginning of this exercise: ratio of nurses to doctors per LGA. This is not an important indicator, but ratio of students to teacher is (pupil to teacher ratio). We are looking at nurses to doctors to avoid having to load another dataset. The concept is the same.

First, lets do this at a single LGA. What is the ratio for Katsina?

katsina[, c("num_doctors_fulltime", "num_nurses_fulltime")]
##    num_doctors_fulltime num_nurses_fulltime
## 3                     0                   1
## 18                    0                   0
## 24                    0                   0
## 38                    0                   0
## 39                    0                   0
## 42                   NA                  NA

Simple, right? 1.

Okay, what is the nurses to doctor ratio for Anambra?

anambra[, c("num_doctors_fulltime", "num_nurses_fulltime")]
##    num_doctors_fulltime num_nurses_fulltime
## 2                    NA                   2
## 6                     1                   0
## 17                    1                   2
## 27                    1                   0

Remember that in survey data, we treat NA as “missing value”. The value could be 0, it could be 1, it could be 100. The ratio of nurses to doctors in anambra, for this small sample, is actually 2 / 3 == 0.6667. Because the number of doctors in the first facility is unknown, we have to drop that entire row from our sample before calculating the ratio.

What do you think of the following calculation?

x <- ddply(sample_data, "state", summarize, nurse_per_doctor_ratio = sum(num_nurses_fulltime, 
    na.rm = T)/sum(num_doctors_fulltime, na.rm = T))

Okay, now that we know that it is incorrect. How we do implement it?

The ratio function

This kind of calculation has to be done a few times in NMIS. For special cases like this, we have written convenience functions. In fact, it even handles the cases of needing to filter out just some of the facilities:

# returns sum(numerator_col) / sum(denominator_col) after subsetting by
# filter, which should be a predicate vector (ie, a list of TRUE / FALSE)
# and also drops NAs in the process
ratio <- function(numerator_col, denominator_col, filter) {
    df <- data.frame(cbind(num = numerator_col, den = denominator_col))
    df <- na.omit(df[filter, ])
    if (nrow(df) == 0 | sum(df$den) == 0) {
        return(NA)
    }
    return(sum(df$num)/sum(df$den))
}
my_summary <- ddply(isample_data, .(state), function(df) {
    data.frame(nurse_per_doctor_ratio = ratio(df$num_nurses_fulltime, df$num_doctors_fulltime), 
        nurse_per_doctor_ratio_public = ratio(df$num_nurses_fulltime, df$num_doctors_fulltime, 
            df$management == "public"))
})
head(my_summary)
##         state nurse_per_doctor_ratio nurse_per_doctor_ratio_public
## 1        Abia                     NA                            NA
## 2     Adamawa                 2.0000                             2
## 3     Anambra                 0.6667                             0
## 4      Bauchi                     NA                            NA
## 5       Benue                     NA                            NA
## 6 Cross River                     NA                            NA
anambra[c("num_doctors_fulltime", "num_nurses_fulltime", "management")]
##    num_doctors_fulltime num_nurses_fulltime management
## 2                    NA                   2     public
## 6                     1                   0     public
## 17                    1                   2       <NA>
## 27                    1                   0       <NA>

Outlier Cleaning

Outliers were thoroughly cleaned from the NMIS data set before running the analysis portion of our R scripts. In this section, we will explore both the outlier detection and replacement processes.

Reading in Our Special Data: RDS files

Please download the pre-outlier cleaned education data set here: edu data.

As you can see, the file you downloaded is in .RDS format. This file type can only be opened using R. The main purpouse for this file format is speed. R reads in and writes out .RDS files significantly faster than any other file type, which is important when working with larger data sets such as NMIS.

Let's take this opportunity now to learn two new commands that are nearly identical to read.csv() and write.csv(): readRDS() and saveRDS().

edu_normalized <- readRDS("~/Dropbox/Nigeria/Nigeria 661 Baseline Data Cleaning/in_process_data/Normalized/Education_774_normalized.rds")
saveRDS(edu_normalized, "~/Desktop/Education_Normalized_unchanged.RDS")

Outlier Detection Example

We can start by seeing a simple example of outlier removal using logic Checks. Frequently, we run basic logic checks on the data as part of the cleaning process.

function(df, c, rowpredicate)
edu_normalized <- outlierreplace(edu_normalized, "num_tchrs_male", (edu_normalized$num_tchrs_male > 
    edu_normalized$num_tchrs_total))

The Function: outlierreplace()

Now that you've seen an example of the function, let's step back and break down the outlierreplace function line by line. Because our functions are not in an official R package, you cannot access the help manual as you learned in Day 1.


# stringr library is necessary for our function
library(stringr)

# the function
outlierreplace = function(df, c, rowpredicate, replaceVal = NA) {
    naCount1 <- sum(is.na(df[, c]))
    df[, c] <- replace(df[, c], rowpredicate, replaceVal)
    naCount2 <- sum(is.na(df[, c]))
    print(str_c(naCount2 - naCount1, " outliers replaced for field: ", c))
    df
}

head(edu_normalized$num_tchrs_male > edu_normalized$num_tchrs_total)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Questions:

Outlier Removal with a Numerical Threshold

By setting a certain threshold, we can eliminate a range of outliers with one use of the function.


summary(edu_normalized$num_students_female)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      50     105     204     218 1000000     344
edu_normalized <- outlierreplace(edu_normalized, "num_students_female", (edu_normalized$num_students_female > 
    3000))
## [1] "49 outliers replaced for field: num_students_female"
summary(edu_normalized$num_students_female)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      50     105     187     217    3000     393

The alternative would be to write a function equal to each individual value i.e. rowpredicate being edu_normalized$num_students_female == 3000 etc. etc.

Notice that there may be multiple variables in the rowpredicate conditions.


edu_normalized <- outlierreplace(edu_normalized, "num_js_female", (edu_normalized$num_js_female > 
    1250 & edu_normalized$num_classrms_total < 25 & edu_normalized$num_tchrs_total < 
    10))
## [1] "3 outliers replaced for field: num_js_female"

Upon inspection of the summary statistics for certain columns, outliers can become obviously clear.


summary(edu_normalized$num_exercise_books_per_student_pry)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      84     262     724     714   70800   63993
summary(edu_normalized$num_exercise_books_per_student_jss)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0      60     570    1800    1730   34700   68474

Question: Where should we set a threshold?

A Graphical Example

To find a more precise threshold than using the summary/table method, we can plot our data.

library(ggplot2)

ggplot(edu_normalized, aes(x = zone, y = num_exercise_books_per_student_pry, 
    fill = zone)) + geom_boxplot() + coord_flip() + ylab("Number of Books per Student") + 
    xlab("Zone")  #+ scale_y_continuous(limits=c(0,3000))

plot of chunk unnamed-chunk-24

The above graph demonstrates the magnitude of the outliers present for the num_exercise_books_per_student_pry variable. By adjusting the scale of our graph, we can have a more precise picture of where to establish our threshold:

library(ggplot2)

ggplot(edu_normalized, aes(x = zone, y = num_exercise_books_per_student_pry, 
    fill = zone)) + geom_boxplot() + coord_flip() + ylab("Number of Books per Student") + 
    xlab("Zone") + scale_y_continuous(limits = c(0, 10000))

plot of chunk unnamed-chunk-25

Exercise

  1. Calculate zonal means for the number of teachers in a school.
  2. Look at the data, both the zonal means as well as boxplots and summaries, decide on a threshold for outliers, and replace all numbers above that threshold with na.
  3. Re-calculate the zonal means.