Assignment review:
lgas.csv
. Delete your pop_density_2006 column. Now, re-create a new column, called hundred_people_per_sq_km
. For example, if an LGA has population 1000 and sq km of 10, hundred_people_per_sq_km
should be one.sample_data
, calculate the number of nurses per doctor in every facility. Lets look at the results. What do you notice?There are many functions that can do aggregations for you in R; we will cover ddply()
from the plyr
package in this tutorial. This is also the function that we have found most useful when writing aggregate indicators for NMIS.
library(plyr)
Lets say that we want to calculate the total number of doctors in the different states for which we have data. The aggregations provided by ddply
make this very easy. The syntax for ddply is:
ddply(input_data_frame, group-by_variables, function, parameters)
The example below is similar to one you will find yourself using quite frequently:
my_summary <- ddply(sample_data, # input_data_frame
"state", # group-by variable
summarise, # function. in this case, we use summarise
counts = length(lga_id),
total_num_nurses = sum(num_nurses_fulltime, na.rm=T),
avg_num_doctors = mean(num_doctors_fulltime, na.rm=T),
avg_c_section_rate = mean(c_section_yn, na.rm=T))
my_summary
## state counts total_num_nurses avg_num_doctors avg_c_section_rate
## 1 Abia 1 0 308.0000 1.0000
## 2 Adamawa 1 2 1.0000 0.0000
## 3 Anambra 4 4 1.0000 0.5000
## 4 Bauchi 2 0 0.0000 0.0000
## 5 Benue 1 0 0.0000 0.0000
## 6 Cross River 3 3 0.0000 0.0000
## 7 Delta 4 10 0.5000 0.2500
## 8 Edo 1 0 0.0000 0.0000
## 9 Ekiti 1 2 1.0000 0.0000
## 10 Imo 3 8 0.0000 0.3333
## 11 Jigawa 1 0 0.0000 0.0000
## 12 Kaduna 1 5 0.0000 0.0000
## 13 Kano 1 0 0.0000 0.0000
## 14 Katsina 6 1 0.0000 0.0000
## 15 Kebbi 2 0 0.0000 0.0000
## 16 Kogi 1 1 0.0000 0.0000
## 17 Lagos 3 4 1.3333 0.6667
## 18 Niger 3 0 0.0000 0.0000
## 19 Ogun 3 0 0.6667 0.3333
## 20 Osun 2 6 0.5000 0.5000
## 21 Plateau 2 0 0.0000 0.0000
## 22 Rivers 1 2 2.0000 0.0000
## 23 Taraba 2 0 0.5000 0.5000
## 24 Zamfara 1 0 0.0000 0.0000
ddply
What is ddply doing here? It does something quite complicated but useful. Lets try to repeat what ddply
did above for two states in Nigeria.
Exercise 1: Before you knew ddply
, and I said that I wanted the total number of doctors in Anambra and Katsina state, how would you do it? Hint: make a dataframe named anambra
and a dataframe called katsina
.
Exercise 2: Now, lets say I want a two-row and two-column data.frame. The column names are state
and total_num_doctors
, and each row is the total number of doctors per state. Please do this without using ddply. If you have extra time, you can also do this with ddply
. Check your work by manually counting the number of doctors in sample_data.
Exercise 3: On the flipboard, make a block diagram of the process that you went through to make this happen. We will do this as a group.
ddply
worksIf you did exercise 2 correctly, here are the steps you probably took:
This pattern of data manipulation is called “split-apply-combine” by the author of plyr
, Hadley Wickam. ddply and associated functions basically make it easier to perform this pattern of data manipulation, which turns out to be extremely useful. The equivalent method of doing what we did above in ddply is the following:
my_summary <- ddply(sample_data, # input_data_frame
"state", # group-by variable
summarise, # function. in this case, we use summarise
total_num_doctors = sum(num_doctors_fulltime, na.rm=T))
head(my_summary)
## state total_num_doctors
## 1 Abia 308
## 2 Adamawa 1
## 3 Anambra 3
## 4 Bauchi 0
## 5 Benue 0
## 6 Cross River 0
With the difference being that we have now done our calculation for ALL states. As we walk through the ddply, it is useful to see what summarize does for us; can you describe what summarize does based on the output below?
anambra <- subset(sample_data, state == "Anambra")
example <- summarize(anambra, total_num_doctors = sum(num_doctors_fulltime,
na.rm = T))
str(example)
## 'data.frame': 1 obs. of 1 variable:
## $ total_num_doctors: int 3
example
## total_num_doctors
## 1 3
Okay, back to our ddply
. The group-by variable state
tells ddply that it should take our data.frame, and make many many small-small dataframes, one data frame for each state. So there is one dataframe for anambra, one for abia, one for katsina, and so on. Then, ddply “applies” the function at the end of the ddply command. This function should return a dataframe, like summarize does. Finally, it takes all of those small-small NEW dataframes, and rbind
s them all together to a new large dataframe. Plus, it adds a column called state
, whose value it knew from the subsetting process (anambra's state is Anambra, duh!), and adds it on.
Awesome, or what?
Notes on ddply
usage:
## zone num_public_facilities
## 1 North-Central 6
## 2 Northeast 3
## 3 Northwest 10
## 4 South-South 6
## 5 Southeast 4
## 6 Southwest 4
Hint: what does sum(c(TRUE, FALSE, TRUE, TRUE, FALSE))
return?
## zone avg_num_doctors standard_deviation_doctors
## 1 North-Central 0.0000 0.0000
## 2 Northeast 0.4000 0.5477
## 3 Northwest 0.0000 0.0000
## 4 South-South 0.4444 0.8819
## 5 Southeast 44.4286 116.2252
## 6 Southwest 0.8889 1.0541
Remember that facility with the 308 doctors?
ddply
You can use multiple by variables to perform an aggregation. For example, we can use both “state” and “lga” below:
my_summary <- ddply(sample_data, c("state", "lga"), summarise, counts = length(lga_id),
total_num_nurses = sum(num_nurses_fulltime, na.rm = T), avg_num_doctors = mean(num_doctors_fulltime,
na.rm = T), avg_c_section_rate = mean(c_section_yn, na.rm = T))
head(my_summary)
## state lga counts total_num_nurses avg_num_doctors
## 1 Abia Umuahia North 1 0 308
## 2 Adamawa Shelleng 1 2 1
## 3 Anambra Anaocha 1 2 NaN
## 4 Anambra Ayamelum 1 0 1
## 5 Anambra Ekwusigo 1 0 1
## 6 Anambra Ihiala 1 2 1
## avg_c_section_rate
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
ddply
also allows you to use a special . syntax, where you don't have to put your column names in string variables.
my_summary <- ddply(sample_data, .(state, lga), summarise, counts = length(lga_id),
total_num_nurses = sum(num_nurses_fulltime, na.rm = T), avg_num_doctors = mean(num_doctors_fulltime,
na.rm = T), avg_c_section_rate = mean(c_section_yn, na.rm = T))
head(my_summary)
## state lga counts total_num_nurses avg_num_doctors
## 1 Abia Umuahia North 1 0 308
## 2 Adamawa Shelleng 1 2 1
## 3 Anambra Anaocha 1 2 NaN
## 4 Anambra Ayamelum 1 0 1
## 5 Anambra Ekwusigo 1 0 1
## 6 Anambra Ihiala 1 2 1
## avg_c_section_rate
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
Question: What is this summary of? Could you use a single group-by variable to get the same result? When might you want to use two variables instead of one?
You are allowed, and in fact, will need to, use your own function (instead of summarize) in ddply. The syntax is like any function definition in R; we will get to functions later on. For now, please pay attention to the syntax, and note that there must always be the data.frame included inside the function when using it with ddply. This is because, as we said above, ddply
, at the end, rbinds a whole lot of small-small data.frames. Using the syntax below, each function returns a new data.frame.
my_summary <- ddply(sample_data, "state", function(df) {
data.frame(counts = length(df$lga_id), total_num_nurses = sum(df$num_nurses_fulltime,
na.rm = T), avg_num_doctors = mean(df$num_doctors_fulltime, na.rm = T),
avg_c_section_rate = mean(df$c_section_yn, na.rm = T))
})
head(my_summary)
## state counts total_num_nurses avg_num_doctors avg_c_section_rate
## 1 Abia 1 0 308 1.0
## 2 Adamawa 1 2 1 0.0
## 3 Anambra 4 4 1 0.5
## 4 Bauchi 2 0 0 0.0
## 5 Benue 1 0 0 0.0
## 6 Cross River 3 3 0 0.0
How is this diferent from the above, which used summarize? Why do you think this is?
When you are working with bigger datasets, there is a type of data.frame called idata.frame
that is more efficient to use than data.frames for aggregation. By default, R makes copies of the dataframes during a ddply operation (the same way we made copies of our data for anambra and katsina). In order to make R not make these copies, and work more efficiently, we can use something called an idata.frame
. The way to do it is to create a new object from a data frame that is an idata.frame, and then perform the exact same operations as we did before.
If ddply
calls are starting to take a long time, you should think about using idata.frame. Note that the cost of idata.frames is slighly more complex code; certain functions don't work with idata.frames. One such function is summarize; WE ALWAYS NEED THE FUNCTION FORMAT when using ddply with idata.frame
s. Please read the idata.frame
documentation for further instructions. An example:
isample_data <- idata.frame(sample_data)
my_summary <- ddply(isample_data, "state", function(df) {
data.frame(counts = length(df$lga_id), total_num_nurses = sum(df$num_nurses_fulltime,
na.rm = T), avg_num_doctors = mean(df$num_doctors_fulltime, na.rm = T),
avg_c_section_rate = mean(df$c_section_yn, na.rm = T))
})
head(my_summary)
## state counts total_num_nurses avg_num_doctors avg_c_section_rate
## 1 Abia 1 0 308 1.0
## 2 Adamawa 1 2 1 0.0
## 3 Anambra 4 4 1 0.5
## 4 Bauchi 2 0 0 0.0
## 5 Benue 1 0 0 0.0
## 6 Cross River 3 3 0 0.0
In the end, however, we get a data.frame back. Can someone remind me how to check if this is correct?
To demonstrate the speedup, lets time the output of running ddply on an idata.frame versus a data.frame. We'll replicate a simple ddply call a thousand times so we can notice a difference. Note that the difference will grow larger as the datasets grow larger.
system.time(replicate(1000, ddply(isample_data, "state", nrow)))
## Error: could not find function "ddply"
## Timing stopped at: 0 0 0.001
system.time(replicate(1000, ddply(sample_data, "state", nrow)))
## Error: could not find function "ddply"
## Timing stopped at: 0 0 0
Question: what is the result that we calculate, before replicate it many times and timing it?
c_section_yn==TRUE
versus total non-NA records in each state?sum(sample_data$c_section_yn)
do? Compare to the output of summary
.## state num_c_sections total_non_na
## 1 Abia 1 1
## 2 Adamawa 0 1
## 3 Anambra 2 4
## 4 Bauchi 0 2
## 5 Benue 0 1
## 6 Cross River 0 3
Lets look at something similar to what we looked at in the beginning of this exercise: ratio of nurses to doctors per LGA. This is not an important indicator, but ratio of students to teacher is (pupil to teacher ratio). We are looking at nurses to doctors to avoid having to load another dataset. The concept is the same.
First, lets do this at a single LGA. What is the ratio for Katsina?
katsina[, c("num_doctors_fulltime", "num_nurses_fulltime")]
## num_doctors_fulltime num_nurses_fulltime
## 3 0 1
## 18 0 0
## 24 0 0
## 38 0 0
## 39 0 0
## 42 NA NA
Simple, right? 1.
Okay, what is the nurses to doctor ratio for Anambra?
anambra[, c("num_doctors_fulltime", "num_nurses_fulltime")]
## num_doctors_fulltime num_nurses_fulltime
## 2 NA 2
## 6 1 0
## 17 1 2
## 27 1 0
Remember that in survey data, we treat NA as “missing value”. The value could be 0, it could be 1, it could be 100. The ratio of nurses to doctors in anambra, for this small sample, is actually 2 / 3 ==
0.6667. Because the number of doctors in the first facility is unknown, we have to drop that entire row from our sample before calculating the ratio.
What do you think of the following calculation?
x <- ddply(sample_data, "state", summarize, nurse_per_doctor_ratio = sum(num_nurses_fulltime,
na.rm = T)/sum(num_doctors_fulltime, na.rm = T))
Okay, now that we know that it is incorrect. How we do implement it?
This kind of calculation has to be done a few times in NMIS. For special cases like this, we have written convenience functions. In fact, it even handles the cases of needing to filter out just some of the facilities:
# returns sum(numerator_col) / sum(denominator_col) after subsetting by
# filter, which should be a predicate vector (ie, a list of TRUE / FALSE)
# and also drops NAs in the process
ratio <- function(numerator_col, denominator_col, filter) {
df <- data.frame(cbind(num = numerator_col, den = denominator_col))
df <- na.omit(df[filter, ])
if (nrow(df) == 0 | sum(df$den) == 0) {
return(NA)
}
return(sum(df$num)/sum(df$den))
}
my_summary <- ddply(isample_data, .(state), function(df) {
data.frame(nurse_per_doctor_ratio = ratio(df$num_nurses_fulltime, df$num_doctors_fulltime),
nurse_per_doctor_ratio_public = ratio(df$num_nurses_fulltime, df$num_doctors_fulltime,
df$management == "public"))
})
head(my_summary)
## state nurse_per_doctor_ratio nurse_per_doctor_ratio_public
## 1 Abia NA NA
## 2 Adamawa 2.0000 2
## 3 Anambra 0.6667 0
## 4 Bauchi NA NA
## 5 Benue NA NA
## 6 Cross River NA NA
anambra[c("num_doctors_fulltime", "num_nurses_fulltime", "management")]
## num_doctors_fulltime num_nurses_fulltime management
## 2 NA 2 public
## 6 1 0 public
## 17 1 2 <NA>
## 27 1 0 <NA>
Outliers were thoroughly cleaned from the NMIS data set before running the analysis portion of our R scripts. In this section, we will explore both the outlier detection and replacement processes.
Please download the pre-outlier cleaned education data set here: edu data.
As you can see, the file you downloaded is in .RDS format. This file type can only be opened using R. The main purpouse for this file format is speed. R reads in and writes out .RDS files significantly faster than any other file type, which is important when working with larger data sets such as NMIS.
Let's take this opportunity now to learn two new commands that are nearly identical to read.csv()
and write.csv()
: readRDS() and saveRDS().
edu_normalized <- readRDS("~/Dropbox/Nigeria/Nigeria 661 Baseline Data Cleaning/in_process_data/Normalized/Education_774_normalized.rds")
saveRDS(edu_normalized, "~/Desktop/Education_Normalized_unchanged.RDS")
We can start by seeing a simple example of outlier removal using logic Checks. Frequently, we run basic logic checks on the data as part of the cleaning process.
function(df, c, rowpredicate)
edu_normalized <- outlierreplace(edu_normalized, "num_tchrs_male", (edu_normalized$num_tchrs_male >
edu_normalized$num_tchrs_total))
Now that you've seen an example of the function, let's step back and break down the outlierreplace
function line by line. Because our functions are not in an official R package, you cannot access the help manual as you learned in Day 1.
# stringr library is necessary for our function
library(stringr)
# the function
outlierreplace = function(df, c, rowpredicate, replaceVal = NA) {
naCount1 <- sum(is.na(df[, c]))
df[, c] <- replace(df[, c], rowpredicate, replaceVal)
naCount2 <- sum(is.na(df[, c]))
print(str_c(naCount2 - naCount1, " outliers replaced for field: ", c))
df
}
head(edu_normalized$num_tchrs_male > edu_normalized$num_tchrs_total)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Questions:
By setting a certain threshold, we can eliminate a range of outliers with one use of the function.
summary(edu_normalized$num_students_female)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 50 105 204 218 1000000 344
edu_normalized <- outlierreplace(edu_normalized, "num_students_female", (edu_normalized$num_students_female >
3000))
## [1] "49 outliers replaced for field: num_students_female"
summary(edu_normalized$num_students_female)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 50 105 187 217 3000 393
The alternative would be to write a function equal to each individual value i.e. rowpredicate
being edu_normalized$num_students_female == 3000 etc. etc.
Notice that there may be multiple variables in the rowpredicate
conditions.
edu_normalized <- outlierreplace(edu_normalized, "num_js_female", (edu_normalized$num_js_female >
1250 & edu_normalized$num_classrms_total < 25 & edu_normalized$num_tchrs_total <
10))
## [1] "3 outliers replaced for field: num_js_female"
Upon inspection of the summary statistics for certain columns, outliers can become obviously clear.
summary(edu_normalized$num_exercise_books_per_student_pry)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 84 262 724 714 70800 63993
summary(edu_normalized$num_exercise_books_per_student_jss)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 60 570 1800 1730 34700 68474
Question: Where should we set a threshold?
num_doctors_available
variable. Hint: if you get stuck about how to set the rowpredicate
conditions, remember the table()
function. To find a more precise threshold than using the summary/table method, we can plot our data.
library(ggplot2)
ggplot(edu_normalized, aes(x = zone, y = num_exercise_books_per_student_pry,
fill = zone)) + geom_boxplot() + coord_flip() + ylab("Number of Books per Student") +
xlab("Zone") #+ scale_y_continuous(limits=c(0,3000))
The above graph demonstrates the magnitude of the outliers present for the num_exercise_books_per_student_pry
variable. By adjusting the scale of our graph, we can have a more precise picture of where to establish our threshold:
library(ggplot2)
ggplot(edu_normalized, aes(x = zone, y = num_exercise_books_per_student_pry,
fill = zone)) + geom_boxplot() + coord_flip() + ylab("Number of Books per Student") +
xlab("Zone") + scale_y_continuous(limits = c(0, 10000))