Day 6

Creation of More Complex Columns(indicators) with rowSums():

Review: Do you remember from our exercise a few days ago, how you would create a new column called skilled_birth_attendants that is the sum of number of doctors and number of nurses in a health facility?

sample_data$skilled_birth_attendants <- sample_data$num_nurses_fulltime + sample_data$num_doctors_fulltime

Do you remember what the issue was with doing things this way?

Today, we'll see a different method of doing this. We will use the rowSums function, which has the na.rm parameter we have seen so many times before. There are two ways to use rowSums that we use commonly, as shown below:

sample_data$skilled_birth_attendants <- rowSums(sample_data[, c("num_nurses_fulltime", 
    "num_doctors_fulltime")], na.rm = T)
head(sample_data[, c("num_doctors_fulltime", "num_nurses_fulltime", "skilled_birth_attendants")])
##   num_doctors_fulltime num_nurses_fulltime skilled_birth_attendants
## 1                    0                   0                        0
## 2                   NA                   2                        2
## 3                    0                   1                        1
## 4                    0                   0                        0
## 5                    0                   0                        0
## 6                    1                   0                        1

sample_data$skilled_birth_attendants <- rowSums(cbind(sample_data$num_nurses_fulltime, 
    sample_data$num_doctors_fulltime), na.rm = T)
head(sample_data[, c("num_doctors_fulltime", "num_nurses_fulltime", "skilled_birth_attendants")])
##   num_doctors_fulltime num_nurses_fulltime skilled_birth_attendants
## 1                    0                   0                        0
## 2                   NA                   2                        2
## 3                    0                   1                        1
## 4                    0                   0                        0
## 5                    0                   0                        0
## 6                    1                   0                        1

The is.na() function:

Review: do you remember how to create column-based indicators? Can you calculate the is_public column again, which is TRUE if that facility has public in its management column, and FALSE if not? How about a column called is_public_facility_with_doctor, which is TRUE only when the management is public and the num_doctors_fulltime is bigger than 0?

You could check your answer to match the following output:

summary(sample_data$is_public)
##    Mode    TRUE    NA's 
## logical      33      17
summary(sample_data$is_public_facility_withdoctor)
##    Mode   FALSE    TRUE    NA's 
## logical      34       3      13

Okay, notice that in our sample dataset, we have lots of facilities that are unknown as to whether they are public facilities or not (they are NA). Lets say that for this exercise, we want to treat these facilities as private facilities. In our calculation above, is_public_facility_with_doctor is NA if public facilities is NA and the place has a doctor. In this exercise, because we are treating management == NA as a private facility, we can use the is.na function to check if something is public:

sample_data$is_public <- !is.na(sample_data$management)
sample_data$is_public_facility_withdoctor <- sample_data$is_public & (sample_data$num_doctors_fulltime > 
    0)
summary(sample_data$is_public)
##    Mode   FALSE    TRUE    NA's 
## logical      17      33       0
summary(sample_data$is_public_facility_withdoctor)
##    Mode   FALSE    TRUE    NA's 
## logical      44       3       3

Finally, lets deal with a more complex indicator. We don't generally do this, but what if you wanted to say that num_doctors_fulltime == NA implies that there is no doctor? How would you write this equation? While possible with just & and |, this column modification is easier to write with ifelse, which is again, a vectorized function:

sample_data$is_public_facility_withdoctor <- 
    ifelse(is.na(sample_data$num_doctors_fulltime), # condition
       FALSE,  # value if true, in this case, doctor = NA is assumed to be doctor = 0, and therefore we should get out FALSE
       sample_data$is_public & sample_data$num_doctors_fulltime > 0)
summary(sample_data$is_public_facility_withdoctor)
##    Mode   FALSE    TRUE    NA's 
## logical      47       3       0

In contrast, if in R is not vectorized. What is the difference in the above and the below?

if (is.na(sample_data$num_doctors_fulltime)) {
    sample_data$is_public_facility_withdoctor_wrong <- FALSE
} else {
    sample_data$is_public_facility_withdoctor_wrong <- sample_data$is_public & 
        (sample_data$num_doctors_fulltime > 0)
}
## Warning: the condition has length > 1 and only the first element will be
## used
summary(sample_data$is_public_facility_withdoctor_wrong)
##    Mode   FALSE    TRUE    NA's 
## logical      44       3       3

Exercise:

##    Mode   FALSE    TRUE    NA's 
## logical      32      18       0
head(sample_data[, c("num_nurses_fulltime", "num_doctors_fulltime", "c_section_yn", 
    "skilled_worker")], 20)
##    num_nurses_fulltime num_doctors_fulltime c_section_yn skilled_worker
## 1                    0                    0        FALSE              0
## 2                    2                   NA        FALSE              2
## 3                    1                    0        FALSE              1
## 4                    0                    0        FALSE              0
## 5                    0                    0        FALSE              0
## 6                    0                    1        FALSE              0
## 7                    0                    0        FALSE              0
## 8                    1                    1         TRUE              2
## 9                    3                    0        FALSE              3
## 10                   0                    0        FALSE              0
## 11                   2                    1        FALSE              2
## 12                  NA                   NA        FALSE             NA
## 13                   0                    0        FALSE              0
## 14                   1                    0        FALSE              1
## 15                   7                    0         TRUE              7
## 16                   0                    0        FALSE              0
## 17                   2                    1         TRUE              3
## 18                   0                    0        FALSE              0
## 19                   8                    2         TRUE             10
## 20                   3                    3         TRUE              6
summary(sample_data$skilled_worker)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    7.83    2.00  308.00       3

Data Pipeline

outlier_normalized_health.R, outlier_normalized_education.R

nmis_indicators_health_lga_level_normalized.R, nmis_indicators_education_lga_level_normalized.R, nmis_indicators_water_lga_level_normalized.R

nmis_indicators_health_facility_level_normalized.R, nmis_indicators_education_facility_level_normalized.R, nmis_indicators_water_facility_level_normalized.R

nmis_indicators_COMBINING_normalized.R

nmis_post_processing