Review: Do you remember from our exercise a few days ago, how you would create a new column called skilled_birth_attendants
that is the sum of number of doctors and number of nurses in a health facility?
sample_data$skilled_birth_attendants <- sample_data$num_nurses_fulltime + sample_data$num_doctors_fulltime
Do you remember what the issue was with doing things this way?
Today, we'll see a different method of doing this. We will use the rowSums
function, which has the na.rm
parameter we have seen so many times before. There are two ways to use rowSums that we use commonly, as shown below:
sample_data$skilled_birth_attendants <- rowSums(sample_data[, c("num_nurses_fulltime",
"num_doctors_fulltime")], na.rm = T)
head(sample_data[, c("num_doctors_fulltime", "num_nurses_fulltime", "skilled_birth_attendants")])
## num_doctors_fulltime num_nurses_fulltime skilled_birth_attendants
## 1 0 0 0
## 2 NA 2 2
## 3 0 1 1
## 4 0 0 0
## 5 0 0 0
## 6 1 0 1
sample_data$skilled_birth_attendants <- rowSums(cbind(sample_data$num_nurses_fulltime,
sample_data$num_doctors_fulltime), na.rm = T)
head(sample_data[, c("num_doctors_fulltime", "num_nurses_fulltime", "skilled_birth_attendants")])
## num_doctors_fulltime num_nurses_fulltime skilled_birth_attendants
## 1 0 0 0
## 2 NA 2 2
## 3 0 1 1
## 4 0 0 0
## 5 0 0 0
## 6 1 0 1
Review: do you remember how to create column-based indicators? Can you calculate the is_public
column again, which is TRUE
if that facility has public
in its management column, and FALSE
if not? How about a column called is_public_facility_with_doctor
, which is TRUE
only when the management is public
and the num_doctors_fulltime
is bigger than 0?
You could check your answer to match the following output:
summary(sample_data$is_public)
## Mode TRUE NA's
## logical 33 17
summary(sample_data$is_public_facility_withdoctor)
## Mode FALSE TRUE NA's
## logical 34 3 13
Okay, notice that in our sample dataset, we have lots of facilities that are unknown as to whether they are public facilities or not (they are NA). Lets say that for this exercise, we want to treat these facilities as private facilities. In our calculation above, is_public_facility_with_doctor
is NA if public facilities is NA and the place has a doctor. In this exercise, because we are treating management == NA
as a private facility, we can use the is.na
function to check if something is public:
sample_data$is_public <- !is.na(sample_data$management)
sample_data$is_public_facility_withdoctor <- sample_data$is_public & (sample_data$num_doctors_fulltime >
0)
summary(sample_data$is_public)
## Mode FALSE TRUE NA's
## logical 17 33 0
summary(sample_data$is_public_facility_withdoctor)
## Mode FALSE TRUE NA's
## logical 44 3 3
Finally, lets deal with a more complex indicator. We don't generally do this, but what if you wanted to say that num_doctors_fulltime == NA
implies that there is no doctor? How would you write this equation? While possible with just & and |, this column modification is easier to write with ifelse
, which is again, a vectorized function:
sample_data$is_public_facility_withdoctor <-
ifelse(is.na(sample_data$num_doctors_fulltime), # condition
FALSE, # value if true, in this case, doctor = NA is assumed to be doctor = 0, and therefore we should get out FALSE
sample_data$is_public & sample_data$num_doctors_fulltime > 0)
summary(sample_data$is_public_facility_withdoctor)
## Mode FALSE TRUE NA's
## logical 47 3 0
In contrast, if
in R is not vectorized. What is the difference in the above and the below?
if (is.na(sample_data$num_doctors_fulltime)) {
sample_data$is_public_facility_withdoctor_wrong <- FALSE
} else {
sample_data$is_public_facility_withdoctor_wrong <- sample_data$is_public &
(sample_data$num_doctors_fulltime > 0)
}
## Warning: the condition has length > 1 and only the first element will be
## used
summary(sample_data$is_public_facility_withdoctor_wrong)
## Mode FALSE TRUE NA's
## logical 44 3 3
Exercise:
management==NA
means private facilities, which facilities in sample_data meet this new regulation? The summary of this column should be as follows:## Mode FALSE TRUE NA's
## logical 32 18 0
ifelse
, define this (mock) indicator. The summary should match the following:head(sample_data[, c("num_nurses_fulltime", "num_doctors_fulltime", "c_section_yn",
"skilled_worker")], 20)
## num_nurses_fulltime num_doctors_fulltime c_section_yn skilled_worker
## 1 0 0 FALSE 0
## 2 2 NA FALSE 2
## 3 1 0 FALSE 1
## 4 0 0 FALSE 0
## 5 0 0 FALSE 0
## 6 0 1 FALSE 0
## 7 0 0 FALSE 0
## 8 1 1 TRUE 2
## 9 3 0 FALSE 3
## 10 0 0 FALSE 0
## 11 2 1 FALSE 2
## 12 NA NA FALSE NA
## 13 0 0 FALSE 0
## 14 1 0 FALSE 1
## 15 7 0 TRUE 7
## 16 0 0 FALSE 0
## 17 2 1 TRUE 3
## 18 0 0 FALSE 0
## 19 8 2 TRUE 10
## 20 3 3 TRUE 6
summary(sample_data$skilled_worker)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.00 7.83 2.00 308.00 3
outlier_normalized_health.R
,
outlier_normalized_education.R
outlier_functions.R
nmis_indicators_health_lga_level_normalized.R
,
nmis_indicators_education_lga_level_normalized.R
,
nmis_indicators_water_lga_level_normalized.R
nmis_indicators_health_facility_level_normalized.R
,
nmis_indicators_education_facility_level_normalized.R
,
nmis_indicators_water_facility_level_normalized.R
nmis_indicators_COMBINING_normalized.R
nmis_post_processing
post_processing_functions.R