In our surveys, we have found it useful to include a Don't Know / Information Not Available option when asking mutliple-choice questions. When users select this option, we also ask them why this information wasn't available. This is similar to allowing for an other option, and asking enumerators to Please Specify Other. In the response to the follow-up question, enumerators enter free-form text. This article helps survey managers analyze such free-form text data. We are looking at real data from a recent facilities survey in Nigeria. The analysis below was very similar to the actual analysis we conducted after a pilot survey, to analyze how to improve our surveys better.
In this demo, we will use the formhub
and tm
(text mining) packages in R to download data from formhub, and look at answers from a survey on schoools. Before we go any further, we will load our schools dataset into R. If you haven't used formhub.R before, please read the basics tutorial to learn how to load your own formhub data into R.
require(formhub) # See http://sel-columbia.github.io/formhub.R/ for install instructions if you get an error
require(tm) # Run install.packages('tm') if you get an error
edu <- formhubRead("~/Downloads/mopup_questionnaire_education_final_2014_04_02_08_15_46.csv",
"~/Downloads/mopup_questionnaire_education_final.json")
We will look at the question that asks about the level of education in surveyed schools. The question is called facility_type
in our dataset. Lets look at it.
## The question was worded follows:
edu@form["facility_type", "label"]
## [1] "Facility Level - pick one of the following that best describes this facility:"
## The options were worded as follows:
ldply(fromJSON(edu@form["facility_type", "options"]))$label
## [1] "Primary and Junior Secondary school combined"
## [2] "Primary school only"
## [3] "Junior Secondary school only"
## [4] "Junior and Senior Secondary school combined"
## [5] "Primary, Junior and Senior Secondary school combined"
## [6] "Information not available / Don't know"
If the enumerator selected Information not available / Don't know, then she would be asked a follow up question: Please explain why this is unknown: The following analysis looks at the responses to this follow up question.
Our goal here is to discover why enumerators are selecting Don't Know when they are asked what the level of the school is. Usually, this happens either because the options aren't exhaustive, enumerators have an understanding gap, or that there was an extreme circumstance in the field, such as the facility being closed.
With that in mind, lets look at (1) how many don't know responses we have and (2) some of the responses. For (2), instead of looking at the top or bottom entries in our dataset (using head
or tail
), we'll look at a random sample of 15 entries using the sample
function:
## First, we pull our data into a dontknow_responses variable, omitting
## missing data (NA) in the process.
dontknow_responses <- na.omit(edu$facility_type_dontknow)
## (1) How many don't know responses do we have?
length(dontknow_responses)
## [1] 143
## (2) A random sample of 15 responses
sample(dontknow_responses, 15)
## [1] "Nursery & Primary"
## [2] "Kinder Garten and Primary School"
## [3] "No certificate to be issued"
## [4] "Senior Secondary School only"
## [5] "Recent School Reclassification in Osun State"
## [6] "Kinder Garten and Primary School"
## [7] "Nursery sch"
## [8] "Its a nursery and primary school combined"
## [9] "Local Quranic School"
## [10] "Nursery & Primary"
## [11] "Local Quranic School"
## [12] "Preprimary"
## [13] "Senior Secondry School Only"
## [14] "Kinder garten and primary school"
## [15] "Local Quranic School"
The challenge here is to group our responses in a meaningful way. There are many different spellings and punctuation patterns used for the same basic response. To do this, we can borrow some techniques from the text mining literature. Two things we have found illustrative to look at are: (1) the top 10 individual responses (2) the top ten terms.
Note that when looking at either of these, we should lower case everything. In addition, when looking at terms, we should remove punctuation, remove whitespace, and perform something known as stemming. Stemming removes the ends of words, so “connected”, “connection”, and “connecting” end up all being transformed to “connect”, which will help us produce better term frequencies.
## First we write a quick helper function, top_N, which outputs the frequency
## of the top N elements
top_N <- function(vector, N = 10) {
x <- sort(table(vector), decreasing = TRUE)
x[1:min(N, length(x))]
}
## The top 10 entries are easy to produce; Note that the response is printed
## with the number of times we found it in our dataset right below it.
top_N(tolower(dontknow_responses), 10)
## vector
## kinder garten and primary school local quranic school
## 18 18
## preprimary nursery & primary
## 12 11
## combined with nursery nursery & primary school
## 4 4
## senior secondary only facility closed
## 3 2
## kinder garden and primary school nur/primary
## 2 2
The fact that we were missing combined nursery and primary schools is clearly a major cause for enumerators selecting the Don't Know option. However, this picture doesn't yet tell us whether smaller issues, such as senior secondary only, or facility closed, are major issues in the survey as well. For that, we turn to term frequencies.
## Producing the top 10 terms in our responses. To do this, we first convert
## the data into a 'corpus'. Then we apply text processing, and then finally
## create a document term matrix, which will show us our top terms.
ftdk_corpus <- Corpus(VectorSource(na.omit(dontknow_responses)))
ftdk_corpus <- tm_map(ftdk_corpus, tolower) # lowercase the documents
ftdk_corpus <- tm_map(ftdk_corpus, stemDocument) # stem Document
ftdk_corpus <- tm_map(ftdk_corpus, removeWords, c("school")) # school is sometimes used, and othertimes not
ftdk_corpus <- tm_map(ftdk_corpus, removePunctuation) # remove Punctuation
ftdk_corpus <- tm_map(ftdk_corpus, stripWhitespace) # remove white space
document_term_matrix <- DocumentTermMatrix(ftdk_corpus)
sort(colSums(as.matrix(document_term_matrix)), decreasing = TRUE)
## primari and nurseri onli kinder
## 54 40 39 24 23
## garten local quran senior secondari
## 20 18 18 18 16
## preprimari combin sch facil the
## 14 11 10 6 6
## with both centr close nuseryprimari
## 5 3 3 3 3
## this for garden gartennurseri not
## 3 2 2 2 2
## nurprimari nurseryprimari sec sss vocat
## 2 2 2 2 2
## abandon acquisit ani appli arms
## 1 1 1 1 1
## avail becaus building capac categori
## 1 1 1 1 1
## certif choic doe down educ
## 1 1 1 1 1
## ikun inform issu junior kind
## 1 1 1 1 1
## kindergarten nursari okpodu only osun
## 1 1 1 1 1
## pre pri prim primary primarynuseri
## 1 1 1 1 1
## questionair recent reclassif resourc schools
## 1 1 1 1 1
## scondari secondri section simerin skill
## 1 1 1 1 1
## state teacher there within women
## 1 1 1 1 1
The term frequency table can be hard to work with, since all the terms mentioned will appear here. One trick is to remove all terms above a certain number of entries. Another trick is to make it more visual, and make the table into a wordcloud. Lets try both below. For this dataset, we will remove any term with less than 3 entries.
## Output term frequency table again, this time only showing terms with at
## least 3 entries
term_frequencies <- sort(colSums(as.matrix(document_term_matrix)), decreasing = TRUE)
term_frequencies[term_frequencies >= 3]
## primari and nurseri onli kinder
## 54 40 39 24 23
## garten local quran senior secondari
## 20 18 18 18 16
## preprimari combin sch facil the
## 14 11 10 6 6
## with both centr close nuseryprimari
## 5 3 3 3 3
## this
## 3
## Make a wordcloud out of term frequencies. Note, you may need to
## install.packages('wordcloud')
require(wordcloud) # Maybe also install.packages('wordcloud')
wordcloud(ftdk_corpus)
We note that senior secondary schools do seem to be a big source of Don't Know resopnses (18 entries for “senior”), whereas closed or abandoned facilities are much less a source of Don't Know responses (only 3 responses for “close”).
Stepping back, note that we dove straight in with one of the questions. We may have many such questions in our survey. How do we determine which questions actually are eliciting lots of Don't Know responses, and merit further analysis? To do this, we can look at how many of the responses to our follow up question are NA. In our case, all of our Don't Know questions have a standard name, they always end with _dontknow
. So we can just look at the number of NA responses for each of these questions:
# Find all the _dontknow columns, and then look at the number of NAs per
# question.
relevant_columns <- names(edu)[str_detect(names(edu), "_dontknow")]
colSums(!is.na(edu[relevant_columns]))
## facility_type_dontknow
## 143
## education_type_dontknow
## 8
## management_dontknow
## 0
## improved_water_functionality_dontknow
## 1
## phcn_functional_dontknow
## 1
## solar_functional_dontknow
## 0
## generator_functional_dontknow
## 0
## grid_proximity_dontknow
## 246
Looks like we picked one of the correct columns to analyze above! The other question we should make sure to look at is the grid_proximity
question. Lets repeat the analysis above for that field.
Below, we repeat the exact same analysis as above. Only this time, we make functions functions produce our complicated output. As a result, you can copy and paste most of this code to produce outputs for a completely different dataset. You will only have to change the line just below the comment marked CHANGE.
require(tm)
require(wordcloud)
## This function outputs the top ten entries in a vector
top_N <- function(response_vector, N = 10) {
x <- sort(table(response_vector), decreasing = TRUE)
x[1:min(N, length(x))]
}
## This function takes a vector of string responses, and outputs the top N
## terms in that corpus It will also return the processed corpus
top_N_terms <- function(response_vector, NMIN = NULL) {
corpus <- Corpus(VectorSource(response_vector))
corpus <- tm_map(corpus, tolower) # lowercase the documents
corpus <- tm_map(corpus, stemDocument) # stem Document
corpus <- tm_map(corpus, removeWords, c("school")) # school is sometimes used, and othertimes not
corpus <- tm_map(corpus, removePunctuation) # remove Punctuation
corpus <- tm_map(corpus, stripWhitespace) # remove Punctuation
dtm <- DocumentTermMatrix(corpus)
top_terms <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
if (!is.null(NMIN)) {
top_terms <- top_terms[top_terms >= NMIN]
}
print(top_terms)
return(corpus)
}
## Extract the relevant responses from the relevant survey. CHANGE: If you
## try to reproduce this code, change the following line.
dontknow_responses <- na.omit(edu$grid_proximity_dontknow)
## Top 10 responses
top_N(tolower(dontknow_responses))
## response_vector
## not connected
## 29
## no connection
## 26
## remote location
## 23
## too far from grid
## 21
## not accessable to national grid
## 13
## not connected to any power source
## 10
## not connected to any power grid
## 7
## neglegence
## 5
## not connected to nepa
## 4
## there is no power connection in the school
## 4
## Term frequencies for terms appearing at least twice
dontknow_corpus <- top_N_terms(dontknow_responses, NMIN = 3)
## connect not grid power the from far ani
## 111 101 58 53 40 32 28 27
## sourc too remot locat nation access there nepa
## 26 26 25 23 23 16 13 12
## has facil suppli close area hard negleg phcn
## 10 8 8 7 5 5 5 5
## reach approv provid avail conect have inaccess non
## 5 4 4 3 3 3 3 3
## Make a word cloud
wordcloud(dontknow_corpus)
Most of the answers relate to facilities not being connected, or being far from the grid. In this case, this was actually listed as one of the possible responses in our grid proximity questions. Due to tricky wording, it seems that enumerators simply mis-understood the question.
What can we actually do with our analysis results?
In this case, we have very actionable information about our survey. For this survey, we modified the facility level question to include more options. And we re-worded the grid proximity quesiton and its options to help make the meaning clearer. Finally, we gained confidence that our other questions were actually working fairly well.
We hope that you are able to benefit from similar analysis yourself as well. If you have R installed, you should be able to run R code very similar to what you see on this page, as long as you put in the correct filenames and column names. In fact, the last block of code is designed to be copy-pasted; as long as you change the line marked CHANGE, you should be able to reproduce this analysis for your own datase. For questions, please contact the formhub-users Google group.