Open-ended survey questions offer distinct benefits when public opinion and attitudes are studied. Yet despite their utility, analyzing responses to these questions can be time-consuming and labor-intensive, which often deters researchers from using them. Natural language processing (NLP) techniques offer new solutions that can help researchers overcome these deterrents and help improve the usability of open-ended questions.
Open-Ended Questions Offer Benefits and Present Challenges
Although closed-ended survey questions are often ideal, open-ended questions offer advantages. For example, because open-ended questions do not include specific response options, there is less risk of biasing responses based on the options listed. They are also useful when it is unfeasible to provide an adequate list of response options.
However, using open-ended questions does not come without challenges. Analyzing these questions manually is resource-intensive, and the coding that humans do is often not replicable. Luckily, recent advances in automated NLP offer new solutions to these problems.
In this blog, we discuss two emerging NLP methods for topic modeling, which is a technique that clusters text documents and finds topics and themes within them. We show that using them in tandem is an ideal solution for the analysis of complex open-ended survey questions.
Step 1: Writing Good Questions and Asking Them at the Right Time
In general, NLP algorithms are only as good as the data they are given, and the first step in getting good data is writing a good open-ended question. This means using clear question wording, avoiding leading questions, and following other basics of question construction. First and foremost, however, researchers must consider when to use open-ended questions and where they belong in the survey.
Since the beginning of the COVID-19 pandemic, Gallup has used a probability-based web panel survey to measure the health and wellbeing of Americans. This survey relied almost exclusively on closed-ended questions, for easy tracking of changes over time.
In 2022, Gallup expanded the survey to ask Americans about the factors motivating them to maintain their health with this open-ended question: “What is the most important factor that motivates you to maintain your health and wellbeing?” We opted for an open-ended question because we wanted respondents to answer the question in their own words.
Step 2: Exploring the Data With Biterm Topic Modeling
Because we received thousands of responses, analyzing the data manually would have been too time-consuming. Instead, we used automated topic modeling to help us understand the topics and themes in the responses.
As a first-cut, exploratory effort, we used Biterm (BTM) topic modeling. BTM is useful on short texts such as survey responses because it combines text data across all responses, rather than relying on data from a single response. Examples of short responses found in our data included “to feel better,” “health and longevity,” and “to live a healthy life.” Because of this, BTM can be more insightful than other topic-modeling methods when responses are relatively short.
In addition to BTM's being less labor intensive than manual coding, it is also data-driven. This means that the process is easier to replicate because it relies on mathematically driven algorithms instead of humans assigning topic categories. It also means that possible sources of error and bias arising from human input are reduced.
In our example, BTM revealed seven primary topic “clusters” pertaining to health motivations. Two of these involved, first, motivations relating to general health and wellbeing, and second, health maintenance for the benefit of family, friends and loved ones.
While BTM was a helpful first step in identifying topics in our data, some keywords were repeated in multiple topics. This sacrificed the interpretability of the results because the similarity among topics was relatively high, meaning that the results were somewhat ambiguous.
Step 3: Organizing the Data With the Keyword-Assisted Topic Model
To address the similarity among topics, we used a Keyword-Assisted Topic Model (KeyATM). KeyATM allows researchers to use keywords to form seed topics that the model builds from. In effect, researchers can use information about possible known topics and their associated keywords to create topic pillars that the model recognizes.
For example, our BTM showed that “family,” “friends” and “loved” appeared in the same topic, suggesting that respondents were motivated to be healthy so they could be around their family, friends and loved ones for as long as possible.
We used these keywords to create a “family and friends” topic, and the KeyATM model estimated the overall frequency of this topic. We repeated this process with other keyword groupings that appeared to form distinct topic categories.
The KeyATM model was quite informative, and it improved the overall interpretability of the results. We found that the most common motivations for health maintenance were family and friends (37%), wanting to feel good or feel healthy (22%) and wanting to live a longer life (22%).
Overall, we were satisfied by the results that KeyATM provided, and the model was greatly informed by the prior information that BTM provided.
Conclusion
By combining NLP approaches, we were able to make sense of thousands of open-ended responses in just a couple of hours. This would have taken weeks using traditional manual coding. Additionally, the process was replicable and less susceptible to human bias and coding errors. Our process shows that, while distinct, BTM and KeyATM can be used in a complementary way to get the most out of your data and to improve analytical insights.