Back to 2012...
In the wake of the 2016 presidential election, I thought I’d better rev up my natural language processing skills in preparation for the four years of adjective-laden data to come. I decided to get my feet wet with a much more tame data set: the 2012 American National Election Studies (ANES) time series free-response answers. I wondered if descriptions of the Democratic (Obama) and Republican (Romney) candidates for President were described differently by survey participants, and what words survey participants would use to describe them.
About the data
ANES is a cross-sectional study conducted on election years by Stanford and the University of Michigan in which a statistically-representative sample of adults answers a battery of multiple choice questions regarding their views on candidates for election, the national parties, and various political topics, such as abortion, the economy, healthcare, and foriegn policy. Each participant also has an opportunity to answer a series of open-response questions. I was interested in the responses to the following questions (paraphrased):
- What do you like about the Democratic candidate for President?
- What do you like about the Republican candidate for President?
- What do you dislike about the Democratic candidate for President?
- What do you dislike about the Republican candidate for President?
I’m not an expert in the field of topic modeling, so I used a method that I, with a good Bayesian statistics background, could easily understand: LDA. To make a long story short, LDA is a hierarchical statistical model that describes documents as being mixtures of latent (unobserved) topics. Topics and words are described by categorical distributions, whose conjugate priors—which we think of as describing word and topic probabilities—are given by Dirichlet distributions. Integrating out the priors gives the joint distribution of the words and topics, from which can be calculated the probability of observing a given topic.
Using the implementation of LDA found in scikit-learn, I located the top four topics that occurred in the answers to each of the four questions listed above. For each topic, I found the four words most closely associated with that topic.
||think good job president
||business man experience good
||care don like obama
||women issues rights abortion
||que rights care health
||policies marriage que la
||economy does just jobs
||don like just republican
||country work democrat hes
||jobs government plan taxes
||years people way country
||class middle people taxes
||class middle people like
||obama people believe like
||debt country spending military
||tax que wants rich
It was interesting to note that many of the topics were quite easily decipherable; interpretability of results is an ongoing challenge in all of machine learning, topic modeling included. The discovered topics included many of the main storylines of the election, including Romney’s emphasis on his record as a businessman, Obama’s signature healthcare law, and Romney’s promise to cut taxes on the middle class.
Using the (terrifically named) VADER algorithm, I then calculated the average sentiment of responses to each one of the above questions. I deliberately did not control by political persuasion of the respondents, since the respondents were a statistically-representative sample of voting adults and I wanted to get an idea of the mean sentiment surrounding the candidates.
Sentiment about Obama was more polarized then sentiment about Romney; the mean sentiment for positive statements about Obama was higher than the mean sentiment for positive statements about Romney, but the mean sentiment for negative statements about Obama was lower than the identical statistic for Romney. Examining a histogram of the distributions of sentiment regarding the candidates didn’t really help me further decipher the nature of the distributions…
However, the distributions for each question type were significantly different: Kolmogorov-Smirnov (KS) statistic for the positive statements was 0.123 (p = 0.000000), while the KS statistic for the negative statements was 0.062 (p = 0.000000).