# Back to 2012...

In the wake of the 2016 presidential election, I thought I’d better rev up my natural language processing skills in preparation for the four years of adjective-laden data to come. I decided to get my feet wet with a much more tame data set: the 2012 American National Election Studies (ANES) time series free-response answers. I wondered if descriptions of the Democratic (Obama) and Republican (Romney) candidates for President were described differently by survey participants, and what words survey participants would use to describe them.

ANES is a cross-sectional study conducted on election years by Stanford and the University of Michigan in which a statistically-representative sample of adults answers a battery of multiple choice questions regarding their views on candidates for election, the national parties, and various political topics, such as abortion, the economy, healthcare, and foriegn policy. Each participant also has an opportunity to answer a series of open-response questions. I was interested in the responses to the following questions (paraphrased):

1. What do you like about the Democratic candidate for President?
2. What do you like about the Republican candidate for President?
3. What do you dislike about the Democratic candidate for President?
4. What do you dislike about the Republican candidate for President?

### Topic modeling

I’m not an expert in the field of topic modeling, so I used a method that I, with a good Bayesian statistics background, could easily understand: LDA. To make a long story short, LDA is a hierarchical statistical model that describes documents as being mixtures of latent (unobserved) topics. Topics and words are described by categorical distributions, whose conjugate priors—which we think of as describing word and topic probabilities—are given by Dirichlet distributions. Integrating out the priors gives the joint distribution of the words and topics, from which can be calculated the probability of observing a given topic.

Using the implementation of LDA found in scikit-learn, I located the top four topics that occurred in the answers to each of the four questions listed above. For each topic, I found the four words most closely associated with that topic.

Topic Dem, like
Rep, like
Dem, dislike
Rep, dislike
Topic 1
think good job president
care don like obama
women issues rights abortion
Topic 2
que rights care health
policies marriage que la
economy does just jobs
don like just republican
Topic 3
country work democrat hes
jobs government plan taxes
years people way country
class middle people taxes
Topic 4
class middle people like
obama people believe like
debt country spending military
tax que wants rich

It was interesting to note that many of the topics were quite easily decipherable; interpretability of results is an ongoing challenge in all of machine learning, topic modeling included. The discovered topics included many of the main storylines of the election, including Romney’s emphasis on his record as a businessman, Obama’s signature healthcare law, and Romney’s promise to cut taxes on the middle class.

### Sentiment analysis

Using the (terrifically named) VADER algorithm, I then calculated the average sentiment of responses to each one of the above questions. I deliberately did not control by political persuasion of the respondents, since the respondents were a statistically-representative sample of voting adults and I wanted to get an idea of the mean sentiment surrounding the candidates.

Mean sentiment
Dem, like
0.1409
Rep, like
0.0867
Dem, dislike
-0.0432
Rep, dislike
-0.0121

Sentiment about Obama was more polarized then sentiment about Romney; the mean sentiment for positive statements about Obama was higher than the mean sentiment for positive statements about Romney, but the mean sentiment for negative statements about Obama was lower than the identical statistic for Romney. Examining a histogram of the distributions of sentiment regarding the candidates didn’t really help me further decipher the nature of the distributions…

However, the distributions for each question type were significantly different: Kolmogorov-Smirnov (KS) statistic for the positive statements was 0.123 (p = 0.000000), while the KS statistic for the negative statements was 0.062 (p = 0.000000).

### Documentation

You can look at the terminal output from my analysis here, while the source code is available here.

Written on January 22, 2017