Word Frequency Histograms of the First US 2004 Presidential Debate
Here are some histograms that show how often each person in the
Presidential Debate held on 2004-09-30 said each phrase or word.
Colored by Candidate
Candidate-Specific
Questions and Answers
-
What are the graphics from?
-
They're made by HTML Graph. Nice, huh?
-
How did you do it?
-
First I needed to find a transcript that actually had everything spelled
out, no Arabic numerals or anything. I ended up taking the AP transcript
and editing it by hand, resulting in this.
I also made sure each sentence was on its own line so I could then assume
that any period not at the end of a line was part of a "word", and make
some assumptions about capitalization. (Some errors are still there,
though.) Then I wrote a perl program that
parsed the debate file and outputted a raw
data file, counting all one-word strings, as well as all multi-word
string that appeared more than once. (I had to go back and change things
so that it would find the maximal multiword string.) Then I wrote
another perl script that generated proper
output for HTML Graph to parse.
-
How long did this take you to do?
-
About 1 hour to get the transcript in the right format, another hour
to write the first prototype, another hour of debugging, and a half-hour
of typing up this document. And I hadn't slept for 14 hours before
that, and had less than usual before that! And I was sick with a cold
on Monday and Tuesday! Why am I still up???
-
Is that all?
-
No, I woke up on October 1st realizing that I could do something simple
to the underlying data structure (namely, instead of assigning a numerical
value to each phrase, I can assign a string value to the phrases that have
a dominated superstring) that would remove a lot of bugs in the system.
So I spent about another 30 minutes on that.
-
Did you know that the resistance of the human body decreases after a cold?
-
Oh. Well, that explains it.
-
So, what can we conclude from this data?
-
That I should get my priorities straight.
-
Can I tell my friends about this page?
-
Yeah, sure. Go nuts.
-
Are you going to do this for the next few debates?
-
Maybe. I did it for the Vice-Presidential Debate
that happened on October 5th, and the Second Presidential
Debate on October 8th.
-
Can you add some more phrases to the "interesting" page?
-
Sure. Send me an e-mail and I'll think about it. If you don't want
to expend the effort to find my e-mail address, you probably don't
really care that much.
-
How did you decide on the order of the words in the "Frequent words"
page?
-
From this page. Since that doesn't have contractions, I had to include them; possibly didn't get them all.
-
I found a mistake...
-
Yay! I'm not surprised.
Tell me what it is, and if it's simple enough, I'll fix it.
-
What is
?
-
It's a permil symbol. Like a percent symbol, except that you have
1000 parts instead of 100 parts. So,
10
= 1%.
-
Don't you know that historically, the incumbent should be blue and the
challenger should be red?
-
Yes, but I assumed you didn't.
Wei-Hwa Huang