Word Frequency Histograms of the First US 2004 Presidential Debate

Here are some histograms that show how often each person in the Presidential Debate held on 2004-09-30 said each phrase or word.

Colored by Candidate

Candidate-Specific

Questions and Answers

What are the graphics from?
They're made by HTML Graph. Nice, huh?
How did you do it?
First I needed to find a transcript that actually had everything spelled out, no Arabic numerals or anything. I ended up taking the AP transcript and editing it by hand, resulting in this. I also made sure each sentence was on its own line so I could then assume that any period not at the end of a line was part of a "word", and make some assumptions about capitalization. (Some errors are still there, though.) Then I wrote a perl program that parsed the debate file and outputted a raw data file, counting all one-word strings, as well as all multi-word string that appeared more than once. (I had to go back and change things so that it would find the maximal multiword string.) Then I wrote another perl script that generated proper output for HTML Graph to parse.
How long did this take you to do?
About 1 hour to get the transcript in the right format, another hour to write the first prototype, another hour of debugging, and a half-hour of typing up this document. And I hadn't slept for 14 hours before that, and had less than usual before that! And I was sick with a cold on Monday and Tuesday! Why am I still up???
Is that all?
No, I woke up on October 1st realizing that I could do something simple to the underlying data structure (namely, instead of assigning a numerical value to each phrase, I can assign a string value to the phrases that have a dominated superstring) that would remove a lot of bugs in the system. So I spent about another 30 minutes on that.
Did you know that the resistance of the human body decreases after a cold?
Oh. Well, that explains it.
So, what can we conclude from this data?
That I should get my priorities straight.
Can I tell my friends about this page?
Yeah, sure. Go nuts.
Are you going to do this for the next few debates?
Maybe. I did it for the Vice-Presidential Debate that happened on October 5th, and the Second Presidential Debate on October 8th.
Can you add some more phrases to the "interesting" page?
Sure. Send me an e-mail and I'll think about it. If you don't want to expend the effort to find my e-mail address, you probably don't really care that much.
How did you decide on the order of the words in the "Frequent words" page?
From this page. Since that doesn't have contractions, I had to include them; possibly didn't get them all.
I found a mistake...
Yay! I'm not surprised. Tell me what it is, and if it's simple enough, I'll fix it.
What is Per Mil?
It's a permil symbol. Like a percent symbol, except that you have 1000 parts instead of 100 parts. So, 10Per Mil = 1%.
Don't you know that historically, the incumbent should be blue and the challenger should be red?
Yes, but I assumed you didn't.

Wei-Hwa Huang