Processing text

Once you have text ready for analysis, what can you use and do?


Natural Language Toolkit: A basic introduction to natural language processing.

Speech and Language Processing, 2nd Editionby Daniel Jurafsky and James H. Martin also provides a comprehensive introduction.

Tools for Text: Materials from an excellent University of Washington text analysis conference in 2010, with videos of speakers and links to important papers.

Brandon Stewart and Justin Grimmer's paper on the promise and pitfalls of automated content analysis for political texts.

Specific Processing Packages and Techniques

ReadMe: Automated content analysis in R.

Wordfish: An R package to extract political positions from text documents and place documents onto a single dimension.

tm: A useful text mining R package.

Opinion Mining and Sentiment Analysis: How does what people write reflect their feelings?

Topic Modelling: Computer scientist David Blei's website, with links to topic modeling software.

lda: An R package to implement latent dirichlet allocation topic models in R; takes data in the Blei LDA-C format.

JFreq: Easy-to-use standalone software from Will Lowe that batch uploads a folder of text files and creates LDA-C sparse term document matrices or non-sparse csv term document matrices for analysis in lda() or elsewhere.  Java-based, so creates TDMs much faster than R packages.  Also can preprocess and stem documents and provide basic content analysis if Yoshikoder dictionaries are provided.

Yoshikoder: Text analysis with the lowest learning curve.  Standalone program with easy-to-use user interface.  Works on English and non-English languages.  Provides word frequency tables, dictionary based content analysis, and concordance tables.  However, does not preprocess texts by stemming and removing punctuation, numbers, stop words, and the most and least frequent words, which is now considered standard.