What the Computer Said: Introduction and Background

Introduction

Our computational approach to analysing The Waste Land presently focuses on identifying the points in the text where a "voice switch" occurs—where one voice in the poem gives way to another. Our interest is both in improving algorithms for performing this kind of stylistic analysis, and also in contributing an objective, quantitative "voice" to ongoing debates about Eliot's poem.

It should be noted, however, that our automatic method misses some very obvious cues signalling voice changes—cues that humans notice readily. In its present state, our algorithm does not truly "understand" patterns of human discourse, and thus bases its choices entirely on statistical differences between spans of text. Based on the degree of difference, we classify its choices into "high", "medium", or "low" confidence, though often the value of the difference depends largely on the type of text in the vicinity of each break.

On the "What the Computer Said" page, instances where the algorithm is highly confident that a voice switch has occurred are indicated by a dark red line, as follows:

Instances where the algorithm has only medium confidence are indicated by a medium red line:

Instances where the algorithm has low confidence are indicated by a light pink line:

You can compare these computionally-derived breaks to those perceived by human readers (see "What the Class Said"), which are indicated by black lines as follows:

Where the Computer and the Class agree on the location of a voice switch, a double-bar appears, with the level of the Computer's confidence indicated in the bottom position, as in the following bar, which indicates human-computer agreement and the algorithm's medium confidence:

Background

The following is a brief overview of how our voice-breaking algorithm works. A more detailed discussion is available in our paper "Unsupervised Stylistic Segmentation of Poetry with Change Curves and Extrinsic Features" (Julian Brooke, Adam Hammond, Graeme Hirst), presented at the Computational Linguistics for Literature Workshop at the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (June 8, 2012, Montreal).

Our algorithm scans the text the text of The Waste Land and, at each space between words, calculates a degree of "change" by comparing information (features) derived from the previous 50 words with the same features derived from next 50 words. The information includes the number of various parts of speech on either side of the word, the surrounding words' stylistic characteristics (are they formal or informal, objective or subjective?), and information derived from various existing external sources, including large Internet text collections. When all these "points of change" are calculated, they form a curve.  The algorithm is able to "see" line breaks (a very useful feature in poetry), but not stanza breaks, since it would be too easy for our algorithm to simply guess a break at all stanzas—a method that would be relatively accurate, though not very enlightening. The difference between features forms our measure of change, and the change taken across the entire text forms a change curve.

We choose voice switches by looking at the local maxima of these curves—the relative high points where there appears to be a great amount of stylistic change in process, and where what comes before is relatively different, stylistically, from what comes afterwards. Since there are many possible local maxima in the poem, we must somehow limit them. We accomplish this by requiring that they be the highest point in some range (here, 50 words on either side). This means that the algorithm seldom guesses switches very close to one another, which in turn means there are some (indeed, many) breaks in the poem that it does not even attempt to guess. The alternative, however, is worse: relaxing this restriction would allow the algorithm to run wild, guessing voice switches nearly everywhere. Since there is, on average, approximately 50 words between each voice switch detected in the Class's reading of The Waste Land, we consider this a useful range for our purposes.

As with "What the Class Said," the results presented here round off the guesses of the algorithm to the nearest line.

Read The Waste Land as Divided by the Computer