Stylometric Analysis of Colossal Cave Adventure (Part 1): Tools and Method

Plot of CCA Stylo Data
Graph created in Gephi of Colossal Cave Adventure narrative data generated by Stylo for R.

Introduction

Several dozen versions of the Ur-game of interactive fiction Colossal Cave Adventure (CCA) exist, created over the past 40 years by enthusiasts for the game. It served to reason that these sets of code were not all independently created, but rather relied on the source code of the second version of the game. The first version of CCA, authored by Will Crowther in 1976, was discovered by Don Woods in 1977 who subsequently revised it. He then posted his version as open source at which point people in North America and Europe began to create their own versions of the game as early as 1978.

Curious to see if I could prove evidence of code-reuse across 40 years of CCA, I decided to borrow digital tools used by historians and scholars of literature for their work in text analysis, which includes stylometry. A classic example used to train scholars in the use of digital text analysis tools is the set of Federalist Papers. This example is a good one because there is a finite number of authors (3), and a small corpus of data (85 separate papers). By running the text of the corpus through text analysis software one can determine authorship by virtue of the writing style, which is further supported by comparing a questionable text against the corpus of an author’s other works.

Method

The code of Colossal Cave Adventure is not unlike the Federalist Papers: for CCA there is a finite set of authors (120) and a small corpus of text (162 documents) as identified by Nathanael CJE Culver in his Adventure Family Tree. In order to achieve usable results with the software, the corpus of texts needed to be prepared. This was achieved by identifying three classes of files within each set of code.

Versions of CCA have been written in at least 12 programming languages, beginning with FORTRAN IV and continuing on to contemporary languages such as Python. CCA has even been ported to the Nintendo 3DS handheld gaming device. Despite this variety of programming languages, most code sets could be broken into three distinct parts, which could then be stylometrically analyzed. The first part is the code itself. It made little sense to compare sets written in FORTRAN IV to sets written in C (it would be like comparing an Akkadian text to something written in English). It was, however, possible to collect over a dozen examples written in various versions of FORTRAN, and to compare those code sets against each other to see how much (or how little) of the FORTRAN code had been re-used by other authors over time.

The second set of files shared across the spectrum of CCA are the “data” files. CCA is a narrative game and as such contains “human-readable” text in the form of statements presented on the screen to the player depending on what the player types. These statements include descriptions of areas within the cave, of creatures encountered during the adventure, bad puns, and notoriously humorous ways to die. The game’s narrative was often packed into its own distinct file and given line numbers, which were called by the main program during the game’s operation. It is no secret that the game’s narrative evolved over time, with later authors adding other rooms to the cave, other adventures, and more. Stylometric analysis would be able to identify what (and how much) of that narrative was preserved and shared across versions.

The third class of files to undergo text analysis were the “ReadMe” files. Many programs—especially those created in the 1980s and 1990s—contained a ReadMe file, a simple text file containing brief instructions for the user as well as copyright information and, on occasion, a history of that program’s creation. Roughly half of the CCA code sets that I was able to locate and download contained a ReadMe file, and I was curious to see if these, too, had been shared over time between authors.

The end result of running these three distinct sets of files through the text analysis software would show a history of text reuse, a borrowing of code, and a genealogy of CCA versions, all of which contribute to understanding the history of the game, but also point to future use on other sets of code for any kind of software as archaeology continues its digital turn.

The next section discusses the tools and processes used to affect the text analysis of a corpus of code. This will be followed in Part 2 by an interpretation of the results, and a conclusion about the utility of this kind of approach to increasing one’s understanding of the digital archaeological record.

To read more about the archaeology of code, see my earlier post.

Tools

In order to conduct any kind of modern text analysis, one needs to employ software, which is programmed to recognize complex patterns within collected bodies of work. The software quickly executes this function by comparing text in every document in a corpus of documents against every other document in that corpus. The resulting data show a number from 0 to 1, with numbers in the .8 and .9 range showing very close matches/relationships between documents, and numbers closer to 0 indicating originality/non-relationships with other documents in the corpus.

After consulting with colleagues in the Digital Humanities—Classicsists in particular who frequently run text analysis against surviving documents from antiquity—I decided to use two packages developed for the R language and environment, and Gephi for data visualization. R (short for The R Project for Statistical Computing) is a widely used, free, cross-platform statistics program that encourages the creation of open source packages that perform a variety of statistical analysis tasks. I selected the Stylo package, which is widely used for stylometric analysis of texts written in any language. I also selected the TextReuse package, originally developed for the field of law, which specifically checks for the presence of text shared between one or more documents. With Stylo I hoped to see the presence of “hands” across the many versions of CCA. With TextReuse I wanted to be able to determine which versions of the game were borrowing from other versions, and how much text was shared between them. A summary of results will follow in Part 2.

Data visualization allows the researcher to take a different approach to interpreting results derived from experimentation. I needed a way to see if/how CCA’s code was being shared, and how various versions connected with others (if at all). Gephi is a free and open, cross-platform digital visualization tool, and one that is commonly used by Digital Humanities scholars. Again, I wanted to use robust digital tools that are either free or outright open source to demonstrate that one can achieve results when on a budget that is either tight or non-existent. With Gephi, I could upload my CSV spreadsheets created through the Stylo and TextReuse packages for R, and convert them into color-coded graphs displaying links as well as weights showing the popularity of some versions over others in regards to how code was borrowed between versions.

The learning curve for both R packages and for Gephi can be steep, especially for a non-technical person. Thankfully the open source community of programmers as well as the wider Digital Humanities user group understand this and have gone to great lengths to document their code and to provide free, illustrated tutorials. I am grateful for the assistance of Shawn Graham (archaeologist and Associate Professor of History, Carleton University), who was able to coach me on how to use the software to its greatest effect. This help is not uncommon, and creates meaningful professional relationships as we work together to make Digital Humanities more user-friendly.

Process

The following steps illustrate how to conduct text analysis of code in R and then visualize them with Gephi. These steps are for Mac OSX, and will differ slightly for people using other platforms.

Stylo

Step 1: Prepare the corpus. For CCA, I opened the files that I wanted to analyze in a simple text editor and then re-saved the files as .txt. To make it easy to read the results, I named each file after its author (e.g., crowther.txt). I then placed all of the .txt files into a folder labeled “corpus”. Note that because I was working with three sets of files (code, data, and ReadMe), I had three separate “corpus” directories.

Step 2: Download and install R from www.r-project.org. Follow the instructions in the “Getting Started” section. Once installed, launch the R app.

Step 3: Download and install Stylo for R from here. Follow the installation instructions provided on that page (Steps 1.1 and 1.2).

Step 4: Run Stylo. After launching the R app, type the following at the prompt and then press the Return key (always press Return to run the line of code that you typed):

library(stylo)

Next, set the working directory where the results will ultimately be saved, without the brackets, substituting your own information for what is inside the brackets:

setwd(“/Users/[your user name on your computer]/[filepath to the directory containing the corpus folder]”)

For example, I typed:

setwd(“/Users/andrew/desktop/colossalcavereadme”)

Now you can run Stylo by typing:

stylo()

A window will open that gives you options to describe the nature of your data and how you would like it to appear.

  • Input & Language windowpane: Select “plain text” for input, and “Other” for language.
  • Features windowpane: Accept the default settings.
  • Statistics windowpane: Accept the default settings.
  • Sampling: Accept the default settings.
  • Output: Accept the default settings.

Press the “OK” button, and in a few moments some files will appear in the directory holding your corpus folder. One of these files is a .csv file that can be opened in a spreadsheet program containing the stylometric analysis results of the documents in the corpus folder, which can be reviewed and interpreted prior to data visualization. To exit Stylo, type:

q()

TextReuse

Step 1: Prepare the corpus. You can use the same .txt files in the same corpus folder created in Stylo Step 1 above.

Step 2: Download and install TextReuse. Launch the R app and type the following at the prompt:

install.package(“textreuse”)

As above with Stylo, set the working directory where the results will ultimately be saved, without the brackets, substituting your own information for what is inside the brackets:

setwd(“/Users/[your user name on your computer]/[filepath to the directory containing the corpus folder]”)

For example, I typed:

setwd(“/Users/andrew/desktop/colossalcavereadme”)

Then type:

library(textreuse)

Now you can run TextReuse by typing the following few lines of code (below is what I typed for my project), pressing Return after reaching the end of this snippet:

dir <- (“corpus”)

corpus <- TextReuseCorpus(dir = dir, meta = list(title = “Colossal Cave Adventure”), tokenizer = tokenize_ngrams, n = 7)

[NB: “n = 7” indicates that every seventh word is compared between documents; this number can be set higher or lower depending on your needs.]

Now type, pressing Return after each line:

corpus

names(corpus)

comparisons <- pairwise_compare(corpus, jaccard_similarity)

comparisons

pairwise_candidates(comparisons)

df <- pairwise_candidates(comparisons)

View(df)

write.csv(comparisons, file=”textreuse-comparisons.csv”)

This will create a .csv file containing a table of data about which documents borrowed how much text from other documents in the corpus folder. You can read and interpret the data now, or can move on to data visualization with Gephi.

Data Visualization with Gephi

Follow these steps to create a graph (like the one at the top of this post) based on the .csv files returned from both Stylo and TextReuse:

Step 1: Prepare the .csv files. While the .csv files can be visualized as-is, they may benefit from a bit of data reformatting from a table to a list that will better define the “edges” (links) between “nodes” (.txt files that underwent text analysis). To do this, first download and install a Visual Basic macro, “table2list.xla”, from here. Open a .csv file containing your data, then open the Visual Basic editor from within the spreadsheet application, and run the macro. Re-save and close the .csv file. Repeat for your other .csv files.

Step 2: Download and install the Gephi data visualization program from gephi.org. Launch Gephi.

Step 3: Create a graph based on a .csv file of text-analyzed data by first selecting “File” from the menu, and then “Import Spreadsheet.”

Once the spreadsheet has been uploaded to Gephi and nodes and edges appear in the Overview window, do the following:

  • Activate the Statistics pane and select “Run” for Eigenvector in the Node Overview section.
  • In the window that appears, choose “Undirected” and then OK.
  • Close the Eigenvector window after it opens.
  • In the Appearance panel, choose the Circles icon, then Nodes and then Ranking and select Eigenvector Centrality from the drop-down list.
  • Set Min size to 5 and Max size to 40. Select “Apply.”
  • In the Statistics pane run the Modularity option in the Network Overview section.
  • Accept the defaults and select OK.
  • Close the Modularity Report window after it opens.
  • In the Appearance panel, choose the Art icon, then Nodes and Partition, and select Modularity Class. Select Apply.
  • At the bottom of the screen, choose the T icon.
  • Open the Data Laboratory window and select the “Copy data to other column” button. Choose ID and then Label, and press OK.
  • Return to the Overview window.
  • Use the Font Size slider to reduce the size of the labels.
  • In the Layout pane, choose ForceAtlas 2 from the drop-down list. Run it. Change Gravity to 5. Run the change.
  • Select Expansion from the drop-down list. Run it.
  • Select Fruchterman Reingold from the drop-down list. Run it. Stop it.
  • Select Noverlap from the drop-down list. Run it.
  • Select Label Adjust from the drop-down list. Run it.
  • In the Filters pane, choose Edge Weight from the Edges Library. Drag Edge Weight into the Queries pane below. Change the slider to read something between 4 and 5, and press the Filter button. Reset the slider to 1.0 and press the Stop button.
  • Open the Preview window and tick the box for “Show Labels.” Click the “Refresh” button to view the final graph. Export to .svg, .png, or .pdf by selecting the “Export” button at the bottom of the window.

Part 2 will interpret my results and will provide conclusions about code-stylometry as well as the next phase of this investigation.

—Andrew Reinhard, Archaeogaming

2 thoughts on “Stylometric Analysis of Colossal Cave Adventure (Part 1): Tools and Method

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s