Stylometric Analysis of Colossal Cave Adventure (Part 2): Results

cca4.png

People unfamiliar with my case for code-archaeology should start with this post, and then read Part 1 about my tools and methods used to achieve the results below.

Introduction

At the start of this case study, it was unclear whether or not I would be able to perform text analysis on the sets of files from the numerous versions of Colossal Cave Adventure. Working with three packages in R and visualizing the output in Gephi proved that not only was it possible, but that there are now conclusions to be drawn about the 40+-year history of CCA’s development.

As mentioned in the first section of tools and methods, I decided to divide the files up into three groups: code, data, and ReadMe files. The code files also needed to be separated by language, with the largest grouping reserved for 12 sets of FORTRAN code, albeit various iterations of that language from its original FORTRAN IV (1966) to FORTRAN-77 (1977) and beyond. It was unclear if any meaningful results could be gotten from looking at possible code-sharing with this small data set written in various versions of the same language, but I felt it worth the effort to try.

To report the results, I have broken them down into three sections of three parts each, the top-level sections pertaining to the three classes of files followed by the three text analysis tools used on each of the three classes of files. Two sections will follow the reporting of results: additional file metadata to support these quantitative findings, and general conclusions.

NOTE: All of my data sets, .gephi files, and CSV files will be uploaded to my Github repository at the conclusion of my case study.

NOTE: CCA version names have already been established in the literature. Those version names, dates, and creator names, are maintained by Nathanael CJE Culver as part of his Adventure Family Tree project, and are generally accepted as being accurate. The names and dates I use below are derived from Culver’s project.

fortran
FORTRAN source code (from dove0550) as it appears in a simple text editor.

FORTRAN Code Sets

Stylo

The Stylo package for R looks for shared traits of authorship over a corpus of files. The goal of running code, data, and ReadMe files through Stylo was to see what (if any) text was borrowed by the authors of CCA versions over a span of 40+ years.

Stylo and FORTRAN Code

There are 12 known sets of FORTRAN code for CCA written between 1976 and 2016. Stylo read the contents of each file and then compared each set of code against every other set in the small corpus. Once the comparisons were completed, Stylo assigned a weight to each pairing of files, a higher number meaning a close match and a high probability that one set of code borrowed heavily from another.

In the case of the FORTRAN code sets, weights between 1 and 6 were assigned to analyzed file-pairings, a weight of 6 meaning that the files were identical. Stylo compares each pair of files backwards and forwards, meaning that each pairing is analyzed twice where each file in the pair takes a turn being the Source and then the Target, which is compared against the Source.

Fortran-Stylo-Weights
Comparison of FORTRAN code sets generated and weighted by Stylo for R.

Barring a score of 6 given when a file is compared against itself for calibration, there are two sets of identical FORTRAN code: 1) long0500 and oska0551, and 2) wood0350 and wood043b. For the first set of identical files, oska0551 is a direct, 1:1 port done in 1990 by Johann Gunnar Oskarsson to FORTRAN-77 from long0500, which was written by David Long in 1979 in FORTRAN IV for the DEC mainframe. For the second set of identical files, it is clear that Don Woods copied his own code from 1977’s direct port of Will Crowther’s original source code (1976) for FORTRAN IV for the PDP-10 mainframe to his 1978 version, which became the source for all future versions of the first iteration of CCA.

Two sets of code received a weight of 5: 1) black350 and dove0550, and 2) vane0560 and wood0350. Kevin J. Black’s version of CCA was written in Microsoft Fortran in 1987, and became a more accessible ported version of the game for others to borrow from, nearly as popular as Don Wood’s 1977 source code, wood0350, from which Neal Van Eck borrowed heavily in his 2011 version, vane0560.

Three sets of code received a weight of 4: 1) arna0440 and wood0350, 2) blac0350 and jame0551 and 3) vane0560 and wood043b. Mike Arnautov’s code (2001) first appears here derived from Woods’ original. Arnautov would also create versions in the A-Code language. Daniel Jameson (jame0551) ported Kevin Black’s Microsoft Fortran (blac0350) to FORTRAN-77 in 2016 for the BBC Micro.

By the time we reach code-comparisons with a weight of 3 or less, we have begun to see popular authors begin to be repeated: Don Woods and Kevin Black. Their code, written in 1977 and 1987 respectively, appear to be the wells most frequently visited by other authors writing in FORTRAN.

Six sets of code received weight of 3: 1) arna0440-dos and blac0350, 2) arna0440-source and vane0560, 3) blac0350 and muno0370, 4) black0350 and olss0551, 5) black0350 and supn0350, and 6) dove0550 and jame0551.

Seven sets of code received a weight of 2: 1) arna0440-dos and dove0550, 2) arna-0440-dos and long0500, 3) arna0440-dos and oska0551, 4) dove0550 and supn0350, 5) dove0550 and muno0370, 6) dove0550 and olss0551, and 7) wood043b and arna0440-dos.

Six sets of code received a weight of 1: 1) arna0440-dos and jame0551, 2) blac0350 and long0500, 3) blac0350 and oska0551, 4) jame0551 and muno0370, 5) jame0551 and olss0551, and 6) jame0551 and supn0350.

Fortran-Stylo-Plot
Graph of FORTRAN code sets generated in Gephi from Stylo-derived data.

Visualizing the Stylo FORTRAN Data

Running the data through Gephi provided the following graph. I tweaked Gephi to ignore code sets with weights lower than 4, displaying sets with weights 4, 5, and 6. As expected, long0500 and oska0551 are connected by two thick “edges” (lines) indicating equivalence, as are wood0350 and wood043b. Frequently borrowed code of black0350, dove0550, and jame0551 and to a lesser extent arna0440-dos appear in large font to indicate popularity of use, linking stylistically to several other code sets. The code sets are also grouped in green, purple, and orange to show where the greatest stylistic comparisons lie, showing how the style of various code sets appears across the range. Based on the chart, there are three main divisions to the style of the code, very much like shared DNA. I would have expected to see wood0350 to be more prominent in the chart, and am unsure why its appearance in the graph is so small.

FORTRAN-TextReuse-CSV
Weighted FORTRAN code sets compared with each other using TextReuse for R.

TextReuse and FORTRAN Code

TextReuse compares every document against every other document in a corpus in order to see what percentage of text was borrowed/shared between files. Different than stylometrics, which are a bit more forgiving and take into account vocabulary, syntax, and punctuation, TextReuse takes a brute-force approach to determine is blocks of text were copied 1:1 between files. The results are weighted from 0 to 1, 1 being 100% copied/match, and zero meaning no copying at all.

Unlike the results explained above for Stylo, only three pairs of files showed significant outright copying/formatting. Scoring a weight of one, black0350 is an exact match with supn0350. Interestingly, these to files received a weight of 3 from Stylo. When I opened the files to compare them visually, it was clear that both files were identical to each other, even though they were not written in English, and appeared instead as hundreds of pages of symbols, which lined up exactly. The discrepancy may be that with Stylo, one can set an “n-value” meaning that one can check every seventh letter (n=7), or can set that number higher or lower. In dealing with older FORTRAN files, we might be seeing odd results with n set so high, and future analysis can test with a finer grain to see if this impacts the weight.

The second set of files sharing 70% of copied text contains long0500 and oska0551, which makes a bit more sense seeing as oska0551 is a direct port of long0500 and was assigned a weight of 6, meaning identical style. It is interesting to note, however, that visual inspection of both files shows them to be identical, too, and thus should have a TextReuse weight of 1 (or 100%). It remains unclear to me why 70% was assigned by the package to this pairing.

The third set of files has a weight of 17% text shared between wood0350 and arna0440, yet has a stylometric weight of 4. One could interpret this as arna0440 being written in the manner of wood0350 without explicit copying of blocks of code.

Fortran-TextReuse-Plot
Gephi-created chart of TextReuse data showing nodes and relationships between FORTRAN code sets.

Visualizing the TextReuse FORTRAN Data

Running the FORTRAN TextReuse data through Gephi shows five families of text (in dark green, light green, light blue, pink, and orange). Dove0550 is shows as an outlier unrelated to any other code sets, meaning this FORTRAN is completely original. The other families of code sets are loosely connected through thin edges indicating slight relationships, something borne out by the data. Code sets most widely borrowed from appear in a larger font, even if that borrowing is minimal. The 1977 code of wood0350 is one of two most-widely used sets, which is understandable because of its age and proximity to Crowther’s Ur-code.

Textnets and FORTRAN Code

The Textnets package for R creates visualizations of networks share between a corpus of texts, and also shows how various networks of texts relate to each other. Unfortunately the FORTRAN code sets caused the text analysis tool to fail, erroring out because of non-English and special characters. Textnets did, however, work for the sets of data and ReadMe files, as will be shown below.

CCA-Data
Sample narrative data from William Crowther’s original version of CCA (1976), stored in the advent.dat file.

Data Sets

The second group of Colossal Cave Adventure files contain narrative data referenced by the game’s code. In Crowther’s 1976 original game and in Woods’ port of the game in 1977, there is a program file and a data file, advent.dat. This .dat file can be opened in a simple text editor (e.g., Text for Mac OS and Notepad in Microsoft Windows). It contains a few hundred lines of human-readable text (i.e., not in FORTRAN or another programming language), each line sequentially numbered and containing a description of where a player is (e.g., a room of the cave) or of player-related actions (e.g., death). Of the 100+ versions of CCA known to exist, 37 have the data file split off from the main program. I was able to prepare these 37 files for text analysis, and then ran this corpus through the three packages for R.

Stylo-Weights-Data
Weighted Stylo data for sets of CCA narrative text.

Stylo and Data Sets

As with the FORTRAN code sets, I used Stylo to check for stylometric similarities between texts in a corpus. These texts were weighted from 1 to 6, 6 being a perfect match, which would likely mean that the data from each set in the pair being compared were identical. Because the game changed over the course of 40 years, I expected to see some exact matches for cavern rooms and events, as well as some close matches.

Weight of 6 (4 sets):

  • bhch0565 and well0550
  • cox_0350 and daim0350
  • kine0350 and wood0350
  • malm0350 and malm1000

It is important to note the numbers in the filenames. These numbers correspond to the maximum score players can achieve in CCA. The original high score was 350 points, and versions named “0350” are typically closest to the original version of the game. This also means that the name and number of the rooms in the cavern as well as the things that happen to the player are roughly the same. Filenames with higher scores typically indicate later versions, and one can see parallels between these files as well. Here bhch0565 (1987) and well0550 (1985) share the same data file for their higher-scoring version. Also David Malmberg copied his own data file between malm0350 (1993) and malm1000 (2000).

Weight of 5 (8 sets):

  • anon0501 and oska0551
  • arna0440-linux and arna0440-source
  • cox_0350 and lumm0350
  • ekma0350 and kine0350
  • gill0350 and wood0350
  • goet0350 and kint0350
  • kenw0350 and plot0350
  • wood0430 and wood043b

Again we see parallels between scores and files, with 350-files and 440-files borrowing from each other, not to mention Woods borrowing from himself. Oska0551 (1990) borrows from anon0501 (1979), both versions of which follow the code created by long0500, both files listed below with a weight of 4.

Weight of 4 (4 sets):

  • daim0350 and lumm0350
  • goet0350 and ticm0350
  • kenw0550 and vane0560
  • long0500 and oska0551

Similarly scored files continue to pair with each other.

The remaining weights (3, 2, and 1) show an increased diffusion of human-readable text spread across various, later versions of the game as authors continued to change the points system, while adding rooms and events not native to Crowther and Woods’ original CCA.

Weight of 3 (17 sets):

  • anon0501 and arna0660
  • anon0501 and long0500
  • arna0440-dos and arna0440-linux
  • arna0660 and arna0770
  • arna0660 and plat0550
  • arna0770 and mcdo0551
  • beck0500 and kenw0550
  • CROW0000 and russ0000
  • ekma0350 and gill0350
  • ekma0350 and wood0350
  • kenn0000 and lumm0350
  • kenn0000 and pohl0350
  • kenn0000 and whin0450
  • kint0350 and ticm0350
  • munk0430 and wood0430
  • nels0350 and oska0551
  • plot0350 and vane0560

Weight of 2 (18 sets):

  • anon0501 and bhch0565
  • anon0501 and nels0350
  • anon0501 and well0550
  • arna0440-dos and arna0440-source
  • arna0660 and long0500
  • arna0770 and munk0430
  • beck0500 and mcdo0551
  • beck0500 and plot0350
  • bhch0565 and malm1000
  • cox_0350 and kenn1000
  • CROW0000 and lumm0350
  • gill0350 and kine0350
  • lumm0350 and pohl0350
  • lumm0350 and russ0000
  • munk0430 and wood043b
  • oska0551 and malm0350
  • oska0551 and plat0550
  • pohl0350 and whin0450

Weight of 1 (26 sets):

  • anon0501 and gill0350
  • anon0501 and plat0550
  • arna0440-dos and beck0500
  • arna0440-linux and beck0500
  • arna0440-source and beck0500
  • arna0660 and munk0430
  • arna0770 and wood0430
  • beck0500 and vane0560
  • bhch0565 and long0500
  • cox_0350 and CROW0000
  • cox_0350 and pohl0350
  • cox_0350 and russ0000
  • daim0350 and kenn1000
  • gill0350 and long0500
  • gill0350 and munk0430
  • gill0350 and wood0430
  • goet0350 and oska0551
  • goet0350 and wood0350
  • kenw0550 and mcdo0551
  • kint0350 and wood0350
  • long0500 and nels0350
  • long0500 and well0550
  • lumm0350 and whin0450
  • malm0350 and kint0350
  • malm1000 and oska0551
  • ticm0350 and wood0350
Data-Stylo-Plot
Gephi-created graph of stylometric relationships from Stylo data between narrative data files.

Visualizing the Stylo Data Sets

As above with the FORTRAN code sets, I ran the data sets through Gephi to visualize the stylometric relationships between data sets. I established a cut-off of relationships weighted below 4. The resulting graph is below and shows six distinct stylometric families (light green, dark green, red, purple, orange, light blue). The most conspicuous data source turns out to be long0500 (1979). This file might have been more popular than Woods’ because of its discoverability: people looking for CCA might have found it easier to find than Woods’ version thereby copying its contents. Again the graph shows shared DNA that demonstrates whose code was derived from others and showing the primary source of data within each of the six families.

TextReuse-Data-Weights
Weighted relationships between sets of narrative text files generated from TextReuse.

TextReuse and Data Sets

The TextReuse package was able to show the percentage of text borrowed between versions of CCA, 1 being 100% down to 0. As one might expect due to the nature of the game’s data, many versions (despite the programming language) use the same (or very similar) descriptions for cavern locations and player-events. 33% of the data sets are weighted above 50%. That’s a lot of copying, but CCA was released as open source to encourage discovery, circulation, reengineering, and play.

Weight = 100% (10 sets):

  • long0500 and anon0501
  • arna0440-linux and arna0440-dos
  • arna0440-source and arna0440-dos
  • arna0440-source and arna0440-linux
  • CROW0000 and russ0000
  • kenn0000 and pohl0350
  • kint0350 and ticm0350
  • munk0430 and wood0430
  • munk0430 and wood043b
  • wood043b and wood0430

Perfect matches of the English narrative data include Jacob Munkhammer porting Woods’ original game from FORTRAN IV to DOS. John W. Kennedy updated Jerry Pohl’s Macintosh OS/2 version to a more modern Mac operating system. While the code changed, the narrative data did not. In 2007, Matthew Russotto updated the FORTRAN-77 code of Crowther’s original, and like Kennedy’s port of Pohl, he left the narrative data set alone. Mike Arnautov recycled his data set as well between his three versions of CCA made for different platforms, but all with a 440-point maximum score. Wood recycled his data, too, for the 430-point version of the game.

Weight = 90–99% (10 sets):

  • ekma0350 and kine0350 (99%)
  • cox_0350 and daim0350 (99%)
  • daim0350 and lumm0350 (99%)
  • gill0350 and wood0350 (98%)
  • daim0350 and kenn0000 (96%)
  • daim0350 and pohl0350 (96%)
  • cox_0350 and kenn0000 (96%)
  • cox_0350 and pohl0350 (96%)
  • kenn0000 and lumm0350 (95%)
  • lumm0350 and pohl0350 (95%)

Ten more sets of data fall between 95% and 99%, meaning a near-exact copy of the narrative data shared between two versions of CCA. The very minor differences relate to small variations in formatting and punctuation. For all intents and purposes, these 10 sets can be included with the 10 perfect matches above.

Weight = 80–89% (5 sets):

  • kenn0000 and whin0450 (88%)
  • pohl0350 and whin0450 (88%)
  • daim0350 and whin0450 (86%)
  • cox_0350 and whin0450 (86%)
  • lumm0350 and whin0450 (85%)

When we get to data in the 80% range, differences in the text become easier to spot. For example lumm0350 is single-spaced and whin0450 is double-spaced. This is enough to register a significant change to how TextReuse compares the data. One could change the spacing of whin0450, but that would manipulate the data. All data in this case study remain as they were discovered in the wild.

Weight = 70–79% (11 sets):

  • kine0350 and wood0350 (75%)
  • ekma0350 and wood0350 (75%)
  • oska0551 and vane0560 (75%)
  • gill0350 and kine0350 (74%)
  • gill0350 and ekma0350 (74%)
  • anon0501 and oska0551 (71%)
  • long0500 and oska0551 (71%)
  • kine0350 and kint0350 (71%)
  • kine0350 and ticm0350 (71%)
  • ekma0350 and kint0350 (71%)
  • ekma0350 and ticm0350 (71%)

Data in the 70% range show more significant deviations, in part with style and formatting, but also now with actual data. For example ticm0350 ends with a series of numbers, “9, 20, 3, 180, 181”, and ekma0350 ends with “8, 1, 24, 2, 29.” The series of numbers leading up to the end of the data file show significant difference, while the English narrative text in the first half of each file does not.

Weight = 60–69% (4 sets):

  • kint0350 and wood0350 (63%)
  • ticm0350 and wood0350 (63%)
  • gill0350 and kint0350 (62%)
  • gill0350 and ticm0350 (62%)

The drop from the 70s to the 60s is 8%, and looking at the data files shows not only changes in data, but also in how the data are organized. The reasoning behind this is because these four sets of files are all from different programming languages, which call the data in different ways.

Weight = 50–59% (8 sets):

  • munk0430 and wood0350 (58%)
  • wood0350 and wood0430 (58%)
  • wood0350 and wood043b (58%)
  • gill0350 and munk0430 (58%)
  • gill0350 and wood0430 (58%)
  • gill0350 and wood043b (58%)
  • anon0501 and vane560 (57%)
  • long0500 and vane0560 (57%)

The trend in the 60s continues in the 50s for the same reasons, but like the other examples, affects only a small number of sets. This brings us to sets under 57%, which make up the majority of the TextReuse results: 107 data sets (compared to 48 total data sets weighted above 50%):

Weight = 40–49% (35 sets):

Weight = 30–39% (32 sets):

Weight = 20–29% (40 sets):

Weight = 10–19% (0 sets):

Weight = 0–9% (0 sets):

The one thing to notice about the data sets in the lower 50% of the weighted corpus is that there are no sets below 20%. In reviewing the CSV file, the lowest percentage is 25%, which has a logic to it. CCA is a classic game with plenty of puzzles adored by countless players since 1976. Failure to copy-paste any of the narrative text would result in a game distinctly separate from CCA. The fact that TextReuse bottoms out at 25% proves that.

Data-TextReuse-Graph
Gephi-created graph showing nodes and relationships between narrative text files shared between CCA versions. Yes, I will clean this up.

Visualizing the TextReuse Data Sets

To get statistically meaningful results from visualizing the data in Gephi, I limited the weights of the sets of data files to over 50%. The resulting graph shows two core versions of the English narrative text (green and pink). The large fonts and nodes show the popularity of the data that was borrowed from one version to the next. Interestingly, three small groups of outliers also appear: CROW0000 and russ0000, which are a 100% match, the three Arnautov files, also 100% matches, as well as the kenn0350 and pohl0350. These perfect scores create outliers in the chart surrounding the remaining files in various state of linking and usage.

Textnets and Data Sets

The Textnets package for R displays networks of related text files and how (or if) they connect. The package does not output a CSV file for visualization, but instead has the researcher dump the contents of each file in the corpus into its own row in a two-column spreadsheet. The package then reviews the contents and draws a graph showing how the sets of files relate. I ran Textnets against the 35 sets of narrative data to produce the graph below.

CCA-Data-Textnet
Textnets-derived graph showing a text network of CCA narrative text files.

Visualizing the Textnets Data Sets

I am not quite sure how to interpret the graphical results. In this instance, two groups of text appear as red and black connected by a line running directly from whin0450 (black) to bchc0565 (red, in between the two groups), and newdoc (red), which connects other files exhibiting somewhat tenuous relationships. The two main colors may be attempting to show data similar to that retrieved from TextReuse, showing the to major groups of text being shared across versions. All of the nodes in the black network are interrelated, whereas the red nodes appear to be much less entangled.

ReadMe-Text
Sample ReadMe text from Mike Arnautov’s 550-point version of the game.

ReadMe Files

The final set of CCA that I analyzed were the ReadMe files. ReadMe files are often created by programmers to explain who created a program and when, what the program does, and how to install and run it. 47 versions of CCA included ReadMe files, which can all be read in a simple text program. I was curious to see if text from the ReadMe files had been borrowed between versions over time. All three R text analysis packages proved this to be the case, again to varying degrees as shown about with the code and narrative data sets.

ReadMe-Stylo-Weights
Weighted comparisons between ReadMe files generated from Stylo.

Stylo and ReadMe Files

As with the other two groups of files, ReadMe files were compared against each other and then given a weight from 1 to 6, 6 being a perfect stylometric match or possible 1:1 copy of the ReadMe file of one version by another.

Weight = 6 (5 sets):

  • arna0660 and arna0770
  • bree_xxx and gerr0000
  • ekma0350 and kine0350
  • kenw0550 and well0550
  • kinm0551 and kint0350

For the five sets of perfectly matched ReadMe files, Mike Arnautov shared his between his different versions. The ReadMe file shared between Jim Gerrie (2015) and Barry Breen (1980), however is odd: neither version is a port of the other, and they were written in different languages (BASIC and Pascal respectively). Reviewing the files by hand, these should not be matched at all, and it is unclear why the two ReadMe files were matched at all. This odd mismatching continued with other ReadMe files of different weights, and required another round of using Stylo for R to see how and where errors might have crept in. Re-running Stylo against the ReadMe corpus returned identical results, and I am left wondering if I am personally unable to interpret these correctly.

Weight = 5 (3 sets):

  • arna0550 and arna0660
  • gasi0350 and muno0370
  • olss0551 and oska0551

Weight = 4 (8 sets):

  • arna0550 and arna0770
  • arna0550 and pict0551
  • bree_xxx and kenw0550
  • cox_0350 and ticm0350
  • gasi0350 and yong_xxx
  • king0350 and plat0550
  • king0350 and well0550
  • kinm0551 and ticm0350

Weight = 3 (27 sets)

Weight = 2 (32 sets)

Weight = 1 (39 sets)

ReadMe-Stylo-Plot
Gephi-derived graph of stylometric relationships between ReadMe files from various CCA versions.

Visualizing the Stylo ReadMe Files

Gephi graphed the ReadMe Stylo data to create another collection of families sharing similar traits between files (dark green, light green, blue, orange, pink). The pink grouping is almost wholly set apart from the rest, instead sharing style between a dozen classic game versions with the nodes of diaz0350, gasi0350 and muno0370 standing out. Kenw0550 dominates the orange group, while kinm0551 and kint0350 have the biggest pull for the blue nodes. Bree_xxx and gerr0000 top the green nodes, and I continue to wonder why. As with earlier graphs, Mike Arnatauv’s versions continue to stand by themselves as outliers, linked to themselves.

ReadMe-TextReuse-Weights
Weighted comparisons of ReadMe files generated by TextReuse.

TextReuse and ReadMe Files

Having received confusing results in Stylo, I was curious to see if TextReuse would return more logical results when checking to see what versions of the game’s ReadMe files borrowed text from other versions.

The data returned from TextReuse is weighted from 1 to 0, one meaning a 100% duplication. Only three sets of ReadMe files were direct copies:

  • black0350 and supn0350
  • kenn0000 and pohl0350
  • kind0430 and munk0430

I visually checked each of these ReadMe files and can confirm that each set does indeed duplicate these files between versions. As seen above, Kevin Black’s 1987 port of Mike Supnik’s 1978 version remains faithful across all files. The same is true of John Kennedy’s Mac OS update of Jerry Pohl’s (1990) original Mac version. Jacob Munkhammer also updated David Kinder’s version for the Amiga and kept the ReadMe file the same.

Three more sets were in the 90% range, all of them belonging to Mike Arnautov:

  • arna0550 and arna0660 (93%)
  • arna0660 and arna0770 (92%)
  • arna 0550 and arna0770 (90%)

Two sets were weighted in the 70% range:

  • kenw0550 and well0550 (75%)
  • ekma0350 and kine0350 (70%)

The remaining ReadMe files showed either no overlap, or overlap of less than 1%, meaning that 70% of the 42 ReadMe files were uniquely written by the authors of these versions. The TextReuse data seems to be a much more accurate representation of the circulation and sharing of the CCA ReadMe files than that returned by Stylo.

ReadMe-TextReuse-Plot
Gephi-generated graph of CCA ReadMe files, data inherited from TextReuse.

Visualizing the TextReuse ReadMe Files

The visualization of the TextReuse data by Gephi also returns more logical data, correctly reflecting the sharing of ReadMe text amidst three dozen outliers, the size of the nodes again showing the degree of matching nodes where the text was shared.

Textnets and ReadMe Files

Just as I did with the CCA sets of narrative text data, I used the Textnets package for R to see if any networks of usage appeared across the 42 ReadMe files.

CCA-Readme-Textnet
Textnets visualization of CCA ReadMe file text networks.

Visualizing the Textnets ReadMe Files

Five loose networks (black, red, yellow, blue, pink) and two outliers (green, light blue), show how the versions interrelate based on the ReadMe files. The data, however, might not be accurate because based on the TextReuse data, most of the ReadMe file overlaps were less than 1% and often 0. This might explain why most of the “edges” (lines) in the graph are gray instead of a solid black.

Preliminary Conclusions

Computer code and English-language data files (narrative text and ReadMe files) can all be treated as text-artifacts, and as such can undergo text/stylometric analysis. Using Stylo, TextReuse, and Textnets packages for R returned a quantitative history of Colossal Cave Adventure, showing how 40 years worth of versions interrelate to one another and build on each other in the spirit of open source programming. Based on the success of this experiment, it seems likely that digital archaeologists can use these same tools to review other code sets for other software applications, games or otherwise, to create a tree of authors and versions, a timeline of development, and can get a good idea of the human networks underlying digital artifacts.

I would welcome any constructive feedback to the above, especially with the interpretation of data, and if there are other questions that the data can answer that I did not think to ask. I would also appreciate some help understanding why Stylo seemed to fail when analyzing the ReadMe files, when it worked for FORTRAN code sets and narrative data sets. I am happy to share the corpus as well as the CSV file.

What’s Next

In working with the collection of existing CCA versions, I have discovered that the metadata of the files and file-sets containing a lot of extra, supporting information on the history of the game, and should be used in concert with the quantitative, statistical data to round out the development history of the world’s first digital, interactive text adventure. The next section of this case study will be devoted to file metadata, and this will be followed by general conclusions about the entire project.

Following the closure of my work on Colossal Cave Adventure, I hope to finish the case study by applying the same tools, methods, and lessons learned against code sets from Atari cartridges. I would like to see if developers at Atari were sharing code, and which games contained shared code, as well as stylometric analyses of the different developers at Atari to see if I can find “signatures” of these authors in the cartridges they produced with their name “on the box” or perhaps behind the scenes.

—Andrew Reinhard, Archaeogaming

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s