Towards an Archaeology of Code

26.3.317

Fragment of Colossal Cave Adventure source code on papyrus.

Introduction

In 2007, Dr. Dennis G. Jerz, Associate Professor of English (New Media Journalism) at Seton Hill University, recovered the original source code of Colossal Cave Adventure, the Ur-text of interactive fiction written by William Crowther in 1976, subsequently updated by Don Woods, and released as open source through an unintentional viral “sneakernet.” Jerz published an account of his research into the CCA code, completing various lacunae in research on the game dating back to Mary Ann Buckle’s groundbreaking 1985 PhD thesis on a later version of it (Interactive Fiction: The Computer Storygame “Adventure”), through curated, chronological lists of over 60 versions, ports, and updates of CCA from 1977 through 2006 by Russel Dalenberg and Rick Adams.

If you haven’t played CCA (often referred to as Adventure, although that can confuse it with Atari’s 1980 classic by Warren Robinett loosely based on Woods’ adaptation of Crowther’s original story), there are a number of browser-based versions (like this one), which will get you up to speed. The original game was written in FORTRAN 4 for the PDP-10 mainframe computer, compiled and installed by Crowther at one of Stanford University’s labs, requiring players to make the pilgrimage to the lab at night in order to enjoy the game’s puzzles. CCA is not the first computer game (games like Hunt the Wumpus came before it), but it is the first known distributed work of text-only interactive fiction entertainment, which would lead directly to the creation of Zork (1980) and interactive fiction giant, Infocom, at MIT, followed by thousands of graphical role-playing computer games. CCA was the game that launched 1,000 companies.

The narrative of the game places the player near the grated mouth of “Colossal Cave”, which happened to be a virtual reconstruction of part of Kentucky’s Mammoth Cave system as mapped by Crowther and his wife in the mid-1970s. The player must determine how to navigate the cave, solve puzzles, and collect treasures. To do so, the player must also determine how to interact with the program by typing in commands/responses to on-screen prompts. This was arguably the main source of delight and frustration with the game, the first of its kind to allow people to interact with a program in natural language to proceed through an entertainment narrative (independent of artificial intelligence research at MIT with the language-processing program Eliza, 1964–66).

What we have in the original 1976 source code of Colossal Cave Adventure is archaeological. It is a code-artifact written in an old language, an ancient primary source, something that is securely sourced to a person and a location. Think of CCA as an epic poem, the equivalent of the Iliad or the Odyssey, but perhaps most closely aligned with Gilgamesh, literally the Ur-epic seeing as it was set during the Third Dynasty of Ur. In ancient epics as well as in CCA, we follow the hero’s journey, which includes a trip to the Underworld and the collection of artifacts there. As with all ancient epic poems, the core story and characters remain the same, but the story grows and changes over time depending on who is telling it, the context in which the story is told, and the chronological distance between the current telling and the original source. Gilgamesh comes down to us from two Mesopotamian sources, both written on clay tablets, one in Akkadian, and another in Old Babylonian. Homeric epics come down to us not only from the oral tradition, but also through clay tablets and papyrus. CCA follows suit, with its earliest version written in the ancient language of FORTRAN 4, later ported to more current versions of FORTRAN, as well as early versions of C. The content of an epic is inscribed on either clay or tape, preserved on the early media, but also preserved through the sharing of the stories. Epic poetry is the original open source/open-access, and as such conveys information-as-virus, most reliably through earworm, spoken, sung, or played. Thanks to the open nature of CCA, we now have 42 years over which Crowther’s original epic grew and changed.

As with the papyrological record, archaeologists and historians know three things: 1) some papyrus exists, and its texts are complete; 2) some papyrus exists, but its texts are incomplete; 3) some papyrus does not exist, but once did. This is true with source code versions of CCA. We have the original 1977 source code. We have later versions, which may or may not be able to be read by either people or machines in 2018. We also know through research that some versions once existed (e.g., Alan Solomon’s Microsoft Fortran version from 1985), but are no longer accessible for whatever reason. Code archaeology could very well be the papyrology of the 21st century.

My third PhD case study focuses on the evolution of CCA’s code. I plan on examining over 70 versions of the game not just by eye, but also with stylometric software (Stylo for R, which is among a vast palette of apps for conducting corpus analysis) to see if what I find regarding a versioning chronology and authorship of the actual programs matches up with the historical record. Granted: my data set of ca. 70 code versions is relatively small when compared against data sets of code that can contain millions of lines to compare, so while I will likely not be able to determine an author because of a small data sample, I will definitely be able to see what code was re-used between versions. CCA will allow me to establish formally an archaeology of code, a control set from which I will derive a methodology to use on other sets of code. Once finished with CCA, my plan is to put those code-stylometric methods to the test with source code from Atari games written in the late ‘70s and early ‘80s, training the software on the styles of Howard Warshaw, Carol Shaw, and others, and then running source code of various games through Stylo to look for evidence of code re-use and other artifacts of code-authorship born in a corporate environment at the birth of the video game industry.

A Classical Archaeologist’s Approach to Code

As archaeological researchers, we cannot escape our pasts (pun very much intended), and often our prior experience colors current/future endeavors. I was trained as a Classical archaeologist and art historian and conducted fieldwork in Etruria and Greece, later serving as the publisher for the American School of Classical Studies at Athens (ASCSA). Twenty years after finishing my M.A. at the University of Missouri-Columbia as advised by William R. Biers, I found my way to the archaeology of the digital. I continue to be curious about how (and whether) the tools and methods I used specifically for Greek pottery apply to my current work in synthetic spaces. I continue to be surprised by the sustained utility of my previous training, but there are other elements at work in the digital that have prompted methodological revision as will be explained below.

So how can an archaeologist conduct an archaeological investigation of 1,000 lines of FORTRAN (or any programming language)?

Let’s start with authorship: who made the artifact I am looking at? Source code is text, and as such was written by one or more people for a primary audience of either a machine or human to read. By the nature of code being text, it is subject to the study of epigraphy (the study and interpretation of inscriptions, typically ancient), palaeography (the study of ancient writing systems and the deciphering and dating of historical manuscripts), and stylometry (the statistical analysis of variations in literary style between one writer and another). When looking at source code, we are looking at the primary source (or variations of it). Computer programming is an iterative venture, meaning that the code changes over time with modifications to intended functionality, debugging, feature-creep, etc. These later iterations of code can be authored by others, and are written in a way that can be tracked chronologically via date/time stamps and version numbers. All of these changes can be identified through stylometric analysis, which reviews vocabulary, punctuation, and syntax, which can then be used to identify an author (although not necessarily by name). Everyone writes code differently, which can include how one organizes routines, how one does or does not comment the code, and even how one punctuates or formats the code (e.g., tabs/indents). We can attribute code authorship by way of understanding style.

As an art historian of ancient Greek pottery, I can compare this stylometric approach to that of archaeologist John Beazely whose seminal Attic Black-Figure Vase-Painters (1956) and Attic Red-Figure Vase-Painters (1963) were informed by the non-quantitative “stylometric” theory of art historian Giovanni Morelli (1816–91), who attributed authorship to the style of the painted line. In ancient art history, one continues to rely on conducting a visual comparison of a newfound artifact against an established corpus of comparanda: “this thing looks like these other things, which happened to be made by this person/workshop.” The art historian can train the eye in pattern recognition to spot an artist’s style, or in 2018, can supplement what the human eye sees with a variety of Digital Humanities software tools for visual and non-visual analysis. A line is a line, drawn or written, and it is arguably more difficult with texts to determine both identity and authenticity of authorship by the eye and experience alone. Stylometric software such as Stylo (which is open source) can help not just with providing quantitative results for text-authorship, but also with a suite of data visualization tools.

Part of establishing the identity of a maker of something is sourcing the (code) object. Depending on the sample, source code can potentially be signed by the author (or attributed to the author by the author’s employer). The author’s identity could be guessed via the author’s commented and/or redacted code, which will contain its own subset of “fingerprints” of writing style. Thanks to metadata (digital context), code can also be tied to an IP address, giving the code locative data: if we do not know who wrote the code, we can derive where the code was written.

Turning to Greek painted pottery, we see these same identifiers. Some potters/painters signed their work (e.g., Amasis, Euphronios). For those pots unsigned, art historians have assigned names to the painters largely based on their current locations (e.g., the Berlin Painter, the Painter of Louvre F 51). Some ancient pots have actual fingerprints on them of the artisans. Also the clay of pots can be sourced, a geological IP address; the chemistry of soil varies from place to place, a unique identifier. Many of the intact, painted Greek pots were recovered from Italian findspots in funerary contexts, but we can trace their origins back to Greek workshops in Attika.

With identification of authorship comes the question of ethics. For ethics of identification of Ancient Greek artisans who have been dead for over 2,000 years, there is little ethical concern (unless we are dealing with possible counterfeits where authorship has been attributed by an unscrupulous seller). These ancient artisans are gone, and their lines of descent are now murky at best. Those who wanted to be identified signed their works; those who did not sign can still be identified by style and then assigned a unique name for modern reference.

Attributions can also be done through revisiting legacy archaeological data (maps, sections, stratigraphy, notebooks, data sets/spreadsheets/databases, etc.), which can help confirm or revise old attributions/conclusions. I was able to do this at the Greek site of Isthmia when I was able to review old stratigraphic drawings and notebooks combined with boxed pottery lots and other contextual markers to determine the date in which the Roman Bath’s mosaic was laid. Returning to older version of code and the context in which it was written will very likely provide similar revisions in conclusions. When I conduct the stylometric analysis of the Atari game source code (most of it written in Assembly), I should be able to confirm author attribution, code re-use by Atari employees, and might even be able to identify instances of dual authorship among other things, contributing additionally to the corporate history of Atari, and to the history of digital games generally. It’s similar to using the same software to resolve authorship debates: is the author Diogenes or Pseudo-Diogenes, James Madison or Alexander Hamilton?

When researching code, however, one must be sensitive to revealing the identity of the author. While some will sign their work or will make their role known, others will either use a pseudonym or will attempt to hide behind obfuscated IPs. Be that as it may, obfuscating style is nearly impossible: there are always textual ticks, and these can still be caught if code is run through a scrubber in an attempt to further hide one’s identity. Researchers can use the real name with the consent of the author, use the alias if given (Guccifer is the new Berlin Painter…), or assign a unique identifier in publications that designates code authorship without revealing private data. Depending on the sensitivity of an issue, the archaeologist can further respect privacy by assigning a sequential number to represent someone’s identity.

The Sensuality of Data

As a field archaeologist, I learned to engage all of my senses when working with excavation pottery, and I was curious to learn if the same “sensual” approach might work when considering code-artifacts.

Pottery

  • Sight: clay color, slip color, inclusions, biscuit, shape, condition, decoration, inscription
  • Sound: Tap/flick a piece of pottery with your finger to identify pitch and volume, which can help determine the quality of clay and firing, as well as the presence/absence of contents
  • Touch: texture, fine v. course, slipped v. unslipped, intentional raised or depressed design elements, breaks and mends
  • Smell: fresh v. old manufacture
  • Taste: stickiness of new v. old breaks when touched by the tongue
  • Context: presence/absence of other things found either with or around the pottery in question

Code

  • Sight: words, phrases, punctuation, syntax. Visual chunks and loops and routines observed as patterns divorced from the actual text within these patterns of code (look at the shape of the code elements instead of what the code actual says). Code comments. Line numbers. Redacted or obfuscated code. Language of both code and coded. Language version and code revisions. Upper and lower case. Tabs. Length of code and of subroutines, line-length. How was the code saved and in what format? Is there file metadata?
  • Sound: convert terms, syntax, and routines to audio in order to listen for patterns and dissonance.
  • Smell: none? Unless the code was printed on old v. new paper.
  • Taste: none?
  • Touch: create raised patterns of routines. Print in Braille?
  • Context: Look at the snippet of code in relation to other code, how the code is presented, how the code is stored, how it is executed and on what device(s). What is the code near? What is the result of compiling the code, of running it?

Recompiling legacy code is the same as mending a pot: we take pieces that we know belong together and then reassemble them in order to access more data about the artifact, and potentially to preserve it. Debugging old code is the same as reconstructing and artifact or site feature/building: we address the problems surrounding the data in order to reconstruct it to the best of our ability. Running old code is like experimental archaeology. What might happen? What did happen?

The Sensual Assistance of Digital Tools

Traditional archaeologists (as well as digital ones) still rely on empirical evidence as granted through the senses that deliver data. Data is just data until they are sensed/perceived. As soon as data are encountered, interpretation begins, as basic as “what is this data” to instantly recognizing something and making a direct leap to interpretation. Data are anything that is sensed; data is still data even if there is no one around to sense it. It is just data, but without interpretation. It is the tree that has fallen in the woods when nobody was present to see it.

Our human senses and interpretations are fallible, biased, and at times unreliable. Creating and using machines in support of data analysis assists in data interpretation without replacing human intelligence. In the case of visual or textual stylometric analysis, it is likely that machines can recognize patterns that humans might miss, especially with big data sets: pottery, coins and die studies, thousands of lines of code, etc. A machine can be trained to look at thousands of coins in order to assign them to various groups of dies, a task that takes scholars years, but takes the computer minutes. Researchers can come at the data through their own senses and interpretations, feeding the same data to the machine either to check results or assumptions, to test hypotheses, or to reveal other ways of looking at data. While still not without bias (after all, people create and program machines), the outcomes can provide more information to add to the conclusions of the research, perhaps leading to new, unanticipated research questions.

Reflexivity

All of the above leads the archaeologist to the issue of reflexivity: while self-reflection on tools, methods, and interpretation is important, this reflexivity is often done by the individual, a self-critique. This should be supplemented by a kind of fieldwork peer-review, perhaps in the form of a group debriefing prior to when the site team leaves at season’s end. Ideally this could be done every two weeks to constantly examine and improve workflow on the ground. For digital-only sites, this can still be achieved through Slack channels (or similar) or publicly via social media and public writing/blogging. With the archaeology of code, the computer science and code-hobbyist community is vibrant, vocal, and can be tapped for assistance and public peer review, albeit done so with caution by the researcher in light of well-documented internet behavior. In many cases, the authors of code are still alive and can potentially be approached. As a pottery person, I would have enjoyed asking Charinos why he paired the face of Herakles opposite that of a woman on a kantharos base topped with a decorative painted frieze containing a ring of ivy and satyrs, many of such vessels found in south Italian graves. I can’t ask this question of the pot’s maker, but with authors of code, I can.

Towards an Archaeology of the Code of Colossal Cave Adventure

I am treating CCA as a code-artifact, digital papyrus or clay tablet from which later copies and versions are made. There is a perceived materiality to it especially when one remembers that FORTRAN 4 programs needed to be physically entered onto punch cards, which were then fed into the computer for compiling and running. While we have the source code digitally in a programmer’s analogue for Akkadian, we will likely never find CCA inscribed upon those original punchcards. It’s likely easier to find more tablets with fragments of Gilgamesh inscribed upon them.

CCA is an epic narrative at an early point in the “Digital Age”, one with mysterious beginnings as the code was copied and loaded by computer scientists across the United States from the original findspot of Stanford, to MIT, and beyond. As the program spread, it became the genesis of a new kind of literature as well as a new kind of entertainment good enough to spawn a major software company, snowballing into an industry derived from and supported by a player-base interested in being the lead character on an adventure.

As with epics, these shared stories changed over time while retaining the core action, characters, and feel. With CCA other programmers ported the game to other languages and platforms making it more accessible to a wider group of players, largely for free (there was one officially licensed, commercial version). The game spread and grew, sometimes with added rooms to the Colossal Cave, sometimes with additional points and penalties, other times with the author of the ported code stating explicitly that the original code was used, and nothing more has been added to the story.

My own goal with this case study is to track those changes to the original narrative over time, to see what of Crowther’s code was kept, what was changed, when, and by whom. With stylometric analysis I can do this, as has been done by others albeit with other kinds of code largely in the interest of cybersecurity. CCA makes things a bit easier for me as a control test case, largely thanks to how FORTRAN works. The original source code to CCA comes in two files, one with machine-readable instructions and the other with human-readable data. It is this data file (which is called by the instructions file) that interests me the most because it contains the entire story, the narrative of adventure, the lists of objects and how to use them, the vocabulary needed to play the game successfully, and the numerous humorous results of bad decisions by players as anticipated by the programmer. This is the way the story was told originally, and it is on a single, small text file. Other iterations of the game followed suit, from C to Python, and while the language of the code changed from Akkadian to Old Babylonian, the story remained in English (even when being re-coded by international programmers).

For the time being, I plan on going about executing the project in this way:

  • Download original source code and data files. [Done]
  • Identify, locate, and download other CCA editions in other languages. [Ongoing, and I have over 70 versions from 1977–2018]
  • Identify the dates of these ports/additions to create a chronology. [Ongoing. Most of the versions do contain at least the year of their creation.]
  • For CCA ports to other languages, find a common language to export to. [Because of the nature of interactive fiction, the CCA narrative is in English, and it will be relatively easy to export the story from its code wrappers.]
  • Identify and download pattern recognition tools for linguistics, philology, and epigraphy. Run these against the solo and cumulative code. [After consulting with a few friends in Classics and the Digital Humanities, I am going with Patrick Burns’ suggestion of Stylo for R, which is open source and cross-platform.]
  • Create 2D and 3D data visualizations of code growth over time. Experiment with audio “visualizations.” [The visuals can be handled by Stylo, but I am also thinking about drawing a tree with Crowther and Woods as the trunk, and the earliest ports of the code as the bigger branches off of which later versions grew. I am not yet sure how to approach the audio version of the data.]
  • Identify code authors by looking at code line lengths in thumbnail images of code snippets. Also look for word “tells” unique to various code authors. [This step may not be necessary for CCA unless some versions need a correction to their author attribution. This will more likely be used when I review the Atari code.]
  • Assign colors to individual authors and to code from specific dates. [Color-coding for CCA versions will likely be for those ports that share a common branch. Colors will also be assigned to retained code and new code in later editions/additions to CCA.

Following the success/failure of analyzing quantitatively the history of CCA, I will attempt to use the same tool (Stylo) and updated methods in order to conduct code stylometry, epigraphy, and palaeography on Atari game Assembly code, using archaeological evidence to contribute to the history of digital games, and the personalities behind their creation.

As with all of my work, constructive comments and critiques are welcome. I’ll post a follow-up or two during the actual number-crunching along with my results. Everything will be made available post-PhD as CC0.

—Andrew Reinhard, Archaeogaming

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s