This week I finally sent along the data from our survey at Pyla-Koutsopetria on Cyprus to Eric Kansa at Open Context. This was a bigger task than I had anticipated, but with the publication of the volume on our survey, it seemed like ideal time to make our data accessible to researchers on the web. Hopefully, the web publication of our data at Open Context becomes a companion piece to our survey volume allowing a critical reader to interrogate our claims more thoroughly than traditional paper tables and catalogues would permit.
I do not need to recite all the good reasons to make raw archaeological data publicly available. In preparing the data from our survey for publication, I did discover some unexpected benefits to this process.
1. How can there be so many holes in my data? At the end of every season, we spent a bit of time making sure that our data was in good order and that there weren’t massive gaps in our dataset. Reviewing the entire dataset, however, exposed myriad small gaps and irregularities that had crept into our data over the years. Most of these could be easily filled as we collected data in the field in a way that ensured redundancies, but because these little gaps in our data tables were not significant for our analyses, they remained almost invisible until we reviewed our data for publication. The notion that someone else would use our data in ways we could not entirely anticipate pushed us to apply a greater degree of scrutiny to our dataset and to produce a much cleaner copy.
As a little note, it took more time to fix the last few problems than it took to do large scale normalization. Hours before I submitted the data for review, I officially gave up on 23 records in our finds database. I decided just to live with .3% of our data being not entirely tidy. Reviewing and revising our databases also gave me a firm set of practical limits for the quality of our data.
2. Excavating Data. One of the more remarkable things that we discovered on reviewing our data for publication was the number of strange fields that we never used or used only sparingly. For example, our main survey database had three fields describing our orientation as we walked each individual unit. We had columns for bearing, “Direction To”, and “Direction From”. As our survey units were all orthogonal and each fieldwalker walked a straight line through the unit, I have no idea why there were these three fields. We also had a yes/no field for “Black and White Photograph”. Our project had used relatively high resolution (<8 megapixels) cameras from our first field season in 2004.
These fields, then, must have entered our database from the earlier databases upon which it was based. The “Black and White” photography field must have originated in either the Eastern Korinthia Archaeological Survey (EKAS) database or the Sydney Cyprus Survey Project Database. EKAS originated in the mid-1990s and SCSP in the early 1990s, both prior to the widespread use of publication quality digital photography. Black and white photography remained the standard for archaeological documentation until around the year 2000. The “Direction to” and “Direction from” fields must have derived from the database of a project that anticipated more irregularly shaped units such as terraces or hill slopes. I suspect this came from the database used in the Australian Palaiokythera Archaeological Survey where we walked numerous irregularly shaped units.
We removed these fields from the final dataset submitted for publication because they did not include any data (at all!), but it was intriguing to be reminded of the origins of our survey data structure through these residual components.
3. Managing Misfit Data. As we prepared our data for publication we discovered that we had to make some hard decisions about misfit data that do not nest neatly in our larger survey datasets. For example, we collected data on several hundred features in the survey area. Each feature received a number, a GPS coordinate, and a brief description in a notebook. At some point the notebook entries were summarized in a brief table and merged with the GPS point data in our GIS, but these points were never reconciled formally with our survey units. In other words, these points were not part of the survey dataset either spatially or structurally.
As we prepared our data for publication, we decided against including the features dataset in large part because it was collected on a different spatial scale and in a fundamentally different way from our survey database. We used the features data to describe the landscape of survey area and even did some rudimentary spatial analysis with it, but in the end this data remained too awkward and complex to include with the survey data.
In contrast, we did find ways to integrate the lithic analysis and the study of organic remains from the survey area with our ceramic dataset even though this was data recorded outside of our standard data structure. Managing misfit data was a tricky task and as I am beginning to look ahead to my next survey project, I am already thinking how to ensure that our data integrates more seamlessly.
4. Many Copies Makes a Mess (without version control!). I know that one of the great principles of good data management is to keep multiple copies of data to insure against data loss. With easy access to vast quantities of storage in the cloud, it is now easier than ever to have redundant data storage. The only issue occurs when you have multiple copies of your data, managed by multiple scholars, and living in multiple places. It took more time for us to assemble a complete collection of survey unit photographs, for example, than to normalize the finds and survey database because our photographs lived on various hard drives and we lacked a definitive dataset. As I move forward with new projects, I am going to insist on a better system for maintaining definitive versions of our data.
None of these things should come as revelations to anyone who has dealt with archaeological datasets, but encountering all these little issues and making these decisions has compelled me to engage critically with the data collection, revision, and maintenance process one last time before embarking on a new survey project with a new data structure. Reviewing and revising our dataset for formal publication allowed us to understand the limits of our data collection processes and the structure of our data in new ways.