Never mind the scary theory, here’s some empiricism. And computer programming. The piece I’m working on is an analysis of lists of horses donated to the parliamentarian army in the First Civil War. There are some figures derived from these lists in my forthcoming article in War In History and in the seminar paper that I posted in November, but I’m trying to write an article which examines them in much more detail. This article will be related to debates over allegiance and the causes of the war, which is why I’ve been trying to explore the historiography and think about theoretical issues, but the substance of it will be fairly straightforward empirical stuff with lots of numbers. That’s not to say that this kind of analysis is easy. If it was someone else might have done it all years ago. John Tincey was the first person to try it, but he only did the smallest of the three account books, which is a fraction of the size of the other two. Following his lead I decided to do all of them.
In 1999 I spent about 2 weeks in the PRO typing these lists into an Access database. I’m still using that transcript as the basis of my work now, although I’ve converted it to XML to make it more flexible and checked a selection of the entries against digital photos of the manuscript. I’ve been using the Python classes that I developed for representing uncertainty to calculate totals of horses and values. Some pages are damaged, meaning that exact totals can’t be calculated – this is something that was difficult to deal with in Access but the combination of XML and Python has enough flexibility to cope with it. Getting totals for days and months is fairly easy, but I also want to group by the social status of the donors and the counties that they came from. Before I can group by counties I need to identify place names given in the manuscript as although some entries specify a county in the address, many more give a place name without a county.
Having decided to leave my 5th Lincolnshire First World War project for a while, I got an offer I couldn’t refuse: someone from the Great War Forum sent me a transcript of the battalion’s medal citations from the regimental archive so that I could publish them on my site and link them in to the index of people that I’d created for the book. The document contains information that can’t be found elsewhere, as although awards of the Military Medal were listed in the London Gazette, full citations were not normally published. There are also three awards not mentioned in Sandall’s list, and citations for 10 people who were recommended for awards but turned down.
I received the list as a Word file with no semantic markup on Wednesday morning, started working on it on Thursday morning, and published it on the web this afternoon. It looks very basic but it’s not bad for two days, and it’s all linked in to the index of people for Sandall’s book. First of all I copied the text into jEdit and used Find and Replace to insert some basic TEI XML markup. Then I pasted it into a new TEI document in oXygen. With the automatic validation it was easy to track down and correct errors in the markup, so by lunch time I had a completely valid TEI file. In the afternoon I spent about 3 or 4 hours on linking records by inserting key attributes into <persName> tags. In most cases I already had the keys that I used for linking names in Sandall, but sometimes I had to change them in the light of new evidence from the citations, such as full names of people who I previously only knew by their initials. This also allowed me to clear up some ambiguities . This morning I finished the linkage by creating new keys for the 13 people not mentioned by Sandall, then got started on writing some XSLT. That was easy as I could copy or adapt a lot of the code from the style sheet for Sandall. As well as generating the HTML version of the citations, this XSLT generates an extra JSON file which is imported into the Sandall index of people to allow linking the citations. Again this only required some minor adjustments to the Exhibit page. After some testing and corrections I had a live site up this afternoon.
This demonstrates the potential value of the techniques I’ve been using for marking up texts, but it also raises some problems for digital history. I decided to trust a transcript from a random person off the internet. I have no way of knowing how accurate the transcript is, or even if the source document really exists! It could be Hugh Trevor Roper and the “Hitler Diaries” all over again. Therefore I’m going to think more carefully before putting myself in this situation again. There’s also a possibility that I’ve miscalculated the copyright situation. Based on internal evidence and comparison with other documents my best guess is that the list was created by the army and is therefore under Crown Copyright (and being unpublished and available for inspection in a public record repository should come under waiver of Crown Copyright), but without seeing the original it’s hard to be sure. I might be wrong, and even if I’m right the holders of the manuscript might not agree. So technology makes some things easier, but there are other problems that it can’t solve.
Having made good progress with my project to digitize Sandall’s History of 5th Lincolnshire Regiment in the last month I’m going to leave it for a while. This month I haven’t read any books or articles, haven’t written anything other than blog posts and computer code, and have only occasionally thought about historiography and theory. I kind of like it like that but I have other things to get on with now.
I’ve made some small changes since the last post. Dates now have tool tips, so if you hover over them you can see the full date. The place name index is a bit more user-friendly. I’ve replaced the hash values with query strings in the incoming links so that the Exhibit page filters the list down to the place passed in the query instead of displaying a box with the details. This means that you just have to click on “Map” to go straight to map view with only that place displayed. Once you’re there you can easily take the filter off again to see all the other places. The map view is also zoomed out further by default so that you can see Britain and Egypt. That means that you have to zoom in a long way to get to France and Flanders but I think it’s less confusing than not being able to see Grimsby or Alexandria unless you zoom out.
So the site is now in a satisfactory condition with lots of cool features, and now that I’ve worked out how to do everything I could probably get another book to the same stage within a few weeks. But there are still lots of features that could, and probably should, be added. See below for more details. (more…)
Following on from adding an interactive index of people to my digital edition of Sandall’s history of 5th Lincs, I’ve now added a similar feature for place names. It works in exactly the same way as the person index, but it also has a map view. Again this uses the Exhibit API, which makes it very easy to mash up data with Google Maps without even having to know anything about the Google Maps API. The map view is a bit slower than the normal view, especially if the list isn’t filtered, but that’s an inherent limitation of using maps.
One of the many cool things about the map is that it strikingly illustrates the allied advances in the last months of the First World War. If you go into the map view and click “The Beginning of the Great Advance” on the list of chapters, you’ll see the battalion holding the line in Flanders, then moving behind the lines for rest near Amiens, then moving up to the front line at Saint-Quentin. Then click on each of the following chapters in turn and watch the markers surge forward as 46th Division breaks through the Hindenburg Line and pushes towards Belgium.
Adding the place index was mostly similar to adding the person index: I added a unique id to each<placeName> tag using a Python script, pulled out the place names into an SQLite database, identified/disambiguated them and added a regularized name, then used another Python script to pull the regularized names out of the database and put them into the key attributes in the XML file. Identifying the places was easier than identifying people, and took a couple of days, although there are a few that I couldn’t find. As with people I added some code the the XSLT to generate a JSON file of all the places. Then following the map view tutorial I used the Exhibit API to pull latitude and longitude co-ordinates from Google Maps and put them into another JSON file. This turned out to be a bit unreliable as about 10 per cent of the places had their co-ordinates missing. It seems to be random, as running the script again with the same set of data produced a similar error rate but with different places. I had to take the missing places from the output file, put them into another input file and run the script over them again, which produced a similar 10 per cent error rate, but the remaining few co-ordinates could be put in manually. Once I had a JSON file with all the correct geocodes it was easy to copy code from the tutorial to add a map view to the Exhibit page. In a few cases it turned out that Google had given me the wrong co-ordinates. Mostly this was because there are two or more places with the same name and it had picked the wrong one. I thought I’d put in enough information from my manual searches to disambiguate them but it seems that the results of a Google Map search can be a bit unpredictable, and don’t necessarily give you the full address of a place.
I’ve now done most of what I planned to do in this phase. There are still some features that could be added, especially a feedback mechanism, but I’ll be giving this project a rest soon so I can do some English Civil War work.
I’ve also made every occurrence of a name in the text into a link which points to the index. My worries about illegal characters in id attributes turned out to be unfounded. With Exhibit I can use the standardized names from the TEI @key attribute as hashes to make permalinks to individual records. Clicking on the link takes you to the index and displays a dialog box with all of that persons details, including links back to every mention in the text. The dialog box is also displayed by clicking on a person’s name on the index page. I just need to work out a way to display it without having to reload the page.
Exhibit is really easy to use and makes it possible to add some fairly advanced features with surprisingly little effort. It took some searching, copying examples, trial and error, and asking on the mailing list before I worked out how to do everything, but as the project is documented by a wiki I’ve been able to update it whenever I find out how to do something that isn’t already explained there. The JSON data file for my index page is generated automatically by XSLT which loops through every <persName> and <rs> tag in the TEI document, and pulls out extra details (date of death, links to medal cards and CWGC) from another XML file.
Now that person names are more or less fully implemented, it’s time to move on to place names. These should be easier to disambiguate, and with Exhibit I can do some even cooler things with them, such as generating a Google map.