Tuesday, May 19, 2015

Day of DH 2015

For the fourth year, I'm participating in the Day of DH.

You can follow my day at the Day of DH blog.

Friday, May 8, 2015

Best Practices at Engaging the Public at CCLA

This is the text of my talk at the best practices panel at the Crowd Consortium for Libraries and Archives meeting Engaging the Public on May 8, 2015.

One caveat: most of my background is in crowdsourced manuscript transcription, though with the development of FromThePage 2 I've become involved in the related fields of collaborative document translation and crowd-sourced OCR correction. I hope that this is useful to non-textual projects as well.

The best practice I'd like to talk about is returning the product of crowd-sourcing to the volunteers that produced it.

What do I mean by product?
I'm not talking about what project managers consider the final product, whether that be item-level finding aids or peer-reviewed papers in the scholarly press. I'm talking about the raw product – the actual work that comes out of a volunteer's direct effort, or the efforts of their fellow volunteers – the transcript of a letter, the corrected text of a newspaper article, the translated photo captions, the carefully researched footnotes and often personal comments left on pages.


First, it's the right thing to do. Yesterday we talked about reciprocity and social justice. An older text says “Thou shalt not muzzle the oxen that tread out the corn.”

Crowdsourced transcription projects vary a lot on this. For wiki-like systems, displaying volunteer transcripts is built into the system – I know that's the case for FromThePage, TranscribeBentham and WikiSource, and suspect the same applies to Scripto and DIYHistory. For others, users can't even see their own contributions after they have submitted them. However, the Smithsonian Institute Transcription Center actually added this feature on purpose – the team implementing the center added the ability for users to download PDFs of transcribed documents specifically because they felt it was the Right Thing to Do.

Now that I've quoted the Bible, let's talk about purely instrumental reasons crowdsourcing projects should return volunteers' labor to them.


For one thing, exposing the raw data early can better align our projects with the incentives that motivate many volunteers. Most volunteers are not participating because of their affiliation with an institution, nor because they treasure clean library metadata – at least not primarily! What keeps them coming back and contributing is their connection to the material – an intrinsic motivation of experiencing life as a bird-watcher in the 1920s, of marching alongside a Civil War soldier as they transcribe observation cards or diaries.

We should expose the texts volunteers have worked on in ways that are immediately usable to them – PDFs they can print out, texts they can email, URLs they can post on Facebook—to show their friends and families just what they've been up to, and why they're so excited to volunteer.

In some cases this may provide extrinsic rewards project managers can't envision. One of the first projects I worked on, the Zenas Matthews diary of the Mexican-American War—attracted a super-volunteer early on who transcribed the entire diary in two weeks. When I interviewed Scott Patrick, I learned that the biggest reward we could provide – the thing he'd treasure above over badges or leader boards – would be the text itself in a printable and publishable format. You see, Mr. Patrick's heritage organization formally recognizes members who have written books, including editions of primary sources. His contribution to the project certainly matched his fellows' for quality, but access to a usable form of the text—the text he'd transcribed himself—was the thing that stood in his way.


Exposing raw transcripts online during the crowdsourcing process can actually enhance recruitment to crowd-sourcing projects. I've seen this in a personal project I worked on. in which one super-volunteer found the project by Googling his own name. You see, a previous volunteer had transcribed a lot of material that mentioned the a letter carrier named Nat Wooding. So when Nat Wooding did a vanity search, he found the transcribed diaries, recognized the letter carrier as his great-uncle, and became a major contributor to the project. Had the user-generated transcripts been locked away for expert review, or even published online somewhere outside of the crowdsourcing tool, we would have missed the contributions of a new super-volunteer.


For the past three years, I've been involved with an non- called Free UK Genealogy. They have volunteers around the world transcribe genealogical records using offline, spreadsheet-like tools so that they can be searched on a freely accessible website.

I spent several months building a new system for crowd-sourced transcription of parish registers, but encountered very little enthusiasm—actually some outright opposition—from the most active volunteers. They were used to their spreadsheets, and saw no value at all to changing what they were doing.

Eventually, we switched from improving the transcription tool-chain to improving the delivery system. We re-wrote the public-facing search engine from scratch, focusing on the product visible to the volunteers and their communities. When we launched the site in April, it received the most positive reviews of any software redesign I've been involved with in two decades in the industry. Best of all—although time frame is too short to have hard numbers—the volunteer community seems to have been reinvigorated, as the FreeREG2 database passed 32 million records at the beginning of the month.

So that's my best practice: expose volunteer contributions online, within your crowdsourcing system, as they are produced. It will improve the quality and productivity of the project, and it's the right thing to do.

Sunday, July 6, 2014

Collaborative Digitization at ALA 2014

This is a transcript of the talk I gave at the Collaborative Digitization SIG meeting at the American Library Association annual meeting on June 28, 2014 in Caesar's Palace casino in Las Vegas.  I was preceded by Frederick Zarndt delivering his excellent talk on Crowdsourcing, Family History, and Long Tails for Libraries, which focused particularly on newspaper digitization and crowdsourced OCR correction.  (See Laura McElfresh's notes [below] for a near-transcript of his talk.)
I'd like to thank Frederick for a number of reasons, one of them being that I don't need to define crowdsourcing, which gives me the opportunity to be a little more technical.
Before we start, I'd just like to make a quick note that all of the slides, the audio files in MP3 format, and a full transcript will be posted at my blog.

I can also direct you to the notes taken by Laura McElfresh [see pp. 19-22] over there who does an amazing job at these [conferences].

Finally, if you tweet about this, there's my handle.

Okay, so we've talked about OCR correction. What's the difference between OCR correction and manuscript transcription? Why would people transcribe manuscripts -- isn't OCR good enough?

I'd like to go into that and talk about the [effectiveness] of OCR on printed material versus handwritten materials.

We're going to go into detail on the results of running Tesseract--which is a popular, open-source OCR tool--on this particular herbarium specimen label.

I chose this one because it's got a title in print up here at the top, and then we've got a handwritten portion down here at the bottom.

So how does Tesseract do with these pieces?

With the print, it does a pretty good job, right? I mean, even though this is sort of an antique typeface, really every character is correct except that this period over here--for some reason--is OCRed as a back-tick.

So it's getting one character wrong out of--fifty, perhaps?

So how about the handwritten portion? What do you get when you run the same Tesseract program on that?

So here's the handwritten stuff, and the results are -- I'm actually pretty impressed -- I think it got the "2" right.

So in this case it got one character right out of the whole thing. So this is actually total garbage.

And my argument is that the quantitative difference in accuracy of OCR software between script versus print actually results in a qualitative difference between these two processes.

This has implications.

One of them is on methodology, which is that--as we've demonstrated--we can't use software to automatically transcribe (particularly joined-up, cursive) writing. You have to use humans.

There are a couple of other implications too, that I want to dive into a bit deeper.

One of them is the goal of the process. In the case of OCR correction, we're talking about improving accuracy of something that already exists. In the case of manuscript transcription, we're actually talking about generating a (rough) transcript from scratch.

The second one comes down to workflow, and I'll go into that in a minute.

Let's talk about findability.

Right now, if you put this page online--this manuscript image--no-one's going to find it. No-one's going to read it. Because Google cannot crawl it -- these are not words to Google, these are pixels. And without a transcript, without that findability, you miss out on the amazing serendipity that is a feature of the internet age. We don't have the serendipity of spotting books shelved next to each other anymore, but we do have the serendipity of--in this case--of a retired statistical analyst named Nat Wooding doing a vanity search on his name. And encountering a transcript of this diary--my great-great grandmother's diary--mentioning her mailman, Nat Wooding--and realizing that this is his great uncle.

Having discovered this, he started contributing to the project--not financially, but he went through and transcribed an entire year's worth of diaries. So he's contributing his labor.

Other people who've encountered these have made different kinds of contributions. These diaries were distributed on my great-great grandmother's death among her grandchildren. So they were scattered to the four winds. After putting these online, I received a package in the mail one day containing a diary from someone I'd never met, saying "looks like you'll do more with this than I will. So this element of user engagement in this case is bringing the collection back together.

Let's talk about the implications on workflow.

This is--I'm not going to say a typical--OCR correction workflow. The thing that I want to draw your attention to is that OCR correction of print can be done at a very fine grain. The National Library of Finland's Digital Koot project is asking users to correct a small block of text: a single word, a single character even. This lends itself to gamification. It lends itself to certain kinds of quality control, in which maybe you show the same image to multiple people and compare them to see if they match.

That really doesn't work very well with handwritten text, because readers have to get used to a script. Context is really important! And you find this when you put material online: people will go through and transcribe a couple of pages, then say "Oh, that's a 'W'!" And they go back and [correct earlier pages].

I want to tell the story of Page 19. This was a project that was a collaboration between me (and the FromThePage platform) and the Smith Library Special Collections at Southwestern University in Georgetown (Texas). They put a diary of a Texas volunteer in the Mexican-American War online--his name was Zenas Matthews. They found one volunteer who came online and transcribed the whole thing. He added all these footnotes. He did an amazing job.

But let's look at the edit history of one page, and what he did.

We put the material online in September. Two months later, he discovers it, and transcribes it in one session in the morning. Then he comes back in the afternoon and makes a revision to the transcript.

Time passes. Two weeks go by, and he's going back [over the text]. He makes six more revisions in one sitting on December 8, then he makes two more revisions on the next morning. Then another eight months go past, and he comes back in August in the next year, because he's thought of something -- he's reviewing his work and he improves the transcription again. He ends up with [an edition] that I'd argue is very good.

Well, this is very different from the one-time pass of OCR correction. This is, in my opinion, a qualitative difference. We have this deep, editorial approach with crowdsourced transcription.

I'm a tool maker; I'm a tool reviewer, and I'm here to try to give you some hands-on advice about choosing tools and platforms for crowdsourced transcription projects.

Now, I used to go through and review [all of the] tools. Well, I have some good news, which is that there are a lot of tools out there nowadays. There are at least thirty-seven that I'm aware of. Many of them are open source. The bad news is that there are thirty-seven to choose from, and many of them are pretty rough.

So instead of talking about the actual tools, I'm going to direct you to a spreadsheet -- a Google Doc that I put together that is itself crowdsourced. About twenty people have contributed their own tools, so it's essentially a registry of different software platforms for [crowdsourced transcription].

Instead, I'm going to discuss selection criteria -- things to consider when you're looking at launching a crowdsourced transcription project.

The first selection criterion is to look at the kind of material you're dealing with. And there are two broad divisions in source material for transcription.

This top image is a diary entry from Viscountess Emily Anne Strangford's travels through the Mediterranean in the 1850s. The bottom image is a census entry.

These are very different kinds of material. A plaintext transcript that could be printed out and read in bed is probably the [most appropriate purpose] for a diary entry. Wheras, for a census record, you don't really want plaintext -- you want something that can go into a structured database.

And there are a limited number of tools that nevertheless have been used very effectively to transcribe this kind of structured data. FamilySearch Indexing is one that we're all familiar with, as Frederick mentioned it. There are a few others from the Citizen Science world: PyBossa comes from the Open Knowledge Foundation, and Scribe and Notes From Nature both come out of GalaxyZoo. [The Zooniverse/Citizen Science Alliance.] I'm going to leave those, and concentrate on more traditional textual materials.

One of the things you want to ask is, What is the purpose of this transcript? Is mark-up necessary? These kinds of texts, as we're all aware, are not already edited, finished materials.

Most transcription tools which exist ask users for plain-text transcripts, and that's it. So the overwhelming majority of platforms support no mark-up whatsoever.

However, there are two families of mark-up [support] which do exist. One of them is a subset of TEI markup. It's part of this TEI Toolbar which was developed by Transcribe Bentham for their own platform [the Bentham Transcription Desk] which is a modification of MediaWiki. It then was later repurposed by the 1916 Letters project and used on top of a totally different software stack, the NARA Transcribr Drupal module [actually DIYHistory]. And what it does is give users a mall series of buttons which can be used to mark up features within a text. So this is really useful if you're dealing with marginalia, with additions and deletions within the text, and you want to track all that. Not everybody wants to track all that, but if that's the kind of purpose that you have, you'll want to look at in-page mark-up.

The other form of mark-up is one that I've been using in FromThePage, using wiki-links to do subject identification within the text. [2-3 sentences inaudible: see "Wikilinks in FromThePage" for a detailed presentation given at the iDigBio Original Sources Digitization Workshop.]

What this means is that if users encounter "Irvin Harvey" and it's marked up like this:

The tool will automatically generate an index that shows every time that Irvin Harvey was mentioned within the texts, or read all the pages mentioning Irvin Harvey. You can actually do network analysis and other digital humanities stuff based on [mining the subject mark-up].

So that's a different flavor of mark-up to consider.

Another question to ask is, how open is your project? Right now I know of projects that are using my own FromThePage tool entirely for staff to use internally.

There are others in which they have students working on the transcripts. And in some cases, this is for privacy reasons. For example, Rhodes College Libraries is using FromThePage to transcribe the diaries of Shelby Foote. Well, Shelby Foote only died a few years ago. [His diaries] are private. So this installation is entirely internal. The transcriptions are all done by students. I've never seen it -- I don't have access to it because it's not on the broad Internet.

Then there's the idea of leveraging your own volunteers on-site, with maybe some [ancillary] openness on the Internet. San Diego Natural History Museum is doing this with the people who come in, and ordinarily will volunteer to clean fossils or prepare specimens for photographs. Well, now they're saying Can you transcribe these herpetology field notes?

So these kinds of platforms are not only wide-open crowdsourcing tools; they can be private, and you should consider this. In some cases, the same platform can support both private projects and crowdsourced projects simultaneously, so you can get all of your data in the same place. [One sentence inaudible.]

Branding! Branding may be very important.

Here are a couple of platforms, with screenshots of each.

The first one is is the French-language version of Wikisource. Wikisource is a sister project to Wikipedia that was spun off around 2003 that allows people to transcribe documents and do OCR correction both. This is being used by the Departmental Archives of Alpes-Maritimes to transcribe a set of journals of episcopal visits. The bishop in the sixteenth century would go around and report on all the villages [in his diocese], so there's all this local history, but it's also got some difficult paleography.

So they're using Wikisource, which is a great tool! It has all kinds of version control. It has ways to track proofreading. It does an elegant job of putting together indiviual pages into larger documents. But, do you see "Departmental Archives of Alpes-Maritimes" on this page? No! You have no idea [who the institution is]. Now, if they're using this internally, that may be fine -- it's a powerful tool.

By contrast, look at the Letters of 1916. [Three sentences inaudible.] This is public engagement in a public-facing site.

Most platforms are somewhere between the two.

Integration: Let's say you've just done a lot of work to scan a lot of material, gather item-level metadata, and you've [ingested it] into CONTENTdm or another CMS. Now you want to launch a crowdsourcing project. Often, the first thing you have to do is get it all back out again and put it into your crowdsourcing platform.

So you need to look at integration. You need to ask the questions, How am I going to get data into the transcription platform? How am I going to get data back out? These may be totally different things: I know of one project that's trying to get data from Fedora into FromThePage, then trying to get it out of FromThePage by publishing to Omeka. There's a different project that wants to get data from Omeka into FromThePage. But these are totally different code paths! They have nothing to do with each other, believe it or not. So you really have to ask detailed questions about this.

Here are a few of the tools that exist, with what they support. (Or what they plan to support -- last week I was contacted about Fedora support and CONTENTdm support for FromThePage, one on Wednesday and one on Thursday, so if anyone has any advice on integration with those systems, please let me know.)

Hosting: Do you want to install everything on-site? Do you have sysadmins and servers? Is this actually a requirement? Or do you want this all hosted by someone else?

Right now you have pretty limited options for hosting. Notes from Nature and the GalaxyZoo projects host everything themselves. Wikisource and FromThePage can be either local or hosted. Everything else, you've got to download and get running on your servers.

Finally, I'd like to talk a little bit about asking yourself, what are yardsticks for success?

If you're doing this for volunteer engagement, what does successful engagement look like? I know of one project that launched a trial in which they put some material from 19th century Texas online. One volunteer found this and dove into it. He transcribed a hundred pages in a week, he started adding footnotes -- I mean he just plowed through this. After a couple of weeks, the librarians I was working with cancelled the trial, and I asked them to give me details. One of the things that they said was, We were really disappointed that only one volunteer showed up. Our goal for public engagement was to do a lot of public education and public outreach, and we wanted to reach out [to] a lot of people.

[For them,] a hundred pages transcribed by one volunteer is a failure compared with one page each transcribed by ten volunteers. So what are your goals?

Similarly, if you're using a platform that is a wiki-like platform--an editorial platform--you'll get obsessive users who will go back and revise page 19 over and over again. That may be fine for you. Maybe you want the highest quality transcripts and you don't mind that there's sort of spotty coverage because users come in and only transcribe the things that really interest them.

Other systems try to go for coverage over quality and depth. ProPublica developed the transcribable Ruby on Rails plugin for research on campaign contributions. They intentionally designed their tool with no back button -- there's no way for a user to review what they did. And they wrote a great article about this which is very relevant to this conference venue: it's called "Casino-Driven Design: One Exit, No Windows, Free Drinks". So for them, the page 19 situation would be an absolute failure, while for me I'm thrilled with it. So again there's this trade off of quality versus quantity in product as well as in engagement.
[Audio to follow.]

Friday, March 14, 2014

Wikilinks in FromThePage

From March 10-12, I got to participate in the iDigBio Original Sources Digitization Workshop, a gathering of natural history collections managers, archivists, and technologists. Although the focus of digitization within natural history has been on specimens or specimen labels, this workshop sought to address the challenges and opportunities involved in digitizing ledgers, field notes, and other non-specimen data. As usual for iDigBio events, the workshop was spectacular.

Carolyn Sheffield chaired a panel (video recording) on crowdsourcing which included Rob Guralnik discussing Notes From Nature, Christina Fidler talking about the Grinnell field notes on FromThePage, my talk, and a long, valuable discussion among all participants. My presentation covered the data model and uses of wiki links as I'm using them in FromThePage.

Video, slides, and transcript are below:

"From The Page" - Ben Brumfield from iDigBio on Vimeo.
I'm Ben Brumfield.  You saw a little bit about FromThePage in Christina Fidler's presentation, so I wanted to talk about the internals -- the design and the datastructures behind some of the things that make this a little bit different from NotesFromNature or the NARA Transcribr Drupal module.
This is the transcription screen.  You've seen this with Christina, so I'll probably go over this pretty quickly.  This is a full-text transcription, not individual records like you get with Notes From Nature. 
The reason for that is that FromThePage was built to be a wiki-like tool, purpose-built for creating amateur editions.  So we've got a text and we want to create an edition from the text that can then be re-used, printed, and analyzed.

I say "amateur" editions because we're not dealing with the kinds of things that textual scholars in the humanities are dealing with, where they're trying to compare different variant manuscript versions of Chaucer.  [By contrast, we] have something that's very straightforward, and we're interested in some fairly simple annotations.

It's purpose-built -- free-standing on MySQL and Ruby on Rails, so it's not integrated with MediaWiki or anything like that.
So who's using it?

[FromThePage] was built originally for a set of my great-great grandmother's diaries.

Since then it's been used for military diaries by libraries and history departments.
It's been used for literary diaries--in this case for Shelby Foote's diaries--for literary drafts, and for punk rock fanzines.  (Which is kind of awesome!)
So what does that have to do with the people in this room and the kind of material [we're working with]?

Here's an example:  This is an 1859 journal from an expedition in which someone went out and made a number of observations and collected some things to bring back with them.  There are scholars interested in mining those.

But it's not a naturalist expedition.  This is Viscountess Emily Anne Smyth Strangford, who in this case is touring the Mediterranean and visiting a lot of classical monuments.  The folks at the Duke Computational Classics Collaboratory are interested in finding all the places in which she recorded Latin and Greek inscriptions, coming up with her itenerary, and figuring out how [that data] connects to the objects her father-in-law had collected for the British Museum twenty years earlier.

So there's a lot of correspondence, I tend to think, with field notes.
The San Diego Natural History Museum started using FromThePage for field books in 2010.  They're still working on the project.
  • They've identified ten thousand subjects worth classifying in their system.
  • Individual pages have been edited twenty-four thousand times.  And this goes back to the wiki-like approach -- people transcribe a page, and then they revisit it. They make a number of edits to a page as they get comfortable with the handwriting.
  • And then they've linked individual observations, species mentioned, and people in the field notes to those subjects forty-two thousand times.
Then there are a couple of other projects working with field notes.  [Museum of Vertebrate Zoology] obviously is in trial, and [the Museum of Comparative Zoology] and Missouri Botanical Gardens are just evaluating the software right now.  
So, what is a wiki link?

Any of us who've edited Wikipedia may be used to this.  I followed the same syntax [in FromThePage].

What we have here is a set of double square braces with the canonical name of the subject--this could be a formatted date, this could be a full name that's spelled out--and then the text that's actually used within the verbatim transcript.

So our example here -- this is when Grinnell meets Klauber.  The field note actually says "L. M. Klauber", so the person transcribing has expanded this out to "Laurence M. Klauber".  So we have the ability to handle variance in references to Klauber, but still identify them as Klauber.
Technically speaking, what's behind one of these wiki links?

There are a lot of tables in this database.
  • We know that there's this page that Klauber is mentioned on.  It's S1 Page 3 in the Grinnell field notes that MVZ has online.
  • We've got a subject which is Laurence M. Klauber.
  • The subject is categorized as a person, which can be used for analysis and filtering, like Christina showed you.
  • And then the individual link between the page and the subject, that contains the variation, is also stored.
So there are a lot of things you can do with that.
  • You can show all the pages that mention Laurence M. Klauber, and read the pages in context or just get a listing of them.
  • More helpfully, as you're transcribing we can mine those links to automatically suggest mark-up.  So the next time we encounter "L. M. Klauber", we can push a button and that will automatically expand the mark-up of "L. M. Klauber" to "[[Laurence M. Klauber|L. M. Klauber]]".
  • You can also feed this to full-text searches.  So if you've got a lot of plain-text transcripts which contain Laurence M. Klauber, we can automatically populate the search with those variations, creating an OR query with "Klauber", "L. M. Klauber"
  • And then we can mine the mark-up for correspondences [between subjects] as Christina showed.

The last thing you can do with it is export.
Here is a TEI-XML export of the Joseph Grinnell notes.  This is useful for interchange, but the most important thing this does is that it allows amateurs to create well-formatted, TEI P5-compliant XML.  And it will handle one of the things that's very hard about creating TEI in an XML editor, which is associating reference string to their entries over in the TEI header which describes who the people are outside the text.
This is a CSV export of the Grinnell field notes.  Basically this is every observation and every person who's mentioned, exported as a CSV file with links back to the pages and URLs at which those pages can be found.  This is the kind of thing that perhaps could be ingested into [museum collection management database] Arctos.
Future plans:

We're going to be doing more CMS integrations.  We're working on Omeka.  The Internet Archive is done.  There are a couple of grant applications that involve hooking FromThePage up to Fedora Commons.

We also really want to contextualize links in time and place.  We want the ability for people to define where the person writing the journal is where they're writing, and then to apply those geotags and chronotags to the references.  So you could map when species were mentioned.  You could extract a visual itenerary.

We need more formatting options.  One of our volunteers has found all kinds of crazy editorial issues for handling strike-outs and things like that.

And the last thing that we're looking for is more projects.

Tuesday, December 31, 2013

Code and Conversations in 2013

It's often hard to explain what it is that I do, so perhaps a list of what I did will help.  Inspired by Tim Sherratt's "talking" and "making" posts at the end of 2012, here's my 2013. 


I work on a number of software projects, whether as contract developer, pro bono "code fairy", or product owner.  


It's been a big year for FromThePage, my open-source tool for manuscript transcription and annotation.  We started work upgrading the tool to Rails 3, and built a TEI Export (see discussion on the TEI-L) and an exploratory Omeka integration.  Several institutions (including University of Delaware and the Museum of Vertebrate Zoology) launched trials on FromThePage.com for material ranging from naturalist field notes to Civil War diaries.  Pennsylvania State University joined the ranks of on-site FromThePage installations with their "Zebrapedia", transcribing Philip K. Dick's Exegesis -- initially as a class project and now as an ongoing work of participatory scholarship.

One of the most interesting developments of 2013 was that customizations and enhancements to FromThePage were written into three grant applications.  These enhancements--if funded--would add significant features to the tool, including Fedora integration, authority file import, redaction of transcripts and facsimiles, and support for externally-hosted images.  All these features would be integrated into the FromThePage source, benefiting everybody.

Two other collaborations this year promise interesting developments in 2014.  The Duke Collaboratory for Classics Computing (DC3) will be pushing the tool to support 19th-century women's travel diaries and Byzantine liturgical texts, both of which require more sophisticated encoding than the tool currently supports.  (Expect Unicode support by Valentine's Day.)  The Austin Fanzine Project will be using a new EAC-CPF export which I'll deliver by mid-January.

OpenSourceIndexing / FreeREG 2

Most of my work this year has been focused on improving the new search engine for the twenty-six million church register entries the FreeREG organization has assembled in CSV files over the last decade and a half.  In the spring, I integrated the parsed CSV records into the search engine and converted our ORM to Mongoid.  I also launched the Open Source Indexing Github page to rally developers around the project and began collecting case studies from historical and genealogical organizations.

In May, I built a parser for historical dates into the search engine I'm building for FreeREG.  It handles split dates like "4 Jan 1688/9", illegible date portions in UCF like "4 Jan 165_", and preserves the verbatim transcription as well as programmatically handling searching and sorting correctly.  Eventually I'll incorporate this into an antique_date gem for general use.

Most of the fall was spent adding GIS search capabilities to the search engine.   In fact, my last commit of the year added the ability to search for records within a radius of a place.  The new year will bring more developments on GIS features, since an effective and easy interface to a geocoded database is just as big a challenge as the geocoding logic itself.

Other Projects

In January I added a command-line wrapper to Autosplit, my library for automatically detecting the spine in a two-page flatbed scan and splitting the image into recto and verso halves.  In addition to making the tool more usable, it also added support for notebook-bound books which must be split top-to-bottom rather than left-to-right.

For the iDigBio Augmenting OCR Hackathon in February, I worked on two exploratory software projects.  HandwritingDetection (code, write-up) analyzes OCR text to look for patterns characteristically produced when OCR tools encounter handwriting.    LabelExtraction (code, write-up) parses OCR-generated bounding boxes and text to identify labels on specimen images.  To my delight, in October part of this second tool was generalized by Matt Christy at the IDHMC to illustrate OCR bounding boxes for the eMOP project's work tuning OCR algorithms for Early Modern English books.

In June and July, I started working on the Digital Austin Papers, contract development work for Andrew Torget at the University of North Texas.  This was what freelancers call a "rescue" project, as the digital edition software had been mostly written but was still in an exploratory state when the previous programmer left.  My job was to triage features, then turn off anything half-done and non-essential, complete anything half-done and essential, and QA and polish core pieces that worked well.  I think we're all pretty happy with the results, and hope to push the site to production in early 2014.  I'm particularly excited about exposing the TEI XML through the delivery system as well as via GitHub for bulk re-use.

Also in June, I worked on a pro bono project with the Civil War-era census and service records from Pittsylvania County, Virginia which were collected by Jeff McClurken in his research.  My goal is to make the PittsylvaniaCivilWarVets database freely available for both public and scholarly use.   Most of the work remaining here is HTML/CSS formatting, and I'd welcome volunteers to help with that. 

In November, I contributed some modifications to Lincoln Mullen's Omeka client for ruby.  The client should now support read-only interactions with the Omeka API for files, as well as being a bit more robust.

December offered the opportunity to spend a couple of days building a tool for reconciling multi-keyed transcripts produced from the NotesFromNature citizen science UI.  One of the things this effort taught me was how difficult it is to find corresponding transcript to reconcile -- a very different problem from reconciliation itself.  The project itself is over, but ReconciliationUI is still deployed on the development site.


February 13-15 -- iDigBio Augmenting OCR Hackathon at the Botanical Research Institute of Texas.  "Improving OCR Inputs from OCR Outputs?" (See below.)

February 26 -- Interview with Ngoni Munyaradzi of the University of Cape Town.  See our discussion of his work with Bushman languages of southern Africa.

March 20-24 -- RootsTech in Salt Lake City.  "Introduction to Regular Expressions"

April 24-28 -- International Colloquium Itinera Nova in Leuven, Belgium.  "Itinera Nova in the World(s) of Crowdsourcing and TEI". 

May 7-8 -- Texas Conference on Digital Libraries in Austin, Texas.  I was so impressed with TCDL when Katheryn Stallard and I presented in 2012 that I attended again this year.  While I was disappointed to miss Jennifer Hecker's presentation on the Austin Fanzine Project, I was so impressed with Nicholas Woodward's talk in the same time slot that I talked him into writing it up as a guest post.

May 22-24 -- Society of Southwestern Archivists Meeting in Austin, Texas.  On a fun panel with Jennifer Hecker and Micah Erwin, I presented "Choosing Crowdsourced Transcription Platforms"

July 11-14 -- Social Digital Scholarly Editing at the University of Saskatchewan.  A truly amazing conference.  My talk: "The Collaborative Future of Amateur Editions".

July 16-20 -- Digital Humanities at the University of Nebraska, Lincoln.  Panel "Text Theory, Digital Document, and the Practice of Digital Editions".  My brief talk discussed the importance of blending both theoretical rigor and good usability into editorial tools.

July 23 -- Interview with Sarah Allen, Presidential Innovation Fellow at the Smithsonian Institution.  Sarah's notes are at her blog Ultrasaurus under the posts "Why Crowdsourced Transcription?" and "Crowdsourced Transcription Landscape".

September 12 -- University of Southern Mississippi. "Crowdsourcing and Transcription".  An introduction to crowdsourced transcription for a general audience.

September 20 -- Interview with Nathan Raab for Forbes.com.  Nathan and I had a great conversation, although his article "Crowdsourcing Technology Offers Organizations New Ways to Engage Public in History" was mostly finished by that point, so my contributions were minor.  His focus on the engagement and outreach aspects of crowdsourcing and its implications for fundraising is one to watch in 2014.

September 25 -- Wisconsin Historical Society"The Crowdsourced Transcription Landscape".  Same presentation as USM, with minor changes based on their questions.  Contents: 1. Methodological and community origins.  2. Volunteer demographics and motivations.  3. Accuracy.  4. Case study: Harry Ransom Center Manuscript Fragments.  5. Case study: Itinera Nova at Stadarchief Leuven.

September 26-27 -- Midwest Archives Conference Fall Symposium in Green Bay, Wisconsin.  "Crowdsourcing Transcription with Open Source Software".  1. Overview: why archives are crowdsourcing transcription.  2. Selection criteria for choosing a transcription platform.  3. On-site tools: Scripto, Bentham Transcription Desk, NARA Transcribr Drupal Module, Zooniverse Scribe.  4. Hosted tools deep-dive: Virtual Transcription Laboratory, Wikisource, FromThePage.

October 9-10 -- THATCamp Leadership at George Mason University.  In "Show Me Your Data", Jeff McClurken and I talked about the issues that have come up in our collaboration to put online the database he developed for his book, Take Care of the Living.  See my summary or the expanded notes.

November 1-2 -- Texas State Genealogy Society Conference in Round Rock, Texas.  Attempting to explore public interest in transcribing their own family documents, I set up as an exhibitor, striking up conversations with attendees and demoing FromThePage.  The minority of attendees who possessed family papers were receptive, and in some cases enthusiastic about producing amateur editions.  Many of them had already scanned in their family documents and were wondering what to do next.  That said, privacy and access control was a very big concern -- especially with more recent material which mentioned living people.

November 7 -- THATCamp Digital Humanities & Libraries in Austin, Texas. Great conversations about CMS APIs and GIS visualization tools.

November 19-20 -- Duke UniversityI worked with my hosts at the Duke Collaboratory for Classics Computing to transcribe a 19th-century travel diary using FromThePage, then spoke on "The Landscape of Crowdsourcing and Transcription", an expansion of my talks at USM and WHS.  (See a longer write-up and video.)

December 17-20 -- iDigBio Citizen Science HackathonDue to schedule conflicts, I wasn't able to attend this in person, but followed the conversations on the wiki and the collaborative Google docs.  For the hackathon, I built ReconciliationUI, a Ruby on Rails app for reconciling different NotesFromNature-produced transcripts of the same image on the model of FamilySearch Indexing's arbitration tool.


All these projects promise to keep me busy in the new year, though I anticipate taking on more development work in the summer and fall.  If you're interested in collaborating with me in 2014--whether to give a talk, work on a software project, or just chat about crowdsourcing and transcription--please get in touch.

Saturday, November 23, 2013

"The Landscape of Crowdsourcing and Transcription" at Duke University

I spent part of this week at Duke University with the Duke Collaboratory for Classics Computing -- Josh Sosin, Hugh Cayless, and Ryan Baumann. We discussed ideas for mobile epigraphy applications, argued about text encoding, and did some hacking. We loaded an instance of FromThePage onto the DC3's development machine, seeded it with the 1859 journal of Viscontess Emily Anne Beaufort Smyth Strangford (part of Duke Libraries' amazing collection of Women's Travel Diaries). Transcribing six pages of her tour through Smyrna and Syria together suggested some exciting enhancements for the transcription tool, revealing a few bugs along the way. I'm really looking forward to collaborating with the DC3 on this project.

On Wednesday, I gave an introductory talk on crowdsourced manuscript transcription at the Perkins Library: "The Landscape of Crowdsourcing and Transcription":
One of the most popular applications of crowdsourcing to cultural heritage is transcription. Since OCR software doesn’t recognize handwriting, human volunteers are converting letters, diaries, and log books into formats that can be read, mined, searched, and used to improve collection metadata. But cultural heritage institutions aren’t the only organizations working with handwritten material, and many innovations are happening within investigative journalism, citizen science, and genealogy.
This talk will present an overview of the landscape of crowdsourced transcription: where it came from, who’s doing it, and the kinds of contributions their volunteers make, followed by a discussion of motivation, participation, recruitment, and quality controls.
The talk and visit got a nice write-up in Duke Today, which includes this quote by Josh Sosin:
Sosin said that although many students and professors visit the library's collections and partially transcribe the sources that are pertinent to their research, nearly all of these transcripts disappear once the researchers leave the library.
"Scholars or students come to the Rubenstein, check out these precious materials, they transcribe and develop all sorts of interesting ideas about them," Sosin said. "Then they take their notebooks out of the library and we lose all the extra value-added materials developed by these students. If we can host a platform for students and scholars to share their notes and ideas on our collections, the library's base of knowledge will grow with every term paper or book that our scholars produce."
Video of "The Landscape of Crowdsourcing and Transcription" (by Ryan Baumann):

Slides from the talk:

Previous versions of this talk were delivered at University of Southern Mississippi (2013-09-12) and the Wisconsin Historical Society (2013-09-25). It differs substantially in the discussion of quality control mechanisms (on the video from 26:15 through 31:30, slides 37-40), an addition which was suggested by questions posed at USM and WHS.

Friday, October 25, 2013

Feature: TEI-XML Export

How do you get the data out?

This is a question I hear pretty often, particularly from professional archivists.  If an institution and its users have put the effort into creating digital editions on FromThePage, how can they pull the transcripts out of FromThePage to back it up, repurpose it, or import it into other systems?

This spring, I created an XHTML exporter that will generate a single-page XHTML file containing transcripts of a work's pages, their version history, all articles written about subjects within the work, and internally-linked indices between subjects and pages.  Inspired by conversations at the TEI and SDSE conferences and informed by my TEI work for a client project, I decided to explore a more detailed export in TEI.

This is the result, posted on github for discussion:
Zenas Matthews' Mexican War Diary was scanned and posted by Southwestern University's Smith Library Special Collections.  It was transcribed, indexed, and annotated by Scott Patrick, a retired petroleum worker from Houston.

Julia Brumfield's 1919 Diary was scanned and posted by me, transcribed largely by volunteer Linda Tucker, and indexed and annotated by me.

I requested comment on the TEI mailing list (see the thread "Draft TEI Export from FromThePage"), and got a lot of really helpful, generous feedback both on- and off-list.  It's obvious that I've got more work to do for certain kinds of texts--which will probably involve creating a section header notation in my wiki mark-up--but I'm pretty pleased with the results.

One of the most exciting possibilities of TEI export is interoperability with other systems.  I'd been interested in pushing FromThePage editions to TAPAS, but after I posted the TEI-L announcement, Peter Robinson pulled some of the exports into Textual Communities.  We're exploring a way to connect the two systems, which might give editors the opportunity to do the sophisticated TEI editing and textual scholarship supported by Textual Communities starting from the simple UI and powerful indexing of FromThePage.   I can imagine an ecosystem of tools good at OCR correction, genetic mark-up, display and analysis of correspondence, amateur-accessible UIs, or preservation -- all focusing on their strengths and communicating via TEI-XML.

I'm interested in more suggestions for ways to improve the exports, new things to do with TEI, or systems to explore integration options before I deploy the export feature on production. 

Sunday, October 20, 2013

A Gresham's Law for Crowdsourcing and Scholarship?

This is a comment I wanted to make at Neil Fraistat's "Participatory DH" session (proposal, notes) at THATCamp Leadership, but ended up having on twitter instead.

Much of the discussion in the first half of the session focused on the qualitative difference between the activities we ask amateurs to do and the activities performed by scholars.  One concern voiced was that we're not asking "citizen scholars" to do real scholarly work, and then labeling their activity scholarship -- a concern I share with regard to editing.  If most crowdsourcing projects ask amateurs to do little more than wash test tubes, where are the projects that solicit scholarly interpretation?

The Harry Ransom Center's Manuscript Fragments Project is just such a crowdsourcing project, and I think the results may be disquieting.  In this project, fragments of medieval manuscripts reused as binding for printed books are photographed and posted on Flickr.  Volunteers use the comments to identify the fragments, discussing the scribal hand and researching the source texts. I'd argue that while this does not duplicate the full range of an academic medievalist's scholarly activities, it's certainly not just "bottle-washing" either.

The project has been very successful.  (See organizer Micah Erwin's talks for details.)  Most of the contributions to the project have been made on Flickr in the comments by a few "super volunteers" -- retired rare book dealers and graduate students among them.  However, around 20% of the identifications were made by professional medievalists who learned about the project, visited the Flickr site, and then called or emailed the project organizer.  None of their contributions were made on the public Flickr forum at all.

So why did professional scholars avoid contributing in public?  I related this on Twitter, and got some interesting suggestions
Many of these suggest a sort of Gresham's Law of crowdsourcing, in which inviting the public to participate in an activity lowers that activity's status, driving out professionals concerned with their reputation. 

There's a more reassuring explanation as well -- many people with domain expertise still aren't very comfortable with technology.  Asking them to use a public forum puts additional pressure on them, as any mistakes typing, encoding, and using the forum will be public and likely permanent.  This challenge is not confined to professionals, either -- I receive commentary on the Julia Brumfield Diaries via email from people without high school degrees, who have no professional reputation to protect.

Wednesday, July 24, 2013

University of Delaware and Cecil County Historical Society on FromThePage

Over the last few months, the University of Delaware and the Cecil County Historical Society have been using FromThePage to transcribe the diary of a minister serving in the American Civil War.  They're using the project to expose undergraduates to primary sources while also improving access to an important local history document.

The county has documented the process with an extensive post on the Cecil County Historical Society Blog, which was picked up by the Cecil Daily.

The university also put together a lovely video providing background on the project and interviewing students and faculty members involved in the project:

One of the things I find most interesting about the project is the collaboration between digital humanities-focused university faculty and the county historical society:
Kasey Grier, director of the Museum Studies Program and the History Media Center at the university, says the transcription will be done by students in a process called “crowd sourcing.”

“Crowd sourcing,” according to Grier, “is when students in remote locations, review the handwritten text and try their hand at transcribing it. They then submit their contributions which are reviewed and put up online. Eventually, all of the diary entires will be available for anyone to access and read.”
Historical Society of Cecil County President Paul Newton says the society welcomes this collaboration with the University of Delaware and hopes to strengthen it because it broadens the society’s horizons and reach.
“The university’s focus is in the area of the digital humanities, which allows us to take largely unused and un-accessed collections and get the material out to a broader audience for study. It is also a preservation method as it reduces handling and makes interpretation much easier,” Grier said.
 You can see the Joseph Brown Diary and the students' work on it at the project site on FromThePage.com.

Saturday, July 13, 2013

The Collaborative Future of Amateur Editions

This is the transcript of my talk at Social Digital Scholarly Editing at the University of Saskatchewan in Saskatoon on July 11 2013.
I'm Ben Brumfield.  I'm not a scholarly editor, I'm an amateur editor and professional software developer.  Most of the talks that I give talk about crowdsourcing, and crowdsourcing manuscript transcription, and how to get people involved. I'm not talking about that today -- I'm here to talk about amateur editions.

So let's talk about the state of amateur editions as it was, as it is now, as it may be, and how that relates to the people in this room.
Let's start with a quote from the past.  This was written in 1996, representing what I think may be a familiar sort of consensus [scholarly] opinion about the quality of amateur editions, which can be summed up in the word "ewww!"
So what's going on now?  Before I start looking at individual examples of amateur editions, let's define--for the purpose of this talk--what an amateur edition is.

Ordinarily people will be talking about three different things:
  • They can be talking about projects like Paul's, in which you have an institution who is organizing and running the project, but all the transcription, editing, and annotation is done by members of the public.
  • Or, they can be talking about organizations like FreeREG, a client of mine which is a genealogy organization in the UK which is transcribing all the parish registers of baptisms, marriages, and burials from the reformation up to 1837.  In that case, all the material--all the documents--are held at local records offices and and archives, who in many cases are quite hostile to the volunteer attempt to  put these things online.  Nevertheless, over the last fifteen years, they've managed to transcribe twenty-four million of these records, and are still going strong.
  • Finally, amateur run editions of amateur-held documents.  These are cases like me working on my great-great grandmother's diaries, which is what got me into this world [of editing].
I'm going to limit that [definition] slightly and get rid of crowdsourcing.  That's not what I want to talk about right now.  I don't want to talk about projects that have the guiding hand of an institutional authority, whether that's an archive or a [scholarly] editor.
So let's take a look at amateur editions.  Here's a site called Soldier Studies.  Soldier Studies is entirely amateur-run.  It's organized by a high-school history teacher who got really involved in trying to rescue documents from the ephemera trade.
The sources of the transcripts of correspondence from the American Civil War are documents that are being sold on E-Bay.  He sees the documents that are passing through--and many of them he recognizes as important, as an amateur military historian--and he says, I can't purchase all of these, and I don't belong to an institution that can purchase them. Furthermore, I'm not sure that it's ethical to deal in this ephemera trade--there is some correlation to the antiquities trade--but wouldn't it be great if we could transcribe the documents themselves and just save those, so that as they pass from a vendor to a collector, some of the rest of us can read what's on these documents?
So he set up this site in which users who have access to these transcripts can upload letters.  They upload these transcripts, and there's some basic metadata about locations and subjects that makes the whole thing searchable.  
But the things that I think people in here--and I myself--will be critical about are the transcription conventions that he chose, which are essentially none.  He says, correspondence can be entered as-is--so maybe you want to do a verbatim transcript, but maybe not--and the search engines will be able to handle it.

A little bit more shocking is that -- you know, he's dealing with people who have scans--they have facsimile images--so he says, we're going to use that.  Send us the first page, so that we know that you're not making this piece of correspondence up completely, fabricating it out of whole cloth. 
So that's not a facsimile edition, and we don't have transcription conventions.  He has this caveat, in which he explains that this [site] is reliable because we have "the first page of the document attached to the text transcription as verification that it was transcribed from that source."  So you'll be able to read one page of facsimile from this transcript you have.  We do our best, we're confident, so use them with confidence, but we can't guarantee that things are going to be transcribed validly.

Okay, so how much use is that to a researcher? 
This puts me in the mind of Peter Shillingsburg's "Dank Cellar of Electronic Texts", in which he talks about the world "being overwhelmed by texts of unknown provenance, with unknown corruptions, representing unidentified or misidentified versions."

He's talking about things like Project Gutenberg, but that's pretty much what we're dealing with right here.  How much confidence could a historian place in the material on this site?  I'm not sure.

Here's an example of an amateur edition which is in a noble cause, but which is really more ammunition for the earlier quote.
So what about amateur editions that are done well?  This is the Papa's Diary Project, which is a 1924 diary of a Jewish immigrant to New York, transcribed by his grandson.

What's interesting about this -- he's just using Blogger, but he's doing a very effective job of communicating to his reader:
So here is a six-word entry.  We have the facsimile--we can compare and tell [the transcript] is right: "At Kessler's Theater.  Enjoyed Kreuzer Sonata."

So the amateur who's putting this up goes through and explains what Kessler's theater is, who Kessler was.
Later on down in that entry, he explains that Kessler himself died, and the Kreuzer Sonata is what he died listening to.  Further down the page you can listen to the Kreuzer Sonata yourself.

So he's taken this six-word diary entry and turned it into something that's fascinating, compelling reading.  It was picked up by the New York Times at one point, because people got really excited about this.
Another thing that amateurs do well is collaborate.  Again: Papa's Diary Project.  Here is an entry in which the diarist transcribed a poem called "Light". 
Here in the comments to that entry, we see that Jerroleen Sorrensen has volunteered: Here's where you can find [the poem] in this [contemporary] anthology, and, by the way, the title of the poem is not "Light", but "The Night Has a Thousand Eyes".

So we have people in the comments who are going off and doing research and contributing.
I've seen this myself.  When I first started work on FromThePage, my own crowdsourced transcription tool, I invited friends of mine to do beta testing.

I started off with an edition that I was creating based on an amateur print edition of the same diary from fifteen years previously.

If you look at this note here, what you see is Bryan Galloway looking over the facsimile and seeing this strange "Miss Smith sent the drugg... something" and correcting the transcript--which originally said "drugs"--saying, Well actually that word might be "drugget", and "drugget" is, if you look on Wikipedia, is a coarse woolen fabric.  Which--since it's January and they're working with [tobacco] plant-beds--that's probably what it is.

Well, I had no idea--nobody who's read this had any idea--but here's somebody who's going through and doing this proofreading, and he's doing research and correcting the transcription and annotating at the same time.
Another thing that volunteers do well is translate.  This is the Kriegstagebuch von Dieter Finzen, who was a soldier in World War I, and then was drafted in World War II.  This is being run by a group of volunteers, primarily in Germany.

What I want to point out is, that here is the entry for New Year's Day, 1916.  They originally post the German, and then they have volunteers who go online and translate the entry into English, French, and Italian.

So now, even though my German is not so hot, I can tell that they were stuck drinking grenade water.
So, what's the difference?

What's the difference between things that amateurs seem to be doing poorly, and things that they're doing well?

I think that it comes down to something that Gavin Robinson identified in a blog post that he wrote about six years ago about the difference between professional historians/academic historians and amateur historians.  What he essentially says is that professionals--particularly academics, but most professionals--are particularly concerned with theory.  They're concerned with their methodologies and with documenting their methodologies.

This is something that amateurs, in many cases, are not concerned with -- don't know exist -- maybe have never even been exposed to.
So, based on that, let's talk about the future.

How can we get amateurs--doing amateur editions on their own--to move from the things that they're doing well and poorly to being able to do everything well that's relevant to researchers' needs?

I see three major challenges to high-quality amateur editions.

The first one is one which I really want to involve this community in, which is ignorance of standards.  The idea that you might actually include facsimiles of every page with your transcription -- that's a standard.  I'm not talking about standards like TEI -- I'd love for amateur editions to be elevated to the point that print editions were in 1950 -- we're just talking about some basics here.

Lack of community and lack of a platform.
So let's talk about standards.

How does an amateur learn about editorial methodologies?  How do they learn about emendations?  How do they learn about these kinds of things?

Well, how do they learn about any other subject?  How do they learn about dendrochronology if they're interested in measuring tree rings? 

Let's go check out Wikipedia!
Wikipedia has a problem for most subjects, which is that Wikipedia is filled with jargon.  If you look up dendrochronology, you don't really have a starting place, a "how to".  If you look up the letter X, you get this wonderful description of how 'X' works in Catalan orthography, but it presupposes you being familiar with the International Phonetic Alphabet, and knowing that that thing which looks like an integral sign is actually the 'sh' sound.

Now if amateurs are trying to do research on scholarly editing and documentary editing in Wikipedia, they have a different problem:
There's nothing there. There's no article on documentary editing.
There's no article on scholarly editing.

These practices are invisible to amateurs
So if they can't find the material online that helps them understand how to encode and transcribe texts, where are they going to get it?

Well--going back to crowdsourcing--one example is by participation in crowdsourcing projects.  Crowdsourcing projects--yes, they are a source of labor; yes they are a way to do outreach about your material--but they are a way to train the public in editing.  And they are training the public in editing whether that's the goal of the transcription project or not.  The problem is that the teacher in this school is the transcription software--is the transcription website.

This means that the people who are teaching the public about transcription--the people who are teaching the public about editing--are people like me: developers.

So, how do developers learn about transcription?

Well, sometimes, as Paul [Flemons] mentioned, we just wing it.  If we're lucky, we find out about TEI, and we read the TEI Guidelines, and we find out that there's so much editorial practice that's encoded in the TEI Guidelines that that's a huge resource.

If we happen to know the people in this room or the people who are meeting at the Association for Documentary Editing in Ann Arbor, we might discover traditional editorial resources like the Guide to Documentary Editing.  But that requires knowing that there's a term "Documentary Editing".

So what does that mean?  What that means is that people like me--developers with my level of knowledge or ignorance--are having a tremendous amount of influence on what the public is learning about editing.  And that influence does not just extend to projects that I run -- that influence extends to projects that archives and other institutions using my software run.  Because if an archive is trying to start a transcription project, and the archivist has no experience with scholarly editing, I say, You should pick some transcription conventions.  You should decide how to encode this.  Their response is, What do you think?  We've never done this before.  So I'm finding myself giving advice on editing.
Okay, moving on.

The other thing that amateurs need is community.

Community is important because community allows you to collaborate.  Communities evaluate each [member's] work and say, This is good.  This is bad.  Communities teach each [member].  And communities create standards -- you don't just hang out on Flickr to share your photos -- you hang out on Flickr to learn to be a better photographer.  People there will tell you how to be a better photographer.

We have no amateur editing community for people who happen to have an attic full of documents and want to know what to do with them.
So communities create standards, and we know this.  Let me quote my esteemed co-panelist, Melissa Terras, who, in her interviews with the managers of online museum collections--non-institutional online "museums"--found that people are coming up with "intuitive metadata" standards of their own, without any knowledge or reference to existing procedures in creating traditional archival metadata.
The last big problem is that there's currently no platform for someone who has an attic full of documents that they want to edit.  They can upload their scans to Flickr, but Flickr is a terrible platform for transcription.

There's no platform that will guide them through best practices of editing.

What's worse, if there were one, it would need a "killer feature", which is what Julia Flanders describes in the TAPAS project as a compelling reason for people to contribute their transcripts and do their editing on a platform that enforces rigor and has some level of permanence to it -- rather than just slapping their transcripts up on a blog.
So, let's talk about the future.  In his proposal for this conference, Peter Robinson describes a utopia and dystopia: utopia in which textual scholars train the world in how to read documents, and a dystopia in which hordes of "well-meaning but ill-informed enthusiasts will strew the web willy-nilly with error-filled transcripts and annotations, burying good scholarship in rubbish." 
This is what I think is the road to dystopia:
  1. Crowdsourcing tools ignore documentary editing methodologies.  If you're transcribing using the Transcribe Bentham tool, you learn about TEI.  You learn from a good school.  But almost all of the other crowdsourced transcription tools don't have that.  Many of them don't even contain a place for the administrator to specify transcription conventions to their users!
  2. As a result, the world remains ignorant of the work of scholarly editors, because we're not finding you online--because you're invisible on Wikipedia--and we're not going to learn about your work through crowdsourcing.
  3. So you have the public get this attitude that, well, editing is easy -- type what you see.  Who needs an expert?  I think that's a little bit worrisome.
  4. The final thing--which, when I started working on this talk, was a sort of wild bogeyman--is the idea that new standards come into being without any reference whatsoever to the tradition of scholarly or documentary editing.
I thought that [idea] was kind of wild.  But, in March, an organization called the Family History Information Standards Organization--which is backed by Ancestry.com, the Federation of Genealogy Societies, BrightSolid, a bunch of other organizations--announced a Call for Papers for standards for genealogists and family historians to use -- sometimes for representing family trees, sometimes for source documents.
And, in May, Call for Papers Submission number sixty-nine, "A Transcription Notation for Genealogy", was submitted.
Let's take a look at it.

Here we have what looks like a fairly traditional print notation.  It's probably okay.
What's a little bit more interesting, though, is the bibliography.

Where is your work in this bibliography?  It's not there.

Where is the Guide to Documentary Editing?  It's not there.

So here's a new standard that was proposed the month before last.  Now, I hope to respond to this--when I get the time--and suggest a few things that I've learned from people like you.  But these standards are forming, and these standards may become what the public thinks of as standards for editing.
All right, so let's talk about the road to utopia.

The road to the utopia that Peter described I see as in part through partnerships between amateurs and professionals:  you get amateurs participating in projects that are well run -- that teach them useful things about editing and how to encode manuscripts.

Similarly, you get professionals participating in the public conversation, so that your methodologies are visible.   Certainly your editions are visible, but that doesn't mean that editing is visible.  So maybe someone here wants to respond to that FHISO request, or maybe they just want to release guides to editing as Open Access.

As a result, amateurs produce higher-quality editions on their own, so that they're more useful for other researchers; so that they're verifiable.

And then, amateurs themselves become advocates -- not just for their material and the materials they're working on through crowdsourcing projects, but for editing as a discipline.

So that's what I think is the road to utopia.
So what about the past?

Back in Shillingsburg's "Dank Cellar" paper, he describes the problems with the e-texts that he's seeing, and he really encourages scholarly editors not to worry about it -- to disengage -- [and] instead to focus on coming up with methodologies--and again, this is 2006--for creating digital editions.  He says that these aren't well understood yet.  Let's not get distracted by these [amateur] things -- let's focus on what's involved in making and distributing digital editions.

Is he still right?  I don't know.

Maybe--if we're in the post-digital age--it's time to re-engage.