- «Single-track methods»: le document ne fait l’objet que d’une seule
transcription (par un seul contributeur ou de façon collaborative ensemble sur le
même document)
- «Open-ended community revison»: (Wikipédia) les utilisateurs peuvent continuer à modifier le texte transcrit, sans limite dans le temps. Un historique des modifications permet de revenir à la version précédente et d’éviter le vandalisme.
- «Fixed-term community revision» (Transcribe Bentham) : convient pour des projets d’édition plus traditionnels, dont l’objectif est la publication d’une “version finale”. Quand une transcription atteint un niveau acceptable, val idée par les experts, elle est close et publiée.
- «Community-controlled revision workflows» (Wikisource) : la transcription est considérée comme une “version finale” non plus par des experts, mais parce qu’elle a traversé un workflow collaboratif de correction/révision/validation -
- «Transcriptions with "known-bad" insertions before proofreading» : dans une première phase, les correcteurs sont invités à transcrire. Puis d’autres correcteurs révisent la transcription en la comparant au texte original; pour s’assurer que la seconde lecture est bien réalisée, des erreurs sont ajoutées dans le texte: si toutes les «fausses erreurs» sont corrigées, le système déduit que les «vraies erreurs» ont dû être corrigées aussi.
- «Single-keying with expert review» : lorsqu’une transcription a été réalisée par un contributeur, elle est validée ou rejetée par un expert (soit un professionnel de l’institution à l’origine du projet, soit un contributeur sélectionné). Si la correction est rejetée, elle est soit à nouveau soumise à correction, soit corrigée par l’expert et validée.
- «Multi-track methods»: ces méthodes conviennent particulièrement à des corrections portant sur des données structurées ou des micro-tâches. La même image de départ est présentée à plusieurs contributeurs qui transcrivent chacun à partir de zéro. Généralement, les contributeurs ne savent pas s’ils sont les premiers correcteurs ou si d’autres transcriptions ont déjà été soumises. Puis les données ainsi collectées sont comparées automatiquement.
- «Triple-keying with voting» (Old Weather, ReCAPTCHA) : l’image est présentée à 3 contributeurs, la majorité l’emporte (au depart, Old Weather proposait l’image à 10 contributeurs, mais ils se sont aperçus que la pertinence était sensiblement la même avec 3 qu’avec 10 contributeurs)
- «Double-keying with expert reconciliation»: la même donnée est présentée à deux contributeurs, et, s’ils ne sont pas d’accord entre eux, un expert tranche.
- «Double-keying with emergent community-expert reconciliation» (FamilySearch Indexing): la method est presque similaire à la précédente, sauf que l’expert qui tranche entre deux corrections divergentes est lui-même un contributeur, qui a été promu conciliateur grâce à l’analyse automatique de ses contributions (volume,pertinence).
- «Double-keying with N-keyed run-off votes»: si les deux contributeurs ne sont pas d’accord, la correction est re-proposée à un nouveau duo/trio d’usagers.
Tuesday, May 14, 2013
Typologie des méthodes de contrôle de la qualité dans les projets de crowdsourcing
A translation of my 2012-03-05 post "Quality Control for Crowdsourced Transcription" which appeared in "Etat de l’art en matière de Crowdsourcing dans les bibliothèques numériques" by Moirez, Moreaux, and Josse (2013), reproduced for Francophone readers:
Monday, April 29, 2013
Itinera Nova in the World(s) of Crowdsourcing and TEI
On April 25, 2013, I presented this talk at the International Colloquium Itinera Nova in Leuven, Belgium. It was a fantastic experience, which I plan to post (and speak) more about, but I wanted to get my slides and transcript online as soon as possible.
Abstract: Crowdsourcing for cultural heritage material has become increasingly popular over the last decade, but manuscript transcription has become the most actively studied and widely discussed crowdsourcing activity over the last four years. However, of the thirty collaborative transcription tools which have been developed since 2005, only a handful attempt to support the Text Encoding Initiative (TEI) standard first published in 1990. What accounts for the reluctance to adopt editorial best practices, and what is the way forward for crowdsourced transcription and community edition? This talk will draw on interviews with the organizers behind Transcribe Bentham, MoM-CA, the Papyrological Editor, and T-PEN as well as the speaker's own experience working with transcription projects to situate Itinera Nova within the world of crowdsourced transcription and suggest that Itinera Nova's approach to mark-up may represent a pragmatic future for public editions.
I'd like to talk about Itinera Nova within the world of crowdsourced transcription tools, which means that I need to talk a little bit about crowdsourced transcription tools themselves, and their history, and the new things that Itinera Nova brings.
Crowdsourced transcription has actually been around for a long time. Starting in the 1990s we see a number of what are called "offline" projects. This is before the term crowdsourcing was invented.
Really the modern era of crowdsourced transcription begins about eight years ago. There are a number of projects that begin development in 2005. They are released (even though they've been in development for a while) starting around 2006. Familysearch Indexing is, again, a genealogy system primarily concerned with records of genealogical interest which are tabular. It is put up by the Mormon Church.
Then things start to change a little bit. In 2008, I publish FromThePage, which is not designed for genealogy records per se -- rather it's designed for 19th and 20th century diaries and letters. (So here we have more complex textual documents.) Also in 2008, Wikisource--which had been a development of Wikipedia to put primary sources online--start using a transcription tool. But initially, they're not using it for manuscripts because of policy in the English, French, and Spanish language Wikisources. The only people using it for manuscripts are the German Wikisource community, which has always been slightly separate. So they start transcribing free-form textual material like war journals [ed: memoirs] and letters. But again, we have a departure from the genealogy world.
In 2009, the North American Bird Phenology Program starts transcribing bird observations. So in the 1880s you had amateur bird-watchers who would go into the field and they would record their sightings of certain ducks, or geese, or things like that, and they would record the location and the birds they had observed. So we have this huge database of the presences of species throughout North America that is all on index cards. And as the climate changes and habitats change, those species are no longer there. So scientists who want to study bird migration and climate change need access to these. But they're hand-written on 250,000 index cards, so they need to be transformed. So that requires transcription, also by volunteers. [ed: The correct number of cards is over 6 million, according to Jessica Zelt's "Phenology Program (BPP): Reviving a Historic Program in the Digital Era"]
2010 is the year that crowdsourced transcription really gets big. The first big development is the Old Weather project, which comes out of the Citizen Science Alliance and the Zooniverse team that got started with GalaxyZoo. The problem with studying climate change isn't knowing what the climate is like now. It is very easy to point a weather satellite at the South Pacific right now. The problem is that you can't point a weather satellite at the South Pacific in 1911. Fortunately, in many of the world's navies, the officer of the watch would, every four hours, record the barometric pressure, the temperature, the wind speed and direction, the latitude and the longitude in the ships logs. So all we have to do is type up every weather observation for all the navies' ships, and suddenly we know what the climate was like. Well, they've actually succeeded at this point -- in 2012 they finished transcribing all the British Royal Navy's ships log weather observations during World War I. So this has been very successful -- it's a monumental effort: they have over six hundred thousand registered accounts--not all of those are active, but they have a very large number of volunteers.
Also in 2010 in the UK, Transcribe Bentham goes live. (We'll talk a lot more about this -- it's a very well documented project.) This is a project to transcribe the notes and papers of the utilitarian philosopher Jeremy Bentham. It's very interesting technically, but it was also very successful drawing attention to the world of crowdsourced transcription.
In 2011, the Center for History and New Media at George Mason University in northern Virginia published the Papers of the United States War Department, and builds a tool called Scripto that plugs into it. Now this is primarily of interest to military and social historians, but again we're getting away from the world of genealogy, we're getting away from the world of individual tabular records, and we're getting into dealing with documents.
Once we get there, we have a tension. And this is a pretty common tension. There's an institutional tension, in that editing of documents has historically been done by professionals, and amateur editions have very bad reputations. Well now we're asking volunteers to transcribe. And there's a big tension between, well how do volunteers deal with this [process], do we trust volunteers? Wouldn't it be better just to give us more money to hire more professionals? So there's a tension there.
There's another tension that I want to get into here, since today is the technical track, and that's the difference between easy tools and powerful tools, and [the question of] making powerful tools easy to use. This is common to all technology--not just software, and certainly not just crowdsourced transcription--but it's new because this is the first time we're asking people to do these sorts of transcription projects.
Historically these professional [projects] have been done using mark-up to indicate deletions or abbreviations or things like that.
So there's this fear: what happens when you take amateurs and add mark-up?
Well, what is going to happen? Well, one solution--and it's a solution that I'm distressed to say is becoming more and more popular in the United States--is to get rid of the mark-up, and to say, well, let's just ask them to type plain text.
There's a problem with this. Which is that giving users power to represent what they see--to do the tasks that we're asking them to do--enables them. Lack of power frustrates them. And when you're asking people to transcribe documents that are even remotely complex, mark-up is power.
So I'm going to tell a little story about scrambled eggs. These are not the scrambled eggs that I ate this morning--which were delicious by the way--but they're very similar.
I'm going to pick on my friends at the New York Public Library, who in 2011 launched the "What's on the Menu?" project. They have an enormous collection of menus from around the world, and they want to track to culinary history of the world as dishes originate in one spot and move to other locations, the change in dishes--when did anchovies become popular? Why are they no longer popular?--things like that. So they're asking users to transcribe all of these menu items. They developed a very elegant and simple UI. This UI did not involve mark-up; this is plain-text. In fact--I'm going to get over here and read this--if you look at this instruction, this is almost stripped text: "Please type the text of the indicated dish exactly as it appears. Don't worry about accents."
Well, this may not be a problem for Americans, but it turns out that some of their menus are in languages that contain things that American developers might consider accents. This is a menu that was published on their site in 2011. They sent out an appeal asking, "can anyone read Sütterlin or old German Kurrentschrift"? I saw this and I went over to a chat channel for people who are discussing German and the German language, because I knew that there were some people familiar with German paleography there, and I wanted to try it out.
So the transcribers are going through and they're transcribing things, and they get to this entry: Rühreier. All right, let's transcribe that without accents. So they type in what they see. Rühreier is scrambled eggs. And what they type is converted to "Ruhreier", which are... eggs from the Ruhrgebiet? I don't know? This is not a dish. I'm not familiar with German cuisine, but I don't think that the Ruhr valley is famous for its eggs.
And this is incredibly frustrating! We see in the chat room logs: "Man, I can't get rid of 'Ruhreier' and this (all-capital) 'OMELETTE'! What's going on? Is someone adding these back? Can you try to change "Ruhreier" to "Rühreier"? It keeps going back!"
So we have this frustration. We have this potential to lose users when we abandon mark-up; when we don't give them the tools to do the job that we're asking them to do.
Okay. Let's shift gears and talk about a different world. This is the world of TEI, the Text Encoding Initiative. It's regarded as the ultimate in mark-up -- Manfred [Thaller] mentioned it some time earlier. It's been a standard since 1990, and it's ubiquitous in the world of scholarly editing.
Remember, up until recently, all scholarly editing was done by professionals. These professionals were using offline tools to edit this XML which Manfred described as a "labyrinth of angle brackets." It was never really designed to be hand-edited, but that's what we're doing.
And because it's ubiquitous and because it's old, there's a perception among at least some scholars, some editors, that this is just a 'boring old standard'. I have a colleague who did a set of interviews with scholars about evaluating digital scholarship, and not all but some of the responses she got when she brought up TEI were "TEI? Oh, that's just for data entry."
Well, not quite. TEI has some strengths. It is an incredibly powerful data model. The people who are doing this--these professionals who have been working with manuscripts for decades--they've developed very sophisticated ways of modeling additions to texts, deletions to texts, personal names, foreign terms -- all sorts of ways of marking this up.
It has great tools for presentation and analysis. Notice I didn't say transcription.
And it has a very active community, and that community is doing some really exciting things.
I want to use just one example of something that has only been around in the last four years that it's been developed. It's a module that was created for TEI called the Genetic Edition module. A "genetic edition" is the idea of studying a text as it changes -- studying the changes that an author has made as they cross through sections and created new sections, or over-written pieces.
So it's very sophisticated, and I want to show you the sorts of things you can do [with it] by demostrating an example of one of these presentation tools by Elena Pierazzo and Julie Andre. Elena's at King's College London, and they developed this last year.
This is a draft of--I believe it's Proust's Recherches du Temps Perdu--unfortunately I can't see up there. But as you can see, this is a very complicated document. The author has struck through sections and over-written them. He's indicated parts moved. He's even -- if you look over here -- he's pasted on an extra page to the bottom of this document. So if you can transcribe this to indicate those changes, then you can visualize them.
[Demo screenshots from the Proust Prototype.] And as you slide, you see transcripts appear on the page in the order that they're created,
And in the order that they're deleted even.
There's even rotation and stuff --
It's just a brilliant visualization!
So this is the kind of thing that you can do with this powerful data model.
But how was that encoded? How did you get there?
Well, in this case, this is an extension to that thousand-page book. It's only about fifty pages long, printed, and it contains individual sets of guidelines. In this case, this is how Henrik Ibsen clarified a letter. In order to encode this, you use this
So this is incredibly complex. So if TEI is powerful; and if, as it gets more complex, it becomes harder to hand-encode; and as we start inviting members of the public and amateurs to participate in this work, how are we going to resolve this?
If there's a fear about combining amateurs and mark-up, what do we do when we combine amateurs with TEI? This is panic!
And it is very rarely attempted. I maintain a directory of crowdsourced transcription tools, with multiple projects per tool. And of the 29 projects in this directory, only 7 claim to support TEI.
One of them is Itinera Nova. I found out about this when I was preparing a presentation for the TEI conference last year, in which I interviewed people running projects doing this crowdsourcing, and found out about their experience of users trying to encode in TEI, and asked, "Do you know anyone else?"
And that's how I found out about Itinera Nova, which is unfortunately not very well known outside of Belgium. This is something that I hope to part of correcting, because you have a hidden gem here -- you really do. It is amazing.
So how do you support TEI? Well, one approach--the most common approach--is to say we'll have our users enter TEI, but we'll give them help. We'll create buttons that add tags, or menus that add tags. This has been the approach taken by T-PEN (created by the Center for Digital Thelogy out of Saint Louis University), and a project associated with them, the Carolingian Canon Law Project. It's also the approach taken by Transcribe Bentham with their TEI toolbar. Menus are an alternative, but essentially the do the same thing -- they're a way of keeping users from typing angle brackets. So the Virtuelles deutsches Urkundennetzwerk is one of those, as well as the Papyrological Editor which is used by scholars studying Greek papyri.
So how well does that work? You provide users with buttons that add tags to their text. Here's an example from Transcribe Bentham.
Here's an example from Monasterium. And the results are still very complicated. The presentation here is hard. It's hard to read; it's hard to work with.
That does not mean that amateurs cannot do it at all! Certainly the experience of Transcribe Bentham proves that amateurs to the same level as any professional transcriber, using these tools and coding these manuscripts, even without the background.
But there are limitations. One limitation is that users outgrow buttons. In Transcribe Bentham, [the most active] users eventually just started typing the angle brackets themselves -- they returned to that labyrinth of angle brackets of TEI tags.
Another problem is more interesting to me, which is when users ignore buttons. Here we have one editor who's dealing with German charters, who uses these double-pipes instead of the line break tag, because this is what he was used to from print. This speaks to something very interesting, which is that we have users who are used to their own formats, they're used to their own languages for mark-up, they're used to their own notations from print editions that they have either read or created themselves. And by asking them to switch over to this style of tagging, we're asking them not just to learn something new, but also to abandon what they may already know.
And, frankly, it's really hard to figure out which buttons [to support]. Abigail Firey of the Carolingian Canon Law Project talks about how when they were designing their interface, they had 67 buttons. This is very hard to navigate, and the users would just give up and start typing angle brackets instead, because buttons aren't a magic solution.
This is where Itinera Nova comes in. The "intermediate notation" that Professor Thaller was talking about is quite clear-cut, and it maps well to the print notations that volunteers are already used to.
And what's interesting about this is that what many people may not realize is that Itinera Nova--despite having a very clear, non-TEI interface--has full TEI under the hood.
Everything is persisted in this TEI database, so the kinds of complex analysis that we talked about earlier--not necessarily the Proust genetic editions, but this kind of thing--is possible with the data that's being created. It's not idiosyncratic.
So as a result, I really think that in this, Itinera Nova points the way to the future. Which is to abandon this idea that TEI is just for data entry, or that amateurs cannot do mark-up. Both of those ideas are bogus! Instead, let's say: use TEI for the data model; for the presentation, so we have these beautiful sliders. And whatever else will get created out of the annotation tool, out of the transcription tool, let's use that for the data model and for the presentation. But let's consider let's consider hooking up these--I don't want to say "easier"--but these more straightforward, these more traditional user interfaces [for transcription].
This is something that I think is really the way forward for crowdsourced transcription. It is being done right now by the Papyrological Editor, it has been done by Itinera Nova for a long time. And there are now some incipient projects to move forward with this. One of these is a new project at the University of Maryland, Maryland Institute for Technology and the Humanities, the Skylark project, in which they are taking those same transcription tools that were used for Old Weather to allow people to mark up and transcribe portions of an image of a literary text that has been heavily annotated--like that Proust--to create data using the data model that can be viewed with tools like the Proust viewer.
So this is, I think, the technical contribution that Itinera Nova is making. Obviously there are a lot more contributions--I mean I'm absolutely stunned by the interaction with the volunteer community that's happening here--but I'm staying on the technical track, so I'm not going to get into that.
Are there any questions? No? Keep up the great work -- you folks are amazing.
Abstract: Crowdsourcing for cultural heritage material has become increasingly popular over the last decade, but manuscript transcription has become the most actively studied and widely discussed crowdsourcing activity over the last four years. However, of the thirty collaborative transcription tools which have been developed since 2005, only a handful attempt to support the Text Encoding Initiative (TEI) standard first published in 1990. What accounts for the reluctance to adopt editorial best practices, and what is the way forward for crowdsourced transcription and community edition? This talk will draw on interviews with the organizers behind Transcribe Bentham, MoM-CA, the Papyrological Editor, and T-PEN as well as the speaker's own experience working with transcription projects to situate Itinera Nova within the world of crowdsourced transcription and suggest that Itinera Nova's approach to mark-up may represent a pragmatic future for public editions.
I'd like to talk about Itinera Nova within the world of crowdsourced transcription tools, which means that I need to talk a little bit about crowdsourced transcription tools themselves, and their history, and the new things that Itinera Nova brings.
Crowdsourced transcription has actually been around for a long time. Starting in the 1990s we see a number of what are called "offline" projects. This is before the term crowdsourcing was invented.
- A Dutch initiative: Van Papier naar Digitaal which is transcribing primarily genealogy records.
- FreeBMD, FreeREG, and FreeCEN in the UK, transcribing church registers and census records.
- Demogen in Belgium -- I don't know a lot about this -- it appears to be dead right now, but if anyone can tell me more about this, I'd like to talk after this.
- Archivalier Online--also transcribing census records--in Denmark,
- And a series of projects by the Western Michigan Genealogy Society to transcribe local census records and also to create indexes of obituaries.
Really the modern era of crowdsourced transcription begins about eight years ago. There are a number of projects that begin development in 2005. They are released (even though they've been in development for a while) starting around 2006. Familysearch Indexing is, again, a genealogy system primarily concerned with records of genealogical interest which are tabular. It is put up by the Mormon Church.
Then things start to change a little bit. In 2008, I publish FromThePage, which is not designed for genealogy records per se -- rather it's designed for 19th and 20th century diaries and letters. (So here we have more complex textual documents.) Also in 2008, Wikisource--which had been a development of Wikipedia to put primary sources online--start using a transcription tool. But initially, they're not using it for manuscripts because of policy in the English, French, and Spanish language Wikisources. The only people using it for manuscripts are the German Wikisource community, which has always been slightly separate. So they start transcribing free-form textual material like war journals [ed: memoirs] and letters. But again, we have a departure from the genealogy world.
In 2009, the North American Bird Phenology Program starts transcribing bird observations. So in the 1880s you had amateur bird-watchers who would go into the field and they would record their sightings of certain ducks, or geese, or things like that, and they would record the location and the birds they had observed. So we have this huge database of the presences of species throughout North America that is all on index cards. And as the climate changes and habitats change, those species are no longer there. So scientists who want to study bird migration and climate change need access to these. But they're hand-written on 250,000 index cards, so they need to be transformed. So that requires transcription, also by volunteers. [ed: The correct number of cards is over 6 million, according to Jessica Zelt's "Phenology Program (BPP): Reviving a Historic Program in the Digital Era"]
2010 is the year that crowdsourced transcription really gets big. The first big development is the Old Weather project, which comes out of the Citizen Science Alliance and the Zooniverse team that got started with GalaxyZoo. The problem with studying climate change isn't knowing what the climate is like now. It is very easy to point a weather satellite at the South Pacific right now. The problem is that you can't point a weather satellite at the South Pacific in 1911. Fortunately, in many of the world's navies, the officer of the watch would, every four hours, record the barometric pressure, the temperature, the wind speed and direction, the latitude and the longitude in the ships logs. So all we have to do is type up every weather observation for all the navies' ships, and suddenly we know what the climate was like. Well, they've actually succeeded at this point -- in 2012 they finished transcribing all the British Royal Navy's ships log weather observations during World War I. So this has been very successful -- it's a monumental effort: they have over six hundred thousand registered accounts--not all of those are active, but they have a very large number of volunteers.
Also in 2010 in the UK, Transcribe Bentham goes live. (We'll talk a lot more about this -- it's a very well documented project.) This is a project to transcribe the notes and papers of the utilitarian philosopher Jeremy Bentham. It's very interesting technically, but it was also very successful drawing attention to the world of crowdsourced transcription.
In 2011, the Center for History and New Media at George Mason University in northern Virginia published the Papers of the United States War Department, and builds a tool called Scripto that plugs into it. Now this is primarily of interest to military and social historians, but again we're getting away from the world of genealogy, we're getting away from the world of individual tabular records, and we're getting into dealing with documents.
Once we get there, we have a tension. And this is a pretty common tension. There's an institutional tension, in that editing of documents has historically been done by professionals, and amateur editions have very bad reputations. Well now we're asking volunteers to transcribe. And there's a big tension between, well how do volunteers deal with this [process], do we trust volunteers? Wouldn't it be better just to give us more money to hire more professionals? So there's a tension there.
There's another tension that I want to get into here, since today is the technical track, and that's the difference between easy tools and powerful tools, and [the question of] making powerful tools easy to use. This is common to all technology--not just software, and certainly not just crowdsourced transcription--but it's new because this is the first time we're asking people to do these sorts of transcription projects.
Historically these professional [projects] have been done using mark-up to indicate deletions or abbreviations or things like that.
So there's this fear: what happens when you take amateurs and add mark-up?
Well, what is going to happen? Well, one solution--and it's a solution that I'm distressed to say is becoming more and more popular in the United States--is to get rid of the mark-up, and to say, well, let's just ask them to type plain text.
There's a problem with this. Which is that giving users power to represent what they see--to do the tasks that we're asking them to do--enables them. Lack of power frustrates them. And when you're asking people to transcribe documents that are even remotely complex, mark-up is power.
So I'm going to tell a little story about scrambled eggs. These are not the scrambled eggs that I ate this morning--which were delicious by the way--but they're very similar.
I'm going to pick on my friends at the New York Public Library, who in 2011 launched the "What's on the Menu?" project. They have an enormous collection of menus from around the world, and they want to track to culinary history of the world as dishes originate in one spot and move to other locations, the change in dishes--when did anchovies become popular? Why are they no longer popular?--things like that. So they're asking users to transcribe all of these menu items. They developed a very elegant and simple UI. This UI did not involve mark-up; this is plain-text. In fact--I'm going to get over here and read this--if you look at this instruction, this is almost stripped text: "Please type the text of the indicated dish exactly as it appears. Don't worry about accents."
Well, this may not be a problem for Americans, but it turns out that some of their menus are in languages that contain things that American developers might consider accents. This is a menu that was published on their site in 2011. They sent out an appeal asking, "can anyone read Sütterlin or old German Kurrentschrift"? I saw this and I went over to a chat channel for people who are discussing German and the German language, because I knew that there were some people familiar with German paleography there, and I wanted to try it out.
So the transcribers are going through and they're transcribing things, and they get to this entry: Rühreier. All right, let's transcribe that without accents. So they type in what they see. Rühreier is scrambled eggs. And what they type is converted to "Ruhreier", which are... eggs from the Ruhrgebiet? I don't know? This is not a dish. I'm not familiar with German cuisine, but I don't think that the Ruhr valley is famous for its eggs.
And this is incredibly frustrating! We see in the chat room logs: "Man, I can't get rid of 'Ruhreier' and this (all-capital) 'OMELETTE'! What's going on? Is someone adding these back? Can you try to change "Ruhreier" to "Rühreier"? It keeps going back!"
So we have this frustration. We have this potential to lose users when we abandon mark-up; when we don't give them the tools to do the job that we're asking them to do.
Okay. Let's shift gears and talk about a different world. This is the world of TEI, the Text Encoding Initiative. It's regarded as the ultimate in mark-up -- Manfred [Thaller] mentioned it some time earlier. It's been a standard since 1990, and it's ubiquitous in the world of scholarly editing.
Remember, up until recently, all scholarly editing was done by professionals. These professionals were using offline tools to edit this XML which Manfred described as a "labyrinth of angle brackets." It was never really designed to be hand-edited, but that's what we're doing.
And because it's ubiquitous and because it's old, there's a perception among at least some scholars, some editors, that this is just a 'boring old standard'. I have a colleague who did a set of interviews with scholars about evaluating digital scholarship, and not all but some of the responses she got when she brought up TEI were "TEI? Oh, that's just for data entry."
Well, not quite. TEI has some strengths. It is an incredibly powerful data model. The people who are doing this--these professionals who have been working with manuscripts for decades--they've developed very sophisticated ways of modeling additions to texts, deletions to texts, personal names, foreign terms -- all sorts of ways of marking this up.
It has great tools for presentation and analysis. Notice I didn't say transcription.
And it has a very active community, and that community is doing some really exciting things.
I want to use just one example of something that has only been around in the last four years that it's been developed. It's a module that was created for TEI called the Genetic Edition module. A "genetic edition" is the idea of studying a text as it changes -- studying the changes that an author has made as they cross through sections and created new sections, or over-written pieces.
So it's very sophisticated, and I want to show you the sorts of things you can do [with it] by demostrating an example of one of these presentation tools by Elena Pierazzo and Julie Andre. Elena's at King's College London, and they developed this last year.
This is a draft of--I believe it's Proust's Recherches du Temps Perdu--unfortunately I can't see up there. But as you can see, this is a very complicated document. The author has struck through sections and over-written them. He's indicated parts moved. He's even -- if you look over here -- he's pasted on an extra page to the bottom of this document. So if you can transcribe this to indicate those changes, then you can visualize them.
[Demo screenshots from the Proust Prototype.] And as you slide, you see transcripts appear on the page in the order that they're created,
And in the order that they're deleted even.
There's even rotation and stuff --
It's just a brilliant visualization!
So this is the kind of thing that you can do with this powerful data model.
But how was that encoded? How did you get there?
Well, in this case, this is an extension to that thousand-page book. It's only about fifty pages long, printed, and it contains individual sets of guidelines. In this case, this is how Henrik Ibsen clarified a letter. In order to encode this, you use this
rewrite tag with a cause... And this is that forest of angle brackets; this is very hard. And this is only one item from this document of instructions, which was small enough that I could cut it out and fit it on a slide. So this is incredibly complex. So if TEI is powerful; and if, as it gets more complex, it becomes harder to hand-encode; and as we start inviting members of the public and amateurs to participate in this work, how are we going to resolve this?
If there's a fear about combining amateurs and mark-up, what do we do when we combine amateurs with TEI? This is panic!
And it is very rarely attempted. I maintain a directory of crowdsourced transcription tools, with multiple projects per tool. And of the 29 projects in this directory, only 7 claim to support TEI.
One of them is Itinera Nova. I found out about this when I was preparing a presentation for the TEI conference last year, in which I interviewed people running projects doing this crowdsourcing, and found out about their experience of users trying to encode in TEI, and asked, "Do you know anyone else?"
And that's how I found out about Itinera Nova, which is unfortunately not very well known outside of Belgium. This is something that I hope to part of correcting, because you have a hidden gem here -- you really do. It is amazing.
So how do you support TEI? Well, one approach--the most common approach--is to say we'll have our users enter TEI, but we'll give them help. We'll create buttons that add tags, or menus that add tags. This has been the approach taken by T-PEN (created by the Center for Digital Thelogy out of Saint Louis University), and a project associated with them, the Carolingian Canon Law Project. It's also the approach taken by Transcribe Bentham with their TEI toolbar. Menus are an alternative, but essentially the do the same thing -- they're a way of keeping users from typing angle brackets. So the Virtuelles deutsches Urkundennetzwerk is one of those, as well as the Papyrological Editor which is used by scholars studying Greek papyri.
So how well does that work? You provide users with buttons that add tags to their text. Here's an example from Transcribe Bentham.
Here's an example from Monasterium. And the results are still very complicated. The presentation here is hard. It's hard to read; it's hard to work with.
That does not mean that amateurs cannot do it at all! Certainly the experience of Transcribe Bentham proves that amateurs to the same level as any professional transcriber, using these tools and coding these manuscripts, even without the background.
But there are limitations. One limitation is that users outgrow buttons. In Transcribe Bentham, [the most active] users eventually just started typing the angle brackets themselves -- they returned to that labyrinth of angle brackets of TEI tags.
Another problem is more interesting to me, which is when users ignore buttons. Here we have one editor who's dealing with German charters, who uses these double-pipes instead of the line break tag, because this is what he was used to from print. This speaks to something very interesting, which is that we have users who are used to their own formats, they're used to their own languages for mark-up, they're used to their own notations from print editions that they have either read or created themselves. And by asking them to switch over to this style of tagging, we're asking them not just to learn something new, but also to abandon what they may already know.
And, frankly, it's really hard to figure out which buttons [to support]. Abigail Firey of the Carolingian Canon Law Project talks about how when they were designing their interface, they had 67 buttons. This is very hard to navigate, and the users would just give up and start typing angle brackets instead, because buttons aren't a magic solution.
This is where Itinera Nova comes in. The "intermediate notation" that Professor Thaller was talking about is quite clear-cut, and it maps well to the print notations that volunteers are already used to.
And what's interesting about this is that what many people may not realize is that Itinera Nova--despite having a very clear, non-TEI interface--has full TEI under the hood.
Everything is persisted in this TEI database, so the kinds of complex analysis that we talked about earlier--not necessarily the Proust genetic editions, but this kind of thing--is possible with the data that's being created. It's not idiosyncratic.
So as a result, I really think that in this, Itinera Nova points the way to the future. Which is to abandon this idea that TEI is just for data entry, or that amateurs cannot do mark-up. Both of those ideas are bogus! Instead, let's say: use TEI for the data model; for the presentation, so we have these beautiful sliders. And whatever else will get created out of the annotation tool, out of the transcription tool, let's use that for the data model and for the presentation. But let's consider let's consider hooking up these--I don't want to say "easier"--but these more straightforward, these more traditional user interfaces [for transcription].
This is something that I think is really the way forward for crowdsourced transcription. It is being done right now by the Papyrological Editor, it has been done by Itinera Nova for a long time. And there are now some incipient projects to move forward with this. One of these is a new project at the University of Maryland, Maryland Institute for Technology and the Humanities, the Skylark project, in which they are taking those same transcription tools that were used for Old Weather to allow people to mark up and transcribe portions of an image of a literary text that has been heavily annotated--like that Proust--to create data using the data model that can be viewed with tools like the Proust viewer.
So this is, I think, the technical contribution that Itinera Nova is making. Obviously there are a lot more contributions--I mean I'm absolutely stunned by the interaction with the volunteer community that's happening here--but I'm staying on the technical track, so I'm not going to get into that.
Are there any questions? No? Keep up the great work -- you folks are amazing.
Labels:
crowdsourcing,
presentations,
similar projects,
tei
Tuesday, February 26, 2013
Ngoni Munyaradzi on Transcribe Bleek and Lloyd
Ngoni Munyaradzi is a Master's student in Computer Science at the University of Cape Town, South Africa, working on a research project on the transcription of the Digital Bleek and Lloyd collection. He kindly agreed to an interview over email, which I present below:
Your website does an excellent job explaining the background and motivation of Transcribe Bleek and Lloyd. Can you tell us more about the field notebooks you are transcribing?
The Digital Bleek and Lloyd Collection is composed of dictionaries, artwork and notebooks documenting stories about the earliest inhabitants of Southern Africa, the Bushman people. The notebooks were written by Wilhelm Bleek, his sister-in-law, Lucy Lloyd and Dorothea Bleek (Wilhelm's daughter) in the 19th century, with the help of a number of Bushmen people who were prisoners in the Western Cape region of South Africa at the time. The notebooks were recorded in the |Xam and !Kun languages and English translations of these languages are available in the notebooks.
Link to the collection: http://lloydbleekcollection.cs.uct.ac.za/
Correct me if I'm wrong, but it seems like at least in the case of |Xam, you are working with one of the only representatives of an extinct language. Are there any standard data models for these kinds of vocabularies/bilingual texts which you're using?
There are no complete models - the best known models are still only partial.
I suspect that I'm not alone in wondering why these Bushman people were prisoners during the writing of these texts. Can you tell us a bit more about the Bleek/Lloyd informants, or point us to resources on the subject?
The bushman people were prisoners because of petty crimes and a grossly unfair colonial government. On the Bleek and Lloyd website there is a story on each contributor. There is information in various books on the subject as well, but I am not sure there is more that is known than what is on the website. see:
http://lloydbleekcollection.cs.uct.ac.za/xam.html
http://lloydbleekcollection.cs.uct.ac.za/kun.html
This is the first transcription project I'm aware of using the Bossa Crowd Create platform. What are the factors that led you to choose that platform and what's been your experience setting it up?
In 2011 when our project began Bossa was the most mature opensource crowdsourcing framework that was tailored for volunteer projects available. Due to this Bossa suited well with the project's requirements. The alternative crowdsourcing frameworks available at the time used payment methods.
Setting up the Bossa framework was a relatively straight-forward task. The documentation online is very thorough and with examples of how to set-up test applications. I also got assistance from David Anderson the developer of Bossa.
The Bushman writing system seems extremely complex with it's special characters and multiple diacritics. I see that you are using LaTeX macros to encode these complexities. Why did you decide on LaTeX and what has been the user response to using that notation?
So the project is part of ongoing research related to the Bleek and Lloyd Collection within our Digital Libraries Laboratory at the University of Cape Town. Credit for developing the encoding tool goes to Kyle Williams. And the reason why he chose to use LaTeX was that; using custom LaTeX macros allowed for both the problem of the encoding and visual rendering of the text to be solved in a single step. Developing a unique font for the Bushman script is something we might look at in the future!
Here's a link to a paper published on the encoding tool developed by Kyle Williams: http://link.springer.com/chapter/10.1007%2F978-3-642-24826-9_28
Overall the user feedback has been good, as most users are able to complete transcriptions using the LaTeX macros. We have gotten suggestions from users to use glyphs to encode the complexities. Currently the scope of my masters research project does not include that. There are talks in our research group to develop a unique font to represent the |Xam and !Kun languages, as this is not supported by Unicode.
User 1 Comment: "I think the palette handles the complexity of the character set very well. This material is inherently difficult to transcribe. The tool has, on the whole, been well thought out to meet this challenge. I think it needs to be improved in some ways, but considering the difficulties it is remarkably well done."
User 2 Comment: "VERY intuitive, after a few practice transcriptions. I actually enjoyed using the tool after a page was done."
This is incredibly useful. So far as I'm aware, yours is only the third crowdsourced transcription project that's surveyed users seriously (after the North American Bird Phenology Project and Transcribe Bentham). Do you have any advice on collecting user feedback at such an early stage?
Collecting user feedback in the early stages will tremendously help project administrators determine whether the setup of the project is easy to follow for participants. One can easily pick up any hindrances to user participation and address these early. From our project, I've found that participants can actually suggest very helpful ideas that will make the data collection process better.
Crowdsourced citizen science and cultural heritage projects have mostly been based in the USA, Northern Europe and Australia until recently -- in fact, yours is the first that I'm aware of originating in sub-Saharan Africa. I'd really like to know which projects inspired your work with Transcribe Bushman, and what your hopes are for crowdsourced transcription projects focusing on Africa?
Our work was mostly inspired by the success of GalaxyZoo at recruiting volunteers, and also the Transcribe Bentham project that explored the feasibility of volunteers performing transcription. I hope that more crowdsourced transcription projects will start-up within Africa in the near future. What would be interesting is to see a transcription project for the Timbuktu manuscripts of Mali. Beyond transcription, I would like to see other researchers adopting crowdsourcing in fields of specialty within Africa.
Thanks so much for this interview. If people want to help out on the project, what's the best way for them to contribute?
Interested participants can simply:
Your website does an excellent job explaining the background and motivation of Transcribe Bleek and Lloyd. Can you tell us more about the field notebooks you are transcribing?
The Digital Bleek and Lloyd Collection is composed of dictionaries, artwork and notebooks documenting stories about the earliest inhabitants of Southern Africa, the Bushman people. The notebooks were written by Wilhelm Bleek, his sister-in-law, Lucy Lloyd and Dorothea Bleek (Wilhelm's daughter) in the 19th century, with the help of a number of Bushmen people who were prisoners in the Western Cape region of South Africa at the time. The notebooks were recorded in the |Xam and !Kun languages and English translations of these languages are available in the notebooks.
Link to the collection: http://lloydbleekcollection.cs.uct.ac.za/
Correct me if I'm wrong, but it seems like at least in the case of |Xam, you are working with one of the only representatives of an extinct language. Are there any standard data models for these kinds of vocabularies/bilingual texts which you're using?
There are no complete models - the best known models are still only partial.
I suspect that I'm not alone in wondering why these Bushman people were prisoners during the writing of these texts. Can you tell us a bit more about the Bleek/Lloyd informants, or point us to resources on the subject?
The bushman people were prisoners because of petty crimes and a grossly unfair colonial government. On the Bleek and Lloyd website there is a story on each contributor. There is information in various books on the subject as well, but I am not sure there is more that is known than what is on the website. see:
http://lloydbleekcollection.cs.uct.ac.za/xam.html
http://lloydbleekcollection.cs.uct.ac.za/kun.html
This is the first transcription project I'm aware of using the Bossa Crowd Create platform. What are the factors that led you to choose that platform and what's been your experience setting it up?
In 2011 when our project began Bossa was the most mature opensource crowdsourcing framework that was tailored for volunteer projects available. Due to this Bossa suited well with the project's requirements. The alternative crowdsourcing frameworks available at the time used payment methods.
Setting up the Bossa framework was a relatively straight-forward task. The documentation online is very thorough and with examples of how to set-up test applications. I also got assistance from David Anderson the developer of Bossa.
The Bushman writing system seems extremely complex with it's special characters and multiple diacritics. I see that you are using LaTeX macros to encode these complexities. Why did you decide on LaTeX and what has been the user response to using that notation?
So the project is part of ongoing research related to the Bleek and Lloyd Collection within our Digital Libraries Laboratory at the University of Cape Town. Credit for developing the encoding tool goes to Kyle Williams. And the reason why he chose to use LaTeX was that; using custom LaTeX macros allowed for both the problem of the encoding and visual rendering of the text to be solved in a single step. Developing a unique font for the Bushman script is something we might look at in the future!
Here's a link to a paper published on the encoding tool developed by Kyle Williams: http://link.springer.com/chapter/10.1007%2F978-3-642-24826-9_28
Overall the user feedback has been good, as most users are able to complete transcriptions using the LaTeX macros. We have gotten suggestions from users to use glyphs to encode the complexities. Currently the scope of my masters research project does not include that. There are talks in our research group to develop a unique font to represent the |Xam and !Kun languages, as this is not supported by Unicode.
User 1 Comment: "I think the palette handles the complexity of the character set very well. This material is inherently difficult to transcribe. The tool has, on the whole, been well thought out to meet this challenge. I think it needs to be improved in some ways, but considering the difficulties it is remarkably well done."
User 2 Comment: "VERY intuitive, after a few practice transcriptions. I actually enjoyed using the tool after a page was done."
This is incredibly useful. So far as I'm aware, yours is only the third crowdsourced transcription project that's surveyed users seriously (after the North American Bird Phenology Project and Transcribe Bentham). Do you have any advice on collecting user feedback at such an early stage?
Collecting user feedback in the early stages will tremendously help project administrators determine whether the setup of the project is easy to follow for participants. One can easily pick up any hindrances to user participation and address these early. From our project, I've found that participants can actually suggest very helpful ideas that will make the data collection process better.
Crowdsourced citizen science and cultural heritage projects have mostly been based in the USA, Northern Europe and Australia until recently -- in fact, yours is the first that I'm aware of originating in sub-Saharan Africa. I'd really like to know which projects inspired your work with Transcribe Bushman, and what your hopes are for crowdsourced transcription projects focusing on Africa?
Our work was mostly inspired by the success of GalaxyZoo at recruiting volunteers, and also the Transcribe Bentham project that explored the feasibility of volunteers performing transcription. I hope that more crowdsourced transcription projects will start-up within Africa in the near future. What would be interesting is to see a transcription project for the Timbuktu manuscripts of Mali. Beyond transcription, I would like to see other researchers adopting crowdsourcing in fields of specialty within Africa.
Thanks so much for this interview. If people want to help out on the project, what's the best way for them to contribute?
Interested participants can simply:
- Create an account on the project website.
- Watch a 5 minute video tutorial on how to transcribe the Bushman languages.
- With that, you are ready to start transcribing pages.
Labels:
crowdsourcing,
interview,
similar projects
Monday, February 25, 2013
Detecting Handwriting in OCR Text
This is my fourth and final post about the iDigBio Augmenting OCR Hackathon. Prior posts covered the hackathon itself, my presentation on preliminary results, and my results improving the OCR on entomology specimens. The other participants are slowly adding their results to the hackathon wiki, which I recommend checking back with (their efforts were much more impressive than mine).
Let's say you have scanned a large number of cards and want to convert them from pixels into data. The cards--which may be bibliography cards, crime reports, or (in this case) labels for lichen specimens--have these important attributes:
Since OCR doesn't work on handwriting, how do you know which images to route to the humans and which to process algorithmically? It's simple: any images that contain handwriting should go to the humans. Detecting the handwriting on the images is unfortunately not so simple.
I adopted a quick-and-dirty approach for the hackathon: if OCR of handwriting produces gibberish, why send all the images through a simple pass of OCR and look in the resulting text files for representative gibberish? In my preliminary work, I pulled 1% of our sample dataset (all cards ending with "11") and classified them three ways:

To my surprise, I was only able to correctly classify cards from OCR output 80% of the time -- a disappointing finding, since any program I produced to identify handwriting from OCR output could only be less accurate. More interesting was the difference between the kinds of files that ABBY and Tesseract produced. Tesseract produced a lot more gibberish in general--including on card images that were entirely printed. ABBY, on the other hand, scrubbed a lot of gibberish out of its results, including that which might be produced when it encountered handwriting.
This suggested an approach: look at both the "terse" results from ABBY and the "noisy" results from Tesseract to see if I could improve my classification rate.
But what does it mean to "look" at a file? I wrote a program to loop through each line of an OCR file and check for the kind of gibberish characteristic of OCR and handwriting. Inspecting the files reveals some common gibberish patterns, which we can sum up as regular expressions:
However, some of these expressions match non-handwriting features like geographic coordinates or bar codes. Handling these requires a white list of regular expressions for gibberish we know not to be handwriting:
With these on hand, we can calculate a score for each file based on the number of occurrences of gibberish we find per line. That score can then be compared against a threshold to determine whether a file contains handwriting. Due to the noisiness of the Tesseract files, I found it most useful to calculate their score N as a percentage of non-blank lines, while the score for the terse files T worked best as a simple count of gibberish matches.
One interesting thing about this approach is that adjusting the thresholds lets us tune the classifications for resources and desired quality. If our humans doing data entry are particularly expensive or impatient, raising the thresholds should ensure that they are only very rarely sent typed text. On the other hand, lowering the thresholds would increase the human workload while improving quality of the resulting text.
I'm really pleased with this result. The combined classifications are slightly better than I was able to accomplish by looking at the OCR myself. The experience of a volunteer presented with 56 images containing handwriting and 13 which don't may necessitate a "send to OCR" button in the user interface, but must be less frustrating than the unclassified ratio of 45 in 105 from the sample set. With a different distribution of handwriting-to-type in the dataset, the process might be very useful for extracting rare typed material from a mostly-handwritten set, or vice versa.
All of the datasets, code, and scored CSV files are in iDigBio AOCR Hackathon's HandwritingDetection reposity on GitHub..
![]() |
| Clearly handwritten: T=8, N=78% from terse and noisy OCR files |
Let's say you have scanned a large number of cards and want to convert them from pixels into data. The cards--which may be bibliography cards, crime reports, or (in this case) labels for lichen specimens--have these important attributes:
- They contain structured data (e.g. title of book, author, call number, etc. for bibliographies) you want to extract, and
- They were part of a living database built over decades, so some cards are printed, some typewritten, some handwritten, and some with a mix of handwriting and type.
Since OCR doesn't work on handwriting, how do you know which images to route to the humans and which to process algorithmically? It's simple: any images that contain handwriting should go to the humans. Detecting the handwriting on the images is unfortunately not so simple.
I adopted a quick-and-dirty approach for the hackathon: if OCR of handwriting produces gibberish, why send all the images through a simple pass of OCR and look in the resulting text files for representative gibberish? In my preliminary work, I pulled 1% of our sample dataset (all cards ending with "11") and classified them three ways:
- Visual inspection of the text files produced by an ABBY OCR engine,
- Visual inspection of the text files produced by the Tesseract OCR engine, and
- Looking at the actual images themselves.

To my surprise, I was only able to correctly classify cards from OCR output 80% of the time -- a disappointing finding, since any program I produced to identify handwriting from OCR output could only be less accurate. More interesting was the difference between the kinds of files that ABBY and Tesseract produced. Tesseract produced a lot more gibberish in general--including on card images that were entirely printed. ABBY, on the other hand, scrubbed a lot of gibberish out of its results, including that which might be produced when it encountered handwriting.
This suggested an approach: look at both the "terse" results from ABBY and the "noisy" results from Tesseract to see if I could improve my classification rate.
![]() |
| Easily classified as type-only, despite (non-characteristic) gibberish: T=0,N=0 from terse and noisy OCR files. |
But what does it mean to "look" at a file? I wrote a program to loop through each line of an OCR file and check for the kind of gibberish characteristic of OCR and handwriting. Inspecting the files reveals some common gibberish patterns, which we can sum up as regular expressions:
GARBAGE_REGEXEN = { 'Four Dots' => /\.\.\.\./, 'Five Non-Alphanumerics' =>/\W\W\W\W\W/, 'Isolated Euro Sign' =>/\S€\D/, 'Double "Low-Nine" Quotes' =>/„/, 'Anomalous Pound Sign' =>/£\D/, 'Caret' =>/\^/, 'Guillemets' =>/[«»]/, 'Double Slashes and Pipes' =>/(\\\/)|(\/\\)|([\/\\]\||\|[\/\\])/, 'Bizarre Capitalization' =>/([A-Z][A-Z][a-z][a-z])|([a-z][a-z][A-Z][A-Z])|([A-LN-Z][a-z][A-Z])/, 'Mixed Alphanumerics' =>/(\w[^\s\w\.\-]\w).*(\w[^\s\w]\w)/ }
However, some of these expressions match non-handwriting features like geographic coordinates or bar codes. Handling these requires a white list of regular expressions for gibberish we know not to be handwriting:
WHITELIST_REGEXEN = { 'Four Caps' =>/[A-Z]{4,}/, 'Date' =>/Date/, 'Likely year' =>/1[98]\d\d|2[01]\d\d/, 'N.S.F.' =>/N\.S\.F\.|Fund/, 'Lat Lon' =>/Lat|Lon/, 'Old style Coordinates' =>/\d\d°\s?\d\d['’]\s?[NW]/, 'Old style Minutes' =>/\d\d['’]\s?[NW]/, 'Decimal Coordinates' =>/\d\d°\s?[NW]/, 'Distances' =>/\d?\d(\.\d+)?\s?[mkf]/, 'Caret within heading' =>/[NEWS]\^s/, 'Likely Barcode' =>/[l1\|]{5,}/, 'Blank Line' =>/^\s+$/, 'Guillemets as bad E' =>/d«t|pav«aont/ }
With these on hand, we can calculate a score for each file based on the number of occurrences of gibberish we find per line. That score can then be compared against a threshold to determine whether a file contains handwriting. Due to the noisiness of the Tesseract files, I found it most useful to calculate their score N as a percentage of non-blank lines, while the score for the terse files T worked best as a simple count of gibberish matches.
| Threshold | Correct | False Positives |
False Negatives |
|---|---|---|---|
| T > 1 and N > 20% | 82% | 10 of 45 | 8 of 60 |
| T > 0 and N > 20% | 84% | 13 of 45 | 4 of 60 |
| T > 1 | 79% | 10 of 45 | 12 of 60 |
| N > 20% | 75% | 8 of 45 | 18 of 60 |
| N > 10% | 81% | 14 of 45 | 6 of 60 |
![]() |
| One of the false negatives: T=0, N=10% from parsing terse and noisy text files. |
I'm really pleased with this result. The combined classifications are slightly better than I was able to accomplish by looking at the OCR myself. The experience of a volunteer presented with 56 images containing handwriting and 13 which don't may necessitate a "send to OCR" button in the user interface, but must be less frustrating than the unclassified ratio of 45 in 105 from the sample set. With a different distribution of handwriting-to-type in the dataset, the process might be very useful for extracting rare typed material from a mostly-handwritten set, or vice versa.
All of the datasets, code, and scored CSV files are in iDigBio AOCR Hackathon's HandwritingDetection reposity on GitHub..
Friday, February 15, 2013
Results of the "Ocrocrop" Approach to Improving OCR
This project attempted to improve the quality of OCR applied to difficult
entomology images[*] by cropping labels from the images to run through OCR
separately. In order to identify labels on the image to crop, an initial,
'naive' pass of OCR was made over the whole image, generating both
I'll call this method "ocrocrop". (For more detail on method, see the transcript of my preliminary presentation.)
The results were encouraging. (See CSV file listing results for each file, and the directory containing "naive" output, annotated JPGs, and cleaned output files for each test.)
Of 80 files tested, 20 experienced a decrease in score (see Alex Thomson's scoring service), but most (14/20) of those were on OCR output below 10% accuracy in the first place, and the remainder were at or below 20% accuracy. So it is reasonable to say that the ocrocrop method only degraded the quality of texts that were unusable in the first place.
40 of the 80 files tested showed more promising results, showing improvements from one to twenty percentage points -- in some cases only marginally improving unusable (below 10% accurate) outputs, but in many cases improving the scores more substantially (say from 25% to 35% in the case of EMEC609908_Stigmus_sp).
Most of the top quartile of results saw improvements on texts that were already scoring above 10% accuracy rates (16 of 20), so it appears that the effectiveness of the ocrocrop method is correlated to the quality of the naive input data -- garbage is degraded or only minimally improved, while OCR that is merely bad under the naive approach can be significantly improved.
The ocrocrop method saw the greatest improvement in cases where the naive OCR pass was effective at identifying word bounding boxes, but ineffective at translating their contents into words. Taking EMEC609928_Stigmus_sp, the case of greatest improvement (naive: 18.9%, ocrocrop: 70.5%), we see that all words on the labels except for the collector name were recognized as words (in purple), making the cropped label images (in blue) good representatives of the actual labels on the image.
The cropped image was more easily processed by our OCR image, so that we may compare the naive version of the second label:
One of the problems with the OCR-based pre-processing which may be hidden by the scores is that many labels are entirely missed by the ocrocrop if the first, naive OCR pass failed to identify any words at all on the label. In cases such as EMEC609651_Cerceris_completa, the determination label was not cropped (indicated by blue rectangles) because no words (purple rectangles) were detected by the original. As a result, while the ocrocrop OCR is an improvement over the naive OCR (6.6% vs. 6.5%), substantial portions of text on the image are unimproved because they are unattempted.
There are two possible ways to solve this problem. One is to abandon the ocrocrop model entirely, switching back to a computer vision approach -- either by programmatically locating rectangles on the image (as Phuc Nguyen demonstrated) or by asking humans to identify regions of interest for OCR processing (as demonstrated by Jason Best in Apiary and by Paul and Robin Schroeder in ScioTR). The other option is to improve the naive OCR -- perhaps by swapping out the engine (e.g. use ABBY instead of Tesseract), perhaps by using a different image pre-processor (like ocropus's front-end to Tesseract), perhaps by re-training Tesseract.
I suspect that a computer vision approach to extracting entomology labels (or similar pieces of paper photographed against a noisy background) will provide a more effective eventual solution than the ocrocrop method. Nevertheless, the ocrocrop "bang it with a rock until it works" approach has a lot of potential to take entomology-style OCR to bad from worse.
[*]In addition to the difficulties typical of specimen labels--mix of typefaces, handwritten material, typewritten material, text inventory with few overlaps with a dictionary of literary English--the entomology dataset contained additional challenges. Difficulties included the following:
- A) a set of rectangles on the image defined as word bounding boxes by the OCR engine, and
- B) a control OCR text file to be used for comparing the 'naive' model with the methodology.
I'll call this method "ocrocrop". (For more detail on method, see the transcript of my preliminary presentation.)
The results were encouraging. (See CSV file listing results for each file, and the directory containing "naive" output, annotated JPGs, and cleaned output files for each test.)
Of 80 files tested, 20 experienced a decrease in score (see Alex Thomson's scoring service), but most (14/20) of those were on OCR output below 10% accuracy in the first place, and the remainder were at or below 20% accuracy. So it is reasonable to say that the ocrocrop method only degraded the quality of texts that were unusable in the first place.
40 of the 80 files tested showed more promising results, showing improvements from one to twenty percentage points -- in some cases only marginally improving unusable (below 10% accurate) outputs, but in many cases improving the scores more substantially (say from 25% to 35% in the case of EMEC609908_Stigmus_sp).
Most of the top quartile of results saw improvements on texts that were already scoring above 10% accuracy rates (16 of 20), so it appears that the effectiveness of the ocrocrop method is correlated to the quality of the naive input data -- garbage is degraded or only minimally improved, while OCR that is merely bad under the naive approach can be significantly improved.
The ocrocrop method saw the greatest improvement in cases where the naive OCR pass was effective at identifying word bounding boxes, but ineffective at translating their contents into words. Taking EMEC609928_Stigmus_sp, the case of greatest improvement (naive: 18.9%, ocrocrop: 70.5%), we see that all words on the labels except for the collector name were recognized as words (in purple), making the cropped label images (in blue) good representatives of the actual labels on the image.
The cropped image was more easily processed by our OCR image, so that we may compare the naive version of the second label:
CALIF:Hunbo1dt Co. ;‘ ~ 3 m1.N' Garbervflle ,::f< '_- ' v—23~75 n.n1e:z.' 9 ._ ’with the ocrocrop version of the second label:
CALIF:Humboldt Co. 3 mi.N Garberville V-23-76 R.Dietz,'
One of the problems with the OCR-based pre-processing which may be hidden by the scores is that many labels are entirely missed by the ocrocrop if the first, naive OCR pass failed to identify any words at all on the label. In cases such as EMEC609651_Cerceris_completa, the determination label was not cropped (indicated by blue rectangles) because no words (purple rectangles) were detected by the original. As a result, while the ocrocrop OCR is an improvement over the naive OCR (6.6% vs. 6.5%), substantial portions of text on the image are unimproved because they are unattempted.
There are two possible ways to solve this problem. One is to abandon the ocrocrop model entirely, switching back to a computer vision approach -- either by programmatically locating rectangles on the image (as Phuc Nguyen demonstrated) or by asking humans to identify regions of interest for OCR processing (as demonstrated by Jason Best in Apiary and by Paul and Robin Schroeder in ScioTR). The other option is to improve the naive OCR -- perhaps by swapping out the engine (e.g. use ABBY instead of Tesseract), perhaps by using a different image pre-processor (like ocropus's front-end to Tesseract), perhaps by re-training Tesseract.
I suspect that a computer vision approach to extracting entomology labels (or similar pieces of paper photographed against a noisy background) will provide a more effective eventual solution than the ocrocrop method. Nevertheless, the ocrocrop "bang it with a rock until it works" approach has a lot of potential to take entomology-style OCR to bad from worse.
[*]In addition to the difficulties typical of specimen labels--mix of typefaces, handwritten material, typewritten material, text inventory with few overlaps with a dictionary of literary English--the entomology dataset contained additional challenges. Difficulties included the following:
- Images containing specimens and rulers as well as labels.
- Labels casually arranged for photography, so that text orientation was not necessarily aligned.
- Labels photographed against a background of heavily pin-pricked styrofoam rather than a black or neutral background.
- 3-d images including what appear to be shadows, which soften the contrast differences around borders.
iDigBio Augmenting OCR Hackathon
I spent the last three days at the iDigBio Augmenting OCR Hackathon working alongside mycologists, botanists, entomologists, herbarium managers, and bioinformaticians to explore ways to improve parsing of digitized specimen labels. While I'm pleased with the results of my own contribution, I'd like to take a minute to talk about the hackathon process itself before I post them.
This was my first hackathon--a condition which seemed to be the rule among the participants--and I was really impressed with it. The iDigBio folks defined a clear set of goals (improve OCR parsing of specimen labels) with clear metrics (these datasets, these output formats, this scoring algorithm) a couple of months beforehand, and organized five weekly videoconferences before the event. Most important of all, the participants were encouraged to prepare a 10-minute lightning talk on their efforts and preliminary results. (See below for the transcript of my talk, see the notes document for descriptions of all talks.)
In my opinion, these preliminary talks were critical to the success of the project. The preliminary nature relaxed pressure on participants, so we were able to experiment beyond the target of the hackathon (as I did with my handwriting detection digression, a related, but un-scorable effort). On the other hand, they did provide enough impetus to get many of us looking at the data, working with the tools, and thinking about approaches. This meant that even before the hackathon started, many of us were familiar enough with the materials to have a real 'meeting of the minds' experience during the pre-event supper: "Did you just say 'the contrast difference between the print and the label is higher than the difference between the label and the background'? We ran into that too, and here's what we did..."
The experience was a real education in OCR for me, and I feel like I picked up techniques I can apply directly to projects I've discussed with clients and potential clients. In particular, I got a real appreciation for how interrelated image preparation, OCR, and parsing are to each other. One participant had created separate libraries of regular expressions to clean up each kind of field, having discovered that latitude/longitude coordinates require different error correction than personal names or herbarium catalog numbers do. Another group had built a touch-screen tool for classifying segments of the image before submitting them to OCR. My own project required a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR. A simple 1,2,3 workflow just isn't sufficient!
iDigBio itself is an NSF-funded attempt to advance digitization practices on natural history collections, combining disciplinary "thematic collection networks" and methodologically focused working groups on topics like georeferencing, crowdsourcing, and OCR. Aware that they're not the only people digitizing things, they have been reaching out beyond the natural sciences to the library and information science community at the iConference this year. This rejection of "not invented here" siloing was a big part of the hackathon, and I hope that more people from outside the natural sciences will get involved.
This was my first hackathon--a condition which seemed to be the rule among the participants--and I was really impressed with it. The iDigBio folks defined a clear set of goals (improve OCR parsing of specimen labels) with clear metrics (these datasets, these output formats, this scoring algorithm) a couple of months beforehand, and organized five weekly videoconferences before the event. Most important of all, the participants were encouraged to prepare a 10-minute lightning talk on their efforts and preliminary results. (See below for the transcript of my talk, see the notes document for descriptions of all talks.)
In my opinion, these preliminary talks were critical to the success of the project. The preliminary nature relaxed pressure on participants, so we were able to experiment beyond the target of the hackathon (as I did with my handwriting detection digression, a related, but un-scorable effort). On the other hand, they did provide enough impetus to get many of us looking at the data, working with the tools, and thinking about approaches. This meant that even before the hackathon started, many of us were familiar enough with the materials to have a real 'meeting of the minds' experience during the pre-event supper: "Did you just say 'the contrast difference between the print and the label is higher than the difference between the label and the background'? We ran into that too, and here's what we did..."
The experience was a real education in OCR for me, and I feel like I picked up techniques I can apply directly to projects I've discussed with clients and potential clients. In particular, I got a real appreciation for how interrelated image preparation, OCR, and parsing are to each other. One participant had created separate libraries of regular expressions to clean up each kind of field, having discovered that latitude/longitude coordinates require different error correction than personal names or herbarium catalog numbers do. Another group had built a touch-screen tool for classifying segments of the image before submitting them to OCR. My own project required a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR. A simple 1,2,3 workflow just isn't sufficient!
iDigBio itself is an NSF-funded attempt to advance digitization practices on natural history collections, combining disciplinary "thematic collection networks" and methodologically focused working groups on topics like georeferencing, crowdsourcing, and OCR. Aware that they're not the only people digitizing things, they have been reaching out beyond the natural sciences to the library and information science community at the iConference this year. This rejection of "not invented here" siloing was a big part of the hackathon, and I hope that more people from outside the natural sciences will get involved.
Thursday, February 14, 2013
Improving OCR Inputs from OCR Outputs?
This is a transcript of my talk at the iDigBio Augmenting OCR Hackathon, presenting preliminary results of my efforts before the event.
For my preliminary work, I tried to improve the inputs to our OCR process through looking at the outputs of a naive OCR.
One of the first things that we can do to improve the quality of our inputs to OCR is to not feed them handwriting. To quote Homer Simpson, "Remember son, if you don't try, you can't fail." So let's not try feeding our OCR processes handwritten materials.
To do this, we need to try to detect the presence of handwriting. When you try to feed handwriting to OCR, you get a lot of gibberish. If we can detect handwriting, we can route some of our material to "humans in the loop" -- not wasting their time with things we could be OCRing. So how do we do this?
My approach was to use the outputs of [naive] OCR to detect the gibberish it produces when it sees handwriting to try to determine when there was handwriting present in the images. The first thing I did before I started programming, was classifying OCR output from the lichen samples by visual inspection: whether I thought there was hand writing present or not, based on looking at the OCR outputs. Step two was to automate the classifications.
I tried this initially on the results that came out of ABBY and then the results that came out of Tesseract, and I was really surprised by how hard it was for me as a human to spot gibberish. I could spot it, but in a lot of cases -- ABBY does a great job of cleaning up its OCR output -- so in a lot of cases, particularly the labels that were all printed with the exception of some species name that was handwritten, ABBY generally misses those. Tesseract, on the other hand, does not produce outputs that are quite as clean.
So the really interesting thing about this to me is that while we were able to get 70-75% accuracy on both ABBY and Tesseract, if you look at the difference between the false positives that come out of ABBY and Tesseract and the false negatives, I think there is some real potential here for making a much more sophisticated algorithm. Maybe the goal is to pump things through ABBY for OCR, but beforehand look at Tesseract output to determine whether there is handwriting or not.
The next thing I did was try to automate this. I just used some regular expressions to look for representative gibberish, and then based on the number of matches got results that matched the visual inspection, though you do get some false positives.
The next thing I want to do with this is to come up with a way to filter the results based on doing a detection on ABBY [output] and doing a detection on Tesseract [output].
The next thing that I wanted to work on was label extraction.
We're all familiar with the entomology labels and problems associated with them.
So if you pump that image of Cerceris through Tesseract, you end up with a lot of garbage. You end up with a lot of gibberish, a lot of blank lines, some recognizable words. That "Cerceris compacta" is, I believe, the result of a post-digitzation process: it looks like an artifact of somebody using Photoshop or ImageMagick to add labels to the image. The rest of it is the actual label contents, and it's pretty horrible. We've all stared at this; we've all seen it.
So how do you sort the labels in these images from rulers, holes in styrofoam, and bugs? I tried a couple of approaches. I first tried to traverse the image itself, looking for contrast differences between the more-or-less white labels and their backgrounds. The problem I found with that was that the highest contrast regions of the image are the difference between print and the labels behind the print. So you're looking for a fairly low-contrast difference--and there are shadows involved. Probably, if I had more math I could do this, but this was too hard.
So my second try was to use the output of OCR that produces these word bounding boxes to determine where labels might be, because labels have words on them.
If you run Tesseract or Ocropus with an "hocr" option, you get these pseudo-HTML files that have bounding boxes around the text. Here you see this text element inside a span; the span has these HTML attributes that say "this is an OCR word". Most importantly, you have the title attribute as the bounding box definition of a rectangle.
If you extract that and re-apply it to an image, you see that there are a lot of rectangles on the image, but not all the rectangles are words. You've got bees, you've got rulers; you've got a lot of random trash in the styrofoam.
So how do we sort good rectangles from bad rectangles? First I did a pass looking at the OCR text itself. If the bounding box was around text that looked like a word, I decided that this was a good rectangle. Next, I did a pass by size. A lot of the dots in the stryofoam come out looking suspiciously word-like for reasons I don't understand. So if the area of the rectangle was smaller than .015% of the image, I threw it away.
The result was [above]: you see rectangles marked with green that pass my filter and rectangles marked with red that don't. So you get rid of the bee, you get rid of part of the ruler -- more important, you get rid of a lot of the trash over here. [Pointing to small red rectangles on styrofoam.] There are some bugs in this--we end up getting rid of "Arizona" for reasons I need to look at--but it does clean the thing up pretty nicely.
Question: A very simple solution to this would be for the guys at Berkeley to take two photographs -- one of the bee and ruler, one of the labels. I'm just thinking how much simpler that would be.
Me: If the guys in Berkeley had a workflow that took the picture--even with the bee--agaist a black background, that would trivialize this problem completely!
Question: If the photos were taken against a background of wallpaper with random letters, it couldn't be much worse than this [styrofoam]. The idea is that you could make this a lot easier if you would go to the museums and say, we'll participate, we'll do your OCRing, but you must take photographs this way.
Me: You're absolutely right. You could even hand them a piece of cardboard that was a particular color and say, "Use this and we'll do it for you, don't use it and we won't." I completly agree. But this is what we're starting with, so this is what I'm working on.
The next thing is to aggregate all those word boxes into the labels [they constitute]. For each rectangle, look at all of the other rectangles in the system, expand them both a little bit, determine if they overlap, and if they do, consolidate them into a new rectangle, and repeat the process until there are no more consolidations to be done. [Thanks to Sara Brumfield for this algorithm.]
If you do that, the blue boxes are the consolidated rectangles. Here you see a rectangle around the U.C. Berkeley label, a rectangle around the collector, and a pretty glorious rectangle around the determination that does not include the border.
Having done that, you want to further filter those rectangles. Labels contain words, so you can reject any rectangles that were "primitives" -- you can get rid of the ruler rectangle, for example, because it was just a single [primitive] rectangle that was pretty large.
So you make sure that all of your rectangles were created through consolidation, then you crop the results. And you end up automatically extracting these images from that sample -- some of which are pretty good, some of which are not. We've got some extra trash here, we cropped the top of "Arizona" here. But for some of the labels -- I don't think I could do better than that determination label by hand.
Then you feed the results back into Tesseract one by one, then we combine the text files in Y-axis order to produce a single file for all those images. (Not something that's a necessary step, but that does allow us to compare the results with the "raw" OCR.) How did we do?
This is a resulting text file -- we've got a date that's pretty recognizable, we've got a label that's recognizable, and the determination is pretty nice.
Let's compare it to the raw result. In the cropped results, we somehow missed the "Cerceris compacta", we did a much nicer job on the date, and the determination is actually pretty nice.
Let's try it on a different specimen image.
We run the same process over this Stigmus image. We again find labels pretty well.
When we crop them out, the autocrop pulls them out into these three images.
Running those images through OCR, we get a comparison of the original, which had a whole lot of gibberish.
The original did a decent job with the specimen number, but the autocrop version does as well. In particular, for this location [field], the autocrop version is nearly perfect, whereas the original is just a mess.
My conclusion is that we can extract labels fairly effectly by first doing a naive pass of OCR and looking at the results of that, and that the results of OCR over the cropped images is less horrible than running OCR over the raw images -- though still not great.
[2013-02-15 update: See the results of this approach and my write-up of the iDigBio Augmenting OCR Hackathon itself.]
For my preliminary work, I tried to improve the inputs to our OCR process through looking at the outputs of a naive OCR.
One of the first things that we can do to improve the quality of our inputs to OCR is to not feed them handwriting. To quote Homer Simpson, "Remember son, if you don't try, you can't fail." So let's not try feeding our OCR processes handwritten materials.
To do this, we need to try to detect the presence of handwriting. When you try to feed handwriting to OCR, you get a lot of gibberish. If we can detect handwriting, we can route some of our material to "humans in the loop" -- not wasting their time with things we could be OCRing. So how do we do this?
My approach was to use the outputs of [naive] OCR to detect the gibberish it produces when it sees handwriting to try to determine when there was handwriting present in the images. The first thing I did before I started programming, was classifying OCR output from the lichen samples by visual inspection: whether I thought there was hand writing present or not, based on looking at the OCR outputs. Step two was to automate the classifications.
I tried this initially on the results that came out of ABBY and then the results that came out of Tesseract, and I was really surprised by how hard it was for me as a human to spot gibberish. I could spot it, but in a lot of cases -- ABBY does a great job of cleaning up its OCR output -- so in a lot of cases, particularly the labels that were all printed with the exception of some species name that was handwritten, ABBY generally misses those. Tesseract, on the other hand, does not produce outputs that are quite as clean.
So the really interesting thing about this to me is that while we were able to get 70-75% accuracy on both ABBY and Tesseract, if you look at the difference between the false positives that come out of ABBY and Tesseract and the false negatives, I think there is some real potential here for making a much more sophisticated algorithm. Maybe the goal is to pump things through ABBY for OCR, but beforehand look at Tesseract output to determine whether there is handwriting or not.
The next thing I did was try to automate this. I just used some regular expressions to look for representative gibberish, and then based on the number of matches got results that matched the visual inspection, though you do get some false positives.
The next thing I want to do with this is to come up with a way to filter the results based on doing a detection on ABBY [output] and doing a detection on Tesseract [output].
The next thing that I wanted to work on was label extraction.
We're all familiar with the entomology labels and problems associated with them.
So if you pump that image of Cerceris through Tesseract, you end up with a lot of garbage. You end up with a lot of gibberish, a lot of blank lines, some recognizable words. That "Cerceris compacta" is, I believe, the result of a post-digitzation process: it looks like an artifact of somebody using Photoshop or ImageMagick to add labels to the image. The rest of it is the actual label contents, and it's pretty horrible. We've all stared at this; we've all seen it.
So how do you sort the labels in these images from rulers, holes in styrofoam, and bugs? I tried a couple of approaches. I first tried to traverse the image itself, looking for contrast differences between the more-or-less white labels and their backgrounds. The problem I found with that was that the highest contrast regions of the image are the difference between print and the labels behind the print. So you're looking for a fairly low-contrast difference--and there are shadows involved. Probably, if I had more math I could do this, but this was too hard.
So my second try was to use the output of OCR that produces these word bounding boxes to determine where labels might be, because labels have words on them.
If you run Tesseract or Ocropus with an "hocr" option, you get these pseudo-HTML files that have bounding boxes around the text. Here you see this text element inside a span; the span has these HTML attributes that say "this is an OCR word". Most importantly, you have the title attribute as the bounding box definition of a rectangle.
If you extract that and re-apply it to an image, you see that there are a lot of rectangles on the image, but not all the rectangles are words. You've got bees, you've got rulers; you've got a lot of random trash in the styrofoam.
So how do we sort good rectangles from bad rectangles? First I did a pass looking at the OCR text itself. If the bounding box was around text that looked like a word, I decided that this was a good rectangle. Next, I did a pass by size. A lot of the dots in the stryofoam come out looking suspiciously word-like for reasons I don't understand. So if the area of the rectangle was smaller than .015% of the image, I threw it away.
The result was [above]: you see rectangles marked with green that pass my filter and rectangles marked with red that don't. So you get rid of the bee, you get rid of part of the ruler -- more important, you get rid of a lot of the trash over here. [Pointing to small red rectangles on styrofoam.] There are some bugs in this--we end up getting rid of "Arizona" for reasons I need to look at--but it does clean the thing up pretty nicely.
Question: A very simple solution to this would be for the guys at Berkeley to take two photographs -- one of the bee and ruler, one of the labels. I'm just thinking how much simpler that would be.
Me: If the guys in Berkeley had a workflow that took the picture--even with the bee--agaist a black background, that would trivialize this problem completely!
Question: If the photos were taken against a background of wallpaper with random letters, it couldn't be much worse than this [styrofoam]. The idea is that you could make this a lot easier if you would go to the museums and say, we'll participate, we'll do your OCRing, but you must take photographs this way.
Me: You're absolutely right. You could even hand them a piece of cardboard that was a particular color and say, "Use this and we'll do it for you, don't use it and we won't." I completly agree. But this is what we're starting with, so this is what I'm working on.
The next thing is to aggregate all those word boxes into the labels [they constitute]. For each rectangle, look at all of the other rectangles in the system, expand them both a little bit, determine if they overlap, and if they do, consolidate them into a new rectangle, and repeat the process until there are no more consolidations to be done. [Thanks to Sara Brumfield for this algorithm.]
If you do that, the blue boxes are the consolidated rectangles. Here you see a rectangle around the U.C. Berkeley label, a rectangle around the collector, and a pretty glorious rectangle around the determination that does not include the border.
Having done that, you want to further filter those rectangles. Labels contain words, so you can reject any rectangles that were "primitives" -- you can get rid of the ruler rectangle, for example, because it was just a single [primitive] rectangle that was pretty large.
So you make sure that all of your rectangles were created through consolidation, then you crop the results. And you end up automatically extracting these images from that sample -- some of which are pretty good, some of which are not. We've got some extra trash here, we cropped the top of "Arizona" here. But for some of the labels -- I don't think I could do better than that determination label by hand.
Then you feed the results back into Tesseract one by one, then we combine the text files in Y-axis order to produce a single file for all those images. (Not something that's a necessary step, but that does allow us to compare the results with the "raw" OCR.) How did we do?
This is a resulting text file -- we've got a date that's pretty recognizable, we've got a label that's recognizable, and the determination is pretty nice.
Let's compare it to the raw result. In the cropped results, we somehow missed the "Cerceris compacta", we did a much nicer job on the date, and the determination is actually pretty nice.
Let's try it on a different specimen image.
We run the same process over this Stigmus image. We again find labels pretty well.
When we crop them out, the autocrop pulls them out into these three images.
Running those images through OCR, we get a comparison of the original, which had a whole lot of gibberish.
The original did a decent job with the specimen number, but the autocrop version does as well. In particular, for this location [field], the autocrop version is nearly perfect, whereas the original is just a mess.
My conclusion is that we can extract labels fairly effectly by first doing a naive pass of OCR and looking at the results of that, and that the results of OCR over the cropped images is less horrible than running OCR over the raw images -- though still not great.
[2013-02-15 update: See the results of this approach and my write-up of the iDigBio Augmenting OCR Hackathon itself.]
Labels:
hackathon,
ocr,
presentations
Subscribe to:
Posts (Atom)


































































