Tuesday, March 19, 2013

Crowd Sourcing and Digital Editing

The term is almost over, and finally I find some time to write down soem of the thoughts I have been munching in the past months. Apologies to my numerous followers for the long silence.
The topic this time is crowd-sourcing, which is a bit unusual for me as I have not been directly involved in any crowd-sourcing project, but as many of you already knows, I'm working at a book on Digital Scholarly Editing, which inevitably force me to consider new form of edition such as, for instance, crowd-sourcing and its role in editing.  
A King's College London project devoted to the classification of different types of crowd-sourcing activity is just concluded by producing a hefty report written by Stuart Dunn, but as per admission of its author, the classification contained in the report is a bit too comprehensive to be really useful for my purpose, so Here you have the one I have created (yes, all by myself!). Comments are welcomed!

Without any pretension of being exhaustive, crowdsourcing concerning in some way the edition and publications of texts can be classified according to five parameters:
1.     Context: Crowdsourcing projects can be hosted and supported by: 
a.     Universities and Cultural Heritage Institutions, such as Libraries and Museums. This is the case of some of the projects mentioned above (TranscribeBentham is hosted and supported by UCL, for instance), and of the National Library of Australia’s Historic NewspaperDigitisation Project, where users have been asked to correct OCRed articles from historical newspapers.
b.     Non-governmental organisations and other private initiatives: It is the case, for instance, of the Project Gutenberg, which began 19971 from the vision of his founder, Michael Hart and continued since thanks to donations.
c.     Commercial: it is the case, for instance, of Google that uses the ReCAPTCHA service, asking users to enter words seen in distorted text images onscreen, a part of which comes from unreadable passages of digitised books, thus helping the correction of the output of the OCR process, while protecting websites from internet robots (the so-called ‘bots’) attacks.
2.     Participants: or better, how are they recruited and which skills should they possess to be allowed to contribute. Some project issues open calls, for which anybody can enrol and contribute at their wish, with no particular skill being required other than commitment; other projects require their contributors to possess specific skills, which are checked before the user is allowed to do anything. The former is the case for the Historic Newspaper Digitisationor the Project Gutenberg, the latter for the EarlyEnglish Laws project. Many projects collocate themselves in between these two categories, closer to one end or the other. In the SOL project, for instance, users are assumed to read and understand Greek, but their competence is verified by the quality of their translations, although to register as editors, users are expected to declare their competences, which are checked by the editorial board.
3.     Tasks: The tasks requested to the users could be one or more of:
a.     Transcribing manuscripts or other primary sources, like in the case of Transcribe Bentham.
b.     Translating: as in the case of SOL.
c.     Editing, which is requested by the Early English Law project.
d.     Commenting and Annotating: as in the case of the Pynchon Wiki 
e.     Correcting: this is the case, for instance of the National Library of Australia’s project seen above and of the Project Gutenberg, where users not only contributes by uploading new material, but also take on proofreading texts in the archive.
f.      Answering to specific questions: this is the case for the Friedberg Genizah Project, for instance, which uses the project Facebook page to ask specific questions to its followers about, for instance, a particular reading of a passage, or if the hand of two different fragments is the same, and so on.
4.     Quality control: the quality of the work produced by the contributors can be assessed professional staff hired for that purpose (e.g. Transcribe Bentham), or could be assured by the community itself, with super-contributors which controlling roles are gained on the field by becoming major contributors (e.g. Wikipedia), or because of their qualifications (e.g. SOL), or both.
5.     Role in the project: for some project the crowdsourced material can be the final aim of the project, like for the Project Gutenberg or the Historical Newspapers Digitization project, or it could be a product that will be used in other stages of the project. The transcriptions produced within Transcribe Bentham project serve a double purpose: they represent the main outcome of the project as, once their quality has be ascertained, they feed into UCL’s digital repository, but they are also meant to be used for the edition of The Collected Works of Jeremy Bentham in preparation since 1958.

Is there anything else I should have included?

Friday, May 11, 2012

Genetic encoding at work

I get back to my blog after a long silence which has been determined by a rather busy month (March) with three papers given in three different continents on three different topics (1. Paris on Proust: see below; Providence on Modelling in Teaching; Canberra on the role of TEI on DH projects), al all of it in the middle of term. Nice. Than there was a rather deserved holiday (Australia seems to be better each time I go!), and MMSDA (April). Finally, catching up with loads of emails, deadlines, etc.

This post wants to relate on the content of the the first of the three conferences, i.e. the presentation that Julie André and I gave in Paris on the 1st of March Proust, l’œuvre des manuscrits. The conference was organised by the "Equipe PROUST" of ITEM-CNRS (Institut des Textes et Manuscrits modernes), with funding by the ANR Program CAHIERS-PROUST (Nathalie Mauriac Dyer, ITEM, dir.)

You can admire the prototype I have created at this address: http://research.cch.kcl.ac.uk/proust_prototype/ You can download the XML and XSLT, if so you wish.

The idea that is at the base of this prototype is that in digital editions we have so far tried to reproduce print editions without engaging with the new medium in a fruitful or interesting way. Even the most sought-after type of online edition, such as the transcription presented side by side with the facsimile is not new at all, and shows quite a few limitations.
  1. It creates an alternative space which tries to mimic the original space, without ever being able to represent it in full; 
  2. It leaves to the user/reader the task of establishing the relationship between the transcribed and the inscribed text; 
  3. It is bound to present pages (and not, for instance, openings), given the constraint in width of the screen, an approach that, if applied to Proust’s Chaiers, will indeed falsify the documentary evidence which shows how Proust considered the double page as his writing space (have a look at these materials on Gallica: they are amazing!).
The normal type of publication format adopted for draft manuscripts is the ultra-diplomatic edition, which presents the transcribed text in a format that tries to mimic the layout of the manuscript page as much as possible. While this type of edition provides many advantages, it lacks one fundamental aspect: the dynamicity of the writing process.

So what, then? For the transcription we have used the new TEI elements for documentary transcription (I talk about this in another post), then I have used SVG to plot on top of the facsimile the transcribed zones of text, then I have used a bit of javascript to put a bit of animation into the output to reproduce the sequence of writing and the sequence of reading of such zones. I have also used color to mark uncertainty: are we sure about the temporal collocation of the sequences? the yellower the background the least sure we are.

I think this type of visualisation is definitely not perfect but it is interesting for many reasons: first, because it tries to do something that the print editions cannot do; second, because it doesn't present a coherent read-me-top-to-bottom type of text (which would be just wrong in this case); and third, because it takes the (facsimile of the) document as its structural support.
What's still missing? Quite a lot, actually, but in particular I can think of these few points now:

  • A way to represent the dynamic sequences across pages and documents: this can be easily doen in the XMl source, but not yet in the output
  • A way to drag the zones away in roder to read what's underneath
  • Microgenesis: timing writing and rewriting at word level.  
But this is for the next project!

Thursday, January 26, 2012

Digital Humanities seen from the outside: a Fish out of water

On this post I would like to reflect on the way Digital Humanities are seen from the outside and the consequences of misunderstandings.

Apparently, seen from outside, we are those people counting words and detecting hidden meanings from numbers and statistics; this method is seen as being in contrast with the more traditional literary interpretation (close vs. distant reading, to say it with a slogan). Of this opinion seems to be Stanley Fish. In his blog post Mind Your P’s and B’s: The Digital Humanities and Interpretation he reports on a DH-like analysis of the Aeropagitica of John Milton, where he studies the "the dance of the “b’s” and “p’s” on a given passage. In the end, he concludes that DH-like analysis is not his piece of cake:
But whatever vision of the digital humanities is proclaimed, it will have little place for the likes of me and for the kind of criticism I practice: a criticism that narrows meaning to the significances designed by an author, a criticism that generalizes from a text as small as half a line, a criticism that insists on the distinction between the true and the false, between what is relevant and what is noise, between what is serious and what is mere play. Nothing ludic in what I do or try to do. I have a lot to answer for.
Well, there is nothing wrong in the fact that DH is not everybody's piece of cake. I can live with that, pretty easily, as it happens. The problem is that  to do an effective criticism, you should actually know what you are talking about. Mark Liberman has in fact run a test on the very premise of Fish argument and has discovered that in that passage:

  • The number of 'p's and 'b's is only 1% higher of the average number of 'p's and 'b's in the whole text
  • There are passages that contain even more 'p's and 'b's
  • There are letters that show similar patterns, such as 'x's and 'y's
  • There are letters that show even bigger picks, such as 'l's
A.k.a.: to do a DH-like research you should use DH tools, i.e. use a computer! Had Fisher used a programme for his own research he could has spared himself a bit of ridicule. To do DH-like research, you should be able to do it, actually. DH are not approaches and theories only, they are practice as well (see my definition of DH on a earlier post). Turns out that to count words (or letters) you actually have to count them.

What do we learn from this? Two main lessons, I think.
First, that we have to reflect on our image and the way we present our research and ourselves to people that take more traditional approaches to scholarship. Second, that if you want to criticise something you have to make sure you have done your homework (something I discussed in another post). The problem is, I think, is that Fish *has*, in my opinion, a point here, namely that the statistical, computational approach is not for everybody (he doesn't say that it is not useful, only that is not for him) and that there is still a lot of values in doing things traditionally.
But if you want to make a point, make sure your argument is solid, otherwise people will make fun of you, missing something potentially interesting.

Are my students listening?

Wednesday, January 25, 2012

Research without Borders and the TEI

Last Friday (20 of January) I have been invited by Marjorie Burghart to give a lecture in Lyon as part of a two days DH event, called L’édition électronique dans tous ses états : évolution des pratiques, évolution des besoins (details of the event here and poster here).
It has been lot of fun, in particular because I have organised a role game and everybody was very involved.
I have also had the opportunity of investigating one of my favorite topics: why on earth people spend time (and money) to work for the TEI when all this work is not credited i.e. the name of who got the idea is not recorded anywhere, it is just represented as the collaborative effort of The TEI (a.k.a. the Technical Council + the SIG + TEI-L + etc. etc.)? This si not what academics normally do, right? And even more so, why on earth Institutions accept this to happen?
On a personal level, the best answer to these questions is, in my opinion, because working to improve the TEI is fun, you have the opportunity of meeting with exceptionally gifted researchers from all over the world and, even if you cannot immediately quantify or point at something specific, your research is affected by this. Mine has: I think I am a much better researcher as a results of my past 10-odds year of work with the TEI, as part of the SIG, the Council and now the Board.
At institutional level, the reason is that the TEI is recognised as one of the foundational bases of DH, of which we are all collectively responsible.
Yes, the TEI has a lot of open issues (last summer putsch is a luminous example of this), but, as always, I think the best way to solve the problems is to get involved. So, au travail mes amis!

Here are the slides of the presentation, in French though... apologies to all non-French speakers and to all French speakers as well (quality of the language is, well, you'll see!).

Monday, January 23, 2012

Medievalists on the making and the digital

For the past few years I have been lucky enough to be involved in a wonderful training course, MMSDA, i.e. Medieval Manuscript Studies in the Digital Age. This course is offered for free to UK PhD students which have to work with medieval manuscripts and are interested in the digital stuff. We have now run the course for three years with an exceptional success which we mesure in the number of applicants  (65, in the first year, 42 the second and 28 the third) and their enthusiasm and commitment. The main brain behind this initiative is Peter Stokes (yep, my Peter Stokes).
The course was initially funded by the AHRC, so we were forced to offer it only to UK-based students, but, from the very first time we run it, we were aware of a much larger interest out there. This is the reason why we have sought alternative funding and we were finally lucky enough, thanks to the hard work of Charles Burnett form the Warburg Institute, to secure some substantial funding from a COST Action project,   IS1005, 'Medieval Europe - Medieval Cultures and Technological Resources'.

So we have opened the application to European Countries. Results? we had 90 applications (yes 90!!) from 18 countries for 20 places. The quality of the applicants where outstanding, I have never had to make more difficult choices, really! We have just been through them all and sent the list of successful candidates to the COST office for approval, then we will communicate the results.

This experience is telling me a few things:

  1. There are some amazing young researchers out there, we will have some stiff competition quite soon.
  2. Many people that have to engage with manuscripts lack appropriate training. Even at PhD level for many a manuscripts is little more than a support for a text. 
  3. Young researchers are desperate to acquire essential digital skills (we teach XML, TEI, imaging, so nothing very sophisticated, but very desirable, as it seems)
  4. We (i.e. the organisers) have willingly left out from the course soem essential topics: Greek, Hebrew, Arabic, Glagolitic, Cyrillic... all of this languages and scripts and traditions and manuscripts are part of our common European culture, but we tend to, quite conveniently, forget it... In our case it was mostly due to lack of time (there are just so many things you can fit in 5 days, you know), still there is something to keep in mind here, I think
Food for thought...

Tuesday, December 6, 2011

European-Flavoured Digital Editions

As it happens, I'm the co-chair of a working group on Digital Editions of NeDiMAH.

All clear? In case it is not so clear, NeDiMAH is the European Science Foundation Network for Digital Methods in the Arts and Humanities. The network groups representatives of 13 European countries (in no particular order): United Kingdom, Danemark, Netherlands, Ireland, Sweden, Finland, Croatia, France, Portugal, Norway, Switzerland, Bulgaria and Germany (yes, Italy is missing... and this is a huge problem as it seems we cannot involve many Italian colleagues or held event there... and I'm not commenting on the european dissemination of Italian research, no I'm not) .

This network is organised in six working groups, and one of them is on Digital Editions, which I happen to co-chair with Matthew Driscoll.

We met in Copenhagen on the 5 December and it has been great fun, I met people I haven't seen since long time (Hilde Boe, Peter Boot), seen some dear friends (Malte Rehbein, Marjorie Burghart) and met people for the first time (Mats Dahlström, Michael Stolz).

In the next 4 years we will be looking into many issues connected with digital editions, such us:

  • Theory of digital editions
  • Production of digital editions
  • Delivery of digital editions 
  • The role and changing nature of the editor
  • Long term issues (preservation, impact, sustainability)
We have in the radar a few events (an expert seminar and few workshops) and publications such as a few articles in journals and a print publication.
The first two topics we will concentrate on are the skill set for editors and the role of technology in the editorial work (stemmed somehow from my presentation at the TEI Members meeting which is the object of another post on this blog), and new approaches to critical apparatus and textual scholarship, which should be embodied in two articles.

Well, can I say I can't wait to get my greedy hands on this? We will have so much fun!

Sunday, November 27, 2011

Tablets Apps, or the future of the Scholarly Editons?

I was yesterday at a very interesting symposium entitled The Future Perfect of the Book organised by the Institute of English Studies. It has been an interesting mixture of people and of presentations. One in particular I would like to discuss (i.e. Elaine Treharne's keynote adress The Numbered days of the Page), but not now, in another post.
In this one I would like to report about the presentation I gave with Miguel Vieira and that builds on the research conducted by Patricia Searl for her MA dissertation on Digital Humanities at Kings' College London. Here is the abstract we submitted to the conference organisers:
The last 20 years have seen a rise of scholarly digital editions, which offer new, unexplored dimensions and depth to textual scholarship. The new possibilities opened up by the pioneering work of Peter Robinson and Jerome McGann’s charmed editors to the point that at the beginning of the millennium they seemed ready to switch to the digital medium. However, this promised switch did not happen and while many universities have adopted eBooks for course readings, digital textual scholarship seems not to have reached classrooms at all. In fact, an unpublished survey of twenty-two undergraduate syllabi in the US and UK has revealed that not one class had a single web edition as an assigned reading material. On the other hand, in commercial publishing the last couple of years have seen a boom in eBooks and eReaders. It is true though that eBooks look like very poor relatives of digital scholarly editions: in most cases they include the raw text with no additional features other than string searching. As such, eBooks look somewhat regressive, representing an evolution of the codex but not the revolution of the way we read texts which was promised by the advent of computers. Usability studies have demonstrated that reading on tablets is more enjoyable than reading on the screen of computers and, in some cases, more than reading print. But this is for general reading: does it also apply to highly sophisticated digital scholarly editions? Is the sophistication of such editions, as we have conceived them so far, the enemy of accessibility and user-friendliness? Are tablet apps a possible way to enhance the appeal of Digital Scholarly Editions? These are some of the questions that this paper will address.
and here are the slides we have used for the presentation


Comments? The presentation was very well received and we got plenty of nice feedback.
Early this week Miguel and I rehearsed the presentation within the DDH internal seminar and Raffaele Viglianti wrote a very nice summary of this event.

I will get back shortly on this topic, so stay tuned.