OpenITI AOCP presents: Digital Publication of Right-to-Left Script Corpora

OpenITI AOCP presents: Digital Publication of Right-to-Left Script Corpora

Date: June 29 - 30, 2023
Location: The University of Maryland, College Park, MD, USA

Welcome Workshop Participants!

The Open Islamicate Texts Initiative Arabic OCR Catalyst Project (OpenITI AOCP) team would like to thank you for agreeing to participate in this experts workshop and share your expertise and experience with all in attendance.

Before we delve into more specifics on the workshop itself, let’s go over the logistical details.

John Mullan—the incredible digital specialist of OpenITI AOCP—has already sent you your flight and accommodation information in separate emails (if for some reason you did not receive this email please let us know as soon as possible). Some of this information is repeated below. Be sure to review everything below in detail because there is additional information provided here that is not included in those emails.

Workshop Background

This workshop is a part of the Open Islamicate Texts Initiative Arabic OCR Catalyst Project Phase II (OpenITI AOCP II)—a three-year project generously funded by The Mellon Foundation.

OpenITI AOCP II is led by Matthew Thomas Miller (Roshan Institute for Persian Studies, UMD), Taylor Berg-Kirkpatrick (University of California, San Diego), Sarah Bowen Savant (Aga Khan University), David Smith (Northeastern University), and Raffaele Viglianti (Maryland Institute for Technology in the Humanities, UMD).

This workshop will bring together scholars working with digital corpora who are addressing questions including: digital text collation, annotation and encoding, and digital publication of texts. Participants will be invited to share and work collaboratively on approaches and solutions to the digital publication of text corpora in general, but with a focus on challenges pertaining to text in right-to-left scripts. The workshop has two core goals: 1. To discuss the mechanics of encoding texts and the process of creating a digital edition (with a particular focus on multi-text editions). 2. To discuss mechanisms for the dissemination of digitally encoded texts, in ways which will allow other scholars to engage with the complexity of the textual tradition while taking account of long-term sustainability. The workshop as a whole will focus on these issues as they relate to right-to-left scripts and the challenges that these scripts present (especially where many existing solutions and software cater to left-to-right use cases). As part of the workshop, we will examine and (where possible) experiment with existing software solutions and evaluate their potential for our use cases.

Transportation

Please use taxis, rideshare (Uber, Lyft, etc.), or light rail to travel between the airport and the hotel. You can submit these receipts to John for reimbursement. An original receipt is required. If for any reason obtaining a taxi or rideshare is difficult for you, please let us know as soon as possible and we will schedule transportation for you. DC traffic can be quite bad, depending on the time your flight arrives. If you are departing Dulles International Airport (IAD) for The Hotel via taxi or rideshare, especially near rush hour times (6:30am-9:30am and 4:00-6:30pm), please prepare for a lengthy car ride (sometimes as long as two hours) by using the restroom and grabbing something to eat in the airport before departing.

Guests arriving at Ronald Reagan Washington National Airport (DCA) or Washington Dulles International Airport (IAD) have the option of using the D.C. metro, which can be faster than hiring a car at rush hour. For guests arriving at DCA, follow signs to the airport metro station, then take the yellow or green line towards Greenbelt. Exit the metro at College Park - U of MD station. For guests arriving at IAD, follow signs to the airport metro station, then take the silver line towards Largo and exit at L’Enfant Plaza station. From L’Enfant Plaza, take the green or yellow line towards Greenbelt station and exit at College Park - U of MD station. The Hotel at the University of Maryland is a short ride away on rideshare or bus from the College Park station, or a fifteen minute walk.

Accommodations

You will all be staying at The Hotel at University of Maryland (UMD). The Hotel is Located across from the university’s entrance at 7777 Baltimore Ave, College Park, MD 2074.

Workshop Location

The workshop will take place in the McKeldin Special Events Room (Room #6137 @7649 S Library Ln, College Park, MD 20742). Jonathan Allen or Osama Eshera will meet you in The Hotel lobby each morning at 8:40 to walk with you from The Hotel to the workshop venue, but in case you need to navigate the campus on your own, here is a link to a campus map. We can also arrange transportation for any participants who would prefer not to walk for any reason.

Food

All meals during the workshop will be provided. We will also be hosting a (optional) pre-workshop dinner and drinks on Wednesday, September 28th at 8:00 at the The Hall CP (located behind The Hotel) for those who arrive early enough (and aren’t too jetlagged to attend). For breakfast each morning you will receive a $12 voucher that can be used at Bagels N Grinds, which is located off of the lobby of The Hotel. We will have some pastries, fruit, and coffee in the meeting room for snacking during the morning hours of the workshop, but if you want a sizable breakfast please eat breakfast before we depart The Hotel for the workshop each morning. Lastly, if you will be missing the group dinner on the second day of the workshop due to an early flight time, please feel free to request a boxed meal to go, or you may purchase dinner at the airport and we will reimburse you for it.

Reimbursement

We will reimburse you for travel to and from the airport as well as for any meals you purchase while traveling. U.S. citizens and permanent residents will need to fill out a W-9 form, while international travelers will need to fill out an X-9 form and a W-8 BEN form. We will need receipts of all purchases for which you would like to receive reimbursement. All materials for reimbursement must be submitted to jmullan@umd.edu by August 4, 2023.

Workshop Participants

A. Sean Pue, Associate Professor of Hindi Language & South Asian Literature and Culture, Michigan State University
Project Affiliation: Jugaad CoLab

Bronson Brown-deVost, Postdoctoral Fellow, Theologische Fakultät, Georg-August-Universität Göttingen

Chander Shekhar, Professor of Persian, University of Delhi

Daniel Stökl Ben Ezra, Directeur d’Études, École Pratique des Hautes Études (EPHE), Université Paris Sciences et Lettres; Co-director, eScripta
Project Affiliation: eScripta

David Smith, Associate Professor of Computer Science, Northeastern University
Project Affiliation: OpenITI

David Vishanoff, Associate Professor of Religious Studies, University of Oklahoma

Elena Pierazzo, Professor of Digital Humanities, Centre d’Études Superieures de la Renaissance, University of Tours; Former Chair, Text Encoding Initiative (TEI) & TEI Manuscripts Special Interest Group

Farrukh Shahzad, Chief Librarian, Forman Christian College University

Gregory Crane, Professor of Classical Studies & Computer Science, Tufts University; Editor-in-Chief, Perseus Project
Project Affiliation: Perseus Digital Library

Hugh Cayless, Senior Digital Humanities Developer, Duke Collaboratory for Classics Computing, Duke University

Intisar Rabb, Professor of History & Law, Harvard Law School; Director, Program in Islamic law, Harvard University
Project Affiliation: SHARIAsource

Jacob Murel, Postdoctoral Research Associate, Northeastern University
Project Affiliation: OpenITI

Janelle Jenstad, Professor of English, University of Victoria; General Editor & Director, The Map of Early Modern London (MoEML)
Project Affiliation: MoEML

Jonathan Parkes Allen, Mellon Humanities Postdoctoral Fellow, Roshan Institute for Persian Studies, University of Maryland, College Park
Project Affiliation: OpenITI

Joseph Hilleary, Masters Candidate, Tufts University

Lorenz Nigst, Research Associate, Corpus Management, KITAB Project, Aga Khan University Institute for the Study of Muslim Civilizations
Project Affiliation: KITAB; OpenITI

Magdalena Turska, e-editiones.org

Mathew Barber, Post-Doctoral Research Fellow, KITAB Project, Aga Khan University Institute for the Study of Muslim Civilizations
Project Affiliation: KITAB, OpenITI

Matthew Thomas Miller, Assistant Professor of Persian Literature and Digital Humanities, Roshan Institute for Persian Studies, University of Maryland, College Park; Director, Roshan Initiative in Persian Digital Humanities
Project Affiliation: OpenITI

Maxim Romanov, Junior Research Group Leader, The Evolution of Islamic Societies (c.600-1600 CE): Algorithmic Analysis into Social History, Universität Hamburg
Project Affiliation: OpenITI

Muhammad Taimoor Shahid Khan, Mellon Humanities Postdoctoral Fellow, Roshan Institute for Persian Studies, University of Maryland, College Park
Project Affiliation: OpenITI

Nick Laiacona, President, Performant Software Solutions LLC

Osama Eshera, Assistant Research Professor, Roshan Institute for Persian Studies, University of Maryland, College Park; Assistant Director, OpenITI AOCP Phase II
Project Affiliation: OpenITI

Peter Stokes, Directeur d’Études, École Pratique des Hautes Études (EPHE), Université Paris Sciences et Lettres; Co-rector, eScripta
Project Affiliation: eScripta

Peter Verkinderen, Assistant Professor, Centre for Digital Humanities, Aga Khan University Institute for the Study of Muslim Civilizations; Postdoctoral Research Fellow, KITAB Project
Project Affiliation: KITAB

Raffaele Viglianti, Senior Research Software Developer, Maryland Institute for Technology in the Humanities, University of Maryland
Project Affiliation: OpenITI

Sarah Bowen Savant, Professor of History, Aga Khan University Institute for the Study of Muslim Civilizations; Principal Investigator, KITAB Project
Project Affiliation: KITAB

Theodore Beers, Postdoctoral Research Fellow, Seminar für Semitistik und Arabistik, Freie Universität Berlin
Project Affiliation: Kalīla and Dimna – AnonymClassic

Till Grallert, Research Associate, Universitätsbibliothek, Humboldt-Universität zu Berlin; Marie Sklodowska-Curie Postdoctoral Fellow, Universität Hamburg

Wayne Graham, Chief Information Officer & Director of Informatics, Cultural Networks, and Knowledge Systems, Council on Library and Information Resources

Wolfgang Meier, e-editiones.org

Schedule

Pre-workshop Event (Optional)
Wednesday, June 28th, 2023

7:30 p.m. - 10:00 p.m. Welcome dinner and drinks (The Hall CP located behind The Hotel) for those who arrive early enough (and aren’t too jetlagged to attend)

Workshop Day #1: Thursday, June 29, 2023

7:30 a.m. - 8:40 a.m. Breakfast at Bagels ‘N Grinds located at The Hotel (not a group event—please just grab breakfast at your leisure)
8:40 a.m. - 9:00 a.m. Walk with Jonathan Allen to McKeldin library
9:00 a.m. - 9:45 a.m. Welcome and introductory remarks
Matthew Miller, Raff Viglianti
9:45 a.m. - 10:45 a.m.

Opening talks

Digital editing as publication, craft, and research
Elena Pierazzo, University of Tours

Perseus, NLP and Opening up the Human Record
Gregory Crane and Joseph Hilleary

10:45 a.m. - 11:00 a.m. Break
11:00 a.m. - 12:30 p.m.

Curation

Open Arabic Periodical Editions (OpenArabicPE): a case study in digital editing outside the Global North
Till Grallert, Freie Universität Berlin (remote)

OpenITI mARkdown: Balancing Between a Standard and a Research Method
Maxim Romanov, University of Hamburg (remote)

Digital Cairo: Exploring Urban News
Hugh Cayless, Duke University
12:30 p.m. - 2:00 p.m. Lunch
2:00 p.m. - 3:30 p.m.

Annotation

TEI Publisher and right-to-left texts
Wolfgang Meier and Magdalena Turska, eXist Solutions GmbH (remote)

Challenges and Potentials in Multi-Script Text-Image Annotation
Peter Stokes, Paris Sciences et Lettres University

Annotating and Publishing Right-to-Left Script Corpora using FairCopy, ArchivEngine, and EditionCrafter
Nick Laiacona, Performant Software Solutions LLC
3:30 p.m. - 3:45 p.m. Break
3:45 p.m. - 5:15 p.m.

Dissemination

KITAB: corpus, text reuse and the multi-text editions future
Peter Verkinderen, Mathew Barber, Lorenz Nigst, Sarah Savant. Aga Khan University

Tools for Comparative Reading in the AnonymClassic Project
Theodore Beers, Freie Universität Berlin

“The end is where we start from”: The Endings Principles
Janelle Jenstad, University of Victoria (remote)
5:15 p.m. - 6:30 p.m. Break
6:30 p.m. Dinner (Pennyroyal Station)

Workshop Day #2: Friday, June 30, 2023

On day two, we will have a series of break-out/discussion groups around different topics. Each session will last 90 minutes: 60 min for break-out groups and 30 min for plenary discussion. Each break-out group will focus on a tool or approach and will have an assigned moderator who may showcase certain approaches to facilitate discussion.

7:30 a.m. - 8:40 a.m. Breakfast at Bagels ‘N Grinds located at The Hotel (not a group event—please just grab breakfast at your leisure)
8:40 a.m. - 9:00 a.m. Walk with Jonathan Allen to McKeldin library
9:00 a.m. - 9:15 a.m. Day Overview and Planning
Raff Viglianti, Mathew Barber
9:15 a.m. - 10:45 a.m.

Session 1: Strategies for Right-to-Left Encoding (Post-OCR)

Break-out groups
  • oXygen XML editor for RTL scripts (lead: Hugh Cayless)
  • mARkdown and NotePad++ (lead: Lorenz Nigst and Peter Verkinderen)
  • FairCopy (lead: Nick Laiacona)
  • Free and Open Source Software for XML (lead: Raff Viglianti)
10:45 a.m. - 11:00 a.m. Break
11:00 a.m. - 12:30 p.m.

Session 2: Enriching Editions: Collation, Additional Markup, Annotation

Break-out groups
  • AnonymClassic approaches to collation (lead: Theo Beers)
  • Performant Software solutions: ArchivEngine and EditionCrafter (lead: Nick Laiacona)
  • Collation algorithms: CollateX, Passim (lead: David Smith)
  • TEI Publisher annotation module (lead: Magdalena Turska — remote)
12:30 p.m. - 2:00 p.m. Lunch
2:00 p.m. - 3:30 p.m.

Session 3: Publication Systems

Break-out groups
  • Scaife and KITAB’s Digital Arabic Reader (lead: Greg Crane and Mathew Baker )
  • Making and Knowing Reader (lead: Nick Laiacona)
  • TEI Publisher (lead: Daniel Stoekl and Madgalena Turska)
  • Static solutions: CETEIcean (lead: Raff Viglianti, Hugh Cayless, Bronson Brown-deVost)
3:30 p.m. - 3:45 p.m. Break
3:45 p.m. - 5:15 p.m.

Session 4: Open Discussion and Closing Remarks

This session will be run “unconference” style: the group will spend 10 min writing down themes (e.g on a whiteboard) and 10 min assigning people to groups. 45 min break-out groups discuss and then return together.
Example topics:
  • Longevity and infrastructure: how long should editions be expected to exist online? What infrastructure is needed for longevity?
  • Scale: how to handle very large texts and very large corpora? What systems are better suited than others and why?
  • Other themes emerged from previous focus sessions.
5:15 p.m. - 6:30 p.m. Break
6:30 p.m. Dinner (GrillmarX College Park)

Abstracts

Digital editing as publication, craft, and research

Elena Pierazzo, University of Tours

More than thirty years of digital editing have borne fruit: nowadays we have methods, standards, theories, and editions. Yet we are far from having exhausted all the possibilities offered by the digital sphere, and new developments are still happening at an ever-faster rate. Digital editing and digital philology are no longer thought of as imitating print nor sought for their capacity to do so, but to exceed it and to explore new possibilities and break new grounds. Research possibilities are still before us and are even more open, and many digital editorial projects aim to study new texts, or old texts with new methods, or to develop new methods to study and display all of the above.

However, not everybody is interested in doing all this, or has the objective or the skills to do so, and for many researchers the digital is sought mostly for its improved capacity for dissemination. Luckily, creating a digital edition does not need to mean creating a brand-new digital infrastructure anymore: we have tools, and we have infrastructures, even if not as many as we could and should have. In fact, although producing a digital edition from scratch is tempting because it provides us with the facility of adapting it to our specific research needs, showcasing our scholarship as a luxury designer red-carpet outfit, the sustainability and financial costs of such enterprises are often prohibitive for most researchers, except for those with generous research grants. A few years ago, I used the metaphor of haute couture and pret-à-porter editions to demonstrate the need for a two-way discipline, with pre-built tools on the one hand that allow researchers to publish their editions digitally without a massive financial and pedagogical investments, and on the other hand a research space for the exploration of the digital medium and computational potential. This paper will revisit these distinctions with examples, with the aim of better defining the landscape of digital scholarly editing.

Perseus, NLP and Opening up the Human Record

Gregory Crane and Joseph Hilleary

Overarching themes of our work:

  • We are trying to intensify the role that the human record plays in the intellectual life of human society around the world. In this view, professional academic work has its ultimate value insofar as it has an impact beyond closed specialist networks. This principle has profound implications for the practice of scholarship. Open Access is a necessary, but by no means sufficient condition.
  • We are exploring how language technologies are beginning to transform our relationship with the hyperlingual human record, allowing us to use translation as a gateway into source texts.
  • From a practical perspective, we look at two fields: Classical Studies and World Literature
    • Classical Studies (at least in countries such as the US) must include all major cultural heritage languages, with Classical Arabic and Persian playing very prominent roles. With Classics Departments given up on Greek and Latin, we need new technologies if we are to engage with more rather than fewer languages.
    • World Literature uses a common translation language (English in the anglophone world) and generally ignores all linguistic features. The generality of tools at our disposal now begins to make this possible.
  • From an applied perspective,
    • We are finishing an NEH-funded project called Beyond Translation which addresses the challenge of integrating multiple classes of annotation to create a new kind of reading environment. We are working now on adding Classical Arabic and Persian to Greek and Latin. An overview of this work is available here: https://pdldatajournal.pubpub.org/new-features-in-beyond-translation
    • We are beginning a new NEH-funded project on “Perseus the next 30 years.” We are particularly interested now in addressing challenges of scale:
      • Working with a comprehensive database of texts in Greek, Latin and other languages
      • Serious assessing how to work with uncorrected OCR in a scientific way (this will build on a lot of work that David Smith and his colleagues have done over the past decade)
    • We are looking to generalize the work that we have done

Open Arabic Periodical Editions (OpenArabicPE): a case study in digital editing outside the Global North

Till Grallert, Freie Universität Berlin

I will present my project Open Arabic Periodical Editions (OpenArabicPE, 2015–) as a framework for bootstrapped scholarly editions outside the global north that addresses the severe impact of the multidimensional digital divides for RTL scripts and digitised cultural artefacts of societies from the global south. Without any external funding and thus relying on volunteer labour and free-to-use software and infrastructures, OpenArabicPE has published digital editions of six periodicals published in Baghdad, Beirut, Cairo, and Damascus between 1892 and 1918 with a total of 41 volumes, 645 issues and more than six million words.

The project is influenced by ideas of minimal computing and pirate care. The guiding principles for every part of the tool chain and workflow are accessibility, simplicity, sustainability and credibility. The basic idea behind the project is to unite manually transcribed digital texts from shadow libraries, such as al-Maktaba al-Shamela with digital facsimiles from academic scanning efforts with the aims of validating the former through the latter and to, ultimately, produce the necessary ground truth for modern character recognition algorithms. All texts are marked up in XML following the guidelines of the Text Encoding Initiative (TEI) and each issue is modelled as a single file with relatively light structural mark-up for sections, articles, headers, and bylines, as well as page breaks linking the text to the facsimiles. A set of authority files is maintained for the disambiguation of named entities in bylines. Finally, a webview based on the TEI Boilerplate provides a local GUI for reading the text and the facsimiles side by side.

The presentation will discuss our workflows and infrastructures with a view to their wider applicability.

OpenITI mARkdown: Balancing Between a Standard and a Research Method

Maxim Romanov, University of Hamburg

From its inception, OpenITI mARkdown was conceived to offer lightweight machine readability to the OpenITI corpus texts. It was primarily envisioned as a practical alternative to TEI XML, the de facto standard for text editions. Over time, my personal perspective of mARkdown has transitioned from it being a soft standard to serving as a methodological research instrument. OpenITI mARkdown possesses a limited set of tags vital for offering basic structural annotation of texts. Therefore, its “standard” aspect—one that should maintain its stability and universality—is relatively concise, bearing substantial similarity to the original markdown, its primary inspiration source. However, the “methodological” aspect of mARkdown expanded in response to the need for addressing specific research queries. Striving for standardization here proved to be somewhat ineffective, as we all may pose diverse research questions that 1) may not resonate with others, or 2) may not yield noteworthy results. The primary requirement was to provide an efficient annotation to entities pertinent to modeling specific research queries. We’ll delve into examples of how the greater KITAB team employs mARkdown principles. Furthermore, I will explore potential enhancements for the “standard” OpenITI mARkdown, provisionally dubbed as mARkdownSimple, along with two distinct variations, each designed for specific purposes. These are mARkdownMSS, an experimental system for generating and maintaining diplomatic and edited text versions, and EIS1600 flavor, created specifically for my ongoing project aimed at analyzing historical and biographical data. While mARkdownMSS can be considered more of a standardized schema with a finite development trajectory, EIS1600 is anticipated to continually evolve to accommodate new research challenges.

Digital Cairo: Exploring Urban News

Hugh Cayless, Duke University

Digital Cairo is an NEH-funded project aiming to create TEI-encoded versions of Arabic and Ottoman Turkish articles from the Egyptian Affairs newspaper (Al-Waqa’i’ al-Misriyya, الوقائع المصريّة) relating to urbanization in 19th century Cairo. It is a sub-project of La fabrique du Caire moderne, an initiative led by Adam Mestyan (Duke University) and Mercedes Volait (CNRS). Mestyan, Volait, and Hugh Cayless (Duke) are co-PIs on the Digital Cairo project. Our process begins with MS Word transcriptions of articles, currently between the years 1828–1884, which are automatically converted to TEI XML, then processed by student assistants who do some structural editing and mark up the names of persons, places, and organizations. Our intention is to link these annotations up to authority lists, which we have begun to develop, and a gazetteer and GIS in the case of the place names. Further editorial checking is carried out by project editors using visualizations of the TEI documents before they are finalized. Work on the texts is done using oXygen in Author Mode, with a custom CSS configuration and a custom project ODD. GitHub Actions automate our workflow processes to facilitate editorial review and final publication. We have found that these tools gave us the freedom we needed to decide how to handle the material without dictating a particular approach, and then allowed us to narrow our focus and create efficient editorial workflows.

TEI Publisher and right-to-left texts

Wolfgang Meier and Magdalena Turska, e-editiones.org

TEI Publisher is a community effort based on ideas and contributions by TEI enthusiasts all over the world, licensed under the GPLv3. Initially inspired by the vision behind the TEI Processing Model - work of the late Sebastian Rahtz and other members of the TEI Simple project of 2015 - it continues to evolve into what you can see today. This is only possible thanks to contributions of developers, users and institutions having concrete publication projects and willing to employ Open Source first approach so the whole community can reuse and benefit from their work. This presentation will provide an overview of functionalities, including support for RTL scripts and the new TEI Publisher annotation module.

Challenges and Potentials in Multi-Script Text-Image Annotation

Peter Stokes, Paris Sciences et Lettres University

A topic in digital publication (in the broader sense) is that of text-image annotation. This could include annotating palaeographical or art-historical features in images of manuscripts, linking images to texts, and so on, and then sharing this data, whether for large-scale analysis, for training machine learning systems, and many other purposes one can (and probably cannot) imagine. In principle, achieving this may sound straightforward, particularly given the very large number of annotation tools that are already available. However, in practice, significant challenges remain, not only in terms of tools but also in terms of data. If we are truly to make the best use of our efforts, our data should be linked and linkable, open and shared, but this implies standards, and how to achieve this in practice is by no means evident. Many concepts that scholars use from day to day are very ill-defined, and sharing or comparing palaeographical annotations is when we do not have clear and precise definitions even for very basic terms such as “letter”. Furthermore, those standards that do exist tend to function poorly in capturing even basic information such as script irectionality. This difficulty becomes even greater when the one corpus or even manuscript contains multiple writing systems, such as the countless manuscripts and inscriptions that are written in various combinations of Greek, Latin, Arabic and/or Hebrew, not to mention Hebrew and Chinese (as in the sixteenth-century manuscripts from Kaifeng); Brahmi, Kharosthi, Greek and Aramaic (as in the inscriptions of Ashoka from the 3 rd century BCE); and so on. Ideally, standards and tools should allow for the sharing and comparison of these multi-language, multi-script annotations. We are still a long way from achieving this in practice, but several promising efforts are beginning to show the way.

Annotating and Publishing Right-to-Left Script Corpora using FairCopy, ArchivEngine, and EditionCrafter

Nick Laiacona, Performant Software Solutions LLC

In this presentation, we will survey the different ways in which texts can be annotated, marked up, structured, and aligned to produce a digital critical edition. We will look at different tools Performant has developed that have support for RTL script corpora. One such tool is FairCopy, which is a specialized word processor for annotating primary sources. Another tool is ArchivEngine, which provides collaboration and version control. The third tool is EditionCrafter, which can provide a sophisticated side-by-side rendering.

FairCopy can markup the structure of the document to prepare it for publishing on the web. The interface can be customized for each project so that editors have the elements and attributes they need for the job. The interface can also be configured for RTL editing. It can ingest texts from plain text, TEI XML, and IIIF. It exports valid TEI XML Documents.

ArchivEngine works with FairCopy to provide version control and collaboration features. It was developed to support the Sapientia project at the University of Quebec. Using ArchivEngine, the whole team can browse the repository of texts. Team members can check out texts to edit and check them in when they are done. They can also change the project schema, which is then reflected to all other users.

EditionCrafter is a recent project still in development with the Making and Knowing Project at Columbia University. This project takes the folio viewer from “Secrets of Craft and Nature” and generalizes it into a reusable React component. The React component consumes a IIIF Manifest that is adorned with Web Annotations containing the textual layers associated with each folio. The project provides a command line tool that can generate these artifacts from the TEI Document. It can also draw texts from ArchivEngine, creating a complete document publishing pipeline.

KITAB: corpus, text reuse and the multi-text editions future

Peter Verkinderen, Mathew Barber, Lorenz Nigst, Sarah Savant. Aga Khan University

The KITAB project is one of the most active contributors to and users of the OpenITI corpus. KITAB’s main aim is to develop computational methods and infrastructures to study the Arabic written tradition. In the first five years of the project, our major foci have been building a corpus and developing text reuse detection methods. Text reuse helps us to investigate how authors wrote their texts and how the deeply intertextual Arabic written tradition evolved.

We will give a brief overview of our current dissemination strategy, which is based primarily on flat text files stored on GitHub (to facilitate collaborative work) and Zenodo (for long-term sustainability).

The Arabic corpus currently consists almost exclusively of digitized print editions. We remove footnotes from the texts, partly for copyright reasons, partly for facilitating computational analysis.

This is a very unsatisfactory state of affairs, because the printed critical edition in itself did much to sever modern research from the manuscript tradition, and removing the critical apparatus only deepens that divide.

However, we believe that HTR and digital multi-text editions hold the promise to bridge the gap between 21st-century research and the manuscript tradition. Using some of our own research as examples, we will sketch our vision for multi-text editions that could transform the way we read texts and the way we can study the long Arabic written tradition computationally.

Tools for Comparative Reading in the AnonymClassic Project

Theodore Beers, Freie Universität Berlin

At the previous OpenITI workshop in September 2022, I gave a general introduction to the work of the AnonymClassic project, which has, for a few years now, focused on constructing synoptic digital editions of the text of the Arabic book of fables, Kalīla wa-Dimna. And I demonstrated certain parts of the software platform that our team has built from the ground up, which allows us to collect images of digitized manuscripts of Kalīla wa-Dimna; to encode their metadata and transcribe their contents according to a standardized schema; and then to juxtapose the different versions of the text that those manuscripts represent.

The literary-historical analysis that represents the end goal of much (though not all) of our research takes place in the final stages of this process. That is, once we have a section of the text of Kalīla wa-Dimna transcribed from a range of manuscript witnesses, our team can gather for a collaborative reading session, in which we go through the text one narrative unit at a time, comparing the different versions. We also examine translations of Kalīla and Dimna in other languages, such as Syriac, Persian, Hebrew, and Old Castilian. (Including these other medieval versions of Kalīla and Dimna can be helpful in managing the difficulties posed by the corpus of extant Arabic manuscripts, none of which predates the seventh/thirteenth century.) A meeting of the AnonymClassic colloquium, then, involves both close and broad reading of part of one fable in the book.

I would like to use my presentation at the upcoming OpenITI workshop to offer a more focused demonstration and discussion of this process of juxtaposing versions of Kalīla and Dimna, both within and beyond the Arabic tradition. In doing so, I will show two of the main digital tools that we have developed. First is our adaptation of the LERA platform (drawing on the work of Marcus Pöckelmann), which enables us to view simultaneously a range of transcribed texts from Arabic Kalīla wa-Dimna manuscripts. Second is a tool called Kalīla Reader, which I developed to make it easier for members of our team to explore published versions of Kalīla and Dimna in many languages. I will use passages from the fable of “The Lion and the Jackal” as examples to give a sense of our comparative reading practices. This is meant to address the question of “What next?” after a project like AnonymClassic has built a database of digitized manuscripts along with their metadata and transcribed contents.

“The end is where we start from”: The Endings Principles

Janelle Jenstad, University of Victoria

How can you begin your project in such a way that the final deliverable is robust and archivable? The Endings Project — a 2016-2023 collaboration between developers, researchers, and librarians/archivists (endings.uvic.ca) — aims to help digital humanities projects reduce both the ongoing maintenance burden of evolving technologies and the risk of complete project loss. The “Endings Principles for Digital Longevity” ensure that your project is ready for long-term archiving at any point in its development. I will briefly present the key recommendations for Endings-compliant Data, Documentation, Processing, Products, and Release Management and then show how two major DH projects have achieved Endings Compliance. The Map of Early Modern London (MoEML; mapoflondon.uvic.ca) moved in 2018 from its twenty-year habit of rolling releases with server-side dependencies to periodic releases of static “editions” with no server-side dependencies. Each edition is archived with full functionalities and citation information. Linked Early Modern Drama Online (LEMDO) has been Endings-compliant from its inception; we have developed release strategies that allow us to hold back unfinished files, editions, and anthologies while letting completed work be released incrementally. This talk aims to provide strategies for new and continuing digital projects to plan for their own intentional and thoughtful conclusion.