*Automatic Collation for Diversifying Corpora* (*ACDC*)

Automatic Collation for Diversifying Corpora (ACDC)

The Automatic Collation for Diversifying Corpora (ACDC) project, funded by a Level III Digital Humanities Advanced Grant from the National Endowment for the Humanities, aims to significantly improve the accuracy of handwritten text recognition (HTR) for Arabic-script manuscripts. Our team will develop a collation tool to automatically create large amounts of training data from existing digital texts and manuscript images without time-consuming human annotation of individual manuscripts.

The ACDC project will accomplish this task by extending the capabilities of the text alignment tool passim and the OCR/HTR engine Kraken to align poor initial HTR transcriptions of diverse manuscript exemplars with existing digital texts in order to automatically produce training data in a “distantly supervised” manner.

The ACDC tool’s acceleration of the training data production process will mark an important step towards the creation of the generalizable Arabic and Persian HTR models required for the digital transcription of large-scale Persian and Arabic manuscript collections.

Funding and Project Duration: $282,905.00 from January 2021 to June 2023 (see the National Endowment for the Humanities’ website for more information).

Primary Project Personnel

Jonathan Parkes Allen

Mellon Post-Doctoral Fellow, Roshan Institute for Persian Studies, University of Maryland, College Park; Acting Assistant Director, OpenITI AOCP project

Matthew Thomas Miller

Assistant Professor of Persian Literature & Digital Humanities, Roshan Institute for Persian Studies, University of Maryland, College Park; Director, Roshan Initiative in Persian Digital Humanities; Affiliate, Maryland Institute for Technology in the Humanities

David Smith

Associate Professor, Khoury College of Computer Sciences, Northeastern University; Founding Member, NULab for Texts, Maps, and Networks

Alejandro Toselli

Associate Research Scientist, Khoury College of Computer Sciences, Northeastern University

Si Wu

Doctoral Candidate, Khoury College of Computer Sciences, Northeastern University

Advisory Board

Carl Ernst

William R. Kenan, Jr. Distinguished University Professor, University of North Carolina, Chapel Hill; Co-Director, UNC Center for Middle East and Islamic Studies

Adi Keinan-Schoonbaert

Digital Curator, Asian and African Collections, British Library

Evyn Kropf

Librarian for Middle Eastern & North African Studies and Religious Studies, University of Michigan; Curator, Islamic Manuscripts Collection, University of Michigan

Sarah Bowen Savant

Professor of History, Institute for the Study of Muslim Civilisations, Aga Khan University, London; Principal Investigator, KITAB project

Sabine Schmidtke

Professor of Islamic Intellectual History, School of Historical Studies, Institute for Advanced Study; Principal Investigator, The Zaydi Manuscript Tradition (ZMT) project

Columba Stewart

Executive Director, Hill Museum & Manuscript Library; Professor of Theology, Saint John’s University

Daniel Stoekl Ben Ezra

Directeur d’Études, École Pratique des Hautes Études (EPHE), Paris, Section des Sciences historiques et philologiques; Principal Investigator, eScripta project