#1 Workshop | 24 April 2024 (Wed)
Computational Approaches to Textual Similarity
Date: 24 April 2024 (Wed)
Time: 3:00-4:30pm
Venue: Digital Scholarship Lab, University Library
Speaker: Dr. Donald Sturgeon
Textual similarity – encompassing a variety of phenomena including direct quotation, unattributed copying with rewording or embellishment, allusion, and distinctly similar word usage – has long been of interest to textual scholars in many domains for a variety of reasons. Often these similarities are non-trivial to uncover, but once identified can provide valuable evidence for hypotheses about textual transmission histories and authorship – particularly important where these are complex or disputed.
Digitization of texts at an enormous scale together with ever more-powerful computer technology present excellent opportunities for identifying textual similarities automatically at scales that would otherwise be impossible. Focusing primarily on classical Chinese examples, this interactive, hands-on workshop will give an overview of some of the most commonly used approaches, and introduce practical ways of identifying, summarizing, and visualizing a variety of types of textual similarity in historical materials. No technical background is assumed, and all necessary materials will be provided.
#2 Seminar | 26 April 2024 (Fri)
Premodern China in the Age of AI: Opportunities and Challenges
Date: 26 April 2024 (Fri)
Time: 3:00-4:30pm
Venue: Digital Scholarship Lab, University Library
Speaker: Dr. Donald Sturgeon
This seminar introduces ongoing work on building and using NLP models in the context of a large and widely used digital library of premodern Chinese texts, the Chinese Text Project, and the challenges encountered during this process. Using models trained on a large collection of data obtained through a combination of rule-based automation, linking of external resources, and crowdsourced editing, this ongoing project deploys deep learning models to directly augment a digital library with computer-generated data in a practically useful and sustainable way.
These models cover a range of tasks, many of which would previously have been considered impractical without human intervention, including: automated punctuation of unpunctuated texts; automated annotation of named historical entities in transcribed texts; automated OCR post-correction to correct mistaken transcriptions in text generated through OCR; and lastly more speculative and research-oriented tasks such as the application of deep learning to chronological and authorial attribution of historical texts.
About the Speaker
• Creator and administrator of Chinese Text Project, a major online collaborative digital library project for pre-modern Chinese texts
• Research interests: issues of language, mind and knowledge in classical Chinese thought, and the application of digital methods to the study of pre-modern Chinese literature and language
Faculty Office of Arts
arts@cuhk.edu.hk