### Programmable Corpora A New Infrastructural Concept for
Digital Literary Studies
Frank Fischer · Ingo Börner · Mathias Göbel · Angelika Hechtl
Christopher Kittel · Carsten Milling · Peer Trilcke
(Moscow · Vienna · Göttingen · Berlin · Potsdam)
**This presentation:** [bit.ly/2xEPupe](https://bit.ly/2xEPupe)
[DH2019](https://dh2019.adho.org/) · Utrecht 🇳🇱 · 10 July 2019 -- ### Chapters
1. [TEI Corpus Work](#/2) 2. [This is going to be APIc!](#/3) 3. [Programmable Corpora](#/4) 4. [Linked Open Data](#/5) --- # Prologue. -- ## Previously on … ### Drama Studies (2014–2018)
- objective: digital analysis of corpora of dramatic texts - Python super script [***dramavis***](https://github.com/lehkost/dramavis) -- ## Distant-Reading Showcase 
*Distant-Reading Showcase* (DHd2016, Leipzig 🏆).
Download via Figshare. DOI: [10.6084/m9.figshare.3101203.v2](https://dx.doi.org/10.6084/m9.figshare.3101203.v2).
-- ## Gamification! 
„Brecht Beats Shakespeare!“ (released 2018, img source: [@angelikah](https://twitter.com/angelikah/status/1012100869301702657)).
Full-res version here: https://doi.org/10.6084/m9.figshare.6667424.v1
-- What to do in case of …
… a power outage at data-analysis bootcamp? 
(Photo courtesy of Daniil Skorinkin.)
-- ### Some of Our Research
Presented at DH Conferences
- conditions for a network analysis of dramatic texts (DH2015) - small-world phenomenon in drama (DH2016) - progressive structuration of drama networks/„plot“ (DH2017) - catching protagonists: typology of characters (DH2018) -- ### Problems?
- all-in-one-script approach: maintenance costs too high over the years - although script and corpora are openly accessible, it was still hard to reproduce our findings or use our code for experiments -- ### This Talk Will Introduce …
**#ProgrammableCorpora** = the concept
= an application and
showcase for the concept --- # Chapter 1.
### TEI Corpus Work -- ## How **DraCor** Came About
- sustainable research projects require (and generate) infrastructure - DraCor: Drama Corpora Platform (https://dracor.org/) - two in-house corpora: - German Drama Corpus, GerDraCor (1730s–1930s) - Russian Drama Corpus, RusDraCor (1740s–1940s) -- ## Frontend dracor.org 
https://dracor.org/ (public beta!)
-- ## GerDraCor + RusDraCor
- TEI-P5 - added ```
``` to assemble all speakers, even functional or ephemeral characters not featured in cast lists - added Wikidata IDs for authors and plays (connecting DraCor to the LOD cloud) -- ## ShakeDraCor
- derived from [Shakespeare Folger Library](https://www.folgerdigitaltexts.org/) - simple XQuery script to adapt the corpus to the platform - example: [„Hamlet“](https://dracor.org/shake/hamlet) -- ## SpanDraCor
- „Biblioteca Electrónica Textual del Teatro en Español de 1868–1936“ (BETTE) - 25 plays by 8 authors - forked from their [GitHub repo](https://github.com/GHEDI/BETTE), minimal adaptations to connect to DraCor - example: Valle Inclán’s [„Águila de blasón“](https://dracor.org/span/valle-aguila) -- ## GreekDraCor + RomDraCor
- derived from [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/) - 39 plays of Greek antiquity downloaded from [their website](http://www.perseus.tufts.edu/hopper/opensource/download) (licenced under CC BY-SA 3.0) - 36 Roman plays (Plautus, Seneca, Terence) - convert Betacode to Unicode (using Python package [betacode 0.2](https://pypi.org/project/betacode/)) - example: Aristophanes’s [„Frogs“](https://dracor.org/greek/aristophanes-frogs) -- ## Coming Soon
- IbsDraCor - HolDraCor (derived from the digital edition of [*Ludvig Holbergs skrifter*](http://holbergsskrifter.dk/holberg-public/view?docId=adm/HolbergsWritings.xml&sort=category)) - SweDraCor (derived from [*Dramawebben*](https://litteraturbanken.se/dramawebben)) - BashDraCor - ItaDraCor (derived from the [*Letteratura teatrale nella Biblioteca italiana*](http://www.bibliotecaitaliana.it/)) - …? --- # Chapter 2.
## This is going to be APIc! ¹ ²
¹ API: Application Programming Interface
² Peter Handke’s Law: [„One pun per text is permitted.“](https://www.welt.de/119569452)
-- ## DraCor Technology Stack 
All repos are open source: https://github.com/dracor-org
-- ## DraCor API (1/2)
- provides metadata, bespoke excerpts of the actual text (stage directions, spoken text per character, etc.) and various metrics - e.g., network data for all plays in CSV and GEXF format - SPARQL endpoint - live documentation via Swagger: https://dracor.org/documentation/api/ -- ## DraCor API (2/2): Example Queries
Spoken text by female characters in Goethe’s „Faust“.
Basic network metrics for Shakespeare’s „Hamlet“.
All stage directions of Chekhov’s „Cherry Orchard“.
Metadata for all plays of the Russian corpus.
-- ## Simple Use Case #### Number of characters per play in chronological order (1/3)
```R library(data.table) library(ggplot2) rusdracor <- fread("https://dracor.org/api/corpora/rus/metadata.csv") ggplot(rusdracor[], aes(x = year, y = numOfSpeakers)) + geom_point() ```
Very simple R script.
-- #### Number of characters per play in chronological order (2/3)

Output in RStudio.
-- #### Number of characters per play in chronological order (3/3)

Output in LibreOffice Calc (Microsoft Excel is similar).
--- # Chapter 3.
## Programmable Corpora -- ## The Concept
- analogy to the IT news platform *ProgrammableWeb* (slogan: „APIs, Mashups and the Web as Platform“) - corpora as comparable objects that offer functions themselves and are linked to other data sources (via LOD) - FAIR principles, better reproducibility - you can connect to the platform at any level (TEI, API, R, Python, SPARQL, Excel, …) -- ### Pushkin’s „Boris Godunov“ (1/2) 
Label size correlates with betweenness centrality. Gavrila Pushkin – a side character – in the middle.
(Data source: GEXF file from https://dracor.org/rus/pushkin-boris-godunov.)
-- ### Pushkin’s „Boris Godunov“ (2/2) 
**Dynamic graph**, generated with [**ndtv**](https://cran.r-project.org/web/packages/ndtv/index.html) package. Data coms directly from the **DraCor API**.
Script by Ivan Pozdniakov ([source code auf RPubs.com](https://rpubs.com/Pozdniakov/godunov)).
-- ## Shiny App 
https://shiny.dracor.org/ (by Ivan Pozdniakov).
--- # Chapter 4.
## Linked Open Data -- 
(img source: https://www.w3.org/DesignIssues/LinkedData.html)
-- ## Connecting to the
Linked-Open-Data Cloud
- example: Gogol’s „Женитьба“/„Marriage“ (1842) - has [Wikipedia article](https://en.wikipedia.org/wiki/Marriage_%28play%29) - has Wikidata article with facts: https://www.wikidata.org/wiki/Q4179360 - first performances of Russian plays: http://tinyurl.com/y8r337pa - first performances of German plays: http://tinyurl.com/y9vga68j -- ## Location of first performances in GerDraCor 
Via Wikidata:[P4647](https://www.wikidata.org/wiki/Property:P4647) („location of first performance“).
-- ## Berlin as location of first performances 
Via Wikidata:[P4647](https://www.wikidata.org/wiki/Property:P4647) („location of first performance“).
--- ## Summary
in general: - **Programmable Corpora** as a concept for research and teaching - applicable to other cases ([ELTeC](https://www.distant-reading.net/eltec/)?)
in particular: - **dracor.org**: research infrastructure for European drama: - reliable, expandable corpora in multiple languages (→ comparative approaches) - open-source project: lurk, fork, send pull requests -- Thanks.