### Programmable Corpora A New Infrastructural Concept for
Digital Literary Studies
Frank Fischer · Ingo Börner · Mathias Göbel · Angelika Hechtl
Christopher Kittel · Carsten Milling · Peer Trilcke
(Moscow · Vienna · Göttingen · Berlin · Potsdam)
**This presentation:** [bit.ly/2xEPupe](https://bit.ly/2xEPupe)
[DH2019](https://dh2019.adho.org/) · Utrecht 🇳🇱 · 10 July 2019 -- ### Chapters
1. [TEI Corpus Work](#/2) 2. [This is going to be APIc!](#/3) 3. [Programmable Corpora](#/4) 4. [Linked Open Data](#/5) --- # Prologue. -- ## Previously on … ### Drama Studies (2014–2018)
- objective: digital analysis of corpora of dramatic texts - Python super script [***dramavis***](https://github.com/lehkost/dramavis) -- ## Distant-Reading Showcase ![465 drama networks at a glance](images/distant-reading-showcase-poster.jpg)
*Distant-Reading Showcase* (DHd2016, Leipzig 🏆).
Download via Figshare. DOI: [10.6084/m9.figshare.3101203.v2](https://dx.doi.org/10.6084/m9.figshare.3101203.v2).
-- ## Gamification! ![card game](images/card-game-dh2018.jpg)
„Brecht Beats Shakespeare!“ (released 2018, img source: [@angelikah](https://twitter.com/angelikah/status/1012100869301702657)).
Full-res version here: https://doi.org/10.6084/m9.figshare.6667424.v1
-- What to do in case of …
… a power outage at data-analysis bootcamp? ![card game](images/boot-camp-card-game.jpg)
(Photo courtesy of Daniil Skorinkin.)
-- ### Some of Our Research
Presented at DH Conferences
- conditions for a network analysis of dramatic texts (DH2015) - small-world phenomenon in drama (DH2016) - progressive structuration of drama networks/„plot“ (DH2017) - catching protagonists: typology of characters (DH2018) -- ### Problems?
- all-in-one-script approach: maintenance costs too high over the years - although script and corpora are openly accessible, it was still hard to reproduce our findings or use our code for experiments -- ### This Talk Will Introduce …
**#ProgrammableCorpora** = the concept
**dracor.org**
= an application and
showcase for the concept --- # Chapter 1.
### TEI Corpus Work -- ## How **DraCor** Came About
- sustainable research projects require (and generate) infrastructure - DraCor: Drama Corpora Platform (https://dracor.org/) - two in-house corpora: - German Drama Corpus, GerDraCor (1730s–1930s) - Russian Drama Corpus, RusDraCor (1740s–1940s) -- ## Frontend dracor.org ![DraCor frontpage)](images/dracor-frontpage.png)
https://dracor.org/ (public beta!)
-- ## GerDraCor + RusDraCor
- TEI-P5 - added ```
``` to assemble all speakers, even functional or ephemeral characters not featured in cast lists - added Wikidata IDs for authors and plays (connecting DraCor to the LOD cloud) -- ## ShakeDraCor
- derived from [Shakespeare Folger Library](https://www.folgerdigitaltexts.org/) - simple XQuery script to adapt the corpus to the platform - example: [„Hamlet“](https://dracor.org/shake/hamlet) -- ## SpanDraCor
- „Biblioteca Electrónica Textual del Teatro en Español de 1868–1936“ (BETTE) - 25 plays by 8 authors - forked from their [GitHub repo](https://github.com/GHEDI/BETTE), minimal adaptations to connect to DraCor - example: Valle Inclán’s [„Águila de blasón“](https://dracor.org/span/valle-aguila) -- ## GreekDraCor + RomDraCor
- derived from [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/) - 39 plays of Greek antiquity downloaded from [their website](http://www.perseus.tufts.edu/hopper/opensource/download) (licenced under CC BY-SA 3.0) - 36 Roman plays (Plautus, Seneca, Terence) - convert Betacode to Unicode (using Python package [betacode 0.2](https://pypi.org/project/betacode/)) - example: Aristophanes’s [„Frogs“](https://dracor.org/greek/aristophanes-frogs) -- ## Coming Soon
- IbsDraCor - HolDraCor (derived from the digital edition of [*Ludvig Holbergs skrifter*](http://holbergsskrifter.dk/holberg-public/view?docId=adm/HolbergsWritings.xml&sort=category)) - SweDraCor (derived from [*Dramawebben*](https://litteraturbanken.se/dramawebben)) - BashDraCor - ItaDraCor (derived from the [*Letteratura teatrale nella Biblioteca italiana*](http://www.bibliotecaitaliana.it/)) - …? --- # Chapter 2.
## This is going to be APIc! ¹ ²
¹ API: Application Programming Interface
² Peter Handke’s Law: [„One pun per text is permitted.“](https://www.welt.de/119569452)
-- ## DraCor Technology Stack ![DraCor technology stack](images/dracor-drawio.svg)
All repos are open source: https://github.com/dracor-org
-- ## DraCor API (1/2)
- provides metadata, bespoke excerpts of the actual text (stage directions, spoken text per character, etc.) and various metrics - e.g., network data for all plays in CSV and GEXF format - SPARQL endpoint - live documentation via Swagger: https://dracor.org/documentation/api/ -- ## DraCor API (2/2): Example Queries
https://dracor.org/api/corpora/ger/play/goethe-faust-eine-tragoedie/spoken-text?gender=FEMALE
Spoken text by female characters in Goethe’s „Faust“.
https://dracor.org/api/corpora/shake/play/hamlet/metrics
Basic network metrics for Shakespeare’s „Hamlet“.
https://dracor.org/api/corpora/rus/play/chekhov-vishnevyi-sad/stage-directions
All stage directions of Chekhov’s „Cherry Orchard“.
https://dracor.org/api/corpora/rus/metadata
Metadata for all plays of the Russian corpus.
-- ## Simple Use Case #### Number of characters per play in chronological order (1/3)
```R library(data.table) library(ggplot2) rusdracor <- fread("https://dracor.org/api/corpora/rus/metadata.csv") ggplot(rusdracor[], aes(x = year, y = numOfSpeakers)) + geom_point() ```
Very simple R script.
-- #### Number of characters per play in chronological order (2/3)
![number of characters per play](images/num-of-speakers-rusdracor.png)
Output in RStudio.
-- #### Number of characters per play in chronological order (3/3)
![number of characters per play (LibreOffice Calc)](images/num-of-speakers-rusdracor-libreoffice-calc.jpg)
Output in LibreOffice Calc (Microsoft Excel is similar).
--- # Chapter 3.
## Programmable Corpora -- ## The Concept
- analogy to the IT news platform *ProgrammableWeb* (slogan: „APIs, Mashups and the Web as Platform“) - corpora as comparable objects that offer functions themselves and are linked to other data sources (via LOD) - FAIR principles, better reproducibility - you can connect to the platform at any level (TEI, API, R, Python, SPARQL, Excel, …) -- ### Pushkin’s „Boris Godunov“ (1/2) ![network graph](images/betweenness-pushkin-boris-godunov.png)
Label size correlates with betweenness centrality. Gavrila Pushkin – a side character – in the middle.
(Data source: GEXF file from https://dracor.org/rus/pushkin-boris-godunov.)
-- ### Pushkin’s „Boris Godunov“ (2/2) ![Boris Godunov (dynamic graph)](images/pushkin-boris-godunov.gif)
**Dynamic graph**, generated with [**ndtv**](https://cran.r-project.org/web/packages/ndtv/index.html) package. Data coms directly from the **DraCor API**.
Script by Ivan Pozdniakov ([source code auf RPubs.com](https://rpubs.com/Pozdniakov/godunov)).
-- ## Shiny App ![Shiny App](images/shiny-kaethchen.png)
https://shiny.dracor.org/ (by Ivan Pozdniakov).
--- # Chapter 4.
## Linked Open Data -- ![LOD cup](images/lod-cup.jpg)
(img source: https://www.w3.org/DesignIssues/LinkedData.html)
-- ## Connecting to the
Linked-Open-Data Cloud
- example: Gogol’s „Женитьба“/„Marriage“ (1842) - has [Wikipedia article](https://en.wikipedia.org/wiki/Marriage_%28play%29) - has Wikidata article with facts: https://www.wikidata.org/wiki/Q4179360 - first performances of Russian plays: http://tinyurl.com/y8r337pa - first performances of German plays: http://tinyurl.com/y9vga68j -- ## Location of first performances in GerDraCor ![location of first performances (map)](images/first-performances-overall.png)
Via Wikidata:[P4647](https://www.wikidata.org/wiki/Property:P4647) („location of first performance“).
-- ## Berlin as location of first performances ![location of first performances in Berlin (map)](images/first-performances-berlin.png)
Via Wikidata:[P4647](https://www.wikidata.org/wiki/Property:P4647) („location of first performance“).
--- ## Summary
in general: - **Programmable Corpora** as a concept for research and teaching - applicable to other cases ([ELTeC](https://www.distant-reading.net/eltec/)?)
in particular: - **dracor.org**: research infrastructure for European drama: - reliable, expandable corpora in multiple languages (→ comparative approaches) - open-source project: lurk, fork, send pull requests -- Thanks.
https://dracor.org/
#ProgrammableCorpora