Programmable Corpora A New Infrastructural Concept for Digital Literary Studies
Digital Literary Studies
Frank Fischer · Boris Orekhov
Higher School of Economics, Moscow
Slavic DH Workshop · Princeton University · 28 May 2019
1. TEI Corpus Work 2. This is going to be APIc! 3. Programmable Corpora 4. Linked Open Data 5. Research 6. Didactics --- # Prologue. -- ## In Light of Current Events
* Nan Z. Da: The Computational Case against Computational Literary Studies. *Critical Inquiry* 45:3 (Spring 2019), pp. 601–639. DOI:[10.1086/702594](https://doi.org/10.1086/702594). * controversial debate to say the least; one bigger shortcoming is the total neglect of non-American scholarship in the field, cf. [statement of ADHO’s SIG-DLS](https://culturalanalytics.org/2019/05/response-by-the-special-interest-group-on-digital-literary-stylistics-to-nan-z-das-study/) * a point well made and generally accepted regards the replication crisis: „the process of requesting complete, runnable codes and quantitative results […] took me **nearly two years**“ -- ## Previously on … ### Drama Studies (2014–2018)
- objective: digital analysis of corpora of dramatic texts - Python super script [***dramavis***](https://github.com/lehkost/dramavis) (2014–2018) - all-in-one approach: maintenance costs too high - although script and corpora are openly accessible, it was still hard to reproduce our findings or use our code for own experiments -- ### Adaptations? 
--- # Chapter 1.
### TEI Corpus Work -- ## **DraCor**
- sustainable research projects require (and generate) infrastructure - DraCor: Drama Corpora Platform (https://dracor.org/) - two in-house corpora: - German Drama Corpus, GerDraCor (1730s–1930s) - Russian Drama Corpus, RusDraCor (1740s–1940s) -- ## Frontend dracor.org 
https://dracor.org/ (public beta!)
-- ## GerDraCor + RusDraCor
- based on reliable sources (for Russian plays: lib.ru, rvb.ru, ilibrary.ru, …) - page numbers map to the scans of the corresponding pages of the used edition - converted to TEI-P5 - added ```
``` to assemble all speakers, even the functional or ephemeral characters not featured in cast lists - added Wikidata IDs for authors and plays - example: Pushkin’s [*Boris Godunov*](https://dracor.org/rus/pushkin-boris-godunov) -- ## ShakeDraCor
- derived from [Shakespeare Folger Library](https://www.folgerdigitaltexts.org/) - simple XQuery script to adapt the corpus to the platform - example: [*Hamlet*](https://dracor.org/shake/hamlet) -- ## SpanDraCor
- „Biblioteca Electrónica Textual del Teatro en Español de 1868–1936“ (BETTE) - 25 plays by 8 authors - forked from their [GitHub repo](https://github.com/GHEDI/BETTE), minimal adaptations to connect to DraCor - our visualisations and API-fication led to correction of hitherto unknown (structural) bugs - example: Valle Inclán’s [*Águila de blasón*](https://dracor.org/span/valle-aguila) -- ## GreekDraCor
- derived from [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/) - 39 plays of Greek antiquity downloaded from [their website](http://www.perseus.tufts.edu/hopper/opensource/download) (licenced under CC BY-SA 3.0) - convert Betacode to Unicode (using Python package [betacode 0.2](https://pypi.org/project/betacode/)) - example: Aristophanes’s [*Frogs*](https://dracor.org/greek/aristophanes-frogs) -- ## Coming Soon
- IbsDraCor - HolDraCor (derived from the digital edition of [*Ludvig Holbergs skrifter*](http://holbergsskrifter.dk/holberg-public/view?docId=adm/HolbergsWritings.xml&sort=category)) - SweDraCor (derived from [*Dramawebben*](https://litteraturbanken.se/dramawebben)) - BashDraCor - ItaDraCor (derived from the [*Letteratura teatrale nella Biblioteca italiana*](http://www.bibliotecaitaliana.it/)) - …? --- # Chapter 2.
## This is going to be APIc! ¹ ²
¹ API: Application Programming Interface
-- ## DraCor Technology Stack 
All repos are open source: https://github.com/dracor-org
-- ## DraCor API (1/2)
- provides metadata, bespoke excerpts and various metrics - live documentation via Swagger: https://dracor.org/documentation/api/ -- ## DraCor API (2/2)
- list of available corpora and their content (list of all plays) - metadata and network data for all plays in CSV or GEXF format - list of speaking characters per play (usually much more comprehensive than the *dramatis personae*!) - characters per segment of a play (*dynamic graphs*!) - spoken text (total/female/male, or per character) - stage directions - SPARQL endpoint - … -- ## Example #### Number of characters per play in chronological order (1/2)
```R library(data.table) library(ggplot2) rusdracor <- fread("https://dracor.org/api/corpora/rus/metadata.csv") ggplot(rusdracor[], aes(x = year, y = numOfSpeakers)) + geom_point() ```
Very simple R script.
-- #### Number of characters per play in chronological order (2/2)

Output in RStudio.
-- ## Little Hands-On with Metadata
```https://dracor.org/api/corpora/rus/metadata.csv``` --- # Chapter 3.
## Programmable Corpora -- ## The Concept
- analogy to the IT news platform *ProgrammableWeb* (slogan: „APIs, Mashups and the Web as Platform“) - corpora as comparable objects that offer functions themselves and are linked to other data sources (via LOD) - FAIR principles, better reproducibility - you can connect to the platform at any level (TEI, API, R, Python, SPARQL, …) -- ### Pushkin’s „Boris Godunov“ (1/2) 
Label size correlates with betweenness centrality. Gavrila Pushkin – a side character – in the middle.
(Data source: GEXF file from https://dracor.org/rus/pushkin-boris-godunov.)
-- ### Pushkin’s *Boris Godunov* (2/2) 
**Dynamic graph**, generated with [**ndtv**](https://cran.r-project.org/web/packages/ndtv/index.html) package. Data coms directly from the **DraCor API**.
Script by Ivan Pozdniakov ([source code auf RPubs.com](https://rpubs.com/Pozdniakov/godunov)).
-- ## Shiny App 
https://shiny.dracor.org/ (by Ivan Pozdniakov).
-- ## Boris Yarkho (1889–1942)

(img source: http://urokiistorii.ru/article/52560)
-- Boris Yarkho: ***Speech Distribution in Five-Act Tragedies (A Question of Classicism and Romanticism)*** (written 1935–1938).
Ed. by Frank Fischer, Marina Akimova and Boris Orekhov.
In: *Journal of Literary Theory* 13:1 ([*„**Moscow Formalism and Literary History**“*](https://www.degruyter.com/view/j/jlt.2019.13.issue-1/issue-files/jlt.2019.13.issue-1.xml)). De Gruyter 2019, pp. 13–76.
DOI:[**10.1515/jlt-2019-0002**](https://doi.org/10.1515/jlt-2019-0002) -- ## Speech Distribution According to Yarkho

Shakespeare vs moderate romanticists (in average): „romantic drama is a return to Shakespeare“.
-- ## Speech Distribution According to Yarkho

Implemented in **DraCor** (example: Ozerov’s *Dmitry Donskoy*):
--- # Chapter 4.
## Linked Open Data -- 
(img source: https://www.w3.org/DesignIssues/LinkedData.html)
-- ### Example for an RDF Representation
- taken from the website *The Programming Historian*: - Matthew Lincoln: [Using SPARQL to access Linked Open Data](https://programminghistorian.org/lessons/graph-databases-and-SPARQL) (publ. 2015, released under CC-BY 4.0) --  - Rembrandt: De Nachtwacht (The Nightwatch), 1642
(img source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Rembrandt_van_Rijn-De_Nachtwacht-1642.jpg))
--  - Vermeer: A Woman Holding a Balance, 1662/1663
(img source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Woman-with-a-balance-by-Vermeer.jpg))
graph visualisation of RDF-encoded information:
arrows indicate the direction of the predicate, *The Nightwatch*
was created by Rembrandt and not the other way around
(img source: [*The Programming Historian*](https://programminghistorian.org/lessons/graph-databases-and-SPARQL))
-- ## Connecting to the
Linked-Open-Data Cloud
- example: Gogol’s „Женитьба“/„Marriage“ (1842) - has [Wikipedia article](https://en.wikipedia.org/wiki/Marriage_%28play%29) - has Wikidata article with facts: https://www.wikidata.org/wiki/Q4179360 - first performances of Russian plays: http://tinyurl.com/y8r337pa - first performances of German plays: http://tinyurl.com/y9vga68j -- ## Location of first performances in GerDraCor 
Via Wikidata:[P4647](https://www.wikidata.org/wiki/Property:P4647) („location of first performance“).
-- ## Berlin as location of first performances 
Via Wikidata:[P4647](https://www.wikidata.org/wiki/Property:P4647) („location of first performance“).
-- ## Representativity of a Corpus
| Play | Number of Wikipedia version | |:----:|:---------------------------:| | Chekhov: The Cherry Orchard | 35 | | Gogol: The Government Inspector | 35 | | Chekhov: The Seagull | 34 | | Chekhov: Three Sisters | 34 | | Chekhov: Uncle Vanya | 31 | | Pushkin: Boris Godunov | 13 | | Gorki: The Lower Depths | 12 | | Griboyedov: Woe from Wit | 11 | | Chekhov: Ivanov | 10 | | Gogol: Marriage | 10 |
The 10 most popular plays world-wide according to Wikipedia (RusDraCor currently includes 161 plays). -- ## Little Hands-On with Metadata
```https://dracor.org/api/corpora/rus/metadata.csv``` --- # Chapter 5.
## Research -- ## Distant-Reading Showcase 
*Distant-Reading Showcase* (DHd2016, Leipzig 🏆).
Download via Figshare. DOI: [10.6084/m9.figshare.3101203.v2](https://dx.doi.org/10.6084/m9.figshare.3101203.v2).
-- ### Small-World Phenomenon in Russian Drama 
Work by Evgeniya Ustinova.
-- ### Topics in Spoken Text of Russian Drama 
Work by Irina Pavlova ([abstract](https://eadh2018.exordo.com/files/papers/158/final_draft/Pavlova___Fischer_-_Topic_Modeling_-_EADH_conference.pdf)).
-- ### Stage directions 
Work by Daria Maximova ([abstract](https://eadh2018.exordo.com/files/papers/79/final_draft/Stage_Directions_for_EADH_Conference.pdf)).
-- ### Female/Male Character Word Usage 
Work by Skorinkin/Fischer/Palchikov ([article](http://www.dialog-21.ru/media/4332/skorinkind.pdf)).
--- # Chapter 6.
## Didactics -- ## Brecht Beats Shakespeare! 
*Brecht Beats Shakespeare* (DHd2018, Cologne 🏆, and DH2018, México).
Download via Figshare. DOI: [10.6084/m9.figshare.5926363.v1](https://doi.org/10.6084/m9.figshare.6667424.v1).
-- ## Gamification! 
„Brecht Beats Shakespeare“ (released 2018, img source: [@angelikah](https://twitter.com/angelikah/status/1012100869301702657)).
Full-res version here: https://doi.org/10.6084/m9.figshare.6667424.v1
-- What to do in case of …
… a power outage at data-analysis bootcamp?  --    --    -- ### *ezlinavis* 
**ezlinavis** in action: https://ezlinavis.dracor.org/
--- ## Summary
in general: - **Programmable Corpora** as a concept for research and teaching
in particular: - **dracor.org**: research infrastructure for European drama: - reliable, expandable corpora in multiple languages (→ comparative approaches) - open-source project: lurk, fork, send pull requests -- Thanks.