![workshop logo cropped](images/workshop-logo-cropped.png) ### Programmable Corpora A New Infrastructural Concept for
Digital Literary Studies
[Frank Fischer](https://www.hse.ru/en/org/persons/182492735) ¹ ² · [Boris Orekhov](http://nevmenandr.net/bo.php) ¹ ¹ Higher School of Economics, Moscow
² DARIAH-EU
[Slavic DH Workshop](https://cdh.princeton.edu/events/2019/05/slavic-dh-workshop-russian-literary-studies-digital-age/) · Princeton University 🇺🇸 · 28 May 2019, 10:45 a.m.–12:15 p.m. -- ### TOC
1. TEI Corpus Work 2. This is going to be APIc! 3. Programmable Corpora 4. Linked Open Data 5. Research 6. Didactics --- # Prologue. -- ## In Light of Current Events
* Nan Z. Da: The Computational Case against Computational Literary Studies. *Critical Inquiry* 45:3 (Spring 2019), pp. 601–639. DOI:[10.1086/702594](https://doi.org/10.1086/702594). * controversial debate to say the least; one bigger shortcoming is the total neglect of non-American scholarship in the field, cf. [statement of ADHO’s SIG-DLS](https://culturalanalytics.org/2019/05/response-by-the-special-interest-group-on-digital-literary-stylistics-to-nan-z-das-study/) * a point well made and generally accepted regards the replication crisis: „the process of requesting complete, runnable codes and quantitative results […] took me **nearly two years**“ -- ## Previously on … ### Drama Studies (2014–2018)
- objective: digital analysis of corpora of dramatic texts - Python super script [***dramavis***](https://github.com/lehkost/dramavis) (2014–2018) - all-in-one approach: maintenance costs too high - although script and corpora are openly accessible, it was still hard to reproduce our findings or use our code for own experiments -- ### Adaptations? ![Phèdre](images/phedre-it-works.png)
https://twitter.com/christof77/status/772370040016568320
--- # Chapter 1.
### TEI Corpus Work -- ## **DraCor**
- sustainable research projects require (and generate) infrastructure - DraCor: Drama Corpora Platform (https://dracor.org/) - two in-house corpora: - German Drama Corpus, GerDraCor (1730s–1930s) - Russian Drama Corpus, RusDraCor (1740s–1940s) -- ## Frontend dracor.org ![DraCor-Frontpage)](images/dracor-frontpage.png)
https://dracor.org/ (public beta!)
-- ## GerDraCor + RusDraCor
- based on reliable sources (for Russian plays: lib.ru, rvb.ru, ilibrary.ru, …) - page numbers map to the scans of the corresponding pages of the used edition - converted to TEI-P5 - added ```
``` to assemble all speakers, even the functional or ephemeral characters not featured in cast lists - added Wikidata IDs for authors and plays - example: Pushkin’s [*Boris Godunov*](https://dracor.org/rus/pushkin-boris-godunov) -- ## ShakeDraCor
- derived from [Shakespeare Folger Library](https://www.folgerdigitaltexts.org/) - simple XQuery script to adapt the corpus to the platform - example: [*Hamlet*](https://dracor.org/shake/hamlet) -- ## SpanDraCor
- „Biblioteca Electrónica Textual del Teatro en Español de 1868–1936“ (BETTE) - 25 plays by 8 authors - forked from their [GitHub repo](https://github.com/GHEDI/BETTE), minimal adaptations to connect to DraCor - our visualisations and API-fication led to correction of hitherto unknown (structural) bugs - example: Valle Inclán’s [*Águila de blasón*](https://dracor.org/span/valle-aguila) -- ## GreekDraCor
- derived from [Perseus Digital Library](http://www.perseus.tufts.edu/hopper/) - 39 plays of Greek antiquity downloaded from [their website](http://www.perseus.tufts.edu/hopper/opensource/download) (licenced under CC BY-SA 3.0) - convert Betacode to Unicode (using Python package [betacode 0.2](https://pypi.org/project/betacode/)) - example: Aristophanes’s [*Frogs*](https://dracor.org/greek/aristophanes-frogs) -- ## Coming Soon
- IbsDraCor - HolDraCor (derived from the digital edition of [*Ludvig Holbergs skrifter*](http://holbergsskrifter.dk/holberg-public/view?docId=adm/HolbergsWritings.xml&sort=category)) - SweDraCor (derived from [*Dramawebben*](https://litteraturbanken.se/dramawebben)) - BashDraCor - ItaDraCor (derived from the [*Letteratura teatrale nella Biblioteca italiana*](http://www.bibliotecaitaliana.it/)) - …? --- # Chapter 2.
## This is going to be APIc! ¹ ²
¹ API: Application Programming Interface
² Peter Handke’s Law: [„One pun per text is permitted.“](https://www.welt.de/119569452)
-- ## DraCor Technology Stack ![Anzahl Figuren](images/dracor-drawio.svg)
All repos are open source: https://github.com/dracor-org
-- ## DraCor API (1/2)
- provides metadata, bespoke excerpts and various metrics - live documentation via Swagger: https://dracor.org/documentation/api/ -- ## DraCor API (2/2)
- list of available corpora and their content (list of all plays) - metadata and network data for all plays in CSV or GEXF format - list of speaking characters per play (usually much more comprehensive than the *dramatis personae*!) - characters per segment of a play (*dynamic graphs*!) - spoken text (total/female/male, or per character) - stage directions - SPARQL endpoint - … -- ## Example #### Number of characters per play in chronological order (1/2)
```R library(data.table) library(ggplot2) rusdracor <- fread("https://dracor.org/api/corpora/rus/metadata.csv") ggplot(rusdracor[], aes(x = year, y = numOfSpeakers)) + geom_point() ```
Very simple R script.
-- #### Number of characters per play in chronological order (2/2)
![Anzahl Figuren](images/num-of-speakers-gerdracor.png)
Output in RStudio.
-- ## Little Hands-On with Metadata
```https://dracor.org/api/corpora/rus/metadata.csv``` --- # Chapter 3.
## Programmable Corpora -- ## The Concept
- analogy to the IT news platform *ProgrammableWeb* (slogan: „APIs, Mashups and the Web as Platform“) - corpora as comparable objects that offer functions themselves and are linked to other data sources (via LOD) - FAIR principles, better reproducibility - you can connect to the platform at any level (TEI, API, R, Python, SPARQL, …) -- ### Pushkin’s „Boris Godunov“ (1/2) ![betweenness](images/betweenness-pushkin-boris-godunov.png)
Label size correlates with betweenness centrality. Gavrila Pushkin – a side character – in the middle.
(Data source: GEXF file from https://dracor.org/rus/pushkin-boris-godunov.)
-- ### Pushkin’s *Boris Godunov* (2/2) ![Boris Godunov (dynamic graph)](images/pushkin-boris-godunov.gif)
**Dynamic graph**, generated with [**ndtv**](https://cran.r-project.org/web/packages/ndtv/index.html) package. Data coms directly from the **DraCor API**.
Script by Ivan Pozdniakov ([source code auf RPubs.com](https://rpubs.com/Pozdniakov/godunov)).
-- ## Shiny App ![Shiny App](images/shiny-kaethchen.png)
https://shiny.dracor.org/ (by Ivan Pozdniakov).
-- ## Boris Yarkho (1889–1942)
![Boris Yarkho (cropped)](images/boris-yarkho.jpg)
(img source: http://urokiistorii.ru/article/52560)
-- Boris Yarkho: ***Speech Distribution in Five-Act Tragedies (A Question of Classicism and Romanticism)*** (written 1935–1938).
Ed. by Frank Fischer, Marina Akimova and Boris Orekhov.
In: *Journal of Literary Theory* 13:1 ([*„**Moscow Formalism and Literary History**“*](https://www.degruyter.com/view/j/jlt.2019.13.issue-1/issue-files/jlt.2019.13.issue-1.xml)). De Gruyter 2019, pp. 13–76.
DOI:[**10.1515/jlt-2019-0002**](https://doi.org/10.1515/jlt-2019-0002) -- ## Speech Distribution According to Yarkho
![Shakespeare vs Romanticists 2, speech distribution in average](images/yarkho-shakespeare-vs-romanticists-2.png)
Shakespeare vs moderate romanticists (in average): „romantic drama is a return to Shakespeare“.
-- ## Speech Distribution According to Yarkho
![Dmitry Donskoy, speech distribution](images/speech-distribution-ozerov-dmitrij-donskoj.png)
Implemented in **DraCor** (example: Ozerov’s *Dmitry Donskoy*):
https://dracor.org/rus/ozerov-dmitrij-donskoj#speech
--- # Chapter 4.
## Linked Open Data -- ![LOD cup](images/lod-cup.jpg)
(img source: https://www.w3.org/DesignIssues/LinkedData.html)
-- ### Example for an RDF Representation
- taken from the website *The Programming Historian*: - Matthew Lincoln: [Using SPARQL to access Linked Open Data](https://programminghistorian.org/lessons/graph-databases-and-SPARQL) (publ. 2015, released under CC-BY 4.0) -- ![Rembrandt: De Nachtwacht](https://upload.wikimedia.org/wikipedia/commons/0/0b/Rembrandt_van_Rijn-De_Nachtwacht-1642.jpg) - Rembrandt: De Nachtwacht (The Nightwatch), 1642
(img source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Rembrandt_van_Rijn-De_Nachtwacht-1642.jpg))
-- ![Vermeer: A Woman Holding a Balance](https://upload.wikimedia.org/wikipedia/commons/7/72/Woman-with-a-balance-by-Vermeer.jpg) - Vermeer: A Woman Holding a Balance, 1662/1663
(img source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Woman-with-a-balance-by-Vermeer.jpg))
-- ![graph visualisation of RDF-encoded information](https://programminghistorian.org/images/graph-databases-and-SPARQL/sparql01.svg)
graph visualisation of RDF-encoded information:
arrows indicate the direction of the predicate, *The Nightwatch*
was created by Rembrandt and not the other way around
(img source: [*The Programming Historian*](https://programminghistorian.org/lessons/graph-databases-and-SPARQL))
-- ## Connecting to the
Linked-Open-Data Cloud
- example: Gogol’s „Женитьба“/„Marriage“ (1842) - has [Wikipedia article](https://en.wikipedia.org/wiki/Marriage_%28play%29) - has Wikidata article with facts: https://www.wikidata.org/wiki/Q4179360 - first performances of Russian plays: http://tinyurl.com/y8r337pa - first performances of German plays: http://tinyurl.com/y9vga68j -- ## Location of first performances in GerDraCor ![ezlinavis screenshot](images/first-performances-overall.png)
Via Wikidata:[P4647](https://www.wikidata.org/wiki/Property:P4647) („location of first performance“).
-- ## Berlin as location of first performances ![ezlinavis screenshot](images/first-performances-berlin.png)
Via Wikidata:[P4647](https://www.wikidata.org/wiki/Property:P4647) („location of first performance“).
-- ## Representativity of a Corpus
| Play | Number of Wikipedia version | |:----:|:---------------------------:| | Chekhov: The Cherry Orchard | 35 | | Gogol: The Government Inspector | 35 | | Chekhov: The Seagull | 34 | | Chekhov: Three Sisters | 34 | | Chekhov: Uncle Vanya | 31 | | Pushkin: Boris Godunov | 13 | | Gorki: The Lower Depths | 12 | | Griboyedov: Woe from Wit | 11 | | Chekhov: Ivanov | 10 | | Gogol: Marriage | 10 |
The 10 most popular plays world-wide according to Wikipedia (RusDraCor currently includes 161 plays). -- ## Little Hands-On with Metadata
```https://dracor.org/api/corpora/rus/metadata.csv``` --- # Chapter 5.
## Research -- ## Distant-Reading Showcase ![465 drama networks at a glance](images/distant-reading-showcase-poster.jpg)
*Distant-Reading Showcase* (DHd2016, Leipzig 🏆).
Download via Figshare. DOI: [10.6084/m9.figshare.3101203.v2](https://dx.doi.org/10.6084/m9.figshare.3101203.v2).
-- ### Small-World Phenomenon in Russian Drama ![Small-World-Phänomen in russischen Dramen)](images/rusdracor-small-world-preview.png)
Work by Evgeniya Ustinova.
-- ### Topics in Spoken Text of Russian Drama ![Topics)](images/rusdracor-topics-per-author.png)
Work by Irina Pavlova ([abstract](https://eadh2018.exordo.com/files/papers/158/final_draft/Pavlova___Fischer_-_Topic_Modeling_-_EADH_conference.pdf)).
-- ### Stage directions ![Regieanweisungen)](images/rusdracor-didascalie-all-pos.png)
Work by Daria Maximova ([abstract](https://eadh2018.exordo.com/files/papers/79/final_draft/Stage_Directions_for_EADH_Conference.pdf)).
-- ### Female/Male Character Word Usage ![Female/Male Character Word Usage)](images/rusdracor-craigs-zeta.jpg)
Work by Skorinkin/Fischer/Palchikov ([article](http://www.dialog-21.ru/media/4332/skorinkind.pdf)).
--- # Chapter 6.
## Didactics -- ## Brecht Beats Shakespeare! ![Dramenquartett](images/brecht-shakespeare.jpg)
*Brecht Beats Shakespeare* (DHd2018, Cologne 🏆, and DH2018, México).
Download via Figshare. DOI: [10.6084/m9.figshare.5926363.v1](https://doi.org/10.6084/m9.figshare.6667424.v1).
-- ## Gamification! ![card game](images/card-game-dh2018.jpg)
„Brecht Beats Shakespeare“ (released 2018, img source: [@angelikah](https://twitter.com/angelikah/status/1012100869301702657)).
Full-res version here: https://doi.org/10.6084/m9.figshare.6667424.v1
-- What to do in case of …
… a power outage at data-analysis bootcamp? ![card game](images/boot-camp-card-game.jpg) -- ![a card](images/card-andromaque.jpg) ![a card](images/card-nora.jpg) ![a card](images/card-miss-julie.jpg) -- ![a card](images/card-hamlet.jpg) ![a card](images/card-faust.jpg) ![a card](images/card-grabbe.jpg) -- ### *ezlinavis* ![ezlinavis screenshot](images/ezlinavis-screenshot-tolstoy.png)
**ezlinavis** in action: https://ezlinavis.dracor.org/
--- ## Summary
in general: - **Programmable Corpora** as a concept for research and teaching
in particular: - **dracor.org**: research infrastructure for European drama: - reliable, expandable corpora in multiple languages (→ comparative approaches) - open-source project: lurk, fork, send pull requests -- Thanks.
https://dracor.org/
#ProgrammableCorpora