Undergraduate student uses Savio to perform Natural Language Processing on Fanfiction

August 12, 2016

Smitha Milli, a fourth year Electrical Engineering and Computer Science (EECS) undergraduate student at UC Berkeley, is collaborating with David Bamman, Assistant Professor at the Berkeley School of Information, to perform Natural Language Processing (NLP) on fanfiction texts. Milli is using Professor Bamman’s Faculty Computing Allowance to run the computation for this research on Berkeley Research Computing’s (BRC) High Performance Computing (HPC) cluster, Savio. With Bamman’s help, Milli has analyzed over 5 million fanfiction stories -- more than 50 billion words in total, or a body of text equivalent in size to “about 10 percent of all of Google books data storage,” according to Bamman. From these analyses, Milli has culled fascinating information about the interaction between fan and story, and the unmet desires of readers and media consumers reflected in fanfiction texts. Milli is now preparing to present her research findings at the 2016 Conference on Empirical Methods of Natural Language Processing (EMNLP), to be held in Austin, Texas this Fall. 
What is Fanfiction?

Has a work of literature ever left you unsatisfied, wanting the story to continue, to highlight a secondary character or perspective, or take on new meaning as it is transposed into a different literary genre? For the millions of fanfiction writers across the world who want to expand on the work of their favorite authors -- including Jane Austen’s Pride and Prejudice, J.R.R. Tolkien’s Lord of the Rings, Sir Arthur Conan Doyle’s The Adventures of Sherlock Holmes, and J.K. Rowling’s Harry Potter -- the answer is yes. On an internet platform conducive to the quirky, parodic, romantic, and horrific bloomings of the imagination, fanfiction writers bring their unfulfilled desires to life. For example, a fanfiction writer pen-named “sohypothetically” translated Edgar Allen Poe’s “The Masque of the Red Death,” a short story that describes a grisly epidemic and a prince’s attempt to keep himself and his entourage from contracting the disease, into the language of Dr. Seuss, producing this sardonic verse on the fanfiction.net site:

Original Work

Fanfiction Rendering

The Masque of the Red Death

“…The scarlet stains upon the body and especially upon the face of the victim, were the pest ban which shut him out from the aid and from the sympathy of his fellow-men. And the whole seizure, progress and termination of the disease, were the incidents of half an hour.

But the Prince Prospero was happy and dauntless and sagacious. When his dominions were half depopulated, he summoned to his presence a thousand hale and light-hearted friends from among the knights and dames of his court, and with these retired to the deep seclusion of one of his castellated abbeys…”

-Edgar Allen Poe

Tick Tock Goes the Clock

“…We fell one by one with our faces stained red,
We corpses piled up behind each house and shed.
Our Prince, it was said, was dauntless indeed.
He cared nary a bit for Death's latest great deed.

Despite all the quaking and fearful bemoaning
He grandly proclaimed to all his friends who were roaming
‘Join me! Behind my castle's strong iron gates.
We'll weld them all shut until the Red Death abates.
We'll party and dance and drink and make merry
Until one and all forget it's so very scary…’”


Inspired by both classic and modern literature, movies, television series, anime and comics, among other media and entertainment sources, fan-authored work represents a vibrant network of human communication. These are the “stories that everyday readers and writers are using and authoring,” Bamman says, and that reflect, in the changes they make from the original work, Milli continues, “what mainstream literature is omitting that fans are interested in.” 

Computational Analysis of Fanfiction

According to Bamman, “NLP is a research area focused on the computational analysis of human language [both written and spoken]” and includes the development and application of “algorithms for reasoning about linguistic phenomena like the syntactic structure of sentences; broader applications including speech recognitionautomatic translation [like Google Translate], and question answering [like Apple’s Siri]; and other quantitative analyses of language that rely to some degree on computation.” 

Both Bamman and Smitha have an interest in “bringing NLP to underserved domains like literary text,” Bamman says, and fanfiction is a great fit because it’s a “massive enough corpora to do useful, large scale analysis,” but has little to no “prior work or attention.” Their computational process, as Milli describes, begins by comparatively running fanfiction and canonical text (the latter taken from the online public library Project Gutenberg) through a natural language processing pipeline Bamman developed, called BookNLP. From both the fanfic and the original, the software pulls out the characters, the things they say, verbs related to them (i.e., “Mary cooks”), and other salient information related to those characters. It then pairs characters with the same name between the stories; and from there Milli and Bamman perform a series of post-analyses of the data on Savio. 

Milli emphasizes that “running this pipeline takes a lot of compute power and time, so it is essential for us to be able to parallelize our work on the [Savio] cluster.” BRC’s consulting staff “was also incredibly helpful,” Milli continues, “and answered my emails and questions always within a couple of hours. They’re doing a good job of taking in feedback from users, and making the cluster easier and easier to access.”

Results, Influence and Continued Support of Future Research Goals

Milli and Bamman have shown that secondary characters typically receive greater attention in fanfiction texts compared to the original works; female characters are more prominent and play stronger roles in fanfiction texts, possibly a result of greater female fan-authorship; and finally, the ability for fan-authors to release their writings on a chapter-by-chapter basis, and subsequently receive comments from their fan-readers, not only promotes community, but allows analysis of evolving fan-responses to evolving storylines, thereby yielding a possible predictive model of reader reactions to narrative over time.

Beyond these results, their research produces two principal accomplishments, according to Milli. First, it gives insight to popular, modern literature and media sources by providing computational analysis of a large body of text that, although dealing explicitly with the interests of today’s readers and writers, is generally overlooked by the NLP community. Second, their work informs fanfiction scholarly research, confirming qualitative conjectures about fanfiction trends with empirical data; and complementing qualitative studies with a unique perspective on fanfiction texts developed through application of metadata and computational tools over dozens of stories at once. Bamman says, for example, “the varying attention given to different characters across different stories would be challenging to uncover without these methods.” Milli and Bamman are hopeful that their work, and Milli’s presentation of her findings at the 2016 EMNLP Conference, will open doors to more scholarly inquiry into fanfiction, and influence “other researchers to look at this kind of data,” Milli says.  

Milli’s and Bamman’s collaboration with BRC will continue. In reference to his access to Savio, Bamman says, “There’s no question we’ll keep using the free Faculty Computing Allowance. That has been essential for this work, and it made ramping up this research possible. Having the ability to redistribute that allowance to other students in my research group is hugely important, and will be a continued strategy to use going forward.”

Milli is also looking toward the next stage of her research project, which involves using a different algorithm that requires neural net training, “a type of machine learning model,” Milli says, to extract “a description, or representation, of a word by the words it’s generally found next to within a text.” For example, Bamman continues, one “can learn a representation of a character in a story from the kinds of actions or things that have happened to them,” and then use the representation “to measure how similar characters are between stories, or as features for a different kind of downstream classification task.” To expedite the computationally intensive network training process, Milli needs the Graphics Processing Units (GPUs) on Savio; however, TensorFlow, the open source software she wants to work with, currently requires custom installation on the cluster. Milli says a BRC consultant was able to do this for her, “but still it would be really effective to have a solution immediately available.” BRC is now exploring the possibility of using Docker containers, in which applications like TensorFlow can be pre-configured, in the cluster environment. “A TensorFlow Docker image would install really easily,” Milli says. “That would solve everything.”   

To learn more about Savio, BRC Consulting, or the application of Docker within an HPC cluster, contact research-it@berkeley.edu. And congratulations Smitha Milli on your paper’s acceptance to the 2016 EMNLP Conference!