in Natural Language Processing

Conversation Datasets and Learning Character Styles from Movie Dialogues

As Artificial Intelligence continues to push its boundaries on cognition, it takes on a challenge that we humans do so naturally – to understand and respond using natural language. Human conversations are incredibly rich in content. The foundations of the information carried across is laid upon the words at face value, tempered by the prosodic features like tone, pitch and volume, the power difference between the two speakers, in addition to the emotions and attitudinal disposition hinted through facial expression, eye contact, body language and even the time delay of the response. This rich, turn-by-turn content makes conversations particularly interesting as compared to monologues.

The complex, multi-modal nature of dialogues requires a multi-disciplinary analysis approach – linguists, psychologists and machine learning researchers come together and draw on existing research on conversation analysis, emotions analysis, and natural language processing. To do that well, reproducible research is necessary and publicly available data is sought after. Outside of the academic, publicly available data is the starting point for data scientists and machine learning practitioners to build applied machine learning systems.

[1] has done an extensive review of the availability of conversations datasets suitable for building dialogue systems. For each dataset, the authors have made detailed annotations about its size, its source and whether the conversation is:

  • Written / Spoken / Multi-Modal (i.e., visual modality – facial expression etc.)
  • Human to Human / Human to Machine
  • Transcribed from an actual, natural conversation. As opposed to hired to be speaking to a machine.
  • From works of fiction: Movies or novels
  • Spontaneous or constrained (i.e., having a task to work on and goal-driven, like a topical debate or planning of routes)

I would highly recommend this review to anyone who wishes to find conversations or dialogue datasets, the link to this review is in the references section.

For the rest of the blog post, let’s focus on an application where a film corpus is used to learn character speaking style. I think films are interesting in that the character style of speaking is usually consistent across the scenes, unlike in natural spontaneous conversations where the style of the speaker has a degree of mimicry towards the more powerful speaker to build rapport.

[2] has built a system that automatically generates dialogue based on film characters, using hundreds of film scripts from the Internet Movie Script Database website, and the authors have released the corpus data ( The authors used external tools to extract distinctive features from the transcripts. I have chosen a select few tools which may be of interest:

After extracting these features, the features are fed into the Personage architecture as described in [4]. The details of implementation are complex and it out of scope for this post, consisting of multiple modules that select the syntax, aggregate the sentence, insert pragmatic markers and making choices about the lexical structure.

We conclude with an illustration of the differences in character styles. The authors include a table for parallel comparisons between generated dialogues – differences in character styles can be clearly perceived.


Utterances generated using Film Character Models. Table from [2].


Lastly, some short clips and quotes to gain some context about the speaking style of the characters.

Annie Hall: Alvy and Annie

Indiana Jones  – Indy Quotes from Thought Catalog

“Fortune and glory, kid. Fortune and glory.” “I think it’s time to ask yourself; what do you believe in?” “…Indiana Jones. I always knew someday you’d come walking back through my door. I never doubted that. Something made it inevitable.” “Professor of Archaeology, expert on the occult, and how does one say it… obtainer of rare antiquities.” “Throw me the idol; I’ll throw you the whip!”

 Pulp Fiction – Vincent Quotes from MovieQuoteDB

“You don’t **** with another man’s vehicle. It’s just against the rules.” “So you’re gonna go out there, drink your drink, say “Goodnight, I’ve had a very lovely evening”, “go home, jerk off. And that’s all you’re gonna do.” “Oh man, I just shot Marvin in the face.” “Chill out, man, I told you it was an accident. We probably went over a bump or something.” “Why the **** didn’t you tell us there was someone in the bathroom? Slipped your mind? You forgot to mention someone’s in the bathroom with a goddamn handcannon?!”


[1]         I.V. Serban, R. Lowe, P. Henderson, L. Charlin, J. Pineau, A Survey of Available Corpora for Building Data-Driven Dialogue Systems, (2015). (accessed February 22, 2018).

[2]         M.A. Walker, G.I. Lin, J.E. Sawyer, An Annotated Corpus of Film Dialogue for Learning and Characterizing Character Style, in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 2012: pp. 1373–1378. (accessed December 11, 2017).

[3]         J.W. Pennebaker, R.J. Booth, M.E. Francis, Operator’s Manual: Linguistic Inquiry and Word Count – LIWC2007, (2007) 1–11. doi:10.4018/978-1-60960-741-8.ch012.

[4]         F. Mairesse, M.A. Walker, Towards personality-based user adaptation: Psychologically informed stylistic language generation, User Modeling and User-Adapted Interaction. 20 (2010) 227–278. doi:10.1007/s11257-010-9076-2.


Write a Comment