For my capstone project proposal, I am performing a textual analysis of Dilbert, a comic strip about an engineer in a bureaucratic corporate machine. Since the late 80s, the author has been creating over 10,000 strips for nearly 26 years. In an earlier assignment, I scraped the dialogue of each of these individual strips and downloaded each strip as separate images; this data was used to produce a visualization indicate the relationship strength of each character pair. For my capstone project, I would like to extend this exercise to extract more interesting information from this data set and apply it to a creative application.
I currently have dialogue for each strip, but the dialogue is not associated by individual panel (there are three panels on a typical strip). Using Python image manipulation and optical character recognition (Tesseract) bindings, I intended to slice each of the strips into individual panels and then perform OCR to associate dialogue with a specific panel. Levenshtein Distance algorithm would be used to determine the OCR text with the ground truth of the scraped dialogue. Once this task has been completed, I can wipe the original dialogue from the strips’ image files. Then, using natural language processing techniques, I can compare the topic/subjects of an individual strip with other bureaucratic content, such as CSPAN transcripts or United Nations assembly meeting minutes, and then insert this new text content with a Dilbert-like font to effectively create a new strip.
Part of my previous research looked into what other information I could extract from the comic strip’s dialogue. They include my Looking Outwards 9 post, which showed examples of plotting the character appearance frequency in a given chapter of Les Miserables and noting the setting’s mood in which the character appeared. Another example in the Looking Outwards 9 post showed a visualization of common phrases or words spoken by various pubic speakers. At Golan’s suggestion, I have also looked into MALLET (MAchine Learning for LanguagE Toolkit), which provides a way to analyze large amounts of unlabeled text and group them into topics, which are determined by analyzing cluster of words that frequently occur together. This toolkit can be applied to bring out common themes and topics that occur throughout the Dilbert strip and use this data to determine what type of content can replace the strip’s original dialogue in a way that makes sense.
A 140-character description of this project can be worded:
My proposal is to textually examine dialogue from every Dilbert comic strip and replace them with new content from outside sources.