Chong Han Chua | App Store Visualization | 31 January 2011
This project explores the possibility of doing an interesting visualization of icons from the Apple iTunes App Store.
This project was conceived when I saw the visualization of flags by colours. The whole nature of a set of similar manner graphics, such as the flags, amused me immensely. It then suddenly occurred to me that the icons on the iTunes App Store are of the same nature. Almost rectangle, with rounded corners, and usually vector graphics of some sort – would be interesting to look at.
I guess in a way, this existed as largely a technical inquiry. This is the first time I wrote a crawler as well as a screen scraper. This is the first time I dealt with a large set of data that takes almost forever to do anything with. I can almost feel that heart beat when I ran the scraping script for the first time, half expecting Apple to boot me off their servers after a few hundred continuous queries. Thankfully, they didn’t.
There are a bunch of technical challenges in this inquiry mainly:
1. Scraping large sets of data requires planning. My scraping code went through at least 3 different versions, not to mention playing with various language. Originally, I wanted to use Scala as I was under the impression that the JVM would be more efficient as well as speedy. Unfortunately, the HTML returned by the iTunes App store is malformed – one of the link tags is not properly closed and choked the built in Scala’s XML parser.
After determining that using any random Java XML parser would be too much of a hassle, I turned to my favourite scripting language, JavaScript on node.js (using Google V8). After looking through a bunch of DOM selection solutions, I finally got jsdom and jquery to work, then I knew that I was in business.
The original plan was to crawl the website from first page to last page and create a Database entry for every page in the website. There was only very basic crash recovery in the script which basically state that the last scraped entry is a certain index n. Unfortunately for me, the links are traversed not exactly in the same order every time so I ended up having duplicate entries in my database. Also, the script was largely single threaded, and it took almost over 10 hours to scrape 70+k worth of pictures.
After realizing that a partial data set will not do me any good, I decided to reconcentrate my efforts. I then built in some redundancy in getting links and test the data base for existing entries before inserting. I also ran another script on top of the scraper script that restarts the script when it crashes on a bad response. Furthermore, I used 20 processes instead of 1 to expedite the process. I was half expecting to really get booted off this time round, or get a warning letter from CMU but thankfully till now there is none. After 10 hours or so, I managed to collect 300,014 images. Finder certainly isn’t very happy about that.
2. Working with large data sets requires planning. Overall, this is a pretty simple visualization, however the scaffolding required to process the data consumes plenty of time. For one, there was a need to cache results so that it doesn’t take forever to debug anything. SQLite was immensely useful in this process. Working of large sets of data also means that when there is a long running script, and it crashes, most of the time, the mid point data is corrupted and has to be deleted. I pretty much ran through every iteration at least 2 to 3 times. I’m quite sure most of my data is in fact accurate, but the fact that a small portion of the data was corrupted (I think > 20 images) does not escape me.
I wouldn’t consider this a very successful inquiry. Technically it is ego stroking, on an intellectual-art level, there seems to be no very useful results from the data visualization. When constructing this visualization, I had a few goals in mind
1. I don’t want to reduce all these rich data sets into simply aggregations of colours or statistics. I want to display the richness of the dataset.
2. I want to show the vastness of the data set.
As a result, I ended up with a pseudo spectrum display of every primary colour of every icon in the App Store that I scraped. It showed basically the primary colour distribution in something that looks like a HSB palette. The result was that it seems to be obvious that there are plenty of whitish or blackish icons, and the hue distribution of the middle saturation seems quite even. In fact, it says nothing at all. It’s just nice to look at.
There’s a couple of technical critiques on this: The 3D to 2D mapping algorithm sucks. What I used was a very simple binning and sorting via both the x and y axis. Due to the binning, the hue distribution was not equal for all bins. To further improve this visualization, the first step is to at least equalize the hue distribution across bins.
I guess what I really wanted to do, if I had the time, was to have a bunch of controls that filters the icons that showed up on the screen. I really wanted to have a control where I can have a timeline where I can drag the slide across time and see the appstore icons populate or a bunch of checkboxes which I can show and hide categories that the apps belong to. If I have more chops, I would attempt to sort the data into histograms for prices and ratings and have some sort of colour sorting algorithms. If I had more chops, I would make them animate from one to another.
I think there is a certain difficulty in working with big data sets as there is no expectation of what trends to occur since statistically speaking everything basically evens out and on the other hand it basically just takes forever to get anything done. But it is fun, and satisfying.
If you want the data set, email me at johncch at cmu dot edu.
Code can be found at:
Hi Chong-Han – great work! Here are the comments from the PiratePad.
I like you already.
Enjoying your presentation style.
Woop async. Node.js.
You got a ‘whoa’ out of me. Good sign.
I like the zoom feature. Would be better if it animated between zoom levels.
search for app feature?
This could have benefitted from some more searching capabilities. for instance, could there have been a way to see if certain types of apps have certain colors, or if the most popular apps have brighter colors ect.
Wow nice levels of zooming.
Could use a filter (show only Genre X).
Why are the horizontal lines not continuous (some greys among yellows etc)?
Its interesting to see at what level a colorful icon at first glance fades away to a generic grey around the edes, or speckling the spectrum. I think you need a different organization scheme though then generic rainbow.
Shiny —-> $$$$
Even if this is the only visualization you made, the zoomability as well as the image itself is really interesting. Good use of humor in presentation. It’s a little unclear exactly how the sorting algorithm works?
Good presentation! Your visualization looks awesome. That is a lot of apps! Zoom view is a lot of fun to look at. Very impressive. Agreed!
Cool topic! The mosaic idea is still interesting to me.. or some other form of organizing the data.
I think this presentation would benefit if it had more examples of all the other visual options you tried.
It’s the ultimate in small-multiples.
Zoom function is pretty incredible. Interesting dataset.
Would be nice to cut off the corners so they have the rounded edges that they would show in the app store. Agreed!
the TEXTURE of this is just spectacular.
Some more computer vision could have helped — for other organizational schemes.
I’d love to see this organized perhaps, seeing what kind of app something is vs it’s color.
Would be nice if you removed all the duplicates
nice data scraping, data prep, panic slide
would like to see some meta data correlations, genres, comments etc
wow. the icons on the mouse pointer have white edges, maybe take that away and frame it. make a Voltron of the Apple logo!
There are some really interesting icons… Some really similar to each other. Agreed with the idea of pulling out the most sucessful apps in the viz.