In how many different ways can you sort a collection? From how many different perspectives can you view an assembly of things?
A collection is a group of things related to each other in some identifiable way. The relationship might be through a topic, a place, a person, a type of object etc. Even though there is a specific motivation for gathering on the first place, what makes a collection dynamic is that one can reorganize the same data in alternative ways and make different sense of it.
The motivation about this project came from the ‘Pink project’ by Portia Munson. The pink project comprised a series still life installations created out of the artist’s collection of discarded objects, all of which are varying shades of pink and are objects used by females. The discarded items assume new meaning when seen out of the perspective of the common color and the connotation with gender as well as through the way they are organized in space.
Portia Munson, Pink Project, 1994
The data set I am using for the project, comes from a collection of images of butterflies provided by an iPhone app made by Hunter Research and Technology. The collection is composed by 240 images. Each butterfly is shown in plan on a white background; the images I extracted are 260×340 pixels in size. The only data accompanying the images is the name of each butterfly. The sizes do not reflect the real life size of the butterflies; they were probably all brought to the same scale before being used in the iPhone app. I later found that the original images are coming from the following website: http://butterflies.aa6g.org/ where also a description of how the butterflies had been photographed can be found.
Butterfly Collection, by Hunter Research and Technology
Data processing
The images have been processed in Matlab to extract a series of values that would enable different ways of sorting.
Initially, the following image descriptors were extracted from each image:
- perimeter of the butterfly outline
- area of the overall shape
- number of detected boundaries on the surface of the wings
- the image’s average value
- the image variance
- color histogram
For 1, 2,3: I used the Image Processing Toolbox for Matlab. The algorithms were run on the grayscale representation of the image after thresholding. More specifically the regionprops and the bwboundaries toolset have been used.
For 4, 5: to compute the image statistics I worked also on grayscale images. To get each image’s average I computed the arithmetic mean. To get the image variance I computed the square of the standard deviation.
For 1-5 I got a range of numbers according to which I sorted the images linearly. See below a video capture of a Processing applet that demonstrates the sorted images in a slide show progressing from the image with the smaller value to the image with the bigger value for a given sort. By keyboard input the user can change the sorting mode, change the slide show speed as well as pause the slide show and go forward and backward manually.
From the results I got for the images I noticed that the values do not vary significantly. This, I believe, is also reflected on the slideshow: in most of the cases the reason for transitioning from one butterfly to the other is not observable. My impression is that given the collection under consideration (same family of things with similar characteristics) the linear ordering might not make that much sense in terms of results.
Also, it might have been wrong on the first place to compute the image statistics without taking into consideration that a significant amount of the image pixels were those of the background. So I run the histogram analysis taking a different approach:
(1) I considered all three color channels
(2) I masked the image in order to compute a histogram only on the butterfly shape
(3) I computed the similarity among all pairs of butterflies and got the corresponding sorts; more specifically given the histogram vector for each image I computed the distance matrix between all pairs of butterflies.
The results seemed more reasonable. Thus, taking the three points mentioned below under consideration, I repeated the image processing and extracted the following image descriptors:
- perimeter and area – considered together in one descriptor
- color histogram
- edge histogram – horizontal, vertical and edges on the two diagonal were considered for the computation of the edge histogram
- average value – the average value was computed separately for each channel
- standard deviation – the deviation was computed separately for the columns and rows of the image matrix for each of the three color channels
- color histogram – 10 bin color histogram for each r g b channel
The following images are from my first attempt to visualize the data. I thought that it might be interesting to observe what kind of relations may occur among two different sorts. In the applet below I represented the sequence of a given sort e.g. by variation with a sequence of dots where each dot represents a butterfly in the sequence. By following the movement of a curve in time representing another sort e.g. by number of blobs I can compare how similar or not is the distribution of the butterflies with regard to the first sort.
I consider that this attempt does not work in communicating the objective of the visualization. I redesigned it following the rules below: