Getting to know Crossfilter

I've had a recent request to add a Crossfilter and dc.js section to my D3 workshops so I've had to sit down and really get to know these two libraries. Crossfilter is a JavaScript library for slicing and dicing row-based data whilst dc.js is library that combines the analytics power of Crossfilter with the charting prowess of D3.

What is Crossfilter?

Crossfilter is a library for multidimensional filtering and aggregation of tabular data. For example, if we take the first 10 rows of the example data from the Crossfilter homepage:

DateOriginDestinationDelay (mins)
03/01/2001 00:05:00MCIMDW8
03/01/2001 00:45:00LASPHX95
03/01/2001 05:30:00LAXPHX10
03/01/2001 05:30:00ONTPHX0
03/01/2001 05:30:00LASLAX-9
03/01/2001 05:30:00LAXOAK12
03/01/2001 05:40:00ONTSMF-8
03/01/2001 06:00:00MDWBNA-1
03/01/2001 06:00:00OAKLAX-10
03/01/2001 06:00:00LASLAX-11

we can use Crossfilter to compute things like:

  • the top three longest delays
  • the origin (or destination) airport that had the most delays
  • how many flights were delayed
  • how many flights were delayed at a given airport
  • how many flights were delayed at a given airport within a given time period
  • the number of delays per hours throughout the day
  • the number of delays for a given airport, grouped by day

Crossfilter is purely an analysis tool, but it is most often used in conjunction with D3 for drawing charts. It comes into its own with large datasets as it's been designed with performance in mind. Although it can be a bit hard to learn and understand it's definitely worth considering if you're handling large datasets in the browser.

We'll be using the data from Crossfilter's homepage but just one day's worth. The original dataset has 231083 rows!

How do I use it?

We start by creating an instance of Crossfilter based on our data set:

var cf = crossfilter(data);

We can straightaway ask crossfilter how many rows it has:

cf.size();

In our case this returns 2692.

There are two primary concepts that Crossfilter uses: dimensions and groups.

  • dimensions represent a property of a data, be it an existing property (e.g. delay duration) or a derived property
  • groups tell Crossfilter how to group, or aggregate, the result set. For example, you might want to group by time of day or by airport and count the number of occurrences in each group

Dimensions

A Crossfilter dimension represents a property of the data, be it an existing property or a derived property. (You can also think of a dimension as a column in a spreadsheet.) We can set up filters and grouping functions on dimensions. Let's start by setting up some dimensions and trying out the filtering.

Let's set up a dimension on the flight delay:

var delayDimension = cf.dimension(function(d) {return d.delay;});

We can basically do 3 things with a dimension:

  • list the top n or bottom n items
  • set up a filter
  • set up a group

Let's start by listing the top 10 delays:

delayDimension.top(3);

This returns an array containing the following rows:

DateOriginDestinationDelay (mins)
03/01/2001 12:43:00PHXSEA192
03/01/2001 16:03:00SEABNA178
03/01/2001 17:52:00SEASMF170

Similarly we can list out the bottom 3 delays:

delayDimension.bottom(3);
DateOriginDestinationDelay (mins)
03/01/2001 09:35:00PHXPVD-40
03/01/2001 18:15:00PHXSDF-30
03/01/2001 17:55:00MSYSAN-30

If you're using Chrome, or similar, you can open up the developer console and have a play around with this data. For example, type delayDimension.bottom(3) and you should see the same result as above. Type in reset() if you'd like to reset all the dimension filters.

Derived dimensions

We can also set up dimensions that are derived from our data's properties. For example, we might want a dimension representing origin-destination pairs (e.g. MCI-MDW):

var originDestinationDimension = cf.dimension(function(d) {return d.origin + '-' + d.destination;});

Similarly we can set up a dimension that returns whether or not the flight was delayed:

var isDelayedDimension = cf.dimension(function(d) {return d.delay > 0;});
Filtering

If we set up a filter on a dimension it'll act across our Crossfilter instance meaning that future operations will respect whichever filters have been set up. Let's set up a filter on our delay dimension such that only delays (rather than on time or early flights) are included:

delayDimension.filter(function(d) {return d > 0}); 

Now let's look at the bottom 3 using delayDimension.bottom(3):

DateOriginDestinationDelay (mins)
03/01/2001 12:08:00BWIJAX1
03/01/2001 14:35:00TULDAL1
03/01/2001 14:30:00PHXABQ1

We can see that only positive delays have been included in our data.

If we're interested in flights from one particular location we could set up another dimension and filter on the origin property:

var originDimension = cf.dimension(function(d) {return d.origin});
originDimension.filter(function(d) {return d === 'MDW'});

And now let's look at the top 3 delays:

delayDimension.top(3);
DateOriginDestinationDelay (mins)
03/01/2001 18:40:00MDWISP53
03/01/2001 20:50:00MDWBDL53
03/01/2001 19:25:00MDWBWI49

If we now take a look at the bottom 3 delays:

delayDimension.bottom(3);
DateOriginDestinationDelay (mins)
03/01/2001 10:00:00MDWDTW1
03/01/2001 08:55:00MDWMHT2
03/01/2001 06:25:00MDWSTL2

we can see that only positive delays are included because the delay filter is still in place. Let's remove it and look at the bottom 3 again:

delayDimension.filterAll();
delayDimension.bottom(3);
DateOriginDestinationDelay (mins)
03/01/2001 07:40:00MDWLAS-22
03/01/2001 08:20:00MDWPVD-14
03/01/2001 09:25:00MDWRDU-14

Finally, let's remove the origin filter and check that our top 3 is the same as when we started out:

DateOriginDestinationDelay (mins)
03/01/2001 12:43:00PHXSEA192
03/01/2001 16:03:00SEABNA178
03/01/2001 17:52:00SEASMF170

The main takeaways from this section are that dimensions allow you to set up filters and to view the data (sorted by the dimension). As we've seen in the example, filters remain active until they're removed.

Groups

Groups in Crossfilter allow us to aggregate our data. Specifically, a Crossfilter grouping allows us to group our data by a particular dimension and to perform some kind of aggregation operation such as count, sum or average on each group.

Examples of aggregations we might want to perform on the flight data are:

  • the number of delays of each origin airport
  • the number of delays per hours throughout the day
  • the number of delays for a given airport, grouped by day

Notice the distinction between the dimension operations in the previous section and the grouping operations. With the former our result sets are rows of the original data, whilst with the grouping operations, our result set is an array of groups where each group has a key (the thing we're grouping by) and a value such as the count or sum.

Let's start by grouping our data by origin airport:

var originGrouping = originDimension.group();

We can now get an array of the groups using:

originGrouping.all();
keyvalue
ABQ62
ALB10
AMA10
AUS47
BDL13
BHM27
BNA88
BOI17
BUF10
BUR52

etc.

We've truncated the results in this case, but originGrouping.all() returns an array of size 59 because there are 59 unique origins.

Similarly to with dimensions we can list out the top n groups by value:

originGrouping.top(3);
keyvalue
PHX180
LAS167
HOU143

By default, the value property is the count of each of the dimension's unique values. So in our example, it's the number of flights leaving each airport and we can see that Phoenix Sky Harbor International had the most flights.

Let's try grouping by delay:

var delayGrouping = delayDimension.group();
delayGrouping.top(5);
keyvalue
0231
-5157
5112
-2100
390

What this tells us is that the most common delay duration was 0 minutes (no delay).

Specifying how to group our data

Now let's get a bit cleverer and group by hour of day:

var hourGrouping = dateDimension.group(function(d) {return d.getHours();});

What we're doing here is telling Crossfilter that we want to group by the time of day, bucketing times with the same hour together. (If we didn't do this, we'd end up with a group for every single flight time during the day.) The result set is:

keyvalue
02
55
6157
7198
8181
9167
10147
11161
12175
13147
14153
15164
16164
17172
18170
19163
20162
21134
2258
2312

So there were 2 flights at midnight, 5 at 5am, and so on.

Now let's set up a filter to see the flight hours from Phoenix:

originDimension.filter(function(d) {return d === 'PHX';});
hourGrouping.all();
keyvalue
00
50
65
713
812
910
1010
1112
1213
1310
1413
159
1614
177
1812
199
2011
2110
228
232
Specifying our own reduce functions

Finally let's get really clever and write the code to answer the following question:

How many flights from Phoenix were delayed each hour?

Now we need to count each flight in hourGrouping, but only if the delay was greater than 0. Crossfilter allows us to do this by using the reduce function which requires us to define three functions: add (for when records are added to the filtered selection), remove (for when records are removed from the filtered selection) and initial (provides the start value). (For more detail read the API documentation.)

So let's go ahead and write the code:

function reduceAdd(p, v) {return v.delay > 0 ? p + 1 : p;}
function reduceRemove(p, v) {return v.delay > 0 ? p - 1 : p;}
function reduceInitial() {return 0;}

hourGrouping.reduce(reduceAdd, reduceRemove, reduceInitial);

hourGrouping.all();

Now we're returned an array of delayed flights originating from Phoenix, grouped by hour of the day:

keyvalue
00
50
60
73
83
92
104
118
129
137
149
157
1610
173
1810
197
209
218
226
232

Summary

First of all, thanks for reading this far! It took me a while to figure out how to use Crossfilter and hopefully this introduction will be useful to you.

Do have a play around with Crossfilter yourself. If you're using Chrome, or similar, you can open up the developer console and have a play around with this very data yourself. For example, try typing in:

delayDimension.top(10)

and inspecting the resulting array.