Getting to know Crossfilter

I've had a recent request to add a Crossfilter and dc.js section to my D3 workshops so I've had to sit down and really get to know these two libraries. Crossfilter is a JavaScript library for slicing and dicing row-based data whilst dc.js is library that combines the analytics power of Crossfilter with the charting prowess of D3.

What is Crossfilter?

Crossfilter is a library for multidimensional filtering and aggregation of tabular data. For example, if we take the first 10 rows of the example data from the Crossfilter homepage:

DateOriginDestinationDelay (mins)

we can use Crossfilter to compute things like:

  • the top three longest delays
  • the origin (or destination) airport that had the most delays
  • how many flights were delayed
  • how many flights were delayed at a given airport
  • how many flights were delayed at a given airport within a given time period
  • the number of delays per hours throughout the day
  • the number of delays for a given airport, grouped by day

Crossfilter is purely an analysis tool, but it is most often used in conjunction with D3 for drawing charts. It comes into its own with large datasets as it's been designed with performance in mind. Although it can be a bit hard to learn and understand it's definitely worth considering if you're handling large datasets in the browser.

We'll be using the data from Crossfilter's homepage but just one day's worth. The original dataset has 231083 rows!

How do I use it?

We start by creating an instance of Crossfilter based on our data set:

var cf = crossfilter(data);

We can straightaway ask crossfilter how many rows it has:

cf.size();

In our case this returns .

There are two primary concepts that Crossfilter uses: dimensions and groups.

  • dimensions represent a property of a data, be it an existing property (e.g. delay duration) or a derived property
  • groups tell Crossfilter how to group, or aggregate, the result set. For example, you might want to group by time of day or by airport and count the number of occurrences in each group

Dimensions

A Crossfilter dimension represents a property of the data, be it an existing property or a derived property. (You can also think of a dimension as a column in a spreadsheet.) We can set up filters and grouping functions on dimensions. Let's start by setting up some dimensions and trying out the filtering.

Let's set up a dimension on the flight delay:

var delayDimension = cf.dimension(function(d) {return d.delay;});

We can basically do 3 things with a dimension:

  • list the top n or bottom n items
  • set up a filter
  • set up a group

Let's start by listing the top 10 delays:

delayDimension.top(3);

This returns an array containing the following rows:

DateOriginDestinationDelay (mins)

Similarly we can list out the bottom 3 delays:

delayDimension.bottom(3);
DateOriginDestinationDelay (mins)

If you're using Chrome, or similar, you can open up the developer console and have a play around with this data. For example, type delayDimension.bottom(3) and you should see the same result as above. Type in reset() if you'd like to reset all the dimension filters.

Derived dimensions

We can also set up dimensions that are derived from our data's properties. For example, we might want a dimension representing origin-destination pairs (e.g. MCI-MDW):

var originDestinationDimension = cf.dimension(function(d) {return d.origin + '-' + d.destination;});

Similarly we can set up a dimension that returns whether or not the flight was delayed:

var isDelayedDimension = cf.dimension(function(d) {return d.delay > 0;});
Filtering

If we set up a filter on a dimension it'll act across our Crossfilter instance meaning that future operations will respect whichever filters have been set up. Let's set up a filter on our delay dimension such that only delays (rather than on time or early flights) are included:

delayDimension.filter(function(d) {return d > 0}); 

Now let's look at the bottom 3 using delayDimension.bottom(3):

DateOriginDestinationDelay (mins)

We can see that only positive delays have been included in our data.

If we're interested in flights from one particular location we could set up another dimension and filter on the origin property:

var originDimension = cf.dimension(function(d) {return d.origin});
originDimension.filter(function(d) {return d === 'MDW'});

And now let's look at the top 3 delays:

delayDimension.top(3);
DateOriginDestinationDelay (mins)

If we now take a look at the bottom 3 delays:

delayDimension.bottom(3);
DateOriginDestinationDelay (mins)

we can see that only positive delays are included because the delay filter is still in place. Let's remove it and look at the bottom 3 again:

delayDimension.filterAll();
delayDimension.bottom(3);
DateOriginDestinationDelay (mins)

Finally, let's remove the origin filter and check that our top 3 is the same as when we started out:

DateOriginDestinationDelay (mins)

The main takeaways from this section are that dimensions allow you to set up filters and to view the data (sorted by the dimension). As we've seen in the example, filters remain active until they're removed.

Groups

Groups in Crossfilter allow us to aggregate our data. Specifically, a Crossfilter grouping allows us to group our data by a particular dimension and to perform some kind of aggregation operation such as count, sum or average on each group.

Examples of aggregations we might want to perform on the flight data are:

  • the number of delays of each origin airport
  • the number of delays per hours throughout the day
  • the number of delays for a given airport, grouped by day

Notice the distinction between the dimension operations in the previous section and the grouping operations. With the former our result sets are rows of the original data, whilst with the grouping operations, our result set is an array of groups where each group has a key (the thing we're grouping by) and a value such as the count or sum.

Let's start by grouping our data by origin airport:

var originGrouping = originDimension.group();

We can now get an array of the groups using:

originGrouping.all();
keyvalue

etc.

We've truncated the results in this case, but originGrouping.all() returns an array of size because there are unique origins.

Similarly to with dimensions we can list out the top n groups by value:

originGrouping.top(3);
keyvalue

By default, the value property is the count of each of the dimension's unique values. So in our example, it's the number of flights leaving each airport and we can see that Phoenix Sky Harbor International had the most flights.

Let's try grouping by delay:

var delayGrouping = delayDimension.group();
delayGrouping.top(5);
keyvalue

What this tells us is that the most common delay duration was 0 minutes (no delay).

Specifying how to group our data

Now let's get a bit cleverer and group by hour of day:

var hourGrouping = dateDimension.group(function(d) {return d.getHours();});

What we're doing here is telling Crossfilter that we want to group by the time of day, bucketing times with the same hour together. (If we didn't do this, we'd end up with a group for every single flight time during the day.) The result set is:

keyvalue

So there were 2 flights at midnight, 5 at 5am, and so on.

Now let's set up a filter to see the flight hours from Phoenix:

originDimension.filter(function(d) {return d === 'PHX';});
hourGrouping.all();
keyvalue
Specifying our own reduce functions

Finally let's get really clever and write the code to answer the following question:

How many flights from Phoenix were delayed each hour?

Now we need to count each flight in hourGrouping, but only if the delay was greater than 0. Crossfilter allows us to do this by using the reduce function which requires us to define three functions: add (for when records are added to the filtered selection), remove (for when records are removed from the filtered selection) and initial (provides the start value). (For more detail read the API documentation.)

So let's go ahead and write the code:

function reduceAdd(p, v) {return v.delay > 0 ? p + 1 : p;}
function reduceRemove(p, v) {return v.delay > 0 ? p - 1 : p;}
function reduceInitial() {return 0;}

hourGrouping.reduce(reduceAdd, reduceRemove, reduceInitial);

hourGrouping.all();

Now we're returned an array of delayed flights originating from Phoenix, grouped by hour of the day:

keyvalue

Summary

First of all, thanks for reading this far! It took me a while to figure out how to use Crossfilter and hopefully this introduction will be useful to you.

Do have a play around with Crossfilter yourself. If you're using Chrome, or similar, you can open up the developer console and have a play around with this very data yourself. For example, try typing in:

delayDimension.top(10)

and inspecting the resulting array.