On the Side: Using Apache Spark and Clojure for Basketball Reasons(?)


I spend a lot of time thinking about “the next big thing”. In tech it seems you can almost never be too far ahead of the curve. Whatever toolset you’re working with now, chances are some kids out there are spending their nights trying to obsolete disrupt it (Note: “disrupt” obsoleted the word “obsolete”).

When I started building nbawowy.com towards the end of 2012, I decided to use a stack that wasn’t all that common, but fast forward to today and the “MEAN” stack (MEAN = Mongodb, Express, AngularJS, and Node.js) as it came to be called seems to be everywhere (funny enough, I was actually referring to it as AMEN). AngularJS (a Google-backed project) now has over 27,000 stars, almost 10,000 more than Backbone, which was widely considered the “default” Javascript front-end framework before 2013.

This past year I spent a lot of time in my day job learning more about “big data” and how to deal with it. In the past 5 or so years this mostly meant learning how to run MapReduce jobs on Hadoop, either by hand-coding them yourself in Java or using a higher-level scripting language, such as Pig or Hive. Not being a Java developer myself, I decided to learn Pig (a top-level Apache project) and it has made me much more productive.

Let me tell you how. At my work (a “social network” app called Skout) we generate a lot of data every day, not nearly as much as a Twitter or Facebook, of course, but enough to make it inconvenient to work with using traditional means (MySQL!). Last time I checked we were generating somewhere in the neighborhood of 100 million data messages per day (a “data message” is a little piece of JSON-formatted text sent over the network that tells us about an action taken by the user in the app). Like many companies, we store these messages on S3, an Amazon AWS service which is essentially an infinitely (for our purposes) scalable storage service in the cloud.

You can think of S3 as a really gigantic hard drive. What MapReduce (or Pig in my case) allows one to do is query the data in an ad hoc fashion, but the catch is that up until now this has mostly been a batch process. So one of my queries (…count all the chat messages sent by women under the age of 25 in Asian countries on Android phones over the past week) might take anywhere from 10 minutes upwards of an hour. It’s better than nothing, and often the only way to get real answers, but it sort of takes the hoc out of ad hoc. What I’d really like to be able to do (and so would everyone else in tech) is be able to interactively query the data on S3 (or some other Hadoop service). And by “interactive”, I mean essentially get real-time or near real-time (seconds to a couple minutes) results as one would get by querying a MySQL database (at least, one designed for such a purpose). With such a system it becomes possible to iterate much faster. It also literally enables data scientists to implement iterative algorithms that were previously not feasible using the current MapReduce toolset.

Enter Apache Spark, a cluster computing project coming out of UC Berkeley that has burst onto the big data scene in the past year. The selling point of Spark being the following:

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

The promise of Spark is to enable a whole new set of big data applications. Naturally, I became intrigued when I first learned about it, and thought it could be a great new tool for my day job. My second thought was…can I use this for basketball statistics? The obvious answer being: Sure, why the hell not? One thing that is useful about being a (self-proclaimed) NBA stats geek is that I always have a fun data sandbox at my command (I’m not sure there are two things, actually).

Spark comes out of the box with an API in three different programming languages: Java, Scala (the source code language), and Python. Unfortunately, I’m not using any of those languages, and the language I typically use for such things (Ruby) isn’t supported (yet, although I’m sure there will eventually be such a project). There is a SparkR project, but I had another idea. In the past few months I have taken up the task of learning Clojure, which is basically a Lisp that runs on the JVM.  Scala, by the way, is in a similar vein in that it is a functional language hosted on the JVM. In researching the two languages, I simply decided that Clojure was eventually what “all the cool kids” would be doing, and that’s always where I want to be. Also, Rich Hickey, the developer of Clojure, is brilliant and reminds me of the 70’s version of Doctor Who.

Fortunately, there is a project called Flambo that is developing a Clojure API for Spark. I decided to give it try. I’m in the very early phase of the learning curve, but I’ve already figured out enough to see that this is shaping up to be a very cool/powerful data stack, indeed.

First, here is a sample of the data set I’m using, which comes straight from my nbawowy database:

"76ers" : [
"Lorenzo Brown",
"Elliot Williams",
"Hollis Thompson",
"Brandon Davies",
"Daniel Orton"
"Timberwolves" : [
"A.J. Price",
"Alexey Shved",
"Robbie Hummel",
"Ronny Turiaf",
"Gorgui Dieng"
"_id" : ObjectId("53531a345bca6d54dd0382b2"),
"as" : 120,
"assist" : null,
"away" : "Timberwolves",
"coords" : {
"x" : 13,
"y" : 15
"date" : "2014-01-06",
"distance" : 16,
"espn_id" : "400489378",
"event" : "A.J. Price makes a pull up jump shot from 16 feet out.",
"home" : "76ers",
"hs" : 93,
"last_state" : {
"type" : "fga",
"val" : 2,
"rel" : "jump shot",
"made" : true,
"shooter" : "Daniel Orton",
"dist" : 17
"made" : true,
"opponent" : "76ers",
"pd" : 27,
"pid" : 424,
"points" : 2,
"q" : 4,
"release" : "pull up jump shot",
"season" : "2014",
"shooter" : "A.J. Price",
"t" : "2:22",
"team" : "Timberwolves",
"type" : "fga",
"url" : "http://scores.nbcsports.msnbc.com/nba/pbp.asp?gamecode=2014010620",
"value" : 2

This is a single play. Each season of nbawowy has roughly 550K plays just like this with metadata describing all kinds of things I pull out from the play-by-play data with my current parser (written in Ruby). The 2013-2014 season is a little under 500 MB of data like this. I “dumped” it to a text file that could then be processed with Flambo/Spark.

The following is a code sample that produces the number of made three-point field goals by the Warriors last season in descending order (comments are denoted by leading semi-colons):

;; create a namespace and require libraries
(ns flambo.clojure.spark.demo
(:require [flambo.conf :as conf])
(:require [flambo.api :as f])
(:require [c1ojure.data.json :as json]))

;; configure Spark
(def c (-> (conf/spark-conf)
(conf/master "local[*]")
(conf/app-name "nba_dsl")))

;; create a SparkContext object
(def sc (f/spark-context c))

;; read in plays from nbawowy database
(def plays (f/text-file sc "/Users/evanzamir/Code/Clojure/flambo-nba/resources/plays.json")) ;; returns an unrealized lazy dataset

;; define a function that prints out field goals
(defn field-goals-made-by-player
[team p]
(-> p
(f/map (f/fn [x] (json/read-str x :key-fn keyword)))
(f/filter (f/fn [x] (and (= "fga" (:type x))
(= 3 (:value x))
(= true (:made x))
(= team (:team x)))))
(f/map (f/fn [x] [(.toUpperCase (:shooter x)) 1]))
(f/reduce-by-key (f/fn [x y] (+ x y)))
(clojure.pprint/pprint (sort-by last > fgm))))

(field-goals-made-by-player "Warriors" plays)

The results of this code (generated by the very last line) are a list of Warriors 3pt fgm last season:


I’m not going to explain the code, except to say it is basically a series of very common functional operations, including filter, map, and reduce. Every line in the code where you see “f/operation” is the Flambo api instructing Spark to do some operation on a dataset (called an RDD in Spark terminology). There is another important point to be made about the code. You can see in Line 29 the .toUpperCase function being called. This is interesting because it is actually a Java function being called from Clojure and passed to the Spark engine. One of the design principles of Clojure is to enable very transparent and powerful interoperability with Java, which enables one to take advantage of the tremendous amount of Java libraries available. It is a huge win (and also true for Scala, btw).

I hope this post was useful. It really just scratches the surface of what is possible. This was all done locally on a MacBook Pro (automatically multi-threaded though!). The real fun begins when you take the code to a cluster (think EC2 and S3). It wouldn’t suprise me at all if some NBA analytics departments working with SportsVU data are already headed down this path even as you read this. I would encourage anyone interested in a future in analytics (NBA or otherwise) to check out these projects.

Leave a Reply