On the Side: Using Apache Spark and Clojure for Basketball Reasons(?)

Flambo!

I spend a lot of time thinking about "the next big thing". In tech it seems you can almost never be too far ahead of the curve. Whatever toolset you're working with now, chances are some kids out there are spending their nights trying to obsolete disrupt it (Note: "disrupt" obsoleted the word "obsolete").

When I started building nbawowy.com towards the end of 2012, I decided to use a stack that wasn't all that common, but fast forward to today and the "MEAN" stack (MEAN = Mongodb, Express, AngularJS, and Node.js) as it came to be called seems to be everywhere (funny enough, I was actually referring to it as AMEN). AngularJS (a Google-backed project) now has over 27,000 stars, almost 10,000 more than Backbone, which was widely considered the "default" Javascript front-end framework before 2013.

This past year I spent a lot of time in my day job learning more about "big data" and how to deal with it. In the past 5 or so years this mostly meant learning how to run MapReduce jobs on Hadoop, either by hand-coding them yourself in Java or using a higher-level scripting language, such as Pig or Hive. Not being a Java developer myself, I decided to learn Pig (a top-level Apache project) and it has made me much more productive.

Let me tell you how. At my work (a "social network" app called Skout) we generate a lot of data every day, not nearly as much as a Twitter or Facebook, of course, but enough to make it inconvenient to work with using traditional means (MySQL!). Last time I checked we were generating somewhere in the neighborhood of 100 million data messages per day (a "data message" is a little piece of JSON-formatted text sent over the network that tells us about an action taken by the user in the app). Like many companies, we store these messages on S3, an Amazon AWS service which is essentially an infinitely (for our purposes) scalable storage service in the cloud.

You can think of S3 as a really gigantic hard drive. What MapReduce (or Pig in my case) allows one to do is query the data in an ad hoc fashion, but the catch is that up until now this has mostly been a batch process. So one of my queries (...count all the chat messages sent by women under the age of 25 in Asian countries on Android phones over the past week) might take anywhere from 10 minutes upwards of an hour. It's better than nothing, and often the only way to get real answers, but it sort of takes the hoc out of ad hoc. What I'd really like to be able to do (and so would everyone else in tech) is be able to interactively query the data on S3 (or some other Hadoop service). And by "interactive", I mean essentially get real-time or near real-time (seconds to a couple minutes) results as one would get by querying a MySQL database (at least, one designed for such a purpose). With such a system it becomes possible to iterate much faster. It also literally enables data scientists to implement iterative algorithms that were previously not feasible using the current MapReduce toolset.

Enter Apache Spark, a cluster computing project coming out of UC Berkeley that has burst onto the big data scene in the past year. The selling point of Spark being the following:

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

The promise of Spark is to enable a whole new set of big data applications. Naturally, I became intrigued when I first learned about it, and thought it could be a great new tool for my day job. My second thought was...can I use this for basketball statistics? The obvious answer being: Sure, why the hell not? One thing that is useful about being a (self-proclaimed) NBA stats geek is that I always have a fun data sandbox at my command (I'm not sure there are two things, actually).

Spark comes out of the box with an API in three different programming languages: Java, Scala (the source code language), and Python. Unfortunately, I'm not using any of those languages, and the language I typically use for such things (Ruby) isn't supported (yet, although I'm sure there will eventually be such a project). There is a SparkR project, but I had another idea. In the past few months I have taken up the task of learning Clojure, which is basically a Lisp that runs on the JVM.  Scala, by the way, is in a similar vein in that it is a functional language hosted on the JVM. In researching the two languages, I simply decided that Clojure was eventually what "all the cool kids" would be doing, and that's always where I want to be. Also, Rich Hickey, the developer of Clojure, is brilliant and reminds me of the 70's version of Doctor Who.

Fortunately, there is a project called Flambo that is developing a Clojure API for Spark. I decided to give it try. I'm in the very early phase of the learning curve, but I've already figured out enough to see that this is shaping up to be a very cool/powerful data stack, indeed.

First, here is a sample of the data set I'm using, which comes straight from my nbawowy database:

{
	"76ers" : [
		"Lorenzo Brown",
		"Elliot Williams",
		"Hollis Thompson",
		"Brandon Davies",
		"Daniel Orton"
	],
	"Timberwolves" : [
		"A.J. Price",
		"Alexey Shved",
		"Robbie Hummel",
		"Ronny Turiaf",
		"Gorgui Dieng"
	],
	"_id" : ObjectId("53531a345bca6d54dd0382b2"),
	"as" : 120,
	"assist" : null,
	"away" : "Timberwolves",
	"coords" : {
		"x" : 13,
		"y" : 15
	},
	"date" : "2014-01-06",
	"distance" : 16,
	"espn_id" : "400489378",
	"event" : "A.J. Price makes a pull up jump shot from 16 feet out.",
	"home" : "76ers",
	"hs" : 93,
	"last_state" : {
		"type" : "fga",
		"val" : 2,
		"rel" : "jump shot",
		"made" : true,
		"shooter" : "Daniel Orton",
		"dist" : 17
	},
	"made" : true,
	"opponent" : "76ers",
	"pd" : 27,
	"pid" : 424,
	"points" : 2,
	"q" : 4,
	"release" : "pull up jump shot",
	"season" : "2014",
	"shooter" : "A.J. Price",
	"t" : "2:22",
	"team" : "Timberwolves",
	"type" : "fga",
	"url" : "http://scores.nbcsports.msnbc.com/nba/pbp.asp?gamecode=2014010620",
	"value" : 2
}

This is a single play. Each season of nbawowy has roughly 550K plays just like this with metadata describing all kinds of things I pull out from the play-by-play data with my current parser (written in Ruby). The 2013-2014 season is a little under 500 MB of data like this. I "dumped" it to a text file that could then be processed with Flambo/Spark.

The following is a code sample that produces the number of made three-point field goals by the Warriors last season in descending order (comments are denoted by leading semi-colons):

;; create a namespace and require libraries
(ns flambo.clojure.spark.demo
  (:require [flambo.conf :as conf])
  (:require [flambo.api :as f])
  (:require [c1ojure.data.json :as json]))

;; configure Spark
(def c (-> (conf/spark-conf)
           (conf/master "local[*]")
           (conf/app-name "nba_dsl")))

;; create a SparkContext object
(def sc (f/spark-context c))

;; read in plays from nbawowy database
(def plays (f/text-file sc "/Users/evanzamir/Code/Clojure/flambo-nba/resources/plays.json")) ;; returns an unrealized lazy dataset

;; define a function that prints out field goals
(defn field-goals-made-by-player
  [team p]
  (let
      [fgm
       (-> p
           (f/map (f/fn [x] (json/read-str x :key-fn keyword)))
           (f/filter (f/fn [x] (and (= "fga" (:type x))
                                    (= 3 (:value x))
                                    (= true (:made x))
                                    (= team (:team x)))))
           (f/map (f/fn [x] [(.toUpperCase (:shooter x)) 1]))
           (f/reduce-by-key (f/fn [x y] (+ x y)))
           f/collect)]
    (clojure.pprint/pprint (sort-by last > fgm))))

(field-goals-made-by-player "Warriors" plays)

The results of this code (generated by the very last line) are a list of Warriors 3pt fgm last season:

(["STEPHEN CURRY" 261]
["KLAY THOMPSON" 223]
["HARRISON BARNES" 66]
["ANDRE IGUODALA" 62]
["DRAYMOND GREEN" 55]
["JORDAN CRAWFORD" 40]
["STEVE BLAKE" 27]
["TONEY DOUGLAS" 19]
["KENT BAZEMORE" 10]
["MARREESE SPEIGHTS" 8]
["NEMANJA NEDOVIC" 3])

I'm not going to explain the code, except to say it is basically a series of very common functional operations, including filter, map, and reduce. Every line in the code where you see "f/operation" is the Flambo api instructing Spark to do some operation on a dataset (called an RDD in Spark terminology). There is another important point to be made about the code. You can see in Line 29 the .toUpperCase function being called. This is interesting because it is actually a Java function being called from Clojure and passed to the Spark engine. One of the design principles of Clojure is to enable very transparent and powerful interoperability with Java, which enables one to take advantage of the tremendous amount of Java libraries available. It is a huge win (and also true for Scala, btw).

I hope this post was useful. It really just scratches the surface of what is possible. This was all done locally on a MacBook Pro (automatically multi-threaded though!). The real fun begins when you take the code to a cluster (think EC2 and S3). It wouldn't suprise me at all if some NBA analytics departments working with SportsVU data are already headed down this path even as you read this. I would encourage anyone interested in a future in analytics (NBA or otherwise) to check out these projects.

NBA Combine Measurement Similarities

I'm sick, so I data.

With the annual NBA Draft Combine having completed the anthropometric and athletic testing portion, it's a good time to update the similarity study I did a few years ago here. To summarize, I take all the testing categories available from DraftExpress (from 2009 through 2014) and use a couple of R packages (ape and cluster) to spit out the similarities between players. The result is a circular dendrogram. The closer two players are on the dendrogram, the more similar they are in terms of the combine results.NBA Draft Combine similarities 2009-2014.

A few examples of closest comps for fun:

  • Garry Harris and Austin Rivers
  • Thanasis Ante... and Wesley Johnson
  • James Young and Xavier Henry
  • Aaron Craft and Jimmer Fredette
  • Jahii Carson and Peyton Siva
  • Jordan McRae and Jeremy Lab
  • Noah Vonleh and Derrick Favors

See if you can find some others. It's not perfect, of course. But it's fun. Should entertain you for at least several minutes. Enjoy! Pass it around on the interwebz if you like.

 

NBA Draft Combine similarities 2009-2014.

A History of Hating Harrison Barnes

I think Twitter is amazing. It is also somewhat, perhaps mostly, responsible for the diminution in frequency of my long-form blog posts here and at GSoM over the last couple years (also I got really freaking busy with the nbawowy stuff). It's just so easy on twitter to communicate your thoughts in real-time, that I often feel like I've already said everything I want to say, and it obviates the need for more than 140 characters at a time that the old-fashioned blog platform originally provided.

If you are reading this, there is a good chance you follow me on twitter, and if you follow me on twitter you probably have heard me make a disparaging remark or two about the play of a certain Golden State Warrior who arrived by way of North Carolina and Iowa. I'm referring, of course, to Harrison Bryce Jordan Barnes. They say don't hate the player, hate the game. Well, I've tried my best to hate the game, but I am continually accused of hating the player regardless.

As a fun exercise for myself, and to stir the passions of Barnes fanboys everywhere, I wanted to go through my history of tweeting about Barnes (I now have over 706 tweets with "Barnes" as a search term, although some of those could be about Matt Barnes!) to see how my "hate" for this player came to be. Think of this post as the origin myth for the most rampant and prolific Barnes "hater" on all of twitter (if you know of anyone who "hates" Barnes more than myself, let me know in the comments or on twitter!). So without any further adieu...let's a do this.

I didn't think Barnes would be there at 7 leading up to the draft.

 

And cue the draft, Barnes falls to 7. I'm apparently fine with it.

 

Although in my heart and head I wanted us to take John Henson.

(since 2011)

 

(and Nicholson!)

 

(and I knew we would never have the balls to take him)

 

(oh, the wildcard!)

 

(one last Henson regret for good measure)

 

So Barnes, Ezeli, & Draymond it is. How do I feel about it at the time?

 

Uh, that's kind of spooky how accurate that fake quote turned out to be! (I'm apparently pretty good at fake interviewing people.)

I noted the hand measurements being small at the time of the draft. Anthony Davis doesn't seem to have been bothered by it (perhaps, because he was a point guard growing up), but I often think (and still do) it's a real issue for Barnes and at the core of his ball handling troubles on the perimeter:

 

Still, I was optimistic.

 

Oh, gosh. Really optimistic!

 

Starting to come down to reality.

 

Apparently I thought the bar needed to be lowered.

 

Foreshadowing here?

 

Hmm...jury still out on this one, perhaps?

 

This is still an insult apparently (but also still appropriate).

 

I think I shifted the proximity of my position on this one quite a bit in the interim.

 

This debate was a thing at the time.

 

It's really funny going back to that article to see what I had written as the "Case for Barnes":

The case I would make for Barnes actually has less to do with Barnes strengths than it does thinking about what will work best for the team. As stated above, one of my concerns with Barnes coming off the bench is that he'll feel that he has a responsibility to be "the scorer". That is the last thing I want in terms of his development as a player. Conversely, I feel that Barnes would have to learn how to play the "right way" as a member of the starting unit, because he would be surrounded by several players that are clearly a step or two or three above him right now in terms of offensive production. Of course, one could turn this right around and argue, well, if Barnes isn't in the starting unit because of his offense, and it isn't because of his defense, then maybe he shouldn't be starting, eh? And I can't really disagree with that argument. (I'm a terrible self-debater.)

Clearly, I am now of the same opinion as the second guy in that quote.

Back to the tweets! Here I start to notice Draymond.

 

That trend would continue and intensify.

 

 

Then I started to question the kool-aid.

 

 

I was at this game tweeting from Oracle! Perhaps, it could be like this forever.

 

 

He was decent for a while!

 

(with certain caveats)

 

Here is clear evidence of me hating Harrison Barnes:

 

Much more foreshadowing!

 

I was skeptical even against Denver.

 

At the time, some people were advocating for David Lee to be moved so that Barnes could replace him. Hmm. I wonder if those people ever said they were wrong about that.

 

I still wonder this, fwiw:

 

There's that Marvin Williams comp for the first time (from me):

 

At the time, a lot of folks said they wouldn't have (I wanted Kawhi on draft night, btw):

 

A continuing concern to this day. The number one concern in my estimation.

 

This. Still. Except not so much dunking.

 

And then we got Iguodala.

 

He is coming off the bench, and he is not shining. And they are discounting it because he doesn't have the benefit of always playing with better players. Sigh.

 

I believe this was something I heard Sam Mitchell say on NBA TV:

 

It's been pretty much all downhill from there:

 

Always this. But again this season with less dunks.

 

Still waiting.

 

You've surely heard me say this by now:

 

And probably this too:

 

Harrison Barnes' best skill:

 

This could get awkward:

 

And so it goes on and on:

 

 

Crazy talk!

 

Ok, I'm going to stop here. It just gets worse and worse.

 

Well, one last tweet for good measure.

 

Right idea, but the execution needs some work!

In his 2+ seasons as head coach of the Golden State Warriors, head coach Mark Jackson has clearly made improving the defense one of his highest priorities. So much so, in fact, that in a live blog/hangout yesterday morning from the Warriors training facility, Stephen Curry pointed out how all the photos of the team hanging on the wall depict the team defending the ball, as opposed to "posterizing" players on offense (so evidently "Barnes over Pekovic" is nowhere to be seen).

Curry goes on to show viewers a chart that Mark Jackson had created for the players to show them where they should try to force defenses to take shots, based on efficiencies. This is a great idea, and it's one of the things you have almost come to expect as analytics has swept into front office and coaching mentalities across the league, with the Warriors, perhaps, being one of its top proponents.

There is a curious thing, however, in this chart. And it makes me wonder how much further analytics needs to go before its lessons are fully learned (or even appreciated).

Screen Shot 2013-10-26 at 1.41.28 PM

Did you spot the problem? (If not, I suggest you read my Advanced Stats Primer!) Notice how the chart shows FG% in each region? From what we can see, there is no label as such, but to all of us who have studied the numbers even a little, it's clear that the %'s given are field goal percentages. It's sort of odd, right? I mean, if I was a player, the message I'd receive looking at this chart is that I'd rather force opponents to take "above the break" 3-pt shots (34.2%) as opposed to 16-23 ft jump shots (38.1%). But we know that a better metric to use here is "equivalent" or "effective" FG% (eFG%), which multiplies 3-pt shots by 1.5X, so that 34.2% becomes effectively 51% or so, much better than the long 2-pt jumpers.

And if you're thinking the numbers aren't important, that the players will only look at the colors (which to my eye are confusing, if anything), then why bother putting numbers at all? I see this as a window into the current state of affairs in the NBA. Analytics has definitely become the prominent way of thinking among the "NBA intelligentsia", and players are most likely aware of the "take-home messages", but there's still quite a ways to go until analytics becomes part of the everyday language of basketball (especially for players) in the same way that "pick and roll" or "coming off a screen" have implicit meaning.

Lists! The League's Best Scorers in 2013 according to Scoring Index

Long time, no write. I've been busy with things lately, as some of you may know. Hopefully, I can sprinkle in more posts now and again, though. So to ease back into this web logging habit, I've compiled a list of the best scorers this season from nbawowy.com (heard of it?). The "Scoring Index" (SI) is based on work I did a while back (see here and here and here and here and here) looking at the tradeoff between usage (i.e. volume shooting) and efficiency (measured by TS%). At the very edge of the TS-USG relationship, there appears to be a "frontier" of all-time great scorers.

The "Usage-Efficiency" Frontier

The list I've compiled has a minimum threshold of 250 FGA taken. The one (significant) change I've made from the earlier metric is that SI is "signed", meaning if a player actually falls outside of the frontier (above and to the right of that line on the plot), they will have a SI > 1. IOW, they are scoring at a rate even better than the all-time greats. And wouldn't you know, we happen to have a couple players like that this season. You may have heard of them.

Here's the list in all it's glory. And if you're wondering (which you surely are by now)...it's Draymond Green.

Introducing NBA WOWY!

I'll make this short and sweet. As some of you know, I've spent the last few months moving my codebase over to a new database framework. After finishing that I decided that there was so much good stuff in there, that I needed to make some of it public. NBA WOWY! (nbawowy.com) — pronounced Wow-ee! — is the result.

The basic idea is that it lets you select any combination of players on or off the court and calculate the stats for all the other players. Right now, it's only got a few basic shooting stats, but much more is to come.

Update (Apr. 6): Ok, a few months later now, and here's a couple more recent screenshots:

Screen Shot 2013-04-06 at 5.13.13 PM

Screen Shot 2013-04-06 at 3.29.25 PM

Update (Jan. 6): The site now has a much fuller suite of stats, including turnovers, assists, and rebounding. More to come...

Updated screen shot.

Updated screen shot.

Think of this as the "beta" version, and you can be my very first beta testers.

Let me know what you think. My e-mail address is given at the bottom of the site. I'm interested to know what features you'd like to see added, in terms of both data and usability. Also, if you find bugs, please let me know.

Here's a quick tutorial. Let's say I want to know what the Warriors shoot when David Lee is on the court. I simply select Warriors from the team menu:

Screen Shot 2013-01-04 at 11.29.10 AM

Then I select David Lee from the "ON" menu in the green box and hit the "+" button to add him to the list (which is empty at first):

Screen Shot 2013-01-04 at 11.31.17 AM

Screen Shot 2013-01-04 at 11.32.34 AMAfter adding Lee, I click on the "Submit" button to run the query:

Screen Shot 2013-01-04 at 11.34.24 AMThen I just wait for the results (which should hopefully not take more than a few seconds to calculate):

Screen Shot 2013-01-04 at 11.35.27 AM

To re-run a new (different) query with a different filter, simply clear the list of players or add new ones or both. You can search literally any combination of players on or off the court. That's the whole point!

Anyway, that's pretty much all there is to it. Have fun and keep watching the site as I will periodically rollout updates.

A Post-Christmas Post about the Knicks Offense

Let's take shooting efficiency from the field (points per shot) and see how it is affected by having Carmelo Anthony, Tyson Chandler, and Jason Kidd on or off the floor.

First, with all 3 on the floor, here are the PPS stats for every Knicks player with >= 30 FGA (each list is NAME/FGA/PPS):

With melo, Tyson, & Kidd

  1. Tyson Chandler, 75, 1.36
  2. Jason Kidd, 74, 1.284
  3. Carmelo Anthony, 233, 1.189
  4. Ronnie Brewer, 63, 1.032
  5. J.R. Smith, 74, 0.973
  6. Raymond Felton, 175, 0.926

Now, we'll take each of them off, one at a time. The number in () is the ∆PPS from the above list with all 3 on the court.

Without Melo, With Tyson & Kidd

  1. Tyson Chandler, 33, 1.333 (-0.027)
  2. J.R. Smith, 39, 0.872 (-0.101)
  3. Jason Kidd, 39, 0.872 (-0.412)
  4. Raymond Felton, 79, 0.797 (-0.129)

Without Tyson, With Melo & Kidd

  1. Carmelo Anthony, 41, 1.0 (-0.189)

Without Kidd, With Melo & Tyson

  1. Tyson Chandler, 65, 1.446 (+0.086)
  2. Carmelo Anthony, 157, 0.968 (-0.221)
  3. Raymond Felton, 122, 0.844 (-0.082)
  4. Ronnie Brewer, 38, 0.816 (-0.216)
  5. J.R. Smith, 65, 0.769 (-0.204)

Conclusion

The stats pretty much speak for themselves, don't they? What they suggest is that the offense takes a significant hit when any of the three come off the floor. Also, Tyson Chandler appears to be the only Knicks player whose efficiency doesn't fluctuate too much regardless of who is on the court with him.

ezPM ratings are back!

(If you want to get on with Christmas and stuff, you can read this later, and just check out the new ezPM link at the top of the page.)

It's taken me several months to re-code my play-by-play parser since Basketball-Value.com is no longer being updated (i.e. since Aaron Barzilai was hired by the 76ers). The cool part is that now I can make updates faster. I also have more data available to put in the model. Every play (or event) in my database has a lot of information associated with it that can be queried. To illustrate, here's a typical field goal attempt (it should be pretty straightforward to follow each field):

{
	"Lakers" : [
		"Steve Blake",
		"Kobe Bryant",
		"Antawn Jamison",
		"Pau Gasol",
		"Jordan Hill"
	],
	"Warriors" : [
		"Stephen Curry",
		"Klay Thompson",
		"Richard Jefferson",
		"Carl Landry",
		"David Lee"
	],
	"_id" : ObjectId("50d802605bca6d03c1008ad6"),
	"as" : 24,
	"away" : "Warriors",
	"block" : "Jordan Hill",
	"coords" : {
		"x" : 2,
		"y" : 10
	},
	"date" : "2012-11-09",
	"distance" : 4,
	"espn_id" : "400277800",
	"event" : "Jordan Hill blocks a Stephen Curry driving finger roll shot from 4 feet out.",
	"home" : "Lakers",
	"hs" : 27,
	"made" : false,
	"opponent" : "Lakers",
	"pid" : 142,
	"q" : 2,
	"release" : "driving finger roll shot",
	"season" : "2013",
	"shooter" : "Stephen Curry",
	"t" : "9:22",
	"team" : "Warriors",
	"type" : "fga",
	"url" : "http://scores.nbcsports.msnbc.com/nba/pbp.asp?gamecode=2012110913",
	"value" : 2
}

Here's an example of a turnover (you'll see the fields are somewhat different, because it's a different type of event):

{
	"Suns" : [
		"Goran Dragic",
		"Jared Dudley",
		"P.J. Tucker",
		"Luis Scola",
		"Marcin Gortat"
	],
	"Warriors" : [
		"Stephen Curry",
		"Jarrett Jack",
		"Klay Thompson",
		"David Lee",
		"Andrew Bogut"
	],
	"_id" : ObjectId("50d801f85bca6d03c1001113"),
	"as" : 46,
	"away" : "Warriors",
	"date" : "2012-10-31",
	"espn_id" : "400277730",
	"event" : "Stephen Curry with a bad pass turnover: Bad Pass",
	"home" : "Suns",
	"hs" : 36,
	"opponent" : "Suns",
	"pid" : 195,
	"player" : "Stephen Curry",
	"q" : 2,
	"season" : "2013",
	"t" : "3:39",
	"team" : "Warriors",
	"tov_type" : "Bad Pass",
	"type" : "tov",
	"url" : "http://scores.nbcsports.msnbc.com/nba/pbp.asp?gamecode=2012103121"
}

Anyway, after doing all this, I can now get back to routinely calculating my various metrics, and hopefully, making them even more informative in the future. For example, here are a couple of things I'm working on for a future iteration of ezPM:

  • Change value of a rebound depending on the floor location and type of release. For example, if the offense tends to have a higher OREB% after a missed layup attempt, than the value of a defensive board in that situation should be higher.
  • Similarly, a player might be debited less for a missed layup attempt, since the offense has a better chance of securing the rebound.
  • Another change that I've been wanting to make for a while is to make the value of a possession dependent on the starting state. For example, possessions started after a steal, defensive rebound, or made basket, tend to have different expected values. This should be accounted for wherever the model uses PPP.

Simple Data Visualization using Node+Express+Jade

Update (2012-11-12): I created an app to go along with this post. Check it out at: http://ezamir.mongotest.jit.su.

If you know about Node, you're probably one of the cool kids. And you'll no doubt grok this post. In a nutshell, Node.js enables one to create an entire web application stack from the server to the client using JavaScript. It's pretty cool and stuff.

Another cool JavaScript thingy these days is D3, which is a library for doing all kinds of awesome visualization (that's actually what the "d3" in my domain refers to, if you were ever wondering). What D3 does is it essentially lets you bind data to elements of the DOM (e.g. the underlying structure of a web page). So D3 is really great and it has a huge and ever-growing community of users.

The reason I'm writing this post is because I have found it's not that easy to inject D3 code into a web app built on the Node stack (which almost always includes the Express framework as well). I could only find one decent tutorial, and on top of Node and Express, that code wraps D3 in an AngularJS directive. While I was trying to figure out that code, I realized that for relatively simple use cases, it's possible to bind visual elements directly using nothing more than Node+Express+Jade. Jade is a popular HTML templating language.

To demonstrate how this works, we'll visualize shot location for the Warriors this season. First, we pull the data from some data store (in this case, I'm using MongoDB):

exports.shots = function(req, res){
    console.log(req.route.params.team);
    var team = req.route.params.team;
    Db.connect(mongoUri, function(err, db) {
        console.log('show all shots!');
        db.collection('shots', function(err, coll) {
            coll.find({'for':team}).sort({'date':-1,'dist':-1}).toArray(function(err, docs) {
                db.close();
                res.render('shots',{shots: docs, team: team});
            });
        });
    });
};

The important line there is: res.render('shots',{shots: docs, team: team});. This basically hands off the shot data (which is now an array) to the Jade template (called "shots.jade"). The template looks like this:

extends layout

block content

    div.hero-unit
        h1 #{shots[0].for}
    div.row
        div.span2.offset1
            svg(width=600,height=600)
                each shot in shots
                    if (shot.made)
                        circle(cx="#{(shot.coords.x+25)*10}",cy="#{shot.coords.y*10}",r="3",fill="green",stroke="black")
                    else
                        circle(cx="#{(shot.coords.x+25)*10}",cy="#{shot.coords.y*10}",r="3",fill="red",stroke="black")

What you see is that the iterator each shot in shots in the Jade template created a element for each shot in the array pulled in from the database. Here's a screen shot of the final result (it's only running locally right now, so I can't give a link to the application):

Screen Shot of Jade-generated data visualization.

So there you have it. It's possible to do some basic data visualization using just Node+Express+Jade. There isn't a lot out there on this particular topic, so I figured this might help someone or give some inspiration to go further with it.

It's Early Yet, But There's Some Historically Productive Scoring in the League Right Now

You might remember I have done a bit of work on the usage-efficiency tradeoff in the past. The "payoff" was a chart that presented evidence of a usage-efficiency "frontier" (having stolen the idea of an efficient frontier from finance, of course):

All-time productive scoring seasons lie along the "frontier".

We're almost at the quarter-point of the 2012-13 season now, so I thought it would be interesting to look at the current leaders, and see where they stand with respect to the frontier. So far, pretty, pretty good. In particular, Kevin Durant, Kobe Bryant, Tyson Chandler (so good he looks to be close to setting a new point along the frontier), and Carmelo Anthony are on or very close to the frontier, itself. Have a look:

The players in green make up the historical reference for the "frontier". Note that Chandler would be very near the frontier if it was extrapolated out further.


Of course, we should expect some regression to the mean. How much is anyone's guess, so I'll update the results periodically throughout the season.

A Grown Man NBA Blog