Category Archives: Uncategorized

On the Side: Using Apache Spark and Clojure for Basketball Reasons(?)

Flambo!

I spend a lot of time thinking about "the next big thing". In tech it seems you can almost never be too far ahead of the curve. Whatever toolset you're working with now, chances are some kids out there are spending their nights trying to obsolete disrupt it (Note: "disrupt" obsoleted the word "obsolete").

When I started building nbawowy.com towards the end of 2012, I decided to use a stack that wasn't all that common, but fast forward to today and the "MEAN" stack (MEAN = Mongodb, Express, AngularJS, and Node.js) as it came to be called seems to be everywhere (funny enough, I was actually referring to it as AMEN). AngularJS (a Google-backed project) now has over 27,000 stars, almost 10,000 more than Backbone, which was widely considered the "default" Javascript front-end framework before 2013.

This past year I spent a lot of time in my day job learning more about "big data" and how to deal with it. In the past 5 or so years this mostly meant learning how to run MapReduce jobs on Hadoop, either by hand-coding them yourself in Java or using a higher-level scripting language, such as Pig or Hive. Not being a Java developer myself, I decided to learn Pig (a top-level Apache project) and it has made me much more productive.

Let me tell you how. At my work (a "social network" app called Skout) we generate a lot of data every day, not nearly as much as a Twitter or Facebook, of course, but enough to make it inconvenient to work with using traditional means (MySQL!). Last time I checked we were generating somewhere in the neighborhood of 100 million data messages per day (a "data message" is a little piece of JSON-formatted text sent over the network that tells us about an action taken by the user in the app). Like many companies, we store these messages on S3, an Amazon AWS service which is essentially an infinitely (for our purposes) scalable storage service in the cloud.

You can think of S3 as a really gigantic hard drive. What MapReduce (or Pig in my case) allows one to do is query the data in an ad hoc fashion, but the catch is that up until now this has mostly been a batch process. So one of my queries (...count all the chat messages sent by women under the age of 25 in Asian countries on Android phones over the past week) might take anywhere from 10 minutes upwards of an hour. It's better than nothing, and often the only way to get real answers, but it sort of takes the hoc out of ad hoc. What I'd really like to be able to do (and so would everyone else in tech) is be able to interactively query the data on S3 (or some other Hadoop service). And by "interactive", I mean essentially get real-time or near real-time (seconds to a couple minutes) results as one would get by querying a MySQL database (at least, one designed for such a purpose). With such a system it becomes possible to iterate much faster. It also literally enables data scientists to implement iterative algorithms that were previously not feasible using the current MapReduce toolset.

Enter Apache Spark, a cluster computing project coming out of UC Berkeley that has burst onto the big data scene in the past year. The selling point of Spark being the following:

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

The promise of Spark is to enable a whole new set of big data applications. Naturally, I became intrigued when I first learned about it, and thought it could be a great new tool for my day job. My second thought was...can I use this for basketball statistics? The obvious answer being: Sure, why the hell not? One thing that is useful about being a (self-proclaimed) NBA stats geek is that I always have a fun data sandbox at my command (I'm not sure there are two things, actually).

Spark comes out of the box with an API in three different programming languages: Java, Scala (the source code language), and Python. Unfortunately, I'm not using any of those languages, and the language I typically use for such things (Ruby) isn't supported (yet, although I'm sure there will eventually be such a project). There is a SparkR project, but I had another idea. In the past few months I have taken up the task of learning Clojure, which is basically a Lisp that runs on the JVM.  Scala, by the way, is in a similar vein in that it is a functional language hosted on the JVM. In researching the two languages, I simply decided that Clojure was eventually what "all the cool kids" would be doing, and that's always where I want to be. Also, Rich Hickey, the developer of Clojure, is brilliant and reminds me of the 70's version of Doctor Who.

Fortunately, there is a project called Flambo that is developing a Clojure API for Spark. I decided to give it try. I'm in the very early phase of the learning curve, but I've already figured out enough to see that this is shaping up to be a very cool/powerful data stack, indeed.

First, here is a sample of the data set I'm using, which comes straight from my nbawowy database:

{
	"76ers" : [
		"Lorenzo Brown",
		"Elliot Williams",
		"Hollis Thompson",
		"Brandon Davies",
		"Daniel Orton"
	],
	"Timberwolves" : [
		"A.J. Price",
		"Alexey Shved",
		"Robbie Hummel",
		"Ronny Turiaf",
		"Gorgui Dieng"
	],
	"_id" : ObjectId("53531a345bca6d54dd0382b2"),
	"as" : 120,
	"assist" : null,
	"away" : "Timberwolves",
	"coords" : {
		"x" : 13,
		"y" : 15
	},
	"date" : "2014-01-06",
	"distance" : 16,
	"espn_id" : "400489378",
	"event" : "A.J. Price makes a pull up jump shot from 16 feet out.",
	"home" : "76ers",
	"hs" : 93,
	"last_state" : {
		"type" : "fga",
		"val" : 2,
		"rel" : "jump shot",
		"made" : true,
		"shooter" : "Daniel Orton",
		"dist" : 17
	},
	"made" : true,
	"opponent" : "76ers",
	"pd" : 27,
	"pid" : 424,
	"points" : 2,
	"q" : 4,
	"release" : "pull up jump shot",
	"season" : "2014",
	"shooter" : "A.J. Price",
	"t" : "2:22",
	"team" : "Timberwolves",
	"type" : "fga",
	"url" : "http://scores.nbcsports.msnbc.com/nba/pbp.asp?gamecode=2014010620",
	"value" : 2
}

This is a single play. Each season of nbawowy has roughly 550K plays just like this with metadata describing all kinds of things I pull out from the play-by-play data with my current parser (written in Ruby). The 2013-2014 season is a little under 500 MB of data like this. I "dumped" it to a text file that could then be processed with Flambo/Spark.

The following is a code sample that produces the number of made three-point field goals by the Warriors last season in descending order (comments are denoted by leading semi-colons):

;; create a namespace and require libraries
(ns flambo.clojure.spark.demo
  (:require [flambo.conf :as conf])
  (:require [flambo.api :as f])
  (:require [c1ojure.data.json :as json]))

;; configure Spark
(def c (-> (conf/spark-conf)
           (conf/master "local[*]")
           (conf/app-name "nba_dsl")))

;; create a SparkContext object
(def sc (f/spark-context c))

;; read in plays from nbawowy database
(def plays (f/text-file sc "/Users/evanzamir/Code/Clojure/flambo-nba/resources/plays.json")) ;; returns an unrealized lazy dataset

;; define a function that prints out field goals
(defn field-goals-made-by-player
  [team p]
  (let
      [fgm
       (-> p
           (f/map (f/fn [x] (json/read-str x :key-fn keyword)))
           (f/filter (f/fn [x] (and (= "fga" (:type x))
                                    (= 3 (:value x))
                                    (= true (:made x))
                                    (= team (:team x)))))
           (f/map (f/fn [x] [(.toUpperCase (:shooter x)) 1]))
           (f/reduce-by-key (f/fn [x y] (+ x y)))
           f/collect)]
    (clojure.pprint/pprint (sort-by last > fgm))))

(field-goals-made-by-player "Warriors" plays)

The results of this code (generated by the very last line) are a list of Warriors 3pt fgm last season:

(["STEPHEN CURRY" 261]
["KLAY THOMPSON" 223]
["HARRISON BARNES" 66]
["ANDRE IGUODALA" 62]
["DRAYMOND GREEN" 55]
["JORDAN CRAWFORD" 40]
["STEVE BLAKE" 27]
["TONEY DOUGLAS" 19]
["KENT BAZEMORE" 10]
["MARREESE SPEIGHTS" 8]
["NEMANJA NEDOVIC" 3])

I'm not going to explain the code, except to say it is basically a series of very common functional operations, including filter, map, and reduce. Every line in the code where you see "f/operation" is the Flambo api instructing Spark to do some operation on a dataset (called an RDD in Spark terminology). There is another important point to be made about the code. You can see in Line 29 the .toUpperCase function being called. This is interesting because it is actually a Java function being called from Clojure and passed to the Spark engine. One of the design principles of Clojure is to enable very transparent and powerful interoperability with Java, which enables one to take advantage of the tremendous amount of Java libraries available. It is a huge win (and also true for Scala, btw).

I hope this post was useful. It really just scratches the surface of what is possible. This was all done locally on a MacBook Pro (automatically multi-threaded though!). The real fun begins when you take the code to a cluster (think EC2 and S3). It wouldn't suprise me at all if some NBA analytics departments working with SportsVU data are already headed down this path even as you read this. I would encourage anyone interested in a future in analytics (NBA or otherwise) to check out these projects.

NBA Combine Measurement Similarities

I'm sick, so I data.

With the annual NBA Draft Combine having completed the anthropometric and athletic testing portion, it's a good time to update the similarity study I did a few years ago here. To summarize, I take all the testing categories available from DraftExpress (from 2009 through 2014) and use a couple of R packages (ape and cluster) to spit out the similarities between players. The result is a circular dendrogram. The closer two players are on the dendrogram, the more similar they are in terms of the combine results.NBA Draft Combine similarities 2009-2014.

A few examples of closest comps for fun:

  • Garry Harris and Austin Rivers
  • Thanasis Ante... and Wesley Johnson
  • James Young and Xavier Henry
  • Aaron Craft and Jimmer Fredette
  • Jahii Carson and Peyton Siva
  • Jordan McRae and Jeremy Lab
  • Noah Vonleh and Derrick Favors

See if you can find some others. It's not perfect, of course. But it's fun. Should entertain you for at least several minutes. Enjoy! Pass it around on the interwebz if you like.

 

NBA Draft Combine similarities 2009-2014.

A History of Hating Harrison Barnes

I think Twitter is amazing. It is also somewhat, perhaps mostly, responsible for the diminution in frequency of my long-form blog posts here and at GSoM over the last couple years (also I got really freaking busy with the nbawowy stuff). It's just so easy on twitter to communicate your thoughts in real-time, that I often feel like I've already said everything I want to say, and it obviates the need for more than 140 characters at a time that the old-fashioned blog platform originally provided.

If you are reading this, there is a good chance you follow me on twitter, and if you follow me on twitter you probably have heard me make a disparaging remark or two about the play of a certain Golden State Warrior who arrived by way of North Carolina and Iowa. I'm referring, of course, to Harrison Bryce Jordan Barnes. They say don't hate the player, hate the game. Well, I've tried my best to hate the game, but I am continually accused of hating the player regardless.

As a fun exercise for myself, and to stir the passions of Barnes fanboys everywhere, I wanted to go through my history of tweeting about Barnes (I now have over 706 tweets with "Barnes" as a search term, although some of those could be about Matt Barnes!) to see how my "hate" for this player came to be. Think of this post as the origin myth for the most rampant and prolific Barnes "hater" on all of twitter (if you know of anyone who "hates" Barnes more than myself, let me know in the comments or on twitter!). So without any further adieu...let's a do this.

I didn't think Barnes would be there at 7 leading up to the draft.

 

And cue the draft, Barnes falls to 7. I'm apparently fine with it.

 

Although in my heart and head I wanted us to take John Henson.

(since 2011)

 

(and Nicholson!)

 

(and I knew we would never have the balls to take him)

 

(oh, the wildcard!)

 

(one last Henson regret for good measure)

 

So Barnes, Ezeli, & Draymond it is. How do I feel about it at the time?

 

Uh, that's kind of spooky how accurate that fake quote turned out to be! (I'm apparently pretty good at fake interviewing people.)

I noted the hand measurements being small at the time of the draft. Anthony Davis doesn't seem to have been bothered by it (perhaps, because he was a point guard growing up), but I often think (and still do) it's a real issue for Barnes and at the core of his ball handling troubles on the perimeter:

 

Still, I was optimistic.

 

Oh, gosh. Really optimistic!

 

Starting to come down to reality.

 

Apparently I thought the bar needed to be lowered.

 

Foreshadowing here?

 

Hmm...jury still out on this one, perhaps?

 

This is still an insult apparently (but also still appropriate).

 

I think I shifted the proximity of my position on this one quite a bit in the interim.

 

This debate was a thing at the time.

 

It's really funny going back to that article to see what I had written as the "Case for Barnes":

The case I would make for Barnes actually has less to do with Barnes strengths than it does thinking about what will work best for the team. As stated above, one of my concerns with Barnes coming off the bench is that he'll feel that he has a responsibility to be "the scorer". That is the last thing I want in terms of his development as a player. Conversely, I feel that Barnes would have to learn how to play the "right way" as a member of the starting unit, because he would be surrounded by several players that are clearly a step or two or three above him right now in terms of offensive production. Of course, one could turn this right around and argue, well, if Barnes isn't in the starting unit because of his offense, and it isn't because of his defense, then maybe he shouldn't be starting, eh? And I can't really disagree with that argument. (I'm a terrible self-debater.)

Clearly, I am now of the same opinion as the second guy in that quote.

Back to the tweets! Here I start to notice Draymond.

 

That trend would continue and intensify.

 

 

Then I started to question the kool-aid.

 

 

I was at this game tweeting from Oracle! Perhaps, it could be like this forever.

 

 

He was decent for a while!

 

(with certain caveats)

 

Here is clear evidence of me hating Harrison Barnes:

 

Much more foreshadowing!

 

I was skeptical even against Denver.

 

At the time, some people were advocating for David Lee to be moved so that Barnes could replace him. Hmm. I wonder if those people ever said they were wrong about that.

 

I still wonder this, fwiw:

 

There's that Marvin Williams comp for the first time (from me):

 

At the time, a lot of folks said they wouldn't have (I wanted Kawhi on draft night, btw):

 

A continuing concern to this day. The number one concern in my estimation.

 

This. Still. Except not so much dunking.

 

And then we got Iguodala.

 

He is coming off the bench, and he is not shining. And they are discounting it because he doesn't have the benefit of always playing with better players. Sigh.

 

I believe this was something I heard Sam Mitchell say on NBA TV:

 

It's been pretty much all downhill from there:

 

Always this. But again this season with less dunks.

 

Still waiting.

 

You've surely heard me say this by now:

 

And probably this too:

 

Harrison Barnes' best skill:

 

This could get awkward:

 

And so it goes on and on:

 

 

Crazy talk!

 

Ok, I'm going to stop here. It just gets worse and worse.

 

Well, one last tweet for good measure.

 

Right idea, but the execution needs some work!

In his 2+ seasons as head coach of the Golden State Warriors, head coach Mark Jackson has clearly made improving the defense one of his highest priorities. So much so, in fact, that in a live blog/hangout yesterday morning from the Warriors training facility, Stephen Curry pointed out how all the photos of the team hanging on the wall depict the team defending the ball, as opposed to "posterizing" players on offense (so evidently "Barnes over Pekovic" is nowhere to be seen).

Curry goes on to show viewers a chart that Mark Jackson had created for the players to show them where they should try to force defenses to take shots, based on efficiencies. This is a great idea, and it's one of the things you have almost come to expect as analytics has swept into front office and coaching mentalities across the league, with the Warriors, perhaps, being one of its top proponents.

There is a curious thing, however, in this chart. And it makes me wonder how much further analytics needs to go before its lessons are fully learned (or even appreciated).

Screen Shot 2013-10-26 at 1.41.28 PM

Did you spot the problem? (If not, I suggest you read my Advanced Stats Primer!) Notice how the chart shows FG% in each region? From what we can see, there is no label as such, but to all of us who have studied the numbers even a little, it's clear that the %'s given are field goal percentages. It's sort of odd, right? I mean, if I was a player, the message I'd receive looking at this chart is that I'd rather force opponents to take "above the break" 3-pt shots (34.2%) as opposed to 16-23 ft jump shots (38.1%). But we know that a better metric to use here is "equivalent" or "effective" FG% (eFG%), which multiplies 3-pt shots by 1.5X, so that 34.2% becomes effectively 51% or so, much better than the long 2-pt jumpers.

And if you're thinking the numbers aren't important, that the players will only look at the colors (which to my eye are confusing, if anything), then why bother putting numbers at all? I see this as a window into the current state of affairs in the NBA. Analytics has definitely become the prominent way of thinking among the "NBA intelligentsia", and players are most likely aware of the "take-home messages", but there's still quite a ways to go until analytics becomes part of the everyday language of basketball (especially for players) in the same way that "pick and roll" or "coming off a screen" have implicit meaning.

Lists! The League's Best Scorers in 2013 according to Scoring Index

Long time, no write. I've been busy with things lately, as some of you may know. Hopefully, I can sprinkle in more posts now and again, though. So to ease back into this web logging habit, I've compiled a list of the best scorers this season from nbawowy.com (heard of it?). The "Scoring Index" (SI) is based on work I did a while back (see here and here and here and here and here) looking at the tradeoff between usage (i.e. volume shooting) and efficiency (measured by TS%). At the very edge of the TS-USG relationship, there appears to be a "frontier" of all-time great scorers.

The "Usage-Efficiency" Frontier

The list I've compiled has a minimum threshold of 250 FGA taken. The one (significant) change I've made from the earlier metric is that SI is "signed", meaning if a player actually falls outside of the frontier (above and to the right of that line on the plot), they will have a SI > 1. IOW, they are scoring at a rate even better than the all-time greats. And wouldn't you know, we happen to have a couple players like that this season. You may have heard of them.

Here's the list in all it's glory. And if you're wondering (which you surely are by now)...it's Draymond Green.

Simple Data Visualization using Node+Express+Jade

Update (2012-11-12): I created an app to go along with this post. Check it out at: http://ezamir.mongotest.jit.su.

If you know about Node, you're probably one of the cool kids. And you'll no doubt grok this post. In a nutshell, Node.js enables one to create an entire web application stack from the server to the client using JavaScript. It's pretty cool and stuff.

Another cool JavaScript thingy these days is D3, which is a library for doing all kinds of awesome visualization (that's actually what the "d3" in my domain refers to, if you were ever wondering). What D3 does is it essentially lets you bind data to elements of the DOM (e.g. the underlying structure of a web page). So D3 is really great and it has a huge and ever-growing community of users.

The reason I'm writing this post is because I have found it's not that easy to inject D3 code into a web app built on the Node stack (which almost always includes the Express framework as well). I could only find one decent tutorial, and on top of Node and Express, that code wraps D3 in an AngularJS directive. While I was trying to figure out that code, I realized that for relatively simple use cases, it's possible to bind visual elements directly using nothing more than Node+Express+Jade. Jade is a popular HTML templating language.

To demonstrate how this works, we'll visualize shot location for the Warriors this season. First, we pull the data from some data store (in this case, I'm using MongoDB):

exports.shots = function(req, res){
    console.log(req.route.params.team);
    var team = req.route.params.team;
    Db.connect(mongoUri, function(err, db) {
        console.log('show all shots!');
        db.collection('shots', function(err, coll) {
            coll.find({'for':team}).sort({'date':-1,'dist':-1}).toArray(function(err, docs) {
                db.close();
                res.render('shots',{shots: docs, team: team});
            });
        });
    });
};

The important line there is: res.render('shots',{shots: docs, team: team});. This basically hands off the shot data (which is now an array) to the Jade template (called "shots.jade"). The template looks like this:

extends layout

block content

    div.hero-unit
        h1 #{shots[0].for}
    div.row
        div.span2.offset1
            svg(width=600,height=600)
                each shot in shots
                    if (shot.made)
                        circle(cx="#{(shot.coords.x+25)*10}",cy="#{shot.coords.y*10}",r="3",fill="green",stroke="black")
                    else
                        circle(cx="#{(shot.coords.x+25)*10}",cy="#{shot.coords.y*10}",r="3",fill="red",stroke="black")

What you see is that the iterator each shot in shots in the Jade template created a element for each shot in the array pulled in from the database. Here's a screen shot of the final result (it's only running locally right now, so I can't give a link to the application):

Screen Shot of Jade-generated data visualization.

So there you have it. It's possible to do some basic data visualization using just Node+Express+Jade. There isn't a lot out there on this particular topic, so I figured this might help someone or give some inspiration to go further with it.

It's Early Yet, But There's Some Historically Productive Scoring in the League Right Now

You might remember I have done a bit of work on the usage-efficiency tradeoff in the past. The "payoff" was a chart that presented evidence of a usage-efficiency "frontier" (having stolen the idea of an efficient frontier from finance, of course):

All-time productive scoring seasons lie along the "frontier".

We're almost at the quarter-point of the 2012-13 season now, so I thought it would be interesting to look at the current leaders, and see where they stand with respect to the frontier. So far, pretty, pretty good. In particular, Kevin Durant, Kobe Bryant, Tyson Chandler (so good he looks to be close to setting a new point along the frontier), and Carmelo Anthony are on or very close to the frontier, itself. Have a look:

The players in green make up the historical reference for the "frontier". Note that Chandler would be very near the frontier if it was extrapolated out further.


Of course, we should expect some regression to the mean. How much is anyone's guess, so I'll update the results periodically throughout the season.

On Ceilings and Floors and Betting

Harrison Barnes has exceeded most Warriors' fans expectations though 9 games this season. He's looked especially good in the last copule of games. This has prompted some fans to re-visit the classical sports discussion regarding a player's  "ceiling" and "floor". While the topic is one of the oldest in the book, the criteria for selecting a ceiling and floor for a player is not very clear (to me, anyway).

I think that most people see it as equivalent to asking following question:

Who is the best current or former player that Player X has *some* possibility of becoming better than?

The key word here is *some*. When a fan suggests a ceiling that is deemed too low, the response is always something like, "How can you say he doesn't have *some* chance to be better than that player!?" Well, my reply is, of course, there's *some* chance. I'm going to illustrate why this is a problematic foundation for the discussion.

I think it's fair to say that, ideally, we would like to have debates that have some objectivity to them. One way to constrain a debate to be more objective is simply to introduce a bet. A bet invariably has to be settled by some objective criteria, otherwise, neither party would agree. If we want to debate which team is better, we should bet on the outcome of a game or maybe a season. That might not truly settle the debate, but at least it's an objective approach. If I pick Team A and you pick Team B, we bet against each other, and the winner is easy to declare.

So let's think about how we might construct a bet on the ceiling for a player (the floor could be done in a similar way). Here's one way to do it. The player in question is Player X. I propose that Player A is his ceiling. You propose that Player B is his ceiling. First, we need some objective criterion, i.e. a "stat". For the sake of argument, I'll just choose a stat that most everyone reading this has heard of: Hollinger's PER. (This is not the time to debate the merit of PER. You can substitute any stat you would like, as it won't materially change the point at hand.) Ok, so with per as the base metric, the winner of the bet is the one who picks the ceiling that is closest to Player X.

Let me demonstrate with some numbers. Say that Player A's highest PER was 25 and Player B's highest PER was 30. Let's have one scenario where Player X ends up with a PER of 24. In this case, I win the bet because Player A meets two important criteria: 1) Player X did not achieve a PER higher than Player A (which would mean Player A was by definition too low a ceiling); and 2) In absolute terms, the difference between Player X and Player A is smaller than between Player X and Player B.

Now, say we have another scenario where Player X ends up with a PER of 26. In that case - and again, according to how I would set up the bet - you would win simply because Player X achieved a PER higher than the ceiling I set for him. The fact that my ceiling was closer (in absolute terms) doesn't make a difference.

Does that make sense? Let me re-iterate that this is just one way to construct the bet. Obviously, there are others. We could just take the absolute difference and not worry about whether Player X ends up higher or lower than our ceilings. I don't like that approach, because I'm used to thinking about the games from the Price is Right, where you had to guess the price *without going over*. It makes even more sense to have that rule here, because the whole point of choosing a ceiling is that we're saying that is the player's LIMIT.

The problem I have with the original (and seemingly more popular) approach to the ceiling/floor discussion is that there's really now way to evaluate it objectively. Let's use Harrison Barnes as the example. I'll say that his ceiling is Danny Granger. You say that his ceiling is LeBron James. Who would win that bet ? If Barnes never becomes "better" than LeBron, do you win? If that's the case, what exactly is the incentive of choosing any player other than arguably the best SF of all time? The ceiling for every SF would then either be Bird or James, right?

Now, I think most people inherently understand that dilemma, so they pick someone not quite as good as that for Barnes. But the criteria for doing so is usually ad hoc. It's basically, "Well, I think he has some chance of being better than this player, but no chance of being better than this player."

My point is let's bet on it. Let's put some numbers on it. The challenge here is not to pick *some* player that is the absolute ceiling (which is easy and trivial). The challenge (for me, anyway) is to pick the *worst* player that you think will be *better* than the player in question (Player X). Because otherwise, as I said, there's no incentive to pick anyone other than the best player of all time. In math, they would say that's an "ill-posed" problem. In order to make it a well-posed problem, it seems to me the logical solution is to construct it as a bet. From there everything else follows.

I know, that was a lot of words. But next time you enter into the ceiling/floor debate or listen to it on tv, just remember the main point here: Pick the guy that you would be willing to bet on.

Bayesian True Power Ratings for the NFL

In a recent post, I laid out the framework for developing a Bayesian power ratings model for the NFL using the BUGS/JAGS simulation software. That was a really simple model that essentially amounted to little more than a standard linear regression (or ridge regression). At the end of the article I suggested that one area of improvement would be to take into account turnovers. So, this is my first attempt to do that (at least, the first one that I'm writing about).

Continue reading

Hand Down, Man Down: New Source of NBA Data Reveals Critical Detail for More Effective Shot Defense

There’s a revolution taking place in NBA analytics driven by new sources and types of data and an increasingly sophisticated application of statistics. As a blogger, I’m always on the lookout for the “next big thing” in NBA analytics.  Recently, I have had the opportunity to do some statistical consulting for a start-up company in Seattle trying to establish itself as — you guessed it — the next big thing in NBA analytics. The company is Competitive Analytics Consulting (CAC), and with a combination of proprietary technology and highly trained data analysts, CAC tracks each player during every play of every game, resulting in countless new and relevant stats — some of which you’ve likely always wanted to see in a box score, and others which would never even dawn on you to track, but make perfect sense, after the fact. Of course, there are a couple of other increasingly well-known companies that supply NBA teams with useful (or potentially useful) data, including STATS, Inc. with their SportsVU player and ball tracking technology, and Synergy Sports Technology, which relies on video tracking of various play types.

Knowing what Synergy currently offers, and having had access to CAC’s database for a few months now, I can state for a fact that CAC is tracking stats that are not available anywhere else today. I think of CAC’s software platform (called “Vantage”) as Synergy on PEDs.  CAC has given me permission to use some of their never-before-released data in this article, which I will use to highlight just one new factor (“shot defense”) that CAC is tracking.  Once you see the data, you will wonder why nobody has done this before (aside from the cost and laborious work involved!). It will also hopefully drive home the message that very soon all 30 teams are going to need these kinds of data just to keep up with the Joneses (or possibly even the Jameses). It’ll simply be the cost of doing business in the NBA.

Shot defense is defined by CAC as the pressure that a defender applies to the shooter. There are 7 possible values:

  • OPEN (no defender within 5 feet of player)
  • GUARDED (defender within 3 to 5 feet)
  • PRESSURED (defender within 3 feet but no hand up)
  • CONTESTED (defender within 3 feet and hand is up in front of shooter)
  • ALTERED (defender within 3 feet, hand is up, and shooter is forced to change shooting angle or release point while in air)
  • BLOCK (defender blocks shot)
  • FOUL (defender fouls shooter)

Only two of those values are currently recorded by the box score or play-by-play data, BLOCK and FOUL. But how many conversations have you had with your buddies about the number of contested shots Kobe Bryant takes compared to, say, a player like Matt Bonner whose job it is to stand in the corner and hit open 3-pt shots? We’ve all had those conversations numerous times, and likewise, NBA decision-makers, scouts, and players know that the shot defense a player faces is a huge factor in his shooting efficiency. Now let’s slice through the data in a few different ways.

I should make it clear that I’m not going to spit out a bunch of numbers in a table and tell you to go read them. I think that one of the critically important aspects of supporting NBA decision-makers is distilling the vast amounts of data in ways that are understandable and actionable for folks that don’t have PhD’s. This is a vision CAC shares, and it often means utilizing new visualization techniques to reduce the dimensions of a problem. The reason I’m prefacing the “data” part of this article with this comment about visualization is that you may find the attached graphics a little intimidating at first. Fear not, they will become second nature once you realize the key points in each one.

The first graphic looks at shot defense on 2-pt shots in matrix or “heat map” form, where the color of each tile represents the efficiency in terms of FG%. White tiles are worse (less efficient) shots and red tiles are better (more efficient) shots. Blocks were not included (0% efficiency by definition). Each row of the matrix represents a play type defined in a similar way to Synergy, except, the observant reader will note the addition of a new play type called “FLASH”, which is when a player receives the ball after cutting sharply toward the perimeter (as opposed to CUT which is defined as moving sharply to the basket). The other play types (TRANS = transition, SPOT-UP, ROLL = screen and roll, POP = screen and pop, OREB = following immediately after an offensive rebound, POST-UP, OFF SCREEN = coming off screen, and ISOLATION) are probably familiar to you, regardless of whether you have used Synergy.

What the data show are, perhaps, striking confirmation of what basketball fans think they already know. As the defender gets closer to the shooter, the shots get less and less efficient. What may surprise you, however, is the difference that just having a hand up can make. Across virtually every play type, having a hand up (CONTEST) vs. not having a hand up (PRESSURE) makes a significant difference.  On POST shots, a pressured shot averages 54.3%, whereas a contested shot is only 42.3%. A contested screen and pop is 37.8% on average, compared to 50.5% when the shot is only pressured. That’s over a 10% difference just from the defender putting a hand up. Another important takeaway from the chart is that on several play types, there isn’t much of a drop-off between pressured and guarded, in terms of shooter efficiency.  In other words, it doesn’t matter if you’re 5 feet away or 2 feet away, if you don’t have your hand up. It’s no wonder Mark Jackson’s favorite line is “Hand down, man down!” 

The second graphic slices the shot defense data in a different way, namely, as a function of the 24-second shot clock. Instead of comparing all the different play types, we’ll just focus on two, isolation and spot-up. This graphic “bins” the data, one shot a time, into buckets defined for each shot defense value. The data are slightly jittered so you can see all the shots, because otherwise, they would just lie in a straight vertical line. What this graphic allows you to see is a data “fingerprint”, so-to-speak, for each play type. Shots generated via isolation, not surprisingly, tend to fall most often in the contested and pressured bins. Compare that to spot-up shots where the open and guarded bins have a relatively high density of shots compared to the isolation plays. Again, this should not be surprising, but the value is in being able to quantify the data. One last interesting statistic before we move on to the final graphic. You can see the distribution of shots by shot clock time remaining. I bet you’re wondering if the average shot clock time varies as a function of shot defense. Well, yes, it does actually. It turns out that contested shots on average take place with about 2 fewer seconds on the shot clock than open shots. If we remove transition plays (which skew the distribution), contested shots still occur about 1.3 seconds later in the shot clock on average.

For the final graphic, I drilled down to a specific comparison of shot defense faced by LeBron James and Kobe Bryant as a function of the shot clock. This graphic shows the fraction of total shots facing each type of shot defense with a given amount of time remaining on the shot clock. The main feature and difference between the two stars is that Kobe faces a consistently high level of contested shots throughout the shot clock, whereas LeBron displays the more typical pattern of facing increasingly more difficult shot pressure as the shot clock winds down. I’ll leave it to the reader to debate with his or her friends whether this exposes a flaw or greatness in Kobe’s game.

Clearly, I have just scratched the surface of what can be done with these new data. And remember, this is just one new factor that I’ve discussed here. CAC is tracking dozens of other ones that I can’t share right now. The combinations of factors that can be studied and analyzed on every single play are mind-boggling. You know, if I were an up-and-coming basketball player, I might think about taking just enough math in college so that I could understand what guys like me are talking about with all this analytics stuff. Maybe it can give you that extra edge, like shooting another 100 jumpers a day or something. Because I can tell you this, it’s not going away, and it’s only going to get more important over time.

Vantage Basketball currently offers industry-leading data collection and analysis products and services to NBA organizations and media/broadcast companies at www.cacvantage.com. All data presented in this article are used with permission of Vantage Basketball LLC and should not be copied or distributed without its express written consent.