Category Archives: Uncategorized

Lists! The League's Best Scorers in 2013 according to Scoring Index

Long time, no write. I've been busy with things lately, as some of you may know. Hopefully, I can sprinkle in more posts now and again, though. So to ease back into this web logging habit, I've compiled a list of the best scorers this season from (heard of it?). The "Scoring Index" (SI) is based on work I did a while back (see here and here and here and here and here) looking at the tradeoff between usage (i.e. volume shooting) and efficiency (measured by TS%). At the very edge of the TS-USG relationship, there appears to be a "frontier" of all-time great scorers.

The "Usage-Efficiency" Frontier

The list I've compiled has a minimum threshold of 250 FGA taken. The one (significant) change I've made from the earlier metric is that SI is "signed", meaning if a player actually falls outside of the frontier (above and to the right of that line on the plot), they will have a SI > 1. IOW, they are scoring at a rate even better than the all-time greats. And wouldn't you know, we happen to have a couple players like that this season. You may have heard of them.

Here's the list in all it's glory. And if you're wondering (which you surely are by now)...it's Draymond Green.

Simple Data Visualization using Node+Express+Jade

Update (2012-11-12): I created an app to go along with this post. Check it out at: .

If you know about , you're probably one of the cool kids. And you'll no doubt grok this post. In a nutshell, Node.js enables one to create an entire web application stack from the server to the client using JavaScript. It's pretty cool and stuff.

Another cool JavaScript thingy these days is , which is a library for doing all kinds of awesome visualization (that's actually what the "d3" in my domain refers to, if you were ever wondering). What D3 does is it essentially lets you bind data to elements of the DOM (e.g. the underlying structure of a web page). So D3 is really great and it has a huge and ever-growing community of users.

The reason I'm writing this post is because I have found it's not that easy to inject D3 code into a web app built on the Node stack (which almost always includes the framework as well). I could only find , and on top of Node and Express, that code wraps D3 in an directive. While I was trying to figure out that code, I realized that for relatively simple use cases, it's possible to bind visual elements directly using nothing more than Node+Express+Jade. is a popular HTML templating language.

To demonstrate how this works, we'll visualize shot location for the Warriors this season. First, we pull the data from some data store (in this case, I'm using MongoDB):

exports.shots = function(req, res){
    console.log(req.route.params.team);
    var team = req.route.params.team;
    Db.connect(mongoUri, function(err, db) {
        console.log('show all shots!');
        db.collection('shots', function(err, coll) {
            coll.find({'for':team}).sort({'date':-1,'dist':-1}).toArray(function(err, docs) {
                db.close();
                res.render('shots',{shots: docs, team: team});
            });
        });
    });
};

The important line there is: res.render('shots',{shots: docs, team: team});. This basically hands off the shot data (which is now an array) to the Jade template (called "shots.jade"). The template looks like this:

extends layout

block content

    div.hero-unit
        h1 #{shots[0].for}
    div.row
        div.span2.offset1
            svg(width=600,height=600)
                each shot in shots
                    if (shot.made)
                        circle(cx="#{(shot.coords.x+25)*10}",cy="#{shot.coords.y*10}",r="3",fill="green",stroke="black")
                    else
                        circle(cx="#{(shot.coords.x+25)*10}",cy="#{shot.coords.y*10}",r="3",fill="red",stroke="black")

What you see is that the iterator each shot in shots in the Jade template created a element for each shot in the array pulled in from the database. Here's a screen shot of the final result (it's only running locally right now, so I can't give a link to the application):

Screen Shot of Jade-generated data visualization.

So there you have it. It's possible to do some basic data visualization using just Node+Express+Jade. There isn't a lot out there on this particular topic, so I figured this might help someone or give some inspiration to go further with it.

It's Early Yet, But There's Some Historically Productive Scoring in the League Right Now

You might remember I have done a bit of work on the usage-efficiency tradeoff in the past. The "payoff" was a chart that presented evidence of a usage-efficiency "frontier" (having stolen the idea of an efficient frontier from finance, of course):

All-time productive scoring seasons lie along the "frontier".

We're almost at the quarter-point of the 2012-13 season now, so I thought it would be interesting to look at the current leaders, and see where they stand with respect to the frontier. So far, pretty, pretty good. In particular, Kevin Durant, Kobe Bryant, Tyson Chandler (so good he looks to be close to setting a new point along the frontier), and Carmelo Anthony are on or very close to the frontier, itself. Have a look:

The players in green make up the historical reference for the "frontier". Note that Chandler would be very near the frontier if it was extrapolated out further.


Of course, we should expect some regression to the mean. How much is anyone's guess, so I'll update the results periodically throughout the season.

On Ceilings and Floors and Betting

Harrison Barnes has exceeded most Warriors' fans expectations though 9 games this season. He's looked especially good in the last copule of games. This has prompted some fans to re-visit the classical sports discussion regarding a player's  "ceiling" and "floor". While the topic is one of the oldest in the book, the criteria for selecting a ceiling and floor for a player is not very clear (to me, anyway).

I think that most people see it as equivalent to asking following question:

Who is the best current or former player that Player X has *some* possibility of becoming better than?

The key word here is *some*. When a fan suggests a ceiling that is deemed too low, the response is always something like, "How can you say he doesn't have *some* chance to be better than that player!?" Well, my reply is, of course, there's *some* chance. I'm going to illustrate why this is a problematic foundation for the discussion.

I think it's fair to say that, ideally, we would like to have debates that have some objectivity to them. One way to constrain a debate to be more objective is simply to introduce a bet. A bet invariably has to be settled by some objective criteria, otherwise, neither party would agree. If we want to debate which team is better, we should bet on the outcome of a game or maybe a season. That might not truly settle the debate, but at least it's an objective approach. If I pick Team A and you pick Team B, we bet against each other, and the winner is easy to declare.

So let's think about how we might construct a bet on the ceiling for a player (the floor could be done in a similar way). Here's one way to do it. The player in question is Player X. I propose that Player A is his ceiling. You propose that Player B is his ceiling. First, we need some objective criterion, i.e. a "stat". For the sake of argument, I'll just choose a stat that most everyone reading this has heard of: Hollinger's PER. (This is not the time to debate the merit of PER. You can substitute any stat you would like, as it won't materially change the point at hand.) Ok, so with per as the base metric, the winner of the bet is the one who picks the ceiling that is closest to Player X.

Let me demonstrate with some numbers. Say that Player A's highest PER was 25 and Player B's highest PER was 30. Let's have one scenario where Player X ends up with a PER of 24. In this case, I win the bet because Player A meets two important criteria: 1) Player X did not achieve a PER higher than Player A (which would mean Player A was by definition too low a ceiling); and 2) In absolute terms, the difference between Player X and Player A is smaller than between Player X and Player B.

Now, say we have another scenario where Player X ends up with a PER of 26. In that case - and again, according to how I would set up the bet - you would win simply because Player X achieved a PER higher than the ceiling I set for him. The fact that my ceiling was closer (in absolute terms) doesn't make a difference.

Does that make sense? Let me re-iterate that this is just one way to construct the bet. Obviously, there are others. We could just take the absolute difference and not worry about whether Player X ends up higher or lower than our ceilings. I don't like that approach, because I'm used to thinking about the games from the Price is Right, where you had to guess the price *without going over*. It makes even more sense to have that rule here, because the whole point of choosing a ceiling is that we're saying that is the player's LIMIT.

The problem I have with the original (and seemingly more popular) approach to the ceiling/floor discussion is that there's really now way to evaluate it objectively. Let's use Harrison Barnes as the example. I'll say that his ceiling is Danny Granger. You say that his ceiling is LeBron James. Who would win that bet ? If Barnes never becomes "better" than LeBron, do you win? If that's the case, what exactly is the incentive of choosing any player other than arguably the best SF of all time? The ceiling for every SF would then either be Bird or James, right?

Now, I think most people inherently understand that dilemma, so they pick someone not quite as good as that for Barnes. But the criteria for doing so is usually ad hoc. It's basically, "Well, I think he has some chance of being better than this player, but no chance of being better than this player."

My point is let's bet on it. Let's put some numbers on it. The challenge here is not to pick *some* player that is the absolute ceiling (which is easy and trivial). The challenge (for me, anyway) is to pick the *worst* player that you think will be *better* than the player in question (Player X). Because otherwise, as I said, there's no incentive to pick anyone other than the best player of all time. In math, they would say that's an "ill-posed" problem. In order to make it a well-posed problem, it seems to me the logical solution is to construct it as a bet. From there everything else follows.

I know, that was a lot of words. But next time you enter into the ceiling/floor debate or listen to it on tv, just remember the main point here: Pick the guy that you would be willing to bet on.

Bayesian True Power Ratings for the NFL

In a recent post, I laid out the framework for developing a Bayesian power ratings model for the NFL using the BUGS/JAGS simulation software. That was a really simple model that essentially amounted to little more than a standard linear regression (or ridge regression). At the end of the article I suggested that one area of improvement would be to take into account turnovers. So, this is my first attempt to do that (at least, the first one that I'm writing about).

Continue reading

Hand Down, Man Down: New Source of NBA Data Reveals Critical Detail for More Effective Shot Defense

There’s a revolution taking place in NBA analytics driven by new sources and types of data and an increasingly sophisticated application of statistics. As a blogger, I’m always on the lookout for the “next big thing” in NBA analytics.  Recently, I have had the opportunity to do some statistical consulting for a start-up company in Seattle trying to establish itself as — you guessed it — the next big thing in NBA analytics. The company is , and with a combination of proprietary technology and highly trained data analysts, CAC tracks each player during every play of every game, resulting in countless new and relevant stats — some of which you’ve likely always wanted to see in a box score, and others which would never even dawn on you to track, but make perfect sense, after the fact. Of course, there are a couple of other increasingly well-known companies that supply NBA teams with useful (or potentially useful) data, including STATS, Inc. with their SportsVU player and ball tracking technology, and Synergy Sports Technology, which relies on video tracking of various play types.

Knowing what Synergy currently offers, and having had access to CAC’s database for a few months now, I can state for a fact that CAC is tracking stats that are not available anywhere else today. I think of CAC’s software platform (called “Vantage”) as Synergy on PEDs.  CAC has given me permission to use some of their never-before-released data in this article, which I will use to highlight just one new factor (“shot defense”) that CAC is tracking.  Once you see the data, you will wonder why nobody has done this before (aside from the cost and laborious work involved!). It will also hopefully drive home the message that very soon all 30 teams are going to need these kinds of data just to keep up with the Joneses (or possibly even the Jameses). It’ll simply be the cost of doing business in the NBA.

Shot defense is defined by CAC as the pressure that a defender applies to the shooter. There are 7 possible values:

  • OPEN (no defender within 5 feet of player)
  • GUARDED (defender within 3 to 5 feet)
  • PRESSURED (defender within 3 feet but no hand up)
  • CONTESTED (defender within 3 feet and hand is up in front of shooter)
  • ALTERED (defender within 3 feet, hand is up, and shooter is forced to change shooting angle or release point while in air)
  • BLOCK (defender blocks shot)
  • FOUL (defender fouls shooter)

Only two of those values are currently recorded by the box score or play-by-play data, BLOCK and FOUL. But how many conversations have you had with your buddies about the number of contested shots Kobe Bryant takes compared to, say, a player like Matt Bonner whose job it is to stand in the corner and hit open 3-pt shots? We’ve all had those conversations numerous times, and likewise, NBA decision-makers, scouts, and players know that the shot defense a player faces is a huge factor in his shooting efficiency. Now let’s slice through the data in a few different ways.

I should make it clear that I’m not going to spit out a bunch of numbers in a table and tell you to go read them. I think that one of the critically important aspects of supporting NBA decision-makers is distilling the vast amounts of data in ways that are understandable and actionable for folks that don’t have PhD’s. This is a vision CAC shares, and it often means utilizing new visualization techniques to reduce the dimensions of a problem. The reason I’m prefacing the “data” part of this article with this comment about visualization is that you may find the attached graphics a little intimidating at first. Fear not, they will become second nature once you realize the key points in each one.

The first graphic looks at shot defense on 2-pt shots in matrix or “heat map” form, where the color of each tile represents the efficiency in terms of FG%. White tiles are worse (less efficient) shots and red tiles are better (more efficient) shots. Blocks were not included (0% efficiency by definition). Each row of the matrix represents a play type defined in a similar way to Synergy, except, the observant reader will note the addition of a new play type called “FLASH”, which is when a player receives the ball after cutting sharply toward the perimeter (as opposed to CUT which is defined as moving sharply to the basket). The other play types (TRANS = transition, SPOT-UP, ROLL = screen and roll, POP = screen and pop, OREB = following immediately after an offensive rebound, POST-UP, OFF SCREEN = coming off screen, and ISOLATION) are probably familiar to you, regardless of whether you have used Synergy.

What the data show are, perhaps, striking confirmation of what basketball fans think they already know. As the defender gets closer to the shooter, the shots get less and less efficient. What may surprise you, however, is the difference that just having a hand up can make. Across virtually every play type, having a hand up (CONTEST) vs. not having a hand up (PRESSURE) makes a significant difference.  On POST shots, a pressured shot averages 54.3%, whereas a contested shot is only 42.3%. A contested screen and pop is 37.8% on average, compared to 50.5% when the shot is only pressured. That’s over a 10% difference just from the defender putting a hand up. Another important takeaway from the chart is that on several play types, there isn’t much of a drop-off between pressured and guarded, in terms of shooter efficiency.  In other words, it doesn’t matter if you’re 5 feet away or 2 feet away, if you don’t have your hand up. It’s no wonder Mark Jackson’s favorite line is “Hand down, man down!” 

The second graphic slices the shot defense data in a different way, namely, as a function of the 24-second shot clock. Instead of comparing all the different play types, we’ll just focus on two, isolation and spot-up. This graphic “bins” the data, one shot a time, into buckets defined for each shot defense value. The data are slightly jittered so you can see all the shots, because otherwise, they would just lie in a straight vertical line. What this graphic allows you to see is a data “fingerprint”, so-to-speak, for each play type. Shots generated via isolation, not surprisingly, tend to fall most often in the contested and pressured bins. Compare that to spot-up shots where the open and guarded bins have a relatively high density of shots compared to the isolation plays. Again, this should not be surprising, but the value is in being able to quantify the data. One last interesting statistic before we move on to the final graphic. You can see the distribution of shots by shot clock time remaining. I bet you’re wondering if the average shot clock time varies as a function of shot defense. Well, yes, it does actually. It turns out that contested shots on average take place with about 2 fewer seconds on the shot clock than open shots. If we remove transition plays (which skew the distribution), contested shots still occur about 1.3 seconds later in the shot clock on average.

For the final graphic, I drilled down to a specific comparison of shot defense faced by LeBron James and Kobe Bryant as a function of the shot clock. This graphic shows the fraction of total shots facing each type of shot defense with a given amount of time remaining on the shot clock. The main feature and difference between the two stars is that Kobe faces a consistently high level of contested shots throughout the shot clock, whereas LeBron displays the more typical pattern of facing increasingly more difficult shot pressure as the shot clock winds down. I’ll leave it to the reader to debate with his or her friends whether this exposes a flaw or greatness in Kobe’s game.

Clearly, I have just scratched the surface of what can be done with these new data. And remember, this is just one new factor that I’ve discussed here. CAC is tracking dozens of other ones that I can’t share right now. The combinations of factors that can be studied and analyzed on every single play are mind-boggling. You know, if I were an up-and-coming basketball player, I might think about taking just enough math in college so that I could understand what guys like me are talking about with all this analytics stuff. Maybe it can give you that extra edge, like shooting another 100 jumpers a day or something. Because I can tell you this, it’s not going away, and it’s only going to get more important over time.

Vantage Basketball currently offers industry-leading data collection and analysis products and services to NBA organizations and media/broadcast companies at . All data presented in this article are used with permission of Vantage Basketball LLC and should not be copied or distributed without its express written consent.

Starting work on Bayesian football power ratings

The more I learn about Bayesian statistics, the more I want to use the approach in my own sports research. I have found some other football rating systems that use Bayesian methodology (including and most recently ), so most of what I'm doing here is not novel. However, I feel it's important to document new things I'm working on as much as possible, because you can always learn something new from seeing how someone else tackles the same problem. And I usually learn something by forcing myself to write about it. Anyway, that's more or less the pedagogic motivation for this article. In this post, I'll just introduce the framework and starting point for the model and show some initial predictive results from last season. I promise to make this as high-level as possible.

Probability is about calculating the likelihood of a given set of data given some known parameters. The classic example is flipping a coin, which we "know" is equally likely to come up heads or tails. Statistics is essentially the inverse problem (or "inverse probability"), i.e. calculating the likelihood of a parameter or set of parameters lying within a certain range of values, given a known set of data. If you didn't know a coin was "fair", how many flips would it take for you to figure it out? The example I always give to people who know nothing about Bayesian statistics is imagine flipping a coin 10 times, and it comes up heads 8 times. Most people, hopefully, everyone reading this, will know intuitively that just because the coin doesn't come up heads exactly 5 times, it doesn't (necessarily) mean the coin is an unfair one. The purpose of Bayesian statistics, then, is to combine the evidence that a current set of data presents (i.e. 8/10 flips coming up heads) with our "prior knowledge" of the phenomena under study. Figuring out how to do that is where all the math stuff comes in handy, and up until the early 90's, it was actually a very difficult problem to solve in all but a few well-studied "toy" problems, because there was not enough computational power to tackle the really complex models.

Nowadays, any statistical hack like myself can download the necessary software tools for free, and run extremely sophisticated models on a pc or even a laptop (I'm doing this on a MacBook Air) that 20 years ago probably would have required a supercomputer (or two). Unquestionably, the computational tool that kickstarted the widespread Bayesian revolution in statistics over the past two decades (and the one that anyone can download for free), is , (Bayesian inference Using Gibbs Sampling). I'm using JAGS (Just Another Gibbs Sampler), which is essentially like BUGS, but runs easily within R on a Mac using the package .

The way these programs work is that you enter in the data along with your data model and prior knowledge. The data model and the prior knowledge is typically in the form of a probability distribution. Given all this information, the program crunches through thousands and thousands of simulations of the model with the given set of data (using so-called Monte Carlo methods), which in the end generates a probability distribution for the parameters of the model you are interested in. This is called the "posterior" distribution. Along with that, you can feed in other parameters to the model that you want predicted. Let me show you how it works with the NFL model I've been setting up.

model {
  for (i in 1:N) {
        mov[i] ~ dnorm(mov.hat[i], tau1)
        mov.hat[i] <- b0 + inprod( b[] , x[i,] )
}

This part describes the model for the margin of victory (mov) when two teams face each other. It says that the samples are from a standard normal distribution, the mean of which is given by the difference between the power ratings between the two teams plus home field advantage (b0), and the variance of which will be estimated by the model during the simulation. Think of this as the error that comes with any prediction.

  for (i in 1:32) {
    b[i] ~ dnorm(0, tau2)
  }

Here, we're setting the prior distribution for the array of 32 coefficients that represents each team's power ratings. It's again a normal distribution, but we specify the mean to be 0 (i.e. we "know" this), and a second variance parameter, tau2, which will also be determined by the simulation.

b0 <- 3
tau1 ~ dexp(0.1)
tau2 ~ dexp(0.1)

Here we set the homefield advantage to a "known" constant (as opposed to estimating it from the data), and we set exponential priors on the two variance parameters. What you should note so far is that all we have explicitly said is "known" are that the mean of all team ratings should be zero and the HFA = 3 points, neither of which are controversial assumptions. Everything else is extremely general, and we just wait for the simulation to tell us what the actual parameter values turn out to be.

The data fed into the model consisted of the game results from the regular season. The ratings produced by the simulation were as follows:

Based on the regular season, that looks about right. Looking at a histogram of the distribution of team ratings, it looks normal as would be expected.

Distribution of team ratings.

Within the simulation, itself, one can make predictions about future events simply by monitoring certain variables of interest. In this case, I set up the simulation to monitor "virtual" matchups between teams that actually met in the post-season. In other words, I used the model based on regular season data to predict (out-of-sample) post-season results. For each game, we can look at the resulting probability distribution. Here's the prediction for SF vs. NYG as an example:

Bayesian prediction for Niners-Giants NFC Championship game.

You can see that according to the simulation, the Giants had almost no chance of winning the game. The probability lies almost entirely to the right of zero and is centered around 8.44 points. With hindsight, we can say that either the simulation was wrong incomplete, or the Giants got extremely lucky. A little luck was involved, to be sure, but I think it's safe to say the model is far from perfect at this point. Let's take a look at the predictions for all the post-season games:

So how did the model do overall? Well, to be perfectly honest, not all that great. It beat the spread 6/11 times, but lost big time to the Vegas spreads in terms of error (least squares columns). In non-Giants games, it beat Vegas spreads 6/8 times, but there always seems to be a "hot" team in January, and it doesn't help bettors to realize that only after the SB has been played. Still, this is just scratching the surface of what Bayesian models are capable of doing. As others have pointed out, we can try to build in the effects of turnovers and also try to account for trends within and across seasons. Much remains to be done. Let's go!

Why you should be excited and nervous to be a Chicago Bulls fan in 2012-13


I'm endeavoring to do a post for all 30 teams before the season starts on why if I were a fan of that team, I'd be excited and nervous. I'll talk about the great, the good, the bad, and oftentimes, the really ugly.

The Bulls are on pause. After went down with an ACL tear against Philly in Game 1 of the first round of the Playoffs, not only were Chicago's hopes for a 7th Championship dashed in the short term, but it quickly became clear that this season would be a placeholder for those hopes going forward. Continue reading

A look at scoring productivity for several "Player of the Century" candidates

This is going to be quick hit, basically just an excuse to put up a cool plot. A while back I developed a relatively simple metric for measuring scoring productivity that combines usage (i.e. volume) and shooting efficiency. I also did some work on aging curves using that metric. Motivated by Dime Mag's naming Kobe Bryant the "Player of the Century" (or, at least, the first 0.12 parts thereof), I thought I'd re-visit my scoring metric, and post aging curves for several of Kobe's potential rivals for that honor. Enjoy, debate, stare blankly, whatever! (Keep in mind this doesn't account for defense or any aspects of offense aside from scoring.)

Aging curves for several POC candidates. Click to enlarge.

Why you should be excited and nervous to be a Charlotte Bobcats fan in 2012-13

I'm endeavoring to do a post for all 30 teams before the season starts on why if I were a fan of that team, I'd be excited and nervous. I'll talk about the great, the good, the bad, and oftentimes, the really ugly.

Let's get this out of the way. If you're a Bobcats fan, you know there's a steep climb ahead. The good news is that Michael Jordan hired , and by doing so, may have finally got the team headed in the right direction. In fact, this was evident to me on Draft Night when they actually took the second best player available in Michael Kidd-Gilchrist. Very few people expected them to do the right thing there, perhaps, because it was unclear who was calling the shots. If you remember, MJ doesn't exactly have a strong record as far as evaluating other people's talent. Continue reading