Note #1: Before I get into this, I want to make it clear that many, perhaps most (hell, maybe all), of the ideas in the proposed model I am going to describe are not new. They may seem new to you, but I promise there are folks out there who have thought about these things before and heavily influenced my particular choices for all terms and coefficients in the model. Some of these folks you may have heard of before, including Dean Oliver ("Basketball on Paper"), John Hollinger (Pro Basketball Prospectus, ESPN), Dave Berri (Wages of Wins, Stumbling on Wins), and Dan Rosenbaum (believe he does stats for the Cavs for the past several years).
Note #2: Another motivator for my doing this is to help others get up to speed on (most) of the issues that need to be addressed when considering or developing player valuation models. As an engineer, I'm a big fan of rigorously defining a problem before tackling it. In developing this model and writing up my findings, it has helped me to better understand the problem (and basketball, hopefully). It has also raised many questions that need to be tackled going forward.
Note #3: This is the start of a discussion, not the end of one.
Note #4: Some will take this post as an implicit criticism of some other existing models. To some extent, that is obviously true. Let me name one, to be exact, since it's the elephant in the room: Wins Produced or WP (the metric developed principally by Dave Berri). First, let me emphasize that my understanding and appreciation of WP is what led me to start considering alternative models. My model, on the face of it, is not drastically different from WP. In fact, since both models are tied to point margin, it is really only the player valuation aspects that will be different. Of course, you will say, that is a large part. True. And if it weren't important, I wouldn't have bothered with all this. Finally, I would add that it has become clear to me that WP is not really going to change any time soon. Many of the components of my proposed model could easily be fit into the framework of WP, but I don't see that happening. If my post inspires Berri and others to re-consider their models, great. If not, that's fine, too. I don't mind rowing the canoe solo. That pretty much sums up my entire academic career. If this post inspires others to develop their own models or revise mine, that would be the greatest outcome of all. Go for it! (And let me know the results.)
First, what is a +/- model? Ok, let me back it up one step. What is +/-? At the team level, this is the number of points a team goes up or falls behind while it is on the floor. If starters played an entire game, this would simply be the point differential at the end of the game. Because players come in and out of the game, +/- typically refers to the number of points a team goes up or down while a player was in the game. In other words, +/- is assigned to individual players, but represents a team outcome. If Stephen Curry's +/- is +10 in a game, that doesn't (necessarily) mean Curry created a +10 point differential by himself. All it means is that while Curry was on the floor, the Warriors outscored their opponent by 10 points. Therefore, to be a meaningful statistic for evaluating players, what we really want to know is the individual +/-, or how a particular player contributed to the aggregate +/-. Say Curry was +5, Ellis was +6, Lee was +2, Biedrins was +1 and Dorell Wright was -4. In this example, the team +/- is +10, but Dorell Wright was actually a negative contributer. Going a step further, we can foresee a situation where some players are attributed +/- stats simply by playing with 4 other really good players.
One more thing, before I go further. Why do we care about +/-? Intuitively, it should be obvious that teams that score more points than they allow will win games. In fact, there is such a good relationship between +/- and wins, that we can actually formulate a simple equation that predicts wins based on point margin (another term for +/-). For example, Fig. 1 shows wins as a function of point margin per 100 possessions (simply offensive rating - defensive rating) for all teams for the 2009-10 NBA season.
Here, the "winning formula" (literally) is:
- W = 2.54 * p.m. + 40.9
A team with a p.m. of zero, wins about 41 games (hey, that makes sense, right?). A team with a p.m. of +10 (that would be San Antonio currently), should win about 66 games. The Warriors with a p.m. currently of -6.6 are predicted to win 24.5 (ugh). Now, it should be noted that p.m. can be adjusted for strength of schedule, which makes a lage difference early in the season, and between conferences, since the East is weaker than the West (at least, I think that's still the case). Now we can move on...
So, that's what +/- is, and that's also why it should be clear that we need some sort of model for decoupling individual contributions from team +/- stats. How do we do this?
Fundamentally, it's a simple problem. What creates +/-? Points scored by the team minus points scored by the opponent. So, we can just take points scored by each player, distribute the points allowed by the team across all players evenly, and voila! That would be a very simple way to do this. Would it be "correct"? Well, technically, yes. All the points scored minus all the points allowed has to add up to +/-. But is it "right" in the sense that it properly attributes value? What about all the other things that players do besides score? What about rebounds? Assists? Steals? Turnovers? Why do we account for these stats? And how do we account for these stats? I'll get to the how in a bit, but let me address the why, right now. Points don't travel well. You can take that almost literally. If you put a bunch of high scoring point guards out on the floor, they probably won't come anywhere close to their individual +/- stats, because they will lose a ton of possessions due to poor rebounding. What if we put together a team of all centers? Well, they will probably get a ton of rebounds, but they won't be able to get the ball up the floor because they are such poor ball handlers compared to the guards. In short, a team is the sum of five positions, each of which bring their own strengths and weaknesses. To have a useful model — ideally, a predictive model — we need to account for everything. Well, everything that we can get our hands on.
The way that Dean Oliver, Berri, and others have created such models is by considering the possession as the atomic unit of basketball. A possession is usually defined by what ends one: FGM (field goal made), DRB (defensive rebound), FTM (free throw made —well, some of them, anyway), TOV (turnover). That's it. Each team essentially will have the same number of possessions during a game. Another useful fact is that the average number of points scored or allowed per possession in the NBA is 1.0. (Technically, it's a few hundredths of a point higher, but 1.0 is going to be good enough for our purposes.) With this number in mind, we can actually assign a value to the result of every possession at the team level and the player level. This is the marginal point value. Let me start with the team level:
|FG (2 PT)||+1||-1|
|FG (3 PT)||+2||-2|
Note that assists, blocks, and personal fouls all have values of zero. The reason for this has to do with accounting. At the team level, all three are already accounted for by other results, either made or missed field goals (assists and blocks, respectively) or free throws (personal fouls). There is no need to count these again. At the player level, we will account for these. So, the next step is to distribute value at the player level so that points add up to the same total at the team level. Does that make sense? Here's how I distribute marginal points across players (explanations follow):
|2PT Assisted FG (shooter)||+0.7|
|2PT Assisted FG (passer)||+0.3|
|3PT Assisted FG (shooter)||+1.4|
|3PT Assisted FG (passer)||+0.6|
|2PT Unassisted FG||+1.0|
|3PT Unassisted FG||+2.0|
|Any FG missed (minus BLK)||-0.7||0.14 (all)|
|2PT FG made||-0.2 (all)|
|3PT FG made||-0.4 (all)|
|FT made (minus And1)||0.5||-0.5|
|FT missed (minus And1)||-0.5||+0.5|
Let me go through the rationale for each of these assignments:
- Field goals: On offense, the +1 or +2 values for made shots are obvious, and to my knowledge, all marginal value metrics (Oliver, Berri) use these values. For missed field goals, we encounter the first bit of separation from WP, namely that a player is debited only a portion (0.7) of a marginal point. The reason for this is that there is a 30% chance that the offense will get an ORB and the possession will continue. I discussed this a bit a few days ago here. WP subtracts a full point in this case.
- Assists: In my model, an assist is valued at 30% of the marginal value of the type of field goal that is made. This is one of the weakest parts of my model. The "true value" of an assist is not currently known (not that it is not knowable, though). I think everyone can agree that assists are helpful. Not all players can create their own shot. If we completely ignore the assist, then a player appears to be a more efficient scorer than he might otherwise be on a team lacking a good point guard. Although my 30% is essentially picked out of thin air, it is important to note that the total value of the assisted field goal is still +1.0 or +2.0 at the team level. This is not true for WP, which essentially double counts assists (or I should say 1.5X counts to be more accurate), since there is no distinction between assisted and unassisted FG in that model.
- Free throws: Fairly straightforward, except accounting for these is dramatically different if we are using box score stats or play-by-play (PBP) data. If we use PBP data, FT can be attributed easily (but analyzing PBP data is hard). If we use box score stats, we don't exactly know who to blame for giving up free throws, so we apportion blame to the players proportional to their PF, noting that 0.44 FTA is roughly one possession.
- And1: These are free points, since the player would have already been credited with a made field goal. You are credited value for making one, but not debited for missing.
- Blocks: In my model, the +0.7 coefficient for blocks has real meaning, since it gives the full value of the opponent's missed field goal to the player getting the block. To me, this makes perfect sense, since the guy getting his shot blocked missed the field goal and loses 0.7 pts. Shouldn't the blocker be credited with exactly the same amount?
- Let me do STL and TOV before getting to rebounds: A steal is worth +1, since you are taking away a possession from the opponent. A TOV is -1, since you are losing a possession and any chance for scoring. Team TOV should be debited or credited evenly among players (in the absence of a better model).
- Rebounds (yay!): This is where it gets interesting. Here's the basic logic. A missed field goal is worth -0.7 pts. The other -0.3 pts are lost, if the team does not get the ORB. On the other hand, I credit the defense +0.7 (distributed evenly among 5 players) for creating the missed field goal, and +0.3 to the player who gets the DRB. On the offense, the players on the team that "gave up" the DRB are debited by an amount that adds up to -0.3 and is proportional to the league average DRB% at their position. So, a PG loses a tiny bit of value, while the PF/C loses quite a bit more. Two things to note here. 1) This is vastly different from the WP model. 2) Everything adds up and defense is accounted for. Now, you may argue it's not right to split the defensive credit evenly. I agree! Well, I don't disagree. Truth is I'm not sure right now what the best way to debit value (maybe subtract 0.7 from the position counterpart that scored), but what I am sure of is that there needs to be this credit/debit for defense, and it needs to be of larger value than it is in the WP model. In the WP model, a DRB is worth +1. You can see now that is the equivalent of crediting the player with creating the missed field goal (i.e. the entire defensive possession). Is that a fair way to value the player? I dont' think it is.
Ideally, we would apply this scoring system to PBP (play-by-play) data. But that is hard, and I have not implemented such a system yet. Over Thanksgiving, however, I checked that everything adds up by going through the PBP data line by line for a single game (remember that Denver game?). In lieu of PBP data, we can use box score stats. I've done that for the Warriors through Friday's game against the Heat. As a point of comparison, I also ran the numbers for the Spurs. Here are the numbers (for players with > 100 minutes played):
The column you want to look at is EZPM100, which is the point margin (+/-) for the player per 100 possessions on the floor according to my model. I'm not going to go into a heavy discussion of the actual numbers now (although the rankings generally make sense, but there are probably some things that will surprise you). The model is still preliminary, as far as I'm concerned. As I said in the beginning of the post, this is the start, not the end of my work on the model. What I would really like to hear is feedback. Be as critical as you want. That's what makes this fun. If you think the coefficients are crap, say so. Besides using PBP data (which I think will have a big effect), I have tons of ideas for how to improve and test the validity of the model. Defense needs much more work. Do you have ideas for crediting/debiting value on defense? If so, I want to hear them. What about assists? How can we do that more rigorously? So many things to discuss.
- I am now using the rebounding stats from basketball-value.com, so the actual number of rebounds while a player was on the floor is taken into account (as opposed to an estimate of total rebounds based on minutes played). This appears to boost second unit guys more than starters (which is actually the opposite of what I had predicted).