ezPM Compared with RAPM: Part I

An APBR forum member (back2newbelf) has recently started publishing regularized adjusted +/- (RAPM) data. It’s like +/- data, but on steroids. If you want to learn more about how RAPM works, in general, see the paper presented by Joe Sill at the MIT Sloan Sports Analytics Conference in March, 2010. I thought it would be useful at this point in the development of the ezPM model to compare 1yr and 3yr averages with back2newbelf’s RAPM data. What follows are the results for regression of total RAPM on total ezPM100 (both metrics are per 100 possessions), with some tables of best/worst players by average of the two metrics, to give some idea of the actual numbers. In a subsequent post, I will perform the same type of analysis on the offensive and defensive components of the metrics.

1 Yr RAPM

The 1yr RAPM data set can be found here. I used a 1000 possession minimum as my cutoff, which left about 200 players to compare. Here are the results in graphical form, followed by regression data from R:

1 yr. RAPM as a function of ezPM100.
Call:lm(formula = RAPM ~ EZPM, data = data.1yr)
Residuals:    Min      1Q  Median      3Q     Max
 -3.2659 -0.8885  0.0009  0.8501  4.3308 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.20883    0.09066   2.303   0.0223 *  
EZPM         0.27176    0.03348   8.118 4.34e-14 ***
---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 Residual standard error: 1.284 on 205 degrees of freedom
Multiple R-squared: 0.2433,	Adjusted R-squared: 0.2396 
F-statistic: 65.91 on 1 and 205 DF,  p-value: 4.335e-14

A few things to note here. 1) The regression result is highly significant (p=4.34e-14); 2) The slope of the regression is 0.27, which means that RAPM underestimates ezPM100, or put the other way, ezPM100 overestimates RAPM; and 3) ezPM100 explains about 24% of the variance (R^2=0.24).

Let’s look at some of the best and worst players according to the 1-yr data by averaging the two metrics:

Top 20 Players by 1 Yr. Metric Average (ezPM100+RAPM)/2

RANK NAME RAPM EZPM AVG
1 LeBron James 3.3 7.37 5.34
2 Manu Ginobili 4.1 5.41 4.76
3 Chris Paul 2.7 6.63 4.67
4 Dwight Howard 1.7 7.48 4.59
5 Pau Gasol 2.9 6.04 4.47
6 Dwyane Wade 2.3 6.45 4.38
7 Paul Pierce 3.4 5.20 4.30
8 Steve Nash 2.5 5.67 4.09
9 Dirk Nowitzki 5.2 2.43 3.82
10 Kevin Garnett 4.0 3.45 3.73
11 Kobe Bryant 0.5 6.22 3.36
12 Tyson Chandler 3.4 3.25 3.33
13 Nene Hilario 2.4 4.12 3.26
14 George Hill 3.5 2.87 3.19
15 Ronnie Brewer 0.9 5.43 3.17
16 Kevin Durant 1.7 4.30 3.00
17 Brandon Bass 3.2 2.75 2.98
18 Lamar Odom 1.7 4.11 2.91
19 Rajon Rondo 2.2 3.45 2.83
20 Al Horford 1.1 4.51 2.81

Bottom 20 Players

RANK NAME RAPM EZPM AVG
207 J.J. Hickson -2.8 -4.99 -3.90
206 Andrea Bargnani -0.9 -5.89 -3.40
205 Goran Dragic -3.6 -3.13 -3.37
204 Eric Bledsoe -2.5 -3.57 -3.04
203 Sonny Weems -1.4 -4.47 -2.94
202 Jordan Hill -2.9 -2.93 -2.92
201 DeMarcus Cousins -2.1 -3.71 -2.91
200 John Wall -1.8 -4.00 -2.90
199 Travis Outlaw -2.8 -2.97 -2.89
198 Dante Cunningham -0.9 -4.80 -2.85
197 Spencer Hawes -0.8 -4.79 -2.80
196 Steve Blake 0.6 -6.03 -2.72
195 Richard Hamilton -2.7 -2.27 -2.49
194 Charlie Villanueva -0.1 -4.77 -2.44
193 Stephen Jackson -2.1 -2.33 -2.22
192 Linas Kleiza -0.1 -4.31 -2.21
191 Jeff Green -0.4 -3.99 -2.20
190 Antawn Jamison -1.7 -2.65 -2.18
189 Michael Beasley -0.8 -3.46 -2.13
188 Darko Milicic -1.0 -3.25 -2.13

Ok, both models are clearly wrong. Did you see that block J.J. Hickson made on Griffin last night?

3 Yr RAPM

The 3 yr. RAPM data can be found here. My data set goes back to the 2008-2009 season through the first week of February. As far as I know, the RAPM data weights each year equally, so I did the same to make the comparison fair. As before, a plot followed by numbers:

 

3 yr. RAPM as a function of ezPM100.
Call:lm(formula = RAPM ~ EZPM, data = avg.3yr)
Residuals:    Min      1Q  Median      3Q     Max
 -5.8064 -1.1009  0.0435  1.1527  5.9218 
Coefficients:            
Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.58592    0.10964   5.344 1.92e-07 ***
EZPM         0.43908    0.03725  11.786  < 2e-16 ***
---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
Residual standard error: 1.703 on 273 degrees of freedom
Multiple R-squared: 0.3372,	Adjusted R-squared: 0.3348 
F-statistic: 138.9 on 1 and 273 DF,  p-value: < 2.2e-16

As expected, there is an improvement in both the slope (0.44) and R^2 (~0.34). Here are the player tables:

Top 20 Players by 3 Yr Metric Average

RANK NAME RAPM EZPM AVG
1 LeBron James 10.6 9.8 10.2
2 Dwight Howard 6.7 10.1 8.4
3 Chris Paul 6.7 7.7 7.2
4 Kevin Garnett 7.2 4.1 5.6
5 Manu Ginobili 5.8 5.4 5.6
6 Tim Duncan 5.0 5.7 5.4
7 Steve Nash 7.4 3.3 5.3
8 Dirk Nowitzki 7.5 2.3 4.9
9 Kobe Bryant 4.7 5.0 4.8
10 Pau Gasol 4.0 5.4 4.7
11 Yao Ming 4.0 5.2 4.6
12 Chris Bosh 5.0 3.4 4.2
13 Paul Pierce 4.7 3.6 4.2
14 Greg Oden 2.7 5.6 4.2
15 Andrew Bogut 4.5 3.8 4.1
16 Lamar Odom 5.0 3.2 4.1
17 Leon Powe 1.4 6.7 4.1
18 Amir Johnson 5.2 2.9 4.1
19 Marcus Camby 3.2 4.3 3.8
20 Nene Hilario 3.6 3.9 3.8

Bottom 20 Players by 3yr Average (Or the List You Really Don’t Want to Appear On)

RANK NAME RAPM EZPM AVG
458 Gerald Green -1.9 -10.2 -6.0
457 Josh Powell -5.5 -6.4 -5.9
456 Bobby Brown -3.6 -7.4 -5.5
455 Adam Morrison -2.6 -7.7 -5.2
454 Ricky Davis -4.9 -5.0 -5.0
453 Darnell Jackson -2.8 -6.8 -4.8
452 Stephon Marbury -2.6 -6.8 -4.7
451 Brian Skinner -4.5 -4.6 -4.6
450 Brian Scalabrine -1.1 -8.0 -4.5
449 Johan Petro -3.7 -5.2 -4.5
448 Jannero Pargo -0.7 -8.1 -4.4
447 Marcus Williams -3.1 -5.6 -4.3
446 Malik Allen -2.2 -6.4 -4.3
444 Timofey Mozgov -1.8 -6.5 -4.1
445 Trenton Hassell -4.5 -3.8 -4.1
443 Rob Kurz -2.2 -6.0 -4.1
442 DaJuan Summers -1.2 -7.0 -4.1
441 J.J. Hickson -5.9 -2.2 -4.1
440 Oleksiy Pecherov -0.8 -7.3 -4.0
439 Brian Cook -2.6 -5.3 -3.9

Conclusions

This was definitely a worthwhile exercise. It’s good to see how the ezPM model compares to RAPM. Of course, it should not be expected that the two models line up perfectly. That would be great, but in practice, we should be using multiple models to evaluate players. Some players may look better in one metric or the other. We should have more confidence in players that are highly rated by both an APM model and a box score metric, such as ezPM. For example, what I didn’t show here are the players that were ranked in the top 20 by either metric alone. That would have showed that Derek Fisher is one of the best players in the league according to RAPM 1yr data (2.4), but not according to ezPM (-4.03). Kris Humphries looks great according to ezPM (4.52), but not RAPM (-1.4). (His new girlfriend always looks great!)

Anyway, this is a good stopping point, but also a good starting point. Going forward, we’ll see if there are adjustments that can be made to ezPM that will make it even more consistent with RAPM. For example, why is Dirk rated so much higher in RAPM? Does it have something to do with usage? His teammates? It’s also important to ask which model is a better predictor. If one or the other (or an average) is a better predictor, we probably want to know that, right? As always, to be continued…

Regressing Point Differential on The “Four Factors” (Part 2)

There are four factors of an offense or defense that define its efficiency: shooting percentage, turnover rate, offensive rebounding percentage, and getting to the foul line. Striving to control those factors leads to a more successful team. (Dean Oliver, “Basketball on Paper”)

How well do these four factors predict point differential (and thus, winning)? How important are each of the factors relative to the others? The first question was the subject of Part 1 — now would be a convenient time to read Part 1, if you haven’t already done so (don’t worry, Part 2 will still be here when you get back, thanks to the magic of the interwebs). Today we will address the second question.

How important are each of the factors relative to the others? In Part 1, we found the following model for predicting point differential (p.d.) as a function of the four factors (well, eight factors, including offense+defense):

$$ p.d. = 10.41 + 1.49 * eFG(own) – 1.63 * eFG(opp) + 0.187 * FTR(own) – 0.213 * FTR(opp) -1.51 * TOR(own)+ 1.37 * TOR(opp) + 0.327 * ORR(own) -0.365 * ORR(opp) $$

where,

  • effective FG% (eFG): eFG=(FG+0.5 *3PT)/FGA
  • foul rate (FTR): FTR = FTA/FGA
  • turnover rate (TOR): TOR=TOV / (FGA + 0.44 * FTA + TOV)
  • offensive rebounding rate (ORR): ORR=ORB / (ORB + Opp DRB)

Recall that positive coefficients (own eFG%, own FTR, opp TOR, own ORR) mean that terms add to point differential, while negative coefficients (opp eFG%, opp FTR, own TOR, opp ORR or own DRR) subtract from point differential.

Upon inspection of the model, one is, perhaps, initially tempted to conclude that the most important terms are the ones with the largest coefficients (in terms of absolute value) — eFG% and TOR. The problem with that logic is twofold: 1) It should be obvious that the means for each stat vary over a wide range (i.e. eFG% is typically around 50%, whereas TOR~13%). Therefore, even though the coefficients for eFG% and TOR are similar, eFG% is larger overall, and would appear to dominate. 2) The variation for each stat may vary. In other words, even if a parameter appears to be a large contributor based on its coefficient and mean, in practice, if there is little variation (i.e. between teams), it will not have a large effect on winning.

Fortunately, there is a straightforward way to deal with both issues and get at the truth. Specifically, we can use the model, itself, to calculate the variation in total wins due to a normalized change in each parameter. Here’s how it works. First, I will temporarily take over David Stern’s role as NBA Commish (thank you, thank you), and create a new franchise in Las Vegas (VEG — Vegas, baby!). Next, we will magically skip forward to next pre-season — yes, there was an expansion draft, no, VEG did not get LeBron James, although I will not rule out James having taken his talents to Vegas on several occasions. Before the season begins, we would like to predict how many wins VEG might have. How do we do this?

Oh, right, the model! Let’s start out by assuming (optimistically) that our new franchise is average in all eight of the four factors (you know what I mean). How many wins would such a team produce (if you’ve already guessed around 41, eat a cookie or something)? Take a look at the table below. I’ve calculated the NBA average value and standard deviation (STD) for each category. Next, I varied each parameter by one standard deviation (in a direction that increases wins for that category), and used the model to predict point differential (P.D.) and wins (uh, Wins). Wins are related to point differential by the following formula (see here for explanation):

W = 2.54 * p.d. + 40.9

eFG% FTR TOR ORR Wins
Team P.D. Own Opp Diff Own Opp Diff Own Opp Diff Own Opp Diff
NBA 0.2 49.6 49.6 0.0 31.4 31.4 0.0 13.8 13.8 -0.0 26.2 26.3 -0.1
STD 6.1 2.3 2.0 3.5 3.3 3.3 5.2 0.9 1.1 1.4 2.8 2.7 3.1
UNCH VEG0 0.1 49.6 49.6 0.0 31.4 31.4 0.0 13.8 13.8 0.0 26.2 26.3 -0.1 41.2
eFG(own) VEG1 3.5 51.9 49.6 2.3 31.4 31.4 0.0 13.8 13.8 0.0 26.2 26.3 -0.1 49.9
eFG(opp) VEG2 3.3 49.6 47.6 2.0 31.4 31.4 0.0 13.8 13.8 0.0 26.2 26.3 -0.1 49.5
FTR(own) VEG3 0.7 49.6 49.6 0.0 34.7 31.4 3.3 13.8 13.8 0.0 26.2 26.3 -0.1 42.8
FTR(opp) VEG4 0.8 49.6 49.6 0.0 31.4 28.1 3.3 13.8 13.8 0.0 26.2 26.3 -0.1 43.0
TOR(own) VEG5 1.4 49.6 49.6 0.0 31.4 31.4 0.0 12.9 13.8 -0.9 26.2 26.3 -0.1 44.7
TOR(opp) VEG6 1.6 49.6 49.6 0.0 31.4 31.4 0.0 13.8 14.9 -1.1 26.2 26.3 -0.1 45.0
ORR(own) VEG7 1.0 49.6 49.6 0.0 31.4 31.4 0.0 13.8 13.8 0.0 29.0 26.3 2.7 43.6
ORR(opp) VEG8 1.1 49.6 49.6 0.0 31.4 31.4 0.0 13.8 13.8 0.0 26.2 23.6 2.6 43.7

As expected, if VEG is totally average across-the-board (case VEG0), the model predicts 41.2 wins (no surprise, eh, that’s just about 50%). And if you’re complaining that the prediction is not exactly 41.0 wins, well, get a life. (And curl up with a good statistics book that can tell you about the nature of error and uncertainty in model predictions.)

Next, we change eFG%(own) by +1 STD from 49.6 to 51.9 (VEG1). The result is that VEG is now predicted to win 49.9 games. Wow! That’s an increase of almost 8 wins, just by varying eFG% by 1 STD. What happens when we do the same thing to the other categories? Ok, alright. You get it by now…just look at the table.

Having varied each factor by +(-) 1 STD, we can now rank the factors in terms of wins produced over average. We see that the ranking goes:

Rank Factor Case Prediction Wins Delta %
1 eFG(own) VEG1 3.5 49.9 8.7 26.8%
2 eFG(opp) VEG2 3.3 49.5 8.3 25.5%
3 TOR(opp) VEG6 1.6 45.0 3.8 11.8%
4 TOR(own) VEG5 1.4 44.7 3.5 10.6%
5 ORR(opp) VEG8 1.1 43.7 2.5 7.6%
6 ORR(own) VEG7 1.0 43.6 2.4 7.3%
7 FTR(opp) VEG4 0.8 43.0 1.8 5.5%
8 FTR(own) VEG3 0.7 42.8 1.6 4.9%

The last category (%) takes the wins produced above average (Delta) and divides that amount by the sum of the Deltas for each case. This is what we were looking for to begin with: the relative weight of each factor. Note that shooting efficiency (producing it and defending against it) accounts for about 52% of the extra wins. Shooting efficiency is followed by turnover ratio, rebounding, and foul rate.

To bring this back to reality a bit, now let’s look at the current season and how teams at the top and bottom of the league are doing with respect to the four factors. The top and bottom rows represent hypothetical teams that are +1 or -1 STD relative to the mean in all 8 factors.

eFG% FTR TOR ORR
Rank Team P.D. Wins Own Opp Diff Own Opp Diff Own Opp Diff Own Opp Diff
+1 12.7 73.2 51.8 47.6 4.2 34.7 28.1 6.6 12.9 14.9 -2.0 29.0 23.6 5.4
1 BOS 12.4 72.5 54.3 46.7 7.6 31.4 32.7 -1.4 14.5 15.5 -1.0 21.3 23.1 -1.8
2 MIA 11.5 70.1 51.7 46.1 5.6 38.0 31.3 6.7 12.8 13.5 -0.7 24.8 24.5 0.3
3 SAS 8.9 63.5 52.3 49.4 2.9 31.9 24.5 7.4 12.9 14.8 -1.8 26.4 25.7 0.7
4 LAL 7.4 59.6 50.7 47.5 3.2 29.8 25.2 4.7 12.6 13.3 -0.8 30.0 29.8 0.2
5 DAL 7.2 59.1 52.1 47.4 4.7 30.5 27.7 2.7 13.9 13.6 0.3 23.6 25.3 -1.7
26 SAC -6.7 23.9 46.8 51.0 -4.1 29.9 35.1 -5.2 13.8 13.5 0.4 29.9 26.5 3.4
27 NJN -7.0 23.1 46.8 49.2 -2.4 31.4 34.3 -2.9 13.7 11.6 2.1 24.6 25.0 -0.4
28 MIN -8.0 20.5 47.7 51.2 -3.5 28.5 36.0 -7.6 15.3 13.0 2.2 30.9 24.9 6.0
29 WAS -9.5 16.9 47.9 52.1 -4.2 29.4 33.1 -3.8 14.9 14.5 0.4 28.6 32.6 -4.0
30 CLE -10.0 15.5 46.5 53.0 -6.6 30.1 28.2 1.9 12.7 12.7 -0.0 22.0 23.7 -1.7
-1 -12.0 10.5 47.5 51.5 -4 28.2 34.6 -6.4 14.7 12.8 2.1 23.5 28.9 -5.4

I’ve highlighted in green (red) the values that are above (below) 1 STD from the mean (in a direction that produces more or less wins, respectively). Lastly, since this is a Warriors-centric blog, let’s take a (sad and unfortunate) look at my favorite team with respect to the four factors:

eFG% FTR TOR ORR
Rank Team P.D. Wins Own Opp Diff Own Opp Diff Own Opp Diff Own Opp Diff
GSW -4.7 29.0 49.8 51.4 -1.7 24.3 36.9 -12.6 14.4 15.1 -0.7 29.7 30.5 -0.8

Interestingly, the Warriors are not terrible in shooting, although they are just below league average in offensive efficiency, and well below in defensive efficiency. The Warriors are absolutely terrible in FTR. In fact, they are about 2 STD below the league average in going to the line. Surprisingly, considering the off-season acquisition of Lee and the return of Biedrins, the defensive rebounding is really bad. However, the offensive rebounding is actually very good. So, that’s a push. The Warriors are good at forcing turnovers, and this is helping them from dropping to the very bottom of the league. Make no mistake about it, though, the Warriors will continue to be cellar dwellars until their offensive and defensive shooting efficiency improves.

Summary

I have shown that offensive and defensive shooting efficiency are by far the most important of the four factors, accounting for over 50% of wins alone. In comparison, offensive and defensive rebounding account for about 14% of wins. For reference, we can compare my results to Dean Oliver’s estimates for the weight of each factor:

Factor DeanO Regression
Shooting 40% 54%
Turnovers 25% 22%
Rebounding 20% 15%
Foul Rate 15% 10%

I like that the results of the regression are consistent with what Oliver found, and it is especially comforting since I haven’t been able to actually track down his studies that show how he derived these weights. I assume he performed a similar analysis, but there may, of course, be other ways to arrive at the same conclusion. Lastly, it should be clear that the challenge for player valuation models is attributing credit for each of the factors to individual play. It is important to think about this team level analysis when you are considering models like Wins Produced, PER, Win Shares, etc.

EZPM: Yet Another Model for Player Evaluation

Notes

Note #1: Before I get into this, I want to make it clear that many, perhaps most (hell, maybe all), of the ideas in the proposed model I am going to describe are not new. They may seem new to you, but I promise there are folks out there who have thought about these things before and heavily influenced my particular choices for all terms and coefficients in the model. Some of these folks you may have heard of before, including Dean Oliver (“Basketball on Paper”), John Hollinger (Pro Basketball Prospectus, ESPN), Dave Berri (Wages of Wins, Stumbling on Wins), and Dan Rosenbaum (believe he does stats for the Cavs for the past several years).

Note #2: Another motivator for my doing this is to help others get up to speed on (most) of the issues that need to be addressed when considering or developing player valuation models. As an engineer, I’m a big fan of rigorously defining a problem before tackling it. In developing this model and writing up my findings, it has helped me to better understand the problem (and basketball, hopefully). It has also raised many questions that need to be tackled going forward.

Note #3: This is the start of a discussion, not the end of one.

Note #4: Some will take this post as an implicit criticism of some other existing models. To some extent, that is obviously true. Let me name one, to be exact, since it’s the elephant in the room: Wins Produced or WP (the metric developed principally by Dave Berri). First, let me emphasize that my understanding and appreciation of WP is what led me to start considering alternative models. My model, on the face of it, is not drastically different from WP. In fact, since both models are tied to point margin, it is really only the player valuation aspects that will be different. Of course, you will say, that is a large part. True. And if it weren’t important, I wouldn’t have bothered with all this. Finally, I would add that it has become clear to me that WP is not really going to change any time soon. Many of the components of my proposed model could easily be fit into the framework of WP, but I don’t see that happening. If my post inspires Berri and others to re-consider their models, great. If not, that’s fine, too. I don’t mind rowing the canoe solo. That pretty much sums up my entire academic career. If this post inspires others to develop their own models or revise mine, that would be the greatest outcome of all. Go for it! (And let me know the results.)


First, what is a +/- model? Ok, let me back it up one step. What is +/-? At the team level, this is the number of points a team goes up or falls behind while it is on the floor. If starters played an entire game, this would simply be the point differential at the end of the game. Because players come in and out of the game, +/- typically refers to the number of points a team goes up or down while a player was in the game. In other words, +/- is assigned to individual players, but represents a team outcome. If Stephen Curry’s +/- is +10 in a game, that doesn’t (necessarily) mean Curry created a +10 point differential by himself. All it means is that while Curry was on the floor, the Warriors outscored their opponent by 10 points. Therefore, to be a meaningful statistic for evaluating players, what we really want to know is the individual +/-, or how a particular player contributed to the aggregate +/-. Say Curry was +5, Ellis was +6, Lee was +2, Biedrins was +1 and Dorell Wright was -4. In this example, the team +/- is +10, but Dorell Wright was actually a negative contributer. Going a step further, we can foresee a situation where some players are attributed +/- stats simply by playing with 4 other really good players.

One more thing, before I go further. Why do we care about +/-? Intuitively, it should be obvious that teams that score more points than they allow will win games. In fact, there is such a good relationship between +/- and wins, that we can actually formulate a simple equation that predicts wins based on point margin (another term for +/-). For example, Fig. 1 shows wins as a function of point margin per 100 possessions (simply offensive rating – defensive rating) for all teams for the 2009-10 NBA season.

Figure 1

Here, the “winning formula” (literally) is:

  • W = 2.54 * p.m. + 40.9

A team with a p.m. of zero, wins about 41 games (hey, that makes sense, right?). A team with a p.m. of +10 (that would be San Antonio currently), should win about 66 games. The Warriors with a p.m. currently of -6.6 are predicted to win 24.5 (ugh). Now, it should be noted that p.m. can be adjusted for strength of schedule, which makes a lage difference early in the season, and between conferences, since the East is weaker than the West (at least, I think that’s still the case). Now we can move on…

So, that’s what +/- is, and that’s also why it should be clear that we need some sort of model for decoupling individual contributions from team +/- stats. How do we do this?

Fundamentally, it’s a simple problem. What creates +/-? Points scored by the team minus points scored by the opponent. So, we can just take points scored by each player, distribute the points allowed by the team across all players evenly, and voila! That would be a very simple way to do this. Would it be “correct”? Well, technically, yes. All the points scored minus all the points allowed has to add up to +/-. But is it “right” in the sense that it properly attributes value? What about all the other things that players do besides score? What about rebounds? Assists? Steals? Turnovers? Why do we account for these stats? And how do we account for these stats? I’ll get to the how in a bit, but let me address the why, right now. Points don’t travel well. You can take that almost literally. If you put a bunch of high scoring point guards out on the floor, they probably won’t come anywhere close to their individual +/- stats, because they will lose a ton of possessions due to poor rebounding. What if we put together a team of all centers? Well, they will probably get a ton of rebounds, but they won’t be able to get the ball up the floor because they are such poor ball handlers compared to the guards. In short, a team is the sum of five positions, each of which bring their own strengths and weaknesses. To have a useful model — ideally, a predictive model — we need to account for everything. Well, everything that we can get our hands on.

The way that Dean Oliver, Berri, and others have created such models is by considering the possession as the atomic unit of basketball. A possession is usually defined by what ends one: FGM (field goal made), DRB (defensive rebound), FTM (free throw made —well, some of them, anyway), TOV (turnover). That’s it. Each team essentially will have the same number of possessions during a game. Another useful fact is that the average number of points scored or allowed per possession in the NBA is 1.0. (Technically, it’s a few hundredths of a point higher, but 1.0 is going to be good enough for our purposes.) With this number in mind, we can actually assign a value to the result of every possession at the team level and the player level. This is the marginal point value. Let me start with the team level:

Result Offense Defense
FG (2 PT) +1 -1
FG (3 PT) +2 -2
FG miss -1 +1
FT made +0.5 -0.5
FT miss -0.5 +0.5
And1 +1 -1
ORB team +1 -1
DRB -1 +1
Assist 0 0
Block 0 0
PF 0 0
TOV -1 +1
STL -1 +1

Note that assists, blocks, and personal fouls all have values of zero. The reason for this has to do with accounting. At the team level, all three are already accounted for by other results, either made or missed field goals (assists and blocks, respectively) or free throws (personal fouls).  There is no need to count these again. At the player level, we will account for these. So, the next step is to distribute value at the player level so that points add up to the same total at the team level. Does that make sense? Here’s how I distribute marginal points across players (explanations follow):

Result Offense Defense
2PT Assisted FG (shooter) +0.7
2PT Assisted FG (passer) +0.3
3PT Assisted FG (shooter) +1.4
3PT Assisted FG (passer) +0.6
2PT Unassisted FG +1.0
3PT Unassisted FG +2.0
Any FG missed (minus BLK) -0.7 0.14 (all)
2PT FG made -0.2 (all)
3PT FG made -0.4 (all)
FT made (minus And1) 0.5 -0.5
FT missed (minus And1) -0.5 +0.5
And1 made +1 -1
And1 missed 0 0
ORB +0.7 -0.7*DRB%
DRB +0.3 -0.3*ORB%
BLK -0.7 +0.7
TOV -1 +0.2
STL +1

Let me go through the rationale for each of these assignments:

  1. Field goals: On offense, the +1 or +2 values for made shots are obvious, and to my knowledge, all marginal value metrics (Oliver, Berri) use these values. For missed field goals, we encounter the first bit of separation from WP, namely that a player is debited only a portion (0.7) of a marginal point. The reason for this is that there is a 30% chance that the offense will get an ORB and the possession will continue. I discussed this a bit a few days ago here. WP subtracts a full point in this case.
  2. Assists: In my model, an assist is valued at 30% of the marginal value of the type of field goal that is made. This is one of the weakest parts of my model. The “true value” of an assist is not currently known (not that it is not knowable, though). I think everyone can agree that assists are helpful. Not all players can create their own shot. If we completely ignore the assist, then a player appears to be a more efficient scorer than he might otherwise be on a team lacking a good point guard. Although my 30% is essentially picked out of thin air, it is important to note that the total value of the assisted field goal is still +1.0 or +2.0 at the team level. This is not true for WP, which essentially double counts assists (or I should say 1.5X counts to be more accurate), since there is no distinction between assisted and unassisted FG in that model.
  3. Free throws: Fairly straightforward, except accounting for these is dramatically different if we are using box score stats or play-by-play (PBP) data. If we use PBP data, FT can be attributed easily (but analyzing PBP data is hard). If we use box score stats, we don’t exactly know who to blame for giving up free throws, so we apportion blame to the players proportional to their PF, noting that 0.44 FTA is roughly one possession.
  4. And1: These are free points, since the player would have already been credited with a made field goal. You are credited value for making one, but not debited for missing.
  5. Blocks: In my model, the +0.7 coefficient for blocks has real meaning, since it gives the full value of the opponent’s missed field goal to the player getting the block. To me, this makes perfect sense, since the guy getting his shot blocked missed the field goal and loses 0.7 pts. Shouldn’t the blocker be credited with exactly the same amount?
  6. Let me do STL and TOV before getting to rebounds: A steal is worth +1, since you are taking away a possession from the opponent. A TOV is -1, since you are losing a possession and any chance for scoring. Team TOV should be debited or credited evenly among players (in the absence of a better model).
  7. Rebounds (yay!): This is where it gets interesting. Here’s the basic logic. A missed field goal is worth -0.7 pts. The other -0.3 pts are lost, if the team does not get the ORB. On the other hand, I credit the defense +0.7 (distributed evenly among 5 players) for creating the missed field goal, and +0.3 to the player who gets the DRB. On the offense, the players on the team that “gave up” the DRB are debited by an amount that adds up to -0.3 and is proportional to the league average DRB% at their position. So, a PG loses a tiny bit of value, while the PF/C loses quite a bit more. Two things to note here. 1) This is vastly different from the WP model. 2) Everything adds up and defense is accounted for. Now, you may argue it’s not right to split the defensive credit evenly. I agree! Well, I don’t disagree. Truth is I’m not sure right now what the best way to debit value (maybe subtract 0.7 from the position counterpart that scored), but what I am sure of is that there needs to be this credit/debit for defense, and it needs to be of larger value than it is in the WP model. In the WP model, a DRB is worth +1. You can see now that is the equivalent of crediting the player with creating the missed field goal (i.e. the entire defensive possession). Is that a fair way to value the player? I dont’ think it is.

Ideally, we would apply this scoring system to PBP (play-by-play) data. But that is hard, and I have not implemented such a system yet. Over Thanksgiving, however, I checked that everything adds up by going through the PBP data line by line for a single game (remember that Denver game?). In lieu of PBP data, we can use box score stats. I’ve done that for the Warriors through Friday’s game against the Heat. As a point of comparison, I also ran the numbers for the Spurs. Here are the numbers (for players with > 100 minutes played):

2010-11 Warriors

Player POS MIN OFF O100 DEF D100 EZPM EZPM100
Stephen Curry PG 673 76.5 5.6 -29.5 -2.2 47.0 3.4
Monta Ellis SG 918 63.1 3.4 -43.2 -2.3 19.9 1.1
Reggie Williams SG 515 44.2 4.2 -49.2 -4.7 -5.1 -0.5
Jeremy Lin PG 134 -8.9 -3.3 6.9 2.5 -2.0 -0.7
Dorell Wright SF 909 41.8 2.3 -59.5 -3.2 -17.7 -1.0
Rodney Carney SF 266 -0.9 -0.2 -24.7 -4.6 -25.6 -4.7
V. Radmanovic PF 264 12.5 2.3 -38.3 -7.1 -25.7 -4.8
David Lee PF 539 12.0 1.1 -67.4 -6.2 -55.4 -5.1
Charlie Bell SG 106 -7.0 -3.2 -7.0 -3.3 -14.0 -6.5
Andris Biedrins C 636 6.9 0.5 -93.0 -7.2 -86.0 -6.7
Jeff Adrien PF 162 4.3 1.3 -26.3 -8.0 -22.1 -6.7
Dan Gadzuric C 217 -2.7 -0.6 -33.0 -7.5 -35.7 -8.1

2010-2011 Spurs

Player POS MIN OFF O100 DEF D100 EZPM EZPM100
Manu Ginobili SG 735 149.2 10.2 -10.2 -0.7 139.0 9.5
Richard Jefferson SF 729 120.2 8.3 -39.4 -2.7 80.8 5.6
James Anderson SG 106 16.8 8.0 -5.3 -2.5 11.4 5.4
George Hill PG 596 79.0 6.7 -15.3 -1.3 63.6 5.4
Tony Parker PG 758 93.2 6.2 -18.5 -1.2 74.7 5.0
Gary Neal PG 365 50.7 7.0 -20.2 -2.8 30.5 4.2
Matt Bonner PF 359 57.6 8.1 -32.3 -4.5 25.3 3.6
Tim Duncan C 662 54.5 4.2 -47.0 -3.6 7.5 0.6
DeJuan Blair PF 472 24.2 2.6 -30.5 -3.3 -6.3 -0.7
Tiago Splitter C 223 15.1 3.4 -26.4 -6.0 -11.3 -2.6
Antonio McDyess PF 380 6.3 0.8 -36.7 -4.9 -30.4 -4.0

The column you want to look at is EZPM100, which is the point margin (+/-) for the player per 100 possessions on the floor according to my model. I’m not going to go into a heavy discussion of the actual numbers now (although the rankings generally make sense, but there are probably some things that will surprise you). The model is still preliminary, as far as I’m concerned. As I said in the beginning of the post, this is the start, not the end of my work on the model. What I would really like to hear is feedback. Be as critical as you want. That’s what makes this fun. If you think the coefficients are crap, say so. Besides using PBP data (which I think will have a big effect), I have tons of ideas for how to improve and test the validity of the model. Defense needs much more work. Do you have ideas for crediting/debiting value on defense? If so, I want to hear them. What about assists? How can we do that more rigorously? So many things to discuss.

Update (12/13/10):

  1. I realized that I had put the missed FTA (debits) in the defensive column, instead of the offensive column. I have updated the tables to reflect that change.
  2. I am now using the rebounding stats from basketball-value.com, so the actual number of rebounds while a player was on the floor is taken into account (as opposed to an estimate of total rebounds based on minutes played). This appears to boost second unit guys more than starters (which is actually the opposite of what I had predicted).