 A

Adjusted R-Squared, R-Squared Adjusted -  A version of R-Squared that has been adjusted for the number of predictors in the model.  R-Squared tends to over estimate the strength of the association especially if the model has more than one independent variable. C

Cp Statistic -   Cp measures the differences of a fitted regression model from a true model, along with the random error.  When a regression model with p independent variables contains only random differences from a true model, the average value of Cp is (p+1), the number of parameters. Thus, in evaluating many alternative regression models, our goal is to find models whose Cp is close to or below (p+1). (Statistics for Managers, page 917.)

Cp Statistic formula:.

Cp = ((1-Rp2)(n-T) / (1-RT2)) – [n – 2(p+1)]

p = number of independent variable included in a regression model
T = total number of parameters (including the intercept) to be estimated in the full regression model
Rp2 = coefficient of multiple determination for a regression model that has p independent variables
RT2 = coefficient of multiple determination for a full regression model that contains all T estimated parameters.

Confidence Interval - The lower endpoint on a confidence interval is called the lower bound or lower limit.  The lower bound is the point estimate minus the margin of error.  The upper bound is the point estimate plus the margin of error. Coefficient of Determination   In general the coefficient of determination measures the amount of variation of the response variable that is explained by the predictor variable(s).  The coefficient of simple determination is denoted by r-squared and the coefficient of multiple determination is denoted by R-squared.   (See r-square)

Coefficient of Variation   In general the coefficient of variation measures the amount of variation of the response variable.  If this value is small, then the data is considered ill conditioned. Correlation Coefficients, Pearson’s r -  Measures the strength of linear association between two numerical variables.  (See  r.)

D

DFITS, DFFITS:  Combines leverage and studentized residual (deleted t residuals) into one overall measure of how unusual an observation is.  DFITS is the difference between the fitted values calculated with and without the ith observation, and scaled by stdev (Ŷi).  Belseley, Kuh, and Welsch suggest that observations with DFITS >2Ö(p/n) should be considered as unusual.  (Minitab, page 2-9.)

E

Error -  In general, the error difference in the observed and estimated value of a parameter.

Error in Regression = Error in the prediction for the ith observation (actual Y minus predicted Y)

Errors, Residuals -  In regression analysis, the error is the difference in the observed Y values and the predicted Y values that occur from using the regression model.  See the graph below. F

F-test:  An F-test is usually a ratio of two numbers, where each number estimates a variance. An F-test can be used in the test of equality of two population variances. An F-test is also used in analysis of variance (ANOVA), where it tests the hypothesis of equality of means for two or more groups. For instance, in an ANOVA test, the F statistic is usually a ratio of the Mean Square for the effect of interest and Mean Square Error. The F-statistic is very large when MS for the factor is much larger than the MS for error. In such cases, reject the null hypothesis that group means are equal. The p-value helps to determine statistical significance of the F-statistic.  (Vogt, page 117)

The F test statistic can be used in Simple Linear Regression to assess the overall fit of the model.

F = test statistics for ANOVA for Regression= MSR/MSE,
where MSR=Mean Square Regression, MSE = Mean Square Error
F has dfSSR for the numerator and dfSSE for the denominator

The null and alternative hypotheses for simple linear regression for the F-test statistic are

Ho:  b1=0;      where b1  is the coefficient for x  (i.e. the slope of x)

Ha:  b 1 is not 0

p-value = the probability that the random variable F > the value of the test statistics.  This value is found by using an F table where F has dfSSR for the numerator and dfSSE for the denominator.

J

K

L

Leverages, Leverage Points -  An extreme value in the independent (explanatory) variable(s).  Compared with an outlier, which is an extreme value in the dependent (response) variable.

The hat matrix is H = X (X'X)-1 X', where X is the design matrix. The leverage of the ith observation is the ith diagonal element, hi (also called vii and rii), of H. Note that hi depends only on the predictors; it does not involve the response Y. If hi is large, the ith observation has unusual predictors (X1i, X2i, ..., Xki).  Many people consider hi to be large enough to merit checking if it is more than 2p/n or 3p/n, where p is the number of predictors (including one for the constant).  This observation will have a large influence in determining the regression coefficients.

(Note: Minitab uses a cutoff value of 3p/n or 0.99, whichever is smallest.). (Minitab, page 2-9.)

M

Mean Square Error

MSE = Mean Square Errors = Error Mean Square = Residual Mean Square Mean Square Regression

MSR = MSRegression =Mean Square of Regression Multiple Correlation Coefficient,  RA measure of the amount of correlation between more than two variables.  As in multiple regression, one variable is the dependent variable and the others are independent variables.  The positive square root of R-squared.  (See R.)

P

Prediction Interval - In regression analysis, a range of values that estimate the value of the dependent variable for given values of one or more independent variables.  Comparing prediction intervals with confidence intervals prediction intervals estimate a random value, while confidence intervals estimate population parameters. where R

r, Correlation Coefficients, Pearson’s r -  Measures the strength of linear association between two numerical variables. R,  Coefficient of Multiple Correlation -  A measure of the amount of correlation between more than two variables.  As in multiple regression, one variable is the dependent variable and the others are independent variables.  The positive square root of R-squared. r2 ,  r-squared,  Coefficient of Simple Determination -  The percent of the variance in the dependent variable that can be explained by of the independent variable.

r2 = SSRegression / SSTotal = (explained variation)/(total variation) = percent of the variation of Y that is explained by the model.  R-squared,  Coefficient of Multiple Determination -  The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.  = 1 – percent of deviation that is not explained by the model = the percent of variation that is explained by the model.

For simple linear regression R2 reduces r2.

Note:  The coefficient of simple (multiple) determination is the square of the simple (multiple) correlation coefficient.

R-Squared Adjusted, Adjusted R-Squared,  -  A version of R-Squared that has been adjusted for the number of predictors in the model.  R-Squared tends to over estimate the strength of the association especially if the model has more than one independent variable. where R=multiple regression coefficient.

More equivalent formulas for R2 and R2-adjusted  are shown below.  From this formulation, we can see the relationship between the two statistics.  We can see how R-squared Adjusted, “adjusts” for the number of variables in the model.  ,

where k=the number of coefficients in the regression equation.  Note, k includes the constant coefficient.  For simple linear regression when you fit the y-intercept, k=2.    If you do not fit the y-intercept (i.e. let the y-intercept be zero) then k=1.

For simple linear regression, when you do not fit the y-intercept, then k=1 and the formula for R-squared Adjusted simplifies to R-squared.

 If k=1, then      Regression SS (See SSregression) -  The sum of squares that is explained by the regression equation.  Analogous to between-groups sum of squares in analysis of variance.

S

Standard Deviation - A statistic that shows the square root of the squared distance that the data points are from the mean. for a sample for a population

Standard Error, Standard Error of the Regression, Standard Error of the Mean, Standard Error of the Estimate -  In regression the standard error of the estimate is the standard deviation of the observed y-values about the predicted y-values.  In general, the standard error is a measure of sampling error.  Standard error refers to error in estimates resulting from random fluctuations in samples.  The standard error is the standard deviation of the sampling distribution of a statistic. Typically the smaller the standard error, the better the sample statistic estimates of the population parameter. As N goes up, so does standard error.

Formula for the Standard Error of Estimate: dferrors = number of observations – number of independent variables in the model –1

For simple linear regression: dferrors = n-1-1 = n-2 for simple linear regression when fitting the y-intercept.  (Two degrees of freedom are lost since we are estimating the slope and the y-intercept)

Standardized Residuals -
Standardized residuals are of the form (residual) / (square root of the Mean Square Error).  Standardized residuals have variance 1.  If the standardized residual is larger than 2, then it is usually considered large.  (Minitab.) where Sum Square Errors

SSE = SSErrors = Sum Square of Errors = Error Sum of Squares = SSResidual = Sum Square of Residuals = Residual Sum of Squares Alternative computational formula for SSE Sum Square Regression

SSR = SSRegression  = Sum square of regression = Sum of square of the differences in the predicted value of Y and the average value of Y.  This tells how far the predicted value is from the average value.

Sum Square Total

SST = SSTotal = Sum Square of Total Variation of Y = sum of square of error from Y to the mean of Y. SST = SSE + SSR = unexplained variation + explained variation

Note: has a definite pattern, but is the error and it should be random.

T

V

Variance Inflation Factor (VIF)A statistics used to measuring the possible collinearity of the explanatory variables.  Let X1, X2, ..., Xk be the k predictors. Regress Xj on the remaining k - 1 predictors and let RSQj be the R-squared from this regression. Then the variance inflation factor for Xj is 1/(1 - RSQj). When Xj is highly correlated with the remaining predictors, its variance inflation factor will be very large. When Xj is orthogonal to the remaining predictors, its variance inflation factor will be 1. (Minitab)

W

X

Y =Actual value of Y for observation i = Predicted or estimated value of Y based on the given  = average of the original Y-values

Z