Multiple Regression.ppt

  • Uploaded by: Govindamal Thangiah
  • 0
  • 0
  • March 2021
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Multiple Regression.ppt as PDF for free.

More details

  • Words: 1,925
  • Pages: 55
Loading documents preview...
Statistics for Health Research

Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics

Objectives of session • • •

Recognise the need for multiple regression Understand methods of selecting variables Understand strengths and weakness of selection methods • Carry out Multiple Regression in SPSS and interpret the output

Why do we need multiple regression? Research is not as simple as effect of one variable on one outcome, Especially with observational data Need to assess many factors simultaneously; more realistic models

Dependent (y)

Consider Fitted line of y = a + b1x1 + b2x2

x 2) ( ry o t na a l p Ex

Explanatory (x1)

3-dimensional scatterplot from SPSS of Min LDL in relation to baseline LDL and age

When to use multiple regression modelling (1) Assess relationship between two variables while adjusting or allowing for another variable Sometimes the second variable is considered a ‘nuisance’ factor Example: Physical Activity allowing for age and medications

When to use multiple regression modelling (2) In RCT whenever there is imbalance between arms of the trial at baseline in characteristics of subjects e.g. survival in colorectal cancer on two different randomised therapies adjusted for age, gender, stage, and co-morbidity

When to use multiple regression modelling (2) A special case of this is when adjusting for baseline level of the primary outcome in an RCT Baseline level added as a factor in regression model This will be covered in Trials part of the course

When to use multiple regression modelling (3) With observational data in order to produce a prognostic equation for future prediction of risk of mortality e.g. Predicting future risk of CHD used 10-year data from the Framingham cohort

When to use multiple regression modelling (4) With observational adjust for possible

data in order to confounders

e.g. survival in colorectal cancer in those with hypertension adjusted for age, gender, social deprivation and co-morbidity

Definition of Confounding A confounder is a factor which is related to both the variable of interest (explanatory) and the outcome, but is not an intermediary in a causal pathway

Example of Confounding Lung Cancer

Deprivation

Smoking

But, also worth adjusting for factors only related to outcome Lung Cancer

Deprivation

Exercise

Not worth adjusting for intermediate factor in a causal pathway Exercise

Blood viscosity

Stroke

In a causal pathway each factor is merely a marker of the other factors i.e correlated - collinearity

SPSS: Add both baseline LDL and age in the independent box in linear regression

Output from SPSS regression on Age at

linear baseline

Coefficientsa

Model 1 (Constant) Age at baseline

Unstandardized Standardized Coefficients Coefficients B Std. Error Beta 2.024 .105 -.008 .002 -.121

a. Dependent Variable: Min LDL achieved

t 19.340 -4.546

95% Confidence Interv al for B Collinearity Statistics Sig. Lower Bound Upper Bound Tolerance VIF .000 1.819 2.229 .000 -.011 -.004 1.000 1.000

Output from regression on

SPSS linear Baseline LDL

Coefficientsa

Model 1

(Constant) Baseline LDL

Unstandardized Coeff icients B Std. Error .668 .066 .257 .018

a. Dependent Variable: Min LDL achieved

Standardized Coeff icients Beta .351

t 10.091 13.950

95% Confidence Interv al for B Sig. Lower Bound Upper Bound .000 .538 .798 .000 .221 .293

Output: Multiple regression Model Summary

R2 now improved to 13%

Model 1

R .360a

R Square .130

Adjusted R Square .129

St d. Error of the Estimate .6753538

a. Predictors: (Constant), Age at baseline, Baseline LDL

Coefficientsa

Model 1

(Constant) Baseline LDL Age at baseline

Unstandardized Coeff icients B Std. Error 1.003 .124 .250 .019 -.005 .002

Standardized Coeff icients Beta .342 -.081

t 8.086 13.516 -3.187

Sig. .000 .000 .001

a. Dependent Variable: Min LDL achiev ed

Both variables still significant INDEPENDENTLY of each other

95% Confidence Interv al for B Lower Bound Upper Bound .760 1.246 .214 .286 -.008 -.002

How do you select which variables to enter the model? •Usually consider what hypotheses are you testing? •If main ‘exposure’ variable, enter first and assess confounders one at a time •For derivation of CPR you want powerful predictors •Also clinically important factors e.g. cholesterol in CHD prediction •Significance is important but •It is acceptable to have an ‘important’ variable without statistical significance

How do you decide what variables to enter in model? Correlations? With great difficulty!

3-dimensional scatterplot from SPSS of Time from Surgery in relation to Duke’s staging and age

Approaches to model building 1. Let Scientific or Clinical factors guide selection 2. Use automatic selection algorithms 3. A mixture of above

1) Let Science or Clinical factors guide selection Baseline LDL cholesterol is an important factor determining LDL outcome so enter first Next allow for age and gender Add adherence as important? Add BMI and smoking?

1) Let Science or Clinical factors guide selection Results in model of: 1.Baseline LDL 2.age and gender 3.Adherence 4.BMI and smoking Is this a ‘good’ model?

1) Let Science or Clinical factors guide selection: Final Model Note three variables entered but not statistically significant

1) Let Science or Clinical factors guide selection Is this the ‘best’ model? Should I leave out the non-significant factors (Model 2)? Model

Adj R2

F from ANOVA

No. of Paramete rs p

1

0.137

37.48

7

2

0.134

72.021

4

Adj R2 lower, F has increased and number of parameters is less in 2nd model. Is this better?

Kullback-Leibler Information Kullback and Leibler (1951) quantified the meaning of ‘information’ – related to Fisher’s ‘sufficient statistics’ Basically we have reality f And a model g to approximate f So K-L information is I(f,g)

f

g

Kullback-Leibler Information

We want to minimise I (f,g) to obtain the best model over other models I (f,g) is the information lost or ‘distance’ between reality and a model so need to minimise:

f ( x) I ( f , g )   f ( x ) log( ) dx g( x  )

Akaike’s Information Criterion It turns out that the function I(f,g) is related to a very simple measure of goodnessof-fit: Akaike’s Information Criterion or AIC

Selection Criteria •With a large number of factors type 1 error large, likely to have model with many variables •Two standard criteria: 1) Akaike’s Information Criterion (AIC) 2) Schwartz’s Bayesian Information Criterion (BIC) •Both penalise models with large number of variables if sample size is large

Akaike’s Information Criterion AIC  2 * loglikelihood  2 * p •Where p = number of parameters and -2*log likelihood is in the output •Hence AIC penalises models with large number of variables •Select model that minimises (-2LL+2p)

Generalized linear models •Unfortunately the standard REGRESSION in SPSS does not give these statistics •Need to use Analyze Generalized Linear Models…..

Generalized linear models. Default is linear •Add Min LDL achieved as dependent as in REGRESSION in SPSS •Next go to predictors…..

Generalized linear models: Predictors

•WARNING!

•Make sure you add the predictors in the correct box •Categorical in FACTORS box •Continuous in COVARIATES box

Generalized linear models: Model •Add all factors and covariates in the model as main effects

Generalized Linear Models Parameter Estimates Note identical to REGRESSION output

Generalized Linear Models Goodness-of-fit Note output gives log likelihood and AIC = 2835 (AIC = -2x-1409.6 +2x7= 2835)

Footnote explains smaller AIC is ‘better’

Let Science or Clinical factors guide selection: ‘Optimal’ model •The log likelihood is a measure of GOODNESS-OF-FIT •Seek ‘optimal’ model that maximises the log likelihood or minimises the AIC Model 1 Full Model

2 Non-significant variables removed

2LL

p

AIC

-1409.6

7

2835.6

-1413.6

4

2837.2

Chang e is 1.6

1) Let Science or Clinical factors guide selection Key points: 1.Results demonstrate a significant association with baseline LDL, Age and Adherence 2.Difficult choices with Gender, smoking and BMI 3.AIC only changes by 1.6 when removed 4.Generally changes of 4 or more in AIC are considered important

1) Let Science or Clinical factors guide selection Key points: 1.Conclude little to chose between models 2.AIC actually lower with larger model and consider Gender, and BMI important factors so keep larger model but have to justify 3.Model building manual, logical, transparent and under your control

2) Use automatic selection procedures These are based on automatic mechanical algorithms usually related to statistical significance Common ones are stepwise, forward or backward elimination Can be selected in SPSS using ‘Method’ in dialogue box

2) Use automatic selection procedures (e.g Stepwise)

Select Method = Stepwise

2) Use automatic selection procedures (e.g Stepwise)

1st step 2nd step

Final Model

2) Change in AIC with Stepwise selection Note: Only available from Generalized Linear Models Step

Model

Log Likelihoo d

AIC

Chang e in AIC

No. of Parameter s p

1

Baseline LDL

-1423.1

2852.2

-

2

2

+Adherence

-1418.0

2844.1

8.1

3

3

+Age

-1413.6

2837.2

6.9

4

2) Advantages and disadvantages of stepwise Advantages Simple to implement Gives a parsimonious model Selection is certainly objective

Disadvantages Non stable selection – stepwise considers many models that are very similar P-value on entry may be smaller once procedure is finished so exaggeration of p-value Predictions in external dataset usually worse for stepwise procedures

2) Automatic procedures: Backward elimination Backward starts by eliminating the least significant factor form the full model and has a few advantages over forward: •Modeller has to consider the ‘full’ model and sees results for all factors simultaneously •Correlated factors can remain in the model (in forward methods they may not even enter) •Criteria for removal tend to be more lax in backward so end up with more parameters

2) Use automatic selection procedures (e.g Backward)

Select Method = Backward

2) Backward elimination in SPSS 1st step Gender removed 2nd step BMI removed

Final Model

Summary of automatic selection •

Automatic selection may not give ‘optimal’ model (may leave out important factors)



Different methods may give different results (forward vs. backward elimination)



Backward elimination preferred as less stringent



Too easily fitted in SPSS!



Model assessment still requires some thought

3) A mixture of automatic procedures and self selection

• Use automatic procedures as a

guide • Think about what factors are important • Add ‘important’ factors • Do not blindly follow statistical significance • Consider AIC

Summary of Model selection • Selection of factors for Multiple Linear

regression models requires some judgement

• Automatic procedures are available but

treat results with caution

• They are easily fitted in SPSS • Check AIC or log likelihood for fit

Summary • Multiple regression models are the

most used analytical tool in quantitative research

• They are easily fitted in SPSS • Model assessment requires some

thought

• Parsimony is better – Occam’s Razor

Remember Occam’s Razor ‘Entia non sunt multiplicanda praeter necessitatem’ ‘Entities must not be multiplied beyond necessity’

William of Ockham 14th century Friar and logician 1288-1347

Summary After fitting any model check assumptions • Functional form – linearity or not • Check Residuals for normality • Check Residuals for outliers • All accomplished within SPSS • See publications for further info • Donnelly LA, Palmer CNA, Whitley AL, Lang C, Doney ASF, Morris AD, Donnan PT. Apolipoprotein E genotypes are associated with lipid lowering response to statin treatment in diabetes: A GoDARTS study. Pharmacogenetics and Genomics , 2008; 18: 279-87.

Practical on Multiple Regression Read in ‘LDL Data.sav’

1)Try fitting multiple regression model on Min LDL obtained using forward and backward elimination. Are the results the same? Add other factors than those considered in the presentation such as BMI, smoking. Remember the goal is to assess the association of APOE with LDL response. 2)Try fitting multiple regression models for Min Chol achieved. Is the model similar to that found for Min Chol?

Related Documents

Multiple Dua
January 2021 1
Embarazo Multiple
March 2021 0
Multiple Choice
January 2021 1

More Documents from "bigboxcash"