Linear Models for Continuous Data.pdfLinear Models

文件名称: Linear Models for Continuous Data.pdf

所属分类: 讲义

开发工具:

文件大小: 409kb

下载次数: 0

上传时间: 2019-08-17

提供者: qq_16******

下载 (409kb)

不能下载？报告错误

详细说明：Linear Models for Continuous Data. Wald Test. 等等等等2.1. INTRODUCTION TO LINEAR MODELS FIGURE 2.1: Scattergrams for the Program Effort Dat them into categories and treat them as discrete factors 2.1.2 The random structure The first issue we must deal with is that the response will vary even among units with identical values of the covariates. To inodel this fact we will treat each response yi as a realization of a random variable Yi. Conceptually, we view the observed response as only one out of many possible outcomes that we could have observed under identical circumstances, and we describe the possible values in terms of a probability distribution For the nlodels in this chapter we will assuine Chat the randon variable Yi has a normal distribution with mean Wi and variance o2, in symbols YiN N The mean bi represents the expected outcome, and the variance a2 measures the extent to which an actual observation may deviate from expectation Note that the expected value may vary from unit to unit, but the variance is the same for all. In terms of our example, we may expect a larger fertility decline in Cuba than in Haiti, but we dont anticipate that our expectation will be closer to the truth for one country than for the other The normal or Gaussian distribution(after the mathematician Karl Gauss has probability density function f(yi) 1(y;- (2.1) CHAPTER 2. LINEAR MODELS FOR CONTINUOUS DATA FIGURE 2.2: The Standard Normal density The standard density with mean zero and standard deviation one is shown in Figure 2.2 Most of the probability mass in the normal distribution (in fact, 99. 7%) lies within three standard deviations of the mean. In terms of our example we would be very surprised if fertility in a country declined 30 more than expected. Of course, we dont know yet what to expect, nor what o is So far we have considered the distribution of one observation at this point we add the important assumption that the observations are mutually independent. This assumption allows us to obtain the joint distribution of the data. as a simple product of the individual probability distributions, and underlies the construction of the likelihood function that will be used for estimation and testing. When the observations are independent they are also uncorrelated and their covariance is zero, so cov(Yi,Yi)-0 for i f j It will be convenient to collect the n responses in a column vector y, which we view as a rea.lization of a random vector Y with mean F(Y)=p and variance-covariance matrix var(Y)=oI, where I is the identity matrix The diagonal elements of var(Y)are all o- and the off-diagonal elements are all zero. so the n observations are uncorrelated and have the same variance Under the assumption of normality, Y has a multivariate normal distribution Y (22) with the staled mean and variance 2.1.3 The Systematic Structure Let us now turn our attention to the systematic part of the model. Suppose that we have data on p predictors 1,..,p which take values il,., ip 2.1. INTRODUCTION TO LINEAR MODELS for the i-th unit. We will assume that the expected response depends on these predictors. Specifically, we will assume that Hi is a linear function of the predictors 1=1x1B2x1…2ip for some unknown coefficients B1, B2,.. Bp. The coefficients B, are called regression coefficients and we will devote considerable attention to their in- terpretation This equation may be written more compactly using matrix notation as (23) where x is a row vector with the values of the p predictors for the i-th unit and B is a column vector containing the p regression coefficients. Even more compactly, we may form a column vector u with all the expected responses and then write =XB where x is an n x p matrix containing the values of the p predictors for the m units. The matrix x is usually called the model or design matrix. matrix notation is not only nore compact but, once you get used to il, it is also easier to read than formulas with lots of subscripts The expression xB is called the linear predictor, and includes many special cases of interest. Later in this chapter we will show how it includes simple and multiple linear regression models, analysis of variance models and analysis of covariance models Thc simplest possiblc lincar modcl assumes that cvcry unit has the same expected value, so that Ali A for all This model is often called the mull model, because it postulates no systematic differences between the units The null model can be obtained as a special case of Equation 2.3 by setting p=l and wi= 1 for all i. In terms of our example, this model would expect fertility to decline by the samc amount in all countrics, and would attribute all observed differences between countries to random variation At the other extreme we have a model where every unit has its own expected value Pi. This model is called the saturated model because it has as many parameters in the linear predictor (or linear parameters, for short) as it has observations. The saturated model can be obtained as a special case of Equation 2.3 by setting p= n and letting i take the value 1 for unit i and o otherwise. In this model the x's are indicator variables for the different units, and there is no random variation left all obseryed differences between countries are attributed to their own idiosyncrasies CHAPTER 2. LINEAR MODELS FOR CONTINUOUS DATA Obviously the null and saturated models are not very useful by them- selves. Most statistical models of interest lie somewhere in between, and most of this chapter will be devoted to an exploration of the middle ground Our aim is to capture systematic sources of variation in the linear predictor and let the error term account for unstructured or random variation 2.2 Estimation of the parameters Consider for now a rather abstract model where ui=x B for some predictors How do we estimate the parameters B and o2? 2.2.1 Estimation of B The likelihood principle instructs us to pick the values of the parameters that maximize the likelihood, or equivalently, the logarithm of the likelihood function. If the observations are independent, then the likelihood function is a product of normal densities of the form given in Equation 2.1. Taking logarithms we obtain the normal log-likelihood logL(B lg2xa2)-∑(1-m1)2/a2, where bi=xiB. The most important thing to notice about this expression is that maximizing the log-likelihood with respect to the linear parameters B for a fixed value of o is exactly equivalent to minimizing the sum of squared differences between observed and expected values, or residual sum of squares RSS(B)=>(i ui)2=(y xB)(y XB) (2.6) In other words, we need to pick values of B that make the fitted values x,6 as close as possible to the observed values yi Taking derivatives of the residual sum of squares with respect to B and setting the derivative equal to zero leads to the so-called normal equations for the maximum-likelihood estimator B XXB=X If the model matrix X is of full column rank so that no column is an exact linear combination of the others, then the matrix of cross-products X'x is of full rank and can be inverted to solve the normal equations. This gives an explicit formula for the ordinary least squares (ols) or maximum likelihood estimator of the linear parameters 2.2. ESTIMATION OF THE PARAMETERS 6=(XXXy If X is not of full colullln rank one call use generalized inverses, but inter prctation of thc results is much morc straightforward if onc simply climinates redundant columns. Most current statistical packages are smart enough to detect and omit redundancies automatically There are several numerical methods for solving the normal equations including methods that operate on XX, such as gaussian elimination or the Choleski dccomposition, and mcthods that attcmpt to simplify the calcula tions by factoring the model matrix x, including Householder reflections Givens rotations and the Gram-Schmidt orthogonalization. We will not dis cuss these methods here, assuming that you will trust the calculations to a reliable statistical package. For further details see McCullagh and Nelder (1989, Scction 3.8) and thc rcfcrcnccs therein The foregoing results were obtained by maximizing the log-likelihood with respect to B for a fixed value of o2. The result obtained in Equation 2. 7 does not depend on o, and is therefore a global maximum For the null model X is a vector of ones, X'x= n and Xy=2y arc scalars and B= y, the samplc mcan. For our samplc data y= 14.3 Thus, the calculation of a sample mean can be viewed as the simplest case of maximum likelihood estimation in a linear model 2.2.2 Properties of the Estimator The least squares estimator B of Equation 2. 7 has several interesting prop. erties. If the model is correct, in the(weak)sense that the expected value of the response Yi given the predictors xi is indeed x,s, then the ols estimator is unbiased, its expected value equals the true parameter value E()=B It can also be shown that if the observations a re uncorrelated and have con- stant variance a2 then the variance-covariance matrix of the ols estimator var(B)=(X'x)=102 (2.9) This result follows immediately from the fact that B is a linear function of the data y(see Equation 2.7), and the assumption that the variance-covariance matrix of the data is var(Y)=oI, where I is the identity matrix A further property of the estimator is that it has minimum variance among all unbiased estimators that are linear functions of the data, i.e CHAPTER 2. LINEAR MODELS FOR CONTINUOUS DATA it is the best linear unbiased estimator(BLUE). Since no other unbiased estimator can have lower variance for a fixed sample size, we say that Ols estimators are fully efficient Finally, it can be shown that the sampling distribution of the Ols es- timator B in large samples is approximately multiva riate normal with the mean and variance given above, 1.e BN Np(B, (xx-0) applying these results to the null model we see that the sample mean y is an unbiased estimator of u, has variance o/m, and is approximately normally distributed in large samples All of thcsc rcsults depend only on sccond-ordcr assumptions conccrning the mean, variance and covariance of the observations, namely the assump- tion that e(Y=XB and var(Y)=o I Of course, B is also a maximum likelihood estimator under the assump- tion of normality of the observations If Y N Nn(XB, oI)then the sampling distribution of B is ccactly multivariatc normal with thc indicatcd mcan and variant The significance of these results cannot be overstated: the assumption of normality of the observations is required only for inference in small samples The really important assumption is that the observations are uncorrelated and have constant variance, and this is sufficient for inference in large sam 2.2.3 Estimation of Substituting the ols estimator of B into the log-likelihood in Equation 2.5 gives a profile likelihood for 2 log L(o)=-o log(2o)-erss(B)/ Differentiating this expression with respect to o(not o) and setting the lerivalive lo zero leads to the maximum likelihood estimator RSS(B/ This estimator happens to be biased, but the bias is easily corrected dividing by n-p instead of n. The situation is exactly analogous to the use of n-1 instead of n when estimating a variance. In fact the estimator of o for 2.3. TESTS OF HYPOTHESES the null model is the sample variance, since B=y and the residual sum of squares is RSS=2(i-g) Under the assumption of normality, the ratio RSS /02 of the residual sum of squares to the true parameter value has a chi-squared distribution with n-p degrees of freedom and is independent of the estimator of the linear parameters. You might be interested to know that using the chi-squared distribution as a likelihood to estimate o(instead of the normal likelihood to estimate both B and o) leads to the unbiased estimator For the sample data the rss for the null model is 2650. 2 on 19 d f and therefore a=11. 81, the sample standard deviation 2.3 Tests of Hypotheses Consider testing hypothcscs about the regression cocfficicnts 6. Somctimcs we will be interested in testing the significance of a single coefficient, say B but on other occasions we will want to test the joint significance of several components of B. In the next few sections we consider tests based on the sampling distribution of the maximum likelihood estimator and likelihood ratio test 2.3.1 Wald Tests Consider first testing the significance of one particular coefficient, say 0. The m L.e. B, has a distribution with mean 0(under ho) and variance given oy the i-th diagonal element of the matrix in Equation 2.9. Thus, we can base our test on the ratio 2.10 var Note from Equation 2.9 that var(Bi) depends on o, which is usually un- known. In practice we replace o by the unbiased estimate based on the residual sum of squares Under the assumption of normality of the data, the ratio of the coefficient to its standard error has under Ho a student 's t distribution with n-p degrees of freedom when o is estimated, and a standard normal distribution if o is known. This result provides a basis for exact inference in samples of anv size 10 CHAPTER 2. LINEAR MODELS FOR CONTINUOUS DATA Under the weaker second-order assumptions concerning the means, vari- ances and covariances of the observations, the ratio has approximately in arge samples a standard normal distribution. This result provides a basis for approximate inference in large samples Many analysts treat the ratio as a Students t statistic regardless of the sample size. If normality is suspect one should not conduct the test unless the sample is large, in which case it really makes no difference which distribution is used. If the sample size is moderate, using the t test provides a more conservative procedure.(The Students t distribution converges to a standard normal as the degrees of freedom increases to o. For example the 95% two-tailed critical value is 2.09 for 20 d f. and 1.98 for 100 d f compared to the normal critical value of 1.96.) The l lest can also be used to construct a confidence interval for d co efficient. Specifically, we can state with 100(1-a)% confidence that B, is between the bound ±t1 a/2.p 11 where t1-aj2, m-p is the two-sided critical value of Students t distribution with n-p d.f. for a test of size a The Wald test can also be used to test the joint significance of several coefficients. Let us partition the vector of coefficients into two components say B=(61: B2) with P1 and p2 elements, respectively, and consider the hypothesis H0:32=0. In this casc the Wald statistic is givcn by the quadratic form where B2 is the m.l.e. of B2 and var(B2 is its variance-covariance matrix Note that the variance depends on o which is usually unknown; in practice we substitute the estimate bascd on the residual sum of squares In the case of a single coefficient p2= 1 and this formula reduces to the square of the t statistic in Equation 2.10 Asymptotic theory tells us that under Ho the large-sample distribution of the m.l.e. is multivariate normal with mean vector o and variance-covariance matrix var(B2 ) Consequently, the large-sample distribution of the quadratic form w is chi-squared with po degrees of freedom. This result holds whether 2 is known or estimated Under the assumption of normality we have a stronger result. The dis- tribution of w is exactly chi-squared with 2 degrees of fre eerom if a2 is

(系统自动生成,下载前可以参看下载内容)