Regression and time series model selection in small samples pdf




















Revision received:. Select Format Select format. Permissions Icon Permissions. Abstract A bias correction to the Akaike information criterion, AIC, is derived for regression and autoregressive time series models. Issue Section:. You do not currently have access to this article.

Download all slides. Sign in Don't already have an Oxford Academic account? You could not be signed in. Sign In Forgot password? Don't have an account? Sign in via your Institution Sign in.

Purchase Subscription prices and ordering for this journal Short-term Access To purchase short term access, please sign in to your Oxford Academic account above. This article is also available for rental through DeepDyve. View Metrics. Additionally, the book supplements the classic Box-Jenkins-Tiao model-building strategy with recent auxiliary tests for transformation, differencing, and model selection.

Not only does the text discuss new developments, including the prospects for widespread adoption of Bayesian hypothesis. A thorough review of the most current regression methods in timeseries analysis Regression methods have been an integral part of time seriesanalysis for over a century. Recently, new developments have mademajor strides in such areas as non-continuous data where a linearmodel is not appropriate. This book introduces the reader to newerdevelopments and more diverse regression models and methods fortime series analysis.

Accessible to anyone who is familiar with the basic modern conceptsof statistical inference, Regression Models for Time SeriesAnalysis provides. Statisticians and applied scientists must often select a model to fit empirical data. This book discusses the philosophy and strategy of selecting such a model using the information theory approach pioneered by Hirotugu Akaike.

This approach focuses critical attention on a priori modeling and the selection of a good approximating model that best represents the inference supported by the data. The book includes practical applications in biology and environmental science. With its broad coverage of methodology, this comprehensive book is a useful learning and reference tool for those in applied sciences where analysis and research of time series is useful.

Its plentiful examples show the operational details and purpose of a variety of univariate and multivariate time series methods. Numerous figures, tables and real-life time series data sets illustrate the models and methods useful for analyzing, modeling, and forecasting data collected sequentially in time. The text also offers a balanced. New statistical methods and future directions of research in time series A Course in Time Series Analysis demonstrates how to build time series models for univariate and multivariate time series data.

It brings together material previously available only in the professional literature and presents a unified view of the most advanced procedures available for time series model building. Large-sample Simulations Real Data Example Autoregressive Models Two Special Case Models Multivariate Regression Models Model Structure. Details of Simulation Results Stepwise Regression The selection of an appropriate model from a potentially large class of candidate models is an issue that is central to regression, times series modeling, and generalized linear models.

The variety of model selection methods in use not only demonstrates a wide range of statistical techniques, but it also illustrates the creativity statisticians have employed to approach various problems-there are parametric procedures, data resampling and bootstrap procedures, and a full complement of nonparametric procedures as well. The object of this book is to connect many different aspects of the growing model selection field by examining the different lines of reasoning that have motivated the derivation of both classical and modern criteria, and then to examine the performance of these criteria to see how well it matches the intent of their creators.

In this way we hope to bridge theory and practicality with a book that can serve both as a guide to the researcher in techniques for the application of these criteria, and also as a resource for the practicing statistician for matching appropriate selection criteria to a given problem or data set.

We begin to understand the different approaches that inspired the many criteria considered in this book by deriving some of the most commonly used selection criteria. These criteria are themselves statistics, and have their own moments and distributions. An evaluation of the properties of these moments leads us to suggest a new model selection criterion diagnostic, the signal-tonoise ratio.

The signal-to-noise ratio and other properties such as expectations, mean and variance for differences between two models, and probabilities of selecting one model over another, can be used not only to evaluate individual criterion performance, but also to suggest modifications to improve that performance. We determine relative performance by comparing criteria against each other under a wide variety of simulated conditions.

The simulation studies in this book, some of the most detailed in the literature, are a useful tool for narrowing the field of selection criteria that are applicable to a given practical scenario. We cover parametric, nonparametric, semiparametric, and wavelet regression models as well as univariate and multivariate response structures. We discuss bootstrap, cross validation, and robust methods. While we focus on Gaussian random errors, we also consider quasi-likelihood and location-scale distributions.

Overall, this book collects and relates a broad range of insightful work in the field of model selection, and we hope that a diverse readership will d V Preface find it accessible.

We wish to thank our families for their support, without which this book would not have been possible. We also thank Michelle Pallas for her carefully review and constructive comments, and Rhonda Boughtin for proofreading.

The manuscript was typeset using and the graphs were produced using the postscript features of V W. McQuarrie C. Table Table Table Table Table 2. Table 2. Table Table Table Table 3. Table Table Table Table Table 3. Table 3. Probabilities of overfitting Probability of selecting a particular candidate model of order k over the true order 6 for Model 1. Counts and observed efficiency. Asymptotic signal-to-noise ratio for overfitting by L variables Probability of sel rder p over the t for Model Simulation results for Model 3.

Counts and observed efficiency Table 4. Table Table Table Table Table 4. Table 5. Expected values and expected efficiency for Model 5. Expected values and expected efficiency for Model 9. Table 6. Table 7. Table 8. Table 9. List of Tables Model selection performance of L k. K-L observed efficiency ranks, L2 observed efficiency ranks and counts. Summary of overall rank by K-L and L2 observed efficjency.

K-L observed efficiencyranks, L2 observed efficiency ranks and counts. Summary of overall rank by K-L and L2 observed efficiency.

K-L observed efficiency ranks and L2 observed efficiency ranks. Table 9A. List of Tables Simulation results summary for Model K-L observed efficiency ranks.

L2 observed efficiency ranks and counts. Summary of overall rank by K-L and Lz observed efficiency Counts and observed efficiencies for Model 3. Simulation results for all multivariate regression models-K-L observed efficiency Simulation results for all multivariate regression models-det l2 observed efficiency Counts and observed efficiencies for Model A6. Counts and observed efficiencies for Model A7. Counts and observed efficiencies for Model 9.

Counts and observed efficiencies for Model Simulation results for all VAR models-det l2 observed efficiency. Counts and observed efficiencies for Model A8. Counts and observed efficiencies for Model A9. Stepwise counts and observed efficiencies for Model 1. Stepwise counts and observed efficiencies for Model 2. Stepwise results for all univariate regression models-K-L observed efficiency For example, model selection techniques can be applied to areas such as histogram construction see Linhart and Zucchini, , to determine the number of factors in factor analysis, and to nonparametric problems such as curve smoothing and smoothing bandwidth selection.

In fact, model selection criteria can be applied to any situation where one tries to balance variability with complexity. What defines a good model? A good model certainly fits the data set under investigation well. Of course, the more variables added to the model, the better the apparent fit. One of the goals of model selection is to balance the increase in fit against the increase in model complexity. Perhaps a better defining quality of a good model is its performance on future data sets collected from the same process.

A model that fits well on one of the data sets representing the process should fit well on any other data set. More importantly, a model that is too complicated but fits the current data set well may fit subsequent data sets poorly. A model that is too simple may fit none of the data sets well.

How to select a model? Once a probabilistic model has been proposed for an experiment, data can be collected, leading to a set of competing candidate models. Model selection criteria are often compared using results from simulation studies. However, assessing subtle differences between performance results is a daunting task-no single model selection criterion will always be better than another; certain criteria perform best for specific model types. In this book we use many different models to compare performance of the criteria, sometimes narrowly focusing on only a few differences between model types and sometimes varying them very widely.

Often, a count of the times that a selection 2 Introduction criterion identifies the correct model is a useful measure of model selection performance.

However, the more variety in models, the more unreliable counts can become, as we will see in some simulations throughout the book. When the true model belongs to the set of candidate models, our measure of performance is the distance between the selected model and the true model. In any set of candidate models, one of the candidates will be closest to the true model.

We term the ratio that compares the distance between the closest candidate model and the selected model the observed eficiency, which we will discuss in more detail below. We will see that observed efficiency is a much more flexible measure of performance than comparisons of counts. Historical Review Much of past model selection research has been concerned with univariate or multiple regression models.

It is known that R2 always increases whenever a variable is added t o the model, and therefore it will always recommend additional complexity without regard t o relative contribution t o model fit. The latter is currently one of the most commonly used model selection criteria for regression. AIC is probably the most commonly used model selection criterion for time series data.

AICc has shown itself t o be one of the best model selection criteria in an increasingly crowded field. In the notion of asymptotic efficiency appeared in the literature Shibata, as a paradigm for selecting the most appropriate model, and SIC, HQ, and GM became associated with the notion of consistency.

We briefly describe these two philosophies of model selection. Background 3 1. Eficient Criteria A common assumption in both regression and time series is that the generating or true model is of infinite dimension, or that the set of candidate models does not contain the true model. The goal is to select one model that best approximates the true model from a set of finite-dimensional candidate models.

The candidate model that is closest to the true model is assumed to be the appropriate choice. In large samples, a model selection criterion that chooses the model with minimum mean squared error distribution is said to be asymptotically eficient Shibata, Researchers who believe that the system they study is infinitely complicated, or that there is no way to measure all the important variables, choose models based on efficiency.

AIC is perhaps the most popular basis for correction. Sometimes the predictive ability of a candidate model is its most important attribute.

Both PRESS and FPE are efficient, and while we do not study predictive ability as a way to evaluate performance except with respect to bootstrapping and cross-validation methods, it is worth noting that prediction and asymptotic efficiency are related Shibata, Consistent Criteria Many researchers assume that the true model is of finite dimension, and that it is included in the set of candidate models.

Under this assumption the goal of model selection is to correctly choose the true model from the list of candidate models. A model selection criterion that identifies the correct model asymptotically with probability one is said to be consistent. Here the researcher believes that all variables can be measured, and furthermore, that enough is known about the physical system being studied to write the list of all important variables.

These are strong assumptions to many statisticians, but they may hold in fields like physics, where there are large bodies of theory to justify assuming the existence of a trce model that belongs to the set of candidate models.

Many of the classic consistent selection criteria are derived from asymptotic Introduction 4 arguments. Less work has been focused on finding improvements to consistent criteria than efficient criteria, due in part to the fact that the consistent criteria do not estimate some distance function or discrepancy.

Which is better, efficiency or consistency? There is little agreement. To make matters more confusing, both consistency and efficiency are asymptotic properties. In small samples, the criteria can behave much differently. Because of the practical limitations on gathering and using data, small-sample performance is often more important than asymptotic properties. This issue is discussed in Chapter 2 using the signal-to-noise diagnostic. Overview 1. Distributions It is important to remember that all model selection criteria are themselves random variables with their own distributions.

The moments of many of the classical selection criteria have been investigated in other papers, as have their probabilities of selecting a true model, assuming that it is one of the candidate models Nishii, and Akaike, We derive moments and probabilities for the primary criteria discussed in this book, and relate them to performance via the concept of the signal-to-noise ratio.

Differences between models are also investigated. When evaluating the relative merits of two models, the value of the selection criterion for each is compared and some decision is made. Such differences also have distributions that can be investigated, and probabilities of selecting one model over another are based on the distribution of the difference.

We derive moments for these differences as well. Examination of the moments can lead to insights into the behavior of model selection criteria. These moments are used to derive the signal-to-noise ratio. Two somewhat uncommon distributions are reviewed, the log-x2 and logBeta distributions. These two distributions are important to the derivations of many of the classical model selection criteria, and detailed information about them can be found in Appendix 2A to Chapter 2.

They can be described as follows: If X x 2 m then , log X log-X2 m. Log-Beta is related to the usual Beta distribution. While the exact moments can be computed for these N N N N 5 1. Overview distributions, we will derive some useful approximations that will allow us t o more easily compute small-sample signal-to-noise ratios. Multivariate model selection criteria often make use of the generalized variance Anderson, p.

In regression, the variance has either a central or noncentral Wishart distribution. Many of the classic multivariate selection criteria have moments involving the log-determinant Wishart distribution, and therefore exact and closed-form approximations are developed for the log-determinant Wishart.

Moments for the log-U distribution are developed so that signal-to-noise ratios can be formulated for model selection criteria in the multivariate case. Model Notation Regression as well as time series autoregressive models are discussed. Since these models necessarily have different structures, different notation is used. We use k to represent the model order when the model includes the intercept.

For regression cases, all models include the intercept. Our time series models do not include an intercept or constant term, and the order of the model will be equal to the number of variables, or p , and the true autoregressive model is denoted by p ,. Discrepancy a n d Distance Measures How to measure model selection performance? If the true model belongs to the set of candidate models and consistency holds, then a natural way t o measure performance is to compare the probabilities of selecting the correct model for each criterion considered.

For efficiency, where the true model may not belong to the set of candidate models, selecting the closest approximation is the goal. For this some sort of distance measure is required. A distance function or metric, d, is a real valued function with two arguments, u and u , which may be vectors or scalars. By definition, a model with a better fit must have a smaller distance than a model with a poor fit.

We do not need a distance function for model selection; any function satisfying Property 1 will suffice. Such a function is often referred t o as a discrepancy, a term dating back t o Haldane Other authors have continued t o use the term to describe the distance between likelihoods for a variety of problems. Certainly, the set of functions satisfying Property 1 yields a large class of potential discrepancy functions, and several important ones are given in Linhart and Zucchini , p.

The three we will use in this book are listed below. Let M A denote the candidate approximating model with density fA and let A denote the discrepancy. The Kullback-Leibler discrepancy, Kullback and Leibler, , also called the Kullback-Leibler information number, or K-L, is based on the likelihood ratio. The Kullback-Leibler discrepancy applies t o nearly all parametric models. K-L is a real valued function for univariate regression as well as multivariate regression.

As such, K-L is perhaps the most important discrepancy used in model selection. In general, The L2 norm can be used as a basis for measuring distance as well. L2 is a distance function and is easy to apply to univariate models. An advantage of Lz is that it depends only on the means of the two distributions and not the actual densities. This means that Lz can be applied when errors are not normally distributed.

However, a disadvantage is that L2 is a matrix in certain multivariate models. While there are many types of discrepancy functions on which to base model choices, some are more easily applied and computed than others. The 7 1. Overview relative ease with which K-L and L2 can be adapted t o a variety of situations led us to choose them to measure model selection performance.

Although the two measures sometimes give different indications of small-sample performance, in large samples they can be shown via a lengthy derivation t o be equivalent. Thus, criteria that are efficient in the L2 sense are also efficient in the K-L sense.

The L2 and L1 norms are much more applicable in the robust setting because their forms do not depend on any given distribution. By contrast, when errors are nonnormal the Kullback-Leibler discrepancy must be computed for each distribution.

We can use these two measures t o define efficiency in both the asymptotic and the small-sample observed sense. To define observed small-sample efficiency for K-L, 8 Introduction again let M, be the candidate model that is closest to the true model, and let K-L M, denote this distance.

Let Mk be the candidate model, with distance K-L Mk , selected by some criterion. The observed efficiencies given in Eq. Wherever we make references to model selection performance under K-L and L2, the terms K-L and L2 refer to observed efficiency unless otherwise mentioned.

Chapters include theoretical properties of model selection criteria and the L2 and K-L distances. Here we use the expected values of La and K-L when discussing theoretical distance between the candidate model and true model. As noted earlier, efficiency can be defined in terms of expected distance. When the true model belongs to the set of candidate models or for general expectation, we use E without subscripts.

Overfitting and Underfitting The terms overfitting and underfitting can be defined two ways. Under consistency, when a true model is itself a candidate model, overfitting is defined as choosing a model with extra variables, and underfitting is defined as choosing a model that either has too few variables or is incomplete.

We have no term to describe choosing a model with the correct order but the wrong variables. Layout 9 Using efficiency observed or expected , overfitting can be defined as choosing a model that has more variables than the model identified as closest t o the true model, thereby reducing efficiency. Underfitting is defined as choosing a model with too few variables compared t o the closest model, also reducing efficiency.

Both overfitting and underfitting can lead to problems with the predictive abilities of a model. An underfitted model may have poor predictive ability due to a lack of detail in the model. An overfitted model may be unstable in the sense that repeated samples from the same process can lead to widely differing predictions due t o variability in the extraneous variables. A criterion that can balance the tendencies t o overfit and underfit is preferable.

Layout We will discuss the broad model categories of univariate models, multivariate models, data resampling techniques, and nonparametric models, and include simulation results for each category, presenting results under both K-L and Lz observed efficiencies. We leave it to the practitioner to decide his or her preference.

In addition, at the end of this book we devote an entire chapter of simulation studies for each model type as well as real data examples. The contents of each chapter are summarized below. In Chapter 2 we lay the foundation for the criteria we will discuss throughout the book, and for the K-L and Lz observed efficiencies. We introduce the distributions necessary t o develop the concept of the signal-to-noise ratio. We begin by examining the large-sample and small-sample properties of the classical criteria AICc, AIC, FPE, and SIC for univariate regression, including their asymptotic probabilities of overfitting the probability of preferring one overfit model to the true model and asymptotic signal-to-noise ratios.

The signal-to-noise information is analyzed in order to suggest some signal-to-noise corrected variant criteria that perform better than the parent criteria. In this Chapter we also introduce the simulation model format we will use to illustrate criterion performance throughout the book. This includes a brief discussion of random X regression, since it is used t o generate the design matrices for our simulations, and also an explanation of the ranking method we will use to compare model selection criteria.

Ranks for each individual simulation run are computed and averaged over all runs, and the criterion with the lowest overall average rank is considered the best; i. In general, for each model category we will begin with a simulation study of two special cases where the noncentrality parameter and true model structure Introduction 10 is known.

Then to measure the performance of a model selection criterion in small samples, observed efficiency is developed and a large-scale small-sample simulation is conducted. In Chapter 3 we discuss the autoregressive model, which describes the present value yt as a linear combination of past observations y t P l ,.

This linear relationship allows us t o write the autoregressive model as a special case regression model similar t o those in Chapter 2. Since past observations are used to model the present, we have a problem modeling the first observation y1 because there are no past observations. There are several possible solutions.

The one we have chosen is to begin modeling a t observation p 1, and t o lose the first p observations due t o conditioning on the past. Although this results in a reduced sample size, it also requires fewer model assumptions.

However, this also means that the sample size for autoregressive models changes with the model, unlike the univariate regression models in Chapter 2. Another way to model time series is with a univariate moving average model.

Although we do not discuss model selection with respect to moving average models, we do address the situation where the data is truly the result of a moving average process, but is modeled using autoregressive models. Also, under certain conditions a moving average MA 1 model may be written as an infinite order AR model.

This allows us t o examine how criteria behave with models of infinite order where the true model does not belong to the set of candidate models.

Multistep prediction AR models are discussed briefly and the performance of some multistep variants are tested via a simulation study. In Chapter 4 we consider the multivariate regression model. Multivariate regression models are similar t o univariate regression models with the important difference that the error variance is actually a covariance matrix. Since many selection criteria are functions of this matrix, a central issue is how t o produce a scalar function from these matrices.

Determinants and traces are common methods, but are by no means the only options. Generalized variances are popular due to the fact that their distributional properties are well-known, whereas distributions of other scalar functions of matrices, such as the trace, are not well-known. In this book we focus on the generalized variance the determinant so that moments and probabilities of overfitting the probability of preferring one overfit model to the true model can be computed.

However, we also present the trace criterion results for comparison purposes. Layout 11 to be more useful than the determinant of La. In Chapter 5 we discuss the vector autoregressive model. Of all the models in this book, the vector autoregressive or VAR model is perhaps the most difficult to work with due t o the rapid increase in parameter counts as model complexity increases. This rapid increase causes many selection criteria to perform poorly, particularly those prone to underfitting.

As we did with the univariate autoregressive models, we begin modeling at p 1, and thus the sample size decreases as model order increases. We again condition on the past and write the vector autoregressive model as a special case multivariate regression model. This loss of sample size eliminates the need for backcasting or other assumptions about unobserved past data. Casting the VAR model into a multivariate framework allows us to compute moments of the model selection criteria as well as compute probabilities of overfitting the probability of preferring one overfit model t o the true model.

Such moments allow us to better study small-sample properties of the selection criteria. Overfitting has much smaller probability of occurring in VAR models than in multivariate regression models, due in part t o the rapid increase in parameters with model order and t o the decrease in sample size with increasing model order.

Simulation results indicate that an excessively heavy penalty function leads to decreased performance in VAR model selection. In Chapter 6 we investigate data resampling techniques. If predictive ability is of interest for the model, then cross-validation or bootstrapping techniques can be applied. Cross-validation and bootstrapping are discussed for univariate as well as multivariate regression and time series.

CV is also an efficient criterion and is asymptotically equivalent to FPE. Some issues unique t o bootstrapping include choosing between randomly selecting pairs y, x or bootstrapping from the residuals.

Both are considered in a simulation study. Variants of bootstrapped selection criteria with penalty functions that prevent overfitting are also introduced. In Chapter 7 we discuss robust regression and robust model selection criteria. The least squares approach does not assume normality; however, least squares can be affected by heavy-tailed distributions. We begin with least absolutes regression, or L1 regression, and introduce the L1 distance and observed efficiency. Using this assumption we will discuss the LlAICc criterion and present an L1 regression simulation study.

In this Chapter we also propose a generalized Kullback-Leibler information for measuring the distance between a robust function evaluated under the true model and a fitted model. We then use this generalization to obtain robust model selection criteria that not only fit the majority of the data, but also take into account nonnormal errors.

These criteria have the additional advantage of unifying most existing Akaike information criteria. Lastly in Chapter 7 we develop criteria for quasi-likelihood models. Such models include not only regression models with normal errors, but also logistic regression models, Poisson regression models, exponential regression models, etc.

The performance of these criteria are examined via simulation focusing on logistic regression. In Chapter 8, we develop a version of AICc for use with nonparametric and semiparametric regression models. The nonparametric AICc can be used to choose smoothing parameters for any linear smoother, including local quadratic and smoothing spline estimators. It has less tendency to undersmooth and it exhibits low variability.

Monte Carlo results show that the nonparametric AICc is comparable to well-behaved plug-in methods see Ruppert, Sheather and Wand, , but also performs well when the plug-in method fails or is unavailable.

We also develop a cross-validatory or cross-validation version of AICc for selecting a hard wavelet threshold Donoho and Johnstone, , and show via simulations that our method can outperform universal hard thresholding.

In addition, we provide supporting theory on the rate at which our proposed method attains the optimal mean integrated squared error. Finally, Chapter 9 is devoted almost exclusively to simulation results for each of the modeling categories in earlier chapters. Simulations include two special case models, a large-scale multi-model study, and two very large sample size models. Sixteen criteria are compared for the univariate models, while 18 criteria are compared for the multivariate models.

While our studies are by no means comprehensive, they do illustrate the performance of a variety of selection criteria under many different modeling circumstances. Four real data examples 1. Topics Not Covered 13 are also analyzed for each model type. Finally, we study the performance of the stepwise procedure in the selection of variables. Topics Not Covered Unfortunately, there is much interesting work being done on topics that are outside the scope of this book, but important to the topic of model selections.

In addition, there are important model categories which we do not address but are nevertheless important areas for research in variable selection. These include, but are not limited to, survival models Lawless, , regression models with ARMA errors Tsay, , measurement error models Fuller, , transformation and weighted regression models Carroll and Ruppert , , nonlinear regression models Bates and Watts, , Markov regression time series models Zeger and Qaqish, , structural time series models Harvey, , sliced inverse regression Li, , linear models with longitudinal data Diggle, Liang and Zeger, , generalized partially linear single-index mddels Carroll, Fan, Gijbels and Wand, and ARCH models GouriCroux, Finally, a forthcoming book, Model Selection and Inference: A Practical Information Theoretic Approach, Burnham and Anderson, covers some subjects that have not been addressed in this book.

The interested reader may find it a useful reference for the study of model selection. While this list is by no means complete, these six criteria were chosen as the basis for illustrating three possible approaches t o selecting a model-using efficient criteria to estimate K-L, using efficient criteria t o estimate Lz, and using consistent criteria. With the aim of making further refinements, we will also examine the small-sample moments of three of these criteria in order to suggest improvements to their penalty functions.

We will discuss the use of the signal-to-noise ratio as a descriptive statistic for evaluating model selection criteria. Sections 2. In Section 2. The rest of Chapter 2 examines small-sample properties, including underfitting using two special case models, and we close with a simulation study of these two models for the purposes of comparison to the expected theoretical results.

Model Description 2. Model Structure and Notation Before we can discuss model selection in regression, we need to define the model structures with which we will work and the assumptions we will make.

Here we introduce three model structures: the true model, the general model, and the fitted model. In Eq. If the constant. Finally we will define the fitted model, or the candidate model, with respect to the general model. We will further assume that the method of least squares is used t o fit a model to the data, and the candidate model unless otherwise noted will be of order k.

This is also the maximum likelihood estimate MLE of p, since the errors E satisfy the assumption in Eq. Distance M e a s u r e s The distance measures Lz and the Kullback-Leibler discrepancy K-L provide a way to evaluate how well the candidate model approximates the true model given in Eq.

We can use the notation from Eq. Derivations of the Foundation Model Selection Criteria 19 Finally, by taking expectations with respect to the true model, we arrive a t In practice, the candidate model is estimated from the data. We first consider FPE, which was originally derived for autoregressive time series models.

A similar procedure was developed by Davisson for analyzing signal-plus-noise data; however, since Akaike published his findings in the statistical literature, F P E is usually attributed to Akaike. The derivation of FPE is straightforward for regression. Suppose we have n observations from the overfitted model given by Eq. Hence, the idea of minimizing F P E strikes a balance between these two variances.

Mallows 1 9 7 3 took a different approach t o obtaining an Lz-based model selection criterion. Recent work by Mallows indicates that any candidate model where Cp 2. Derivations of the Foundation Model Selection Criteria 21 assumption that the true model belongs t o the set of candidate models.

This assumption may be unrealistic in practice, but it allows us t o compute expectations for central distributions, and it also allows us t o entertain the concept of overfitting. The derivation of AIC is intended t o create an estimate that is an approximation of the Kullback-Leibler discrepancy a detailed derivation can be found in Linhart and Zucchini, , p.

Like the Kullback-Leibler discrepancy on which it is based, AIC is readily adapted t o a wide range of statistical models. In fitting candidate models t o Eqs. The number of parameters is k for the ,O and 1 for u2. In response to this difficulty, Sugiura and Hurvich and Tsai derived AICc by estimating the expected Kullback-Leibler discrepancy directly in regression models.

As with AIC, the candidate model is estimated via maximum likelihood. Hurvich and Tsai also adopted the assumption that the true model belongs t o the set of candidate models. Under this assumption, they took expectations of Eq. Ic - 2 u: n - k - 2 4 Simplifying, Noticing that log 6; is unbiased for E, [log i? Hurvich and Tsai have shown that AICc does in fact outperform AIC in small samples, but that it is asymptotically equivalent to AIC and therefore performs just as well in large samples.

We next consider the case where an investigator believes that the true model belongs to the set of candidate models. Here the goal is to identify the true model with an asymptotic probability of 1, the approach that resulted in the derivation of consistent model selection criteria. Two authors, Akaike and Schwarz , introduced equivalent consistent model selection criteria conceived from a Bayesian perspective.

Schwarz derived SIC for selecting models in the Koopman-Darmois family, whereas Akaike derived his model selection criterion BIC for the problem of selecting a model in linear regression. Although in this book we consider SIC, the reader should note that the two procedures are equivalent both in performance and by date of introduction.

Since SIC does not depend on the prior, the exact distribution of the prior need not be known. Schwarz assumes it is of the form C a j p j , where cuj is the prior probability for model j, and pj is the conditional prior of B given model j. Finally, Schwarz assumed a fixed penalty or loss for selecting the wrong model. The Bayes solution for selecting a model is to choose the model with the largest posterior probability of being correct.

In large samples, this posterior can be approximated by a Taylor expansion. The second term was of the form log n k where k is the dimension of the model and n is the sample size. The remaining terms in the Taylor expansion were shown to be bounded and hence could be ignored in large samples.

Scaling the first two terms by n , we have 2. The other consistent criterion among our foundation criteria was proposed by Hannan and Quinn They applied the law of the iterated logarithm to derive HQ for autoregressive time series models. Although intended for use with the autoregressive model, HQ also can be applied t o regression models.

We postpone the derivation for HQ until Chapter 3, where we discuss autoregressive models in detail, and simply present the expression for the scaled HQ for regression with 62 from Eq.

This can be explained by the behavior of its penalty function, which even for a sample size of , is roughly only 2. T h e Univariate Regression Model 24 Many authors e. The choice of a! On the basis of simulation results, Bhansali and Downham propose FPE4, although other authors have found a in the range of 1. Note that the penalty function of HQ falls within this a! Our signal-to-noise ratio derivations show that in small samples, adjusting the penalty function by a yields much less satisfactory results than the correction proposed by AICc.

Other choices of a! Moments of Model Selection Criteria When choosing among candidate models, the standard rule is that the best model is the one for which the value of the model selection criterion used attains its minimum, and models are compared by taking the difference between the criterion values for each model. For example, suppose we have one model with k variables and a second model with L additional variables, and we would like t o use some hypothetical model selection criterion, say MSC, to evaluate them.

This difference depends largely on the strength of the penalty function, and is actually a random variable for which moments can be computed for nested models.



0コメント

  • 1000 / 1000