54 views

Uploaded by Joanne Wong

LAD - Lasso Robust

- Phylogenetic clustering increases with elevation for microbes
- Matlab Toolbox for MIDAS 2011
- 010-06 MaringerMeyer STAR Model Selection
- Piecewise Linear Regression Examples (Lesson 1) Truncated
- thesisssssss
- Northern Ireland Irish hare survey 2005
- MATLAB Central - File Exchange Pick of the Week » Finding the Best Distribution That Fits the Data
- Hemiptera Belastomatidae Benefits of Male Parental Care in Abedus Breviceps (Hemiptera Belostomatidae)
- RDP267 Exercise VHO
- Adolescent Personality
- Notes
- Classical Liner Regression Model
- Discrete Time NHPP Models for Software
- Intervention Analysis of Maternal Healthcare Enrollment at Mampong Government Hospital, Ghana
- Rodeo- Sparse, Greedy Non Parametric Regression
- Density of the Indian Peafowl Pavo Cristatus in Satpura Tiger Reserve, India
- Simplifying Procedure for a Non-Destruct
- 163
- Kallio 09
- 3DTNMSRGP124

You are on page 1of 9

1

2

3

F:jbes05m060.tex; (Diana) p. 1

Variable Selection Through the LAD-Lasso

4

5

6

12

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

65

66

67

Guodong L I

Department of Statistics and Actuarial Science, University of Hong Kong, P. R. China (ligd@hkusua.hku.hk )

68

69

70

Guohua J IANG

Guanghua School of Management, Peking University, Beijing, P. R. China 100871 (gjiang@gsm.pku.edu.cn)

13

14

62

64

Hansheng WANG

10

11

61

63

7

8

60

71

72

The least absolute deviation (LAD) regression is a useful method for robust regression, and the least absolute shrinkage and selection operator (lasso) is a popular choice for shrinkage estimation and variable

selection. In this article we combine these two classical ideas together to produce LAD-lasso. Compared

with the LAD regression, LAD-lasso can do parameter estimation and variable selection simultaneously.

Compared with the traditional lasso, LAD-lasso is resistant to heavy-tailed errors or outliers in the response. Furthermore, with easily estimated tuning parameters, the LAD-lasso estimator enjoys the same

asymptotic efficiency as the unpenalized LAD estimator obtained under the true model (i.e., the oracle

property). Extensive simulation studies demonstrate satisfactory finite-sample performance of LAD-lasso,

and a real example is analyzed for illustration purposes.

75

76

77

78

79

80

82

efficient model selection criteria, among which the most wellknown example is the AIC. The efficient criteria have the ability

to select the best model by an appropriately defined asymptotic

optimality criterion. Therefore, they are particularly useful if

the underlying model is too complicated to be well approximated by any finite-dimensional model. However, if an underlying model indeed has a finite dimension, then it is well known

that many efficient criteria (e.g., the AIC) suffer from a nonignorable overfitting effect regardless of sample size. In such

a situation, a consistent model selection criterion can be useful. Hence the second major class contains all of the consistent

model selection criteria, among which the most representative

example is the BIC. Compared with the efficient criteria, the

consistent criteria have the ability to identify the true model

consistently, if such a finite-dimensional true model does in fact

exist.

Unfortunately, theoretically there is no general agreement regarding which selection criteria type is preferable (Shi and Tsai

2002). As we demonstrate, the proposed LAD-lasso method belongs to the consistent category. In other words, we make the

assumption that the true model is of finite dimension and is contained in a set of candidate models under consideration. Then

the proposed LAD-lasso method has the ability to identify the

true model consistently. (For a better explanation of model selection efficiency and consistency, see Shao 1997; McQuarrie

and Tsai 1998; and Shi and Tsai 2002.)

Because most selection criteria (e.g., AIC, BIC) are developed based on OLS estimates, their finite-sample performance

under heavy-tailed errors is poor. Consequently, Hurvich and

Tsai (1990) derived a set of useful model selection criteria (e.g.,

AIC, AICc, BIC) based on the LAD estimates. But despite

Datasets subject to heavy-tailed errors or outliers are commonly encountered in applications. They may appear in the responses and/or the predictors. In this article we focus on the

situation in which the heavy-tailed errors or outliers are found

in the responses. In such a situation, it is well known that the

traditional ordinary least squares (OLS) may fail to produce a

reliable estimator, and the least absolute deviation

(LAD) esti

mator can be very useful. Specifically, the n-consistency and

asymptotic normality for the LAD estimator can be established

without assuming any moment condition of the residual. Due

to the fact that the objective function (i.e., the LAD criterion)

is a nonsmooth function, the usual Taylor expansion argument

cannot be used directly to study the LAD estimators asymptotic properties. Therefore, over the

past decade, much effort

has been devoted to establish the n-consistency and asymptotic normality of various LAD estimators (Bassett and Koenker

1978; Pollard 1991; Bloomfield and Steiger 1983; Knight 1998;

Peng and Yao 2003; Ling 2005), which left the important problem of robust model selection open for study.

In a regression setting, it is well known that omitting an

important explanatory variable may produce severely biased parameter estimates and prediction results. On the other hand, including unnecessary predictors can degrade the efficiency of the

resulting estimation and yield less accurate predictions. Hence

selecting the best model based on a finite sample is always a

problem of interest for both theory and application. Under appropriate moment conditions for the residual, the problem of

model selection has been extensively studied in the literature

(Shao 1997; Hurvich and Tsai 1989; Shi and Tsai 2002, 2004),

and two important selection criteria, the Akaike information criterion (AIC) (Akaike 1973) and the Bayes information criterion

(BIC) (Schwarz 1978), have been widely used in practice.

By Shaos (1997) definition, many selection criteria can be

classified into two major classes. One class contains all of the

74

81

1. INTRODUCTION

73

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

Journal of Business & Economic Statistics

???? 0, Vol. 0, No. 00

DOI 10.1198/00

1

116

117

118

F:jbes05m060.tex; (Diana) p. 2

2

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

have some limitations, the major one being is the computational

burden. Note that the number of all possible candidate models

increases exponentially as the number of regression variables

increases. Hence performing the best subset selection by considering all possible candidate models is difficult if the number

of the predictors is relatively large.

To address the deficiencies of traditional model selection

methods (i.e., AIC and BIC), Tibshirani (1996) proposed the

least absolute shrinkage and selection operator (lasso), which

can effectively select important explanatory variables and estimate regression parameters simultaneously. Under normal errors, the satisfactory finite-sample performance of lasso has

been demonstrated numerically by Tibshirani (1996), and its

statistical properties have been studied by Knight and Fu

(2000), Fan and Li (2001), and Tibshirani, Saunders, Rosset,

Zhu, and Knight (2005). But if the regression error has a very

heavy tail or suffers from severe outliers, then the finite-sample

performance of the lasso can be poor due to its sensitivity to

heavy-tailed errors and outliers.

In this article we attempt to develop a robust regression

shrinkage and selection method that can do regression shrinkage and selection (like lasso) and is also resistant to outliers or

heavy-tailed errors (like LAD). The basic idea is to combine the

usual LAD criterion and the lasso-type penalty together to produce the LAD-lasso method. Compared with LAD, LAD-lasso

can do parameter estimation and model selection simultaneously. Compared with lasso, LAD-lasso is resistant to heavytailed errors and/or outliers. Furthermore, with easily estimated

tuning parameters, the LAD-lasso estimator enjoys the same asymptotic efficiency as the oracle estimator.

The rest of the article is organized as follows. Section 2 proposes LAD-lasso and discusses its main theoretical and numerical properties. Section 3 presents extensive simulation results,

and Section 4 analyzes a real example. Finally, Section 5 concludes the article with a short discussion. The Appendix provides all of the technical details.

same tuning parameters for all regression coefficients, the resulting estimators may suffer an appreciable bias (Fan and Li

2001). Hence we further consider the following modified lasso

criterion:

lasso* =

n

(yi xi )2

+n

i=1

p

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

59

62

63

64

j |j |,

66

67

j=1

68

which allows for different tuning parameters for different coefficients. As a result, lasso* is able to produce sparse solutions

more effectively than lasso.

Nonetheless, it is well known that the OLS criterion used in

lasso* is very sensitive to outliers. To obtain a robust lasso-type

estimator, we further modify the lasso* objective function into

the following LAD-lasso criterion:

LAD-lasso = Q() =

n

|yi xi | + n

i=1

p

As can be seen, the LAD-lasso criterion combines the LAD criterion and the lasso penalty, and hence the resulting estimator is

expected to be robust against outliers and also to enjoy a sparse

representation.

Computationally, it is very easy to find the LAD-lasso estimator. Specifically, we can consider an augmented dataset

{(yi , xi )} with i = 1, . . . , n + p, where (yi , xi ) = (yi , xi ) for

1 i n, (yn+j , xn+j ) = (0, nj ej ) for 1 j p, and ej is a

p-dimensional vector with the jth component equal to 1 and all

others equal to 0. It can be easily verified that

LAD-lasso = Q() =

70

71

72

73

74

75

76

j |j |.

j=1

n+p

69

|yi xi |.

i=1

(yi , xi ) as if they were the true data. Consequently, any standard unpenalized LAD program (e.g., l1fit in SPLUS, rq in the

QUANTREG package of R) can be used to find the LAD-lasso

estimator without much programming effort.

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

Consider the linear regression model

yi = xi

+ i ,

i = 1, . . . , n,

(1)

where xi = (xi1 , . . . , xip ) is the p-dimensional regression covariate, = (1 , . . . , p ) are the associated regression coefficients, and i are iid random errors with median 0. Moreover,

assume that j = 0 for j p0 and j = 0 for j > p0 for some

p0 0. Thus the correct model has p0 significant and ( p p0 )

insignificant regression variables.

Usually, the unknown parameters of model

(1) can be esti

mated by minimizing the OLS criterion, ni=1 (yi xi )2 . Furthermore, to shrink unnecessary coefficients to 0, Tibshirani

(1996) proposed the following lasso criterion:

57

58

61

65

39

40

60

lasso =

n

i=1

(yi xi )2

+ n

p

j=1

|j |,

99

100

as = ( a , b ) , where a = (1 , . . . , p0 ) and b = (p0 +1 ,

. . . , p ) . Its corresponding LAD-lasso estimator is denoted

by = ( a , b ) , and the LAD-lasso objective function is

denoted by Q() = Q( a , b ). In addition, we also decompose the covariate xi = (xia , xib ) with xia = (xi1 , . . . , xip0 ) and

xib = (xi( p0 +1) , . . . , xip ) . To study the theoretical properties of

LAD-lasso, the following technical assumptions are necessarily

needed:

101

Assumption A. The error i has continuous and positive density at the origin.

111

definite.

113

Note that Assumptions A and B are both very typical technical assumptions used extensively in the literature (Pollard 1991;

Bloomfield and Steiger 1983; Knight 1998). They are needed

116

102

103

104

105

106

107

108

109

110

112

114

115

117

118

F:jbes05m060.tex; (Diana) p. 3

1

2

for establishing the n-consistency and the asymptotic normality of the unpenalized LAD estimator. Furthermore, define

an = max{j , 1 j p0 }

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

and

bn = min{j , p0 < j p},

where j is a function of n. Based on the foregoing notation, the

consistency of LAD-lasso estimator can first be established.

Lemma 1 (Root-n consistency). Suppose that (xi , yi ), i =

1, . . . , n, are iid and that the linear

regression model (1) satisfies Assumptions A and B. If nan 0, then the LAD-lasso

estimator is root-n consistent.

The proof is given in Appendix A. Lemma 1 implies that if

the tuning parameters associated with the significant variables

1/2 , then the correspondconverge to 0 at a speed faster than

n

ing LAD-lasso estimator can be n-consistent.

We next show that if the tuning parameters associated with

the insignificant variables shrink to 0 slower than n1/2 , then

those regression coefficients can be estimated exactly as 0 with

probability tending to 1.

Lemma 2 (Sparsity). Under the same

assumptions as Lemma 1 and the further assumption that nbn , with probability tending to 1, the LAD-lasso estimator = ( a , b ) must

satisfy b = 0.

The proof is given in Appendix B. Lemma 2 shows that

LAD-lasso has the ability to consistently produce sparse solutions for insignificant regression coefficients; hence variable

selection and parameter estimation can be accomplished simultaneously.

The foregoing two lemmas imply that the root-nconsistent

estimator must satisfy P(2 = 0) 1 when the tuning parameters fulfill appropriate conditions. As a result, the LAD-lasso

estimator performs as well as the oracle estimator by assuming

that the b = 0 is known in advance. Finally, we obtain the asymptotic distribution of the LAD-lasso estimator.

Theorem (Oracle property). Suppose that (xi , yi ), i = 1,

. . . , n, are iid and that the linear regression

model (1) satisfies

bn , then the LAD-lasso estimator = ( a , b ) must

0 /

f 2 (0)), where 0 = cov(xia ) and f (t) is the density of i .

The proof isgiven in Appendix C. Under the conditions

nan 0 and nbn , this theorem implies that the LADlasso

against heavy-tailed errors, because the

estimator is robust

n-consistency of a is established without making any moment assumptions on the regression error i . The theorem also

implies that the resulting LAD-lasso estimator has the same

asymptotic distribution as the LAD-estimator obtained under

the true model. Hence the oracle property of the LAD-lasso

estimator is established. Furthermore, due to the convexity and

piecewise linearity of the criterion function Q(), the LADlasso estimator properties discussed in this article are global instead of local, as used by Fan and Li (2001).

2.3 Tuning Parameter Estimation

Intuitively, the commonly used cross-validation, generalized

cross-validation (Craven and Wahba 1979; Tibshirani 1996; Fan

and Li 2001), AIC, AICc, and BIC (Hurvich and Tsai 1989;

Zou, Hastie, and Tibshirani 2004) methods can be used to select the optimal regularization parameter j after appropriate

modification, for example, replacing the least squares term by

the LAD criterion. However, using them in our situation is difficult for at least two reasons. First, there are a total of p tuning

parameters involved in the LAD-lasso criterion, and simultaneously tuning so many regularization parameters is much too

expensive computationally. Second, their statistical properties

are not clearly understood under heavy-tailed errors. Hence we

consider the following option.

Following an idea of Tibshirani (1996), we can view the

LAD-lasso estimator as a Bayesian estimator with each regression coefficient following a double-exponential prior with location parameter 0 and scale parameter nj . Then the optimal j

can be selected by minimizing the following negative posterior

log-likelihood function:

n

|yi xi | + n

p

i=1

j |j | log(.5nj ),

(2)

j=1

i=1

|yi xi | + n

p

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

we do not know the values of j ; however, they can be easily estimated by the unpenalized LAD estimator j , which pro

duces j = 1/(n|j |). Noting that j is n-consistent, it follows

needed

in the theorem is satisfied. However, the other condition,

n j for j > p0 , is not necessarily guaranteed. Hence directly using such a tuning parameter estimate tends to produce

overfitted results if the underlying true model is indeed of finite

dimension. This property is very similar to the AIC. In contrast, note that in (2), the complexity of the final model is actually controlled by log(), which can be viewed as simply the

number of significant variables in the AIC. Hence it motivates

us to consider the following BIC-type objective function:

n

60

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

j |j | log(.5nj ) log(n),

97

j=1

98

the BIC with the factor log(n). This function produces the tuning parameter estimates

99

100

101

102

j =

log(n)

,

n|j |

(3)

j > p0 . Hence the consistent variable selection is guaranteed by

the final estimator.

103

104

105

106

107

108

109

3. SIMULATION RESULTS

110

111

out to evaluate the finite-sample performance of LAD-lasso under heavy-tailed errors. For comparison purposes, the finitesample performance of LAD-based AIC, AICc, and BIC as

specified by Hurvich and Tsai (1990), together with the oracle estimator, are evaluated. For the AIC, AICc, and BIC, the

best subset selection is used.

112

113

114

115

116

117

118

F:jbes05m060.tex; (Diana) p. 4

4

1

2

3

4

5

6

7

0, 0). In other words, the first p0 = 4 regression variables are

significant, but the rest are not. For a given i, the covariate

xi is generated from a standard eight-dimensional multivariate

normal distribution. The sample sizes considered are given by

n = 50, 100, and 200. Furthermore, the response variables are

generated according to

yi = xi + i ,

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

prediction error (MAPE), evaluated based on another 1,000 independent testing samples for each iteration.

As can be seen from Tables 24, all of the methods (AIC,

AICc, BIC, and LAD-lasso) demonstrate comparable prediction accuracy in terms of mean and median MAPE. However,

the selection results differ significantly. In the case of = 1.0,

both the AIC and AICc demonstrate appreciable overfitting effects, whereas the BIC and LAD-lasso have very comparable

performance, with both demonstrating a clear pattern of finding

the true model consistently. Nevertheless, in the case of = .5,

LAD-lasso significantly outperforms the BIC with a margin that

can be as large as 30% (e.g., n = 50; see Table 3) whereas the

overfitting effect of the BIC is quite substantial.

Keep in mind that our simulation results should not be mistakenly interpreted as evidence that the AIC (AICc) is an inferior choice compared with the BIC or LAD-lasso. As pointed

out earlier, the AIC (AICc) is a well-known efficient but inconsistent model selection criterion that is asymptotically optimal

if the underlying model is of infinite dimension. On the other

hand, the BIC and LAD-lasso are useful consistent model selection methods if the underlying model is indeed of finite dimension. Consequently, all of the methods are useful for practical

data analysis. Our theory and numerical experience suggest that

Specifically, the following three different distributions are considered: the standard double exponential, the standard t-distribution with 5 df (t5 ), and the standard t-distribution with

3 df (t3 ). Two different values are tested for .5 and 1.0, representing strong and weak signal-to-noise ratios.

For each parameter setting, a total of 1,000 simulation iterations are carried out to evaluate the finite-sample performance

of LAD-lasso. The simulation results are summarized in Tables 13, which include the percentage of correctly (under/over)

estimated numbers of regression models, together with the average number of correctly (mistakenly) estimated 0s, in the same

manner as done by Tibshirani (1996) and Fan and Li (2001).

Also included are the mean and median of the mean absolute

25

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

26

27

85

86

No. of zeros

28

87

29

Method

Underfitted

Correctly fitted

Overfitted

Incorrect

Correct

Average MAPE

Median MAPE

88

30

1.0

50

AIC

AICc

BIC

LASSO

ORACLE

.079

.093

.160

.184

.347

.429

.591

.633

1.000

.574

.478

.249

.183

.079

.093

.161

.185

3.038

3.252

3.635

3.730

4.000

1.102

1.099

1.092

1.099

1.059

1.011

.988

1.110

1.045

1.084

89

.003

.005

.026

.035

1.045

1.043

1.035

1.040

1.027

1.013

1.087

1.037

1.059

1.073

94

3.117

3.232

3.778

3.815

4.000

0

0

0

0

0

3.295

3.343

3.907

3.922

4.000

1.019

1.018

1.014

1.016

1.012

.987

.946

.976

1.020

1.043

99

.666

.550

.314

.036

0

.001

.002

.017

0

2.945

3.228

3.619

3.961

4.000

.550

.546

.539

.549

.528

.560

.525

.615

.501

.537

104

.596

.539

.162

.013

0

0

0

0

0

3.131

3.249

3.819

3.987

4.000

.522

.521

.516

.520

.513

.469

.542

.496

.496

.513

109

0

0

0

0

0

3.224

3.268

3.868

3.992

4.000

.510

.510

.507

.507

.507

.526

.502

.502

.507

.516

31

32

33

34

35

100

36

37

38

39

200

40

41

42

43

44

45

.5

50

46

47

48

49

50

100

51

52

53

54

55

56

57

58

59

200

AIC

AICc

BIC

LASSO

ORACLE

.385

.435

.774

.798

1.000

AIC

AICc

BIC

LASSO

ORACLE

0

0

0

0

0

.475

.498

.914

.927

1.000

AIC

AICc

BIC

LASSO

ORACLE

0

.001

.002

.017

0

.334

.449

.684

.947

1.000

AIC

AICc

BIC

LASSO

ORACLE

0

0

0

0

0

.404

.461

.838

.987

1.000

AIC

AICc

BIC

LASSO

ORACLE

0

0

0

0

0

.429

.455

.877

.992

1.000

.003

.005

.026

.035

.612

.560

.200

.167

0

.525

.502

.086

.073

0

0

.571

.545

.123

.008

0

90

91

92

93

95

96

97

98

100

101

102

103

105

106

107

108

110

111

112

113

114

115

116

117

118

F:jbes05m060.tex; (Diana) p. 5

1

2

60

61

No. of zeros

62

Method

Underfitted

Correctly fitted

Overfitted

Incorrect

Correct

Average MAPE

Median MAPE

63

1.0

50

AIC

AICc

BIC

LASSO

ORACLE

.082

.102

.177

.225

.244

.316

.492

.566

1.000

.674

.582

.331

.209

.082

.102

.179

.228

2.825

3.072

3.501

3.687

4.000

1.052

1.048

1.042

1.048

1.007

1.290

1.124

1.164

1.073

.990

64

.009

.011

.027

.042

.994

.994

.985

.991

.976

1.023

.948

.988

.965

.980

69

2.995

3.105

3.707

3.803

4.000

.677

.650

.174

.139

0

0

0

0

0

3.051

3.104

3.814

3.856

4.000

.971

.971

.966

.969

.963

1.161

.986

1.020

1.059

.907

74

.713

.594

.339

.025

0

.001

.001

.012

0

2.904

3.189

3.596

3.975

4.000

.522

.519

.512

.524

.501

.525

.494

.544

.489

.522

79

.699

.646

.273

.018

0

0

0

0

0

2.957

3.090

3.700

3.982

4.000

.499

.498

.493

.495

.489

.546

.475

.493

.524

.509

.675

.652

.200

.013

0

0

0

0

0

2.999

3.052

3.781

3.987

4.000

.487

.486

.483

.482

.482

.473

.464

.496

.514

.494

6

7

8

9

100

10

11

12

13

14

200

15

16

17

18

19

20

.5

50

21

22

23

24

25

100

26

27

28

29

30

31

32

33

200

AIC

AICc

BIC

LASSO

ORACLE

.316

.364

.721

.787

1.000

AIC

AICc

BIC

LASSO

ORACLE

0

0

0

0

0

.323

.350

.826

.861

1.000

AIC

AICc

BIC

LASSO

ORACLE

0

.001

.001

.012

0

.287

.405

.660

.963

1.000

AIC

AICc

BIC

LASSO

ORACLE

0

0

0

0

0

.301

.354

.727

.982

1.000

AIC

AICc

BIC

LASSO

ORACLE

0

0

0

0

0

.325

.348

.800

.987

1.000

.009

.011

.027

.042

.675

.625

.252

.171

0

65

66

67

68

70

71

72

73

75

76

77

78

80

81

82

83

84

85

86

87

88

89

90

91

92

34

93

35

94

36

37

38

39

95

performance comparable to or even better than that of BIC, but

with a much lower computational cost.

40

41

42

43

4.1 The Chinese Stock Market

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

Whereas the problem of earnings forecasting has been extensively studied in the literature on the North American and European markets, the same problem regarding one of the worlds

fastest growing capital markets, the Chinese stock market, remains not well understood. China resumed its stock market in

1991 after a hiatus of 4 decades. Since it reopened, the Chinese

stock market has grown rapidly, from a few stocks in 1991 to

more than 1,300 stocks today, with a total of more than 450 billion U.S. dollars in market capitalization. Many international

institutional investors are moving into China looking for better

returns, and earnings forecast research has become extremely

important to their success.

In addition, over the past decade, China has enjoyed a high

economic growth rate and rapid integration into the world market. Operating in such an environment, Chinese companies tend

to experience more turbulence and uncertainties in their operations. As a result, they tend to generate more extreme earnings.

For example, in our dataset, the kurtosis of the residuals differentiated from an OLS fit is as large as 90.95, much larger

than the value 3 of a normal distribution. Hence the reliability

of the usual OLS-based estimation and model selection methods (e.g., AIC, BIC, lasso) is severely challenged, whereas the

LAD-based methods (e.g., LAD, LAD-lasso) become more attractive.

96

97

98

99

100

101

102

103

104

105

106

107

Database, which was partially developed by the China Center for Economic Research (CCER) at Peking University. It is

considered one of the most authoritative and widely used stock

market databases on the Chinese stock market. The dataset contains a total of 2,247 records, with each record corresponding

to one yearly observation of one company. Among these, 1,092

come from year 2002 and serve as the training data, whereas

the rest come from year 2003 and serve as the testing data.

The response variable is the return on equity (ROE), (i.e.,

earnings divided by total equity) of the following year

108

109

110

111

112

113

114

115

116

117

118

F:jbes05m060.tex; (Diana) p. 6

1

2

60

61

No. of zeros

62

Method

Underfitted

Correct fitted

Overfitted

Incorrect

Correct

Average MAPE

Median MAPE

63

1.0

50

AIC

AICc

BIC

LASSO

ORACLE

.125

.149

.238

.248

.283

.373

.542

.536

1.000

.592

.478

.220

.216

.129

.153

.245

.255

2.961

3.206

3.640

3.671

4.000

1.217

1.212

1.202

1.212

1.164

1.199

1.258

1.347

1.248

1.366

64

.010

.011

.043

.046

3.108

3.216

3.751

3.761

4.000

1.151

1.151

1.144

1.148

1.133

1.104

1.152

1.339

1.218

1.169

69

.001

.001

.002

.002

3.144

3.191

3.856

3.850

4.000

1.123

1.128

1.120

1.123

1.114

1.148

1.090

1.070

1.122

1.110

74

.002

.003

.004

.030

.603

.602

.595

.607

.582

.594

.561

.674

.611

.614

79

6

7

8

9

100

10

11

12

13

14

200

15

16

17

18

19

20

50

.5

21

22

23

24

100

25

26

27

28

29

200

30

31

32

33

AIC

AICc

BIC

LASSO

ORACLE

AIC

AICc

BIC

LASSO

ORACLE

AIC

AICc

BIC

LASSO

ORACLE

0

.010

.011

.043

.046

0

.001

.001

.002

.002

0

.367

.410

.742

.748

1.000

.385

.407

.864

.856

1.000

.623

.579

.215

.206

0

.614

.592

.134

.142

0

.309

.412

.652

.920

1.000

2.982

3.219

3.600

3.946

4.000

AIC

AICc

BIC

LASSO

ORACLE

0

0

0

0

0

.340

.391

.790

.978

1.000

.660

.609

.210

.022

0

0

0

0

0

0

3.086

3.197

3.765

3.977

4.000

.575

.574

.569

.572

.566

.571

.581

.557

.573

.594

AIC

AICc

BIC

LASSO

ORACLE

0

0

0

0

0

.355

.386

.866

.984

1.000

.645

.614

.134

.016

0

0

0

0

0

0

3.085

3.153

3.857

3.984

4.000

.564

.563

.558

.560

.559

.560

.575

.568

.580

.550

.002

.003

.004

.030

.689

.585

.344

.050

65

66

67

68

70

71

72

73

75

76

77

78

80

81

82

83

84

85

86

87

88

89

90

91

92

34

93

35

94

36

37

38

39

40

41

42

95

current year (ROEt ), asset turnover ratio (ATO), profit margin (PM), debt-to-asset ratio or leverage (LEV), sales growth

rate (GROWTH), price-to-book ratio (PB), account receivables/revenues (ARR), inventory/asset (INV), and the logarithm of total assets (ASSET). These are all measured at year t.

Past studies on earnings forecasting show that these explanatory variables are among the most important accounting

variables in predicting future earnings. Asset turnover ratio

measures the efficiency of the company using its assets; profit

margin measures the profitability of the companys operation;

debt-to-asset ratio describes how much of the company is fi-

43

Variable

47

Int

ROEt

ATO

PM

LEV

GROWTH

PB

ARR

INV

ASSET

51

52

53

54

55

56

58

MAPE

STDE

59

NOTE:

57

99

100

101

104

46

50

98

103

45

49

97

102

44

48

96

AIC

AICc

BIC

.316809

.207533

.058824

.140276

.020943

.018806

.316809

.207533

.058824

.140276

.020943

.018806

.316809

.207533

.058824

.140276

.020943

.018806

LAD-lasso

LAD

OLS

105

2.103386

.181815

.165402

.209346

.245233

.033153

.018385

.009217

.354148

.101458

106

.233541

.021002

.014523

.014523

.014523

.002186

.363688

.190330

.058553

.134457

.023592

.019068

.001544

.002520

.012297

.016694

.121379

.021839

.121379

.021839

.121379

.021839

.120662

.022410

.120382

.021692

.067567

.006069

107

108

109

110

111

112

113

114

115

116

117

118

F:jbes05m060.tex; (Diana) p. 7

1

2

3

4

5

6

7

8

9

10

11

12

13

rate and price-to-book ratio measure the actual past growth and

expected future growth of the company. The account receivable/revenues ratio and the inventory/asset ratio have not been

used prominently in studies of North American or European

companies, but we use them here because of the observation

that Chinese managers have much more discretion over a companys receivable and inventory policies, as allowed by accounting standards. Therefore, it is relatively easy for them to use

these items to manage the reported earnings and influence earnings predictability. We include the logarithm of total assets for

the same reason, because large companies have more resources

under control, which can be used to manage earnings.

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

Various methods are used to select the best model based on

the training dataset of 2002. The prediction accuracies of these

methods are measured by the MAPE based on the testing data

for 2003. For the AIC and BIC, the best subset selection is carried out. For comparison purposes, the results of the full model

based on the LAD and OLS estimators are also reported; these

are summarized in Table 4.

As can be seen, the MAPE of the OLS is as large as .233541.

It is substantially worse than all other LAD-based methods, further justifying the use of the LAD methods. Furthermore, all

of the LAD-based selection criteria (i.e., AIC, AICc, and BIC)

agree that PB, ARR, and INV are not important, whereas LADlasso further identifies the intercept together with PM, LEV, and

GROWTH as insignificant variables. Based on a substantially

simplified model, the prediction accuracy of the LAD-lasso estimator remains very satisfactory. Specifically, the MAPE of the

LAD-lasso is .120662, only slightly larger than the MAPE of

the full LAD model (.120382). According to the reported standard error of the MAPE estimate (STDE), we can clearly see

that such a difference cannot be statistically significant. Consequently, we conclude that among all of the LAD-based model

selection methods, LAD-lasso resulted in the simplest model

with a satisfactory prediction accuracy.

According to the model selected by LAD-lasso, a firm with

high ATO tends to have higher future profitability (positive

sign of ATO). Furthermore, currently more profitable companies continue to earn higher earnings (positive coefficient on

ROEt ). Large companies tend to have higher earnings because

they tend to have more stable operations, occupy monopolistic

industries, and have more resources under control to manage

earnings (positive sign of ASSET). Not coincidentally, these

three variables are the few variables that capture a companys

core earnings ability the most: efficiency (ATO), profitability

(ROEt ), and company size (Assets) (Nissim and Penman 2001).

It is natural to expect that more accounting variables would also

have predictive power for future earnings in developed stock

markets such as the North American and European markets.

However, in a new emerging stock market with fast economic

growth, such as the Chinese stock market, it is not surprising

that only the few variables, which capture a companys core

earnings ability the most, would survive our model selection

and produce accurate predictions.

5. CONCLUDING REMARKS

60

61

which combines the ideas of LAD and lasso for robust regression shrinkage and selection. Similar ideas can be further extended to Hubers M-estimation using the following -lasso

objective function:

Q() =

n

(yi xi ) + n

i=1

p

62

63

64

65

66

67

j |j |,

68

69

j=1

-function contains the LAD function | | as a special case,

-lasso would be expected to be even more efficient than

LAD-lasso if an optimal transitional point could be used. However, for real data, how to select such an optimal transitional

point automatically within a lasso framework is indeed a challenging and interesting problem. Further study along this line is

definitely needed.

70

71

72

73

74

75

76

77

78

79

ACKNOWLEDGMENTS

80

The authors are grateful to the editor, the associate editor, the

referees, Rong Chen, and W. K. Li for their valuable comments

and constructive suggestions, which led to substantial improvement of the manuscript. This research was supported in part by

the Natural Science Foundation of China (grant 70532002).

81

82

83

84

85

86

87

88

We want to show that for any given > 0, there exists a large

constant C such that

(A.1)

P inf Q + n1/2 u > Q() 1 ,

u =C

89

90

91

92

93

u = C. From the fact that the function Q() is convex and

piecewise linear, the inequality (A.1) implies, with probability at least 1 , that the LAD-lasso estimator lies in the ball

{ + n1/2 u : u C}. For convenience, we define Dn (u)

Q( + n1/2 u) Q(). It then follows that

97

98

99

102

103

p

+n

j j + n1/2 uj |j |

104

105

j=1

106

n

yi x x n1/2 u |yi x |

107

108

i=1

nan

96

101

i=1

95

100

n

yi x + n1/2 u |yi x |

Dn (u) =

i

94

109

p0

110

|uj |.

(A.2)

j=1

112

Denote the first item at the last line of (A.2) by Ln (u). On the

other hand, according to Knight (1998), it holds that for x = 0,

|x y| |x|

111

0

113

114

115

116

117

118

F:jbes05m060.tex; (Diana) p. 8

8

1

because

2

3

4

n1/2 u

n

i=1

5

6

+2

n

n1/2 u xi

11

12

13

14

15

16

17

18

2

nE Zni

(u)I n1/2 |u xi |

2

n1/2 |u xi |

2 ds I n1/2 |u xi |

nE

20

21

22

28

29

30

31

32

R 2nE

n1/2 |u xi |

34

I n

35

2nE

36

37

38

n1/2 |u xi |

43

i=1

49

56

57

58

59

(1983, p. 4), it can be seen that the LAD-lasso criterion Q()

is piecewise linear and reaches the minimum at some breaking

point. Taking the first derivative of Q() at any differentiable

point = (1 , . . . , p ) with respect to j , j = p0 + 1, . . . , p, we

can obtain that

and 1 for x < 0. For any Rp , denote

V() = n1/2

n

xi sgn i n1/2 xi .

Hence

E

n

Furthermore,

1/2

n

n1/2 u xi

79

[F(s) F(0)] ds

sf (0) ds + o(1) = .5f (0)u (xi xi )u + o(1),

83

84

85

86

87

88

89

90

91

92

93

94

95

101

102

1/2

105

106

108

positive number. Then for any =

( , ) satisfying that n( a a ) = Op (1) and | b b |

b

n

104

107

M

i=1

=E

78

100

n = Mn1/2 ,

Zni (u)

77

103

n1/2 u xi

76

99

= nE[Zni (u)] = nE

75

98

n1/2 max{|u xi |}ni=1 = op (1), and lemma A.2 of Koenker and

Zhao (1996) can be applied, which leads to

i=1

74

97

V(0) = n

n

73

96

i=1

n

n

2

var

Zni =

var(Zni ) nEZni

(u) 0.

48

55

47

54

s ds I n1/2 |u xi | <

72

82

i=1

46

53

where p represents convergence in probability. Therefore, the second item on the right side of (A.3) converges to

f (0)u u in probability.

By choosing a sufficiently large C, for (A.3), the second item

dominates the first item uniformly in u = C. Furthermore,

the second item on the last line of (A.2) converges to 0 in probability and hence is also dominated by the second item on the

right side of (A.3). This completes the proof of Lemma 1.

{ f (0) + }E|u xi | ,

42

52

n1/2 |u xi |

70

71

41

51

|u xi | <

69

1

Zni (u) p f (0)u u,

2

i=1

[F(s) F(0)] ds I n1/2 |u xi | <

2n{ f (0) + }E

40

50

39

45

1/2

66

68

n

1/2 Q()

1/2

ij + nj sgn(j ),

n

= n

sgn(yi xi )x

j

65

81

and 0 < < such that sup|x|< f (x) < f (0) + . Then it can

2 (u)I(n1/2 |u x | < )] is dominated

be verified that R = nE[Zni

i

by

33

44

i=1

64

67

25

27

63

80

= 4E |u xi |2 I(|u xi | n) = o(1).

24

26

n

23

62

By the central limit theorem, the first item on the right side

of (A.3) converges in distribution to u W, where W is a p-dimensional normal random vector with mean 0 and variance matrix = cov(x1 ).

Next we show that the second item on the right side of

(A.3) converges to a real function of u in probability. Denote

the cumulative distribution function of i by F and the item

n1/2 u xi

[I(i s) I(i 0)] ds by Zni (u). Hence

0

19

61

1

2 E |u x1 |2 I |u x1 | > n1/2 0.

( )

(A.3)

9

10

P n1/2 max(|u x1 |, . . . , |u xn |) >

nP |u x1 | > n1/2

i=1

60

109

110

111

112

113

xi sgn(yi xi )

114

115

i=1

1/2

n

i=1

116

117

118

F:jbes05m060.tex; (Diana) p. 9

1

where =

n( ). Hence

n1/2

3

4

5

6

7

8

9

10

11

12

n

REFERENCES

= Op (1).

xi sgn(yi xi )

i=1

> 0 for 0 < j < n

Q()

n1/2

j

< 0 for n < j < 0,

where j = p0 + 1, . . . , p. This completes the proof of Lemma 2.

13

14

15

16

17

1/2

n

v, 0) Q( a , 0). Then

18

n

yi x a n1/2 v xai |yi x a |

sn (v) =

ai

ai

19

20

i=1

21

p0

+n

j j + n1/2 vj |j | ,

22

23

24

25

26

27

28

29

j=1

we know that the first item at the right side of (A.1),

n

yi x a n1/2 v xai |yi x a |

ai

ai

i=1

30

d v W0 + f (0)v v,

31

32

33

34

35

36

37

38

39

40

41

42

(A.1)

mean 0 and variance matrix 0 . Furthermore,

p

p0

0

1/2

j j + n

vj |j | nan

|vj | 0.

n

j=1

j=1

v W0

f (0)v v

+

in distribution. Hence the required central limit theorem follows from

lemma 2.2 and remark 1 of Davis et al. (1992). This completes

the proof of the theorem.

[Received March 2005. Revised February 2006.]

Likelihood Principle, in 2nd International Symposium on Information Theory, eds. B. N. Petrov and F. Csaki, Budapest: Akademia Kiado, pp. 267281.

Bassett, G., and Koenker, R. (1978), Asymptotic Theory of Least Absolute

Error Regression, Journal of the American Statistical Association, 73,

618621.

Bloomfield, P., and Steiger, W. L. (1983), Least Absolute Deviation: Theory,

Applications and Algorithms, Boston: Birkhauser.

Craven, P., and Wahba, G. (1979), Smoothing Noise Data With Spline Function: Estimating the Correct Degree of Smoothing by the Method of Generalized Cross Validation, Numerische Mathematik, 31, 337403.

Davis, R. A.. Knight, K., and Liu, J. (1992), M-Estimation for Autoregressions With Infinite Variance, Stochastic Process and Their Applications, 40,

145180.

Fan, J., and Li, R. (2001), Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties, Journal of the American Statistical Association, 96, 13481360.

Hurvich, C. M., and Tsai, C. L. (1989), Regression and Time Series Model

Selection in Small Samples, Biometrika, 76, 297307.

(1990), Model Selection for Least Absolute Deviation Regression in

Small Samples, Statistics and Probability Letters, 9, 259265.

Knight, K. (1998), Limiting Distributions for L1 Regression Estimators Under

General Conditions, The Annals of Statistics, 26, 755770.

Knight, K., and Fu, W. (2000), Asymptotics for Lasso-Type Estimators, The

Annals of Statistics, 28, 13561378.

Koenker, R., and Zhao, Q. (1996), Conditional Quantile Estimation and Inference for ARCH Models, Econometric Theory, 12, 793813.

Ling, S. (2005), Self-Weighted Least Absolute Deviation Estimation for Infinite Variance Autoregressive Models, Journal of the Royal Statistical Society, Ser. B, 67, 113.

McQuarrie, D. R., and Tsai, C. L. (1998), Regression and Time Series Model

Selection, Singapore: World Scientific.

Nissim, D., and Penman, S. (2001), Ratio Analysis and Equity Valuation:

From Research to Practice, Review of Accounting Studies, 6, 109154.

Peng, L., and Yao, Q. (2003), Least Absolute Deviation Estimation for ARCH

and GARCH Models, Biometrika, 90, 967975.

Pollard, D. (1991), Asymptotics for Least Absolute Deviation Regression Estimators, Econometric Theory, 7, 186199.

Schwarz, G. (1978), Estimating the Dimension of a Model, The Annals of

Statistics, 6, 461464.

Shao, J. (1997), An Asymptotic Theory for Linear Model Selection, Statistica

Sinica, 7, 221264.

Shi, P., and Tsai, C. L. (2002), Regression Model Selection: A Residual

Likelihood Approach, Journal of the Royal Statistical Society, Ser. B, 64,

237252.

(2004), A Joint Regression Variable and Autoregressive Order Selection Criterion, Journal of Time Series, 25, 923941.

Tibshirani, R. J. (1996), Regression Shrinkage and Selection via the LASSO,

Journal of the Royal Statistical Society, Ser. B, 58, 267288.

Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005), Sparsity and Smoothness via the Fused Lasso, Journal of the Royal Statistical

Society, Ser. B, 67, 91108.

Zou, H., Hastie, T., and Tibshirani, R. (2004), On the Degrees of Freedom of

Lasso, technical report, Standford University, Statistics Department.

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

43

102

44

103

45

104

46

105

47

106

48

107

49

108

50

109

51

110

52

111

53

112

54

113

55

114

56

115

57

116

58

117

59

118

- Phylogenetic clustering increases with elevation for microbesUploaded bymacroecology
- Matlab Toolbox for MIDAS 2011Uploaded byManu
- 010-06 MaringerMeyer STAR Model SelectionUploaded byRara Aya Tiara
- Piecewise Linear Regression Examples (Lesson 1) TruncatedUploaded byStanley Yong
- thesisssssssUploaded byFrance Torres
- Northern Ireland Irish hare survey 2005Uploaded byIrish Council Against Blood Sports
- MATLAB Central - File Exchange Pick of the Week » Finding the Best Distribution That Fits the DataUploaded bywinsaravanan
- Hemiptera Belastomatidae Benefits of Male Parental Care in Abedus Breviceps (Hemiptera Belostomatidae)Uploaded byBiank Morejon
- RDP267 Exercise VHOUploaded bySatpal Singh
- Adolescent PersonalityUploaded byDiana V
- NotesUploaded byOzbridge
- Classical Liner Regression ModelUploaded byUnmilan Kalita
- Discrete Time NHPP Models for SoftwareUploaded byKocherla Chandu K
- Intervention Analysis of Maternal Healthcare Enrollment at Mampong Government Hospital, GhanaUploaded bydecker4449
- Rodeo- Sparse, Greedy Non Parametric RegressionUploaded bySwaroop Nuli
- Density of the Indian Peafowl Pavo Cristatus in Satpura Tiger Reserve, IndiaUploaded byRaju Lal Gurjar
- Simplifying Procedure for a Non-DestructUploaded byrayhan
- 163Uploaded byDeepak Jain
- Kallio 09Uploaded bydirafq
- 3DTNMSRGP124Uploaded byAmit Patil
- w23486Uploaded byAlessandro Economia Uefs
- Linear RegressionUploaded byRishi
- The General Linear ModelUploaded bytarun gehlot
- uyerueyurUploaded byakshay patri
- EC402 Lecture1 FinalUploaded byGuo Xu
- s516 14s-homework-21Uploaded byapi-253411445
- US Federal Reserve: ifdp567Uploaded byThe Fed
- ppt05 EconometrieUploaded byinebergmans
- ESignal Manual Ch8Uploaded byTatu Arroyo
- Assignment Var and CointegrationUploaded byVikrant Sharma

- sss3Uploaded byJoanne Wong
- Chapter 2Uploaded byJoanne Wong
- 97em MasterUploaded byJoey Mclaughlin
- Brosur Staclim Extended 2016Uploaded byJoanne Wong
- Soalan Peperiksaan Matematik Tingkatan 1 Kertas 2-(SKEMA JAWAPAN)Uploaded bysyukrie3
- Skewness GOFUploaded byJoanne Wong
- Template MerahUploaded byJoanne Wong
- 92225 Guideline to Thesis PreparationUploaded byJacquelineLynne
- Form 1 Chapter 1 - 7 Science NotesUploaded byPhilip
- Gini mean differenceUploaded byJoanne Wong
- Notes Chapter 5Uploaded byKAI_YING2005
- Notes Chapter 2Uploaded bySatthya Peter
- Outlier and EstimationUploaded byJoanne Wong
- Robust Estimation Methods and Outlier Detection in Mediation ModelUploaded byJoanne Wong
- English L2Uploaded byJoanne Wong
- English L1Uploaded byJoanne Wong
- English Remove Class Final Year Exam QuestionUploaded byJoanne Wong
- TestingUploaded byJoanne Wong
- Essay pt3Uploaded byJoanne Wong
- 110252575 English Year 2 Final ExamUploaded byJoanne Wong
- Skew NessUploaded byJoanne Wong
- ContentUploaded byJoanne Wong
- 2000 rep (single outlier) n=30Uploaded byJoanne Wong
- PT3 ESSAYUploaded byJoanne Wong
- sss1Uploaded byJoanne Wong
- Doc1Uploaded byJoanne Wong
- 4.QT1_ASGMT_2016.01_FinalUploaded byJoanne Wong
- Peralihan Pintar_paper 2- FinalUploaded bycitiecutie
- Sample dataUploaded byJoanne Wong

- Statements & QuantifiersUploaded bywaseem555
- SAFE Database Documentation.pdfUploaded byricardo
- Improve Energy Efficiency of Electrical System by Energy Audit (Data Logging)Uploaded byIAEME Publication
- Handbook of Bilingualism Psycholinguistic Approaches [Judith F. Kroll, Annette M. B. de Groot]Uploaded byBeata Dudziak
- 5315ijcga02Uploaded byijcga
- Game From WebUploaded byravindrarao_m
- PMO Charter Template with ExampleUploaded byMegan Wale
- New England New York 68 Bus System Study ReportUploaded byViraj Viduranga Muthugala
- bas70Uploaded byRicardo Urio
- ATV32HD15N4[1]Uploaded bygasm22
- FM-GATE-For Comprehension AUUploaded byGiri Haran
- Twister OnUploaded bylaraser
- String Theory Only Game in Town TestsUploaded bySilvino González Morales
- i b 2514451452Uploaded byAnonymous 7VPPkWS8O
- Sample EssaysUploaded byAmirul Arif
- ATN+910B-English-versionUploaded byMitchell Daniels
- Training Report 2Uploaded bySahibjeet Singh
- Flatpicking Essentials Vol1Uploaded byEru Iluvatar
- As 1799.1-2009 Australian Small Boat Builders StandardUploaded byDavid Wise-Mann
- Anson E Typs Gate Valve With Fail Safe Closed ActuatorUploaded byWeniton Oliveira
- Chapter13 Solutions Hansen6eUploaded bymentoel
- Pdc Course FileUploaded bysaftainpatel
- 10 Proven Methods to InnovateUploaded byAnderson Santiago
- BoilerUploaded byrasheedshaikh2003
- 09 Chapter 3Uploaded byPinapaSrikanth
- 02_step 7 BasicUploaded byhassanaagib
- Proceedings of the 2007 Antenna Applications Symposium, Volume II, Daniel Schaubert et al., Editor, Air Force Research Laboratory, 12-2007.Uploaded byBob Laughlin, KWØRL
- Prevent CrackingUploaded byRafeek Shaikh
- Mo_CordesUploaded byArt Rodriguez
- Tema ArsitekturUploaded byJoseph Hughes