Variable Selection With The Knockoffs: Composite Null Hypotheses

1
Variable Selection with the Knockoffs:

Composite Null Hypotheses
Mehrdad Pournaderi, Student Member, IEEE, and Yu Xiang, Member, IEEE
Abstract—The Fixed-X knockoff filter is a flexible framework where S0 = {1 ≤ i ≤ p : βi = 0} denotes the set of true null
arXiv:2203.02849v1 [math.ST] 6 Mar 2022
for variable selection with false discovery rate (FDR) control variables and Ŝ = {i : H0,i rejected} the selected variables
in linear models with arbitrary (non-singular) design matrices by some variable selection procedure, and | · | the cardinality
and it allows for finite-sample selective inference via the LASSO
estimates. In this paper, we extend the theory of the knockoff of the sets. A selection rule controls the FDR at level q if its
procedure to tests with composite null hypotheses, which are corresponding FDR is guaranteed to be at most q for some
usually more relevant to real-world problems. The main technical predetermined q ∈ [0, 1].
challenge lies in handling composite nulls in tandem with depen- Recently, Barber and Candès proposed the (fixed-X) knock-
dent features from arbitrary designs. We develop two methods for off filter procedure [2], a data-dependent selection rule that
composite inference with the knockoffs, namely, shifted ordinary
least-squares (S-OLS) and feature-response product perturbation controls the FDR in finite sample settings and under arbitrary
(FRPP), building on new structural properties of test statistics designs. In this procedure, a test statistic is computed for
under composite nulls. We also propose two heuristic variants each feature through constructing a knockoff variable and a
of S-OLS method that outperform the celebrated Benjamini- feature is selected by (data-dependent) thresholding the statis-
Hochberg (BH) procedure for composite nulls, which serves as tics according to the target FDR. The knockoff construction
a heuristic baseline under dependent test statistics. Finally, we
analyze the loss in FDR when the original knockoff procedure is allows for correlated features and this framework has higher
naively applied on composite tests. statistical power in comparison with the Benjamini-Hochberg
(BH) procedure [1], [3]–[5] (using z-scores for the p variables)
Index Terms—Selective inference, composite hypothesis testing,
FDR control, knockoff procedure, Benjamini-Hochberg proce- in a range of settings [2]. The knockoff filter has inspired
dure. various formulations such as the model-X knockoffs and deep
learning based knockoffs among others [6]–[13].
The original fixed-X knockoff filter is focused on the simple
I. I NTRODUCTION nulls (βi = 0). A natural question is whether one can extend
Selecting variables from a large collection of potential ex- this to handle composite nulls, namely,
planatory variables that are associated with responses of in- ′
H0,i : |βi | ≤ δ
terest is a fundamental problem in many fields of science in- ′ , 1≤i≤p ,
H1,i : |βi | > δ
cluding genome-wide association study (GWAS), geophysics,
and economics. In this paper, we focus on the classical linear for some given δ ≥ 0. In this paper, we provide an affirmative
regression model, answer to the above question by developing two methods:
shifted ordinary least-squares (S-OLS) and feature-response
y = Xβ + w , w ∼ N (0, σ 2 In ) , (1) product perturbation (FRPP). We show that both methods
achieve FDR control in finite sample settings under arbitrary
where y and w are n-dimensional random vectors with
designs, leveraging new structural properties for test statistics
elements denoting response and error variables, respectively,
under composite nulls. The main technical difficult is to handle
X = [X1 , ..., Xp ] ∈ Rn×p denotes a fixed design matrix
composite nulls in tandem with dependent features. This is
containing n samples of p explanatory features/variables, and
highly nontrivial and, to the best of our knowledge, the closest
β = [β1 , ..., βp ]⊤ ∈ Rp is the vector of unknown fixed coeffi-
solution to this is the BH procedure for composite nulls but
cients relating X and E(y). For the following p hypotheses,
with independent test statistics [3, Theorem 5.2] (referred to
H0,i : βi = 0 as the composite BH in this paper). We use the composite BH
, 1≤i≤p
H1,i : βi 6= 0 as a heuristic baseline in our simulations, as the theoretical
guarantees no longer hold for composite nulls with dependent
the problem of interest is to test these hypotheses while test statistics. Our S-OLS method motivates two LASSO-based
controlling a simultaneous measure of type I error called the heuristic variants that outperform the composite BH in power.
false discovery rate (FDR), defined as follows [1], Furthermore, we quantify the loss in FDR when one uses the
" #
|Ŝ ∩ S0 | original fixed-X knockoff filter for composite nulls, and the
FDR := E , (2) result reduces to the exact FDR control for simple nulls.
max(|Ŝ|, 1) The paper is organized as follows. In Section II, we briefly
present the knockoff filter framework by introducing the main
M. Pournaderi and Y. Xiang are with the Electrical and Computer Engi-
neering Department, University of Utah, Salt Lake City, UT, 84112, USA steps to set the stage for our analysis. In Section III, we present
(e-mail: {mehrdad.pournaderi, yu.xiang}@utah.edu) our main results and theoretical guarantees, along with two
2
heuristic methods, with all the proofs deferred to Appendices. Remark 1. Since a column-wise normalization of the design
We report our experimental results in Section IV for a range matrix is natural for variable selection purpose and essential
of composite nulls, amplitude of alternatives, and correlation in terms of statistical power, throughout this paper, we always
coefficients. assume X is normalized by ℓ2 norm of columns
II. BACKGROUND : F IXED -X K NOCKOFF F ILTER B. Statistics

The knockoff variable selection procedure [2] consists of Using the knockoff features, we now compute a vector
two main steps: (I) computing an statistic Wj for each variable of anti-symmetric statistics W = (W1 , W2 , . . . , Wp )⊤ by
in the model, and (II) selecting Ŝko = {j : Wj ≥ T (q, Ω)}, regression over the augmented design [X X̃]. Let θ̂i and
where the threshold T ≥ 0 depends on the target FDR q and θ̂i′ denote some estimated parameters corresponding to the
the set of computed statistics, i.e., Ω = {W1 , . . . , Wp }. In this variables Xi and X̃i . In this case, one can define an statistic
section we look at this procedure in more details. as follows
Wi = |θ̂i | − |θ̂i′ | , 1≤i≤p. (6)
A. Knockoff Design
We can also define the statistics differently,
The knockoff methodology for detecting non-null variables
involves creating a fake (or knockoff) design that maintain Wi = sgn(|θ̂i | − |θ̂i′ |) max(|θ̂i |, |θ̂i′ |) . (7)
the correlation structure of X but break down the relationship
between X and y. To be more precise, the term anti-symmetric here means
that swapping Xi and X̃i (in the regression problem) for
Assumption 1. Σ = X⊤ X is invertible. any subset of indices F ⊆ {1, 2, . . . , p} has the effect of
Specifically, if n ≥ 2p, [2] proposes the following construc- switching the signs of {Wi : i ∈ F }. Specifically, the
tion to produce knockoff designs knockoff framework guarantees the FDR control when the
(anti-symmetric) statistics are computed based on estimators
X̃(s) = X (Ip − Σ−1 diag{s}) + ŨC, (3) that depend on the data through the following form,
where s ∈ Rp+ is a free vector of parameters as long as it θ̂2p×1 = T (G, [X X̃]⊤ y) , (8)
satisfies G := [X X̃]⊤ [X X̃] 0, C is obtained by Cholesky
where T is a deterministic operator. For instance, the
decomposition of the Schur complement of G, and Ũn×p is
LASSO [14] regression estimates β̂ LAS given by
an orthonormal matrix that satisfies Ũ⊤ X = 0 (see [2] for
details). Therefore, knockoff matrices are not unique and they n 2 o

β̂ LAS := arg min y − [X X̃]b + λkbk1 , λ ≥ 0, (9)
are constructed according to the original design matrix. Using b ∈R2p 2
(3), the following relation can be easily verified.
can be considered as an example of estimators satisfying (8).
Σ Σ − diag{s}
G= . (4)
Σ − diag{s} Σ
C. FDR Control
In fact, this construction not only preserves the correlation In this subsection, we briefly discuss the existing approaches
structure of X, but also has another subtle yet important to show the FDR control in the knockoff procedure. Let
geometrical implication: Xi (i-th column of X) and X̃i PF denote the 2p × 2p permutation matrix corresponding to
are second-order “exchangable” in a deterministic sense, i.e., swapping Xi and X̃i for all i ∈ F , then by the structure of
swapping Xi and X̃i does not change the inner product struc- knockoff matrix we have,
ture (Gram matrix) of the augmented design. This property
makes X̃ an appropriate tool for FDR control. It should PF⊤ GPF = G , F ⊆ {1, . . . , p} . (10)
be noted that the detection power highly depends on the
Also, relying on Assumption 1, we get
parameter s as it determines the angle between a feature and
its corresponding knockoff. In other words, si (i-th element of d
PF⊤ [X X̃]⊤ y = [X X̃]⊤ y , F ⊆ S0 . (11)
s) controls how different (or orthogonal) Xi and X̃i would be.
Assuming the columns of X are normalized by the Euclidean The identities (10) and (11) immediately imply an interesting
norm, one way to choose s is to solve the following convex property of the estimates θ̂: the estimated parameters for
problem, null variables and their corresponding knockoff variables are
p exchangeable. This property along with the anti-symmetric
X
minimize |1 − si | (5) structure of the statistics are the main ingredients of the FDR
i=1
control proof in the original paper. In fact, under the simple
subject to si ≥ 0 , diag{s} 2 Σ . null hypotheses, these properties lead to the so-called i.i.d.
sign property of the nulls, i.e., the signs of null statistics
This semi-definite programming minimizes the average cor- (signs of W = {Wj : βj = 0}) are independent of
relation between variables and their corresponding knockoff the magnitudes and have i.i.d. Rademacher distribution (i.e.,
variable. P(+1) = P(−1) = 1/2). In this case, using the martingale
3
theory, it is shown that rejecting {j : Wj ≥ T } with the In fact, showing R ≤ 1 in the knockoff procedure framework
following threshold controls the FDR at level q. d
means that the procedure overesitmates the FDP(t) and there-
n o fore, can be interpreted as an equivalent for super uniformity of
d
T = inf t ∈ Ψ : FDP(t) ≤q , (12) null p-values in terms of BH procedure. However, for bounds
R ≤ B where B > 1, we need to correct the test size by a
d 1 + #{j : Wj ≤ −t} factor of 1/B, i.e., q ′ = q/B. In cases where R = ∞, we
FDP(t) := , (13)
#{j : Wj ≥ t} ∨ 1 use a generalization of this argument which leads to tighter
bounds [8]. More precisely, we bound the following quantity.
where Ψ = {|Wi | : i = 1, 2, . . . p} \ {0}. Although this n o
approach is elegant, we shall not follow it in development
P Wj > 0, Aj |Wj |, W −j
of composite tests as in this situation (11) fails to hold R′ = sup n

o . (16)
immediately. On the other hand, [8] provides another proof j∈S0 , |Wj |6=0 P Wj < 0 |Wj |, W −j
for FDR control1 which requires weaker conditions on the
statistics [8, equation (16)]. Specifically, it suggests that if the where Aj is some event regarding the j-th variable.
S In this
null statistics satisfy case, if R′ ≤ B ′ , we get FDR ≤ q · B ′ + P j∈S0 Acj [8,
n o Theorem 2]. This bound characterizes the FDR loss term and
provides intuition and insight about the loss of FDR under
P Wj > 0 |Wj |, W −j
n o different values of parameters. In the following subsection we

≤ C · P Wj < 0 |Wj |, W −j , (14) present composite inference methods that allow for theoretical
FDR control.
for some C > 0, then the knockoff procedure with the target
FDR q/C controls the FDR at level q. It is straightforward
to verify that in the case of simple nulls, this condition holds A. Composite Testing with FDR Control
with C ≥ 1 according to the i.i.d. sign property of the null The following theorem concerns the composite knockoff
statistics, resulting in FDR control. procedure based on the ordinary least-squares.3 We consider
both one-sided and two-sided null hypothesis, i.e., βj ≤ δ and
III. M AIN R ESULTS |βj | ≤ δ, respectively. We show that shifting the estimates
corresponding to the knockoff variables by δ would result in
In case of a single composite test of the form |βi | ≤ δ, exact FDR control in the former case (R ≤ 1). Regarding the
the common approach is to compute super uniform p-values latter case, we derive a bound of the FDR of this approach
under the null (i.e., P (P ≤ t) ≤ t) which clearly controls the (R = ∞, R′ ≤ 1).
probability of type I error by definition. This usually happens
when the p-value is computed according to the distribution Theorem 1 (S-OLS). Consider the knockoff procedure (target
0
corresponding to a parameter on the boundary of null region. FDR=q) and based on the estimates β̂S-OLS = β̂ OLS + p×1
However, in case of composite multiple testing problems, δ ′p×1
′ ′
this argument gets more complicated as the dependencies with some δi ≥ δ for all 1 ≤ i ≤ p. Let β̂j and β̂j denote the
between the statistics should be considered. The composite j-th and (j + p)-th elements of β̂S-OLS .
BH procedure [3, Theorem 5.2] guarantees the FDR control (I) Consider testing H0 : βj ≤ δ for 1 ≤ j ≤ p. If Wj =
when the null p-values are super uniform, but it is only valid β̂j − β̂j′ , then the procedure controls the FDR at level q.
for independent p-values. On the other hand, the kncockoff
H
(II) Consider testing 0 : |βj | ≤ δ for 1 ≤ j ≤ p. If
filter is designed to utilize the model and covarites structure sgn(Wj ) = sgn β̂j − β̂j′ , then
for computing one-bit p-values that handle the dependencies
naturally. In developing the composite tests, it turns out
FDR ≤ q + P min β̂j + β̂j′ < 0 .
that we can actually maintain this property of the knockoff j∈S0
procedure. To show the FDR control we rely on manipulating
the estimates so that we can (upper) bound the following ratio2
as it is known that this will lead to rigorous FDR control at Remark 2. We note that β̂j + β̂j′ ∼ N (δj′ + βj , 2σ 2 sj ).
level q ·R (see [8, equation (16) and proof of Theorem 2] with Therefore, for small |S0 | and large (δj′ +βj )2 /(σ 2 sj ) the FDR
Ej = 0). control is theoretical. This also evidences how larger shifts δ ′
n o would help the theoretical FDR control.

P Wj > 0 |Wj |, W −j Corollary 1 (Alternative method for Theorem 1 (II)).
R= sup n o . (15)
Consider testing H0 : |βj | ≤ δ for 1 ≤ j ≤ p but let δj′ ≤ −δ.
j∈S0 , |Wj |6=0 P Wj < 0 |Wj |, W −j
If sgn(Wj ) = sgn β̂j − β̂j′ , then

1 The results in [8] concern robustness of the Model-X knockoff framework
(where X is random) but the FDR control proof works for the Fixed-X setting FDR ≤ q + P max β̂j + β̂j′ > 0 .
as well, since they only rely on antisymmetry of the statistics and (14). j∈S0
2 Recall that the knockoff procedure discards D = {j : |W | = 0} and as
n j o 3 Note that performing the knockoff procedure using the OLS estimator

a result T > 0 by definition. Also, we have P Wj < 0 |Wj |, W −j > 0 requires that G is invertible. In Lemma 1 we show that this is the case if
for |Wj | 6= 0 since supp (Wj ) = R. max si < 2λmin (Σ).
1≤i≤p
4
Our next method is not restricted to the OLS estimates. This IV. S IMULATIONS
makes the situation difficult in the sense that the analysis of the In this section we present simulation results on synthetic
shifted estimates would not be as easy as before. Therefore, data sets for all methods. We set the sample size and dimension
we propose to introduce artificial randomness to the procedure to be n = 2000 and p = 800, respectively. We generate
to be able to perform composite inference in such general samples (rows of X) i.i.d. according to N (0, Sp ) where
setting. Specifically, we perturb the feature-response products Sij = ρ|i−j| , |ρ| < 1 and then normalize the columns of
[X X]e ⊤ y by noise generated from Laplace distribution. In this
X by the ℓ2 -norm. The responses y are generated according
case, we will be able to show that R ≤ eǫ with ǫ determined to the linear model (1) with noise variance σ 2 = 1 and the
by the noise variance. number of true alternatives is k = 100. The composite null
Theorem 2 (FRPP). Fix some ǫ > 0. Define the null variables boundary is set to be δ = 1, and we consider two distributions
iid
S0 = {i : |βi | ≤ δ} and let ∆i ∼ Lap (2 si δ/ǫ) where si = for generating coefficients corresponding to null variables:
1 − X⊤ e (a) β0 ∼ U [−1, 1] which is presented in the left column of
i Xi , 1 ≤ i ≤ p. The knockoff procedure with the target
FDR q/eǫ , and using (antisymmetric) statistics based on any the figures. We consider this setting as a practical case.
estimator of the form T (G, [X X̃]⊤ y + ∆) controls the FDR (b) β0 ∼ Rademacher, which is presented in the right
at level q. column of the figures. This setting tries to examine the
methods in the hardest situation (worst case) for FDR control.
We note that this theorem gives an stochastic generalization The super uniform p-values for the composite BH pro-
of the knockoff procedure, i.e., if δ = 0 the FRPP method cedure are computed according to Pj = 2P N (δ, 1) ≥
reduces to the original method without any additional assump-
|β̂j | . We adopt the equicorrelated knockoffs and set the
tions. elements of s (the vector used in creating the knockoff
matrices) to be min(1.8λmin (Σ), 1), min(λmin (Σ), 1), and
B. Heuristic Methods min(2λmin (Σ), 1) for the S-OLS, FRPP, and heuristic meth-
Motivated by our results that show the shifting argument ods (S-LAS1 and S-LAS2), respectively. We use the structure
is theoretically valid in case of using the OLS estimator, (7) to construct the coefficient signed max statistics. The target
we propose the following two methods based on shifting the FDR is q = 0.2 and the plots are based on averaging 200 trials.
LASSO estimates. The power is defined as follows,
Method S-LAS1: This method shifts the LASSO estimates 1
Power := EŜ ∩ S0c . (18)
just as the S-OLS method. Namely, we use the following k
formulae to compute the statistics. Remark 5. In [2], it is suggested to set si =

0 min(2λmin (Σ), 1) for equicorrelated knockoffs. However, this
β̂ S-LAS1 = β̂ LAS (λ) + is not possible in case of using OLS estimator as it would
δ
result in a singular augmented design if 2λmin (Σ) ≤ 1 (See
Method S-LAS2: This method estimate the coefficients by Lemma 1). Regarding the FRPP method, note that unlike the
solving the following LASSO problem. deterministic methods larger values of si does not result in
( 2 )
0 higher detection power necessarily because the variance of
β̂ S-LAS2 := arg min y − X X̃ b + δ
+ λ b ,
1 the additive Laplacian noise is proportional to s2i . In our
b ∈R2p 2
experiments it turns out that for ρ = 0 (no correlation)
where λ ≥ 0. case setting the procedure reaches to its highest power when
si = min(λmin (Σ), 1). However this is not the case for higher
Remark 3. We note that both methods reduce to the S-OLS
correlations since λmin (Σ) will be small. In this case the
method if λ = 0.
maximum power is reached when min(2λmin (Σ), 1).
C. FDR Bound for Naive Selection

Theorem 3. Naive application of the (fixed) knockoff pro- V. D ISCUSSION
cedure (with target FDR q) on composite null hypotheses Fixed-X knockoff procedure is an elegant method for se-
S0 = {i : |βi | ≤ δ} and using (antisymmetric) statistics based lective inference in general linear models with FDR control
on any estimator of the form T (G, [X X̃]⊤ y), will result in guarantee. However, the composite extension of this method
FDR bounded as follows. has not been developed yet. In this paper, we have investigated
( ) the Fixed-X knockoff filter approach to the variable selection
δ
ǫ
FDR ≤ min q . e + P 2 max γj − γj > ǫ ′
, (17) problem with composite nulls and under arbitrary depen-
ǫ≥0 σ j∈S0 dencies among statistics. The knockoff inference procedure
handles the dependencies between variables very naturally by
γ p×1 e ⊤ y. computing model-based statistics. We have shown that this
where = [X X]
γ ′ p×1 structure is still useful under the composite nulls and allows for
Remark 4. We note that the bound (17) will reduce to q in development of methods with theoretical FDR control guar-
the case of simple nulls, i.e., δ = 0. It also evidences that antee. We have derived a full stochastic generalization of the
when δ ≪ σ 2 the FDR loss will be negligible. knockoff procedure which has reasonable statistical power. We
5
A PPENDIX A
0.4 0.4
S-OLS
C-BH
S-OLS
C-BH
T ECHNICAL L EMMAS
FRPP (eps=0.6) FRPP (eps=0.6)
0.3 FRPP (eps=0.8)
S-LAS1
0.3 FRPP (eps=0.8)
S-LAS1 Lemma 1. If max si < 2λmin (Σ), then G is invertible.
S-LAS2 S-LAS2 1≤i≤p
Target FDR Target FDR
FDR
FDR
0.2 0.2
Proof. Recall,

0.1 0.1
Σ Σ − diag(s)
G= , (19)
Σ − diag(s) Σ
0 0
4 6 8 10 12 14 16 4 6 8 10 12 14 16
1 1 where s ∈ Rp+ . We note,

0.8 0.8
I −I A B I I A−B O
= .
0.6 0.6 O I B A O I B A+B
Power
Power

0.4 0.4
Therefore, det(G
− λI) = det diag(s) − λI det 2Σ −
0.2 0.2 diag(s) − λI and as a result, the set of eigenvalues of G
is the union of the eigenvalues of diag(s) and 2Σ − diag(s).
0 0
4 6 8 10
Amplitude
12 14 16 4 6 8 10
Amplitude
12 14 16 If max si < 2λmin (Σ) holds, then λmin (2Σ − diag(s)) > 0
1≤i≤p
which implies that G is positive definite and therefore, invert-
(a) (b) ible.
Fig. 1. FDR and power vs Amplitude of the true alternatives. In this
experiment the correlation coefficient is set to be ρ = 0 and all of the true Lemma 2. Let G = [X X] e ⊤ [X X]
e and Σ = X⊤ X. If G is
alternatives have the same underlying coefficient. It can be observed that both −1
heuristic methods outperform the composite BH procedure.
invertible, G has the following structure.

A A−D
G−1 = ,
A−D A
0.4
S-OLS
0.4
S-OLS
where D is a diagonal matrix.
C-BH C-BH
FRPP (eps=0.8) FRPP (eps=0.8)
0.3 S-LAS1 0.3 S-LAS1 Proof. Recall,
S-LAS2 S-LAS2
Target FDR Target FDR
Σ Σ − D∗
FDR
FDR
0.2 0.2
G= , (20)
Σ − D∗ Σ
0.1 0.1
with some diagonal D∗ . From the inverse of 2 × 2 block
0
-0.5 0 0.5
0
-0.5 0 0.5
matrices we have

1 1
−1 J − J(Ip − D∗ Σ−1 )
G = .
0.8 0.8 −J(Ip − D∗ Σ−1 ) J
0.6 0.6
where J = (2D∗ − D∗ Σ−1 D∗ )−1 . We observe,
Power
Power
0.4 0.4
J − − J(Ip − D∗ Σ−1 ) = D∗ −1 . (21)
0.2 0.2
0 0
Therefore, A = (2D∗ − D∗ Σ−1 D∗ )−1 and D = D∗ −1 .
-0.5 0 0.5 -0.5 0 0.5
Correlation Coefficient ( ) Correlation Coefficient ( )
γ p×1
Lemma 3. Let ′
e ⊤ y. It holds that E(γj −
= [X X]
(a) (b) γ p×1
Fig. 2. FDR and power vs Correlation coefficient ρ. In this experiment γj′ ) ≤ sj δ for all j ∈ S0 .
the coefficients corresponding to the true alternatives is set to be βa = 8.
Although both heuristic methods outperform the BH procedure in high γ β
Proof. We observe E = G . According to the
correlations, it can be observed that the second heuristic method fails to control γ′ 0p×1
the FDR. structure of G (20), we get

E(γj − γ ′ ) = G(j) − G(j+p) β
j 0

have shown that if we restrict ourselves to the ordinary least-
= Σ(j) − Σ(j) − D(j) ∗
β
squares estimates, the intuitive (and deterministic) method of
shifting the estimate would be theoretically valid. We have = sj |βj | ≤ sj δ ,
also derived a general bound on the FDR for cases where
the original knockoff procedure is applied on a composite where the index (j) denotes the j-th row of the matrices and
problem, without any additional assumptions. the inequality holds according to the definition of S0 .
6
p−1
A PPENDIX B with some cj ∈ R . Hence,
P ROOF OF T HEOREM 3
⊤
γ p×1 γ̆j cj cj a − γ̆ −j
Lemma 4. Let = [X X] e ⊤ y. For fixed X and any ηj = + (25)
γ ′ p×1 γ̆j′ cj cj b − γ̆ ′−j

(deterministic) anti-symmetric f , the vector of statistics Ŵ = γ̆j + cj ⊤ (a + b − γ̆ −j − γ̆ ′−j ) γ̆j + dj
= = .
e ⊤ [X X],
f ([X X] e [X X] e ⊤ y) statisfies the following properties. γ̆j′ + cj ⊤ (a + b − γ̆ −j − γ̆ ′−j ) γ̆j′ + dj
Similarly, for the covariance matrix R we have,
(I) For all j ∈ S0 and (u, v) ∈ R2 ,
h i⊤ h i e e
n o e e j j
Rj = Xj Xj Xj Xj − , (26)
P Wj > 0 (γj , γj′ ) ∈ U, (γ −j , γ ′−j ) ≤ ej ej
δ n o
where ej ∈ R does not depend on a or b. Let h(·, ·) denote
exp 2 u − v P Wj < 0 (γj , γj′ ) ∈ U, (γ −j , γ ′−j ) ,
σ h pdf
∼ N (ηj , σ 2 Rj ). The proof for u = v is
′
(γj ,γj ) (γ −j ,γ −j )
′
where U = {(u, v), (v, u)}. trivial as it implies Wj = 0. Suppose u 6= v. In this case,
(II) For all 1 ≤ j ≤ p we have, since swapping γj and γj′ would switch the sign of Wj , we
get

sgn(Wj ) ⊥⊥ sgn(W −j ) γ, γ ′ V , (23) n o

P Wj > 0 (γj , γj′ ) ∈ U, (γ −j , γ ′−j )
where {·, ·}V denotes a vector of unordered pairs and sgn(·) n o

operates coordiante-wise. P Wj < 0 (γj , γj′ ) ∈ U, (γ −j , γ ′−j )
( )
Proof. (I) We compute the conditional distribution
h(u,v) h(v,u)
γ γ̆ ≤ max
h(u,v)+h(v,u)
,
h(u,v)+h(v,u)
(γj , γj′ )(γ −j , γ ′−j ). We note that ′ ∼ N 2
′ ,σ G h(v,u) h(u,v)
γ γ̆ h(u,v)+h(v,u) h(u,v)+h(v,u)
γ̆ β
where = G according to (1). Therefore we h(u, v) h(v, u)
γ̆ ′ 0p×1 = max , . (27)
h(v, u) h(u, v)
have,
By a π
counter-clockwise rotation of h we get
4
(γj , γj′ )(γ −j , γ ′−j ) = (a, b) ∼ N (ηj , σ 2 Rj ) , (24)
1 1
h(u, v) h1 √2 (u − v) h2 √2 (u + v)
where =
h h(v, u) h1 √1 (v − u) h2 √1 (v + u)
i⊤ h i 2 2
γ̆j
ηj = + X e
X X e
X h1 √12 (u − v)
γ̆j′ j j −j −j
= , (28)
−1
!−1 h1 √ (u − v)
h i⊤ h i h γ̆ −j i
2
e −j e −j a
X−j X X−j X − ′ , where,
b γ̆ −j

pdf 1 ′
2 (11) (12)
and h1 ∼ N √ ηj − ηj , σ Rj − Rj ,
h i⊤ h i h i⊤ h i 2
ej e j − Xj Xej e −j
Rj = Xj X Xj X X−j X pdf 1 (11) (12)
! h2 ∼ N √ ηj + ηj ′ , σ 2 Rj + Rj .
h i⊤ h i −1 h i⊤ h i 2
X−j Xe −j e −j
X−j X X−j Xe −j Xj Xej .
According to (25) and (26), we have

From the definition of the knockoff variables we have 1
i z z ⊤
(11) (12)
h i⊤ h N √ ηj − ηj ′ , σ 2 Rj − Rj =
Xj Xej X−j X e −j = j j
with some z j ∈ R
p−1
2
zj z j 1 2
′
and according to Lemma 2 we have N √ γ̆j − γ̆j , σj , (29)
! 2
h i⊤ h i −1
e e A A−D where σj2 = sj σ 2 and sj = 1 − X⊤ e
X
X−j −j X−j −jX = , j Xj since the columns of
A−D A X are normalized. Therefore, h1 is free of the given values
for (γ −j , γ ′−j ), which also means
with some symmetric positive-definite A ∈ R(p−1)×(p−1) and

diagonal D. Therefore we get d
γj − γj′ γ −j , γ ′−j = γj − γj′ , (30)
!−1
h i⊤ h i h i⊤ h i
ej e −j e −j e −j and hence γj − γj′ ⊥⊥ γ −j , γ ′−j . We now use (28) and
Xj X X−j X X−j X X−j X
(29) to bound the RHS of (27) under H0 ,
⊤ ( )
c cj h1 √12 (u − v) h1 √ −1
(u − v)
= j , max , 2

cj cj h1 √ −1
(u − v) h1 √12 (u − v)
2
7

2
exp − 1 γ̆j − γ̆ ′ − u − v the tower property since |Wj |, W −j is a function of
4σj2 j
≤ {γj , γj′ }, (γ −j , γ ′−j ) .
2
1
exp − 4σ2 γ̆j − γ̆j + u − v ′
j Hence (17) holds according to the Theorem 2 in [8].

1
′

= exp γ̆j − γ̆j u − v
σ2
j A PPENDIX C
sj δ
≤ exp u−v
σ2
j β̂
δ

Lemma 6. If we use estimates with distribution ∼
= exp u−v , β̂ ′
σ2
β̆
N 2
′ ,σ G
−1
to compute the statistics, then, the fol-
where the second inequality is a consequence of γ̆j − γ̆j′ ≤ β̆
sj δ under H0 (Lemma 3). lowing properties hold for any 1 ≤ j ≤ p.

(II) According to (25) and (26), the conditional distribution (I) If sgn(Wj ) = sgn β̂j − β̂ ′ , then
j
(24) only depends on a + b (through ηj ). Hence, we have

d
sgn(Wj ) ⊥⊥ W −j β̂j , β̂j′ , (31)
γj , γj′ γ −j , γ ′−j = γj , γj′ γ −j , γ ′−j V ,
We observe that the unordered pair {γj , γj′ } is a function of where {·, ·} denotes an unordered pair.
(γj , γj′ ). Therefore, by conditioning on it we get, (II) If Wj = β̂j − β̂j′ , then

d
γj , γj′ (γ −j , γ ′−j ), {γj , γj′ } = γj , γj′ γ, γ ′ V . Wj ⊥⊥ W −j . (32)
Therefore,
Proof. (I) We compute the conditional distribution,
d
sgn(Wj ) sgn(W −j ), {γ, γ ′ V = sgn(Wj ) γ, γ ′ V , ′

β̂−j , β̂ −j β̂j , β̂j′ ) = (a, b ∼ N (ξ, C) .
which implies (23) immediately.
Lemma 5. For all j ∈ S0 we have We note that according to Lemma 2, G and G−1 have the
n o same structure. Therefore, similar calculations as in the proof

P Wj > 0 , σδ2 γj − γj′ ≤ ǫ |Wj |, W −j of Lemma 4 gives,
n

o ≤ eǫ . ! !
P Wj < 0 |Wj |, W −j β̆ −j c c a − β̆j
ξ= ′ + ,
β̆ −j c c b − β̆j′
Proof.
n δ o
P Wj > 0 , 2 γj − γj′ ≤ ǫ |Wj |, W −j ′ Q Q
σ C= Cov(β̂ −j , β̂ −j ) − ,
Q Q
δ
= E P Wj > 0 , 2 γj − γj′ ≤ ǫ {γj , γj′ },
σ with some c ∈ R
p−1
and symmetric Q ∈ R(p−1)×(p−1) that

′
(γ −j , γ −j ) |Wj |, W −j does not depend on a and b. We note that the conditional
n distribution depends on the pair (a, b) only through their sum
δ o
a + b, so it is free of the order of the pair. Thus, we get
= E 1 2 γj − γj′ ≤ ǫ P Wj > 0 {γj , γj′ },
σ
′
′ d
β̂ −j , β̂−j β̂j , β̂j′ = β̂ −j , β̂−j β̂j , β̂j′ .
(γ −j , γ ′−j ) |Wj |, W −j
n o δ Therefore,
δ
≤ E 1 2 γj − γj′ ≤ ǫ exp 2 γj − γj′ d
σ σ
W −j sgn(Wj ), β̂j , β̂j′ = W −j β̂j , β̂j′ ,

P Wj < 0 {γj , γj }, (γ −j , γ ′−j )
′
which implies (31) immediately.

|Wj |, W −j (II) Again, since G and G−1 have the same structure

(according to Lemma 2), the same argument that led to (30)
ǫ ′ ′ applies and we get
≤ e E P Wj < 0{γj , γj }, (γ −j , γ −j ) |Wj |, W −j
n o ′
β̂j − β̂j′ ⊥⊥ β̂−j , β̂−j . (33)
= eǫ P Wj < 0 |Wj |, W −j ,
where {·, ·} denotes an unordered pair, the first in- This would imply (32) since Wj and W −j are functions of
′
equality holds according to Lemma 4 (I), and we used β̂j − β̂j′ and β̂ −j , β̂−j , respectively.
8

A. One-sided Test: for all realizations of W −j and β̂j , β̂j′ . Since β̂j + β̂j′ is a

According to (15), it is sufficient to show that for any j ∈ S0 function of β̂j , β̂j′ , we have
and |Wj | 6= 0, we have

n o P Wj > 0, β̂j + β̂j′ > 0W −j , β̂j , β̂j′

P Wj > 0 |Wj |, W −j
n o ≤1. P Wj < 0 W −j , β̂j , β̂j′
n o
P Wj < 0 |Wj |, W −j
1 β̂j + β̂j′ > 0 P Wj > 0W −j , β̂j , β̂j′
=
Also since |Wj | is a function of Wj , (32) yields
P Wj < 0 W −j , β̂j , β̂j′
n o n o
′
P Wj > 0 |Wj |, W −j P Wj > 0 |Wj | (∗)
n o P W j > 0 β̂ j , β̂ j
n o= n o . = 1 β̂j + β̂j′ > 0

P Wj < 0 |Wj |, W −j P Wj < 0 |Wj | P Wj < 0 β̂j , β̂j′

n o ′ ′
Using the tower property we get, (∗∗)
P β̂ j > β̂ j β̂ j , β̂ j
= 1 β̂j + β̂j′ > 0
n o
P β̂j < β̂j β̂j , β̂j′
′
′
E P Wj > 0 β̂j , β̂j |Wj |
P Wj > 0 |Wj |
n o =
P β̂j > β̂j′ β̂j , β̂j′
P Wj < 0 |Wj |
E P Wj < 0 β̂j , β̂j′ |Wj | ≤
≤ 1 ,
P β̂j < β̂j′ β̂j , β̂j′

′
E P β̂j > β̂j β̂j , β̂j |Wj |
′
where (*) holds according to (31) and (**) holds since β̂j >
= .
β̂j′ ⇔ |β̂j | > |β̂j′!
| when β̂j + β̂j′ > 0. The last inequality
′ ′
E P β̂j < β̂j β̂j , β̂j |Wj |
β̂ β
follows from ′ ∼N , σ 2 G−1 and the definition
β̂ δ
Hence, it will be sufficient to prove,
of (composite) null variables (|βj | ≤ δ). Hence the following

P β̂j > β̂j′ β̂j , β̂j′ = {u, v} inequality holds according to the Theorem 2 in [8].
≤1, n o
FDR ≤ q + P min β̂j + β̂j′ < 0 .
P β̂j < β̂j′ β̂j , β̂j′ = {u, v}
j∈S0
′

for all j ∈ S0 and u, v > 0. But since β̂j , β̂j is bivariate
normal with E(β̂j′ ) = δ ′ ≥ δ ≥ E(β̂j ) for j ∈ S0 , the result
is immediate. A PPENDIX D
In this method, the Laplace noise ∆2p×1 is added to the
B. Two-sided Test:
feature-response products,
According to [8], it is sufficient to show that for any j ∈ S0
[X X]e ⊤ y + ∆ = G β + ∆ + [X X] e ⊤w ,
and |Wj | 6= 0, we have
n o
iid e i, β = β
P Wj > 0, β̂j + β̂j′ > 0 |Wj |, W −j where ∆i ∼ Lap(2 si δ/ǫ), si = 1−X⊤ i X , and
0p×1
n o ≤1 .
W is the model noise. For simplicity in notation we re-write
P Wj < 0 |Wj |, W −j
the equation as follows,

According to the tower property we have, κp×1 θ α
= + ,
n o κ′p×1 θ′ α′

P Wj > 0, β̂j + β̂j′ > 0 |Wj |, W −j
n o = θ
where = Gβ +∆. By Lemma 3 we have E(θj −θj′ ) ≤
P Wj < 0 |Wj |, W −j θ′
sj δ for all j ∈ S0 . Therefore, the following inequality holds

′ ′
E P Wj > 0, β̂j + β̂j > 0W −j , β̂j , β̂j W −j , |Wj | according to the Laplace mechanism [15], [16].
.
f(θj ,θj′ ) (x, y) ≤ eǫ f(θj′ ,θj ) (x, y) , (34)
′
E P Wj < 0 W −j , β̂j , β̂j W −j , |Wj |
for all (x, y) ∈ R2 and j ∈ S0 .
Therefore, it is sufficient to prove, In order to show the FDR control, according to (15), it will
be sufficient to prove that

P Wj > 0, β̂j + β̂j′ > 0W −j , β̂j , β̂j′ n o n o
≤ 1,
P Wj > 0 |Wj |, W −j ≤ eǫ P Wj > 0 |Wj |, W −j ,
P Wj < 0 W −j , β̂j , β̂j′
9

all ′j ∈ S0 . We
for note that |Wj |, W −j is a function of
′
R EFERENCES
( κj , κj }, α−j , α′−j , θ−j , θ−j , therefore, according to the [1] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate:
tower property we get a practical and powerful approach to multiple testing,” Journal of the
n o royal statistical society. Series B (Methodological), pp. 289–300, 1995.
[2] R. F. Barber, E. J. Candès et al., “Controlling the false discovery rate
P Wj > 0 |Wj | , W −j = via knockoffs,” The Annals of Statistics, vol. 43, no. 5, pp. 2055–2085,
2015.
′ ′ ′
E P Wj > 0{κj , κj }, α−j , α−j , θ−j , θ−j |Wj |, W −j , [3] Y. Benjamini and D. Yekutieli, “The control of the false discovery rate
in multiple testing under dependency,” The annals of statistics, vol. 29,
no. 4, pp. 1165–1188, 2001.
where {·, ·} denotes an unordered pair. Therefore, it is suffi- [4] J. D. Storey, J. E. Taylor, and D. Siegmund, “Strong control, conservative
cient to show point estimation and simultaneous conservative consistency of false
discovery rates: a unified approach,” Journal of the Royal Statistical
Society: Series B (Statistical Methodology), vol. 66, no. 1, pp. 187–205,
P (κj , κ′j ) = (a, b)(κj , κ′j ) ∈ K, α−j , α′−j , θ−j , θ′−j ≤ 2004.

[5] B. Efron, R. Tibshirani, J. D. Storey, and V. Tusher, “Empirical bayes
eǫ P (κ′j , κj ) = (a, b)(κj , κ′j ) ∈ K, α−j , α′−j , θ−j , θ′−j , analysis of a microarray experiment,” Journal of the American statistical
association, vol. 96, no. 456, pp. 1151–1160, 2001.
for all (a, b) ∈ R2 . We also note that (κj , κ′j ) is conditionally [6] R. F. Barber, E. J. Candès et al., “A knockoff filter for high-dimensional
selective inference,” The Annals of Statistics, vol. 47, no. 5, pp. 2504–
independent of (θ −j , θ′−j ). Hence, we only need to show 2537, 2019.
[7] E. Candes, Y. Fan, L. Janson, and J. Lv, “Panning for gold:‘model-

P (κj , κ′j ) = (a, b)(κj , κ′j ) ∈ K, α−j , α′−j ≤ X’knockoffs for high dimensional controlled variable selection,” Journal
of the Royal Statistical Society: Series B (Statistical Methodology),

vol. 80, no. 3, pp. 551–577, 2018.
eǫ P (κ′j , κj ) = (a, b)(κj , κ′j ) ∈ K, α−j , α′−j , [8] R. F. Barber, E. J. Candès, and R. J. Samworth, “Robust inference with
knockoffs,” The Annals of Statistics, vol. 48, no. 3, pp. 1409–1431, 2020.
for all (a, b), where K = {(a, b), (b, a)}. Starting from LHS, [9] Y. Romano, M. Sesia, and E. Candès, “Deep knockoffs,” Journal of the
American Statistical Association, pp. 1–12, 2019.
we get [10] J. Jordon, J. Yoon, and M. van der Schaar, “KnockoffGAN: Generating
knockoffs for feature selection using generative adversarial networks,”

P (κj , κ′j ) = (a, b)(κj , κ′j ) ∈ K, α−j , α′−j in International Conference on Learning Representations, 2018.
[11] Y. Lu, Y. Fan, J. Lv, and W. S. Noble, “DeepPINK: reproducible feature
1 selection in deep neural networks,” in Advances in Neural Information
= g (a, b)
g(a, b) (κj ,κ′j )(α−j ,α′−j ) Processing Systems, 2018, pp. 8676–8686.
[12] Y. Fan, J. Lv, M. Sharifvaghefi, and Y. Uematsu, “IPAD: stable inter-
1 pretable forecasting with knockoffs inference,” Journal of the American
= g (a, b)
g(a, b) (θj +αj ,θj′ +α′j )(α−j ,α′−j ) Statistical Association, pp. 1–13, 2019.
1 n [13] M. Pournaderi and Y. Xiang, “Differentially private variable selection
= E via the knockoff filter,” in 2021 IEEE 31st International Workshop on
′
g(a, b) (θj ,θj ) (α−j ,α−j )
′
Machine Learning for Signal Processing. IEEE, 2021, pp. 1–6.
o [14] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Jour-
g (a − θj , b − θj′ )
′
(αj ,αj ) (α−j ,α−j ,θj ,θj )
′ ′ nal of the Royal Statistical Society: Series B (Methodological), vol. 58,
no. 1, pp. 267–288, 1996.
(∗) 1 [15] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise
= g (a, b)
g(a, b) (θj +α′j ,θj′ +αj )(α−j ,α′−j ) to sensitivity in private data analysis,” in Theory of cryptography
1 n o conference. Springer, 2006, pp. 265–284.
= E g (a − α′j , b − αj ) [16] C. Dwork, A. Roth et al., “The algorithmic foundations of differential
′ ′
g(a, b) (αj ,αj ) (α−j ,α−j ) (θj ,θj ) (α,α )
′ ′ privacy.” Found. Trends Theor. Comput. Sci., vol. 9, no. 3-4, pp. 211–
407, 2014.
(∗∗) eǫ
≤ g (a, b)
g(a, b) (θj′ +α′j ,θj +αj )(α−j ,α′−j )
eǫ
= g (a, b)
g(a, b) (κ′j ,κj )(α−j ,α′−j )

= eǫ P (κ′j , κj ) = (a, b)(κ′j , κj ) ∈ K, α−j , α′−j .
where g denotes the conditional probability

(κj ,κ′j )(α−j ,α′−j )
density function,
g(a, b) = g (a, b) + g (a, b) ,

(κj ,κ′j )(α−j ,α′−j ) (κ′j ,κj )(α−j ,α′−j )
and (*) holds according to the independence of (θj , θj′ ) and

(α, α′ ), and the fact that

′⊤ d ′⊤
α′j , α⊤ ⊤
−j , αj , α −j = (α , α ) ,

α e ⊤ w ∼ N (0, G). The inequality marked
since = [X X]
α′
with (**) follows from the independence of (θj , θj′ ) and
(α, α′ ) and (34).

Variable Selection With The Knockoffs: Composite Null Hypotheses

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Variable Selection With The Knockoffs: Composite Null Hypotheses

Uploaded by

Copyright:

Available Formats

1

Variable Selection with the Knockoffs:

II. BACKGROUND : F IXED -X K NOCKOFF F ILTER B. Statistics

C. FDR Bound for Naive Selection

1 1 where s ∈ Rp+ . We note,

where g denotes the conditional probability

g(a, b) = g (a, b) + g (a, b) ,

and (*) holds according to the independence of (θj , θj′ ) and

You might also like