You are on page 1of 9

a

r
X
i
v
:
1
1
1
2
.
6
2
3
5
v
1


[
m
a
t
h
.
S
T
]


2
9

D
e
c

2
0
1
1
Detecting a Vector Based on Linear Measurements
Ery Arias-Castro
Department of Mathematics, University of California, San Diego
We consider a situation where the state of a system is represented by a real-valued vector x ∈ R
n
.
Under normal circumstances, the vector x is zero, while an event manifests as non-zero entries in
x, possibly few. Our interest is in designing algorithms that can reliably detect events — i.e., test
whether x = 0 or x ,= 0 — with the least amount of information. We place ourselves in a situation,
now common in the signal processing literature, where information on x comes in the form of
noisy linear measurements y = ¸a, x¸ + z, where a ∈ R
n
has norm bounded by 1 and z ∈ A(0, 1).
We derive information bounds in an active learning setup and exhibit some simple near-optimal
algorithms. In particular, our results show that the task of detection within this setting is at once
much easier, simpler and different than the tasks of estimation and support recovery.
Keywords: signal detection, compressed sensing, adaptive measurements, normal mean model,
sparsity, high-dimensional data.
1 Introduction
We consider a situation where the state of a system is represented by a real-valued vector x ∈ R
n
.
Under normal circumstances, the vector x is zero, while an event manifests as non-zero entries in
x, possibly few. Our interest is in the design of algorithms that reliably detect events — i.e., test
whether x = 0 or x ,= 0 — with the least amount of information. We assume that we may learn
about x via noisy linear measurements of the form
y
i
= ¸a
i
, x¸ + z
i
, (1)
where the measurement vectors a
i
’s have Euclidean norm bounded by 1 and the noise z
i
’s are
i.i.d. standard normal. Assuming that we may take a limited number of linear measurements, the
engineering is in choosing them in order to minimize the false alarm and missed detection rates.
We derive information bounds, establishing some fundamental detection limits relating the signal
strength and the number of linear measurements. The bounds we obtain apply to all adaptive
schemes, where we may choose the ith measurement vector a
i
based on the past measurements,
i.e., we may choose a
i
as a function of (a
1
, y
1
, . . . , a
i−1
, y
i−1
).
1.1 Related work
Learning as much as possible about a vector based on a few linear measurements is one of the central
themes of compressive sensing (CS) [4, 5, 8]. Most of this literature, as it relates to signal processing,
has focused on the tasks of estimation and support recovery. Particularly in surveillance situations,
however, it makes sense to perform detection before estimation because, as we shall confirm, reliable
detection is possible at much lower signal-to-noise ratios or, equivalently, with much fewer linear
measurements than estimation. This can be achieved with much greater implementation ease and
much lower computational cost than standard CS methods based on convex programming.
The literature on the detection of a high-dimensional signal is centered around the classical
normal mean model, based on observations y
i
= x
i
+ z
i
, where the z
i
’s are i.i.d. standard normal.
In this model, only one noisy observation is available per coordinate, so that some assumptions are
necessary and the most common one, by far, is that the vector x = (x
1
, . . . , x
n
) is sparse. This
2
setting has attracted a fair amount attention [7, 15, 16], with recent publications allowing adaptive
measurements [12]. More recently, a few papers [2, 10, 14] extended these results to testing for a
sparse coefficient vector in a linear system with the aim of characterizing the detection feasibility.
These papers work with designs having low mutual coherence, for example, assuming that the a
i
’s
are i.i.d. multivariate normal. As we shall see below, such designs are not always desirable. We also
mention [13], which assumes that an estimator x of x is available and examines the performance of
the test based on ¸ x, x¸; and [17], which proposes a Bayesian approach for the detection of sparse
signals in a sensor network for which the design matrix is assumed to have some polynomial decay
in terms of the distance between sensors.
We mention that the present paper may be seen as a companion paper to [1] which considers
the tasks of estimation and support recovery in the same setting.
1.2 Notation and terminology
Our detection problem translates into a hypothesis testing problem H
0
: x = 0 versus H
1
: x ∈ A,
for some subset A ⊂ R
n
¸ ¦0¦. A test procedure based on m measurements of the form (1) is a
binary function of the data, i.e., T = T(a
1
, y
1
, . . . , a
m
, y
m
), with T = ε ∈ ¦0, 1¦ indicating that T
favors H
ε
. The (worst-case) risk of a test T is defined as
γ(T) := P
0
(T = 1) + max
x∈X
P
x
(T = 0),
where P
x
denotes the distribution of the data when x is the true underlying vector. With a prior
π on the set of alternatives A, the corresponding average (Bayes) risk is defined as
γ
π
(T) := P
0
(T = 1) +E
π
P
x
(T = 0),
where E
π
denotes the expectation under π. Note that for any prior π and any test procedure T,
γ(T) ≥ γ
π
(T). (2)
For a vector a = (a
1
, . . . , a
k
),
|a| =

j
a
2
j

1/2
, [a[ =

j
[a
j
[,
and a
T
denote its transpose. For a matrix M,
|M|
op
= sup
a =0
|Ma|
|a|
.
Everywhere in the paper, x = (x
1
, . . . , x
n
) denotes the unknown vector, while 1 denotes the vector
with all coordinates equal to 1 and dimension implicitly given by the context.
1.3 Content
In Section 2 we focus on vectors x with non-negative coordinates. This situation leads to an
exceedingly simple, yet near-optimal procedure based on a measurement scheme that is completely
at odds with what is commonly used in CS. In Section 3 we treat the case of a general vector
x and derive another simple, near-optimal procedure. In both cases, the methods we suggest
are non-adaptive — in the sense that the measurement vectors are chosen independently of the
observations — yet perform nearly as well as any adaptive method. In Section 4 we discuss our
results and important extensions, particularly to the case of structured signals.
3
2 Vectors with non-negative entries
Vectors with non-negative entries may be relevant in image processing, for example, where the
object to be detected is darker (or lighter) than the background. As we shall see, detecting such a
vector is essentially straightforward in every respect. In particular, the use of low-coherence designs
is counter-productive in this situation.
The first thing that comes to mind, perhaps, is gathering strength across coordinates by mea-
suring x with the constant vector 1/

n. And, with a budget of m measurements, we simply take
this measurement m times.
Proposition 1. Suppose we take m measurements of the form (1) with a
i
= 1/

n for all i.
Consider then the test that rejects when
m

i=1
y
i
> τ

m,
where τ is some critical value. Its risk against a vector x is equal to
1 −Φ(τ) + Φ(τ −

m/n[x[),
where Φ is the standard normal distribution function. In particular, if τ = τ
n
→ ∞, this test has
vanishing risk against alternatives satisfying

m/n[x[ −τ
n
→ ∞.
Since we may chose τ
m
→ ∞ as slowly as we wish, in essence, the simple sum test based on
repeated measurements from the constant vector has vanishing risk against alternatives satisfying

m/n[x[ →∞.
Proof. The result is a simple consequence of the fact that
1

m
m

i=1
y
i
∼ A(

m/n[x[, 1).
Although the choice of measurement vectors and the test itself are both exceedingly simple,
the resulting procedure comes close to achieving the best possible performance in this particular
setting, as the following information bound reveals.
Theorem 1. Let A(µ, S) denote the set of vectors in R
n
having exactly S non-zero entries all equal
to µ > 0. Based on m measurements of the form (1), possibly adaptive, any test for H
0
: x = 0
versus H
1
: x ∈ A(µ, S) has risk at least 1 −

m/(8n)Sµ.
In particular, the risk against alternatives x ∈ A(µ, S) with

m/n[x[ =

m/nSµ → 0, goes
to 1 uniformly over all procedures.
Proof. The standard approach to deriving uniform lower bounds on the risk is to put a prior on the
set of alternatives and use (2). We simply choose the uniform prior on A(µ, S), which we denote
by π. The hypothesis testing problem reduces to H
0
: x = 0 versus H
1
: x ∼ π, for which the
likelihood ratio test is optimal by the Neyman-Pearson fundamental lemma. The likelihood ratio
is defined
L :=
P
π
(a
1
, y
1
, . . . , a
m
, y
m
)
P
0
(a
1
, y
1
, . . . , a
m
, y
m
)
= E
π
exp

m

i=1
y
i
(a
T
i
x) −(a
T
i
x)
2
/2

,
4
where E
π
denotes the expectation with respect to π, and the related test is T = ¦L > 1¦. It has
risk equal to
γ
π
(T) = 1 −
1
2
|P
π
−P
0
|
TV
, (3)
where P
π
:= E
π
P
x
— the π-mixture of P
x
— and ||
TV
is the total variation distance. By Pinsker’s
inequality
|P
π
−P
0
|
TV

K(P
0
, P
π
)/2, (4)
where K(P
0
, P
π
) denotes the Kullback-Leibler divergence. We have
K(P
0
, P
π
) = −E
0
log L (5)
≤ E
π
m

i=1
E
0

y
i
(a
T
i
x) −(a
T
i
x)
2
/2

(6)
= E
π
m

i=1
E
0
(a
T
i
x)
2
/2 (7)
=
m

i=1
E
0

a
T
i
Ca
i

(8)
≤ m|C|
op
, (9)
where C = (c
jk
) := E
π
(xx
T
). The first line is by definition; the second is by definition of P
x
/ P
0
,
by the application of Jensen’s inequality justified by the convexity of x →−log x, and by Fubini’s
theorem; the third is by independence of a
i
, y
i
and x (under P
0
), and by the fact that E(y
i
) = 0;
the fourth is by independence of a
i
and x (under P
0
) and by Fubini’s theorem; the fifth is because
|a
i
| ≤ 1 for all i.
Since under π the support of x is chosen uniformly at random among subsets of size S, we have
c
jj
= µ
2
P
π
(x
j
,= 0) = µ
2

S
n
, ∀j,
and
c
jk
= µ
2
P
π
(x
j
,= 0, x
k
,= 0) = µ
2

S
n

S −1
n −1
, j ,= k.
This simple matrix has operator norm |C|
op
= µ
2
S
2
/n.
Coming back to the divergence, we therefore have
K(P
0
, P
π
) ≤ m µ
2
S
2
/n,
and returning to (3) via (4), we bound the risk of the likelihood ratio test as follows
γ(T) ≥ 1 −

K(P
0
, P
π
)/8 ≥ 1 −

m/(8n)Sµ.
With Proposition 1 and Theorem 1, we conclude that the following is true in a minimax sense:
Reliable detection of a nonnegative vector x ∈ R
n
from m noisy linear measurements is
possible if

m/n[x[ →∞ and impossible if

m/n[x[ → 0.
5
3 General vectors
When dealing with arbitrary vectors, the measurement vector 1/

n may not be appropriate. In
fact, the resulting procedure is completely insensitive to vectors x such that ¸1, x¸ = 0. Neverthe-
less, if one selects a measurement vector a from the Bernoulli ensemble — i.e., with independent
entries taking values ±1/

n with equal probability — then on average, ¸a, x¸ is of the order of
|x|/

n. This is true when the number of non-zero entries in x grows with the dimension n; if we
repeat the process a few times, it becomes true for any fixed vector x.
Proposition 2. Sample b
1
, . . . , b
hm
independently from the Bernoulli ensemble, with h
m
→ ∞
slowly, and take m measurements of the form (1) with a
i
= b
s
for i ∈ I
s
:= [(m/h
m
)(s − 1) +
1, (m/h
m
)s), s = 1, . . . , h
m
. Consider then the test that rejects when
hm

s=1

i∈Is
y
i

2
> m(1 + τ
m
/

h
m
), (10)
where τ
m
→ ∞. When m →∞, its risk against a vector x — averaged over the Bernoulli ensemble
— vanishes if (m/n)|x|
2
≥ 2τ
m

h
m
.
Since we may take h
m
and τ
m
increasing as slowly as we please, in essence, the test is reliable
when (m/n)|x|
2
→ ∞. Compared with repeatedly measuring with the constant vector 1/

n as
studied in Proposition 1, there is a substantial loss in power when [x[
2
is much larger than |x|
2
.
For example, when x has S non-zero entries al equal to µ > 0, [x[
2
= S|x|
2
.
Proof. For simplicity, assume that m/h
m
is an integer and fix x throughout. For short, let
Y
s
=

i∈Is
y
i
= (m/h
m
)¸b
s
, x¸ +

m/h
m
Z
s
, Z
s
:=

h
m
/m

i∈Is
z
i
.
Note that the Z
s
’s are i.i.d.∼ A(0, 1), while the ¸b
s
, x¸’s are i.i.d. with mean zero, variance |x|
2
/n
and fourth moment bounded by 6|x|
4
/n
2
— which is immediate using the fact that the coordinates
of b
s
are i.i.d. taking values ±1/

n with equal probability. Proceeding in an elementary way, we
have
E

hm

s=1
Y
2
s

= (m
2
/h
m
)E

¸b
1
, x¸
2

+ mE

Z
2
1

= (m
2
/h
m
)|x|
2
/n + m,
and
Var

hm

s=1
Y
2
s

= (m
4
/h
3
m
)E

¸b
1
, x¸
4

+ 6(m
3
/h
2
m
)E

¸b
1
, x¸
2

E

Z
2
1

+ (m
2
/h
m
)E

Z
4
1

= 6(m
4
/h
3
m
)|x|
4
/n
2
+ 6(m
3
/h
2
m
)|x|
2
/n + 3(m
2
/h
m
).
Therefore, by Chebyshev’s inequality, the probability of (10) under the null is bounded from above
by 3/τ
2
m
→ 0. Similarly, the probability of (10) not happening under an alternative x satisfying
(m/n)|x|
2
≥ 2τ
m

h
m
is bounded from above by
6(m
4
/h
3
m
)|x|
4
/n
2
+ 6(m
3
/h
2
m
)|x|
2
/n + 3(m
2
/h
m
)
((m
2
/h
m
)|x|
2
/n −mτ
m
/

h
m
)
2

24
h
m
+
24
τ
m

h
m
+
3
τ
2
m
→0.
6
Again, this relatively simple procedure nearly achieves the best possible performance.
Theorem 2. Let A
±
(µ, S) denote the set of vectors in R
n
having exactly S non-zero entries all
equal to ±µ. Based on m measurements of the form (1), possibly adaptive, any test for H
0
: x = 0
versus H
1
: x ∈ A
±
(µ, S) has risk at least 1 −

Sm/(8n)µ.
In particular, the risk against alternatives x ∈ A
±
(µ, S) with (m/n)|x|
2
= (m/n)Sµ
2
→ 0,
goes to 1 uniformly over all procedures.
Proof. Again, we choose the uniform prior on A
±
(µ, S). The proof is then completely parallel to
that of Theorem 1, now with C = µ
2
(S/n)I — since the signs of the nonzero entries of x are
i.i.d. Rademacher — so that |C|
op
= µ
2
S/n.
With Proposition 2 and Theorem 2, we conclude that the following is true in a minimax sense:
Reliable detection of a vector x ∈ R
n
from m noisy linear measurements is possible if

m/n|x| → ∞ and impossible if

m/n|x| →0.
4 Discussion
In this short paper, we tried to convey some very basic principles about detecting a high-dimensional
vector with as few linear measurements as possible. First, when the vector has non-negative entries,
repeatedly sampling from the constant vector 1/

n is near-optimal. Second, when the vector is
general but sparse, repeatedly sampling from a few measuring vectors drawn from a standard
random (e.g., Bernoulli) ensemble is also near-optimal. In both cases, choosing the measuring
vectors adaptively does not bring a substantial improvement. And, moreover, sparsity does not
help, in the sense that the detection rates depend on [x[ and |x|, respectively.
4.1 A more general adaptive scheme
Suppose we may take as many linear measurements of the form (1) as we please (possibly an infinite
number), with the only constraint being on the total measurement energy

i
|a
i
|
2
≤ m. (11)
(Note that m is no longer constrained to be an integer.) This is essentially the setting considered
in [11, 12], and clearly, the setup we studied in the previous sections satisfies this condition. So
what can we achieve with this additional flexibility?
In fact, the same results apply. The lower bounds in Theorem 1 and Theorem 2 are proved in
exactly the same way. (We effectively use (11) to go from (8) to (9), and this is the only place
where the constraints on the number and norm of the measurement vectors are used.) Of course,
Proposition 1 and Proposition 2 apply since the measurement schemes used there satisfy (11).
However, in this special case they could be simplified. For instance, in Proposition 1 we could take
one measurement with the constant vector

m/n1.
7
4.2 Detecting structured signals
The results we derived are tailored to the case where x has no known structure. What if we know
a priori that the signal x has some given structure? The most emblematic case is when the support
of x is an interval of length S. In the classical setting where each coordinate of x is observed once,
the scan statistic (aka generalized likelihood ratio test) is the tool of choice [3]. How does the story
change in the setting where adaptive linear measurements in the form of (1) can be taken?
Perhaps surprisingly, knowing that x has such a specific structure does not help much. Indeed,
Theorem 1 and Theorem 2 are proved in the same way. In the case of non-negative vectors,
we use the uniform prior on vectors with support an interval of length S and nonzero entries
all equal to µ, and the proof is identical, except for the matrix C, which now has coefficients
c
jk
= µ
2
max(S −[j −k[, 0)/n for all j, k. Because C is symmetric, we have
|C|
op
≤ max
j

k
[c
jk
[ = µ
2
S
2
/n, (12)
which is exactly the same bound as before. In the general case, the arguments are really identical,
except that we use uniform prior on vectors with support an interval of length S and nonzero entries
all equal to µ in absolute value. (Here the matrix C is exactly the same.) Of course, Proposition 1
and Proposition 2 apply here too, so the conclusions are the same. Here too, these conclusions hold
in the more general setup with measurements satisfying (11).
To appreciate how powerful the ability to take linear measurements in the form of (1) with the
constraint (11) really is, let us stay with the same task of detecting an interval of length S with a
positive mean. On the one hand, we have the simple test based on

i
y
i
studied in Proposition 1.
On the other hand, we have the scan statistic
max
t
t+S−1

i=t
y
i
,
with observations of the form
y
i
= x
i
+ σz
i
, σ :=

n/m. (13)
While the former requires

m/n[x[ → ∞ to be asymptotically powerful, the scan statistic requires
lim

m/n[x[ (S log
+
(n/S))
−1/2


2,
where log
+
(x) := max(log x, 1). With observations provided in the form of (13), this is asymptoti-
cally optimal [3]. Note that (13) is a special case of (11). Hence, the ability of taking measurements
of the form (1) allows to detect structured signals that are potentially much weaker, without a priori
knowledge of the structure and with much simpler algorithms. Hardware that is able to take linear
measurements such as (1) is currently being developed [9].
4.3 A comparison with estimation and support recovery
The results we obtain for detection are in sharp contrast with the corresponding results in estimation
and support recovery. Though, by definition, detection is always easier, in most other settings it
is not that much easier. For example, take the normal mean model described in the Introduction,
assuming x is sparse with S coefficients equal to µ > 0. In the regime where S = p
1−β
, β ∈ (1/2, 1),
detection is impossible when µ ≤

2r log n with r < ρ
1
(β), while support recovery is possible when
µ ≥

2r log n with r > ρ
2
(β), for a fixed functions ρ
1
, ρ
2
: (1/2, 1) → (0, ∞) [7, 15, 16]. So the
8
difference is a constant factor in the per-coordinate amplitude. In the setting we consider here, we
are able to detect at a much smaller signal-to-noise ratio than what is required for estimation or
support recovery, which nominally require at least m ≥ S measurements regardless of the signal
amplitude, where S is the number of nonzero entries in x. In fact, [1] shows that reliable support
recovery is impossible unless µ is of order at least

n/m. In detection, however, we saw that
m = 1 measurement may suffice if the signal amplitude is large enough, which can be smaller
than

n/m by a factor of S or

S in the nonnegative and general cases respectively. Therefore,
having the ability to take linear measurements of the form (1) in a surveillance setting, it makes
sense to perform detection as described here before estimation (identification) or support recovery
(localization) of the signal.
4.4 Possible improvements
Though we provided simple algorithms that nearly match information bounds, there might be
room for improvement. For one thing, it might be possible to reliably detect when, say,

m/n[x[
is sufficiently large — for the case where x
j
≥ 0 for all j — without necessarily tending to infinity.
A good candidate for this might be the Bayesian algorithm proposed in [6].
More importantly, in the general case of Section 3, we might want to design an algorithm that
detects any fixed x with high-probability, without averaging over the measurement design. This
averaging may be interpreted in at least two ways:
(A1) If we were to repeat the experiment many times, each time choosing new measurement vectors
and corrupting the measurements with new noise, then for a fixed vector x, in most instances
the test would be accurate.
(A2) Given the amplitudes [x
j
[, j = 1, . . . , n, for most sign configurations the test will be accurate.
Interpretation (A1) is controversial as we do not repeat the experiment, which would amount to
taking more samples. And interpretation (A2) raises the issue of robustness to any sign configu-
ration. One way — and the only way we know of — to ensure this robustness is to use a CS-like
sampling scheme, i.e., choosing a
1
, . . . , a
m
in (1) such that the matrix with these rows satisfies RIP-
like properties. This setting is studied in detail in [2], which in a nutshell says the following. Take
measurement vectors from the Bernoulli ensemble, say, but hold the measurement design fixed.
This is just a way to build a measurement matrix satisfying the RIP and with low mutual coher-
ence. In particular, this requires that m is of order at least S log n, though what follows assumes
that m ≫S(log n)
3
. Based on such measurements, the test based on

i
y
2
i
is able to detect when
(

m/n)|x|
2
→ ∞, which is more stringent than what is required in Proposition 2; while the test
based on max
j=1,...,n
[

i
a
ij
y
i
[ is able to detect when liminf

m/nmax
j
[x
j
[(log n)
−1/2
>

2,
which, except for the log factor, is what is required for support recovery. And this is essential
optimal, as shown in [2].
Acknowledgements
The author would like to thank Emmanuel Cand`es for stimulating discussions and Rui Castro for
clarifying the set up described in Section 4.1. This work was partially supported by a grant from
the Office of Naval Research (N00014-09-1-0258).
9
References
[1] E. Arias-Castro, E. J. Cand`es, and M. A. Davenport. On the fundamental limits of adaptive
sensing. Submitted, 2011.
[2] E. Arias-Castro, E. J. Cand`es, and Y. Plan. Global testing under sparse alternatives: Anova,
multiple comparisons and the higher criticism. Ann. Statist., 2012. To appear.
[3] E. Arias-Castro, D. Donoho, and X. Huo. Near-optimal detection of geometric objects by fast
multiscale methods. IEEE Trans. Inform. Theory, 51(7):2402–2425, 2005.
[4] E. Cand`es and T. Tao. Near optimal signal recovery from random projections: Universal
encoding strategies? IEEE Trans. Inform. Theory, 52(12):5406–5425, Dec. 2006.
[5] E. J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruc-
tion from highly incomplete Fourier information. IEEE Trans. Info. Theory, 52(2):489–509,
Feb. 2006.
[6] R. Castro, J. Haupt, R. Nowak, and G. Raz. Finding needles in noisy haystacks. In Acoustics,
Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages
5133 –5136, 31 2008-april 4 2008.
[7] D. Donoho and J. Jin. Higher criticism for detecting sparse heterogeneous mixtures. Ann.
Statist., 32(3):962–994, 2004.
[8] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289–1306, 2006.
[9] M. Duarte, M. Davenport, D. Takhar, J. Laska, T. Sun, K. Kelly, and R. Baraniuk. Single-pixel
imaging via compressive sampling. Signal Processing Magazine, IEEE, 25(2):83–91, 2008.
[10] M. Duarte, M. Davenport, M. Wakin, and R. Baraniuk. Sparse signal detection from incoherent
projections. Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006
IEEE International Conference on, 3:III, May 2006.
[11] J. Haupt, R. Baraniuk, R. Castro, and R. Nowak. Compressive distilled sensing: Sparse
recovery using adaptivity in compressive measurements. In Proc. 43rd Asilomar Conf. on
Signals, Systems, and Computers, 2009.
[12] J. Haupt, R. Castro, and R. Nowak. Distilled sensing: Selective sampling for sparse sig-
nal recovery. In Proc. 12th International Conference on Artificial Intelligence and Statistics
(AISTATS), pages 216–223. Citeseer, 2009.
[13] J. Haupt and R. Nowak. Compressive sampling for signal detection. In IEEE Int. Conf. on
Acoustics, Speech, and Signal Processing (ICASSP), 2007.
[14] Y. Ingster, A. Tsybakov, and N. Verzelen. Detection boundary in sparse regression. Electronic
Journal of Statistics, 4:1476–1526, 2010.
[15] Y. I. Ingster. Minimax detection of a signal for ℓ
l
n
balls. Math. Methods Statist., 7:401–428,
1999.
[16] Y. I. Ingster and I. A. Suslina. Nonparametric goodness-of-fit testing under Gaussian models,
volume 169 of Lecture Notes in Statistics. Springer-Verlag, New York, 2003.
[17] J. Meng, H. Li, and Z. Han. Sparse Event Detection in Wireless Sensor Networks using Com-
pressive Sensing. In 43rd Annual Conference on Information Sciences and Systems (CISS),
2009.

As we shall see below. am . The (worst-case) risk of a test T is defined as γ(T ) := P0 (T = 1) + max Px (T = 0). With a prior π on the set of alternatives X . 1. which assumes that an estimator x of x is available and examines the performance of the test based on x. These papers work with designs having low mutual coherence. γ(T ) ≥ γπ (T ). In both cases. xn ) denotes the unknown vector. i. We mention that the present paper may be seen as a companion paper to [1] which considers the tasks of estimation and support recovery in the same setting. |a| = j |aj |. x∈X where Px denotes the distribution of the data when x is the true underlying vector. . . We also mention [13]. with recent publications allowing adaptive measurements [12]. . In Section 4 we discuss our results and important extensions. .3 Content In Section 2 we focus on vectors x with non-negative coordinates. a Everywhere in the paper. . where Eπ denotes the expectation under π. . . yet near-optimal procedure based on a measurement scheme that is completely at odds with what is commonly used in CS. M op = sup a=0 Ma . assuming that the ai ’s are i. Note that for any prior π and any test procedure T . . ym ). This situation leads to an exceedingly simple. In Section 3 we treat the case of a general vector x and derive another simple. 1. for example. with T = ε ∈ {0. y1 . 1} indicating that T favors Hε . 15.i. and aT denote its transpose. . the methods we suggest are non-adaptive — in the sense that the measurement vectors are chosen independently of the observations — yet perform nearly as well as any adaptive method. More recently.d.. For a vector a = (a1 . For a matrix M. A test procedure based on m measurements of the form (1) is a binary function of the data. for some subset X ⊂ Rn \ {0}. which proposes a Bayesian approach for the detection of sparse signals in a sensor network for which the design matrix is assumed to have some polynomial decay in terms of the distance between sensors. ak ). a = j (2) a2 j 1/2 . T = T (a1 . 14] extended these results to testing for a sparse coefficient vector in a linear system with the aim of characterizing the detection feasibility.e. multivariate normal. while 1 denotes the vector with all coordinates equal to 1 and dimension implicitly given by the context. the corresponding average (Bayes) risk is defined as γπ (T ) := P0 (T = 1) + Eπ Px (T = 0). x . a few papers [2. such designs are not always desirable. 10. . . and [17]. near-optimal procedure. 16].2 Notation and terminology Our detection problem translates into a hypothesis testing problem H0 : x = 0 versus H1 : x ∈ X . x = (x1 . . . particularly to the case of structured signals.2 setting has attracted a fair amount attention [7.

possibly adaptive. = Eπ exp L := i P0 (a1 . S) denote the set of vectors in Rn having exactly S non-zero entries all equal to µ > 0. . if τ = τn → ∞. goes Proof. In particular. this test has vanishing risk against alternatives satisfying m/n|x| − τn → ∞. the simple sum test based on repeated measurements from the constant vector has vanishing risk against alternatives satisfying m/n|x| → ∞. Its risk against a vector x is equal to 1 − Φ(τ ) + Φ(τ − m/n|x|). is gathering strength across coordinates by mea√ suring x with the constant vector 1/ n. for example. 1). the risk against alternatives x ∈ X (µ. am . The first thing that comes to mind. ym ) T yi (aT x) − (ai x)2 /2 . in essence. In particular. Suppose we take m measurements of the form (1) with ai = 1/ n for all i. The result is a simple consequence of the fact that 1 √ m m i=1 yi ∼ N ( m/n|x|. . . detecting such a vector is essentially straightforward in every respect. . . The standard approach to deriving uniform lower bounds on the risk is to put a prior on the set of alternatives and use (2). y1 . Consider then the test that rejects when m i=1 √ yi > τ m. perhaps. Based on m measurements of the form (1). for which the likelihood ratio test is optimal by the Neyman-Pearson fundamental lemma. am . m/n|x| = m/nSµ → 0. . we simply take this measurement m times. We simply choose the uniform prior on X (µ. ym ) i=1 . S) with to 1 uniformly over all procedures. with a budget of m measurements. Let X (µ. where Φ is the standard normal distribution function. the use of low-coherence designs is counter-productive in this situation. where the object to be detected is darker (or lighter) than the background. The hypothesis testing problem reduces to H0 : x = 0 versus H1 : x ∼ π.3 2 Vectors with non-negative entries Vectors with non-negative entries may be relevant in image processing. √ Proposition 1. Theorem 1. Since we may chose τm → ∞ as slowly as we wish. y1 . In particular. Proof. any test for H0 : x = 0 versus H1 : x ∈ X (µ. the resulting procedure comes close to achieving the best possible performance in this particular setting. As we shall see. Although the choice of measurement vectors and the test itself are both exceedingly simple. where τ is some critical value. S). And. . . which we denote by π. The likelihood ratio is defined m Pπ (a1 . as the following information bound reveals. S) has risk at least 1 − m/(8n)Sµ.

yi and x (under P0 ). . j = k. n ∀j. It has risk equal to 1 (3) γπ (T ) = 1 − Pπ − P0 TV . We have K(P0 . Pπ ) = − E0 log L m (5) (6) (7) (8) (9) ≤ Eπ = Eπ i=1 m i=1 E0 yi (aT x) − (aT x)2 /2 i i E0 (aT x)2 /2 i m = i=1 E0 aT Cai i op . n n−1 K(P0 . by the application of Jensen’s inequality justified by the convexity of x → − log x. Pπ )/8 ≥ 1 − m/(8n)Sµ. we therefore have S S−1 · . Pπ ) ≤ m · µ2 S 2 /n. (4) where K(P0 . we have cjj = µ2 Pπ (xj = 0) = µ2 · and cjk = µ2 Pπ (xj = 0. ≤ m C where C = (cjk ) := Eπ (xxT ). the second is by definition of Px / P0 . xk = 0) = µ2 · S . By Pinsker’s inequality Pπ − P0 TV ≤ K(P0 . and by the fact that E(yi ) = 0. and the related test is T = {L > 1}. Since under π the support of x is chosen uniformly at random among subsets of size S. This simple matrix has operator norm C op = µ2 S 2 /n. Pπ )/2. and returning to (3) via (4). Coming back to the divergence. the third is by independence of ai . With Proposition 1 and Theorem 1. we bound the risk of the likelihood ratio test as follows γ(T ) ≥ 1 − K(P0 . Pπ ) denotes the Kullback-Leibler divergence. the fifth is because ai ≤ 1 for all i.4 where Eπ denotes the expectation with respect to π. we conclude that the following is true in a minimax sense: Reliable detection of a nonnegative vector x ∈ Rn from m noisy linear measurements is possible if m/n|x| → ∞ and impossible if m/n|x| → 0. the fourth is by independence of ai and x (under P0 ) and by Fubini’s theorem. and by Fubini’s theorem. The first line is by definition. 2 where Pπ := Eπ Px — the π-mixture of Px — and · TV is the total variation distance.

hm . (m/hm )s). variance x 2 /n and fourth moment bounded by 6 x 4 /n2 — which is immediate using the fact that the coordinates √ of bs are i. the test is reliable √ when (m/n) x 2 → ∞. x ’s are i. and hm Var s=1 Ys2 = (m4 /h3 )E m b1 . Similarly. it becomes true for any fixed vector x.d. when x has S non-zero entries al equal to µ > 0. . 2 /h ) x 2 /n − mτ / h )2 hm τm hm τm ((m m m m . . |x|2 = S x 2 . When m → ∞. (10) where τm → ∞. Proof. For simplicity.i. . we have hm E s=1 Ys2 = (m2 /hm )E b1 . x + m/hm Zs . the probability of (10) not happening under an alternative x satisfying √ (m/n) x 2 ≥ 2τm hm is bounded from above by 24 24 3 6(m4 /h3 ) x 4 /n2 + 6(m3 /h2 ) x 2 /n + 3(m2 /hm ) m m √ √ ≤ + + 2 → 0. . with hm → ∞ slowly. if one selects a measurement vector a from the Bernoulli ensemble — i. a. s = 1. Proceeding in an elementary way. . the resulting procedure is completely insensitive to vectors x such that 1. with mean zero. there is a substantial loss in power when |x|2 is much larger than x 2 . This is true when the number of non-zero entries in x grows with the dimension n. let Ys = i∈Is yi = (m/hm ) bs . while the bs . the probability of (10) under the null is bounded from above 2 by 3/τm → 0. Proposition 2.d. For short. if we repeat the process a few times. . . Zs := hm /m i∈Is zi . In fact. .i. in essence. x is of the order of √ x / n.e. 1). Sample b1 .i. assume that m/hm is an integer and fix x throughout. Compared with repeatedly measuring with the constant vector 1/ n as studied in Proposition 1. and take m measurements of the form (1) with ai = bs for i ∈ Is := [(m/hm )(s − 1) + 1.5 3 General vectors √ When dealing with arbitrary vectors.. taking values ±1/ n with equal probability. x 4 + 6(m3 /h2 )E m b1 . Since we may take hm and τm increasing as slowly as we please.∼ N (0. Nevertheless. x 2 2 + mE Z1 = (m2 /hm ) x 2 /n + m. the measurement vector 1/ n may not be appropriate. bhm independently from the Bernoulli ensemble. by Chebyshev’s inequality. with independent √ entries taking values ±1/ n with equal probability — then on average. x = 0. m m Therefore.d. Note that the Zs ’s are i. x 2 2 4 E Z1 + (m2 /hm )E Z1 = 6(m4 /h3 ) x 4 /n2 + 6(m3 /h2 ) x 2 /n + 3(m2 /hm ). Consider then the test that rejects when hm 2 yi s=1 i∈Is > m(1 + τm / hm ). For example. √ risk against a vector x — averaged over the Bernoulli ensemble its 2 ≥ 2τ — vanishes if (m/n) x m hm .

First. 2 = (m/n)Sµ2 → 0. the setup we studied in the previous sections satisfies this condition. and clearly. (We effectively use (11) to go from (8) to (9). In particular. Proof.) Of course. . we tried to convey some very basic principles about detecting a high-dimensional vector with as few linear measurements as possible.. Second. S) with (m/n) x goes to 1 uniformly over all procedures. we conclude that the following is true in a minimax sense: Reliable detection of a vector x ∈ Rn from m noisy linear measurements is possible if m/n x → ∞ and impossible if m/n x → 0. For instance. √ repeatedly sampling from the constant vector 1/ n is near-optimal. However. with the only constraint being on the total measurement energy ai i 2 ≤ m.g. the same results apply. now with C = µ2 (S/n)I — since the signs of the nonzero entries of x are i. in this special case they could be simplified. repeatedly sampling from a few measuring vectors drawn from a standard random (e. this relatively simple procedure nearly achieves the best possible performance. (11) (Note that m is no longer constrained to be an integer. sparsity does not help. Let X ± (µ. And. Based on m measurements of the form (1). Rademacher — so that C op = µ2 S/n.i. choosing the measuring vectors adaptively does not bring a substantial improvement. when the vector has non-negative entries. in Proposition 1 we could take one measurement with the constant vector m/n 1. Bernoulli) ensemble is also near-optimal. possibly adaptive. any test for H0 : x = 0 versus H1 : x ∈ X ± (µ. S). the risk against alternatives x ∈ X ± (µ.) This is essentially the setting considered in [11. With Proposition 2 and Theorem 2.1 A more general adaptive scheme Suppose we may take as many linear measurements of the form (1) as we please (possibly an infinite number). respectively. Theorem 2. In both cases. The lower bounds in Theorem 1 and Theorem 2 are proved in exactly the same way. in the sense that the detection rates depend on |x| and x . 4. moreover. S) denote the set of vectors in Rn having exactly S non-zero entries all equal to ±µ. The proof is then completely parallel to that of Theorem 1. we choose the uniform prior on X ± (µ.d.6 Again. Proposition 1 and Proposition 2 apply since the measurement schemes used there satisfy (11). when the vector is general but sparse. S) has risk at least 1 − Sm/(8n)µ. 12]. So what can we achieve with this additional flexibility? In fact. and this is the only place where the constraints on the number and norm of the measurement vectors are used. Again. 4 Discussion In this short paper.

In the case of non-negative vectors. On the other hand. by definition. the ability of taking measurements of the form (1) allows to detect structured signals that are potentially much weaker.3 A comparison with estimation and support recovery The results we obtain for detection are in sharp contrast with the corresponding results in estimation and support recovery. let us stay with the same task of detecting an interval of length S with a positive mean. In the regime where S = p1−β . take the normal mean model described in the Introduction. With observations provided in the form of (13). for a fixed functions ρ1 . we use the uniform prior on vectors with support an interval of length S and nonzero entries all equal to µ. we have the simple test based on i yi studied in Proposition 1. this is asymptotically optimal [3]. √ detection is impossible when µ ≤ 2r log n with r < ρ1 (β). Hence. 15. 16]. 1). 0)/n for all j. detection is always easier. without a priori knowledge of the structure and with much simpler algorithms. In the classical setting where each coordinate of x is observed once. (Here the matrix C is exactly the same. To appreciate how powerful the ability to take linear measurements in the form of (1) with the constraint (11) really is. Because C is symmetric. in most other settings it is not that much easier. Here too. we have C op ≤ max j k |cjk | = µ2 S 2 /n. Proposition 1 and Proposition 2 apply here too. the scan statistic (aka generalized likelihood ratio test) is the tool of choice [3]. β ∈ (1/2. except for the matrix C. What if we know a priori that the signal x has some given structure? The most emblematic case is when the support of x is an interval of length S. and the proof is identical. k. knowing that x has such a specific structure does not help much. assuming x is sparse with S coefficients equal to µ > 0. For example. So the .2 Detecting structured signals The results we derived are tailored to the case where x has no known structure. In the general case. On the one hand. 1). we have the scan statistic t+S−1 max t i=t yi . While the former requires σ := n/m.7 4. the arguments are really identical. while support recovery is possible when √ µ ≥ 2r log n with r > ρ2 (β). where log+ (x) := max(log x. the scan statistic requires lim m/n|x| · (S log+ (n/S))−1/2 ≥ 2. How does the story change in the setting where adaptive linear measurements in the form of (1) can be taken? Perhaps surprisingly.) Of course. √ (13) m/n|x| → ∞ to be asymptotically powerful. Note that (13) is a special case of (11). 4. these conclusions hold in the more general setup with measurements satisfying (11). 1) → (0. Hardware that is able to take linear measurements such as (1) is currently being developed [9]. Theorem 1 and Theorem 2 are proved in the same way. which now has coefficients cjk = µ2 max(S − |j − k|. Though. with observations of the form yi = xi + σzi . ρ2 : (1/2. except that we use uniform prior on vectors with support an interval of length S and nonzero entries all equal to µ in absolute value. ∞) [7. so the conclusions are the same. (12) which is exactly the same bound as before. Indeed.

. say. Interpretation (A1) is controversial as we do not repeat the experiment. In detection. except for the log factor. which in a nutshell says the following. there might be room for improvement. the test based on i yi is able to detect when √ ( m/n) x 2 → ∞. n. 4.4 Possible improvements Though we provided simple algorithms that nearly match information bounds. but hold the measurement design fixed. which can be smaller √ than n/m by a factor of S or S in the nonnegative and general cases respectively. . . which would amount to taking more samples. For one thing. In fact. it makes sense to perform detection as described here before estimation (identification) or support recovery (localization) of the signal. say. in the general case of Section 3. we might want to design an algorithm that detects any fixed x with high-probability.e. we saw that m = 1 measurement may suffice if the signal amplitude is large enough. as shown in [2]. Take measurement vectors from the Bernoulli ensemble. choosing a1 .. . without averaging over the measurement design. In particular. More importantly. j = 1. . . A good candidate for this might be the Bayesian algorithm proposed in [6]. This is just a way to build a measurement matrix satisfying the RIP and with low mutual coherence. In the setting we consider here. however.1. having the ability to take linear measurements of the form (1) in a surveillance setting. This averaging may be interpreted in at least two ways: (A1) If we were to repeat the experiment many times. then for a fixed vector x. m/n|x| is sufficiently large — for the case where xj ≥ 0 for all j — without necessarily tending to infinity. Acknowledgements The author would like to thank Emmanuel Cand`s for stimulating discussions and Rui Castro for e clarifying the set up described in Section 4. am in (1) such that the matrix with these rows satisfies RIPlike properties. we are able to detect at a much smaller signal-to-noise ratio than what is required for estimation or support recovery. Based on such measurements. . . This setting is studied in detail in [2]... This work was partially supported by a grant from the Office of Naval Research (N00014-09-1-0258). One way — and the only way we know of — to ensure this robustness is to use a CS-like sampling scheme. And this is essential optimal. Therefore. this requires that m is of order at least S log n. which. i. in most instances the test would be accurate. .. is what is required for support recovery. which nominally require at least m ≥ S measurements regardless of the signal amplitude. And interpretation (A2) raises the issue of robustness to any sign configuration. each time choosing new measurement vectors and corrupting the measurements with new noise.8 difference is a constant factor in the per-coordinate amplitude. [1] shows that reliable support recovery is impossible unless µ is of order at least n/m.. which is more stringent than what is required in Proposition 2. it might be possible to reliably detect when. for most sign configurations the test will be accurate. (A2) Given the amplitudes |xj |. though what follows assumes 2 that m ≫ S(log n)3 .n | i aij yi | is able to detect when lim inf m/n maxj |xj |(log n)−1/2 > 2. while the test √ based on maxj=1. where S is the number of nonzero entries in x.

[5] E. To appear. Baraniuk. Sparse Event Detection in Wireless Sensor Networks using Compressive Sensing. [17] J. T. Sun. Inform. 52(12):5406–5425. 52(4):1289–1306. Ingster and I. [9] M. I. Li. Theory. Verzelen. E. IEEE Trans. IEEE Trans. IEEE. 3:III. Raz. In Acoustics. [16] Y. 51(7):2402–2425. Haupt. 2006. [13] J. Davenport. [2] E. Sparse signal detection from incoherent projections. Dec. K. Theory. Compressive sampling for signal detection. Detection boundary in sparse regression. Theory. M. Feb. Suslina. Statist. ICASSP 2006 Proceedings. Nowak. and X. Haupt. and Z. Compressive distilled sensing: Sparse recovery using adaptivity in compressive measurements. Ingster. IEEE International Conference on. Near-optimal detection of geometric objects by fast multiscale methods. volume 169 of Lecture Notes in Statistics. Jin. Conf. In 43rd Annual Conference on Information Sciences and Systems (CISS). J. and N. Donoho and J. Statist. On the fundamental limits of adaptive e sensing. Castro. and Y. Nowak. [15] Y. and R. and R. pages 216–223. J. 2005. Electronic Journal of Statistics. 2006 IEEE International Conference on. Davenport.. 2009. and G. New York. 25(2):83–91. Duarte. Cand`s. A. on Signals. 2003. E. Baraniuk. [11] J. Tao. and M. Arias-Castro. May 2006. and T. Castro. 2006. Wakin. Castro. Distilled sensing: Selective sampling for sparse signal recovery. Speech. 32(3):962–994. Compressed sensing. Methods Statist. Tsybakov. 2008. and Signal Processing (ICASSP). Laska. J. 2012. D.. ICASSP 2008. [10] M. Inform. J. e multiple comparisons and the higher criticism. Han. Cand`s and T. Huo. Takhar. 2011. Plan. 31 2008-april 4 2008. and R. R. IEEE Trans. Near optimal signal recovery from random projections: Universal e encoding strategies? IEEE Trans. [3] E. 2007. Speech and Signal Processing. Springer-Verlag. Ann. pages 5133 –5136. Duarte. 2009. Global testing under sparse alternatives: Anova. 2008. Arias-Castro. Nonparametric goodness-of-fit testing under Gaussian models. 43rd Asilomar Conf. Haupt and R. Donoho. Signal Processing Magazine. Nowak. Romberg. on Acoustics. Davenport. H. M. Higher criticism for detecting sparse heterogeneous mixtures. R. Donoho. Kelly. Acoustics. Minimax detection of a signal for ℓl balls. Math. Systems. 2006. Finding needles in noisy haystacks. Submitted. 2006. 2009. [12] J. 4:1476–1526. and Computers. Cand`s. Citeseer. 2004. R. [4] E. Cand`s. In IEEE Int. J. Baraniuk. M. [6] R. [8] D.. n 1999. . Haupt. A. Robust uncertainty principles: Exact signal reconstruce tion from highly incomplete Fourier information. Inform. [14] Y. D. Ann. L. 52(2):489–509. Tao.9 References [1] E. Meng. 2010. A. Info. 12th International Conference on Artificial Intelligence and Statistics (AISTATS). Single-pixel imaging via compressive sampling. [7] D. and R. Ingster. Nowak. In Proc. In Proc. 7:401–428. Speech and Signal Processing. Arias-Castro. Theory. J. R. I.